Flexible Lévy-Based Models for Time Series of Count Data with Zero-Inflation, Overdispersion, and Heavy Tails

Kollie, Confort; Ngare, Philip; Malenje, Bonface

doi:https://doi.org/10.1155/2023/1780404

Journal of Probability and Statistics

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Acknowledgments Supplementary Materials References Copyright Related Articles

Research Article | Open Access

Volume 2023 | Article ID 1780404 | https://doi.org/10.1155/2023/1780404

Flexible Lévy-Based Models for Time Series of Count Data with Zero-Inflation, Overdispersion, and Heavy Tails

Confort Kollie,¹Philip Ngare,²and Bonface Malenje³

Academic Editor: Zacharias Psaradakis

Received03 Oct 2023

Revised06 Nov 2023

Accepted09 Nov 2023

Published30 Nov 2023

Abstract

The explosion of time series count data with diverse characteristics and features in recent years has led to a proliferation of new analysis models and methods. Significant efforts have been devoted to achieving flexibility capable of handling complex dependence structures, capturing multiple distributional characteristics simultaneously, and addressing nonstationary patterns such as trends, seasonality, or change points. However, it remains a challenge when considering them in the context of long-range dependence. The Lévy-based modeling framework offers a promising tool to meet the requirements of modern data analysis. It enables the modeling of both short-range and long-range serial correlation structures by selecting the kernel set accordingly and accommodates various marginal distributions within the class of infinitely divisible laws. We propose an extension of the basic stationary framework to capture additional marginal properties, such as heavy-tailedness, in both short-term and long-term dependencies, as well as overdispersion and zero inflation in simultaneous modeling. Statistical inference is based on composite pairwise likelihood. The model’s flexibility is illustrated through applications to rainfall data in Guinea from 2008 to 2023, and the number of NSF funding awarded to academic institutions. The proposed model demonstrates remarkable flexibility and versatility, capable of simultaneously capturing overdispersion, zero inflation, and heavy-tailedness in count time series data.

1. Introduction

Time series of count data arises in different disciplines, where observed counts are recorded over time, such as economics, epidemiology, finance, and insurance. Several aspects of count time data have been the subject of extensive research as evidenced by the rich literature on this area. A major issue entails modeling dependence arising from the observations’ discrete nature, which renders the autoregressive structure for continuous data incoherent. The efforts directed towards handling time series of count data are aimed at ensuring the validity of inference and consequently data-driven decisions. The challenge of handling serial correlation in count data continues to attract the attention of many researchers and scholars, who are inspired by the difficulties that arise when dealing with these data in various situations, including the trend of daily COVID-19 deaths in Ghana [1], stock market trends [2], road accident counts [3], and crime analysis [4]. Count data exhibit features such as nonnegativity, integer-valued, and frequently overdispersed which indicates that the variance is greater than the corresponding mean, zero-inflation which is a high occurrence of zero values in the dataset, and heavy-tailedness which refers to higher probability relative frequency of having extremes values or outliers in the dataset. The presence of zero-inflation, extreme values, or outliers in count data can have an effect on mean and variance estimations, as well as the validity of statistical inferences. Consequently, complexities arise in this setting due to the requirement to provide a modeling strategy capable of capturing the dependence patterns and simultaneous modeling as well as the marginal features of the observations. There are various modeling strategies that have been proposed to deal with the issues arising when handling time series for count data.There are two paradigms predominant in the literature for handling serial dependence in count data, the first is the discrete autoregressive moving-average models introduced by Jacobs and Lewis [5], and a given ARMA model is defined as follows:where, in the given time series model, represents the value at time , with as the intercept and and as coefficients to estimate autoregressive and moving-average lags. The error term is a statistically independent random variable, uncorrelated both with itself over time and with other random variables in the model. The second is based on a thinning operator and was introduced by McKenzie [6] and Al-Osh and Alzaid [7]. The corresponding model used for modeling the dynamics of an integer-time sequence is defined as follows:where , represents the value of the sequence at the time step preceding , is a thinning operator, and is a sequence of random variables. The advantage of the first approach is that in such a stationary process, their marginal distribution can be of any kind shown by McKenzie [8]. However, count data’s drawback includes long runs of constant values, making sample paths unrealistic in many applications. The second provides a diverse set of models. Sample pathways from thinning models frequently appear more realistic than those from discrete autoregressive moving-average processes. Thinned models, on the other hand, cannot generate an arbitrary marginal distribution for integer-valued data [9].

Efforts to apply the ARMA framework to continuous data have emerged despite challenges, yielding promising results, particularly with Gaussian data. This adaptation reflects innovative strategies to accommodate continuous data’s distinct nature while retaining ARMA’s core principles. Successful application in Gaussian contexts demonstrates the model’s potential for capturing temporal dependencies, it has limitations in the count data field, and Gaussian ARMA-type processes are insufficient for capturing features of integer-valued time series, such as overdispersion and zero inflation [10]. Since this model does not generate integer predictions, it is prone to approximation errors when applied to count data. This has led to the creation of specific data counting approaches, some of which draw concepts from the autoregressive modeling of continuous data. Some popular approaches such as the integer autoregressive modeling framework (INAR) strive to keep the data’s distinct nature. Several researchers have applied this technique in both univariate and multivariate contexts. However, a significant challenge emerges when attempting to capture higher-order dependencies, especially in the extension to multivariate cases. This introduces complexities in implementation within this framework. To tackle this problem, some authors adopt Markov modeling for example [11]. However, Markov models have limited memory and do not explicitly capture long-term dependencies or past events beyond the current state. In situations of systems with intricate temporal relationships or require considerable historical data for appropriate modeling and prediction, this can be an obstacle. Hidden Markov models (HMMs) reduce this issue in part by including hidden states that collect more information [12], but they fail to capture long-term dependencies in some cases. In addition, they work on the assumption that transitions between states are independent events which is also not realistic. Another solution would be to use copula-based modeling, which accounts for dependence in multivariate count data. Although copulas allow for different kinds of dependence structures, finding parametric distributions for high-dimensional random vectors remains difficult because any type of high-dimensional multivariate distribution is limited in covariance structure [13].

Time series of count data recorded in various applications exhibits diverse characteristics and features such as overdispersion, zero inflation, heavy-tailedness, volatility, nonstationarity, and complex dependence structures. In response to this, numerous models have been introduced to effectively handle count time series data by accounting for zero inflation and overdispersion. However, the aspect of heavy-tailedness has received less attention, as noted by Qian et al. [10]. However, ignoring the extreme values or outliers that characterize the feature of the heavy tail may result in the loss of useful information since it is not feasible to ignore the tail probability or assume that it decreases very slowly. Modeling heavy-tailed data present a challenge because it necessitates identifying distributions that can capture both the major portion of the data and the extreme values or outliers [14]. Zhu and Joe [15] introduced a family of distributions known as the generalized Poisson-inverse Gaussian distribution. This distribution is constructed to efficiently capture heavy-tailed count data and provides a flexible strategy for such scenarios. Building upon this work, Qian et al. [10] introduced a novel approach called the GPIG-INAR model of the first order, which involves an INAR process incorporating innovations from the generalized Poisson-inverse Gaussian distribution. For GPIG-INAR to successfully model time series data with heavy-tailed count distributions, this methodology was developed. However, it was considered only in the short-range dependence INAR (1), whereby the motivation of this work considers both short- and long-range dependence.

Additionally, simultaneous modeling of two or more of these aspects when present in the data presents challenges in model specification and estimation. This is further aggravated by the need to specify a modeling strategy that respects the integer nature of the data when accounting for the serial correlation over time. Existing frameworks, such as Markov modeling, INAR strategy, and GLM framework, though successful in their own right, encounter challenges in accommodating numerous features as well as capturing certain dependence patterns such as a long-range serial correlation. The Lévy-based modeling approach was first introduced in the area of turbulence modeling by Barndorff-Nielsen and Schmiegel [16] where they found that the Lévy-based framework allows very flexible autocorrelation structures and can produce any kind of marginal distribution within the class of integer-valued infinitely divisible distributions. In the context of time series analysis, the Lévy-based framework has been adopted in modeling time series of count data in recent years, see Barndorff-Nielsen et al. [9]; Veraart [17]; Bennedsen et al. [18]; Leonte and Veraart [19]. This approach entails modeling serially correlated count and integer-valued data in continuous time and offers several advantages including flexibility of the autocorrelation structure, simplicity, and accommodating short or long memory processes. This framework due to its simplicity and dynamical control can be enhanced to accommodate various features to develop flexible models within the count time series landscape. Zero inflation and overdispersion are common aspects in various application areas, and failure to account for them, if present in the data, may result in misleading inference.

To the best of our knowledge, there are existing gaps in the literature. First, how can heavy-tailed count data be modeled, considering both short-range and long-range dependence under stationary conditions in the data? In other words, how can we account for all memory ranges in count data, given that it exhibits stationarity and heavy-tailedness? Another question is this how can these features be handled in a simultaneous modeling framework?

In this work, we consider marginal distributions that can account for more features in the data such as zero inflation and overdispersion within the stationary setting and heavy-tailedness in both short- and long-range dependence. To achieve this aim, we develop stationary Poisson inverse-Gaussian Lévy-based models for time series of count data with heavy-tailed characteristics; and stationary semi-Poisson Lévy-based models for zero inflation and overdispersion time series of count data for simultaneous modeling.

The article is structured as follows: Section 2 provides brief preliminaries and components of the Lévy-based modeling framework. In Section 3, model specification consists of choosing the distributions and the kernel set. We estimate the parameters of our models using moments-based methods and composite pairwise likelihood in Section 4. In Section 5, a simulation study is presented. Real data applications are considered in Section 6. We give a conclusion in Section 7.

2. Lévy Bases

This section briefly introduces the Lévy-based framework. This framework can accommodate any kind of marginal distribution within the class of integer-valued infinitely divisible distributions. Further details are provided in Barndorff-Nielsen and Schmiegel [16]. A Lévy basis is a random measure that is infinitely divisible and independently scattered, meaning it can be decomposed into an infinite number of smaller independent random measures. This characteristic is useful for modeling a variety of phenomena, such as disease spread, traffic movement, and customer arrivals at stores. Additional information about independently scattered random measures can be found in Rajput and Rosinski [20] and Kwapien and Woyczynski [21].

Let be probability space, and let denote a Lebesgue-Borel space and denotes the Lebesgue measure of. We assume that is a subset of , i.e., with . The set represents the collection of Borel measurable sets with finite Lebesgue measure contained in . We can think of as a collection of events that have a time and location in space. The measure is finite if . Lévy measure on is Borel measure such that and . Finally, defines the bounded Borel sets of S such that:

The cumulant transform of a random variable is given by [22], denoted , if and are identically distributed.

2.1. Definition

Lévy basis on is a collection of random variables such that:(i)The law of is infinitely divisible for all . Thus, for any natural number , the measure can be expressed as the sum of independent and identically distributed random measures , where . Otherwise .(ii)The random variables are independent whenever are disjoint (Independent scattering property).(iii)For every disjoint sequence with bounded union then we have,

(Additivity property).

The Lévy basis controls the marginal distribution of the resulting stochastic process and is specified by an infinitely divisible distribution. This is the only restriction to its marginal distribution. This offers a wide range of stochastic processes that can be supported on integer and real number states that have light or heavy tail properties. In addition, a Lévy basis on is homogeneous if it is stationary. Their statistical properties remain unchanged across different points in time.

2.2. Definition

A Lévy basis on is said to be stationary if for any and such that then

A Lévy basis is considered homogeneous if it exhibits stationarity, and its characteristic function follows the following form:where , , and is a Lévy measure on . The condition for a Lévy basis to be homogeneous is that its characteristic function must be of the form given above called Lévy–Khinchine. This condition ensures that the distribution of is the same for all sets that have the same size and shape, regardless of their location in .

3. Models Specification

Lévy-based framework has two key components which consist of the choice of the marginal distribution which has to be infinitely divisible, and the kernel set where we consider shapes able to induce a flexible autocorrelation structure, which is expected to be a flexible and valid autocorrection structure, and finally possible to induce both long-range and short-range dependence. For various choices of kernel sets, see Barndorff-Nielsen et al. [9] and Veraart [17]. The criterion for choosing the kernel set in this framework, firstly, is that it must have a finite Lebesgue measure, and secondly, as we concentrate on stationary processes in this context, we make the assumption that the shape of the kernel set remains constant over time.

The Lévy basis determines the marginal law of the process with the chosen distribution depending on the problem at hand. It can handle various marginal distributions as long as they are infinitely divisible. The semi-Poisson distribution is effective in addressing overdispersion and zero-inflation scenarios, while in cases of heavy-tailedness modeling, we consider the Poisson-inverse Gaussian distribution.

3.1. Stationary Poisson-Inverse Gaussian Process

Using Lévy based framework, we define the following observation-driven model with a Poisson-inverse Gaussian process:where is a homogeneous Poisson-inverse Gaussian Lévy basis, is the value of a stochastic process following a Poisson-inverse Gaussian distribution at a specific location within the real numbers with , and is a kernel set.

More specifically, the Poisson-inverse Gaussian process is given bywith .

Moreover, a Lévy basis on is said to be stationary if for any and such that then

More specifically, the Poisson-inverse Gaussian basis satisfies , where probability mass function (pmf) of Poisson-inverse Gaussian distribution is derived from the mixed Poisson distribution. The proposed (D) model offers enhanced flexibility in capturing complex autocorrelation structures through the incorporation of a kernel set D, potentially providing a better fit for time series data compared to the standard PIG distribution.

3.2. Definition

A discrete random variable follows a Poisson-inverse Gaussian (PIG) distribution parameterized by two positive real numbers, and . The stochastic representation of given is Poisson with a mean , where is a random variable with an inverse Gaussian distribution with . We denote , and the moment-generating function of is given bywith

The probability mass function is given byfor where the altered Bessel function of the third kind is a mathematical function that can be calculated using software such as Maple and Mathematica.

The mean is defined as follows:

The variance is defined as follows:

The heavy tail (HT) is defined as follows:

In this scenario, the condition indicates that the model cannot exhibit negative correlations. Additionally, considering the expression, for , as the distance grows infinitely large, the overlapping region tends towards 0. This characteristic guarantees that the process follows an -mixing pattern.

3.3. Stationary Semi-Poisson Process

Using the Lévy-based framework, we define the following observation-driven model with a semi-Poisson distribution:where is a homogeneous semi-Poisson Lévy basis, and , is a kernel set defined by an exponential and sup-IG.

More specifically, the semi-Poisson process is given by

Furthermore, a Lévy basis on is said to be stationary if for any and such that , then

More specifically, the semi-Poisson basis satisfies where probability mass function (pmf) of semi-Poisson distribution is defined bywhere and .

The cumulative function is given by

The mean is defined as follows:

Hence, the variance can be obtained as follows:

The index of dispersion is defined as follows:

We introduce the zero inflation index as follows:

The zero inflation index is a measure of the excess of zeros in a dataset. A negative indicates that there are more zeros than expected, a zero indicates no excess zeros, and a positive indicates fewer zeros than expected.

Let . For each component, the autocovariance and autocorrelation functions are given by

Hence,

In this work, for the kernel sets, we consider a parametric specification of the form:where and are location and scale parameters, respectively.

For a short-range dependence, we consider the exponential shape in the form:and the autocorrelation function is given by with ; consequently, for , and .

For a long-range process dependence, we havefor . The autocorrelation function is as follows:withwhere .

The choice of our kernel set is due to both analytical tractability and modeling flexibility. The exponential kernel is the simplest, while the super-GIG kernel is flexible, consistent with data properties, and computationally tractable. For the process’s realization, we have the set that is moving along the time axis via the location parameter governing the movement and temporal dependence of the process with the shape parameter controlling the strength and pattern of this dependence via the scale parameter .

4. Parameters Estimation

This section looks into the statistical properties of the moments-based methods and composite likelihood based on pairs of observations for the estimation of parameters. We give a thorough overview of moments-based methods. We also review pairwise likelihood methods and highlight their advantages over the standard likelihood method. Indeed, the maximum likelihood becomes impractical when the number of observations is very large. This is mainly due to computational challenges that arise with the increased size of the dataset. Pairwise likelihood can be useful in situations with large datasets or complex models where computing the whole likelihood function is difficult or when data are sparse or partial.

4.1. Composite Pairwise Likelihood

The introduction of composite likelihood methods is important because there are a number of situations, such as time series models, where the computation of the full likelihood is very difficult and too time-consuming [23]. The term composite likelihood denotes a general class of [24] based on likelihood-type objects. Consider an vector random variable , with probability density function for unknown parameter vector .

Let be a collection of marginal or conditional events, with associated composite likelihoods proportional to that is with .

A composite likelihood is defined as follows:where are suitable nonnegative weights that do not depend on . Here, we discuss an alternative strategy based on a simple known as “pairwise likelihood.” Its advantage is that it reduces the computational burden so that it is possible to fit highly structured statistical models. The pairwise likelihood is a statistical technique that breaks down the joint likelihood function into a product of pairwise conditional or marginal likelihoods. This simplification allows for more manageable parameter estimation and inference, making it particularly useful in situations with complex or high-dimensional data where traditional likelihood methods become computationally infeasible. For more details, we point to Lindsay [24]; Varin et al. [25]; and Davis and Yau [26]; since the bivariate distributions are available in closed form, this technique has also been used in Bennedsen et al. [27] to make conclusions about related procedures. For the pairs, can be decomposed as follows:where , , and are random variables, and and have in common. Since the semi-Poisson Lévy basis is independently scattered, the sets are disjoint and independent. The pairwise composite likelihood function of order ( time lag)where is the observed data, is the parameter to be estimated, and and represent random variables at time and , respectively.with

Davis and Yau [26] provided evidence that it is not necessary to use all possible lags.

To find the maximum likelihood estimator (MLE) for the parameter based on the pairwise composite likelihood function, we need to maximize the log-likelihood function at the form:where is the constant defined as follows:

4.2. Method of Moments Estimation

While the pairwise composite likelihood has proven effective within the univariate context and has demonstrated efficient computational performance in terms of both time and estimation, its application encounters challenges when transitioning to the presence of multiple components. The approach faces computational challenges of increased magnitude, especially when closed-form solutions are unavailable due to the nontrivial interaction of two distributions like in Poisson-inverse Gaussian. Another approach to bridge this gap is the method of moments estimation (MME), introduced by Karl Pearson in 1894. MME provides a flexible framework for parameter estimation and gives the estimate by comparing the functions of the sample and their theoretical moments. MME is a specific case of GMM and is often used to estimate the parameters of a distribution by equating the sample moments to the theoretical moments of the distribution. Given that it is often impractical to gather data from an entire population, we rely on a sample taken from that population to estimate its moments. The notion of a moment is fundamental for describing features of a population.

Suppose an observation is a sample from a population with mean for which we aim to estimate an unknown parameter vector . This estimation involves using a vector . These statistics have an expected value in the theoretical context, where represents their theoretical counterparts under the specific model. The concept of a “moment condition” is introduced, which involves the expectation of a function and is defined as . In this case, is a continuous vector function of , and is finite and exists for all values of and . In practical terms, this moment condition is approximated using its sample equivalent: . This equation allows us to obtain the estimator . For the dimension of , we arrive at what is known as the method of moments (MM) estimator. The MM estimator is obtained by minimizing the expression:where represents a symmetric and positive definite weight matrix of size . It can depend on the data and should converge in probability to a positive definite matrix .

In our case for the moment conditions, with unknown parameters , let and denote the first- and second-order moments respectively, then

By substituting the first-moment estimator into the equation of , we have

This equation now expresses the sample variance estimator in terms of the first-moment estimator and the differences between each data point and the first moment. In this case, the criterion function seems to relate to the first-moment estimator and the second-moment estimator through a formula that involves the differences between each data point and the first moment. The purpose of the criterion function would likely be related to parameter estimation or evaluating the goodness-of-fit of a distribution to the given data.

5. Simulation Studies

5.1. Slice Partition

This section presents the simulation approach based on slice partition, let us start by decomposing the sets into distinct slices denoted as collected in . This allows for simulating the values of the Poisson-inverse Gaussian Lévy basis over each slice, and the process leads to the formulation of the computation for as the sum of across slices contained within :

Exploiting the independent-scattered nature of the Lévy basis, we can independently sample . This allows us to utilize the additivity property of the Lévy basis [19]. As a result, we can reconstruct the value of the trawl by summing the values derived from the Lévy basis simulations over slices contained by .

For instance, if there exists a such that , let , using the ceiling function , and this consequently defines the slice partition as follows:

Consequently, consecutive trawl sets share nonempty intersections, with each containing exactly slices, culminating in a total of slices. Defining , the translational invariance of the Lebesgue measure establishes for . This simplifies the process of determining the slice areas by computing for and . The calculation is defined as follows:in which we set .

The equations given above completely explain the values of the areas .

5.2. Inverse Transform Method

For the semi-Poisson Lévy basis, we generate random variables using the inverse transform method.

Theorem 1. Let be a random variable with cumulative distribution function (cdf) (continuous or not).

Then,

Proof. Let and suppose that has cumulative distribution function (cdf) . Then,From the above, the inverse c.d.f. can be defined as follows:

Proposition 2. Let , denote any given cumulative distribution function (cdf) and let with be the inverse function defined in 5.4. Let . Define means is distributed as , that is, .

Proof. Let us show that with . First, we assume to be continuous. Saying so let us show that , by taking probabilities (and letting in ) results in what follows:Finally, the equation implies (by monotonicity of ) that if , then , or . Likewise, we can observe , and as a result, when , then . This establishes the equality of the two events as was sought to prove. In the general context, it is straightforward to illustrate thatThis leads to the same outcome when considering probabilities (since due to the continuous nature of the random variable ).
In this case, we simulate the semi-Poisson process following the scheme outlined.
We first need to identify the cumulative distribution function (cdf) for the probability distribution we want to generate random numbers from, that is, . Next, we compute the inverse of the cdf as , which is also referred to as the quantile function or percent-point function. This inverse cdf takes a probability value as input and outputs the corresponding value from the desired distribution. Once we have the inverse cdf, we generate a random number using a uniform distribution that ranges between 0 and 1 such as .
The cumulative distribution function of a semi-Poisson random variable is given as follows:Then, from a given equation:we can manipulate it as follows: , which leads toThis expression represents the inverse of our cdf and defined above with the parameter.
Now, we simulate our models by considering both the exponential kernel for the short-range dependence and sup-IG for the long-range dependence. Then, we estimate the parameter vectors , , , and , respectively, where is the parameter of the semi-Poisson. We compute the mean, bias, and mean-squared error (MSE) given by where represents the parameter vector values that have been estimated from the data of the simulated series. .
The study investigated the stationarity of data generated from a Poisson-inverse Gaussian Lévy-based model using an exponential kernel, as depicted in Figures 1 and 2. This investigation was extended to include a sup-IG kernel, illustrated in Figure 3. Stationarity was confirmed through the observation of exponential decay in the autocorrelation function. The study also included graphical summaries of time series data for parameter estimates, presented in Tables 1 and 2. These results indicate that larger sample sizes enhance the accuracy of parameter estimation. Additionally, Figures 4 and 5 demonstrate the stationarity of data from both the semi-Poisson Lévy-based model with an exponential kernel and the sup-IG kernels, respectively. Parameter estimation was also conducted using the pairwise likelihood estimation method, which yielded more accurate estimates with increasing sample sizes, as shown in Tables 3 and 4. Finally, the trawl process of the super-IG model, depicted in Figure 6, exhibits sustained long-term dependence under the estimated parameters.

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

6. Real-Data Applications

This section discusses the real-data application of the proposed models: Poisson-inverse Gaussian Lévy-based and semi-Poisson Lévy-based. Data for the second model were obtained from the Meteorological Services Department of Guinea Conakry. For the first model, we analyzed data consisting of the numbers of NSF funding awarded to academic institutions, which is discussed in Qian et al. [10] and available at https://www.nsf.gov.

6.1. Dataset 1

These authors used the number of NSF award data, which is presented in Figure 7, for modeling the GPIG-INAR model. Figure 8 shows that the data are stationary. The proposed Poisson-inverse Gaussian Lévy-based model exhibits flexibility in capturing both short-term (Table 5; Figure 9) and long-term (Table 5; Figure 10) dependencies within the same dataset. Table 5 shows that the PIG-Lévy-based model is particularly more suitable for the given data as it has a lower value of the AIC information criterion. The mean predictions from the two models are comparable.

(a)

(b)

We also fitted the PIG-Lévy-based model to the data (Tables 6 and 7) using an exponential and sup-IG kernel, respectively, to capture both short-range and long-range dependencies nature. Mean estimates from the model were slightly higher, while variance estimates were slightly lower (Table 5). The AIC was found to be slightly high. Despite the fact that a comparison with existing models could not be made because they are not constructed within the long-range framework, the model still provided more precise predictions for the numbers of NSF award data (Figure 10). The extension to long-range dependence is an advantage of our framework.

6.2. Dataset 2

We estimated weekly rainfall frequencies using daily rainfall data from the N’zerekore region in Guinea Conakry between 2008 and 2023 (Figure 11) and the autocorrelation and partial autocorrelation functions of the data (Figure 12).

(a)

(b)

We considered a rainfall amount of at most 2.54 mm to represent a dry day. Thus, the data are zero-inflated and overdispersed. The parameter estimate showed that (Table 8), indicating that the semi-Poisson random variable from this model has the highest probability of yielding many zeros. The semi-Poisson Lévy-based model captured this tendency very well by predicting quantiles with a high degree of closeness (Table 9; Figures 13 and 14). Additionally, the predicted values were shown to follow the same probability distribution as the raw data, as depicted in Figures 15 and 16.

6.3. Goodness of Fit of the Model

To verify the accuracy of the model, we used the mean absolute error (equation (52)) and the coefficient of determination (equation (53))

For the semi-Poisson Lévy-based model, the MAE is 0.128, implying that model values are much closer to data values on average. The value of is 0.823, indicating that the model behaves much closer to the data than the center line of the data. The confidence interval for was calculated using the following formula [28]:

The 90%, 95%, and 99% confidence intervals were found as , , and , meaning that is much closer to the value of 82.3%.

7. Summary and Conclusions

In conclusion, our exploration of the Lévy-based modeling framework for time series analysis of count data has revealed its remarkable flexibility and versatility. The developing time series model offers a powerful tool for simultaneously capturing diverse characteristics and features in count time series, including complex dependence structures, and critical aspects such as heavy-tailedness, overdispersion, and zero inflation. Our study has also emphasized the importance of achieving realism in modeling by incorporating Lévy basis which is infinitely divisible marginal distributions and the kernel set for dependence modeling. Moreover, we have underscored the significance of stationary and homogeneous Lévy bases to ensure statistical consistency across time and space. Our simulations and real-data applications have demonstrated this approach’s practical relevance and potential advantages and flexibility. There are potential directions for future research. One compelling direction is the extension of Lévy-based models to multivariate settings, where higher-order dependencies can be effectively addressed. This can allow for comprehensive modeling of both short-term and long-term serial correlation structures. Finally, theoretical advances in comprehending these models’ features and limitations will contribute to a better understanding of their applicability and resilience. In conclusion, our findings highlight the promise of the Lévy-based paradigm and motivate further research into its application to count data analysis.

Data Availability

The data supporting the current study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors express their appreciation to the African Union, as well as to anonymous reviewers and editors, for their support in funding the research and for providing valuable feedback that contributed to the refinement of the paper in its current state. This work was funded by the Pan African University Institute for Basic Sciences, Technology, and Innovation.

Supplementary Materials

Supplementary materials include derivations and R-codes for reference. (Supplementary Materials)

References

K. Tawiah, W. A. Iddrisu, and K. Asampana Asosega, “Zero-inflated time series modelling of covid-19 deaths in Ghana,” Journal of Environmental and Public health, vol. 2021, Article ID 5543977, 9 pages, 2021.
View at: Publisher Site | Google Scholar
S. M. Idrees, M. A. Alam, and P. Agarwal, “A prediction approach for stock market volatility based on time series data,” IEEE Access, vol. 7, pp. 17287–17298, 2019.
View at: Publisher Site | Google Scholar
M. A. Quddus, “Time series count data models: an empirical application to traffic accidents,” Accident Analysis and Prevention, vol. 40, no. 5, pp. 1732–1741, 2008.
View at: Publisher Site | Google Scholar
G. Saridakis, “Violent crime in the United States of America: a time-series analysis between 1960–2000,” European Journal of Law and Economics, vol. 18, no. 2, pp. 203–221, 2004.
View at: Publisher Site | Google Scholar
P. A. Jacobs and P. A. Lewis, “Discrete time series generated by mixtures. iii. autoregressive processes (dar (p)),” Tech. Rep., Naval Postgraduate School, Monterey CA, USA, 1978, Technical report.
View at: Google Scholar
E. McKenzie, “Some simple models for discrete variate time series 1,” JAWRA Journal of the American Water Resources Association, vol. 21, no. 4, pp. 645–650, 1985.
View at: Publisher Site | Google Scholar
M. A. Al-Osh and A. A. Alzaid, “First-order integer-valued autoregressive (inar (1)) process,” Journal of Time Series Analysis, vol. 8, no. 3, pp. 261–275, 1987.
View at: Publisher Site | Google Scholar
P. J. McKenzie, “A model of information practices in accounts of everyday-life information seeking,” Journal of Documentation, vol. 59, no. 1, pp. 19–40, 2003.
View at: Publisher Site | Google Scholar
O. E. Barndorff-Nielsen, A. Lunde, N. Shephard, and A. E. Veraart, “Integer-valued trawl processes: a class of stationary infinitely divisible processes,” Scandinavian Journal of Statistics, vol. 41, no. 3, pp. 693–724, 2014.
View at: Publisher Site | Google Scholar
L. Qian, Q. Li, and F. Zhu, “Modelling heavy-tailedness in count time series,” Applied Mathematical Modelling, vol. 82, pp. 766–784, 2020.
View at: Publisher Site | Google Scholar
C. Velasco-Gallego and I. Lazakis, “Radis: a real-time anomaly detection intelligent system for fault diagnosis of marine machinery,” Expert Systems with Applications, vol. 204, Article ID 117634, 2022.
View at: Publisher Site | Google Scholar
J. Liu, J. Huang, Y. Zhou et al., “From distributed machine learning to federated learning: a survey,” Knowledge and Information Systems, vol. 64, no. 4, pp. 885–917, 2022.
View at: Publisher Site | Google Scholar
C. Schölzel and P. Friederichs, “Multivariate non-normally distributed random variables in climate research–introduction to the copula approach,” Nonlinear Processes in Geophysics, vol. 15, no. 5, pp. 761–772, 2008.
View at: Publisher Site | Google Scholar
L. Qian, B. Zhou, and H. T. Yang, “Cardiomyocyte proliferation and reprogramming for cardiac regeneration,” Journal of Molecular and Cellular Cardiology, vol. 179, pp. 1–24, 2023.
View at: Publisher Site | Google Scholar
R. Zhu and H. Joe, “Modelling heavy-tailed count data using a generalised Poisson-inverse Gaussian family,” Statistics and Probability Letters, vol. 79, no. 15, pp. 1695–1703, 2009.
View at: Publisher Site | Google Scholar
O. E. Barndorff-Nielsen and J. Schmiegel, “Lévy-based spatial-temporal modelling, with applications to turbulence,” Russian Mathematical Surveys, vol. 59, no. 1, pp. 65–90, 2004.
View at: Publisher Site | Google Scholar
A. E. Veraart, “Modeling, simulation and inference for multivariate time series of counts using trawl processes,” Journal of Multivariate Analysis, vol. 169, pp. 110–129, 2019.
View at: Publisher Site | Google Scholar
M. Bennedsen, A. Lunde, N. Shephard, and A. E. Veraart, “Inference and forecasting for continuous-time integer-valued trawl processes,” 2021, https://arxiv.org/abs/2107.03674.
View at: Google Scholar
D. Leonte and A. E. Veraart, “Simulation methods and error analysis for trawl processes and ambit fields,” 2022, https://arxiv.org/abs/2208.08784.
View at: Google Scholar
B. S. Rajput and J. Rosinski, “Spectral representations of infinitely divisible processes,” Probability Theory and Related Fields, vol. 82, no. 3, pp. 451–487, 1989.
View at: Publisher Site | Google Scholar
S. Kwapien and W. A. Woyczynski, Random Series and Stochastic Integrals: Single and Multiple, Birkhäuser, Boston, MA, USA, 1992.
S. Ken-Iti, Lévy Processes and Infinitely Divisible Distributions, Cambridge University Press, Cambridge, UK, 1999.
C. Varin and P. Vidoni, “Pairwise likelihood inference for general state space models,” Econometric Reviews, vol. 28, no. 1-3, pp. 170–185, 2008.
View at: Publisher Site | Google Scholar
B. G. Lindsay, “Composite likelihood methods,” Comtemporary Mathematics, vol. 80, no. 1, pp. 221–239, 1988.
View at: Google Scholar
C. Varin, N. Reid, and D. Firth, “An overview of composite likelihood methods,” Statistica Sinica, vol. 21, pp. 5–42, 2011.
View at: Google Scholar
R. A. Davis and C. Y. Yau, “Comments on pairwise likelihood in time series models,” Statistica Sinica, vol. 21, pp. 255–277, 2011.
View at: Google Scholar
M. Bennedsen, A. Lunde, N. Shephard, and A. E. Veraart, Estimation of Integer-Valued Trawl Processes, 2017.
P. Chidzalo, P. O. Ngare, and J. K. Mung’atu, “Trivariate stochastic weather model for predicting maize yield,” Journal of Applied Mathematics, vol. 2022, Article ID 3633658, 32 pages, 2022.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2023 Confort Kollie et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

355

Downloads

243

Citations