Journal of Probability and Statistics

Volume 2018, Article ID 1012647, 12 pages

https://doi.org/10.1155/2018/1012647

## A Poisson-Gamma Model for Zero Inflated Rainfall Data

^{1}Pan African University Institute of Basic Sciences, Technology and Innovation, Juja, Kenya^{2}University of Nairobi, Nairobi, Kenya^{3}Kenyatta University, Nairobi, Kenya

Correspondence should be addressed to Nelson Christopher Dzupire; wm.ca.cc@eripuzdn

Received 7 November 2017; Accepted 20 February 2018; Published 4 April 2018

Academic Editor: Steve Su

Copyright © 2018 Nelson Christopher Dzupire et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Rainfall modeling is significant for prediction and forecasting purposes in agriculture, weather derivatives, hydrology, and risk and disaster preparedness. Normally two models are used to model the rainfall process as a chain dependent process representing the occurrence and intensity of rainfall. Such two models help in understanding the physical features and dynamics of rainfall process. However rainfall data is zero inflated and exhibits overdispersion which is always underestimated by such models. In this study we have modeled the two processes simultaneously as a compound Poisson process. The rainfall events are modeled as a Poisson process while the intensity of each rainfall event is Gamma distributed. We minimize overdispersion by introducing the dispersion parameter in the model implemented through Tweedie distributions. Simulated rainfall data from the model shows a resemblance of the actual rainfall data in terms of seasonal variation, means, variance, and magnitude. The model also provides mechanisms for small but important properties of the rainfall process. The model developed can be used in forecasting and predicting rainfall amounts and occurrences which is important in weather derivatives, agriculture, hydrology, and prediction of drought and flood occurrences.

#### 1. Introduction

Climate variables, in particular, rainfall occurrence and intensity, hugely impact human and physical environment. Knowledge of the frequency of the occurrence and intensity of rainfall events is essential for planning, designing, and management of various water resources system [1]. Specifically rain-fed agriculture is a sensitive sector to weather and crop production is directly dependent on the amount of rainfall and its occurrence. Rainfall modeling has a great impact on crop growth, weather derivatives, hydrological systems, drought, and flood management and crop simulated studies.

Rainfall modeling is also important in pricing of weather derivatives which are financial instruments that are used as a tool for risk management to reduce risk associated with adverse or unexpected weather conditions.

Further as climate change greatly affects the environment there is an urgent need for predicting the variability of rainfall for future periods for different climate change scenarios in order to provide necessary information for high quality climate related impact studies [1].

However modeling precipitation poses a lot of challenges, namely, accurate measurement of precipitation since rainfall data consists of sequences of values which are either zero or some positive numbers (intensity) depending on the depth of accumulation over discrete intervals. In addition factors like wind can affect collection accuracy. Rainfall is localized unlike temperature which is highly correlated across regions; therefore a derivative holder based on rainfall may suffer geographical basis risk in case of pricing weather derivatives. The final challenge is the choice of a proper probability distribution function to describe precipitation data. The statistical property of precipitation is far more complex and a more sophisticated distribution is required [2].

Rainfall has been modeled as a chain dependent process where a two-state Markov chain model represents the occurrence of rainfall and the intensity of rainfall is modeled by fitting a suitable distribution like Gamma [3], exponential, and mixed exponential [1, 4]. These models are easy to understand and interpret and use maximum likelihood to find the parameters. However models involve many parameters to fully describe the dynamics of rainfall as well as making several assumptions for the process.

Wilks [5] proposed a multisite model for daily precipitation using a combination of two-state Markov process (for the rainfall occurrence) and a mixed exponential distribution (for the precipitation amount). He found that the mixture of exponential distributions offered a much better fit than the commonly used Gamma distribution.

In study of Leobacher and Ngare [3] the precipitation is modeled on a monthly basis by constructing a suitable Markov-Gamma process to take into account seasonal changes of precipitation. It is assumed that rainfall data for different years of the same month is independent and identically distributed. It is assumed that precipitation can be forecast with sufficient accuracy for a month.

Another approach of modeling rainfall is based on the Poisson cluster model where two of the most recognized cluster based models in the stochastic modeling of rainfall are the Newman-Scott Rectangular Pulses model and the Bartlett-Lewis Rectangular Pulse model. These models represent rainfall sequences in time and rainfall fields in space where both the occurrence and depth processes are combined. The difficulty in Poisson cluster models as observed by Onof et al. [6] is the challenge of how many features should be addressed so that the model is still mathematically tractable. In addition the models are best fitted by the method of moments and so requires matching analytic expressions for the statistical properties such as mean and variance.

Carmona and Diko [7] developed a time-homogeneous jump Markov process to describe rainfall dynamics. The rainfall process was assumed to be in form of storms which consists of cells themselves. At a cell arrival time the rainfall process jumps up by a random amount and at extinction time it jumps down by a random amount, both modeled as Poisson process. Each time the rain intensity changes, an exponential increase occurs either upwards or downwards. To preserve nonnegative intensity, the downward jump size is truncated to the current jump size. The Markov jump process also allows for a jump directly to zero corresponding to the state of no rain [8].

In this study the rainfall process is modeled as a single model where the occurrence and intensity of rainfall are simultaneously modeled. The Poisson process models the daily occurrence of rainfall while the intensity is modeled using Gamma distribution as the magnitude of the jumps of the Poisson process. Hence we have a compound Poisson process which is Poisson-Gamma model. The contribution of this study is twofold: a Poisson-Gamma model that simultaneously describes the rainfall occurrence and intensity at once and a suitable model for zero inflated data which reduces overdispersion.

This paper is structured as follows. In Section 2 the Poisson-Gamma model is described and then formulated mathematically while Section 3 presents methods of estimating the parameters of the model. In Section 4 the model is fitted to the data and goodness of fit of the model is evaluated by mean deviance whereas quantile residuals perform the diagnostics check of the model. Simulation and forecasting are carried out in Section 5 and the study concludes in Section 6.

#### 2. Model Formulation

##### 2.1. Model Description

Rainfall comprises discrete and continuous components in that if it does not rain the amount of rainfall is discrete whereas if it rains the amount is continuous. In most research works [3, 4, 9] the rainfall process is presented by use of two separate models: one is for the occurrence and conditioned on the occurrence and another model is developed for the amount of rainfall. Rainfall occurrence is basically modeled as first or higher order Markov chain process and conditioned on this process a distribution is used to fit the precipitation amount. Commonly used distributions are Gamma, exponential, mixture of exponential, Weibull, and so on. These models work based on several assumptions and inclusion of several parameters to capture the observed temporal dependence of the rainfall process. However rainfall data exhibit overdispersion [10] which is caused by various factors like clustering, unaccounted temporal correlation, or the fact that the data is a product of Bernoulli trials with unequal probability of events. The stochastic models developed in this way underestimate the overdispersion of rainfall data which may result in underestimating the risk of low or high seasonal rainfall.

Our interest in this research is to simultaneously model the occurrence and intensity of rainfall in one model. We would model the rainfall process by using a Poisson-Gamma probability distribution which is flexible to model the exact zeros and the amount of rainfall together.

Rainfall is modeled as a compound Poisson process which is a Lévy process with Gamma distributed jumps. This is motivated by the sudden changes of rainfall amount from zero to a large positive value following each rainfall event which are modeled as pure jumps of the compound Poisson process.

We assume rainfall arrives in forms of storms following a Poisson process, and at each arrival time the current intensity increases by a random amount based on Gamma distribution. The jumps of the driving process represent the arrival of the storm events generating a jump size of random size. Each storm comprises cells that also arrive following another Poisson process.

The Poisson cluster processes gives an appropriate tool as rainfall data indicating presence of clusters of rainfall cells. As observed by Onof et al. [6] use of Gamma distributed variables for cell depth improves the reproduction of extreme values.

Lord [11] used the Poisson-Gamma compound process to model the motor vehicle crashes where they examined the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Wang [12] proposed a Poisson-Gamma compound approach for species richness estimation.

##### 2.2. Mathematical Formulation

Let be total number of rainfall event per day following a Poisson process such thatThe amount of rainfall is the total sum of the jumps of each rainfall event, say , assumed to be identically and independently Gamma distributed and independent of the times of the occurrence of rainfall:such that is with probability density function

Lemma 1. *The compound Poisson process (2) has a cumulant function for and , where is the moment generating function of the Gamma distribution.*

*Proof. *The moment generating function of is given by So the cumulant of is

If we observe the occurrence of rainfall for periods, then we have the sequence which is independent and identically distributed.

If on a particular day there is no rainfall that occurred, then

Therefore the process has a point mass at which implies that it is not entirely continuous random variable.

Lemma 2. *The probability density function of in (2) iswhere is a dirac function at zero.*

*Proof. *Let be the probability that it rained. Hence for we have If we let and , then we haveWe can express the probability density function in terms of a Dirac function as

Consider a random sample of size of with the probability density functionIf we assume that there are positive values , then there are zeros where .

We observe that and ; hence the likelihood function is and the log-likelihood for is Now for we have We can observe from the above evaluation that can not be expressed in closed form; similar derivation also shows that as well can not be expressed in closed form. Therefore we can only estimate and using numerical methods. Withers and Nadarajah [13] also observed that the probability density function can not be expressed in closed form and therefore it is difficult to find the analytic form of the estimators. So we will express the probability density function in terms of exponential dispersion models as described below.

*Definition 3 (see [14]). *A probability density function of the formfor suitable functions and is called an exponential dispersion model.

is the dispersion parameter. The function is the cumulant of the exponential dispersion model; since , then are the successive cumulants of the distribution [15]. The exponential dispersion models were first introduced by Fisher in 1922.

If we let as a contribution of to the likelihood function , then However we expect that and so that Furthermore Therefore the mean of the distribution is and the variance is .

The relationship is invertible so that can be expressed as a function of ; as such we have , where is called a variance function.

*Definition 4. *The family of exponential dispersion models, whose variance functions are of the form for , are called Tweedie family distributions.

Examples are as follows: for then we have a normal distribution, , and ; it is a Poisson distribution, and Gamma distribution for , while when it is Gaussian inverse distribution. Tweedie densities can not be expressed in closed form (apart from the examples above) but can instead be identified by their cumulants generating functions.

From , then for Tweedie family distribution we have Hence we can solve for and as follows: by equating the constants of integration above to zero.

For we have so that

Proposition 5. *The cumulant generating function of a Tweedie distribution for is *

*Proof. *From (16) the moment generating function is given by Hence cumulant generating function is For we substitute and to have

By comparing the cumulant generating functions in Lemma 1 and Proposition 5 the compound Poisson process can be thought of as Tweedie distribution with parameters expressed as follows:

The requirement that the Gamma shape parameter be positive implies that only Tweedie distributions between can represent the Poisson-Gamma compound process. In addition, for , implies and .

Proposition 6. *Based on Tweedie distribution, the probability of receiving no rainfall at all is and the probability of having a rainfall event is where *

*Proof. *This follows by directly substituting the values of and into (16).

The function is an example of Wright’s generalized Bessel function; however it can not be expressed in terms of the more common Bessel function. To evaluate it the value of is determined for which the function reaches the maximum [15].

#### 3. Parameter Estimation

We approximate the function following the procedure by [15] where the value of is determined for which reaches maximum. We treat as continuous so that is differentiated with respect to and set the derivative to zero. So for we have the following.

Lemma 7 (see [15]). *The log maximum approximation of is given by where .*

*Proof. * Substituting the values of in the above equation we have The term depends on the values so we maximize the summation Considering we have Using Stirling’s approximation of Gamma functions we have And hence we have For we have ; hence the logarithms have positive arguments. Differentiating with respect to we have where is ignored for large . Solving for we have Substituting in to find the maximum approximation of we have Hence the result follows.

It can be observed that is monotonically decreasing; hence is strictly convex as a function of . Therefore decays faster than geometrically on either side of [15]. Therefore if we are to estimate by the approximation error is bounded by geometric sum For quick and accurate evaluation of , the series is summed for only those terms in the series which contribute significantly to the sum.

Generalized linear models extend the standard linear regression models to incorporate nonnormal response distributions and possibly nonlinear functions of the mean. The advantage of GLMs is that the fitting process maximizes the likelihood for the choice of the distribution for a random variable and the choice is not restricted to normality unlike linear regression [16].

The exponential dispersion models are the response distributions for the generalized linear models. Tweedie distributions are members of the exponential dispersion models upon which the generalized linear models are based. Consequently fitting a Tweedie distribution follows the framework of fitting a generalized linear model.

Lemma 8. *In case of a canonical link function, the sufficient statistics for are .*

*Proof. *For independent observations of the exponential dispersion model (16) the log-likelihood function is But ; hence

Proposition 9. *Given that is distributed as (16) then its distribution depends only on its first two moments, namely, and .*

*Proof. *Let be the link function of the GLM such that . The likelihood equations are Using chain rule we have HenceSince , the relationship between the mean and variance characterizes the distribution.

Clearly a GLM only requires the first two moments of the response ; hence despite the difficulty of full likelihood analysis of Tweedie distribution as it can not be expressed in closed form for we can still fit a Tweedie distribution family. The likelihood is only required to estimate and as well as diagnostic check of the model.

Proposition 10. *Under the standard regularity conditions, for large , the maximum likelihood estimator of for generalized linear model is efficient and has an approximate normal distribution.*

*Proof. *From the log-likelihood, the covariance matrix of the distribution is the inverse of the information matrix .

So Hence where .

Therefore has an approximate with , where is evaluated at .

To compute we use the iteratively reweighted least square algorithm proposed by Dobson and Barnett [17] where the iterations use the working weights : where .

However estimating is more difficult than estimating and such that most researchers working with Tweedie densities have a priori. In this study we use the procedure in [15] where the maximum likelihood estimator of is obtained by directly maximizing the profile likelihood function. For any given value of we find the maximum likelihood estimate of and compute the log-likelihood function. This is repeated several times until we have a value of which maximizes the log-likelihood function.

Given the estimated values of and , then the unbiased estimator of is given by Since for the Tweedie density can not be expressed in closed form, it is recommended that the maximum likelihood estimate of must be computed iteratively from full data [15].

#### 4. Data and Model Fitting

##### 4.1. Data Analysis

Daily rainfall data of Balaka district in Malawi covering the period 1995–2015 is used. The data was obtained from Meteorological Surveys of Malawi. Figure 1 shows a plot of the data.