Research Article  Open Access
Improved Estimation of the Initial Number of Susceptible Individuals in the General Stochastic Epidemic Model Using Penalized Likelihood
Abstract
The initial size of a completely susceptible population in a group of individuals plays a key role in drawing inferences for epidemic models. However, this can be difficult to obtain in practice because, in any population, there might be individuals who may not transmit the disease during the epidemic. This short note describes how to improve the maximum likelihood estimators of the infection rate and the initial number of susceptible individuals and provides their approximate Hessian matrix for the general stochastic epidemic model by using the concept of the penalized likelihood function. The simulations of major epidemics show significant improvements in performance in averages and coverage ratios for the suggested estimator of the initial number in comparison to existing methods. We applied the proposed method to the Abakaliki smallpox data.
1. Introduction
In any epidemic in a group of individuals, there is a subgroup of individuals who are not susceptible to a disease, that is, those immune to the disease naturally or by vaccination, as well as those not exposed to the disease owing to physical separation or other reasons. Therefore, estimation of the size of the initially susceptible population in the group might be pivotal; for example, see [1]. For the general stochastic epidemic model, [2–4] have dealt with estimating the initial number.
For the case where an epidemic is observed fully over a given time interval such that all infection and removal times are known, [2] provided martingale estimating equations to propose an estimator called the Mestimator of the initial number of susceptible individuals and its approximate variance. However, the Mestimator does not have the property of consistent coverage ratios for confidence intervals of the initial number of susceptible individuals. For the same conditions, [4] derived a likelihood function using the counting process theory after [5] to yield the maximum likelihood estimator called the kMLE and its approximate variance. However, this likelihood function does not coincide with that given by [5] but with that obtained by using the survival function method of two earlier studies [6, 7]. The kMLE better improves coverage ratios for confidence intervals than the Mestimator, but the problem of inconsistent coverage ratios remains. [3] extended the martingale procedure of [2] when only the removal times are observed. See [8] for a summary of the likelihoods of the completely observed data given parameters under the various model setups for the general stochastic epidemic model adopted by many researchers such as [5, 6, 9, 10].
Here, the first approach to improving the estimator of the initial number of susceptible individuals was to employ the likelihood function of [5]. A system of equations was derived from the loglikelihood function to find the MLEs of the infection rate and the initial number of susceptible individuals, and a normal limiting distribution was assumed to propose a corresponding approximate Hessian matrix. However, because simulations for the MLE give unstable results such as infinite values for the estimate of the initial number and low coverage ratios for confidence intervals such as the Mestimator, a method of penalized likelihood function is proposed. See [11] for an example of estimation using a penalized likelihood function.
Simulations were conducted to compare the proposed maximum penalized likelihood estimator called the pMLE with the kMLE and the Mestimator of the initial number of susceptible individuals. Then, the proposed method was applied to the Abakaliki smallpox data from Nigeria to compare results with the findings of [2, 4].
The rest of this paper is organized as follows: Section 2 presents the notations and the general stochastic epidemic model. Section 3 describes the estimation methods. Section 4 presents the simulation results. Section 5 considers a numerical example, and Section 6 concludes with a discussion and concluding remarks.
2. Notations and the General Stochastic Epidemic Model
Notations very similar to those of [2, 4] were adopted for the spread of a susceptibleinfectedremoved infectious disease in a population whose individuals are mixing homogeneously. Suppose that the epidemic is observed over the time interval in the population, whose size is at time , where indicates infectious individuals and susceptible individuals. Let denote the number of susceptible individuals present at time ; the number of individuals infected up to and including time , including the initial set of infectious individuals; the number of infectious individuals present at time ; the number of infected individuals removed up to and including time the infection rate; the removal rate; and the algebra generated by the history . The number of individuals who become infected and are removed by time , respectively, is denoted by and . Note that when an individual becomes infected, the individual is assumed to be immediately infectious. Given and , assume that the probability of a susceptible individual becoming infected and that of an infectious individual being removed within a small time interval are given by and , respectively, such that the transition probability is The correction term becomes negligible when is small; that is, .
The process , is assumed not to be observed, such that is not observable. However, the process , is fully observable, such that the times at which individuals become infected are observable. The observation here includes the infection time for the infected individual and his or her removal time. Let be the ordered successive infection times observed over . As indicated in [2], the number of individuals infected in who are still susceptible at time can be observed; that is, . Note that although depends only on infection and removal times, depends on , as well as infection and removal times.
3. Derivation of the Estimators
First, consider the likelihood function of the parameters and according to [5] where and denote the situation just prior to time . The likelihood function (2) differs from that given by [4] which is obtained under the same conditions as (2), except that the epidemic process is observed until all infectious individuals are removed, such that the likelihood function can be interpreted as obtained by observing the process over the interval for and . Note that the likelihood functions (2) and (3) are both derived using the definition of a likelihood function in statistical physics based on the counting process theory.
Here, inferences are drawn for when the infection process is observed over a fixed time interval , which is a relaxation of the condition in [2, 4], where the epidemic is observed until it ceases.
The loglikelihood function is given by taking the logarithm of (2): Here, we use the relation . Note that the total number of susceptible individuals at time is just plus the number of susceptible individuals not infected by : where denotes the number of individuals infected in and still susceptible at time . With (5) substituted into (4) to use information on those infected at the infection time, the following is obtained: where Let the first partial derivatives of the loglikelihood function (4) with respect to and be 0; thus, the system of two nonlinear equations is where Because there are no terms of in (9), (9) can be solved with respect to to obtain the MLE of . Then, the MLE of can be plugged into (8) to get the maximum likelihood estimator of : which is the same as the Mestimator of in [2]. The two nonlinear equations are solved here separately to get solutions, whereas [4] maximizes his loglikelihood function of variables and .
The Hessian matrix can be approximated by assuming that the limiting distribution of and follows a normal distribution.
Equation (9) does not have a finite solution for some values of the observations of . To show this, we can first rewrite as by using the relationships and . It is clear that when becomes positive such that the solution of (9) should be infinity. Here, we assumed without loss of generality.
When the number of the initial susceptible individuals is not large enough, the number of simulated epidemics for which the estimates of that did not exist cannot be ignored (Table 1). Therefore a method for improving the maximum likelihood estimator is proposed by considering a penalized likelihood of (6): where for , which modifies (9) to Note that is a modification of to make slightly bigger, and is subtracted from to make it slightly smaller. The penalty function is heuristically chosen to penalize a large value of the estimate of for the loglikelihood (6). It can be shown that the denominator of the first derivative of with respect to is positive and that the numerator is given in the quadratic equation form for constants and such that increases as increases with for some finite value . See [11] for a discussion in choosing a penalty function. Let the estimators of and obtained by solving (8) and (16) be and , respectively, and be called the pMLE.

Note that the kMLE in [4] can be obtained by instead of (16), where with to reduce the value of in (9).
The Hessian matrix can be approximated to assuming that the limiting distribution of and follows a normal distribution. Therefore, the diagonal elements of can be used to give estimated standard errors and , which may be used to construct approximate nominal 95% confidence intervals of and as respectively.
4. Simulations
A simulation study very similar to that of [2, 4] was conducted to compare the efficiency of the pMLE relative to the kMLE and the Mestimate. Here, populations of = 100, 250, 1000, and 5000 susceptible individuals and = 5 initial infectious individuals are considered. In this simulation, and and 1.3 were taken. Results were conditional on a major epidemic, and, therefore, following [2], only simulated epidemics with more than 20% of infected individuals were considered. For = 1.3, epidemics with more than 40% of infected individuals were considered. The value of was set to to compare simulation results with the findings of [2, 4]. For each combination of parameters, 1000 epidemics were simulated.
The number of simulated epidemics for which the maximum likelihood estimate does not exist was counted (Table 1). The results resemble those of [2]. Furthermore, in each scenario, the following were computed: (i) , and (averages and standard deviations of 1000 estimates of and , resp.); (ii) and (averages of estimated approximate standard errors of 1000 estimates of and , resp.); and (iii) (percentage of 1000 nominal 95% confidence intervals containing ), the coverage ratio, and (average final size of simulated epidemics). The simulation results are presented in Table 2.
 
Coverage ratio for . 
Among the 1000 simulations, there were no cases in which the estimate was infinite; this held true for the kMLE as well, as in [4]. In the comparison with the kMLE and the Mestimator, there were substantial improvements in coverage ratios and averages of estimates of . The coverage ratios for the pMLE were quite stable in comparison to those for the kMLE and the Mestimator. An increase in the value of increased the coverage ratio such that all coverage ratios for = 5000 were close to the true confidence coefficient 0.95.
The average estimates of were closer to the true value than those of , the kMLE, for each combination of parameters. The standard deviations and the average of estimated approximate standard errors of 1000 estimates of were quite close to each other in all cases. However, this was not the case for the kMLE and the Mestimator.
In addition, the averages of 1000 estimates of tended to increase to the true value with an increase in , and the standard deviations of estimates and the average of estimated approximate standard errors were also quite close to each other in all cases.
5. Application
The proposed method was applied to the Abakaliki smallpox data from Nigeria. [12] provided 29 infection times for infected individuals and the number of infectious individuals on each day of the epidemic in Abakaliki. The infectious period is assumed to be fixed at 7 days for every individual, and the latent period fixed at 13 days. The estimates and , as well as corresponding standard errors, were obtained (Table 3). The function nleqslv in was used to solve (16) for . The results indicated that both the estimate and estimated approximate standard error for were less than those for and . The estimate (33.88) and estimated approximate standard error for (4.13) were close to those for (35.27 and 6.70, resp.) but lower than those for (42.12 and 37.15, resp.). The estimate of initial susceptible individuals was much lower than 120, the population size. Because of the assumption of homogeneous mixing between individuals, as in earlier studies [2, 4, 12, 13], this estimate was interpreted assuming that a number of individuals were not susceptible to the disease owing to natural immunity, vaccination, or isolation.

Figure 1 shows the shape of the loglikelihood function of and , which is similar to that in [4], and Figure 2 presents the gradient of the likelihood function of for the smallpox data.
6. Discussion and Conclusions
In any epidemic, the estimation of the initial number of susceptible individuals is of great interest. Although [5] derived a loglikelihood function for the initial number and the infection rate when the epidemic is fully observed over a given time interval, no study has investigated the properties of the MLEs. [4] considered the loglikelihood function to be obtained when the epidemic process is observed from the time of the first infection to the time when all infectious individuals are removed and derived the MLE of parameters of interest. [2] used a martingale framework to propose an estimator. The present study used the loglikelihood function of [5] and the relationship between and of [2] to derive a system of equations for parameters and solved the system to obtain MLEs of parameters, instead of finding them as [4] did, by maximizing the loglikelihood function. An approximate Hessian matrix of estimators and was derived based on the assumption that the limiting distribution of and follows a normal distribution. The derivation of the limiting distribution of and can be a challenge.
Our modification of (9) can be considered the same as penalizing the likelihood function (4) with a suitable penalizing factor. Because there are various penalizing methods, another penalized likelihood function can be attempted for better estimators in a future study.
The simulations for the pMLE provide a more stable result than the Mestimator and the kMLE for unbiasedness, standard errors, and coverage ratios, and therefore, the proposed method can be used as a more reliable tool for estimating the initial number of susceptible individuals in a population.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This research was supported by the 2012 Yeungnam University Research Grant.
References
 E. B. O'Dea, K. M. Pepin, B. A. Lopman, and C. O. Wilke, “Fitting outbreak models to data from many small norovirus outbreaks,” Epidemics, vol. 6, pp. 18–29, 2014. View at: Publisher Site  Google Scholar
 R. M. Huggins, P. S. F. Yip, and E. H. Y. Lau, “A note on the estimation of the initial number of susceptible individuals in the general epidemic model,” Statistics and Probability Letters, vol. 67, no. 4, pp. 321–330, 2004. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 E. H. Y. Lau and P. S. F. Yip, “Estimating the basic reproductive number in the general epidemic model with an unknown initial number of susceptible individuals,” Scandinavian Journal of Statistics, vol. 35, no. 4, pp. 650–663, 2008. View at: Publisher Site  Google Scholar  MathSciNet
 T. Kypraios, “A note on maximum likelihood estimation of the initial number of susceptibles in the general stochastic epidemic model,” Statistics and Probability Letters, vol. 79, no. 18, pp. 1972–1976, 2009. View at: Publisher Site  Google Scholar  MathSciNet
 H. Andersson and T. Britton, Stochastic Epidemic Models and Their Statistical Analysis, vol. 151, Springer, New York, NY, USA, 2000.
 P. D. O'Neill and G. O. Roberts, “Bayesian inference for partially observed stochastic epidemics,” Journal of the Royal Statistical Society A, vol. 162, no. 1, pp. 121–129, 1999. View at: Publisher Site  Google Scholar
 M. Höhle and E. Jørgensen, “Estimating parameters for stochastic epidemics,” Dina Research Report 102, Danish Institute of Agricultural Sciences, Tjele, Denmark, 2002. View at: Google Scholar
 T. Kypraios, Efficient Bayesian inference for partially observed stochastic epidemics and a new class of semiparametric time series models [Ph.D. thesis], Department of Mathematics and Statistics, Lancaster University, Lancaster, UK, 2007.
 P. Neal and G. Roberts, “A case study in noncentering for data augmentation: stochastic epidemics,” Statistics and Computing, vol. 15, no. 4, pp. 315–327, 2005. View at: Publisher Site  Google Scholar  MathSciNet
 N. G. Becker and T. Britton, “Statistical studies of infectious disease incidence,” Journal of the Royal Statistical Society B: Statistical Methodology, vol. 61, no. 2, pp. 287–307, 1999. View at: Publisher Site  Google Scholar  MathSciNet
 M. Moreno and S. R. Lele, “Improved estimation of site occupancy using penalized likelihood,” Ecology, vol. 91, no. 2, pp. 341–346, 2010. View at: Publisher Site  Google Scholar
 N. G. Becker, Analysis of Infectious Disease Data, Monographs on Statistics and Applied Probability, Chapman & Hall, London, UK, 1989. View at: MathSciNet
 N. G. Becker and P. Yip, “Analysis of variations in an infection rate,” Australian Journal of Statistics, vol. 31, pp. 42–52, 1989. View at: Google Scholar
Copyright
Copyright © 2014 Changhyuck Oh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.