Research Article | Open Access
Identification and Forecasting in Mortality Models
Mortality models often have inbuilt identification issues challenging the statistician. The statistician can choose to work with well-defined freely varying parameters, derived as maximal invariants in this paper, or with ad hoc identified parameters which at first glance seem more intuitive, but which can introduce a number of unnecessary challenges. In this paper we describe the methodological advantages from using the maximal invariant parameterisation and we go through the extra methodological challenges a statistician has to deal with when insisting on working with ad hoc identifications. These challenges are broadly similar in frequentist and in Bayesian setups. We also go through a number of examples from the literature where ad hoc identifications have been preferred in the statistical analyses.
Mortality models are commonly used in a wide range of fields such as actuarial sciences, epidemiology, and sociology. They are often used in important decisions such as how to deal with unisex legislation in the pension industry; see Ornelas et al.  and Jarner and Kryger . However, such models do often have inbuilt identification issues stemming from overparametrisation. While identification issues are omnipresent in statistical modelling, this paper focuses on mortality modelling, where estimated parameters are treated as time series and extrapolated to give forecasts of future mortality. The underlying theme of this paper is to provide strategies of avoiding arbitrariness resulting from the identification process. We suggest two ways forward. First, we can reparametrise the model in terms of a freely varying parameter, which therefore has to be of lower dimension than the original parameter. Secondly, we can work with an identified version of the original parameter as long as we keep track of the consequences of the identification choice. That way we ensure that two researchers making different identification choices get the same statistical inferences and forecasts.
A simple example is the age-period model for an age-period array of mortality rates. It is well-known that the levels of the age- and period-effects cannot be determined from the likelihood representing the overparametrisation of the model. When the estimated age- and period-effects are treated as time series and subjected to plotting and extrapolation, then our approach ensures that the statistical analysis is the same for two researchers identifying the above model in two different ways. Whereas this issue is relatively simple for the age-period model, identification becomes more tricky for complicated models such as the age-period-cohort model and the model of Lee and Carter , let alone two-sample situations.
Mortality models are built as a combination of age, period, and cohort-effects, but the likelihood only varies with a surjective function of these time effects. The time effects can be divided into two parts. One part that moves the likelihood function and another part which does not induce variation in the likelihood function. We will argue that all inferences and forecasts should be concerned primarily with the part of the parameter that moves the likelihood function. This does not preclude the researcher from working with the time effects, but it gives some limitations on what can be done. This is important because the motivation and the intuition of mortality models typically originate in the time effects. For instance, in the context of an age-period-cohort model linear trends cannot be identified so time series plots of the time effects need to be invariant to linear trends and extrapolations of time effects must preserve the arbitrary linear trend in the time effects. This applies regardless of whether the identification issue is dealt with in a frequentist manner or by Bayesian methods.
To formalise the discussion slightly return to the age-period example. Denote the predictor for the age-period data array by . The age-period model then determines how the predictor varies with a vector summarising age and period effects. That vector is split into two components and so that the predictor only depends on through but not on which cannot be identified by statistical analysis. In the age-period example could reflect the contrasts and the overall level of the predictor , whereas reflects the level of the age effect. The more principled solution is then to work exclusively with and simply consider as a motivation rather than the objective of the analysis. Another solution is to ad hoc identify based on a notion of mathematical convenience or based on a particular purpose given the substantive context.
Once an ad hoc identification of is chosen the identification problem appears to go away, because the likelihood analysis can now go through. The reason is that the variation of is now reduced to the variation of precisely because is fixed. Suppose two researchers choose the same likelihood and the same parametrisation of but different ad hoc identifications and . Which of their conclusions will be the same and which will be different? As the likelihood only depends on the fits of the two researchers will be identical. But differences might arise if the statistical inference or forecasting or any other statistical analysis involves in some way.
Indeed, with many extrapolation methods forecasts will be invariant to the choice of . But, there will also be extrapolation methods where this is not the case. Examples arise in the age-period-cohort model, where linear trends have to be handled with care.
We will start by analysing linearly parametrised models at a rather general level. We do this with two aspects in mind. First, we need to step back to a point in the analysis before ad hoc identification is made. Secondly, we also want to avoid the discussion of how to choose and , which tend to be specific to the mortality model in question. Working at the general level we can focus on the mappings between different parametrisations and the invariance properties coming from these mappings. It is then seen that the parameter arises as a maximal invariant. The general setting also allows the formulation of a series of results discussing different types of ad hoc identification, first in a frequentist fashion and then in a Bayesian fashion.
Subsequently, we will consider the age-period-cohort model in detail, both for one- and two-sample situations. Using the general results it becomes easier to see that a number of popular methods inadvertently include features that are not invariant to ad hoc identification. These include the “intrinsic estimator” advocated by Yang et al. , the “mixed model approach” by Yang and Land , the Bayesian approach by Berzuini and Clayton , and the two-sample analysis by Riebler and Held . Finally, we consider the nonlinearly parametrised model of Lee and Carter . The nonlinearity gives a further complication since the mapping from the time effects to the mortality predictor is nondifferentiable. As it turns out the mortality predictor varies in a smooth space, so the nondifferentiability is avoided by working directly with the mortality predictor instead of the original time effects. Instead, a Lee-Carter application should consider whether a certain matrix has rank of unity or zero. Apart from that the analysis is similar to that of linearly parametrised models. Likewise a theory is given for two-sample situations.
Throughout the paper our concern rests exclusively with the identification problem and the consequences of ad hoc identification for estimation, plots, inference, and forecasting. In practice, important additional concerns are how to choose appropriate models and forecasting methods. We would like to refer to Girosi and King , Pitacco et al.  for general discussions of these issues, and also to Kuang et al.  and Coelho and Nunes  for discussions of forecast methods in the light of structural breaks. Instead, the aim of the paper is to present an overall framework that can help streamlining the identification discussion that has appeared in so many papers in so many fields over so many years.
Section 2 of this paper considers standard linear statistical models, which lend themselves to a relative straightforward analysis based on linear algebra. Any ad hoc identification splits the time effect into two components. The first component is an arbitrary component, which is not needed for the identification of the likelihood. The other component is necessary and sufficient to identify the model and hence sufficient for statistical analysis. In Section 3 it is outlined how to analyze the statistical model when the latter component is ad hoc identified. It is argued that this can cause difficulties for estimation, interpretation, and forecast. In Section 4 it is shown that Bayesian analysis shares the same challenges as the frequentist approach. In Sections 5 and 6 we study the two particular examples: the omnipresent age-period-cohort and Lee-Carter mortality models. All proofs are collected in the Appendix.
2. Statistical Models with Linear Parametrisations
In this section we present the identification problem in a linear framework. The problem is solved by analysing the mapping from the original time effect to the predictor which, in turn, leads to standard statistical analysis. In Section 6 we show how these ideas transfer to a nonlinear context. This contrasts with Section 3 in which we illustrate the analytical challenges and inconveniences arising from ad hoc identification.
In Section 2.1 we present the overparametrized linear model for the mortality predictor. The identification problem is defined in Section 2.2 via the likelihood. In an overparametrized linear model two different parameters might produce the same likelihood. In Section 2.3 we analyze the mapping from the overparametrised parameter to the predictor. This mapping enables us to split the overparametrised parameter into two. One arbitrary parameter and one parameter identify the model without being overparametrised. This latter parameter is shown to be a maximal invariant parameter. In Section 2.4 it is demonstrated how any statistical analysis can be based on this maximal invariant parameter alone. In particular we comment that visual data representations, hypothesis testing, and forecasting are simple and well defined. This in turn leads to standard statistical analysis.
The analysis of the linearly parametrised involves projections on linear or affine spaces and on their orthogonal complements. It is therefore convenient to introduce the following notation. A matrix has full column rank if is invertible. In this case the orthogonal complement is a matrix so and is invertible. Thus, when itself is invertible then is the empty matrix. It is not difficult to calculate in practise, an explicit construction of follows from a singular value decomposition of , choosing as the eigenvectors associated with the zero eigenvalues. Moreover, let so that is the identity matrix, while .
2.1. The Model
Think of the time effect as our preferred intuitive, but unidentified parameter, and think of the predictor as some function of specifying the model at hand. In a Poisson type model, where the mean specifies the distribution, could be the log of that mean. Such Poisson models are omnipresent in mortality models. We will often think of as containing some time effects. Often forecasting is carried out simply by isolating and extrapolating such a time effect.
Consider a data vector of dimension . This could, for instance, be the vector consisting of the stacked mortality rates for a rectangular age-period array of dimension in which case . The statistical model for could be a generalized linear model. This involves an appropriately chosen distribution and a link function, which links the expected mortality rate to an -dimensional predictor, which is denoted by . Taken together this defines a likelihood function .
The model for the predictor is constructed in terms of, for instance, age, period, and cohort time effects. These time effects are summarized in a vector , which is of dimension . Therefore is a surjective function of . For the moment the specification of the predictor is assumed linear so that for some design matrix . We refer to this specification as the mortality model, while the space is the time effect space. The time effect space is chosen as an unrestricted real space in accordance with the starting point of most mortality analyses.
The parameter space for the likelihood function and therefore for the statistical model is given by the range of variation for the predictor ; that is, The likelihood function is assumed uniquely identified on this space in the sense that for all pairs of predictors so then the likelihood of , differ; that is, for in a set with positive probability.
2.2. The Identification Problem
The identification problem of mortality models arises when the mapping from the time effect space to the parameter space is surjective but not injective. With a linear parametrisation this arises when the design matrix has reduced column rank so is singular. In this situation there exists time effects with the same likelihood: for all data . Then the time effect space is not useful as parameter space for the statistical model.
2.3. Analysing the Mapping
When analysing the mapping from our intuitively preferred parametrisation into the linear predictor , we will be able to rewrite as a sum of two components: one is a function of the predictor and the other is the arbitrary part varying with , but not with the predictor. We provide two methods for analysis.
The first method is to find a basis with full column for the design . The design matrix of the mortality model can then be expressed as for some matrix with full column . Introduce a new -dimensional parameter: The parameter space can then be written more parsimoniously as The mapping from to is bijective, so the statistical model can just as well be parametrised in terms of .
Alternatively, the identification problem can be expressed through an invariance argument. This argument relates to the parameterization but resembles the classical invariance argument for reduction of data; see Cox and Hinkley [12, page 157]. With a linear parametrisation the argument involves the orthogonal complement to the matrix . That is a matrix which has the properties that and that is invertible. The mortality model (1) is defined by the mapping from to . This mapping is surjective in that two different values of may result in the same and therefore the same likelihood. These equivalence classes in the time effect space can be described by the group of transformations acting on for arbitrary . Indeed, it holds that and will result in the same . The mapping (7) is therefore invariant to the group . We will argue that the parameter is a maximal invariant to the group acting on , which provides a link with (6). It has to be argued that for any , so that equals then , see Cox and Hinkley [12, page 159]. For this argument use the orthogonal projection identity to write for unique and . Thus, if then with .
In applications it can be difficult to find a basis for the design . It can be easier to find a group and hence and then use this information to construct and a candidate basis , noting that . This argument leaves it to be proven that is a basis, or equivalently, that the suggested group actually describes the equivalence classes of the mapping from to .
It is useful to note that in the choices of , only the spaces spanned by them are unique since for any invertible . Likewise, the maximal invariant is only unique up to bijective transformations. This lack of uniqueness has no impact on the analysis of the likelihood albeit it influences interpretations.
2.4. Statistical Analysis Using the Maximal Invariant Parameter
The statistical model parametrised with the maximal invariant parameter can be analysed by standard statistical techniques. This contrasts to a range of problems that arise when working with an ad hoc identified time effect . In the following the relatively simple standard statistical analysis of the model parametrised by is discussed with respect to likelihood theory, interpretation, plots, hypothesis testing, forecasting, and Bayesian analysis. In Sections 3 and 4 we give an overview of the much more complicated theory underpinning models parametrised by the ad hoc identified time effect . Age-period-cohort examples follow in Section 5.
2.4.1. Exponential Family Theory
Suppose the likelihood is drawn from a generalized linear model based on an exponential family. Then the model is actually a regular exponential family where the maximal invariant parameter is the canonical parameter since it is freely varying in a real space; see Barndorff-Nielsen [13, page 116]. This opens up for a wealth of convenient statistical properties such as a likelihood equation with a simple expression and explicit conditions for a unique solution. In contrast, ad hoc identified parameters are based on an injective mapping of the canonical parameter into ; see Sections 3.1 and 3.2. It is then more difficult to fully exploit the exponential family theory.
2.4.2. Interpretation and Plots
The maximal invariant parameter varies freely in . It can therefore be interpreted as the parameter of any standard statistical model. Since is freely varying the coordinates of can be interpreted independently. When is a collection of time effects then can be organised as a collection of time series. Since the coordinates of are freely varying the time series plots of the components of have the usual interpretation of time series. In contrast, ad hoc identified estimators are constrained to a -dimensional subspace of , which is often affine but can be more complicated. A consequence is that plots are complicated to evaluate; see Section 3.4.1.
2.4.3. Hypothesis Testing
Hypotheses are easily formulated and analysed when using the maximal invariant parametrisation. An affine hypothesis that restricts to vary in a -dimensional affine subspace can be formulated as for known matrices , . This implies a restriction on the predictor of (6). Form the orthogonal complement and recall the orthogonal projection identity so that . Introduce a -dimensional parameter , a design matrix , and an offset . The restricted parameter space is In an exponential family context both the unrestricted model and the restricted model form regular exponential families. A variety of nice properties then follow for the estimators and the test statistics from the exponential family theory. Examples are given in Sections 5.3 and 5.5.3. In contrast, the hypothesis derived from restrictions on ad hoc identified parameters and the resulting degrees of freedom are complicated to analyse; see Section 3.4.2.
Most often the objective of a mortality study is to forecast the future mortality. In the linear context, , this is done by extending the design and by extrapolating .
It is usually easy to extend the design into the forecast horizon. This involves the construction of a triangular block matrix with an appropriate number of extra rows corresponding to the data over the forecast horizon as well as extra columns representing the extra parameters that would be needed: Extrapolating into a vector then gives the forecast The extrapolation of the parameter can be done as follows. The estimated parameter, or part of it, can be thought of as a time series. Any forecast techniques from the time series literature applied directly to can be used, subject to the usual contextual considerations.
Ad hoc identified time effects can be extrapolated in a similar way; see Section 3.4.3. This may, however, result in avoidable arbitrary effects in the forecast. Necessary and sufficient conditions for this eventuality are given for age-period-cohort models in Section 5.4.3. The practical examples are mainly Bayesian in nature and are discussed next.
2.4.5. Bayesian Analysis
The introduction of the canonical parameter shows that the likelihood, in Bayesian notation, is of the form where is freely varying. A purist Bayesian analysis can simply introduce a prior on the canonical parameter, . This is updated in a straight forward way, resulting in the posterior .
In contrast, introducing a prior on ad hoc identified parameters gives various difficulties. Only parts of the prior are updated by the likelihood, so that it becomes unclear which information arises from the data and which information arises from the ad hoc identification. Moreover, avoidable arbitrariness is introduced in the forecast; see Section 4. Introduction of hyperparameters exacerbates the issue. Examples are given in Sections 5.4.4, 5.5.2, and 6.1.6.
3. Working with the Time Effects
In Section 2 we considered the situations where estimation, hypothesis testing a hypothesis, or forecasting is carried out using the canonical parameter. However, there might be situations, where the original time effect parametrisation is preferred, perhaps because it is felt that this parametrisation is particularly helpful in guiding the intuition. This requires ad hoc identification of the time effect. In this section we will guide the considerations a statistician that has to go through when insisting on an analysis based on some nonunique parametrisations. As in Section 2 we focus on linearly parametrised models. Specific examples follow in Sections 5 and 6.
In Section 3.1 ad hoc identification is defined. As an example we consider a least squares estimation problem with collinear regressors in Section 3.2. For the age-period-cohort model reviewed in Section 5 it is common to ad hoc identify in two steps: first identifying levels then the linear trends. We consider such two-step ad hoc identification in Section 3.3. The consequence of ad hoc identification is considered in Section 3.4. Indeed, when forecasting the time effect, we do not want the forecast to depend on the identification scheme. The same applies to graphical visualisation of our data, where the eye may extract patterns that depend on the identification scheme. Likewise, confusion may arise when formulating a hypothesis directly on the time effect parameters.
3.1. Ad Hoc Identification
In this section the time effect parametrisation is considered. An identification scheme has to be introduced when working with the time effects. This may rest on mathematical convenience or it may be chosen for a particular purpose given the substantive context. We therefore call it ad hoc identification. Here we consider a simple identification scheme but turn to a more common two-step identification scheme in Section 3.3.
Once the canonical parameter has been estimated there is often a wish to return to the original time effect . The two are linked through the surjective mapping from to . Indeed, since is constructed as a function of the notation for is often chosen to reflect . The canonical parameter does, however, only give partial information about . The remaining part, say , of will have to be chosen by the researcher and combined with .
A linear ad hoc identification of comes about by the researcher choosing a constraint for some known and some matrix chosen so the square matrix is invertible. The time effect space is now reduced to an affine subspace Given we can find , through (13) and (14) as . At the same time, given values of , and the invertibility of , the ad hoc identified time effect is found through In this notation a subindex is introduced to avoid confusion with the time effect in the original mortality model. Indeed, there are now four different parameters in play, namely, the original time effect , the predictor , the maximal invariant parameter and the ad hoc identified time effect , each of which has a different interpretation. The mapping from to each of , , and is surjective, while there are bijective mappings between the latter three. The interpretations of the time effect and the canonical parameter will inevitably be different. For a start they have different dimensions. Endowing the spaces with Euclidean norms shows that distances in the two spaces and will be judged differently. The time effect and the ad hoc identified time effect will similarly have different interpretations. Although they have the same dimensions the Euclidean norms on and will be rather different. Confusion may arise in the interpretation of a mortality analysis if there is no clear distinction between and . In addition an unnecessary arbitrariness may arise when making inference on or extrapolating . We will return to these issues in Section 3.4.
It is perhaps interesting to note that despite the linear parametrisation the ad hoc identification need not be done in a linear fashion as in (14). Indeed it is common for Poisson models with a log link to ad hoc identify through the original multiplicative scale. That means that the ad hoc identification is done nonlinearly through
The fit of the model is unaffected by the ad hoc identification. Indeed the fit is measured in terms of the estimate of the predictor where . Since the identification is made so ; the estimated predictor reduces to regardless of the choice of ad hoc identification.
3.2. A Least Squares Example
As an illustration of estimation in the presence of ad hoc identification consider a normal likelihood. Different, but equivalent, expressions can be found depending on the parametrisation. The likelihood of the predictor is Rewriting it in terms of the canonical parameter it is while introducing the time effect parameter gives The likelihood (20) of the canonical parameter can be analysed by the least squares method since the design has full column rank. The maximum likelihood estimator for and the predictor for the data are Along with the residual variance this is all the information that is given by the likelihood.
The likelihood (21) of the time effect only depends on through . The lack of identification means that the maximum likelihood estimator for has an arbitrary element, so that it is a set valued estimator. Based on (16) this can be expressed by for any so is invertible and for any . The fit, however, remains the same and (18) becomes In order to compute actual estimates then , have to be chosen, which amounts to ad hoc identification. For instance, with the ad hoc identifying restrictions and then can be thought of as the least squares estimator of on using the Moore-Penrose generalised inverse for the singular matrix ; see Searle [14, page 212]. See Section 5.4.1 for an example.
3.3. Step-Wise Identification
It is common to ad hoc identify parameter in a step-wise fashion. In the first step the time effect parameter is only partially constrained. The full identification then follows in a second step. An example is given in Section 5.4.1 for an age-period-cohort model in which the levels of the time effects are constrained in the first step leaving the ad hoc identification of the linear trends to the second step.
The first step constraints are affine of the type for known matrices , . The constrained time effect space is then Thereby the -dimensional time effect space is reduced to a -dimensional variation. The properties of this partially ad hoc identified parameter space depends on the rank of the matrix . If the number of constraints, , is at most equal to the number of unidentified components , it is possible that has full column rank. In that case the constraint implies a partial ad hoc identification without constraining the parameter space of the statistical model. This is shown in Theorem 1; see also Section 5.4.1 for an example, while the proof is given in the Appendix. When has reduced rank the parameter space is also constrained; see Section 3.4.2 for a discussion.
Theorem 1. Suppose has full column rank. Then the matrix has full column rank and the constraint (25) does not constrain the canonical parameter and the predictor . Hence, the predictor space remains of the form (2). The equivalence classes in under the mapping are given by the group for arbitrary where is the orthogonal complement of . The maximal invariant remains .
The partial ad hoc identification by (25) implies that any time series analysis of the time effects has to happen relative to the constrained space rather than the space . This is awkward as discussed in Section 3.4 below. It is also considerably more complicated than working with the freely varying canonical parameter ; see Section 2.4.2.
3.4. Consequences of Ad Hoc Identification
In the following we will look closer at the consequences of working with the ad hoc identified time effect parameter in the context of a linear mortality model of the form . We consider the consequences for plotting, hypothesis testing, and forecasting.
3.4.1. Plots of Time Effects
In the mortality model (1) the time effect is the concatenation of age, period, and cohort effects. It seems natural to think of these individual time effects as time series and to plot them against time. As the time effect varies in the unrestricted space this maps the -vector into unrestricted time series.
Estimates of the time effects are constructed by combining an estimate of with an ad hoc chosen value for , see (14). The resulting estimate is therefore constrained to the space . The interpretation of the estimate is therefore different from the interpretation of the original time effect . Distances on the spaces and are judged differently and the variability of is deduced exclusively from through (16). The time series components of are now restricted through . Plots of the -time series are therefore interpreted differently from the imagined plots of the original -time series and from the plots of the maximal invariant parameter discussed in Section 2.4.2. Indeed, if one were to analyse the estimated -time series statistically the linear constraint should be taken into account. This is a bit complicated as illustrated below, but it is the consequence of working with the ad hoc identified parameter rather than the canonical parameter .
Attempts to give intrinsic meaning to will be specific to the index set for the data set at hand. For instance, the requirement that the age effect should be zero on average does not carry over when looking at a subsample or when forecasting. It is not obvious that such an ad hoc identification is any more or less arbitrary than saying that, for instance, the first or the last age effect should have a particular value.
Adding confidence bands to a plot of is in itself not difficult. If is asymptotically normal with mean and variance , then is asymptotically normal with mean and variance . This is a normal distribution on the space . The interpretation of these standard errors will therefore be similar to that of itself.
Finally, it may be of interest to analyse the estimated -time series statistically. Denote this time series by . Its sample space is now . A statistical model on can be built as follows. The starting point could be a time series model for unrestricted variables on the sample space . This gives a joint density for , which can be reduced by marginalisation to a density for . Whether one is working with the unrestricted model for or the restricted model for inferences that are invariant to must be based on those statistics of or that are invariant to . Thus, inferences must be based on the maximal invariant under . For a general overview of invariant reduction see Cox and Hinkley [12, page 175f], whereas Nielsen  gives the argument in some detail for an autoregression with a linear trend.
3.4.2. Hypothesis Testing
Having formulated the model in terms of time effects it may be of interest to test the hypothesis that one of these time effects is absent. No identification issues arise when the hypothesis is formulated as a restriction on the canonical parameter as discussed in Section 2.4.3. But one has to be careful when formulating hypotheses in terms of the original time effect. See Sections 5.4.5, 5.5.3, and 5.5.4 for examples.
Affine hypotheses on the time effect are of the form for known matrices , . The constrained time effect space is then To see how the restriction (28) restricts the predictor space recall that the predictor only depends on through . Thus, the analysis of the restriction (28) depends on the interplay between the matrices , . Theorem A.3 in Appendix A.3 gives a general result to that effect. It shows that the hypothesis (28) restricts the predictor space to a -dimensional affine subspace of in so far as it restricts the canonical parameter . In particular, the degrees of freedom of the hypothesis, , may in general be different from the dimension reduction of the time effect parameter, . When this is the case the restriction (28) has an element of ad hoc identifying the time effect.
Forecasts can be made by extrapolating the ad hoc identified time effects . Two researchers choosing different ad hoc identification schemes, but otherwise making the same analysis, may make different forecasts. This can be avoided if the extrapolation method is chosen with some care.
Following the linear approach outlined in Section 2.4.4 the predictor is forecasted by extending the design into Extrapolating the ad hoc identified into a vector then gives the forecast Often both components and depend on the ad hoc identification. Nonetheless, these dependencies of ad hoc identification may cancel each other so that the overall forecast is invariant to the ad hoc identification. Such invariance would seem desirable in most applications unless there is strong substantial reason for the ad hoc identification scheme. Necessary and sufficient conditions for invariance are presented for the age-period-cohort model in Section 5.4.3 and for a nonlinear model in Section 6.1.5.
In contrast, these considerations are redundant when working with the canonical parameter, ; see Section 2.4.4.
4. Bayesian Models and Random Effects Models
Mortality analysis is often carried out using either Bayesian methods or random effects methods. The mortality model is then altered through the introduction of a prior distribution on the parameters. One might think that the identification problems become less of an issue or even disappear. This is not the case since the Bayesian method and the random effects method is based on the mortality likelihood which only depends on the time effect through the maximal invariant parameter . Thus, the identification challenges remain. The issue is that a prior on the unidentified part, say , of the time effect amounts to an ad hoc identification. Indeed, the conditional prior of given is not updated by the mortality likelihood. A main difference is that a maximum likelihood analysis of the original mortality likelihood usually prompts the researcher when there is an identification issue, whereas both Bayesian methods and random effects methods allow computations to go through despite an identification issue.
In Section 4.1 it is seen that introduction of a conditional prior on given is the Bayesian analogue of ad hoc identification. This leads to the same type of forecasting challenges as in the frequentist settings as is seen in Section 4.2. In Section 4.3 we show how the Bayesian identification issues transfer to random effects models.
4.1. Bayesian Estimation
For Bayesian and random effects models we formulate a likelihood and a prior. Thus, consider a likelihood . Replacing by , the identification problem implies that The prior on is factorised as . In the case of Bayesian estimation the following result emerges.
Theorem 2. Suppose the likelihood satisfies (32). Then (i)the predictive distribution does not depend on the conditional prior for : (ii)the posterior satisfies (iii)the posterior means satisfy
Theorem 2 shows that it suffices to give a prior to and ignore as advocated in Section 2.4.5. Indeed the conditional prior for given is not updated. Theorem 2 appears to be well-known; see Poirier [16, Proposition 2] or Smith [17, Section ].
Due to Theorem 2 the Bayesian analyst faces the complications outlined in Section 3.4. Indeed, suppose that two Bayesian researchers choose the same likelihood and the same prior for , but different conditional priors for given . Their marginal distributions for the data are identical, but any inferences regarding interpretation or forecasting will differ in so far as they involve the unidentified parameter . A Bayesian researcher should therefore be cautious with inference related to . There will of course be situations where the prior knowledge of given is found to be of substantive relevance. In such situations it seems more fruitful to change the likelihood to include that information.
Bayesian forecasts involve integrating an extrapolative distribution. This can be done in two ways, either working exclusively with the identified, maximal invariant parameter as in Section 2.4.4, or working with the time effect as in Section 3.4.3.
4.2.1. Forecasting Using the Maximal Invariant Parameter
Consider first the case where only the maximal invariant parameter is used. In that case the forecast is computed by sampling from the posterior and then extrapolating using the sampled value using some extrapolative methods, say . In combination this gives the forecast
4.2.2. Forecasting Using the Ad Hoc Identified Time Effect
Consider now forecasts involving the full time effect . Theorem 2(ii) shows that the posterior satisfies . The distribution forecast with extrapolation is then The concern is now as follows. Suppose a second researcher chooses the same extrapolative method, likelihood, and prior for , but different conditional priors . In general, this will result in a different distribution forecast: The question is then under which conditions will so that the distribution forecasts are invariant to the choice of conditional prior for given ? A sufficient condition is that the extrapolation method does not depend on so Condition (39) could alternatively be expressed as requiring that the forecast is invariant to the group acting on the time effect space so that .
Theorem 3. Suppose that the likelihood satisfies (32) and the priors are probabilities. If the extrapolative distribution does not depend on so (39) holds; then the forecast distribution computed in (37) is invariant to the choice of conditional prior for given . The forecast then reduces to (36).
To summarise, the identification issues surrounding Bayesian analysis are similar to those outlined in the previous sections. Examples of the problems that can arise are discussed in Sections 5.4.4, 5.5.2, and 6.1.6. There are two solutions to the identification problem. The first is only to formulate a prior on ; see Section 2.4.5. Incidentally, this is what Bernardo and Smith [18, page 218] do in their discussion of the two-way analysis of variance, albeit without linking it to the considerations of Smith . The prior can of course be constructed by formulating a prior on and then reduce it to a prior on by marginalisation so . The other solution is to work with a prior on but avoid those parts of the posterior that depend on .
4.3. Random Effects Models
It is common to combine mortality models with a random effects approach, which effectively forms a new model. An example is given in Section 5.4.6. We consider the consequence of the lack of identification.
The random effect models are typically constructed as follows. Suppose the density of the data given the time effects is of the form as before; see (32). A prior is chosen that now depends on a parameter . The prior can be decomposed as . Theorem 2 implies that the density of the data given is This in turn is used to form the random effects likelihood of as This, effectively, defines a new model. The random effects likelihood only depends on the prior through . Two researchers choosing the same prior but different conditional priors will then get the same random effects likelihood and the same maximum likelihood estimator .
In mortality modelling it is common to go one step further and estimate the time effects through the mean of the posterior evaluated at . Then the identification problem may show up. Theorem 2 shows that so that the prior for is updated, while the conditional posterior for given is not updated by the data. Thus, in general the estimate for is based, in part, on a prior which is not updated by the data.
5. Age-Period-Cohort Models
We will now apply the theoretical considerations to analyse the age-period-cohort model. The methodological literature on this model is large and the consequences of the above theory are wide ranging.
In Section 5.1 we present the age-period-cohort model along with the maximal invariant parameter. This maximal invariant parameter is also called the canonical parameter because the age-period-cohort model is usually implemented as an exponential family; see Section 2.4.1. When formulating the model we choose a notation matching the age-period-cohort literature rather than the reserving literature. At the same time the exposition takes it starting point in Kuang et al. , but the notation deviates.
The implementation of the canonical parameter depends on the type of data array. In Section 5.2 design matrices are given for age-cohort, age-period, and period-cohort data arrays. While they illustrate interesting differences in the structure for these data arrays, they also provide the basis for an immediate implementation via any generalised linear model software. The age-cohort model is expressed as a hypothesis of the age-period-cohort model in Section 5.3. Time effects and forecasting are considered in Section 5.4, while the two-sample age-period-cohort model is discussed in Section 5.5.
5.1. The Model and the Canonical Parameter
Here the age-period model is set up and a quite general identification result is reported.
Consider data indexed by where is the age and is the period. The index set may be a rectangle given by and so that the cohort runs from to . More generally, the index set could be a generalized trapezoid where two corners are cut off the rectangle so that the cohort runs from to for some . The class of generalized trapezoids includes the three types of Lexis diagrams discussed by Keiding . We will return to those special cases below.
The statistical model is defined by the assumption that the variables are independent with an exponential family distribution with predictor given by The time effect , now varies in some time effect space where .
The model (43) is of the form (1) discussed in Section 2. Specifically, the predictors can be stacked in a vector of dimension and written as . Thus, the parameter space for the model is of the form for as outlined in (2). The mapping from to is surjective and the equivalence classes in the time effect space can be described by a group of transformations that are discussed in (8). This group can be represented as for any , , , and . This is of the form (8) with although the definition of the matrix depends on the structure of the index set .
A first clue for the canonical parametrisation is given by Fienberg and Mason  and Clayton and Schiffler  who pointed out that, on the multiplicative scale, ratios of relative risks are invariant. On the additive scale this amounts to looking at second differences, such as . A graphical illustration of the double differences is given in Figure 1 (graphics were done using R 3.0.2, see R Development Core Team ), which is taken from Miranda et al. . Panel (a) illustrates the interpretations of the formula for as follows. Consider the 1970 and 1971 cohorts. In 2010 these have ages 40 and 39, while in 2011 these have ages 41 and 40. Thus, represents the increase in mortality from ages 40 to 41 in 2011 relative to the increase from ages 39 to age 40 in 2010. An equivalent interpretation is that which represents the increase in mortality from ages 40 to 41 for the 1970 cohort relative to the increase from ages 39 to 40 for the 1971 cohort. In a similar way panels (b) and (c) illustrate the formulas for and .
Kuang et al.  introduces a parameter formed by these second differences as well as three entries of the predictor; that is, The parameter varies in the space where . If the three points , , and are chosen not to be linearly related then they define the levels and the linear trends in the predictor. The formal condition is that a certain determinant defined from the indices is nonzero; that is,
Theorem 4 (see , [25, Corollary 2]). Let satisfy (43). If the condition (46) is satisfied then the parameter of (45) satisfies the following:(i) is a function of which is invariant to the group in (44);(ii) is a function of ;(iii)the parametrisation of by is exactly identified in that .
Theorem 4 therefore shows that varies freely in . Moreover, is a maximal invariant of the mapping from to under the transformations . It should be noted that the choice of maximal invariant is not unique. Indeed, any bijective mapping of can serve as maximal invariant. The choice of is convenient since it becomes the canonical parameter in generalized linear models of the exponential family type.
In itself this theorem does not tell how to express the predictor in terms of the canonical parameter . The link depends on the structure of the index set . The above mentioned paper gives implicit expressions for generalized trapezoid index sets. In the following we report explicit expressions for the 3 principal Lexis diagrams.
5.2. Design Matrices for Lexis Diagrams
The link between the canonical parameter and the predictor is analysed for the 3 principal Lexis diagrams. We start with age-cohort data arrays, which were the focus of attention in Kuang et al. . Such arrays are easiest to analyse because all three time scales increase from the point where . As a consequence the results are relatively easier for these arrays.
5.2.1. Age-Cohort Data Arrays
Age-cohort data arrays are rectangular in the age and cohort indices and given by Consequently, the period index varies over . Keiding  refers to this Lexis diagram as the first principal set of death.
Age-cohort arrays are in particular used for reserving in general insurance. In that situation, only the triangle is observed. The issue is to forecast the other triangle in the square . In the reserving literature these triangles are referred to as the upper and lower triangles, since the cohort axis has reverse order. The two-factor age-cohort model for triangular age-cohort arrays is known as the chain-ladder model; see England and Verrall  for an overview. Zehnwirth  introduced an age-period-cohort model for such triangular arrays. The identification issue is analysed in detail in Kuang et al. [19, 25]. Subsequently, Kuang et al.  analysed the Poisson likelihood, while Kuang et al.  give an empirical analysis focusing on forecasting.
The age-period-cohort model for the age-cohort arrays is parametrised by The time effect now varies in .
The design matrix linking the canonical parameter in (45) and the predictor is essentially an identity linking the two parameters. A natural choice of the three levels points to the predictors that are , , and . We then get the representation with the convention that empty sums are zero, and recalling that second differences are defined as so that and .
The identity (49) is crucial to the understanding of the age-period-cohort model. It shows that the predictor has a single level expressed as , which in turn satisfies . The level is therefore estimable, but the individual levels , , , and are not identifiable from the model. Further, the model has two linear trends, here expressed with slopes and in terms of the age and cohort indices. These slopes can be expressed as and . They are estimable, but the individual slopes , , and are not identifiable.
The design matrix now follows from the identity (49) so that the predictor satisfies , where where , where and .
The identification relies on Theorem 4, which can be specialised to age-cohort arrays as follows.
Theorem 5 (see [19, Theorem 1]). Let satisfy (48). The parameter of (50) satisfies the following:(i) is a function of which is invariant to the group in (44);(ii) is a function of , because of (49);(iii)the parametrisation of by is exactly identified in that .
Theorem 5 in turn implies that the parameter varies freely in , while the design matrix given by (51) has full column rank. Originally, the more general Theorem 4 was proved as a corollary to Theorem 5.
5.2.2. Age-Period Arrays
An age-period data array is rectangular in the age and cohort indices and given by Consequently, the cohort index varies over . Keiding  refers to this Lexis diagram as the third principal set of death.
Age-period arrays are commonly used in epidemiology, in mortality analysis, and in sociology. The analysis of identification issue is largely similar to that of age-cohort arrays. However, the representation of the predictor in terms of differs in an intriguing way, because the third time index, the cohort , is the difference of the other two indices.
The age-period-cohort model for the age-period arrays is parametrised by The time effect now varies in . A representation of the predictor in terms of the canonical parameter is now The representation (54) differs from that of (49) in a subtle way. The three reference points for the levels of the predictor are chosen in the corner , . From this corner period and cohort indices increase, while age decreases. Hence, the age double differences are now cumulated backwards. This phenomenon arises because the cohort index is the difference of the principal indices of age and period, whereas for the age-cohort array the period index is the sum of the principal indices of age and cohort. The predictor is now where, with , The identification relies on Theorem 4. It is specialised to age-period arrays as follows.
Theorem 6 (see [24, Theorem 4.1]). Let satisfy (53). The parameter of (55) satisfies the following:(i) is a function of which is invariant to the group in (44);(ii) is a function of , because of (54);(iii)the parametrisation of by is exactly identified in that .
5.2.3. Period-Cohort Arrays
An age-period data arrays is rectangular in the age and cohort indices and given by Consequently, the age index varies over . Keiding  refers to this Lexis diagram as the second principal set of death. Age-period arrays are commonly used in prospective cohort studies in epidemiology and in sociology. The analysis is similar to that of age-period arrays when swapping the role of age and cohort.
The age-period-cohort model for the age-cohort arrays is parametrised by The time effect now varies in . A representation of the predictor in terms of the canonical parameter is now Thus, the canonical parameter and the design matrix are given by In parallel with Theorem 6 we then have the following identification result.
Theorem 7. Let satisfy (60). The parameter of (62) satisfies the following:(i) is a function of which is invariant to the group in (44);(ii) is a function of , because of (61);(iii)the parametrisation of by is exactly identified in that .
5.3. Expressing the Age-Cohort Model as a Hypothesis
It is often of interest to test the absence of the period effect. An application to analysing asbestos related mortality can be found in Miranda et al. .
The hypothesis is that , when expressed in terms of the time effect parameters. The restricted model is given by, with ,
The identification problem simplifies to a question of determining the levels of and . Therefore the (log) relative risk parameters are identified as pointed out by Clayton and Schifflers . In this model the cohort index is present and keeps the difference of the principal age and period indices. Therefore the representation of the predictor involves backward cumulated age differences as before but with a subtle change of sign, so that (54) reduces to As a consequence the canonical parameter and the design reduce to , where Miranda et al. [24, Theorem 4.2] establish an identification result similar to Theorem 6.
The age-cohort model can also be formulated as a hypothesis on the maximal invariant in the age-period-cohort model following Section 2.4.3. The period effects are set to zero through , where . Applying this to the expression for in (55) gives since in the absence of period effects; then and . The double differences cumulate to first differences through , so the above expression is seen to be a linear transformation of in (67). In other words the age-cohort model arises from the age-period-cohort model by restricting the maximal invariant parameter.
5.4. Working with the Time Effect
There is a large literature seeking to identify the original time effects , , and of the age-period-cohort model from the predictor. Here we look closer at some of those ad hoc identification proposals.
5.4.1. Ad Hoc Identification of Levels
For the age-period-cohort model it is popular to impose ad hoc identifications in two steps of the type discussed in Section 3.3. Here the first step is concerned with the level of the time effects and the second step is concerned with the linear trend. Examples are given in Sections 5.4.2 and 5.5.4.
A common first step ad hoc identification is to require that This ad hoc identification is specific to the chosen data range. For instance, the constraint is not easily transferable to a different data set drawn from the same population but with a different set of age groups. This aspect would have to be kept in mind if a substantive motivation was to be found for this constraint. Other ad hoc identification schemes such as have similar problems.
The constraint (69) is a special case of affine constraints of the form discussed in Section 3.3. The involved dimensions are and , while the number of constrains is . The matrix is given by the top left -block of in (58) padded with a column of zeros, while is given by . Theorem 1 shows that has full rank. Indeed, and its orthogonal complement are given by where, for instance, . Thus, the constrained group of equivalence classes (27) is
5.4.2. Ad Hoc Identification of Slopes: The “Intrinsic” Estimator
The “intrinsic” estimator is a popular estimator in the sociology literature; see Yang et al.  and see also O’Brien [31, 32] and Fu et al.  for a recent discussion of its merits. It has its roots in a suggestion by Kupper et al. , with an early critique given by Holford .
The “intrinsic” estimator is defined in two steps. In the first step, the levels are identified by the ad hoc constraint (69). Three of the -coordinates are then dropped; that is , , and are dropped. In a second step the linear trend is ad hoc identified using a Moore-Penrose inverse as in (23).
We can analyse these steps using the developed framework. The first step identifies the levels by the ad hoc constraint (69), which is a constraint of the form for the discussed in Section 5.4.1. This is defined on which is a linear subspace with a dimension deficiency of 3. Introduce a selection matrix that selects all coordinates of except , , and . Thus arises as a -dimensional with 3 columns deleted corresponding to , , and . This is chosen so that is invertible. Then is freely varying in that . The skew projection identity and the constraint then implies that where and . Note that while depends on and , it does not depend on the normalisation of , since we can replace by for arbitrary invertible matrices . This implies that is a function of and . The predictor is now parametrised by with . This corresponds to equation 5 of Yang et al.  who use the notation and for and , respectively.
In the second step the linear trend is ad hoc identified through a time effect parameter of the form (23) with , replaced by , so that where for some scalar and some matrix .
The “intrinsic” estimator is ad hoc identified through the choices and , while is chosen by (69). It therefore estimates an “intrinsic” parameter: which depends on the choices of , , and . However, since we can replace by for arbitrary invertible matrices without changing the expression does not depend on the normalisation of . The “intrinsic” parameter satisfies the following result.
Theorem 8. The “intrinsic” parameter is an injective mapping of the canonical parameter into a dimensional linear subspace of . The “intrinsic” time effect space is a -dimensional linear subspace of of the form where is uniquely defined up to a scale by .
Theorem 8 implies that the “intrinsic” parameter should be interpreted as an object varying in the linear subspace rather than in the unrestricted time effect space . As outlined in Section 3.4 this has consequences for the interpretation of plots of the time effects, hypothesis testing, and forecasts. A consequence of this argument is that different choices of , , , and would lead to other ad hoc identified parameters varying in other affine subspaces of . In other words, the “intrinsic” estimator carries the cost of working with the somewhat complicated linear subspace . This effort may be worthwhile if the particular choice of , , , and can be made on substantive grounds.
Forecasting of future mortality rates involves an extrapolation of the time parameters. In Section 2.4.4 it was argued that ad hoc identification may introduce an undesirable arbitrariness in the forecast. When working exclusively with the canonical parameter this arbitrariness is avoided. It is, however, also possible to work with ad hoc identified time effects under specific circumstances that we characterise here for age-period arrays. This builds on the theory developed in Kuang et al.  for age-cohort data arrays.
In the context of an age-period data array it is often of interest to forecast periods ahead. Suppose it is of interest to forecast the mortality at age in period , so that the cohort is . This requires an extrapolation of the period effect. If the cohort index is sufficiently large, that is, , then the cohort effect needs to be extrapolated too. Thus, there are two forecast index arrays of interest: Figure 2 illustrates these forecast index arrays.
Identification plays a role when extrapolating the estimates obtained on the data array . The identification issues can be ignored if the investigator simply extrapolates and . In the context of ad hoc identified time effects arbitrary linear trends are introduced in the model. The forecast of the predictor is invariant to these if and only if the chosen extrapolation method for , preserves these linear trends so that they can cancel with the arbitrary linear trend in . The next result gives a precise formulation of this statement. It applies both to point forecasts and distribution forecasts.
Theorem 9. Consider the predictor for as given in (53). Suppose the time effects , , and are ad hoc identified. Consider the class of periods-ahead forecasts over constructed as , where is a function of the ad hoc identified estimate . Let be the group (57). Invariance of the forecast with respect to the ad hoc identification is equivalent to either of the following:(i)the extrapolation method for period and cohort effects is linear trend-preserving: (ii)functions , exist so that with ; then
To illustrate the use of Theorem 9 consider the extrapolation methods and . The first forecast is a random walk forecast and it is seen to violate (ii). The second forecast is a cumulated random walk and satisfies (ii). The reason is that . Since , then . Further examples of forecasts that are linear trend-preserving as well as some which are not are given Kuang et al. [25, Table 1].
Kuang, Nielsen, and Nielsen  apply this to reserving data organised in an age-cohort array and discuss the issue of robustification of forecast with respect to structural breaks at the forecast origin. Miranda et al.  give an application to asbestos related mortality using an age-period array .
5.4.4. Bayesian Ad Hoc Identification Using a Dynamic Prior
A Bayesian ad hoc identification using a dynamic prior does not solve the identification problem as discussed in Section 4 and the same care has to be exercised to avoid the problems outlined in Section 3.4. Berzuini and Clayton  suggest such an ad hoc identification approach. On page 831 they write “Identificability problems may be solved by imposing an arbitrary linear constraint on the log-linear trend components of age, period and cohort effects. Happily, such an arbitrary constraint has no effect on the predictions of the model.” The previous analysis suggests that this is far from innocent.
The Berzuini-Clayton suggestion is to ad hoc identify the model (53) through A dynamic prior is chosen so that the double differences , , and are independent zero mean normal with variances that have -type prior. The purpose of this is in part to facilitate extrapolations , , and for , , and , which is done through further draws from normal distributions. The level/trend effects have independent uniform priors on some large intervals.
We will analyse the Berzuini-Clayton model as applied to an age-period data array . Decompose the canonical parameter from (54) into two parts: the slope and level parameters, say , and the collection of double differences, say . The assumed prior for is a simple collection of independent normal distributions with variances . The assumed prior for is a linear combination of not only the independent uniform variables , but also on , since the age double differences are cumulated backwards in (54), but forwards in (77). Thus, the prior for depends on the construction.
We get a hyper-parameter , where is some three-dimensional ad hoc identified level/trend effect dependent on . We will argue that the ad hoc identified level/trend effect will wash out in the Berzuini-Clayton model. However, the level/trend parameter is a function of the construction that is tailored to the ad hoc identification. That construction remains in the analysis.
In the presentation of the posterior Berzuini and Clayton are careful only to consider the double differences and stay clear of the ad hoc identified level/trend effect . Theorem 2 yields the posterior