Model and Variable Selection Procedures for Semiparametric Time Series Regression
Semiparametric regression models are very useful for time series analysis. They facilitate the detection of features resulting from external interventions. The complexity of semiparametric models poses new challenges for issues of nonparametric and parametric inference and model selection that frequently arise from time series data analysis. In this paper, we propose penalized least squares estimators which can simultaneously select significant variables and estimate unknown parameters. An innovative class of variable selection procedure is proposed to select significant variables and basis functions in a semiparametric model. The asymptotic normality of the resulting estimators is established. Information criteria for model selection are also proposed. We illustrate the effectiveness of the proposed procedures with numerical simulations.
Non- and semiparametric regression has become a rapidly developing field of statistics in recent years. Various types of nonlinear model such as neural networks, kernel methods, as well as spline method, series estimation, local linear estimation have been applied in many fields. Non- and semiparametric methods, unlike parametric methods, make no or only mild assumptions about the trend or seasonal components and are, therefore, attractive when the data on hand does not meet the criteria for classical time series models. However, the price of this flexibility can be high; when multiple predictor variables are included in the regression equation, nonparametric regression faces the so-called curse of dimensionality.
A major problem associated with non- and semiparametric trend estimation involves the selection of a smoothing parameter and the number of basis functions. Most literature on nonparametric regression with dependent errors focuses on the kernel estimator of the trend function (see, e.g., Altman , Hart  and Herrmann et al. ). These results have been extended to the case with long-memory errors by Hall and Hart , Ray and Tsay , and Beran and Feng . Kernel methods are affected by the so-called boundary effect. A well-known estimator with automatic boundary correction is the local polynomial approach which is asymptotically equivalent to some kernel estimates. For detailed discussions on local polynomial fitting see, for example, Fan and Gijbels  and Fan and Yao .
For semiparametric models with serially correlated errors, Gao  proposed the semiparametric least-square estimators (SLSEs) for the parametric component and studied its asymptotic properties. You and Chen  constructed a semiparametric generalized least-square estimator (SGLSE) with autoregressive errors. Aneiros-Pérez and Vilar-Fernández  constructed SLSE with correlated errors.
Like parametric regression models, variable selection of the smoothing parameter for the basis functions is important problem in non- and semiparametric models. It is common practice to include only important variables in the model to enhance predictability. The general approach to finding sensible parameters is to choose an optimal subset determined according to the model selection criterion. Several information criteria for evaluating models constructed by various estimation procedures have been proposed, see, for example, Konishi and Kitagawa . The commonly used criteria are generalized cross-validation, the Akaike information criterion (AIC), and the Bayesian information criterion (BIC). Although best subset selection is practically useful, these selection procedures ignore stochastic errors inherited between the stages of variable selection. Furthermore, best subset selection lacks stability, see, for example, Breiman . Nonconcave penalized likelihood approaches for selecting significant variables for parametric regression models have been proposed by Fan and Li . This methodology can be extended to semiparametric generalized regression models with dependent errors. One of the advantages of this procedure is the simultaneous selection of variables and the estimation of unknown parameters.
The rest of this paper is organized as follows. In Section 2.1 we introduce our semiparametric regression models and explain classical partial ridge regression estimation. Rather than focus on the kernel estimator of the trend function, we use the basis functions to fit the trend component of time series. In Section 2.2, we propose a penalized weighted least-square approach with information criteria for estimation and variable selection. The estimation algorithms are explained in Section 2.3. In Section 2.4, the GIC proposed by Konishi and Kitagawa , the BIC proposed by Hastie and Tibshirani , and the BIC proposed by Konishi et al.  are applied to the evaluation of models estimated by penalized weighted least-square. Section 2.5 contains the asymptotic results of proposed estimators. In Section 3 the performance of these information criteria is evaluated by simulation studies. Section 4 contains the real data analysis. Section 5 concludes our results, and proofs of the theorems are given in the appendix.
2. Estimation Procedures
In this section, we present our semiparametric regression model and estimation procedures.
2.1. The Model and Penalized Estimation
We consider the semiparametric regression model:
where is the response variable and is the covariate vector at time , is an unspecified baseline function of with , is a vector of unknown regression coefficients, and is a Gaussian and zero mean covariance stationary process.
We assume the following properties for the error terms and vectors of explanatory variables .(A.1) It holds that is a linear process given by where and is an i.i.d. Gaussian random variable with and .(A.2)The coefficients satisfy the conditions that for all , and .
We define .
The assumptions on covariate variables are as follows.(B.1) Also and , have mean zero and variance 1.
The trend function is expressed as a linear combination of a set of underlying basis functions :
where is an -dimensional vector constructed from basis functions , and is an unknown parameter vector to be estimated. The examples of basis functions are B-spline, P-spline, and radial basis functions. A P-spline basis is given by
where are spline knots. This specification uses the so-called truncated power function basis. The choice of the number of knots and the knot locations are discussed by Yu and Ruppert .
Radial basis function (RBF) emerged as a variant of artificial renewal network in late 80s. Nonlinear specification of using RBF has been widely used in cognitive science, engineering, biology, linguistics, and so on. If we consider the RBF modeling, a basis function can take the form
where determines the location and determines the width of the basis function.
Selecting appropriate basis functions, then the semiparametric regression model (2.1) can be expressed as a linear model
where , , with . The penalized least-square estimator is then a minimizer of the function
where is the smoothing parameter controlling the tradeoff between the goodness-of-fit measured by weighted least-square and the roughness of the estimated function. Also is an appropriate positive semidefinite symmetric matrix. For example, if satisfies , we have the usual quadratic integral penalty (see, e.g., Green and Silverman ). By simple calculus, (2.7) is minimized when and satisfy the block matrix equation
where , ,and is the identity matrix of order . Applying least-square to the linear model (2.9), we obtain the semiparametric ordinary least-square estimator (SOLSE) result:
Speckman  studied similar solutions for partial linear models with independent observations. Since the errors are serially correlated in model (2.1), is not asymptotically efficient. To obtain an asymptotically efficient estimator for , we use the prewhitening transformation. Note that the errors in (2.6) are invertible. Let , where is the lag operator and with . Applying to the model (2.6) and rewriting the corresponding equation, we obtain the new model:
where and . Here The regression errors in (2.12) are i.i.d. Because, in practice, the response variable is unknown, we use a reasonable approximation by based on the work by Xiao et al.  and Aneiros-Pérez and Vilar-Fernández .
Under the usual regularity conditions the coefficients decrease geometrically so, letting denote a truncation parameter, we may consider the truncated autoregression on :
where are i.i.d. random variables with . We make the following assumption about the truncation parameter.(C.1)The truncation parameter satisfies for some .
The expansion rate of the truncation parameter given in (C.1) is also for convenience. Let be the transformation matrix such that . Then the model (2.12) can be expressed as
with . Here denotes the lag autocorrelation function of .
Now our estimation problem for the semiparametric time series regression model can be expressed as the minimization of the function
where and . Based on the work by Aneiros-Pérez and Vilar-Fernández , an estimator for is constructed as follows. We use the residuals to construct an estimate of using the ordinary least square method applied to the model
Define the estimate of , where where and is the matrix of regressors with the typical element . Then is obtained from by replacing with , with and so forth. Applying least-square to the linear model, we obtain Then
where and , with and . The following theorem shows that the loss in efficiency associated with the estimation of the autocorrelation structure is modest in large samples.
Theorem 2.1. Let the conditions of (A.1), (A.2), (B.1), and (C.1) hold, and assume that is nonsingular. Let denote the true value of , then where denotes convergence in distribution and . Assume that is nonsingular and let denote the true value of , then one has where .
2.2. Variable Selection and Penalized Least Squares
Variable and model selection are an indispensable tool for statistical data analysis. However, it has rarely been studied in the semiparametric context. Fan and Li  studied penalized weighted least-square estimation with variable selection in semiparametric models for longitudinal data. In this section, we introduce the penalized weighted least-square approach. We propose an algorithm for calculating the penalized weighted least-square estimator of in Section 2.3. In Section 2.4 we present the information criteria for the model selection.
From now on, we assume that the matrices and are standardized so that each column has mean 0 and variance 1. The first term in (2.7) can be regarded as a loss function of and , which we will denote by . Then expression (2.7) can be written as
The methodology in the previous section can be applied to the variable selection via penalized least-square. A form of penalized weighted least-square is
where are penalty functions and are regularization parameters, which control the model complexity. By minimizing (2.27) with a special construction of the penalty function given in what following some coefficients are estimated as 0, which deletes the corresponding variables, whereas others are not. Thus, the procedure selects variables and estimates coefficients simultaneously. The resulting estimate is called a penalized weighted least-square estimate.
Many penalty functions have been used for penalized least-square and penalized likelihood in various non- and semiparametric models. There are strong connections between the penalized weighted least-square and the variable selection. Denote by and the true parameters and the estimates, respectively. By taking the hard thresholding penalty function
we obtain the hard thresholding rule
The penalty results in a ridge regression and the penalty yields a soft thresholding rule
This solution gives the best subset selection via stepwise deletion and addition. Tibshirani [24, 25] has proposed LASSO, which is the penalized least-square estimate with the penalty, in the general least-square and likelihood settings.
2.3. An Estimation Algorithm
In this section we describe an algorithm for calculating the penalized least-square estimator of . The estimate of minimizes the penalized sum of squares given by (2.17). First we obtain in Step 1. In Step 2, we estimate by using obtained in Step 1. Then is obtained using (Step 3). Here the penalty parameters , and and the number of basis functions are chosen using information criteria that will be discussed in Section 2.4.
Step 2. An estimator for is constructed followings the work of Aneiros-Pérez and Vilar-Fernández . We use the residuals to construct an estimate of using the ordinary least square method applied to the model The estimator is obtained from by replacing parameters with their estimates.
Step 3. Our SGLSE of is obtained by using the model where , , , and . Finding the solution of the penalized least-square of (2.27) needs the local quadratic approximation, because the and hard thresholding penalty are irregular at the origin and may not have second derivatives at some points. We follow the methodology of Fan and Li . Suppose that we are given an initial value that is close to the minimizer of (2.27). If is very close to 0, then set . Otherwise they can be locally approximated by a quadratic function as when . Therefore, the minimization problem (2.27) can be reduced to a quadratic minimization problem and the Newton-Raphson algorithm can be used. The right-hand side of equation (2.27) can be locally approximated by where The solution can be found by iteratively computing the block matrix equation: This gives the estimators where , , and .
2.4. Information Criteria
Selecting suitable values for the penalty parameters and number of basis functions is crucial to obtaining good curve fitting and variable selection. The estimate of minimizes the penalized sum of squares given by (2.17). In this section, we express the model (2.15) as
where and . In many applications, the number of basis functions needs to be large to adequately capture the trend. To determine the number of basis functions, all models with are fitted and the preferred model minimizes some model selection criteria.
The Schwarz BIC is given by
where is the least-square estimate of without a degree of freedom correction. Hastie and Tibshirani  used the trace of the smoother matrix as an approximation to the effective number of parameters. By replacing the number of parameters in BIC by , we formally obtain information criteria for the basis function Gaussian regression model in the form
Here is defined by (2.44) in what follows.
We also consider the use of the BIC criterion to choose appropriate values for these unknown parameters. Denote Let and be the number of zero components in and , respectively. Then the BIC criterion is
where is the matrix of second derivatives of the penalized likelihood defined by
Here is a diagonal matrix with th element and . The -dimensional vector has th element where is the element in the th row and th column of . Also is the matrix defined by
and and are the product of the and nonzero eigenvalues of and , respectively.
Konishi and Kitagawa  proposed a framework of Generalized Information Criteria (GIC) to the case where the models are not estimated by maximum likelihood. Hence, we also consider the use of GIC for the model evaluations. The GIC for the hard thresholding penalty function is given by
where is a matrix. Also is basically the product of the empirical influence function and the score function. It is defined by
The number of basis functions , penalty parameters are determined by minimizing BIC, BIC or GIC.
2.5. Sampling Properties
We now study the asymptotic properties of the estimate resulting from the penalized least-square function (2.27).
First we establish the convergence rate of the penalized profile least-square estimator. Assume that penalty functions and are negative and nondecreasing with . Let and denote the true values of and , respectively. Also let
Theorem 2.2. Under the conditions of Theorem 2.1, if , and tend to 0 as , then with probability tending to 1, there exist local minimizers and of such that and .
Theorem 2.2 demonstrates how the rate of convergence of the penalized least-square estimator of depends on for . To achieve the root convergence rate, we have to take small enough so that .
Next we establish the oracle property for the penalized least-square estimator. Let consist of all nonzero components of and let consist of all zero components. Let consist of all nonzero components of and let consist of all zero components. Let
Further, let consist of the first components of and let consist of the last components of . Let consist of the first components of and let consist of the last components of .
Theorem 2.3. Assume that for and , one has , , and . Assume that the penalty functions and satisfy If then, under the conditions of Theorem 2.1, with probability tending to 1, the root consistent local minimizers and in Theorem 2.2 must satisfy the following: (1)(sparsity) ;(2)(asymptotic normality) Here and consist of the first and rows and columns of and defined in Theorem 2.1, respectively.
3. Numerical Simulations
We now assess the performance of semiparametric estimators of the proposed in previous section via simulations. We generate simulation data from the model
where , and is a Gaussian AR(1) process with autoregressive coefficient . We used the radial basis function network modeling to fit the trend component. We simulate the covariate vector from a normal distribution with mean 0 and . In each case, the autoregressive coefficient is set to , , or and the sample size is set to 50, 100 or 200. Figure 1 depicts some examples of simulation data.
We compare the effectiveness of our proposed procedure (PLS + HT) with an existing procedure (PLS). We also compare the performance of the information criteria BIC, GIC and BIC for evaluating the models. As discussed in Section 3, the proposed procedure (PLS + HT) excludes basis functions as well as explanatory variables.
First we assess the performance of by the square root of average squared errors ():
where are the grid points at which the baseline function is estimated. In our simulation, we use . Table 1 shows the means and standard deviations of for based on 500 simulations. increases as the autoregressive coefficient increases but decreases as the sample size increases. From Table 1, we see that the proposed procedure (PLS + HT) works better than PLS and that models evaluated by BIC work better than those based on BIC or GIC.
Then the performance of is assessed by the square root of average squared errors ():
The means and standard deviations of for based on 500 simulations are shown in Table 2. We can see that the proposed procedure (PLS + HT) works better than the existing procedure. There is almost no change in as the autoregressive coefficient changes (unlike the procedure of You and Chen ), whereas depends strongly on the information, BIC works the best among the criteria. We can also confirm the consistency of the estimator, that is decreases as the sample size increases.
The first step ahead prediction error (PE), which is defined as
is also investigated. Table 3 shows the means and standard errors of PE for based on 500 simulations. The PE increases as the autoregressive coefficient increases, but the PE decreases as the sample size increases. From Table 3, we see that PLS + HT works better than the existing procedures and there is almost no difference in the PE depending on the information criteria. The models evaluated by BIC perform well for large sample sizes.
The means and standard deviations of the number and deviation of basis functions are shown in Tables 4 and 5. The BIC gives a smaller number of basis functions than the other information criteria. The models evaluated by BIC also give smaller standard deviations of the number of basis functions. The models determined by BIC tend to choose larger deviations of basis functions than those based on BIC and GIC. The number of basis functions increases gradually as the sample size or increase. From Table 4, it appears that the number of basis functions does not depend on the sample size . From Table 5, it also appears that the deviations of basis functions do not depend on the sample size and .
We now compare the performance of our procedure with existing procedures in terms of the reduction of model complexity. Table 6 shows simulation results of the means and standard deviations of the number of parameters excluded ( or ) by the proposed procedure. The results indicate that the proposed procedure reduces model complexity. From Table 6, It appears that the models determined by BIC tend to exclude fewer parameters and give smaller standard deviations for the number of parameters excluded. This is due to the selection of a smaller number of basis functions compared to the selection based on the other criteria (see Table 4). There is almost no dependence of the number of excluded parameters on . The models evaluated by BIC give a larger number of excluded parameters as the sample size increases. On the other hand, the models evaluated by BIC or GIC give a smaller number of excluded parameters as the sample size increases.
Table 7 shows the means and standard deviations of the number of basis functions excluded as by the proposed procedure. From Table 7 it appears that the models evaluated by BIC tend to exclude fewer basis functions than those based on GIC and BIC. Again this is due to the selection of a smaller number of basis functions (see Table 4). The models determined by BIC also give smaller standard deviations of the number of basis functions than the other criteria. There is almost no dependence of the number of basis functions on .
Table 8 shows the means and standard deviations of the number of basis functions excluded as by the proposed procedure. The number of which values really 0 was five. From Table 8 we see that the proposed procedure gives nearly five. The models determined by BIC give results more close to five and smaller standard deviations of the number of basis functions than the other criteria. The number of basis functions approaches five as the sample size increases. The standard deviations of the number of basis functions excluded decrease as increases. These results indicate that the proposed procedure reduces model complexity.
Table 9 shows the percentage of times that various were estimated as being zero. As for the parameters , these parameters were not estimated zero for every simulations, we omit to show the corresponding results on Table 9. The results indicate that the proposed procedure excludes insignificant variables and selects significant variables. It can be seen that the proposed procedure gives a better performance as the sample size increases and that BIC is superior to the other criteria.
4. Real Data Analysis
In this section we present the consequence of analyzing the real-time series data using proposed procedure. We use two data in this study; the data about the spirit consumption data in United Kingdom and the association between fertility and female employment in Japan.
4.1. The Spirit Consumption Data in the United Kingdom
We now illustrate our theory through an application to spirit consumption data for the United Kingdom from 1870 to 1938. The data-set can be found in Fuller [26, page 523]. In this data-set, the dependent variable is the logarithm of the annual per capita consumption of spirits. The explanatory variables and are the logarithms of per capita income and the price of spirits, respectively, and . Figure 2 shows that there is a change-point at the start of the First World War (1914). Therefore, we prepare a variable : from 1870 to 1914 and from 1915 to 1933. From this we derive another three explanatory variables: , , and . We consider the semiparametric model:
We computed the basis function estimate for using the existing procedure (PLS) and the proposed procedure (PLS + HT) with BIC. The resulting estimates and standard errors (SEs) of are given in Table 10. The selected number of basis function is seven with one excluded basis function and the spread parameter is estimated as . Therefore, we obtain the model
The results indicate that the proposed procedure excludes some variable and reduces model complexity. Table 10 shows that PLS + HT selects only and . That indicates possible interactions between consumption and income and between consumption and incomeprice after 1915. Consumption increases as income increases; however, as income increases and the price also increases, consumption decreases. We plot the estimated trend curve, residuals and autocorrelations functions in Figures 3 to 5. The residual mean square is .
You and Chen  used the following semiparametric partially linear regression model:
The semiparametric least-square (SLS) regression gives . The residual mean square is , which is more than that of our SGLSE fit. For a fair comparison, we use model (4.3) to revise You and Chen's estimation. Our semiparametric generalized least-square gives . The result indicates that is insignificant in model (4.3).
4.2. The Association between Fertility and Female Employment in Japan
Recent literature finds that in OECD countries the cross-country correlation between the total fertility rate and the female labor force participation rate, which until the beginning of the 1980s had a negative value, has since acquired a positive value. See for example, Brewster and Rindfuss , Ahn and Mira , and Engelhardt et al. . This result is often interpreted as evidence for a changing sign in the time series association between fertility and female employment within OECD countries.
However, OECD countries, including Japan, have different cultural backgrounds. We investigate whether or not the relation between the total fertility rate (TFR) and the female labor force participation rate (FLP) has changed in Japan from a negative value to a positive value. This application challenges previous findings and could be good news for policy makers, as a positive relationship implies that a rising FLP is associated with an increasing TFR.
Usually, FLP contains all women aged 15 to 64. However, TFR is a combination of fertility rates for ages 15–49, so we use the FLP of women aged 15 to 49 instead of women aged 15 to 64. We take the TFR from 1968 to 2007 in Japan. The estimation is a semiparametric regression of on . As the law of the Equal Employment Act came into force in 1985, we use the interaction variables “dummy for 1968–1984 ” () and for 1985–2007 (). We also use dummy variables for 1990–1999 and 2000–2007 (, ) and consider the semiparametric model
We applied the existing procedure (PLS) and proposed procedure (PLS + HT) with . The resulting estimates and standard errors (SE) of are given in Table 11. Therefore, we obtain the model
The residual mean square of PLS + HT is and that of PLS is . The selected number of basis functions is six with one excluded basis function and the spread parameter is estimated as . Table 11 shows that PLS + HT selects only 1968–1984 and 1985–2007. That indicates a negative correlation between TFR and FLP for 1968–2007, especially for 1968–1984, which means TFR decreases as FLP increases. We could not see the positive association in 80s which has been reported in recent studies, see, for example, Brewster and Rindfuss , Ahn and Mira , and Engelhardt et al. . We plot the estimated trend curve, residuals and autocorrelation functions in Figures 7 to 9.
5. Concluding Remarks
In this article we have proposed variable and model selection procedures for the semiparametric time series regression. We used the basis functions to fit the trend component. An algorithm of estimation procedures is provided and the asymptotic properties are investigated. From the numerical simulations, we have confirmed that the models determined by the proposed procedure are superior to those based on the existing procedure. They reduce the complexity of models and give good fitting by excluding basis functions and nuisance explanatory variables.
The development here is limited to the case with Gaussian stationary errors, but it seems likely our approach can be extended to the case with non-Gaussian long-range dependent errors, along with the lines explored in recent work by Aneiros-Pérez et al. . However, the efficient estimation for regression parameter is an open question in case of long-range dependence. This is a question we hope to address in future work. We also plan to explore the question of whether the proposed techniques can be extended to the cointegrating regression models with an autoregressive distributed lag framework.
In this appendix we give the proofs of the theorems in Section 2. We use to denote the Euclidian norm of .
Let be the infeasible estimator for constructed using OLS methods. That is , where and with . For ease of notation, we set for , and . We write for . Then we can construct the infeasible estimate using and . The following lemma states that the estimators and given in Theorem 2.1 have asymptotically normal distributions.
Proof of Lemma .1. From model (2.6), can be written as where , and are given by , and , respectively. Hence can be expressed without using . Then the minimization function in (2.17) can be written as First we consider asymptotic normality for , using the model The estimators minimize the function , which yields Then the minimization of this quadratic function is given by We now deal with , and . First we evaluate . From the expansion we can see that Similarly, we obtain Finally, we can evaluate as follows: We can also observe that the weighted least-square estimates have a normal distribution. Hence If and , then , and become and . Therefore, (A.7) can be written as By the law of large numbers and the central limit theorem, Next we deal with the estimators . These minimize the function , which yields The minimization of this quadratic function is given by If we substitute for its estimator , from (A.14) and (A.11), we have Similarly, by the law of large numbers and the central limit theorem, This completes the proof of the lemma.
Proof of Theorem 2.1. Let the estimator be the ordinary least-square estimate applied to model (2.18). For the ease of notation, we set for and . Then we write where From assumptions (A.1), (A.2), and Lemma .1 we can see that under the assumptions about and by the Caucy-Schwarz inequality Next we evaluate . In An et al. [31, proof of Theorem 5]: it is shown that, under the assumptions about , Thus, by the Cauchy-Schwarz inequality which yields . Finally, we evaluate . By the extended Baxter inequality from Bühlmann [32, proof of Theorem 3.1], we have Notice that . Since is a stationary and invertible process whose linear process coefficients satisfy the given summability assumptions, we have for some