Abstract
Reinforcement learning is one of the paradigms and methodologies of machine learning developed in the computational intelligence community. Reinforcement learning algorithms present a major challenge in complex dynamics recently. In the perspective of variable selection, we often come across situations where too many variables are included in the full model at the initial stage of modeling. Due to a highdimensional and intractable integral of longitudinal data, likelihood inference is computationally challenging. It can be computationally difficult such as very slow convergence or even nonconvergence, for the computationally intensive methods. Recently, hierarchical likelihood (hlikelihood) plays an important role in inferences for models having unobservable or unobserved random variables. This paper focuses linear models with random effects in the mean structure and proposes a penalized hlikelihood algorithm which incorporates variable selection procedures in the setting of mean modeling via hlikelihood. The penalized hlikelihood method avoids the messy integration for the random effects and is computationally efficient. Furthermore, it demonstrates good performance in relevantvariable selection. Throughout theoretical analysis and simulations, it is confirmed that the penalized hlikelihood algorithm produces good fixed effect estimation results and can identify zero regression coefficients in modeling the mean structure.
1. Introduction
Reinforcement learning is specified as trial and error (variation and selection and search) plus learning (association and memory) in Sutton and Barto [1]. Traditional variable selection procedures, such as LASSO in Tibshirani [2] and OMP in Cai and Wang [3], only consider the fixed effect estimates in the linear models in the past literature. However, in real life, a lot of existing data have both the fixed effects and random effects involved. For example, in the clinic trials, several observations are taken for a period of time for one particular patient. After collecting the data needed for all the patients, it is natural to consider random effects for each individual patient in the model setting since a common error term for all the observations is not sufficient to capture the individual randomness. Moreover, random effects, which are not directly observable, are of interest in themselves if inference is focused on each individual’s response. Therefore, to solve the problem of the random effects and to get good estimates, Lee and Nelder [4] proposed hierarchical generalized linear models (HGLMs). HGLMs are based on the idea of hlikelihood, a generalization of the classical likelihood to accommodate the random components coming through the model. It is preferable because it avoids the integration part for the marginal likelihood and uses the conditional distribution instead.
Inspired by the idea of reinforcement learning and hierarchical models, this paper proposes a method by adding a penalty term to the hlikelihood. This method considers not only the fixed effects but also the random effects in the linear model, and it produces good estimation results with the ability to identify zero regression coefficients in joint models of meancovariance structures for highdimensional multilevel data.
The rest of this paper is organized as follows: Section 2 provides the literature review on current variable selection methods based on partial linear models and hlikelihood. Section 3 explains a penaltybased hlikelihood variable selection algorithm and demonstrates via simulation that our proposed algorithm exhibits desired sample properties and can be useful in practical applications. Finally, Section 4 concludes the paper, and some future research directions are given.
2. Literature Review
2.1. Reinforcement Learning in the Perspective of Nonlinear Systems
Reinforcement learning, one of the most active research areas in artificial intelligence, is introduced and defined as a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment in Sutton and Barto [1]. In addition, in the paper of Sutton and Barto [5], reinforcement learning is specified to be trial and error (variation and selection and search) plus learning (association and memory). Furthermore, Barto and Mahadevan [6] propose hierarchical control architectures and associated learning algorithms. Approaches to temporal abstraction and hierarchical organization, which mainly rely on the theory of semiMarkov decision processes, are reviewed and discussed in Barto and Mahadevan’s paper [6]. Recent works, such as Dietterich [7], have focused on the hierarchical methods that incorporate subroutines and state abstractions, instead of solving “flat” problem spaces.
Nonlinear control design has gained a lot of attention in the research area for a long time. In the industrial field, the controlled system usually has great nonlinearity. Various adaptive optimal control models have been applied to the identification of nonlinear systems in the past literature. In fact, the two important fundamental principles of controller design are optimality and veracity. He et al. [8] study a novel policy iterative scheme for the design of online optimal laws for a class of nonlinear systems and establishes the convergence of the novel policy iterative scheme to the optimal control law. He et al. [9] investigate an online adaptive optimal control problem of a class of continuoustime Markov jump linear systems (MJLSs) by using a parallel reinforcement learning (RL) algorithm with completely unknown dynamics. A novel parallel RL algorithm is proposed, and the convergence of the proposed algorithm is shown. Wang et al. [10] study a new online adaptive optimal controller design scheme for a class of nonlinear systems with input time delays. An online policy iteration algorithm is proposed, and the effectiveness of the proposed method is verified. He et al. [11] propose the online adaptive optimal controller design for a class of nonlinear systems through a novel policy iteration (PI) algorithm. Cheng et al. [12] investigate the observerbased asynchronous fault detection problem for a class of nonlinear Markov jumping systems and introduces a hidden Markov model to ensure that the observer modes run synchronously with the system modes. Cheng et al. [13] propose the finitetime asynchronous output feedback control scheme for a class of Markov jump systems subject to external disturbances and nonlinearities.
2.2. Partial Linear Models
Linear models have been widely used and employed in the literature. One extension of linear models, which was introduced by Nelder and Wedderburn [14], is generalized linear models (GLMs). GLMs allow the class of distributions to be expanded from the normal distribution to that of oneparameter exponential families. In addition, GLMs generalize linear regression in the following two manners: first of all, GLMs allow the linear model to be related to the response variable via a link function, or equivalently a monotonic transform of the mean, rather than the mean itself. Second, GLMs allow the magnitude of the variance of each measurement to be a function of its predicted value.
On the contrary, Laird and Ware [15] propose linear mixed effect models (LMEs), which are widely used in the analysis of longitudinal and repeated measurement data. Linear mixed effect models have gained popular attention since they take into consideration withincluster and betweencluster variations simultaneously. Vonesh and Chinchilli [16] have investigated and applied statistical estimation as well as inference for this class of LME models. However, it seems that model selection problem in LME models is ignored. This disregarded problem was noticed and pointed out by Vaida and Blanchard [17], stating that when the focus is on clusters instead of population, the traditional selection criteria such as AIC and BIC are not appropriate. In the paper of Vaida and Blanchard [17], the conditional AIC is proposed, for mixed effects models with detailed discussion on how to define degrees of freedom in the presence of random effects. Furthermore, Pu and Niu [18] study the asymptotic behavior of the proposed generalized information criterion method for selecting fixed effects. In addition, Rajaram and Castellani [19] use ordinary differential equations and the linear advection partial differential equations (PDEs) and introduce a casebased density approach to modeling big data longitudinally.
Recently, Fan and Li [20] develop a class of variable selection procedures for both fixed effects and random effects in linear mixed effect models by incorporating the penalized profile likelihood method. By this regularization method, both fixed effects and random effects can be selected and estimated. There are two outstanding aspects regarding Fan and Li’s [20] method. First of all, the proposed procedures can estimate the fixed effects and random effects in a separate way. Or in other words, the fixed effects can be estimated without the random effects being estimated, and vice versa. In addition, the method works in the highdimensional setting by allowing dimension of random effect to grow exponentially with sample size.
Combined with the idea of generalized linear models (GLMs) and linear mixed effect (LME) models, one extension, generalized linear mixed models (GLMMs), is developed. In the traditional GLMs, it is assumed that the observations are uncorrelated. To solve the constrained assumption, GLMMs allow for correlation between observations, which often happens in the longitudinal data and clustered designs. The advantages of GLMMs are presented as follows: first of all, GLMMs allow random effects to be included in the linear predictor. As a result, the correlations between observations can be explained through an explicit probability model. Second, when the focus is on estimating the fixed effects on a particular individual, GLMMs provide good subjectspecific parameter estimates. However, since GLMMs are also called multilevel models, it is generally more computationally intensive when fitting the model.
So far, all those GLMs and GLMMs are wellestablished parametric regression models. A serious disadvantage of parametric modeling is that a parametric model may be too restrictive in some applications. To overcome this restrictive assumption difficulty in the parametric regression, nonparametric regression has gained popular attention in the literature. There are many nonparametric and smoothing methods, such as kernel smoothing, local polynomial fitting, and penalized splines. In this section, two oftenused smoothing methods in estimating a nonparametric model are described in the following paragraphs since they are used later in simulations and applications.
The first type is called local linear kernel smoothing. The main idea of local linear kernel smoothing is to locally approximate the function linearly. Local linear kernel smoothing uses Taylor expansion as a fundamental tool. In particular, Taylor expansion states that any smooth function can be locally approximated by a polynomial of some degree.
Suppose we have a simple nonparametric modelfor . Let be an arbitrary fixed point where the function is estimated. Assume has a firstorder continuous derivative at . Then, by Taylor expansion, can be locally approximated byin a neighborhood of that allows the above expansion where denotes the first derivative of at .
Let and . The local linear smoother is obtained by fitting a data set locally with a linear function, to minimize the following weighted least squares criterion:where , which is obtained by rescaling a kernel function with a positive constant bandwidth . The primary objective of the bandwidth is to specify the size of the local neighborhood , where the local fitting is conducted. Moreover, the kernel function determines how observations within the neighborhood contribute to the fit at . A detailed introduction of the kernel function will be provided in the later paragraphs.
The local linear smoother can be simply expressed aswhere
A local linear smoother is often good enough for most problems if the kernel function and the bandwidth are adequately determined. Moreover, it enjoys many good properties that the other linear smoothers may lack. Fan [21], Fan and Gijbels [22], and Hastie and Loader [23] separately discussed those good properties in detail.
The kernel function used in the local linear smoother is a symmetric probability density function. The kernel specifies how the observations contribute to the local linear kernel fit at , whereas the bandwidth specifies the size of the local neighborhood . Several widely used kernel functions include the following:(i)Uniform (ii)Epanechnikov (iii)Biweight (iv)Gaussian
Suppose, for instance, the uniform kernel is used. All the ’s within the neighborhood contribute equally; or equivalently, the weights are the same, in the local linear kernel fit at ; on the contrary, all the ’s outside the neighborhood contribute nothing. Suppose, for another example, the Gaussian kernel is used. The contribution of the ’s is determined by the distance of from . In other words, smaller distance results in larger contribution since the Gaussian kernel is a bellshaped curve, which peaks at the origin.
The second type of smoothing is called regression spline smoothing. In local linear kernel smoothing introduced above, local neighborhoods were defined by a bandwidth and a fixed point . On the contrary, in regression spline smoothing that will be introduced shortly, local neighborhoods are defined by a group of locations, known as knots, for example,
in an interval , where . Moreover, are referred as interior knots or simple knots. Then, local neighborhoods are divided by these knots, i.e.,and within any two neighboring knots, a Taylor’s expansion up to some degree is applicable.
A regression spline can be constructed in terms of truncated power basis. As mentioned earlier, there are K knots , and the th degree truncated power basis can be expressed aswhere denotes power of the positive part of with . In most of the literature, it is called “constant, linear, quadratic, and cubic” truncated power basis when correspondingly. For the purpose of this chapter, cubic truncated power basis is used in subsequent sections of simulations and applications.
We still consider the abovementioned simple nonparametric model:for . It is with conventional purpose to denote the truncated basis aswhere is the number of the basis functions involved. Then, the regression fit of the function in the nonparametric model can be expressed aswhere and .
To sum up, parametric models are very useful for longitudinal data analysis since they provide a clear and easy description of the relationship between the response variable and its covariates. However, in most of data analysis, the parametric model does not fit the data well, resulting in biased estimates. To overcome the restricted assumptions on parametric forms, various nonparametric models such as nonparametric mixed effects models have been proposed for longitudinal data. Refer, for example, the study by Fan and Zhang [24] and Wu and Rice [25] among others. There is always a tradeoff model assumption and model complexity. Parametric models are less robust against model assumptions, but they are efficient when the models are corrected assigned. On the contrary, nonparametric models are more robust against model assumptions, but they are less efficient and more complex. A tradeoff between efficiency and complexity by the information measure is fully investigated and discussed in Caves and Schack [26]. Zhang et al. [27] propose an improved Kmeans clustering algorithm, which is called the covering Kmeans algorithm (CKmeans). There are two advantages for the CKmeans algorithm. First of all, it acquires efficient and accurate clustering results under both sequential and parallel conditions. Furthermore, it selfadaptively provides a reasonable number of clusters based on the data features.
Semiparametric models come across in the need to compromise and remain good features of both parametric and nonparametric models. In semiparametric models, parametric component and nonparametric component are the two essential components. More specifically, the parametric component is often used to model important factors that affect the responses parametrically, whereas the nonparametric component is often used for less important and nuisance factors. Various semiparametric models for longitudinal data include semiparametric population mean models proposed in Martinussen and Scheike [28] and Xu [29], among others, and semiparametric mixed effects models in the study by Zeger and Diggle [30], Groll and Tutz [31], and Heckman et al. [32]. For the purpose of this paper, we restrict our attention to partially linear regression models.
2.3. hLikelihood
In longitudinal studies, there are two types of models, marginal models, and conditional models. By definition, marginal models are usually referred as populationaverage models by ignoring the cluster random effects. In contrast, conditional models have random effect or are subjectspecific models. The main difference between marginal and conditional models is whether the regression coefficients describe an individual’s response or the marginal response to changing covariates. Or in other words, changing covariates does not attempt to control for unobserved subjects’ random effects. Diggle et al. [33] suggested the random effect model for inferences about individual responses and the marginal model for inferences about margins.
The idea of hlikelihood was introduced by Lee and Nelder [4]. hlikelihood is an extension of Fisher likelihood to models of GLMs with additional random effects in the linear predictor. The concept of hlikelihood is for inferences of unobserved random variables. In fact, hlikelihood is a special kind of extended likelihood, where the random effect parameter is specified to satisfy certain conditions as we shall talk more in details later. In the meantime, with the idea of hlikelihood, hierarchical generalized linear models (HGLMs) were introduced as well in Lee and Nelder’s [4] paper. This class of hierarchical GLMs allows various distributions of the random component. In addition, these distributions are conjugate to the distributions of the response . Four conjugate HGLMs were introduced in [4], namely, normalnormal, Poissongamma, binomialbeta, and gammainverse gamma (Table 1). If we let be the response and be the unobserved random component, is the scale on which the random effect happens linearly in the linear predictor. In other words, and are linked via some strictly monotonic function.
Consider the hierarchical model where and follow some arbitrary distributions listed in Table 1. The definition of hlikelihood, denoted by , is presented in the following way:where is the log likelihood function of given parameter and is that of given parameter and . One point to note is that the hlikelihood is not a traditionally defined likelihood since are not directly observable. In the traditional standard maximum likelihood estimation for models with random effects, the method is based on the marginal likelihood as the objective function. In this marginal likelihood approach, random effects are integrated out and what remain in the maximized function are the fixed effects and dispersion parameter . There are two disadvantages of the marginal likelihood approach. First of all, the intractable integration of is with obvious difficulty. In addition, random effects are nonestimable after integration. In contrast, the hlikelihood approach avoids such intractable integration. In fact, as clearly stated by Lee and Nelder [4], “we can treat the hlikelihood as if it were an orthodox likelihood for the fixed effects and random effects , where the are regarded as fixed parameters for realized but unobservable values of the random effects.” Furthermore, the hlikelihood allows us to have a fixed effect estimator that is asymptotically efficient as the marginal maximum likelihood estimator. Last but not least, the maximized hlikelihood estimates are derived by solving the two equations simultaneously:
People always expect an outstanding property of likelihood inference to be invariant with respect to transformations. As for maximum hlikelihood estimates, estimates for random effects are invariant with respect to the transformation of the random components of .
Furthermore, Lee and Nelder [4] mentioned adjusted profile hlikelihood, which is defined in the following way:where . It eliminates the nuisance effects from the hlikelihood. Moreover, the part is often referred as the adjusted term for such elimination. In fact, this adjusted profile hlikelihood, which is used for the estimation of dispersion components, acts as an approximation of the marginal likelihood, without integrating out.
There are a few outstanding contributions in Lee and Nelder’s [4] publication. First of all, it widens the choice of random effect distributions in mixed generalized linear models. In addition, it brings about the hlikelihood as a device for estimation and prediction in hierarchical generalized linear models. Compared to the traditional marginal likelihood, the hlikelihood avoids the messy integration for the random effects and hence is convenient to use. Furthermore, maximized hlikelihood estimates are obtained by iteratively solving equation (14). To conclude, the hlikelihood is used for inference about the fixed and random effects given dispersion parameter .
On the contrary, Lee and Nelder [34] demonstrated the use of an adjusted profile hlikelihood for inference about the dispersion components given fixed and random effects. In this paper, the focus is on the joint modeling of the mean and dispersion structure. Iterative weighted least squares (IWLS) algorithm is used for estimations of both the fixed and random effects by the extended likelihood and dispersion parameters by the adjusted profile likelihood. Later, in [35], the algorithm was adjusted by replacing the extended likelihood to the firstorder adjusted profile likelihood, as to estimate fixed effects in the mean structure.
Lee and Nelder [36] proposed a class of double hierarchical generalized linear models in which random effects can be specified for both the mean and dispersion. Compared with HGLMs, double hierarchical generalized linear models allow heavytailed distributions to be present in the model. Random effects are introduced in the dispersion model to solve heteroscedasticity between clusters. Then, hlikelihood is applied for statistical references and efficient algorithm, as the synthesis of the inferential tool. In addition, Lee and Noh [37] proposed a class of double hierarchical generalized linear models in which random effects can be specified for both the mean and dispersion, allowing models with heavytailed distributions and providing robust estimation against outliers. Greenlaw and Kantabutra [38] address the parallel complexity of hierarchical clustering. Instead of the traditional sequential algorithms, the described topdown algorithm in Greenlaw and Kantabutra [38] is parallelized and the computational cost of the topdown algorithm is with time.
In conclusion, for both hierarchical generalized linear models (HGLMs) and double hierarchical generalized linear models (DHGLMs), hlikelihood plays an important role in inferences for models having unobservable or unobserved random variables. Furthermore, numerical studies have been investigated and shown that hlikelihood gives statistically efficient estimates for HGLMs as well as DHGLMs. In addition, Noh and Lee [39] have shown that the hlikelihood procedure outperforms existing methods, including MCMCtype methods, in terms of bias. Last but not least, compared to the traditional marginal likelihood, the hlikelihood avoids the messy integration for the random effects and hence is convenient to use. Therefore, the hlikelihood method is worth attention.
3. Variable Selection via Penalized hLikelihood
3.1. Model Setup
Suppose that we have independent groups and each group contains subjects. Let be the subject of group , where and . Based on the idea of modeling the mean structure in the HGLM framework, we consider a partial linear model for modeling the conditional mean:where is an unknown smooth function in , is an univariate explanatory variable in for simplicity, is the canonical link function for the conditional distribution of , and is a covariate vector with as the associated coefficients. In matrix representation,
We assume that conditional random variables and are from an exponential family with mean and variance:
We also assume that and are independent. The random effects presented in the mean model are linked to via the relationship , where . This allows for the definition of hlikelihood given in Lee and Nelder [4]. In this paper, the identity link is used, and hence, this canonical scale corresponds to the case that the conditional distribution of the response is normal, i.e., .
For simplicity, random effects are considered in the form of a random intercept throughout this paper. If a random intercept is not sufficient to represent the variation exhibited in the data, then the model can be easily extended to a more general form by considering a more complex random effects structure.
3.2. Estimation Procedure via Penalized hLikelihood
Thus, the log of hlikelihood is
For the purpose of this paper, the first and second derivatives of with respect to and are derived and listed below:
The maximum likelihood estimate for the random effects is obtained by setting to zero. Then, an approximated likelihood for the fixed effects can be obtained by plugging the estimate in . In addition, the marginal likelihood is approximated by the adjusted profile likelihood:where .
Now the problem of how to estimate the smooth function rises. In this paper, we use two nonparametric approaches to estimate : local linear regression technique and spline technique.
In the framework of penalized variable selection, we apply a penalty on the approximated marginal likelihood so thatwhere is the penalty function with tuning parameter . Our aim is to maximize and get the maximum likelihood estimates for the fixed effects . We will give a brief theoretical support on how to derive the estimation in the following paragraphs.
First of all, the penalty functions are singular at the origin, and they do not have continuous secondorder derivatives. However, they can be locally approximated by a quadratic function as follows. Assume that we are given an initial value that is close to the maximizer of . If is very close to 0, then set . Otherwise, they can be locally approximated by a quadratic function aswhen . In other words,for . A drawback of this approximation is that once a coefficient is shrunk to zero, it will stay at zero.
Furthermore, note the first two derivatives of the log hlikelihood function are continuous. Around a given point , the log hlikelihood function can be approximated by
Similarly, can be locally approximated by the quadratic functionwhere is a constant term, , , and . The quadratic maximization problem yields the solution iteratively by
When the algorithm converges, the estimator satisfies the penalized likelihood equation conditionfor nonzero elements of .
As stated in Fan and Li [20], in the maximum likelihood estimation (MLE) setting, with good initial value of , the onestep procedure can be as efficient as the fully iterative procedure, when the Newton–Raphson algorithm is used. Thus, if we have a good initial value for , the very next iteration can be regarded as a onestep procedure, and the resulting estimator can be as efficient as the fully iterative method.
3.3. Variable Selection via the Adaptive Lasso Penalty
There are many penalized likelihood variable selection criteria available in the literature review on penalized approaches, such as lasso penalty and SCAD. In this paper, we focus on the adaptive lasso penalty, which was introduced by Zou [40]. The form of the penalty function for adaptive lasso is given bywhere is a known weights vector and is the tuning parameter satisfying . It has been shown if the weights are datadependent and cleverly chosen, the weighted lasso can achieve the oracle properties, or in other words, it performs well as if the true underlying model was known in advance. This is the main reason for our choice of penalty function. In addition, the adaptive lasso is less complicated than the smoothly clipped absolute deviation (SCAD) penalty introduced by Fan and Li [20] and hence is easier to implement.
For the choice of the datadependent weights vector , we use the hierarchical generalized linear model to estimate . To specify,
As the sample size grows, the weights for zerocoefficient estimators get to infinity, whereas the weights for nonzerocoefficients converge to a finite constant.
A significant part of our proposed method is the process of variable selection by choosing an appropriate penalty function. As a result, the choice of the tuning parameter in the penalty function becomes important. The most popular methods for choosing such tuning parameters are Kfold crossvalidation and generalized crossvalidation procedures in the literature. In fact, the consistency of selection of various shrinkage methods relies on an appropriate choice of the tuning parameters, and the method of generalized crossvalidation (GCV) method has been widely used in the past literature. Therefore, we adopt the traditional method and generalized crossvalidation method, for the choice of the tuning parameter. In particular, suppose we have the fitted for a linear method under squared error, then the standard formula for the generalized crossvalidation is
Then, we obtain the tuning parameter with the minimized GCV.
3.4. Computational Algorithm
We propose the following hlikelihood algorithm (Algorithm 1) for developing the method discussed in this paper.

The computational cost of the proposed penalized hlikelihood algorithm is of order , where is the sample size and is the number of associated coefficients in equation (16). The efficient path algorithm makes the proposed penalized hlikelihood algorithm an attractive method for real applications. In particular, if we have a good initial value for , the very next iteration can be regarded as a onestep procedure, and the resulting estimator can be as efficient as the fully iterative method.
3.5. Simulation Studies
To assess the finite sample performance of our proposed method, we conduct several simulation studies. All simulations are conducted using R codes. Our models have the formwith and . It has been assumed throughout this chapter and . In addition, the distribution of the response conditional on the random components is also assumed to be , where . To form the covariates for the model, we draw random samples from a multivariate normal distribution , where the covariance matrix is assumed to have an AR (1) structure with and . The choice of the correlation parameter is fixed here since the choice of the correlation has little impact on the resulting penalized estimates for by trying several values for . Furthermore, are simulated from a uniform [0, 1] distribution. We do the simulation studies through several examples. For each of the cases, we run a simulation study over 100 simulated datasets.
Furthermore, for the nonparametric part of the model, we use three different functions for simulation purposes: , , and . Both and represent a nonlinear and increasing function, whereas represents a nonlinear and nonmonotonic function.
In order to examine the finite sample performance of our proposed method, we run simulations based on the following six examples.
Example 1. We generate a balanced dataset such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters is set to be . In addition to the linear component, the nonparametric component is .
Example 2. Similar to Example 1 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .
Example 3. We generate a balanced dataset such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .
Example 4. Similar to Example 3 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .
Example 5. We generate a balanced dataset, such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .
Example 6. Similar to Example 5 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .
We simulate each random effect from a normal distribution with 0 mean and . Moreover, we simulate from uniform distribution of . Then, we obtain the smoothing function by plugging in the values of . Once we have the random effects and the nonparametric part of , we can simulate the response by computing its mean and variance through the model. In this case, , where and .
By default, we estimate the unknown smooth function by two methods: local linear kernel smoothing method and cubic spine smoothing method. We denote the estimates with respective to those two methods by PHKernel and PHSpline. In addition, we also calculated the cubic spline smoothing method without the penalty term, i.e., , and denote the estimates algorithm by HSpline. However, due to the computational complexity of the local linear kernel smoothing method, we only consider the comparison between local linear kernel smoothing method and cubic spine smoothing method for Examples 1 and 2. For the rest of the four examples, we only run the simulations in terms of HSpline and PHSpline.
Before we report the simulation performances of our proposed penaltybased procedure, several terms, which will be listed in the summary tables, are introduced. First of all, let percentage of correctly fitted and percentage of overfitted be the proportions of selected models that are correctly fitted and overfitted, respectively. In the case of overfitting, the columns “1,” “2,” and “” represent the proportions of selected models including one, two, and more than two irrelevant predictors, correspondingly.
Furthermore, to characterize the capability of a method in producing sparse solutions, we defineTo characterize the method’s underfitting effect, we further defineTable 2 presents a detailed summary of variable selection accuracy for all the six examples provided above. Several key findings can be observed from Table 2. First of all, all the six examples do not have the underfitting problem, which means all the relevant predictors can be discovered by the PHSpline method. Equivalently, results of zeros for percent of incorrect zeros column double confirm the above statement.
Furthermore, our proposed PHSpline method is in good performance in terms of variable selection consistency for Example 1, with 100% correctly fit. Similarly, simulation results of our proposed PHSpline method for Example 2 provides a 94% correct fit, a 2% of overfit with 1 irrelevant predictor included, and a 2% of overfit with 2 irrelevant predictors included. The overall performance of variable selection consistency for Example 2 is good with a 98.8% of correct zeros.
Thirdly, when we have a more sparse representation for the fixed effects with smaller magnitudes, our proposed PHSpline tends to provide a little bit conservative result compared to Examples 1 and 2, in terms of variable selection accuracy. In particular, simulation results of our proposed PHSpline method for Example 3 provide a 74% of correct fit, a 14% of overfit with 1 irrelevant predictor included, a 8% of overfit with 2 irrelevant predictors included, and a 4% of overfit with more than 2 irrelevant predictors included. In fact, the overall performance of variable selection consistency for Example 3 is good with a 93.3% of correct zeros. On the contrary, when the number of withincluster subjects decreases from 10 to 5 in Example 4, percent of correct zeros decreases to 84.4%, meaning that more irrelevant predictors are included in the model.
Last but not least, similar trends can be observed for Examples 5 and 6 compared to Examples 3 and 4. Example 5 returns a 71% of correct fit, a 20% of overfit with 1 irrelevant predictor included, a 2% of overfit with 2 irrelevant predictors included, and a 7% of overfit with more than 2 irrelevant predictors included. On the contrary, Example 6 returns a 64% of correct fit, a 21% of overfit with 1 irrelevant predictor included, a 6% of overfit with 2 irrelevant predictors included, and a 9% of overfit with more than 2 irrelevant predictors included. As a result, the 71% of correct fit for Example 5 outperforms the 64% of correctly fit for Example 6, in terms of the variable selection consistency. Hence, generally speaking, our proposed PHSpline method works better when the number of within cluster subjects increases.
Besides the variable selection accuracy summarized in Table 2, prediction accuracy for the fixed effects for various examples is also with our interest. In the following paragraphs, results of prediction accuracy for the fixed effects are discussed and interpreted, with Tables 3–8 presented.
Table 3 summarizes simulation result over 100 replications for Example 1. As we can see, both PHkernel and PHSpline can recover the relevant predictors accurately. In addition, the estimates of the fixed effects for both PHkernel and PHSpline are comparably making very little difference with the true values of . However, in terms of speed of the algorithm, the PHSpline method is way fast than the PHKernel method, and hence, the PHSpline method is fast to implement. On the contrary, the HSpline method returns the hlikelihood estimates of the fixed effects, without the penalty term. As we can observe from Table 3, the HSpline method gives nonzero estimates for all the , resulting in bad variable selection performance compared with PHSpline, which involves a penalty term. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. Therefore, the PHSpline method outperforms the other two methods by either variable selection accuracy or efficiency of the implementation speed.
Simulation result over 100 replications for Example 2 is summarized in Table 4. Example 2 has a smaller number of withincluster subjects than that in Example 1. In fact, similar to the results obtained in Example 1, both PHKernel and PHSpline methods return relatively good estimates of the fixed effects in terms of variable selection accuracy and prediction accuracy. In particular, both PHKernel and PHSpline methods select one irrelevant covariate wrongly. In addition, the estimates of the fixed effects for both PHKernel and PHSpline methods are comparably making very little difference with the true values of . On the contrary, as we can observe from Table 4, the HSpline method gives nonzero estimates for all the , resulting in bad variable selection performance compared with PHSpline. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. In fact, it is not surprising to see that both PHkernel and PHSpline methods include as a relevant predictor in the model. Or in an equivalent way, both PHKernel and PHSpline methods return nonzero . The reason is that we have a AR (1) model, which means there is a correlation of between and .
`As we compare simulation results of Examples 1 and 2, our proposed PHSpline method tends to perform better when the number of withincluster subjects increases. In addition, a similar conclusion can be drawn for the PHKernel method. Furthermore, both PHKernel and PHSpline methods work well when the nonparametric component is .
Tables 5 and 6 present simulation results over 100 replications for Examples 3 and 4. In these two examples, we have a more sparse representation in terms of the fixed effects than those in Examples 1 and 2. On top of that, the magnitudes of the fixed effects are set to be smaller than those in Examples 1 and 2. For both of the results, the PHSpline method outperforms the HSpline method in terms of variable selection performance in two ways. First of all, the PHSpline method identifies some of the irrelevant predictors accurately, whereas the HSpline method gives nonzero estimates for all the . Though PHSpline cannot guarantee 100% selection accuracy, it does improve the poor variable selection performance of HSpline by adding a penalty term. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. Therefore, the PHSpline method performs better than the HSpline method, even for the sparse fixed effects situation.
Similarly, simulation results over 100 replications for Examples 5 and 6 are presented in Tables 7 and 8. Again, we have a more sparse representation in terms of the fixed effects than those in Examples 1 and 2, with smaller magnitudes of the fixed effects . The PHSpline method works pretty well in terms of variable selection for Example 5 even though it does not guarantee a 100% selection accuracy. On the contrary, the PHSpline method returns nonzero estimates for all the , resulting in poor variable selection performance for Example 6, where the number of within cluster subjects reduces to 5.
Overall, the simulation results show that our proposed penalized hlikelihood approach performs good in terms of variable selection accuracy because of its ability to recover the true zeros, especially when the number of withincluster subjects is not too small. Generally, our proposed PHSpline method works better when the number of within cluster subjects increases. In addition, even when the true model is sparse, our penalized estimator still does no worse than the hlikelihood estimator in terms of estimation accuracy.
4. Conclusion
To conclude, we have introduced a new penalized hlikelihood approach to identify nonzero relevant fixed effects in the partial linear model setting in this paper. This penalized hlikelihood incorporates variable selection procedures in the setting of mean modeling via hlikelihood. A few advantages of this newly proposed method are listed below. First of all, compared to the traditional marginal likelihood, the hlikelihood avoids the messy integration for the random effects and hence is convenient to use. In addition, hlikelihood plays an important role in inferences for models having unobserved random variables. Last but not least, it has been demonstrated by simulation studies that the proposed penaltybased method is able to identify zero regression coefficients in modeling the mean structure and produces good fixed effects estimation results.
As for future research, it would be interesting to apply the proposed penalized hlikelihood approach to be extended for more complicated circumstances for the partial linear models. In other words, the model in this paper assumes only a simple onecomponent structure for the random effects, such that only a random intercept is considered. For possible future research, we may consider a partial linear model for modeling the conditional mean with more than one random effect, i.e., the extended multicomponent random effects model. Other future work, including variance components estimates of the random effects and study of penalized hlikelihood estimator’s theoretical and asymptotical property such as convergence rate, would be investigated and discussed.
Data Availability
This is a theoretical study, and we do not have experimental data.
Disclosure
This work was part of the originally written Ph.D. thesis by the first author in 2013 [41].
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was funded by the Ministry of Education of Humanities and Social Science Project (Grant no. 17YJCZH199). The authors gratefully acknowledge the Ministry of Education of Humanities and Social Science for the technical and financial support.