Abstract

Reinforcement learning is one of the paradigms and methodologies of machine learning developed in the computational intelligence community. Reinforcement learning algorithms present a major challenge in complex dynamics recently. In the perspective of variable selection, we often come across situations where too many variables are included in the full model at the initial stage of modeling. Due to a high-dimensional and intractable integral of longitudinal data, likelihood inference is computationally challenging. It can be computationally difficult such as very slow convergence or even nonconvergence, for the computationally intensive methods. Recently, hierarchical likelihood (h-likelihood) plays an important role in inferences for models having unobservable or unobserved random variables. This paper focuses linear models with random effects in the mean structure and proposes a penalized h-likelihood algorithm which incorporates variable selection procedures in the setting of mean modeling via h-likelihood. The penalized h-likelihood method avoids the messy integration for the random effects and is computationally efficient. Furthermore, it demonstrates good performance in relevant-variable selection. Throughout theoretical analysis and simulations, it is confirmed that the penalized h-likelihood algorithm produces good fixed effect estimation results and can identify zero regression coefficients in modeling the mean structure.

1. Introduction

Reinforcement learning is specified as trial and error (variation and selection and search) plus learning (association and memory) in Sutton and Barto [1]. Traditional variable selection procedures, such as LASSO in Tibshirani [2] and OMP in Cai and Wang [3], only consider the fixed effect estimates in the linear models in the past literature. However, in real life, a lot of existing data have both the fixed effects and random effects involved. For example, in the clinic trials, several observations are taken for a period of time for one particular patient. After collecting the data needed for all the patients, it is natural to consider random effects for each individual patient in the model setting since a common error term for all the observations is not sufficient to capture the individual randomness. Moreover, random effects, which are not directly observable, are of interest in themselves if inference is focused on each individual’s response. Therefore, to solve the problem of the random effects and to get good estimates, Lee and Nelder [4] proposed hierarchical generalized linear models (HGLMs). HGLMs are based on the idea of h-likelihood, a generalization of the classical likelihood to accommodate the random components coming through the model. It is preferable because it avoids the integration part for the marginal likelihood and uses the conditional distribution instead.

Inspired by the idea of reinforcement learning and hierarchical models, this paper proposes a method by adding a penalty term to the h-likelihood. This method considers not only the fixed effects but also the random effects in the linear model, and it produces good estimation results with the ability to identify zero regression coefficients in joint models of mean-covariance structures for high-dimensional multilevel data.

The rest of this paper is organized as follows: Section 2 provides the literature review on current variable selection methods based on partial linear models and h-likelihood. Section 3 explains a penalty-based h-likelihood variable selection algorithm and demonstrates via simulation that our proposed algorithm exhibits desired sample properties and can be useful in practical applications. Finally, Section 4 concludes the paper, and some future research directions are given.

2. Literature Review

2.1. Reinforcement Learning in the Perspective of Nonlinear Systems

Reinforcement learning, one of the most active research areas in artificial intelligence, is introduced and defined as a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment in Sutton and Barto [1]. In addition, in the paper of Sutton and Barto [5], reinforcement learning is specified to be trial and error (variation and selection and search) plus learning (association and memory). Furthermore, Barto and Mahadevan [6] propose hierarchical control architectures and associated learning algorithms. Approaches to temporal abstraction and hierarchical organization, which mainly rely on the theory of semi-Markov decision processes, are reviewed and discussed in Barto and Mahadevan’s paper [6]. Recent works, such as Dietterich [7], have focused on the hierarchical methods that incorporate subroutines and state abstractions, instead of solving “flat” problem spaces.

Nonlinear control design has gained a lot of attention in the research area for a long time. In the industrial field, the controlled system usually has great nonlinearity. Various adaptive optimal control models have been applied to the identification of nonlinear systems in the past literature. In fact, the two important fundamental principles of controller design are optimality and veracity. He et al. [8] study a novel policy iterative scheme for the design of online optimal laws for a class of nonlinear systems and establishes the convergence of the novel policy iterative scheme to the optimal control law. He et al. [9] investigate an online adaptive optimal control problem of a class of continuous-time Markov jump linear systems (MJLSs) by using a parallel reinforcement learning (RL) algorithm with completely unknown dynamics. A novel parallel RL algorithm is proposed, and the convergence of the proposed algorithm is shown. Wang et al. [10] study a new online adaptive optimal controller design scheme for a class of nonlinear systems with input time delays. An online policy iteration algorithm is proposed, and the effectiveness of the proposed method is verified. He et al. [11] propose the online adaptive optimal controller design for a class of nonlinear systems through a novel policy iteration (PI) algorithm. Cheng et al. [12] investigate the observer-based asynchronous fault detection problem for a class of nonlinear Markov jumping systems and introduces a hidden Markov model to ensure that the observer modes run synchronously with the system modes. Cheng et al. [13] propose the finite-time asynchronous output feedback control scheme for a class of Markov jump systems subject to external disturbances and nonlinearities.

2.2. Partial Linear Models

Linear models have been widely used and employed in the literature. One extension of linear models, which was introduced by Nelder and Wedderburn [14], is generalized linear models (GLMs). GLMs allow the class of distributions to be expanded from the normal distribution to that of one-parameter exponential families. In addition, GLMs generalize linear regression in the following two manners: first of all, GLMs allow the linear model to be related to the response variable via a link function, or equivalently a monotonic transform of the mean, rather than the mean itself. Second, GLMs allow the magnitude of the variance of each measurement to be a function of its predicted value.

On the contrary, Laird and Ware [15] propose linear mixed effect models (LMEs), which are widely used in the analysis of longitudinal and repeated measurement data. Linear mixed effect models have gained popular attention since they take into consideration within-cluster and between-cluster variations simultaneously. Vonesh and Chinchilli [16] have investigated and applied statistical estimation as well as inference for this class of LME models. However, it seems that model selection problem in LME models is ignored. This disregarded problem was noticed and pointed out by Vaida and Blanchard [17], stating that when the focus is on clusters instead of population, the traditional selection criteria such as AIC and BIC are not appropriate. In the paper of Vaida and Blanchard [17], the conditional AIC is proposed, for mixed effects models with detailed discussion on how to define degrees of freedom in the presence of random effects. Furthermore, Pu and Niu [18] study the asymptotic behavior of the proposed generalized information criterion method for selecting fixed effects. In addition, Rajaram and Castellani [19] use ordinary differential equations and the linear advection partial differential equations (PDEs) and introduce a case-based density approach to modeling big data longitudinally.

Recently, Fan and Li [20] develop a class of variable selection procedures for both fixed effects and random effects in linear mixed effect models by incorporating the penalized profile likelihood method. By this regularization method, both fixed effects and random effects can be selected and estimated. There are two outstanding aspects regarding Fan and Li’s [20] method. First of all, the proposed procedures can estimate the fixed effects and random effects in a separate way. Or in other words, the fixed effects can be estimated without the random effects being estimated, and vice versa. In addition, the method works in the high-dimensional setting by allowing dimension of random effect to grow exponentially with sample size.

Combined with the idea of generalized linear models (GLMs) and linear mixed effect (LME) models, one extension, generalized linear mixed models (GLMMs), is developed. In the traditional GLMs, it is assumed that the observations are uncorrelated. To solve the constrained assumption, GLMMs allow for correlation between observations, which often happens in the longitudinal data and clustered designs. The advantages of GLMMs are presented as follows: first of all, GLMMs allow random effects to be included in the linear predictor. As a result, the correlations between observations can be explained through an explicit probability model. Second, when the focus is on estimating the fixed effects on a particular individual, GLMMs provide good subject-specific parameter estimates. However, since GLMMs are also called multilevel models, it is generally more computationally intensive when fitting the model.

So far, all those GLMs and GLMMs are well-established parametric regression models. A serious disadvantage of parametric modeling is that a parametric model may be too restrictive in some applications. To overcome this restrictive assumption difficulty in the parametric regression, nonparametric regression has gained popular attention in the literature. There are many nonparametric and smoothing methods, such as kernel smoothing, local polynomial fitting, and penalized splines. In this section, two often-used smoothing methods in estimating a nonparametric model are described in the following paragraphs since they are used later in simulations and applications.

The first type is called local linear kernel smoothing. The main idea of local linear kernel smoothing is to locally approximate the function linearly. Local linear kernel smoothing uses Taylor expansion as a fundamental tool. In particular, Taylor expansion states that any smooth function can be locally approximated by a polynomial of some degree.

Suppose we have a simple nonparametric modelfor . Let be an arbitrary fixed point where the function is estimated. Assume has a first-order continuous derivative at . Then, by Taylor expansion, can be locally approximated byin a neighborhood of that allows the above expansion where denotes the first derivative of at .

Let and . The local linear smoother is obtained by fitting a data set locally with a linear function, to minimize the following weighted least squares criterion:where , which is obtained by rescaling a kernel function with a positive constant bandwidth . The primary objective of the bandwidth is to specify the size of the local neighborhood , where the local fitting is conducted. Moreover, the kernel function determines how observations within the neighborhood contribute to the fit at . A detailed introduction of the kernel function will be provided in the later paragraphs.

The local linear smoother can be simply expressed aswhere

A local linear smoother is often good enough for most problems if the kernel function and the bandwidth are adequately determined. Moreover, it enjoys many good properties that the other linear smoothers may lack. Fan [21], Fan and Gijbels [22], and Hastie and Loader [23] separately discussed those good properties in detail.

The kernel function used in the local linear smoother is a symmetric probability density function. The kernel specifies how the observations contribute to the local linear kernel fit at , whereas the bandwidth specifies the size of the local neighborhood . Several widely used kernel functions include the following:(i)Uniform (ii)Epanechnikov (iii)Biweight (iv)Gaussian

Suppose, for instance, the uniform kernel is used. All the ’s within the neighborhood contribute equally; or equivalently, the weights are the same, in the local linear kernel fit at ; on the contrary, all the ’s outside the neighborhood contribute nothing. Suppose, for another example, the Gaussian kernel is used. The contribution of the ’s is determined by the distance of from . In other words, smaller distance results in larger contribution since the Gaussian kernel is a bell-shaped curve, which peaks at the origin.

The second type of smoothing is called regression spline smoothing. In local linear kernel smoothing introduced above, local neighborhoods were defined by a bandwidth and a fixed point . On the contrary, in regression spline smoothing that will be introduced shortly, local neighborhoods are defined by a group of locations, known as knots, for example,

in an interval , where . Moreover, are referred as interior knots or simple knots. Then, local neighborhoods are divided by these knots, i.e.,and within any two neighboring knots, a Taylor’s expansion up to some degree is applicable.

A regression spline can be constructed in terms of truncated power basis. As mentioned earlier, there are K knots , and the -th degree truncated power basis can be expressed aswhere denotes power of the positive part of with . In most of the literature, it is called “constant, linear, quadratic, and cubic” truncated power basis when correspondingly. For the purpose of this chapter, cubic truncated power basis is used in subsequent sections of simulations and applications.

We still consider the abovementioned simple nonparametric model:for . It is with conventional purpose to denote the truncated basis aswhere is the number of the basis functions involved. Then, the regression fit of the function in the nonparametric model can be expressed aswhere and .

To sum up, parametric models are very useful for longitudinal data analysis since they provide a clear and easy description of the relationship between the response variable and its covariates. However, in most of data analysis, the parametric model does not fit the data well, resulting in biased estimates. To overcome the restricted assumptions on parametric forms, various nonparametric models such as nonparametric mixed effects models have been proposed for longitudinal data. Refer, for example, the study by Fan and Zhang [24] and Wu and Rice [25] among others. There is always a trade-off model assumption and model complexity. Parametric models are less robust against model assumptions, but they are efficient when the models are corrected assigned. On the contrary, nonparametric models are more robust against model assumptions, but they are less efficient and more complex. A trade-off between efficiency and complexity by the information measure is fully investigated and discussed in Caves and Schack [26]. Zhang et al. [27] propose an improved K-means clustering algorithm, which is called the covering K-means algorithm (C-K-means). There are two advantages for the C-K-means algorithm. First of all, it acquires efficient and accurate clustering results under both sequential and parallel conditions. Furthermore, it self-adaptively provides a reasonable number of clusters based on the data features.

Semiparametric models come across in the need to compromise and remain good features of both parametric and nonparametric models. In semiparametric models, parametric component and nonparametric component are the two essential components. More specifically, the parametric component is often used to model important factors that affect the responses parametrically, whereas the nonparametric component is often used for less important and nuisance factors. Various semiparametric models for longitudinal data include semiparametric population mean models proposed in Martinussen and Scheike [28] and Xu [29], among others, and semiparametric mixed effects models in the study by Zeger and Diggle [30], Groll and Tutz [31], and Heckman et al. [32]. For the purpose of this paper, we restrict our attention to partially linear regression models.

2.3. h-Likelihood

In longitudinal studies, there are two types of models, marginal models, and conditional models. By definition, marginal models are usually referred as population-average models by ignoring the cluster random effects. In contrast, conditional models have random effect or are subject-specific models. The main difference between marginal and conditional models is whether the regression coefficients describe an individual’s response or the marginal response to changing covariates. Or in other words, changing covariates does not attempt to control for unobserved subjects’ random effects. Diggle et al. [33] suggested the random effect model for inferences about individual responses and the marginal model for inferences about margins.

The idea of h-likelihood was introduced by Lee and Nelder [4]. h-likelihood is an extension of Fisher likelihood to models of GLMs with additional random effects in the linear predictor. The concept of h-likelihood is for inferences of unobserved random variables. In fact, h-likelihood is a special kind of extended likelihood, where the random effect parameter is specified to satisfy certain conditions as we shall talk more in details later. In the meantime, with the idea of h-likelihood, hierarchical generalized linear models (HGLMs) were introduced as well in Lee and Nelder’s [4] paper. This class of hierarchical GLMs allows various distributions of the random component. In addition, these distributions are conjugate to the distributions of the response . Four conjugate HGLMs were introduced in [4], namely, normal-normal, Poisson-gamma, binomial-beta, and gamma-inverse gamma (Table 1). If we let be the response and be the unobserved random component, is the scale on which the random effect happens linearly in the linear predictor. In other words, and are linked via some strictly monotonic function.

Consider the hierarchical model where and follow some arbitrary distributions listed in Table 1. The definition of h-likelihood, denoted by , is presented in the following way:where is the log likelihood function of given parameter and is that of given parameter and . One point to note is that the h-likelihood is not a traditionally defined likelihood since are not directly observable. In the traditional standard maximum likelihood estimation for models with random effects, the method is based on the marginal likelihood as the objective function. In this marginal likelihood approach, random effects are integrated out and what remain in the maximized function are the fixed effects and dispersion parameter . There are two disadvantages of the marginal likelihood approach. First of all, the intractable integration of is with obvious difficulty. In addition, random effects are nonestimable after integration. In contrast, the h-likelihood approach avoids such intractable integration. In fact, as clearly stated by Lee and Nelder [4], “we can treat the h-likelihood as if it were an orthodox likelihood for the fixed effects and random effects , where the are regarded as fixed parameters for realized but unobservable values of the random effects.” Furthermore, the h-likelihood allows us to have a fixed effect estimator that is asymptotically efficient as the marginal maximum likelihood estimator. Last but not least, the maximized h-likelihood estimates are derived by solving the two equations simultaneously:

People always expect an outstanding property of likelihood inference to be invariant with respect to transformations. As for maximum h-likelihood estimates, estimates for random effects are invariant with respect to the transformation of the random components of .

Furthermore, Lee and Nelder [4] mentioned adjusted profile h-likelihood, which is defined in the following way:where . It eliminates the nuisance effects from the h-likelihood. Moreover, the part is often referred as the adjusted term for such elimination. In fact, this adjusted profile h-likelihood, which is used for the estimation of dispersion components, acts as an approximation of the marginal likelihood, without integrating out.

There are a few outstanding contributions in Lee and Nelder’s [4] publication. First of all, it widens the choice of random effect distributions in mixed generalized linear models. In addition, it brings about the h-likelihood as a device for estimation and prediction in hierarchical generalized linear models. Compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. Furthermore, maximized h-likelihood estimates are obtained by iteratively solving equation (14). To conclude, the h-likelihood is used for inference about the fixed and random effects given dispersion parameter .

On the contrary, Lee and Nelder [34] demonstrated the use of an adjusted profile h-likelihood for inference about the dispersion components given fixed and random effects. In this paper, the focus is on the joint modeling of the mean and dispersion structure. Iterative weighted least squares (IWLS) algorithm is used for estimations of both the fixed and random effects by the extended likelihood and dispersion parameters by the adjusted profile likelihood. Later, in [35], the algorithm was adjusted by replacing the extended likelihood to the first-order adjusted profile likelihood, as to estimate fixed effects in the mean structure.

Lee and Nelder [36] proposed a class of double hierarchical generalized linear models in which random effects can be specified for both the mean and dispersion. Compared with HGLMs, double hierarchical generalized linear models allow heavy-tailed distributions to be present in the model. Random effects are introduced in the dispersion model to solve heteroscedasticity between clusters. Then, h-likelihood is applied for statistical references and efficient algorithm, as the synthesis of the inferential tool. In addition, Lee and Noh [37] proposed a class of double hierarchical generalized linear models in which random effects can be specified for both the mean and dispersion, allowing models with heavy-tailed distributions and providing robust estimation against outliers. Greenlaw and Kantabutra [38] address the parallel complexity of hierarchical clustering. Instead of the traditional sequential algorithms, the described top-down algorithm in Greenlaw and Kantabutra [38] is parallelized and the computational cost of the top-down algorithm is with time.

In conclusion, for both hierarchical generalized linear models (HGLMs) and double hierarchical generalized linear models (DHGLMs), h-likelihood plays an important role in inferences for models having unobservable or unobserved random variables. Furthermore, numerical studies have been investigated and shown that h-likelihood gives statistically efficient estimates for HGLMs as well as DHGLMs. In addition, Noh and Lee [39] have shown that the h-likelihood procedure outperforms existing methods, including MCMC-type methods, in terms of bias. Last but not least, compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. Therefore, the h-likelihood method is worth attention.

3. Variable Selection via Penalized h-Likelihood

3.1. Model Setup

Suppose that we have independent groups and each group contains subjects. Let be the subject of group , where and . Based on the idea of modeling the mean structure in the HGLM framework, we consider a partial linear model for modeling the conditional mean:where is an unknown smooth function in , is an univariate explanatory variable in for simplicity, is the canonical link function for the conditional distribution of , and is a covariate vector with as the associated coefficients. In matrix representation,

We assume that conditional random variables and are from an exponential family with mean and variance:

We also assume that and are independent. The random effects presented in the mean model are linked to via the relationship , where . This allows for the definition of h-likelihood given in Lee and Nelder [4]. In this paper, the identity link is used, and hence, this canonical scale corresponds to the case that the conditional distribution of the response is normal, i.e., .

For simplicity, random effects are considered in the form of a random intercept throughout this paper. If a random intercept is not sufficient to represent the variation exhibited in the data, then the model can be easily extended to a more general form by considering a more complex random effects structure.

3.2. Estimation Procedure via Penalized h-Likelihood

Thus, the log of h-likelihood is

For the purpose of this paper, the first and second derivatives of with respect to and are derived and listed below:

The maximum likelihood estimate for the random effects is obtained by setting to zero. Then, an approximated likelihood for the fixed effects can be obtained by plugging the estimate in . In addition, the marginal likelihood is approximated by the adjusted profile likelihood:where .

Now the problem of how to estimate the smooth function rises. In this paper, we use two nonparametric approaches to estimate : local linear regression technique and spline technique.

In the framework of penalized variable selection, we apply a penalty on the approximated marginal likelihood so thatwhere is the penalty function with tuning parameter . Our aim is to maximize and get the maximum likelihood estimates for the fixed effects . We will give a brief theoretical support on how to derive the estimation in the following paragraphs.

First of all, the penalty functions are singular at the origin, and they do not have continuous second-order derivatives. However, they can be locally approximated by a quadratic function as follows. Assume that we are given an initial value that is close to the maximizer of . If is very close to 0, then set . Otherwise, they can be locally approximated by a quadratic function aswhen . In other words,for . A drawback of this approximation is that once a coefficient is shrunk to zero, it will stay at zero.

Furthermore, note the first two derivatives of the log h-likelihood function are continuous. Around a given point , the log h-likelihood function can be approximated by

Similarly, can be locally approximated by the quadratic functionwhere is a constant term, , , and . The quadratic maximization problem yields the solution iteratively by

When the algorithm converges, the estimator satisfies the penalized likelihood equation conditionfor nonzero elements of .

As stated in Fan and Li [20], in the maximum likelihood estimation (MLE) setting, with good initial value of , the one-step procedure can be as efficient as the fully iterative procedure, when the Newton–Raphson algorithm is used. Thus, if we have a good initial value for , the very next iteration can be regarded as a one-step procedure, and the resulting estimator can be as efficient as the fully iterative method.

3.3. Variable Selection via the Adaptive Lasso Penalty

There are many penalized likelihood variable selection criteria available in the literature review on penalized approaches, such as lasso penalty and SCAD. In this paper, we focus on the adaptive lasso penalty, which was introduced by Zou [40]. The form of the penalty function for adaptive lasso is given bywhere is a known weights vector and is the tuning parameter satisfying . It has been shown if the weights are data-dependent and cleverly chosen, the weighted lasso can achieve the oracle properties, or in other words, it performs well as if the true underlying model was known in advance. This is the main reason for our choice of penalty function. In addition, the adaptive lasso is less complicated than the smoothly clipped absolute deviation (SCAD) penalty introduced by Fan and Li [20] and hence is easier to implement.

For the choice of the data-dependent weights vector , we use the hierarchical generalized linear model to estimate . To specify,

As the sample size grows, the weights for zero-coefficient estimators get to infinity, whereas the weights for nonzero-coefficients converge to a finite constant.

A significant part of our proposed method is the process of variable selection by choosing an appropriate penalty function. As a result, the choice of the tuning parameter in the penalty function becomes important. The most popular methods for choosing such tuning parameters are K-fold cross-validation and generalized cross-validation procedures in the literature. In fact, the consistency of selection of various shrinkage methods relies on an appropriate choice of the tuning parameters, and the method of generalized cross-validation (GCV) method has been widely used in the past literature. Therefore, we adopt the traditional method and generalized cross-validation method, for the choice of the tuning parameter. In particular, suppose we have the fitted for a linear method under squared error, then the standard formula for the generalized cross-validation is

Then, we obtain the tuning parameter with the minimized GCV.

3.4. Computational Algorithm

We propose the following h-likelihood algorithm (Algorithm 1) for developing the method discussed in this paper.

[(Step 1)] (initialization).
(i)Assume a partial linear model excluding variable selection. Express in a parametric way. For example, a cubic regression spline can be expressed by using the truncated power basis:
where the 5 knots are percentiles of , are the associated coefficients, and , are the numbers corresponding to the cubic regression spline representation.
(ii)Initialize the fixed effects , where is the h-likelihood estimates by treating in a parametrical way. Then, we have
(iii)Denote the estimates by :
where are the h-likelihood estimates.
(iv)Determine initial value for random effects using
with and .
[(Step 2)] (loop).
(i)Use and to get
(ii)For the iteration, set the estimator from the iteration and update by
(iii)For , set if , for a small cutoff value c.
(iv)Compute and compare to a small predetermined value . If is smaller than , stop the loop.

The computational cost of the proposed penalized h-likelihood algorithm is of order , where is the sample size and is the number of associated coefficients in equation (16). The efficient path algorithm makes the proposed penalized h-likelihood algorithm an attractive method for real applications. In particular, if we have a good initial value for , the very next iteration can be regarded as a one-step procedure, and the resulting estimator can be as efficient as the fully iterative method.

3.5. Simulation Studies

To assess the finite sample performance of our proposed method, we conduct several simulation studies. All simulations are conducted using R codes. Our models have the formwith and . It has been assumed throughout this chapter and . In addition, the distribution of the response conditional on the random components is also assumed to be , where . To form the covariates for the model, we draw random samples from a multivariate normal distribution , where the covariance matrix is assumed to have an AR (1) structure with and . The choice of the correlation parameter is fixed here since the choice of the correlation has little impact on the resulting penalized estimates for by trying several values for . Furthermore, are simulated from a uniform [0, 1] distribution. We do the simulation studies through several examples. For each of the cases, we run a simulation study over 100 simulated datasets.

Furthermore, for the nonparametric part of the model, we use three different functions for simulation purposes: , , and . Both and represent a nonlinear and increasing function, whereas represents a nonlinear and nonmonotonic function.

In order to examine the finite sample performance of our proposed method, we run simulations based on the following six examples.

Example 1. We generate a balanced dataset such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters is set to be . In addition to the linear component, the nonparametric component is .

Example 2. Similar to Example 1 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .

Example 3. We generate a balanced dataset such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .

Example 4. Similar to Example 3 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .

Example 5. We generate a balanced dataset, such that there are 10 subjects within each 100 groups. In other words, we have 100 clusters and 10 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .

Example 6. Similar to Example 5 but with reduced number of within cluster subjects. We generate a balanced dataset, such that there are 5 subjects within each 100 groups. In other words, we have 100 clusters and 5 subjects within each cluster, denoted by and . The size of the true model is with the true values of the parameters set to be . In addition to the linear component, the nonparametric component is .
We simulate each random effect from a normal distribution with 0 mean and . Moreover, we simulate from uniform distribution of . Then, we obtain the smoothing function by plugging in the values of . Once we have the random effects and the nonparametric part of , we can simulate the response by computing its mean and variance through the model. In this case, , where and .
By default, we estimate the unknown smooth function by two methods: local linear kernel smoothing method and cubic spine smoothing method. We denote the estimates with respective to those two methods by PHKernel and PHSpline. In addition, we also calculated the cubic spline smoothing method without the penalty term, i.e., , and denote the estimates algorithm by HSpline. However, due to the computational complexity of the local linear kernel smoothing method, we only consider the comparison between local linear kernel smoothing method and cubic spine smoothing method for Examples 1 and 2. For the rest of the four examples, we only run the simulations in terms of HSpline and PHSpline.
Before we report the simulation performances of our proposed penalty-based procedure, several terms, which will be listed in the summary tables, are introduced. First of all, let percentage of correctly fitted and percentage of overfitted be the proportions of selected models that are correctly fitted and overfitted, respectively. In the case of overfitting, the columns “1,” “2,” and “” represent the proportions of selected models including one, two, and more than two irrelevant predictors, correspondingly.
Furthermore, to characterize the capability of a method in producing sparse solutions, we defineTo characterize the method’s underfitting effect, we further defineTable 2 presents a detailed summary of variable selection accuracy for all the six examples provided above. Several key findings can be observed from Table 2. First of all, all the six examples do not have the underfitting problem, which means all the relevant predictors can be discovered by the PHSpline method. Equivalently, results of zeros for percent of incorrect zeros column double confirm the above statement.
Furthermore, our proposed PHSpline method is in good performance in terms of variable selection consistency for Example 1, with 100% correctly fit. Similarly, simulation results of our proposed PHSpline method for Example 2 provides a 94% correct fit, a 2% of overfit with 1 irrelevant predictor included, and a 2% of overfit with 2 irrelevant predictors included. The overall performance of variable selection consistency for Example 2 is good with a 98.8% of correct zeros.
Thirdly, when we have a more sparse representation for the fixed effects with smaller magnitudes, our proposed PHSpline tends to provide a little bit conservative result compared to Examples 1 and 2, in terms of variable selection accuracy. In particular, simulation results of our proposed PHSpline method for Example 3 provide a 74% of correct fit, a 14% of overfit with 1 irrelevant predictor included, a 8% of overfit with 2 irrelevant predictors included, and a 4% of overfit with more than 2 irrelevant predictors included. In fact, the overall performance of variable selection consistency for Example 3 is good with a 93.3% of correct zeros. On the contrary, when the number of within-cluster subjects decreases from 10 to 5 in Example 4, percent of correct zeros decreases to 84.4%, meaning that more irrelevant predictors are included in the model.
Last but not least, similar trends can be observed for Examples 5 and 6 compared to Examples 3 and 4. Example 5 returns a 71% of correct fit, a 20% of overfit with 1 irrelevant predictor included, a 2% of overfit with 2 irrelevant predictors included, and a 7% of overfit with more than 2 irrelevant predictors included. On the contrary, Example 6 returns a 64% of correct fit, a 21% of overfit with 1 irrelevant predictor included, a 6% of overfit with 2 irrelevant predictors included, and a 9% of overfit with more than 2 irrelevant predictors included. As a result, the 71% of correct fit for Example 5 outperforms the 64% of correctly fit for Example 6, in terms of the variable selection consistency. Hence, generally speaking, our proposed PHSpline method works better when the number of within cluster subjects increases.
Besides the variable selection accuracy summarized in Table 2, prediction accuracy for the fixed effects for various examples is also with our interest. In the following paragraphs, results of prediction accuracy for the fixed effects are discussed and interpreted, with Tables 38 presented.
Table 3 summarizes simulation result over 100 replications for Example 1. As we can see, both PHkernel and PHSpline can recover the relevant predictors accurately. In addition, the estimates of the fixed effects for both PHkernel and PHSpline are comparably making very little difference with the true values of . However, in terms of speed of the algorithm, the PHSpline method is way fast than the PHKernel method, and hence, the PHSpline method is fast to implement. On the contrary, the HSpline method returns the h-likelihood estimates of the fixed effects, without the penalty term. As we can observe from Table 3, the HSpline method gives nonzero estimates for all the , resulting in bad variable selection performance compared with PHSpline, which involves a penalty term. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. Therefore, the PHSpline method outperforms the other two methods by either variable selection accuracy or efficiency of the implementation speed.
Simulation result over 100 replications for Example 2 is summarized in Table 4. Example 2 has a smaller number of within-cluster subjects than that in Example 1. In fact, similar to the results obtained in Example 1, both PHKernel and PHSpline methods return relatively good estimates of the fixed effects in terms of variable selection accuracy and prediction accuracy. In particular, both PHKernel and PHSpline methods select one irrelevant covariate wrongly. In addition, the estimates of the fixed effects for both PHKernel and PHSpline methods are comparably making very little difference with the true values of . On the contrary, as we can observe from Table 4, the HSpline method gives nonzero estimates for all the , resulting in bad variable selection performance compared with PHSpline. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. In fact, it is not surprising to see that both PHkernel and PHSpline methods include as a relevant predictor in the model. Or in an equivalent way, both PHKernel and PHSpline methods return nonzero . The reason is that we have a AR (1) model, which means there is a correlation of between and .
`As we compare simulation results of Examples 1 and 2, our proposed PHSpline method tends to perform better when the number of within-cluster subjects increases. In addition, a similar conclusion can be drawn for the PHKernel method. Furthermore, both PHKernel and PHSpline methods work well when the nonparametric component is .
Tables 5 and 6 present simulation results over 100 replications for Examples 3 and 4. In these two examples, we have a more sparse representation in terms of the fixed effects than those in Examples 1 and 2. On top of that, the magnitudes of the fixed effects are set to be smaller than those in Examples 1 and 2. For both of the results, the PHSpline method outperforms the HSpline method in terms of variable selection performance in two ways. First of all, the PHSpline method identifies some of the irrelevant predictors accurately, whereas the HSpline method gives nonzero estimates for all the . Though PHSpline cannot guarantee 100% selection accuracy, it does improve the poor variable selection performance of HSpline by adding a penalty term. Furthermore, PHSpline estimates tend to have relatively smaller standard deviations than those computed in HSpline estimates. Therefore, the PHSpline method performs better than the HSpline method, even for the sparse fixed effects situation.
Similarly, simulation results over 100 replications for Examples 5 and 6 are presented in Tables 7 and 8. Again, we have a more sparse representation in terms of the fixed effects than those in Examples 1 and 2, with smaller magnitudes of the fixed effects . The PHSpline method works pretty well in terms of variable selection for Example 5 even though it does not guarantee a 100% selection accuracy. On the contrary, the PHSpline method returns nonzero estimates for all the , resulting in poor variable selection performance for Example 6, where the number of within cluster subjects reduces to 5.
Overall, the simulation results show that our proposed penalized h-likelihood approach performs good in terms of variable selection accuracy because of its ability to recover the true zeros, especially when the number of within-cluster subjects is not too small. Generally, our proposed PHSpline method works better when the number of within cluster subjects increases. In addition, even when the true model is sparse, our penalized estimator still does no worse than the h-likelihood estimator in terms of estimation accuracy.

4. Conclusion

To conclude, we have introduced a new penalized h-likelihood approach to identify nonzero relevant fixed effects in the partial linear model setting in this paper. This penalized h-likelihood incorporates variable selection procedures in the setting of mean modeling via h-likelihood. A few advantages of this newly proposed method are listed below. First of all, compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. In addition, h-likelihood plays an important role in inferences for models having unobserved random variables. Last but not least, it has been demonstrated by simulation studies that the proposed penalty-based method is able to identify zero regression coefficients in modeling the mean structure and produces good fixed effects estimation results.

As for future research, it would be interesting to apply the proposed penalized h-likelihood approach to be extended for more complicated circumstances for the partial linear models. In other words, the model in this paper assumes only a simple one-component structure for the random effects, such that only a random intercept is considered. For possible future research, we may consider a partial linear model for modeling the conditional mean with more than one random effect, i.e., the extended multicomponent random effects model. Other future work, including variance components estimates of the random effects and study of penalized h-likelihood estimator’s theoretical and asymptotical property such as convergence rate, would be investigated and discussed.

Data Availability

This is a theoretical study, and we do not have experimental data.

Disclosure

This work was part of the originally written Ph.D. thesis by the first author in 2013 [41].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the Ministry of Education of Humanities and Social Science Project (Grant no. 17YJCZH199). The authors gratefully acknowledge the Ministry of Education of Humanities and Social Science for the technical and financial support.