Abstract

Existing literature in nonparametric regression has established a model that only applies one estimator to all predictors. This study is aimed at developing a mixed truncated spline and Fourier series model in nonparametric regression for longitudinal data. The mixed estimator is obtained by solving the two-stage estimation, consisting of a penalized weighted least square (PWLS) and weighted least square (WLS) optimization. To demonstrate the performance of the proposed method, simulation and real data are provided. The results of the simulated data and case study show a consistent finding.

1. Introduction

Regression analysis is aimed at modeling the association between the predictor and the response. If the data pattern shows an unknown regression curve, nonparametric regression is used [1]. However, if the form of the regression curve is known, parametric regression can be applied [2]. Additionally, nonparametric regression has high flexibility because the data is expected to find its regression curve estimation form without being influenced by the researcher’s subjectivity [3]. In this study, we have analyzed several models such as kernel, spline [47], and Fourier series [8].

A spline estimator, which has an excellent ability to handle data with changes at subspecified intervals [9], was obtained using penalized least square optimization [10] and the Bayesian approach [11]. A spline estimator can be applied for cross-sectional data as well as longitudinal data. Additionally, several studies on nonparametric regression for longitudinal data have been addressed using kernel estimator [12, 13], generalized spline regression [14], and mixed-effects model [7]. Fourier series, which is useful to explain curves that show sine and cosine waves, is generally used if the data pattern is unknown and there is a tendency to iterate.

A considerable amount of research has used only one estimator for each predictor. However, because each predictor can have a different pattern, it was proposed to develop a mixed estimator. Recently, Sudiarsa et al. [15] discussed a study of the mixed estimator of the truncated spline and Fourier series. The study, which only discussed cross-sectional data, did not obtain a model for each subject as it did not include longitudinal data. Consequently, this study cannot be used to investigate response behavior based on the time change.

Although some research has been carried out on a mixed estimator, no studies have explored multisubject data so far. This paper proposes a new methodology for a mixed estimator of the truncated spline and Fourier series in the nonparametric regression for longitudinal data. This study addresses the gap in previous research by obtaining a mixed estimator of the truncated spline and Fourier series in the nonparametric regression for longitudinal data and applying it to simulated data and a case study.

This study is organized as follows. We briefly explain the materials and methods used in our study in Section 2. Section 3 consists of three subsections: the developed theory, simulation study, and case study. We present the developed nonparametric regression theory for longitudinal data with a mixed estimator of the truncated spline and Fourier series with two-stage estimation in Section 3.1. In Section 3.2, we conduct a simulation study based on the developed theory to assess the proposed estimator’s behavior. To illustrate the applicability of the model, we use a dataset of patients with pulmonary tuberculosis in Section 3.3. Section 4 presents the conclusion.

2. Materials and Methods

Longitudinal data has independent subjects and observations for each subject. Given paired data , which consist of and predictors with subjects, each subject has observations. The relationship between and , which followed a nonparametric regression model for longitudinal data, is as follows: Each regression curve is additive so that the model can be expressed as is the truncated spline component and is the Fourier series component.

This study’s first objective is to obtain the mixed estimator of the truncated spline and Fourier series in nonparametric regression for longitudinal data. To achieve this, we propose a two-stage estimation method. The first stage is estimating the components of the Fourier series using the penalized weighted least square (PWLS) method. The second stage is estimating the truncated spline component using the weighted least square (WLS) method. For the second goal, that is, a simulation study, we generate functions that meet the truncated spline and Fourier series characteristics. In the third step, we apply the developed theory to a dataset of patients with pulmonary tuberculosis.

3. Results and Discussion

3.1. Mixed Model of Truncated Spline and Fourier Series with Two-Stage Estimation

Lemmas and theorems are used to obtain a nonparametric regression model for longitudinal data with a mixed estimator. The regression curve component that is approximated by the Fourier series estimator is presented in Lemma 1 and the penalty component for the Fourier series function is presented in Lemma 2. Following the PWLS form in Lemma 3, we estimate the Fourier series component by using PWLS in Theorem 4. The regression curve component that is approximated by the truncated spline estimator is presented in Lemma 5, and we estimate the truncated spline component using the WLS method in Theorem 6. The results are summarized as follows:

Lemma 1. If is approached by the Fourier series function, then the goodness of fit is where and is the weighting matrix.

Proof. The regression curve is a regression curve of an unknown shape and is contained in continuous space . The component of in Equation (2) is approximated by the Fourier series function with the trend line as follows:

If the regression curve in Equation (4) involves only one predictor, then it can be written as follows: By using Equation (4), Equation (6) can be written in the form of a matrix as follows:

The Fourier series function in the nonparametric regression component for longitudinal data with predictor can be expressed in the following form:

So, is a matrix as follows: with

Whereas, is a vector given by

To estimate the Fourier series component, the nonparametric regression model in Equation (1) can be written as

The model in Equation (14) can be written in matrix form:

Then, a goodness of fit for the model is formed as follows: with . If the function is approached by a Fourier series function as in Equation (4), then the goodness of fit can be presented in the form with as a weighting matrix for the regression of longitudinal data.

Lemma 2. If the Fourier series is given, then the penalty component is

Proof. The penalty component in the PWLS optimization based on Equation (4) can be obtained as follows:

As a result,

To simplify, we defined

The value of will be obtained as follows. Furthermore, the value of is given by

Based on Equations (22) and (23), it can be written as follows:

For , we obtained where

Thus, the penalty component can be expressed in a matrix form as follows:

Lemma 3. If the goodness of fit component is presented in Lemma 1 and the penalty component is given by Lemma 2, then the PWLS is

In general, PWLS is defined as follows:

Besides, PWLS can be presented in the form of a matrix as follows:

The next step is to obtain a Fourier series estimator in nonparametric regression for longitudinal data derived in Theorem 4.

Theorem 4. If paired data are given, which follows the nonparametric regression model for longitudinal data, then the mixed estimator that minimizes PWLS in Lemma 3, is with and .

Proof. The first estimation step in the mixed estimator of the truncated spline and Fourier series in the nonparametric regression model for longitudinal data is performed by estimating the form of the Fourier series estimator by using the PWLS method. The PWLS in Equation (29) can be written in the form of a matrix as follows:

Next, we complete PWLS optimization using the following steps:

To complete the optimization, the estimators are obtained by performing a partial derivative of concerning and the results are equaled to zero. The given results are

By substituting into Equation (9), we get

So, the model in Equation (15) can be written as with and .

Lemma 5. If is approached with the truncated spline function, then the WLS is

is a weighting matrix.

Proof. is a truncated spline estimator component. The component regression curve is a linear truncated spline function defined as follows: with truncated function

If the component of the regression curve involves one predictor, then it can be written as follows:

By using Equation (41), it can be described in the form of a matrix as follows:

The truncated spline function in the nonparametric regression component for longitudinal data with predictors can be expressed in the following form:

So,

Furthermore, it can be written as follows: is a matrix and is a vector.

The mixed model of nonparametric regression for longitudinal data in Equation (1) can be written in a matrix as follows:

By substituting Equation (36) with Equation (46), we get

The truncated spline component can be written as

Substituting Equation (45) with Equation (48), we obtain

If the function is approximated by the truncated spline function as in Equation (38), then

Therefore, we obtain the WLS by where is a weighting matrix for the regression of longitudinal data. Next, the truncated spline estimator in the nonparametric regression for longitudinal data is derived in Theorem 6.

Theorem 6. If paired data which follows the nonparametric regression model for longitudinal data is given, then the mixed estimator that minimizes WLS in Lemma 5, is with and .

Proof. The second estimation stage of the mixed estimator of the truncated spline and Fourier series in the nonparametric regression model for longitudinal data is performed using the WLS method. The estimator can be obtained by completing the WLS optimization as follows:

The estimators are obtained by performing a partial derivative of and the results are equaled to zero. The partial derivative results are as follows: where and .

By substituting into the form of a truncated spline estimator component as in Equation (45), we obtain

After obtaining , we obtain by substituting Equation (56) with Equation (34).

So,

By substituting and into the mixed estimator of the truncated spline and the Fourier series in the nonparametric regression for longitudinal data, the following estimation results are obtained. with , , and .

The mixed estimator depends on the optimum knot point, oscillation parameter, and smoothing parameter. To obtain the best model, it is essential to select the optimum parameter. One of the criteria to select the optimum parameter is the generalized cross-validation (GCV) method [11]. The GCV function of the nonparametric regression model for longitudinal data is as follows:

The optimum knot point, oscillation parameter, and smoothing parameter are obtained by solving the minimum optimization, as presented in Equation (60).

3.2. Simulation Study

To demonstrate the performance of the proposed method, we created one sample size with . For the simulation study, we considered ten models for each subject. The models are generated from the formula that contains two different functions to represent the truncated spline and Fourier series pattern. A polynomial function is used to present the truncated spline, while a trigonometry function is used to present the Fourier series. Additionally, and are generated from distribution, and random errors are generated from a multivariate normal distribution.

Using ten subjects and two predictors, the formula for generated data is stated as follows:

The simulation study is applied based on these models, as shown in Table 1.

Figure 1 illustrates the partial relationship between the response and each predictor variable. It can be seen that the relationship between predictor and the response for each subject tends to change at certain subintervals, which is suitable for the truncated spline estimator. The relationship between and the response for each subject has a repetitive pattern with a particular trend line, which is suitable for the Fourier series estimator.

Wu and Zhang [7] stated that a regression’s performance strongly depends on good knot locations and a good choice of the number of knots. In general, the number of knots is smaller than the sample size . Considering the scatterplot of simulated data and computational convenience, our study uses three knots and three oscillation parameters . To choose the optimum parameter, we use the minimum GCV criteria. Table 2 provides a summary of the GCV for varying knots and oscillations. What is remarkable is that using one knot and one oscillation with , we obtain the best model with the lowest GCV, 3.112. This model yields satisfactory results with a root mean square error (RMSE) of 0.837.

3.3. Case Study

After conducting the simulation, we applied the proposed model to the case to confirm the results of the previous simulation. The data for this research was obtained from a study conducted by Fernandes and Solimun [16], that is, patients with pulmonary tuberculosis disease. Pulmonary tuberculosis is a contagious disease caused by Mycobacterium tuberculosis, which can attack various organs, particularly the lungs. This disease is typical among women in their productive years (ages 15-50 years). The World Health Organization (WHO) declared tuberculosis a global emergency in 1992 [17]. WHO report in 2013 stated that there were 8.6 million tuberculosis cases in 2012, of which 40% of the cases were in Southeast Asia. In a further report, Indonesia was noted as the country with the second-largest number of cases, 2.8 million, in 2015.

This study’s dataset consists of four patients that represent radiological images of the thorax (stadium), which are minimal lesion, mod advance, far advance, and KP Miller. suPAR level as a response with mL units in several observation periods every two weeks for six months of treatment is observed. The predictor variables are the erythrocyte sedimentation rate with mm/hour units and body mass index with kg/m3 units.

The partial relationship between the suPAR level and each predictor variable for each subject is presented in Figure 2. There were changes in data patterns in the four subjects observed for six months with measurements taken every two weeks. The plot in Figure 2 shows a different pattern for each predictor. For this reason, we propose a nonparametric regression approach based on a mixed estimator for longitudinal data. The erythrocyte sedimentation rate will be approached by a truncated spline estimator, while the body mass index will be approached by a Fourier series estimator.

In this case study, similar to the simulation study, only three knot points and three oscillation parameters were used. From the various knots and oscillation parameter results, we obtained the GCV values listed in Table 3. Interestingly, the data in this table is that the minimum GCV achieved one oscillation parameter under the same conditions as the simulation study, that is, one knot point. However, it has a different smoothing parameter, . This model provides a GCV value of 0.239 with a RMSE of 0.197. The knot point location and the results of the parameter estimation for each subject of patients with pulmonary tuberculosis are presented in Tables 4 and 5, respectively.

Based on the optimal knot points in Table 3 and the parameter estimation for each subject in Table 4, the nonparametric regression model based on a mixed estimator for longitudinal data can be written as follows: (1)Model estimation for subject minimal lesion:(2)Model estimation for subject mod advance:(3)Model estimation for subject far advance:(4)Model estimation for subject KP Miller:

4. Conclusions

Based on the simulation study and the case study, we selected the best model by using the minimum GCV. The higher knot point or oscillation parameter does not produce a high GCV and vice versa. Therefore, we tried several combinations of knot points and the oscillation parameter to choose the best model. The result of the case study of patients suffering from pulmonary tuberculosis is similar to the simulation study. This study found that the best model uses a one knot point and one oscillation with different . It can be concluded that the simulation study supports the results of the case study.

A limitation of this study is that it does not investigate other sample sizes. Consequently, we cannot compare the performance of the developed theory for different sample sizes. Despite its limitations, the study certainly adds to our understanding of the mixed estimator’s new theory in nonparametric regression for longitudinal data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

The authors are thankful for financial support via the PMDSU Grant (Batch III) from the Ministry of Education and Culture of the Republic of Indonesia.