Journal of Probability and Statistics

Volume 2019, Article ID 3743762, 14 pages

https://doi.org/10.1155/2019/3743762

## Conditional Analysis for Mixed Covariates, with Application to Feed Intake of Lactating Sows

^{1}Eli Lilly and Company, Indianapolis, IN 46285, USA^{2}Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA^{3}Department of Animal Science, North Carolina State University, Raleigh, NC 27695, USA

Correspondence should be addressed to S. Y. Park; moc.liamg@9583gnuoyoskrap and C. Li; ude.uscn@9ilc

Received 5 December 2018; Revised 28 April 2019; Accepted 28 May 2019; Published 16 July 2019

Guest Editor: Rahim Alhamzawi

Copyright © 2019 S. Y. Park et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We propose a novel modeling framework to study the effect of covariates of various types on the conditional distribution of the response. The methodology accommodates flexible model structure, allows for joint estimation of the quantiles at all levels, and provides a computationally efficient estimation algorithm. Extensive numerical investigation confirms good performance of the proposed method. The methodology is motivated by and applied to a lactating sow study, where the primary interest is to understand how the dynamic change of minute-by-minute temperature in the farrowing rooms within a day (functional covariate) is associated with low quantiles of feed intake of lactating sows, while accounting for other sow-specific information (vector covariate).

#### 1. Introduction

Many modern applications routinely collect data on study participants comprising scalar responses and covariates of various types, vector, function, and image, and the main question of interest is to examine how the covariates affect the response. For example, in our motivating experimental study, the goal is to analyze how the minute-by-minute daily temperature and humidity of the farrowing rooms, where sows are placed after giving birth for nursing, affect their feed intake during a lactation period. The covariates consist of temperature profile, humidity, and sow age, where the response is a total amount of daily feed intake of sows. A popular approach in these cases is to use a nonparametric framework and assume that the mixed covariates solely affect the mean response; see Cardot et al. [1], James [2], Ramsay and Silverman [3, 4], Ferraty and Vieu [5, 6], Goldsmith et al. [7], McLean et al. [8], and others. However, for our application, while it is important to study the mean feed intake, animal scientists are often more concerned with the left tail of the feed intake distribution. This is because low feed intake of lactating sows could lead to many serious issues, including decrease in milk production and negative impact on the sows reproductive system; see, for reference, Quiniou and Noblet [9], Renaudeau and Noblet [10], St-Pierre et al. [11], among others. In this paper we focus on regression models that study the effects of covariates on the entire distribution of response. Our contribution is the development of a modeling framework that accommodates a comprehensive study of various types (vector and functional) of covariates on a scalar response.

Quantile regression models the effect of scalar/vector covariates beyond the mean response; it provides a more comprehensive study of the covariates on the response and has attracted great interest [12, 13]. For prespecified quantile levels, quantile regression models the conditional quantiles of the response as a function of the observed covariates; this approach has been extended more recently to ensure noncrossing of quantile functions [14]. Quantile regression has been also extended to handle functional covariates. Cardot et al. [15] discussed quantile regression models by employing a smoothing spline modeling based approach. Kato [16] considered the same problem and used a functional principal component (fPC) based approach. Both papers mainly discussed the case of having a single functional covariate and it is not clear how to extend them to the case where there are multiple functional covariates or mixed covariates (vector and functional).

More recently, Tang and Cheng [17], Lu et al. [18], and Yu et al. [19] studied quantile regression when the covariates are of mixed types and introduced the partial functional linear quantile regression modeling framework. The first two publications used fPC basis while the last one considered partial quantile regression (PQR) basis. These approaches all are suitable when the interest is studying the effect of covariate at a particular quantile level and do not handle the study of covariate effects at simultaneous quantile levels due to the well-known crossing-issue.

Ferraty et al. [20] and Chen and Müller [21] considered a different perspective and studied the effect of a functional predictor on the quantiles of the response by modeling the conditional distribution of the response directly. However their approach is limited to one functional predictor. In this paper we fill this gap and propose a unifying modeling framework and estimation technique that allows studying the effect of mixed type covariates (i.e., scalar, vector, and functional) on the conditional distribution of a scalar response in a computationally efficient manner.

Let denote the th conditional quantile of given a functional covariate , and let denote the conditional distribution of given . We model the conditional distribution using a generalized function-on-function regression framework, i.e., , where is an indicator function and is the logit link function, and we study the conditional quantiles by exploiting the relationship between and through for . The advantage and contribution of our proposed method mainly come from the following reasons: (1) our modeling approach is spline-based, and as a result it can easily accommodate smooth effects of scalar variables as well as of functional covariates and (2) our estimation approach is based on a single step function-on-function (or function-on-scalar) penalized regression, which enables efficient implementation by exploiting off-the-shelf software and leads to competitive computations.

The remainder of the paper is structured as follows. Section 2 discusses the details of the proposed method and Section 3 describes the estimation procedure and extensions. Section 4 performs a thorough simulation study evaluating the performance of the proposed method and its competitors. We apply the proposed method to analyze the sow data in Section 5. We conclude the paper with a discussion in Section 6.

#### 2. Methodology

##### 2.1. Statistical Framework

Let be index subjects, index repeated measurements, the number of subjects, and the number of observations for subject . Suppose we observe , where is the response, which is a scalar random variable, is a -dimensional vector of nuisance covariates, is a scalar covariate, and is a functional covariate, which is assumed to be square-integrable on a closed domain .

We propose the following model for the conditional distribution of given , , and :where denotes the conditional distribution function as before, is a known, monotone link function, namely, the logit link function defined as for arbitrary scalar , is an unknown and smooth functional intercept, is a -dimensional parameter capturing the linear additive effect of the covariate vector , is an unknown and smooth function, and is an unknown and smooth bivariate function. Here, the effect of the nuisance covariates is ; it is assumed to be constant over while the smooth intercept is -variant. The effect of is , which varies smoothly over ; quantifies -variant linear effects of the covariate . If the parameter function is zero then the covariate has no effect on the distribution of the response , which is equivalent to having no effect on any quantile level of . Similarly, it is easy to see that a null effect, say , is equivalent to the case that the functional covariate has no effect on any quantile level of the response. Chen and Müller [21] (CM, henceforth) considered a similar model; however, their approach is restrictive to a single functional covariate. We discuss the differences between their method and ours in Section 2.2

To explain our ideas, we consider the case that the functional covariates are observed without noise on a fine, regular, and common grid of sampling points, i.e., with for all . Bear in mind, this assumption is made for illustration only, and our framework can be extended to more general cases, including settings where the functional covariate is observed with noise and at irregular sampling points; see Section 3.3.

##### 2.2. Modeling of the Covariate Effects

We model and by using prespecified, truncated univariate basis. Let and be two bases of dimensions and , respectively. and , where ’s and ’s are unknown basis coefficients. We represent using the tensor product of two univariate bases functions, and , where and are the bases dimensions; , where ’s are unknown basis coefficients. In practice, the integration term is approximated by Riemann integration but other numerical approximation scheme can be also used.

Define for . In practice for each in a fine grid, we view as a binary-valued random functional variable. It follows that model (1) can be written equivalently as a generalized function-on-function regression model through relating the “artificial” binary functional response and the mixed covariates . This model can be fitted by using, for example, the ideas of Scheipl et al. [22] which we briefly summarize next.

Model (1) can be represented as the following generalized additive model:For convenience, we use the notation , . We let , , , , , , and is a coefficient matrix. Then , where is the vectorization of . We let , , and . Now model (2) can be written asThe general idea is to set the bases dimensions , , , and to be sufficiently large to capture the complexity of the coefficient functions and control the smoothness of the estimator through some roughness penalties. This approach of using roughness penalties has been widely used; see, for example, Eilers and Marx [23]; Ruppert [24]; Wood [25, 26] among many others.

It is important to emphasize that, even in the case of a single functional covariate, our methodology differs from [21] in two directions: (1) our proposed method is based on modeling the unknown smooth coefficient functions using prespecified basis function expansion and using penalties to control their roughness. In contrast, CM uses data-driven basis and chooses the number of basis functions through the percentage of explained variance (PVE) of the functional predictors. This key difference allows our method to accommodate covariates of different types as well as nonlinear effects. (2) Our estimation approach is based on a single step penalized function-on-function regression while CM uses pointwise estimation based on functional principal component bases and thus requires fitting multiple generalized regressions. This nice feature leads to an computational advantage.

#### 3. Estimation

##### 3.1. Estimation via Penalized Log-Likelihood

Let be a set of equally spaced points in the range of the response variable, ’s. Conditioning on , we model as independently distributed Bernoulli variables with mean , where . The coefficients , , , and are estimated by minimizing the penalized log-likelihood criterion:where is the log-likelihood function of data , , , , and are smoothing parameters, which control the balance between the model fit and its complexity, and , , , and are all penalties.

There are several choices to define the penalty matrix in nonparametric regression; see Eilers and Marx [23] and Wood [26]. We use quadratic penalties which penalize the size of the curvature of the estimated coefficient functions. Let , where is of dimension with its element equal to . Similarly, , where is of dimension with its element equal to . As is a bivariate function, the choice of penalty implies penalizing the size of curvature in each direction, respectively: , where is of dimension with the element of equal to for some orthonormal spline bases. Similarly, , where is of dimension with the element of equal to .

Criterion (4) can be viewed as a penalized quasilikelihood (PQL) of the corresponding generalized linear mixed modelwhere is the generalized inverse matrix of ; and are defined similarly. Wood [26] discusses an alternative way to deal with the rank-deficient matrices in the context of restricted maximum likelihood (REML) estimation. Here we do not account for the dependence over ; see Scheipl et al. [22] for a general formulation. See also Ivanescu et al. [27] who use the mixed model representation of a similar regression model to (5), but with a Gaussian functional response. The smoothing parameters are estimated using REML.

##### 3.2. Extension to Nonlinear Model

One advantage of the proposed framework is that it can be easily extended to allow for more flexible effects, i.e., extending the ideas to accommodate multiple covariates, scalar or functional, and varied types of effects. In particular, the smooth effect can be replaced by and by , where and are unknown bivariate and trivariate smooth functions, respectively; see Scheipl et al. [22] and Kim et al. [28]. These changes require little additional computational effort. The modeling and estimation follow roughly similar ideas as Scheipl et al. [22]. We consider the nonlinear model in the simulation study for the case of having a scalar covariate only, i.e., , and the corresponding results are presented in Section S1.1 of the Supplementary Materials. The results show excellent prediction performance compared to the competitive nonlinear quantile regression method, namely, constrained B-spline smoothing (COBS) [29].

##### 3.3. Extension to Sparse and Noisy Functional Covariates

In practice the functional covariates are often observed at irregular times across the units and the measurements are possibly corrupted by noises. In such case, one needs to first smooth and denoise the trajectories before fitting. When the sampling design of the functional covariate is dense, the common approach is to smooth each trajectory using splines or local polynomial smoothing, as proposed in Ramsay and Silverman [3] and Zhang and Chen [30]. When the design is sparse, the smoothing can be done by pooling all the subjects and following the PACE method proposed in Yao et al. [31]. As recovering the trajectories has been extensively discussed in the literatures, we do not review the procedures here. Instead, we discuss some available computing resources that can be used to fit these methods. In our numerical study, we used fpca.sc function in the refund R package [32] for recovering the latent trajectories, irrespective of a sampling design (dense or sparse). Alternatively, one can use fpca.face [33] in refund for regular dense design and face.spare [34] in the R package face [35] for irregular sparse design. Once the latent trajectories are estimated, they can be used in the fitting criterion (4).

##### 3.4. Estimation of Conditional Quantile

Let , , , and be parameter estimates in (5). It follows that the estimated distribution function can be obtained by plugging in the estimated coefficients. The th conditional quantile is estimated by inverting the estimated distribution, i.e., . The estimated distribution function is not a monotonic function yet. In practice we suggest to first apply a monotonization method as described in Section 3.5 and then estimate the conditional quantiles by inverting the resulting estimated distribution.

##### 3.5. Monotonization and Implementation

While a conditional quantile function is nondecreasing, the resulting estimated quantiles may not be. Two approaches are widely used: one is to monotonize the estimated conditional distribution function and the other is to monotonize the estimated conditional quantile function. We choose the former as is readily available at dense grid points ’s. We use an isotonic regression model [36] for monotonization, which imposes an order restriction; this is done by using the R function isoreg. Other monotonization approaches include Chernozhukov et al. [37], which was employed in Kato [16].

Our approach is implemented by first creating an artificial binary response and then fitting a penalized function-on-function regression model and using the logit link function. Fitting models in (4) can be done by extending the ideas of Ivanescu et al. [27] for Gaussian functional response; the extension of the model to the non-Gaussian functional response has recently been studied and implemented by Scheipl et al. [22] as the pffr function in refund package [32].

#### 4. Simulation Study

##### 4.1. Simulation Setting

In this section we evaluate the empirical performance of the proposed method. We present results for the case when we have both functional and scalar covariates; additional results when there is only a single scalar or a single functional covariate are discussed in the Supplementary Materials, Section S1.

Suppose the observed data for the th subject are , , where , for . Let , where , for odd values of , for even values of , , , and . We assume three cases for generating response :(i)Gaussian: ; this corresponds to the quantile regression model , where is the distribution function of the standard normal;(ii)Mixture of Gaussians: + , where the true quantiles can be approximated numerically by using qnorMix function in the R package norMix;(iii)Gaussian with heterogeneous error: ; the true quantiles are given by = + .

Let , where for .

For each case, we use different combinations of signal to noise ratio (SNR), sample size, and sampling designs to generate simulated datasets. We define SNR as , and we consider five levels of noise: . Two levels of sample size are and . Two sampling designs are considered: (i)* sparse design*, where are randomly selected points from a set of equispaced grids in ; and (ii)* dense design*, where the sampling points are equispaced time points in .

The performance is evaluated on a test set of subjects, for which we have available, in terms of mean absolute error (MAE) for quantile levels , and :

##### 4.2. Competing Methods

We denote the proposed method by Joint QR to emphasize the single step estimation approach. We compare our method with two alternative approaches: (1) a variant of our proposed approach using pointwise estimation, denoted by Pointwise QR. This approach consists of fitting multiple regression models with binomial link function as implemented by the penalized functional regression pfr, developed by Goldsmith et al. [7], of the refund package for generalized scalar responses; (2) a modified version of the CM method, denoted by Mod CM, that we developed to account for additional scalar covariates and which fits multiple generalized linear models with scalar covariates and fPC scores as predictors; (3) a linear quantile regression approach using the quantile loss function and the partial quantile regression bases for functional covariates, proposed by Yu et al. [19] and denoted by PQR. Notice that although the formulation of the first two methods implicitly accounts for a varying effect of the covariates on the response distribution, they do not ensure that this effect is smooth. The third approach can only estimate a specific quantile rather than the entire conditional distribution. Note that all the competing methods are monotonized for a fair comparison.

The R function pfr can incorporate both scalar/vector and functional predictors by adopting a mixed effects model framework. The functional covariates are presmoothed by fPC analysis [31]. Throughout the simulation study we fix PVE as for fPC analysis to determine the number of principal components and use REML to select the smoothing parameters for our proposed methods. Other basis settings are set to their default values. We use 100 equally distanced points between the minimum and maximum of the observed ’s to set the grid for the conditional distribution function.

##### 4.3. Simulation Results

Tables 1 and 2 show the accuracy of the quantile prediction for the two cases (normal and mixture) when the functional covariate is observed sparsely and the sample size is (Table 1) and (Table 2). Table 3 presents the prediction accuracy for the case of heteroskedasticity with sparsely observed functional covariates. The results based on dense sampling design show similar patterns and thus are relegated to the supplement; see Section S1.3. The comparison of running times is presented in Table 4.