Research Article | Open Access

Volume 2014 |Article ID 673657 | https://doi.org/10.1155/2014/673657

Marick S. Sinay, John S. J. Hsu, "Bayesian Inference of a Multivariate Regression Model", Journal of Probability and Statistics, vol. 2014, Article ID 673657, 13 pages, 2014. https://doi.org/10.1155/2014/673657

Bayesian Inference of a Multivariate Regression Model

Revised14 Oct 2014
Accepted28 Oct 2014
Published24 Nov 2014

Abstract

We explore Bayesian inference of a multivariate linear regression model with use of a flexible prior for the covariance structure. The commonly adopted Bayesian setup involves the conjugate prior, multivariate normal distribution for the regression coefficients and inverse Wishart specification for the covariance matrix. Here we depart from this approach and propose a novel Bayesian estimator for the covariance. A multivariate normal prior for the unique elements of the matrix logarithm of the covariance matrix is considered. Such structure allows for a richer class of prior distributions for the covariance, with respect to strength of beliefs in prior location hyperparameters, as well as the added ability, to model potential correlation amongst the covariance structure. The posterior moments of all relevant parameters of interest are calculated based upon numerical results via a Markov chain Monte Carlo procedure. The Metropolis-Hastings-within-Gibbs algorithm is invoked to account for the construction of a proposal density that closely matches the shape of the target posterior distribution. As an application of the proposed technique, we investigate a multiple regression based upon the 1980 High School and Beyond Survey.

1. Introduction

The multivariate multiple regression model is a natural extension of the univariate multiple regression model. The key difference, as the name implies, is that the univariate response variable is instead a multivariate response vector. By utilizing the multivariate multiple regression model the covariance of the response vector can be modeled. From an estimation standpoint van der Merwe and Zidek  suggest an intrinsic role to be played by the covariance structure, whereas, in the case of separate univariate multiple regression models, the covariance of the distinct response variables cannot be modeled. Although optimal point estimates of any linear combination of the means of the various response variables can still be obtained, an appropriate estimate of the variance of said estimator cannot be obtained without fully incorporating the covariance amongst the multivariate response vector. Under this framework multivariate analysis is required to most appropriately produce an estimate of the standard error.

As a particular example, in educational testing data when multiple subject area exams are administered it is common practice to simply report the sum of the individual exam scores as a total score. For instance, the metric most commonly associated with the Scholastic Aptitude Test (SAT) is simply the sum of the student’s verbal score and the mathematical score. Other exams, such as the ACT exam, have even more than two subject areas and report a composite score which is the arithmetic average of the individual subject area scores. In these instances, multivariate analysis, by capturing the covariance amongst the various subject exams, is required to properly estimate the standard error of the final score.

Formal Bayesian analysis has long been used for multivariate multiple regression models . The inverse Wishart is widely used in this respect, since it is a conjugate prior distribution for the multivariate normal covariance matrix . Also see Dawid  for a general discussion of the inverse Wishart and Wishart distributions. However, in contrast to traditional Bayesian methods we will not make use of the standard inverse Wishart conjugate prior for the covariance matrix. The reason is that the inverse Wishart is a rather restrictive distribution in its ability to capture prior information that may be available to the practitioner. See Leonard and Hsu , Hsu et al. , and Sinay et al.  for a more detailed explanation of the disadvantages of the inverse Wishart as a prior distribution for the covariance matrix.

Leonard and Hsu  presented an alternative approach that remedies the shortcomings of the inverse Wishart and allows for greater flexibility in the prior specification. In a univariate normal model setting, the normal distribution has been used as a prior for the logarithm of the variance parameter. In this same vein, Leonard and Hsu  consider the matrix logarithm transformation of the covariance matrix for the multivariate case. Making use of a result from Bellman [10, page 171], it can be demonstrated that the exponential terms of a multivariate normal likelihood function can be expressed in the form of a Volterra linear integral equation. An approximation to a function, that is, proportional to the likelihood, can then be obtained via Bellman’s iterative solution to the Volterra integral equation. The resulting approximation has a multivariate normal form, with respect to the unique elements of the matrix logarithm of the covariance matrix. This allows a normal prior specification to act as a conjugate prior distribution, thereby yielding an approximate normal posterior for the covariance structure.

One of the primary benefits of such a technique is the ability to specify varying degrees of confidence in each element of the normal prior hyperparameter mean vector via the variance terms of the prior hyperparameter covariance matrix. Obviously, larger variance terms in the prior hyperparameter covariance matrix indicate a lack of confidence in the corresponding prior location hyperparameter. Another chief advantage of this method is the ability to model beliefs about any possible interdependency between the covariance parameters. This can be accomplished by specifying the covariance terms of the prior hyperparameter covariance matrix. Note that in this way both the interrelationships and the strength of prior beliefs with respect to the covariance parameters can be modeled.

Bayesian estimates of all relevant parameters of interest are calculated using Markov chain Monte Carlo (MCMC) techniques. With respect to the covariance structure, since an approximate posterior distribution is used as a proposal density, we employ the Metropolis-Hastings-within-Gibbs (MHWG) algorithm [11, page 291], to correctly estimate the true target posterior density.

Having laid out the general outline we now move on to the body of the paper. We begin by introducing and defining the standard multivariate multiple regression model. We follow that by making a distributional assumption about the error matrix of the multivariate regression model. This provides us with a mechanism to state the likelihood function for the model. In turn, we then go through the formal analytical Bayesian derivations. Subsequent to the Bayesian analysis we outline the MCMC procedure and discuss how the posterior means and standard errors are numerically calculated. We conclude with an application to the High School and Beyond Survey .

2. Multivariate Multiple Regression Model

We consider the standard multivariate multiple regression model: Notationally in matrix form we can succinctly write where is the matrix of response variables, is the matrix of explanatory variables, is the matrix of unknown regression coefficients, and is the matrix of random errors. In Section 2.1 we introduce the matrix normal distribution and make a general assumption about the distribution of the random error matrix in (2). The matrix normal representation will greatly facilitate the Baysian analysis that follows. Based upon the matrix normal, we proceed in Section 2.2 to develop the joint likelihood function for and . Hierarchical prior specifications are discussed in Section 2.3 and the joint posterior distribution is reported in Section 2.4.

2.1. Distributional Assumptions and Matrix Normal Distribution

The matrix normal distribution is closely related to, and is a generalization of the multivariate normal. In particular, the random matrix , if and only if, the random vector , where denotes the dimensional matrix normal distribution, is a location matrix, is a first covariance matrix, and is a second covariance matrix [13, page 54]. and are the standard vector operator and Kronecker product, respectively.

We make the distributional assumption that, conditional on the covariance matrix , the random error matrix , in (2), follows a matrix normal with zero mean matrix and covariance matrices given by and , where is a identity matrix. Formally, we have , or equivalently, the error terms are independent and identically distributed normal random vectors each with mean vector and covariance matrix . The probability density function for the error matrix is given by where is the standard trace function.

2.2. Likelihood Function for Conditional on

From the multivariate multiple regression model (2) we can write . Therefore, the joint likelihood function for and is given by the following:

For a given value of let . Note that is a symmetric almost surely positive definite matrix. Then the likelihood function (4) for conditional on can be written as In Bayesian analysis for a univariate normal model, the logarithm of the variance parameter has been modeled by a univariate normal prior distribution. In a multivariate setting the matrix logarithm of a covariance matrix has also been investigated by Chiu et al. . Along these same lines, we consider the matrix logarithm of and : where is a orthonormal matrix whose columns are normalized eigenvectors and is a diagonal matrix of the corresponding normalized eigenvalues associated with . and are defined analogously for . Using the fact that from (6) and noting that , we can express the exact likelihood function (5) in the following equivalent fashion: We now define the following unconventional matrix operator . Let be the element in the th row and th column of the matrix , and then where is a vector and . We analogously define , which will appear again in Section 3.4. Moving forward we will use , which captures the unique elements of the matrix logarithm of the covariance matrix, to model the covariance structure.

2.3. Prior Distributional Specfications

We will assume a priori that is independent of . More specifically, with respect to we make the assumption of a uniform prior distribution: Note that the uniform prior assumption for can be viewed as a limiting case of the informative multivariate normal prior specification, which this modeling approach could accommodate.

We will assume a priori that, given and , follows a dimensional normal distribution with mean location hyperparameter vector and covariance hyperparameter matrix . The multivariate normal provides a very rich and flexible family of prior distributions for the matrix logarithm of the covariance structure. This adds far greater flexibility than the conventional inverse Wishart prior specification. Since the multivariate normal is fully parameterized by a mean vector and covariance matrix, we have the ability to model more complex prior information. In particular, we can specify different prior mean values for each element of via the elements of the location hyperparameter . Moreover, we have the ability to model varying degrees of strength of the prior belief in each of the elements of through the diagonal elements of the covariance hyperparameter matrix . Additionally, with the multivariate normal prior we are able to model potential interdependency among the elements of because we can specify nontrivial covariance terms in the covariance hyperparameter matrix. That is, the off diagonal elements of can be used to specify any potential correlations amongst the elements of .

We are now able to craft a more complex and accurate prior specification for the covariance structure. A subjective Bayesian may in fact wish to specify all hyperparameters. In this way, the practitioner can fully take advantage of any relevant prior information through use of the flexible multivariate normal prior specification for the covariance structure. Alternatively, we can opt to model and , where and are of smaller order than and , respectively. That is, a priori we may wish to only model certain subsets of the covariance structure. An obvious choice is to consider the variance components as one subset and the covariance components as another. However, we stress the point that the fully general multivariate normal prior specification can be utilized in its totality.

Here we will consider the intraclass matrix form for the prior specification as an example of the fully generalized multivariate normal prior distribution. Specifically, we will consider the first elements of separate from the remaining terms. That is, we wish to model the variance components separately from the covariance components of . Formally, we assume for the prior distribution. We have the following prior distributional form: where the vector and Note that is a matrix, whose first elements of the first column are equal to one and the remaining terms of the first column are equal to zero. The second column of consists of the first elements equal to zero and the remaining elements equal to one. In , and are indicator matrices of dimension and , respectively.

Thus, and are the location and variance hyperparameters, respectively, for the variance components of . Analogously, and are the location and variance hyperparameters for the covariance components of . In this way we can specify two different location hyperparameters and two levels of confidence.

For the hyperparameters, and , we will assume the following vague prior distribution: Note here that the uniform prior specification can be viewed as a limiting case of a multivariate normal and inverse Wishart prior specification for and , respectively. Furthermore, the analysis could in fact accommodate such nontrivial specifications quite easily. Having stated all the prior distributional assumptions we now turn to the posterior Bayesian analysis. We begin this by first examining the exact joint posterior distribution.

2.4. Exact Joint Posterior Distribution

The exact joint posterior distribution for all parameters and hyperparameters will be proportional to the product of (4) the exact likelihood function, (9) the prior distribution for , (10) the prior distribution for , and (12) the prior distribution for and . Note that here we will use interchangeably with : The term in (13) can be integrated out, so that where .

We clearly see that the exact joint posterior distribution is in fact not tractable. This is the driving motivation behind the implementation of the numerical MHWG sampling techniques. Rather than working with the cumbersome exact joint posterior distribution it is much easier to consider the so-called full conditional distributions for each of the parameters and hyperparameters.

3. Markov Chain Monte Carlo Approach

As already noted, the joint posterior distribution in (14) is not analytically tractable with respect to drawing inference for the relevant parameters and hyperparameters. This will give rise to our consideration in the subsequent subsections of the full conditional posterior distributions. The conditional posterior distributions we derive below will provide the framework for the MHWG sampling techniques.

3.1. Exact Conditional Posterior Distribution for

Recall that is the maximum likelihood estimator of . Furthermore, we can rewrite the exact likelihood function (4) as Therefore, the posterior distribution for conditional on will be proportional, with respect to only the terms involving , to the product of (15) the exact likelihood function multiplied by (9) the prior distribution for : We recognize that the posterior distribution of conditional on is of a matrix normal form. Making use of the relationship between the matrix normal and the multivariate normal distributions as stated above in Section 2.1, we have

3.2. Exact Conditional Posterior Distribution for

The joint prior distribution for , , and is given by the product of (10) the prior distribution for and (12) the joint prior distribution for and . We can derive the joint prior distribution for just and by integrating over the joint prior distribution for , , and with respect to . Upon completion of the integration we have the following joint prior distribution for and : where and is a identity matrix. By integrating out we will help to facilitate the MCMC procedure both in terms of speed and simplification of the algorithm.

The exact posterior distribution will be proportional to the product of (7) the exact likelihood function, multiplied by (18) the joint prior distribution for and . Note that here we will use interchangeably with : Note that the above exact posterior distribution is not of a known form and cannot be directly simulated from in an easy fashion. This will motivate us to use a proposal density that closely matches this target density in a MHWG sampling routine.

3.3. Exact Conditional Posterior Distribution for

The exact posterior distribution for conditional on and will be proportional to (18) the joint prior distribution for and . Note that the exact likelihood function (7) does not depend upon and thus can be omitted entirely: where and are the arithmetic means of the variance and covariance components of , respectively. We recognize that the posterior distributions for and conditional on are independent Inverse Gamma random variables: This result is intuitive and theoretically appealing in that the posterior distribution for , the variance hyperparameter for the variance components of , depends only on the number of variance terms and , whereas the posterior distribution for , the variance hyperparameter for the covariance components of , depends only on the number of covariance terms and . This draws out the point of modeling the variance components separate from the covariance components. In addition, the Inverse Gamma is highly tractable and lends itself to the numerical procedures in the subsequent section.

3.4. Approximate Conditional Posterior Distribution for

In order to implement the MHWG sampling algorithm, we have derived all the full conditional posterior distributions , , and in (17), (20), (22), and (23), respectively. However, we clearly see that the simulation of based upon the true conditional posterior distribution in (20), the target distribution, is not tractable. Therefore, the Metropolis-Hastings algorithm is employed and a proposal density needs to be constructed. The algorithm works best if the proposal density closely matches the shape of the target distribution.

We can construct an approximation to a function, that is, proportional to the likelihood function by utilizing the linear Volterra integral. The key advantage of this is that the approximation can be written as a multivariate normal with respect to the unique elements of the matrix logarithm of the covariance matrix. This allows for a multivariate normal to act as conjugate prior. Hence, we have a multivariate normal posterior, that is, a good proxy for the true posterior, and the proposal can be easily simulated from. Interested readers should refer to Leonard and Hsu  and Hsu et al.  for a detailed exposition of how this is performed.

Leonard and Hsu  show how we can use Bellman’s solution of a Volterra integral equation [10, page 171] to derive the following approximation, that is, proportioanl to the likelihood function of given and : Recall from Section 2.2 that , where is as defined in (6). The symmetric almost surely positive definite matrix is the likelihood information matrix of and is a function of the normalized eigenvalues and normalized eigenvectors of . In particular, where and where and for are the th normalized eigenvalue and eigenvector, respectively, of . denotes the vector that satisfies the condition . We see that the approximate likelihood function (24) is a multivariate normal form with respect to . This functional form of the approximate likelihood function in (24) will be the driving mechanism in the Bayesian analysis for . Again, for details of the derivation of the approximate likelihood function please refer to Leonard and Hsu  and Hsu et al. .

Equation (24) provides an excellent approximation to a function, that is, proportional to the exact likelihood (7), when the sample size is large. To illustrate the effects of , the sample size, and , the dimension of the covariance matrix , we conduct the following exercise. Without loss of generality, we consider a simplified model, when is a random sample from . In our illustrative example, three dimensional sizes (, 5, and 10), and four sample sizes (, 100, 500, and 5000) were considered for comparison. For a fair comparison, the same sample covariance is used for all four different sample sizes, for each . The sample covariance matrix is assumed to consist of elements for the th row and th column, where . For example, when , then

The histograms, in Figure 1, represent the exact likelihood from (7) of the (1, 1)th element, , of , where the histogram is normalized for comparison purposes, and the dotted curves represent the univariate normal densities based on approximation (24). Please note that these histograms are computed according to an importance sampling method using (24) as the importance function. For an overview of importance sampling methods, please see, for example, Rubinstein  and Leonard et al. . It can be seen from Figure 1 that the approximation is better when the sample size is bigger and the dimension size is smaller. Similar patterns were found for other variance and covariance elements of the covariance matrix . The approximation is fairly accurate when is 500 or greater.

The approximate joint posterior distribution for and conditional on will be proportional to the product of the approximate likelihood function (24) and the joint prior distribution (18):

Completing the square for the two terms in the exponent of (28) yields the following approximate posterior distribution for conditional on and : where the vector . Recall that and are as defined in (25) and (19), respectively. Thus, we have the following approximate posterior distribution for conditional on and : This demonstrates the conjugacy of utilizing the approximate likelihood function. Equation (30) provides an efficient proposal distribution for implementing a MHWG algorithm. In short, we have developed a highly flexible while at the same time tractable Bayesian methodology for the covariance structure.

3.5. Metropolis-Hastings within Gibbs Sampling Procedure

Based upon the theoretical results derived above we outline the following procedure for implementing the MHWG algorithm. Specifically, from (17), (30), (20), (22), and (23) we have a formal setup for implementing a MCMC procedure with a Metropolis-Hastings step. Below we outline the specific steps involved in implementing the MHWG algorithm.Simulate and from (22) and (23), respectively. Initial starting values for may be set equal to , where and is the maximum likelihood estimator for . Subsequent simulations of and will be based upon simulated values of .Simulate from (17). Initial starting values for may be based upon . Subsequent simulations of will be based upon simulated values of .Simulate a candidate value from (30) based upon and and from steps and , respectively. Then let where is given by following expression: and and are as defined in (29) and (20), respectively.

The last procedure in step is the Metropolis-Hastings algorithm. We employ this procedure since we are utilizing an approximation to the exact posterior distribution [11, page 291].

Posterior moments for any parameter of interest can be calculated easily upon the MHWG results. It is usually the case in MHWG estimation procedures that the first simulations are not included in the estimates, due to the fact that the Markov chain has not yet reached a steady state. In the parlance of numerical procedures, is usually referred to as the number of burn in iterations.

In addition to the burn in value, we also explored the potential need to perform a so-called thinning procedure. A thinning step entails only retaining every th simulated value of the MHWG sampling, where is chosen large enough so that any autocorrelation is removed. By examining the sample autocorrelation function (ACF) plots, practitioners can decide if thinning is necessary. We investigated several plots for numerous parameters. For illustrative purposes, in Figure 2, we present the sample ACF plots for , , , and . We can see that autocorrelation is not a significant concern in our particular application. Other sample ACF plots looked quite similar to numerous other parameters. Practitioners should explore the need to use a thinning procedure in their specific applications.

4. Application: High School and Beyond Survey

In 1980 the National Education Longitudinal Studies program of the National Center for Education Statistics administered the High School and Beyond (HSB) Survey . The HSB study contains both a 1980 senior class cohort and a 1980 sophomore class cohort. Within the sophomore class cohort we have a total sample size of = 14,667 students. The HSB study has been analyzed extensively by many, for example, Astone and McLanahan , Grogger , St. John , and Zwick and Sklar .

4.1. Description of Data

The HSB study contains a myriad of data and variables. In particular, for the sophomore class cohort a total of exams were administered in the areas of vocabulary, reading, two exams in mathematics, science, writing, and civics. As is often the case with educational testing data the test scores were standardized to have a mean of fifty and a standard deviation equal to ten. We will use these standardized test scores for the sophomore class cohort as our multivariate response variables.

Grade point average data was also obtained from official transcripts that were included in the survey. In addition, a number of demographic variables were collected on both the school and individual student levels. These variables included school type, relative urbanization of school environment, geographic region of the country, gender, and race. All of these variables will serve as categorical or qualitative explanatory variables in the multivariate multiple regression. Table 1 provides a description of the categorical explanatory variables as well as the associated number of students per category. Note that the original HSB study did not include a school type four.

 Categorical variables Levels Level description Students per level School type School type 0 Regular public 9534 School type 1 Alternative 437 School type 2 Cuban/Hispanic public 156 School type 3 Other Hispanic public 1483 School type 5 Regular catholic 1341 School type 6 Black catholic 758 School type 7 Cuban/Hispanic catholic 216 School type 8 Private elite 294 School type 9 Private nonelite 448 Urbanization Urban 1 Urban 3451 Urban 2 Suburban 7325 Urban 3 Rural 3891 Geographic region Region 1 New England 736 Region 2 Middle Atlantic 2518 Region 3 South Atlantic 2178 Region 4 East South Central 713 Region 5 West South Central 1750 Region 6 East North Central 2931 Region 7 West North Central 1128 Region 8 Mountain 696 Region 9 Pacific 2017 Race Race 1 Hispanic/Spanish 3122 Race 2 American Indian/Alaskan native 224 Race 3 Asian/Pacific Islander 345 Race 4 Black 2050 Race 5 White 8880 Race 6 Other 46 Gender Gender 1 Male 7265 Gender 2 Female 7402

As can be expected with educational data there was some moderate degree of missing data. We employed a data augmentation technique for missing data imputation. The specific procedure was invoked using the  mice library  in R. This particular data augmentation procedure fits nicely within the context of our research here since it employs a multivariate imputation by chained equations technique. That is, it uses a multivariate Gibbs sampler procedure to augment the missing data set. In particular, incomplete columns of data, that is, variables with missing data, are augmented by generating appropriate values of data given the values of the other columns of variables.

4.2. Treatment Contrast for Categorical Variables

Characterization of the explanatory data or design matrix to account for categorical explanatory variables is not unique [22, page 173]. There are actually several contrast methods. In our analysis, school type zero, which corresponds to a regular public school, will serve as the base level and will not in fact have its own regressor. In the same fashion, urban will serve as the base level for the relative urbanization categorical explanatory variable. Also, the New England region will act as the base level for the geographic variable. Hispanic or Spanish is the base level for the race identifier variable. Finally, we will designate male as the base gender level. Thus, after properly accounting for the base levels in our particular application .

4.3. Classical Multivariate Multiple Regression Results

We consider the standard multivatiate regression model described in (2) for the HSB survey data within the Sophomore class cohort. A total of = 14,667 students in that cohort participated in the study. We estimated the model for the seven standardized exams VOCAB, READ, MATH 1, MATH 2, SCI, WRITE, and CIVIC regressed on GPA and the other explanatory variables described in the previous section. Most of the regression coefficient estimates are highly significant. In particular, GPA is highly significant for all seven exams. Although, certain particular levels for various categorical variables are not significant for some of the response variables. All of the associated statistics are highly significant for all response variables. Moreover, based upon the analysis of variance that was performed we can conclude that nearly all the explanatory variables are highly significant for all seven subject area exams.

4.4. Posterior Estimates of the Model Parameters

In Tables 2, 3, 4, 5, and 6 we present the Bayesian posterior means and the associated standard errors for the matrices and . In practice, for the MCMC procedure we found that a total iteration size of = 500,000 was quite sufficient to establish convergence and we used = 2,000 as the burn in value.

 VOCAB READ MATH 1 MATH 2 SCI WRITE CIVIC (Intercept) 37.0582 35.6555 33.6377 37.7119 37.3335 31.2303 37.9283 GPA 4.7810 5.2711 5.9424 4.8139 4.4674 5.7163 4.3733 School type 1 −1.1971 −0.3382 −1.0521 −0.5196 −1.1388 −0.9455 −0.5326 School type 2 −0.8747 −1.9121 −0.6682 −0.4325 −2.4736 −1.4395 −2.4039 School type 3 −1.0713 −0.9146 −0.3638 −0.3816 −1.1779 −0.9105 −0.7742 School type 5 2.7359 2.1862 2.3550 1.5275 0.7704 2.5247 2.1433 School type 6 1.8095 1.8877 0.9502 0.2953 0.2920 2.0570 2.1941 School type 7 3.0532 1.6308 1.5283 0.0003 0.5107 2.1525 1.7716 School type 8 10.3236 8.7564 8.8743 8.7145 5.8165 7.0452 5.5020 School type 9 2.9475 1.9583 2.9381 2.9743 2.1760 1.9975 1.7927 Urban 2 0.5558 0.0821 0.6002 0.9824 0.7407 0.5115 0.2248 Urban 3 −0.8006 −0.5359 −0.3611 −0.2156 0.4344 0.0087 −0.3363 Region 2 −1.0396 −0.7905 −0.3661 −1.0051 −0.4256 −0.6938 −0.9750 Region 3 −2.2086 −1.1256 −1.6328 −1.7664 −1.4603 −1.1176 −1.7760 Region 4 −3.9350 −1.8827 −2.4900 −2.8335 −2.0910 −1.3902 −2.3088 Region 5 −2.6954 −1.5560 −1.5315 −2.1349 −1.1494 −0.8252 −1.6827 Region 6 −2.1648 −1.5579 −0.9690 −1.7886 −0.6775 −0.5598 −1.5524 Region 7 −1.9171 −0.4627 0.1163 −0.3902 0.2802 0.2396 −0.3970 Region 8 −1.6480 −0.9278 −1.6167 −2.4756 −0.6947 −0.4478 −1.4824 Region 9 −0.4866 −0.7586 −0.5338 −1.0931 −0.6208 −0.0413 −1.5015 Gender 2 −1.5605 −1.2414 −1.8966 −1.5490 −3.0722 3.0841 0.2048 Race 2 −0.9813 0.5513 −0.8029 −0.2627 0.5303 0.8640 −0.9432 Race 3 1.7286 1.7006 4.9363 3.9146 2.4132 3.5702 1.3489 Race 4 −0.8534 0.0337 −0.6765 −0.5625 −1.3975 −0.3582 0.4884 Race 5 4.5403 3.7077 3.8777 2.3674 4.5873 3.9272 2.7858 Race 6 −0.5276 0.3656 −0.0162 −0.1581 0.4371 0.3194 0.9498
 VOCAB READ MATH 1 MATH 2 SCI WRITE CIVIC (Intercept) 0.4421 0.4482 0.4195 0.4585 0.4383 0.4201 0.4678 GPA 0.0920 0.0930 0.0872 0.0951 0.0913 0.0874 0.0972 School type 1 0.4327 0.4379 0.4102 0.4483 0.4297 0.4110 0.4568 School type 2 0.7111 0.7199 0.6736 0.7352 0.7047 0.6761 0.7518 School type 3 0.2797 0.2823 0.2651 0.2902 0.2773 0.2662 0.2954 School type 5 0.2561 0.2589 0.2428 0.2659 0.2540 0.2438 0.2707 School type 6 0.3358 0.3392 0.3185 0.3480 0.3327 0.3193 0.3553 School type 7 0.6116 0.6187 0.5786 0.6331 0.6054 0.5806 0.6452 School type 8 0.5164 0.5229 0.4901 0.5359 0.5125 0.4912 0.5457 School type 9 0.4171 0.4222 0.3963 0.4325 0.4138 0.3972 0.4414 Urban 2 0.1864 0.1882 0.1764 0.1927 0.1844 0.1769 0.1964 Urban 3 0.2135 0.2158 0.2020 0.2206 0.2114 0.2027 0.2253 Region 2 0.3623 0.3672 0.3441 0.3759 0.3593 0.3438 0.3825 Region 3 0.3706 0.3759 0.3520 0.3848 0.3681 0.3524 0.3915 Region 4 0.4512 0.4574 0.4286 0.4678 0.4480 0.4290 0.4774 Region 5 0.3877 0.3928 0.3681 0.4020 0.3850 0.3686 0.4097 Region 6 0.3538 0.3587 0.3358 0.3669 0.3513 0.3365 0.3743 Region 7 0.4056 0.4103 0.3846 0.4208 0.4022 0.3858 0.4286 Region 8 0.4647 0.4709 0.4411 0.4819 0.4614 0.4426 0.4917 Region 9 0.3783 0.3830 0.3586 0.3925 0.3753 0.3598 0.3998 Gender 2 0.1421 0.1436 0.1347 0.1470 0.1411 0.1351 0.1500 Race 2 0.5944 0.6018 0.5630 0.6159 0.5897 0.5652 0.6273 Race 3 0.4938 0.4996 0.4679 0.5114 0.4903 0.4692 0.5216 Race 4 0.2599 0.2639 0.2467 0.2697 0.2579 0.2470 0.2752 Race 5 0.2061 0.2085 0.1952 0.2136 0.2042 0.1959 0.2175 Race 6 1.2666 1.2806 1.1995 1.3091 1.2548 1.2040 1.3359
 VOCAB READ MATH 1 MATH 2 SCI WRITE CIVIC VOCAB 72.3406 45.3069 34.3079 25.0339 39.1280 35.7151 32.6781 READ 45.3069 74.0616 37.4238 29.2657 40.8084 36.8560 34.3226 MATH 1 34.3079 37.4238 65.0307 39.9597 36.9083 34.2467 27.4186 MATH 2 25.0339 29.2657 39.9597 77.6078 28.6053 25.5057 19.7419 SCI 39.1280 40.8084 36.9083 28.6053 71.1835 34.7257 31.7008 WRITE 35.7151 36.8560 34.2467 25.5057 34.7257 65.3904 31.9590 CIVIC 32.6781 34.3226 27.4186 19.7419 31.7008 31.9590 80.8041
 VOCAB READ MATH 1 MATH 2 SCI WRITE CIVIC VOCAB 72.3308 45.3567 34.2973 24.9684 39.1487 35.7217 32.6627 READ 45.3567 74.0529 37.4264 29.2202 40.8335 36.8651 34.3098 MATH 1 34.2973 37.4264 64.9922 39.9886 36.9119 34.2393 27.3645 MATH 2 24.9684 29.2202 39.9886 77.6275 28.5646 25.4523 19.6396 SCI 39.1487 40.8335 36.9119 28.5646 71.1592 34.7254 31.6765 WRITE 35.7217 36.8651 34.2393 25.4523 34.7254 65.3498 31.9434 CIVIC 32.6627 34.3098 27.3645 19.6396 31.6765 31.9434 80.8225
 VOCAB READ MATH 1 MATH 2 SCI WRITE CIVIC VOCAB 0.8456 0.7119 0.6329 0.6507 0.6759 0.6399 0.6872 READ 0.7119 0.8657 0.6514 0.6705 0.6886 0.6513 0.6990 MATH1 0.6329 0.6514 0.7604 0.6731 0.6389 0.6082 0.6393 MATH2 0.6507 0.6705 0.6731 0.9069 0.6567 0.6251 0.6742 SCI 0.6759 0.6886 0.6389 0.6567 0.8316 0.6326 0.6779 WRITE 0.6399 0.6513 0.6082 0.6251 0.6326 0.7647 0.6556 CIVIC 0.6872 0.6990 0.6393 0.6742 0.6779 0.6556 0.9425

From Table 4 we observe that the posterior means for the variance and covariance components are each slightly pulled towards central quantities, respectively. To better illustrate this shrinkage property of the posterior mean we present the classical frequentist estimate of in Table 5. If we make the element-wise comparison between the posterior mean in Table 4 and classical frequentist estimate in Table 5 we see that, among all diagonal elements (variances), relatively smaller elements of the classical frequentist estimate are pulled up and relatively larger elements of the classical frequentist estimate are pulled down. For example, the estimated variance for MATH 1 moved up from 64.9922 in Table 5 to 65.0307 in Table 4 and the estimated variance for CIVIC moved down from 80.8225 in Table 5 to 80.8041 in Table 4. Similar phenomenon appeared for the off-diagonal elements (covariances). This is due to the fact that we have assumed the intraclass matrix form for the prior specification of .

To further investigate the shrinkage property we considered an informative prior distribution for and instead of the vague prior specification of (12). In particular, we assumed a priori that . Table 7 presents the posterior mean for with such informative conjugate prior specification. Under this informative prior specification the variance and covariance components are each pulled more towards central quantities, respectively, in comparison to the elements of Table 4. For example, the estimated variance for MATH 1 moved further up from 65.0307 in Table 4 to 65.7643 in Table 7 and the estimated variance for CIVIC moved further down from 80.8041 in Table 4 to 79.0518 in Table 7. Under the informative prior specification for and we observe that the shrinkage property is more pronounced.

 VOCAB READ MATH 1 MATH 2 SCI WRITE CIVIC VOCAB 72.3394 44.8244 34.4191 25.5068 38.8961 35.6451 32.5834 READ 44.8244 74.1193 37.4280 29.5260 40.5447 36.7774 34.1974 MATH 1 34.4191 37.4280 65.7643 39.5284 36.9010 34.3695 27.7634 MATH 2 25.5068 29.5260 39.5284 76.2078 28.8305 25.8843 20.4709 SCI 38.8961 40.5447 36.9010 28.8305 71.2920 34.7172 31.6945 WRITE 35.6451 36.7774 34.3695 25.8843 34.7172 65.8718 31.8856 CIVIC 32.5834 34.1974 27.7634 20.4709 31.6945 31.8856 79.0518

Specifically, observe that the posterior estimates for the variance of the MATH 2 and CIVIC exams are less than their associated classical frequentist estimates, whereas, all others on diagonal elements of the posterior mean estimate for are greater than their respective classical frequentist estimates. An analogous statement concerning the covariance terms can also be made. This draws our attention to the notion of the Bayesian posterior mean as a compromise between the prior information and the information contained in the data.

4.5. Posterior Estimates of the Sum Total

Of particular interest to educational testing data is some estimate of the overall summary or composite score of the individual subject area exams for a given set of explanatory variables. An obvious choice is to estimate the sum total of the individual exam scores. In a Bayesian framework, this can easily be accomplished by simply summing the individual posterior estimates of the seven subject area exams. However, the standard error associated with this estimate of the sum total cannot simply be calculated as the square root of the sum of the variance of the individual estimates. This would fail to incorporate the obvious covariance terms that exist amongst the subject area exams. Additionally, we would be overlooking any potential parameter uncertainty.

Suppose for an individual we have the vector of response variables and the associated vector of explanatory variables . Note that the linear model for a single observation can be expressed as . Furthermore, the sum total score is , where is a vector of ones. Then, given and , the total score follows a univariate normal distribution with mean and variance .

In order to more fully account for the added variability due to parameter uncertainty we consider the unconditional mean and variance of the sum total for an individual: Here it is understood and the notation implies that the resulting expectations in (34) are in fact taken with respect to the posterior distributions of and , given . Equation (34) can be calculated based upon the results from the MCMC procedures in the following manner: where is the total number of iterations of the MCMC algorithm, is the number of burn-in iterations defined in Section 3, and .

Notice that by calculating the variance of the sum total score in this fashion we fully incorporate the obvious covariance structure amongst the response variables. Furthermore, in a Bayesian sense we capture the added variation due to parameter uncertainty.

As a particular example we investigated a hypothetical student. The student is a female of Hispanic or Spanish decent. Her GPA is 3.0 and she attends a private nonelite school in a rural community of the Pacific region of the United States. Thus, for this particular example we have the following characterization of the vector of explanatory variables . , , and all the other remaining elements of are equal to zero.

We obtained the following estimates of the Bayesian posterior means for the individual area exams, 50.9732, 51.2566, 51.5953, 52.1122, 50.0902, 53.7476, and 52.1576 for the VOCAB, READ, MATH 1, MATH 2, SCI, WRITE, and CIVIC exams, respectively. This results in an estimated sum total of 361.9328 with a standard error of 44.2357.

5. Conclusion

In conclusion, we have demonstrated how a flexible prior specification for the covariance structure of a multivariate multiple regression model can provide a richer class of distributions than the inverse Wishart family. We discussed how the likelihood function for the covariance structure can be approximated based upon Bellman’s solution of a linear Volterra integral equation. We discussed the shrinkage properties of the posterior mean of the covariance structure. This highlighted the concept of the posterior means as a compromise between the prior information and the information contained in the data.

All posterior estimates were calculated based upon the numerical results of a MHWG procedure. The Metropolis-Hastings algorithm was employed to account for sampling from an approximate posterior distribution for the covariance structure.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

1. A. van der Merwe and J. V. Zidek, “Multivariate regression analysis and canonical variates,” The Canadian Journal of Statistics, vol. 8, no. 1, pp. 27–39, 1980. View at: Google Scholar
2. G. Tiao and A. Zellner, “On the bayesian estimation of multivariate regression,” Journal of the Royal Statistical Society B, vol. 26, pp. 277–285, 1964. View at: Google Scholar
3. C. F. Chen, “Bayesian inference for a normal dispersion matrix and its application to stochastic multiple regression analysis,” Journal of the Royal Statistical Society, Series B: Methodological, vol. 41, no. 2, pp. 235–248, 1979. View at: Google Scholar | MathSciNet
4. J. Dickey, D. Lindley, and S. Press, “Bayesian estimation of the dispersion matrix of a multivariate normal distribution,” Communications in Statistics—Theory and Methods, vol. 14, no. 5, pp. 1019–1034, 1985. View at: Google Scholar
5. I. J. Evans, “Bayesian estimation of parameters of a multivariate normal distribution,” Journal of the Royal Statistical Society Series B: Methodological, vol. 27, pp. 279–283, 1965. View at: Google Scholar | MathSciNet
6. A. P. Dawid, “Some matrix-variate distribution theory: notational considerations and a Bayesian application,” Biometrika, vol. 68, no. 1, pp. 265–274, 1981. View at: Publisher Site | Google Scholar | MathSciNet
7. T. Leonard and J. S. J. Hsu, “Bayesian inference for a covariance matrix,” Annals of Statistics, vol. 20, no. 4, pp. 1669–1696, 1992. View at: Publisher Site | Google Scholar
8. C.-W. Hsu, M. S. Sinay, and J. S. J. Hsu, “Bayesian estimation of a covariance matrix with flexible prior specification,” Annals of the Institute of Statistical Mathematics, vol. 64, no. 2, pp. 319–342, 2012.
9. M. S. Sinay, C. Hsu, and J. S. J. Hsu, “Bayesian estimation with flexible prior for the covariance structure of linear mixed effects models,” International Journal of Statistics and Probability, vol. 2, pp. 29–41, 2013. View at: Google Scholar
10. R. Bellman, Introduction to Matrix Analysis, McGraw-Hill, New York, NY, USA, 1960. View at: MathSciNet
11. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, Chapman & Hall, New York, NY, USA, 2nd edition, 2004.
12. National Center for Education Statistics, High School and Beyond, 1980: A Longitudinal Survey of Students in the United States, produced by National Opinion Research Center, Chicago, Ill, USA, 1980, distributed by Inter-University Consortium for Political and Social Research, Ann Arbor, Mich, USA, 2nd edition, 1986.
13. M. S. Srivastava and C. G. Khatri, An Introduction to Multivariate Statistics, Elsevier North Holland, New York, NY, USA, 1979.
14. T. Y. M. Chiu, T. Leonard, and K.-W. Tsui, “The matrix-logarithmic covariance model,” Journal of the American Statistical Association, vol. 91, no. 433, pp. 198–210, 1996.
15. R. Y. Rubinstein, Simulation and the Monte Carlo Method, John Wiley & Sons, New York, NY, USA, 1981. View at: MathSciNet
16. T. Leonard, J. S. J. Hsu, and K. Tsui, “Bayesian marginal inference,” Journal of the American Statistical Association, vol. 84, pp. 1051–1058, 1989. View at: Google Scholar
17. N. M. Astone and S. S. McLanahan, “Family structure, parental practices and high school completion,” American Sociological Review, vol. 56, no. 3, pp. 309–320, 1991. View at: Google Scholar
18. J. Grogger, “School expenditures and post-schooling earnings: evidence from high school and beyond,” Review of Economics and Statistics, vol. 78, no. 4, pp. 628–637, 1996. View at: Google Scholar
19. E. P. St. John, “Price response in enrollment decisions: an analysis of the High School and Beyond Sophomore cohort,” Research in Higher Education, vol. 31, no. 2, pp. 161–176, 1990. View at: Publisher Site | Google Scholar
20. R. Zwick and J. C. Sklar, “Predicting college grades and degree completion using high school grades and SAT scores: the role of student ethnicity and first language,” The American Educational Research Journal, vol. 42, no. 3, pp. 439–464, 2005. View at: Publisher Site | Google Scholar
21. S. Van Buuren and C. G. M. Oudshoorn, “Multivariate Imputation by Chained Equations: MICE V1.0 User's manual,” Tech. Rep. PG/VGZ/00.038, TNO Prevention and Health, Leiden, The Netherlands, 2000. View at: Google Scholar
22. J. J. Faraway, Linear Models with R, Chapman & Hall, New York, NY, USA, 2005. View at: MathSciNet

More related articles

Article of the Year Award: Outstanding research contributions of 2020, as selected by our Chief Editors. Read the winning articles.