Journal of Applied Mathematics

Volume 2016 (2016), Article ID 1648462, 12 pages

http://dx.doi.org/10.1155/2016/1648462

## Assessing Heterogeneity for Factor Analysis Model with Continuous and Ordinal Outcomes

Department of Applied Mathematics, Nanjing Forestry University, Nanjing, Jiangsu 210037, China

Received 8 December 2015; Revised 23 February 2016; Accepted 2 March 2016

Academic Editor: Wei-Chiang Hong

Copyright © 2016 Ye-Mao Xia and Jian-Wei Gou. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Factor analysis models with continuous and ordinal responses are a useful tool for assessing relations between the latent variables and mixed observed responses. These models have been successfully applied to many different fields, including behavioral, educational, and social-psychological sciences. However, within the Bayesian analysis framework, most developments are constrained within parametric families, of which the particular distributions are specified for the parameters of interest. This leads to difficulty in dealing with outliers and/or distribution deviations. In this paper, we propose a Bayesian semiparametric modeling for factor analysis model with continuous and ordinal variables. A truncated stick-breaking prior is used to model the distributions of the intercept and/or covariance structural parameters. Bayesian posterior analysis is carried out through the simulation-based method. Blocked Gibbs sampler is implemented to draw observations from the complicated posterior. For model selection, the logarithm of pseudomarginal likelihood is developed to compare the competing models. Empirical results are presented to illustrate the application of the methodology.

#### 1. Introduction

Owing to its wide applications in behavioral and social science researches, analysis of factor analysis models with mixed data structure has received a lot of attention; see [1–6]. However, most of these methods are mainly developed within particular parametric distribution families such as the exponential family or normal scale mixture family, which have a limited role in dealing with the distributional deviations, in particular heterogeneity or multimodality of the data. Though some robust methods are developed to downweight the influence of the outliers [7–12], most of them are still confined to dealing with unimodality and are less effective for the asymmetric and/or multimodal problems.

Recently, some authors focused on the Bayesian semiparametric modeling for latent variables model. For multivariate categorical data analysis, Kottas et al. [13] extended the traditional multivariate probit model [14–16] to a flexible underlying prior probability model. The usual single multivariate normal model for the latent variables is replaced by a mixture of normal priors with infinite number of components. And, for the latent variable model with fixed covariates and continuous responses, Lee et al. [17] established the semiparametric Bayesian hierarchal model for the structural equation models (SEMs) by relaxing the common normal distribution of exogenous factors to follow a finite-dimensional Dirichlet process [18]. Song et al. [19] developed a semiparametric Bayesian procedure for analyzing the latent variable model with unordered categorical data. For some recent advances in semiparametric analysis for factor analysis model, see [20–23] among others.

In this paper, we developed a Bayesian semiparametric approach for analyzing factor analysis model with mixed continuous and ordinal responses. The methods are twofold. Firstly, we extended Kottas, Müller, and Quintana’s model to a more general multivariate model which contains factor variables. This extension aims to interpret the relationships between measurements and latent variables and explore correlations among the multiple manifest variables. Moreover, we treat the threshold parameters as unknown and estimate them simultaneously with other model parameters, thus providing a more flexible approach to fit the data. Secondly, we introduce the truncated Dirichlet process prior as the prior of the mean vector and variance-covariance parameters of unique errors and latent variables. This facilitates the interpretation of heterogeneity in the mean and/or covariance structure across the subjects.

This paper is organized as follows. We first introduce the Bayesian semiparametric modeling framework for factor analysis model with continuous and ordinal variables. We then present the Markov chain Monte Carlo procedure for parameters estimation and model selection. Simulation studies and a real example are provided to illustrate the performance of the proposed procedure. We close with some remarks and concluding comments.

#### 2. Model Description

##### 2.1. Factor Analysis Model with Continuous and Ordinal Responses

Suppose that a -dimensional mixed observed vector contains continuous variables and ordinal variables with taking an integral value in for , . We assume that the observed ordinal vector is related to the unobserved continuous vector throughwhere , is a set of unknown threshold parameters that define the categories: . Hence, for the th variable , there are categories.

Let denote the vector of continuous observed measurements and unobserved variables. For subject , we formulate the dependence among ’s through the following measurement model:where is a intercept vector, is a factor loading matrix, is an vector of latent variables, and is a vector of measurement errors which is independent of . In many applications, may represent the hypothesized factors underlying manifest responses and/or unobserved heterogeneity not explained by covariates.

The latent variable model with mixed continuous and ordinal responses defined by (1) and (2) faces two sources of identification problems. The first one is associated with the determinacy of latent variables in modeling of categorical variables, and the second one is related to the uniqueness of the factor loadings matrix. To solve the first problem, we use the common method [24] to fix endpoints and () at preassigned values. For the second problem, we follow the usual practice in structural equation modeling to identify the covariance matrix of by fixing appropriate elements in at preassigned values.

Let be the parametric vector formed by the unknown parameters contained in and let denote the free parameters contained in factor loading matrices and with . Based on the assumptions of (2), the conditional distribution of given is a normal distribution with mean vector and covariance matrix .

Note that the latent factors here play an important role in characterizing the associations between the observed variables. It can be seen clearly that and are dependent when is integrated out. The marginal density of is given by within which is the standard normal cumulative distribution function.

##### 2.2. Bayesian Semiparametric Hierarchical Modeling

Let be the conditional density of given and denote by a prior distribution function of . Suppose that is proper; we define the following mixture density: in which is the conditional distribution of given . By taking a prior for and restricting to be a parametric family of distributions indexed by , we complete the Bayesian parametric model specification. However, this restriction severely constrains the estimation of and produces estimators that shrink data values toward the same points. A more flexible modeling for is to treat as random and assign a prior for it. For this end, we introduce a latent variable vector and assume that, given , ’s are conditionally independent and drawn from . Furthermore, we suppose that ’s are independent and identically distributed (i.i.d.) according to with a prior on it. As a result, we break the mixture model into where “ind” means “independent” and is a prior of .

We consider the following truncated version of Dirichlet process for :in which denotes a discrete probability measure concentrated on atom and , independent of , are random weights constructed through the following stick-breaking procedure:with ; ’s are i.i.d with common distribution .

Truncated Dirichlet process prior (7) can be considered as a truncation version of Dirichlet process [25–30] in the nonparametric Bayesian analysis. It can be shown that, under (7) and (8), for any Borel set in ,This indicates that can be served as the starting point or guess of and determines the concentration of the prior around . In practice, the value of is either set to a large, predetermined value (e.g., ) or chosen empirically. For instance, Ishwaran and Zarepour [31] suggested that the adequacy of the truncation level, , can be assessed by evaluating moments of the tail probability. Our simulation results have shown that is more than adequate for the model considered in the present context.

Now, we specify the distribution . Recalling that by convention is the collection of , hence, we assume thatwhere , and are hyperparameters, is a diagonal matrix with the th diagonal element , and is an positive definite matrix; refers to the inverse gamma distribution with shaper parameters and scale parameters , respectively, and denotes the inverse Wishart distribution [32].

Modeling in (7) into the random probability measure and incorporating the latent variable into (5) generate the following hierarchical model: for ,where is given by (7) and (8).

#### 3. Parameters Estimation and Model Selection

##### 3.1. Prior Specifications and Estimation via Blocked Gibbs Sampler

Let . To implement Bayesian analysis, blocked Gibbs sampler is used to simulate observations from the posterior. The key for blocked Gibbs sampler is to recast model (11) completely by introducing the cluster variables such that Consequently, the semiparametric hierarchical model (11) can be reformulated as the following framework:where is a prior of , is the stick-breaking prior given by (8) with , and is the joint distribution of given byin which is given in (10).

For the Bayesian analysis, we need to specify priors for the parameters involved in the model. The whole parameters can be divided into two parts: parametric component part and nonparametric component part . For the parametric components, we assume that withwhere is a column vector that contains unknown parameters in the th row of .

For the hyperparameter , we consider the following conjugate priors:The hyperparameters , , , , , , and in (10), (14), and (15) are treated as known.

Let , , and . Posterior analysis in relation to the complex is carried out through the data augmentation technique [33]. Specifically, we treat the latent quantities as missing data and augment them with the observed data. A sequence of random observations is generated from the joint posterior distribution by the blocked Gibbs sampler [31, 34], coupled with the Metropolis-Hastings algorithm [35, 36]: given at the th iteration draw from , , , , , , , draw from , , , , , , draw from , , , , , , draw from , , , , , , , draw from , , , , , , ,and form . It can be shown that as tends to infinity, the empirical distribution of converges to at any geometrical rate. The full conditional distributions and the implementation of the above algorithm are given in the Appendix.

##### 3.2. Model Selection

Model selection is an important issue in Bayesian semiparametric modeling for latent variable model since it is of practical interest to compare different modelings for factor analytic models. Formal Bayesian model selection is accomplished by comparing the marginal predictive distribution of data across models. Consider the problem of comparing competing models and . Let and denote the marginal density of data under and , respectively. A popular choice for selecting models is achieved via Bayes factor (BF) (e.g., [37–39]). However, in view of the fact that computing BF involves the high-dimensional density which is hard to estimate well, we prefer comparing the following logarithm of pseudomarginal likelihood (LPML) [40, 41]:where is known as the conditional predictive ordinate (CPO) defined asHere, is the data set with removed. Obviously, from (17), we can see that is the marginal posterior predictive density of given and can be interpreted as the height of this marginal density at . Thus, small values of LPML imply that does not support the model.

Based on MCMC sample already available in the estimation, a consistent estimate for LPML can be obtained via ergodic average given by

It is noted that, under our proposed model, which is complicated due to the existence of and . This can be solved by Monte Carlo approximation. Specifically, given the current values at the th iteration, we draw (i) from and (ii) from for and then evaluate at the observation through Obviously, the distributions involved in (i) and (ii) are standard and sampling is rather straightforward and fast.

#### 4. A Simulation Study

In this section, a simulation study to evaluate the performance of the proposed procedure is conducted. The goal is to assess the accuracy of estimates under parametric, partly exchangeable, and semiparametric modelings when data take on the multimodality or heterogeneity. We consider the situation in which each observed vector consists of three-dimensional continuous vector and three-dimensional ordinal vector with threshold values . We generate by first generating with from the mixture of two factor analytic models with weights 0.45 and 0.55 and then transforming into () via (1) to create the ordinal observations, where represents a observed continuous random vector and is a latent continuous random vector. Each component in the mixture model is specified through the following measurement model: for ,The parameters involved in the components of mixture model are taken as , , , ,in which is a vector with all elements equal to one and is a identity matrix. The elements with asterisks involved in loading matrix and threshold parameters are treated as fixed for identifying model (see Section 2.1). Based on these settings, random sample with size 500 is generated and 100 replications are completed for each combination.

Prior inputs in the prior distributions involved in the parametric components (see (14)) are as follows: and are diagonal matrices with the diagonal elements 1.0, and elements in are equal to the true values, while prior inputs in the prior distribution of superparameter (see (15)) are , , , , , , and . Note that these values ensure approaching noninformative priors.

A few test runs are conducted to explore the effect of truncated levels on the estimates of unknown parameters and the convergence of the blocked Gibbs sampler. We take , , , , , , , and 300 and calculate the total sum of the root mean square (RMS) of estimates (see below for details). The resulting values are 1.9830, 1.7382, 1.6582, 1.5548, 1.4194, 1.4128, 1.4108, and 1.4101, respectively. It can be seen that the total sum of the root mean square (RMS) becomes rather stable when . In the following analysis, we set in our data analysis. For the threshold parameters , we choose (see Appendix) in MH algorithm to produce the acceptance rate about 0.40. Figure 1 gives the plots of EPSR (estimated potential scale reduction [42]) values of unknown free parameters in , , and against iterations for three groups of different starting values. It can be seen that the estimates converge in less than 1000 iterations. To be conservative, in the following analysis, we collect 3000 observations after 2000 “burn-in”s deleted to take posterior analysis. We first consider the performance of the proposed LMPL in model comparison. We compare the proposed model with the parametric model (denoted by PARA) and the partly exchangeable model (denoted by PAEX), which approximately correspond to and under our proposal, respectively. The parametric model is defined by The priors of the parameters are given by , , and