Abstract

Most experimental design literature on causal inference focuses on establishing a causal relationship between variables, but there is no literature on how to identify a design that results in the optimal parameter estimates for a structural equation model (SEM). In this research, search algorithms are used to produce a D-optimal design for a SEM for three-stage least squares and full information maximum likelihood estimators. Then, a D-optimal design for the estimate of the model parameters of a mixed-effects SEM is obtained. The efficiency of each of the D-optimal designs for SEMs is compared with univariate optimal and uniform designs. In each case, the causal relationship changed the optimal designs dramatically and the new D-optimal designs were more efficient.

1. Introduction

Most experimental design literature on causal inference focuses on establishing a causal relationship between variables, but there is no literature on how to identify a design that results in the optimal parameter estimates for a structural equation model (SEM). This work was motivated by an agricultural research study by Milander and colleagues [1] where a SEM was proposed to determine the influence of seeding rate on the yield components of waxy maize and to better determine the interrelationship between yield and yield components by estimating the direct effect of the path model based on the biological understanding of the interrelationship among the endogenous and exogenous variables. The endogenous variables in the proposed system as described in Figure 1 were the yield components rows per ear, ear length, kernels per row, kernels per ear, grain yield, ear circumference, and kernel weight. Seeding rate was the only exogenous variable.

The goodness-of-fit parameters for the model were poor, so the data were reanalyzed to obtain a model that better fit the data, with the resulting model as described in Figure 2. However, it did not give any insight into the interrelationship among many of the variables, which raised the question of whether or not the design was adequate for structural equation modeling.

The study was conducted over a period of three years and the original model was known before the data were collected [1]. Ideally, the researchers should have designed an experiment which would have allowed for the most precise estimates of the parameters, thus leading to more precise inferential statistics. Rather than uniformly applying the seeding rate levels across the experiment, the optimal design would have provided the researchers the number of replicates for each seeding rate level and which of the seeding rate levels should have been replicated more. However, they did not have the tools to obtain an optimal design for the proposed model nor were they able to take advantage of the information that was collected in the first year to adjust the design for subsequent years. There simply was no literature to support such an objective.

Historically, what drove the growth of optimal design theory was the need to have a design that is optimal for all parameters that need to be estimated [2, 3]. Technological advancements made it possible to develop the search algorithms that are needed to obtain optimal designs [48]. The new growth in the application of experimental design came about because of the lack of theoretical results in the optimal design of blocked and split-plot experiments, which made the computerized design algorithms popular and led to three proposals for point-exchange algorithms for constructing D-optimal blocked designs [915].

Traditional statistical analyses focus on univariate linear models, which have limitations because most random processes involve multiple dependent variables. This, in turn, led to the development of the multivariate linear model that does not address the causal relationships among the dependent (endogenous) variables, which is not the case with structural equation modeling [16, 17]. SEMs allow qualitative cause-effect information to be combined with statistical data and provide quantitative assessment of the cause-effect relationships among, and within, the endogenous and exogenous variables [17, 18].

Even though there have been many algorithms developed for univariate optimal designs and univariate mixed optimal designs, there is very limited literature about multivariate optimal design with examples like the work of Soumaya and fellow authors [19] and Schwabe [20]. There is literature in experimental design in structural equation modeling but not in optimal design. The research in this area so far is to establish causal relationships (direction of the arrows) among the variables [21, 22]. This research extends to counterfactuals and experimental design where it is used to establish causal relationships [23]. However, no research exists to obtain an optimal design to estimate the parameters of a SEM.

The gap in literature demonstrates the need for this current research, which comprises three objectives. For a given causal structure, the first objective is to obtain a D-optimal design that allows the most precise estimates of the endogenous and exogenous parameters of that model. Then, for a given mixed causal structure with random blocks or split-plots, the second objective is to obtain a D-optimal design for estimating the parameters of the model. Finally, the efficiency of each of the D-optimal designs for causal structures will be compared with the optimal designs for the univariate case.

2. Methods

For the univariate case, the linear model is , where is an vector of observations, is an design matrix of rank , is a vector of unknown coefficients that are estimable, and is an identically independently normally distributed vector with and where is the identity matrix with rank . The least-squares estimate of is . Our objective is to choose a design “” such that [7, 8]. Such a design “” is called a D-optimal design.

The same philosophy would be followed for a mixed model. The model for the data that comes from experiments with random blocks or split-plots is where , X, , and are defined above and is a random vector with and . is an matrix of the form where is a vector of ones. Then, it follows that . Now it can be shown that where . The objective becomes to choose the design matrix “” such that . However, is unknown. This presents a new challenge and shows that the more complicated the model, the more challenges can arise when choosing a D-optimal design.

The first solution to the above challenge is to assume that is known by making assumptions about and . The second solution is to rewrite , where and [14, 15, 24]. This allows us to rewrite as the function to optimize a univariate linear model. It can be assumed without loss of generality , thus can be expressed in terms of variance ratio or in terms of the correlation coefficient . Therefore, the D-optimal design for a blocked or a split-plot experiment depends on the variance ratio. However, Goos and Vandebroek [14, 15] and Goos and Jones [13] argued that the dependence is minor and presented algorithms for constructing locally optimal designs for a given or . In some special cases, the optimal design is globally optimal and neither depends on the variance ratio nor the degree of correlation of which the most popular is the orthogonal design [14].

The previous method, or “traditional method” as it is referred to in literature, is based on optimization of or and has two weaknesses [25]. The first weakness is that the D-optimal design depends on or and these values are not known prior to the experiment. The second weakness is that the optimal design focuses only on the fixed effect parameters and ignores the estimate of the random effect parameters. These weaknesses could have severe consequences on the estimate of the variance components, which severely affects the inferential statistics on the fixed effect parameters. Mylona et al. [25] provided scenarios for optimal designs that were constructed using the traditional D-optimality criterion and, in these designs, the random effects and are not estimable.

Consequently, the two weaknesses led to a new composite D-optimality criterion. The new composite criterion was proposed by Mylona et al. [25]. The reason for calling it composite is because this criterion takes into consideration both the fixed effect and the random effect parameters.

However, D-optimality criterion and orthogonal designs are both dependent on an assumed model. If the assumed model incorrectly represents the interrelationships between the endogenous and exogenous variables, it may lead to an increased bias error. Thus, in cases where there is no knowledge of the design specification, one of the most used designs is the uniform design in which the design points are scattered over the experimental region [26]. In comparison to D-optimal and orthogonal designs, uniform designs are less sensitive for model selection [27, 28]. Consequently, uniform design does not require pre-experiment model specification [29]. Uniform design could be used to explore the relationship between variables.

For the same reasons that led to the development of algorithms to account for all parameters of interest in the univariate case, our objective is to develop algorithms for SEMs that consider both the endogenous and exogenous parameters. In structural equation modeling, estimation methods are more complex and there are a variety of class estimators. This research focuses on three-stage least squares (3SLS) and full information maximum likelihood (FIML) methods because they are the only full-system estimation methods where all parameters are estimated simultaneously.

3. D-Optimal Design for a Causal Structure Model

3.1. The Information Matrix Approximation for 3SLS

In most systems, researchers are interested in more than one endogenous variable and these variables directly and indirectly affect each other. To model such a study, we use SEMs. To demonstrate the methodology and introduce the new approach for D-optimal designs, the following examples will be introduced. A simple example is shown in Figure 3 with two exogenous variables and two endogenous variables.

Many phenomena can be modeled based on this SEM. For example, at the University of Washington Tacoma there are two pre-calculus pathways into Calculus I, a two-course sequence (TMATH 115 and TMATH 116) and a one-course sequence (TMATH 120). In this model, represents the grade in TMATH 115, represents the grade in TMATH 116, represents the grade in TMATH 120, and represents the Calculus I grade. If we are designing a study to estimate the parameters in Figure 3, our first objective would be to determine how many students should be enrolled in each pathway to obtain the best estimates for the three parameters , , and . There are two parameters that need to be estimated through the first pathway but only one through the second pathway. In this example, assigning half of the sample size to each pathway may not be optimal because there are more parameters on the first pathway than the second. So, the fundamental question here is, how many students should be assigned to each pathway to obtain the optimal estimates for all parameters?

We will focus on the model in Figure 3. In this centered model, the observation can be expressed mathematically as and where for . Some definitions are necessary to develop the model. is an matrix that consists of the endogenous independent variables that are in the equation. is an vector that consists of the responses of the endogenous dependent variable for the equation. is an matrix that consists of the exogenous variables that are in the equation, are the endogenous parameters of the equation, and are the exogenous parameters of the equation. is an vector of random error for the endogenous variable. Each equation in the system can be rewritten using vector notations. So, and . Now rewrite the equation as where and . Therefore, . Multiply both sides by , use matrix notation, and use the Kronecker product to obtain where , , , and [30].

Then, to estimate , use the Generalized Least Square (GLS) to obtain . So, the covariance matrix of the 3SLS estimators is where [31, 32], which is equivalent to the asymptotic information matrix for FIML [33]. Thus, the estimate of the covariance matrix would be

The reason that we are interested in is because our objective is to obtain the design X that “minimizes” . Since normality is assumed throughout the research, then is called the information matrix estimate and is denoted by . Thus, our objective would be to obtain the D-optimal design “X” such that

In this paper, we assume that there are no constraints on the treatments or on the exogenous variables. However, if there are constraints on the treatments, such as if one of the treatments is limited, then the optimality criteria can be adjusted as is done in the univariate case, which is known as the optimal design of mixture experiments problem [13, 3437].

Example 1. .Based on equation (2), the D-optimal design for the causal structure below (Figure 4) assumes that the treatments are qualitative treatments for twenty observations and assumes that the true values for , , and are 8, 2, and 5, respectively Figure 4:
In vector notation, the model can be expressed as and where . In this example, is a vector that includes the responses of the first endogenous variable, is a vector that includes the responses of the second endogenous variable, is a vector that indicates whether or not the first treatment is applied on the th experimental unit as denoted by or , is a vector that indicates whether or not the second treatment is applied on the th experimental unit as denoted by or , is a vector of random residuals for the first endogenous variable, is a vector of random residuals for the second endogenous variable, is a vector of zeros, , and is a identity matrix. To obtain a D-optimal design, the algorithm starts with multiple random designs in order to avoid local optimality. For this specific example, 50 initial random designs were used, each design with size 20. The initial designs were constructed randomly where each value had a 50–50 chance of being a 1 or 0. To improve the design, the algorithm compares each point in the design with the candidate points and makes a simultaneous exchange with the candidate point that improves the optimality criterion from equation (2) the most. The exchanges will continue until no further improvement is achieved or the improvement is sufficiently small. This procedure will be repeated for each of the 50 initial designs. The final designs will be compared. The design that has the smallest from among the 50 final designs will be selected as the D-optimal design.
Using the previous algorithm, a D-optimal design for 20 observations for the model in Example 1 is given in Table 1.
In the D-optimal design for the causal structure for the model in Figure 4, the points and were each replicated nine times, and the point was replicated twice. The point was not replicated, which is expected because the model was assumed to have no intercept.
Table 1 was based on the true value of where the true parameters are used in . However, the true values are unknown. There are two approaches that address this issue. First, when prior data are available, estimates of the parameters can be used to estimate instead of the true parameters. One question that arises with the use of this approach is the robustness of the design. To address the sensitivity of the true values of the parameters on the design, simulation is used to obtain the optimal design based on the estimates of the parameters.
When there is no prior data available, a second approach is a Bayesian approach which assumes a prior distribution for the endogenous variables, which is similar to the approach that is used by Mylona et al. [25] and Goos and Mylona [38] to address the variance components optimal design for blocked or split-plot experiments. This approach allows for the uncertainty of the endogenous parameters. Using the Bayesian approach to obtain a 3SLS D-optimal design for a causal structure will be the subject of future work.
Optimal design theory is currently limited to univariate or multivariate applications where one of the biggest weaknesses of these approaches is that they ignore any possible causal structure among the endogenous variables. Because of the lack of theoretical results and the algorithms to produce optimal designs for a causal structure, one approach is to use univariate optimal designs for causal models, which leads to a loss of efficiency. To obtain a univariate D-optimal design for Example 1, it is assumed that the treatment will affect both endogenous variables to avoid the situation where there is an optimal design for each equation. In this case, we obtain a D-optimal design for both treatment combinations. An optimal design in the univariate case for two qualitative treatments (i.e.) can be found in Table 2 where the four possible candidate points are , and . The three candidate points , and are replicated seven times, seven times, and six times, respectively. For the same reasons as in the D-optimal design for a causal structure, the point was not replicated since the intercept was not included in the model.
The ratio of the determinants of the information matrices of the univariate versus the 3SLS designs and the D-efficiency of the two designs are used to compare the optimality of the designs [9, 10]. The results from Table 3 compare the efficiency of the univariate optimal design from Table 2 to the efficiency of the new D-optimal design from Tables 2 and 3.
When comparing the univariate optimal design to the 3SLS D-optimal design for a causal structure, there was an approximately 18% increase in the determinant of the asymptotic information matrix estimates. The univariate optimal design is about 94% as D-efficient as the 3SLS D-optimal design for a causal structure in all four cases.
One criticism over the comparison of the D-optimal designs obtained above is that they are based on the asymptotic information matrices, but the D-optimal designs were obtained for small samples sizes (20 observations). Thus, it is important to ensure that the results are consistent for small samples sizes, as well. To verify the comparison, we simulated data based on the causal structure D-optimal design and data based on the univariate optimal design 100,000 times each. We then estimated the parameters for both designs using the GLS estimate and calculated the covariance matrix of the parameter estimates for both designs, whose elements are where is the th element of . The determinants of the covariance matrices of the parameters estimates were computed. This process was repeated three times. The determinant of the covariance matrix of the parameters estimates for the 3SLS causal structure D-optimal design was consistently smaller than the determinant of the covariance matrix of the parameters estimates for the univariate optimal design, as shown in Table 4.
The results from Table 4 were consistent with the results for the comparison of the asymptotic information matrices. The 3SLS causal structure D-optimal design consistently produced a smaller determinant for the covariance matrix of the parameters estimates than the 3SLS univariate optimal design. Specifically, the determinant of the covariance matrix of the parameter estimates for the 3SLS causal structure D-optimal design was about 15%–19% smaller than the determinant of the covariance matrix of the parameter estimates for the 3SLS univariate optimal design. Based on those results, the 3SLS causal structure D-optimal design was 5%–7% more D-efficient than the univariate optimal design. It is important to note that whether the true parameters or their estimates are used, the 3SLS D-optimal design for a causal structure did not change.

3.2. The Information Matrix Approximation for FIML

The full information maximum likelihood (FIML) is another estimation method for SEMs. In this section, we will discuss optimal designs based on the FIML estimators and will investigate the relationship between the optimal designs.

Durbin [33] proposed a transformation for the maximum likelihood equations that simplified the computations and made it easier to study the properties of FIML estimators and their advantages over 3SLS estimators. However, even with the special notations, the FIML information matrix is sophisticated and takes much more work to obtain than the 3SLS information matrix. Based on Durbin’s [33] results, our objective is to obtain a design matrix “X” such that . The derivation and the special notation are noted in Appendix A. The reason that both D-optimal designs are considered is because the 3SLS information matrix is easy to obtain, but FIML estimators have advantages over 3SLS estimators as discussed by Durbin [33].

As with the 3SLS D-optimal design, the objective would be to obtain the design matrix that will minimize the estimate of the determinant of the covariance of the FIML estimates or maximize the determinant of the FIML information matrix estimate. The performance of the D-optimal design for a causal structure will then be compared to the univariate optimal design by computing through simulation. The design which consistently has the largest determinants of the information matrix would be the better design. In the following example, the objective will be to use the estimate of the information matrix to obtain the FIML optimal design.

For the causal structure in Example 1, the 3SLS D-optimal design is also the D-optimal design for the FIML estimates. Similar to 3SLS, the true parameters are unknown, so the FIML estimates are used to replace the true parameters for the three simulations. In practice, the information matrix is estimated based on the parameters estimates. This is similar to the approach of Goos and Vandebroek [14, 15] and Goos and Jones [13] in the case of optimal design for random blocks or split-plots where they suggest using the estimates of the variance parameters or a reasonable guess, arguing that the design minimally depends on those values. The results obtained for a D-optimal design for a causal structure in this research are consistent with those of Goos and Vandebroek [14, 15] and Goos and Jones [13] where the optimal design did not change when replacing the true parameters with their estimates.

The results shown in Table 5 show the efficiency of the D-optimal design from Table 1 as compared to the efficiency of the univariate optimal design from Table 2.

The results for the FIML causal structure D-optimal design are similar to the results for the 3SLS causal structure D-optimal design. The optimal design for the univariate case was about 6% less D-efficient than the FIML design for the causal structure. The loss of efficiency demonstrates the importance of taking the endogenous parameters into account to obtain an optimal design to study the causal structure.

For both the 3SLS and FIML D-optimal designs, the goal of the three simulations is to show that the D-optimal design did not change even though the estimate of the parameters deviated from the true value. However, it may be problematic to base a conclusion on three simulations only; therefore, 300 data sets were simulated and then the parameters were estimated. These parameters were used to obtain a D-optimal design based on the estimate of the parameters. In each of the 300 simulations, the D-optimal design did not change and was equivalent to the D-optimal design in Table 1.

Another way to demonstrate the robustness of the design is to incrementally change the value of the parameter(s) from the true value, then obtain a D-optimal design based on the value of the parameters, and finally compare the D-optimal design to that which was obtained based on the true value. For the model in Figure 4, the D-optimal design depends only on since includes only . Table 1 is the D-optimal design for . Now, we will change the value of incrementally based on the interval . We first start with 8 and then decrease incrementally by (i.e.). Then, we obtain the D-optimal design based on these values and check whether it is equivalent to the D-optimal design in Table 1. Similarly, we start with 8 and then increase incrementally by 0.1 (i.e.). Then, again, we obtain the D-optimal design based on these values and check whether it is equivalent to the D-optimal design in Table 1. The D-optimal designs did not change even though the value of the parameters changed up to 75% from the true value. These results are consistent with Goos and Vandebroek [14, 15] and Goos and Jones [13] where they argued that the optimal design depended minimally on the specific parameter values.

4. D-Optimal Design for a Causal Structure Model with Random Blocks

In applied research, blocked designs are likely the most commonly used experimental designs since they can be used to account for variation attributable to sources other than treatments and can considerably improve the precision of the experiment. The first objective is to establish the blocked causal structure and then to obtain the 3SLS and FIML estimators for the endogenous and exogenous parameters, which will be used to estimate their information matrices. The second objective of this section is to use the algorithm described in Section 1.1 in Example 1 to obtain a D-optimal design for both the endogenous and exogenous parameters. The last objective is to compare the efficiency of the D-optimal design to the classical univariate mixed model optimal design.

4.1. 3SLS D-Optimal Design for a Causal Structure with Random Blocks

The model with random blocks is similar to the model for a completely randomized design except that the block effect needs to be added to every endogenous variable. Therefore, for the endogenous variable as the dependent variable, the model can be written as. where and all of the other terms have been previously defined. If the blocks have the same size, then , the elements of and are assumed to be mutually independent and normally distributed with zero mean and variances and , respectively.

Let and . Then, the model with random blocks can be rewritten as . Use matrix notation and then the Kronecker product to rewrite the previous model as .

Therefore, where . Let and where is a nonnegative definite matrix. Then, where and .

Using the same approach as described in Section 1.1, the asymptotic information matrix would be where .

Based on the information matrix for the 3SLS estimates for the causal structure D-optimal design with random blocks, our objective would be to use the algorithm described in Section 1.1 in Example 1 to obtain the design such that . Since the information matrix depends on the unknown parameters through and , we face the same predicament as in the case of the completely randomized D-optimal design for a causal structure where the parameters are unknown. The solutions that were proposed for the completely randomized D-optimal design for a causal structure in Section 1.1 are appropriate for the causal structure D-optimal design with random blocks. Other approaches are also possible. For example, in an experiment with random blocks that is conducted in multiple sequential stages, the data of one block (year, lab, day, etc.) could be utilized to obtain initial estimates that are then used to obtain the D-optimal design for the rest of the blocks (other years, other labs, other days, etc.). Requiring preliminary information is a standard part of univariate design. If there is no prior data, then a reasonable guess can be used similar to the approach of Goos and Vandebroek [14, 15] and Goos and Jones [13]. In many experiments, multiple factors and limitations of the experimental material require the use of split-plot designs and/or incomplete block designs. Specifically, the way the factors are applied and the nature of the experimental material give rise to different sizes of experimental units and limits on block size. For this reason, an incomplete block design with three factors is used as an example in this section. In the next example, the objective is to use the asymptotic information matrix estimates to obtain a D-optimal design for a causal structure with four random blocks where the size of each block is four.

Example 2. .Assume that there are three binary treatment factors (exogenous variables) with two endogenous variables for the causal structure given in Figure 5. Here, the , each with a value, had a 50–50 chance of being a 1 or −1. Also, assume that the true values for , , , and are 8, 2, 3, and 5, respectively Figure 5:
A D-optimal design for a causal structure with four random blocks with block size equal to four under the given assumptions can be found in Table 6.
The expected determinant of the information matrix is 5,222,400 and the previous design is not orthogonal. The expected determinant of the information matrix of the orthogonal design in Table 7 is 4,784,128. The orthogonal design is the optimal design for the univariate mixed model with three binary treatment factors and four blocks each block with size four.
One concern for both D-optimal and orthogonal designs is that they both require an assumed model structure. One popular alternative is the uniform design, which is “a space filling design” ([39], p. 131) that uniformly distributes experimental points across the design domain without model pre-specification [39]. Therefore, we will compare the information of the uniform design to the orthogonal design and the new D-optimal design. Using a uniform design table [39, 40], a uniform design for 16 runs with 3 factors and an experimental cube of is shown in Tables 7 and 8.
The expected determinant of the information matrix for the uniform design is 28,612, which is significantly smaller than the determinants of the D-optimal and the orthogonal design information matrices.
The results show that the new D-optimal design was approximately 2.2% more -efficient than the orthogonal design, whereas the D-optimal design was 72.8% more -efficient than the uniform design. The D-optimal design performed significantly better than the uniform design but is still comparable to the orthogonal design. Although orthogonal designs are desirable because they are universally optimal for first-order linear models and are also robust for nonlinear models, they also come with concerns in that they sometimes do not exist and require a large sample size [25]. It is also important to keep in mind that in Example 2, there is only one endogenous parameter. The significance in the increase in efficiency of the determinant may not be easily identified through our simple example. In a more complex design, there will be more endogenous variables and if those variables are interconnected in a more sophisticated way, then the increase of the determinant of the information matrix would be amplified.
It is worth noting the differences between the 3SLS causal structure D-optimal design and the orthogonal design. First, the orthogonal design completely confounds the three-way interaction with the blocks where is the th element of , is the th element of , and is the th element of . The number of replicates for each candidate point is 2. However, this is not the case in the 3SLS causal structure D-optimal design where the two-way interaction of is confounded with Blocks 1 and 3, the three-way interaction of is confounded with Block 4, and the two-way interaction of is confounded with Block 2. In the 3SLS causal structure D-optimal design, the number of candidate points is not replicated equally. For example, the candidate point was replicated three times, but the candidate point was replicated only once.
To understand the differences between the designs, it is important to contrast the asymptotic information matrix of the 3SLS D-optimal design for a causal structure with that of the orthogonal design. In this way, we can see where information is gained and parameter estimates are improved for the former design and where information is sacrificed and parameter estimates worsen for the latter design. Tables 9 and 10 show the asymptotic information matrices for the model in Figure 5 for the orthogonal design and the 3SLS D-optimal design, respectively Table 9.
When the information of the estimates is compared, we can see that the exogenous parameters in both models have the same information except for the elements.
. However, the endogenous parameters in the 3SLS D-optimal design gained more information with an increase of about 16% over the orthogonal design. These results are expected because univariate optimal designs do not take the precision of the endogenous parameters into account, whereas D-optimal designs for causal structures do. Therefore, our new approach has the advantage of giving the optimal combination of treatments for the entire model and taking into account the precision of both the exogenous and endogenous parameters.

4.2. FIML D-Optimal Design for a Causal Structure with Random Blocks

As in the completely randomized D-optimal design for a causal structure, another important class of estimators is the maximum likelihood estimators. We will obtain the FIML estimators for a causal structure with random blocks and then obtain the covariance matrix estimates for the estimators, or the information matrix estimates. The information matrix estimates will then be used to obtain a D-optimal design for the FIML estimators.

Starting with the notation from Section 2.1, where . Then, . To maximize the likelihood, and set the partial derivative equal to 0, which gives . Solving for , makes . Then, find the second partial derivative and take its expected value to obtain the information matrix, . Next, . However, in general, we cannot find the expected value explicitly until the model is known since includes fixed and random variables. But, for Example 2 above the information matrix is simplified to [30].

Based on the information matrix above, a FIML D-optimal design for Example 2 is equivalent to the design in Table 6. The determinant of the information matrix is 5,280,295.4. Neither the orthogonal design nor the uniform design was as efficient as the new D-optimal design. The determinants of the information matrices were 4,845,883.1 and 36,622 for the orthogonal design and uniform design, respectively. Thus, the D-optimal design was approximately 2.1% more D-efficient than the orthogonal design and 71.1% more -efficient than the uniform design. These results are similar to the D-optimal design for Section 1.2 where both the 3SLS and the FIML estimates had the same D-optimal design.

5. Discussion

Optimal design theory and its applications are generally limited to univariate designs. However, most natural phenomena include multiple endogenous variables which are related in complicated ways. There are two weaknesses concerning univariate designs. First, univariate designs are incapable of addressing more than one endogenous variable that could be affected by different exogenous variables. The second weakness is that univariate designs ignore the endogenous parameters. Our new approach overcomes both weaknesses. The D-optimality criteria were developed for both the 3SLS and FIML estimates for a causal structure and for a causal structure with random blocks. In addition, our approach is broadly applicable since it generalizes to any linear causal model with both fixed and random effects, and with any type of nesting and factorial structure.

Our results show that the D-optimal designs were the same for both FIML and 3SLS estimates. The reader may wonder why they should care about 3SLS D-optimal design since FIML estimators have traditionally been used in structural equation modeling since 3SLS estimators are not based on optimization. However, 3SLS information matrices are mathematically easier to obtain than FIML information matrices. The interest here is in optimal design, not estimation. Since there was no difference between the FIML and 3SLS D-optimal designs for both examples and, in general, since the FIML and 3SLS information matrices are asymptotically equivalent [33], we use the 3SLS information matrix both for its simplicity in obtaining a D-optimal design and to ease hesitancy for scientists to utilize the more complicated FIML information matrix.

In the example of the completely randomized causal structure, the new D-optimal design increased the information matrix determinant by at least 17% over the univariate optimal design. Because of this increase, the univariate optimal design was only 94% as D-efficient as the new D-optimal design. The information matrix of a causal structure depends on parameter values that are unknown. This raises the concern of whether or not the D-optimal design is robust. Based on the simulation results from Example 1, the D-optimal design is robust because it did not change when the parameter values changed.

The fact that the information matrix of a causal structure depends on unknown parameter values is not unique to causal structure D-optimal designs. The same issue arises for D-optimal designs for random effects, or the covariance parameters, in a split-plot or a blocked design as well where the values of the parameters are needed to obtain the D-optimal design. The results in our research are consistent with those of Goos and Vandebroek [14, 15] and Goos and Jones [13] where the optimal design did not change when replacing the true parameters for an estimate.

Similarly, for a causal structure model with random blocks, the endogenous parameters had a significant impact on the D-optimal design. In the univariate mixed model, orthogonal designs are universally optimal [14, 41]. However, it is not optimal for a causal structure model with random blocks. As shown in Example 2, the new D-optimal design increased the determinant of the information matrix by at least 9%, meaning that the orthogonal design was less efficient than the new D-optimal design. The orthogonal design was consistently about 97.8% and 97.9% as D-efficient as the new 3SLS and FIML D-optimal designs, respectively.

While it may be a disadvantage that D-optimal designs require an assumed model structure, the results show that the uniform design, which focuses on distributing the design points uniformly over the experimental domain, is not optimal. The uniform design was about 27.2% as D-efficient as the new 3SLS D-optimal design and 28.9% as D-efficient as the new FIML D-optimal design. In both cases, the uniform designs consistently lost more than 70% of the information in contrast to the new D-optimal design. Such a large loss of information will result in inflating the variance and has serious consequences on inferential statistics and the power of the statistical tests.

The designs that were produced using the new criteria show that endogenous parameters have an important effect on D-optimal designs. The results further show the importance of taking advantage of prior knowledge of the interrelationships of endogenous and exogenous parameters to develop a D-optimal design using the algorithms and methodology developed here. An application for the use of our algorithms could be in research similar to the heart study by Mi and colleagues [42] who examined the complex interrelationships among physiological, genetic, and environmental factors and their interaction effects on coronary heart disease (CHD), as shown in Figure 6. The data in that study were collected and analysis was performed based on the data alone; there was no pre-planned design. To obtain the most precise estimates to describe the system, causal structure D-optimal design can be used to determine replication of the levels of each factor such as how many men and women to include in the study, how many of the subjects should be smokers, how many should be obese, or how to determine an optimal sample of volunteers based on budgetary constraints.

Another practical use of our algorithms and methodology comes from the field of agriculture through research conducted by Campbell and colleagues [43], and Dhungana and colleagues [44] where the objective was to understand how gene-environment interactions (GEI) influence interrelated and complex traits such as grain yield. This subject of future work would help to answer questions like which genotypes should be chosen for the study and what is the combination of those genotypes to obtain optimal estimates. As a multi-stage experiment which took place over a period of three years, our methods could have been implemented to use the preliminary results from the first year of the study to answer those questions if the model was pre-determined.

5.1. Limitations

The fact that the algorithm requires the model to be predetermined is a challenge because in most natural phenomena the interrelationships in a system could be complex. In univariate designs, the challenge is reduced because there is only one endogenous variable that is directly affected by an exogenous variable. However, in structural equation modeling, not all exogenous variables affect endogenous variables directly and the interrelationship among the endogenous variables needs to be predetermined, as well. An additional challenge to obtain a D-optimal design through our algorithm is that some of the estimates of the endogenous parameters are required before an experiment is conducted, which is unknown prior to the experiment. Addressing these limitations is also the subject of future work by using a prior distribution and a Bayesian approach to allow for the uncertainty of the unknown endogenous and exogenous parameters. Finally, how our technique works with other designs such as A-optimal, C-optimal, or E-optimal designs is yet to be explored.

Appendix

A. Derivation of the FIML Information Matrix

For the structural equation model in the form where the rows of are independently and normally distributed with vector mean zero and a variance matrix , the log likelihood function can be given by . First, to obtain the FIML estimators, we find the partial derivatives and set them to zero as follows:

Because of the invariance properties of the maximum likelihood (ML) estimators [45], it would simplify the derivative if we derive the likelihood with respect to instead of . So,

Per Kmail [30], algebraic manipulation of equations (A.1) and (A.3) gives

Let , , , and let and . Now rewrite (equations (A.2) and (A.4)) in terms of these new notations. These equations can be written as

Then, by rewriting the previous equation,

Next,

Finally,

These are under the same restrictions as (equation (A.2)) and (equation (A.4)), which implies that the only unknown elements of in the left-hand side are equated to zero.

Let be the column of where . This leads us to conclude that the element of is one since the diagonal of consists of 1’s. Also, other elements of are zero by the original model restrictions on and . Let be the number of unknown elements in and let be the vectors of these unknowns. In other words, represents the unknown coefficients of the equation in the original model. Let the columns of corresponding to unknown elements of be rearranged as matrix called . Also, let represent the column of . The new notations will allow us to rewrite (equation (A.8)) without restrictions. Note that since the coefficient of is one, has coefficients of , and the rest of columns of have coefficients of zero. Thus,

By substituting this quantity in to (equation (A.8)), it becomeswhere is the element of . Consequently, the column of the left-hand side of (equation (A.10)) is equal to Use similar notation as before and let be the columns of that correspond to unknown elements of the vector . For that reason, is equivalent to the previous equation. The full system is obtained by stacking all of the equations and using a similar technique as the 3SLS. Let,

Then, equation (A.10) simplifies towhich implies that,

The advantage of this notation is that the elements of are unknown, and it is an unrestricted set of equations. The restrictions on and have been resolved by the notation. However, there is a problem in solving for since is unknown. In addition, consists of columns of , which is unknown because of and . Therefore, replace these values by initial values and replace.

by its ML estimate, . Thus, (equation (A.13)) becomes

Let the maximum likelihood estimate of be denoted by not by . The estimate of which is the arrangement of is denoted by . Finally, the maximum likelihood estimate of is .

Using the results from (A.14), . However, , , and are unknown. Therefore, an iterative approach needs to be used. Start by the 2SLS estimates which are of in probability. To obtain the second estimates, let . Using the first-order Taylor expansions, let and be the values of and evaluated at . Then, . But since and are small in comparison to , we ignore the third term and conclude that . Therefore, . Repeating the previous procedure, we can generalize the results for the estimates where . Kmail [30] shows the asymptotic variance matrix of the estimates is the .

Data Availability

Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study.

Disclosure

This research is presented in part as the first authors dissertation research at https://digitalcommons.unl.edu/statisticsdiss/21/. Any mistakes, omissions, etc., are that of the authors alone.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank Dr. Erin Blankenship, Dr. Stephen Kachman, and Dr. Allan Peterson for their critical reviews of the first author’s Ph.D. dissertation upon which this paper is based.