Abstract

Item response theory (IRT) is a popular approach used for addressing statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is computationally expensive. This limits the use of the procedure in real applications. In an effort to reduce the execution time, a previous study shows that high performance computing provides a solution by achieving a considerable speedup via the use of multiple processors. Given the high data dependencies in a single Markov chain for IRT models, it is not possible to avoid communication overhead among processors. This study is to reduce communication overhead via the use of a row-wise decomposition scheme. The results suggest that the proposed approach increased the speedup and the efficiency for each implementation while minimizing the cost and the total overhead. This further sheds light on developing high performance Gibbs samplers for more complicated IRT models.

1. Introduction

Item response theory (IRT) is a popular approach used for describing probabilistic relationships between correct responses on a set of test items and continuous latent traits (see [14]). In addition to educational and psychological measurement, IRT models have been used in other areas of applied mathematics and statistical research, including US Supreme Court decision-making processes [5], alcohol disorder analysis [69], nicotine dependency [1012], multiple-recapture population estimation [13], and psychiatric epidemiology [1416], to name a few.

IRT has the advantage of allowing the inference of what the items and persons have on the responses to be modeled by distinct sets of parameters. As a result, a primary concern associated with IRT research has been on parameter estimation, which offers the basis for the theoretical advantages of IRT. Specifically, of concern are the statistical complexities that can often arise when item and person parameters are simultaneously estimated (see [1, 1719]). More recent attention has focused on the fully Bayesian estimation where Markov chain Monte Carlo (MCMC, [20, 21]) simulation techniques are used. In spite of the many advantages, the fully Bayesian estimation is computationally expensive, which further limits its actual applications. It is hence important to seek ways to reduce the execution time. A suitable solution is to use high performance computing via the Message Passing Interface (MPI) standard, which employs supercomputers and computer clusters to tackle problems with complex computations.

The implementation of parallel computing is, however, not straightforward due to the high data dependencies in a single Markov chain for an IRT model, such as the dependency of one state of the chain to the previous state and the dependencies among the data within the same state. Patsias et al. [22] developed a parallel algorithm to implement Gibbs sampling [23], one of the most efficient MCMC algorithms, to the two-parameter normal ogive (2PNO, [24]) IRT model. In their study, the implementation of parallel computing was realized through decomposition of data matrices and item parameters into columns while minimizing the communication overhead among processors. In their implementation, the person parameters were communicated between the root and the processor nodes. Given that the fully Bayesian estimation of IRT models requires a minimum of 20 or 30 times more subjects than test items, which typically occurs in a test situation, it is believed that one can reduce the communication overhead if item parameters are communicated instead of person parameters. Hence, an alternative approach is to decompose data matrices and person parameters into rows. The present study is to develop such a high performance computing algorithm for the 2PNO model and further demonstrate its utility by comparing it with that developed in [22].

The remainder of the paper is organized as follows. Section 2 reviews the 2PNO IRT model, the Gibbs sampler, and the decomposition scheme used in [22]. Section 3 illustrates the approach we propose in the present study to parallelize the serial algorithm. In Section 4, the performance of the developed parallel algorithm is investigated by comparing it with the serial algorithm and further with the parallel algorithm developed in [22].

2. Preliminaries

The 2PNO IRT model provides a fundamental framework in modeling the person-item interaction by assuming one latent trait. Let = denote a matrix of responses to items where () if the th person answers the th item correctly (incorrectly) for and . The probability of person obtaining correct response for item is then defined for the 2PNO model as where the scalars and denote item parameters, denotes the continuous person trait parameter, and denotes the unit normal cdf.

The Gibbs sampler involves updating three sets of parameters in each iteration, namely, an augmented continuous variable (which is positive if and negative if ), the person parameter , and the item parameters , where from their respective full conditional distributions, namely, where , assuming , and (see, e.g., [25, 26]).

Hence, with starting values and , observations can be simulated from the Gibbs sampler by iteratively drawing from their respective full conditional distributions as specified in (2), (3), and (4). To go from to , it takes three transition steps:(1)draw ;(2)draw ;(3)draw .This iterative procedure produces a sequence of , . To reduce the effect of the starting values, early iterations in the Markov chain are set as burn-ins to be discarded. Samples from the remaining iterations are then used to summarize the posterior density of item parameters and ability parameters . For more detailed examples of the Gibbs sampler, interested readers should refer to Cassella and George [27].

Since a large number of iterations are needed for the Markov chain to reach convergence, the algorithm is computationally intensive and requires a considerable amount of execution time, especially with large datasets [26]. In an effort to achieve a speedup, Patsias et al. [22] developed a parallel algorithm where they used domain decomposition to divide the updating tasks into blocks. In particular, their approach was to assign each processor a column block of the data matrix so that they could update a block of and item parameters . This scheme is depicted in Figure 1 and is now called column-wise decomposition. Since is of size (a column vector), it could not be decomposed. Therefore, in each iteration, all processor nodes need to send their portions of and to the root for it to update and consequently send back to the rest of the nodes to proceed. The approach of [22] is certainly limited in situations where is large because the number of communications is increased. Given that test data typically involve less than , it is reasonable to believe that if the domain decomposition is done differently, such as the one depicted in Figure 2, one can achieve a greater speedup. The present study focuses on the development of the parallel algorithm using such a row-wise decomposition.

3. Methodology

The study was performed using the Maxwell Linux cluster, a cluster with 106 processing nodes. Maxwell uses the message-passing model via the MPICH MPI framework implementation. One of the 106 nodes acted as the root node, while the rest of the nodes acted as slave nodes. The root node was responsible for generating and partitioning the matrix , transmitting the submatrices, updating and broadcasting item parameters, and recording execution time, in addition to assuming the same duties as the slave nodes.

Each node on the cluster has an Intel Xeon dual CPU quad-core processor clocked at 2.3 GHz, 8 GB of RAM, 90 TB storage, and Linux 64 bit operating system. MPICH allows the user to choose how many nodes to use before the execution of a program so that various numbers of processing nodes may be used in every execution.

3.1. Parallel Gibbs Sampler

Given the decomposition of and that is illustrated in Figure 2, we see that each processor is using a submatrix of to update a block of (which is of the same size of ) and . For instance, is updating a block of , , from to , and a block of , , from to , where and is the number of processing nodes.

Since and are each of size (a row vector), they are not decomposed and hence have to be communicated among processors. In order to minimize the communication cost, after finishing updating in each iteration, each node needs to calculate and and send them to the root for it to update from This way, each processing node is sending a vector of size to the root, which in return is broadcasting one message of size to each processing node. The total data transferred between all the nodes for a single Markov chain involving iterations is The total data transferred between all the nodes using column-wise decomposition is [22, page 67]. In practice, . In addition, the sequential Gibbs sampler described in Section 2 calls for larger than to ensure in (4) exits so that item parameters can be updated. Hence, as an example, if a multiple-choice test consisting of 10 items is given to 500 examinees, the sample item response data can be denoted using a binary data matrix of size . We can implement the high performance Gibbs sampling algorithm to fit the 2PNO model, which would reach convergence within 10,000 iterations. With five processing nodes used, the total data transferred is for row-wise decomposition and for column-wise decomposition. The difference would be even larger with a larger sample size (which is very common in large-scale testing situations), more processing nodes, and/or a longer Markov chain. Hence, the total data transferred using the column-wise decomposition are more than that of the proposed approach.

3.2. Implementation

The proposed algorithm was implemented in ANSI C and MPI with utilization of the GNU Scientific Library (GSL) [28]. To achieve the parallel computation as illustrated in the previous section, the MPI_Gather and MPI_Bcast routines were used for collective communications. For more detailed implementation, see the appendix for part of the source code of the parallel algorithm in updating the model parameters.

3.3. Performance Analyses

In order to investigate the advantages of the proposed parallel solution over its serial counterpart, and further over the parallel algorithm based on column-wise decomposition, four experiments were carried out in which the sample size , test length , and number of iterations were manipulated to vary:Experiment : , , 10,000,Experiment : , , 10,000,Experiment 3: , , 10,000,Experiment : , , 20,000.In all these experiments, one (representing the serial algorithm) to eight processing nodes were used to implement the Gibbs sampler. They were evaluated using four performance metrics in addition to the execution time. These metrics are the total overhead, relative speedup, relative efficiency, and cost.(i)The total overhead is defined as where is the number of available processing nodes, is the fastest sequential algorithm execution time and is the parallel algorithm execution time.(ii)Relative speedup is the factor by which execution time is reduced on processors and it is defined as: (iii)Efficiency describes how well the algorithm manages the computational resources. More specifically, it tells us how much time the processors spend executing important computations [29]. Relative efficiency is defined as (iv)The definition of the cost of solving a problem on a parallel system is the product of parallel runtime and . Consequently, cost is a quantity that reveals the sum of individual processing node runtime.

4. Numerical Results

Results from the four experiments are summarized in Figures 3 to 7, where those based on the column-wise decomposition are in (a) and those based on the row-wise decomposition are in (b). Note that the values plotted represent the average of ten replications. As expected, the execution time decreased as the number of processing nodes increased, and row-wise decomposition resulted in a much shorter execution time than column-wise decomposition in all the experimented conditions (see Figure 3).

In the case of column-wise decomposition, the communication overhead (see Figure 6(a)) incurred using six nodes overshadowed the increase in computational speedup compared to five nodes and hence at this point, some decrease in the speedup (see Figure 7(a)) and increase in the execution time (see Figure 3(a)) are observed. On the other hand, when seven nodes were used, the computational speedup overpowered the increase in communication overhead and speedup was higher than that with five nodes. A similar scenario is observed when we compare the performance using eight nodes with that using seven or five nodes. In addition, because the communication size in column-wise decomposition depends only on , not , (where ) maintained a higher speedup (Figure 7(a)) and was more efficient (Figure 4(a)) compared to the other experiments where . This agrees with the findings by [22].

With respect to row-wise decomposition, the communication overhead using eight nodes dominated the increase of the computational speedup by adding one more executing processor in experiments where and . Hence, there was less speedup by increasing the number of processors from seven to eight in these experiments (see Figure 7(b)). This can be further illustrated by profiling the time necessary for carrying out individual sections of the codes. In particular, as is shown in the appendix, the parallel algorithm using row-wise decomposition involves six tasks in each iteration of the Markov chain: update , update , update and, send and to the root, update and , and broadcast and . The time for completing each task in a single iteration was hence recorded and plotted in Figure 8 using two to eight processing nodes for the experiments where and . One can see that when more processing nodes were used, the time for updating , , , and (tasks 1 to 3) decreased, and the time for updating and (task 5) remained the same, whereas the time for sending and to the root (task 4) varied depending on the number of processing nodes. This communication overhead was the largest with 5 nodes and the smallest with 4 or 7 nodes. Hence, given that the algorithm with seven nodes involved reduced communication overhead and computation time, it resulted in a much improved speedup and efficiency for Experiments and .

In addition, as we noted earlier, the communication size for this approach depends on , not . Hence, although (where ) had the largest input size ( 250,000), its overhead was close to, or even smaller than, and 4 (where ) with the use of the parallel algorithm (see Figure 6). But due to the reason of the large matrix size, the time used for each slave node to calculate their blocks of and increased, and consequently had only a slight advantage in speedup and efficiency than the other experiments. To further understand this, we can also look at the time for completing each task within one iteration of the Markov chain for and (see Figures 9 and 10). A comparison of Figures 8 to 10 indicates that involved more calculation due to the larger input size, but in spite of this, it had less communication overhead (e.g., by sending and to the root) than .

With both decomposition schemes, the communication overhead increased more than the computational speedup when a certain number of processors are used. As a result, the speedup does not increase with increasing number of processors and, consequently, the cost increases (see Figure 5). However, a comparison of the two parallel algorithms clearly indicates that row-wise decomposition involves much less computational overhead than column-wise decomposition in all the experimented conditions (see Figure 6) and hence is a more efficient approach in terms of enhancing speedup (Figure 7) and efficiency (Figure 4) while reducing cost (Figure 5).

5. Conclusion

This study developed a high performance Gibbs sampling algorithm for the 2PNO IRT model with the purpose of achieving a shorter execution time possible using a row-wise decomposition scheme. The algorithm was implemented using the ANSI C programming language and the message-passing interface. Experiments were performed to empirically evaluate its performance with varying sample sizes, test lengths, and chain lengths.

Results indicated that the proposed parallel algorithm using row-wise decomposition (for the given problem size) performed much better compared with that using column-wise decomposition. Regarding the number of processing nodes, the algorithm worked relatively better, in terms of efficiency and cost, using two to seven processing nodes. On the other hand, in the experiments with relatively large input matrix sizes (e.g., ), it had the smallest execution time when eight processing nodes were used, which was the largest number of processing nodes used in the experiments. Therefore, given the high data dependencies in the Gibbs sampler for IRT models, such as the dependency of one state of the chain to the previous states, and the dependencies between the data within the same state, it is not possible to completely avoid communications among processors in each iteration of the chain. By changing the domain decomposition scheme from column-wise to row-wise, we managed to reduce the size of the data communicated so that a speedup was achieved. This further sheds light on developing high performance Gibbs sampler for more complicated IRT models. In the IRT literature, the model can be more complex by assuming multiple latent traits, where row-wise decomposition is theoretically more appealing than column-wise decomposition.

This study achieved parallelization through a row-wise decomposition and the use of all-to-one and one-to-all broadcast schemes. Further studies can be undertaken to increase the speedup and the efficiency and minimize the cost and the total overhead. For example, an all-to-all broadcast scheme may be adopted in order to achieve a smaller communication overhead.

Appendix

The pseudocode for updating the values of , , , , , and is shown in Pseudocode 1. First of all, and are updated through the functions update_Z and update_TH, respectively. Then, update_KSI_AR is called to update and, and MPI_Gather is called to send and to the root. The root receives and and calls update_A_G to update and . It then broadcasts α and by calling MPI_Bcast. In order to reduce communication overhead, and are sent in the same message. To achieve that, an array of size is set up, where the first 2k entries consist of the elements of Z and entries and consist of elements for (the name of this array in the code is KSI_array) (See Pseudocode 1).

// Start iteration:
for (m = 0; m < l; m++){
count++;
update_Z(Z, y, TH, A, G, r) ;
update_TH(TH, THV, Z, A, G, r);
update_KSI_AR(KSI_array, Z, TH);
MPI_Gather (KSI_array, (2*k)+2, MPI_DOUBLE, KSI_rec, (2*k)+2, MPI_DOUBLE, ROOT, MPI_COMM_WORLD);
if(rank == ROOT){
double XX00=0.0; // XX00 = XX
double XX01=0.0; // XX01 = XX
double XZ00=0.0; // XZ0 = XZ0
double XZ10=0.0; // XZ1 = XZ1
// Retrieve x′x and x′Z from KSI_rec::
for(i=0; i<size; i++) {
XX00 += KSI_rec 2*k+(2*k+2)*i ;
XX01 += KSI_rec (2*k+1)+(2*k+2)*i ;
}
gsl_matrix_set(XX, 0, 0, XX00);
gsl_matrix_set(XX, 0, 1, XX01);
gsl_matrix_set(XX, 1, 0, XX01);
gsl_matrix_set(XX, 1, 1, n);
for(j=0; j < k; j++) {
XZ00=0.0;
XZ10=0.0;
for(i=0; i<size; i++) {
XZ00 += KSI_rec ;
XZ10 += KSI_rec ;
}
gsl_matrix_set(XZ, 0, 0, XZ00);
gsl_matrix_set(XZ, 1, 0, XZ10);
}
update_A_G(A, G, AV, GV, XX, XZ, unif, count, r);
// Transfer A and G data into a buffer so that they can be broadcasted:
for(i=0; i < k; i++){
A_G_array i = gsl_vector_get(A, i);
A_G_array i+k = gsl_vector_get(G, i);}
}
MPI_Bcast(A_G_array, 2*k, MPI_DOUBLE, ROOT, MPI_COMM_WORLD);
// Transfer A and G received to a vector structure:
for(i=0; i < k; i++){
gsl_vector_set(A,i,A_G_array i );
gsl_vector_set(G,i,A_G_array i+k );
}
}// End iteration