Abstract

Pooling is an attractive strategy in screening infected specimens, especially for rare diseases. An essential step of performing the pooled test is to determine the group size. Sometimes, equal group size is not appropriate due to population heterogeneity. In this case, varying group sizes are preferred and could be determined while individual information is available. In this study, we propose a sequential procedure to determine varying group sizes through fully utilizing available information. This procedure is data driven. Simulations show that it has good performance in estimating parameters.

1. Introduction

Routine monitoring or large scale of screening usually occurs in biomedical research to identify infected specimens [14]. However, some test kits, e.g., nucleic acid amplification test (NAAT), are expensive [2, 5]. Therefore, the expense during a large-scale monitoring process is usually a financial burden if resource is limited [68]. The strategy of pooling biospecimens is attractive to address this issue [911], which was first used during World War II to screen for syphilis [12]. This strategy is firstly to pool specimens into groups and then screen these groups. If a group tests negative, all specimens in this group will be declared negative; otherwise, continue to perform individual test. When the prevalence is low, the total number of tests using pooling will be far less than that using the individual test. Due to its efficiency and cost saving, pooling is now applied in many fields, such as agriculture [13], genetics [14, 15], HIV/AIDS [16, 17] and blood screening [18], and environmental epidemiology [19, 20].

The gain of pooling mainly depends on the pooling algorithm. Assuming homogeneity of the population, dozens of papers have investigated the problem how to design an efficient algorithm [2125]. However, this assumption might be violated in practical application [2628]. While individual information is available, it is of interest to estimate individual-level prevalence through incorporating such information. Note that only group-level status is observed, e.g., positive or negative. This problem has been studied in parametric context through the framework of binary regression models [2931], and also in semiparametric [32, 33] or nonparametric context [34, 35]. However, aforementioned work mostly uses a single group size that is determined in advance.

A set of pool sizes might be more appropriate while considering population heterogeneity. For example, varying pool sizes were used to estimate the infection prevalence of Myxobolus cerebralis, which causes whirling disease, among free-ranging salmonid fish collected from the Truckee River in Nevada and California [36]. In a study of estimating the prevalence of several viruses in carnations grown in nursery glasshouses in Victoria, sequential pooled testing involving several pool sizes was adopted [37]. Using a single group size might be optimal for some estimates but far from others, especially when we have little information ahead of the experiment [37, 38]. More work is better on this issue since the benefit of pooling algorithm mainly depend on the choice of pool size [3840]. In this study, we propose a pooling strategy with varying pool sizes through taking advantage of individual information. Our procedure is a data-driven pooling algorithm, where groups are formed sequentially. Its performance is extensively investigated by simulations and a real data set.

2. Methods

2.1. Notations and Background

Suppose N specimens are assigned into m groups each with size for . denotes the observed status of the group, and denotes the covariates of the specimen in the group for and The observations are where . Here, the notation represents the transpose of matrix . The sensitivity and specificity of the screening tool are denoted by and , respectively. The full likelihood function iswhere and . The parameter is defined by , and the function is a known, monotone, and differentiable link function.

Sometimes there might be a maximum admissible group size , e.g., a large group size might bring the dilution effect. Therefore, we should carefully choose an appropriate group size that is smaller than . Define a set , and denote it by , Once the group size is determined, we could obtain the estimator of β through maximum likelihood function . The Fisher information matrix of the parameter β could be rewritten as follows:where

The calculation of Fisher information is presented in Supplemental Material (Available here). To obtain a better estimator , we try to find that maximizes Fisher information . However, individual-level measurements make it difficult to achieve this goal.

The Fisher information defined in (2) involves a measurement , along with its functions and . According to Delaigle and Hall [41], is generally close to , where . This closeness let the Fisher information reduce to the following format: where . Then, we propose to determine the group sizes through minimizing all with respect to for

Note that the aforementioned approximate approach requires the pools are homogeneous. There are two methods to obtain homogeneous pool: reorder the specimens according to similarity of covariants or based on individual risk probability. The latter is adopted in this study. Following the method in McMahan et al. [42], the procedure of forming homogeneous pool is as follows. Firstly, use training data or prior knowledge to obtain an initial estimator [42]. Secondly, sort the specimens by their risk probability. Let G denotes the set which contains total covariants of enrolled specimens, where N is the number of specimens and is the covariant of the specimen. Sort G by risk probability in the descending order, and obtain a sorted set . The remaining procedure is directly performed on this sorted set.

2.2. Sequential Adaptive Pooling Algorithm

Our strategy is an adaptive design, which is often adopted in the biological experiment and also in the pooled test [22]. Before stating the algorithm, we need the following result. Suppose the specimens are assigned for the first groups with the corresponding group sizes Let for and . Denote Then the group size for the next group, , equals if . Here, is the root of an equation and is approximately 1.8414. The proof of this result is presented in Supplemental Material (Available here). Our pooling strategy is described as follows:

Step 1. Label the specimens according to the ordering of . For example, label the specimen with covariants by number 1. Assign specimens with labels up to into group.

Step 2. Calculate the corresponding function and . If , defines by , choose the group size which minimizes the function , . Define the set of covariants .

Step 3. Let , . Repeat Step 2 to form the next group in the same way until all specimens are assigned.

Step 4. Screen the groups and obtain maximum likelihood estimator of .

Note that this is a data-driven pooling strategy. Additionally, the above procedure does not strictly require that all specimens are enrolled before screening since the set is dynamic and could be renewed by new enrolled specimens.

2.3. Numerical Results

In this section, we proceed to evaluate the performance of our proposed procedure. Name it by PSV, which is pooling strategy with varied group sizes. For comparison, we also present the results of pooling strategy with a single group size k, named by PSS(k). The group size k for PSS(k) is given in advance, e.g., , or could be determined by the average prevalence of those enrolled samples. For the latter, we determine the optimal single group size by minimizing the variance of .

To investigate the performance of these methods, define the link function as the logistic function . Then, individual prevalence is obtained through the following model:

We first consider a single covariant (), following the normal distribution or the gamma distribution The corresponding parameters are set by and . The samples are generated under these settings, and the procedures are repeated by times. We report the estimators and , along with their mean square error (MSE) in Table 1 under different settings of sensitivity, specificity, and the number of groups. In Figure 1, we further report the relative bias of the parameters.

Table 1 shows that all procedures have similar performance except PSF [5]. While using the procedure PSF, we have to choose a group size in advance. It is crucial for a group testing algorithm since the precision of estimators severely depend on the group size. In our setting, the average of individual prevalence is about 0.0997, and the corresponding optimal single group size is mostly for respectively. Consequently, the procedure PSF [10] has better performance than PSF [5] since the latter procedure uses a too smaller group size. Figure 1 further shows the relative bias of the parameters, and . Our procedure with varying group sizes, PSV, has very good performance under different scenarios. The procedure PSF [5] still has the poorest performance on the measurement of relative bias. As data-driven pooling strategies, PSV and PSF () both show good performance, but PSV has smaller bias, which is a desired characteristic.

We proceed to consider the model (2) with . Denote the single variable in the above setting by . We add two more variables: follows the binomial distribution and follows the normal distribution . Then, the model (2) is

Specifically, denote by “Model I”: , , , and “Model II”: , , . Set the parameters by , , , and . In Figure 2, we report the relative bias of the estimators under Model I. Furthermore, define a measurement of to calculate the overall relative bias. The results are reported in Figure 3.

Figure 2 shows that our procedure PSV performs best among the four procedures. It is a similar result as shown in Figure 1. The overall relative bias of these estimators reported in Figure 3 also confirms such property. It also reveals that pooling procedures using a single group size are not desired for a heterogeneous population, even the group size is carefully chosen, e.g., .

2.4. An Illustrative Application

Verstraeten et al. conducted a surveillance study in Kenya to monitor a trend in HIV risk over time [43]. The samples were collected from pregnant women, along with potential risk covariants such as age, parity, and education level. They used a common group size of 10 to estimate the seroprevalence of HIV. However, the individual prevalence of HIV is related with those risk covariants, e.g., the risk of HIV might tend to increase with age. For this data set, Vansteelandt et al. reported a set of group sizes varying between 5 and 12 under cost-precision trade-off [40].

We proceed to illustrate our pooling strategy based on part of these data published in [44]. They reported individuals enrolled in the experiment, including their age () and education level (). Using model presented in [2], the individual prevalence follows the model: with . Let the initial estimator be . Using our proposed pooling strategies PSV and , the group sizes are listed in Table 2. Correspondingly, we obtain estimators: using PSV and using .

3. Discussion

In biological and epidemiological studies, there is growing interest in developing methods for a more accurate result but less cost. Group testing is such a cost saving strategy. In this study, we developed a pooling strategy that uses varying group sizes while individual information is available. This strategy is attractive since it only depends on the information of enrolled specimens and does not require a group size chosen in advance. Due to the characteristic of data-driven and theoretical justification, the procedure, “PSV,” proposed in this study has a robust performance under different settings. It is convenient for practical application since we do not have to worry about how to choose an appropriate group size.

Varying group sizes are reasonable to be used when the target population is diverse. For example, a sequential testing procedure using several group sizes is adopted to estimate virus infection levels of carnation populations grown in glasshouses since different carnation populations were expected to have a wide range of infection levels [45]. We could pool more specimens into one group if the probability of testing positive is small. It sounds reasonable to balance the probability of testing positive for each group, a way to mimic the situation when all enrolled specimens are homogeneous.

In this study, we also propose a procedure using a single group size determined by minimizing the variance of estimator of the prevalence. We could choose this procedure if we prefer a simple procedure or the diversity among the specimens to be screened is ignorable. Besides, we did not consider the cost of collecting specimens. If a test is much more expensive than that of collecting specimens, then the cost of tests is the main consideration in a project involving large-scale screening. Otherwise, it is necessary to take into account the overall cost of collecting and test while using the pooling strategy.

Data Availability

The Kenya data supporting this study are from previously reported studies and datasets, which have been cited. The data are available at https://cran.r-project.org/package=binGroup.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 11801102 and 11501134), Guangxi Scholarship Fund of Guangxi Education Department, Guangxi Natural Science Foundation (no. 2018GXNSFAA138161), and Research Projects of Guangxi Colleges (no. 2018KY0081).

Supplementary Materials

This article contains additional information on some technical aspects of the research, including the detailed calculation of the Fisher information matrix of the regression parameter and theoretical justification of Step 2 of our sequential adaptive pooling algorithm. (Supplementary Materials)