Table of Contents Author Guidelines Submit a Manuscript
Computational and Mathematical Methods in Medicine
Volume 2019, Article ID 4381084, 8 pages
Research Article

Determination of Varying Group Sizes for Pooling Procedure

School of Mathematics and Statistics, Guangxi Normal University, Yucai Road 15, Guilin 541004, China

Correspondence should be addressed to Juan Ding; nc.ude.unxg@naujgnid

Received 15 May 2018; Revised 17 January 2019; Accepted 5 February 2019; Published 1 April 2019

Academic Editor: Nadia A. Chuzhanova

Copyright © 2019 Wenjun Xiong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Pooling is an attractive strategy in screening infected specimens, especially for rare diseases. An essential step of performing the pooled test is to determine the group size. Sometimes, equal group size is not appropriate due to population heterogeneity. In this case, varying group sizes are preferred and could be determined while individual information is available. In this study, we propose a sequential procedure to determine varying group sizes through fully utilizing available information. This procedure is data driven. Simulations show that it has good performance in estimating parameters.

1. Introduction

Routine monitoring or large scale of screening usually occurs in biomedical research to identify infected specimens [14]. However, some test kits, e.g., nucleic acid amplification test (NAAT), are expensive [2, 5]. Therefore, the expense during a large-scale monitoring process is usually a financial burden if resource is limited [68]. The strategy of pooling biospecimens is attractive to address this issue [911], which was first used during World War II to screen for syphilis [12]. This strategy is firstly to pool specimens into groups and then screen these groups. If a group tests negative, all specimens in this group will be declared negative; otherwise, continue to perform individual test. When the prevalence is low, the total number of tests using pooling will be far less than that using the individual test. Due to its efficiency and cost saving, pooling is now applied in many fields, such as agriculture [13], genetics [14, 15], HIV/AIDS [16, 17] and blood screening [18], and environmental epidemiology [19, 20].

The gain of pooling mainly depends on the pooling algorithm. Assuming homogeneity of the population, dozens of papers have investigated the problem how to design an efficient algorithm [2125]. However, this assumption might be violated in practical application [2628]. While individual information is available, it is of interest to estimate individual-level prevalence through incorporating such information. Note that only group-level status is observed, e.g., positive or negative. This problem has been studied in parametric context through the framework of binary regression models [2931], and also in semiparametric [32, 33] or nonparametric context [34, 35]. However, aforementioned work mostly uses a single group size that is determined in advance.

A set of pool sizes might be more appropriate while considering population heterogeneity. For example, varying pool sizes were used to estimate the infection prevalence of Myxobolus cerebralis, which causes whirling disease, among free-ranging salmonid fish collected from the Truckee River in Nevada and California [36]. In a study of estimating the prevalence of several viruses in carnations grown in nursery glasshouses in Victoria, sequential pooled testing involving several pool sizes was adopted [37]. Using a single group size might be optimal for some estimates but far from others, especially when we have little information ahead of the experiment [37, 38]. More work is better on this issue since the benefit of pooling algorithm mainly depend on the choice of pool size [3840]. In this study, we propose a pooling strategy with varying pool sizes through taking advantage of individual information. Our procedure is a data-driven pooling algorithm, where groups are formed sequentially. Its performance is extensively investigated by simulations and a real data set.

2. Methods

2.1. Notations and Background

Suppose N specimens are assigned into m groups each with size for . denotes the observed status of the group, and denotes the covariates of the specimen in the group for and The observations are where . Here, the notation represents the transpose of matrix . The sensitivity and specificity of the screening tool are denoted by and , respectively. The full likelihood function iswhere and . The parameter is defined by , and the function is a known, monotone, and differentiable link function.

Sometimes there might be a maximum admissible group size , e.g., a large group size might bring the dilution effect. Therefore, we should carefully choose an appropriate group size that is smaller than . Define a set , and denote it by , Once the group size is determined, we could obtain the estimator of β through maximum likelihood function . The Fisher information matrix of the parameter β could be rewritten as follows:where

The calculation of Fisher information is presented in Supplemental Material (Available here). To obtain a better estimator , we try to find that maximizes Fisher information . However, individual-level measurements make it difficult to achieve this goal.

The Fisher information defined in (2) involves a measurement , along with its functions and . According to Delaigle and Hall [41], is generally close to , where . This closeness let the Fisher information reduce to the following format: where . Then, we propose to determine the group sizes through minimizing all with respect to for

Note that the aforementioned approximate approach requires the pools are homogeneous. There are two methods to obtain homogeneous pool: reorder the specimens according to similarity of covariants or based on individual risk probability. The latter is adopted in this study. Following the method in McMahan et al. [42], the procedure of forming homogeneous pool is as follows. Firstly, use training data or prior knowledge to obtain an initial estimator [42]. Secondly, sort the specimens by their risk probability. Let G denotes the set which contains total covariants of enrolled specimens, where N is the number of specimens and is the covariant of the specimen. Sort G by risk probability in the descending order, and obtain a sorted set . The remaining procedure is directly performed on this sorted set.

2.2. Sequential Adaptive Pooling Algorithm

Our strategy is an adaptive design, which is often adopted in the biological experiment and also in the pooled test [22]. Before stating the algorithm, we need the following result. Suppose the specimens are assigned for the first groups with the corresponding group sizes Let for and . Denote Then the group size for the next group, , equals if . Here, is the root of an equation and is approximately 1.8414. The proof of this result is presented in Supplemental Material (Available here). Our pooling strategy is described as follows:

Step 1. Label the specimens according to the ordering of . For example, label the specimen with covariants by number 1. Assign specimens with labels up to into group.

Step 2. Calculate the corresponding function and . If , defines by , choose the group size which minimizes the function , . Define the set of covariants .

Step 3. Let , . Repeat Step 2 to form the next group in the same way until all specimens are assigned.

Step 4. Screen the groups and obtain maximum likelihood estimator of .

Note that this is a data-driven pooling strategy. Additionally, the above procedure does not strictly require that all specimens are enrolled before screening since the set is dynamic and could be renewed by new enrolled specimens.

2.3. Numerical Results

In this section, we proceed to evaluate the performance of our proposed procedure. Name it by PSV, which is pooling strategy with varied group sizes. For comparison, we also present the results of pooling strategy with a single group size k, named by PSS(k). The group size k for PSS(k) is given in advance, e.g., , or could be determined by the average prevalence of those enrolled samples. For the latter, we determine the optimal single group size by minimizing the variance of .

To investigate the performance of these methods, define the link function as the logistic function . Then, individual prevalence is obtained through the following model:

We first consider a single covariant (), following the normal distribution or the gamma distribution The corresponding parameters are set by and . The samples are generated under these settings, and the procedures are repeated by times. We report the estimators and , along with their mean square error (MSE) in Table 1 under different settings of sensitivity, specificity, and the number of groups. In Figure 1, we further report the relative bias of the parameters.

Table 1: The performance of estimators using different pooling procedures.
Figure 1: The relative bias of the parameters and . The distribution of covariant is set by (top two panels) and (bottom two panels), with the fixed number of groups .

Table 1 shows that all procedures have similar performance except PSF [5]. While using the procedure PSF, we have to choose a group size in advance. It is crucial for a group testing algorithm since the precision of estimators severely depend on the group size. In our setting, the average of individual prevalence is about 0.0997, and the corresponding optimal single group size is mostly for respectively. Consequently, the procedure PSF [10] has better performance than PSF [5] since the latter procedure uses a too smaller group size. Figure 1 further shows the relative bias of the parameters, and . Our procedure with varying group sizes, PSV, has very good performance under different scenarios. The procedure PSF [5] still has the poorest performance on the measurement of relative bias. As data-driven pooling strategies, PSV and PSF () both show good performance, but PSV has smaller bias, which is a desired characteristic.

We proceed to consider the model (2) with . Denote the single variable in the above setting by . We add two more variables: follows the binomial distribution and follows the normal distribution . Then, the model (2) is

Specifically, denote by “Model I”: , , , and “Model II”: , , . Set the parameters by , , , and . In Figure 2, we report the relative bias of the estimators under Model I. Furthermore, define a measurement of to calculate the overall relative bias. The results are reported in Figure 3.

Figure 2: The relative bias of the parameters under Model I: , , and , with the number of groups .
Figure 3: The overall relative bias of the parameters, defined as . Model I: , , and . Model II: , , and .

Figure 2 shows that our procedure PSV performs best among the four procedures. It is a similar result as shown in Figure 1. The overall relative bias of these estimators reported in Figure 3 also confirms such property. It also reveals that pooling procedures using a single group size are not desired for a heterogeneous population, even the group size is carefully chosen, e.g., .

2.4. An Illustrative Application

Verstraeten et al. conducted a surveillance study in Kenya to monitor a trend in HIV risk over time [43]. The samples were collected from pregnant women, along with potential risk covariants such as age, parity, and education level. They used a common group size of 10 to estimate the seroprevalence of HIV. However, the individual prevalence of HIV is related with those risk covariants, e.g., the risk of HIV might tend to increase with age. For this data set, Vansteelandt et al. reported a set of group sizes varying between 5 and 12 under cost-precision trade-off [40].

We proceed to illustrate our pooling strategy based on part of these data published in [44]. They reported individuals enrolled in the experiment, including their age () and education level (). Using model presented in [2], the individual prevalence follows the model: with . Let the initial estimator be . Using our proposed pooling strategies PSV and , the group sizes are listed in Table 2. Correspondingly, we obtain estimators: using PSV and using .

Table 2: The group sizes chosen using PSV procedure for the Kenyan example.

3. Discussion

In biological and epidemiological studies, there is growing interest in developing methods for a more accurate result but less cost. Group testing is such a cost saving strategy. In this study, we developed a pooling strategy that uses varying group sizes while individual information is available. This strategy is attractive since it only depends on the information of enrolled specimens and does not require a group size chosen in advance. Due to the characteristic of data-driven and theoretical justification, the procedure, “PSV,” proposed in this study has a robust performance under different settings. It is convenient for practical application since we do not have to worry about how to choose an appropriate group size.

Varying group sizes are reasonable to be used when the target population is diverse. For example, a sequential testing procedure using several group sizes is adopted to estimate virus infection levels of carnation populations grown in glasshouses since different carnation populations were expected to have a wide range of infection levels [45]. We could pool more specimens into one group if the probability of testing positive is small. It sounds reasonable to balance the probability of testing positive for each group, a way to mimic the situation when all enrolled specimens are homogeneous.

In this study, we also propose a procedure using a single group size determined by minimizing the variance of estimator of the prevalence. We could choose this procedure if we prefer a simple procedure or the diversity among the specimens to be screened is ignorable. Besides, we did not consider the cost of collecting specimens. If a test is much more expensive than that of collecting specimens, then the cost of tests is the main consideration in a project involving large-scale screening. Otherwise, it is necessary to take into account the overall cost of collecting and test while using the pooling strategy.

Data Availability

The Kenya data supporting this study are from previously reported studies and datasets, which have been cited. The data are available at

Conflicts of Interest

The authors declare no conflicts of interest.


This work was supported by the National Natural Science Foundation of China (nos. 11801102 and 11501134), Guangxi Scholarship Fund of Guangxi Education Department, Guangxi Natural Science Foundation (no. 2018GXNSFAA138161), and Research Projects of Guangxi Colleges (no. 2018KY0081).

Supplementary Materials

This article contains additional information on some technical aspects of the research, including the detailed calculation of the Fisher information matrix of the regression parameter and theoretical justification of Step 2 of our sequential adaptive pooling algorithm. (Supplementary Materials)


  1. F. Behets, S. Bertozzi, M. Kasali et al., “Successful use of pooled sera to determine HIV-1 seroprevalence in Zaire with development of cost-efficiency models,” AIDS, vol. 4, no. 8, pp. 737–742, 1990. View at Publisher · View at Google Scholar
  2. D. J. Westreich, M. G. Hudgens, S. A. Fiscus, and C. D. Pilcher, “Optimizing screening for acute human immunodeficiency virus infection with pooled nucleic acid amplification tests,” Journal of Clinical Microbiology, vol. 46, no. 5, pp. 1785–1792, 2008. View at Publisher · View at Google Scholar · View at Scopus
  3. Z. Zhou, R. M. Mitchell, J. Gutman et al., “Pooled PCR testing strategy and prevalence estimation of submicroscopic infections using bayesian latent class models in pregnant women receiving intermittent preventive treatment at Machinga District Hospital, Malawi, 2010,” Malaria Journal, vol. 13, no. 1, p. 509, 2014. View at Publisher · View at Google Scholar · View at Scopus
  4. D. Leong, K. NicAogáin, L. Luque-Sastre et al., “A 3-year multi-food study of the presence and persistence of Listeria monocytogenes in 54 small food businesses in Ireland,” International Journal of Food Microbiology, vol. 249, pp. 18–26, 2017. View at Publisher · View at Google Scholar · View at Scopus
  5. A. B. Hutchinson, P. Patel, S. L. Sansom et al., “Cost-effectiveness of pooled nucleic acid amplification testing for acute HIV infection after third-generation HIV antibody screening and rapid testing in the United States: a comparison of three public health settings,” PLoS Medicine, vol. 7, no. 9, Article ID e1000342, 2010. View at Publisher · View at Google Scholar · View at Scopus
  6. J. C. Emmanuel, M. T. Bassett, H. J. Smith, and J. A. Jacobs, “Pooling of sera for human immunodeficiency virus (HIV) testing: an economical method for use in developing countries,” Journal of Clinical Pathology, vol. 41, no. 5, pp. 582–585, 1988. View at Publisher · View at Google Scholar · View at Scopus
  7. S. Linauts, J. Saldanha, and D. M. Strong, “PRISM hepatitis B surface antigen detection of hepatits B virus minipool nucleic acid testing yield samples,” Transfusion, vol. 48, no. 7, pp. 1376–1382, 2008. View at Publisher · View at Google Scholar · View at Scopus
  8. P. Mester, A. K. Witte, C. Robben et al., “Optimization and evaluation of the qPCR-based pooling strategy DEP-pooling in dairy production for the detection of Listeria monocytogenes,” Food Control, vol. 82, pp. 298–304, 2017. View at Publisher · View at Google Scholar · View at Scopus
  9. C. Lindan, M. Mathur, S. Kumta et al., “Utility of pooled urine specimens for detection of Chlamydia trachomatis and Neisseria gonorrhoeae in men attending public sexually transmitted infection clinics in Mumbai, India, by PCR,” Journal of Clinical Microbiology, vol. 43, no. 4, pp. 1674–1677, 2005. View at Publisher · View at Google Scholar · View at Scopus
  10. P. Saha-Chaudhuri and C. R. Weinberg, “Specimen pooling for efficient use of biospecimens in studies of time to a common event,” American Journal of Epidemiology, vol. 178, no. 1, pp. 126–135, 2013. View at Publisher · View at Google Scholar · View at Scopus
  11. E. M. Mitchell, R. H. Lyles, A. K. Manatunga, and E. F. Schisterman, “Semiparametric regression models for a right-skewed outcome subject to pooling,” American Journal of Epidemiology, vol. 181, no. 7, pp. 541–548, 2015. View at Publisher · View at Google Scholar · View at Scopus
  12. R. Dorfman, “The detection of defective members of large populations,” Annals of Mathematical Statistics, vol. 14, no. 4, pp. 436–440, 1943. View at Publisher · View at Google Scholar
  13. J. Tebbs and C. Bilder, “Confidence interval procedures for the probability of disease transmission in multiple-vector-transfer designs,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 9, no. 1, pp. 79–90, 2004. View at Publisher · View at Google Scholar · View at Scopus
  14. J. L. Gastwirth, “The efficiency of pooling in the detection of rare mutations,” American Journal of Human Genetics, vol. 67, no. 4, pp. 1036–1039, 2000. View at Publisher · View at Google Scholar · View at Scopus
  15. M. Ozerov, A. Vasemägi, V. Wennevik et al., “Finding markers that make a difference: DNA pooling and SNP-arrays identify population informative markers for genetic stock identification,” PLoS One, vol. 8, no. 12, Article ID e82434, 2013. View at Publisher · View at Google Scholar · View at Scopus
  16. C. D. Pilcher, M. A. Price, I. F. Hoffman et al., “Frequent detection of acute primary HIV infection in men in Malawi,” AIDS, vol. 18, no. 3, pp. 517–524, 2004. View at Publisher · View at Google Scholar · View at Scopus
  17. S. B. Kim, H. W. Kim, H.-S. Kim et al., “Pooled nucleic acid testing to identify antiretroviral treatment failure during HIV infection in Seoul, South Korea,” Scandinavian Journal of Infectious Diseases, vol. 46, no. 2, pp. 136–140, 2014. View at Publisher · View at Google Scholar · View at Scopus
  18. D. H. Seo, D. H. Whang, E. Y. Song et al., “Occult hepatitis B virus infection and blood transfusion,” World Journal of Hepatology, vol. 7, no. 3, pp. 600–606, 2015. View at Publisher · View at Google Scholar · View at Scopus
  19. A. L. Heffernan, L. L. Aylward, L.-M. L. Toms, P. D. Sly, M. Macleod, and J. F. Mueller, “Pooled biological specimens for human biomonitoring of environmental chemicals: opportunities and limitations,” Journal of Exposure Science and Environmental Epidemiology, vol. 24, no. 3, pp. 225–232, 2014. View at Publisher · View at Google Scholar · View at Scopus
  20. M. Ramos, A. L. Heffernan, L. Toms et al., “Concentrations of phthalates and DINCH metabolites in pooled urine from Queensland, Australia,” Environment International, vol. 88, pp. 179–186, 2016. View at Google Scholar
  21. W. H. Swallow, “Relative mean squared error and cost considerations in choosing group size for group testing to estimate infection rates and probabilities of disease transmission,” Phytopathology, vol. 77, no. 10, pp. 1376–1381, 1987. View at Publisher · View at Google Scholar
  22. J. M. Hughes-Oliver and W. H. Swallow, “A two-stage adaptive group-testing procedure for estimating small proportions,” Journal of the American Statistical Association, vol. 89, no. 427, pp. 982–993, 1994. View at Publisher · View at Google Scholar
  23. X. M. Tu, E. Litvak, and M. Pagano, “On the informativeness and accuracy of pooled testing in estimating prevalence of a rare disease: application to HIV screening,” Biometrika, vol. 82, no. 2, pp. 287–297, 1995. View at Publisher · View at Google Scholar · View at Scopus
  24. A. Liu, C. Liu, Z. Zhang, and P. S. Albert, “Optimality of group testing in the presence of misclassification,” Biometrika, vol. 99, no. 1, pp. 245–251, 2011. View at Publisher · View at Google Scholar · View at Scopus
  25. W. Xiong and J. Ding, “Robust procedures for experimental design in group testing considering misclassification,” Statistics & Probability Letters, vol. 100, pp. 35–41, 2015. View at Publisher · View at Google Scholar · View at Scopus
  26. P. Chen, J. M. Tebbs, and C. R. Bilder, “Group testing regression models with fixed and random effects,” Biometrics, vol. 65, no. 4, pp. 1270–1278, 2009. View at Publisher · View at Google Scholar · View at Scopus
  27. Z. Zhang, A. Liu, R. H.  Lyles, and B. Mukherjee, “Logistic regression analysis of biomarker data subject to pooling and dichotomization,” Statistics in Medicine, vol. 31, no. 22, pp. 2473–2484, 2012. View at Publisher · View at Google Scholar · View at Scopus
  28. Q. Li, A. Liu, and W. Xiong, “D-optimality of group testing for joint estimation of correlated rare diseases with misclassification,” Statistica Sinica, vol. 27, no. 2, pp. 823–838, 2017. View at Publisher · View at Google Scholar · View at Scopus
  29. J. L. Gastwirth and P. A. Hammick, “Estimation of the prevalence of a rare disease, preserving the anonymity of the subjects by group testing: application to estimating the prevalence of AIDS antibodies in blood donors,” Journal of Statistical Planning and Inference, vol. 22, no. 1, pp. 15–27, 1989. View at Publisher · View at Google Scholar · View at Scopus
  30. M. Xie, “Regression analysis of group testing samples,” Statistics in Medicine, vol. 20, no. 13, pp. 1957–1969, 2001. View at Publisher · View at Google Scholar
  31. C. R. Bilder and J. M. Tebbs, “Bias, efficiency, and agreement for group-testing regression models,” Journal of Statistical Computation and Simulation, vol. 79, no. 1, pp. 67–80, 2009. View at Publisher · View at Google Scholar · View at Scopus
  32. M. Li and M. Xie, “Nonparametric and semiparametric regression analysis of group testing samples,” International Journal of Statistics in Medical Research, vol. 1, no. 1, pp. 60–72, 2012. View at Publisher · View at Google Scholar
  33. D. Wang, C. S. McMahan, C. M. Gallagher, and K. B. Kulasekera, “Semiparametric group testing regression models,” Biometrika, vol. 101, no. 3, pp. 587–598, 2013. View at Publisher · View at Google Scholar · View at Scopus
  34. A. Delaigle and A. Meister, “Nonparametric regression analysis for group testing data,” Journal of the American Statistical Association, vol. 106, no. 494, pp. 640–650, 2011. View at Publisher · View at Google Scholar · View at Scopus
  35. A. Delaigle and W.-X. Zhou, “Nonparametric and parametric estimators of prevalence from group testing data with aggregated covariates,” Journal of the American Statistical Association, vol. 110, no. 512, pp. 1785–1796, 2015. View at Publisher · View at Google Scholar · View at Scopus
  36. C. J. Williams and C. M. Moffitt, “Estimation of fish and wildlife disease prevalence from imperfect diagnostic tests on pooled samples with varying pool sizes,” Ecological Informatics, vol. 5, no. 4, pp. 273–280, 2010. View at Publisher · View at Google Scholar · View at Scopus
  37. G. Hepworth, “Confidence intervals for proportions estimated by group testing with groups of unequal size,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 10, no. 4, pp. 478–497, 2005. View at Publisher · View at Google Scholar · View at Scopus
  38. G. Haber and Y. Malinovsky, “Random walk designs for selecting pool sizes in group testing estimation with small samples,” Biometrical Journal, vol. 59, no. 6, pp. 1382–1398, 2017. View at Publisher · View at Google Scholar · View at Scopus
  39. G. Haber, Y. Malinovsky, and P. S. Albert, “Sequential estimation in the group testing problem,” Sequential Analysis, vol. 37, no. 1, pp. 1–17, 2018. View at Publisher · View at Google Scholar · View at Scopus
  40. S. Vansteelandt, E. Goetghebeur, and T. Verstraeten, “Regression models for disease prevalence with diagnostic tests on pools of serum samples,” Biometrics, vol. 56, no. 4, pp. 1126–1133, 2000. View at Publisher · View at Google Scholar · View at Scopus
  41. A. Delaigle and P. Hall, “Nonparametric regression with homogeneous group testing data,” The Annals of Statistics, vol. 40, no. 1, pp. 131–158, 2012. View at Publisher · View at Google Scholar · View at Scopus
  42. C. S. McMahan, J. M. Tebbs, and C. R. Bilder, “Informative dorfman screening,” Biometrics, vol. 68, no. 1, pp. 287–296, 2012. View at Publisher · View at Google Scholar · View at Scopus
  43. T. Verstraeten, B. Farah, L. Duchateau, and R. Matu, “Pooling sera to reduce the cost of HIV surveillance: a feasibility study in a rural Kenyan district,” Tropical Medicine & International Health, vol. 3, no. 9, pp. 747–750, 1998. View at Publisher · View at Google Scholar · View at Scopus
  44. C. R. Bilder, B. Zhang, F. Schaarschmidt, and J. M. Tebbs, “binGroup: a package for group testing,” The R Journal, vol. 2, no. 2, pp. 56–60, 2010. View at Publisher · View at Google Scholar
  45. G. Hepworth and R. Watson, “Debiased estimation of proportions in group testing,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 58, no. 1, pp. 105–121, 2009. View at Publisher · View at Google Scholar · View at Scopus