Computational and Mathematical Methods in Medicine

Volume 2019, Article ID 4381084, 8 pages

https://doi.org/10.1155/2019/4381084

## Determination of Varying Group Sizes for Pooling Procedure

School of Mathematics and Statistics, Guangxi Normal University, Yucai Road 15, Guilin 541004, China

Correspondence should be addressed to Juan Ding; nc.ude.unxg@naujgnid

Received 15 May 2018; Revised 17 January 2019; Accepted 5 February 2019; Published 1 April 2019

Academic Editor: Nadia A. Chuzhanova

Copyright © 2019 Wenjun Xiong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Pooling is an attractive strategy in screening infected specimens, especially for rare diseases. An essential step of performing the pooled test is to determine the group size. Sometimes, equal group size is not appropriate due to population heterogeneity. In this case, varying group sizes are preferred and could be determined while individual information is available. In this study, we propose a sequential procedure to determine varying group sizes through fully utilizing available information. This procedure is data driven. Simulations show that it has good performance in estimating parameters.

#### 1. Introduction

Routine monitoring or large scale of screening usually occurs in biomedical research to identify infected specimens [1–4]. However, some test kits, e.g., nucleic acid amplification test (NAAT), are expensive [2, 5]. Therefore, the expense during a large-scale monitoring process is usually a financial burden if resource is limited [6–8]. The strategy of pooling biospecimens is attractive to address this issue [9–11], which was first used during World War II to screen for syphilis [12]. This strategy is firstly to pool specimens into groups and then screen these groups. If a group tests negative, all specimens in this group will be declared negative; otherwise, continue to perform individual test. When the prevalence is low, the total number of tests using pooling will be far less than that using the individual test. Due to its efficiency and cost saving, pooling is now applied in many fields, such as agriculture [13], genetics [14, 15], HIV/AIDS [16, 17] and blood screening [18], and environmental epidemiology [19, 20].

The gain of pooling mainly depends on the pooling algorithm. Assuming homogeneity of the population, dozens of papers have investigated the problem how to design an efficient algorithm [21–25]. However, this assumption might be violated in practical application [26–28]. While individual information is available, it is of interest to estimate individual-level prevalence through incorporating such information. Note that only group-level status is observed, e.g., positive or negative. This problem has been studied in parametric context through the framework of binary regression models [29–31], and also in semiparametric [32, 33] or nonparametric context [34, 35]. However, aforementioned work mostly uses a single group size that is determined in advance.

A set of pool sizes might be more appropriate while considering population heterogeneity. For example, varying pool sizes were used to estimate the infection prevalence of *Myxobolus cerebralis*, which causes whirling disease, among free-ranging salmonid fish collected from the Truckee River in Nevada and California [36]. In a study of estimating the prevalence of several viruses in carnations grown in nursery glasshouses in Victoria, sequential pooled testing involving several pool sizes was adopted [37]. Using a single group size might be optimal for some estimates but far from others, especially when we have little information ahead of the experiment [37, 38]. More work is better on this issue since the benefit of pooling algorithm mainly depend on the choice of pool size [38–40]. In this study, we propose a pooling strategy with varying pool sizes through taking advantage of individual information. Our procedure is a data-driven pooling algorithm, where groups are formed sequentially. Its performance is extensively investigated by simulations and a real data set.

#### 2. Methods

##### 2.1. Notations and Background

Suppose *N* specimens are assigned into *m* groups each with size for . denotes the observed status of the group, and denotes the covariates of the specimen in the group for and The observations are where . Here, the notation represents the transpose of matrix . The sensitivity and specificity of the screening tool are denoted by and , respectively. The full likelihood function iswhere and . The parameter is defined by , and the function is a known, monotone, and differentiable link function.

Sometimes there might be a maximum admissible group size , e.g., a large group size might bring the dilution effect. Therefore, we should carefully choose an appropriate group size that is smaller than . Define a set , and denote it by , Once the group size is determined, we could obtain the estimator of *β* through maximum likelihood function . The Fisher information matrix of the parameter *β* could be rewritten as follows:where

The calculation of Fisher information is presented in Supplemental Material (Available here). To obtain a better estimator , we try to find that maximizes Fisher information . However, individual-level measurements make it difficult to achieve this goal.

The Fisher information defined in (2) involves a measurement , along with its functions and . According to Delaigle and Hall [41], is generally close to , where . This closeness let the Fisher information reduce to the following format: where . Then, we propose to determine the group sizes through minimizing all with respect to for

Note that the aforementioned approximate approach requires the pools are homogeneous. There are two methods to obtain homogeneous pool: reorder the specimens according to similarity of covariants or based on individual risk probability. The latter is adopted in this study. Following the method in McMahan et al. [42], the procedure of forming homogeneous pool is as follows. Firstly, use training data or prior knowledge to obtain an initial estimator [42]. Secondly, sort the specimens by their risk probability. Let *G* denotes the set which contains total covariants of enrolled specimens, where *N* is the number of specimens and is the covariant of the specimen. Sort *G* by risk probability in the descending order, and obtain a sorted set . The remaining procedure is directly performed on this sorted set.

##### 2.2. Sequential Adaptive Pooling Algorithm

Our strategy is an adaptive design, which is often adopted in the biological experiment and also in the pooled test [22]. Before stating the algorithm, we need the following result. Suppose the specimens are assigned for the first groups with the corresponding group sizes Let for and . Denote Then the group size for the next group, , equals if . Here, is the root of an equation and is approximately 1.8414. The proof of this result is presented in Supplemental Material (Available here). Our pooling strategy is described as follows:

*Step 1*. Label the specimens according to the ordering of . For example, label the specimen with covariants by number 1. Assign specimens with labels up to into group.

*Step 2*. Calculate the corresponding function and . If , defines by , choose the group size which minimizes the function , . Define the set of covariants .

*Step 3*. Let , . Repeat Step 2 to form the next group in the same way until all specimens are assigned.

*Step 4*. Screen the groups and obtain maximum likelihood estimator of .

Note that this is a data-driven pooling strategy. Additionally, the above procedure does not strictly require that all specimens are enrolled before screening since the set is dynamic and could be renewed by new enrolled specimens.

##### 2.3. Numerical Results

In this section, we proceed to evaluate the performance of our proposed procedure. Name it by PSV, which is pooling strategy with varied group sizes. For comparison, we also present the results of pooling strategy with a *s*ingle group size *k*, named by PSS(k). The group size *k* for PSS(k) is given in advance, e.g., , or could be determined by the average prevalence of those enrolled samples. For the latter, we determine the optimal single group size by minimizing the variance of .

To investigate the performance of these methods, define the link function as the logistic function . Then, individual prevalence is obtained through the following model:

We first consider a single covariant (), following the normal distribution or the gamma distribution The corresponding parameters are set by and . The samples are generated under these settings, and the procedures are repeated by times. We report the estimators and , along with their mean square error (MSE) in Table 1 under different settings of sensitivity, specificity, and the number of groups. In Figure 1, we further report the relative bias of the parameters.