Research Article | Open Access
Sample Size Determination for the Polychotomous Randomized Response Model for Sensitive Questions in a Stratified Two-Stage Sampling Survey
Methods of finding the minimum value and the Lagrange function were applied to deduce the formulae for the optimum sample sizes for polychotomous randomized response technique (RRT) model in stratified two-stage sampling, so as to minimize the cost for specified sampling errors and to minimize the sampling errors under the constraint of a fixed budget. These formulae were successfully applied to sensitive topics survey among men who have sex with men (MSM) in Beijing, China.
Surveys are an important source of collecting information about the characteristics of a population, from matters of medical and public health study. Their accuracy depends on ample participation and an unbiased sample . However, the validity of survey on sensitive attitudes and behaviours suffers from the tendency of individuals to distort their response towards their perception of what is socially desirable . As a consequence, the established conventional and routine methods like direct questioning have their own limitation in some epidemiological investigations . Direct enquiring often leads to refusals or untruthful replies.
To encourage respondent’s cooperation and to procure reliable data, the randomized response technique (RRT) was first introduced by Warner in 1965, which allowed respondent to elicit trustful response to the sensitive question without revealing anything definite to the interviewer in the course of the survey .
Sample size estimation, like all design issues, is a critical part of the design of a public health survey. For each study, an acceptable sample size needs to be chosen that balances the likelihood of a statistically significant result with the expense and cost involved in conducting the sampling survey .
Our previous studies involved the estimators for the proportion of population carrying the sensitive characteristic in the qualitative case or the estimators for the population mean in the quantitative case, which had been obtained with the implementation of the RRT model under complex survey on sensitive topics [6–8].
Based on the premise that the estimators of the population parameters for polychotomous RRT model in the stratified two-stage sampling survey were given, an attempt is made in this paper to provide sample sizes formulae for stratified two-stage sampling survey. These formulae have minimized the cost of survey implementation for a specified level of precision and meanwhile provided reasonably precise estimates under the constraint of a fixed budget. What is more, an example about preliminary study in Beijing is presented to determine the optimum sample size for a formal field investigation in Beijing which will be carried out in the future.
2. Survey Method
2.1. Randomized Response Designs for Polychotomous Characteristics
The RRT for dichotomous polling can be generalized to polychotomous RRT model . A respondent can belong to one of mutually exclusive groups. All groups consist of a set of sensitive categories. Suppose that is the proportion of respondent who belongs to group . Randomization device is chosen to be a pack of cards identical in all respects number, labeled by the integers from 0 to . Fix the probabilities and , such that . Each respondent is instructed to pick out one card. If the card labeled by 0 is chosen, the respondent reveals his/her true response. If the others are chosen, the respondent discloses this figure on the card.
2.2. Stratified Two-Stage Sampling Design
Suppose that a population was subdivided into non-overlapping strata. The th stratum was subdivided into primary sampling units (PSUs). The th PSU in stratum comprised secondary sampling units (SSUs). On average, each PSU contained SSUs in stratum (, and ). The population was comprised of SSUs (population elements).
In the first stage, PSUs were randomly drawn in the th stratum. In the second stage, SSUs were randomly drawn within each of selected PSUs from stratum . On average, SSUs were randomly drawn from each chosen PSU from stratum (, and ). The polychotomous RRT model was employed to investigate all the chosen SSUs.
3. The Formula Deduction
3.1. Estimation for the Population Proportions of the Sensitive Polychotomous Attribute and Their Estimator’s Variance
Note that represents the estimator of the population proportion in the th sensitive category, stands for the estimator of the population proportion in the th sensitive categories from stratum , denotes the estimator of the population proportion in the th sensitive category in the th PSU from stratum . Then by Gao and Wang , it is shown that where and . Consider the following: The variance of is expressed as The sample estimator of is as follows: where , and .
The sample estimator of is as follows: where and .
3.2. The Formulae for
Suppose again that is the estimator of the population proportion in the th sensitive category from the th PSU in the th stratum, denotes the frequency of people who answer in the th PSU from stratum , and stands for the probability of people who answer in the th PSU from stratum . is estimated by under total probability formula; we could get , for all , and , provided that An unbiased estimator for is as follows: where , , and .
3.3. Sample Size Formulae
Let the overall survey cost be where equals the fixed costs of initiating the survey in th stratum, represents the average cost of approaching to one PSU within stratum , and is the average cost of interviewing an SSU in stratum .
The variance of can also be written in the following alternative form:
To minimize the sampling cost under a given variance (), the optimum sampling size can be considered as the minimal values of function (9) subject to the constraint (10). The Lagrange function is defined as where is a Lagrange multiplier.
The necessary conditions for the solution of the problem are for .
Equation (12) gives Substituting the values of from expression (13) in (14), the is obtained as And from (14), the is obtained as Substituting the values of and from (15) and (16), respectively, formula (10) gives, when ( is a given variance of ), Hence, The minimum value of under a cost function (fixed survey cost ), the optimum sampling size is obtained as the minimum values of function (10) subject to the constraint (9). Consider the following Lagrange function : where λ is a Lagrange multiplier.
The optimums and are the solution of the following numerical problem: Results are presented as follows: We have the approximate optimal sample sizes given by Define as the value of the survey cost, from (21); the formula of the overall survey cost is expressed as Hence, For the stratum, the optimum size of the sample of SSUs in each selected PSU is given by It is noted that the value of may need to be considered in the process of estimating and . Difference of value leads to difference of and . Taking the maximum value of and is necessary to be ensured.
4.1. Preliminary Survey
Homosexual behaviour features were investigated in stratified two-stage sampling study of MSM living in Beijing from August to October 2010. The information was used to characterize high-risk sexual behaviours among MSM. All the respondents were arranged in two strata, the first consisting of MSM aged 15 to 29 years and the second consisting of MSM aged 30 to 49 years . Districts/counties in Beijing were defined as PSUs. Beijing currently comprises 16 county-level subdivisions including 14 districts and 2 rural counties. Each stratum contained 16 districts/counties (). The MSM were considered as SSUs. We took this figure of 2.5% as being the proportion of adult males who were homosexually active in the city of Beijing. This suggested that there were 67750 MSM aged 15 to 49 years living in Beijing. An average of 2466 MSM and 1768 MSM were indicated in each district/county within stratum 1 and stratum 2, respectively ( and ). In the first sampling stage, 13 districts/counties were randomly drawn within each stratum (), while in the second sampling stage, 1523 MSM were randomly selected from all the chosen subdivisions. In the first and second strata, the average of MSM was 68 and 49 drawn from each selected subdivision, respectively ( and ).
The participants underwent an interview using polychotomous RRT model focusing on male-to-male sexual behaviour. The detailed information pertained to use condoms, each commercial same-sex behavioural cost, the proportion to engage in commercial same-sex services, HIV testing status, STD testing status, the preference for sexual behaviours, and latex condom failure. Sensitive quantitative variable closely followed a normal distribution in MSM population. And sensitive qualitative variable was associated with discrete probability distribution.
Take condom use, for example, which was particularly important for combatting the spread of HIV. This typical sensitive question seemed like “Did you use a new condom with every act of anal intercourse?” with answers “1—Never use,” “2—Occasionally use,” “3—Consistently use,” and “4—Say no to anal sex.” By these answers, respondents were classified into four mutual exclusive groups. Randomizing device was given to be a deck of cards identical in all respects number, labelled by the integers from 0 to 4. Fix the probabilities , , , , and , so that : : : : 0.6 : 0.1 : 0.1 : 0.1 : 0.1 ( + + + + ). Each SSU (the selected MSM) was instructed to draw one card from the deck with replacement randomly. Drawing the card labelled with the number 0, the respondent revealed his true response whether he used a new condom during anal intercourse. Drawing the others, he disclosed the value of the chosen card.
In the first stratum, 66 MSM were randomly drawn in the district/county one (), 12 of those who gave answer 1 (when , ). And so the probability of answering 1 was 0.1818 (). Randomizing device was set as follows. A participant either revealed his true type with probability () or answered 1 with probability . From formula (8), therefore, we could approximately get the percentage of MSM who had never used condom for each act of anal intercourse in the district/county one from stratum 1: .
In a similar way, the proportions of MSM who had never used condom for each act of anal intercourse in other districts/counties within each stratum were obtained. Furthermore, and were given by the formulae (2), (4), and (5). Table 1 showed both these variances which were needed in the determination of optimum sample size.
4.2. Optimum Sample Size Estimation
We plan to conduct a formal investigation of stratified two-stage sampling design among the population of MSM in Beijing by the end of 2014. The way to guarantee confidentiality is to apply polychotomous RRT. Survey sample size, including the number of participants and districts/counties in the formal investigation, can be determined based on every response category of polychotomous sensitive question. Accordingly, both different sensitive topics and different response categories with respect to the same sensitive topic lead to variation in optimum sample sizes. It is proper to take the maximum value as the final optimum sample size. Taking the case of condom use, sample size determination is presented as follows.
Based on the preliminary investigation, the formal investigation’s budget was given. The average cost of initiating the survey within each stratum was fifty thousand Yuan (). And then the average cost of approaching to one district/county within each stratum was a hundred thousand Yuan (). Also, the average cost of obtaining information on sensitive characteristics in one respondent from each stratum is fifteen Yuan ().
Table 1 indicated that related estimators of sample variance within each stratum, , , , and , were 0.0058, 0.0152, 0.0075, and 0.0154, respectively. From expressions (15) and (22), an average size of MSM who were needed to be recruited in each chosen district/county from stratum 1 and stratum 2, respectively, was given by To minimize the survey cost under the constraint of sampling error, where the value of sampling error was 0.000057 (). From formula (18), we can get The number of districts/counties which were needed to be chosen within each stratum was given by formula (16): To minimize the sampling error for the fixed overall survey cost, where the value of the fixed overall survey cost was 1000000 , from formula (25), we can get The number of districts/counties which need to be sampled from each stratum was given by formula (23): Table 2 summarized the , , , and in the other different extent of condom usage among MSM discussed in this research.
The determination of sample size for sampling survey may vary with different categories related to polychotomous sensitive topics. And so the maximum sample size is necessary to be ensured. According to the sampling survey on condom use among MSM, an average of 132 MSM and 117 MSM should be sampled in each chosen district/county in the first stratum and second stratum, respectively ( and ). When was gotten, we could determine the number of MSM drawn from the th district/county in the th stratum by formula (26). For example, if a certain chosen district/county had 3342 MSM in the first stratum, the number of MSM drawn from this district/county in the first stratum should be .
We have earlier reported that sample size formulae associated with (stratified) multistage sampling survey on nonsensitive topics were derived . However, sample size formulae for multistage sampling survey on sensitive characteristics are not yet available. The main purpose of this paper is to provide sample size determination for polychotomous RRT model for sensitive characteristics in a stratified two-stage sampling design. We extend the application of sample size formulae for multistage sampling design from nonsensitive questions to sensitive questions.
China is currently undergoing a serious HIV epidemic . Male-to-male sexual contact is one of the leading modes of HIV transmission . There seems to be a trend of increasing HIV prevalence among MSM. MSM in China might have an important role in spreading the HIV-1 epidemic. The proposed method in this study seems to be an effective technique for obtaining more accurate population ratio estimates for sensitive qualitative characteristics among HIV-related high risk groups. What is more, sampling survey schemes under the project 81273188 which will commence in 2014 to estimate the quantities of HIV-related high risk groups have been completed on the basis of sample size formulae deduced in this study.
The principles of validity and reliability are fundamental cornerstones of the scientific method. A good way to assess a survey is in terms of its validity and reliability. Both high validity and reliability can be arguably considered as the most important criteria for good quality of survey. Treating validity and reliability in the RRT model for sensitive quantitative/qualitative characteristics under a complex survey is the recourse to correlation analysis of repeated survey data and Monte Carlo simulation in our previous studies [6, 14, 15]. These survey methods and statistical formulae showed high validity and reliability.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by a Grant (81273188) from National Natural Science Foundation and a Grant (CXLX13_839) from Postgraduate Research and Innovation Program of Jiangsu. The authors are grateful to Wei Li, Xiangyu Chen, Qiaoqiao Du, Mingrun Yu, and Xudong Li for their invaluable help in field investigations. The authors wish to thank reviewers for comments and suggestions.
- F. Esponda, “Negative surveys,” http://arxiv.org/abs/math/0608176.
- M. Moshagen, Multinomial randomized response models, 2008.
- G. L. Tian, M. L. Tang, Z. Liu, M. Tan, and N. S. Tang, “Sample size determination for the non-randomised triangular model for sensitive questions in a survey,” Statistical Methods in Medical Research, vol. 20, no. 3, pp. 159–173, 2011.
- S. L. Warner, “Randomized response: a survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–66, 1965.
- W. Brannath, “Book Review: S. C. Chow and M. Chang 2007: adaptive design methods in clinical trials,” Clinical Trials, vol. 6, no. 1, pp. 102–103, 2009.
- Q. Q. Du, G. Gao, W. Li, and X. Y. Chen, “Application of monte carlo simulation in reliability and validity evaluation of two-stage cluster sampling on multinomial sensitive question,” in Information Computing and Applications, pp. 261–268, Springer, 2012.
- W. Li, G. Gao, Y. H. Ruan, X. Y. Chen, and Q. Q. Du, “Analysis of sensitive questions of MSM based on RRT,” in Information Computing and Applications, pp. 273–279, Springer, 2012.
- X. K. Pu, G. Gao, Y. B. Fan, and M. Wang, “Stratified cluster sampling under multiplicative model for quantitative sensitive question survey,” Interciencia, vol. 37, pp. 833–837, 2012.
- A. Ambainis, M. Jakobsson, and H. Lipmaa, “Cryptographic randomized response techniques,” in Public Key Cryptography—PKC 2004, vol. 2947 of Lecture Notes in Computer Science, pp. 425–438, Springer, Berlin, Germany, 2004.
- G. Gao and S. G. Wang, “The estimation of sample size in stratified two-stage sampling,” Chinese Journal of Health Statistics, vol. 15, pp. 51–53, 1998.
- J. F. Wang, G. Gao, Y. B. Fan et al., “The estimation of sample size in multi-stage sampling and its application in medical survey,” Applied Mathematics and Computation, vol. 178, no. 2, pp. 239–249, 2006.
- K. H. Choi, H. Liu, Y. Guo, L. Han, J. S. Mandel, and G. W. Rutherford, “Emerging HIV-1 epidemic in China in men who have sex with men,” The Lancet, vol. 361, no. 9375, pp. 2125–2126, 2003.
- G. Mumtaz, N. Hilmi, W. McFarland et al., “Are HIV epidemics among men who have sex with men emerging in the Middle East and North Africa?: a systematic review and data synthesis,” PLoS Medicine, vol. 8, no. 8, Article ID e1000444, 2011.
- Z. D. Jin, H. R. Zhu, B. Yu, and G. Gao, “A monte-carlo simulation investigating the validity and reliability of two-stage cluster sampling survey with sensitive topics,” Computer Modelling and New Technologies, vol. 17, pp. 65–79, 2013.
- M. Wang and G. Gao, “Cluster sampling and its application on quantitative sensitive questions,” Chinese Journal of Health Statistics, vol. 25, pp. 586–598, 2008.
Copyright © 2014 Zongda Jin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.