Abstract
When designing the sample scheme, it is important to determine the sample size. The survey accuracy and cost of survey and sampling method should be considered comprehensively. In this article, we discuss the method of determining the sample size of complex successive sampling with rotation sample for sensitive issue and deduce the formulas for the optimal sample size under two-stage sampling and stratified two-stage sampling by using Cauchy-Schwartz inequality, respectively, so as to minimize the cost for given sampling errors and to minimize the sampling errors for given cost.
1. Introduction
Sampling, a kind of incomplete survey, is the most common pattern of investigation [1, 2]. Based on the sample taken from the population [3], we obtain the estimation of population parameter. Because of privacy and variability, in the process of sampling investigation for sensitive questions, some respondents often refuse to answer or give wrong answer for self-protection [4]. Thus it is difficult to get effective data by using the conventional method, and survey results cannot exactly reflect the true population characteristics. Therefore multiplication model of Randomized Response Technique (RRT) is used to improve the response rate of respondents, so as to get the more realistic and reliable results [5]. Geng used the RRT model to survey the behavioral risk profile of men who have sex with men in Beijing, China [6]. Generally speaking, there are two disadvantages of sample fatigue and the decrease of the sample representativeness in successive survey. But sample rotation can greatly improve the accuracy of estimators [7]. Yu developed two complex successive sampling methods with rotation sample for sensitive issue under two-stage sampling and stratified two-stage sampling, respectively [8].
The determination of sample size is a significant part of sampling design and is the necessary premise for the implementation of sampling. Nouri found a method of sampling and sample size determination of a comprehensive integrated community-based interventional [9]. However, there is no universal solution and a perfect prescription about the determination of sample size. Wang and Gao deduced formulae for the optimum sample size for two-stage sampling [10]. For the empirical judgment of sample size, Chen obtained a method to determine the sample size and used it in the testability verification experiment of fault injection [10]. Jin and Yu deduced the formulae for the optimum sample sizes for Randomized Response Technique (RRT) model in stratified two-stage sampling [11].
But there are very few researches about the determination of sample size for the successive survey with sensitive questions. Yu and Jin gave the estimator of sample size for successive survey with partial clusters rotation under the given cost [12]. However, the sample size formulas associated with two complex successive sampling methods mentioned are not yet available. In this paper we deduce the formulas for the optimal sample size of two sampling methods by using Cauchy-Schwartz inequality, respectively, so as to minimize the cost for given sampling errors and to minimize the sampling errors for given cost [8].
The remainder of this paper is as follows. In Section 2, we discuss the method of determining the sample size of complex successive sampling with rotation sample for sensitive issue under multiplications RRT model and deduce the formulas for the optimal sample size under two-stage sampling and stratified two-stage sampling by using Cauchy-Schwartz inequality, respectively. In Section 3, we give four examples. Using the deduced formals in this paper, we investigated the number of sexual services per sex girl each month and the sex girls’ age of first sexual surveys in Xichang City and the proportion of using condoms during anal sex and gay men and male behavior of monthly average value in MSM(men who have sex with men) in Beijing City. Section 4 is a summary of this article. Finally all technical details are put in the Appendix.
2. The Formula Derivation of Sample Size
2.1. The Determination of Sample Size for Successive Survey with the Multiplication RRT Model under Two-Stage Random Sampling
2.1.1. Multiplication RRT Model for Sensitive Questions
Designing a set of random device [13], a box contains ten balls which are printed 0, 1, 2… 9, respectively. The respondents randomly select a small ball from the box and fill in the questionnaire by multiplying the value of their own quantitative sensitive character to the value of the selected ball.
2.1.2. Two-Stage Sampling
Assume that the population of primary units and the i-th primary units are composed of secondary units . On average, each primary unit includes secondary units. In the first stage, primary units were drawn from the primary unit by using the simple random sampling. Let the sampling fraction of the first stage be . At the second stage, secondary units were drawn from the i-th chosen primary unit . On average second-stage units were drawn from each selected primary unit. Let the sampling fraction of the second stage be . At the second stage, the successive survey with sample rotation was, respectively, carried out in each selected primary units. At the i-th chosen primary unit , for the first survey, multiplication RRT Model was applied to investigate respondents in second-stage unit. At the h-th survey, secondary units are reserved randomly from the chosen primary units of h-1-th survey, and () secondary units, the rotated part, are drawn from the rest in the i-th primary unit, which were not chosen in the h-1-th survey. Then the multiplication RRT Model is used to investigate the reserved secondary units and rotated secondary units.
2.1.3. Estimators of the Population Mean
Let the estimator of the population mean in the i-th primary unit of the h-th survey, the sample mean in the i-th primary unit of the h-th survey, the sample mean of the rotated samples in the i-th primary unit of the h-th survey, the sample mean of reserved samples in the i-th primary unit of the h-th survey, the weight of combined estimation in the i-th primary unit of the h-th survey, the regression coefficient of the same variable in the same sample units of the i-th primary unit between the h-1-th survey and the h-th survey (h > 1).
In the second stage of sampling, simple random sampling under sample rotation is used for the secondary units in the selected primary unit in the first stage.
The estimator of the population mean in i-th primary unit in the h-th survey is
Based on simple random sampling in the first stage sampling, according to the essential feature of the mean, the estimator of the population mean in the h-th survey isSuppose the quantitative characteristic of the sensitive problem of the respondents is , the extracted random variable is , and the product of and is . The population mean of and is and , respectively. The mean of all random numbers printed on the every ball in the box is .
According to the essential feature of the mean, we have
By (3), iswhere is the sample mean of answer value for reserved secondary units of the i-th primary unit in the h-th survey.
By (3), we getwhere is the sample mean of answer value for reserved secondary units of the i-th primary unit in the h-1-th survey
According to the essential feature of the mean, iswhere is the sample mean of answer value for rotated secondary units of the i-th primary unit in the h-1-th survey.
2.1.4. The Estimator Variance of Population Mean
The variance of mean estimator in two-stage sampling iswhere is the variance of mean for the secondary units among primary units. Based on the simple random sampling in the first stage, by Cochran, W.G. [14],where is the population mean of the sensitive characters in the h-th survey and is population mean of the sensitive characters of the i-th primary unit in the h-th survey.
From (3), we get
By (8), (9), and (10), we havewhere is the population mean of the answer value in the h-th survey and is the population mean for the answer value of the i-th primarily unit in the h-th survey.
Thus, we obtain the sample estimator where is the sample mean of the answer value in the h-th survey and is the sample mean for the answer value of the i-th primary unit in the h-th survey. Moreover, the is the variance of secondary units in the primary units.where is the variance of estimator for population mean in the i-th primary unit of the h-th survey.
In the second stage of sampling, simple random sampling under sample rotation was used to investigate the secondary units of the primary units in the first stage. the variance of estimator for population mean in the i-th primary unit of the h-th survey iswhere is the estimator of correlation coefficient of the answer value for the i-th primary unit between the h-th and the h-1th survey.
The sample estimator of iswhere is the variance of the answer value for rotated sample in the i-th primary unit of the h-th survey, is the variance of the answer value for reserved sample in the i-th primary unit of the h-th survey, and is the sample variance of the answer value for the whole sample in the i-th primary unit of the h-th survey
The sample estimator of is
So, the sample estimator of is
2.1.5. Optimal Weight and Optimal Sample Rotation Rate
Based on simple random sampling in the first stage and simple random rotation sampling in each primary unit, thus, the optimal weights of the i-th primary unit in the h-th survey is
So, we get the rate of sample rotation for the i-th primary unit in the h-th survey
2.1.6. The Determination of Sample Size
In practice, the cost of survey often has the following simply function, by Cochran, W.G. [14]:where is the fixed cost which is irrelevant to the sample size, such as the cost in leasing premises, hiring employees and publicizing the investigation, is the average charge in investigating the each primary unit, and is the average charge in investigating the each secondary unit. The sample size of the first stage is , and the sample size of the second stage is .
From (7), we get
Letting , , and , we have
From (20), we have
Using the Cauchy-Schwartz inequality, from (22) and (23), we get the product
If and only if , we get
From (25), we get samples in the second stagewhere and .
From (23) and (25), we get (the coefficient for fixed cost of survey)
From (25) and (27), we get (the optimal sample size for the fixed cost of survey)where and .
From (22) and (27), we get (the coefficient for the given variance)From (22) and (29), we get (the optimal sample size for the given variance)where and .
2.2. The Determination of Sample Size for Successive Survey with the Multiplication RRT Model under Stratified Two-Stage Random Sampling
2.2.1. Multiplication RRT Model for Sensitive Questions
Designing a set of random device [13], a box contains 10 balls which are printed 0, 1, 2… 9, respectively. The respondents randomly select a small ball from the box and fill in the questionnaire by multiplying the value of their own quantitative sensitive character to the value of the selected ball.
2.2.2. Stratified Two-Stage Sampling
The population of strata and stratum includes primary units, . On average, each stratum includes primary units. Rotation sampling of the two stage sampling is conducted independently in each stratum.
Firstly, primary units were selected in the stratum , and on average primary units were selected. In the first survey, secondary units were selected in the first stage from the selected i-th primary unit (containing secondary units, on average population has primary units) of stratum (on average secondary units were selected, ); then multiplication RRT Model for sensitive questions was used. In the h-th(h>=2) survey, secondary units are reserved randomly from the chosen secondary units of the i-th selected primary unit in stratum of h-1-th survey, and secondary units are drawn from the rest primary units, which were not chosen from the t-th stratum of the h-1-th survey. Then the multiplication RRT Model is used to investigate the reserved secondary units and rotated secondary units.
2.2.3. Estimators of the Population Mean
(1) Estimators of the Population Mean in Stratum . Let the estimator of the population mean from the i-th primary unit in stratum of the h-th survey, the sample mean from the i-th primary unit in stratum of the h-th survey, the sample mean of the rotated samples from the i-th primary unit in stratum of the h-th survey, the sample mean of reserved samples from the i-th primary unit in stratum of the h-th survey, the weight of combined estimation from the i-th primary unit in stratum of the h-th survey, the regression coefficient of the same variable in the same sample units from the i-th primary unit in stratum between the h-1-th survey and the h-th(h > 1) survey.
In the second stage, rotation sampling under simple random sampling is used for the secondary units from the selected primary unit in each stratum.
The estimator of the population mean in stratum of the h-th survey is
Suppose the quantitative characteristic of the sensitive problem of the respondents is , the random variable extracted is , and is the product of and . and are the population mean of and , respectively. is the mean of all random numbers printed on each ball in the box.
According to the basic properties of the mean
By (32), we getwhere is sample mean of the answer value of the reserved sample from the i-th primary unit in stratum of the h-th survey.
From (32), we getwhere is sample mean of the answer value of the reserved sample from the i-th primary unit in stratum of the h-1-th survey.
By (32), we getwhere is sample mean of the answer value of the rotated sample from the i-th primary unit in stratum of the h-th survey.
(2) Estimators of the Population Mean. According to the basic properties of the mean, the estimator of population mean in the h-th survey iswhere .
(3) The Estimator Variance of Population Mean. Two stage successive sampling with rotation of secondary units is used in each stratum. By (17), the estimator variance of population mean in stratum of h-th survey iswhere , is the variance of answer value for the rotated sample from the i-th primary unit in stratum of the h-th survey, is the variance of answer value for the reserved sample from the i-th primary unit in stratum of the h-th survey, is the sample variance of the answer value for the whole sample from the i-th primary unit in stratum of the h-th survey, and is the estimator of correlation coefficient of answer value from the i-th primary unit in stratum between the h-th and h-1th survey.
According to the basic properties of variance, the estimator variance of population mean in h-th survey is
The sample estimator of is
Letwhere is the sample mean of the answer value in stratum of the h-th survey and is the sample mean for the answer value from the i-th primary unit in stratum of the h-th survey
(4) Optimal Weight and Optimal Sample Rotation Rate. Based on simple random sampling of the first stage in each stratum and simple random rotation sampling in each primary unit of each stratum, thus, the optimal weights from the i-th primary unit in stratum of the h-th survey isSo, we get the rate of sample rotation from the i-th primary unit in stratum of the h-th survey
2.2.4. Sample Size Determination
In practice, the cost of survey often has the following simply function, by Cochran, W.G. [14]:where is the fixed cost which is irrelevant to the sample size, such as the cost of leasing premises, hiring employees, and publicizing the investigation, is the average cost of investigating the each primary unit, is the average cost of investigating the each secondary unit, is the sample size of the first stage in each stratum, and is the sample size of the second stage.
From (39), we get
Lettingthen, we have
From (41), we get
Letting and , we have
Using the Cauchy-Schwartz inequality, from (49) and (50), we get the product
If and only if , we get
From (52), we get in the second stage:
From (41) and (51), we get (the optimal sample size for the fixed cost of survey)where and .
From (45) and (53), we get (the optimal sample size for the given variance)where and .
3. Applications
3.1. Applications of Two Stage Sampling
3.1.1. An Application in Xichang City
In 2013, two-stage sampling was employed to estimate the number of sexual services performed sex girl of each month in Xichang City. Define the streets as the primary unit and sex girls as the secondary unit. According to relative references [15], the permitted errors were taken as half of the confidence interval (), so the confidence is and we get the given variance . And Xichang City has 54 streets (N=54), on average, each street has 126 sex girls (). We also budget the survey cost of each street ( dollars) each person ( dollars) and fixed cost( dollars).
(1) According to the results of investigation materials in Xichang City in 2011 that had be got before, we could compute estimators of relevant values and from (11) and (13) (2) From (26), we could get the average size of sex girls that need to be investigated from each chosen street (3) Supposing that the cost is fixed ( dollars), from (28), we could get the size of streets that need to be investigated from the all streets in Xichang City. (4) Supposing that the variance is fixed (), from (30), we could get the size of streets that need to be investigated from the all streets in Xichang City
3.1.2. An Application in Beijing City
In 2015, two-stage sampling was employed to estimate the proportion of using condoms during anal sexin in Beijng City. Define the districts as the primary unit and the MSM (men who have sex with men) as the secondary unit. According to relative references [16], we took the permitted errors as half of the confidence interval (), so the confidence is . And we get the given variance , and Beijing City has 16 districts (), on average; each district has 4234 MSM (). We also budget the survey cost of each district ( dollars) each person ( dollars) and fixed cost ( dollars).
(1) According to the results of investigation materials in Beijing City in 2010 that had be got before, we could compute estimators of relevant values and from (11) and (13) (2) From (26), we could get the average size of MSM that need to be investigated from each chosen district (3) Supposing that the cost is fixed ( dollars), from (28), we could get the size of districts that need to be investigated from the all districts in Beijing City. (4) Supposing that the variance is fixed (), from (30), we could get the size of districts that need to be investigated from the all districts in Beijing City.
3.2. Applications of Stratified Two-Stage Random Sampling
3.2.1. An Application in Xichang City
In 2015, stratified two-stage random sampling was employed to estimate the sex girls’ age of first sexual service in Xichang City. According to the age of the sex girls, the sex girls are divided into two stratums (), in which the ages of sex girls in the first stratum and the second stratum are from 15 to 29 and from 30 to 49, respectively. Define the streets as the primary unit and sex girls as the secondary unit. According to relative references [15], the permitted errors were taken as half of the confidence interval (), so the confidence is and we get the given variance and . On average, each stratum has 27 streets (), and each street has 126 sex girls averagely (). At the first stage, streets were drawn. At the second stage, sex girls were selected from each chosen street. The foundational cost of survey is 2200 dollars, the average charge in investigating each street is 1750 dollars, and the average charge in investigating each sex girl is 17 dollars.
(1) According to the results of survey that had be got before, we get (2) From (53), we could get the average size of sex girls that need to be sampled from each chosen street (3) Supposing that the cost is fixed ( dollars), from (54), we could get the size of streets that need to be sampled from each stratum in Xichang City (4) Supposing that the variance is fixed (), from (55), we could get the size of streets that need to be sampled from each stratum in Xichang City
3.2.2. An Application in Beijing City
In 2015, stratified two-stage random sampling was employed to estimate gay men and male behavior of monthly average value in Beijing City. According to the age of the MSM, the MSM are divided into two stratums , in which the age of MSM in the first stratum and the second stratum are from 15 to 29 and from 30 to 49, respectively. Define the entertainment venues (such as gay bar and gay club) as the primary unit and the MSM as the secondary unit. According to relative references [8], we took the permitted errors as half of the confidence interval (), so the confidence is we get the given variance and and . Beijing has 3984 entertainment venues (), and each entertainment venue has 17 MSM averagely (). At the first stage, entertainment venues were drawn. At the second stage, MSM were selected from the each chosen entertainment venues. The foundational cost of survey is 4500 dollars, the average charge in investigating each entertainment venues is 270 dollars, and the average charge in investigating each MSM is 32 dollars.
(1) According to the results of survey that had been got before, we get (2) From (53), we could get the average size of MSM that need to be sampled from each chosen entertainment venue (3) Supposing that the cost is fixed ( dollars), from (54), we could get the size of entertainment venues that need to be sampled from each stratum in Beijing City (4) Supposing that the variance is fixed (), from (55), we could get the size of entertainment venues that need to be sampled from each stratum in Beijing City
4. Discussion
(1) The formulas for the optimum sample sizes with rotation sample under the two-stage random sampling and stratified two-stage random sampling for sensitive questions are deduced for the first time in this paper. Because of the feature of sensitive questions, we adopt the multiplications RRT model to obtain the realistic and reliable data. Also, sample fatigue and the decrease of the sample representativeness are two disadvantages in successive survey. But sample rotation can greatly improve the accuracy of estimators. So, we apply sample rotation to balance the above contradictions. Using the formulae deduced in this paper, optimum sample size in each stage for investigating the number of monthly services and the first survey age of sex girls in Xichang City and the number of the proportion of using condoms during anal sex and gay men and male behavior of monthly average value of MSM in Beijing City are gotten.
(2) The survey method and statistical formulas of this paper have been successfully applied to survey and analyzed the sensitive issues of the sex service girls in Xichang city, Sichuan province. It indicates that the formulas have achieved good effect in practical application. The random response technology was adopted for the interviewees, and the multiplication RRT model was combined to improve the response rate of the interviewees and made the survey results more authentic and reliable. The result that is calculated based on our formulas provides scientific basis for health authority to make regional policies and decisions for effectively controlling HIV/AIDS.
(3) The RRT model has huge advantages although the limitations should not be overlooked. It works by adding random noise to the data, which may cause errors. However, RRT still is a good model in protecting sensitive personal information for sensitive issue survey. The RRT model is more likely to get the correct data than direct question designs when investigate some sensitive issues, for instance, premarital sex, premarital pregnancy, and extramarital sex. But some responders provide untruthful answers, which make negatively affect the accuracy of the data. Also, the RRT model needs to use larger samples than direct question designs. From this respect, it is necessary that the investigators should be familiar with the principle and operation of RRT model and obtain the trust of responders to protect privacy and improve reliability and validity. Moreover, the no-randomized response model behaves better than RRT model in aspect of efficiency and privacy protection, which will be the next research direction. RRT model adding random noise to the data for guarding privacy results in inaccurate results and inefficiency. Due to the RRT model having some limitations, it is very significant to get the formulas for the optimal sample size when the variance is given or the cost is fixed in this article.
Appendix
MATLAB CODE
two-stage
3.1.1
clear
Ct=40000;
C1=1500;
C2=15;
C0=2500;
M=126;
N=54;
V=9.6;
S1_2=308.7;
S2_2=280.77;
v=(1/N)*S1_2;
y1=sqrt(S1_2-S2_2/M);
y2=sqrt(S2_2);
m=y2/y1*sqrt(C1/C2)
the cost is fixed
n1=(Ct-C0)/(C1+C2*m)
the given variance
n2=(y1∧2+y2∧2/m)/(V+v)
3.1.2
clear
Ct=147680;
C1=14768;
C2=0.45;
C0=12368;
M=117;
N=13;
V=0.00033;
S1_2=0.0226;
S2_2=0.0018;
v=(1/N)*S1_2;
y1=sqrt(S1_2-S2_2/M);
y2=sqrt(S2_2);
m=y2/y1*sqrt(C1/C2)
the cost is fixed
n1=(Ct-C0)/(C1+C2*m)
the given variance
n2=(y1∧2+y2∧2/m)/(V+v)
stratified two-stage
3.2.1
clear
T=2;
Ct=45000;
C0=2200;
C1=1750;
C2=17;
S_11=130;
S_12=93;
S_21=94.6;
S_22=97.5;
W1=1/2;
N=126;
L=27;
y1=sqrt(1/4*(S_11-1/N*S_21)
+1/4*(S_12-1/N*S_22));
y2=sqrt(1/4*(S_21)+1/4*(S_22));
n_=y2*sqrt(C1)/(y1*sqrt(C2))
the cost is fixed
l1=(Ct-C0)*y1*sqrt(C2)/(C1*T*y1*sqrt(C2)
+C2*T*y2*sqrt(C1))
v=1.95+1/L*1/4*(S_11+S_12);
the given variance
l2=y1∧2/v+y2*y1*sqrt(C2)/(sqrt(C1)*v)
3.2.2
clear
T=2;
Ct=90000;
C0=4500;
C1=270;
C2=32;
S_11=89.5;
S_12=86.02;
S_21=85.83;
S_22=78.73;
W1=0.5824;
W2=0.4176;
N=17;
L=1992;
y1=sqrt(W1∧2*(S_11-1/N*S_21)
+W2∧2*(S_12-1/N*S_22));
y2=sqrt(W1∧2*(S_21)+W2∧2*(S_22));
n_=y2*sqrt(C1)/(y1*sqrt(C2))
the cost is fixed
l1=(Ct-C0)*y1*sqrt(C2)/(C1*T*y1
*sqrt(C2)+C2*T*y2*sqrt(C1))
v=0.385+1/L*(W1∧2*S_11+W2∧2*S_12);
the given variance
l2=y1∧2/v+y2*y1*sqrt(C2)/(sqrt(C1)*v)
Data Availability
Our data comes from field surveys. The survey data supporting this study are from previously reported studies and datasets about AIDS, which have been cited. These prior studies and datasets are cited at relevant places within the text as [8, 15, 16]. And the survey data used to support the findings of this study are available on website [8]: Yu B, the Research of Successive Sampling for Quantitative Sensitive Questions Survey and Its Application [D], Soochow University, 2015 [15], Wei Li, the RRT Model for Inquiring Quantitative Sensitive Questions under Cluster Sampling and Assessment of Validity and Reliability through Computer Simulation [D], Soochow University, 2013 [16], and Pu X KG. Gao and Y. H., Sample Size Determination of Dichotomous Sensitive Question Survey under Two-Stage Sampling [J], Soochow University, 2013(2): 254-256.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The research was fully supported by a grant (31701424) from National Natural Science Foundation of China.