Abstract
This paper is devoted to the study of the behavior of the use of double sampling for dealing with nonresponses, when ranked set sample is used. The characteristics of the sampling strategies are derived. The structure of the errors generated the need of studying of the optimality of the strategies by performing a set Monte Carlo experiments.
1. Introduction
The usual theory of survey sampling is developed assuming that the finite population is composed by individuals that can be perfectly identified. A sample of size is selected. The variable of interest is measured in each selected unit. Real-life surveys should deal the existence of missing observations. There are three solutions to cope with this fact: to ignore the nonrespondents, to subsample the nonrespondents, or to impute the missing values. To ignore the non responses is a dangerous decision, to sub sample is a conservative and costly solution. Imputation is often used to compensate for item nonresponse. See, for discussions on the theme, Rueda and GonzΓ‘lez [1], Singh [2], for example.
Section 2 presents the problem of non response when a single sample is selected.
We consider the use of double sampling for obtaining information on an auxiliary variable . A first large sample is selected, it is supposedly noncostly. The values of are used for selecting a ranked set sample (RSS), as the units are ranked using the values in the first stage sample. A selection of second sample provides a subsample from the preliminary large sample. The literature on the use of simple random double sampling (SRS) is large. Text books give the basic theory, see Singh [2] and Cochran [3]. In this paper we consider a ranked set sampling (RSS) double sampling procedure. It is presented in Section 3 where a family of estimators is considered as an RSS alternative to the proposal of Singh and Kumar [4]. An expression of the gain in accuracy due to our proposed estimator is found. The estimator is compared with simple mean and the proposal of Singh and Kumar [4]. Real-life data are used for evaluating the behavior of these alternative estimators of the population mean in Section 4.
2. The Nonresponse Problem: A Single Sample
Non responses may be motivated by a refusal of some units to give the true value of or by other causes. Hansen and Hurvitz in 1946 [5] proposed selecting a sub-sample among the nonrespondents, see Cochran [3]. This feature depends heavily on the proposed sub-sampling rule. Sampling rules are due to Hansen and Hurvitz [5], Srinath [6], and Bouza [7]. The existence of non responses fixes that is divided into two strata: responds at the first visit},. Similarly is partitioned into . The procedure is a particular double sampling design described, using Hansen-Hurvitzβs rule (HHR) as follows.
Step 1. Select a sample from using srswr.
Step 2. Evaluate among the respondents and determine , . Compute
Step 3. Determine , ; with = .
Step 4. Select a sub-sample of size from using srswr.
Step 5. Evaluate among the units in :, . Compute
Step 6. Compute the estimate of ΞΌ
Note that (2.1) is the mean of an srswr sample selected from , the response stratum, then its expected value is the mean of in the respondent stratum: . We have that the conditional expectation of (2.2) is as (2.4) is the mean of a srswr sample selected from the non response stratum and taking into account that for the unbiasedness of (2.3) is easily derived.
The variance of (2.3) is deduced by using the following trick: the first term is the mean of , then its variance is Ο2/n. For the second term we have that Conditioning to a fixed we have that the expectation of the third term is . Then we have that Hence the expected error of (2.3) is given by the well-known expression Our proposal is to consider obtaining information provided by a known variable for using RSS.
McIntire [8] proposed the method of RSS. He noticed the existence of a gain in accuracy with respect to the use of the sample mean with respect to srswr. Dell and Clutter [9] and Takahashi and Wakimoto [10] provided mathematical support to his claims. The following procedure provides a description of RSS selection.
2.1. RSS Procedure
Step 1. Randomly select units from the target population.
Step 2. Allocate the selected units as randomly as possible into sets, each of size .
Step 3. Without yet knowing any values for the variable of interest, rank the units within each set with respect to variable of interest. This may be based on personal professional judgment or done with concomitant variable correlated with the variable of interest.
Step 4. Choose a sample for actual quantification by including the smallest ranked unit in the first set, the second smallest ranked unit in the second set, the process is continued in this way until the largest ranked unit is selected from the last set.
Step 5. Repeat Steps 1 through 4 for cycles to obtain a sample of size for actual quantification.
The RSS sample is the sequence of order statistics (OS) , where denotes the statistic of order in the hth sample in the cycle . We have observation and of them are of the th order statistics (os), . The RSS estimator of the mean of a variable of interest is and its variance is given by where and .
The second term of (2.11) is the gain in accuracy due to the use of RSS instead of srswr.
Bouza [11] developed an RSS alternative under non responses. The non responses in is . He derived that, using a subsample size , is unbiased for the mean of in the nr stratum.
The cross-expectationβs expected value is zero. In this case the RSS is balanced and we may express the variance of the order statistics (OS) as a function of the variance of in ,, and the gains in accuracy measured by the as Substituting we obtain the following: Taking the RSS estimator Then there is gain in accuracy due to the use of RSS which is where is the gain in accuracy due to the use or RSS in the second stage.
3. The Nonresponse Problem: Double Sampling
We will consider that double sampling is used for obtaining a sample s* from using srswr. A cheap variable is measured in the units in s*. X is correlated with and we are able to compute the mean of it in the first stage. There are non responses. In the second stage we know and . Note that these estimates are used only in the estimation process.
Non responses on are present in the second stage sample and a subsample among the non respondents is selected. Singh and Kumar [4] considered this problem for simple random sampling. They proposed the family of estimators characterized by The sampler fixes the constants Ξ±and Ξ² as well as and . They can be constants or functions, a different from zero. Taking
Proposition 3.1 (see [4]). The bias of is defining where The variance is given by defining
We are going to derive the RSS counterpart of this family. The first phase sample is selected using srswr and the information on is used for selecting the initial sample and to subsample the non respondents. Our proposal is to use is the RSS mean of in the second stage and Let us represent the involved estimators by Due to the unbiasedness of the estimators .
Taking We can rewrite (3.9) as Note that Under the hypothesis , an expansion in Taylor series of (3.13) may be worked out. Grouping conveniently we have that The cross-products for the OS , are expressed by The conditional expectations of the RSS estimators are Using these results we have that with In addition Substituting in (3.15) after some algebraic work we obtain that the bias of (3.9) is where For a large value of the bias tends to zero. Then we have proved the first statement of the following proposition.
Proposition 3.2. The estimator is asymptotically unbiased in terms of and its variance is given by If .
Proof. An expansion in Taylor series of may be worked out. It is, neglecting the terms of order , where Calculating the expected value and grouping we have that
Remark 3.3. The gain in accuracy due to the use of (3.9) in terms of the variance is
Hence, as ) the proposed method is more precise if .
This result allows to deduce the RSS counterparts of different double sampling estimators of the mean. For example, See Khare and Srivastava [12, 13] and Singh and Kumar, [4, 14, 15].
4. Numerical Comparisons
We compared the behavior of the proposed RSS method with the SRS one using data from three populations. Their description is given as follows.
Population 1
A set of 244 accounts was considered. The balance of each of them in the previous semester was and was produced by an auditory. The first phase sample was provided by selecting 120 accounts and 72 non responses were reported. A new auditory was performed. The second stage sample was of size 24.
Population 2
The evaluation of radiographies provided values of in 350 patients with cancer. A sample of 100 provided the first phase sample and 24 of them the second phase. Y was the size of an extirpated tumor. 53 measurements were missing. The measurement of them needed a search in the pathology department.
Population 3
The height of 1270 pigs provided the information on in the population. 170 of them were selected at the first phase and 24 of them the second phase. was the weight of the pigs and 69 initial measurements were missing. The missing pigβs weight was obtained by locating them before sending them to the butchery.
The values of and were fixed conveniently for obtaining a sample of size 24. The means and variances of the osβs involved were determined by forming all the possible samples and computing them. The relative gain in accuracy due to the use of RSS was measured by for . The results are given in Table 1. They sustain that the use of RSS provides gains of accuracy larger than 10%/.
A similar study was developed by generating a sample of 240 values of and determining was generated using the same distribution. The results are given in Table 2. Note that generally the gain in efficiency is larger when the underlying distribution is symmetric. The best results are derived when excepting the Beta distribution.
5. Conclusions
The accuracy of the proposed method seems to be better than the SRS method when is analyzed. It can take negative values but it has been larger than zero in the experiments developed. It was around 0,1 in all the cases and using may be the best choice.
Acknowledgments
The authors thank the referees for their helpful comments which allowed improving a previous version. This paper was supported by the CONACYT Contract 10110/62/10, FON. INST. 8/10.