Abstract

Controlled sampling is a unique method of sample selection that minimizes the probability of selecting nondesirable combinations of units. Extending the concept of linear programming with an effective distance measure, we propose a simple method for two-dimensional optimal controlled selection that ensures zero probability to nondesired samples. Alternative estimators for population total and its variance have also been suggested. Some numerical examples have been considered to demonstrate the utility of the proposed procedure in comparison to the existing procedures.

1. Introduction

Goodman and Kish [1] introduced controlled sampling as a method of sample selection that increases the probability of desired samples. Controlled sampling may be described as a technique of sampling from finite universe, which allows multiple stratifications beyond what is possible by stratified random sampling. There often arises a situation where some combinations of units may be less beneficial or even undesirable to be included in the sample due to considerations such as distance, similarity of units, and cost. The samples having undesirable combinations of units are known as nonpreferred or undesirable samples. Using the technique of controlled selection, one can exclude the possibility of including undesirable combinations of units in the sample or assign them minimum probability of selection. This results in an increase in the selection probability of preferred samples.

The controlled sampling technique can be effectively used in two or more dimensions. Generally, researchers face multidimensional sampling problems in social research where various variables are involved in the population, requiring stratification in more than one dimension. The need of multidimensional stratification in various real life situations was discussed by Bryant [2], Hess and Srikantan [3], Moore et al. [4], and Jessen [5]. Jessen [5] considered with 12 geographical areas and 12 income classes, resulting in a total of 144 strata cells, out of which only 24 cells were to be selected. In such situations, the researcher requires stratification techniques which could permit fewer cells to be selected than the total number of strata cells permitted under stratified sampling, without sacrificing the requirements of probability sampling. This is known as controls beyond stratification.

Goodman and Kish [1] were the first to address this problem under the name of two-dimensional controlled selection but did not provide any general method to solve such problems. Hess and Srikantan [3] and Groves and Hess [6] discussed the multidimensional controlled selection problem for hospital data in US and presented a formal algorithm for obtaining solutions to the two-dimensional and three-dimensional problems. However, there are simple examples, where their algorithm fails, even for two-dimensional problems. To select the set of feasible samples, Jessen [7] proposed two methods for two-way and three-way stratification but both the methods are quite complicated to implement, involve a lot of trails and errors and sometimes even fail to provide a solution. Ernst [8] was the first to present a constructive solution for two-dimensional controlled selection problems, but his procedure is quite cumbersome.

Causey et al. [9] proposed an algorithm based on transportation theory to solve two-dimensional controlled selection problems, which is efficient but complex to implement. Inspired by the idea of Rao and Nigam [10, 11], Sitter and Skinner [12] proposed a linear programming approach to solve multidimensional controlled selection problems. Tiwari and Nigam [13] solved the two-dimensional optimal controlled selection problems with controls beyond stratification using simplex method in linear programming. The procedure of Tiwari and Nigam [13] is best suited to problems with integer marginals while the method of Sitter and Skinner [12] is best suited for noninteger marginals. Extending the idea of linear programming of Sitter and Skinner [12], Lu and Sitter [14] discussed some methods to reduce the amount of computation so that very large problems become feasible using the linear programming approach. Tiwari and Nigam [15] applied the idea of nearest proportional to size sampling design to two-dimensional optimal controlled sampling problems using quadratic programming, to introduce a sampling design which ensures zero probability to nonpreferred samples. The procedure of Tiwari and Nigam [15] is efficient but quit cumbersome in the sense that before applying the idea of nearest proportional to size design to obtain the desired controlled inclusion probability proportional to size (IPPS) design they have to first obtain an appropriate uncontrolled IPPS design and then define a non-IPPS design which totally avoids the nonpreferred samples to make their probabilities zero. In this paper, we introduce an effective distance measure as the objective function and a new constraint in linear programming problem to propose a very simple and effective method for two-dimensional controlled sampling which fully excludes the undesirable samples. The proposed procedure appears to perform better than the earlier two-dimensional controlled selection procedures, as it ensures zero probability to undesirable samples without complicating the implementation process.

Another problem that needs attention is of variance estimation in multidimensional controlled selection designs. For one-dimensional controlled selection problems, the Horvitz and Thompson [16] estimator is best suited as the stability, and nonnegativity conditions of the Yates-Grundy [17] form of the Horvitz-Thompson [16] variance estimator are satisfied for such designs. However, as observed by Tiwari and Nigam [15], the two-dimensional controlled selection problems do not satisfy these conditions, owing to the need for alternative estimators. To overcome this difficulty, Jessen [18], Tiwari and Nigam [13], and Tiwari and Nigam [15] suggested alternative variance estimation procedures using the “split sample,” “half sample,” and “random group” methods, respectively. In this paper, we propose a systematic method for estimation of population total and its variance for two-dimensional controlled selection problems. The proposed variance estimator appears to perform better than the existing estimators in terms of bias. We demonstrate its utility with the help of some examples.

2. The Basic Notations and Preliminaries

Let us consider a two-dimensional population array of units, consisting of cells that have real numbers, , . Suppose a sample of size is to be obtained from this population. Let be the characteristic under study, the y-value for the ijth unit in the population , and the y-value for the lth unit in the sample . Let , , denote the kth possible samples. Also let be each internal entry of . Then equals either [] or , where [] is the integer part of . We have to consider a set of samples with selection probabilities that satisfy the constraints: where is the set of all possible samples and is the selection probability of each sample .

There can be many sets of probability distributions satisfying (1), although only one set of probabilities can be used to obtain a solution of the two-dimensional controlled selection problem. We may consider an algorithm based on an appropriate and objective principle to find the solution that reflects the closeness of each sample to . For this purpose we consider the following measures of closeness between and .

The first ordinary distance, which is often called the Euclidean distance, given as is the most common measure to define the closeness between and , as it is easy to calculate.

Two other distance measures can also be used to define the distance between and . These are(i)cosine distance function: (ii)Bray-Curtis distance function:

Huang [19] and Khatri [20] compared all the above distance measures in their study and found that the cosine distance function works well in comparison to other distance functions. Different distance measures were evaluated empirically using seven data sets by Huang [19] and the results indicated that the cosine distance function performs reasonably well. We have also applied these three distance functions (2), (3), and (4) to all the controlled sampling problems considered by us and found that the distance function given in (3) provides minimum bias, which supported the works of Huang [19] and Khatri [20]. In view of the above observations, we have decided to use as the distance measure in this paper. We have used OPTMODEL procedure in SAS 9.3 to solve linear objective programming and “pdist2” (pairwise distance between two sets of observations) method in MATLAB 10.0 to solve the three distance functions.

3. The Proposed Two-Dimensional Optimal Controlled Sampling Plan

Let denote the set of undesired samples, that is, the samples containing the undesired combinations of units. The required set of samples is obtained through the solution of the following linear programming problem.

Minimize the objective function, where

Subject to the following constraints:

The constraints (i) and (ii) in (6) are necessary for any sampling design and the constraint (iii) assures that the resultant design is an IPPS design. The constraint (iv) ensures that the probabilities of undesired samples are equal to zero. We also tried to add one more constraint , in (6), to ensure the nonnegativity of the Yates-Grundy form of Horvitz-Thompson variance estimator and applied it to all the two-dimensional controlled selection problems considered by us. However, in no case did it yield a solution. Consequently, we dropped the idea of adding this constraint and suggested an alternative procedure for variance estimation.

The solution of the linear programming problem, namely, minimization of (5) and subject to the constraints (6), using “pdist2” (pairwise distance between two sets of observations) method in MATLAB 10.0 and OPTMODEL procedure in SAS 9.3, provides us optimal controlled IPPS sampling plan that ensures zero probability of selection for the undesired samples. The proposed strategy also provides an opportunity to add more constraints to the controlled selection problem. The proposed plan performs better than the plans of Sitter and Skinner [12] and Tiwari and Nigam [13] in the sense that these plans only attempt to minimize the selection probabilities of the nonpreferred samples, whereas the proposed plan ensures zero probability to nonpreferred samples through constraint (iv) in (6). The exclusion of nonpreferred samples was also attempted by Tiwari and Nigam [15] for two-dimensional controlled selection problems, using the idea of nearest proportional to size design. However, their procedure is quite lengthy and tedious, as in their procedure first of all an uncontrolled IPPS design is to be manually constructed and then the required controlled IPPS design is achieved using the quadratic linear programming approach. The same advantage has been achieved in the proposed plan in a very simple manner by just adding one more constraint in the linear programming problem, ensuring zero probability to nonpreferred samples. The implementation of proposed design is very simple in comparison to the earlier designs. One limitation of proposed design is that it becomes impractical when the set of all possible samples () is very large, as the process of enumerating of all possible samples and formation of the objective function and constraints becomes quit tedious. This limitation also holds for the optimum approach of Sitter and Skinner [12], Tiwari and Nigam [13], and Tiwari and Nigam [15]. However, with the help of faster computing techniques and modern statistical tools, there may not be much difficulty in using the proposed plan for moderately large populations. Nevertheless, the proposed procedure takes lesser computing time in comparison to the procedures of Sitter and Skinner [12], Tiwari and Nigam [13], and Tiwari and Nigam [15]. In what follows, we show the utility of the proposed procedure with the help of some numerical examples.

4. Empirical Evaluation

In this section, we will present some numerical examples to demonstrate the utility of the proposed procedure and compare it with the existing procedures of optimal controlled sampling designs.

Example 1. Let us consider a 4 × 3 hypothetical population borrowed from Tiwari and Nigam [15], given in Table 1. The desired sample size of is less than the total number of cells, 12. The set of all possible samples consists of 12 samples, given in Table 2. Let the set of undesirable samples consists of those samples that do not contain all the three elements 1st, 5th, and 9th or 3rd, 5th, and 7th. Thus the sample numbers 6th and 9th are the nonpreferred samples.
Applying the Tiwari and Nigam [13] (to be denoted by TN-1), Tiwari and Nigam [15] (to be denoted by TN-2), and the proposed plan discussed in Section 3 to this population, we get the selection probabilities of the samples as shown in Table 3. For this example we find that the probability of nonpreferred samples for Tiwari and Nigam [13] plan is 0.1, whereas the proposed plan always assures zero probability to nonpreferred samples.

Example 2. Now let us consider another hypothetical example borrowed from Bryant et al. [21] given in Table 4. The desired sample of size 10 is less than the total number of units, 15. The integer parts of ’s are known as “certainty proportion.” For obtaining the set of feasible samples, we initially remove the certainty proportions and replace them at their original position after getting these samples. After removing the certainty proportions, we get a two-way array shown in Table 5.
After subtracting the certainty proportions, the problem is reduced to selecting 6 units from the array. The set of all possible samples consists of 15C6 samples, out of which 4989 samples do not satisfy the marginal constraints of the 5 × 3 population. Thus, the set of samples satisfying the marginal constraints have only 16 samples, given in Table 6. Now we suppose the situation of controls beyond stratification. Based on the considerations similar to those of Avadhani and Shukhatme [22], Tiwari and Nigam [13], and Tiwari and Nigam [15], we consider that if all three units 4th, 8th, and 12th or 6th, 8th, and 10th do not appear in a sample, then the sample is nonpreferred sample. Thus the set of all preferred samples consists of only 10 samples, that is, the sample numbers 1, 3, 5, 7, 9, 10, 11, 13, 14, and 16. Applying the proposed, Tiwari and Nigam [13] [TN-1] and Tiwari and Nigam [15] [TN-2] plans to the modified problem, we get the selection probabilities, shown in Table 7.
For this example, the probability of undesired samples is zero for the proposed plan and Tiwari and Nigam [15] plan. The proposed plan again ensures zero probability to undesirable samples, whereas the plan of Tiwari and Nigam [13] only attempts to minimize the probability of undesirable samples.

Example 3. Let us consider a real life application, borrowed from Tiwari and Nigam [15], where two-dimensional stratification is required in plot sampling in field experiments. Consider the yield (in tons) of wheat given in Table 8 for an experiment involving blocks (B1, B2, B3, and B4) and 4 treatments (T1, T2, T3, and T4). The integer parts of ’s are known as “certainty proportion.” For obtaining the set of feasible samples, we initially remove the certainty proportions and replace them at their original position after getting these samples. After removing the certainty proportions, we get a two-way array shown in Table 9.
After subtracting the certainty proportions, the problem is reduced to selecting 8 units from the array. The set of all possible samples consists of 16C8 samples, out of which 12780 samples do not satisfy the marginal constraints of the 4 × 4 population. Thus, the set of samples satisfying the marginal constraints have only 90 samples. Now we suppose the situation of controls beyond stratification. Based on the considerations similar to Tiwari and Nigam [15], we consider that if three or more diagonal units appear in a sample, then the sample is nonpreferred sample. Thus, the set of all preferred samples consists of only 33 samples, shown in Table 10. Applying the proposed, Tiwari and Nigam [15] [TN-2] plans to the modified problem, we get the selection probabilities, shown in Table 11. For this example, the probability of undesired samples is zero for the proposed and Tiwari and Nigam [15] plans.
Some other examples are also considered to analyse the performance of the proposed plan. Details of these examples are given in the Appendix. The probabilities of selecting the undesirable samples for the proposed plan, the plan of Tiwari and Nigam [13] and Tiwari and Nigam [15], are given in Table 12. Table 12 again shows that while the plan of Tiwari and Nigam [13] only attempts to minimize the probability of undesirable samples, the proposed plan always ensures zero probability to undesirable samples. Tiwari and Nigam [15] plan also provides zero probability of nonpreferred samples, but as discussed earlier, it is quite difficult to implement.

5. Variance Estimation for the Proposed Procedure

Jessen [18] suggested split sample estimator as an alternative to Horvitz-Thompson (HT) estimator. This estimator also works in the situation where the nonnegativity condition of Yates-Grundy form of HT estimator is not satisfied. Jessen’s split sample estimator is negatively biased and bias is found to be quite high. Using half sample method, Tiwari and Nigam [13] introduced a method of variance estimation for two-dimensional controlled selection problems. Their variance estimator was found to be positively biased and the bias was low in comparison to Jessen’s split sample estimator. An important limitation of both the estimators is that they require exactly two units from each row and column of the two-way array. The above two methods could not be applied if two units from each row and column are not available. Using the idea of random group, Tiwari and Nigam [15] introduced an alternative estimator for population total and its variance that can be used even when two units are not available from each row and column of the two-way array.

Using the procedure of systematic sampling for variance estimation, originally developed by W. G. Madow and L. H. Madow [23], we propose an alternative estimation procedure for the population total and its variance in two-dimensional controlled selection problems. The proposed estimator performs better than the split sample estimator of Jessen [18] and the estimators proposed by Tiwari and Nigam [13] and Tiwari and Nigam [15] in terms of bias. The proposed procedure can also be used in the situations where exactly two units are not available from each row and column of the two-way array. The proposed approach is as follows.

To construct    systematic samples from a sample of size drawn from a population of units, we first arrange all sample units in a list: they can be placed at random in the list or they can be placed in a particular sequence or they can be left in a sequence that they naturally occur.

Let be the y-value for the th unit in the sample and let be the measure of size. Next, a cumulative measure of size, , is calculated for each sample unit; that is, . To select a systematic sample of units, a selection interval, say, , is calculated as the total of all measures of size divided by ; that is, . The selection interval is not necessarily an integer but is typically rounded off to two or three decimal places. To initiate the sample selection process, a uniform random deviate, say, , is chosen on the half open interval . The selection numbers for the sample are then . The sample unit identified for the systematic sample by each selection number is the first unit on the list for which the cumulative size, , is greater than or equal to the selection number. With the help of this procedure the units of the sample can be divided into systematic samples. The various values of will give various systematic samples and the proposed estimator will depend on the value of . However, it has been found that the proposed estimator works satisfactorily in all the situations.

With the help of this approach, an unbiased estimator of population total is given as where and are the observation from the tth systematic sample and and are their corresponding inclusion probabilities. An estimator of the variance of is given as where is an approximate finite population correction factor.

The proposed procedure of variance estimation can be applied for square as well as for rectangular populations and works equally well even for the situation where the units selected from each row and column are not fixed and equal. When the nonnegativity condition of Yates-Grundy form of Horvitz-Thompson variance estimator is not satisfied, we can apply the variance estimator given in (8). The proposed variance estimator is always positive as it involves only the sum of squared quantities. We consider some examples to show the utility of proposed variance estimator and compare it with the Jessen’s split sample estimator and the estimators suggested by Tiwari and Nigam [13, 15].

Example 4. Let us consider a 3 × 3 population borrowed from Jessen [7], shown in Table 13. Values of () obtained by Jessen’s split sample estimator (to be denoted by S-S), the estimator given by Tiwari and Nigam [13] (to be denoted by TN-1), Tiwari and Nigam [15] (to be denoted by TN-2), and the proposed estimator are shown in Table 14.
The actual value of for this population is 123/20. From Table 13, we have Thus is an unbiased estimate of . The expected value of for proposed estimator is
The true value of for this population is 0.0581, which shows that the proposed estimator is positively biased. The bias of the proposed estimator is lowest among the four estimators, showing that the proposed estimator performs better than the previous estimators.

Example 5. To further evaluate the utility of the proposed variance estimator, we consider a 4 × 4 population borrowed from Jessen [24], shown in Table 15. A sample of size 8 is to be drawn from this population. The values of for the four estimators and selection probabilities of all twenty possible samples are presented in Table 16.
From Table 15, we get Thus is an unbiased estimate of . The expected value of for the proposed estimator is
The true value of () for this population is 0.24375, which shows that the proposed estimator is positively biased. The bias is lowest for the proposed estimator among the four estimators considered by us.
The outcomes of the above two examples show that the proposed variance estimator performs better than the estimators suggested by Jessen [18], Tiwari and Nigam [13], and Tiwari and Nigam [15]. The bias is minimum for the proposed estimator and it also performs favourably in the situations where the estimators of Jessen [18] and Tiwari and Nigam [13] cannot be applied.

6. Conclusion

In this paper, we have proposed a simple linear programming approach using distance measure as a weight for each sample to obtain an optimum solution in two-way controlled selection problems. In the proposed plan, we have introduced one more constraint in linear programming problem to ensure zero probability to nonpreferred samples. The proposed procedure is quite simple and flexible to implement. We have also proposed a new strategy for the estimation of variance in two-way controlled sampling designs. The proposed estimator appears to perform better than the earlier estimators for two-dimensional controlled sampling suggested by different researchers. The proposed procedure takes lesser computing time in comparison to the procedures of Tiwari and Nigam [13] and Tiwari and Nigam [15] and is found to be more advantageous than these plans.

Appendix

Example A.1. Consider a 4 × 3 hypothetical population, borrowed from Tiwari and Nigam [15] given in Table 17, with population size 12 and sample size is equal to 8. The samples in which all three units 1st, 5th, and 9th or 3rd, 5th, and 7th do not appear, considered as undesired samples.

Example A.2. Consider a 3 × 3 hypothetical population with and , borrowed from Tiwari and Nigam [15]. The proposed sample cell counts () for this population are given in Table 18. The samples in which all the three elements 1st, 5th, and 9th appear together considered as undesired samples.

Example A.3. Let us consider an 8 × 3 population, borrowed from Causey et al. [9] consisting of 24 elements, and a sample of size 10 is to be drawn from it. The basic data for this population is shown in Table 19. The samples having two consecutive elements in a column are assumed as undesired samples.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.