Research Article  Open Access
Neeraj Tiwari, Akhil Chilwal, "A Simplified Approach for TwoDimensional Optimal Controlled Sampling Designs", Advances in Statistics, vol. 2014, Article ID 875352, 10 pages, 2014. https://doi.org/10.1155/2014/875352
A Simplified Approach for TwoDimensional Optimal Controlled Sampling Designs
Abstract
Controlled sampling is a unique method of sample selection that minimizes the probability of selecting nondesirable combinations of units. Extending the concept of linear programming with an effective distance measure, we propose a simple method for twodimensional optimal controlled selection that ensures zero probability to nondesired samples. Alternative estimators for population total and its variance have also been suggested. Some numerical examples have been considered to demonstrate the utility of the proposed procedure in comparison to the existing procedures.
1. Introduction
Goodman and Kish [1] introduced controlled sampling as a method of sample selection that increases the probability of desired samples. Controlled sampling may be described as a technique of sampling from finite universe, which allows multiple stratifications beyond what is possible by stratified random sampling. There often arises a situation where some combinations of units may be less beneficial or even undesirable to be included in the sample due to considerations such as distance, similarity of units, and cost. The samples having undesirable combinations of units are known as nonpreferred or undesirable samples. Using the technique of controlled selection, one can exclude the possibility of including undesirable combinations of units in the sample or assign them minimum probability of selection. This results in an increase in the selection probability of preferred samples.
The controlled sampling technique can be effectively used in two or more dimensions. Generally, researchers face multidimensional sampling problems in social research where various variables are involved in the population, requiring stratification in more than one dimension. The need of multidimensional stratification in various real life situations was discussed by Bryant [2], Hess and Srikantan [3], Moore et al. [4], and Jessen [5]. Jessen [5] considered with 12 geographical areas and 12 income classes, resulting in a total of 144 strata cells, out of which only 24 cells were to be selected. In such situations, the researcher requires stratification techniques which could permit fewer cells to be selected than the total number of strata cells permitted under stratified sampling, without sacrificing the requirements of probability sampling. This is known as controls beyond stratification.
Goodman and Kish [1] were the first to address this problem under the name of twodimensional controlled selection but did not provide any general method to solve such problems. Hess and Srikantan [3] and Groves and Hess [6] discussed the multidimensional controlled selection problem for hospital data in US and presented a formal algorithm for obtaining solutions to the twodimensional and threedimensional problems. However, there are simple examples, where their algorithm fails, even for twodimensional problems. To select the set of feasible samples, Jessen [7] proposed two methods for twoway and threeway stratification but both the methods are quite complicated to implement, involve a lot of trails and errors and sometimes even fail to provide a solution. Ernst [8] was the first to present a constructive solution for twodimensional controlled selection problems, but his procedure is quite cumbersome.
Causey et al. [9] proposed an algorithm based on transportation theory to solve twodimensional controlled selection problems, which is efficient but complex to implement. Inspired by the idea of Rao and Nigam [10, 11], Sitter and Skinner [12] proposed a linear programming approach to solve multidimensional controlled selection problems. Tiwari and Nigam [13] solved the twodimensional optimal controlled selection problems with controls beyond stratification using simplex method in linear programming. The procedure of Tiwari and Nigam [13] is best suited to problems with integer marginals while the method of Sitter and Skinner [12] is best suited for noninteger marginals. Extending the idea of linear programming of Sitter and Skinner [12], Lu and Sitter [14] discussed some methods to reduce the amount of computation so that very large problems become feasible using the linear programming approach. Tiwari and Nigam [15] applied the idea of nearest proportional to size sampling design to twodimensional optimal controlled sampling problems using quadratic programming, to introduce a sampling design which ensures zero probability to nonpreferred samples. The procedure of Tiwari and Nigam [15] is efficient but quit cumbersome in the sense that before applying the idea of nearest proportional to size design to obtain the desired controlled inclusion probability proportional to size (IPPS) design they have to first obtain an appropriate uncontrolled IPPS design and then define a nonIPPS design which totally avoids the nonpreferred samples to make their probabilities zero. In this paper, we introduce an effective distance measure as the objective function and a new constraint in linear programming problem to propose a very simple and effective method for twodimensional controlled sampling which fully excludes the undesirable samples. The proposed procedure appears to perform better than the earlier twodimensional controlled selection procedures, as it ensures zero probability to undesirable samples without complicating the implementation process.
Another problem that needs attention is of variance estimation in multidimensional controlled selection designs. For onedimensional controlled selection problems, the Horvitz and Thompson [16] estimator is best suited as the stability, and nonnegativity conditions of the YatesGrundy [17] form of the HorvitzThompson [16] variance estimator are satisfied for such designs. However, as observed by Tiwari and Nigam [15], the twodimensional controlled selection problems do not satisfy these conditions, owing to the need for alternative estimators. To overcome this difficulty, Jessen [18], Tiwari and Nigam [13], and Tiwari and Nigam [15] suggested alternative variance estimation procedures using the “split sample,” “half sample,” and “random group” methods, respectively. In this paper, we propose a systematic method for estimation of population total and its variance for twodimensional controlled selection problems. The proposed variance estimator appears to perform better than the existing estimators in terms of bias. We demonstrate its utility with the help of some examples.
2. The Basic Notations and Preliminaries
Let us consider a twodimensional population array of units, consisting of cells that have real numbers, , . Suppose a sample of size is to be obtained from this population. Let be the characteristic under study, the yvalue for the ijth unit in the population , and the yvalue for the lth unit in the sample . Let , , denote the kth possible samples. Also let be each internal entry of . Then equals either [] or , where [] is the integer part of . We have to consider a set of samples with selection probabilities that satisfy the constraints: where is the set of all possible samples and is the selection probability of each sample .
There can be many sets of probability distributions satisfying (1), although only one set of probabilities can be used to obtain a solution of the twodimensional controlled selection problem. We may consider an algorithm based on an appropriate and objective principle to find the solution that reflects the closeness of each sample to . For this purpose we consider the following measures of closeness between and .
The first ordinary distance, which is often called the Euclidean distance, given as is the most common measure to define the closeness between and , as it is easy to calculate.
Two other distance measures can also be used to define the distance between and . These are(i)cosine distance function: (ii)BrayCurtis distance function:
Huang [19] and Khatri [20] compared all the above distance measures in their study and found that the cosine distance function works well in comparison to other distance functions. Different distance measures were evaluated empirically using seven data sets by Huang [19] and the results indicated that the cosine distance function performs reasonably well. We have also applied these three distance functions (2), (3), and (4) to all the controlled sampling problems considered by us and found that the distance function given in (3) provides minimum bias, which supported the works of Huang [19] and Khatri [20]. In view of the above observations, we have decided to use as the distance measure in this paper. We have used OPTMODEL procedure in SAS 9.3 to solve linear objective programming and “pdist2” (pairwise distance between two sets of observations) method in MATLAB 10.0 to solve the three distance functions.
3. The Proposed TwoDimensional Optimal Controlled Sampling Plan
Let denote the set of undesired samples, that is, the samples containing the undesired combinations of units. The required set of samples is obtained through the solution of the following linear programming problem.
Minimize the objective function, where
Subject to the following constraints:
The constraints (i) and (ii) in (6) are necessary for any sampling design and the constraint (iii) assures that the resultant design is an IPPS design. The constraint (iv) ensures that the probabilities of undesired samples are equal to zero. We also tried to add one more constraint , in (6), to ensure the nonnegativity of the YatesGrundy form of HorvitzThompson variance estimator and applied it to all the twodimensional controlled selection problems considered by us. However, in no case did it yield a solution. Consequently, we dropped the idea of adding this constraint and suggested an alternative procedure for variance estimation.
The solution of the linear programming problem, namely, minimization of (5) and subject to the constraints (6), using “pdist2” (pairwise distance between two sets of observations) method in MATLAB 10.0 and OPTMODEL procedure in SAS 9.3, provides us optimal controlled IPPS sampling plan that ensures zero probability of selection for the undesired samples. The proposed strategy also provides an opportunity to add more constraints to the controlled selection problem. The proposed plan performs better than the plans of Sitter and Skinner [12] and Tiwari and Nigam [13] in the sense that these plans only attempt to minimize the selection probabilities of the nonpreferred samples, whereas the proposed plan ensures zero probability to nonpreferred samples through constraint (iv) in (6). The exclusion of nonpreferred samples was also attempted by Tiwari and Nigam [15] for twodimensional controlled selection problems, using the idea of nearest proportional to size design. However, their procedure is quite lengthy and tedious, as in their procedure first of all an uncontrolled IPPS design is to be manually constructed and then the required controlled IPPS design is achieved using the quadratic linear programming approach. The same advantage has been achieved in the proposed plan in a very simple manner by just adding one more constraint in the linear programming problem, ensuring zero probability to nonpreferred samples. The implementation of proposed design is very simple in comparison to the earlier designs. One limitation of proposed design is that it becomes impractical when the set of all possible samples () is very large, as the process of enumerating of all possible samples and formation of the objective function and constraints becomes quit tedious. This limitation also holds for the optimum approach of Sitter and Skinner [12], Tiwari and Nigam [13], and Tiwari and Nigam [15]. However, with the help of faster computing techniques and modern statistical tools, there may not be much difficulty in using the proposed plan for moderately large populations. Nevertheless, the proposed procedure takes lesser computing time in comparison to the procedures of Sitter and Skinner [12], Tiwari and Nigam [13], and Tiwari and Nigam [15]. In what follows, we show the utility of the proposed procedure with the help of some numerical examples.
4. Empirical Evaluation
In this section, we will present some numerical examples to demonstrate the utility of the proposed procedure and compare it with the existing procedures of optimal controlled sampling designs.
Example 1. Let us consider a 4 × 3 hypothetical population borrowed from Tiwari and Nigam [15], given in Table 1. The desired sample size of is less than the total number of cells, 12. The set of all possible samples consists of 12 samples, given in Table 2. Let the set of undesirable samples consists of those samples that do not contain all the three elements 1st, 5th, and 9th or 3rd, 5th, and 7th. Thus the sample numbers 6th and 9th are the nonpreferred samples.
Applying the Tiwari and Nigam [13] (to be denoted by TN1), Tiwari and Nigam [15] (to be denoted by TN2), and the proposed plan discussed in Section 3 to this population, we get the selection probabilities of the samples as shown in Table 3. For this example we find that the probability of nonpreferred samples for Tiwari and Nigam [13] plan is 0.1, whereas the proposed plan always assures zero probability to nonpreferred samples.


 
Undesirable sample. 
Example 2. Now let us consider another hypothetical example borrowed from Bryant et al. [21] given in Table 4. The desired sample of size 10 is less than the total number of units, 15. The integer parts of ’s are known as “certainty proportion.” For obtaining the set of feasible samples, we initially remove the certainty proportions and replace them at their original position after getting these samples. After removing the certainty proportions, we get a twoway array shown in Table 5.
After subtracting the certainty proportions, the problem is reduced to selecting 6 units from the array. The set of all possible samples consists of ^{15}C_{6} samples, out of which 4989 samples do not satisfy the marginal constraints of the 5 × 3 population. Thus, the set of samples satisfying the marginal constraints have only 16 samples, given in Table 6. Now we suppose the situation of controls beyond stratification. Based on the considerations similar to those of Avadhani and Shukhatme [22], Tiwari and Nigam [13], and Tiwari and Nigam [15], we consider that if all three units 4th, 8th, and 12th or 6th, 8th, and 10th do not appear in a sample, then the sample is nonpreferred sample. Thus the set of all preferred samples consists of only 10 samples, that is, the sample numbers 1, 3, 5, 7, 9, 10, 11, 13, 14, and 16. Applying the proposed, Tiwari and Nigam [13] [TN1] and Tiwari and Nigam [15] [TN2] plans to the modified problem, we get the selection probabilities, shown in Table 7.
For this example, the probability of undesired samples is zero for the proposed plan and Tiwari and Nigam [15] plan. The proposed plan again ensures zero probability to undesirable samples, whereas the plan of Tiwari and Nigam [13] only attempts to minimize the probability of undesirable samples.




Example 3. Let us consider a real life application, borrowed from Tiwari and Nigam [15], where twodimensional stratification is required in plot sampling in field experiments. Consider the yield (in tons) of wheat given in Table 8 for an experiment involving blocks (B1, B2, B3, and B4) and 4 treatments (T1, T2, T3, and T4). The integer parts of ’s are known as “certainty proportion.” For obtaining the set of feasible samples, we initially remove the certainty proportions and replace them at their original position after getting these samples. After removing the certainty proportions, we get a twoway array shown in Table 9.
After subtracting the certainty proportions, the problem is reduced to selecting 8 units from the array. The set of all possible samples consists of ^{16}C_{8} samples, out of which 12780 samples do not satisfy the marginal constraints of the 4 × 4 population. Thus, the set of samples satisfying the marginal constraints have only 90 samples. Now we suppose the situation of controls beyond stratification. Based on the considerations similar to Tiwari and Nigam [15], we consider that if three or more diagonal units appear in a sample, then the sample is nonpreferred sample. Thus, the set of all preferred samples consists of only 33 samples, shown in Table 10. Applying the proposed, Tiwari and Nigam [15] [TN2] plans to the modified problem, we get the selection probabilities, shown in Table 11. For this example, the probability of undesired samples is zero for the proposed and Tiwari and Nigam [15] plans.
Some other examples are also considered to analyse the performance of the proposed plan. Details of these examples are given in the Appendix. The probabilities of selecting the undesirable samples for the proposed plan, the plan of Tiwari and Nigam [13] and Tiwari and Nigam [15], are given in Table 12. Table 12 again shows that while the plan of Tiwari and Nigam [13] only attempts to minimize the probability of undesirable samples, the proposed plan always ensures zero probability to undesirable samples. Tiwari and Nigam [15] plan also provides zero probability of nonpreferred samples, but as discussed earlier, it is quite difficult to implement.




5. Variance Estimation for the Proposed Procedure
Jessen [18] suggested split sample estimator as an alternative to HorvitzThompson (HT) estimator. This estimator also works in the situation where the nonnegativity condition of YatesGrundy form of HT estimator is not satisfied. Jessen’s split sample estimator is negatively biased and bias is found to be quite high. Using half sample method, Tiwari and Nigam [13] introduced a method of variance estimation for twodimensional controlled selection problems. Their variance estimator was found to be positively biased and the bias was low in comparison to Jessen’s split sample estimator. An important limitation of both the estimators is that they require exactly two units from each row and column of the twoway array. The above two methods could not be applied if two units from each row and column are not available. Using the idea of random group, Tiwari and Nigam [15] introduced an alternative estimator for population total and its variance that can be used even when two units are not available from each row and column of the twoway array.
Using the procedure of systematic sampling for variance estimation, originally developed by W. G. Madow and L. H. Madow [23], we propose an alternative estimation procedure for the population total and its variance in twodimensional controlled selection problems. The proposed estimator performs better than the split sample estimator of Jessen [18] and the estimators proposed by Tiwari and Nigam [13] and Tiwari and Nigam [15] in terms of bias. The proposed procedure can also be used in the situations where exactly two units are not available from each row and column of the twoway array. The proposed approach is as follows.
To construct systematic samples from a sample of size drawn from a population of units, we first arrange all sample units in a list: they can be placed at random in the list or they can be placed in a particular sequence or they can be left in a sequence that they naturally occur.
Let be the yvalue for the th unit in the sample and let be the measure of size. Next, a cumulative measure of size, , is calculated for each sample unit; that is, . To select a systematic sample of units, a selection interval, say, , is calculated as the total of all measures of size divided by ; that is, . The selection interval is not necessarily an integer but is typically rounded off to two or three decimal places. To initiate the sample selection process, a uniform random deviate, say, , is chosen on the half open interval . The selection numbers for the sample are then . The sample unit identified for the systematic sample by each selection number is the first unit on the list for which the cumulative size, , is greater than or equal to the selection number. With the help of this procedure the units of the sample can be divided into systematic samples. The various values of will give various systematic samples and the proposed estimator will depend on the value of . However, it has been found that the proposed estimator works satisfactorily in all the situations.
With the help of this approach, an unbiased estimator of population total is given as where and are the observation from the tth systematic sample and and are their corresponding inclusion probabilities. An estimator of the variance of is given as where is an approximate finite population correction factor.
The proposed procedure of variance estimation can be applied for square as well as for rectangular populations and works equally well even for the situation where the units selected from each row and column are not fixed and equal. When the nonnegativity condition of YatesGrundy form of HorvitzThompson variance estimator is not satisfied, we can apply the variance estimator given in (8). The proposed variance estimator is always positive as it involves only the sum of squared quantities. We consider some examples to show the utility of proposed variance estimator and compare it with the Jessen’s split sample estimator and the estimators suggested by Tiwari and Nigam [13, 15].
Example 4. Let us consider a 3 × 3 population borrowed from Jessen [7], shown in Table 13. Values of () obtained by Jessen’s split sample estimator (to be denoted by SS), the estimator given by Tiwari and Nigam [13] (to be denoted by TN1), Tiwari and Nigam [15] (to be denoted by TN2), and the proposed estimator are shown in Table 14.
The actual value of for this population is 123/20. From Table 13, we have
Thus is an unbiased estimate of . The expected value of for proposed estimator is
The true value of for this population is 0.0581, which shows that the proposed estimator is positively biased. The bias of the proposed estimator is lowest among the four estimators, showing that the proposed estimator performs better than the previous estimators.


Example 5. To further evaluate the utility of the proposed variance estimator, we consider a 4 × 4 population borrowed from Jessen [24], shown in Table 15. A sample of size 8 is to be drawn from this population. The values of for the four estimators and selection probabilities of all twenty possible samples are presented in Table 16.
From Table 15, we get
Thus is an unbiased estimate of . The expected value of for the proposed estimator is
The true value of () for this population is 0.24375, which shows that the proposed estimator is positively biased. The bias is lowest for the proposed estimator among the four estimators considered by us.
The outcomes of the above two examples show that the proposed variance estimator performs better than the estimators suggested by Jessen [18], Tiwari and Nigam [13], and Tiwari and Nigam [15]. The bias is minimum for the proposed estimator and it also performs favourably in the situations where the estimators of Jessen [18] and Tiwari and Nigam [13] cannot be applied.


6. Conclusion
In this paper, we have proposed a simple linear programming approach using distance measure as a weight for each sample to obtain an optimum solution in twoway controlled selection problems. In the proposed plan, we have introduced one more constraint in linear programming problem to ensure zero probability to nonpreferred samples. The proposed procedure is quite simple and flexible to implement. We have also proposed a new strategy for the estimation of variance in twoway controlled sampling designs. The proposed estimator appears to perform better than the earlier estimators for twodimensional controlled sampling suggested by different researchers. The proposed procedure takes lesser computing time in comparison to the procedures of Tiwari and Nigam [13] and Tiwari and Nigam [15] and is found to be more advantageous than these plans.
Appendix
Example A.1. Consider a 4 × 3 hypothetical population, borrowed from Tiwari and Nigam [15] given in Table 17, with population size 12 and sample size is equal to 8. The samples in which all three units 1st, 5th, and 9th or 3rd, 5th, and 7th do not appear, considered as undesired samples.

Example A.2. Consider a 3 × 3 hypothetical population with and , borrowed from Tiwari and Nigam [15]. The proposed sample cell counts () for this population are given in Table 18. The samples in which all the three elements 1st, 5th, and 9th appear together considered as undesired samples.

Example A.3. Let us consider an 8 × 3 population, borrowed from Causey et al. [9] consisting of 24 elements, and a sample of size 10 is to be drawn from it. The basic data for this population is shown in Table 19. The samples having two consecutive elements in a column are assumed as undesired samples.

Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
References
 R. Goodman and L. Kish, “Controlled selectiona technique in probability sampling,” Journal of the American Statistical Association, vol. 45, no. 251, pp. 350–372, 1950. View at: Google Scholar
 E. C. Bryant, “Sampling methods,” Seminar Paper, Iowa State University, 1961. View at: Google Scholar
 I. Hess and K. S. Srikantan, “Some aspects of the probability sampling technique of controlled selection,” Health Services Research, vol. 1, no. 1, pp. 8–52, 1966. View at: Google Scholar
 R. P. Moore, J. R. Chromy, and W. T. Rogers, The National Assessment Approach to Sampling, National Assessment of Educational Progress, Denver, Colo, USA, 1974.
 R. J. Jessen, “Square and cubic lattice sampling,” Biometrics, vol. 31, no. 2, pp. 449–471, 1975. View at: Publisher Site  Google Scholar
 R. M. Groves and I. Hess, “An algorithm for controlled selection,” in Probability Sampling of Hospitals and Patients, I. Hess, D. C. Ridel, and T. B. Fitzpatrick, Eds., chapter 7, Health Administration Press, Ann Arbor, Mich, USA, 2nd edition, 1975. View at: Google Scholar
 R. L. Jessen, “Probability sampling with marginal constraints,” Journal of the American Statistical Association, vol. 65, no. 330, pp. 776–796, 1970. View at: Publisher Site  Google Scholar
 L. R. Ernst, “A constructive solution for twodimensional controlled selection problems,” in Proceeding of the Survey Research Methodology Section, American Statistical Association, 1981. View at: Google Scholar
 B. D. Causey, L. H. Cox, and L. R. Ernst, “Application of transformation theory to statistical problem,” Journal of the American Statistical Association, vol. 80, no. 392, pp. 903–909, 1985. View at: Publisher Site  Google Scholar
 J. N. K. Rao and A. K. Nigam, “Optimal controlled sampling designs,” Biometrika, vol. 77, no. 4, pp. 807–814, 1990. View at: Publisher Site  Google Scholar
 J. N. K. Rao and A. K. Nigam, “Optimal controlled sampling: a unified approach,” International Statistical Review, vol. 60, no. 1, pp. 89–98, 1992. View at: Publisher Site  Google Scholar
 R. R. Sitter and C. J. Skinner, “Multiway stratification by linear programming,” Survey Methodology, vol. 20, no. 1, pp. 65–73, 1994. View at: Google Scholar
 N. Tiwari and A. K. Nigam, “On twodimensional optimal controlled selection,” Journal of Statistical Planning and Inference, vol. 69, no. 1, pp. 89–100, 1998. View at: Google Scholar
 W. Lu and R. R. Sitter, “Multiway stratification by linear programming made practical,” Survey Methodology, vol. 28, no. 2, pp. 199–207, 2002. View at: Google Scholar
 N. Tiwari and A. K. Nigam, “On twodimensional optimal controlled nearest proportional to size sampling designs,” Statistical Methodology, vol. 7, no. 6, pp. 601–613, 2010. View at: Publisher Site  Google Scholar
 D. G. Horvitz and D. J. Thompson, “A generalization of sampling without replacement from finite universe,” Journal of the American Statistical Association, vol. 47, no. 260, pp. 663–685, 1952. View at: Publisher Site  Google Scholar
 F. Yates and P. M. Grundy, “Selection without replacement from within strata with probability proportional to size,” Journal of Royal Statistical Society, vol. 15, no. 2, pp. 253–261, 1953. View at: Google Scholar
 R. L. Jessen, “Some properties of probability lattice sampling,” Journal of the American Statistical Association, vol. 68, no. 341, pp. 26–28, 1973. View at: Google Scholar
 A. Huang, “Similarity measures for text document clustering,” in Proceeding of the New Zealand Computer Science Research Student Conference (NZCSRSC '08), pp. 49–56, Christchurch, New Zealand, April 2008. View at: Google Scholar
 M. Khatri, “Cosine similarity function for the temporal dynamic web data,” International Journal of Computer Science & Engineering Technology, vol. 3, no. 8, pp. 315–318, 2012. View at: Google Scholar
 E. C. Bryant, H. O. Hartley, and R. J. Jessen, “Design and estimation in twoway stratification,” Journal of the American Statistical Association, vol. 55, no. 289, pp. 105–124, 1960. View at: Publisher Site  Google Scholar
 M. S. Avadhani and B. V. Shukhatme, “Controlled sampling with equal probabilities and without replacement,” International Statistical Review, vol. 41, no. 2, pp. 175–182, 1973. View at: Google Scholar
 W. G. Madow and L. H. Madow, “On the theory of systematic sampling,” The Annals of Mathematical Statistics, vol. 15, no. 1, pp. 1–24, 1944. View at: Google Scholar
 R. L. Jessen, Statistical Theory Techniques, Wiley, New York, NY, USA, 1978.
Copyright
Copyright © 2014 Neeraj Tiwari and Akhil Chilwal. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.