On Quantiles Estimation Based on Stratified Sampling Using Multiplicative Bias Correction Approach

Makumi, Nicholas; Otieno, Romanus Odhiambo; Orwa, George Otieno; Habineza, Alexis

doi:https://doi.org/10.1155/2022/4530489

Journal of Mathematics

On this page

Abstract Introduction Notation Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Analytical Methods to Model Nature

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 4530489 | https://doi.org/10.1155/2022/4530489

On Quantiles Estimation Based on Stratified Sampling Using Multiplicative Bias Correction Approach

Nicholas Makumi,^1,2Romanus Odhiambo Otieno ,³George Otieno Orwa,²and Alexis Habineza¹

Academic Editor: A. Ghareeb

Received01 Mar 2022

Accepted07 Apr 2022

Published13 May 2022

Abstract

In the context of stratified sampling, we develop a nonparametric regression technique to estimating finite population quantiles in model-based frameworks using a multiplicative bias correction strategy. Furthermore, the proposed estimator’s asymptotic behavior is presented, and when certain conditions are met, the estimator is observed to be asymptotically unbiased and asymptotically consistent. Simulation studies were conducted to determine the proposed estimator’s performance for the three quartiles of two fictitious populations under varied distributional assumptions. Based on relative biases, mean-squared errors, and relative root-mean-squared errors, the proposed estimator can be extremely satisfactory, according to these findings.

1. Introduction

Many of the activities conducted by official statistics institutes are based on surveys conducted on a finite population employing stratified random sampling with no replacement. According to Thompson [1], stratified simple random sampling is described as follows: the population is subdivided into various mutually exclusive and exhaustive subgroups or strata, each of which denotes a known portion of the entire population. The researcher selects certain instances from each stratum into the sample using random sampling [2], and the results of these distinct samples are merged appropriately to yield an estimate of some specified population parameter. These surveys gather information on three categories of variables: binary variables, categorical variables with more than two modalities, and continuous quantitative variables. When numerous homogeneous and mutually exclusive strata or subpopulations are found in a population, stratified sampling is an appropriate strategy. Stratification can help to increase sample representativeness by minimizing sampling error. The greater the difference between the strata, the greater the accuracy gain. Furthermore, certain strata may be smaller in size but significant in the study. In these circumstances, thorough sampling is advised, which means that all individuals from these strata will be included in the sample.

In sample surveys connected to agriculture, markets, industries, and social research, for example, multiple characteristics are typically observed out of each selected unit of population. For economic and efficiency reasons, stratified random sampling is preferable to alternative survey designs for gathering information from a heterogeneous population. The stratified sampling principle is only associated with desirable properties of estimators developed from stratified simple random samples, as well as the best (optimal) sample size to be selected from various strata, to either maximize the accuracy of designed estimators for a fixed amount or to reduce survey fees for a fixed specificity of estimators. When several features are discerned from each chosen unit of a finite population in stratified sampling, the sample allocation dilemma gets much more challenging. An allocation that is optimal for one attribute may not be optimal for others unless the features are significantly related.

Scientists are frequently really into estimating cumulative distribution functions from analytical survey data. Sedransk and Sedransk [3] studied the suitability of using estimated cumulative distribution functions to compare patient treatment at radiation therapy centers using a huge nationwide survey of cancer patient medical data. Functions of the cumulative distribution function, such as quantiles and the interquartile range, are also of relevance. The Bureau of Labor Statistics, for example, publishes median salaries for wage and salary workers in the periodical news on a regular basis. The medians are derived from a stratified multistage subsample of the Current Population Survey. Despite the fact that large-scale surveys almost always use some sort of stratified cluster sampling, much of the research on quantile estimation for finite populations is limited to simple random sampling or stratified random sampling.

Simple random sampling (SRS) is widely utilized only when variables’ values do not really change significantly, and the population is homogeneous. SRS is among the most basic sampling procedures in many ways, and no further information is required. Furthermore, when using SRS to create a sample, sample weights are not really required for evaluating data from a survey using, for example, regression or multivariate analysis. A downside of SRS is the complexity in managing accuracy and the inefficiencies of not using supplemental data, which could result in enormous samples that are unneeded. Furthermore, because no supplementary information is used, there is always the potential of a skewed sample.

Stratification is widely used to enhance the accuracy of estimates and to ensure that the sample within a survey region is sufficiently distributed through subpopulations. Sometimes, it is a characteristic of designs used in soil surveys and research in soil science. The population (e.g., the survey area) is divided into classes that are mutually exclusive or strata that divide the population into survey area categories. In each stratum, samples are selected independently. More reliable estimators can be obtained when the variance within each stratum of the feature of interest is small compared to the variation within strata. In addition, if subpopulations of interest are identified by strata, an allocation scheme can be implemented to ensure that a sufficient number of sample units for making inferences on these subpopulations are located within each stratum.

The benefit of stratified sampling is that the accuracy can be determined in each stratum. Furthermore, practical features of response, measurement, and auxiliary information may change from one subgroup to the next, and this information can help stratify the population and increase efficiency. Geographic territories can be utilized as various geographical strata for administrative purposes.

The simple random sampling (SRS) design is the most commonly used in the literature. To acquire a representative sample of the population of interest, a more organized sampling approach, such as stratified sampling or systematic sampling, might be used in practice. In many agricultural and environmental studies, as well as more recently in human populations and reliability analyses (see, e.g., Samawi and Al-Sagheer [4]), the actual measurement of a sampling unit might be more expensive than its physical acquisition. As a result, when all available sampling units contribute to the selection process but just a small fraction (experimental units) is used for actual quantification, significant cost reductions can be gained in survey sampling and experimental research. The stratified simple random sampling (SSRS) approach can be used to accomplish this.

In the literature, much emphasis has been placed on the th quantile estimation problem. The cdf estimator is required to estimate the th quantile, according to the definition. Although the empirical distribution function is the most well-known nonparametric estimation for the cdf, it is a step function and thus insufficiently smooth.

Majority of contributions in literature use simple random sampling (SRS) to estimate the th quantile utilizing kernel density function; for information we direct the reader see for example Nadaraya [5], Lio and Padgett [6], Jones [7]. Furthermore, some studies addressed the estimation of the pth quantile using the SSRS scheme. For example, Samawi et al. [8] developed an estimation technique for population quantiles predicted on stratified simple random sampling (SSRS) as well as stratified ranked set sampling (SRSS) using the empirical distribution function of a stratified population, and Eftekharian and Samawi [9] recently introduced kernel-based estimators of population quantiles based on SSRS and SRSS.

Kernel estimating methods have long claimed that the smoothing bandwidth of the kernel determines the effectiveness of the method more than the choice of kernel. The vast majority of kernels used are symmetric kernels that are preconfigured. This method may be beneficial for estimating boundless support curves, but it is ineffective for compact support curves with discontinuous border points. For curves of this sort, a set kernel shape causes a boundary bias. The weight allocation of the fixed symmetric kernel outside the distribution support generates this boundary bias when smoothing close to the border.

Boundary bias is a widely known challenge, and various researchers have offered methods to minimize it. The reader is recommended to [10–16]. In this study, we present a nonparametric estimator for the quantile function of a finite population predicted on SSRS, addressing the problem of boundary bias in quantile estimation using a multiplicative bias-corrected technique described in [17]. This method has two distinguishing characteristics. One is that it ensures a precise estimate, and finally, it reduces estimate bias while increasing variance by a negligible amount.

2. Notation and Basic Concepts

Let denote finite population subdivided into subgroups, with being the known number of units within every stratum and that :where for . Assume that every unit in is linked with a unique value of the feature, . The numbering of the items in each stratum, , is considered to be independent of . For and , let signify the value of associated with unit . Let be the distribution function in the population:where

Conversely, the distribution function of in every stratum can be used to define :where

The population’s th quantile of is described aswhere takes the values between 0 and 1. Simple random samples of predetermined size are taken from the stratum interdependently with no replacement. Let denote the values of the characteristic corresponding with the units in the sample from stratum for . For the combined sample of , the weighted empirical distribution function is expressed byin which the weight designated unit hi is , and

The weight, , is inversely proportional to the likelihood that unit in stratum will be included in the sample. It is also possible to write the weighted empirical distribution function aswhere

The sample quantile is denoted as

The following assumptions were considered:(i)The underlying population of the th stratum has the cdf as which is Hölder continuous with a square-integrable second derivative for any .(ii) is an absolutely continuous function, such that and .(iii)The kernel function satisfies the following conditions for any

3. MBC Quantile Estimation for SSRS

In this section, we describe the multiplicative bias correction distribution function based on SSRS which was proposed by Onsongo et al. [18] and later use it to introduce quantile estimator along with its asymptotic properties. For the stratum, let be the auxiliary variable with associated survey measurement from a predominant univariate distribution function. Suppose a simple random sample of size is taken from stratum without replacement, with the sample fraction as as well as . Accordingly, for a finite population, the empirical distribution function is therefore defined as

For a stratified population, the associated estimator of a distribution function is defined asin which is perhaps step function of a particular set, -quantile, and represent measurements taken from stratum. is the stratum distribution function for the random variable . Suppose represents a set of data units selected from stratum using simple random sampling with no replacement, represent nonsampled units in stratum. Suppose that survey variables were generated with the help of a super population model, which is represented bywhere and

As a result, the predicted form of the empirical distribution function for a stratified population is obtained using the model-based technique.

The second term of equation (17) is not known, and the concern is determining how to accurately estimate it. Onsongo et al. [18] suggested a multiplicative bias-corrected estimator for finite population distribution under stratified sampling to estimate equation (17).where the term represents nonparametric estimator under model-based technique for and denotes residual estimated distribution function, where residuals are given by for th stratum. According to Onsongo et al. [18], leads to an unbiased estimator for and variance is expressed as

We have that . Hence, the SSRS MBC-based estimator of cdf, , can be considered for estimating quantile function. As with , the th quantile of the underlying distribution is defined as follows:and is alternatively denoted by . Based on a sample from SSRS with size and using an approach similar to that used by Eftekharian and Samawi [9], it is immediate that an MBC estimator of the th quantile is proposed aswhere is an index taking values between and . That is, is the smallest value of for which at least of the population values are less than or equal to that value. Furthermore, from (18), can be computed by solving . However, under assumption (i), it can be easily seen that is twice differentiable at .

Now, assume that be proportionally allocation into strata, then using Taylor series expansion of the function around , we can writewhere and [19] become negligible as . From equation (22), Bahadur’s representation [20] of the estimator, , is given by

As with , the proportion of individuals in the population that are less than or equal to the population quantile is as follows:where follows a hypergeometric distribution with the parameters , , and . Using findings from Francisco [21], the estimator’s expectation and variance are computed as

4. Properties of the Proposed Estimator

4.1. Asymptotic Unbiasedness of the Proposed Estimator

Now, consider the bias for the nonparametric estimator defined by

Then, from equation (23), it follows that

Using the results of equation (25), it can be easily seen that

Since becomes negligible as [19], the right-hand side of equation (29) tends to 0, and so, is asymptotically unbiased.

4.2. Asymptotic Variance of the Proposed Estimator

The variance of will now be computed as follows. From equation (23), taking variance on both sides, we have

Applying the results of equation (26), it is immediate that

4.3. Asymptotic Mean-Squared Error

Asymptotic MSE of the estimator is expressed as

From equations (29) and (31), the following results are immediate consequences:

Equation (33) tends to zero as , and thus, . This shows that is a consistent estimator of . Furthermore, has an asymptotic normal distribution as in Serfling [22].

5. Empirical Study

5.1. Description of the Population

In this part, simulation studies were carried out to investigate the performance of the proposed multiplicative bias-corrected quantile estimator for a stratified population. Two data variables, linear and cosine, were used to simulate a population of size 1000. The linear function was constructed using a linear model that has the following relationship.

The cosine function, which has the relationshipwas used to get the second study variable or mean function. The supplementary variable was considered to have a uniform distribution on a range of . The error term is perceived as a standard normal variable that follows .

To investigate the proposed estimator’s practical performance, each of the populations (i.e. ) was subdivided into 5 equal, disjoint, and mutually exclusive subgroups , which are made as homogeneous as possible to ensure that units in each stratum vary little from each other. Thereafter, a sample of size was drawn, with each stratum providing a sample size of employing simple random sampling with no replacement for each scenario. The Epanechnikov kernel, defined by,was employed for kernel smoothing on the different populations.

5.2. Estimators Included in the Empirical Study

We compare the MBC quantile estimator under SSRS defined by (21) to some of the popular quantile estimators under SSRS proposed in the literature since one of our goals is to develop estimators with desirable features with respect to bias, variance, and asymptotic mean-squared error. For comparison purposes, the following estimators were used, and first, we include in our study estimator of [8] defined aswhere . We also include in our empirical study kernel-based estimator of the quantile based on SSRS which is proposed by Eftekharian and Samawi [9].where, in this case, Finally, in our empirical study, we include Chambers and Clark estimator studied in [23].

The corresponding estimator of the quantile function according to Chambers and Clark [23] was defined by

5.3. Results

The unconditional biases, unconditional relative mean error (RME), and unconditional relative root-mean-squared error (RRMSE) for the estimators for various values of the quantile (i.e., , and 0.75) are shown in Table 1. The findings were tabulated using linear and cosine mean functions. Additional mean functions, including bump, quadratic, cycle, and sine, can provide comparable findings and draw similar conclusions. For any estimator , say, we define the relative mean error asand the relative root-mean-squared error aswhere is the quantile corresponding to the simulated sample.

It is clear from Table 1 that, in terms of bias, MBCQE is less biased than ESQE and RCQE for all values of since it exhibits a smaller bias. In terms of performance as measured by RME and RRMSE, MBCQE is better than ESQE and RCQE since it has smaller values of RME and RRMSE for both linear and cosine mean functions.

Table 2 tabulates the quantile estimates findings of the two different sets of mean function. Using , and , and 0.75, this table illustrates the true population quantile , MBCQE, RCQE, and ESQE. Comparison of to the listed estimators suggests that MBCQE is better estimator of the true population quantile since it is close to it at all probability levels.

We now turn to the conditional performances of the estimators by studying the plots of conditional bias (CB), conditional relative absolute bias (CRAB), and conditional mean-squared error (CMSE) of the estimators plotted versus group means of the means of auxiliary variables, for quantile levels , and 0.75. The objective is to determine whether significant differences exist among these various estimators. In Figures 1–6, the red, green, and blue lines, respectively, represent RCQE, MBCQE, and ESQE.

Figures 1–3 show the conditional bias (CB), conditional relative absolute bias (CRAB), and conditional mean-squared error (CMSE) when linear mean functions were considered, and Figures 4–6 show the conditional bias (CB), conditional relative absolute bias (CRAB), and conditional mean-squared error (CMSE) when a cosine mean function was used.

Expected value, bias, and MSE are functions of sample size and the quantile level, and they can be used to exhibit the performance characteristics of individual estimators. Bias and MSE are two criteria by which estimators can be compared. Estimators should have low bias and minimum MSE.

It is clear that the proposed estimator MBCQE has a lower bias and minimum MSE at all values of -quantile, as shown in Figures 1–6 for both linear and cosine mean functions. It is evident that MBCQE outperforms all other estimators investigated. Our results indicate that the proposed estimator MBCQE performs well, both unconditionally and conditionally.

6. Conclusion

The quantile estimator based on stratified simple random sampling has been proposed. We investigated the proposed estimator’s properties and discovered that it possesses asymptotic normal distributions. Under SSRS, it is also an asymptotically unbiased estimator and asymptotically consistent estimator of population quantiles. It is clear from simulation results that the quantile estimator based on SSRS results in a larger decrease of bias than the one achieved using Chambers and Clark [23], Samawi et al. [8], and Eftekharian and Samawi [9]. In terms of performance, MBCQE has consistently produced results that are more precise than existing quantile estimators. We can therefore conclude that MBCQE can be used in estimating finite population quantiles for stratified populations in various sectors since it yields very good results.

Further study on the constructing of confidence intervals for the suggested estimator can be done, and a researcher can explore other bias correction procedures in quantile estimation, including adaptive boosting and bootstrap bias reduction strategies. Furthermore, the design of quantile estimators under stratified rank set sampling, as well as the usage of complex sample designs such as cluster sampling, is a research focus of discussion.

Data Availability

The dataset used to back up the theoretical assertions was obtained through simulation using the R-GUI statistical software.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

S. K. Thompson, “Simple random sampling,” Sampling, vol. 755, pp. 9–37, 2012.
View at: Google Scholar
R. Scheaffer, W. Mendenhall III, L. Ott, and K. Gerow, Survey Sampling, Brooks/Cole–Cengage Learning, Stamford, CT, USA, 7th edition, 2012.
N. Sedransk and J. Sedransk, “Distinguishing among distributions using data from complex sample designs,” Journal of the American Statistical Association, vol. 74, no. 368, pp. 754–760, 1979.
View at: Publisher Site | Google Scholar
H. M. Samawi and O. A. Al-Sagheer, “On the estimation of the distribution function using extreme and median ranked set sampling,” Biometrical Journal, vol. 43, no. 3, pp. 357–373, 2001.
View at: Publisher Site | Google Scholar
E. A. Nadaraya, “Some new estimates for distribution functions,” Theory of Probability and Its Applications, vol. 9, no. 3, pp. 497–500, 1964.
View at: Publisher Site | Google Scholar
Y. Lio and W. Padgett, “A note on the asymptotically optimal bandwidth for nadaraya’s quantile estimator,” Statistics and Probability Letters, vol. 11, no. 3, pp. 243–249, 1991.
View at: Publisher Site | Google Scholar
M. C. Jones, “Estimating densities, quantiles, quantile densities and density quantiles,” Annals of the Institute of Statistical Mathematics, vol. 44, no. 4, pp. 721–727, 1992.
View at: Publisher Site | Google Scholar
H. Samawi, A. Chatterjee, J. Yin, and H. Rochani, “On quantiles estimation based on different stratified sampling with optimal allocation,” Communications in Statistics-Theory and Methods, vol. 48, no. 6, pp. 1529–1544, 2019.
View at: Publisher Site | Google Scholar
A. Eftekharian and H. Samawi, “On kernel-based quantile estimation using different stratified sampling schemes with optimal allocation,” Journal of Statistical Computation and Simulation, vol. 91, no. 5, pp. 1040–1056, 2021.
View at: Publisher Site | Google Scholar
T. Gasser and H.-G. Müller, “Kernel estimation of regression functions,” Smoothing Techniques for Curve Estimation, Springer, Berlin, Germany, 1979.
View at: Google Scholar
R. John, “Boundary modification for kernel regression,” Communications in Statistics-Theory and Methods, vol. 13, no. 7, pp. 893–900, 1984.
View at: Publisher Site | Google Scholar
M. C. Jones, “Simple boundary correction for kernel density estimation,” Statistics and Computing, vol. 3, no. 3, pp. 135–146, 1993.
View at: Publisher Site | Google Scholar
J. S. Marron and D. Ruppert, “Transformations to reduce boundary bias in kernel density estimation,” Journal of the Royal Statistical Society: Series B, vol. 56, no. 4, pp. 653–671, 1994.
View at: Publisher Site | Google Scholar
H.-G. Müller, “Smooth optimum kernel estimators near endpoints,” Biometrika, vol. 78, no. 3, pp. 521–530, 1991.
View at: Publisher Site | Google Scholar
H.-G. Muller and J.-L. Wang, “Hazard rate estimation under random censoring with varying kernels and bandwidths,” Biometrics, vol. 50, no. 1, pp. 61–76, 1994.
View at: Google Scholar
E. F. Schuster, “Incorporating support constraints into nonparametric estimators of densities,” Communications in Statistics-Theory and Methods, vol. 14, no. 5, pp. 1123–1136, 1985.
View at: Publisher Site | Google Scholar
O. Linton and J. P. Nielsen, “A multiplicative bias reduction method for nonparametric regression,” Statistics & Probability Letters, vol. 19, no. 3, pp. 181–187, 1994.
View at: Publisher Site | Google Scholar
W. M. Onsongo, R. O. Otieno, and G. O. Orwa, “Nonparametric estimation of distribution function for stratified populations,” International Journal of Probability and Statistics, vol. 7, no. 5, pp. 125–129, 2018.
View at: Google Scholar
J. Kiefer, “On bahadur’s representation of sample quantiles,” The Annals of Mathematical Statistics, vol. 38, no. 5, pp. 1323–1342, 1967.
View at: Publisher Site | Google Scholar
R. R. Bahadur, “A note on quantiles in large samples,” The Annals of Mathematical Statistics, vol. 37, no. 3, pp. 577–580, 1966.
View at: Publisher Site | Google Scholar
C. A. Francisco, Estimation of Quantiles and the Interquartile Range in Complex surveys, Iowa State University, Ames, IA, USA, 1987, PhD Thesis.
R. Serfling, Approximation Theorems of Mathematical Statistics, John Wiley & Sons, New York, NY, USA, 1980.
R. Chambers and R. Clark, An Introduction to Model-Based Survey Sampling with Applications, OUP, Oxford, UK, 2012.

Copyright

Copyright © 2022 Nicholas Makumi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

409

Downloads

381

Citations