Abstract

In this research, the authors were interested in an efficiency comparison study of new adjusted nonparametric and parametric statistics interval estimation methods in the simple linear regression model. The independent variable and the error came from normal, scale-contaminated normal, and gamma distributions. Six point estimations were performed, for example, least squares, Bayesian, Jack knife, Theil, optimum-type Theil, and new adjusted Theil–Sen and Siegel methods in the simple linear regression model with 1,000 iterations. The criteria used to consider in this study were the coefficient of the confidence interval and the average width of the confidence interval used to compare and determine the optimal effectiveness for six interval estimations of the simple linear regression model. In the interval estimation for normal and scale-contaminated normal distributions of , the least squares method had the narrowest average width of confidence interval. For the interval estimation of , the Bayesian method had the narrowest average width of confidence interval in a small variance of 1, followed by the same of optimum-type Theil and new adjusted Theil–Sen and Siegel methods, and Theil method, respectively. In the interval estimation for gamma distribution of , the Bayesian method had the narrowest average width of confidence interval, followed by optimum-type Theil, new adjusted Theil–Sen and Siegel, and Theil methods, respectively. The optimum-type Theil method was good for medium sample size, while Theil and new adjusted Theil–Sen and Siegel methods were good for small and large sample sizes. Therefore, new adjusted Theil–Sen and Siegel method can be used in many situations and can be used in place of optimum-type Theil and Theil methods for nonparametric statistics interval estimation.

1. Introduction

A simple linear regression is an analysis used to study the relationship of one independent variable and one dependent variable with both variables having a linear relationship. A simple regression model is , where variables may be positively related or negatively related. It is the simplest basic form of simple regression analysis. There are several methods of estimation in simple linear regression equations. The estimation in each method can be classified into two types: point estimation and interval estimation. First, point estimation, also known as single estimation, is the calculation of one statistic from sample data. The obtained values are used as estimates of parameters. Second, interval estimation is an estimate of whether a population parameter will fall into a range of two values, L and U, where L is the lower limit and U is the upper limit. Assuming that the parameter to be estimated in a population is , this interval estimate of will be in the range of L <  < U using the data from the sample [1]. In this research, the authors were interested in an efficiency comparison of new adjusted nonparametric and parametric statistics interval estimation methods in the simple linear regression model, of which relatively little research has been conducted in this area. Most research studies were compared with parametric and nonparametric statistics point estimation methods in the simple linear regression model. Based on the literature review study, a comparative study of three linear regression model estimation methods, least squares, parameter, and nonparametric bootstrap methods when the errors came from normal, uniform, logistic, and double exponential distributions by comparing mean squared error and average of mean squared error was conducted. The results of point estimation showed that the errors came from normal, logistic, and double exponential distributions; the least squares method had the best effect. In the case of interval estimation, it was found that the errors came from normal and logistic distributions; least squares method had the best effect. For uniform and double exponential distributions, the parametric bootstrap method was found to have the best effect [2]. In a comparative study between the classical ordinary least squares (OLS) regression method and the symmetric Bayesian linear regression method, the linear regression model used the criterion of mean square error of point parameter estimation. The results showed that the symmetric Bayesian method was more efficient, consistent, and stable than the classical OLS method [3]. The efficiency comparison of simple linear regression coefficient estimation methods by Theil, quantile, and least squares methods when the data in independent and dependent variables were outliers were carried out by comparing the linear regression coefficient. The criterion used in the study was the mean squared error of point parameter estimation. The results founded that the data had no outliers, and the least squares method had the best effect. When the data had outliers in a small sample, the Theil method had the best effect. As the data had outliers in medium- and large-sample sizes, the Theil method had the best effect in all cases [4], and in this year, the research conducted a comparative study of point parameter estimation methods using Bayesian, least squares, and parametric bootstrap methods in the simple linear regression analysis. The criterion for determining the effectiveness of estimation method was average of mean square errors. The results concluded that the Bayesian method was the most effective for all given situations [5]. Two years later, the research compared point parameter estimation of least squares, Bayesian, Markov chain, Monte Carlo, bootstrap, and Jack knife methods for the simple regression model. The criterion used to determine the most efficient method was the minimum in average of mean square error. It was found that most of the Bayesian method had the minimum in average of mean square error [6]. In addition, there was also a comparative study of point parameter estimation with least squares and maximum likelihood methods for simple linear regression analysis. The criterion for determining the effectiveness of the estimation method was mean squared error. The results showed that the data were normal distribution, least squares, and maximum likelihood methods and were equally effective. The data were gamma distribution in small or medium sample sizes, least squares, and maximum likelihood methods and had a similar effect. A least squares method performed better than a maximum likelihood method in a large sample size [7]. After that, the study compared the interval estimation of simple linear regression models by least squares, parametric bootstrap, and Bayesian methods. The criterion for determining the most effective method was the narrowest average width of confidence interval. It was found that the Bayesian method had the narrowest average width of confidence interval, followed by the least squares method [1]. The study was conducted to compare the point estimation of parametric and nonparametric statistics in the simple linear regression model using least squares, Mood–Brown, Theil–Sen, optimum-type Theil, Theil–Hodges–Lehman, weighted Theil-1 (mean), weighted Theil-1 (median), weighted Theil-2 (mean), and weighted Theil-2 (median) methods based on mean absolute deviations. The results showed that the optimum-type Theil method had the minimum mean absolute deviation [8]. Finally, several comparisons were made between ordinary least squares regression, quantile regression, Theil–Sen regression, and Theil–Sen and Siegel regression methods. The point parameter estimation comparisons used on mean bias, median bias, standard deviation, standard error, root mean square error, relative root mean square error, median absolute error, and relative median absolute error of the four regression procedures are used to evaluate the model fitting. The results showed that, under the normality assumption with no outliers, ordinary least squares regression had the most suitable regression procedure followed by quantile regression. When there were outliers in both X and Y direction, Theil–Sen and Siegel regression had the most suitable followed by quantile regression. Under the nonnormality assumption, quantile regression, Theil–Sen regression, and Theil–Sen and Siegel regression had the same performance [9]. Therefore, the authors were interested in an efficiency comparison study of the interval estimation methods of new adjusted nonparametric and parametric statistics in the simple linear regression model from which the independent variable and the error came from normal, scale-contaminated normal, and gamma distributions for reasons from the above summary of research studies as follows. The least squares method of [2] was the best effective. In point estimation, the least squares method of [7] had the best effect in a normal distribution. Nevertheless, the data were not outliers, least squares method of [4, 9] also had the best effect. The researches of [1, 3, 5, 6] found that the Bayesian method was the best effective. The researcher proposes the prior distribution function as a normal distribution. If the data had outliers all of small, medium, and large sample sizes, the Theil method of [4] had the best effect. The optimum-type Theil method of [8] had the best effect, followed by the Theil method. In addition, there were outliers in both X and Y direction; the Theil–Sen and Siegel method had the most suitable [9]. Finally, authors were also interested in using the Jack knife method for these comparisons.

2. Materials and Methods

2.1. Materials

An efficiency comparison of new adjusted nonparametric and parametric statistics interval estimation methods in the simple linear regression model was carried out with R-Studio version 4.1.0 software. The software programs were run on a notebook with i3, 2 GHz, Intel CPU, and 4 Gb of RAM under Windows 10.

2.2. Methods

The research method examines an efficiency comparison of new adjusted nonparametric and parametric statistics interval estimation methods in the simple linear regression model with least squares, Bayesian, and Jack knife methods which were parametric statistics, whereas Theil, optimum-type Theil, and new adjusted Theil–Sen and Siegel methods were nonparametric statistics as follows:(1)The model is a simple linear regression model in which the equation showing the relationship between the independent variable (X) and the dependent variable (Y) has the form of equation: , where is a vector of the dependent variable of size n × 1, is a matrix of the independent variable of size n × 2, is a vector of the parameter values in 2 × 1 regression equation, and is a vector of the error of size n × 1.(2)Determine the sample size as follows: small sample sizes are 10 and 30, medium sample sizes are 50 and 70, and large sample sizes are 100 and 200.(3)Determine the parameter to , where is the y-intercept and is the regression coefficient.(4)Randomize the independent variable from the given distribution as follows:(i)The normal distribution has mean of and variance of .(ii)The scale-contaminated normal distribution is a transformed distribution from normal distribution. The proportion of contamination (p) is 0.05, the scale factor (c) is 5, and size is 1. The scale-contaminated normal distribution has the probability density function of as follows: , where and are constant. Here, c is the scale factor and p is the probability of binomial distribution for the sample size, size, and proportion of contamination.(iii)The gamma distributions have alpha () of 2 and beta () of 1/2, 1/3, 1/4, 1/5, 2, 3, 4, and 5. The gamma distributions have mean of and variance of .(5)Randomize the error from the given distribution as follows:(i)The normal distribution has mean of and variance of 1, 4, 9, 16, 25, 36, 49, 64, 81, and 100.(ii)The scale-contaminated normal distribution is the same as in 4(ii).(iii)The gamma distribution is the same as in 4(iii).(6)Generate the data of dependent variable from the relationship model. To generate the data, it starts with determining the sample size that you want to study. The parameter value is . The constant is generated as normal, scale-contaminated normal, and gamma distributions. The commands in the R program are then used to generate the error with normal, scale-contaminated normal, and gamma distributions to be studied. Finally, is generated according to the aforementioned relationship model.(7)Six point estimations and six interval estimations were performed with 1,000 iterations. The confidence level was set at 95% as follows:(i)Least squares (LS) method: the least squares method is the basic parametric estimation method for determining the regression coefficient. This is a method that minimizes the sum of squares of the error. The sum of squares of the error can be written as [10]. Find the derivative of with respect to and set it to be equal to 0, that is, . Therefore, the point estimate of parameter with the least squares method is . The confidence interval of parameter with the least squares method is and the confidence interval of parameter is , where , , , and [11].(ii)Bayesian (BS) method: the Bayesian method is a method of parameter estimations with conditional probability where the posterior distribution function varies with the product between the prior distribution function and the likelihood function [12]. Bayesian’s equation can be written as which is proportional to , where is the distribution function, is the prior distribution for , is the likelihood function when and are known, and is the posterior distribution for . The Bayesian method has the following steps. Step 1 calculates the likelihood function when and are known from normal distribution population with mean of and variance of . The distribution of . Step 2 chooses the prior distribution function for . The prior distribution function is chosen as normal distribution with mean of and variance of which is in the form of the conjugate prior family for and given . The equation form of is proportional to where is a vector of prior mean of and is the prior covariance matrix of . Step 3 calculates the posterior distribution function for from Bayesian equation. The equation form is as follows. is proportional to , where is a vector of the posterior mean of and is the posterior covariance matrix of . Therefore, the point estimate of parameter with the Bayesian method is , where . Hence, the confidence interval of parameter with the Bayesian method is and the confidence interval of parameter is , where and are the standard errors of and for the Bayesian method, respectively. and have constant variances of 1 and 2, respectively [13].(iii)Jack knife (JK) method: from a random sample with normal distribution, a new set of sample is generated. By omitting the ith value, a new sample of size n − 1 is obtained. The omitted value is returned into the sample before the next sample is generated. Do this n times with the following steps [14, 15]. In Step 1, random from the distribution of error is used to obtain . In Step 2, at the first time, is omitted from the sample to obtain and . At the second time, is omitted from the sample to obtain and , n times. In Step 3, from the least squares method, an estimation of is obtained to calculate a parameter estimate of . In Step 4, substitute values of , and into the equation to obtain . In Step 5, take the obtained , and to estimate parameter . In Step 6, repeat steps 2, 3, and 4 n times. Therefore, the point estimate of parameter with the Jack knife method is and . The confidence interval of parameter with the Jack knife method is and the confidence interval of parameter is , where , , , and [16].(iv)Theil (T) method: Theil [17] proposed a method for estimating the slope of a simple linear regression line with the following steps [4]. Step 1 calculates slope , to obtain the total slope value of . Step 2 estimates ; the point estimate of parameter with Theil method is . If N is odd and N=2k + 1, then . As if N is even and N= 2k, then . Step 3 estimates ; the point estimate of parameter with the Theil method is , where and are median of Y and , respectively. The confidence interval of parameter with the Theil method is and the confidence interval of parameter is , where , , and [18].(v)Optimum-type Theil (OT) method: let the variables need no assumption of the symmetric distribution and be the arithmetic mean of [8]. Therefore, the point estimate of parameter with optimum-type Theil is and the point estimate of parameter is . The confidence interval of parameter with the optimum-type Theil method is , and the confidence interval of parameter is [18].(vi)New adjused Theil–Sen and Siegel (ATSS) method: Siegel [19], as cited by Farooqi [9], considered repeated medians. For each observation , the regression coefficients between it and the others (n − 1) are calculated and the median is taken. These results in n medians and median from these medians are a regression coefficient estimator. A robust estimator of the regression coefficient can be estimated by taking the medians of these least squares estimates, i.e., . Similarly, the y-intercept can be estimated by the medians of all possible least squares estimates . Since may be negative values, hence, the authors adjusted these values by taking the absolute value of numerator and denominator terms of . Therefore, the authors propose the point estimate of parameter with the new adjusted Theil–Sen and Siegel method is and the point estimate of parameter is . The confidence interval of parameter with the new adjusted Theil–Sen and Siegel method is , and the confidence interval of parameter is [18].(8)Calculate the estimation of the coefficient of confidence interval and the average width of confidence interval in six linear regression equations.(i)The coefficient of confidence interval of the estimation method is a value used to measure the efficiency of the interval estimation method as follows: = (the total number of times the confidence interval covers parameter value )/m, where m is the number of iterations in the experiment which is equal to 1,000. The 1,000 point estimates were taken to find the six confidence intervals, and each method was estimated to be 1,000 intervals and then determined whether the intervals covered parameters of and . The parameters covering such confidence intervals were counted. Then, the number of intervals covering the parameter value was added and divided by the number of repetitions (m) to obtain the coefficient of confidence interval . After that, the obtained values were considered within a range of the coefficient of confidence estimate or not [16]. The coefficient of confidence criterion was determined by testing the hypothesis as follows [20]. Hypothesis is versus . Test statistics is . Critical region is reject if or , that is, or . Therefore, the estimation method that obtains the coefficient of confidence interval in a range of the given coefficient of confidence, that is, , where is the coefficient of confidence, is the given coefficient of confidence, is the estimate of the coefficient of confidence, and is a significance level with a value of 0.05.(ii)The width of confidence interval is denoted by the symbol . It was calculated as the difference between upper bound and lower bound of the interval, where . The average width of confidence interval can be calculated by taking the sum of the width of confidence interval, where the number of iterations is covered by the number of m iterations, that is, AW =  /m. After that, the average width of confidence intervals for each method was compared. If any estimation method had the narrowest average width of confidence interval, it was considered the most efficient estimation method. The Monte Carlo simulation was used by R program [21].

3. Results of a Simulation Study

3.1. The Independent Variable and the Error Have Normal Distributions

From Table 1, at the 95% confidence level of , a small sample size of 10, and variance of 4, the least squares method has the average width of the narrowest confidence interval of 4.113, whereas with a small sample size of 10 and 30 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 1.347 and 0.808, respectively. With a middle sample size of 70 and variance of 25 at , the least squares method had the average width of the narrowest confidence interval of 3.612, whereas with a middle sample size of 50 and 70 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 0.485 and 0.492, respectively. With a large sample size of 100 and 200 and variance of 49 and 100 at , the least squares method had the average width of the narrowest confidence interval of 4.055 and 3.687, respectively, whereas with a large sample size of 100 and 200 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 0.485 and 0.492, respectively.

3.2. The Independent Variable and the Error Have Scale-Contaminated Normal Distributions

From Table 2, at the 95% confidence level of , a small sample size of 10, and variance of 4, the new adjusted Theil–Sen and Siegel method had the average width of the narrowest confidence interval of 2.830, whereas with a small sample size of 30 and variance of 1, the Bayesian method had the average width of the narrowest confidence interval of 1.103. With a middle sample size of 50 and 70 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 0.679 and 0.454, respectively. With a large sample size of 100 and 200 and variance of 4 and 9 at , the least squares method had the average width of the narrowest confidence interval of 3.751 and 3.943, respectively, whereas with a large sample size of 100 and 200 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 0.360 and 0.263, respectively. With a large sample size of 200 and variance of 1 at , the optimum-type Theil method had the average width of the narrowest confidence interval of 0.263.

3.3. An Independent Variable Has a Normal Distribution and an Error Has a Scale-Contaminated Normal Distribution

From Table 3, at the 95% confidence level of , a small sample size of 10, and variance of 1, the new adjusted Theil–Sen and Siegel method had the average width of the narrowest confidence interval of 5.392, whereas with a small sample size of 30 and variance of 1, the Bayesian method had the average width of the narrowest confidence interval of 4.139. With a middle sample size of 50 and variance of 1 at , the least squares method had the average width of the narrowest confidence interval of 4.010, whereas with a middle sample size of 50 and 70 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 2.631 and 1.714, respectively. With a large sample size of 200 and variance of 4 at , the least squares method had the average width of the narrowest confidence interval of 3.771, whereas with a large sample size of 100 and 200 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 1.618 and 1.302, respectively.

3.4. An Independent Variable Has a Scale-Contaminated Normal Distribution and an Error Has a Normal Distribution

From Table 4, at the 95% confidence level of , a small sample size of 10, and variance of 9, the least squares method had the average width of the narrowest confidence interval of 3.617, whereas with the 95% confidence level of , a small sample size of 10, and variance of 1, the Theil method had the average width of the narrowest confidence interval of 0.203. With a small sample size of 30 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 0.169. With a middle sample size of 50 and 70 and variance of 49 and 64 at , the least squares method had the average width of the narrowest confidence interval of 3.899 and 3.938, respectively, whereas with a middle sample size of 50 and variance of 1 at , the Bayesian method had the average width of the narrowest confidence interval of 0.103. With a middle sample size of 70 and variance of 1 at , Bayesian and optimum-type Theil methods had the same average width of the narrowest confidence interval of 0.088. With a large sample size of 100 and variance of 100 at , the least squares method had the average width of the narrowest confidence interval of 3.780, whereas with the 95% confidence level of , a large sample size, and variance of 1, the Bayesian method had the average width of the narrowest confidence interval of 0.068. With a large sample size of 200 and variance of 1 at , Bayesian, Theil, optimum-type Theil, and new adjusted Theil–Sen and Siegel methods had the same average width of the narrowest confidence interval of 0.062.

3.5. The Independent Variable and the Error Have Gamma Distributions

From Table 5, at 95% confidence level of , only a small sample size of 10, and variance of 2/25, the Bayesian method had the average width of the narrowest confidence interval of 1.461. For 95% confidence level of , all of sample sizes, and almost of variance, the Bayesian method had the average width of the narrowest confidence interval of 1.310, 1.165, 0.643, 0.481, 0.363, and 0.212, respectively, followed by the optimum-type Theil method and the same of Theil and new adjusted Theil–Sen and Siegel methods, respectively.

4. Discussion

For the independent variable, the error normally distributed, with the interval estimation of , the least squares method had the narrowest average width of confidence interval. For the interval estimation of , the Bayesian method had the narrowest average width of the confidence interval in a small variance of 1. For the independent variable and the error scale-contaminated normal distributed, with the interval estimation is , the least squares method had the narrowest average width of the confidence interval in large sample sizes. For the interval estimation of , the Bayesian method had the narrowest average width of the confidence interval in a small variance of 1, followed by the new adjusted Theil–Sen and Siegel method in a small sample size of 10 and a small variance of 4, and the optimum-type Theil method in a large sample size is 200 and a small variance is 1. For the independent variable was a normal distribution, the error was a scale-contaminated normal distribution. With interval estimation of , the least squares method had the narrowest average width of confidence interval in some variances of medium- and large-sample sizes. For the interval estimation of , the Bayesian method had the narrowest average width of confidence interval in a small variance of 1, followed by the new adjusted Theil–Sen and Siegel method in a small sample size of 10 and a small variance of 1. Finally, the independent variable was a scale-contaminated normal distribution, and the error was a normal distribution. With the interval estimation of , the least squares method had the narrowest average width of the confidence interval. For the interval estimation of β1, the Bayesian method had the narrowest average width of the confidence interval in a small variance of 1, followed by the same of optimum-type Theil and Theil methods and the new adjusted Theil–Sen and Siegel method, respectively. These conclusions were similar to the research of [1] that the Bayesian method had the narrowest average width of the confidence interval, followed by the least squares method. In point estimation, the research in [3, 5, 6] found that the Bayesian method has the best effect and the research in [2] found that the least squares method had the narrowest average width of the confidence interval. In point estimation, the least squares method of [7] has the best effect in normal distributions. Nevertheless, the data were not outliers, and the least squares method of [4, 9] also has the best effect. For the independent variable and the error which were gamma distributed, with the interval estimation of , the Bayesian method had the narrowest average width of the confidence interval. For the interval estimation of , the Bayesian method had the narrowest average width of the confidence interval, followed by the optimum-type Theil method and the same of new adjusted Theil–Sen and Siegel and Theil methods, respectively. These conclusions were similar to the research in [1, 3, 5, 6] that the Bayesian method has the best effect. In point estimation, the research in [9] showed that there were outliers in both X and Y direction; the Theil–Sen and Siegel method was the most suitable.

5. Conclusions

In the interval estimation for normal and scale-contaminated normal distributions of , the least squares method had the narrowest average width of the confidence interval of all four cases. For the interval estimation of , the Bayesian method had the narrowest average width of the confidence interval in a small variance of 1, followed by the same of optimum-type Theil and new adjusted Theil–Sen and Siegel methods and the Theil method, respectively. If only the parametric statistics interval estimations of and were considered, least squares and Bayesian methods will have the narrowest average width of the confidence interval, respectively. Furthermore, if only the nonparametric statistics interval estimation of was considered, optimum-type Theil and new adjusted Theil–Sen and Siegel methods were slightly better than the Theil method. As the sample size increased, the average width of the confidence interval tended to decrease, but as the variance increased, the average width of the confidence interval tended to increase. In the interval estimation for gamma distribution of , the Bayesian method had the narrowest average width of the confidence interval, followed by optimum-type Theil, new adjusted Theil–Sen and Siegel, and Theil methods, respectively. The optimum-type Theil method was good for medium sample size, while the Theil and new adjusted Theil–Sen and Siegel methods were good for small and large sample sizes. Therefore, the new adjusted Theil–Sen and Siegel method can be used in many situations and can be used in place of the optimum-type Theil and Theil methods for nonparametric statistics interval estimations.

Data Availability

The data used to support this study were simulated from normal, scale-contaminated normal, and gamma distributions using R programs.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank the committees of Department of Statistics and other Departments, School of Science, King Mongkut’s Institute of Technology Ladkrabang (KMITL) for consideration of funding the research project on “Efficiency Comparison of New Adjusted Nonparametric and Parametric Statistics Interval Estimation Methods in Simple Linear Regression Model” (Grant no. 2565-02-05-007). The authors would like to thank the senior project students in Department of Statistics who helped in finding the relavant papers and coding the R programs during research.