Abstract

A popular robust estimation technique for linear models is the rank-based method as an alternative to the ordinary least square (OLS) and restricted maximum likelihood (REML) in the presence of extreme observations. This method is applied in machine reliability analysis and quantum engineering, especially in artificial intelligence and optimization problems where outliers are commonly observed. This technique is also extended for the multilevel model, where the shape of error distribution contributes a significant role in more efficient estimation. In this study, we proposed the Weibull score function for the Weibull distributed error terms in the multilevel model. The efficiency of the proposed score function is compared with the existing Wilcoxon score function and the traditional method REML via Monte Carlo simulations after adding simulated extreme observations. For small values of shape parameter in Weibull distribution of error term showing the presence of outliers, the Weibull score function was found to be efficient as compared to the Wilcoxon and REML methods. However, for a large value of shape parameter, Wilcoxon score appeared either equally efficient than the Weibull score function. REML is observed least precise in all situations. These findings are verified through a real application on test scores data, with a small value of shape parameter, and the Weibull score function turned out the most efficient.

1. Introduction

The rank-based method is an alternative robust estimation method over conventional estimation techniques of the linear models such as restricted maximum likelihood (REML) and ordinary least square (OLS) when errors are asymmetric due to the presence of outliers. Rank-based fitting works by assigning ranks to the residuals based on a pseudonorm similar to the Euclidean squared norm used in the OLS providing for a robust fit [1]. The rank estimation method is also a popular parameter estimation method in reliability analysis, machine learning, and artificial intelligence, where the structure of error term is positively skewed. The reason for skewness may be due to the presence of one or more extreme observations. For example, if one wants to study the life (reliability) batteries, then some batteries maybe failed at an initial time and some are failed after a long time. Both these times are extreme observations. These observations are sometimes also called outliers. So, in this situation, we need a ranked estimation method as a robust estimation method alternative to REML and OLS.

Garcia et al. [2] developed a new rank-based constraint handling technique for the solution of engineering optimization problems. The comparison with other techniques revealed the accuracy and robustness of the rank-based technique. Neelima et al. [3] presented a rank-based skeleton extraction algorithm that shows improved performance based on speed, time, rate, and connectivity than various methods in the literature. Li et al. [4] presented a new powerful rank-based procedure for the EWMA control scheme for online sequential monitoring. This procedure can be used to monitor the location and scale parameters of a continuous distribution. From the simulation study, the rank-based method appeared quite robust to nonnormally distributed data and efficient in detecting various process shifts. Chen et al. [5] advocated that rank regression is a more objective approach for dealing with nonnormal data due to the presence of outliers. This finding is evidenced by comparing the rank-based method with traditional parametric and semiparametric regression models through Monte Carlo simulation and real data examples. Liu et al. [6] proposed estimators of nonparametric coefficients in robust estimation for the varying coefficient partially functional linear regression model (VCPFLM) based on B-spline approximation and rank regression. The proposed estimators are found robust for nonnormal error distribution compared with OLS through a simulation study and real-life dataset. Henry et al. [7] compared rank-based estimation and OLS for the Granger and Lee asymmetric price transmission model. Both estimation methods appeared equivalent for normal data irrespective of the sample size. On the other side, for the data affected by outliers, OLS estimates turned inaccurate. However, the rank-based estimation is robust to outliers in large sample size and produced unbiased estimates. The results of the simulation study confirmed the outperformance of rank-based estimation for asymmetric data with outliers and almost equivalent performance for normal datasets in the context of asymmetric price transmission modeling. The proper choice of a score function leads to asymptotically efficient estimators. The efficiency of robust rank-based analyses is compared with the traditional REML method for normal and nonnormal error distributions. For nonnormal error term in linear models, rank-based estimators attain more efficiency than REML but produce quite a similar efficiency for normal error term [1]. Fixed effect estimates, standard errors, and estimates of variance components are compared through a large simulation study. Rank-based analyses appeared much more powerful and efficient than REML. McKean and Hettmansperger presented a review of the development of rank-based theory for linear models and various extensions [8]. Kloke et al. [9] extended this rank theory for the linear models with cluster-correlated errors for example mixed models and nonlinear models. A complete unified methodology of rank-based analyses is presented in [10]. Usually, rank-based analyses are a robust and efficient estimation method. Nevertheless, this efficiency can be optimized based on the information available regarding the distribution of the error term. The rank-based analysis turns out to be more accurate and efficient when the selected score function is close to the form of ‘f’. If the probability density function (pdf) of u is f (u), then the expression for optimal score function can be obtained through some mathematical derivation [8]. Thus, the prior knowledge of the distribution of error term would assist in proper choice of score function leading towards efficient regression estimators. The underlying shapes of error distributions can be symmetric, right-tailed, left-tailed, light-, moderate-, and heavy-tailed. One common probability distribution for right-skewed data is the Weibull distribution, considered the case of the right-tailed error distribution. The shape of the distribution is right-skewed and controlled by two parameters such as scale () and shape . The level of skewness of the Weibull distribution decreases as the value of the shape parameter increases. Weibull distribution mostly occurs in the reliability of products based on design, manufacturing, and development in prelaunch and postlaunch stages [11]. The rank-based estimation method is highly efficient than OLS and REML in the case of the nonnormal error term. The dependent error structure in the error term spoils the normality. The efficiency of the rank-based method can be optimized by selecting the appropriate score function for Weibull distributed error term. In peridynamics, the variability in the brittle fracture can be modeled well by Weibull distribution [12]. Weibull score function would provide the most efficient estimates of standard error. Some research work related to the rank-based estimation method is available in the literature. Mckean and Kloke [13] developed a family of optimal score functions for a skew-normal family of distributions. Additionally, the validity and efficiency of rank-based estimators are compared with MLE and Wilcoxon score through simulation study for skew-normal and contaminated normal distribution. Mckean and Sievers [14] proposed rank score functions suitable for skewed error distributions, specifically for generalized F family of error distributions. These score functions are bounded and provide a robust and powerful rank analysis for a linear model with application to model the lifetime data. Zhen et al. [15] proposed a new score function of intuitionistic fuzzy values (IFVs). The new score function is capable to overcome the problems of other score functions for ranking IFVs. Lalancette et al. [16] provided a review of rank-based estimation methods for asymptotic dependence and independence in linear models. Rank-based M-estimators are proposed, and asymptotic normality is established under weak regularity conditions. Watcharotone et al. [17] developed a robust rank-based picked-point analysis optimizable for heavy-tailed or skewed distributions for the analysis of covariance with heteroscedastic slopes. This technique is compared with least-square based picked-point analyses for normal and nonnormal models by Monte Carlo simulation. Rank-based analyses appeared valid and more powerful than the least-square method over all the situations, although losing little efficiency for normal models. Kloke and McKean [18] developed R package for computing rank-based estimators and their associated inferences. This package includes a library of several score functions suitable for different shapes of error distributions and provides detailed rank analyses. Shao et al. [19] provided ordered treatment methods based on maximum likelihood and robust estimation for one-factor randomized group design including a vector of the covariate for the cluster-correlated model. The proposed estimation method produced higher power and showed robustness against outliers through theoretical and simulation results. Bindele et al. [20] proposed the estimator for rank estimation of regression coefficients in a single-index regression model. The conditions are established for consistency and asymptotic normality of the estimators. The efficiency and robustness of the rank estimator are compared with semiparametric least-square by using Monte Carlo simulation. A real-life example is used as an illustration that rank regression fixes model nonlinearity in the presence of outliers. Abebe and McKean [21] discussed the estimation of parameters in the linear regression model. They considered the estimator which minimizes weighted Wilcoxon dispersion function and established its asymptotic properties. The robustness, efficiency, and validity of these estimates are verified over several normal and nonnormal error distributions. Terpstra and McKean [22] demonstrated robust analyses based on the calculation of weighted Wilcoxon (WW) for linear models. An R suite of function is developed for WW estimation and testing of hypotheses, diagnostic measures, and residual analysis is available. Bindele [23] considered rank estimator of the general linear regression model. The author established strong consistency of the rank estimator is under mild conditions. Abebe et al. [24] presented a rank-based fitting procedure for repeated measurement design that is based on replacing a norm based on a score function for the Euclidean norm. By using a simulation study, proposed estimators are proved efficient, asymptotically normal, and valid, also a real dataset with inherent hierarchy is used for illustration. Černý et al. [25] studied algorithms for minimization of Jaeckel’s dispersion function to reach the robust rank estimator that is insensitive to outliers. A new two-stage algorithm is developed merging the benefits of two algorithms already used in the literature, approximate and exact algorithms. The behavior of this two-stage algorithm is illustrated using a computational experiment, and convergence and exact result is achieved. Cerny et al. [26] focused on the optimization algorithms of rank estimators by dividing these in two major classes: continuous and convex objective function (CCC), and another class (GEN). For CCC unconditionally polynomial algorithm and for GEN, the enumerative algorithm is proposed that is efficient and superior to other algorithms in the literature. Dutta and Datta [27] developed rank-based weighted estimating equations appropriate for intracluster group size is informative. Additionally, an aligned rank-sum test based on covariate adjusted outcomes is constructed. The authors provided asymptotic distributions, and test-statistics of rank estimators are presented. If the informativeness is present, the significance of selecting proper weights is shown through simulation studies. The superiority and robustness of proposed method is shown in comparison to traditional mixed models in clustered data by using real-life datasets. Dutta and Datta [28] presented a novel extension of the rank-sum test for the scenario where group-specific marginal distributions are based on intracluster group size. The performance of proposed test is compared with the Wilcoxon rank-sum test and classical signed-rank test by using the simulation study of informative intracluster group size. It is observed that the proposed test maintained correct size and attained more power [27, 28]. Xie et al. [29] developed a rank test based on the rank score function using functional principal component analysis. Also, asymptotic properties of the test are established under local and alternative hypotheses. The proposed test attained good size and power in the results of simulation study.

The literature showed that no one still worked on the multilevel modeling when the error term follows a Weibull distribution. So, in this article, we proposed a score function for the multilevel model when the error term follows a Weibull distribution to optimize rank-based analyses. Additionally, we compared the performance of the proposed score function with the Wilcoxon score and the traditional REML method.

2. Description of Multilevel Model and Rank Theory

In this section, a brief overview of the random intercept multilevel model, rank theory, and optimal score function is described.

2.1. Random Intercept Model

The multilevel models generate cluster-correlated errors due to the hierarchical structure in data and are a special case of mixed effect models. A general two-level random intercept multilevel model is defined as [30]where is an vector of responses measured at level-1 for the individual I in cluster j (i subscript refers to individual-level variation and j refers to group-level variation), is the design matrix of fixed effects, and is the design matrix of random effects. is the vector of fixed effect, is the vector of individual-level error terms for group j, and is the vector of random effects for cluster j and following normal distribution, i.e., and , respectively. It is assumed that the error terms follow to be independently normally distributed across individuals with distribution function F and density function f.

2.2. Brief Overview of Rank Theory

In the least-square estimation, the vector is obtained by minimizing the Euclidean norm; similarly, rank-based estimation is accomplished by minimizing a pseudonorm. Thus, the robust estimate of can be defined as [9]

It is efficient and robust in Y-space (Kloke and McKean, 2012). Al-Shomrani (2003) defined this pseudonorm bywhere R denotes the rank of among and these are invariant to constant shifts, is the dependent variable for the individual i in a cluster j, and is the corresponding matrix of covariates.

A suitable selection of score function according to the shape of error distribution is required for optimal analysis. The score function is a nondecreasing square-integrable function bounded in the interval (0, 1). The score function can be standardized such that and . Scores are calculated as , where t = 1,…, N, and N is the total sample size, and these scores sum equal to zero, i.e., .

The asymptotic distribution of the rank-based estimator is defined as [13]

Hence, minimizing the variance of the is equivalent to minimizing . The parameter is given by [13]

The comparison based on asymptotic relative efficiency can be obtained as , where is the variance of the REML estimator that is a traditional method to estimate multilevel models based on the maximization of the likelihood function.

2.3. Optimal Scores

The rank-based analyses are based on the selection of a score function. If the underlying distribution of the error term is known, then [31] showed that the following expression can be used to derive the optimal score function as

This expression calculates scores that lead to fully efficient rank-based estimates where and are probability density function (pdf) and cumulative density function (cdf), respectively, of the respective error distributions. Wilcoxon score function is the optimal score for the logistic distribution of the error term. The expression for the Wilcoxon score is .

According to Mckean and Kloke [13], from equations (5) and (6), can be explained as

In the last expression, is the correlation coefficient and is the Fisher Information matrix. Hence, minimization of scale estimator can be achieved by maximizing by taking  = 1, i.e., . Thus, (6) is a compact expression of the optimal score function. Only the knowledge about the form of f (x) is required to make the rank-based estimator fully efficient [13].

3. Weibull Score Function

In this section, after presenting a brief description of the Weibull distribution, a path diagram to reach the optimal score function for the Weibull distribution is shown. This segment includes all the important steps in completing the derivation of the score function.

3.1. Proposed Score Function

The path to propose optimal score function is briefly described in a flow diagram in Figure 1.

3.2. Weibull Error Distribution

The Weibull distribution is a continuous right-skewed distribution. It was first identified by Fréchet in 1928 [32] and later on discussed in detail by Swedish mathematician Waloddi Weibull [33]. This distribution is extensively used in survival analysis, reliability analysis, and extreme value theory. The pdf of Weibull distribution is of the form:where and skewness is controlled by shape parameter typically ranges between 0.5 and 8.0. In this study, we have fixed the value of the scale parameter and varied the value of the shape parameter from 1.0 to 4.0, capable to change the distribution shape from skewed to symmetric. Hence, the pdf simplifies to the following form:

In this article, we are concerned with the random intercept multilevel model (1), in which the level-1 and level-2 random errors follow the Weibull distribution. To develop an optimal score function for some specific value of shape parameter , it requires only knowledge about the form of pdf.

3.3. Derivation of Score Function

The pdf of Weibull distribution for the error term is defined bywhere is the shape parameter and is the scale parameter. Let us assume , then the pdf becomes .

The quantile function of Weibull distribution for is given by

The optimal score function can be derived by the following expression:

Let us assume that for convenience in derivation, where ,

After substituting the value of x,

The score function is

The derivative of the score function is required for implementation in the R package:

The plot of the score function is suitable for Weibull distribution for the shape parameter and is shown in Figure 2. The left panel comprises pdf’s and the right panel displays its corresponding optimal score functions. It is noticeable that the pdf is extremely right-skewed for the smallest value . The skewness in distribution tends to become symmetric with increasing the value of . Hence, when the value of , then its pdf is closer to symmetric distribution.

4. Simulation

The simulated model is as follows:where is the response variable and are the explanatory variables generated from uniform distribution, i.e., measured at level-1. and are random errors at the group level and individual level, respectively, simulated with . The number of groups is denoted by j, and the vector has observations, where is the number of observations. The score function is evaluated for multilevel models at different sample sizes. Level-1 sample size (N1) was taken 5, 10, 20, and 40 within each level-2 sample size (N2) 5, 10, 20, and 40 leading to the range of total sample size N = 25-1600. The simulation is repeated 1000 times on R software. The number of covariates included in the model is considered to be and . The true regression parametric values are selected according to [13]. The efficiency of the Weibull score function is compared with the Wilcoxon score function and the traditional method REML. To assess the performance of the Weibull score function, we use bias, MSE, and precision as the performance evaluation criteria. Absolute estimation of bias for the fixed effects of covariates and the random effect is calculated as , where is the vector of estimates. The estimates of MSE is calculated as . Three estimates of precision are calculated by taking a ratio of scale estimators obtained by a rank-based method with Weibull score, Wilcoxon score, and REML as .

Two kinds of error distributions are reflected in this simulation study: Weibull distribution and contaminated Weibull distribution. The contaminated Weibull error model is generated aswhere is a random variable with a binomial distribution (1,  = 0.20). The is the Weibull random error with the shape parameter . is a random variable from Weibull distribution . A similar setting is used for group-level error too. For robust analysis through rank theory, the score is selected suitable for in Weibull distribution. Five values of close to are considered to show a comparative analysis of optimal score function, i.e., .

5. Results and Discussion

Table 1 and Table 2 summarize the results of the simulation study with estimated bias and MSE of . The estimates of bias and MSE of the multilevel model including =3 and =6 covariates, when , are computed through the rank-based method, Wilcoxon score function, and REML. After obtaining three estimates of precisions, it is worthy to notice that the rank-based method is most efficient than REML, either applied through Weibull or Wilcoxon score function. Auda et al. [1] debated the same results that the rank-based method is generally highly efficient even if few outliers are present in the data. The efficiency of regression estimates is many times better than MLE for skewed data. Therefore, low MSE is observed for the rank-based method either applied through Wilcoxon or Weibull score function than MLE.

Furthermore, the smallest bias and MSE are achieved through the Weibull score function as compared to the Wilcoxon, for with =3 case. Although for  = 6 covariates, for group size 5, Weibull score is better than Wilcoxon but for group size larger than 10, Weibull and Wilcoxon are equally precise.

Table 3 and Table 4 comprise the estimates of bias and MSE of for =3 and =6 covariates in the multilevel model when . It is observed that in all the simulation scenarios, Weibull score function is more efficient than the Wilcoxon score. Additionally, it is seen that REML is least efficient than rank-based estimation with Weibull and Wilcoxon score function. Table 5 and Table 6 display the estimates of bias and MSE of for =3 and =6 covariates in the multilevel model . It is observed that for =3 against all the sample sizes, the Weibull score function appeared more efficient than the Wilcoxon score. Additionally, it is seen that REML is least efficient than rank-based estimation with Weibull and Wilcoxon score function. However, for =6, both the score functions produce quite a similar fit and all three techniques are equally precise. Table 7 and Table 8 show the estimated bias and MSE of for =3 and =6 covariates in the multilevel model . It is observed that for =3 against all the sample sizes, the Weibull score function performed efficiently than the Wilcoxon score function. Additionally, it is seen that REML is equally efficient as rank-based estimation with Weibull and Wilcoxon score function. However, for =6, both the score functions produce quite a similar fit and all three techniques are almost equally precise. As the value of the shape parameter increases the distribution shape becomes symmetric, hence the Wilcoxon score function starts becoming the optimal score.

5.1. Contaminated Weibull Distributed Error Term

A contaminated Weibull distribution of the error terms is also considered. The contamination model is described in (11) that is designed to include outliers and make the distribution shape more skewed. Table 9 consists of bias and MSE of when with =3, it is obvious from estimates of precision that the Weibull score function is efficient than the Wilcoxon score function, and the rank-based method by using Wilcoxon and Weibull is far more efficient than REML. McKean and Kloke [13] compared the rank-based fit by using the skew-normal score function for skew-normal data and compared the empirical relative efficiencies with MLE and Wilcoxon score. It is observed that the most efficient estimates are obtained for the value of parameters for which the score function is derived. Although empirical efficiencies are not significantly different from empirical efficiencies of a nearby range of parameter values. In the same manner, it is observed that estimates with the Weibull score function are most efficient than for neighborhood parametric values. Also, the Weibull score function appeared most efficient than Wilcoxon and MLE.

From Table 10 for =6, it is clear that the Weibull score is more precise than REML but Wilcoxon scores in almost equal precisely as REML. Moreover, the Weibull score is efficient than the Wilcoxon score function as it produces low estimates of bias and MSE. Table 11 comprises the estimates for =3 when , and it is observed that Weibull and Wilcoxon score functions are almost equally precise. However, the rank-based method appeared efficient in some instances than REML. In Table 12, for =6, at three occasions of sample size, Weibull appeared efficient. REML is found precise in some combinations of level-1 and level-2 sample sizes.

Table 13 displays the estimated bias and MSE of when with =3. It is obvious that the Weibull score function computes the smallest estimates, leads to the most efficient analysis. Rank-based analysis by using the Wilcoxon score function and REML produces almost equal fit. Table 14, for =6, shows that Weibull score is better than Wilcoxon but in most instances, both score functions produce quite equal fit. Similar behavior is seen between the efficiency of rank-based methods and REML. Table 15, presents the estimates of bias and MSE of when with =3. In most of the sample sizes, the Weibull score is better than the Wilcoxon score and sometimes both techniques performed equally well. Rank-based estimation and REML are found equally precise. Table 16, for =6, indicates that the Weibull score is as equally efficient as the Wilcoxon score for all sample sizes. Similar behavior is seen between the efficiency of rank-based methods and REML. Both techniques produce quite a similar fit. For =3, the Weibull score is found efficient than Wilcoxon, whereas for =6, both score functions showed almost the same precision.

5.2. Real Data Application

The dataset “CASchools” taken from [34] encloses the information on test performance, school characteristics, and student demographic backgrounds for all 420 K-6 and K-8 districts in California with data available for 1998 and 1999. The reading score is a response variable with three explanatory variables included in the model. These include the percentage of students who are English learners (student’s demographic variable), the expenditure per student spent by the school (the US $ 1,000), and the student-teacher ratio is computed by working on school characteristics. The data are analyzed to verify the simulation results because the response variable is Weibull distributed with shape ( = 2.60825) and scale parameter ( = 6.67620). A multilevel model has been applied to these data through REML, rank-based method with Wilcoxon, and Weibull score functions. The estimates of fixed effects, SE, and value are summarized in Table 17.

As the number of students who are English learners (English is not their first language) increases, the average score on reading declines significantly. The improvement in the student-teacher ratio remains unable to change reading scores considerably. The amount of school spent on each student is significant in rank-based estimation but nonsignificant in REML. By using the Weibull score function, minimum SE is observed for all explanatory variables. The estimate of scale parameter is the smallest for Weibull ( = 14.923) among all three methods.

6. Conclusion

We proposed a score function of the rank-based estimation for the multilevel model when the error terms follow the Weibull distribution. To reach the optimal analysis, we compared the performance of Weibull and Wilcoxon score functions with the traditional method REML through a simulation study and a real application. A random intercept multilevel model is analyzed by a rank-based method including three and six covariates for level-1 and level-2 sample sizes under Weibull and contaminated Weibull distributed error terms. The estimates of bias, MSE, and precision of three methods, i.e., Weibull, Wilcoxon, and REML, are computed for each simulation scenario. For estimates of with =3, the Weibull score function is found to be more precise as compared to the Wilcoxon score for , whereas for =6, both score functions showed the same precision. The REML is appeared least efficient among all three methods for all simulation scenarios. Similarly, for contaminated Weibull distribution with , quite similar behavior is observed. For =3, the Weibull score function is mostly efficient than the Wilcoxon score function whereas for =6, both score functions show equal efficiency. For a small value of , the Weibull score function is performed as the most efficient estimation method because the distribution shows the skewed shape. On the other hand, for large values of , the Wilcoxon score function outperformed due to the closed symmetric shape. Moreover, real application results also show that the Weibull score function produces a smaller variation as compared to the Wilcoxon score and REML methods.

7. Limitations and Future Recommendations

Limitations: Weibull score function is derived only for the Weibull distribution of the error term. It is derived for the cluster-correlated error term in linear models. In this study, the Weibull score function is applied to analyze the random intercept multilevel models only.

Future recommendations: the side versions of the Weibull score function could be derived for special cases of Weibull distributions. It could be derived for other dependent error structures like in the time-series AR (1) model of error terms. Its application could be made valid for random intercept and random slope multilevel models.

Data Availability

No data were used to support the findings of this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.