Abstract

Maximum likelihood estimation () is often used to estimate the parameters of the circular logistic regression model due to its efficiency under a parametric model. However, evidence has shown that the classical extremely affects the parameter estimation in the presence of outliers. This article discusses the effect of outliers on circular logistic regression and extends four robust estimators, namely, Mallows, Schweppe, Bianco and Yohai estimator , and weighted estimators, to the circular logistic regression model. These estimators have been successfully used in linear logistic regression models for the same purpose. The four proposed robust estimators are compared with the classical through simulation studies. They demonstrate satisfactory finite sample performance in the presence of misclassified errors and leverage points. Meteorological and ecological datasets are analyzed for illustration.

1. Introduction

Circular data arise whenever the values of a random variable can present the circumference of the unit circle. It is measured by angles with values between 0 and or and , for example, wind directions, animal navigation, or the values of any periodic phenomena such as a 24-hour clock or days of the year which can be converted to circular data [1]. The modeling of the relationship between circular variables is so-called “circular regression,” and it is classified into three main classes, namely, circular-circular, circular-linear, and linear-circular regression models [2].

The applications on circular regression models are widely spread in many applied fields. Several regression models were proposed to predict a continuous circular variable from other circular or linear predictors [36].

Logistic regression analysis is a useful statistical tool that analyses the relationship between a binary response and a predictor. The theory of logistic regression is well developed [7]. Daffaie and Khan [8] proposed the circular logistic model to predict a binary variable from a circular variable, such as modeling the rainfall (yes, no) and wind direction, fatal road accident (yes, no), and time of accident.

The existence of outliers is a common problem in regression analysis. For the linear logistic model, Feser and Pia [9] showed that maximum likelihood estimation can be influenced by outliers. Croux et al. [10] found that the most dangerous outliers termed “bad leverage points” are misclassified observations that are outlying in the design space of predicted variables.

Circular logistic regression is also subjected to the existence of outliers as shown in [11], where they have proposed an outlier detection procedure based on the penalized maximum likelihood and applied it on different real datasets.

Several robust estimators that are less affected by outliers are proposed in the literature to improve the estimation performance in linear logistic regression models [12]. The authors in [13] introduced weights depending on the response and covariates and proposed the Mallows-type estimator. This estimator was analyzed deeply by Carroll and Pederson [14] using Mahalanobis distance to reduce the weights in terms of leverage, and Bianco and Yohai [15] proposed methods.

Since the published work just considered the detection of outliers in the circular logistic regression model, this article attempts to overcome the problem of outliers in the circular logistic regression model by extending some robust estimators from the classical linear logistic to the circular logistic case.

The rest of this paper is organized as follows. Section 2 reviews the formulation of the circular logistic regression model and its parameter estimation via . Section 3 presents the types of outliers in circular logistic regression and derives the proposed robust estimators in a logistic circular regression model. Section 4 discusses the effect of outliers on the circular logistic estimators by computing the influence functions. Section 5 investigates the performance of the considered robust estimators. Section 6 applies the proposed estimators on meteorological and ecological datasets. Section 7 provides the conclusion.

2. Model Formulation

A circular logistic regression describes the relationship between a binary response and circular predictors. It shows potential with various applications in the field of environmental sciences. The authors in [8] assumed that binomial observations with a probability of success, , for , depend on circular random variable , and the proposed model is given as follows:where is the value of the logit (log odds) when and is the angle where the logit reaches its highest value. Let and , and equation (1) can be written as

Suppose that binomial data of the form successes out of trails are observations from a binomial distribution, the likelihood function is then given by

Let , and by using the exponential function, we obtain

The maximum likelihood estimation is classically used for parameter estimation and is defined by an objective function aswhere

The maximum likelihood equations are given as follows:

These equations are solved iteratively by using the Newton–Raphson method. Recently, Abuzaid and ElShekh Ahmed [11] used the penalized maximum likelihood estimator () to identify outliers in the circular logistic regression model and investigated its performance via simulation. The following section discusses some possible robust estimators for the circular logistic regression model.

3. Robust Estimators for Circular Logistic Regression

3.1. Outliers in Circular Logistic Regression

This section distinguishes different cases of outlying observations in circular logistic regression, where outliers might occur in the dependent variable, independent variable, or both of them.

For binary data, all the ’s values are either 0 or 1. Hence, an error in the direction can only occur as a transposition of 0 to 1 or vice versa. This type of outlier is known as residual outlier or misclassification-type error [16,17].

A leverage outlier or leverage point occurs when the circular observation at position (e.g., ) is contaminated as follows: , where is the value after contamination and is the degree of contamination in the range . A leverage point can be considered as a good leverage point when with a large value of while it is a bad leverage point when with small value of and vice versa. Abuzaid and ElShekh Ahmed [11] considered the misclassification-type error outlier in their simulation study without any investigation of the leverage point detection.

Alternatively, robust estimators are the common methods used for handling the problem of outliers in logistic models. The following section presents four robust estimators for the parameters in the circular logistic model as follows.

3.2. Circular Mallows Class

The proposed estimator () is extended from the Mallows class in [18] to the circular logistic model for weighting the maximum likelihood estimator.

Assume is a continuous and increasing distribution function and is given by

Then, the partial derivatives in (7)–(9) becomerespectively. The robust estimates for the circular logistic regression model in equation (10) are given by the solution of obtained bywhere are the weights that may depend on , , or both and is a correction function needed to ensure consistency. If and , then equations (12)–(14) give the usual circular logistic regression estimate. If and , then the weights depend only on .

3.3. Circular Schweppe Class

Stefanski [19] stated that is robust Mahalanobis distance for the vector that depends on the covariance matrix of the regression model, which is given bywhere is a diagonal matrix with and is the probability of ; then, the Mahalanobis distance is given by

If , then the estimator is the same as the linear logistic Schweppe class proposed in [13]. Here, depends on and circular . This estimator is called circular conditionally unbiased bounded influence function or estimator.

3.4. Circular BY Estimators

Bianco and Yohai [15] proposed methods for the linear logistic model, and in this section, we extend it to the circular case and referred to as .

Let the be obtained by minimizing the deviance,where . By replacing the deviance function in (17) with function , the robust estimator is defined bywhere is a bounded, differentiable, and nondecreasing function defined in [15] and given bywhere is a positive number, , and .

3.5. Circular WBY Estimators

The extension of estimator by including weights is reducing the influence of outliers in space. This weighted () estimator is extended to the circular logistic regression model and defined aswhere the weights are distances which are computed using the minimum covariance determinant () estimator, to be a decreasing function of robust Mahalanobis distances (), and given by (see [20])

4. Influence Function

Suppose two source populations with -dimensional circular variables, which are both von Mises distribution with different means but the same concentration parameter, . Circular variable can arise from one of these populationswhere and . Let the binary variable indicates the source population of the corresponding , then

Let the joint distribution of (, ) be denoted by and be an estimator of the circular logistic parameters. Then, the influence function [21] is defined as

If , as shown in equations (7)–(9), then the influence function is an unbounded function for spaces and . Specifically, a small amount of contamination in the training data due to the presence of possible outlier in or intensively affects the , as shown in the simulation section.

Suppose , where the weights depend on the robust distance of observation , and robust distance is equal to the Mahalanobis distance of to the center of the data cloud. This condition reduces the influence of outlying observations in the space. Thus, is bounded with respect to but unbounded with respect to . Similar conclusions as can be derived.

If or , which adds a weight to , then a fully bounded influence is obtained.

5. Simulation Study

5.1. Settings

This simulation aims to compare the robustness of the proposed robust estimators and the classical . The independent circular variable is generated from von Mises distribution with mean and concentration parameter of  = 1, 2, 6, 10, and 15, with sample size , and 300; a large sample size is chosen to avoid separation problems. The true parameter values are .

The simulation study is reported in a variety of situations. Initially, the data without contamination are simulated. The robust properties of all estimators with contaminated data are examined in three different ways. First, proportions are taken from the responses, and is chosen randomly and changed from either 0 to 1 or 1 to 0. This process constitutes the misclassification-type error. For each contaminated case, 5%, 10%,20%, 30%, and 40% of the original data are contaminated. Second, the same proportions are taken to contaminate with for good leverage points. Finally, the same proportions are considered, and the generated data are contaminated with two types of outliers simultaneously. This process constitutes bad leverage points.

Each simulation includes 1000 replications. The performance of the estimators is evaluated based on the bias and the median squared error () for each parameter, which are defined as follows:

A good estimator has bias and that are relatively small or close to zero.

The simulation used the standard available “Robust” package in R to obtain the estimators and the” CircStats” package for generating the circular variable .

5.2. Results

The bias and of the five estimators are shown in Tables 1 to 7. Table 1 shows the results for uncontaminated data (i.e., clean data), where the biases and of all five estimators are fairly close to each other. However, the ’s is larger than the others for large concentration parameter (  = 6, 10, and 15). Hence, performs worse compared with the other estimators in this situation.

Tables 2 and 3 show the results of data with misclassified errors. The bias and of the estimates are immediately affected by 5% misclassified type error. The results suggest that the becomes biased with 5% contamination, with 20% contamination, and with 30% contamination. and are good robust estimators and considered as the best methods.

As shown in Tables 4 and 5, small differences are found between the classical and the robust methods when contaminated data with good leverage points are used, where the biases and for all methods are relatively small. Hence, good leverage points do not affect the data.

The results in the case of bad leverage points change the performance of estimators intensively. As shown in Tables 6 and 7, the estimator can only withstand up to 5% contamination. The estimator can tolerate up to 30% contamination. However, the performance of and is equally good: they can withstand up to 40% contamination.

6. Application to Real Data

Two real datasets are considered to illustrate the performance of the proposed robust estimators. The results of the estimated coefficients with their standard errors () and p values are presented. A good estimator has small and the lower p values. The analyzed datasets are available from the corresponding author upon request.

6.1. Meteorological Data

A total of 365 daily observations of rainfall (yes, no) and wind directions are measured in degrees at 0.00 a.m. from January 1 to December 31, 2019. The data are obtained from the Palestinian Meteorological Department for the Hebron meteorology station in the West Bank, Palestine, with geographical coordinates (latitude =  32 N and longitude =  06 W) and 1005 m above sea level. Forty-eight rainy days are reported in 2019 in Hebron governorate. Figure 1 shows the circular plot of wind direction data in Hebron.

Figure 2 suggests no outlier is found in the data (clean dataset). Abuzaid and ElShekh Ahmed [11] have not identified any outliers by their method and refereed that to the low concentration parameter of the wind direction ().

Table 8 shows the estimated parameters and estimated for the different estimators. As observed in Table 8, all the five estimators are fairly close to each other. Thus, the model of the relation between the rainfall as a binary response and the wind directions as a circular predictor of the considered dataset is given by

6.2. Ecological Data

Leaf inclination angles are extremely important to model the light transmission through the vegetation canopies and to indirectly quantify the canopy attributes such as leaf area index and G-function. Two hundred measurements of leaf inclination angles of Betula pendula trees were measured on May 9, 2013, and their canopies either (top or bottom) were recorded [22].

Figure 3 shows the circular plot of leaf inclination angles, and Figure 4 shows the scatter plot of the data. Six observations far from the mass of directions were identified as outliers by Abuzaid and ElShekh Ahmed [11]. Table 9 presents the parameter estimates, , and values for the various robust estimators where the and estimates are reasonably close to the . However, and have the smallest values. Therefore, and estimators give the best results for this dataset. Thus, the model of the relation between the vegetation canopy () as a binary response and the leaf inclination angles () as a circular predictor of the considered dataset is given by

7. Conclusion

This paper aimed to propose robust estimators for the circular logistic regression model. We have compared the performance of the and the proposed robust estimators for the circular logistic model under clean and contaminated datasets. The findings indicate that the tends to bias in the presence of misclassified error and bad leverage points. A good performance is obtained for the proposed robust estimator in estimating the parameters in such models. Some robust estimators, such as and , show superiority over others depending on the type of contamination.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.