Abstract

The presence of outliers can result in seriously biased parameter estimates. In order to detect outliers in panel data models, this paper presents a modeling method to assess the intervention effects based on the variance of remainder disturbance using an arbitrary strictly positive twice continuously differentiable function. This paper also provides a Lagrange Multiplier (LM) approach to detect and identify a general type of outlier. Furthermore, fixed effects models and random effects models are discussed to identify outliers and the corresponding LM test statistics are given. The LM test statistics for an individual-based model to detect outliers are given as a particular case. Finally, this paper performs an application using panel data and explains the advantages of the proposed method.

1. Introduction

Outliers are observations in the dataset that appear to be unusual and discordant. If the sample contains outliers, the inappropriate estimation from contaminated observations may be strongly distorted and leads to unreliable results. Intervention effects may cause serious bias in estimating parameters as explained in the work of Fox [1], Martin and Yohai [2], Chang et al. [3], and Verardi and Croux [4]. Therefore, it is very important to identify these outliers in large datasets for both natural science and social science disciplines such as engineering, biology, education, medicine, economy, and sociology.

Detection of potential outliers plays a very important role in obtaining valuable and accurate information particularly in the field of engineering. An outlier of engineering observations may be due to an error in data transmission or human error in measurement. Some experts in the field of engineering noted the importance of identification of outliers. Febrero et al. [5] highlighted its importance: the analysis of outliers is an important aspect of any statistical analysis of data and especially is important to identify days or periods in which the levels are significantly large. Outlier identification is important to analyze the traffic volume data collected and used for a variety of purposes in intelligent transportation system [6]. Outlier detection is also important in the field of ecological engineering, as identification of atypical observations is an important concern in water quality monitoring [7].

Initial researches in outlier detection focused on time series-based outliers. Some influential studies have made contributions to the detection and identification of outliers [1, 3, 813]. Fox [1] proposed a fundamental approach for detecting and identifying outliers in a time series model. For linear regression models, it is a general approach to detect outliers to employ the mean-shift outlier model and some test statistics equivalent to the external studentized residuals [8, 9]. Chang et al. [3] provided an iterative procedure to detect and identify the outliers. Two distinct kinds of outliers are considered, namely, additive outlier (AO) and innovation outlier (IO), and the outliers in time series are regarded as being generated by dynamic intervention models at unknown time points. In recent years, researchers begin to be concerned about the outliers’ detection of more complicated models. Shi and Chen [14] developed outlier detection for multilevel models and the proposed test can be used to detect outliers at any level in multilevel models and for any combinations of units. Chen [15] provided an approach to estimate the panel data model with a mixed fractional ARIMA remainder process, in which the data may contain different types of outliers. By the modified inverse Fourier transform, the outliers for the spectral Whittle approach can be quickly detected and identified. Willems et al. [16] diagnosed multivariate outliers. Riani et al. [17] found an unknown number of multivariate outliers. Cerioli [18] developed multivariate outlier tests based on the high-breakdown Minimum Covariance Determinant estimator. Yan [19] proposed a novel method integrating self-organizing map (SOM) with adaptive nonlinear map (ANLM) for multivariate outlier detection. Yuen and Mu [20] proposed a novel probabilistic method for robust parametric identification and outlier detection in linear regression problems. The crux of this method is to calculate the probability of outlier, which quantifies how probable a data point is an outlier. Rapallo [21] made use of log-linear models and exact goodness-of-fit tests to specify the notions of outlier and pattern of outliers. Kuhnt et al. [22] introduced a new technique for the detection of outliers in contingency tables, where outliers are unusual cell counts with respect to classical log-linear Poisson models.

It is very difficult to identify outliers directly by eye, especially when faced by a panel data model with unfamiliar features and uncertain large datasets. As pointed out in Bramati and Croux [23], outliers are not always detectable by looking at residuals from a least squares fit, and diagnostic measures like the Cook distance suffer from the masking effect, as soon as multiple outliers are present. Not much effort has been given to the diagnostics and influence assessment of outliers in panel data models. Although a few researchers tend to be aware of this, there is little literature on the detection of outlier in a panel data model as an important issue.

This paper focuses on the detection and identification of outliers in panel data models with individual effects. Because of the presence of individual effects in panel data models, the traditional mean-shift model cannot differentiate individual effects from the mean disturbance. Thus, the mean-shift model cannot be applied in detecting outliers for panel data models with individual effects. A panel data model with outliers is likely to contaminate the residuals. This means that the variance of error term probably has a large deviation in the outlier model. Therefore, in this paper, a variance intervention effects model is proposed to study the detection of outlier. This paper is concerned with outlier detections through a method of variance intervention effects on the remainder disturbance using arbitrary strictly positive twice continuously differentiable function. Even if the error term was observable in Baltagi [24], the equations of maximum likelihood (ML) estimator would still be highly nonlinear and difficult to solve explicitly. The test statistics based on Lagrange Multiplier (LM) approach are derived, since this LM test is based on the parameters’ estimation under the null hypothesis and its computation is simple only requiring residuals. This paper focuses on a more general type of outlier that has specific impacts on subsequent observations, of which an individual outlier model is a particular case. The test statistics of a general type of outlier and an individual outlier are, respectively, calculated through Lagrange Multiplier (LM) approach. Furthermore, this paper would demonstrate outlier detection of fixed effects models and random effects models by the corresponding LM test statistics.

The rest of the paper is organized as follows. Section 2 briefly presents panel data models with individual effects. Section 3 proposes the variance intervention effects outlier model based on the remainder disturbance. Section 4 describes maximum likelihood estimator. Section 5 provides an LM testing approach for the detection and identification of a general type of outlier. Furthermore, fixed effects and random effects models are discussed with outliers and the corresponding LM test statistics are given. The LM test statistics of an individual outlier model as a particular case are given. Section 6 performs an application of the proposed method using a panel data and explains the advantages of the method. Finally, Section 7 provides the concluding remarks. Proofs of the main results are provided in the appendix.

2. Panel Data Models with Individual Effects

Firstly, the following panel data model with individual effects is considered:where the subscript denotes individuals and denotes time. This means that represents the cross-section dimension and denotes the time series dimension. Here is a dependent variable observed for individual at time , is a column vector of observable independent variables (), is a column vector of regression parameters, is the unobserved time-invariant individual effect, and denotes the remainder disturbance term that is uncorrelated over time and across cross-sectional units.

The panel data model can be represented in a matrix form, and thenwhere is an vector of the dependent variable, is an matrix, is the number of explanatory variables, , , is the vector of individual effects coefficients, is the Kronecker product, and is an vector.

3. An Intervention Effects Outlier Model Based Variance

In a panel data model the standard assumption is that the majority of data follow a certain specified distribution. Unfortunately, a certain small percentage of the panel data take values unlikely to follow this same distribution. The residuals are likely to be contaminated with outliers in a panel data model. This means that the error term has the deviation of variance in the model. Since the main purpose of this paper is to propose a method of outlier detection based on the remainder disturbance, it will be assumed that there is no intervention effects problem based on variance if is present, and therefore this paper will not deal with the inference with the variance of . The following is the panel data model for (1):

Suppose outliers interfere with the remainder disturbance and there are the following two conditions. One is variation across both individuals and time where ; the other is variation across only individuals where . Here deviations of the variance across only individual can be assumed, so the variance of the remainder disturbance will bewhere is arbitrary strictly positive twice continuously differentiable function satisfying the conditions , , and . denotes the first derivative of with respect to .

The panel data outlier model is based on the variance of remainder disturbance in a matrix form, and thenwhere and . denotes the th diagonal matrix. It is the th individual variance interventions’ effects model. If the outliers impact not only the th individual observations, but also the subsequent observations, then without loss of generality, it can be assumed that the outliers employ sequence impacts, which are respectively, ; then ; here . denotes a matrix with its th diagonal matrix being and other matrixes being zeros.

Remark 1. The research scholars of science subjects such as applied mathematics, statistics, and engineering have made active and useful contributions to the detection of outliers and provided some methods to identify outliers, such as Grubbs test, -test, Dixon test, and Nair test. Grubbs test and Dixon test are not generally effective against identification of multiple outliers. -test calculation is more complex and generally used for small samples. Nair test requires variance which is assumed to be known. A number of observations do not contain only one outlier, but there are several outliers; thus, the traditional methods especially for detection of a single outlier have less resistance for the pollution of multiple outliers and are highly likely producing the shielding effect once when there are multiple outliers in the sample data.

Remark 2. The outlier model with intervention effects based variance can be applied in a wide range and considers the case in which there are multiple outliers in the observations. By hypothesis testing to construct a valid test statistic, this model develops a new method for the detection of multivariate outliers. It has a strong ability to resist the pollution from outliers. This method can avoid the shielding effect due to the presence of multivariate outliers and also can be used continuously for the detection of multiple outliers.

4. Maximum Likelihood (ML) Estimation

Under the assumption of normality, the log-likelihood function for the model (2) can be written aswhere and . is the variance-covariance matrix of the error term . The variance-covariance matrix can be computed aswhere and is the column vector. In order to obtain the ML estimator of the regression coefficients, needs to be computed as

The maximum likelihood estimators of , , and are obtained by solving the following normal equations:

It can be noted that these equations would be nonlinear and difficult to solve explicitly. Let , , and denote the ML estimates of , , and .

5. Outlier Detection and Testing

In the following context, outliers in panel data will be detected and the corresponding LM test for the variance intervention effects model will be derived.

5.1. LM Test with a General Type of Outlier for

The log-likelihood function for variance intervention effects model can be written aswhere .

The Hessian matrix of can be obtained:

Following Magnus [25], Lejeune [26], and Baltagi [24], the information matrix is given by denoting expectation taken with respect to the true distribution. To calculate the information matrix, it is noted that and the first derivatives of the likelihood function with respect to , evaluated at the restricted MLE of , are zero. Thus, .

Thus the information matrix under the null hypothesis is

The information matrix is block-diagonal between and , and the part of the information matrix corresponding to is ignored in computing the LM statistic since the null hypothesis only involves . Therefore, the LM statistic may be written aswhere is a vector of partial derivatives of the log-likelihood with respect to each element of , evaluated at the restricted MLE . is the information matrix, evaluated at the restricted MLE . Under the null hypothesis, this statistic is asymptotically distributed as a with degrees of freedom, being the number of parameters in the vector .

The first derivative can be obtained asFor the information matrix, the formula is given by Baltagi [24]:Based on (10), log-likelihood function may be written asHereTesting for outliers in this model amounts to testing , . Let , and under the null hypothesis , the inverse matrix of the covariance matrix is

Theorem 3. If denote the ML estimates of , . Outliers interfere with the remainder disturbance . The LM test statistic is given bywhereUnder the null hypothesis , this statistic is asymptotically distributed as with degrees of freedom.

5.2. LM Test of Fixed Effects or Random Effects Model with Outliers for

This section would consider fixed effects model of panel data with outliers and random effects model of panel data with the th individual outliers.

Proposition 4. If and , so the LM statistic for fixed effects model of panel data with a number of individuals outliers is given bywhere .
Under the null hypothesis , this statistic is asymptotically distributed as with degrees of freedom.

Proposition 5. If and , so the LM statistic for random effects model of panel data with the th individual outliers is given bywhere .
Under the null hypothesis , this statistic is asymptotically distributed as with one degree of freedom.

Proposition 6. If and , so the LM statistic for fixed effects model of panel data with the th individual outliers is given bywhere .
Under the null hypothesis , this statistic is asymptotically distributed as with one degree of freedom.

6. Application Analysis

In this section, an application of the above proposed method using a panel data will be performed and the advantages of the proposed method will be explained.

To evaluate the performance and advantages of the proposed method to detect outliers in panel data, a sample is used as the test dataset to explore main variables’ effects on regional environment. The dataset consists of 570 observations for 30 provinces in China from 1992 to 2010. The basic model iswith the regional carbon emissions CO2 as dependent environmental variable, regional export volume and import volume as main independent trade variables, and regional gross domestic product as main regional development level variables. The source data of independent variables are mainly downloaded from National Bureau of Statistics of China and carbon emissions are obtained by following the method of Lin and Sun [27] based on the quantity of fossil fuel consumption data and the CO2 emissions factors of various types of energy from the Intergovernmental Panel on Climate Change (IPCC) reference approach.

6.1. Outlier Detection of the Tested Panel Data

Although the residual plots (Figure 1) of the basic model show that some data points can be initially judged as abnormal, detection of the outliers only from the residual plots is not strict enough because the standard of diagnostic of outliers from residual plots is very vague and even sometimes we will encounter some tougher cases to judge outliers from residual plot. This paper would apply the LM test method to identify outliers accurately.

For those nonidentified outlier tests, which mean that we do not know in advance which data is outlier point, the test statistic . The Bonferroni inequality will be used to approximate the function, and then the process to identify nonidentified outliers with the confidence level as will be as follows: if ( is sample size), the data estimated will be outliers.

Let the confidence level , and then the critical value for detection of nonidentified outliers. Thus, we can find that individuals such as #5, #19, #21, and #29 in Table 1 are detected as outliers as their test statistics are more than the critical value.

When data analysis, it can be probably happened that univariate does not meet the characteristics of structure and correlation among the variables. With the LM test method, significant outliers which could disrupt such relationship can be effectively identified, while the traditional test methods are not in a position to deal with this. Outliers can provide us with valuable information; for example, some significant discoveries often can be obtained on the condition of observations beyond the degree of dispersion of the next random errors. Also, outliers could help us to get an approach of model modification and optimization.

6.2. Correction Model

The following would propose an adjustment model based on above results of the panel model outlier test. After establishing correction model, the outlier test results will be validated by the contrast of the merits of correction model and the original model. The following are the adjustment plan of correction models.

6.2.1. Correction Model 1

As the results of Table 1 show that individuals #5, #19, #21, and #29 are outliers and #21 especially has a higher degree of abnormality, the correction model 1 would exclude the above abnormal individuals. This corrected sample size is 494 and regression results of this model are shown in Table 2.

6.2.2. Correction Model 2

Individuals #5, #19, #21, and #29 have been detected as outliers. Let such abnormal data reflect in the following model, and then the correction model 2 will be as follows:

When taking into account the presence of outliers, the regression results of correction models are compared with the basic model and the estimated value is changing. value of the basic model shows that the coefficient of does not pass the test, while the coefficient of the correction model 1 passes the test at 10% confidence level and the correction model 2 passes the test at 5% confidence level. The coefficients of other two variables () in two correction models pass the test at 1% confidence level. The results in Table 2 show that the indicators of value and std. err. are improving in two correction models; thus, it can be explained that the presence of outliers does affect the accuracy of the estimated value and correction based on the suggested method allows the regression results to be more accurate.

To further evaluate the regression results of correction models, the residuals will be tested whether they are stable or not, following the methods such as LLC test [28], HT test [29], IPS test [30], and Fisher-ADF test [31]. Table 3 shows the results of the diagnostic test for residuals of panel data. The results show that the correction models have smooth residuals of panel data at the 1% significance level, which indicates that regression results of correction models are better than the basic model with evaluated robustness.

This application case shows the efficiency of LM test to identify outliers and reflects the good effect of detection through avoidance of shielding effect.

7. Conclusion

The presence of individual outliers’ properties in panel data can affect parameter estimates and statistical test in a dramatic way. There are some literatures available on time series model with respect to outlier, whereas there are very few studies on panel data models. In case of the existence of individual effect, the traditional mean-shift model cannot be applied in detecting the outliers for panel data with individual effect. Therefore, this paper developed the variance intervention effects model based on the remainder disturbance to detect outliers. Note that the equations of MLE with outliers’ model would be nonlinear and difficult to solve explicitly and directly. These methods of the outlier detection are available through the Lagrange Multiplier (LM) test in panel data. The accurate distribution of the test statistics is not available. Although the test statistics based on LM are asymptotically distributed as , some researches prove that LM test is also very effective even in small samples. This paper presented a more effective LM testing approach for a general type of outliers’ detection based on variance intervention effects model. Furthermore, the corresponding LM test statistics of fixed effects model and random effects model with outliers are obtained. The LM test statistics of an individual outlier model as a particular case are given. This maximum likelihood estimation (ML) can be performed in Stata by using the regress command. Through the implemented estimation, the ML estimates of and are obtained under the null hypothesis. The LM test statistics of outlier detection for each individual are calculated as proposed. Following the procedure, this paper recognized and identified individual outliers by using the results of the LM test statistics. Outlier detection and identification can provide some additional and valuable information which improve the robust direction of statistical model.

Appendix

This appendix derives the LM test for the intervention model with the variance of , and the null hypothesis is given by . Under , the following are obtained:Using (15), the elements of the information matrix are given byThe derivative vector with respect to under is given byThe information matrix evaluated under is given byLet ,Next, the corresponding Lagrange Multiplier (LM) test for null hypotheses is derived:whereUsing (A.6), the following is obtained asLetand thenUsing (A.11), getUnder the null hypothesis , this statistic is asymptotically distributed as with degrees of freedom.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Grant no. 71302054), China Scholarship Council (201308210076), and Dongbei University of Finance & Economics (DUFE2014J13). The author would like to thank the editors and referees whose comments led to an appreciable clarification of the argument of this paper.