Research Article | Open Access
Quadrilateral Interval Type-2 Fuzzy Regression Analysis for Data Outlier Detection
This paper presents a fuzzy regression analysis method based on a general quadrilateral interval type-2 fuzzy numbers, regarding the data outlier detection. The Euclidean distance for the general quadrilateral interval type-2 fuzzy numbers is provided. In the sense of Euclidean distance, some parameter estimation laws of the type-2 fuzzy linear regression model are designed. Then, the data outlier detection-oriented parameter estimation method is proposed using the data deletion-based type-2 fuzzy regression model. Moreover, based on the fuzzy regression model, by using the root mean squared error method, an impact evaluation rule is designed for detecting data outlier. An example is finally provided to validate the presented methods.
1.1. Related Works and Motivations
Regression analysis is widely used in engineering science, social science, economy, and finance [1–4] since it is a significant and comprehensive method to analyze the dependence between dependent variables and one or more variables. Generally, in a regression model, deviations (errors) between the estimated and the observed values are deemed to be because of random variations and/or measurement errors. To this end, some statistical analysis techniques are usually developed for model determination. Nevertheless, in practice, the deviations are sometimes caused by the indefiniteness of the system structure or the incomplete measurable data. In such a case, the uncertainties are ordinarily called fuzziness but not randomness. Then, to deal with the fuzziness, the regression analysis based on fuzzy data or fuzzy numbers called fuzzy regression analysis was proposed and developed in [5–7].
Different from conventional regression techniques which are based on the nonstatistical method, fuzzy regression methodology uses possibility theory and fuzzy set theory [8, 9] for modeling and analysis. Thus, it is much appropriate to manage uncertainties. Zadeh  has introduced two types of fuzzy sets to describe the uncertainties [10–12]. The first type of fuzzy set called the type-1 fuzzy set has several typical extended forms, such as interval fuzzy set, triangular fuzzy set, trapezoidal fuzzy set, pentagonal fuzzy set, and intuitionistic fuzzy set . Because of some variations in dealing with uncertainties, such as the inaccurate description of language uncertainties and the parameter perturbation in the uncertainties, the type-2 fuzzy set was presented to tackle this problem. Its degree of ambiguity is characterized by two subordinate membership functions. This is actually an extension of the type-1 one and leads to wider applicability for complex systems. But, in return, it makes the calculation of the fuzzy regression complicated when using type-2 fuzzy sets. For simplicity of calculation, Mendel et al.  specialized the type-2 fuzzy sets as a kind of interval type-2 fuzzy sets, which can also describe the uncertainties better over the type-1 fuzzy sets. In many research and applications of the fuzzy regression model, the fuzzy numbers based on triangular fuzzy sets are most frequently applied due to its simplicity. However, such fuzzy numbers manifest some limitations in observation, particularly when the complete observed output of the predicted model is required .
In many practical studies, the complete description of regression problems greatly depends on the property of input-output data . Different types of input-output data lead to dissimilar analysis results of the fuzzy regression. There are typically four cases of the fuzzy regression analysis. These are the cases based on the crisp-input crisp-output (CICO) , crisp-input fuzzy-output (CIFO) [15, 18], fuzzy-input crisp-output (FICO) , and fuzzy-input fuzzy-output (FIFO) [19–21] observations in the literature. Most commonly, the case of CIFO data is studied in practice. In this paper, the interested fuzzy regression is analyzed based on CIFO data. Specifically, the case where the predictor variable is crisp but the parameters (coefficients) are fuzzy is considered. Therefore, the observed responses, in this case, are naturally fuzzy.
To sum up, there are three main methods of the fuzzy regression analysis, such as least squares fitting criterion, minimum fuzzy criterion, and interval regression analysis approach . For example, in , mathematical programming methods were used for estimating the parameters of a fuzzy regression model in terms of the trapezoidal case and triangular case. The work in  developed a fuzzy regression model and used the least square method to estimate the coefficients in the sense of distance. The authors  presented a modified fuzzy linear model, based on which all the observed data can be enveloped by the identified model output. A tolerance approach was introduced in  to the construction of fuzzy regression coefficients based on a possibilistic linear regression model with fuzzy data. In this paper, the least square method will be used to calculate the estimation error of the fuzzy regression values.
1.2. Contributions of This Work
Actually, the kernel interval in a trapezoidal fuzzy number is limited to a single point equal to the midpoint of the support interval. As an extension of the triangular fuzzy numbers, the trapezoidal fuzzy numbers can fill those gaps. That is why we develop the fuzzy regression model using the trapezoidal interval type-2 fuzzy numbers and even use a more general case quadrilateral interval type-2 fuzzy numbers considered in this paper. Meanwhile, the model structure is assumed to be linear, which is commonly used in the literature. Consequently, the corresponding fuzzy regression problem becomes a parameter estimation problem of the regression model.
In another research field, data information of all individuals often fails to be collected by experimenters due to measurement methods, preservation methods, and human factors. It results in incomplete observation values of some data indicators in the samples, which are common in clinical trials, socioeconomic statistics, environmental ecology, and other researches. In fact, under normal circumstances, samples with missing values cannot fully reflect the real characteristics of interested systems and the internal relationship between variables. Thus, improper treatment may even lead to large deviations in the results. Therefore, how to deal with the missing data and extract information from data correctly and effectively is an important issue in statistical inference.
However, it is often unavoidable to mix a certain proportion of outliers or strong influence points into the actual data due to the interference of many factors, such as negligence error and rounding error. Once the outliers are mixed, these fuzzy regression methods will become unpractical and even severe challenges can even lead to wrong conclusions. It results in incomplete observation values of some data indicators in the samples, which are common in clinical trials, socioeconomic statistics, environmental ecology, and other researches [26–28]. Hence, the influence analysis of outliers on models is an important part of the statistical diagnosis. To this end, this paper will investigate the fuzzy regression analysis based on the type-2 trapezoidal fuzzy numbers regarding the data outlier detection. The contributions of the paper can be summarized as follows:(1)Firstly, the definition of interval trapezoidal type-2 fuzzy numbers is provided. The parameter estimation laws of the fuzzy linear regression model based on trapezoidal fuzzy numbers are designed in the sense of Euclidean distance.(2)Then, some parameter estimation laws in terms of the data outlier detection are synthesized for the trapezoidal fuzzy regression model.(3)Moreover, based on the trapezoidal fuzzy linear regression model, the impact evaluation rule is established by using the root mean squared error method.
The rest of the paper is organized as follows. Section 2 describes the fuzzy linear regression model and the quadrilateral type-2 fuzzy numbers. Sections 3 and 4 present parameter estimation laws of the fuzzy regression model. Section 5 provides the impact evaluation rule. Section 7 summarizes the paper.
2. Model Description and Preliminaries
is denoted as a set of pairs of observation data, where is the predictor variable (input) and is the observed response variable (output), and . For each observation data , the functional form of the linear regression model is formulated aswhere and are the unknown parameters to be estimated.
Some definitions used in developing the theoretical results are presented in Appendix.
The uncertainty of a type-2 fuzzy sets F can be described by a bounded region, that is, the projection area of the fuzzy sets F on the plane , which is called the footprint of uncertainty, expressed by . The upper-bound membership function and the lower-bound membership function of the interval type-2 fuzzy numbers are actually corresponding to type-1 fuzzy sets, respectively.
An illustrative description and comparison of the type-2 membership functions under different cases is shown in Figure 1. From Figures 1(a) and 1(b), we know that the considered trapezoidal membership function (2) as shown in Figure 1(b) is a special case of the quadrilateral membership function , where and are the membership values of the second and third elements and , respectively. The mostly used triangular type-2 membership function illustrated in Figure 1(c) is actually a special case of the trapezoidal one considered in this paper. Figure 1(d) shows a crisp value of a fuzzy set by comparing with the mentioned fuzzy numbers.
The upper- and lower-bound membership functions of the trapezoidal type-2 fuzzy numbers are expressed as the following form:where and are nonnegative real numbers, , , and denote the heights of the two trapezoids. Considering , we give the following definition of general quadrilateral interval type-2 fuzzy numbers.
Based on this discussion above, we will analyze the regression model (3) for the general case when the quadrilateral type-2 fuzzy numbers are used. The CIFO-based interval type-2 fuzzy regression model is formulated as follows:where and denote the two fuzzy numbers, which are quadrilateral interval type-2 fuzzy numbers considered in this paper. is the observed response . Before proceeding further, we give the following definition of the Euclidean distance of two quadrilateral interval type-2 fuzzy numbers.
Remark 1. The coefficients a and b used in (A.6) can be adjusted as needed. If , then the Euclidean distance is called the centralized Euclidean distance.
In the following section, the fuzzy regression is analyzed in the sense of the defined Euclidean distance, for the estimation of the and in (3). Furthermore, the parameter estimation in terms of the data outlier will be also discussed subsequently.
3. Parameter Estimation of the Type-2 Fuzzy Regression Model
For the considered CIFO observation data, we write the interval type-2 fuzzy parameters and in (3) as follows:
According to (3), the fuzzy-observed response can be represented subject to the positive or negative . If , then the corresponding quadrilateral type-2 fuzzy number is
On the contrary, if , the corresponding type-2 fuzzy number results in
Remark 2. Actually, one can find the minimum from all whether it is positive or negative and then subtract for each , that is . Thus, a new set of which are all nonnegative are obtained. Therefore, we will investigate the regression analysis by considering in (3).
Based on the fuzzy regression model (3), we aim to design the estimates of the bounds of and by minimizing the distance between the resulting fuzzy number and the observed response . As discussed in Remark 2, we can consider the case of , for the parameter estimation. We provide the estimation laws for the general case of the membership grades and , in the following theorem.
Theorem 1. Consider the fuzzy regression model with a set of observation data , . For the quadrilateral type-2 fuzzy parameters and in (4)-(5), if the observed data are a set of quadrilateral type-2 fuzzy numbers , then the estimates of and in the Euclidean distance sense are designed as follows:where .
Proof. According to Definition 4, for n observation pairs , the sum of the squared Euclidean distance between and subject to the fuzzy numbers and can be obtained as follows:Then, we take the partial derivative of with respect to and ( and ) and receptively obtain the following equations for :Letbe the estimates of and , respectively. Then, solving the algebraic equation sets obtained above, we get the estimates of the bounds of and as exactly expressed in Theorem 1. This completes the proof.
4. Parameter Estimation against the Data Deletion Fuzzy Regression Model
For the sake of the evaluation of the impact of the j-th data in regression analysis based on the regression model (3), we can delete the j-th data and detect if these data are an outlier or a strong influence factor, by comparing the changes in statistical inference results. The regression model (1) when the j-th data are deleted is called the data deletion-based regression model, which is represented aswhere and are the two quadrilateral interval type-2 fuzzy parameters.
By the parameter estimation method in Theorem 1, the following results can be drawn for the parameter estimation against the j-th data deleted.
Theorem 2. Consider the trapezoidal fuzzy regression model (12) with a set of observation data , , and is a class of quadrilateral interval type-2 fuzzy number. If the j-th data point is deleted, then the following estimates of and in the Euclidean distance are designed for .where .
Proof. Similar to the case of in Theorem 1, after the j-th data are deleted, the sum of the squared Euclidean distance between and results in the following, for :Then, following the steps in the proof of Theorem 1, the results in Theorem 2 can be obtained accordingly. We omit the specific proof for saving space.
5. Impact Evaluation Rule for the Data Outlier Detection
Since and introduced above are two type-2 fuzzy numbers, it is inconvenient to compare their difference. For this reason, a suitable statistical measure is usually suggested in order to compare the impact quantitatively. In this paper, we introduce the standard deviation of the regression equation as the statistical measure to analysis the impact of the data deletion. Let us define the standard deviation of the regression equation (3) as Definition 5 in Appendix.
From Definition 5, we know that the standard deviation of the regression equation is actually the average deviation between the observed value and the estimated value. Apparently, the smaller the standard deviation is, the closer the estimated value is to the observed value, as well as the closer the observation points are clustered around the fuzzy regression model.
When calculating the standard error in (A.8), we should firstly obtain the parameters and by solving the extreme-value problem in the statistical analysis and then estimate the regression value. It will use two rounds of statistical calculations, and thus, two degrees of freedom are taken. Therefore, the denominator in (A.8) uses and not n in the statistical analysis.
According to the data deletion-based type-2 fuzzy regression model (12), when the j-th data point is deleted, the corresponding standard deviation is
Specifically, the square of for the type-2 fuzzy regression model (12) with fuzzy and fuzzy can be calculated bywhere and ( and ) are defined in Theorem 2.
For the data deletion-based fuzzy linear regression model in (12), let be the metric of the impact on the regression model (12). Evidently, if increases after deleting the j-th data point, then it indicates that the impact is greater and this data point may be an outlier; otherwise, the j-th data are normal.
The derived results can be reduced to the case of trapezoidal type-2 fuzzy regression model when (, ). Besides, it becomes the normal case when using . Therefore, one can consider and () when dealing with the triangular type-2 fuzzy numbers in practice.
6. Simulation Example
In this part, we provide an example to validate the presented fuzzy regression model and the designed impact evaluation rule for the data outlier detection. We borrow a set of data; these are the estimation errors produced in Table 9 from  but considered as some type-2 trapezoidal fuzzy numbers. Table 1 gives the considered interval type-2 trapezoidal fuzzy numbers as the observed data.
Based on this set of observed data, we will detect if some of them is an outlier or a strong impact point by using the designed impact evaluation rule. For simplicity, we use the normal trapezoidal type-2 fuzzy number (, ) for the type-2 fuzzy parameters and . Considering , according to Theorem 1, by setting , we obtain and of the resulting type-2 fuzzy regression equation as follows:
Then, in the following steps, we can use this fuzzy regression equation to calculate the standard deviation of the regression value after deleting the j-th data point. Based on the fuzzy regression model (17), according to Theorem 2, we can obtain the standard deviations as shown in Table 2.