Quadrilateral Interval Type-2 Fuzzy Regression Analysis for Data Outlier Detection

Gao, Pingping; Gao, Yabin

doi:https://doi.org/10.1155/2019/4914593

Mathematical Problems in Engineering

On this page

Abstract Introduction Preliminaries Conclusion Appendix Data Availability Conflicts of Interest References Copyright Related Articles

Research Article | Open Access

Volume 2019 | Article ID 4914593 | https://doi.org/10.1155/2019/4914593

Quadrilateral Interval Type-2 Fuzzy Regression Analysis for Data Outlier Detection

Pingping Gao¹and Yabin Gao²

Academic Editor: Andras Szekrenyes

Received02 Apr 2019

Revised03 Jul 2019

Accepted22 Jul 2019

Published21 Aug 2019

Abstract

This paper presents a fuzzy regression analysis method based on a general quadrilateral interval type-2 fuzzy numbers, regarding the data outlier detection. The Euclidean distance for the general quadrilateral interval type-2 fuzzy numbers is provided. In the sense of Euclidean distance, some parameter estimation laws of the type-2 fuzzy linear regression model are designed. Then, the data outlier detection-oriented parameter estimation method is proposed using the data deletion-based type-2 fuzzy regression model. Moreover, based on the fuzzy regression model, by using the root mean squared error method, an impact evaluation rule is designed for detecting data outlier. An example is finally provided to validate the presented methods.

1. Introduction

1.1. Related Works and Motivations

Regression analysis is widely used in engineering science, social science, economy, and finance [1–4] since it is a significant and comprehensive method to analyze the dependence between dependent variables and one or more variables. Generally, in a regression model, deviations (errors) between the estimated and the observed values are deemed to be because of random variations and/or measurement errors. To this end, some statistical analysis techniques are usually developed for model determination. Nevertheless, in practice, the deviations are sometimes caused by the indefiniteness of the system structure or the incomplete measurable data. In such a case, the uncertainties are ordinarily called fuzziness but not randomness. Then, to deal with the fuzziness, the regression analysis based on fuzzy data or fuzzy numbers called fuzzy regression analysis was proposed and developed in [5–7].

Different from conventional regression techniques which are based on the nonstatistical method, fuzzy regression methodology uses possibility theory and fuzzy set theory [8, 9] for modeling and analysis. Thus, it is much appropriate to manage uncertainties. Zadeh [9] has introduced two types of fuzzy sets to describe the uncertainties [10–12]. The first type of fuzzy set called the type-1 fuzzy set has several typical extended forms, such as interval fuzzy set, triangular fuzzy set, trapezoidal fuzzy set, pentagonal fuzzy set, and intuitionistic fuzzy set [13]. Because of some variations in dealing with uncertainties, such as the inaccurate description of language uncertainties and the parameter perturbation in the uncertainties, the type-2 fuzzy set was presented to tackle this problem. Its degree of ambiguity is characterized by two subordinate membership functions. This is actually an extension of the type-1 one and leads to wider applicability for complex systems. But, in return, it makes the calculation of the fuzzy regression complicated when using type-2 fuzzy sets. For simplicity of calculation, Mendel et al. [14] specialized the type-2 fuzzy sets as a kind of interval type-2 fuzzy sets, which can also describe the uncertainties better over the type-1 fuzzy sets. In many research and applications of the fuzzy regression model, the fuzzy numbers based on triangular fuzzy sets are most frequently applied due to its simplicity. However, such fuzzy numbers manifest some limitations in observation, particularly when the complete observed output of the predicted model is required [15].

In many practical studies, the complete description of regression problems greatly depends on the property of input-output data [16]. Different types of input-output data lead to dissimilar analysis results of the fuzzy regression. There are typically four cases of the fuzzy regression analysis. These are the cases based on the crisp-input crisp-output (CICO) [17], crisp-input fuzzy-output (CIFO) [15, 18], fuzzy-input crisp-output (FICO) [19], and fuzzy-input fuzzy-output (FIFO) [19–21] observations in the literature. Most commonly, the case of CIFO data is studied in practice. In this paper, the interested fuzzy regression is analyzed based on CIFO data. Specifically, the case where the predictor variable is crisp but the parameters (coefficients) are fuzzy is considered. Therefore, the observed responses, in this case, are naturally fuzzy.

To sum up, there are three main methods of the fuzzy regression analysis, such as least squares fitting criterion, minimum fuzzy criterion, and interval regression analysis approach [22]. For example, in [23], mathematical programming methods were used for estimating the parameters of a fuzzy regression model in terms of the trapezoidal case and triangular case. The work in [24] developed a fuzzy regression model and used the least square method to estimate the coefficients in the sense of distance. The authors [15] presented a modified fuzzy linear model, based on which all the observed data can be enveloped by the identified model output. A tolerance approach was introduced in [25] to the construction of fuzzy regression coefficients based on a possibilistic linear regression model with fuzzy data. In this paper, the least square method will be used to calculate the estimation error of the fuzzy regression values.

1.2. Contributions of This Work

Actually, the kernel interval in a trapezoidal fuzzy number is limited to a single point equal to the midpoint of the support interval. As an extension of the triangular fuzzy numbers, the trapezoidal fuzzy numbers can fill those gaps. That is why we develop the fuzzy regression model using the trapezoidal interval type-2 fuzzy numbers and even use a more general case quadrilateral interval type-2 fuzzy numbers considered in this paper. Meanwhile, the model structure is assumed to be linear, which is commonly used in the literature. Consequently, the corresponding fuzzy regression problem becomes a parameter estimation problem of the regression model.

In another research field, data information of all individuals often fails to be collected by experimenters due to measurement methods, preservation methods, and human factors. It results in incomplete observation values of some data indicators in the samples, which are common in clinical trials, socioeconomic statistics, environmental ecology, and other researches. In fact, under normal circumstances, samples with missing values cannot fully reflect the real characteristics of interested systems and the internal relationship between variables. Thus, improper treatment may even lead to large deviations in the results. Therefore, how to deal with the missing data and extract information from data correctly and effectively is an important issue in statistical inference.

However, it is often unavoidable to mix a certain proportion of outliers or strong influence points into the actual data due to the interference of many factors, such as negligence error and rounding error. Once the outliers are mixed, these fuzzy regression methods will become unpractical and even severe challenges can even lead to wrong conclusions. It results in incomplete observation values of some data indicators in the samples, which are common in clinical trials, socioeconomic statistics, environmental ecology, and other researches [26–28]. Hence, the influence analysis of outliers on models is an important part of the statistical diagnosis. To this end, this paper will investigate the fuzzy regression analysis based on the type-2 trapezoidal fuzzy numbers regarding the data outlier detection. The contributions of the paper can be summarized as follows:(1)Firstly, the definition of interval trapezoidal type-2 fuzzy numbers is provided. The parameter estimation laws of the fuzzy linear regression model based on trapezoidal fuzzy numbers are designed in the sense of Euclidean distance.(2)Then, some parameter estimation laws in terms of the data outlier detection are synthesized for the trapezoidal fuzzy regression model.(3)Moreover, based on the trapezoidal fuzzy linear regression model, the impact evaluation rule is established by using the root mean squared error method.

The rest of the paper is organized as follows. Section 2 describes the fuzzy linear regression model and the quadrilateral type-2 fuzzy numbers. Sections 3 and 4 present parameter estimation laws of the fuzzy regression model. Section 5 provides the impact evaluation rule. Section 7 summarizes the paper.

2. Model Description and Preliminaries

is denoted as a set of pairs of observation data, where is the predictor variable (input) and is the observed response variable (output), and . For each observation data , the functional form of the linear regression model is formulated aswhere and are the unknown parameters to be estimated.

Some definitions used in developing the theoretical results are presented in Appendix.

The uncertainty of a type-2 fuzzy sets F can be described by a bounded region, that is, the projection area of the fuzzy sets F on the plane , which is called the footprint of uncertainty, expressed by . The upper-bound membership function and the lower-bound membership function of the interval type-2 fuzzy numbers are actually corresponding to type-1 fuzzy sets, respectively.

An illustrative description and comparison of the type-2 membership functions under different cases is shown in Figure 1. From Figures 1(a) and 1(b), we know that the considered trapezoidal membership function (2) as shown in Figure 1(b) is a special case of the quadrilateral membership function [29], where and are the membership values of the second and third elements and , respectively. The mostly used triangular type-2 membership function illustrated in Figure 1(c) is actually a special case of the trapezoidal one considered in this paper. Figure 1(d) shows a crisp value of a fuzzy set by comparing with the mentioned fuzzy numbers.

(a)

(b)

(c)

(d)

The upper- and lower-bound membership functions of the trapezoidal type-2 fuzzy numbers are expressed as the following form:where and are nonnegative real numbers, , , and denote the heights of the two trapezoids. Considering , we give the following definition of general quadrilateral interval type-2 fuzzy numbers.

Based on this discussion above, we will analyze the regression model (3) for the general case when the quadrilateral type-2 fuzzy numbers are used. The CIFO-based interval type-2 fuzzy regression model is formulated as follows:where and denote the two fuzzy numbers, which are quadrilateral interval type-2 fuzzy numbers considered in this paper. is the observed response . Before proceeding further, we give the following definition of the Euclidean distance of two quadrilateral interval type-2 fuzzy numbers.

Remark 1. The coefficients a and b used in (A.6) can be adjusted as needed. If , then the Euclidean distance is called the centralized Euclidean distance.
In the following section, the fuzzy regression is analyzed in the sense of the defined Euclidean distance, for the estimation of the and in (3). Furthermore, the parameter estimation in terms of the data outlier will be also discussed subsequently.

3. Parameter Estimation of the Type-2 Fuzzy Regression Model

For the considered CIFO observation data, we write the interval type-2 fuzzy parameters and in (3) as follows:

According to (3), the fuzzy-observed response can be represented subject to the positive or negative . If , then the corresponding quadrilateral type-2 fuzzy number is

On the contrary, if , the corresponding type-2 fuzzy number results in

Remark 2. Actually, one can find the minimum from all whether it is positive or negative and then subtract for each , that is . Thus, a new set of which are all nonnegative are obtained. Therefore, we will investigate the regression analysis by considering in (3).
Based on the fuzzy regression model (3), we aim to design the estimates of the bounds of and by minimizing the distance between the resulting fuzzy number and the observed response . As discussed in Remark 2, we can consider the case of , for the parameter estimation. We provide the estimation laws for the general case of the membership grades and , in the following theorem.

Theorem 1. Consider the fuzzy regression model with a set of observation data , . For the quadrilateral type-2 fuzzy parameters and in (4)-(5), if the observed data are a set of quadrilateral type-2 fuzzy numbers , then the estimates of and in the Euclidean distance sense are designed as follows:where .

Proof. According to Definition 4, for n observation pairs , the sum of the squared Euclidean distance between and subject to the fuzzy numbers and can be obtained as follows:Then, we take the partial derivative of with respect to and ( and ) and receptively obtain the following equations for :Letbe the estimates of and , respectively. Then, solving the algebraic equation sets obtained above, we get the estimates of the bounds of and as exactly expressed in Theorem 1. This completes the proof.

4. Parameter Estimation against the Data Deletion Fuzzy Regression Model

For the sake of the evaluation of the impact of the j-th data in regression analysis based on the regression model (3), we can delete the j-th data and detect if these data are an outlier or a strong influence factor, by comparing the changes in statistical inference results. The regression model (1) when the j-th data are deleted is called the data deletion-based regression model, which is represented aswhere and are the two quadrilateral interval type-2 fuzzy parameters.

By the parameter estimation method in Theorem 1, the following results can be drawn for the parameter estimation against the j-th data deleted.

Theorem 2. Consider the trapezoidal fuzzy regression model (12) with a set of observation data , , and is a class of quadrilateral interval type-2 fuzzy number. If the j-th data point is deleted, then the following estimates of and in the Euclidean distance are designed for .where .

Proof. Similar to the case of in Theorem 1, after the j-th data are deleted, the sum of the squared Euclidean distance between and results in the following, for :Then, following the steps in the proof of Theorem 1, the results in Theorem 2 can be obtained accordingly. We omit the specific proof for saving space.

5. Impact Evaluation Rule for the Data Outlier Detection

Since and introduced above are two type-2 fuzzy numbers, it is inconvenient to compare their difference. For this reason, a suitable statistical measure is usually suggested in order to compare the impact quantitatively. In this paper, we introduce the standard deviation of the regression equation as the statistical measure to analysis the impact of the data deletion. Let us define the standard deviation of the regression equation (3) as Definition 5 in Appendix.

From Definition 5, we know that the standard deviation of the regression equation is actually the average deviation between the observed value and the estimated value. Apparently, the smaller the standard deviation is, the closer the estimated value is to the observed value, as well as the closer the observation points are clustered around the fuzzy regression model.

When calculating the standard error in (A.8), we should firstly obtain the parameters and by solving the extreme-value problem in the statistical analysis and then estimate the regression value. It will use two rounds of statistical calculations, and thus, two degrees of freedom are taken. Therefore, the denominator in (A.8) uses and not n in the statistical analysis.

According to the data deletion-based type-2 fuzzy regression model (12), when the j-th data point is deleted, the corresponding standard deviation is

Specifically, the square of for the type-2 fuzzy regression model (12) with fuzzy and fuzzy can be calculated bywhere and ( and ) are defined in Theorem 2.

For the data deletion-based fuzzy linear regression model in (12), let be the metric of the impact on the regression model (12). Evidently, if increases after deleting the j-th data point, then it indicates that the impact is greater and this data point may be an outlier; otherwise, the j-th data are normal.

The derived results can be reduced to the case of trapezoidal type-2 fuzzy regression model when (, ). Besides, it becomes the normal case when using . Therefore, one can consider and () when dealing with the triangular type-2 fuzzy numbers in practice.

6. Simulation Example

In this part, we provide an example to validate the presented fuzzy regression model and the designed impact evaluation rule for the data outlier detection. We borrow a set of data; these are the estimation errors produced in Table 9 from [30] but considered as some type-2 trapezoidal fuzzy numbers. Table 1 gives the considered interval type-2 trapezoidal fuzzy numbers as the observed data.

Based on this set of observed data, we will detect if some of them is an outlier or a strong impact point by using the designed impact evaluation rule. For simplicity, we use the normal trapezoidal type-2 fuzzy number (, ) for the type-2 fuzzy parameters and . Considering , according to Theorem 1, by setting , we obtain and of the resulting type-2 fuzzy regression equation as follows:

Then, in the following steps, we can use this fuzzy regression equation to calculate the standard deviation of the regression value after deleting the j-th data point. Based on the fuzzy regression model (17), according to Theorem 2, we can obtain the standard deviations as shown in Table 2.

From the standard deviations in Table 2, after deleting the 14-th data point, one can find that the standard deviation of the estimate of the fuzzy regression equation is larger than those of others. This can be observed evidently from Figure 2. Thus, the 14-th data point is likely the outlier point, which means the estimation error obtained in Table 1 is the least-accurate estimation error.

7. Conclusion

This paper has dealt with the problems of the fuzzy regression analysis and data outlier detection based on general quadrilateral interval type-2 fuzzy numbers. The Euclidean distance for the type-2 fuzzy numbers has been provided and used for the parameter estimation and standard deviation. Some parameter estimation laws of the quadrilateral interval type-2 fuzzy linear regression model have been designed. Finally, the impact evaluation rule has been designed using the data deletion-based fuzzy regression model. The data outlier detection can be achieved by using the calculation of the standard deviations of the fuzzy regression values.

Appendix

Definitions used in this paper are as follows:

Definition 1 (see [14, 31]). Suppose that F is a nonempty fuzzy set and that the expression of type-2 fuzzy numbers is as follows:in which X is the universe of discourse, denotes the type-2 membership function, and . The fuzzy numbers can also be represented by

Definition 2 (see [14, 31]). For a type-2 fuzzy set A, if for all the type-2 membership functions , then F is called the interval type-2 fuzzy set, which is expressed bywhere X is the universe of discourse and .

Definition 3. If the upper- and lower-bound membership functions of the interval type-2 fuzzy numbers are trapezoidal fuzzy numbers, then they are called trapezoidal interval type-2 fuzzy numbers, expressed bywhere and denote the membership grades of the second and third elements and () above. If , then the generalized fuzzy number F is called a normal trapezoidal type-2 fuzzy number denoted .

Definition 4. Let and be expressed as two quadrilateral interval type-2 fuzzy numbers:Then, the Euclidean distance between and is defined aswith , , and

Definition 5. Consider the observation pair of the regression equation in (3), with crisp and fuzzy , for . Denote the estimate of . Then, the functionis called the Euclidean distance-based standard deviation of the fuzzy regression equation, where .

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

R. J. Brook, Applied Regression Analysis and Experimental Design, Routledge, Abingdon, UK, 2018.
G. W. Bohrnstedt and T. M. Carter, “Robustness in regression analysis,” Sociological Methodology, vol. 3, pp. 118–146, 1971.
View at: Publisher Site | Google Scholar
J. Fox, Applied Regression Analysis and Generalized Linear Models, Sage Publications, Thousand Oaks, CA, USA, 2015.
K. Yao and B. Liu, “Uncertain regression analysis: an approach for imprecise observations,” in Soft Computing, Springer, Berlin, Germany, 2018.
View at: Google Scholar
K. Asai, H. Tanaka, and S. Uegima, “Linear regression analysis with fuzzy model,” IEEE Transactions on Systems, Man and Cybernetics, vol. 12, no. 6, pp. 903–907, 1982.
View at: Publisher Site | Google Scholar
H. Tanaka, “Fuzzy data analysis by possibilistic linear models,” Fuzzy Sets and Systems, vol. 24, no. 3, pp. 363–375, 1987.
View at: Publisher Site | Google Scholar
H. Tanaka, I. Hayashi, and J. Watada, “Possibilistic linear regression analysis for fuzzy data,” European Journal of Operational Research, vol. 40, no. 3, pp. 389–396, 1989.
View at: Publisher Site | Google Scholar
L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, no. 3, pp. 338–353, 1965.
View at: Publisher Site | Google Scholar
L. A. Zadeh, “Fuzzy sets as a basis for a theory of possibility,” Fuzzy Sets and Systems, vol. 100, pp. 9–34, 1999.
View at: Publisher Site | Google Scholar
J. M. Mendel, Uncertain Rule-Based Fuzzy Systems, Springer, Berlin, Germany, 2017.
J. Liu, Y. Gao, W. Luo, and L. Wu, “Takagi–Sugeno fuzzy-model-based control of three-phase AC/DC voltage source converters using adaptive sliding mode technique,” IET Control Theory & Applications, vol. 11, no. 8, pp. 1255–1263, 2016.
View at: Publisher Site | Google Scholar
Y. Gao, F. Xiao, J. Liu, and R. Wang, “Distributed soft fault detection for interval type-2 fuzzy-model-based stochastic systems with wireless sensor networks,” IEEE Transactions on Industrial Informatics, vol. 15, no. 1, pp. 334–347, 2018.
View at: Publisher Site | Google Scholar
K. T. Atanassov, “Interval valued intuitionistic fuzzy sets,” in Intuitionistic Fuzzy Sets, pp. 139–177, Springer, Berlin, Germany, 1999.
View at: Google Scholar
J. M. Mendel, R. I. John, and F. Liu, “Interval type-2 fuzzy logic systems made simple,” IEEE Transactions on Fuzzy Systems, vol. 14, no. 6, pp. 808–821, 2006.
View at: Publisher Site | Google Scholar
A. Bisserier, R. Boukezzoula, and S. Galichet, “A revisited approach to linear fuzzy regression using trapezoidal fuzzy intervals,” Information Sciences, vol. 180, no. 19, pp. 3653–3673, 2010.
View at: Publisher Site | Google Scholar
J. Nowakovź and M. Pokorný, “Fuzzy linear regression analysis,” IFAC Proceedings Volumes, vol. 46, no. 28, pp. 245–249, 2013.
View at: Google Scholar
S. Roychowdhury and W. Pedrycz, “Modeling temporal functions with granular regression and fuzzy rules,” Fuzzy Sets and Systems, vol. 126, no. 3, pp. 377–387, 2002.
View at: Publisher Site | Google Scholar
J. Chachi and M. Roozbeh, “A fuzzy robust regression approach applied to bedload transport data,” Communications in Statistics-Simulation and Computation, vol. 46, no. 3, pp. 1703–1714, 2017.
View at: Publisher Site | Google Scholar
P. D’Urso, “Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data,” Computational Statistics & Data Analysis, vol. 42, no. 1-2, pp. 47–72, 2003.
View at: Publisher Site | Google Scholar
H.-C. Wu, “Fuzzy estimates of regression parameters in linear regression models for imprecise input and output data,” Computational Statistics & Data Analysis, vol. 42, no. 1-2, pp. 203–217, 2003.
View at: Publisher Site | Google Scholar
J. Chachi, “A weighted least-squares fuzzy regression for crisp input-fuzzy output data,” IEEE Transactions on Fuzzy Systems, vol. 27, no. 4, pp. 739–748, 2018.
View at: Publisher Site | Google Scholar
Y.-H. O. Chang and B. M. Ayyub, “Fuzzy regression methods- comparative assessment,” Fuzzy Sets and Systems, vol. 119, no. 2, pp. 187–203, 2001.
View at: Publisher Site | Google Scholar
A. Arabpour and M. Tata, “Estimating the parameters of a fuzzy linear regression model,” Iranian Journal of Fuzzy Systems, vol. 5, no. 2, pp. 1–19, 2008.
View at: Google Scholar
L.-H. Chen and C.-C. Hsueh, “Fuzzy regression models using the least-squares method based on the concept of distance,” IEEE Transactions on Fuzzy Systems, vol. 17, no. 6, pp. 1259–1272, 2009.
View at: Google Scholar
M. Černỳ and M. Hladík, “Possibilistic linear regression with fuzzy data: tolerance approach with prior information,” Fuzzy Sets and Systems, vol. 340, pp. 127–144, 2018.
View at: Google Scholar
M. Soltani, A. J. Telmoudi, L. Chaouech, M. Ali, and A. Chaari, “Design of a robust interval-valued type-2 fuzzy c-regression model for a nonlinear system with noise and outliers,” Soft Computing, vol. 23, no. 15, pp. 6125–6134, 2018.
View at: Google Scholar
W.-L. Hung and M.-S. Yang, “An omission approach for detecting outliers in fuzzy regression models,” Fuzzy Sets and Systems, vol. 157, no. 23, pp. 3109–3122, 2006.
View at: Publisher Site | Google Scholar
K. Y. Chan, C. K. Kwong, and T. C. Fogarty, “Modeling manufacturing processes using a genetic programming-based fuzzy regression with detection of outliers,” Information Sciences, vol. 180, no. 4, pp. 506–518, 2010.
View at: Publisher Site | Google Scholar
L.-W. Lee and S.-M. Chen, “A new method for fuzzy multiple attributes group decision-making based on the arithmetic operations of interval type-2 fuzzy sets,” in Proceedings of the 2008 International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3084–3089, IEEE, Kunming, China, July 2008.
View at: Google Scholar
S.-P. Chen and J.-F. Dang, “A variable spread fuzzy linear regression model with higher explanatory power and forecasting accuracy,” Information Sciences, vol. 178, no. 20, pp. 3973–3988, 2008.
View at: Publisher Site | Google Scholar
L. Abdullah, C. W. R. Adawiyah, and C. W. Kamal, “A decision making method based on interval type-2 fuzzy sets: an approach for ambulance location preference,” Applied Computing and Informatics, vol. 14, no. 1, pp. 65–72, 2018.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2019 Pingping Gao and Yabin Gao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

778

Downloads

693

Citations