Analyzing Big Data with the Hybrid Interval Regression Methods

Huang, Chia-Hui; Yang, Keng-Chieh; Kao, Han-Ying

doi:https://doi.org/10.1155/2014/243921

The Scientific World Journal

On this page

Abstract Introduction Literature Review Numerical Example Conclusions Acknowledgments References Copyright Related Articles

Special Issue

Optimization Methods in Information and Management Sciences

View this Special Issue

Research Article | Open Access

Volume 2014 | Article ID 243921 | https://doi.org/10.1155/2014/243921

Analyzing Big Data with the Hybrid Interval Regression Methods

Chia-Hui Huang,¹Keng-Chieh Yang,²and Han-Ying Kao³

Academic Editor: Jung-Fa Tsai

Received19 May 2014

Accepted07 Jul 2014

Published20 Jul 2014

Abstract

Big data is a new trend at present, forcing the significant impacts on information technologies. In big data applications, one of the most concerned issues is dealing with large-scale data sets that often require computation resources provided by public cloud services. How to analyze big data efficiently becomes a big challenge. In this paper, we collaborate interval regression with the smooth support vector machine (SSVM) to analyze big data. Recently, the smooth support vector machine (SSVM) was proposed as an alternative of the standard SVM that has been proved more efficient than the traditional SVM in processing large-scale data. In addition the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone that the distribution of data becomes hard to be described and the separation margin between classes.

1. Introduction

Big data has become one of new research frontiers. Generally speaking, big data is a collection of large-scale and complex data sets that it becomes more difficult to process using current database management systems and traditional data processing applications. In 2012, Gartner Inc. gave a definition of big data as “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” [1]. The trend of big data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data.

One of the major applications of the future parallel, distributed, and cloud systems is in big data analytic [2–5]. Most concerned issues are dealing with large-scale sets which often require computation resources provided by public cloud services. How to analyze big data efficiently becomes a big challenge.

The support vector machine (SVM) has shown to be an efficient approach for a variety of data mining, classification, analysis, pattern recognition, and distribution estimation [6–14]. Recently, using SVM to solve the interval regression model [15] has become an alternative approach. Hong and Hwang [16] evaluated interval regression models with quadratic loss SVM. Bisserier et al. [17] proposed a revisited fuzzy regression method where a linear model is identified from Crisp-Inputs Fuzzy-Outputs (CISO) data. D’Urso et al. [18] presented fuzzy clusterwise regression analysis with LR fuzzy response variable and numeric explanatory variables. The suggested model is to allow for linear and nonlinear relationship between the output and input variables. Jeng et al. [19] developed a support vector interval regression networks (SVIRNs) based on both SVM and neural networks. Huang and Kao [20] proposed a soft-margin SVM for interval regression analysis. Huang [21] solved interval regression model with reduced support vector machine.

However, there are several main problems while using SVM model.(1)Big data: when dealing with big data sets, the solution by using SVM with a nonlinear kernel may be difficult to be found.(2)Noises and interaction: the distribution of data becomes hard to be described and the separation margin between classes becomes a “gray” zone.(3)Unbalance: the number of samples from one class is much larger than the number of samples from other classes. It causes the excursion of separation margin.

Under this circumstance, developing an efficient method to analyze big data becomes important. The smooth support vector machine (SSVM) has been proved more efficient than the traditional SVM in processing large-scale data [22–24]. The main idea of SSVM is solved by a fast Newton-Armijo algorithm [25] and has been extended to nonlinear separation surfaces by using a nonlinear kernel technology [24].

In this study, we collaborate interval regression [15] with SSVM to analyze big data. The main idea of SSVM is solved by a fast Newton-Armijo algorithm and has been extended to nonlinear separation surfaces by using a nonlinear kernel technology. Additionally, to modify the excursion of separation margin and to be effective in the gray zone, the soft margin method is proposed. The experiment results show that the proposed methods are more efficient than existing methods.

This study is organized as follows. Section 2 reviews the current methods for interval regression analysis. Section 3 proposes the soft margin method and the formulation of interval regression with SSVM to analyze big data. Section 4 gives a numerical example by the proposed methods dealing with big data which is extracted from Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) [26]. Finally, Section 5 gives the concluding remarks.

2. Literature Review

Since Tanaka et al. [27] introduced the fuzzy regression model with symmetric fuzzy parameters, the properties of fuzzy regression have been studied extensively by many researchers. Fuzzy regression model can be simplified to interval regression analysis which is considered as the simplest version of possibilistic regression analysis with interval coefficients. An interval linear regression model is described as where , , is the estimated interval corresponding to the real input vector . An interval coefficient is defined as , where is the center and is the radius. Hence, can also be represented as

The interval linear regression model (1) can also be expressed as

For a data set with crisp inputs and interval outputs, two interval regression models, the possibility and necessity models, are considered. By assumption, the center coefficients of the possibility regression model and the necessity regression model are the same [15]. For this data set, the possibility and necessity estimation models are defined as where the interval coefficients and are defined as and , respectively. The interval estimated by the possibility model must include the observed interval and the interval estimated by the necessity model must be included in the observed interval .

In this section, we review the current methods which are ordinarily used for interval regression analysis.

2.1. Tanaka and Lee’s Approach

Tanaka and Lee [15] proposed an interval regression analysis with a quadratic programming (QP) approach which gives more diverse spread coefficients than a linear programming (LP) one.

The interval regression analysis by QP approach unifying the possibility and necessity models subject to the inclusion relations, , can be represented as where is an extremely small positive number and makes the influence of the term on the objective function negligible. The constraints of the inclusion relations are equivalent to where is the th input vector and is the corresponding interval output that consists of a center and a radius denoted by .

2.2. Hong and Hwang’s Approach

Hong and Hwang [16] evaluated interval regression model combining the possibility and necessity estimation formulation with the principle of quadratic loss support vector machine (QLSVM). This version of SVM utilizes the quadratic loss function. The QLSVM performs interval nonlinear regression analysis by constructing an interval linear regression function in high-dimensional feature space.

With the principle of QLSVM, the interval nonlinear regression model is given as follows: where , , , , , , and are Lagrange multipliers. is a nonlinear kernel. The followings are well-known nonlinear kernels, where , , , , and are kernel parameters:(1)Gaussian (radial basis) kernel: , [10],(2)hyperbolic tangent kernel: , [12],(3)polynomial kernel: , , , and [14].

The advantage of Hong and Hwang’s approach is a model-free method in the sense that there is no need to assume the underlying model function for interval nonlinear regression model with crisp inputs and interval output.

2.3. Huang’s Approach

There are two problems while using the traditional SVM model. (1) Large scale: when dealing with large-scale data sets, the solution may be difficult to be found when using SVM with nonlinear kernels; (2) Unbalance: the number of samples from one class is much larger than the number of samples from the other classes. It causes the excursion of separation margin.

To resolve these problems, Huang [21] proposed a reduced support vector machine (RSVM) approach in evaluating interval regression models. RSVM has been proven more efficient than the traditional SVM in processing large-scale data.

With the principle of RSVM, the interval nonlinear regression model is listed as follows: where , , , , and are Lagrange multipliers. is a positive semidefinite matrix in RSVM. is a nonlinear kernel.

The advantage of Huang’s approach is to reduce the number of support vectors by randomly selecting a subset of samples. While processing with large-scale data sets, the solution can be found easily by the proposed method with nonlinear kernels.

3. Proposed Methods

In this section we first propose the soft margin method to modify the excursion of separation margin and to be effective in the gray zone. Then the formulation of interval regression with SSVM to analyze big data is introduced.

3.1. Soft Margin

In a conventional SVM, the sign function is used as the decision-making function. The separation threshold of the sign function is 0, which results in an excursion of separation margin for unbalanced data sets. The aim of the hard-margin separation margin is to find a hyperplane with the largest distance to the nearest training data. However, the limitations of the hard-margin formulation are as follows:(1)there is no separating hyperplane for certain training data;(2)complete separation with zero training error will lead to suboptimal prediction error;(3)it is difficult to deal with the gray zone between classes.

Thus, the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone. The soft margin is defined as where is the decision value. and are offset parameter and scale parameter which need to be estimated using statistical method.

With the soft margin as shown in Figure 1, the predication of the class labels can be determined as follows: where is a random number between 0 and 1.

3.2. Interval Regression with SSVM

The main idea of smooth support vector machine (SSVM) is solved by a fast Newton-Armijo algorithm [25] and has been extended to nonlinear separation surfaces by using a nonlinear kernel technology [24].

Suppose that training data , , are given, where are the input patterns and are the related target values of two-class pattern classification case. Then the standard support vector machine with a linear kernel [14] is where is the location of hyperplane relative to the origin. The regularization constant is a positive parameter to control the tradeoff between the training error and the part of maximizing the margin that is achieved by minimizing . is the slack variable with weight . is the Euclidean norm of which is the normal to the following hyperplanes:

The first hyperplane (13) bounds the class and the second hyperplane (14) bounds the class . The linear separating hyperplane is

In Lee and Mangasarian’s approach [24], is added to the objective function of (12). This is equivalent to adding a constant feature to the training data and finding a separating hyperplane through the origin. Consider where for all and the “+” function is defined as . Then (12) can be reformulated as the following minimization problem by replacing with :

The objective function in (17) is not twice differentiable and can be solved by using a fast Newton-Armijo method [25]. Thus the “+” function in SSVM is approximated by a smooth function, , as follows: where is the smooth parameter. is the integral of the sigmoid function of neural networks [28]. The with a smoothing parameter is to replace the “+” function of (17) to obtain the following smooth support vector machine (SSVM) with a linear kernel:

For specific data sets, an appropriate nonlinear mapping can be used to embed the original features into a Hilbert feature space , , with a nonlinear kernel . Thus, (19) can be extended to the SSVM with a nonlinear kernel: where is the nonlinear SSVM classifier. The coefficient is determined by solving an optimization problem (20) and the data points with corresponding nonzero coefficients.

With the principle of SSVM, we can formulate the interval linear regression model as follows: where , , and are the collections of all , , and , , respectively.

Given (21), the corresponding Lagrangian objective function is where is Lagrangian and , , , and are Lagrange multipliers. The idea to construct a Lagrange function from the objective function and the corresponding constraints is to introduce a dual set of variables. It can be shown that the Lagrangian function has a saddle point with respect to the primal and dual variables in the solution [29].

The Karush-Kuhn-Tucker (KKT) conditions that the partial derivatives of with respect to the primal variables for optimality

Substituting (23) in (22) yields the following optimization problem:

Similarly, we can obtain the interval nonlinear regression model by mapping to embed the original features into a Hilbert feature space , , with a nonlinear kernel as discussed in Section 2.2. Then we obtain the optimization problem as (25) by replacing and in (24) with and , respectively:

4. Numerical Example

To illustrate the methods developed in Section 3, the following example is presented.

Example. To illustrate the proposed methods dealing with big data sets, we use the data sets from Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) [26] which included the highest, lowest, and closed data and the ranges are from to , from to , from to , and from to , respectively. For these data sets, the Gaussian kernel [10] is used where and the regularization constant . The results are illustrated from Figure 2 to Figure 5.

The comparison is shown by using the measure of fitness [15] as (26), which defines how closely the possibility output for the th input approximates the necessity output for the th input. Consider where is a sample size and .

Table 1 presents the proposed methods with a Gaussian kernel along with the results computed by Tanaka and Lee [15], Hong and Hwang [16], and Huang [21]. We can find that the proposed methods are more efficient than other methods.

5. Conclusions

In this paper, we collaborate interval regression with SSVM to analyze big data. In addition, the soft margin method is proposed to modify the excursion of separation margin and to be effective in the gray zone. The main idea of SSVM is solved by a fast Newton-Armijo algorithm and has been extended to nonlinear separation surfaces by using a nonlinear kernel technology. The experiment results show that the proposed methods are more efficient than existing methods. In this study, we estimate the interval regression model with crisp inputs and interval output. In future works, both interval inputs-interval output and fuzzy inputs-fuzzy output will be considered.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors appreciate the anonymous referees for their useful comments and suggestions which helped to improve the quality and presentation of this paper. The original version was accepted by International Conference on Business, Information, and Cultural Creative Industry, 2014. Also, special thanks are due to the National Science Council, Taiwan, for financially supporting this research under Grants nos. NSC 102-2410-H-141-012-MY2 (C. H. Huang) and NSC 102-2410-H-259-039-(H. Y. Kao).

References

D. Laney, The Importance of Big Data: A Definition, Gartner, 2012.
C. L. P. Chen and C. Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: A survey on Big Data,” Information Sciences, vol. 275, pp. 314–347, 2014.
View at: Publisher Site | Google Scholar
K. Kambatla, G. Kollias, V. Kumar, and A. Grama, “Trends in big data analytics,” Journal of Parallel and Distributed Computing, vol. 74, no. 7, pp. 2561–2573, 2014.
View at: Publisher Site | Google Scholar
V. López, S. del Ro, J. M. Bentez, and F. Herrera, “Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data,” Fuzzy Sets and Systems, 2014.
View at: Publisher Site | Google Scholar
T. Shelton, A. Poorthuis, M. Graham, and M. Zook, “Mapping the data shadows of Hurricane Sandy: uncovering the sociospatial dimensions of 'big data',” Geoforum, vol. 52, pp. 167–179, 2014.
View at: Publisher Site | Google Scholar
M. Arun Kumar, R. Khemchandani, M. Gopal, and S. Chandra, “Knowledge based least squares twin support vector machines,” Information Sciences, vol. 180, no. 23, pp. 4606–4618, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
S. Maldonado, R. Weber, and J. Basak, “Simultaneous feature selection and classification using kernel-penalized support vector machines,” Information Sciences, vol. 181, no. 1, pp. 115–128, 2011.
View at: Publisher Site | Google Scholar
O. L. Mangasarian, “Mathematical programming in data mining,” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 183–201, 1997.
View at: Publisher Site | Google Scholar
O. L. Mangasarian, “Generalized support vector machines,” in Advances in Large Margin Classifiers, A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, Eds., pp. 135–146, The MIT Press, Cambridge, Mass, USA, 2000.
View at: Google Scholar
C. A. Micchelli, “Interpolation of scattered data: distance matrices and conditionally positive definite functions,” Constructive Approximation, vol. 2, no. 1, pp. 11–22, 1986.
View at: Publisher Site | Google Scholar | MathSciNet
R. Savitha, S. Suresh, and N. Sundararajan, “Fast learning Circular COMplex-valued Extreme Learning Machine (CCELM) for real-valued classification problems,” Information Sciences, vol. 187, pp. 277–290, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
B. Schölkopf, C. J. C. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, Mass, USA, 1999.
A. Unler, A. Murat, and R. B. Chinnam, “mr²PSO: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification,” Information Sciences, vol. 181, no. 20, pp. 4625–4641, 2011.
View at: Publisher Site | Google Scholar
V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
View at: MathSciNet
H. Tanaka and H. Lee, “Interval regression analysis by quadratic programming approach,” IEEE Transactions on Fuzzy Systems, vol. 6, no. 4, pp. 473–481, 1998.
View at: Publisher Site | Google Scholar
D. H. Hong and C. H. Hwang, “Interval regression analysis using quadratic loss support vector machine,” IEEE Transactions on Fuzzy Systems, vol. 13, no. 2, pp. 229–237, 2005.
View at: Publisher Site | Google Scholar
A. Bisserier, R. Boukezzoula, and S. Galichet, “A revisited approach to linear fuzzy regression using trapezoidal fuzzy intervals,” Information Sciences, vol. 180, no. 19, pp. 3653–3673, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
P. D'Urso, R. Massari, and A. Santoro, “Robust fuzzy regression analysis,” Information Sciences, vol. 181, no. 19, pp. 4154–4174, 2011.
View at: Publisher Site | Google Scholar | MathSciNet
J. T. Jeng, C. C. Chuang, and S. F. Su, “Support vector interval regression networks for interval regression analysis,” Fuzzy Sets and Systems, vol. 138, no. 2, pp. 283–300, 2003.
View at: Publisher Site | Google Scholar | MathSciNet
C. Huang and H. Kao, “Interval regression analysis with soft-margin reduced support vector machine,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5579, pp. 826–835, 2009.
View at: Publisher Site | Google Scholar
C. H. Huang, “A reduced support vector machine approach for interval regression analysis,” Information Sciences, vol. 217, pp. 56–64, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
C.-C. Chang, L.-J. Chien, and Y.-J. Lee, “A novel framework for multi-class classification via ternary smooth support vector machine,” Pattern Recognition, vol. 44, no. 6, pp. 1235–1244, 2011.
View at: Publisher Site | Google Scholar
Y. J. Lee, W. F. Hsieh, and C. M. Huang, “ε-SSVR: a smooth support vector machine for ε-insensitive regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 5, pp. 678–685, 2005.
View at: Publisher Site | Google Scholar
Y. Lee and O. L. Mangasarian, “SSVM: a smooth support vector machine for classification,” Computational Optimization and Applications, vol. 20, no. 1, pp. 5–22, 2001.
View at: Publisher Site | Google Scholar | MathSciNet
L. Armijo, “Minimization of functions having Lipschitz continuous first partial derivatives,” Pacific Journal of Mathematics, vol. 16, pp. 1–3, 1966.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
“Taiwan Stock Exchange Capitalization Weighted Stock Index,” http://www.twse.com.tw.
View at: Google Scholar
H. Tanaka, S. Uejima, and K. Asai, “Fuzzy linear regression model,” IEEE Transactions on Systems, Man and Cybernetics, vol. 10, pp. 2933–2938, 1980.
View at: Google Scholar
O. L. Mangasarian, “Mathematical programming in neural networks,” ORSA Journal on Computing, vol. 5, no. 4, pp. 349–360, 1993.
View at: Google Scholar
O. L. Mangasarian, Nonlinear Programming, McGraw-Hill, New York, NY, USA, 1969.

Copyright

Copyright © 2014 Chia-Hui Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1322

Downloads

849

Citations