Applied Computational Intelligence and Soft Computing

Volume 2017 (2017), Article ID 5134962, 13 pages

https://doi.org/10.1155/2017/5134962

## Distributed Nonparametric and Semiparametric Regression on SPARK for Big Data Forecasting

Clausthal University of Technology, Clausthal-Zellerfeld, Germany

Correspondence should be addressed to Jelena Fiosina

Received 22 July 2016; Accepted 22 November 2016; Published 8 March 2017

Academic Editor: Francesco Carlo Morabito

Copyright © 2017 Jelena Fiosina and Maksims Fiosins. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Forecasting in big datasets is a common but complicated task, which cannot be executed using the well-known parametric linear regression. However, nonparametric and semiparametric methods, which enable forecasting by building nonlinear data models, are computationally intensive and lack sufficient scalability to cope with big datasets to extract successful results in a reasonable time. We present distributed parallel versions of some nonparametric and semiparametric regression models. We used MapReduce paradigm and describe the algorithms in terms of SPARK data structures to parallelize the calculations. The forecasting accuracy of the proposed algorithms is compared with the linear regression model, which is the only forecasting model currently having parallel distributed realization within the SPARK framework to address big data problems. The advantages of the parallelization of the algorithm are also provided. We validate our models conducting various numerical experiments: evaluating the goodness of fit, analyzing how increasing dataset size influences time consumption, and analyzing time consumption by varying the degree of parallelism (number of workers) in the distributed realization.

#### 1. Introduction

The most current methods of data analysis, data mining, and machine learning should deal with big databases. Cloud Computing technologies can be successfully applied to parallelize standard data mining techniques in order to make working with massive amounts of data feasible [1]. For this purpose, standard algorithms should often be redesigned for parallel environment to distribute computations among multiple computation nodes.

One such approach is to use Apache Hadoop, which includes MapReduce for job distribution [2] and distributed file system (HDFS) for data sharing among nodes.

Recently, a new and efficient framework called Apache SPARK [3] was built on top of Hadoop, which allows more efficient execution of distributed jobs and therefore is very promising for big data analysis problems [4]. However, SPARK is currently in the development stage, and the number of standard data analysis libraries is limited.

R software is a popular instrument for data analysts. It provides several possibilities for parallel data processing through the add-on packages [5]. It is possible also to use Hadoop and SPARK inside of R using SPARKR. This is an R package that provides a lightweight front-end to use Apache SPARK within R. SPARKR is still in the developing stage and supports only some features of SPARK but has a big potential for the future of data science [6].

There exist also alternative parallelization approaches, such as Message Passing Interface (MPI) [7]. However, in the present paper, we will concentrate on SPARK because of its speed, simplicity, and scalability [8].

In this study, we consider regression-based forecasting for the case where the data has a nonlinear structure, which is common in real-world datasets. This implies that linear regression cannot make accurate forecasts and, thus, we resort to nonparametric and semiparametric regression methods, which do not require linearity and are more robust to outliers. However, the main disadvantage of these methods is that they are very time-consuming, and therefore the term “big data” for such methods starts much earlier than with parametrical approaches. In the case of big datasets, traditional nonparallel realizations are not capable of processing all the available data. This makes it imperative to adapt to existing techniques and to develop new ones that overcome this disadvantage. The distributed parallel SPARK framework gives us the possibility of addressing this difficulty and increasing the scalability of nonparametric and semiparametric regression methods, allowing us to deal with bigger datasets.

There are some approaches in the current literature to address nonparametric or semiparametric regression models for parallel processing of big datasets [9], for example, using R add-on packages, MPI. Our study examines a novel, fast, parallel, and distributed realization of the algorithms based on the modern version of Apache SPARK, which is a promising tool for the efficient realization of different machine learning and data mining algorithms [3].

The main objective of this study is to enable a parallel distributed version of nonparametric and semiparametric regression models, particularly kernel-density-based and partial linear models to be applied on big data. To realize this, a SPARK MapReduce based algorithm has been developed, which splits the data and performs various algorithm processes in parallel in the map phase and then combines the solutions in the reduce phase to merge the results.

More specifically, the contribution of this study is (i) to design novel distributed parallel kernel density regression and partial linear regression algorithms over the SPARK MapReduce paradigm for big data and (ii) to validate the algorithms, analyzing their accuracy, scalability, and speed-up by means of numerical experiments.

The remainder of this paper is organized as follows. Section 2 reviews the traditional regression models to be analyzed. Section 3 reviews the existent distributed computation frameworks for big datasets. In Section 4, we propose parallel versions of kernel-density-based and partial linear regression model algorithms, based on SPARK MapReduce paradigm. In Section 5, we present the experimental setup and in Section 6 we discuss the experimental framework and analysis. Section 7 concludes the paper and discusses future research opportunities.

#### 2. Background: Regression Models

##### 2.1. Linear Multivariate Regression

We start with linear regression, which is the only regression model realized in the current version of SPARK to compare the results of the proposed methods.

Let us first consider the classical multivariate linear regression model [10, 11]:where is a number of observations, is the number of factors, is a vector of dependent variables, is a vector of unknown parameters, is a vector of random errors, and is a matrix of explanatory variables. The rows of the matrix correspond to observations and the columns correspond to factors. We suppose that are mutually independent and have zero expectation and equal variances.

The well-known least square estimator (LSE) of is

Further, let be the observations sampled from the distribution of . After the estimation of the parameters , we can make a forecast for a certain th (future) time moment as , where is a vector of observed values of explanatory variables for the th time moment.

For big data, it is a problem to perform the matrix operations in (2). For this purpose, other optimization techniques can be used. One effective option is to use the Stochastic Gradient Descent algorithm [12], which is realized in SPARK. The generalized cost function to be minimized is

Algorithm 1 presents the Stochastic Gradient Descent algorithm, where is a learning rate parameter.