Abstract

With the advancement of Internet technologies and the rapid increase of World Wide Web applications, there has been tremendous growth in the volume of digital data. This takes the digital world into a new era of big data. Various existing data processing technologies are not consistent and scalable in handling the complexity as well as the large-size datasets. Recently, there are many distributed data processing, and programming models have been proposed and implemented to handle big data applications. The open-source-implemented MapReduce programming model in Apache Hadoop is the foremost model for data exhaustive and also computational-intensive applications due to its inherent characteristics of scalability, fault tolerance, and simplicity. In this research article, a new approach for the prediction of target labels in big data applications is developed using a multiple linear regression algorithm and MapReduce programming model, named as MR-MLR. This approach promises optimum values for MAE, RMSE, and determination coefficient (R2) and thus shows its effectiveness in predictions in big data applications.

1. Introduction

Linear regression is one of the most important prediction algorithms in statistics and machine learning. It is a well-understood algorithm that intends to build a linear association that exists between two attributes, namely, the dependent and the independent attributes. In the past years, the regression method has been widely used in forecasting, prediction, batch process analysis, and chemical calibration [15]. Multivariate linear regression (MLR) attempts to create a model to show the relationship between two or more attributes of a given domain [610].

In recent days, the regression algorithm has been widely used in data-intensive big data applications for prediction as well as classification. To improve the scalability, runtime management, and computational effectiveness, there are several parallel programming frameworks that implement classification and prediction techniques in a distributed environment. Google’s MapReduce framework [11] was the revolutionary parallel programming tool to handle big data applications, built on its own distributed file system, Google File System (GFS) [12]. The MapReduce was a very popular tool in the parallel programming paradigm because of its scalability, simplicity, throughput, and fault tolerance [13]. Hadoop MapReduce is one of the publicly available and extensively used open-source frameworks for large-scale data processing [14], which supports many levels of programming details such as block storage, task scheduling, data management, fault tolerance, and load balancing. The various statistical and machine learning algorithms that were developed for sequential execution in a single-machine context have been modified to run concurrently in multiple machines in the cluster computing environment. Apache developed a set of machine learning libraries supported by the MapReduce model [15] called Mahout, and Yahoo uses a MapReduce-based programming model to analyze their big data [16] generated by their e-mail, news, sports, finance, entertainment, and shopping applications. Urbani et al. [17] implemented a MapReduce-based Web-scale inference engine (WebPIE) for the semantic web scalable ontology reasoning [18]. They demonstrated its performance on a Hadoop cluster with 64 nodes using Bio2RDF, LLDLSR, and LUBM datasets [18]. Ding et al. [19] developed a MapReduce-based multilayered massive data query processing system over the Skyline smart transportation dataset.

In cloud computing environment and commodity clusters, the distributed data processing MapReduce programming model is the best choice for data-intensive analysis due to its simplicity, scalability, and any complex tasks that can be parallelized in the underlying computing resources [2022]. Currently, the MapReduce runtime is available as a cloud service in the cloud environment; one can easily build a data processing application [23]. Wang et al. [24] developed a MapReduce-based random forest machine learning algorithm called PaRFR (parallel random forest regression) for regression analysis in large-scale population genetic association studies, which involves multivariate traits.

Ashiq et al. [25] integrated the MapReduce framework with the graphics package OpenCV to process video streams in the cloud computing environment and showed that the performance of the system is better in terms of processing time. Yaseen et al. [26] demonstrated a cloud-based, parallelized, completely automated video stream processing system capable of handling large numbers of video streams in a short period of time. Swapnil et al. [27] proposed a novel architecture to use Hadoop-based image interfacing system [28] for processing a large number of images which are executed on a Hadoop cluster concurrently and deliver elevated throughput. Jatmiko et al. [29] developed a MapReduce framework to analyze the biomedical images which detect breast and brain cancer. Huang et al. [30] demonstrated Hadoop-based parallel processing for remotely sensed images. Maillo et al. [31] adopted the MapReduce programming model for k-nearest neighbor (MR-kNN) prediction as well as the classification algorithm. The outcome from the research shows that the experiment is scalable, has an exact parallel implementation, and achieved a good computational time as compared with the sequential version of k-NN.

1.1. The Contributions of This Research Paper
(i)A distributed machine learning multivariate linear regression model to process massively large-sized datasets is proposed(ii)We executed experiments to show the scalability of the proposed model and compared it with a standalone sequential implementation of the linear regression algorithm(iii)We evaluated the performance of the proposed research model for different split ratios of training and testing samples

2. Materials and Methods

2.1. Multivariate Linear Regression

A multivariate linear regression model with predictor variables and a response variable can be written mathematically as follows:where represents the predicted or expected value of the response/dependent variable, through are distinct independent variables, represents the intercept value of the regression model, is the ith regression coefficient that decides the used weight by the equation on the ith independent attribute to give the estimated output.

MLR has two procedures, model building through learning and prediction. Learning a multivariate linear regression model is defined as estimating the values of the intercept and regression coefficients used in the model representation with a dataset considered. Given the coefficients, if we substitute in values for all the input variables, the learned model will give us the predicted values of the response variable.

3. The Proposed MapReduce Algorithm

The objective of making a machine learning algorithm is to enable an algorithm to gain knowledge from the past data, i.e., past or present events and gains knowledge, and the knowledge is represented as a model. The learned model is used to compose predictions or classifications/decisions regarding unidentified upcoming events. Figure 1 depicts the flow of the proposed MR-MLR algorithm, and the two phases of the algorithm are described as follows.

3.1. MapReduce-I for Training

The mapper program reads the training instances from the underlying distributed file system HDFS and computes the correlation that exists among the regression variables, the intercept value, and the coefficients for all attributes in the training dataset. The reducer program collects the intercept values and coefficients for all the attributes and computes the average value of it. From the intercept and regression coefficients, a distributed learned MR-MLR is constructed for the given training instances.

3.2. MapReduce-II for Prediction/Classification

The mapper program of MapReduce-II reads the test dataset instances from the underlying distributed file system HDFS and predicts the values of the response/predictor variables. This mapper also calculates the various measures pertaining to the prediction of the response/predictor variable for all the instances, and these measures are passed to the reducer component of the Mapreduce-II. The reducer part collects different performance metrics for the different blocks of data from the map tasks and computes the average value of the different performance metrics.

3.3. Hadoop Implementation of MapReduce-I

The instances are distributed uniformly among all the partitions, and each partition contains ‘m/s’ instances approximately. For each partition, a separate map task is created, and the regression coefficients are computed. Algorithm 1 presents the working flow of the map task operation. The reducer task reads the partially generated results computed by all instances of the map job and generates the average value of all the regression coefficients, as shown in Figure 2. The number of reducer tasks is smaller than the number of map roles in the system, which is due to the fact that only a small portion of the proposed MR-MLR algorithm is implemented in the reducer component. Finally, the reducer constructs a distributed machine learning model and writes the model into HDFS. (Algorithm 2)

Function MAP-1 (training dataset)
Begin
   Input: training dataset D with m instances and n attributes
   Partition the dataset D into s partitions as p1, p2, p3….. ps
   Read x_train[], y_train for each partitioned dataset
      Compute intercept and correlation coefficients for each block of instances
      Convert it into (key, value > pair as < Dataset_id, (intercept, coefficients)>
   Output < Dataset_id, <(intercept, coefficients[])>
end
Function REDUCE-1(MAP-I output)
   read < Dataset id, (intercept, coefficients[]> from HDFS
      for i = 1 to s partitions
       sum_intercept+ = intercept
      end for
      for i = 1 to s partitions
         for j = 1 to n attributes
       sum_coefficients[]+ = coefficients[]
        end for
       end for
   compute avg_intercept, avg_coefficients[]
   construct a learned MR-MLR model
   output < MR-MLR model>
end
3.4. Hadoop Implementation of MapReduce-II

The input test dataset TD with ‘z’ instances and ‘n’ columns is partitioned into ‘s’ splits named as p1, p2,…, ps in the underlying HDFS of Hadoop. For every block/partition, a separate map task is created and the machine learning model constructed in the MapReduce-I is validated as given in Algorithm 3. Finally, the reducer computes the average value of all the performance metrics as given in Algorithm 4 and writes the result in HDFS.

Function MAP-II (testing dataset)
   Input: testing dataset TD with k instances and n attributes
   Partition the dataset TD into z partitions as p1, p2, p3…..ps
   Read x_test[], y_test for each partitioned dataset
      Predict y_predict with the MR-MLR model
      Convert it into (key, value > pair as < Dataset_id, (y_predict, y_test)>
   Output < Dataset_id, (y_predict, y_test)>
End
Function REDUCE-1I (MAP-II output)
   read < Dataset_id, (y_predict, y_test)>
   for all the z partitions
      compute MSE, RMSE, and determination coefficient (R2) from y_predict and y_test
   find the average value of MSE, RMSE, and determination coefficient (R2)
   output < Dataset id, (MSE, RMSE, and determination coefficient (R2)>
end

4. Experimental Setup

4.1. Datasets Description

In this proposed research work, four datasets from the UCI machine learning repository are used. The characteristics of each dataset are tabulated in Table 1 and shown in Figure 3.

4.2. Performance Measures
4.2.1. Mean Absolute Error (MAE)

This shows a direct variation between the preferred and predicted target output values.

4.2.2. Mean Square Error (MSE)

This measures the variation between the target of a model and what is going to be predicted pertaining to the target attribute.

4.2.3. Root Mean Square Error (RMSE)

This measures the square root of the quadratic mean of the variations among the predicted and anticipated values of the intention feature.

4.2.4. Determination Coefficient (R2)

This parameter assesses the mode in which the model estimates the actual information, which appraises the predictability grade of the model. The higher R2, the more efficient the developed model. The value of R2 is usually between 0 and 1. For example, if R2 = 1 means, the model fits very well, i.e., all the data points lie on the straight line.

5. Experimental Result Analysis

(i)First, we compare the distributed learning MR-MLR model performance metrics with the standalone sequential learning MLR model as given in Section 5.1(ii)Second, we analyze the scalability performance of the MR-MLR model, as reported in Section 5.2(iii)Finally, we check the influence of the split ratio of training and testing samples in the performance metrics of the MR-MLR model, as described in Section 5.3
5.1. Comparison of MR-MLR Model with the Standalone MLR Model

Initially, we run the standalone sequential multivariate linear regression method among all four datasets, which is used as a baseline for comparison. To do this, 10% of the samples are chosen arbitrarily from each dataset and trained in the standalone MLR model. The model is trained with 80% of samples and tested with 20% of samples. The intercept, regression coefficients, and performance metrics are recorded. It is observed that there are four map tasks that have been created for each subset dataset, and all four map tasks learn the correlation among the attributes and compute the regression coefficients on the given input dataset. Table 2 shows the intercept values generated by the map task process executed in the Hadoop cluster environment. The reduce task process collects all the regression coefficients and computes the average values. The MR-MLR model is constructed with the help of the regression coefficients generated from the MapReduce-phase I implementation.

The MapReduce-II implementation validates the model constructed in the earlier phase with 20% of the test dataset samples. The performance is measured with the various parameters and the results are tabulated and compared with the standalone MLR model performance, as shown in Table 3.

From these tables, we can observe the following:(i)The regression coefficient intercept obtained from all the four map tasks, its average value is nearly the same as the value obtained from the standalone MLR model, which indicates that parallelism on data processing is highly efficient(ii)There is not much deviation from the baseline values in the performance metrics MAE, RMSE, and the determination coefficient (R2) of the MR-MLR model when compared with the standalone MLR model

5.2. Scalability

To study the scalability of the proposed MR-MLR algorithm, the influence of dataset split, and distribution on performance metrics, the input file is divided into <s> subsets of the same-sized files with a balanced distribution of instances. We run our experiment on all the four datasets with s = 6, 8, 10, 12, 14, and 16. There are about <s> map tasks initiated for each dataset processing, and each map task computes the correlation coefficients.

Initially, we set s = 6 and execute the experiment in the cluster on all four datasets. The intercept values from each map task are converted in to < dataset id, (intercept, regression coefficient)> key pair. The reducers receive the intercept, regression coefficients, and their average values are computed. Afterwards, a learned MR-MLR is constructed with the computed intercept coefficients and standard deviations. Similarly, the experiment is repeated with s = 8, 10, 12, 14, and 16 and the results obtained are shown in Table 4.

The MR-MLR model constructed from the average values of intercept and coefficients in MapReduce-I is tested with the test dataset. The performance metrics obtained from the experiments are shown in Tables 57.

According to these tables, we can conclude the following:(i)The Hadoop cluster is capable of generating as many map tasks based on the availability of processor cores in the research cluster and the size of datasets(ii)The MR-MLR model is scalable to handle any number of instances, map tasks, and reduce tasks provided sufficient number of CPU cores in the cluster(iii)The performance metrics MAE, RMSE, and the determination coefficient (R2) of the MR-MLR model vary as 33.3% (from s = 6 to s = 8), 25% (from s = 8 to s = 10), 20% (from s = 10 to s = 12), 16.7% (from s = 12 to s = 14), and 12.5% (from s = 14 to s = 16) when the <s> value increases for all the four datasets

5.3. Influence of Training and Testing Split Ratio on Performance Metrics

To analyze the influence of the split ratio of training and testing samples on performance metrics, we have chosen the percentage of training samples in the dataset as 50%, 60%, 70%, 80%, and 90%. The effectiveness of the designed MR-MLR model is measured in a 64 MB HDFS configuration, and <s> is set as 4. From Figure 4, it is concluded that the mean absolute error (MAE) gradually decreases when the training sample size increases. In the case of the year prediction MSD dataset, there is a rapid fall in mean absolute error (MAE) as compared to the other three datasets.

Figure 5 shows that the root mean square error (RMSE) is very less in the case of the combined cycle power plant dataset, which is due to the less number of training samples as compared with the other three datasets. As the ratio of training samples increases, R2 reaches 1, which indicates that the model fits very well with the chosen dataset, as shown in Figure 6.

6. Conclusion

In this article, we have designed and developed a two-stage MapReduce model called MR-MLR for the multivariate linear regression statistical/machine learning based on Apache Hadoop. The constructed model is trained and tested with large-datasets in the multinode Hadoop cluster environment. The use of the Hadoop framework enables us with a scalable, fault tolerance, and runtime management when dealing with a large dataset of millions of instances. The predictive effectiveness of MR-MLR is computed in terms of mean absolute error, root mean square error, and determination coefficient (R2). The results obtained have shown that the main achievements of MR-MLR are the following: (i)There is a consistency in terms of the MSE, RMSE, and determination coefficient (R2) even when the subset of the dataset increases(ii)When the train and test sample split ratio increases, the determination coefficient (R2) moves near 1, indicating that the model learns and fits very well to the given dataset(iii)MR-MLR is a scalable, exact parallel approach as many maps and reducers are instantiated and very good performance metrics achievement in large-sized big data applications.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.