Mobile Information Systems

Volume 2018, Article ID 3890341, 14 pages

https://doi.org/10.1155/2018/3890341

## An Evaluation Model and Benchmark for Parallel Computing Frameworks

^{1}School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China^{2}Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks, Nanjing, Jiangsu 210003, China^{3}College of Computer & Information Engineering, Henan University, Kaifeng, Henan 475001, China

Correspondence should be addressed to Zhijie Han; moc.621@eijihznah

Received 15 December 2017; Revised 8 February 2018; Accepted 10 February 2018; Published 29 March 2018

Academic Editor: Laurence T. Yang

Copyright © 2018 Weibei Fan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

MARS and Spark are two popular parallel computing frameworks and widely used for large-scale data analysis. In this paper, we first propose a performance evaluation model based on support vector machine (SVM), which is used to analyze the performance of parallel computing frameworks. Furthermore, we give representative results of a set of analysis with the proposed analytical performance model and then perform a comparative evaluation of MARS and Spark by using representative workloads and considering factors, such as performance and scalability. The experiments show that our evaluation model has higher accuracy than multifactor line regression (MLR) in predicting execution time, and it also provides a resource consumption requirement. Finally, we study benchmark experiments between MARS and Spark. MARS has better performance than Spark in both throughput and speedup in the executions of logistic regression and Bayesian classification because MARS has a large number of GPU threads that can handle higher parallelism. It also shows that Spark has lower latency than MARS in the execution of the four benchmarks.

#### 1. Introduction

Cloud computing has increased exponentially because of the increasing demands in storing, processing, and retrieving a large amount of data in a cloud cluster. Apache Hadoop [1] is proposed as a framework that allows for the distributed processing of large data sets, which use a simple programming model through a cluster. It is crucial for processor architects to understand what processor microarchitecture parameters affect performance [2]. MapReduce is one of the main components of Hadoop, which is parallelized scalable for computing frameworks. Several applications based on Hadoop are widely used in machine learning, data mining, and graph processing due to its simple interface [3]. Encouraged by the success of the CPU-based MapReduce, a MapReduce framework on graphics processors is proposed in [4]. Spark [5] is another cluster-computing framework that supports the MapReduce paradigm yet does not depend on it. Among them, Spark gains the most popularity because it outperforms Hadoop significantly for interactive and iterative applications. A clear understanding of system performance under different circumstances is key to make decision in resource management and task planning, such as providing decisions for the hardware configuration of nodes in the cluster and the adjustment of system parameters.

Since time critical and high performance is necessary for dealing with these tasks, the throughput of server CPUs and I/O is a serious challenge when dealing with massive data [6]. The performance cannot satisfy the requirements of data processing for the traditional technical architecture, such as data processing, data storage, fault tolerance, and data acquisition. MapReduce is a calculation model based on CPU, which involves two procedures: Map and Reduce. Since CPU supports only a few outstanding memory accesses, fetching massive data from memory can lead to significant latency. Consequently, the high parallelism of query processing is difficult to explore in alleviating the memory access latency. As a result, the CPU cache could not help much in reducing the memory access latency due to its small capacity. GPU has been recently utilized in various domains, including high-performance computing. GPU can be regarded as massively parallel processors with 10x faster computation and 10x higher memory bandwidth than CPU. GPU has a strong ability in parallel computing and is composed of thousands of computer units [7]. GPU is especially suitable for data-intensive parallel computing mainly because it utilizes a number of threads to process different data at the same time. MARS is a GPU-based MapReduce, which is designed for batch tasks, but it is also widely used for iterative tasks. Spark is a parallel computing engine which is designed mainly for iterative tasks, but it is also used for batch tasks. Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing. Spark provides a variety of data set operations called resilient distributed data sets (RDDs) [5], which implement in-memory data structures by using cache intermediate data across a set of nodes. Since RDDs are kept in memory, they are efficient for algorithms that need multiple iterations. However, the input data type and size can affect the performance of Spark and MARS on a particular job significantly. In design and implementation of algorithms, it will be extremely difficult in predicting the performance metric of a job with limited computing capability, such as execution time, throughput, resource consumption, and latency [8].

Given the increasing use of parallel computing frameworks, the design of methods that allow one to understand and predict the performance of such applications is appealing [9]. Evaluation performance modeling is crucial in both researching and engineering work to gain insights into which part of the complex systems [10]. Evaluation performance models are attractive tools that serve the purpose as they might provide reasonably accurate task performance estimates such as throughput and execution time and ultimately help one to answer the aforementioned questions, at significantly lower cost than simulation and experimental evaluation of real setups [11]. They are usually used as a means to state a parallel computing system which is influenced by a wide range of equations.

Multifactor line regression (MLR) [12] is an important method in statistical analysis and data mining and is widely used in the field of engineering technology. The traditional MLR prediction model is only suitable for small-scale input sample data and can only run on a single node. When the input sample data size increases, it tends to slow down due to the increase of the amount of computation or cannot even get the conclusion within the valid time. This is because matrix multiplication is a basic operation in MLR prediction. When the input data are large, high-order matrices are formed, and the product operations of high-order matrices have higher time complexity. Since the calculation process needs more system resources, the computational efficiency is reduced. Expanding the computational complexity of matrix multiplication and reducing its computation time will meet the requirements to deal with large-scale data of the MLR prediction model.

Support vector machine (SVM) [13] is a kind of binary class classification model, and its basic model is defined as the largest linear classifier on the feature space. In particular, analytical techniques based on SVM have been used to analyze and predict the performance of various distributed and parallel systems. Since SVM does not involve probability measurement and laws of large numbers, it only requires small sample data. SVM can effectively achieve from the training sample to predict the test data inference process. The learning strategy is the interval maximization, which can be transformed into a convex quadratic programming problem. The final decision of the SVM is determined by some support vectors, not by the dimensions of the sample data. The process to classify data sets by SVM involves mapping the input space to the high-dimensional feature space through preselected nonlinear mappings. SVM can identify key sample data and eliminate redundant data, which means it is insensitive to the increase and decrease of unsupported vectors. The process of the SVM is easy to implement and robust. To summarize the description, SVM and MLR are two major performance evaluation methods. We apply both to the parallel computing framework to compare which one has higher accuracy in terms of execution time and resource consumption requirement.

Our work aims to build an analytical evaluation model for parallel computing frameworks and to deploy MARS and Spark on the cloud computing environment. By comparing two evaluation models, we conclude that SVM has higher accuracy in predicting than MLR. The purpose of studying benchmarks is to help big data’s clients select the appropriate framework for processing data. Our major contributions are as follows:(1)We propose an evaluation performance model based on the machine learning method SVM for parallel computing frameworks and have given a comparison with MLR.(2)We study four benchmarks on MARS and Spark, respectively, which conduct a detailed analysis to understand how MARS and Spark process batch and iterative jobs.

The rest of the paper is organized as follows. Section 2 introduces the background and related works. Section 3 presents the analytical performance modeling techniques. In Section 4, a performance evaluation model is proposed based on SVM and gave a performance comparison with MLR. Section 5 illustrates the implementation and analysis experimental results of four benchmarks for MARS and Spark. The whole paper is concluded in Section 6.

#### 2. Related Works

Big data has wide applications, such as batch processing, stream processing, interactive analysis, and query processing. There have been many proposals for performance analysis techniques specific to parallel computing frameworks both in evaluation performance models and benchmarks. There are several implementations of the MapReduce model, specifically in terms of workload, task scheduling, and heterogeneous environments [14–16]. Zhang et al. [17] proposed a distributed HOPCM method based on MapReduce for very large amounts of heterogeneous data. Vianna et al. [18] presented an evaluation model that estimates performance for a Hadoop online prototype using the job pipeline parallelism method. Zhang et al. [19] reviewed the emerging researches of deep learning models for big data feature learning and pointed out the remaining challenges of big data deep learning. In [3], machine learning techniques were applied to predict the performance of MapReduce workloads. The modeling approach consists in correlating the preexecution characteristics of the workload with measured postexecution performance metrics. Considering other parallel computing frameworks, Wang and Khan [8] proposed a prediction model for Apache Spark, which simulates the execution of the actual job by using only a fraction of the input data and collects execution traces (e.g., I/O overhead, memory consumption, and execution time) to predict job performance for each execution stage individually. Chawla et al. [20] evaluated the performance of a cloud workstation from the perspective of the mathematical analysis model, using impact benchmarks such as CPU, internal memory, and network bandwidth to conduct an integrated evaluation of the system impact caused by different parameters in an application, with the use of a comprehensive fuzzy evaluation model.

Queuing models are also used to model the performance for a computing framework. In a queuing model, hardware and software resources are represented by a service center that includes a server and an associated queue. Specifically, this approach operates through two steps: (i) jobs are spawned at a fork node in multiple tasks and then (ii) they are submitted to queuing stations that, in turn, model the available servers. Markov models are used to solve queuing network models, which are based on a representation of the system by a state diagram and capture all possible states that the modeled system may find itself, as well as the possible transitions between such states and the rates at which such transitions occur. However, the limitation of Markov models is the complexity, such that the size of the state space grows exponentially with the number of tasks.

Kavulya and Gandhi [21] predicted Hadoop processs execution time by using behavior analysis of Hadoop users and logistic regression algorithm to measure the similarities of jobs. However, the result accuracy of this approach is unstable and required a large amount of historical data. Ganapathi [3] analyzed the performance of the loaded historical operation information prediction system by using the machine learning algorithm. They extracted historical operation information on jobs and proposed a resource scheduling model that ensures jobs could finish in a specific time. Popescu et al. [23] worked on the issues of execution time prediction on the network-intensive iterative algorithm on MapReduce. However, their study mainly focused on the iterative algorithm that requires representable tuning data in order to achieve high prediction accuracy. Zhang et al. [24] attempted to base distributed computing jobs on heterogeneous machines and to predict job completion time based on boundary-based performance modeling. Its aim is to evaluate the upper and lower limits of task completion time to predict job performance. In [25], the author proposed a performance evaluation model for parallel computing models deployed in cloud centers to support big data applications, such that a big data application is divided into lots of parallel tasks and the task arrivals follow a general distribution.

#### 3. Analytical Performance Modeling Techniques

In this section, we conduct a detailed analysis to understand analytical performance modeling techniques. Various techniques have been applied to model the performance of computer systems. Superplane segmentation of the training data in the high-dimensional attribute can be achieved to prevent nonlinear surface segmentation calculation in the original input space.

First, the linear binary classification problem is described mathematically as follows:where is the training sample and denotes two different categories. For sample , there is at least one sort line that separates positive and negative samples. This sort line can be described as

As illustrated in Figure 1, countless dividing lines exist between and . For the two-dimensional space, we want to find the best divider line. For the *n*-dimensional space, our ultimate goal is to find the best classification of the superplane, that is, to find the final decision-making boundaries.