Computational Intelligence and Neuroscience

Volume 2016, Article ID 2842780, 11 pages

http://dx.doi.org/10.1155/2016/2842780

## Parallelizing Backpropagation Neural Network Using MapReduce and Cascading Model

School of Electrical Engineering and Information, Sichuan University, Chengdu 610065, China

Received 22 November 2015; Revised 16 January 2016; Accepted 19 January 2016

Academic Editor: Jose de Jesus Rubio

Copyright © 2016 Yang Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Artificial Neural Network (ANN) is a widely used algorithm in pattern recognition, classification, and prediction fields. Among a number of neural networks, backpropagation neural network (BPNN) has become the most famous one due to its remarkable function approximation ability. However, a standard BPNN frequently employs a large number of sum and sigmoid calculations, which may result in low efficiency in dealing with large volume of data. Therefore to parallelize BPNN using distributed computing technologies is an effective way to improve the algorithm performance in terms of efficiency. However, traditional parallelization may lead to accuracy loss. Although several complements have been done, it is still difficult to find out a compromise between efficiency and precision. This paper presents a parallelized BPNN based on MapReduce computing model which supplies advanced features including fault tolerance, data replication, and load balancing. And also to improve the algorithm performance in terms of precision, this paper creates a cascading model based classification approach, which helps to refine the classification results. The experimental results indicate that the presented parallelized BPNN is able to offer high efficiency whilst maintaining excellent precision in enabling large-scale machine learning.

#### 1. Introduction

At present, big data analysis has become an important methodology in finding data associations [1], whilst classification is one of the most famous research methods. Among types of classification algorithms, Artificial Neural Network (ANN) is proved to be an effective one that can adapt to various research scenarios. In numbers of ANN implementations, backpropagation neural network (BPNN) is the most widely used one due to its excellent function approximation ability [2]. A typical BPNN usually contains three kinds of layers including input layer, hidden layer, and output layer. Input layer is the entrance of the algorithm. It inputs one instance of the data into the network. The dimension of the instance determines the number of inputs in the input layer. Hidden layer contains one or several layers. It outputs intermediate data to the output layer that generates the final output of the neural network. The number of outputs is determined by the encoding of the classified results. In BPNN each layer consists of a number of neurons. The linear functions or nonlinear functions in each neuron are frequently controlled by two kinds of parameters, weight and bias. In the training phase, BPNN employs feed forward to generate output. And then it calculates the error between the output and the target output. Afterwards, BPNN employs backpropagation to tune weights and biases in neurons based on the calculated error. In the classifying phase, BPNN only executes feed forward to achieve the ultimate classified result. Although it is difficult to determine an optimal number of the hidden layers and neurons for one classification task, it is proved that a three-layer BPNN is enough to fit the mathematical equations which approximate the mapping relationships between the inputs and the outputs.

However, BPNN has encountered a critical issue that, due to a large number of mathematical calculations existing in the algorithm, low efficiency of BPNN leads to performance deterioration in both training phase and classification phase when the data size is large. Therefore to fulfil the potential of BPNN in big data processing, this paper presents a parallel BPNN (CPBPNN) algorithm based on the MapReduce computing model [3] and cascading model. The algorithm firstly creates a number of classifiers. Each classifier is trained by only one class of the training data. However, in order to speed up the training efficiency and maintain generalization, the class of training data does not train only one classifier but a group of classifiers. As long as one testing instance is input into these classifiers, they classify it and output their individual results. Afterwards, a majority voting is executed to decide the final result. If the testing instance is correctly classified, its classification is completed. Otherwise if the testing instance cannot be correctly classified by the classifiers, it will be output to a second group of classifiers trained by another class of training data until all groups of classifiers are traversed. The algorithm is implemented in the MapReduce environment. The detailed algorithm design and implementation are presented in the following sections.

The rest of the paper is organized as follows. Section 2 presents the related work; Section 3 describes the algorithm design in detail; Section 4 discusses the experimental results; and Section 5 concludes the paper.

#### 2. Related Work

It has been widely admitted that ANN has become an effective tool for processing nonlinear function approximation tasks, for example, recognition, classification, and prediction tasks. A number of researches employed neural network to facilitate their researches. Almaadeed et al. introduced a wavelet analysis and neural networks based text-independent multimodal speaker identification system [4]. The wavelet analysis firstly employs wavelet transforms to execute feature extraction. And then the extracted features are used as input for different types of neural networks, which create a number of learning modules including general regressive, probabilistic, and radial basis. Their results indicate that the employed BPNN can classify the data generated by DWT (discrete wavelet transform) and WPT (wavelet packet transform) with high accuracy. Chen and Ye employed a four-layer backpropagation neural network to compute ship resistance [5]. In their research, they studied the impact of algorithm performances with different parameters. Based on their results, with the original ship model experimental data, BPNN can help to develop high-precision neural network systems for the computation of ship resistance. Khoa et al. pointed out that, in the stock price forecasting, it is difficult to generate accurate predictions due to multiple unknown factors [6]. Therefore, they employed a feed forward neural network (FFN) and a recurrent neural network (RNN) to execute the prediction and they also employed the backpropagation mechanism to train and adjust the network parameters.

Recently, neural network with processing large-scale tasks is anxiously needed in big data application. However, the neural networks including BPNN have low efficiency in processing large-scale data. A number of effects have been done by researchers. They mainly focused on tuning the network parameters to achieve high performance. Research [7] combines the neural network algorithm with evolutionary algorithms. The approach can exploit the geometry of the task by mapping its regularities onto the topology of the network, thereby shifting problem difficulty away from dimensionality to the underlying problem structure. Jin and Shu pointed out that BPNN needs a long time to converge [8] so they employed the artificial bee colony algorithm to train the weights of the neural network to avoid the deficiency of BPNN. Li et al. proposed an improved BPNN algorithm with self-adaptive learning rate [9]. The experimental results show the number of iterations is less than that of the standard BPNN with constant learning rate. Also several researchers tried to solve the scale issue with combing cloud computing techniques. For example, Yuan and Yu proposed a privacy preserving BPNN in the cloud computing environment [10]. The authors aimed at enabling multiple parties to jointly conduct the BPNN learning without revealing their private data. The input datasets owned by the parties can be arbitrarily partitioned to achieve a scalable system. Although the researchers claimed that their algorithm supplies satisfied accuracy, they have not conducted the detailed experiments for testing the algorithm efficiency. It is well known that the cloud computing is extremely loosely coupled so that the cloud environment based neural network may encounter a large overhead. Additionally the researches have not mentioned how their algorithm performs in dealing with the practical large-scale tasks. Researches [11–13] stated that a better choice to implement large-scale classification is to parallelize BPNN using the parallel and distributed computing techniques [14]. Research [15] presented three types of Hadoop based distributed BPNN algorithms. A great difference from the work presented by this paper is that, due to the cascading model, our algorithm could improve the algorithm precision. However, the algorithms in [15] can only guarantee but not improve the algorithm precision.

Gu et al. presented a parallel neural network using in-memory data processing techniques to speed up the computation of the neural network. However, their algorithm does not consider the accuracy issue [16]. In this work, the training data is simply segmented into a number of data chunks which are processed in parallel, which may result in accuracy loss. Hebboul et al. also parallelized a distributed neural network algorithm based on the data separation [17]. However, the accuracy loss is also a critical issue in their work. Ganeshamoorthy and Ranasinghe created a vertical partition and hybrid partition scheme [18] for parallelizing neural network using MPI (Message Passing Interface) [19]. However, MPI requires a highly homogeneous environment which decays the adaption of the parallelized algorithm.

The work presented in this paper mainly focuses on parallelizing BPNN in terms of improving the algorithm efficiency, simultaneously maintaining the algorithm classification accuracy in dealing with large-scale data. The paper employs the Hadoop framework as the underlying infrastructure. And then a number of designs have been done in order to improve the algorithm efficiency in both training and classification phases. Also a cascading mechanism is introduced to enhance the algorithm classification accuracy.

#### 3. Algorithm Design

##### 3.1. Backpropagation Neural Network

BPNN is a multilayer network including input layer, hidden layer, and output layer. Each layer consists of a number of neurons. In order to adjust the weights and biases in neurons, BPNN employs error backpropagation operation. Benefiting from the gradient-descent feature, the algorithm has become an effective function approximation algorithm [20, 21]. A standard BPNN which consists of a number of inputs and outputs is shown in Figure 1.