Wireless Communications and Mobile Computing

Volume 2017 (2017), Article ID 6817627, 13 pages

https://doi.org/10.1155/2017/6817627

## A Variable Impacts Measurement in Random Forest for Mobile Cloud Computing

^{1}Department of IT Engineering, Sookmyung Women’s University, Cheongpa-ro 47-gil 100, Yongsan-gu, Seoul 04310, Republic of Korea^{2}Big Data Using Research Center, Sookmyung Women’s University, Cheongpa-ro 47-gil 100, Yongsan-gu, Seoul 04310, Republic of Korea

Correspondence should be addressed to Young-Ho Park

Received 13 April 2017; Accepted 13 June 2017; Published 7 September 2017

Academic Editor: B. B. Gupta

Copyright © 2017 Jae-Hee Hur et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Recently, the importance of mobile cloud computing has increased. Mobile devices can collect personal data from various sensors within a shorter period of time and sensor-based data consists of valuable information from users. Advanced computation power and data analysis technology based on cloud computing provide an opportunity to classify massive sensor data into given labels. Random forest algorithm is known as black box model which is hardly able to interpret the hidden process inside. In this paper, we propose a method that analyzes the variable impact in random forest algorithm to clarify which variable affects classification accuracy the most. We apply Shapley Value with random forest to analyze the variable impact. Under the assumption that every variable cooperates as players in the cooperative game situation, Shapley Value fairly distributes the payoff of variables. Our proposed method calculates the relative contributions of the variables within its classification process. In this paper, we analyze the influence of variables and list the priority of variables that affect classification accuracy result. Our proposed method proves its suitability for data interpretation in black box model like a random forest so that the algorithm is applicable in mobile cloud computing environment.

#### 1. Introduction

Mobile cloud computing becomes a significant issue for data mining. Since multimodal sensor data is gathered from mobile devices, data mining in a mobile cloud environment is an important research area. Multidimensional data from mobile devices such as health information and GPS increases exponentially so that it becomes difficult to handle manually.

There are some researches on the progress that measures variable impact in classification and regression from the big data with multidimensional attributes by using data mining algorithms. As data becomes more complex, the importance of research in interpreting the meaning of data classification and regression results is increasing. The main problem of the multidimensional data analysis is the curse of dimensionality. Since high-dimensional data streams in real time, which is so-called “small* n* large* p*” problem, dimension reduction is a critical issue for efficient data analysis. The following examples illustrate the increasing need for research to identify important variables that have affected classification as well as increasing classification accuracy.

*Example 1. *Assume the situation that the doctor who diagnosed patient* P* used a data mining algorithm to determine whether the patient had cancer or not. The algorithm that completes its training process based on the patient data that was judged as cancer-positive in previous data judged patient* P* as cancer-positive. Before the doctor makes a definitive diagnosis to patient* P*, the doctor wants to know the specific reason that learning algorithm gave the cancer-positive diagnosis to patient* P*.

*Example 2. *We assume two people* B* as the banker and* C* as the customer.* C* wants to borrow money from the bank. When* C* visited the bank and asked* B* for his loan approval,* B* would like to know about the transaction history of* C*. Before making a confirmation,* B* wants to predict whether* C* has the ability to repay the loan or not. Since transaction data is composed of multidimensional attributes, it is impossible for* B* to investigate all data. Therefore, a data mining algorithm can support the decision based on the database by querying historical data of* C*. When the algorithm makes a suggestion to allow loan towards* C*,* B* may want to inspect the decision that algorithm made and which variables gave major impact to the result.

As examples suggested above, the needs for variable impact measurements research is increasing. However, even if the prediction accuracy of the learning algorithm is high, there is a danger that the reliability of the doctor’s diagnosis may deteriorate if the physician cannot directly confirm the cause of the algorithm result. Also, in the second case, it is very important for the banking industry to determine what data from the customer has affected the classification results before deciding whether to approve the customer’s loan or not.

Recently in bioinformatics field, as personal medical data becomes more complicated and accumulated in real time, the related work was proposed [1–3]. There is an increasing demand for research algorithms that can accurately predict patient’s disease name in the multidimensional property [4]. Therefore, it is important to measure which variable among the various attributes contained in the individual’s medical data has affected the prediction results of the algorithm. Random forest algorithm performs reliable classification in this area. Statnikov et al. [5] applied binary and multicategory classification towards cancer diagnosis. The paper investigates that random forests are outperformed by SVM. Díaz-Uriarte and Alvarez de Andrés [6] prove that random forest algorithm is well suited for a large number of datasets and solve the classification problem on gene selection issue. Wu et al. [7] compare five machine learning algorithms, linear discriminant analysis,* k*NN classifier, bagging and boosting classification trees, SVM, and random forest.

However, random forest algorithm has a critical problem. Since it is a black box model, we cannot see which variable is affected in classification result. It is important to interpret the result of the classification with variable importance measurement. Hapfelmeier et al. [8] investigate the variable importance measurement when the data contains missing values. The research proposed allocating variables randomly instead of permuting value to overcome the drawback of previous approaches which do not consider the missing data. Also, Gregorutti et al. [9] proposed new algorithm to eliminate variables recursively to predict with a smaller number of data. The algorithm is efficient when the high-dimensional regression or classification is required.

In this paper, we propose a new method that accurately grasps the influence of relative classification among variables in measuring the influence of classification of variables using random forest algorithm in an attempt to solve the problems. To solve this problem, this paper proposes a method to incorporate the economics theory called Shapley Value into the MDA index.

##### 1.1. Random Forest

The random forest algorithm, which is a kind of ensemble learning technique, generates several decision trees by bootstrapping the learning data and arbitrarily learns them. We then combine the learning results of all the trees to obtain the average in the case of regression and the prediction accuracy in the case of classification by the majority. By learning random decision trees and then averaging them, random forests solve the over sum problem by reducing the variance compared to the single decision trees. In particular, random forests are more suitable for the field of bioinformatics through the study that they have a good performance when sorting data with multidimensional data attributes but small number of data for each “small large .” However, the random forest algorithm corresponding to the black box model has a high prediction accuracy, but it has a disadvantage in that it cannot intuitively interpret data in which the classification is performed directly in the internal process.

The principle of random forest operation is as follows. First, various subsets are arbitrarily generated from existing learning data for random forest learning. The most important characteristic of the random forest is the bagging. Bagging was proposed by Breiman [2] in 1996 as a shorthand for bootstrap aggregation. The decision tree was originally good in classification, but, due to overloading, random forests use bootstrap to perturb data. According to Breiman, bagging predictors are a method of generating multiple versions of a predictor and using an aggregated predictor. The bagging can improve the accuracy rate of the algorithm because the perturbation in learning set could cause a change in predictor construction. Research on the stability of variable impact measurements based on random forest algorithm received high attention in these days [10]. In a recent study, the variable impact measurement is divided into two categories: Mean Decrease Impurity (MDI) and Mean Decrease Accuracy (MDA).

##### 1.2. Variable Impact Measurement Index

Linear regression analysis and decision trees are the most frequently used algorithms for verifying the influence of classification results [11]. However, as the data age becomes more complex as the age of big data grows, linear regression algorithms do not show effective classification results. It is easy to intuitively interpret the learning result, and a decision tree with good performance has emerged as an alternative to multidimensional property classification of data. However, the decision trees are overly compliant with the training data, and there is a problem of overarching consensus that the accuracy of the test data prediction is relatively low. A random forest method has been proposed to solve the problem of prediction accuracy of decision tree.

There are two main indicators to measure the influence of classification of a variable through the random forest. One is the Mean Decrease Impurity (MDI) index, which measures the classification impact of variables by totaling the amount of decrease in impurity as the classification is performed, and the other is the sum of the amount of decrease in accuracy depending on the presence or absence of specific variables (Mean Decrease Accuracy). However, since both indicators adapt biasedly to the order of variables in the tree structure, there is a disadvantage in that the influence of classification is provided at a larger value than the actual value. According to [12], there is a disadvantage that two indicators cannot accurately determine the classification influence because they cannot distinguish false correlation due to data characteristics. The paper [12] has therefore proposed a technique to measure the influence of conditional variable classification to solve this problem. However, this technique has the limitation that it cannot accurately grasp the influence of relative classification between variables and inconsistently provides priority of classification influence.

This paper has the following contributions:(1)We propose a measuring technique of variable impacts based on Shapley Value method on random forest regression. The proposed method attempts to solve the problem that highly correlated variables gain relatively high contribution no matter what their real contribution in prediction is.(2)We proposed a method that demonstrates the impact of variable coalitions. Considering that not only individual variables are important but also the variable impact of variable sets is, our proposed method is able to inspect the interaction between variables. It will increase the overall accuracy of a variable when a high priority of classification influence is improved when it is used as a partitioning variable in the tree.(3)Finally, we propose a coherent ranking of variable impacts based on the marginal contribution of each variable.

The rest of this paper is organized as follows. In Section 2, we describe related work about variable impact measurement in random forest regression algorithm. In Section 3, we explain the economics theoretical method Shapley value with its basic structure. In Section 4, we propose a Shapley Value-based variable impact measurement method. In Section 5, we show the experiments with previous methods and our proposed method. In Section 6, we summarize our research and conclude the paper.

#### 2. Related Work

In this section, we discuss the previous research for measuring variable impact index. In Section 2.1, we introduce the previous research about variable impact measuring technique in a random forest. In Section 2.2, we describe several data mining algorithms that applied Shapley Value.

##### 2.1. Variable Impact Measurement Index

We explain the related research of variable impact measurement index in a random forest. The representative methods of the variable impact measurement index are Mean Decrease Impurity (MDI) and Mean Decrease Accuracy (MDA) proposed by Breiman [2]. Also, to improve its performance, Strobl et al. [12] proposed conditional variable impact measuring technique for random forests.

###### 2.1.1. Mean Decrease Impurity

Breiman [2] proposed the variable impact measurement index called MDI based on impurity. Data impurity index was used to decide where we want to make a split and variables that are often made to make a split. Therefore, the MDI assumes that the amount of impurity reduction when the individual variable is selected as the partition node is the contribution in the random forests. Therefore, the sum of the impurity reductions in all the trees is calculated as the importance of the variable. For impurity reduction, classification trees use Gini coefficient index or information gain and regression trees use mean value of variables.

The equation of variable importance (VI) for variable is as follows. To calculate variable importance for MDI method, it adds up the decrease of Gini index of each of the variables from 1 to , which means the number of trees, and gets the average of all.

*The Formula of Mean Decrease Impurity [12]*

MDI has the advantage of being easy to compute, but it has the disadvantage that it can be biased only for categorical variables that contain multidimensional attributes. For example, if there are continuous variables and categorical variables that contain several classes, this means that the variables are more likely to be biased because they can be judged to be more superficially partitioned when categorical variables are selected under the same conditions. When attempting to split a tree into a specific variable, the most effective partitioning is the moment when the impurity is lowest. If the degree of impurity is reduced to a maximum by a single partition, this partition is considered to be an efficient partition, which means a high contribution to tree partitioning.

On the contrary, when attempting to divide into a specific variable, if the amount of decrease in impurity before and after the division is 0, it is meaningless to perform the division because the data is not classified through the variable. Therefore, in this case, the importance of the variable is judged to be zero.

###### 2.1.2. Mean Decrease Accuracy

MDA is also called permutation importance. This is because when a decision tree is created based on a set of learning datasets divided through subsampling, the intuition behind permutation has an importance that is not a useful feature for predicting an outcome. OOB (Out-Of-Bag) is one of the subsampling techniques to calculate prediction error of each of the training samples utilizing bootstrap aggregation. MDA is the method that calculates variable importance by permutation and the method uses OOB to divide its sample data. In other words, OOB estimates more accurate prediction value by computing OOB accuracy before and after the permutation of variable and compute the difference.

Since , the variable importance of in tree is the averaged value of the difference between predicted class before permuting , which is , and after permuting variable , which is , in certain observation* i*.

*The Formula of Mean Decrease Accuracy [12]*

###### 2.1.3. Conditional Variable Importance

Strobl et al. [12] identified the bias selection problem in MDI and MDA. Both methods are sensitive when it comes to selecting split variables so that the selected variables are biased. In the case of predictive variables with a false correlation, the influence of variables is overestimated. This suggests a way to conditionally replace the variable within the range of the specified variable through splitting by random permutation in which the variable of the input data is replaced with the independent variable . The research shows a simulation to figure out the problem in Table 1. The variables above refer to the following meaning. The first row of the figure is input variable and the second row is its weights towards predictor . In this simulation, are correlated.