Abstract

The random forest algorithm under the MapReduce framework has too many redundant and irrelevant features, low training feature information, and low parallelization efficiency when dealing with multihoming big data network problems, so parallelism is based on information theory, and norms is proposed for random forest algorithm (PRFITN). In this paper, the technique used first builds a hybrid dimensional reduction approach (DRIGFN) focused on information gain and the Frobenius norm, successfully reducing the number of redundant and irrelevant features; then, an information theory feature is offered. This results in the dimensionality-reduced dataset. Finally, a technique is suggested in the Reduce stage. The features are grouped in the FGSIT strategy, and the stratified sampling approach is employed to assure the information quantity of the training features in the building of the decision tree in the random forest. When datasets are provided as key/value pairs, it is common to want to aggregate statistics across all objects with the same key. To acquire global classification results and achieve a rapid and equal distribution of key-value pairs, a key-value pair redistribution method (RSKP) is used, which improves the cluster’s parallel efficiency. The approach provides a superior classification impact in multihoming large data networks, particularly for datasets with numerous characteristics, according to the experimental findings. We can utilize feature selection and feature extraction together. In addition to minimizing overfitting and redundancy, lowering dimensionality contributes to improved human interpretation and cheaper computing costs through model simplicity.

1. Introduction

A classification algorithm is a supervised learning algorithm, which can discover classification rules and construct classification models based on labeled information, to predict the attributes of unlabeled data [1]. Among the classification algorithms, random forest (RF) has been used in text classification [2] and environmental prediction in recent years because of its strong stability and good tolerance to noise and outliers [3, 4]. Credit evaluation [5], bioinformatics [6], medical diagnosis [7], and other fields have received extensive attention. Random forest, as the name indicates, is a classifier that employs a decision tree based on diverse subsets of the supplied dataset and combines them to improve the dataset’s forecasting accuracy. Instead, then relying on a single decision tree, the random forest takes forecasts from each tree and predicts the outcome based on the bulk of predictions’ choices. The increasing number of trees in the forest prevents maximum reliability and generalization.

Big data systems nowadays are backed by a variety of processing, analytical, and dynamic visualization capabilities. These platforms make it possible to retrieve knowledge and information from dynamic environments that are complicated. Through suggestions and automatic identification of abnormalities, deviant behavior, or new trends, they also assist in decision-making [8]. Big data has become a research hotspot as information technology and network technologies have advanced. Big data has 4V characteristics compared to traditional data—volume (large quantity), variety (variety), velocity (fast speed), and value (low density) [8]—which requires a longer running time and more memory capacity when processing big data, and it is especially important to improve computer hardware to meet people’s needs for big data analysis and processing difficulties. By refining the classic random forest technique and merging it with the distributed computing model, the notion of parallelized computing becomes highly relevant at this time, and it has become the major focus of current research.

In recent years, researchers and businesses have embraced Google’s MapReduce parallel programming methodology in the area of large data processing owing to its ease of use, automated fault tolerance, and high scalability. Writing programs that can process big data in parallel on several nodes is possible using the MapReduce programming style. Large amounts of complicated data may be analyzed using analytical tools like MapReduce. The MapReduce concept is aimed at making the translation and analysis of huge datasets more straightforward while allowing programmers to concentrate on algorithms rather than data management. The paradigm makes data-parallel algorithms easy to build. This paradigm has been utilized in a number of ways, notably Google’s (C++) technique and Apache’s Hadoop implementation (written in Java). Both programs run in a peer-to-peer environment on massive hardware platforms. At the same time, Hadoop and Spark, which represent distributed computing systems, have gotten a lot of attention [9]. Many random forest techniques based on the MapReduce computing architecture have been effectively deployed to large data analysis and processing at this time. Among these, the MapReduce-based parallelized random forest method MR_RF [10] uses the divide-and-conquer approach, using the MapReduce paradigm to split the input and transfer it to several computing nodes to create a base classifier, then aggregating the output of each computing node. Create a model of a random forest.

MapReduce is extremely scalable and runs on a big cluster of common computers. Many terabytes of information are often processed on thousands of computers during a typical MapReduce calculation [11]. The MapReduce model is then called again, and the created random forest is utilized to forecast the test set in order to acquire classification accuracy, completing the random forest algorithm’s parallelization. The parallelization framework is called twice before and after the algorithm, and the intermediate results are read out several times. It takes a lot of time to research and write. Literature developed a revised MR_RF method [12] to lower the temporal complexity of the MR_RF technique, which employs out-of-bag data to directly compute the classification model’s generalization error in order to estimate the random forest’s classification accuracy. The number of calls to parallel frameworks has been reduced. However, in a big data environment, a significant number of redundant and irrelevant characteristics in the dataset diminish the quality of the features picked by the decision tree while building the random forest model, which has an impact on the random forest model’s overall classification accuracy.

The author devised a parallel random forest (PRF) approach to lessen the effect of redundant and irrelevant features in big datasets on the model [13]. A hybrid strategy integrating data-parallel and task-parallel optimization is used to optimize the PRF algorithm [14]. The out-of-bag data is utilized as the training set to determine the classification accuracy corresponding to each decision tree as the weight, which is then employed in the model prediction step. Although the PRF method increases the random forest’s classification performance by optimizing the training features, it does not minimize the amount of redundant and irrelevant features in the dataset; therefore, the resulting training feature set includes greater redundancy and irrelevance. In light of this, the authors presented PRFMIC [15], a parallel random forest method based on the maximum information coefficient. The characteristics are separated into three intervals using the maximum information coefficient, the low correlation interval is eliminated, and the high correlation interval is chosen. Compared to a single decision tree method, the random forest approach is more accurate [16]. The random forest model is built in parallel as the features in the interval and the midcorrelation interval create feature subsets. Even though the method considers the impact of irrelevant characteristics on the random forest model, redundant data features cannot be given during the random forest modeling step.

The importance of creating big data applications has increased over the past several years. The information derived from enormous amounts of data is increasingly relied upon by several companies from various industries. Traditional data platforms and methodologies, on the other hand, perform poorly in the context of big data. They lack scalability, efficiency, and accuracy and have a slow reaction time. Much effort has been expended in addressing the tough big data issues. As a result, several distributions and technical advancements have occurred. It offers comparisons based on several system levels, including data storage layer, data processing layer, data querying layer, data access layer, and data management layer, in addition to providing a global overview of the primary big data technologies. It classifies and examines the primary technological aspects, benefits, and limitations.

The aforementioned approach does not consider the amount of information in the training features while producing the training feature set, but it does raise the correlation between decision trees and decision trees, which impacts the overall accuracy of the random forest model. The random forest’s overall accuracy is influenced by the decision tree trained by the training feature set; however, owing to the load imbalance, the method takes too long in the prediction and classification stages, reducing the random forest’s overall parallelization efficiency [17]. There are still pressing concerns to be addressed, such as how to minimize duplicate and unnecessary features in huge datasets, how to enhance the quantity of training feature information, and how to improve the parallel efficiency of algorithms. This work presents a parallel random forest method based on information theory and norms to address these issues (PRFITN). First, the method creates a dimensionality-reduced dataset using a hybrid dimensionality reduction approach called DRIGFN (dimension reduction based on information gain and Frobenius norm) based on information gain and Frobenius norm, successfully minimizing redundancy and irrelevance. Furthermore, the algorithm presents a feature grouping strategy based on information theory (FGSIT), which groups the features according to the FGSIT strategy and uses the stratified sampling approach to guarantee that the decision tree in the random forest is formed. The quantity of data in the feature subset enhances the classification results’ accuracy. Finally, in the Reduce phase, a key-value pair redistribution strategy (RSKP) is proposed to obtain the global classification results, realize the fast and even distribution of key-value pairs, and thus improve the parallel efficiency of the cluster, taking into account the impact of the cluster load on the efficiency of the parallel algorithm. The experimental findings suggest that the algorithm performs better in a big data context, particularly when dealing with datasets with a high number of characteristics.

The next section describes the related concepts followed by PRFITN algorithms that are analyzed in this research. Then, the results of the algorithms are analyzed, and finally, conclusions are drawn.

2.1. Introduction to Related Concepts

Definition 1 (information gain). Given discrete variable and their corresponding category , the information gain is calculated by the following formula: where is the information entropy about category and is the conditional entropy about variable and category .

Definition 2 (mutual information). Given discrete variables and , the mutual information is calculated by the following formula:

Definition 3 (conditional mutual information). Given discrete variables and and their corresponding category , the conditional mutual information is calculated by the following formula: where is the conditional entropy about variable and category and is the conditional entropy about variables X, Y and category Z.

Definition 4 (Frobenius norm). Given that is an -dimensional matrix and is an element in the matrix, then the Frobenius norm can be calculated by

2.2. Principal Component Analysis Algorithm

Principal component analysis (PCA) [18] is a multivariate statistical method for dimensionality reduction. Its main purpose is to find a transformation matrix and reduce the dimension of the dataset under the condition of maintaining the maximum variation. Principal components analysis (PCA) is the most widely used unsupervised dimensionality reduction technique for producing relevant characteristics by integrating the data points in linear (linear PCA) or nonlinear (kernel PCA) arrangements (features). Significant features are created by linearly lowering correlated data to a lower fraction of set of variables. This is accomplished by projecting (dot producing) the actual information into the simplified PCA space using the covariance/correlation matrix eigenvectors, also known as the principal components (PCs). PCA is a linear transformation of data into a sequence of uncorrelated variables existing in the simplified PCA space, where the first element explains the most variation and each successive component explains less. The PCA algorithm is mainly divided into four steps: (1) establish a data matrix and standardize the original dataset; (2) establish a correlation coefficient matrix and calculate the eigenvalue of each principal component and the corresponding number of eigenvectors; (3) according to the eigenvalues and the cumulative contribution rate, determine the number of principal components required; and (4) combine the eigenvectors corresponding to the principal components to obtain the transformation matrix to reduce the dimension of the original dataset.

2.3. Support Vector Machine Algorithm

The support vector machine (SVM) algorithm [19] is a data mining algorithm based on statistical theory. It mainly selects an optimal classification hyperplane that meets the classification requirements, so that the hyperplane can guarantee the classification accuracy. At the same time, the blank areas on both sides of the hyperplane can be maximized [20]. SVMs are utilized in web pages, intrusion detection, face identification, email categorization, genre classification, and handwriting recognition, among other applications. We utilize SVMs in machine learning for several reasons, including this. Simultaneous categorization and extrapolation of linear and nonlinear data are supported by SVM. The SVM algorithm is mainly divided into three steps: (1) construct the classification hyperplane , where is the weight of the hyperplane and is the data vector matrix; (2) use the kernel function to solve the classification hyperplane and obtain the hyperplane weight ; and (3) use the hyperplane weight to predict the data classification. Large datasets are not a good fit for the SVM algorithm. When the targeted classes are overlapping and the dataset includes more noise, SVM does not really perform very well. The SVM will perform poorly when there are more training data samples than characteristics for each data point.

3. PRFITN Algorithm

The PRFITN algorithm mainly includes three stages: data dimensionality reduction, feature grouping, and parallel construction of random forests. (1) In the data dimension reduction stage, the DRIGFN strategy is proposed to accurately identify and delete redundant and irrelevant features in the dataset, and the dimensionality-reduced dataset is obtained. (2) In the feature grouping stage, the FGSIT strategy is proposed to be used in order to measure the importance of features, and then, distribute features cyclically on this basis, to obtain two sets of feature subsets and . (3) In the stage of parallel construction of random forests, the RSKP strategy is proposed to optimize the MapReduce computing model and improve its performance [21]. Parallelization efficiency and use the optimized MapReduce model to build random forests helps to predict and classify datasets, that obtain the accuracy of random forests.

3.1. Data Dimensionality Reduction

At present, dimensionality reduction algorithms mainly include feature selection and feature extraction. However, in the big data environment, due to the existence of a large number of redundant and irrelevant features in the dataset, the feature selection or feature extraction methods alone cannot achieve better results. Feature set for paper proposes a DRIGFN strategy to identify and filter redundant and irrelevant data in a big data environment [22]. First, combined with the MapReduce model, the feature information gain value is calculated in parallel to remove irrelevant features; then, the Frobenius norm is used to evaluate the amount of information loss, classification error, and control overfitting of the classifier, and on this basis, a global algorithm is proposed. The optimization function is used to iteratively optimize the dimensionality reduction parameters. Suppose represents samples in the -dimensional feature space of the original dataset DB, the dataset contains different categories, and Y represents the feature matrix . In the corresponding label, the DRIGFN policy is as follows.

3.2. Feature Selection

For dataset , the main purpose of feature selection is to reduce the number of irrelevant features. The specific process is as follows: first, use the default file block strategy in Hadoop to divide the feature space of the original dataset into file blocks of the same size; then, the file blocks are used as input data [23]. According to Definition 1, the Mapper node calls Map. The function calculates the information gain of each feature in the form of key-value pair ( is the feature name, and is the information gain of the corresponding feature) and combines each key-value pair to obtain the feature information gain set . Finally, according to the information gain value corresponding to the feature, the elements in the set A are sorted in descending order, the features that are ranked later in the set A are removed, and the new feature matrix is obtained by recombining and the dataset obtained by merging the feature matrix and label vector obtained after processing in columns is passed to the next stage. Feature selection is performed as follows.

Input: original dataset .
Output: feature matrix X, dataset DB.
  1. Block ⟵ split the feature space of the original dataset
  2. Key: feature name
  3. Value: combine the feature space of the key with label Y
  4. For each feature in each block, do
  5. ⟵ feature name
  6. //according to Definition 1, calculate the information gain value of feature ; represents the proportion of category in the data set; α represents the tuple divided according to the value of feature ; |n|, , and , respectively, correspond to the total number of data samples, the number of elements in tuple divided by the value of feature , and the number of aspects of category in tuple
  7.  ⟵ IG (Y; )//the information gain value of is assigned to value
  8. A ⟵ <, >//put the key-value pair <, > corresponding to the feature into the set A
  9. End for
  10. Sorted(A)//sort A in descending order of value
  11. Delete the later features in set A//delete the elements arranged later in A
  12. Return //the remaining features in are the feature matrix
  13.  ⟵ combine and by column
  14. Return
3.3. Feature Extraction

In the feature extraction stage, to further optimize the dataset after feature selection, firstly, use principal component analysis and support vector machine algorithm to obtain the initial parameters, and use the received parameters to reconstruct the feature matrix; secondly, use the Frobenius norm loss of information [20]. The classification error and the degree of overfitting of the classifier are estimated; finally, to minimize the sum of information loss, classification error, and the degree of overfitting, a global optimization function is proposed to optimize the transformation matrix and the classification matrix. The specific process of feature extraction is described as follows: (1)Initialization of parameters and reconstruction of feature matrix knowing the feature matrix and the dataset

First, adopt the principal component analysis (PCA) to obtain the initial transformation matrix W , that is, where is the feature matrix obtained after dimensionality reduction by PCA, which is merged with the label in columns to obtain the transformed dataset .

Secondly, the support vector machine (SVM) algorithm is used to obtain the classification matrix υ about with all the samples in the dataset as the training set, according to which the predicted label of can be calculated as

Next, in order to evaluate the loss of information in the feature extraction process, the transformation matrix is used to reconstruct the feature matrix , and the reconstruction matrix of can be expressed as (2)Estimation of the amount of information loss, classification error, and the degree of overfitting of the classifier

According to the transformation matrix , classification matrix , and reconstruction matrix obtained in the previous step, this part will use the Frobenius norm to estimate the amount of information loss, classification error, and classifier degree overfitting.

Since the reconstruction matrix, , is obtained by transforming the feature matrix through the matrix , there will be more or fewer differences with the elements of , so the Frobenius norm is used to process the differences between the elements in the two matrices and sum them up; the obtained result can reflect the matrix and the moment after dimensionality reduction. The amount of information loss compared to the matrix is specifically defined as follows.

Definition 5 (information loss ). Given the known feature matrix and reconstruction matrix , then according to Definition 4, the information loss can be expressed as

Similarly, the difference between the predicted label and the label can also be measured by the Frobenius norm, which is defined as follows.

Definition 6 (classification error ). It is known that is the label corresponding to the feature matrix and is the predicted label predicted by the support vector machine; then according to Definition 4, the classification error can be expressed as

Finally according to the Frobenius norm, is designed, and the degree of overfitting of the classification is controlled by the value, which is specifically defined as follows.

Definition 7 (overfitting degree ). Knowing that is the classification matrix of the feature matrix , then according to Definition 4, the overfitting degree can be expressed as

According to the definition of the Frobenius norm, it can be inferred that the more uniform the distribution of elements of , the smaller the value of ; on the contrary, the larger the value of individual elements in , the larger the value of . (3)Global optimization function

To obtain the global optimal transformation function , it is necessary to satisfy the overall depreciation of , , and at the same time, so combined with the three Equations (8)~(10), the global optimization function is defined as follows.

Definition 8 (global optimization function . Knowing the feature matrix and labels, the corresponding transformation matrix and classification matrix are and , respectively, from which the information loss amount and classification error about can be obtained. If the degree of overfitting is , , and , then the global optimization function can be expressed as where and are the weight parameters of and , respectively.

Considering that the classification matrix in the global optimization function is affected by the transformation matrix , it is divided into two steps when using the gradient to solve the minimum value of the function: (1) treat the classification matrix as a constant to solve the transformation matrix ; (2) substitute the transformation matrix into the solution classification matrix . Then, according to the relevant definitions of , , and , the gradient of the function to is , and , the transformation matrix can be solved , and for , all have , so is such that is a minor local optimal transformation matrix; in the same way, substituting into , the gradient of to is . Let ; the classification matrix can be obtained, and for , all have , so is the local optimal classification matrix that minimizes .

According to Definition 8 and its solution process, the local optimal transformation matrix and the classification matrix can be obtained. To obtain the global optimal transformation matrix , the obtained local optimal transformation matrix and classification matrix are substituted into the function to transform the matrix and classification matrix in turn. The class matrix is solved iteratively until it converges, and the returned transformation matrix is the global optimal transformation matrix . Finally, substituting the global optimal transformation matrix into Equation (5) can obtain the feature matrix after feature extraction, which is merged with the label by column; namely, the dimensionality reduction dataset can be obtained. The execution process of feature extraction is as follows.

Input: dataset , weight parameters , .
Output: feature matrix , dataset .
1. W ⟵ the conversion matrix of calculated according to PCA
2.
3.  ⟵ combine and by column
4. υ ⟵ the classification matrix of according to SVM
5.
6. Do//iteratively solve the global optimal transformation matrix
.  ⟵ conversion matrix solved by
8.  ⟵ classification matrix solved by
9. End do until convergence
10. Get
11.  ⟵ 
12. Return
13.  ⟵ combine and by column
14. Return
3.4. Feature Grouping

In the current parallel random forest algorithm in the big data environment, training features are formed by randomly selecting the features of the dataset. Although the DRIGFN strategy reduces the redundant and irrelevant features in the dataset through data dimensionality reduction [24], there are still many low-informative features, and due to their existence, the resulting training features are low-informative. Therefore, a feature grouping strategy FGSIT based on information theory is proposed to solve the above problems. This strategy first uses the relevant knowledge of information theory to measure the influence degree of feature-label and feature-feature; secondly, based on obtaining the influence degree of feature-label and feature-feature, the feature evaluation function is proposed above; finally, the features are divided into two groups in an iterative manner. The specific process of feature grouping is described as follows: (1)The degree of influence between feature-label and feature-feature

It is known that feature is any feature in feature matrix and is the label corresponding to the feature matrix ; according to Definition 1, the information gain of the feature is obtained as follows:

However, the information gain only measures the influence between features and labels, ignoring the impact between components and parts. Considering the effect of candidate features on selected elements in feature grouping, a calculation feature-feature relationship is proposed. The function of the degree of influence is defined as follows:

Definition 9 (feature-feature influence function ). is the label, is the selected feature set, and is the element in . According to Definition 2 and Definition 3, the candidate features is the feature in . Therefore, the influence degree can be expressed as

According to Definition 2 and Definition 3, the mutual information represents the correlation between the selected feature and the label , and the conditional mutual information represents the feature under the condition of the feature . There is a correlation between and label , so the difference between conditional mutual information and mutual information can represent the influence of feature on feature and label , so in in the function, the sum of the impact of feature on all features in can be used to express the overall influence degree of feature on . (2)Feature evaluation function

To take into account the degree of influence between feature-label and feature-feature in the process of feature grouping, combined with the above two points, a feature evaluation function is proposed, and its definition is as follows.

Definition 10 (feature evaluation function ). Given candidate features , label vector , and selected feature set , the evaluation function about feature can be expressed as where is the weight parameter of the function .

Because the information gain can be used to measure the influence between features and labels and the function can measure the impact between components and parts, the feature evaluation function is obtained by combining Equations (12) and (13) simultaneously. Measure the degree of influence between feature-label and feature-feature. (3)Feature grouping

The feature evaluation function proposed by Definition 10 can divide the feature grouping process into three steps:

Ⓒ Put the feature with the most significant information gain value in into Q

Ⓒ Calculate the value of the candidate features in turn, and put the feature corresponding to the maximum value of into

Ⓒ Execute the step Ⓒ iteratively until the number of features in reaches the threshold Thr, and the remaining elements form a part set by themselves

According to the nature of the random forest, it can be inferred that the classification effect of a random forest is related to the correlation between decision trees in the forest and the classification ability of each decision tree. The stronger the correlation, the worse the classification effect of the random forest: decision tree classification. The stronger the command, the better the random forest classification effect. However, the choice of the threshold affects the grouping of features, which involves the correlation of the decision tree and the classification ability of the decision tree. Therefore, the selection of the threshold is essential, so the threshold function is proposed to determine the threshold, and its definition is as follows.

Definition 11 (threshold function ). Assuming that there are decision trees in the random forest, contains features, and has features and randomly extracts features from as high-information features in proportion, randomly removing features from to combine with an as a construction decision the training feature of the tree, the threshold function can be expressed as where is to use the proportion of features in to reflect the overall classification ability of the decision tree in the random forest and is to use the similarity of the selected components of the two decision trees to reflect the correlation of the decision trees in the random forest.

According to Definition 11, can be used to measure the overall classification ability of decision trees in a random forest, and can be used to measure the correlation between decision trees in a random forest. Therefore, according to the nature of the random forest, the classification effect of random forest can be measured by . By observing the formula, it can be found that when all the features belong to , , then and , and the maximum value can be taken, but the meaning of grouping is lost at this time, so it is discarded; when , since in the big data environment, so . When and , that is, , the value of cor is the smallest. can reach the maximum value.

The execution of the FGSIT policy is as follows.

Input: dataset , weight parameter .
Output: feature subset , .
1. For each feature in do//
2. Calculate the by Equation (1)
3. B ⟵ //put the calculated information gained into set
4. End for
5.  ⟵ the feature corresponding to the maximum value in B//put the feature with the largest information gain into the set
6. Remove from
7. While the length of do
8. For each feature in do//
9. Calculate the by Equation (14) 
10.  ⟵ //put the calculated into the set C
11. End for
12.  ⟵ the feature corresponding to the maximum mum value in
13. Remove from
14. End while
15. S ⟵ the remaining features in
16. Return ,
3.5. Building Random Forests in Parallel

After data dimensionality reduction and feature grouping, the classifier needs to be parallelized and trained according to the reduced dimensionality dataset and feature sets and . At present, the parallel random forest algorithm in the big data environment mainly builds multiple decision trees based on the training data and training features as the output results.

And on this basis, the samples are predicted to obtain the model accuracy. However, in the prediction stage of this method, due to the different decision trees in each computing node, the predicted key-value pairs obtained for the dataset are also other, so after merging, the number of key-value teams on each Mapper node will be different. If there is a difference, it usually leads to an unbalanced load on the Reducer nodes in the next stage, which affects the parallelization efficiency of the algorithm. To deal with the above problems, this section first proposes the RSKP strategy to optimize the MapReduce computing model and balance the load of the Reducer nodes; then, it uses the optimized MapReduce model to build a random forest in parallel, predict the classification of the dataset, and obtain the accuracy of the random forest. Build random forests in parallel.

The specific process is described as follows: (1)RSKP strategy

Given the set of key-value pairs , ,..., obtained after merging in each Mapper node, the process description of the RSKP strategy is shown in Figure 1. (a)Aggregate all key-value pairs , ,…, into the intermediate file and sort them according to the keys in the key-value pairs(b)According to the number of key-value pairs and the number of Reducer nodes, distribute the key-value teams in the intermediate folder to each node(2)Parallel construction of random forest and prediction of dataset classification. The optimized MapReduce model is obtained through the RSKP strategy. Combined with this model, the parallel structure of the random forest is divided into four steps, as shown in Figure 2(a)Call Hadoop’s default data block strategy, divide the dataset into blocks of the same size, and transmit them to the Mapper node as input data(b)According to the task assigned to each Mapper node by the primary node, call the Map function to extract the training set of the decision tree through bootstrap autonomous sampling and randomly remove features from the feature subsets and as training features in proportion; based on the training set and the training feature, construct a decision tree in the form of a key-value pair ( is the decison tree model number; is the decision tree model), all Mapper nodes are executed, and all decision trees are parsed and merged to obtain a random forest model(c)Use the decision tree in the Mapper node to predict the dataset and form a new key-value pair ( is the combination array of the sample ID and the corresponding category; means the critical value number of occurrences of the pair); merge key-value teams with the same value (such as there are three key-value pairs , , and with all locally; then, they will be merged into one key-value pair )(d)The key-value pairs predicted in the Mapper node are distributed by the master node and then transferred to the corresponding Reducer nodes to merge again to obtain the global classification result and compare it with the label to get the model’s accuracy. So far, the execution process of building a random forest in parallel is as follows

Input: dataset , feature subset , .
Output: random forest model and its accuracy.
Map stage
1. For each block corresponding to each Mapper node, do
2. T ⟵ select training set randomly
3. F ⟵ select training features randomly ⟵ number the decision tree trained by and ⟵ train decision tree based on and
6. End for
7. Model ⟵ collection of all decision trees
8. Return model//output random forest, model
9. For each decision tree in each Mapper node, do
10. Predict the category of each sample from the
11.  ⟵ combine sample ID with sample category
12.  ⟵ number of key-value pairs
13. Output
14. End for
15. Combine key-value pairs with the same in each Mapper
16. AK ⟵ collection of all key-value pairs reduce stage
1. Sorted (AK)//sort all key-value pairs
2. RSKP(AK)//evenly distribute key-value pairs
3. Obtain the global classification results by combining key-value pairs with the same
4. Accuracy ⟵ compare label with global classification results//get classification accuracy
5. Return accuracy
3.6. PRFITN Algorithm Steps

The specific implementation steps of the PRFITN algorithm are as follows.

Step 1. Divide the original dataset into file blocks of the same size through the default file block strategy of Hadoop, call a MapReduce task to calculate the information gain of the original data features in parallel, and select the features on this basis.

Step 2. Invoke the FEKFN strategy to extract new features from the feature-selected dataset in an iterative manner.

Step 3. Call the FGSIT strategy to group the features of the reduced dimensionality dataset.

Step 4. Start a new MapReduce task, call the Map function, use bootstrap and stratified sampling to extract training samples and features used for modeling, build a decision tree, and aggregate all decision trees to obtain a random forest; use the RSKP strategy to distribute Reducer node tasks evenly, call the Reduce function to get the global classification results, and evaluate the model classification accuracy.

3.7. Analysis of Algorithm Time Complexity

The PRFITN algorithm mainly includes three stages: data dimensionality reduction, feature grouping, and parallel construction of random forests. Therefore, the algorithm’s time complexity is primarily composed of three parts, denoted as T1, T2, and T3, respectively.

In the feature selection stage of data dimensionality reduction, the time complexity mainly depends on calculating the information gain value of each feature, which needs to traverse each data under the sample corresponding to each element in the dataset. Given that the number of samples in the known dataset is , the number of features is , and the number of Mapper nodes corresponding to executing MapReduce tasks is . The time complexity of this stage is

In the feature extraction stage of data dimensionality reduction, the time complexity mainly depends on the process of iteratively optimizing the transformation matrix and the classification matrix . It is known that is a matrix of order , and is a matrix of order . Therefore, it is assumed that this stage requires iteration. Calculate times, and then, the time complexity is

Therefore, the time complexity of data preprocessing is

The FGSIT strategy is mainly used in the feature grouping stage to divide the features. The feature evaluation function between each candidate feature and the selected feature needs to be calculated each time the quality is screened. Knowing that the number of processed features is and the number of samples is , the time complexity of this stage is

In the parallel construction of the random forest, the MapReduce task is mainly called to build the random forest model in parallel and predict all data classification to evaluate the accuracy. Assuming that the random forest model contains decision trees, the number of Mapper nodes corresponding to the MapReduce task is , and the number of Reduce nodes is , the time complexity of this stage is

4. Experimental Results and Comparison

4.1. Experimental Environment

To verify the performance of the PRFITN algorithm, related experiments are designed in this paper. The experimentation includes four computing nodes in terms of hardware, including 1 Master node and 3 Slaver nodes. The CPUs of all nodes are AMD Ryzen 7, each with eight processing units and 16 GB of memory. The four nodes in the experimental environment are in the same local area network and connected by 200 Mbit/s Ethernet. In terms of software, the Hadoop version installed on each node is 2.7.4, the Java version is 1.8.0, and the operating system is Ubuntu 16.04. The specific configuration of each node is shown in Table 1.

4.2. Experimental Data

The experimental data used by the PRFITN algorithm are three real datasets from the UCI public database (https://archive.ics.uci.edu/ml/index.php), namely, Farm Ads, Susy, and APS Failure at Scania Trucks. The Farm Ads dataset is various farm animal-related data collected from text ads on 12 websites.

The dataset contains 4143 samples and 54877 attributes, which have a small sample size and many features. Susy is a dataset that records the detection of particles using particle accelerator dataset containing 5000000 records. There are 18 attributes in total, which have the characteristics of a large sample size and a small number of features: APS Failure at Scania Trucks dataset. It is a dataset that records Scania truck APS faults and operations. The dataset contains 60,000 samples and a total of 171 attributes. It has the characteristics of moderate sample size and a moderate number of features. The specific information of the dataset is shown in Table 2.

4.3. Performance Analysis of PRFITN Algorithm

To verify the feasibility of the PRFITN algorithm in the big data environment, this paper selects 50, 100, and 150 decision trees in the random forest. It applies the PRFITN algorithm to the three datasets of Farm Ads, Susy, and APS Failure at Scania Trucks, runs ten times independently, takes the average of the ten running results, and compares the running time and accuracy of the algorithm to achieve an overall evaluation of the performance of the PRFITN algorithm. Figure 3 shows the execution results of the PRFITN algorithm under three datasets.

As can be seen from Figure 3 and Table 3, when the number of decision trees changed from 50 to 100 to 150, the running time of the algorithm on the Farm Ads dataset increased by 8700 s and 9000 s, and the accuracy increased by 3.8 percentage points and 3.8 percentage points, respectively, 1.5 percentage points; running time on the Susy dataset increased by 4250 s and 6000 s, and accuracy increased by 2.5 percentage points and decreased by 0.7 percentage points; runtime on the APS Failure at Scania Trucks dataset increased by, respectively, 750 s and 4500 s, and the accuracy increases by 3.0 and 1.1 percentage points, respectively. It can be seen from the data reflected in the picture that the time complexity and accuracy of the PRFITN algorithm on the three datasets gradually increase, and the increase of time complexity gradually increases, but the increase of accuracy decreases slowly. The former is mainly due to the increase in the number of tasks assigned to the computing nodes in the modeling stage as the number of decision trees increases, and the number of key-value pairs also increases exponentially, so it takes more time to process them. The main reason is that with the increase of the number of decision trees, the difference between trees will decrease, and then, the impact on the classification results of the random forest will become smaller and smaller, so the increase in accuracy will increase with decrease as the decision tree increases.

4.3.1. Time Complexity Comparison of PRFITN Algorithm

This study conducts tests based on the three datasets of Farm Ads, Susy, and APS Failure at Scania Trucks to confirm the temporal complexity of the PRFITN method. The PRFMIC algorithm carries out a thorough comparison. The PRFITN algorithm without the RSKP strategy—referred to as PRFITN-ER—is also run in order to investigate the effects of load balancing on the PRFITN algorithm. The specific time complexity is shown in Figure 4 and Table 4.

As can be seen from Figure 5, on the Farm Ads dataset, the running time of the PRFITN algorithm is 2300 s, 3833.3 s, and 8666.7 s higher than that of the PRFMIC algorithm, the PRF algorithm, and the improved MR_RF algorithm, respectively. On the APS Failure at Scania Trucks dataset, the running time of the PRFITN algorithm is on average 200 s, 416.7 s, and 733.3 s higher than that of the PRFMIC algorithm, the PRF algorithm, and the improved MR_RF algorithm, respectively. The above two situations occur because the PRF algorithm uses data dimensionality reduction processing for training features when building the random forest model. On the other hand, the PRFMIC algorithm assumes hierarchical processing for parts, and the PRFITN algorithm adopts dimensionality reduction and feature layering for details. In addition, the dimensionality reduction and layering strategy of the PRFITN algorithm focus on directly evaluating the elements themselves. Therefore, when dealing with the Farm Ads and APS Failure at Scania Trucks datasets with relatively many features, the PRFITN algorithm is significantly better than PRFMIC, PRF, and the improved MR_RF. However, the algorithm takes a lot of time.

On the contrary, when dealing with the Susy dataset with large sample size and a small number of features, the running time of the PRFITN algorithm is 6783.4 s and 3750 slower than that of the PRFMIC algorithm and the PRF algorithm, respectively. When the number of features is small, the PRFITN algorithm takes less time in the data dimensionality reduction and feature layering stages. The PRFMIC algorithm uses the RSKP strategy to balance the load of each node and reduce the time complexity. In addition, to more intuitively judge the impact of load balancing on the model, that is, the optimization effect of the RSKP strategy on the model, comparing the running time of the PRFITN algorithm and the PRFITN-ER algorithm on the three datasets, it can be seen that in the Farm Ads data On the dataset, Susy dataset, and APS Failure at Scania Trucks dataset, the running time of the PRFITN algorithm is 1733.33 s, 1583.33 s, and 295 s more minor than that of the PRFITN-ER algorithm on average, so the adoption of the RSKP strategy will save to a particular extent model learning time.

5. Conclusion

To solve the shortcomings of the parallel random forest algorithm in the big data environment, this paper proposes a parallel random forest algorithm PRFITN based on information theory and norms. Parallel Random Forest (PRF) method using Spark to boost the effectiveness of the RF technique and alleviate the data connection cost and workload imbalances concerns of massive data in a distributed and parallel environment. The PRF method is optimized using a hybrid parallel technique that combines data and job parallelization. A vertical data-partitioning approach and a data-multiplexing method are used for data-parallel optimization. These strategies lower the amounts of data and the frequency of data transfer operations in a distributed setting while maintaining algorithm correctness. These strategies lower the amounts of data and the frequency of data transfer operations in a distributed setting while maintaining algorithm correctness. First, the algorithm fully considers the problem of redundant and irrelevant features in large datasets and proposes a hybrid dimensionality reduction strategy, DRIGFN. This strategy can effectively reduce the dimension of the dataset and significantly reduce the amount of information lost during data dimension reduction. Secondly, to improve knowledge of the features used for training decision trees in random forests, a feature grouping strategy FGSIT is proposed, which fully considers the relationship between feature-feature and feature-label. The weighted vote technique and dimension reduction are used to optimize the PRF algorithm’s accuracy. Then, using Apache Spark, a hybrid parallel PRF technique incorporating data-parallel and task-parallel optimization is carried out. The training dataset is reused, and the amount of data is greatly decreased, thanks to the data-parallel optimization. The task-parallel optimization has the advantage of significantly lowering the cost of data transmission and enhancing algorithm performance. According to experimental findings, PRF is superior to other algorithms and has distinguishing advantages over them in terms of accuracy, efficiency, and scalability. On this basis, the features are divided into two groups; the training features are extracted proportionally, ensuring the information amount of the selected features when constructing the decision tree. Finally, considering the impact of cluster load on the efficiency of parallel algorithms, a key-value pair redistribution strategy RSKP is designed to evenly group the intermediate results obtained by the similar algorithm, balancing a load of reducer nodes in the cluster and reducing the time complexity of the algorithm Spend. At the same time, to verify the classification performance of the PRFITN algorithm, this paper compares and analyzes the three algorithms of the improved MR_RF algorithm, the PRF algorithm, and the PRFMIC algorithm on the three datasets of Farm Ads, Susy, and APS Failure at Scania Trucks. The experimental results show that the PRFITN algorithm has high accuracy in the big data environment, especially for classifying datasets with a large number of features.

6. Future Scope

The authors explain the topics that warrant additional investigation and expect that these issues may contain the potential to contribute to future research studies, building on the solid foundation of the research findings reported and the general knowledge attained in this work. Due to the majority of the studied publications adopting an analytical methodology, it is possible to enhance empirical research based on an extensive case study using a qualitative and quantitative approach based on surveys. As a cross-cutting theme, business and management big data have numerous ties to well-established subjects in the fields of computing, engineering, mathematics, business, and social sciences, among others.

Data Availability

The data shall be made available on request.

Conflicts of Interest

The authors declare that they have no conflict of interest.