Abstract

This paper aims to introduce a robust framework for forecasting demand, including data preprocessing, data transformation and standardization, feature selection, cross-validation, and regression ensemble framework. Bagging (random forest regression (RFR)), boosting (gradient boosting regression (GBR) and extreme gradient boosting regression (XGBR)), and stacking (STACK) are employed as ensemble models. Different machine learning (ML) approaches, including support vector regression (SVR), extreme learning machine (ELM), and multilayer perceptron neural network (MLP), are adopted as reference models. In order to maximize the determination coefficient () value and reduce the root mean square error (RMSE), hyperparameters are set using the grid search method. Using a steel industry dataset, all tests are carried out under identical experimental conditions. In this context, STACK1 (ELM + GBR + XGBR-SVR) and STACK2 (ELM + GBR + XGBR-LASSO) models provided better performance than other models. The highest accuracies of R2 of 0.97 and 0.97 are obtained using STACK1 and STACK2, respectively. Moreover, the rank according to performances is STACK1, STACK2, XGBR, GBR, RFR, MLP, ELM, and SVR. As it improves the performance of models and reduces the risk of decision-making, the ensemble method can be used to forecast the demand in a steel industry one month ahead.

1. Introduction

Demand forecasting indicates the prediction of the future needs of a product or service [1]. It is necessary to follow a procedure to attain a crystalline graph of the demand for identifying the pulse of the customer’s need to hold their position in the market. From the last era, the steel industry in Bangladesh is a fast-growing industry in the local market. The industries managed to manufacture a large amount of steel to fulfill both local and international markets, but producing a large amount of steel without proper forecasting causes various problems. Demand prediction is used to support many fundamental business assumptions, including turnover, total revenues, income, capital consumption, chance evaluation and moderation plans, scope quantification, transportation and distribution plans, and more. Any type of misdeed assessment could cost decaying or scarcity of raw materials. It can also lead to overproduction or underproduction. All these cases erode the entire supply chain and total income, resulting in opportunity cost. Again, the entire industry setup depends on this demand, such as the amount of raw material, labor, and space. For these whole arrangements, time is also a crucial issue, as some processes have predefined deadlines that must be perfectly synchronized. For smart business strategy, the most important thing is to forecast the demand precisely but the industries do not have any intelligent method to measure the need perfectly. They follow the time series of their sales data and often skip factors, such as raw material supply, availability, and the number of workers at the factories, significantly influencing steel production.

Forecasting methods can be classified into three categories: (1) statistical methods, (2) artificial intelligence-based methods such as single machine learning (ML) methods, and (3) ensemble/hybrid methods. Most steel industries in Bangladesh use traditional statistical approaches. Statistical approaches, such as exponential smoothing [2], moving average [3], autoregressive moving average [4], and autoregressive integrated moving average [5], are most frequently used for time series prediction. The major drawbacks of these techniques are that the parameter values are fixed using statistical calculations. The error of estimation increases when the fluctuations in the entered data are high and do not yield convincing results for complicated time series patterns [6]. Thus, the companies need an intelligent decision support system that considers several factors.

Several researchers reveal that in the investigation of most cases, ML approaches have drawn much attention and could provide more accuracy than could traditional approaches [7]. Single artificial intelligence-based models, such as support vector machine (SVM), extreme machine learning, heuristic techniques, and multilayer perceptron (MLP), are widely used in various industrial aspects to predict demands because they demonstrate promising results in the areas of control, prediction, and pattern recognition [810]. Support vector regression (SVR) is popular for predicting future demand because of its outstanding generalization capability and no dependency over input space dimensionality [11]. It produces higher accuracy in agribusiness prediction [8] and supply chain demand forecasting [12]. Recently, MLP is used for monthly water demand prediction [10], wind speed prediction [13], and water demand prediction [14]. For improving MLP’s prediction accuracy, different MLP architectures were used, and an optimization algorithm was used to tune its parameters [15]. The extreme learning machine (ELM) is another advanced model, which is a single hidden layer feed-forward neural network (SLFN) model with incremental learning speed and fast convergence, making it efficient and fast in learning [16]. It is widely used in applications, such as sales forecasting demand of fashion retailing [17] and sales prediction for the retail industry [18].

Since demand forecasting in steel industries is considerably challenging, it is impossible to solve this problem accurately using single ML models. No single model is ideally suited for various ML applications. Each method and application domain has some prerequisites, advantages, assumptions, and characteristics [19]. Generally, the performance of combined forecasting models is better than that of a single forecasting model [20]. The literature has described several strategies to enhance the predictive performance of regression models, and one of these is the regression ensemble [8]. The regression ensemble theory is built on ML, whose roots are related to the concept of divide-to-conquer, solving the constraints of ML models working in isolation [21]. An ensemble model is one in which numerous base models are constructed to address the same problem, with each model learning the dataset’s feature attributes and making a prediction. As a result, the separate model’s forecasts are integrated to generate the final projection. By combining the mean or weighted average, ensemble approaches for regression problems can be developed. The simple method of grouping regression ensembles by mean and weighted average is to use mean and weighted average. The regression ensemble models construct a collection of models in order to improve the predictive power of the selected models and the numerical goal variables [22, 23]. Ensemble methods are used in several studies, such as forecasting for energy consumption [24], agribusiness prediction [8], and wind power forecasting [25]. Although numerous frameworks have been established, there is always a need for improved forecasting accuracy and robustness, particularly in the steel industry.

This study proposes a new pipeline for demand forecasting in steel industries. From this aspect, this study explores the capacity of predictive regression ensemble models by comparing the ensembles among themselves and considering the single reference models to forecast the demand. The proposed pipeline includes data preprocessing, feature selection, hyperparameter tuning, cross-validation, and regression ensemble approaches to outperform the state-of-the-art results. Instead of using the median value of the attribute, the mean value of the attribute is utilized to fill in the empty area since it has a more central tendency to the mean of the attribute distribution than the median. The appropriate features are selected using feature selection algorithms (correlation-based, principal component analysis (PCA), and independent component analysis (ICA)) to avoid redundancy and model overfitting problems. Different single ML techniques, such as SVR, MLP, and ELM, are adopted as reference models. The ensemble bagging (RFR), boosting (GBR and XGBR), and stacking (STACK) models are used in our proposed framework to enhance demand forecasting robustness and efficiency. The grid search technique with cross-validation is used to select the optimal hyperparameters for each ML model. Comprehensive experiments are conducted on different data preprocessing and a combination of ML techniques to minimize the RMSE and maximize of demand forecasting models. All experiments are carried out under the same experimental settings and with the same data set as the previous experiment. Finally, we investigate the performance of regression ensemble approaches and verify that ensemble approaches outperform single reference models. The contributions of this paper are summarized as follows:(i)Collect the dataset from a well-known steel industry in Bangladesh.(ii)Present a modification of the theory underlying regression ensembles based on bagging (RFR), boosting (GBR and XGBR), and stacking (STACK) as well as single models (SVR, MLP, and ELM) (details in Supplementary Material Appendix A).(iii)Find the best preprocessing pipeline using filling missing values, data transformation, standardization, and feature selection algorithms where the number of selected features is also varied.(iv)Implement different ML regression models with its optimal hyperparameters, obtained using grid search algorithms with cross-validation. Investigate and analyze the performance of bagging, boosting, and stacking ensemble approaches and compare them with each other on the same dataset and preprocessing under the same experimental condition.(v)Verify the superiority of the proposed ensemble approaches using Friedman test and Wilcoxon signed rank test.

The remainder of the paper is arranged in the following manner: Section 2 describes a collection of related studies for the purpose of forecasting. Section 3 illustrates the suggested approach, dataset, feature selection methods, and assessment measures. Various experimental findings are documented in Section 4 based on the interpretation of the data. Section 5 provides a conclusion as well as a scope for further development.

Forecasting demand for industrial products is an urgent matter since a massive portion of a company’s planning process is based on the amount of product to be produced. To meet the increasing demand, precise demand forecasting is required. In this section, we will discuss the work that has been done to anticipate demand in a variety of disciplines and will describe numerous exemplary studies.

Ribeiro and dos Santos Coelho [8] proposed a system for agribusiness prediction using ensemble methods. Bagging, boosting, and stacking ensembles along with single reference models named SVR, MLP, and KNN were used for their purposes. In this experiment, it was shown that ensemble methods performed better than single models. They obtained MAPE of 0.9787 and 0.7394 for both cases for best ensemble models. They did not apply any metaheuristics algorithm for optimizing hyperparameters. Yu et al. [9] developed an ensembling and decomposition algorithm with EEML for crude oil price forecasting. In Ref. [12], they introduced a system by ensembling regression algorithms and time series algorithms to forecast the supply chain demand. The system showed superior outcome because of the reality of invalidating the over-gauging and under-determining. Cankurt [26] employed a variety of regression models, including M5P and M5-Rule model trees, bagging, boosting, randomization, stacking, and voting, to anticipate tourism demand. In this case, they obtained of and a RAE of . The bagging and boosting methods have great significance for the improvement of performances in regression tree models.

Yang et al. [27] developed a system for forecasting agriculture commodities using the bagging and combining approaches with the Heterogeneous Autoregression (HAR) model. HAR model along with bagging and the principal component combination shows outstanding performance for agriculture commodities forecasting. In Ref. [28], they introduced a system by ensembling empirical mode decomposition (EEMD) to analyze global food price volatility. Tao et al. [29] proposed a method using a combination of ensemble empirical mode decomposition (EEMD), extreme learning machine (ELM), and ARIMA for forecasting hog price. They obtained the best-estimated accuracy of R = 0.848. Ribeiro et al. [30] designed nonlinear prediction models based on ensemble aggregation in order to improve the prediction accuracy of electricity load forecasting. In the proposed system, they used hourly load values from Italy in 2015 and Global Energy Forecasting Competition in 2012 to validate their proposed framework. Compared to the multilayer perceptron neural network (MPNN) and regression tree approach, their proposed forecasting framework based on wavelet ensemble provided a better performance.

da Silva et al. [31] introduced a decomposition-ensemble learning strategy for multi-step forward extremely short-term forecasting, which involved aggregating many regression models. They employed a range of preprocessing strategies to account for the system’s high degree of input correlation. Across all time horizons, the proposed models outperform the CEEMD, STACK, and single models. In Ref. [32], they presented an excellent rolling decomposition-ensemble model for gasoline forecasting, which was both accurate and efficient. The researchers’ experimental results demonstrate that the rolling decomposition-ensemble model is both accurate and resilient when it comes to projecting gasoline consumption levels and trends. A unique wind speed ensemble forecasting system (WSEFS) was developed by Liu et al. [33] in order to enhance point forecasting (PF) and interval forecasting (IF). They obtained MAPE of 1.9322%, 2.1579%, and 2.2808% for the 1st step, 2nd step, and 3rd step, respectively. The experimental results showed that the MOMA ensemble forecasting system is better than MOGWO and MODA. In order to estimate the sediment movement in open channels, Ebtehaj and Bonakdari [35] developed the ELM algorithm [35]. In all training and testing modes, the FFNN-ELM outperformed the FFNN-BP and GP methods, which were previously used. For the testing mode, they found RMSE = 0.121 and MARE = 0.023, respectively.

Considering the existing literature in Table 1, it is observed that ensemble models contribute significantly to determine predictions, more than traditional models in each case. Although several frameworks have already been developed, there is still a need for improvement in the accuracy and robustness of demand forecasting, especially in the steel industry. To sum up, there is up to now no proper pipeline for data preprocessing, features selection, hyperparameter tuning, and finally developed a regression ensemble method. This study uses bagging, boosting, and two-level stacking ensemble methods by analyzing the time series of historical data from the steel industry to achieve more propriety of forecasting results for demand. The steel industry follows the traditional time series trend to predict the demand, which fluctuates at a high quantity. To avoid this problem, this study combines multiple approaches instead of using a traditional single method to determine the precise result for the industry.

3. Materials and Methods

This section contains a concise description of the materials and method used. The suggested framework is depicted in Figure 1. The following are the primary phases in our suggested framework: (i) collection of industrial environmental data as the primary inputs of the framework; (ii) preprocessing the data including filling the missing values, Yeo–Johnson transformation, and standardization; (iii) discarding the irrelevant and redundant features to avoid overfitting of the models; (iv) applying the grid search algorithm with cross validation for hyperparameter tuning for each machine learning model; (v) development of two-level stacking ensemble method, where machine learning models with optimal hyperparameters are used as the baseline model; and (vi) evaluation metrics used to evaluate the proposed framework. These blocks are explained in the following sections.

3.1. Data Collection

The data were collected from a well-known prominent steel company named Bangladesh Steel Re-Rolling Mills Ltd., in Chittagong, Bangladesh. During the industrial attachment, some raw data were procured from sources, such as workers, production leaders, and human resources. Later, the data were closely knitted to build the dataset. The dataset comprises 132 cases and six input features from January 2009 to December 2019 (11 years). The key responsibility is to identify the demand of every month based on other factors. The dataset holds the amount of raw material used in a month, availability, the number of workers, working days, and other attributes. The data were gathered from their monthly and annual industrial reports from their official website, such as financial reports, production reports, and some other necessary factors directly affecting their production achievements. Table 2 describes each feature and shows a statistical summary.

3.2. Data Preprocessing

The data preprocessing stage comprises missing value imputation and power transformation of data. Raw data inherit some missing attributes from various features that must be filled before applying any ML technique. Several imputation techniques can fill missing values. In our proposed method, the mean-based imputation technique is used, where the missing value is filled with the mean of the attributes of that specific feature.

After the imputation of missing or null values, the data power transformation is performed. In regression analysis, transformations are crucial [36]. Parametric, monotonic transformations are power transformations used to make data more Gaussian-like. This technique is useful in heteroscedasticity problems or other circumstances where data normality is required. Among the two most popular power transformations methods are the Box–Cox and Yeo–Johnson transformations. Here, the Yeo–Johnson transformation is used because the Box–Cox transformation demands that input data are strictly positive, whereas both positive and negative data are endorsed by the Yeo–Johnson transformation [37]. The description of the Yeo–Johnson transformation can be given using where is the transformed value, is a list of n strictly positive numbers, and is a hyperparameter used to control the transformation. Here, Scikit-learn implementation of PowerTransformer (method =  “YeoJohnson,”, , standardize = True) is used, performing the Yeo–Johnson power transformation operation with implicit data standardization with zero mean and unit variance to the transformed output.

3.3. Feature Selection

Feature selection or reduction reduces irrelevant, redundant, or partially important features that might mislead the model prediction, as the accuracy of an ML model depends on the features on which it has been trained. Feature reduction reduces the chances of overfitting because of the reduction of the redundant feature and lessens the model’s complexity. Several feature selection or reduction techniques exist. In our proposed method, PCA, ICA [36], and correlation-based feature selection algorithms were used to discard irrelevant features.

PCA is frequently employed in this capacity due to its adaptability and ease of implementation. PCA works on the premise of dividing data into an orthogonal space so that the eigenvectors corresponding to the greatest eigenvalues preserve the maximum data variance. PCA is a technique that focuses on the covariance matrix and second-order statistics. ICA decomposes observable data linearly into statistically independent components. For the correlation-based method, it classifies characteristics using a heuristic evaluation function that takes into account the correlation between the target outcome and their features. The design structure of both PCA and ICA follows the default implementation of Scikit-learn except the n_components parameter, resembling the number of features to be chosen by the respective algorithm, as the value of the parameter is driven from hyperparameter tuning. The design of PCA can be illustrated, respectively, such as (n_components, copy, whiten, svd_solver, tol, iterated_power) = ({4, 5, 6}, True, False, auto, 0.0, auto). Algorithms 13 summarize the procedures of PCA, ICA, and correlation-based feature selection algorithms, respectively.

Input: dimensional input data matrix with number of samples , and variance threshold
Output: reduced dimensional data matrix ,
Load , and calculate mean for each feature, for subtract the mean from each corresponding dimension, for and
/ Make each signal uncorrelated to each other /
Calculate covariance matrix of
Solve the as , where is the matrix of eigenvector and is the diagonal matrix containing eigenvalues on both sides of the diagonal matrix
Sort the eigenvector matrix in the descending order to the first eigenvector that have variance and form a projection matrix
Finally, project on the PCA space,
Input: dimensional input data matrix with number of samples , and variance threshold
Output: reduced dimensional data matrix ,
Select a nonquadratic nonlinear function
Initialize as , where ratio of source during mixing, matrix contains different components, and mixed output
Perform PCA on , as as in Algorithm 1
whiledo
 Update
 Normalize
Derive the new dataset by taking where
Input: dimensional input data matrix with number of samples , and expected outcome,
Output: reduced dimensional data matrix ,
fordo
Sort the correlation in descending order to choose first features for
3.4. Hyperparameters Determination

Hyperparameters define those values directly controlling the learning process of ML techniques and can be arbitrarily set by the user before starting the training phase. The correct combination of values is significant in achieving the best and quality model. Choosing the correct values for the optimal model is known as hyperparameter optimization or hyperparameter tuning [38]. Grid search and random search are both well-known techniques when tuning the hyperparameters of an estimator. This study used the grid search method based on cross-validation, resulting in the most precise predictions [39]. This algorithm splits the range of parameter values to be upgraded into the grid and across all points to obtain the optimal parameters. Different parameter combinations were evaluated for each model, which were divided into training and test sets using the cross-validation method [39]. Table 3 provides an overview of hyperparameters tuned using ML techniques and their range of tuning.

3.5. Cross-Validation in Time Series

Cross-validation is a widely used validation approach for tuning hyperparameters and assessing the effectiveness of machine learning techniques [40]. Different parameters must be stated for each case depending on the dataset. A grid search technique combined with cross-validation is effective at identifying the optimal hyperparameter combination for each model. As a consequence, forecasting errors associated with test samples may be decreased, allowing for the determination of the ideal collection of hyperparameters that enhance predictive performance while minimizing model overfitting [41]. The leave-one-out cross-validation procedure is acceptable in this scenario when dealing with time series data [42]. Alternatively, this method can be considered a sequential block cross-validation procedure and a subset of K-fold cross-validation.

Thus, the training set is iteratively constructed, with the training and validation sets being utilized concurrently, a process known as rolling cross-validation. This procedure is performed several times, with each iteration increasing the amount of observations in the training set and decreasing them in the validation set. The associated training set comprises only observations that happened before the observation in the test set. The dataset is partitioned into training and test sets, with 70% of the data used for training and verifying the models. The time series split notion is to divide the training set in half at each iteration, assuming that the validation set is still ahead of the training split. It is initially trained on a limited subset of data in order to forecast the next data point. Following that, the forecasted data points are incorporated into the succeeding training dataset, and subsequent data points are forecasted. This process is repeated until the complete training set has been utilized. Calculate the training outcome by estimating iteration performance assessments.

3.6. Structure of Stacked Ensemble Modeling

STACK modeling was conducted by considering two stages, level 0 and level 1, and the predictions of the base learner (level 0) are combined with the meta-learner (level 1). From the previous studies, it is shown that the support vector regression (SVR) and selection operator (LASSO) regression are used as the meta‐learner [8, 25]. The key advantages of adopting SVR, and especially layer-1 in the STACK technique, are its ability to identify predictor nonlinearities and subsequently exploit them to improve demand forecasts [8]. The SVR with linear kernel and selection operator (LASSO) regression model was utilized as a meta-learner in our experiment (level 1).

The following steps were adopted in this work.(1)After doing the training session of the SVR/LASSO, RFR, MLP, ELM, GBR, and XBR models, the predicted results are combined (2 in 2, 3 in 3, 4 in 4, and 5 in 5) to build a STACK (SVR/LASSO) layer 0. Stack layer 0 does not use the model used in layer 1.(2)For each STACK model, 56 models are analyzed, and best one is chosen for the study based on the test set results.(3)The findings in Tables S1 and S2 indicate that models numbered 1–15 indicate a model combination of 2 in 2, models numbered 16–35 indicate a model combination of 3 in 3, models numbered 36–50 indicate a model combination of 4 in 4, and models numbered 51–56 indicate a model combination of 5 in 5 in the order specified in step 1.(4)The performance evaluation measurements are achieved for the training and test sets after training each STACK model.

The working procedure of the stacking ensemble in this paper is described in Algorithm 4.

Input: Input dataset , where is the set of optimal hyperparameter for each based regression model, is number of based model .
Output: final forecast demand level and performance indices.
Step 1: learn first-level base regression models;
/ Loop for train and evaluate the first-level individual /regressor
for do
 Divide the dataset into and ;
 / 70% data for training and validation, 30% for test set /
 / Leave-One-Out Cross-Validation /
fordo
  
  Train with optimal hyperparameter set on
  Predict the demand level for with
Step 2: create a new dataset from ;
fordo
 Create a new dataset for meta-regressor,
 Where output of model, number of based model;
Step 3: learn second-level regressor model;
/ Loop for train and evaluate the final-level meta-regressor model
/fordo
;
 Train the meta-model with using
 Predict the demand level for with
Test set are used for the prediction and performance measure using
return
3.7. Performance Measures

Estimating the model’s accuracy is crucial in designing ML models to define how well the model is predicting. It is used to determine the goodness of fit among models and data to compare various models for model selection. If are actual values and are corresponding predicted values, then the formulas are for evaluating the accuracy of the regression models as follows:where and in this paper, training set and test set are adopted.

Along with the performance evaluation matrix mentioned above, several statistical tests [43, 44] are performed in this study to ensure the superiority of the proposed approach. The Friedman test is used to examine if the absolute percentage errors (APE) of the two models differ statistically significantly. Once statistical significance has been established, post hoc tests (nonparametric tests), such as the Wilcoxon signed-rank test, can be employed to assess if the APEs of the models change when compared to one another (lower tail) [44, 45]. Wilcoxon’s null hypothesis indicates that there is no difference in APE between models 1 and 2, but the alternative hypothesis states that model 1 has a lower APE than model 2.

4. Experimental Results and Discussion

In this section, the preparatory analysis of steel industrial data used in this study is demonstrated in Section 4.2. The performance of the adopted models and statistical tests for test set errors are described in Section 4.3. Tables S1 and S2 represent the performance measurement indices of the 56 generated models.

4.1. Experimental Setup

A single computer (Asus X556U with an Intel® Core (TM) i5−72000U, central processor unit running at 2.50 GHz, 8.0 GB of random access memory, and an Nvidia GeForce 940MX graphics card) running the Windows 10 operating system was used to create the findings provided in Section 4. In order to implement the machine learning approaches and ensemble methods, we used the Python 3.6 programming language in conjunction with the Spyder computing environment, which is included in Anaconda.

4.2. Exploratory Analysis

Correlation analysis is a statistical approach used to determine the connection between two numerical variables. From an ML viewpoint, it indicates how the features correspond to the outcome. However, it is challenging to identify how features are interconnected. Data visualization can help determine how individual features might correlate with the outcome. Pearson’s correlation coefficient is used to identify the relationship between two variables in a statistical analysis. In the range of +1 to −1, it means that there is no correlation at all, +1 indicates that there is a perfect positive correlation, and −1 indicates that there is a perfect negative correlation, according to the definition. After the Yeo–Johnson transformation has been performed to the training data set, the correlation matrix for the exploratory variables is shown in Figure 2. Figure 2 depicts the color scale of its association, which is represented on the right-hand side of the illustration. The light color indicates a close relation of 0, whereas the intense color indicates a close relation of +1 or −1. The indicators (F1, F2, and F3) and the response variable (Demand) are highly positively correlated. Thus, the increment or decrement in the value of one tends to increment or decrement those that are highly correlated. However, indicator (F5) is negatively correlated to the outcome (Demand), indicating that if the number of holidays in a month increases, the number of demands decreases and vice versa.

4.3. Evaluation of Proposed Models

In this study, the proposed models are trained using a set of optimal hyperparameters achieving the maximum predictive performance of each model achieved by grid search. The steel production data from January 2009 to December 2019, covering 132 months, are taken as the training and testing sets.

Table 3 presents an overview of hyperparameters tuned for each ML model, their explanation, and turning ranges. Table 4 represents the quantified results for selecting the best performing preprocessing and the number of selected features and ML models, where R2 with standard deviation is stated for comparison. Table 5 summarizes each model’s capacity to obtain the highest R2 using the suggested pipeline, along with the optimal preprocessing and feature selection algorithms and the number of selected features. In addition, Table 5 illustrates the best-tuned hyperparameters using the grid search. The analysis of Table 4 reveals that when suitable preprocessing is used, various models produce superior outcomes. The different architectures of the MLP model are shown in Table 6. Table 7 summarizes the performance metrics used to evaluate each model, which include R2, MAE, RMSE, and MAPE. When either correlation-based or PCA-based feature selection is applied, each model achieves the best results for filling missing values, Yeo–Johnson transformation, and data normalization (Tables 4 and 5). For SVR, the estimated accuracy of R2 = 0.931 is obtained from preprocessed data and correlation-based feature selection.

The comprehensive experiments were performed on the same dataset to get the best architecture for the MLP model. Eight separate MLP models (Table 6) were implemented and evaluated, with 1–7 hidden layers, where the number of neurons served as a hyperparameter for selecting the best numbers. The experimental results in Figure 3 indicate that the optimal architecture is the MLP layout with M = 4 hidden layers (H1, H2, H3, and H4) and N1 = 12, N2 = 12, N3 = 12, and N4 = 8 neurons. In addition, the presence of additional hidden layers with fewer samples, like in the steel dataset, limits the MLP model’s capability (Figure 3). Because of the limited data, such as in the steel dataset, the wide depth of the MLP model could be overfitted and cause gradient fading problems. Table 3 lists the optimal hyperparameters of the best MLP model. The models have used the ReLU activation function and Adam solver. It was trained on 200 epochs with a constant learning rate, batch size, and a regularization parameter of 0.01, 32, and 0.1, respectively. To reduce overfitting, the dropout layer was used, randomly dropping 60% of neurons. The highest accuracy R2 from the MLP model is 0.961 when we perform data preprocessing and PCA-based feature selection. Similarly, the ELM model with eight neurons in the hidden layer obtained the best result. Table 3 lists the optimal hyperparameters of the best ELM model. The model used the ReLU as the transformation function of hidden layer neurons, and the optimal regularization parameter was 0.001. The best-estimated accuracy (R2) of the ELM model with preprocessed data and correlation-based feature selection is 0.942.

Feature selection methods are used to improve the overall performance of each model (correlation-based, PCA, and ICA). It is possible to reduce the dimensions of a higher-dimensional space to a lower-dimensional space using PCA by selecting the orthogonal projections with the highest variance. The ICA theory implies that data are only partly independent if their variances across characteristics are larger than their covariance. The number of computers being used has a significant impact on PCA performance. Because the ICA-based feature selection technique is used to find newly specified mutually independent components, it is possible that correlation with the desired output will be lost when the procedure is used to discover new predefined mutually independent components. Due to the fact that both PCA and ICA create new components in an unsupervised manner, it is not possible to guarantee greater performance on the steel dataset. Correlation-based feature selection, on the other hand, takes into consideration the relationship between quality and outcomes in order to discover the most closely related features. As shown in Table 4, the majority of models perform better when four features, F1, F2, F3, and F6, are used. These four features were chosen using a correlation-based feature selection technique.

Further improvement of demand forecasting was obtained using regression ensemble models. Bagging (RFR), Boosting (GBR and XGBR), and stacking (STACK) regression ensembles were adopted to improve the performance of demand forecasting. Table 5 presents the performance evaluation of the adopted models. Furthermore, the results are sorted regarding R2 in the ascending order for the test set results. Finally, the best models present the lower RMSE and higher R2 in the test set. RFR is the ensemble learner built-in unpruned decision tree, and it reduced the effects of overfitting by combining multiple trees. Table 5 shows the optimal hyperparameters for the RFR model. The best-estimated accuracy (R2) of the RFR model is 0.966 obtained from preprocessed data and PCA-based feature selection. The RFR performance of the models is better for SVR, MLP, and ELM in terms of the RMSE, that is, it has lower RMSE values. GBR and XGBR are also used to increase the accuracy of forecasts. Extreme gradient boosting is a specific variant of the gradient boosting strategy that discovers the ideal tree model by employing a more exact approximation than the conventional gradient boosting method. The best-estimated accuracy (R2) of the GBR model is 0.969, obtained from preprocessed data and correlation-based feature selection. The XGBR can reduce the loss by showing an extreme gradient capability. The highest accuracy of XGBR is 0.974, and the lowest RMSE is 0.151. The RMSE of XGBR is significantly lower than the reference models and RFR and GBR. The best result of XGBR is obtained when a child’s minimum amount of weight is less than 4, and a subsample ratio to construct a tree is 0.7.

Finally, the stacking ensemble method is used for integrating multiple-base models in order to reduce prediction errors to the smallest possible amount. According to the results from the test set, level 0 of the STACK1 method is formed of the models ELM, GBR, and XGBR, with SVR as the first model in level 1. For STACK2, the levels 0 and 1 are made of ELM, GBR, and XGBR, with LASSO as the level 1 component. All of the models in Table A1 have the same performance (R2) as the models numbered 14, 24, 33, 35, 45, 50, and 55. Model 35, on the other hand, is selected for the STACK1 technique because its complexity is smaller than that of other configurations, and it has the lowest MAPE. In a similar process, the models numbered 33, 35, 50, and 56 in Table A2 exhibit the same level of performance (R2). For the STACK2 technique, model 35 is also picked because its complexity is lower than that of other configurations, and it has the lowest MAPE of any of the models tested. The best-estimated accuracy of STACK1 is 0.977, whereas the best-estimated accuracy of MAPE is 0.445. In a similar vein, the best-estimated accuracy of STACK2 is 0.977, and the best-estimated accuracy of MAPE is 0.463. According to Table 7, based on the findings of the test phase, the approaches based on ensemble learning produced results that were compatible with the objective of minimizing error.

Figure 4 illustrates the violin graph for the APE distribution of each model that was utilized to produce predictions for the test set, as shown by the APE distribution of each model. The mean APE is shown by the white dot in the center of the chart. Ensemble-based techniques, as compared to other models, significantly lower the APE to the absolute bare minimum. In this way, we can show that a model (for the test set) with lower metric values in Table 7 has a more stable APE and less volatility than a model with higher metric values. The Friedman test established that the APEs for the accepted models varied in the test set . This implies that there exist models with observed APE values that are equal to or less than those of the others. In addition, Table 8 depicts the results of the Wilcoxon signed rank test (lower tail) for measuring the APE reduction of the assessed models in the test set, in the presence of a statistically significant difference as revealed by the Friedman test .

At the 5% level of significance, the APE of the STACK1 model is fewer than the APEs of the RFR, MLP, ELM, and SVR models, as shown in Table 8. It is statistically equivalent when the STACK1 model is compared to other models with error rates at the 5% threshold of statistical significance. In addition, when the 5% threshold of significance is utilized to compare the models, Table 8 reveals that the APE of the STACK2 model is lower than the APEs of the RFR, MLP, ELM, and SVR models. Using the % level of statistical significance, the STACK2 model is compared to other models, and the errors are statistically equivalent. This highlights the advantages of the stacking ensemble models that we provide. Ensemble-based models, on average, have a lower APE than ELM and SVR. As a result, the ability of this approach to learn the data could be described using smaller estimation errors and variance between the ensemble methods than with the others, confirming the validity of this methodology. At the 5% level of significance, the APE of the STACK1 model is fewer than the APEs of the RFR, MLP, ELM, and SVR models, as shown in Table 8. When the STACK1 model is compared to other models, the errors are statistically equal at the 5% level. Similarly, Table 8 reveals that the APE of the STACK2 model is less than that of the RFR, MLP, ELM, and SVR models at the 5% level of significance. When the STACK2 model is compared to other models, the errors are statistically equal at the 5% level. This demonstrates the advantages of the stacking ensemble models we proposed. Ensemble-based models, on average, have a lower APE than ELM and SVR. Thus, the capacity of this method to learn the data could be expressed using lower estimation errors and variance between the ensemble methods than with the others, demonstrating the correctness of this approach.

Furthermore, a relationship between the actual and predicted demand was established. Figures 5 and 6 show these techniques for better understanding. Figure 5 shows the correlation-based comparison between actual and predicted demand levels for both reference models and regression ensemble models. Figure 6 provides a pictorial view of actual vs predicted demand. Figure 6 shows that the demand pattern arbitrarily fluctuates because of the impact of the variables affecting it.

As shown in Figures 5 and 6, models that are capable of providing predictions that are consistent with the observed values are able to learn from data behavior. The improved performance attained during the training phases is maintained during the test phases, suggesting that the regression ensemble methodology is reliable in terms of achieving established predictions. This is supported by the capability of machine learning models to manage nonlinearities and model the complicated interaction between response variables and input features.

5. Conclusions

Precise demand forecasting significantly influences improving the performance and durability of the steel industry. This study compares the predictive performance of STACK, GBR, XGBR, and RFR regression ensembles and the MLP, ELM, and SVR reference models. In order to improve the prediction performance of regression ensemble models, data preparation and feature selection procedures are critical. The proposed preprocessing scheme improves the raw dataset quality, where filling the missing values and data standardization are the main concerns. The Yeo–Johnson transformation is used to influence the features and response variables. While PCA and ICA solely focus on interfeature redundancy, correlation-based feature selection might improve interfeature correlation. Hyperparameters are tuned to find the optimal hyperparameter set for each ML technique using a grid search algorithm. The best-performing models are combined in STACK1 to form level 0. SVR with linear kernels and LASSO regressions are adopted as meta-learners in level 1. The Friedman and Wilcoxon signed-rank tests (lower tail) are used to validate the models’ APE differences. Regarding the findings, two models may be used to forecast one month as follows: STACK1 (ELM + GBR + XGBR-SVR) and STACK2 (ELM + GBR + XGBR-LASSO). The test set results demonstrate that ensemble approaches outperform single models, notably the STACK model, in forecasting demand in the steel industry.

Future research will (i) develop other ensemble techniques and integrate other ML regression techniques into the ensemble; (ii) include other influence variables such as occasion and political factors; (iii) collect more information, in this case only 132 months of production data are used; and (iv) extend to other industrial fields to evaluate their generality and flexibility to predict several types of demand. [36, 46, 47].

Data Availability

The data used in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Acknowledgments

The authors are grateful to the Deanship of Scientific Research, King Saud University for funding through Vice Deanship of Scientific Research Chairs. The authors also thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support. The authors are also grateful for the support from Taif University, Taif, Saudi Arabia. The APC is funded by Taif University Researchers Supporting Project Number (TURSP-2020/331), Taif University, Taif, Saudi Arabia.

Supplementary Materials

Figure S1: general architecture of stacking ensemble; Figure S2: architecture of MLP, with M hidden layer and neurons in layer, for forecasting the demand in the proposed framework; Figure S3: Single hidden layer architecture of ELM is used in this study; Table S1: performance metrics for evaluating STACK models that are used to anticipate steel demand one month ahead when the metalearner is SVR with a linear kernel; Table S2: performance metrics for evaluating STACK models that are used to anticipate steel demand one month ahead when the metalearner is LASSO. (Supplementary Materials)