Abstract

In this article, we compare autometrics and machine learning techniques including Minimax Concave Penalty (MCP), Elastic Smoothly Clipped Absolute Deviation (E-SCAD), and Adaptive Elastic Net (AEnet). For simulation experiments, three kinds of scenarios are considered by allowing the multicollinearity, heteroscedasticity, and autocorrelation conditions with varying sample sizes and the varied number of covariates. We found that all methods show improved their performance for a large sample size. In the presence of low and moderate multicollinearity and low and moderate autocorrelation, the considered methods retain all relevant variables. However, for low and moderate multicollinearity, excluding AEnet, all methods keep many irrelevant predictors as well. In contrast, under low and moderate autocorrelation, along with AEnet, the Autometrics retain less irrelevant predictors. Considering the case of extreme multicollinearity, AEnet retains more than 93 percent correct variables with an outstanding gauge (zero percent). However, the potency of remaining techniques, specifically MCP and E-SCAD, tends towards unity with augmenting sample size but capturing massive irrelevant predictors. Similarly, in case of high autocorrelation, E-SCAD has shown good performance in the selection of relevant variables for a small sample, while in gauge, Autometrics and AEnet are performed better and often retained less than 5 percent irrelevant variables. In the presence of heteroscedasticity, all techniques often hold all relevant variables but also suffer from overspecification problems except AEnet and Autometrics which circumvent the irrelevant predictors and establish the true model precisely. For an empirical application, we take into account the workers’ remittance data for Pakistan along its twenty-seven determinants spanning from 1972 to 2020 for Pakistan. The AEnet selected thirteen relevant covariates of workers’ remittance while E-SCAD and MCP suffered from an overspecification problem. Hence, the policymakers and practitioners should focus on the relevant variables selected by AEnet to improve workers' remittance in the case of Pakistan. In this regard, the Pakistan government has devised policies that make it easy to transfer remittances legally and mitigate the cost of transferring remittances from abroad. The AEnet approach can help policymakers arrive at relevant variables in the presence of a huge set of covariates, which in turn produce accurate predictions.

1. Introduction

“Big Data” has arrived, but big insights have not [1]. In regression analysis, researchers are often interested in discovering the important features while predicting the response variable. Therefore, it is important to identify the potential features for knowledge discovery and the predictive ability of the model [2]. However, variable selection is one of the crucial steps while constructing a linear regression model. Picking too many covariates is more likely to enhance the variance of the estimated or trained model. Stated differently, including more variables in the model leads to high variability in the least squares fit, resulting in overfitting and thus providing poor prediction in the future [3]. In contrast, selecting a few covariates may result in unpredictable output or biased results [3, 4]. As [5] stated that for valid results, all relevant predictors should be incorporated in the regression model. Missing a single predictor might lead to a misspecified model and the conclusion we draw can be fallacious. According to [6, 7], if the covariates are highly interrelated to each other, then the confidence interval associated with each estimated coefficient becomes wider and leads to wrong inferences.

In the recent era, a substantial mass of research has concentrated on the analysis of “Big Data” in the field of economics. As a result, a substantial focus is being paid to the wide variety of techniques that are available in the areas of data mining, machine learning, dimension reduction, and penalized least squares [8, 9]. Recently, in the regression context, [1] categorized Big Data into three classes: Tall Big Data, Huge Big Data, and Fat Big Data. Each type can be defined as follows:(i)Tall Big Data: more observations and several covariates (N >> P)(ii)Huge Big Data: more observations and more covariates (N > P)(iii)Fat Big Data: fewer observations and more covariates (N < P)

Here, N and P represent the number of observations and covariates, respectively. We graphically represent the types of Big Data in Figure 1.

It is quite obvious that Big Data’s handling is not an easy task and to date in literature, there exist just a couple of methods, which can be utilized for improving the least squares estimates under a data-rich environment (Big Data). In Figure 2, we identify all common methods and their modification.

Now, we briefly discuss these methods. Penalized least square methods are an integral component of machine learning (ML). It has already been shown in the literature that ML methods are efficient approaches for using Big Data [10]. Penalized regression methods are the modified form of ordinary least squares (OLS). Mathematically we can write the modified form:

Like in classical regression, the first component is the sum of squared residuals and the remaining part represents the shrinkage penalty. Here “” refers to the tuning parameter and is often selected by cross-validation. The other parameter is ; hence by altering its value, we get different models. More specifically, equating  = 0, results in ridge regression model form and if  = 1 is taken as there in Lasso regression. While for the value of between zero and one, we get the model for elastic net [6]. As its name reflects penalized least square methods are based on some constraints. A good penalty consists of the following three oracle properties: unbiasedness, continuity, and sparsity [11]. Methods belonging to the family of penalized regression like ridge, Lasso and Elastic net do not satisfy all the aforementioned oracle properties [12, 13]. Although in the literature, some modified methods satisfy the required oracle properties including smoothly clipped absolute deviation (SCAD) and adaptive lasso, but the drawback associated with these two methods is as follows: they only select one variable from a group of correlated covariates and ignore other variables. The selected variable may or may not be theoretically important. [14] modified SCAD by adding another property to its penalty, which spurs a set of highly correlated covariates to be in or out of the model at the same time. In other words, the new version of SCAD is able to select a group of correlated variables instead of a single one. Similarly, [2] modified the elastic net in the form of an adaptive elastic net, which achieved an oracle property. The method is capable of including and excluding features simultaneously. Minimax concave penalty (MCP) is another extended method, which is developed by [6] and is based on the concave penalty. The method also enjoys an oracle property. To summarize, Adaptive Elastic net, MCP and Elastic SCAD are the updated forms of penalization techniques, primarily used for variable selection and will be explored in the next sections.

Another approach for automatic model selection was proposed by [15, 16], known as PcGets. This method is based on the idea of general to specific (gets) modeling. It starts from a general unrestricted model which captures the key attributes of the underlying dataset. Their standard testing approaches are utilized to decrease its complexity by removing statistically insignificant variables, inspecting the validity of the reductions at every stage to ensure the congruence of the selected model. They studied PcGets probabilities recovering the data generating process (DGP) through Monte Carlo experiments and got reliable results. The consistency of the PcGets procedure was established by [17].

The new version of the PcGets algorithm was proposed by [18] as Autometrics. This version is based on the same principles as PcGets. Autometrics utilizes a tree-path search to identify and knock out statistically insignificant covariates. If the relevant covariate is eliminated by chance, the algorithm works and does not get stuck even in a single route, containing other covariates as proxies (like in stepwise regression). The beauty of this algorithm is that it works well even if the number of covariates exceeds the number of observations [10].

Our study contributes theoretically as well as empirically to literature. There exists immense literature on using conventional approaches like vector autoregressions, vector error correction models, etc. Such approaches adjust not more than 10 covariates, as more covariates create serious issues, due to which the results are invalid. More precisely, increasing the number of predictors (Big Data) leads to a few major problems in the models, such as degrees of freedom, high variability, and multicollinearity. For fixing these problems and achieving valid results, this study adopts several updated classical and machine learning techniques. The techniques will be compared under simulated scenarios for multicollinearity, heteroscedasticity, and autocorrelation, and will there be applied to macroeconomic data to provide conclusive solutions to the predictability and validity of distinct theoretical scenarios simultaneously. Our study aims to provide an improved technique to help policymakers; the improved tool is not restricted to worker’s remittances (in our case) but is valid for any macroeconomic data set under Huge Big Data (P < N).

The goal of this study is to compare the performance of the classical approach (Autometrics) with improved shrinkage methods including Adaptive Elastic net; Elastic Smoothly Clipped Absolute Deviation; Minimax Concave Penalty under different scenarios like multicollinearity, heteroscedasticity and autocorrelation in terms of variable selection. In this study, we focus solely on exploring these techniques for the case of Huge Big Data.

The rest of the article is arranged as Section 2 gives an overview of methods. Section 3 discusses the simulation exercise. Section 4 carries out the real data analysis. Section 5 comprises conclusion.

2. Methods

In statistics and econometrics, it is imperative to investigate the performance of statistical models theoretically and empirically. This work attempts to describe both aspects of the included methods. Our study considers a variety of modified forms of penalization techniques and classical approaches. The methods considered here are Adaptive Elastic net, Elastic Smooth Clipped Absolute Deviation, Minimax Concave Penalty, and Autometrics. Here, we provide a detailed description of each method.

2.1. Adaptive Elastic Net (AEnet)

The lasso estimator has been designed to improve the performance of the ridge estimator. It is certainly useful, particularly when most coefficients of the true model are zero. Albeit, ridge regression performs better than lasso when a correlation between predictors is high [19].

To overcome the shortcomings of lasso and ridge regression, the elastic net method was proposed by [19] and used both lasso and ridge penalty simultaneously. The penalty function of the elastic net (EN) is given by the following:

Using a cross-validation approach, the tuning parameters and control the relative significance of norm and norm penalty. Both Lasso and Ridge regression are the special form of the elastic net, which have already been discussed in Section 1. In this sense, the elastic net contains dual features that are shrinkage and variable selection.

To estimate , [19] proposed an algorithm called least angle regression (LAR). This is the fact that EN does not satisfy an oracle property like Adaptive Lasso, albeit it performs better than Adaptive Lasso [11]. Later on, the idea of the Adaptive Lasso and the Elastic net regularization was combined to achieve further improvement known as Adaptive Elastic net (AEnet) and is defined as follows:

(d = 1, 2, …, m) are adaptive data-driven weights. According to [2], initially, we estimate by using an EN method as given in (2) and then utilize it while computing the weights as  = ; here is constant and should be positive. Thus, AEnet, the modified form of elastic net, attains an oracle property.

2.2. Elastic Smoothly Clipped Absolute Deviation (E-SCAD)

Reference [12] developed a new regularization method known as SCAD. This method is nonconvex and fulfills the properties of a good penalty. This method not only selects the important features consistently and yields the estimates of unknown coefficients more efficiently given that the true model is known. Therefore, the SCAD function covers all the limitations related to the existing methods like Ridge and Lasso.

The penalty function of SCAD is defined as follows:

They considered the value of equal to 3.7, and the unknown tuning parameter was computed by generalized cross-validation. As foregoing, the penalty function is continuous, and the resulting solution is given by the following:

The tuning parameters can be induced from the data-driven techniques. The idea of a combination of SCAD and penalty was proposed by [14] and called it Elastic SCAD. Mathematically, E-SCAD can be written as follows:

2.3. Minimax Concave Penalty

The idea of minimax concave penalty (MCP) was initially proposed by [20]. This method provides the convexity of the penalized loss in sparse regions to the greatest extent, given certain thresholds for variable selection and unbiasedness. The MCP is described as follows:

The tuning parameter () reduces the maximal concavity subject to the following constraints, i.e., unbiasedness and features selection:

The role of dual tuning parameters in concave penalty regression is to control the amount of regularization. Besides, the concavity of the MCP penalty substantially prevents the sparse convexity on account of reducing the maximal concavity. As the value of the regularization parameter rises, a result bears more convexity and attain near an unbiased penalty [20]. The penalty function is a part of the quadratic spline function and dual tuning parameters.

2.4. Classical Approach

Autometrics comprises five fundamental phases. The initial phase concerns the construction of a linear model known as General Unrestricted Model (GUM); the second step yields the estimates of unknown parameters and statistical testing of the GUM; the third step consists of the presearch process; the fourth step provides the tree-path search; the last step involves a selection of the final model.

The complete algorithm is precisely delineated in [18]. The key notion is to commence modeling with a linear model incorporating each essential feature. Estimate the GUM by the least square method and then execute the statistical tests to ensure the congruency of a model. If the estimated GUM contains statistically insignificant coefficients at prespecified criteria, then again estimate the simpler model by utilizing different path searches and ratified by statistical or diagnostic tests. As some terminal models are detected, Autometrics undertakes their union testing. Rejected models are eliminated, and the union of those terminal models who survived induces new GUM for another tree-path search iteration. This whole inspection process remains, and the terminal models are statistically examined against their union. If two or more terminal models assure the encompassing tests, then the prechosen information criterion is a gateway to a final decision.

The forecasting model is obtained by using Autometrics approach on the GUM:

Here, two strategies are widely used for variable selection, a conservative and a superconservative (Liberal) strategy. This study adopts the super conservative strategy based on a one percent level of significance instead of five percent.

3. Simulation Study

Our simulation experiment involves three main scenarios, namely simulations on a data generating process (DGP) with (i) multicollinearity, (ii) heteroscedasticity, and (iii) autocorrelation. In each case, we vary the DGP characteristics as the correlation structure among predictors, the level of variance of the error term, and the level of correlation between the current and lagged value of the error term.

3.1. Data Generating Process

We generate data from the following equation:where is an outcome variable. The features set,  = , , …, , is generated from multivariate normal distribution as ∼MVN (0, ) where the mean of covariates is zero and is the variance-covariance matrix. The same data generating process (DGP) was used by [1, 21] as mentioned in equation (9) for artificial data generation. Three sorts of sample sizes are to be used in the simulation exercise. Moreover, we assume two sets of candidate variables with varying the number of relevant (p) and irrelevant variables (q) respectively, presented in Figure 3.

In the first scenario, we generate the pairwise correlation between the predictors, i.e., and as . The population covariance matrix is generated as follows:

With varying the parameter, , we get the different pairwise correlation; here, we assume the values for as {0.25, 0.5, 0.9} followed by [22]. In the second scenario, we generate a correlation between the current and lagged residuals (autocorrelation), denoted by . The autocorrelation is generated by the following equation:

We assign the following values to the coefficient of lagged residuals: ∈ {0.25, 0.5, 0.9}. Third scenario: in the case of heteroscedasticity, the variance of the error term is not constant and varies across observations by .

Thus, we divide the variance into two parts, i.e., and . For half of the observations (n/2), we set the variance by and for the remaining (n/2) data points. Our experiment assumes three cases of heteroscedasticity and set the values of = (/ ), where i = 1, 2, 3 as ∈ {0.1/0.3, 0.2/0.6, 0.3/0.9}. This study attempts to evaluate the performance of Autometrics, AEnet, E-SCAD, and MCP using Huge Big Data under all preceding scenarios. Tenfold cross-validation is executed to determine the optimal value of the tuning parameter.

To evaluate the performance, the authors [1] have used potency and gauge to assess the best model in features selection relatively. Therefore, we follow the same criteria for model selection as well. The entire process is replicated 1,000 times. The comparison of regularization techniques and Autometrics is assessed in the form of incorrect zero identification, namely gauge and correct zero identification, namely potency [1]. For simulation as well as empirical analysis, we use R software.

3.2. Simulation Results

The Monte Carlo simulation results are described in Tables 13.

Table 1 depicts the findings of simulation in the case of low, moderate, and high multicollinearity for different combinations of observations (n) and covariates. The performance of all methods is improving with increasing sample size:(1)In the case of low and moderate multicollinearity, the potency associated with all methods is one under most simulated scenarios, clearly revealing that they retain all the relevant variables under low multicollinearity. Increasing the level of multicollinearity tends to improve the performance of AEnet and Autometrics in such a way that holds less irrelevant variables but adversely affects the MCP performance, particularly in small and moderate samples. Across low and moderate multicollinearity, the gauge associated with AEnet is lower than the gauge of other methods, which exhibits that it retains less irrelevant covariates. Comparatively, the E-SCAD retains more irrelevant variables and thus overspecify the true model.(2)In the case of high collinearity: high collinearity among variables substantially distorts the performance of MCP and Autometrics in terms of potency and gauge specification. The AEnet retained more than 93 percent correct variables with an outstanding gauge (zero percent). However, the potency and gauge of other methods tend to increase with increasing sample size, particularly MCP and E-SCAD significantly overspecifying the true model (retain more irrelevant variables). AEnet showed an outstanding performance in terms of gauge. Under the large sample, improvement in the E-SCAD gauge was achieved in contrast to the case of low and moderate levels of multicollinearity.

Table 2 presents the simulation results by varying heteroscedasticity along with sample size and many covariates (both relevant and irrelevant).(1)In the case of heteroscedasticity: the potency of all included methods is one in almost all scenarios, certainly manifesting that they hold all the active covariates. In contrast, the gauge of AEnet and Autometrics exhibit that it avoids the irrelevant variables and very precisely identifies the true model. Higher level of Autocorrelation adversely affects the potency of Autometrics in contrast to rival methods. The results suggest that MCP drops the inactive variables, particularly when the sample size is increased. E-SCAD has considerably overspecified the model. Increasing the number of covariates tends to affect the gauge associated with Autometrics and AEnet.

Table 3 portrays the simulation’s output by varying autocorrelation, sample size, and several covariates (both active and inactive). Low (0.25), moderate (0.5), and high autocorrelation (0.9) are considered here:(1)In the case of low and moderate autocorrelation: under mostly simulated schemes, the methods have found all the right variables, but E-SCAD and MCP retain a huge set of irrelevant variables that overspecify the model. In contrast, the AEnet and Autometrics provide the best results under almost all combinations of n and p. In other words, AEnet and Autometrics avoid the irrelevant variables and correctly specify the true model very well. Increasing the length of covariates, the E-SCAD gauge is improved but negatively affects the gauge of Autometrics and AEnet.(2)In the case of high autocorrelation: comparatively rival methods, E-SCAD has shown good performance in selecting relevant variables considering a small sample. However, the same method collapsed under gauge. Similarly, Autometrics and AEnet performed better in gauge and often held less than 5 percent inactive variables. Expanding the covariates' window adversely affects the AEnet and Autometrics performance in terms of gauge.

4. Real Data Implications

After Monte Carlo experiments, this study performs real data analysis using Huge Big Data. We consider worker's remittances inflow and all its possible determinants data for real data analysis. There are so many factors that affect the worker's remittances inflow. Some covariates are recommended by economic theory to be included in the model. Apart from this, a long list of variables has been recommended by past studies. This study considers all the possible determinants based on economic theories and literature to make a general model. In econometrics literature, such a model is known as the general unrestricted model (GUM).

4.1. Data Source

This study collects the yearly data for Pakistan from 1972 to 2020 using different sources such as world development indicators (WDI), international financial statistics (IFS), international country risk guide, and state bank of Pakistan. The few missing observations in the data set are replaced by averaging the neighbor observations. Most variables are transformed into logarithm form to ensure normality. Detail regarding the variables has been given in Table 4. Table 4 describes the variables, symbols, definition of each variable, and data source.

4.2. Correlation Matrix

In Figure 4, blue and red colors exhibit Positive and negative correlations between the variables. The colors severity and area of the circles indicate a high pairwise correlation. Besides the right side of the correlogram, the legend color shows the pairwise correlation. We can observe numerous severe color circles in blue and red, evidence of high pairwise correlation.

Figure 4 shows that there exists high multicollinearity among the predictors using the data period spanning from 1972 to 2020. We noted that in Monte Carlo simulations in the case of high multicollinearity, the AEnet outperformed the rival counterparts in terms of potency and gauge, mainly when the sample size is small. It reveals that AEnet is more robust in such circumstances, and thus we should proceed with AEnet output.

We performed diagnostic tests and found that the residuals of an estimated model are homoscedastic and uncorrelated. Table 5 depicts the features selection based on real data using classical and shrinkage methods. In Table 5, the AEnet suggests almost 13 important determinants of workers’ remittance among 27 determinants. In contrast, MCP and E-SCAD recommend many unrelated determinants of workers’ remittance. In other words, we can conclude that they have overspecified the model. Apart from this, Autometrics keep the least number of irrelevant variables. The selection of an irrelevant set of covariates leads to poor forecasting.

In contrast, the right set of covariates can improve forecasting, leading to low forecast error. Consequently, an accurate forecast can help the government and other sectors in decision-making. To summarize the results, the empirical application strongly supports the findings of the simulation exercise.

5. Conclusion and Recommendations

This study compares Autometrics and three machine learning techniques, namely, Minimax Concave Penalty (MCP), Elastic Smoothly Clipped Absolute Deviation (E-SCAD), and Adaptive Elastic net (AEnet), under different scenarios: multicollinearity, heteroscedasticity, and autocorrelation with varying sample size and several covariates. We conducted Monte Carlo experiments to compare all methods in terms of variable selection using potency and gauge. All methods are improving their performance with expanding sample size. Considering the cases of low and moderate multicollinearity as well as low and moderate autocorrelation, the techniques retain all relevant predictor variables. However, for low and moderate multicollinearity, except AEnet, all methods keep many irrelevant predictors as well, whereas under low and moderate autocorrelation, including AEnet, the Autometrics also retain less irrelevant predictor variables. In presence of extreme multicollinearity, AEnet retains more than 93 percent of correct variables. Albeit, the potency of remaining techniques, specifically MCP and E-SCAD tends towards unity with increasing sample size but capturing massive irrelevant predictors as well. Considering the higher level of autocorrelation, E-SCAD has shown good performance in the selection of relevant variables under small sample. However, the same method collapsed under gauge. Similarly, Autometrics and AEnet performed better in gauge and often held less than 5 percent irrelevant variables. In the presence of heteroscedasticity, all techniques often hold all relevant variables but also suffer from overspecification problems except AEnet and Autometrics, which avoid the irrelevant predictors and identify the true model precisely.

On the application side, we take the workers’ remittance data along its twenty-seven determinants spanning from 1972 to 2020. AEnet keeps thirteen predictors of workers’ remittance. MCP and E-SCAD have selected many irrelevant determinants and consequently overspecified the model. This study has several recommendations:(i)When there is a low/moderate multicollinearity case, and the sample size is small, practitioners and policymakers can use E-SCAD provided if there are less number of irrelevant covariates. Except for this case, AEnet is recommended in the presence of multicollinearity, particularly if the covariates are highly correlated with each other.(ii)The study recommends AEnet when the residuals are heteroscedastic.(iii)In the presence of autocorrelation, if there are more active variables and fewer inactive variables, then researchers should adopt E-SCAD if the scenario is converse, then use AEnet or Autometrics.(iv)In the case of Pakistan, the AEnet showed remarkable performance in relevant variables. Hence, the policymakers and practitioners should focus on the relevant variables selected by AEnet to improve workers' remittance in the case of Pakistan. In this regard, the Pakistan government has devised policies that make it easy to transfer remittances legally and mitigate the cost of transferring remittances from abroad. The AEnet approach can help policymakers arrive at relevant variables in the presence of a huge set of covariates, which in turn produce accurate predictions.

Appendix

Table 4 describes the variables, symbols, definition of each variable, and source of data.

Data Availability

Data can be shared upon request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Conceptualization was proposed by Faridoon Khan, Amena Urooj, and Saud Ahmed Khan. Methodology and formal analysis were performed by Faridoon Khan. The original draft was written by Faridoon Khan. Supervision and cosupervision were done by Amena Urooj and Saud Ahmed Khan. Software was provided by Faridoon Khan, Sara Muhammadullah, and Zahra Almaspoor.Investigations were conducted by Amena Urooj and Saud Ahmed Khan. Review and editing were done by Faridoon Khan, Zahra Almaspoor, and Saima K. Khosa.