#### Abstract

Credit scoring analysis has gained tremendous importance for researchers and the financial industries around the globe. It helps the financial industries to grant credits or loans to each deserving applicant with zero or minimal risks. However, developing an accurate and effective credit scoring model is a challenging task due to class imbalance and the presence of some irrelevant features. Recent researches show that ensemble learning has achieved supremacy in this field. In this paper, we performed an extensive comparative analysis of ensemble algorithms to bring further improvements in the algorithm oversampling, and feature selection (FS) techniques are implemented. The relevant features are identified by utilizing three FS techniques, such as information gain (IG), principal component analysis (PCA), and genetic algorithm (GA). Additionally, a comparative performance analysis is performed using 5 base and 14 ensemble models on three credit scoring datasets. The experimental results exhibit that the GA-based FS technique and CatBoost algorithm perform significantly better than other models in terms of five metrics, i.e., accuracy (ACC), area under the curve (AUC), F1-score, Brier score (BS), and Kolmogorov-Smirnov (KS).

#### 1. Introduction

Credit risk assessment is one of the most sensitive issues in the financial industry which identify the position of the potential borrower. Different types of risks are associated with the banking industries which may affect their business and their customers. Credit scoring is one of the major risks associated with the banks; it helps to make crucial decisions to lend some loan to the applicant or not. The banking sector access the creditworthiness of their applicants to grant loans by implementing the credit scoring models. Thus, developing an effective credit scoring model has become a demanding tool for researchers and the financial industries to precisely distinguish risky customers from nonrisky ones [1].

The credit scoring analysis is often treated as a binary classification problem, in that it determines whether the new credit applicants are “good” or “bad” by comparing their socioeconomic attributes. The models are initially developed using statistical methods, such as discriminant analysis and logistic regression (LR) [2], which are the most common methods in this category. AI-based credit scoring models were proposed in recent decades to optimize accuracy and minimize error rates. Some commonly used machine learning (ML) techniques are decision tree (DT) [3], k-nearest neighbors (KNN) [4], support vector machine (SVM), and Naïve Bayes (NB) [5]. AI-based models gain more popularity than statistical models due to high accuracy labels [6], can easily handle nonlinear classification problems [7], and effectively handle high-dimensional datasets [6–8]. However, the credit scoring models are not always fully machine-dependent. In the semiautomated systems, before approving loans to the applicant, banks process the applications through two steps. First, it should be approved by the financial analyst (or experts), and then, the approved applications must be processed through computational models [9]. In these systems, the loans are approved based on the decisions made by the combined effect of expert knowledge and ML techniques. But, in the case of automated credit scoring systems, the applications are approved based on the decisions made by the ML techniques. However, semiautomated credit scoring models are very rarely proposed in the literature [10, 11]. In this study, only the fully automated credit scoring models are discussed.

A single machine learning algorithm may not provide the best results in every case. In recent works [12, 13], ensemble models are developed for credit scoring problems, which results in more advanced and accurate models than single classification methods. The performance of the ensemble models improves by compensating the limitations of the base learners [14]. The performance can be further enhanced by implementing certain preprocessing mechanisms, such as feature selection (FS) and resampling the instances. Several studies [15, 16] have applied FS techniques in their credit scoring models to reduce the high-dimensional feature space and improve the overall performance of the model. In the paper [15], five traditional FS techniques, such as -test, correlation matrix, stepwise regression, PCA, and factor analysis, are proposed to build a bankruptcy prediction model, and their performances are analyzed using MLP neural networks. In the research work of [16], multiple FS techniques are used to build the credit scoring model. FS techniques like LDA, rough set theory, DT, and F1-score were used, and the performances are examined through an SVM classifier. To improve the accuracy and stability of the credit scoring model, GA and ANN are used to select the optimal features [17].

The lack of a balanced dataset, i.e., a dataset with equally populated tuples for each class, creates a problem for the intended classifier. Skewness towards the positive class requires extra caution from the side of the model designer to make this impact as negligible as possible [18]. Imbalanced data is one of the common problems in credit scoring datasets, where the number of “bad” customers is much less than that of “good” customers. This makes the classifier biased towards majority class samples and leads to huge financial losses when the classifier incorrectly predicts the bad customer as good. Synthetic minority oversampling technique (SMOTE) is one of the widely used resampling techniques to deal with imbalanced datasets, and it achieves optimized performance by oversampling the minority class samples [19].

From the above studies, it is understood that the credit scoring models are designed considering three factors, i.e., (1) ensemble methods, (2) resampling, and (3) FS techniques. Most of the researchers implemented either one factor or a combination of any two factors to build the models. To the best of our knowledge, very few articles might have implemented all the three factors in their credit scoring models. Table 1 shows the studies related to credit scoring models, five papers have combined ensemble and resampling techniques, and four papers have combined FS and ensemble techniques. However, none of the papers have implemented all the three factors in their models. To fill this research gap, this paper proposed a credit scoring model by considering simultaneously all the three factors. In this paper, all three factors are considered in different phases to build an effective and accurate credit scoring model. In the resampling phase, SMOTE oversampling method was applied to tackle the imbalanced dataset. In the FS phase, three FS techniques, namely, IG, PCA, and GA, are employed to identify the informative features, which help to reduce the models’ dimensionality and complexity. In each phase, 19 base and ensemble of classifiers are used for model building. The baseline classification algorithms (i.e., LR, SVM, DT, NB, and KNN) and the ensemble of classifiers (i.e., bagging, boosting, and tree-based) are used in the experiment using three publicly available credit scoring datasets, i.e., Australian, German, and Japan. The predictive performances of the credit scoring models are evaluated against five evaluation metrics: ACC, AUC, F1-score, BS, and KS. Additionally, the ranks of each model are obtained using the Friedman and Nemenyi post hoc statistical tests [20]. The performances of the classifiers are compared by conducting a set of experiments in terms of the above metrics. In brief, the contributions of this paper are as follows: (1)It implements all three approaches: resampling, FS, and ensemble methods on three credit scoring datasets(2)It provides experimental results of 5 base classifiers and 14 ensembles of classifiers in terms of ACC, AUC, F1-score, BS, and KS(3)It implements three FS techniques, i.e., IG, PCA, and GA, to select the relevant features(4)It implements five sets of experiments to identify the best combination of machine learning algorithm and the FS technique that could able to build an accurate and reliable credit scoring model

The main objective of this work is to identify the best combination of FS technique and machine learning algorithms to build an accurate and reliable ensemble-based credit scoring model.

The remaining part of the paper is outlined as follows: Section 2 presents the literature survey of related works, ensemble, and feature selection techniques for the credit scoring model. In Section 3, the proposed methodology with preprocessing techniques is discussed. Experimental setup, evaluation metrics, and statistical tests are given in Section 4. In Section 5, results are discussed, and finally, we draw conclusions and future work in Section 6.

#### 2. Literature Review

In recent decades, most credit scoring models have been proposed using an ensemble of classifiers due to their superior performance. In this section, the credit scoring and its related works, ensemble learning techniques in credit scoring, and credit scoring models with and without feature selection techniques are reviewed.

##### 2.1. Credit Scoring and Its Related Works

The credit scoring model act as a decision-making system for the banks; it helps to make crucial decisions to approve a loan to the applicant or not. The models are designed using different methods, such as judgmental methods, statistical methods, rule-based methods, reject inference methods, profit-based methods, and machine learning methods. In the beginning, due to the nonavailability of data science methodology, the judgmental approaches of the expert team were being followed and approved the loans by reviewing the application form. In certain situations, accurately estimating the risk may be challenging for experienced professionals [24]. But, with the development of the technologies, many effective statistical credit scoring models have been proposed. Statistical techniques, such as LR and LDA have been proved to be superior credit scoring models as compared to the traditional expertise-based models [24]. These methods can determine the linear relationship between the attributes and the class variables. However, they cannot analyze the nonlinear mappings between the variables of the credit scoring data [36]. In rule-based credit scoring models, the rule extraction algorithms are combined with machine learning techniques to predict the creditworthiness of the applicants. The main advantages of these models are they can easily identify the patterns in complex problems and using these patterns the rules can be easily extracted. But, it is difficult to implement these rules in large dimensional credit scoring problems [9]. Traditionally, the credit scoring models are designed using the data that contains the records of only accepted applicants. Then, these models will have a selection bias, since they are trained only on accepted applicants and not on rejected applicants [37]. In reject inference credit scoring methods, the models are trained using labeled (accepted applicants) and unlabelled (rejected applicants) [38]; such models can correctly classify all types of loan applications. In the profit-based credit scoring methods, the model was aimed at maximizing the profit by granting loans to the applicants. These models gain profit by maximizing the benefits and minimizing the losses due to bad credits [39]. The credit scoring models designed using machine learning techniques become more popular. ML methods can automatically extract the relevant information from the instances and can build advanced credit scoring models. In the last two decades, ML has achieved more popular and is effectively used to estimate the probability of defaulters. It can automatically extract the relevant information from the instances and can build advanced credit scoring models. In the studies [40], it has been observed that ML algorithms have achieved significantly better results than statistical methods. However, there are some limitations with these methods, such as (1) hyperparameter tuning is required, (2) normally stick at local minima, (3) it may overfit, and (4) computationally more expensive to train the model.

Different ML techniques have been employed to build the credit scoring models. SVM separates the class samples by an optimal hyperplane and thereby significantly increases the performance of the models [41]. A probabilistic-based Gaussian algorithm was proposed to build the credit scoring model that gives better accuracy than LR and SVM [42]. However, financial industries are not able to correctly identify the defaulters by employing a single classification algorithm. As a result, to mitigate the default risks, the researchers have proposed high predictive models by employing ensemble methods [34, 35].

In the study [43], a set of experiments was conducted and determined that an ensemble model performs better than a single classifier. In some ensemble methods, such as AdaBoost, gradient boosting decision tree (GBDT), and extreme gradient boosting (XGBoost), DTs are used as a base classifier [44, 45]. According to [46], the loan defaulters list can be estimated in a better way by employing RF than LR and KNN. Similarly, RF outperforms the other traditional classifiers, such as SVM, KNN, and LR for predicting the best borrowers in peer-to-peer lending [46]. RF is an ensemble algorithm that generates good accuracy and generates a model that can avoid overfitting, faster, and above all, effectively handle outliers and noises [27].

Credit scoring datasets may include some unimportant or redundant features that increase the training time and reduce the algorithms’ performance level. FS technique helps to minimize the complexity, reduces the training time, and improves the accuracy level of the algorithms by selecting the informative features from the datasets [47]. Each FS technique has some pros and cons, like the filter method selects the features based on a certain ranking criterion. The top-ranked features are randomly used in the classification process while ignoring the impact of the features on the classifier’s performance. The wrapper method selects the optimal feature subset according to the classifier’s performance. This wrapper method results in the best feature subset, but computationally, it is more expensive. The hybridization method combines the application of both methods. The advantages of both methods are reflected in this hybrid approach; i.e., it achieves high classification accuracy and is computationally less expensive [48]. In the work of [49], two FS algorithms were proposed using a set of ML algorithms, such as LR, RF, SVM, MARS, XGBoost, and deep neural networks (DNN). XGBoost and DNN incorporating RF-based new approach (NAP) FS method result in high ACC and AUC, respectively. In the paper [40], five traditional FS techniques, such as -test, correlation matrix, stepwise regression, PCA, and factor analysis, are proposed to build a bankruptcy prediction model, and their performances are analyzed using MLP neural networks. In the research work of [15], multiple FS techniques are used to build the credit scoring model. FS techniques like LDA, rough set theory, DT, and F1-score were used, and the performances are examined through an SVM classifier.

##### 2.2. Ensemble Learning Techniques in Credit Scoring

Improving the performance level is one of the biggest issues in ensemble models. Ensemble models can be implemented using single base learners with different variants (called a homogenous ensemble) or combining different base learners (called a heterogeneous ensemble). By the application of multiple algorithms, the ensemble model outcomes increase as compared to the outcomes of each base algorithm. It has been universally accepted that the diversity and the performance of the base learners are two key factors of ensemble models. To improve the generalization and robustness of the ensemble models, it is needed to focus on these two key factors. It has been observed that the diversified base learner enhances the performance of the ensemble model [27]. To create diversification, it is required to train the base learners using different data subsets. Bagging [50] and boosting [26, 51–53] are the two common approaches to generate diverse members. Building an ensemble model undergoes two stages, namely, the creation of diverse base learners and the combining of the output of the learners. The outputs can be combined using techniques such as majority voting, weighted average, performance weighting, and stacking [6, 47].

##### 2.3. Credit Scoring Models with and without Feature Selection Techniques

Dataset may include some irrelevant or redundant features which may increase the complexity of the training process, and it leads to a reduction in the performance level of the model. The feature selection (FS) technique helps to reduce the complexity of the problem by eliminating the irrelevant features and also helps to increase the predictive capability of the model [15, 54]. In the paper [30], a hybrid FS technique has been proposed HMPGA, in which three feature subsets are shortlisted using three different filter methods, such as IG ratio, F1-score, and Pearson’s correlation. Then, the optimal feature subset is finalized using a wrapper method called MPGA (multiple population genetic algorithm). Similarly, another hybrid FS method IGDFS for credit scoring problems has been proposed in [29]. It implements the IG filter method to select the feature subsets and the best subset is selected using GA. FS techniques help to build models using complicated nonlinear related variables without considering the model’s assumptions [26]. It also helps to identify the relationships between independent and dependent variables in large datasets and reduces the training processing time of the models, especially for large datasets.

In the literature, several credit scoring models have been proposed without implementing FS techniques. A set of ML algorithms, such as LR, Classification and regression trees (CART), ANN, and SVM have been implemented in these models. In the paper [55], a three-layered neural network model is proposed using a back-propagation learning algorithm to predict whether to grant or reject the loan application in an automated processing system. A hybrid credit scoring model was proposed in the paper [56] by combining genetic programming with deep learning network. Genetic programming was applied to extract rules and deep learning network was used to build the credit scoring model. In the paper [57], a few limitations in the credit scoring problems were identified, such as correctly setting the cut-offs for classifying good/bad borrowers, dealing with imbalanced credit datasets, and the implementation of ensemble methods. A profit-based credit scoring model using reinforcement learning is proposed [58] to determine the optimal threshold value. Similarly, to address the imbalance credit scoring problems, an improved SMOTE algorithm using AdaBoost and deep learning technique is proposed [35].

#### 3. Proposed Methodology

In the proposed methodology, three methods are integrated, i.e., FS, resampling of minority class instances using SMOTE, and an ensemble of learners to build an effective credit scoring model. The flow of the proposed work is illustrated in Figure 1. It consists of four phases: (1) data preprocessing, (2) resampling using SMOTE, (3) feature selection, and (4) model generation. In the subsequent sections, the details of each phase are discussed.

##### 3.1. Data Preprocessing

In the first phase, some necessary steps are employed to preprocess the dataset. Data preprocessing helps to enhance the accuracy and efficacy of the classifier. Each dataset undergoes different methods to make it more appropriate for the classification process. In this phase, data cleaning and normalization technique are implemented. In data cleaning, missing values are handled which is a common issue in many real-world problems. Data imputation is a method used to fill the missing values of an attribute by using the existing information. Mean imputation is a technique in which the mean of the existing data replaces the missing values.

In most cases, the domain range of the features differs by a large value. The features with a higher range of values create more influence on the classification model. Data normalization is another preprocessing technique in which the feature values are transformed into a particular range so that all the features will have equal influence. Normally, all the feature values are scaled within the range of [0, 1], which is expressed as

##### 3.2. Resampling

This paper implements SMOTE to address the imbalanced issues before modeling. It oversamples the minority class subset by generating a specific number of artificial minority class samples. For each instance, belongs to the minority class subset,, and its k-nearest neighbors from the minority class subset are identified using Euclidean distances. Then, a random sample is selected from , and finally, along the line segment between and , a new minority sample is obtained using the following equation: where generates random numbers between 0 and 1. is added to to make the imbalanced ratio (IR) to 1.

Additionally, three different oversampling methods are used in the experiment, namely, random oversampling (ROS), adaptive synthetic sampling (ADASYN) [59], and Borderline-SMOTE [60]. ROS randomly replicates the minority class samples to make its number equal to the number of majority class samples. ADASYN assigns different weights to the minority class samples, and more weights are assigned to the samples that are hard to classify. More synthetic samples are generated for instances having higher weights. But, in the case of Borderline-SMOTE, the samples that lie near the borderline are assumed to have more impact on classification. These samples are used to generate synthetic samples through interpolation.

##### 3.3. Feature Selection

In the third stage, we applied three FS techniques, such as IG ratio, GA, and PCA to select the relevant features from the datasets. Through FS techniques, valuable feature subsets are chosen and help to achieve highly optimized model performance in the subsequent stages. FS technique not only improves the model efficiency but also reduces the complexity and running time of algorithms. The three FS techniques are discussed in the following subsections.

##### 3.4. Information Gain (IG)

IG adopts feature ranking principles to find out the best features that are very much related to the class variables. The features with high IG are selected to enhance the classification of the model [54]. The IG of a feature is computed by evaluating the overall reduction in entropy. Entropy quantifies the expected value of a feature that is used while classifying an instance. Let be an input feature vector and be the corresponding class variable; the entropy of is computed by taking the probability distribution of each , which is expressed as where is the probability of belonging to class . Now considering the feature vector , the entropy is defined as

Finally, the IG of feature vector is defined as

##### 3.5. Genetic Algorithm (GA)

GA [61] is an evolutionary heuristic search algorithm where selection, cross-over, and mutation operators are used to find the optimal feature subset. It is an optimal search technique in which the chromosome represents the feature subset in the form of binary strings. Each feature subset is evaluated based on the fitness score. The feature subset is selected based on a higher fitness score. The features are selected from the subset if the bit is 1; otherwise, the feature is discarded. According to the study [62], if a single metric is used to evaluate the performance of the classification algorithm, AUC is a more appropriate choice to be used in the fitness function to evaluate each individual of the population.

##### 3.6. Principal Component Analysis (PCA)

PCA is a feature transformation technique used to transform the high-dimensional feature vector ℝ^{d} to lower-dimensional relevant feature vector ℝ^{p} [63], where . The resultant features are principal components, which are evaluated using the eigenvalue of the covariance matrix of the feature vector. These orthogonal principal components are used to identify the correlated essential features.

##### 3.7. Model Generation

In the final stage, the models are generated by applying a set of base and ensemble of classifiers. In the subsequent sections, the details of each classifier are discussed.

###### 3.7.1. Baseline Classifiers

(1)Logistic regression (LR): LR is the most traditional statistical method and is widely used in credit scoring problems [43]. For the binary classification problems, it transforms the output from continuous values [-∞, +∞] to 0 or 1. For credit scoring problems, LR can be used to estimate the probability of a customer’s default using the logistic transformation function(2)Support vector machine (SVM): SVM classifies the instances of both the classes by an optimal hyperplane such that the data points of both the classes are separated by maximal distance [64]. SVM can also classify the nonlinear data points more accurately than other methods. It handles such classification by transforming the data points into high-dimensional space using different kernel functions, such as linear, polynomial, Gaussian, and radial basis functions (RBF)(3)Naïve Bayes (NB): NB is a probabilistic-based ML algorithm that uses the Bayes theorem and provides better predictive performance for high-dimensional input feature vectors [65]. It implements Bayes’ rule and can predict whether the applicant is eligible for a loan or not. The rule estimates the probability that an instance belongs to class having the highest posterior probability(4)Decision tree (DT): DT is very popular and easy to interpret because its graphical structure looks very similar to human reasoning [66]. In DT, the attributes are represented as nodes, the branches split the instances into smaller subsets, and the terminal nodes represent the class label. The tree is built using the training instances and the class label of each test sample can be easily predicted from its structure. This study employs a classification and regression tree (CART) to build the credit scoring model. It implements all the possible combinations for splitting each attribute to build an optimized model(5)K-nearest neighbor (KNN): KNN is a nonparametric classifier that does not need to train the model [4]. The classifier can predict the new instances based on k-nearest training instances. For each test observation, its k-nearest training samples are identified and the class outcome is predicted based on the majority class of k-nearest neighbors

###### 3.7.2. Ensemble of Classifiers

Some of the widely used ensemble techniques are boosting, bagging, and tree-based ensembles. (1)Boosting: boosting operates sequentially by implementing a set of weak learners (normally shallow DT). Initially, a weak leaner gets trained on the training set by assigning weights to all the samples. In every iteration, the instances that were incorrectly classified in the current step are given more weight in the next iteration to correctly classify those misclassified instances [67]. All the instances with their respective weights are used to train the next weak classifier. More weightage is assigned to the classifier that performs well. By repeating the process, the performance of the classification increases by taking more weak learners. Finally, the resultant superior model is generated by linearly combining the classifiers with higher weights(2)Bagging: bagging is another ensemble approach in which different sets of training subsets are generated using the bootstrap aggregation method [50]. In this approach, K different classifiers are trained by randomly selecting the training subsets with replacement. Each classifier predicts the output for each input vector and the final output is generated using the majority voting technique. Random forest (RF) implements bagging approaches in which a set of DTs is constructed by selecting the training subsets using bootstrap samples and randomly selecting the features. At each round, different trees are built and each of them predicts the output for a given input pattern. The output predicted by each tree is aggregated and the final output is generated based on majority voting [68](3)Tree-based ensembles (TBE): different tree-based ensemble methods, such as gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) [69], light gradient boosting (LGBM) [70], and categorical boosting (CatBoost) [71] are used in this study. GBDT is a boosting algorithm in which a set of weak decision trees are combined to build a strong ensemble model [67]. In this case, the sample data are updated along the negative gradient to reach a point where the algorithm converges globally [68]. XGBoost is an improved version of GBDT, which is designed to minimize the computational cost and above all increase the model efficiency both for classification and regression type problems. Like GBDT, XGBoost can handle overfitting problems by using the learning rate, number of boosting, the tree’s maximum depth, and subsampling [67]. As compared to GBDT, XGBoost increases the efficiency of the model by optimizing the objective function

The high performance of LGBM is due to the application of the “best-first” tree and histogram-based decision tree. XGBoost trained the model by increasing the size of the tree depth-wise, whereas LGBM adopts the “best-first” tree by growing the tree leaf-wise and limiting the depth-wise growth. The best-first helps to minimize the loss function rapidly but may lead to overfitting issues; LGBM prevents the tree depth and the splitting of the nodes.

CatBoost is another powerful GBDT-based algorithm that operates on two advanced algorithms, i.e., ordered boosting and techniques to handle categorical features. Through ordered gradient boosting techniques, it helps to reduce the biased gradient estimates and the overfitting issues. Categorical features are usually present in the credit scoring datasets. Normally, the hot-encoding technique is used to transform the categorical attributes into numeric values but it may lead to overfitting. CatBoost can easily handle the overfitting issue by converting these features to the gradient at each step.

#### 4. Empirical Study

In this section, the experimental setup is designed to evaluate the performance of the proposed credit scoring models. It includes the descriptions of three credit scoring datasets and performance metrics to evaluate the experimental results, and finally, the performance of the classifiers is analyzed using statistical tests. The experiments are implemented in Python 3.8 on a Jupyter notebook on a PC with Intel Core 4 CPU, 4GB RAM, and Windows 10 operating system.

##### 4.1. Credit Scoring Datasets and Experimental Settings

For the experimental analysis of different models, the three most commonly used credit scoring datasets were applied. These datasets are collected from the UCI ML repository^{1}, such as Australian, Japanese, and German datasets. Table 2 shows the brief descriptions of these datasets, and all these datasets are relatively small but contains the total data. The Australian credit dataset contains 690 instances, of which 307 are positive and the remaining 383 are negative, yielding an imbalance ratio (IR) of 1.25. Similarly, in the case of the Japanese dataset, the total number of samples is 690, with an imbalanced ratio (IR) of 1.25. The German dataset contains 1000 instances, out of which 700 are positive and 300 are negative, with a class distribution of 2.33. The “#instance” column represents the number of instances, the “#feature” column is the number of feature/attributes, “%Good” and “%Bad” column represents the percentage of good/positive and bad/negative applicants, and the IR column represents the imbalance ratio.

In the proposed work, an experiment is conducted to perform an extensive comparative analysis of the performances of baseline and ensemble models. In total, are used to perform comparative analyses on a set of credit score datasets. All the experiments have been conducted using 5-fold cross-validation to limit the effect of variability that occurs in random partitioning and able to achieve optimized results. Each dataset is partitioned into five folds, one fold is used as a test set to evaluate the model and the remaining four folds are used for training purposes. The experiments were carried out by repeating ten times the 5-fold cross-validation process for tuning the hyperparameters.

##### 4.2. Evaluation Metrics

To evaluate the effectiveness of the credit scoring models, five evaluation metrics were used, i.e., accuracy (ACC), the area under the curve (AUC), the F1-score, Brier score (BS), and the Kolmogorov-Smirnov statistic (KS). These measures are most commonly used in this problem domain, as they cover each feature of the model’s performance. ACC, AUC, and F1-score metrics can be defined using the confusion matrix (shown in Table 3), in which one class is labeled as positive and the other class as negative. True negative (TN) and true positive (TP) represent the number of correctly classified negative and positive cases, respectively. Similarly, false negative (FN) and false positive (FP) represent the number of incorrectly classified negative and positive cases, respectively.

The ACC metric is used to find the proportion of instances that are correctly predicted by the model, which is defined in Equation (6). The AUC evaluates the discriminatory ability of the model based on the receiver operating characteristic curve (ROC). The AUC value lies between 0 (indiscernible) and 1 (perfectly discernible), and 0.5 indicates the predictive ability of a random classifier. The F1-score evaluates both positive and negative accuracies of the test samples by taking the weighted average of the precision and recall scores. Precision defines the exactness, i.e. the number of samples that are predicted as positive that are actually positive, whereas recall defines the completeness, i.e. the number of positive samples that are correctly predicted. where

The BS metric evaluates the accuracy of the probability predictions. It computes the mean-squared error between the probability predictions and the actual label (0 or 1). It is defined as where is the predicted score and is the actual label of the sample sample.

The KS statistic is used to evaluate the maximum difference between the cumulative score of positive and negative samples.

##### 4.3. Statistical Tests

Even though we are using several measures but without statistical tests, we cannot conclude the model’s superiority over others [20]. The statistical test needs to be carried out to show that the model’s performance is statistically different from others. In this experiment, nonparametric tests are used to compare the performance of all the algorithms over the different datasets. The algorithms are ranked using the Friedman test, ranked 1 is assigned to the best algorithm, ranked 2 to the second-best, and so on. The test is carried out over each algorithm K using chi-square with K-1 degree of freedom. This value is computed using the rank of each classifier on each dataset , where indicates the number of datasets. It is defined using the following equation:

The Friedman test is applied to determine the significant differences in performances of all the classifiers. If the null hypothesis that there is no significant difference is rejected, then the Nemenyi post hoc test can be applied to determine significant differences in performances between each pair of classifiers. According to the Nemenyi post hoc test [20], the performances of two or more classifiers are significantly different if their respective average ranks differ by at least the critical difference (CD), which is defined as where the value is based on the studentized range statistic table.

##### 4.4. Hyperparameter Tuning

The performance of the classifiers depends on the hyperparameters. To improve the performances significantly, these hyperparameters need to be modified. Therefore, the grid search method, a popular hyperparameter optimization method, is employed to determine the best parameters from a prespecified parameter list. Table 4 presents the parameter searching space for all the classifiers and the best parameter obtained for all the datasets. The classifiers used in this study are LR, SVM, KNN, CART, AdaBoost, bagging, RF, XGBoost, LGBM, GBDT, and CatBoost; all these classifiers have a set of hyperparameters that needs to be optimized. However, NB is the classifier whose classification is only based on the prior probability of the features in the training set to estimate the posterior probability, so no parameter tuning is essential for it.

#### 5. Results and Discussion

In the experiment, we build 5 baseline classifiers, i.e., LR, SVM, DT, NB, and KNN, and 14 ensemble models are generated by combining the baseline classifiers with AdaBoost, bagging, and tree-based ensemble methods. In the AdaBoost method, 4 ensemble models are generated by using LR, SVM, DT, and NB algorithms as base classifiers, represented as A_LR, A_SVM, A_DT, and A_NB, respectively. Similarly, in the bagging approach, 6 ensemble models are generated, i.e., random forest, and the remaining 5 are generated by using DT, SVM, LR, NB, and KNN as base classifiers, which are represented as B_DT, B_SVM, B_LR, B_NB, and B_KNN. In the tree-based approach, 4 ensemble models are generated by using XGBoost (XGB), GBDT (GB), CatBoost (CTB), and LGBM (LGB) as base classifiers.

All the models are evaluated concerning ACC, AUC, and F1-score metrics in four separate experiments. They are as follows: (i)Performance analysis of each classifier without resampling and FS(ii)Performance analysis of each classifier with resampling but without FS(iii)Performance analysis of each FS technique(iv)Performance analysis of each classifier with resampling and GA-based FS technique

Finally, in the last two experiments, the performance analysis of each classifier using nonparametric statistical tests and the comparison of the computational cost of each classifier is performed.

##### 5.1. Experiment I: Performance Analysis of each Classifier without Resampling and FS

In this experiment, the performance of all the five individual classifiers and 14 ensemble learnings is compared prior to oversampling and FS techniques to the datasets. Table 5 represents the results of each base and ensemble model. The best classifiers in each metric are highlighted in bold fonts. Additionally, we also rank the classifiers from best (rank-1) performer to worst performer using Friedman’s rank in terms of ACC (R_ACC), AUC (R_AUC), F1-score (R_F1), BS (R_BS), and KS (R_KS) for all the datasets. All the tests have been carried out by taking the level of significance () value equal to 0.05.

From Table 5, we can observe that CTB is the best performer and RF is the second best considering all the three datasets. The mean scores of CTB in terms of different metrics are ACC (84.56%), AUC (89.73%), F1 (76.56%), BS (0.116), and KS (65.06%). Among the base learners, LR is the best performer with a mean rank of 7.7 and CART is the worst performer with a mean rank of 15.6.

After applying ensemble methods, some improvements in performance levels were observed in most of the datasets for all the base learners. Especially, DT (CART) has shown maximum improvements as compared to other base learners. The accuracy level of DT has enhanced by 5.5% to 9.8% in all the datasets after applying bagging (RF). Similarly, XGBoost brings improvements in accuracy by 5.3% to 10%. In general, ensemble methods obtain significant improvements in different metrics as compared to all the base learners.

##### 5.2. Experiment II: Performance Analysis of Each Classifier with Resampling but without FS Technique

Through this experiment, we aim to show the effect of resampling on the performance of the classifiers. In the 5-fold cross-validation process, oversampling is implemented 5 times on all the training folds, while in all the testing folds, the class distribution is kept intact. That is, all the classification algorithms are tested on the dataset having the original class distribution.

In this experiment, four popular oversampling methods are implemented on all three datasets using DT classification. Oversampling methods used in the experiment are random oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE (BSMOTE). After implementing oversampling methods, the performances of DTs are measured in terms of ACC, AUC, F1-score, BS, and KS which are shown in Table 6.

Now to show the effect of oversampling on the classification algorithms, we balance all the datasets using SMOTE and the algorithms are trained on those balanced datasets. Table 7 illustrates the performances of all the base and ensemble learners before oversampling (BO) and after oversampling (AO) in terms of each metric. The bold fonts indicate the performance level of the classifier increase after implementing oversampling. After implementing SMOTE, it has been observed that most of the classifiers have shown improvements in ACC (i.e., 0.2-5%), AUC (i.e., 0.5-5%), F1-score (7-20%), and KS (i.e., 0.5-3%). But, very few classifiers have shown slight improvements in BS. From this experiment, we conclude that SMOTE method brings significant improvement in ACC, AUC, and F1-score metrics for most of the classifiers.

##### 5.3. Experiment III: Performance Analysis of Each FS Technique

Through this experiment, we aim to study the impact of FS techniques on the performance of the classification algorithms. As stated above, the class level distribution of the training dataset is made balanced using SMOTE, and then, FS techniques are applied to these balanced datasets. Next, the top-ranked features are identified and the classifier gets trained on these selected features. Finally, the testing set with the same set of features is used to evaluate the performance of the classification. The FS techniques are implemented and their parameters are adjusted according to the performance of the SVM classifier.

In the IG FS approach, only the number of selected features in each subset varies. Each feature subset is evaluated using ACC, AUC, F1-score, BS, and KS. The results of each subset are shown in Table 8. In the German and Australian datasets, model-4 (i.e., selecting 12 important features) obtains best results than others in the classification process. In the Japan dataset, model-2 (i.e., selecting 6 important features) obtains best results than others.

Similarly, in the case of PCA, the feature space is reduced by taking a different number of components. The best feature subset is determined based on SVM classification performance. The results of the classification are shown in Table 9. According to the classification report, model-3 (i.e., 8 numbers of components) obtains the best results in the German dataset and model-4 (i.e., 12 numbers of components) obtains the best results in the Australian and Japan dataset.

In the GA FS technique, a set of parameters with different combinations are chosen by referring to the studies. To obtain the best feature subset, a series of experiments are conducted with various ranges of population size [50-300], mutation rate [0.001-0.3], cross-over rate [0.01-0.9], and the number of generations [20-100]. Finally, Table 10 presents, the best combination of parameters that are obtained after comparing all the combinations of parameters. The features that are selected using these optimal parameters are the best feature set, which is then applied to the SVM classification algorithm; the results are shown in Table 11.

The features that are selected using IG and GA for all three datasets are shown in Table 12. In Australian and Japanese datasets, the features are named as [X1-X14] and [F1-F15], respectively. Finally, Table 13 shows the mean values of each metric of all the FS techniques. From Table 13, we conclude that GA is the best FS technique, and the features that are selected using this technique are applied to all the datasets in the model generation phase.

##### 5.4. Experiment IV: Performance Analysis of Each Classifier with Resampling and GA-Based FS Technique

In this experiment, the optimal feature subset of each dataset is employed by all the classification algorithms on the balanced training set. Table 14 presents the results of each algorithm after implementing the GA-based FS technique on the oversampled sampled dataset. From Table 14, it is clear that the CTB and XGB are the two best classification algorithms of the credit scoring problems. The mean ACC, AUC, F1-score, BS, and KS values of CTB are 87.1%, 91.5%, 86.53%, 0.112, and 68.22%; and in case of XGB, the respective values of the corresponding metrics are 85.67%, 90.30%, 84.0%, 0.121, and 64.94%. From experiments I-III, it is clear that the performance level of most of the classifiers improves after employing SMOTE and the performance level gets further improved after implementing the GA-based FS technique on the oversampled dataset.

##### 5.5. Experiment V: Performance Analysis of Each Classifier Using Statistical Tests

In each experiment, the Friedman test is conducted to rank the classifiers according to their predictive performances. In all three cases, the value is less than (0.05). This indicates that the performance measures of the classifiers are significantly different, and therefore, the null hypothesis is rejected.

Finally, to make further analysis, a Nemenyi post hoc test [20] with is applied to make pairwise comparisons of each classifier using the mean ranks of all the classification algorithms. Table 15 presents the ranks of each classifier in terms of ACC (R_ACC), AUC (R_AUC), F1-score (R_F1), BS (R_BS), and KS (R_KS) and the last column presents the mean ranks. Figure 2 represents the graphical representation of the global ranks. The CD defines the mean-ranking score difference among the classifiers [20]. The algorithms present on the right-hand side of the diagram are considered the top-ranked algorithms, and among them, CTB is the best one. The graphical representation shows that CTB is the superior algorithm among all the base and ensemble models.

##### 5.6. Comparison of Computational Cost

Computational cost is another important metric that needs to be considered for the credit scoring model. An ideal credit scoring model should respond quickly to whether to allocate loans to the applicants or not. Except for SVM, each base learner comparatively takes less training time than that of ensemble models, because the base learners get trained only once, while the ensemble models get trained multiple times. Moreover, tree-based ensemble methods, such as GBDT, XGBoost, LGBM, and CatBoost, take comparatively low computational time due to GPU computing systems. The computational cost of the classifiers is determined by computing the single training time [12], which represents the training time of a single cross-validation process. Figure 3 shows the average computational time of base and ensemble models. SVM consumes maximum computational time, so it may not be appropriate to design a credit scoring model using SVM in the CPU computing system. However, the computational cost of the ensemble model can be further improved by implementing advanced computing mechanisms, such as distributed environment and GPU computing systems. Hence, instead of more computational cost, ensemble models should be used to build the credit scoring models.

#### 6. Conclusions and Future Work

Effectively analyzing the default customers is an important process to improve the financial status of banks and financial industries. Advanced credit scoring models are an effective tool needed to identify the default customers. As stated in the literature section, different approaches have been proposed to build the credit scoring model. This study develops a hybrid credit scoring model by applying SMOTE and different FS techniques to base and ensemble learners. Three FS techniques are applied in the proposed work, such as IG, GA, and PCA on the balanced training set to select the best predictors. In the experiment, 5 base and 14 ensembles of classifiers are used and the effectiveness of the models is validated in terms of ACC, AUC, F1-score, BS, and KS metrics across three benchmark credit scoring datasets, i.e., German, Australian, and Japan.

Applying the SMOTE method, we observe some improvements in most of the algorithms, and among them, CTB is the best classifier. Next, FS techniques are applied to all the algorithms and result in further improvements in the performance of the models. GA is the best FS technique that brings the highest improvements in the model. Therefore, this study suggests that combining the CTB machine learning algorithm with the GA-based FS technique could build an accurate and reliable credit scoring model. The experimental results reveal that all the financial industries could use the proposed hybrid model to predict the defaulters effectively.

The proposed model can be further improved in classification in future studies by incorporating different optimized techniques, such as particle swarm optimization, GA, and ant colony optimization methods. Moreover, multiple base learners can be combined using different ensemble methods, such as random subspace, stacking, and DECORATE. Additionally, more FS techniques, like RELIEF, chi-square, and rough sets, can be applied, and optimized feature subsets can give better results. Finally, more credit scoring datasets should be explored to validate the conclusions of this paper further.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by the Dongseo University, “Dongseo Cluster Project” Research Fund of 2022 (DSU-20220006).