Abstract

Predictions of credit risk, model reliability, monitoring, and efficient loan processing are all important factors in decision-making and transparency. Machine learning method is providing these people new hope. However, it is up to banking or nonbanking institutions to determine how they will implement this advanced method in order to decrease human biases in loan decision-making. Objective. This paper proposed the novel machine learning-based credit risk analysis in the digital banking evaluation model. The purpose of this research is to compare various ML algorithms in order to develop an accurate model for credit risk assessment utilising data from a genuine credit registry dataset. Aim is to design the classification-based model using particle swarm optimization (PSO) algorithm with structure decision tree learning (SDTL) in predicting credit risk. This system has the potential to improve quality criteria such as dependability, robustness, extensibility, and scalability. Features have been extracted and classified by the proposed PSO_SL model. Experimental Results. The data have been collected based on real time for credit analysis. Simulation is carried out in Python and optimal results are obtained in comparative analysis with existing techniques. The accuracy obtained by proposed technique is enhanced and the error rate of the design is minimized.

1. Introduction

Because of the increase of rural markets, credit penetration among farmers is gaining traction, and it is becoming recognised as a growth engine for a rising economy such as India. Access to banking, particularly, for smallholder farmers, poorer households, and specific women farmers, is a major issue that must be addressed [1]. These are a few distinct groups of people who are likely currently unserved or neglected by current banking and financial organisations [2]. Their digital transformation of corporate processes is unavoidable as they implement big data solutions to improve their operations [3]. The term “big data” has been popular in recent years, and it refers to massive amounts of data as well as the technologies for storing and analysing them. Banking industry has a significant amount of data that is growing at an exponential rate, and it is grappling with the issues of managing and analysing it. Banks are using big data as well as data science to boost profits by getting new insights from existing data and improving predictions based on that data [4]. To provide forecasts for a variety of expert systems, including liquidity, risk, customer attrition, fraud detection, and revenue, as well as making educated decisions, as a result, AI- and ML-based technologies may be of great assistance to them in determining their credit score [5].

Many researchers have looked at credit risk assessment and assessment models using a variety of techniques. Zhou et al. and Lowd and Davis [6, 7] suggested a particle swarm optimization algorithm-based financial credit risk assessment technique. The study in [8] provided an xgbfs-based financial credit risk assessment approach that decreases the dimension of the user’s credit data, trains the xgboost assessment model, and then analyses the user’s credit risk. According to [9], combining supervised and unsupervised ML methods produces better outcomes than employing only one of them. Traditional ways for forecasting credit risks are not well adapted to assist financial institutions, according to [10], and they require ML methods for forecasting credit risk. They introduced a hybrid ensemble ML method that combined the RS (random subspace) and multiboosting, two classic ML methods. They used the usual probit method as well as several ML methods such as NN and KNN in their work [11]. When compared to traditional credit risk assessment methodologies, they achieved a lower mistake rate using ML methods. Kovvuri and Cheripelli [12] utilized a loan dataset from a commercial bank to test five various ML methods for credit risk assessment, including RF, KNN, NB, DT, and logistic regression. The random forest method outperforms the other algorithms studied, according to their findings. The author in [13] built a new P2P credit risk measurement model based on classic credit risk and NN model which are utilized to produce better findings through simulation. The standard credit risk measuring model was combined with SVM and actual study on loan default in [14]. The borrower’s loan amount, current overdue loan amount, loan interest rate, credit rating, debt yield, and other parameters are used as independent variables by the researcher in [15], who builds a LR method for testing and gets good results. Dufour et al. [16] replaced the BP neural network’s training algorithm with an enhanced PSO method, combined with a credit assessment index system as well as finally realised using an enhanced PSO-BPNN. The author of [17] looked at the origins of P2P online loan risk and used loan data from Lending Club to create a risk prediction model. To provide a credit risk management technique for domestic P2P enterprises, the world’s largest P2P company shows forecast accuracy. Restricted Boltzmann machine (RBM algorithm), a multilayer limited Boltzmann introduced by [18], is a nonlinear dimensionality reduction approach employed in this paper. The RBM-DBN model is a DBN model made up of machines. It learns using a multistep unsupervised NN first and then modifies supervised learning parameters before training discriminant classifier model.

The paper contributions are as follows:(i)On the basis of the credit registry dataset, analyse ML methods in order to construct a precise model for credit risk assessment.(ii)We concentrate on credit risk assessment, looking at the impact of various proposed models for detecting business defaults. On the contrary, we investigate the models’ stability in relation to a subset of variables chosen by the models.(iii)We design the classification-based model using particle swarm optimization (pso) algorithm with structure decision tree learning (sdtl) in predicting credit risk.(iv)The findings reveal central banks’ perspective on credit risk analysis, which differs significantly from standard commercial bank credit risk evaluations, which rely on more extensive data per client.

2. Digital Banking Credit Risk Analysis Using Particle Swarm Optimization (PSO) Algorithm with Structure Decision Tree Learning (SDTL)

It is important to precisely categorise financial credit risk level, categorise financial credit risk data using PSO-SDTL, and design the credit risk evaluation model to successfully regulate financial credit risk. For various levels of financial credit risk, targeted control measures must be implemented. The overall architecture is shown by Figure 1.

The PSO is started with a set of random particles (solutions) and then updated generations to find the best solution. First is pbest, which is the best solution (fitness) that has been attained thus far. The best value was attained by each particle in population is known as best global, and it is known as the gbest because it performs well in a demanding, nonconvex, and continuous environment. The best value is the best local and is denoted pbest when a particle takes part of population as a neighbour topology.

PSO algorithm is made up of two equations. Because “k” refers to the current iteration as a head, “k + 1” denotes the next iteration by

For each combined node k, its output is given by

The updating of particle is implemented by each velocity that corresponds to a particle. Because PSO is a swarm method, it looks for feasible solutions in m-dimension space in parallel were represented bywhere i and j indicate ith and jth particle, respectively, given by

Consider the discrete state space model is given by

The preceding equation may be recast as the following nonlinear process equation using the definition of y[k] = x[k + 1], k = 0, …, N − 1 is represented bywhere x and represent inputs, y represents output, and represents unknown q-dimensional parameter that must be approximated. State space is defined to estimate the parameters given bywhere r is the process noise and the first model is the process equation. Latter is measurement equation, which is affected by input as well as measurement noise. Initialize unknown parameter and the covariance matrix were represented by

The following equations provide a set of 2q + 1 sigma vectors W assuming parameter vector has a mean and covariance given by

With nonlinear process F, sigma points are transferred by

Transferred sigma point mean and covariance are given bywhere

The measurement and parameter vectors’ cross covariance matrix is computed by

In search space, particles are spread based on their own best known position and entire swarm’s best known position . In iteration t, velocity of particle i’s jth dimension is calculated as follows:

The above description yields the following equation:

For the gbest PSO method, the velocity of particle I is calculated by

The velocity update equation is calculated as

When , all particles are attracted towards the average of and .

The analysis of the parameter θ of one dimensional is by

Similarly, diminishing log-likelihood function negative effect to calculate θ is shown as

The method of moments, on the contrary, is the simplest form on a discrete maximum likelihood function in the following equation:

Now, we will show you how to learn Markov network structure from data using our method. Each tree is turned into a set of conjunctive characteristics that can reflect the tree’s probability distribution.

The primary purpose of organization and regression tree T is to forecast the response variable y as precisely as possible using predictor variables x. Prediction models coupled to leaf nodes define the prediction structure:

For each k = 1, … , K, the chance of specific being allocated to class k is equal to by the following equation:

Due to the impossibility of evaluating a closed-form result produced from transitional density (17), a nonparametric kernel density technique was used by the following equation:

3. Simulation Results

3.1. Dataset Description
3.1.1. General Data Protection Regulation (GDPR)

The dataset serves as the hub for all credit in the country, collecting data from all commercial banks and savings institutions. Each entry in this dataset reflects a monthly credit and credit card status for a certain client. Credit registry database was approximately 1 TB in size, but utilized a subset of this dataset which contains 1,000,000 rows in the study, which signify status of 1,000,000 various credits only for specific clients and their status in planned date of completion of entire loan payment. Dataset lacks a feature that indicates whether or not the client is capable of repaying the loan [19].

3.1.2. Advanced Analytics of Credit Registry Dataset

We showed properties, their dependencies, trends, and dependencies with further data sources using its advanced analytics features. We went through several stages to complete the analysis, including extra calculated columns, measurements, and a star schema model. We developed various reports in a highly efficient manner after model construction and relationships, which, unlike conventional tools, are incomparably quick and powerful with current features for visualisation, dependency, and prediction which allowed us to investigate dataset as well as acquire a better understanding. The speed with which Power BI can manipulate and analyse data was also a factor in our decision. Because Power BI has its own format that is tailored to handle huge data, subset of dataset that evaluated was 13 GB in size when stored in a SQL database, but it was reduced to 330 MB in Power BI format [20, 21].

3.1.3. WIND Dataset

Sample data used in this paper are real data from a commercial bank branch’s personal credit database. In this work, 75 data samples from the personal credit database are chosen at random. Following the removal of these invalid samples, 61 valid samples are found. There are 40 learning samples and 21 test samples included in this dataset.

Here, analysis for financial credit risk has been carried out for proposed PSO-SDTL in terms of accuracy, MSE, feasibility, and GDP analysis for various datasets.

Table 1 and Figures 2(a) and 2(b) show performance analysis for proposed PSO_SDTL for GDPR dataset. In Figure 2(a), accuracy and MSE analysis are carried out in which dark blue represents PSO_BPNN and light blue indicates RBM_DBN and 98% and 55% are the accuracy and MSE of the proposed PSO_SDTL method, respectively. In Figure 2(b), feasibility and GDP analysis are done in which dark blue represents PSO_BPNN and light blue indicates RBM_DBN and 98% and 55% are the accuracy and MSE of the proposed PSO_SDTL method, respectively. 94% and 96% are the feasibility and GDP values of proposed PSO_SDTL method, respectively.

Table 2 and Figure 3 show analysis for proposed PSO_SDTL for AACRD dataset. This dataset gives the credit analysis based on analysis of registry dataset. Here, analysing the GDP obtained by proposed PSO_SDTL is 94%, accuracy predicted by this proposed technique is 96% MSE is 52%, and feasibility of analysis is 91%.

Table 3 and Figure 4 show analysis for proposed technique for WIND dataset. This dataset covers the sample credit registry dataset of the bank. The GDP analysis is based on accuracy of the system highly optimized with enhanced feasibility and minimized MSE. By analysing the data, the GDP analysis is optimal for this dataset. The accuracy for this dataset by the proposed model would be 97%.

4. Discussion

Credit risk classification is delicate, and we do not want to miss out on a borrower whose risk goes unnoticed, but it is also critical to ensure that forecasted risk is correct. However, risk assessment for farm assets falls within the binary classification category, and the ML method is best suited for this. Because the feature extraction and preprocessing step is so important for the overall performance of machine learning models, it must be customised for datasets from various fields. Techniques for feature engineering and data mining are constantly evolving. As a result, one of the next projects were excited about is devising a more efficient method of increasing the application’s accuracy in making decisions. Another key factor contributing to the models’ low accuracy is the small size of the datasets. One dataset contains 1000 entries, whereas the other contains roughly 3000. When high-dimensional datasets have a large number of entries, models perform better. One of the drawbacks of utilising machine learning in credit risk assessment is scarcity of high-quality datasets. When estimating suggested classifiers on unbalanced datasets, such as our credit registry dataset, it is important to keep in mind that the dataset is unbalanced. The results reveal that, on imbalanced data with feature scaling, all of the models perform with good accuracy and precision. Because the results were not promising, we are missing balanced dataset with scaling. Decision tree, random forest, and linear regression are the best models identified by F1 score. The results of feature scaling are poor when data are balanced and scaled results are even worse. When compared to existing articles on datasets from commercial banks, the results obtained by evaluating five machine learning models demonstrate improved outcomes. The findings revealed that utilising the suggested PSO_SDTL, the models performed well with high accuracy.

5. Conclusion

This paper proposes novel technique in analysis of financial credit risk in digital banking. On the basis of the credit registry dataset, the goal is to test ML methods in order to construct an accurate model for credit risk assessment. We concentrate on credit risk assessment, looking at the impact of various proposed models for detecting business defaults. On the contrary, we investigate the models’ stability in relation to a subset of variables chosen by the models. The design of the classification-based model using particle swarm optimization (PSO) algorithm with structure decision tree learning (SDTL) in predicting credit risk has been carried out. The findings reveal central banks’ perspective on credit risk analysis, which differs significantly from standard commercial bank credit risk evaluations, which rely on more extensive data per client but lack data for same client in other banks.

(i)Tree Growth
(1)if stopping_cond  = true then
(2)leaf  createNode ()
(3)leaf.label  Categorize
(4)return leaf.
(5)Else
(6)root  generate Node ()
(7)root.test_cond = determine_best_split .
(8)let is a probable outcome of root.test_cond
(9)for each do 10. root.test_cond and
(11)child  TreeGrowth
(12)add child as descendent of root and label edge root child as
(13)return root. 14. function LEARNTREE
(15)best_split
(16)best_score
(17)for do
(18)for do
(19)
(20)if best_score then
(21)best_split
(22)best_score
(23)if best_score then
(24), best_split
(25)
(26)
(27)return new TreeVertex best_split,
(28)else
(29)Use to estimate
(30)return new TrecLeaf
(31)end if

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Acknowledgments

This work was supported by the grants from Youth Foudation WuHan Donghu University (project no. 2021dhsk002).