Abstract

Finance, as the core of the modern economy, supports sustained economic growth through financing and distribution. With the continuous development of the market economy, finance plays an increasingly important role in economic development. A new economic and financial phenomenon, known as financial intervention, has emerged in recent years, which has created a series of new problems, promoting the rapid increase both in credit and investment and causing many problems on normal operation of financial bodies. In the long run, it will inevitably affect the stability and soundness of the entire economic and financial system. In order to maximize the effect of financial intervention, in response to the above problems, this article uses a series of US practices in financial intervention as the survey content, combined with the loan data provided by the US government financial intervention department, and mines the data of the general C4.5 algorithm of the decision tree algorithm. Generate a decision tree and convert it into classification rules. Next, we will discover the laws hidden behind the loan data, further discover information that may violate relevant financial policies, provide a reliable basis for financial intervention, and improve the efficiency of financial intervention. Experiments show that the method used in this article can effectively solve the above problems and has certain practicability in fiscal intervention. With stratified sampling, the risky accuracy rate increased by 10%, probably because stratified sampling increased the number of high-risk samples.

1. Introduction

The United States is considered to be the world’s most free market [1], but no country in the world has a free market economy that is completely laissez-faire and free from government regulation, and the United States is no exception [2]. In fact, American government intervention in the economy is to ensure that the market can operate more healthily and that market players can compete more fairly and freely [3]. The financial system of the United States is a financial system dominated by the capital market. Because of the normative system, well-developed financial institutions and financial instruments have formed a developed capital market by virtue of the world’s leading international monetary status of the United States dollar [4]. The market should be determined by the laws of the market, not determined and controlled by administrative orders [5]. The free competition of market entities under equal conditions is very important. Antimonopoly is because monopoly harms free market competition [6]. Unified financial legislation is because fraud and misleading harm the free market competition. The way the United States handles the economic crisis shows that [7], administrative intervention is an effective way for the country to emerge from the crisis, and practice has proved this. Government intervention in the economy is inevitable for the development of market economy [8]. It is also a good remedy for “market failure” and “market self-defeat” in the process of the development of market economy [910]. Market mechanism and government intervention have their own time and space [1112], which cannot be ignored and replaced [13]. A government should perform its coercive intervention management function in the economy during dramatic market changes and economic crises [14] with the purpose of curbing the damage to society caused by harmful behaviors resulting from dramatic market changes and economic crises [15].

The most basic characteristics of data mining include a large amount of data [16], which is to discover unknown and hidden information, extract valuable information, and use this kind of information to make important decisions [1720]. Data mining is the process of extracting useful information from data and using it to make more appropriate decisions. The key to data mining can be divided into three parts: data, information, and decision-making [21]. Data is the basis of all mining [22], but it is only when we mobilize them or convert them into useful information that they are most valuable [2325]. It is not enough to simply obtain information [26], and it is not what data mining requires [27]. The information obtained in the decision-making application is the ultimate goal of obtaining information. Therefore, the ultimate goal of data mining is to extract useful information from data to improve the efficiency of decision-making and make more appropriate decisions. In the past few years, data mining has been used in many industries to help senior managers make important and appropriate decisions. For example, different data mining methods can be used in the banking industry to solve and help the difficulties encountered in the business process of bank cards, credit, etc. Use these advanced computer technologies to enhance or improve their decision-making security and efficiency.

The financial market is producing huge amounts of data. Analyzing these data, explaining valuable information and helping to make financial decisions are great opportunities and challenges for data mining. The essence of many financial theories is to study how to construct a prediction model which is in line with the reality and minimize the prediction error. However, traditional financial analysis and theory, the prediction models used are often established on some harsh assumptions, and the form is a model of some simple mathematical expressions. Although this model is simple, it has good interpretability and comprehensibility, but it damages the accuracy of prediction to some extent. Data mining technology has broken this limitation in some respects. Through the analysis of the characteristics of financial data, we can see its advantages more clearly. Data mining technology is produced under the background that the database cannot predict the development trend of data, and its concept was first proposed at the 1989 International Joint Conference on Artificial Intelligence (IJCAI). Its significance is the process of extracting hidden and potentially useful information and knowledge from a large amount of incomplete and noisy, ambiguous, and random practical application data. Data mining is a new information processing technology. Its main function is to extract, transform, analyze, and model a large amount of data in the database. The process of data mining is also called the process of knowledge discovery. This is a broad academic subject.

In this paper, when studying the problem of financial intervention strategies, the existence of various irregular noises in the data can cause serious interference to the experiment. In order to avoid this situation and realize the hidden laws of data, this paper adopts an effective two-way cohesive information entropy data analysis method to establish a relevant model, which can discover the hidden information and patterns in financial data and help government financial departments to make correct intervention decisions. Under the support of information entropy theory, a simulation model based on two-way clustering is proposed for simulation. After extensive analysis and theoretical demonstration, the results show that the multichannel clustering algorithm has obvious effect on improving the accuracy of data analysis, which provides a strong scientific basis for the formulation of the financial intervention policy of the modern American government. In view of the fact that traditional clustering algorithms can only deal with single attribute data and cannot deal with the clustering problem of mixed attribute data well, and that most of the current clustering algorithms of mixed attribute data are sensitive to initialization and cannot deal with arbitrary shape data, a spectral clustering algorithm of mixed attribute data based on information entropy is proposed to deal with mixed type data. Firstly, a new similarity measurement method is proposed. The traditional similarity matrix is replaced by the combination of the Gaussian kernel function matrix composed of numerical data in spectral clustering algorithm and the influence factor matrix composed of new information entropy-based classification data. The new similarity matrix avoids the conversion and parameter adjustment between numerical attribute and classification attribute data. Then, the new similarity matrix is applied to spectral clustering algorithm to process arbitrary shape data, and finally, the clustering results are obtained.

2. Proposed Method

2.1. Basic Technology of Data Mining

Following years of development, it has been gradually matured the data mining technology. There are commonly used data mining techniques and algorithms such as decision trees, neural networks, rough sets, association rules, cluster analysis, regression analysis, genetic algorithms, and rough set algorithms. Here, the focus will be on clustering analysis, association rules, and regression analysis algorithms in line with the application area of this paper. Figure 1 is a display of several common data mining methods.

2.1.1. Cluster Analysis

Among them, cluster analysis plays a role in data mining in the following aspects: First, preprocessing steps for other algorithms, and then, these algorithms are generated into new clusters and processed; second, to analyze each cluster, mainly to analyze specific clusters; and third, explore and process some relatively independent data. However, it is often ignored when mining some relatively independent data.

(1) Split Method. If a database containing data objects or tuples is provided, the analytical method can construct data partitions, and each partition has its own representative cluster “”. As a general rule, divisive criteria (such as distance) are used to make objects in the same cluster “similar” and to make objects in different clusters “different.” It is mainly used to find spherical clusters. These are mostly used for small- and medium-sized databases. For the purpose of better management and processing of data in clusters, some new partitioning methods are urgently needed.

(2) Stratification. The hierarchical method decomposes the collection of specific data objects hierarchically. According to whether the hierarchical decomposition is bottom-up or bottom-up, the hierarchical clustering technique can be divided into agglutination and segmentation. The disadvantage of hierarchical clustering is that it cannot be restored after the steps are completed, so the errors are corrected.

2.1.2. Association Rules

Association can be divided into simple association and time association. The most commonly used association rule algorithm is the Apriori algorithm proposed by R. Agrawal. Even using candidate itemsets to search for frequent itemsets, mining itemset with frequent Boolean correlation rules is the most influential algorithm. (1)Find all frequency sets that are at least the same as the predefined minimum supported frequency(2)Use the frequency set found in the first step to generate the target rule, and generate all the rules that only include the setting items. There is only one correct part of each rule. The definition of the intermediate rule is used here

Apriori algorithm will generate more candidate sets and may need to scan the database repeatedly. This is where the Apriori algorithm is insufficient.

2.1.3. Regression Analysis

(1) Simple Linear Regression Analysis. It is possible to determine the linear equation with a high correlation between the dependent vector and the independent variable if they are found to be highly correlated, with a view to making all data points as close in approximation to a straight line as possible. The model can be expressed as follows.

(2) Multivariate Linear Regression Analysis. What we usually see more often is that a single dependent variable corresponds to multiple independent variables. This corresponding mode is called regression. Its performance is as follows:

represents the intercept, and represents the correlation coefficient.

(3) Analysis of Nonlinear Regression Data. For linear regression problems, the sample points fall on or near a straight line in space, so a linear function can be used to represent the corresponding relationship between independent variables and dependent variables. However, in some applications, the relationship between variables is in the form of curve, so it is impossible to express the corresponding relationship between independent variables and dependent variables by linear functions, but it needs to be expressed by nonlinear functions.

2.2. Decision Tree Algorithms

It is closer to the objective function. Both leaf node classification and instance classification are performed mainly at the basis of the arrangement of nodes. On each node corresponding to one possible case, a root of a tree node is started; its attributes are measured; then, the node is changed according to its corresponding value.

2.2.1. ID3 Algorithm

On the basis of the ID3 algorithm that the attribute selection metric is the information gains when selecting on the best attribute as each node. And the measure is based on the pioneering work of C.E. in the study of information value or information theory by scientists of C.E. the Shannon:

We first compared the growth of each type of information. To choose the attribute from which the highest information is gaining (for example, maximum extraction compression) one of the tree points.

The second step is to branch according to the different values of the root node and then establish the lower nodes and branches for each branch.

The third step is to repeat the first and second steps and stop branching when the data contained in the subset are of the same category.

In this way, a decision tree can be obtained and used to classify test samples.

For the calculation description of information gain value, let be a set of training data and define different . The expected information for a given training data classification is given by the following formula:

Note that the logarithmic function bottoms 2 because of information binary encoding.

Now, suppose you want to divide the tuples in by attribute , where attribute has values according to the observation of training data. Therefore, attribute divides into subsets , where the tuples in have the same value on attribute . However, these partitions may contain tuples not from the same class but from different classes, that is, impure. After this partition, how much information is needed for the accurate classification of the generated tuples, which can be measured by the following formula:

Among them, item denotes the weight of the th partition, and denotes the expected information needed to classify the components of by attribute . The information gain obtained by branch on attribute can be described as:

The advantages of ID3 algorithm are as follows: (1)The basic principle of the algorithm is clear(2)The classification speed is faster(3)Practical example learning algorithm

Its shortcomings are as follows: (1)There is a bias problem. The number of feature attributes affects the amount of information(2)A problem with training data will make the results different and more sensitive to noise(3)The probability of error is proportional to the increase of category

2.2.2. C4.5 Algorithm

An early machine learning algorithm and a common algorithm for constructing decision tree classifiers became the basis of many decision tree algorithms later. (1)The information gain rate is used as attribute selection measure to solve the problem of bias(2)It can discretize attributes with continuous values and deal with incomplete data(3)Pruning at the same time in the process of tree construction

With the extension of information gain, benefit ratio can solve the drawback of ID3. In the assumption that a variable is selected as a partitioning attribute, with a higher information gain of the variable than the information gain of its other variables is needed. The definition formula of segmentation information is as follows:

The ratio of the increase in information is mainly compared with the total amount of information in some segments. The formula is as follows:

2.2.3. CART Algorithm

Classification and Regression Tree (CART) is a technique for generating binary decision trees. In fact, its principle is dichotomy recursive segmentation technology. In order to produce subnodes, it divides two sample subsets; that is, only two subnodes are generated, so finally, a simple binary decision tree is obtained. Unlike ID3 and C4.5, which are based on information entropy splitting technology, CART chooses the best grouping variables and splitting points based on gini coefficient and variance and chooses the attributes with the minimum gini coefficient as the current test attributes. If the gini coefficient value is smaller, the more reasonable the segmentation is, and the higher the purity of the sample set is.

If the training tuple set contains records of categories, then the gini index is determined as follows:

Calculate the sum of classes, where is the probability that any record in belongs to class and is expressed by . If is divided into D1 and D2, the gini coefficient of this division is

where is the number of samples in and is the number of samples in D1 and D2, respectively.

The CART algorithm terminates splitting and stops constructing decision tree if the following conditions exist. (1)The data records contained in leaf nodes belong to the same category(2)The number of samples covered by a branch is less than a threshold set by the user in advance

3. Experiments

3.1. Selection of Experimental Platform

Through this paper, SPSSC lementine 12.0 is elected as the data mining platform in conjunction for the actual research work. For the mining platform, the selection of the platform is mainly based on the following six aspects: (1)Clementine has the functions of classification and prediction, association analysis, time series analysis, and clustering. It provides a variety of methods, such as neural network, decision tree and regression tree, linear regression, logistic regression, self-organizing network, and fast clustering(2)Clementine has an interactive and visual user interface, which combines intuitive user graphics interface with a variety of analysis techniques. It is a very easy software for users to build models by connecting nodes, and data mining model can be built without programming. So that users can put more energy into the application of data mining to solve specific business problems, rather than the use of software(3)Clementine has an open database interface that provides rich data access capabilities for access to files and relational databases. It also provides the ability to input data processing and output data settings(4)Clementine provides two ways to build models. In the simple mode, the user does not need to make any settings; the system will build the model according to the default settings; in the expert mode, the user can adjust the parameters in the model according to his own needs, so that the model achieves the best results(5)Provide powerful publishing capabilities to export data mining models or entire data mining processes to embedded systems(6)Provide complete data flow management and project management functions. The former can effectively manage the data flow, data mining model, and mining results in the work area. The latter can effectively manage the entire project; users can manage related project files according to different stages of data mining and can effectively manage data mining projects according to data flow, nodes, data mining models, results, and other methods

3.2. Data Acquisition

With this paper, the data are obtained from the financial data of 1500 relevant firm clients of a commercial bank, averaged over the years 2015 to 2018. The attributions in the financial information data tables provided by the bank are in the transaction database based attributes, so conversion of attributes is performed to form 18 attributes that reflect the financial indicators for a firm, as shown in Table 1. Firstly, according to the relevant indicators of enterprises and the actual situation of enterprises in 2018, the experts of financial institutions define the risks of enterprises as high, higher, medium, and low. Among them, the enterprises with high risk are those that will fail from 2017 to 2018; the enterprises with high risk are those that will produce credit default; the enterprises with medium risk are those that have no default but have deteriorating financial situation, and the enterprises with low risk have good financial situation and no credit default. At each tree construction, a randomized method was used. For verification of the stability of a decision tree classification, a total of 5 experiments were conducted. At each training dataset, 1200 data were randomly selected with the tree from the original dataset as training data, which was randomly selected with 12 attributes from 18 alternatives.

4. Discussion

4.1. Accuracy Comparison

(1)In order to compare the data, we counted the progress of a large number of random decisions, making the data comparison between them obvious, and the comparison results are shown in Table 2 and Figure 2.

We can see that the classification accuracy is informative for bank risk prediction by the confirmation of bank personnel and based on the data presented through the graph. However, the algorithm has relatively low classification accuracy for high risks. The main reason is that the number of data with high risk in the training data set is small, resulting in insufficient training of this kind of branch. (2)Accuracy analysis of C4.5 algorithm, as shown in Table 3 and Figure 3.

The classification accuracy of the algorithm for high risk is relatively low. The main reason is that the number of data with high risk in the training data set is small, which results in insufficient training of this kind of branch. Table 4 is the accuracy comparison table between the random decision tree algorithm and the C4.5 algorithm.

According to Table 4, the accuracy comparison chart between the random decision tree algorithm and the C4.5 algorithm is obtained, as shown in Figure 4.

From Figure 4, we can see that the accuracy of random decision tree method is about 10% higher than that of C4.5. In order to improve the accuracy of high risk, 300 high risk data were added to the training data set. The original random sampling is replaced by stratified sampling. The original data are stratified according to the high, higher, medium, and low risk. Random sampling is used for each level to ensure the number of training data with high risk. The following Tables 5 and 6 and Figures 5 and 6, respectively, show the stratified sampling accuracy comparison table of multiple random decision trees, the comparison table of stratified sampling and random sampling accuracy, the comparison of stratified sampling accuracy of multiple random decision trees, and the stratified sampling.

We can see that the accuracy of high-risk increases to 10% after stratified sampling, which is mainly because stratified sampling increases the number of high-risk samples. Then, the accuracy of decision tree classification is related to the number of training data samples. By having a larger sample size, the more accurate the decision tree of classification.

5. Conclusions

With the continuous progress of computer theory and technology, more and more computer data processing and analysis methods are combined with financial intervention work efficiently and organically, which has brought revolutionary innovation to the theory, mode, and method of financial intervention work. Especially the introduction of data mining technology, it brings new ideas for financial analysts, improves the efficiency and quality of financial intervention, and plays an increasingly important role. (1)On the basis of introducing the background of topic selection, process steps, and application fields of data mining and focuses on the commonly used algorithms of data mining(2)Based on the theory of information entropy and through theoretical proof, this paper proposes an objective and fair method to evaluate the clustering effect and applies this method to solve practical problems and achieves better practical results. Due to incomplete data and partial distortion in raw data acquisition, the accuracy of the model is affected to a certain extent. Further work is to increase the number of experimental samples, fully tap the potential useful information; add some derivative variables to make the results of analysis more objective and convincing; the results of analysis are more comprehensive and have greater practical value(3)This paper analyses and studies the classification technology of decision tree in data mining, especially the application of C4.5 algorithm to loan data of a credit cooperative, establishes decision tree and classification rules, builds audit analysis model, and facilitates financial analysts to find problems and find clues to financial problems

Data Availability

This article does not cover data research. No data were used to support this study.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported by the Humanities and Social Sciences Project of Henan Provincial Education Department (No. 2022-ZZJH-098), the Independent Innovation Application Research Project of Zhongyuan University of Technology (No. K2018YY023), and the Graduate Quality Engineering Project of Zhongyuan University of Technology (No. QY202102).