Abstract

Credit card fraud detection (CCFD) is important for protecting the cardholder’s property and the reputation of banks. Class imbalance in credit card transaction data is a primary factor affecting the classification performance of current detection models. However, prior approaches are aimed at improving the prediction accuracy of the minority class samples (fraudulent transactions), but this usually leads to a significant drop in the model’s predictive performance for the majority class samples (legal transactions), which greatly increases the investigation cost for banks. In this paper, we propose a heterogeneous ensemble learning model based on data distribution (HELMDD) to deal with imbalanced data in CCFD. We validate the effectiveness of HELMDD on two real credit card datasets. The experimental results demonstrate that compared with current state-of-the-art models, HELMDD has the best comprehensive performance. HELMDD not only achieves good recall rates for both the minority class and the majority class but also increases the savings rate for banks to 0.8623 and 0.6696, respectively.

1. Introduction

With the rapid development of mobile internet and e-commerce technologies, online payment tools such as credit cards are welcomed by more and more people. While credit cards bring convenience to customers, they also expose cardholders and banks to potential fraud risks [1, 2]. Credit card fraud is a global problem. The Nilson report found that by 2023, the worldwide fraud loss is expected to reach $35.67 billion annually [3]. Fraud prevention and fraud detection are two main ways to combat credit card fraud [4]. Fraud prevention consists of a series of rules, procedures, and protocols. Commonly used technologies in fraud prevention include secure payment gateways, intrusion detection systems, and firewalls [5]. Fraud detection takes place after the fraud prevention mechanism has been breached [4], which means that fraud detection is the last line of defense to ensure the security of credit card transactions. Banks have to invest considerable money to optimize their fraud detection system [6], due to the need to protect cardholder’s funds and their own business reputation.

Data mining and machine learning are widely used technologies in financial fraud detection [79]. As early as 1998, researchers had begun to build CCFD systems based on machine learning techniques [10]. After more than two decades of development, researchers have proposed many different methods and models [2, 11]. In machine learning terms, CCFD is a typical binary classification problem. The detection system is aimed at determining whether the current transaction is either legal (the transaction was made by the cardholder) or fraudulent (the transaction was made by an unauthorized person) based on historical transaction data [12]. Various methods have been proposed to tackle this problem, including supervised learning, unsupervised learning, and semisupervised learning. In supervised learning, the historical transaction data (training data) are labeled with known outcomes. Commonly used supervised learning models include Hidden Markov Model (HMM) [13], Logistic Regression (LR) [14], Support Vector Machine (SVM) [15], -nearest neighbors (KNN) [16], Bayesian Networks (BN) [17], Decision Tree (DT) [18], random forest (RF) [19], and Artificial Neural Network (ANN) [20]. Conversely, the historical transaction data used in unsupervised learning models (ULMs) are unlabeled. ULMs judge whether transactions are fraudulent by observing the distribution of current and historical transaction data. Commonly used ULMs include artificial immune systems [21] and self-organizing maps [22]. Semisupervised learning models are a combination of supervised and unsupervised learning models, which use some labeled data in combination with a large amount of unlabeled data. This can help banks reduce the cost of labeling large volumes of transaction data [12, 23].

In the real world, the proportion of fraudulent transactions (minority class) is much lower than that of legitimate transactions (majority class), which means that the distribution of credit card transaction data is highly imbalanced, and this increases the difficulty of fraud detection [15, 24]. Most standard classifiers have poor performance on imbalanced data, especially for the minority class [25]. Resampling is a widely used method to address the problem of imbalanced classification data. Several resampling algorithms have been proposed to improve the recognition performance of classifiers for the minority class [2628]. However, the disadvantage of the resampling method is that it significantly reduces the performance of classifiers for the majority class. For CCFD, this means that a large number of legal transactions are misclassified as fraudulent, which will significantly increase the investigation costs. Therefore, it is critical to build a CCFD model with strong recognition performance in both the minority and majority classes.

To address the above issues, we propose a new kind of heterogeneous ensemble learning model based on data distribution (HELMDD) for credit card fraud detection. The core idea is to incorporate a resampling method based on the distribution of data (RMDD). To reduce information loss in the majority class and improve the performance of the base classifiers, RMDD applies KNN and -Means algorithms to obtain samples from the majority class, which retain its diversity and boundary contours. Finally, balanced subsets for training the base classifiers are obtained by pairing majority and minority class training subsets.

The main contributions of our study are as follows: (1)We design a new undersampling method based on the distribution of majority class samples, RMDD, which can reduce information loss within the majority class(2)We design a novel combination based on heterogeneous ensemble learning and our RMDD resampling method to obtain better prediction performance in highly imbalanced credit card transaction datasets(3)Experimental results on two real credit card fraud datasets demonstrate that the proposed model can achieve better performance

2. Literature Review

2.1. Credit Card Fraud Detection Model

Credit card datasets contain detailed information about each transaction, such as account number, transaction amount, time, location, and merchant category. We can construct a model to determine whether a transaction is fraudulent or not by expressing the transaction-related information as vectors and calculating their similarity. Singh and Jain [29] reviewed literature on CCFD and summarized the topical issues in current research, such as datasets, evaluation matrices, and the advantages and disadvantages of different models. Armel and Zaidouni [30] compared and analyzed the effectiveness of simple anomaly detection using DT, RF, and Naive Bayes (NB) in CCFD through a series of experiments. Sohony et al. [4] found that RF enables higher accuracy in predicting legal transaction instances and a Feedforward Neural Network (FNN) achieves higher accuracy in predicting instances of fraudulent transactions. Consequently, they proposed an ensemble learning model based on RF and FNN.

Deep learning for CCFD has been discussed in several works [20, 31, 32]. Rushin et al. [20] conducted comparative experiments on deep learning, LR, and Gradient Boosted Tree (GBT) with a dataset containing approximately 80 million account level transactions. The results showed that the performance of deep learning models is better than the GBT and LR. Kim et al. [31] proposed a champion-challenger framework that includes deep learning and ensemble learning and evaluated it on a large transaction dataset taken from a major card issuing company in South Korea. Li et al. [32] proposed a deep representation learning model based on a full center loss function, which considers both distances and angles among different features.

Some studies have made improvements in feature engineering methods for credit card transaction data. Zhang et al. [24] proposed a feature engineering method based on homogeneity-oriented behavior analysis and then used a deep belief network for learning the extracted features. Lucas et al. [33] proposed an HMM-based feature engineering strategy that could incorporate sequential knowledge in the transactions in the form of HMM-based features, which enabled a nonsequential RF classifier to make use of the sequential information. Wu et al. [34] proposed a new feature engineering method to detect fraudulent cash-out of credit cards that considers both snapshot and dynamic behavioral patterns of cardholders and conducted a comparative experiment with the feature extraction method based on Whitrow’s strategy. Vlasselaer et al. [35] proposed a feature engineering method based on the network structure of cardholders and merchants and then calculated a time-dependent suspiciousness score for each network object.

Many other approaches have been used recently in the identification of credit card fraud. Gianini et al. [36] proposed a method of rule pool management based on game theory in which the system distributes suspicious transactions for manual investigation while avoiding the need to isolate the individual rules. Based on generative adversarial networks, Fiore et al. [37] proposed a method to generate simulated fraudulent transaction samples to improve the effectiveness of classification models. Carcillo et al. [38] proposed a scalable real-time CCFD framework that could deal with imbalance and feedback latency based on big data tools such as Spark. Their work provides a reference for real-time detection in massive credit card transaction data.

2.2. Imbalanced Data Learning Methods

Imbalanced distribution of data (class imbalance) has a great impact on the performance of classification models, reducing the accuracy of prediction in the minority class [25]. Some effective solutions for class imbalanced data have been proposed by many researchers. These solutions can be arranged into two groups: data level and algorithm level [2].

Resampling is a simple and efficient way to address the problem of class imbalance at the data level. Current resampling strategies can be divided into those that oversample the minority class samples and those that undersample the majority class samples. Commonly used oversampling methods include Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE) [39], and Borderline-SMOTE [40]. For a highly imbalanced credit card transaction dataset, oversampling generates many minority class samples (fraudulent transactions). Although this can increase the learning weight of the classification model for minority class samples, it also increases computational complexity and generates many noise samples, which will reduce the predictive performance for the majority class (legal transactions). Commonly used undersampling methods include Random Undersampling (RUS), one-sided dynamic undersampling [41], and neighborhood-based undersampling [42]. The undersampling approach involves deleting a large number of majority class samples. This improves the computational efficiency of the classification model but may result in the loss of important information from the majority class samples, which can increase the false-positive rate of the classification model and lead to additional investigation costs for the banks.

Cost-sensitive learning technology is often used to address the problem of imbalanced datasets at the algorithm level. These learning models introduce some constraints and weights through a cost matrix based on the loss function of conventional learning models, which causes models to shift to a smaller total cost. The advantage of cost-sensitive learning technology is that it does not generate or add new information, thereby avoiding the introduction of external noise into the classification model. The disadvantage of cost-sensitive learning technology is that the establishment of the cost matrix needs to be estimated by business experts and cannot be calculated accurately. Commonly used cost-sensitive learning models include cost-sensitive SVM [43], cost-sensitive LR [44], and cost-sensitive DT [18].

Akila and Reddy [45] proposed a cost-sensitive risk-induced Bayesian inference bagging model for CCFD to help card issuers reduce costs. They verified the effectiveness of this model on a dataset from a Brazilian bank. Nami and Shajari [46] proposed a two-stage detection algorithm to address class imbalance in payment card fraud detection. The first stage extracts the relevant features from the transaction data and the second stage extracts the recent transaction behavioral characteristics of cardholders. In the second stage, a cost-sensitive dynamic random forest model is used to improve classification performance.

3. Methodology

In this section, we introduce the proposed heterogeneous ensemble learning model based on data distribution (HELMDD) in details, which consists of two main components. The first is a resampling method based on data distribution (RMDD), as illustrated in Figure 1. RMDD undersamples the majority class based on the data distribution of the majority samples and creates several balanced training subsets by using KNN and -Means. The second one is a framework based on a heterogeneous ensemble learning model (HELM), as illustrated in Figure 2. HELM is a framework that integrates seven kinds of heterogeneous classification models (LR, SVM, NB, DT, RF, AdaBoost, and XGBoost) in the bagging method.

3.1. KNN

KNN is a widely used unsupervised learning method. KNN can predict the category of samples by calculating the Euclidean distance between different points. The formula for calculating the Euclidean distance between points is shown in the following equation:

RMDD divides the majority class samples of the training dataset into a subset of boundary samples and a subset of ordinary samples using a KNN algorithm. A selection of samples is then drawn from each of these subsets to create several new balanced datasets that contain cases from both the majority and minority classes. The advantage of this method is that the new balanced training subsets retain some of the boundary features of the majority class from the original training dataset, which can reduce information loss in these critical boundary cases.

3.2. -Means

-Means is a popular unsupervised clustering algorithm. Taking dataset and the number of classes as inputs, the -Means algorithm is aimed at dividing into subsets quickly. Specifically, -Means randomly select samples as initial clusters. Then, for each sample in the dataset, the Euclidean distance between sample and the centroid of different clusters is calculated. If the distance between and the centroid of cluster is the shortest, is assigned to cluster . The third step is to calculate the average value of samples in cluster and update the centroid of cluster . The second and third steps are repeated until the difference between the old centroid and the new centroid is less than a preset threshold. After the algorithm is executed, we can obtain the data distribution of the majority class samples.

In the task of imbalanced classification, undersampling methods can increase the learning weight of minority class samples, which helps classification models to attain a higher recall rate. RUS is the most widely used method, but a significant defect of RUS is that it discards a large number of samples from the majority class, which may increase the false-positive rate of classification models. During the undersampling process, if we do not consider the distribution of majority class samples, the selected samples cannot represent main features of the majority class and thus could decrease the performance of the base classifiers. In our method, we divide the majority class into clusters using the -Means algorithm and then randomly sample from each cluster in different proportions. This resampling method fully considers the distribution of majority class samples and thus can better retain the main features of these cases.

3.3. RMDD Resampling Method

RMDD is an undersampling algorithm that fully considers data distribution, which has three components. The first is to sample the minority class. Due to the highly imbalanced distribution in a CCFD dataset, we use all the minority class samples to improve recognition ability for the minority class of the base classifier. In the second part, we undersample the majority class to generate multiple subsets so that the number of majority class samples is the same as the number of minority class samples, which forms the core of the RMDD algorithm. The third part is to generate several balanced subsets to provide training data for the base classifiers by merging the minority class samples and the subset of majority class samples. The flowchart of the RMDD resampling algorithm is shown in Figure 1.

The second part above consists of the following 4 steps: (1)We divide the majority class samples into a boundary sample set and an ordinary sample set by the KNN algorithm. Then, for any sample in the majority class (labeled 0), we find -nearest neighbor samples to from the training set. If there are more than neighboring samples with a label of 1, then , otherwise (2)Random sampling with replacement is used to divide sample into subsets. represents the number of samples in set . The formula for calculating the number of samples in each subset is shown as is a weight parameter, which is used to adjust the number of boundary samples in each subset(3)Using the -Means algorithm, we divide ordinary sample set into clusters and sample with replacement from each cluster with different sampling rate. The sampling rate in the th cluster is calculated from Equation (4), where represents the number of samples in the th cluster, and represents the number of ordinary samples in set . represents the number of minority samples, and represents the number of samples randomly selected from the th cluster and can be calculated from Equation (5). (4)Combining boundary sample subset with ordinary sample subset to construct a subset of majority class samples. The corresponding calculation formula is as follows:

The number of samples in the majority class is equal to the number in the minority class, that is,

3.4. Framework of HELM

To improve classification performance, we propose a heterogeneous ensemble learning model (HELM) framework, as shown in Figure 2. The HELM framework uses data resampling and ensemble learning technologies to address the problem of imbalanced data in CCFD. Through the training and screening of multiple heterogeneous base classifiers, we improve the robustness of HELM as well as avoid reliance on a single classifier. The HELM framework consists of two phases: (1) the data processing phase and (2) the model training and selection phase.

3.4.1. Data Processing Phase

The main task of this phase is to preprocess the original credit card transaction dataset, including feature selection, data normalization, dataset division, and resampling. First, we divide the original dataset into a training set and a test set. The training set is used for estimating parameters of the classification model, and the test set is used for evaluating the trained classification model. We further divide the training set into a training subset and a validation subset and use the RMDD algorithm to resample the training subset, which divides the highly imbalanced training subset into balanced subsets for training base classifiers. The RMDD algorithm fully considers the distribution of majority class samples. Boundary samples are distinguished from ordinary samples by applying the KNN algorithm. Ordinary samples are grouped into classes by the -Means algorithm. Through these two algorithms, we can build several balanced subsets and ensure that each balanced training subset contains a certain ratio of boundary samples and ordinary samples from each cluster. The advantage of this is that more feature information of samples in the majority class can be preserved while generating new balanced training subsets. In addition, the introduction of boundary samples retains some of boundary contours from the original dataset in the new balanced subset, which can help to improve classification performance.

3.4.2. Model Training and Selection Phase

When the preprocessing phase is completed, we have balanced training subsets. For each subset, we use seven different base classifiers for training, including LR, SVM, NB, DT, RF, AdaBoost, and XGBoost. Then, we use an imbalanced validation subset to obtain Area Under the Curve (AUC) score for each base classifier and select the base classifier with the best AUC score as recommended classifier for that subset. Finally, we obtain an ensemble learning model with heterogeneous or isomorphic recommended classifiers that are trained with other subsets. For samples in the test dataset, each recommended classifier will give an initial prediction; then, the final prediction result is generated through a voting method across each recommended classifier. In the model selection phase, we use AUC score as the selection condition because AUC score takes into account both the prediction accuracies of the majority and minority classes at the same time, which gives us a good compromise between the accuracy and recall metrics for the classification model. In credit card fraud prediction, misclassification of legitimate transactions as fraudulent transactions or mis classification of fraudulent transactions as legitimate transactions willincur costs for banks and customers, such as loss of transaction amount, manual investigation costs, etc. Therefore, by comparing the AUC score of multiple base classifiers and selecting the base classifier with the best AUC score to build an ensemble model, we can effectively improve prediction performance and reduce economic losses for cardholders and banks.

The HELM framework can be deployed in a distributed manner. Base classifier training tasks on different subsets in HELM can be assigned to different cluster nodes. Each node can perform model training in parallel during periods of low credit card transactions (such as the early morning). Since the proportion of fraudulent transactions is very low, the balanced training subset space generated by the RMDD algorithm is quite small, which can significantly reduce the training time of base classifiers. Compared with other traditional methods, our HELM framework can significantly reduce model training and deployment time by reducing the training sample space and facilitating the application of parallel computing technologies.

4. Experiments

4.1. Dataset Description

In this paper, we use two real credit card transaction datasets: one from Kaggle (public dataset) and one from a bank in China (our private dataset). The detailed statistics are shown in Table 1. (1)Kaggle dataset [47]. This dataset is composed of credit card transaction records of European cardholders in September 2013. The time span of these transactions is two days, and each transaction record contains 30 features. Due to privacy considerations, 28 features were encoded by Principal Component Analysis (PCA), except for two features: transaction time and amount. This dataset contains a total of 284,807 instances, of which 492 are minority class samples (fraudulent transactions). The fraud rate of this dataset is 0.173%, which indicates that the dataset is highly imbalanced(2)Our private dataset. This dataset is provided by a bank in China and contains credit card transaction records of customers on a typical day in May 2017. Each instance has 23 features, including some personal information of cardholders (such as age, gender, marital status, and education level) and transaction-related features (such as transaction amount, time, and merchant number). This dataset contains 24,024 instances, including 660 fraud instances. The fraud rate of this dataset is 2.747%.

4.2. Performance Measures

Confusion matrix provides helpful information regarding the actual labels and predicted labels proposed by the classification model. The confusion matrix used in this study is shown in Table 2. Due to the highly imbalanced phenomenon of our credit card datasets, widely used evaluation indexes (such as accuracy and precision) do not fully represent the performance of classification models. For example, if we classify all samples in the Kaggle dataset as legitimate transactions, the accuracy will be close to 98%; it is clear that this prediction model is not a good classification model. Therefore, we choose Fra_Recall (fraud class recall), Leg_Recall (legal class recall), -mean, AUC, and savings rate [48] to evaluate the model.

Fra_Recall and Leg_Recall are calculated by Equations (7) and (8), respectively. The larger the Fra_Recall value, the higher the proportion of fraudulent transactions that are identified by the classification model, and the more fraud losses that can be avoided for banks and cardholders. The larger the Leg_Recall value, the higher the proportion of legitimate transactions that is identified by the classification model, and the greater the investigation costs that can be saved for banks. The ideal model is that Fra_Recall and Leg_Recall are close to 1 at the same time. -mean and AUC are very important measures that are widely used in model evaluation studies in the presence of imbalanced data. The larger the -mean and AUC value, the better the performance of the classification model. -mean and AUC can be calculated by Equations (9) and (10).

and denote the collection of fraudulent transactions and legitimate transactions, respectively.

The savings rate is an indicator that banks attach great importance to, because it is always used to quantify the economic benefits that fraud detection models can create for banks. The CCFD cost matrix [48] is shown in Table 3. Among them, is the actual label for transaction , and is the predicted label for transaction given by classifier . If a transaction is predicted to be a fraudulent transaction (TP or FP), the bank needs to investigate the transaction incurring a cost of . Conversely, if the transaction is predicted to be legitimate (TN or FN), there is no investigation cost, but in the case of FN, the loss of the bank is equal to the transaction amount . If no classifier is used for CCFD, the total loss of the bank is calculated by Equation (11). The proportion of cost saved for the bank by using classifier is calculated by Equation (12).

4.3. Experimental Design

To evaluate the effectiveness of the HELMDD model, we conducted experiments on two real credit card datasets and compared the proposed model with several competing approaches. Most of models can be divided into two categories: independent model and ensemble learning model. The independent model we used in our experiment includes LR, SVM, NB, and DT. The ensemble learning model we used in our experiment includes RF, AdaBoost, and XGBoost. In addition, we also combined these models with different resampling methods, such as SMOTE and RUS.

5. Experimental Results and Discussion

To directly compare with previous works, we evaluate our model using 10-fold cross-validation similar to prior approaches on the two datasets. The experimental results for each classification model on the Kaggle dataset and our private dataset are shown in Tables 4 and 5, respectively. For convenience of comparison, we have also presented the data of Tables 4 and 5 in histogram form, as shown in Figures 3 and 4. In Tables 4 and 5, numbers in italic indicate the best values of the model in the corresponding evaluation measure.

For the Kaggle dataset, we compare the proposed model with several competing approaches and show the results in Table 4. From the results, we can observe the following: (1)In the case of the same classification model, those implementing resampling methods to preprocess the training subset achieved better performance than models with the original imbalanced training subset. Fra_Recall, AUC, and G-mean have different degrees of improvement. For example, Fra_Recall increased from 0.0235 (DT model with SMOTE method) to 0.2353 (LR model with RUS method), AUC increased from 0.0023 (AdaBoost model with SMOTE method) to 0.0275 (NB model with RUS method), and G-mean increased from 0.0018 (DT model with RUS method) to 0.1239 (LR model with RUS). The main reason is that preprocessing the original imbalanced dataset with SMOTE or RUS method helps models to improve the learning rate of fraudulent transaction instances and therefore enhances the ability to identify fraudulent transactions(2)In the case of applying the same resampling method, AUC obtained by ensemble learning models are generally better than those from independent learning models. As shown in Table 4, the highest AUC obtained by independent learning models with three different resampling methods are 0.9559 (imbalanced data), 0.9678 (SMOTE method), and 0.9697 (RUS method), while the average AUC obtained by ensemble learning models with three different sampling methods are 0.9621 (imbalanced data), 0.9678 (SMOTE method), and 0.9713 (RUS method), so the ensemble learning models are slightly better than the independent learning models. This is because ensemble learning models are strengthened by using multiple weak classification models. Compared with independent models, ensemble models can obtain a smaller deviation and better generalization ability(3)For the same classification model, those using RUS to preprocess the training dataset achieve better Fra_Recall and AUC than models based on SMOTE. For example, Fra_Recall increases from 0.0047 (NB model) to 0.0589 (DT model), and AUC increases from 0.0010 (RF model) to 0.0149 (NB model). However, we cannot ignore that Leg_Recall decreases by 0.0095 (SVM model) to 0.0903 (DT model). This is because the RUS method discards many legitimate transaction samples and leads to an improvement in the identification of fraudulent transactions while increasing the false prediction rate for legitimate transactions(4)In terms of savings score, we have two findings: first, after resampling the original imbalanced training data with SMOTE method, the savings score of six classification models has been improved to varying degrees. For example, the XGBoost model has been increased from 0.7656 to 0.8194 and the LR model has been upgraded from 0.5980 to 0.7591. Second, when we use the RUS method to resample original data, the savings score of three classification models (SVM, DT, and AdaBoost) has been reduced by different degrees, such as the SVM model which has been reduced from 0.7893 to 0.7237, while the savings score of the other four classification models have been improved by different degrees, such as the XGBoost model which has been increased from 0.7656 to 0.7886. There may be two reasons for this: first, the savings score is highly correlated with the recognition rate of fraudulent transactions; SMOTE method can help classification models increase the recall rate of fraudulent transactions and reduce fraud losses of banks. Second, the RUS method discards many legitimate transaction samples, although it strengthens the learning of fraud samples and improves the recall rate of fraud transaction for the model, but it also leads to an increase in the false prediction rate of legitimate transactions and increases investigation cost for banks(5)Our HELMDD model proposed in this paper achieved the best AUC, -mean, and savings scores, which were 0.0128, 0.0054, and 0.0429 higher than previous state-of-the-art methods, respectively. The model showed good stability. While obtaining the second highest Fra_Recall, it did not significantly reduce Leg_Recall, thus ensuring that banks can achieve greater savings. The overall performance of HELMDD is better than the ensemble learning models (such as XGBoost, AdaBoost, and RF) with different resampling methods. This is because the RMDD resampling algorithm fully considers the distribution of legitimate transaction samples. Samples extracted from the boundary subset and multiple clusters fully retain the diversity and boundary contours of legitimate transaction samples. In addition, the selection mechanism of base classification models also helps to improve the overall performance of the framework

Table 5 presents the performance comparison between our approach and other competitive methods on our private dataset. From the results, we can observe the following: (1)Using the SMOTE method to resample the training dataset may not necessarily improve performance of the classification models and may even lead to a deterioration in classification performance. In Table 5, AUC and -mean obtained by SVM, NB, and DT combined with SMOTE have been improved to varying degrees. For example, AUC increases from 0.0067 (DT model) to 0.0314 (SVM model), and -mean increases from 0.0207 (SVM model) to 0.1040 (NB model). However, the overall performance of LR, RF, AdaBoost, and XGBoost combined with SMOTE decreases to varying degrees, in which AUC decreases by 0.0276 (LR model) to 0.0688 (RF model), and -mean decreases from 0.0300 (XGBoost model) to 0.0568 (RF model). This is possibly because through the SMOTE method, a large amount of minority sample noise is generated in the process of resampling for the training dataset, which decreases the performance of some classification models(2)Using RUS to resample the training dataset, apart from AUC of the LR model that dropped from 0.7334 to 0.7250, AUC obtained by the other six classification models manifests an improvement of 0.0081 (XGBoost model) to 0.0962 (DT model). This may be because the absence of new minority samples generated in the RUS process avoids the introduction of noise samples and improves the performance of the classification models(3)The HELMDD model proposed in this article achieved the best Fra_Recall, AUC, -mean, and savings scores, which were 0.0254, 0.0317, 0.0381, and 0.0578 higher than the corresponding measures of the previous state-of-the-art models, respectively. The validity and stability of HELMDD model were thus verified again

6. Ablation Study

We conduct an ablation study to investigate the effectiveness of our model components.

Table 6 shows the effects of the different resampling and model ensemble methods on AUC and savings scores. Here, -RMDD denotes using RUS instead of the RMDD resampling technique in HELMDD, and -HELM denotes using XGBoost instead of the seven heterogeneous models and ensemble in HELMDD. Base is the model generated by performing the above two ablations, which is the default XGBoost model with RUS resampling. For Kaggle and our private dataset, we observe that both RMDD and HELM are beneficial for identifying fraudulent transactions and controlling the cost of investigating fraudulent transactions. The reason is that two model components can significantly improve the recognition rate of fraudulent transactions without reducing the recognition rate of legitimate transactions.

7. Conclusions

In this paper, we propose a heterogeneous ensemble learning model based on data distribution (HELMDD) for the problem of the highly imbalanced data distribution encountered in CCFD. In our HELMDD model, we first propose an undersampling method, RMDD, based on the distribution of the majority class. RMDD divides the majority class into boundary samples and ordinary samples and then generates multiple balanced subsets based on the idea of clustering to train multiple base classifiers. The RMDD algorithm can maintain the classification boundary contours of the majority class and reduce the loss of sample information. Therefore, our model can obtain a higher majority class recall rate while also improving the minority class recall rate. In terms of model selection, we chose base classifiers that obtain the best AUC score in the balanced subset to generate an ensemble model, which helped to improve classification performance. Finally, we evaluate the proposed method on the Kaggle dataset and our private dataset. The results show that HELMDD achieves new state-of-the-art performance compared to other competing approaches.

Data Availability

The labeled dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Nos. 61732022, 61732004, 61672020, and 62072131) and the National Key R&D Program of China (Nos. 2017YFB0802204, 2019QY1406, and 2017YFB0803303).