Detecting fraudulent accounts by using their transaction networks is helpful for proactively preventing illegal transactions in financial scenarios. In this paper, three convolutional neural network models, i.e., NTD-CNN, TTD-CNN, and HDF-CNN, are created to identify whether a bank account is fraudulent. The three models, same in model structure, are different in types of the input features. Firstly, we embed the bank accounts’ historical trading records into a general directed and weighted transaction network. And then, a DirectedWalk algorithm is proposed for learning an account’s network vector. DirectedWalk learns social representations of a network’s vertices, by modeling a stream of directed and time-related trading paths. The local topological feature, generating by accounts’ network vector, is taken as input of NTD-CNN, and TTD-CNN takes time series transaction feature as input. Finally, the two kinds of heterogeneous data, being integrated into a novel feature matrix, are fed into HDF-CNN for classifying bank accounts. The experimental results, conducted on a real bank transaction dataset, show the advantage of HDF-CNN over the existing methods.

1. Introduction

According to Cornell University Law School (CULS) [1], bank fraud is defined as “whoever knowingly executes, or attempts to execute, a scheme or artifice to defraud a financial institution; or to obtain any of the moneys, funds, credits, assets, securities, or other property owned by, or under the custody or control of, a financial institution, by means of false or fraudulent pretenses, representations, or promises". Hence, bank fraudulent activities contain but not limit to money laundering, illegal pyramid selling, and illegal fund-raising. The activities are mainly involved in financial flows across bank accounts. Identifying the fraudulent accounts from massive bank accounts is of great significance in cracking down economic crime activities. In this paper, we call this as Fraud Account Detection (FAD) as below. Investigating large volume of financial transactions to identify fraudsters cannot be done manually for its heavy costs both in time and labor. Therefore, automatic FAD has attracted researchers’ interest increasingly.

A FAD issue can be regarded as a problem of binary classification of accounts. In order to upgrade the classification performance, deep learning technique is employed to address the FAD issues on the basis of analyzing the transaction trading behaviors. is used to denote the dataset that includes the whole bank account information, where is the information set of bank accounts and means the m-dimensional feature vector of the th account. Moreover, the set is utilized as the label set of FAD, in which and represent abnormal label and normal label, respectively. Therefore, the target of a FAD task is to assign a correct label to a bank account from .

Back in the 1960s, the biology research of Hubel et al. [2] shows the transfer process of visual information from retina to brain, which is accomplished by the activation of multiple receptive field. In recent years, deep learning method has been employed in extensive areas, i.e., image processing [3], Natural Language Processing (NLP) [4], network classification [5], and other fields. The reason why CNN architecture has been widely used can be listed as follows: its flexible structure is easy to transfer into other scenarios and CNN extracts the feature automatically. The good scalability of CNN structure makes it successful to address many classification problems. In a specific classification scenario, one can tune the structural feature settings of CNN, e.g., the layer numbers, the neuron numbers of each layer, the types of pooling functions, and activation functions, to achieve the best performance.

Having abstracted the bank accounts into vertices and their transaction relationships into directed edges, the trading behaviors of accounts can be formed into a directed and weighted network. The transaction relationship information and time series information of bank accounts are embedded into the generated network. As mentioned above, CNN models obtain excellent performance in time series classification and social network. The superiority in convolution kernel and structural design inspires us to employ CNN framework in FAD issue. Therefore, with the labelled data provided by economic investigation experts, three convolutional neural network (CNN) models are proposed to address the FAD issue. The models are listed as follows. A CNN model uses network topological data (NTD) being called NTD-CNN model. A CNN model utilizes time series data (TTD) being referred to as TTD-CNN model. A CNN model employs the two kinds of heterogenous data features (HDF), which are extracted from the former two kinds of data, being short for HDF-CNN model. The experiments on a real dataset, containing illegal pyramid selling accounts, demonstrate the effectiveness of our three CNN models. Except for the TTD-CNN, the other two CNN models achieve better performance than traditional abnormal detection method regarding precision, sensitivity, and F1-score. In summary, the classification performance of HDF-CNN is much better than that of the other three methods. To the best of our knowledge, this is the first time that CNN is applied to this application domain.

The main contribution of this paper can be listed as follows.(i)This paper establishes a general account transaction network mathematical model, embedding the transaction relationships and timestamp information, to represent accounts’ historical trading behavior. The network is used as the foundation of the learning network vector of accounts.(ii)This paper proposes a DirectedWalk algorithm to learn the accounts’ network vector. DirectedWalk quantifies the network local topological structures of transaction network into high dimensional vectors.(iii)This paper devises three CNN models in detecting fraudulent bank account, including HDF-CNN that classifies account by using the conjunction of its two kinds of heterogenous data.(iv)The experimental results show that HDF-CNN achieves the best classification performance.

The rest of the paper is organized as follows. The related work is presented in Section 2. Section 3 gives the classification features and proposes three CNN models in turn. Performance evaluation on the proposed models is analyzed in Section 4. Section 5 concludes the paper, and finally Section 6 describes our future work.

There rarely have been specific researches on quantitative study of bank account transaction activities, but most studies mainly focus on anomaly activity detection by using historical transaction records.

Zhu [6] develops a new empirical mode decomposition method by using information from the same group to detect financial suspicious transaction time series data. The model shows its superiorities in analyzing nonlinear and nonstationary stochastic time series data. However, the influence of the transaction relationships is ignored in describing the suspicious. Wang et al. [7] introduces decision tree to create determination rules of trading risk by using the customer profiles, which come from a commercial bank in China. The built classification model benefiting from the determination rules achieves high recall rate.

Yang [8] establishes a central model of criminal networks for discovering its Core-figure, with the help of the Fisher Discriminant Analysis method. In order to control and reduce the loan risk, Cao et al. [9] construct a loan classification model, by using an improved optimization technique and a multiclass support vector machine (SVM) [10] method. Ling et al. [11] build a customer credit scoring model, by using a novel multikernel function and a Chaos particle swarm optimization method. A multilayer neural network and the SVM method are utilized in [12] for predicting if a loan applicant can be classified as solvent or bankrupt. To detect the credit-card fraudulent activities, in [13], the transaction aggregation strategy is expanded into a comprehensive strategy, which combines transaction timestamps and spending behavioral patterns. In order to identify illegal pyramid selling network with social behavioral data, Li et al. [14] model the ego-network of different type of users, e.g., regular users and illegal pyramid scheme users. The authors take the structural property into consideration in the process of analyzing illegal pyramid network statistical features. Reference [15] proposes a novel framework for creating accounts’ trading behavior profiles and detects the members of the pyramid schemes by using anomaly detection methods. The features used are extracted from the sequential transaction records of an account. Four kinds of anomaly detection methods are tested in [15], showing that the IForest [16] is the best method in this scenario. Taken the financial transaction relationship between clients into consider, Ma [17] advances a novel algorithm for discovering anomaly groups, by mining closed-loop client transaction relationships, in the financial network.

Summary. Many techniques and algorithms have been proposed for dealing with fraud detection. However, most of the existing feature extraction methods, designing for specific fields, have disadvantages in transferring to other scenarios. Therefore, this paper attempts to break the bottleneck by using CNN architecture.

3. CNN for FAD

This section firstly defines the problem definitions. And then, we propose a DirectedWalk algorithm for learning network vector of the accounts. Subsequently, three CNN models, feeding with different kinds of features, are devised for classifying bank accounts.

3.1. Problem Definitions

This part gives the definitions of fraudulent account, bank accounts transaction network, and neighbor vertex set in turn.

Definition 1 (fraudulent account). A bank account is called a fraudulent account if it is mainly used to process money that is relevant to illegal transaction activities.

The illegal transaction activities include illegal pyramid selling, illegal money laundering, and illegal financing.

Given a nonempty finite set , where denotes a bank account, for , each transaction record, containing transaction accounts, timestamp and amount, can be represented as a four-tuple . The four-tuple means transferring amount from to on time . The whole transactions between and within a time period can be denoted as , where . Here, means the time series transactions from to , and is the total number of transactions happened during the to time period.

Taken accounts as vertices, transaction relationships between accounts as directed edges, and transaction information as edge weight, the time series transaction data can be created into a directed and dynamic network. We define the network as follows.

Definition 2 (bank account transaction network). Given , where is the vertex set, , denotes the set of edges. The represents the directed weighted edge from to , if at least one transaction from to occurred. Then, we define as the set of all of the weight information, and means the set of all of the timestamp information, where , and , is the weight vector of , is the set of positive real numbers, is the total number of transactions, , means the th transaction amount from to , and is the timestamp of .

Definition 3 (neighbor set). Given , and neighbor radius , for , represents its step neighbor vertices existing in incoming transactions, and means the reverse vertex set. Therefore, the neighbor vertex set of is denoted as .

3.2. CNN with the Topological Data of Transaction Network

Economic crime investigators believe that it is helpful to consider the vertex itself and its neighbors comprehensively in determining whether a vertex is fraudulent. That means, for any in , which category it belongs to is closely related to its local topological structure .

We propose a DirectedWalk algorithm, an improvement version of DeepWalk [18] in directed and dynamic network, to learn the network vectors of accounts. Based on this, a method to generate the local topological feature matrix of an account is presented by constructing its social relationships. And, finally, we devise a CNN framework with network topological data, which is called NTD-CNN for short.

3.2.1. DeepWalk

To capture the network topological information, [18] proposes a DeepWalk approach, which learns features that describe the graph structure. The optimization techniques, originally designed for language modeling, are used for learning social representations of a graph’s vertices. DeepWalk takes a graph as input and produces its latent structure representation as an output. The learned structural features are used in many applications such as network classification and anomaly detection with outperform results.

DeepWalk algorithm, borrowing the concept in language modeling, considers a set of short truncated random walks as its own corpus, and the graph vertices as its own vocabulary. There are two main components, a random walk generator and an update procedure. Given graph as input, the random walk generator samples uniformly a random vertex as the root of the random walk . For each vertex , being the start of random walks, vertices are selected in order randomly. Aiming at updating the vector representation of , Skip-Gram [19] is utilized to maximize the probability of its neighbors in the walk . Inspired by DeepWalk, we propose a DirectedWalk algorithm to learn network vectors of accounts in the transaction network.

3.2.2. Learning the Network Vector

We generate a corpus and a vocabulary from the directed and dynamic network, which is the only required input for learning the network vectors. Give , the vertex set is considered as its own vocabulary, and the directed sequential transaction paths are seen as its own corpus. It is well known that, in DeepWalk, the relationship strength of two vertices is determined by the frequency that the two vertices occur in an adjacent position in the random walks. Here, as defined in Definition 2, the strength of the relationship between any two vertices and is determined by weight of and of . Moreover, in , the order of vertices that passed by a transaction path are not random but time-related. Therefore, we propose a DirectedWalk algorithm for learning the network vector of the transaction vertices. Each vertex can be seen as the start vertex of a directed walk with a maximum length of . For the last visited vertex , the walk will pass over all of its directed neighbors, which have occurred in at least one transaction with within a time interval . The walk grows up iteratively until not satisfying the constraints.

Line - in Algorithm 1 shows the core of our algorithm. The outer loop specifies the paths that start with each vertex and generates a time series ordering of the directed neighbor vertices. The state factor of is initialized to and will be reset to , in condition that its last vertex has no directed neighbor vertex or its length attains . On condition that all of the paths reach their states, the generation procedure of starting at is completed. As depicted in line -, for the last passed vertex , the timestamp information is used to determine whether a vertex will be appended to . In summary, fixing a start vertex, the longer the weight vectors of the passed directed edges are, the more walks will be produced by our DirectedWalk approach. Therefore, the frequent traders are more likely to exist in walks and appear in a window with high probability.

maximum walk length , timestamp interval
window size , embedding size
matrix of vertex rpresentations
the directed corpus is initialized to be empty;
for    do
while    do
NodeSentence = .pop() (pop a in its active state)
retrieve ’s directed neighbor vertices () from
for    do
if  () or ( and )  then
adds a new vertex
if   has no directed neighbor vertex or   then
= False
end if
(push into )
(15)end if
(16)end for
(17)end while
(18)end for
(20)for    do
(22)end for

It is shown in line that the Skip-Gram model [19] is exploited to maximize the cooccurrence probability among the vertices in the time ordered walks. Skip-Gram, using the independence assumption, iterates over all possible collocations in directed corpus within the window .

We finally obtain the optimal network vector representation of the vertices by using the same learning process of DeepWalk. DirectedWalk maps the directed and dynamic transaction network into a -dimensional vector space. The vertices that contain similar local directed topological structures are mapped into adjacent vectors.

3.2.3. Constructing the Local Topological Feature Matrix

For , given in , the local topological structure of vertex is the spanning subgraph of vertex set . As mentioned above, the structure of in is embedded into a -dimensional network vector . We calculate the Euclidean distance between and by using their network vectors. The size of set is denoted as . Therefore, the vertex vectors are spliced into a matrix by ranging their distances with in ascending order. This yields the local topological feature matrix as follows.where ( is the network vector of , while denotes the network vector of itself.

3.2.4. Creating the NTD-CNN Framework

This part establishes a NTD-CNN classification framework, which is input with accounts’ local topological feature matrices. NTD-CNN is composed of the following six elements, as shown in Figure 1.

Input layer: for , its input matrix is defined as matrix ( given in Formula (1)). In CNN model, the input matrix is seen as the pixel matrix of an image. Therefore, the size of the input image equals that of the input matrix. Based on the introduction above, matrix embeds the local topological structure of vertex with each row expressing a network vector. This paper keeps the row of on a fixed value , where corresponds to the maximum union of in-degree set and out-degree set for all vertices in network . If the size of neighbor set of satisfies , we supplement the empty rows with 0. The format of input image is shown in Figure 2(a), whose specific values of and are given in experimental section.

Convolution layer: the input matrix is used as the first layer feature map. Similar to the definition of semantic unit features (unigram, bigram, trigram) in NLP, we employ three kinds of convolution filters to extract n-vertex, e.g., single-vertex, two-vertex, and three-vertex, features from . As shown in Figure 1, for the th convolution layer, the processing function on each feature map is shown in formula (2),where represents an activation function, the operator “” expresses the convolution operation, denotes an input feature map, denotes the th convolution kernel, and , means the bias vector. The stride length for is set as 1. And the sizes of and are the same and range from . Moreover, the numbers of and are denoted as and , respectively. Therefore, the number of output feature maps of th layer is times that of the input. To apply the CNN to the nonlinear classification problems, Rectified Linear Unit (ReLU) function is used in the process of generating output feature maps.

Pooling layer: the processing procedure, on the feature map of the th pooling layer, is shown in formula (3),where operator “” represents a pooling function, which is used for downsizing the th input matrix . The notations and denote the weight matrix and bias vector, respectively. The number of feature maps is not changed by the pooling procedures.

Fully connected layer. In this layer, the input feature maps, each in size of and is obtained from the last pooling layer, are connected with each neuron of this layer. The fully connected structure compresses the input vector into a shorted one , by using the following map function.where denotes the parameter matrix and means the bias vector.

Output layer: this layer is a fully connected layer, which contains two neurons and adopts softmax activation function. The softmax function is used to determine the category. For this reason, a vector is built through formula (5),where denotes the probabilities of a node belonging to each category, i.e., and . The is defined in formula (6)where , denotes a category label in . The category that node belongs to is given by , which is defined as follows:

3.3. CNN with the Time Series Data of Transactions

Each transaction record of an account is composed of three kinds of information, two trading accounts, a timestamp, and transaction amount. All the transaction records that belong to one account can be classified into two parts, i.e., incoming transaction and outgoing transaction. By the opinion of economic investigators, accounts’ historical trading information can be used as immediate evidence in detecting whether they are fraudulent. Thus, in this part, the time series transaction features are quantified into matrices and being input into a CNN classifying model, which is named TTD-CNN for short.

For , its set of in-edge weight vectors is expressed by the matrix , including its whole incoming transactions, where is the in-degree of vertex . Each row of matrix is a vector, which is composed of the trading amounts ordering in time. The row vectors are spliced into matrix, according to the timestamps of their first trading records. That means, the earlier an account has transferred money to , the upper it is arranged in . Given any vertex , the weight vector of , which is defined in Definition 2, is shown in formula (8),where is the total transaction times from to , and , , means the th transaction amount from to . Similarly, the out-edge set vector represents the whole outgoing transactions of . Moreover, the row vectors of is sorted in the same way with that of . We denote the weight vector of as formula (9),

As shown in Figure 2(b), for any vertex , the fixes-size time series feature matrix is denoted as follows:where is denoted as the maximum sum of in-degree and out-degree . is the maximum number of , being defined in Definition 2, of each vertex . And then, we pad any blank position in with 0.

Finally, the time series transaction feature of vertex is formed into matrix . Utilizing as the input matrix, a TTD-CNN model is devised for identifying whether an account is fraudulent. The specific values of and , which is obtained from statistical computing, are given in experimental section.

We ignore the architecture details of TTD-CNN for they are similar to that of NTD-CNN, which is depicted in Figure 1.

3.4. CNN with Two Kinds of Heterogeneous Data

The former two models (NTD-CNN and TTD-CNN) take advantage of two types of transaction information of an account, i.e., the local topological structure and time series transaction information, respectively. We reasonably think that full use of the two complementary node features will improve the accuracy of FAD. Therefore, we merge the former two kinds of heterogeneous data into an integrated feature matrix. The obtained matrix is used as the input of a CNN classifying model, which is named as HDF-CNN for short.

Given , its local topological structure matrix , and its time series feature matrix , we generate a new feature matrix as follows:where and . Therefore, and are constrained to fixed sizes as follows: and , where, equals the sum of numbers of the two matrices’ row vectors and is set as the maximum of and .

Finally, the integrative feature matrix is used as the input of HDF-CNN. As depicted in Figure 2(c), the input matrix means a picture of length and width . The length of is the larger one of that of and . Obviously, there are many empty positions in , which is generated by formula (11). We pad the empty positions with 0. The shape of , i.e., the input picture of HDF-CNN, is determined by that of and .

We ignore the architecture details of HDF-CNN for its structure is similar to that of NTD-CNN.

3.5. Training the CNN Models

In this paper, we optimize the three CNN models by employing the negative log-likelihood loss function. We aim to obtain the global optimum parameters, with which the most samples are predicted into their correct categories. It is the fact that, in this condition, the cost of the loss function reaches its minimum value. Therefore, we minimize the cost by formula (12). Notably, the main updating parameters of a CNN model are the weight of the map functions in each layer. This part denotes the weight parameters as .where is the target label and means the predicted label. In this paper, represents one-hot representation and denotes the predicted probability of each class, as defined in formula (7). The second term of function is a regularization factor, being known as L2. L2 is an effective regularization method for avoiding overfitting. The overfitting problem, common in optimizing the deep learning models, means the model fits training set very well but does not work on testing set. As we all know, without a regularization term, the weight values will be relatively large, causing the cost function to fluctuate greatly in some small intervals. L2 regularization avoids overfitting by decaying weights of the cost function (12). Furthermore, another regularization method dropout is used in training our CNN models. Dropout avoids the overfitting problem by ignoring a proportion of hidden layer neurons randomly. The Minibatch Gradient Descent (MBGD) [20] method is adopted in minimizing the objective function. MBGD updates the network parameters with a batch samples one time. Our CNN model utilizes MBGD method for its fast training speed and high probability of discovering the global optimal value. The training process will stop under the condition that the loss of the objective function is stable at its minimal value.

4. Experiments

4.1. Environmental Settings

Considering the volume of sample data is large and the number of deep learning parameter is numerous, GPU is adopted to improve the effect of matrix multiplication and convolution computation for its massively parallel processing. Since performing the former CNN framework on a single CPU is very time-consuming, to build up the experiment hardware environment, we use a deep learning server with GPU NVIDA LESLA P100 and 128G memory. TensorFlow is adapted to train the three CNN models.

4.2. Data Set

In recent years, we have been exploring computational models to classify bank accounts in combating illegal pyramid selling. The department of economic investigation provides us with plenty of transaction data of real bank accounts. We sample out the transaction records belonging to 10145 bank accounts to form out dataset for training our models. There are 9270 normal accounts and 875 accounts involving a multilayer marketing (MLM) organization, respectively. These MLM members are manually annotated as “abnormal” by economic investigators. Each record includes card id, timestamp of transaction, amount of a transaction, and direction of a transaction, i.e., revenue or expenditure. Before training the models, firstly, we filtered out some noisy data, i.e., deleting the duplicate records, incomplete records and the records whose transaction amounts no more than 50. And secondly, the fivefold cross-validation method is used to evaluate the trained network in all of the experiments. Table 1 summarizes the characteristics of the dataset.

As is usually adopted in neural network training, our models are trained with Minibatch Gradient Descent (MBGD) method. In each iteration, MBGD method takes a batch of samples, e.g., 100 accounts, randomly and updates the parameters with the average gradient value. Therefore, only the seed of MBGD will change when the input order changes, which means the performance of our model will keep stable within a reasonable range.

4.3. Experiments Settings

The determination of the best CNN values of parameters needs various experiments. In this paper, we refer to both related literatures and the certain dataset features to set and optimize those parameters.

The value of structural parameters in Figure 1: experimental results show that the classification property has little change when the length of network vector varies from 200 to 500. Thus, we set as 200. In the three CNN models, the number of each kind of n-vertex convolution filter (i.e., and ) is set as 100, and the number of neurons ( i.e., the length of ) of the fully connected layer is set as 150. In our real transaction network, the maximum in-degree and out-degree (i.e., ) are 338 and 420, respectively. The maximum value of and are 571 and 459, respectively. In addition, is set as 1359, which is also obtained from the statistical data. Therefore, the sizes of the input image of the three CNN models are listed in Table 2.

Regularization parameter and dropout rate: these are two important parameters used to avoid model overfitting. In this paper, we adopt the most used regularization method named L2 and set its coefficient as 10e-4, as many CNN researches. Similarly, the value of dropout rate is set as a default value 0.5.

Fixed learning rate: the convergence process of training will be too slow with a very small MBGD learning rate. Meanwhile, a too large learning rate will make the objective function fluctuate. It is the fact that the rate value is significant for CNN model converges to the global minimum. Therefore, our CNN models employ a dynamic strategy in learning rate selection. The strategy means the learning rate changes with the epoch time goes on. The value of learning is large on the beginning of training and is finally reduced to a minimum one. This paper trains the three CNN models with 100 epochs in order for them to have the same convergence time to attain the convergence condition. Therefore, during the top 80 epochs, the learning rate reduces 0.01 each 10 epochs. On arriving at the th epoch, if the current learning is larger than 0.01, the value will be reset to 0.01. And then, HDF-CNN decays once per epoch, using an exponential schedule. This paper tests the starting value from the set in an decreasing order.

The experimental results show that HDF-CNN obtains the most stable convergence procedure.

We test the three CNN models on the dataset from two aspects, comparing their classification performances with the traditional abnormal detection methods and assessing the influence of additional convolution layers.

4.3.1. Comparing Methods in Classification Performance

We design four groups of experiments to compare the classification performance as follows.(i)There are four traditional abnormal detection methods, i.e., isolation forest (IForest) [16], local outlier factor (LOF) [21], one-class SVM (One-SVM), and robust covariance (RC) [22]. These four are the most common methods for detecting financial fraudulent customers. We employ these four methods on the real dataset by using features extracted in [15]. To be specific, those features can be grouped into three categories: transaction statistical features, network behavioral features, and periodic behavioral features. Three values of a necessary threshold parameter, i.e., 0.01, 0.08, 0.1, meaning the outlier factor are selected in our experiments. The optimum factors for each method are shown in Table 3.(ii)NTD-CNN model: the input is local topological feature matrices of bank accounts.(iii)TTD-CNN model: the input is time series transaction feature matrices of bank accounts.(iv)HDF-CNN model: the input is integrative feature matrices of bank accounts. An input matrix is obtained by concatenating the former two matrices, which are generated from two kinds of heterogenous information dependently.

In our experiments, both the hidden layer structure and the neuron number of each layer are all same in the three CNN models. The traditional methods are regarded as the baseline. The three CNN models are tested with many different parameter values and finally the best parameters are selected. The classification results on the test set are illustrated in Table 3.

As shown in Table 3, for the traditional automatic abnormal detection methods, IForest obtains the best results when the outlier factor evaluates to 0.08. It is indicated that the traditional automatic abnormal detection methods are not applicable on our dataset. This may be caused by the special fraud scenario. The fraudulent accounts act out different trading behaviors owning to their different rules in the MLM organization. Moreover, in this scenario, most of the abnormal accounts are normal when considering from their own angle, but are obviously abnormal by group perspective. Therefore, the four kinds of linear classification methods are too course-grained to find out the optimum classification boundary. Besides, the used features are heuristic, leading its performance to depend on expert knowledge. NTD-CNN maintains higher performances than IForest and TTD-CNN. Obviously, the features, being extracted from local topological structure data, are more useful than that of time series data. Meanwhile, it is illustrated that the proposed DirectedWalk algorithm is effective in representing the network vector of bank accounts. Utilizing the same type of data, TTD-CNN and IForest obtain similar classification performances. And, HDF-CNN achieves the best results, demonstrating our assumption that the two kinds of heterogeneous data are complementary. For a MLM member, involving in the real dataset, its fraudulent trading behaviors are hidden in its transaction relationships and its trading details. Considering that the only difference between the three CNN models is the input feature types, we can draw a conclusion that the integrative feature describes the category of an account best. It is worth noting that there exist significant differences among the precision and sensitivity in all of the four methods. Technically, the result means the number of false negative (FN) sample is more than that of false positive (FP). The confusion matrix of HDF-CNN is shown in Table 4.

Moreover, we input different random seeds in different orders in our experiments. It is proved that the order of input accounts does not affect the performance of CNN.

The changes of training and testing errors of HDF-CNN with increasing epoch times are depicted in Figure 3. The testing error starts to converge after 80 epochs. The training error keeps decreasing during the previous 80 epochs but almost converges to a constant value in the later 20 epochs. The testing error diminishes sharply originally and then fluctuates smoothly before the 80th epoch. With a smaller step size, the fluctuations almost disappear both on training error and testing error. Hence, the procedure to obtain smoothing testing error is to reduce the step size after a specific number of epochs.

4.3.2. Assessing Different Convolution Layers

As shown in Figure 1, HDF-CNN contains 2 convolution-pooling layers. This part investigates the effect of adding extra convolution layers. That means, the number of convolution-pooling layer of HDF-CNN is set as for testing. The one or two additional layers are added following the last pooling layer in Figure 1. The ROC curves of HDF-CNN with different convolution layers are shown in Figure 4.

The value of AUC means the area under the curve. As well known, the greater the AUC value is, the better the classification performance of the model is. As shown in Figure 4, the classification effects of the HDF-CNN model achieve the best results with two and three convolution layers. It is found that the performance decreases when adopting four convolution-pooling layer structures. That means a more complex convolution structure will not improve the classification performance significantly. We conclude that the specific data features of transaction records, e.g., the amount of money, the number of counterparties, and local topological structure, are the reasons why a shallow structure is enough for HDF-CNN.

5. Conclusion and Discussion

In this paper, the CNN framework is applied in FAD problem. This paper proposes three CNN classification models, feeding with feature matrix getting from the historical transaction records, to detect the fraudulent accounts. The three CNN models are same in framework structure, but different in types of input matrices. This paper devises a DirectedWalk algorithm to learn the network vector of vertex, which is used for generating the local topological feature. DirectedWalk embeds the local network structure of an account into a high dimensional vector. It is found that the local topological feature is efficient in classifying fraudulent accounts. And the two kinds of heterogenous features, i.e., local topological feature and time series feature, are complementary in describing the classification category of an account. The experimental results on real dataset show that HDF-CNN achieves significant improvements when compared with other CNN models in classification performance.

6. Future Work

In recent years, automatic FAD has become one of the most concerning but challenging research topics. The two kinds of available information from local topological structure and time series transaction records are independent but interrelated. The way to better represent both kinds of information to reflect the account behavioral features more accurate is one of our future goals.

Long Short-Term Memory (LSTM) model has been proved very suitable for classifying time series data. Considering natural time series characteristic of transaction data, we have also adopted the LSTM model to do another experiment but achieved unsatisfactory classification results more than CNN with some unknown reasons. In the future, we will try to study and improve the LSTM model in order to adopt it to detect the fraudulent accounts.

Moreover, to solve the problem of less labelled samples, in the next step, we aim to add semisupervised strategy in our future work.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research is supported by National Key Research and Development Plan (No. 2016YFB0800802 and No. 2017YFB0801804), Frontier Science and Technology Innovation of China (No. 2016QY05X1002-2), and Key Research and Development Program of Shandong Province (No. 2017CXGC0706 and No. 2016ZDJS01A04).