Abstract

Online mobile advertising plays a vital role in the mobile app ecosystem. The mobile advertising frauds caused by fraudulent clicks or other actions on advertisements are considered one of the most critical issues in mobile advertising systems. To combat the evolving mobile advertising frauds, machine learning methods have been successfully applied to identify advertising frauds in tabular data, distinguishing suspicious advertising fraud operation from normal one. However, such approaches may suffer from labor-intensive feature engineering and robustness of the detection algorithms, since the online advertising big data and complex fraudulent advertising actions generated by malicious codes, botnets, and click-firms are constantly changing. In this paper, we propose a novel weighted heterogeneous graph embedding and deep learning-based fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. In the proposed GFD approach, (i) we construct a weighted heterogeneous graph to represent behavior patterns between users, mobile apps, and mobile ads and design a weighted metapath to vector algorithm to learn node representations (graph-based features) from the graph; (ii) we use a time window based statistical analysis method to extract intrinsic features (attribute-based features) from the tabular sample data; (iii) we propose a hybrid neural network to fuse graph-based features and attribute-based features for classifying the fraudulent apps from normal apps. The GFD approach was applied on a large real-world mobile advertising dataset, and experiment results demonstrate that the approach significantly outperforms well-known learning methods.

1. Introduction

Online mobile advertising plays a vital role in the mobile app ecosystem. One of the popular models in mobile app advertising is known as cost per action (CAP), where payment is based on user action, such as downloading and installing an app on the user’s mobile device. This CAP model may incentivize malicious mobile content publishers (typically app owners) to generate fraudulent actions on advertisements to get more financial returns [13]. Some traditional methods and techniques have been used for detecting and stopping click fraud, such as threshold-based method [4], CAPTCHA [5], splay tree [6], TrustZone [7], power spectral density analysis [8], and social network analysis [9].

To automatically detect mobile advertising fraud behaviors, machine learning methods have been successfully applied to find fraud patterns in data, distinguishing suspicious advertising fraud operation from normal one [1014]. As for learning model with attribute features, researchers usually use several attributes from each sample to train a learning model to identify the fraud behaviors. Unfortunately, such approaches may suffer from labor-intensive feature engineering and robustness of the detection algorithms, since the online advertising big data and complex fraudulent advertising actions generated by malicious codes, botnets, and click-firms are constantly changing. What is more, fraudsters could easily adjust their fraud patterns based on existing fraud detection attributes and rules to avoid being detected. Recently, some researchers try to use the relationship between information entities to construct a graph model and then use the graph mining or learning methods to identify the changing fraud behaviors [1517]. All these methods obtain useful insights into the learning mechanism to classify fraud behaviors from normal activities. Intuitively, if we could combine the complementary information from attributes of sample data and relationship between entities (e.g., users, apps, and ads), we will be able to improve the accuracy and robustness of fraud detection.

However, to unleash the power of attribute-based information and graph-based information, we have to address a series of challenges. First, to take advantage of the characteristic of graph, we should construct a suitable graph, which could potentially represent the interaction behaviors between information entities such as users, apps, and ads. Second, an efficient graph learning method should be developed to learn the useful structural and semantic representation information from constructed graph [18, 19], particularly learning from heterogeneous graph [20]. Third, fusing different kinds of information from sample attributes and node representation is difficult for their inherent heterogeneity and high-order characteristics.

To address the above challenges, in this paper, we propose a weighted heterogeneous graph embedding and deep learning-based fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. In the proposed GFD approach, (i) considering behavior patterns between users, mobile apps, and mobile ads, we construct a weighted heterogeneous graph to represent mobile app advertising behavior and propose a new weighted metapath to vector algorithm, namely, WMP2vec, to learn low-dimensional latent representation (graph-based features) for apps’ nodes in the weighted heterogeneous graph; (ii) we use a time window based statistical analysis method to extract intrinsic features (attribute-based features) from the tabular sample data; (iii) we present a hybrid convolutional neural network model to fuse graph-based features and attribute-based features for classifying the fraudulent apps from normal apps.

We evaluate GFD approach and WMP2vec algorithm on a real-world dataset from one of the mobile advertising platforms in China. Results show that WMP2vec reaches higher performance than three well-known graph embedding algorithms in the constructed weighted heterogeneous graph, and GFD approach achieves highest classification performance compared with Support Vector Machine (SVM), Random Forest (RF), and Fully Connected Neural Networks (FCNN).

The rest of the paper is organized as follows. We introduce GFD approach to detect fraudulent apps with deep neural networks and heterogeneous graph embedding algorithm WMP2vec in Section 2. We present the experimental results and discussion in Section 3. In Section 4, we introduce the related work. We conclude this paper in Section 5.

2. Proposed Approach

The flow chart of the proposed GFD approach is shown in Figure 1. First, we propose a weighted heterogeneous graph embedding method to learn the node representation, including constructing the weighted heterogeneous graph and the WMP2vec algorithm. Second, we use statistical analysis method to extract attribute-based features from the tabular sample data. Third, we introduce the deep neural networks to fuse the attribute-based features and graph-based features for identifying fraudulent apps from normal ones.

2.1. Data Description

We collect advertising log data of mobile apps from a mobile advertising platform. Our mobile advertisement dataset contains the following attributes: user ID, a code to identify a unique mobile user; app ID, a code to identify a unique mobile app; ad ID, a code to identify a unique mobile advertisement; geographical attributes, a series of user geographical attributes used to detect anomalies, including encrypted IP and city; action type, user behavior related to the ads, such as viewing, clicking, app downloading start, app downloading completion, and app installation completion; action time, the time-stamp when the action happened; and device attribute, user device related attributes, such as device ID, device system models, and screen size.

A seven-day mobile advertising log dataset in June 2015 was studied in this paper, and some examples of our raw data are shown in Table 1.

2.2. Weighted Heterogeneous Graph Embedding

In this section, we firstly propose the problem definition and construct the weighted heterogeneous graph, and then we present WMP2vec algorithm to learn latent representation of nodes in weighted heterogeneous graph.

2.2.1. Problem Definition

(1). Given. An undirected weight heterogeneous graph is given, where V is a set of app nodes, ad nodes, and user nodes; E is a set of undirected weight edges between any two types of nodes: app nodes and user nodes, user nodes and ad nodes, and ad nodes and app nodes; W is the set of weight of edges.

(2). Task. The task is to learn the -dimensional latent representations (where ) for nodes, which could capture the structural and semantic relations among nodes in the graph G, and the representations could be used for classifying fraudulent apps.

2.2.2. Weighted Heterogeneous Graph Construction

Let be the set of user nodes, let be the set of app nodes, and let be the set of advertisement nodes. If there exists an action from user to advertisement through app , we form edges from to , from to , and from to , respectively, such that , , and are the edges set of heterogeneous graph G. The set of weight is , where the weights , , and are defined proportional to the behavioral centrality of to , to , and to , respectively. The calculation formula of is shown in equation (1) and so on for and .where is the times of user u operating on advertisement p and is the set of operations of user u on all the advertisements.

2.2.3. Graph Embedding Algorithm

In this section, based on the sequence generation method from metapath based random walk in heterogeneous graph [20], we propose WMP2vec algorithm to generate random walk sequence in weighted heterogeneous graph and embed sequence to representation vector with Skip-Gram [21] for nodes.

(1). Weighted Metapath Based Random Walk. We predefined number of walks per node , the number of walk sequences , and a metapath M The metapath is defined as a path in the heterogeneous graph G with its metatemplate , where and . Each node and each edge are associated with mapping functions and , respectively.

Supposing that current node is , the relationship between and next node is Ri; that is, .

For walk sequences generation, we go through the metapath scheme times, and each time generates one corresponding walk sequence. In the first time, we use two different selecting methods (first phase and second phase), because there are no limits to edge weight in the beginning. After first time, we use the method in the second phase to select next node.

For the first phase, when the length of walk sequence is less than 2, the next node in the sequence is randomly selected from the neighbors set of current nodes, which meet the requirements of metapath M [20]. The transition probability from to is defined as follows:

For the second phase, when the length of walk sequence is between 2 and , the transition probability is restricted by a weight bias . Supposing that the latest weight of edge of relationship Ri is , the weight should be in the range of . The transition probability from to is defined as follows:where is the set of neighbors meeting the requirement.

(2). Embedding Sequence to Vector with Skip-Gram. Based on the weighted metapath random walk sequences, we use Skip-Gram model [21] and negative sampling [22] to learn low-dimensional representation of nodes.

A description of our proposed WMP2vec algorithm method is shown in Algorithm 1.

(1)Input: The weighted heterogeneous information graph , a meta-path scheme M, walks per node , longest walk length per walk , embedding dimension , neighborhood size
(2)Output: The latent node embedding
(3)Initialize , random walk sequence
(4)fordo
(5)fordo
(6) = WeightedMetaPathRandomWalk
(7)
(8)end
(9)end
(10) = HeterogeneousSkipGram
(11)return
(12)WeightedMetaPathRandomWalk
(13)initialize random walk array , weight array ,
(14)relationship array
(15)fordo
(16)fordo
(17)ifthen
(18) draw and according to equation (2) with relationship
(19)
(20)else
(21) draw and according to equation (3) with relationship
(22)if does not exist then return
(23)else
(24)end
(25)fordo
(26)  draw and according to equation (3) with relationship
(27)if does not exist then return
(28)else,
(29)end
(30)end
(31)return
2.3. Attribute-Based Feature Extracting

From the raw log data (tabular data) of mobile advertising, we defined a time window ( hours) and divide original data into data block for one day (24 hours). Then, a plain statistical analysis is performed on each field in each data block. The ratio of the unique value of the field to the total number of records in the specified time window is computed. The attribute-based feature corresponding to one mobile app could be represented as a feature matrix with rows.

2.4. Hybrid Neural Network for Classification

To take advantage of the graph-based features and attribute-based features, we propose a hybrid convolutional neural networks (HNN) model to fuse and learn both information in GFD approach. The overview of the hybrid neural networks is shown in Figure 2.

In HNN model, the first layer (input layer) contains attribute-based feature matrix and graph-based feature, where is the number of samples, is the number of time windows by one day (24 hours), is the dimension of attribute-based feature in a time window, and is the dimension of node embedding.

A convolutional part includes two convolutional layers, and the output of the first convolutional layer iswhere and are the convolution kernel and bias, respectively, is the size of the kernel, indicates the convolution operation, and the function is .

The second convolutional layer is constructed as follows:where and are the convolution kernel and bias, respectively. is the size of the kernel.

is flattened to , where is the number of elements in .

We concatenate and into a single metric to be the input of the first fully connected layer . is constructed as follows:where and are weight and bias, respectively, and is the number of neurons in the first fully connected layer.

The second fully connected layer is constructed as follows:where and are weight and bias, respectively, and is the number of neurons in the second fully connected layer.

In the output of HNN, is the probability of an application to be a fraudulent application.where and are weight and bias, respectively, and is the sigmoid function.

The cross-entropy function with l2-regularization is used to calculate the loss of the hybrid convolutional neural network model.

3. Experiments

3.1. Data Description and Preprocessing

A real-world dataset was collected from a mobile advertising platform in China. The dataset consists of seven days with around 2 M users, 3.5 K apps, and 1 K advertisements per day. We partition our log data into seven subsets with one-day period and conduct experiments on each subset to evaluate our model. The proportion of fraudulent apps is about 2–4 percent in the total 3,500 apps each day. More details of the dataset are described in Section 2.1.

3.2. Evaluation Metric

In this paper, we define the fraudulent apps by positive samples and the other apps by negative samples. The Average Precision (AP) and the Area Under ROC Curve (AUC) are used to evaluate proposed algorithm and approach.

The AP criterion summarizes the Precision-Recall performances at different threshold levels and corresponds to area under the Precision-Recall curve. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The AUC is the total area under the ROC curve.

3.3. Evaluation of WMP2vec Algorithm

In this section, we use WMP2vec algorithm to learn the embedding vector of the nodes (apps) from the constructed weighted heterogeneous graph and then take their embedding vectors as the input of Random Forest (RF) model to classify fraudulent apps.

Based on Section 2.2.2, we construct a weighted heterogeneous graph and define a metapath: app-user-ad-user-app (PUAUP); that is, , which represents the heterogeneous semantic of fraud publishers (apps) that mimic legitimate users to act on the ads from the apps.

3.3.1. Comparison Models and Parameters

We compare the AP and AUC of the WMP2vec model with three well-known graph embedding models: DeepWalk [23], Node2vec [24], and Metapath2vec [20]. The compared algorithms and their parameters are as follows:(1)DeepWalk: DeepWalk [23] is the first graph embedding model based on Word2vec. We use Skip-Gram model [21] and hierarchical softmax [25] with gradient descent to learn the node representation. Negative sampling technique [22] is used to accelerate the Skip-Gram model. The count of random walk is 30, and the walk length is 40.(2)Node2vec: Node2vec [24] extends DeepWalk algorithm through introducing backward probability and forward probability q. The same random walk parameters (count = 30 and length = 40) are used with DeepWalk, and the negative sampling technique is also used. In addition, we use and q = 0.2 for backward probability and forward probability, respectively.(3)Metapath2vec: Metapath2vec [20] uses the metapath based random walk to construct node sequences and then leverages Skip-Gram to perform node embedding. The metapath in this study is PUAUP. The count of random walk is 30, and the walk length is 10.(4)WMP2vec: We use the same parameters (count = 30, length = 10, and metapath = PUAUP) with Metapath2vec, and the weighted bias β is 0.1 additionally.

In all the compared models, we train Skip-Gram model with window size of 5, and the negative samples is 5 in negative-sampling. The graph-based feature of each node is a 32-dimensional vector. The parameters of the RF model are as follows: the number of weak learners is 150, max. deep is 5, and min. sample leaf is 5.

3.3.2. Experimental Results

Tables 2 and 3 show the experimental results by comparing the AP and AUC over 10-fold cross-validation for seven days. The WMP2vec model reached highest AP value in six days and highest AUC value in three days over all seven days. The Metapath2vec model reached highest AP value in one day and highest AUC value in two days over seven days. Thus, WMP2vec outperforms all other models, such that WMP2vec > Metapath2vec > Node2vec > DeepWalk.

3.3.3. Impacts of Parameters

In this subsection, we evaluate the impacts of parameters over the classification task: (i) count of random walk, walk length, and window size of Skip-Gram in WMP2vec and Metapath2vec model; (ii) weighted bias β of WMP2vec. We compare the AP and AUC values in the dataset from one day.

(1). Count of Random Walk. Figure 3 shows the experimental results by comparing the AP and AUC with different count of random walk, with fixed walk length of 5. When the count of random walk is larger than 30, WMP2vec and Metapath2vec models have better performance than count = 10, respectively. In addition, the values of AP and AUC have slight changes when the count of random walk is 30, 50, or 70.

(2). Walk Length. Figure 4 shows the experimental results by comparing the AP and AUC with different walk length (length = 5, 10, 20, 50, and 80), with fixed count of random walk of 10. WMP2vec and Metapath2vec models reach better performance when the walk length ≥10. In addition, when the length changes from 10, 20, and 50 to 80, the AP values change very little and the AUC values have some fluctuations.

(3). Window Size of Skip-Gram. Figure 5 shows the experimental results by comparing the AP and AUC with different window size (size = 3, 4, 5, 6, and 7) over the classification task. The best performance of models is reached when the window size is 5.

(4). Weighted Bias of WMP2vec. Figure 6 shows the experimental results by comparing the AP and AUC with different weighted bias β of WMP2vec (β = 0.1, 0.3, 0.5, 0.7, and 1.0) over the classification task. As the weighted bias β increases, the performance of WMP2vec gets closer to the performance when β = 1.0. The values of AP and AUC change very little when β ≥ 0.5.

3.4. Evaluation of Hybrid Neural Network

In this section, we evaluate the classification performance of HNN model for fusing graph-based features and attribute-based features in GFD approach. As the flow of GFD approach in Figure 1, we extract the attribute-based features and the graph-based features and then use HNN model to fuse two kinds of features to identify fraudulent apps.

3.4.1. Features Extraction

Based on Section 2.3, we divide the log data for each app into 24 parts per day; that is, the time window is one hour. We calculate the ratio of records whose attributes take a certain value to all records in each time window, and we calculate them for each of 22 attributes in total, such as anonymized user id, advertisement id, country id, and device operating system. In addition, we calculate the ratio for browsing behavior and other actions on ads of users, respectively. Finally, we get 24 features for a time window (one hour), and the dimension of attribute-based features of each app is 24 × 24 for one day.

Based on Section 2.2.2 and Section 3.3, for the graph-based feature extraction, we construct the weighted heterogeneous graph of user-app-ad and then extract the graph-based feature through training by using WMP2vec. The dimension of graph-based features for each app is 32.

3.4.2. Comparison Models and Experiment Setup

We compare the proposed HNN with Support Vector Machine (SVM), Random Forests (RF), and Fully Connected Neural Networks (FCNN).(i)SVM : SVM is an effective widely used two-class classification model. The RBF kernel is used and penalty parameter C is 0.9.RF : RF is a well-known ensemble learning method that operates by constructing a multitude of decision trees at training time. The number of decision trees is 200 with depth of 5. Minimum samples split and minimum samples leaf are set to 5, respectively.FCNN : FCNN is a fully connected neural network. The number of hidden layers is 4, with 100 neurons in each layer. The learning rate is 0.001 and the keep probability of dropout is 0.9.HNN : HNN is the fusing model proposed in this study. The number of convolutional layers is 2, and the kernel size is 3 × 3. The number of fully connected layers is 2 with 100 neurons, using activation function “ReLU,” and the keep probability of dropout is 0.9. The learning rate is 0.0001, the weight decay factor of learning rate is 0.98, and the batch size is 100.

In order to make sure that all models could learn the same knowledge from the dataset, when training the comparison models, we flatten the attribute-based features into a 576-dimensional vector. Furthermore, the vector is concatenated with graph-based features, and the dimension of total input vector is 576 + 32 = 608.

We randomly divide the negative samples and positive samples of the dataset into three subsets 8 : 1 : 1, respectively, and combine the corresponding positive and negative example subsets into training (80%), validation (10%), and test (10%) sets. In order to handle the imbalanced category problem between fraudulent and nonfraudulent apps, we adopt upsampling technique during training.

3.4.3. Experimental Results

The experimental results are shown in Tables 4 and 5. The HNN model proposed in this study reaches the highest AP value in six days and the highest AUC value in four days over all seven days. The FCNN, RF, and SVM models have similar performance to AUC measure, and Table 4 shows that HNN > FCNN > RF > SVM with AP measure. Thus, HNN outperforms all other models in terms of AP and AUC measures.

3.4.4. Comparative Experiments without Graph-Based Features

To show the contribution of graph-based feature extraction in proposed GFD approach, we remove the graph-based features in our dataset. When the proposed HNN model has only attribute-based features as input and no graph-based features as input, the HNN model leaves only the fully connected part to work, since the convolution part of HNN model has no input. This also means that the working HNN model would change to a fully connected neural network, that is, FCNN model, in this setting. So we use the SVM, RF, and FCNN models in this comparative experiment. The results are shown in Tables 6 and 7. Comparing the performances of models with/without graph-based features in Tables 4 and 5 and Tables 6 and 7, we could find that the FCNN model with graph-based features reaches better performance than the model without the graph-based features in both AP and AUC measures, while the performance improvement of SVM and RF models is not obvious with graph-based features.

3.4.5. Impacts of Parameters

(1). Time Windows. Time window in attribute-based feature extraction of GFD approach decides the dimension of attribute-based features. We designed experiments to show the impact of time window, and the result is shown in Table 8. The size of time window is set to be 1, 3, and 6 hours. The continuous increase in size of time window makes HNN perform worse AP values. The other models seem to be not sensitive to the size of time window.

(2). Number of Convolutional Layers in HNN Model. We compare the effect of the number of convolutional layers of 1, 2, and 3 in HNN model and show the results in Table 9. The AUC and AP values achieve a high level when the number of convolutional layers is 2.

(3). Number of Fully Connected Layers in HNN Model. We set the number of fully connected layers to be from 1 to 4, and the experiment result is shown in Table 10. When the number of fully connected layers is 2, the HNN model reaches the highest performance.

(4). Activation Functions in HNN Model. We compare three well-known activation functions, ReLU, tanh, and Sigmoid, in HNN model, and the experiment results are shown in Table 11. The AUC values of the models with different activation functions are similar, and ReLU is slightly better than others. In terms of AP, ReLU is obviously better than the other two activation functions.

Our work is related to existing studies on attribute-based fraud detection and graph-based fraud detection with machine learning. The challenges of fraud detection problem in mobile advertising system are summarized as accuracy requirement, throughput requirement, and the ability to combat the latest fraud methods [1].

Attribute-based fraud detection approaches have been used in fraud detection domain. Crussell et al. [26] built decision trees based on the features extracted from their dataset for classification. Liu et al. [27] proposed a binary SVM classifier to determine whether two UIs are likely to lead to equivalent states. This classification is used to simulate user interaction in the context of ad clicking. In order to classify malicious publishers, Mouawi et al. [11] evaluated KNN, SVM, and ANN based on features extracted from dataset, and the experimental results show that all three classifiers give very promising result. Haider et al. [2] proposed an ensemble-based method to classify each individual ad display as fraudulent or nonfraudulent. Gabriel et al. [28] evaluated the performance of logistic regression, gradient trees, and deep learning method in credit card fraud detection and proved that deep learning method outperforms the other compared methods.

Graph-based fraud detection approaches have been studied recently. Hu et al. [15] proposed a weighted graph propagation algorithm to identify the fraudulent apps in the user-app bipartite graphs. Vasumati et al. [29] applied decision trees to classify spam publishers based on constructed feature vector and computed spam score for each of the spam publishers by constructing a bipartite graph between users and publishers to find fraud publishers. What is more, the natural language processing (NLP) models known as Word2vec [23] have been applied to graph embedding, such as DeepWalk [10], Node2vec [21], and Metapath2vec [22]. Zheng et al. [30] proposed an unsupervised method to detect abnormal users and items through deep joint network embedding. Yu et al. [16] proposed a deep embedding approach for anomaly detection in dynamic networks by learning network representations which can be updated dynamically as the network evolves.

Mobile advertising fraud detection is still challenging; however, ensemble learning methods were usually the winner algorithms in fraud detection competition [10], and deep learning and graph learning are recently the most promising methods in this area.

There are two key differences between our proposed approach and existing works. First, we used app id, ad id, and user id from the real-world dataset to construct a weighted heterogeneous graph with these three types of nodes and proposed the graph embedding algorithm for mobile advertising fraud detection. The popular existing datasets, such as TalkingData dataset [31], usually have one or two types of entities (e.g., app id), so there are not enough entities to construct a heterogeneous graph as we did in this paper. Second, we proposed a fusing model to combine attribute-based and graph-based information for mobile advertising fraud detection by graph embedding and deep learning methods.

5. Conclusion

In this paper, we focus on the fraud detection problem in mobile advertising to detect fraudulent publishers. We propose a novel weighted heterogeneous graph and deep learning-based fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. Based on the relationship of users, publishers, and advertisement in mobile ad system, we construct a weighted heterogeneous graph and proposed a weighted metapath based graph embedding approach, named WMP2vec, to learn structural features of publishers in the graph. Furthermore, we construct a hybrid convolutional neural network to learn high-order features from attribute-based features and graph-based features. The experimental results in a real-world dataset show that our method is effective in classifying fraudulent apps for mobile advertising system.

There are two limitations in the work presented here. First, the dataset is limited to one mobile advertising dataset. In order to be more generalizable, it would be important to see whether the proposed GFD approach excels in more fraud detection datasets. Second, the dataset is limited to seven days. In the complex and dynamic online advertising environment, more time is still needed to evaluate the proposed approach.

Despite being focused on mobile advertising fraud detection in this presentation, the proposed GFD approach could be generalized to benefit many other online applications (e.g., e-commerce) that involve relationship between several types of entities. Future work should focus on the robustness and accuracy of our proposed model for other large-scale online datasets.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported in part by the Natural Science Foundation of Guangdong Province of China (Grant no. 2018A030313309), the Innovation Fund of Introduced High-End Scientific Research Institutions of Zhongshan (Grant no. 2019AG031), and the Fundamental Research Funds for the Central Universities, SCUT (Grant no. 2019KZ20).