Abstract
Online mobile advertising plays a vital role in the mobile app ecosystem. The mobile advertising frauds caused by fraudulent clicks or other actions on advertisements are considered one of the most critical issues in mobile advertising systems. To combat the evolving mobile advertising frauds, machine learning methods have been successfully applied to identify advertising frauds in tabular data, distinguishing suspicious advertising fraud operation from normal one. However, such approaches may suffer from laborintensive feature engineering and robustness of the detection algorithms, since the online advertising big data and complex fraudulent advertising actions generated by malicious codes, botnets, and clickfirms are constantly changing. In this paper, we propose a novel weighted heterogeneous graph embedding and deep learningbased fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. In the proposed GFD approach, (i) we construct a weighted heterogeneous graph to represent behavior patterns between users, mobile apps, and mobile ads and design a weighted metapath to vector algorithm to learn node representations (graphbased features) from the graph; (ii) we use a time window based statistical analysis method to extract intrinsic features (attributebased features) from the tabular sample data; (iii) we propose a hybrid neural network to fuse graphbased features and attributebased features for classifying the fraudulent apps from normal apps. The GFD approach was applied on a large realworld mobile advertising dataset, and experiment results demonstrate that the approach significantly outperforms wellknown learning methods.
1. Introduction
Online mobile advertising plays a vital role in the mobile app ecosystem. One of the popular models in mobile app advertising is known as cost per action (CAP), where payment is based on user action, such as downloading and installing an app on the user’s mobile device. This CAP model may incentivize malicious mobile content publishers (typically app owners) to generate fraudulent actions on advertisements to get more financial returns [1–3]. Some traditional methods and techniques have been used for detecting and stopping click fraud, such as thresholdbased method [4], CAPTCHA [5], splay tree [6], TrustZone [7], power spectral density analysis [8], and social network analysis [9].
To automatically detect mobile advertising fraud behaviors, machine learning methods have been successfully applied to find fraud patterns in data, distinguishing suspicious advertising fraud operation from normal one [10–14]. As for learning model with attribute features, researchers usually use several attributes from each sample to train a learning model to identify the fraud behaviors. Unfortunately, such approaches may suffer from laborintensive feature engineering and robustness of the detection algorithms, since the online advertising big data and complex fraudulent advertising actions generated by malicious codes, botnets, and clickfirms are constantly changing. What is more, fraudsters could easily adjust their fraud patterns based on existing fraud detection attributes and rules to avoid being detected. Recently, some researchers try to use the relationship between information entities to construct a graph model and then use the graph mining or learning methods to identify the changing fraud behaviors [15–17]. All these methods obtain useful insights into the learning mechanism to classify fraud behaviors from normal activities. Intuitively, if we could combine the complementary information from attributes of sample data and relationship between entities (e.g., users, apps, and ads), we will be able to improve the accuracy and robustness of fraud detection.
However, to unleash the power of attributebased information and graphbased information, we have to address a series of challenges. First, to take advantage of the characteristic of graph, we should construct a suitable graph, which could potentially represent the interaction behaviors between information entities such as users, apps, and ads. Second, an efficient graph learning method should be developed to learn the useful structural and semantic representation information from constructed graph [18, 19], particularly learning from heterogeneous graph [20]. Third, fusing different kinds of information from sample attributes and node representation is difficult for their inherent heterogeneity and highorder characteristics.
To address the above challenges, in this paper, we propose a weighted heterogeneous graph embedding and deep learningbased fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. In the proposed GFD approach, (i) considering behavior patterns between users, mobile apps, and mobile ads, we construct a weighted heterogeneous graph to represent mobile app advertising behavior and propose a new weighted metapath to vector algorithm, namely, WMP2vec, to learn lowdimensional latent representation (graphbased features) for apps’ nodes in the weighted heterogeneous graph; (ii) we use a time window based statistical analysis method to extract intrinsic features (attributebased features) from the tabular sample data; (iii) we present a hybrid convolutional neural network model to fuse graphbased features and attributebased features for classifying the fraudulent apps from normal apps.
We evaluate GFD approach and WMP2vec algorithm on a realworld dataset from one of the mobile advertising platforms in China. Results show that WMP2vec reaches higher performance than three wellknown graph embedding algorithms in the constructed weighted heterogeneous graph, and GFD approach achieves highest classification performance compared with Support Vector Machine (SVM), Random Forest (RF), and Fully Connected Neural Networks (FCNN).
The rest of the paper is organized as follows. We introduce GFD approach to detect fraudulent apps with deep neural networks and heterogeneous graph embedding algorithm WMP2vec in Section 2. We present the experimental results and discussion in Section 3. In Section 4, we introduce the related work. We conclude this paper in Section 5.
2. Proposed Approach
The flow chart of the proposed GFD approach is shown in Figure 1. First, we propose a weighted heterogeneous graph embedding method to learn the node representation, including constructing the weighted heterogeneous graph and the WMP2vec algorithm. Second, we use statistical analysis method to extract attributebased features from the tabular sample data. Third, we introduce the deep neural networks to fuse the attributebased features and graphbased features for identifying fraudulent apps from normal ones.
2.1. Data Description
We collect advertising log data of mobile apps from a mobile advertising platform. Our mobile advertisement dataset contains the following attributes: user ID, a code to identify a unique mobile user; app ID, a code to identify a unique mobile app; ad ID, a code to identify a unique mobile advertisement; geographical attributes, a series of user geographical attributes used to detect anomalies, including encrypted IP and city; action type, user behavior related to the ads, such as viewing, clicking, app downloading start, app downloading completion, and app installation completion; action time, the timestamp when the action happened; and device attribute, user device related attributes, such as device ID, device system models, and screen size.
A sevenday mobile advertising log dataset in June 2015 was studied in this paper, and some examples of our raw data are shown in Table 1.
2.2. Weighted Heterogeneous Graph Embedding
In this section, we firstly propose the problem definition and construct the weighted heterogeneous graph, and then we present WMP2vec algorithm to learn latent representation of nodes in weighted heterogeneous graph.
2.2.1. Problem Definition
(1). Given. An undirected weight heterogeneous graph is given, where V is a set of app nodes, ad nodes, and user nodes; E is a set of undirected weight edges between any two types of nodes: app nodes and user nodes, user nodes and ad nodes, and ad nodes and app nodes; W is the set of weight of edges.
(2). Task. The task is to learn the dimensional latent representations (where ) for nodes, which could capture the structural and semantic relations among nodes in the graph G, and the representations could be used for classifying fraudulent apps.
2.2.2. Weighted Heterogeneous Graph Construction
Let be the set of user nodes, let be the set of app nodes, and let be the set of advertisement nodes. If there exists an action from user to advertisement through app , we form edges from to , from to , and from to , respectively, such that , , and are the edges set of heterogeneous graph G. The set of weight is , where the weights , , and are defined proportional to the behavioral centrality of to , to , and to , respectively. The calculation formula of is shown in equation (1) and so on for and .where is the times of user u operating on advertisement p and is the set of operations of user u on all the advertisements.
2.2.3. Graph Embedding Algorithm
In this section, based on the sequence generation method from metapath based random walk in heterogeneous graph [20], we propose WMP2vec algorithm to generate random walk sequence in weighted heterogeneous graph and embed sequence to representation vector with SkipGram [21] for nodes.
(1). Weighted Metapath Based Random Walk. We predefined number of walks per node , the number of walk sequences , and a metapath M The metapath is defined as a path in the heterogeneous graph G with its metatemplate , where and . Each node and each edge are associated with mapping functions and , respectively.
Supposing that current node is , the relationship between and next node is R_{i}; that is, .
For walk sequences generation, we go through the metapath scheme times, and each time generates one corresponding walk sequence. In the first time, we use two different selecting methods (first phase and second phase), because there are no limits to edge weight in the beginning. After first time, we use the method in the second phase to select next node.
For the first phase, when the length of walk sequence is less than 2, the next node in the sequence is randomly selected from the neighbors set of current nodes, which meet the requirements of metapath M [20]. The transition probability from to is defined as follows:
For the second phase, when the length of walk sequence is between 2 and , the transition probability is restricted by a weight bias . Supposing that the latest weight of edge of relationship R_{i} is , the weight should be in the range of . The transition probability from to is defined as follows:where is the set of neighbors meeting the requirement.
(2). Embedding Sequence to Vector with SkipGram. Based on the weighted metapath random walk sequences, we use SkipGram model [21] and negative sampling [22] to learn lowdimensional representation of nodes.
A description of our proposed WMP2vec algorithm method is shown in Algorithm 1.

2.3. AttributeBased Feature Extracting
From the raw log data (tabular data) of mobile advertising, we defined a time window ( hours) and divide original data into data block for one day (24 hours). Then, a plain statistical analysis is performed on each field in each data block. The ratio of the unique value of the field to the total number of records in the specified time window is computed. The attributebased feature corresponding to one mobile app could be represented as a feature matrix with rows.
2.4. Hybrid Neural Network for Classification
To take advantage of the graphbased features and attributebased features, we propose a hybrid convolutional neural networks (HNN) model to fuse and learn both information in GFD approach. The overview of the hybrid neural networks is shown in Figure 2.
In HNN model, the first layer (input layer) contains attributebased feature matrix and graphbased feature, where is the number of samples, is the number of time windows by one day (24 hours), is the dimension of attributebased feature in a time window, and is the dimension of node embedding.
A convolutional part includes two convolutional layers, and the output of the first convolutional layer iswhere and are the convolution kernel and bias, respectively, is the size of the kernel, indicates the convolution operation, and the function is .
The second convolutional layer is constructed as follows:where and are the convolution kernel and bias, respectively. is the size of the kernel.
is flattened to , where is the number of elements in .
We concatenate and into a single metric to be the input of the first fully connected layer . is constructed as follows:where and are weight and bias, respectively, and is the number of neurons in the first fully connected layer.
The second fully connected layer is constructed as follows:where and are weight and bias, respectively, and is the number of neurons in the second fully connected layer.
In the output of HNN, is the probability of an application to be a fraudulent application.where and are weight and bias, respectively, and is the sigmoid function.
The crossentropy function with l2regularization is used to calculate the loss of the hybrid convolutional neural network model.
3. Experiments
3.1. Data Description and Preprocessing
A realworld dataset was collected from a mobile advertising platform in China. The dataset consists of seven days with around 2 M users, 3.5 K apps, and 1 K advertisements per day. We partition our log data into seven subsets with oneday period and conduct experiments on each subset to evaluate our model. The proportion of fraudulent apps is about 2–4 percent in the total 3,500 apps each day. More details of the dataset are described in Section 2.1.
3.2. Evaluation Metric
In this paper, we define the fraudulent apps by positive samples and the other apps by negative samples. The Average Precision (AP) and the Area Under ROC Curve (AUC) are used to evaluate proposed algorithm and approach.
The AP criterion summarizes the PrecisionRecall performances at different threshold levels and corresponds to area under the PrecisionRecall curve. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The AUC is the total area under the ROC curve.
3.3. Evaluation of WMP2vec Algorithm
In this section, we use WMP2vec algorithm to learn the embedding vector of the nodes (apps) from the constructed weighted heterogeneous graph and then take their embedding vectors as the input of Random Forest (RF) model to classify fraudulent apps.
Based on Section 2.2.2, we construct a weighted heterogeneous graph and define a metapath: appuseraduserapp (PUAUP); that is, , which represents the heterogeneous semantic of fraud publishers (apps) that mimic legitimate users to act on the ads from the apps.
3.3.1. Comparison Models and Parameters
We compare the AP and AUC of the WMP2vec model with three wellknown graph embedding models: DeepWalk [23], Node2vec [24], and Metapath2vec [20]. The compared algorithms and their parameters are as follows:(1)DeepWalk: DeepWalk [23] is the first graph embedding model based on Word2vec. We use SkipGram model [21] and hierarchical softmax [25] with gradient descent to learn the node representation. Negative sampling technique [22] is used to accelerate the SkipGram model. The count of random walk is 30, and the walk length is 40.(2)Node2vec: Node2vec [24] extends DeepWalk algorithm through introducing backward probability and forward probability q. The same random walk parameters (count = 30 and length = 40) are used with DeepWalk, and the negative sampling technique is also used. In addition, we use and q = 0.2 for backward probability and forward probability, respectively.(3)Metapath2vec: Metapath2vec [20] uses the metapath based random walk to construct node sequences and then leverages SkipGram to perform node embedding. The metapath in this study is PUAUP. The count of random walk is 30, and the walk length is 10.(4)WMP2vec: We use the same parameters (count = 30, length = 10, and metapath = PUAUP) with Metapath2vec, and the weighted bias β is 0.1 additionally.
In all the compared models, we train SkipGram model with window size of 5, and the negative samples is 5 in negativesampling. The graphbased feature of each node is a 32dimensional vector. The parameters of the RF model are as follows: the number of weak learners is 150, max. deep is 5, and min. sample leaf is 5.
3.3.2. Experimental Results
Tables 2 and 3 show the experimental results by comparing the AP and AUC over 10fold crossvalidation for seven days. The WMP2vec model reached highest AP value in six days and highest AUC value in three days over all seven days. The Metapath2vec model reached highest AP value in one day and highest AUC value in two days over seven days. Thus, WMP2vec outperforms all other models, such that WMP2vec > Metapath2vec > Node2vec > DeepWalk.
3.3.3. Impacts of Parameters
In this subsection, we evaluate the impacts of parameters over the classification task: (i) count of random walk, walk length, and window size of SkipGram in WMP2vec and Metapath2vec model; (ii) weighted bias β of WMP2vec. We compare the AP and AUC values in the dataset from one day.
(1). Count of Random Walk. Figure 3 shows the experimental results by comparing the AP and AUC with different count of random walk, with fixed walk length of 5. When the count of random walk is larger than 30, WMP2vec and Metapath2vec models have better performance than count = 10, respectively. In addition, the values of AP and AUC have slight changes when the count of random walk is 30, 50, or 70.
(2). Walk Length. Figure 4 shows the experimental results by comparing the AP and AUC with different walk length (length = 5, 10, 20, 50, and 80), with fixed count of random walk of 10. WMP2vec and Metapath2vec models reach better performance when the walk length ≥10. In addition, when the length changes from 10, 20, and 50 to 80, the AP values change very little and the AUC values have some fluctuations.
(3). Window Size of SkipGram. Figure 5 shows the experimental results by comparing the AP and AUC with different window size (size = 3, 4, 5, 6, and 7) over the classification task. The best performance of models is reached when the window size is 5.
(4). Weighted Bias of WMP2vec. Figure 6 shows the experimental results by comparing the AP and AUC with different weighted bias β of WMP2vec (β = 0.1, 0.3, 0.5, 0.7, and 1.0) over the classification task. As the weighted bias β increases, the performance of WMP2vec gets closer to the performance when β = 1.0. The values of AP and AUC change very little when β ≥ 0.5.
3.4. Evaluation of Hybrid Neural Network
In this section, we evaluate the classification performance of HNN model for fusing graphbased features and attributebased features in GFD approach. As the flow of GFD approach in Figure 1, we extract the attributebased features and the graphbased features and then use HNN model to fuse two kinds of features to identify fraudulent apps.
3.4.1. Features Extraction
Based on Section 2.3, we divide the log data for each app into 24 parts per day; that is, the time window is one hour. We calculate the ratio of records whose attributes take a certain value to all records in each time window, and we calculate them for each of 22 attributes in total, such as anonymized user id, advertisement id, country id, and device operating system. In addition, we calculate the ratio for browsing behavior and other actions on ads of users, respectively. Finally, we get 24 features for a time window (one hour), and the dimension of attributebased features of each app is 24 × 24 for one day.
Based on Section 2.2.2 and Section 3.3, for the graphbased feature extraction, we construct the weighted heterogeneous graph of userappad and then extract the graphbased feature through training by using WMP2vec. The dimension of graphbased features for each app is 32.
3.4.2. Comparison Models and Experiment Setup
We compare the proposed HNN with Support Vector Machine (SVM), Random Forests (RF), and Fully Connected Neural Networks (FCNN).(i)SVM : SVM is an effective widely used twoclass classification model. The RBF kernel is used and penalty parameter C is 0.9. RF : RF is a wellknown ensemble learning method that operates by constructing a multitude of decision trees at training time. The number of decision trees is 200 with depth of 5. Minimum samples split and minimum samples leaf are set to 5, respectively. FCNN : FCNN is a fully connected neural network. The number of hidden layers is 4, with 100 neurons in each layer. The learning rate is 0.001 and the keep probability of dropout is 0.9. HNN : HNN is the fusing model proposed in this study. The number of convolutional layers is 2, and the kernel size is 3 × 3. The number of fully connected layers is 2 with 100 neurons, using activation function “ReLU,” and the keep probability of dropout is 0.9. The learning rate is 0.0001, the weight decay factor of learning rate is 0.98, and the batch size is 100.
In order to make sure that all models could learn the same knowledge from the dataset, when training the comparison models, we flatten the attributebased features into a 576dimensional vector. Furthermore, the vector is concatenated with graphbased features, and the dimension of total input vector is 576 + 32 = 608.
We randomly divide the negative samples and positive samples of the dataset into three subsets 8 : 1 : 1, respectively, and combine the corresponding positive and negative example subsets into training (80%), validation (10%), and test (10%) sets. In order to handle the imbalanced category problem between fraudulent and nonfraudulent apps, we adopt upsampling technique during training.
3.4.3. Experimental Results
The experimental results are shown in Tables 4 and 5. The HNN model proposed in this study reaches the highest AP value in six days and the highest AUC value in four days over all seven days. The FCNN, RF, and SVM models have similar performance to AUC measure, and Table 4 shows that HNN > FCNN > RF > SVM with AP measure. Thus, HNN outperforms all other models in terms of AP and AUC measures.
3.4.4. Comparative Experiments without GraphBased Features
To show the contribution of graphbased feature extraction in proposed GFD approach, we remove the graphbased features in our dataset. When the proposed HNN model has only attributebased features as input and no graphbased features as input, the HNN model leaves only the fully connected part to work, since the convolution part of HNN model has no input. This also means that the working HNN model would change to a fully connected neural network, that is, FCNN model, in this setting. So we use the SVM, RF, and FCNN models in this comparative experiment. The results are shown in Tables 6 and 7. Comparing the performances of models with/without graphbased features in Tables 4 and 5 and Tables 6 and 7, we could find that the FCNN model with graphbased features reaches better performance than the model without the graphbased features in both AP and AUC measures, while the performance improvement of SVM and RF models is not obvious with graphbased features.
3.4.5. Impacts of Parameters
(1). Time Windows. Time window in attributebased feature extraction of GFD approach decides the dimension of attributebased features. We designed experiments to show the impact of time window, and the result is shown in Table 8. The size of time window is set to be 1, 3, and 6 hours. The continuous increase in size of time window makes HNN perform worse AP values. The other models seem to be not sensitive to the size of time window.
(2). Number of Convolutional Layers in HNN Model. We compare the effect of the number of convolutional layers of 1, 2, and 3 in HNN model and show the results in Table 9. The AUC and AP values achieve a high level when the number of convolutional layers is 2.
(3). Number of Fully Connected Layers in HNN Model. We set the number of fully connected layers to be from 1 to 4, and the experiment result is shown in Table 10. When the number of fully connected layers is 2, the HNN model reaches the highest performance.
(4). Activation Functions in HNN Model. We compare three wellknown activation functions, ReLU, tanh, and Sigmoid, in HNN model, and the experiment results are shown in Table 11. The AUC values of the models with different activation functions are similar, and ReLU is slightly better than others. In terms of AP, ReLU is obviously better than the other two activation functions.
4. Related Work
Our work is related to existing studies on attributebased fraud detection and graphbased fraud detection with machine learning. The challenges of fraud detection problem in mobile advertising system are summarized as accuracy requirement, throughput requirement, and the ability to combat the latest fraud methods [1].
Attributebased fraud detection approaches have been used in fraud detection domain. Crussell et al. [26] built decision trees based on the features extracted from their dataset for classification. Liu et al. [27] proposed a binary SVM classifier to determine whether two UIs are likely to lead to equivalent states. This classification is used to simulate user interaction in the context of ad clicking. In order to classify malicious publishers, Mouawi et al. [11] evaluated KNN, SVM, and ANN based on features extracted from dataset, and the experimental results show that all three classifiers give very promising result. Haider et al. [2] proposed an ensemblebased method to classify each individual ad display as fraudulent or nonfraudulent. Gabriel et al. [28] evaluated the performance of logistic regression, gradient trees, and deep learning method in credit card fraud detection and proved that deep learning method outperforms the other compared methods.
Graphbased fraud detection approaches have been studied recently. Hu et al. [15] proposed a weighted graph propagation algorithm to identify the fraudulent apps in the userapp bipartite graphs. Vasumati et al. [29] applied decision trees to classify spam publishers based on constructed feature vector and computed spam score for each of the spam publishers by constructing a bipartite graph between users and publishers to find fraud publishers. What is more, the natural language processing (NLP) models known as Word2vec [23] have been applied to graph embedding, such as DeepWalk [10], Node2vec [21], and Metapath2vec [22]. Zheng et al. [30] proposed an unsupervised method to detect abnormal users and items through deep joint network embedding. Yu et al. [16] proposed a deep embedding approach for anomaly detection in dynamic networks by learning network representations which can be updated dynamically as the network evolves.
Mobile advertising fraud detection is still challenging; however, ensemble learning methods were usually the winner algorithms in fraud detection competition [10], and deep learning and graph learning are recently the most promising methods in this area.
There are two key differences between our proposed approach and existing works. First, we used app id, ad id, and user id from the realworld dataset to construct a weighted heterogeneous graph with these three types of nodes and proposed the graph embedding algorithm for mobile advertising fraud detection. The popular existing datasets, such as TalkingData dataset [31], usually have one or two types of entities (e.g., app id), so there are not enough entities to construct a heterogeneous graph as we did in this paper. Second, we proposed a fusing model to combine attributebased and graphbased information for mobile advertising fraud detection by graph embedding and deep learning methods.
5. Conclusion
In this paper, we focus on the fraud detection problem in mobile advertising to detect fraudulent publishers. We propose a novel weighted heterogeneous graph and deep learningbased fraud detection approach, namely, GFD, to identify fraudulent apps for mobile advertising. Based on the relationship of users, publishers, and advertisement in mobile ad system, we construct a weighted heterogeneous graph and proposed a weighted metapath based graph embedding approach, named WMP2vec, to learn structural features of publishers in the graph. Furthermore, we construct a hybrid convolutional neural network to learn highorder features from attributebased features and graphbased features. The experimental results in a realworld dataset show that our method is effective in classifying fraudulent apps for mobile advertising system.
There are two limitations in the work presented here. First, the dataset is limited to one mobile advertising dataset. In order to be more generalizable, it would be important to see whether the proposed GFD approach excels in more fraud detection datasets. Second, the dataset is limited to seven days. In the complex and dynamic online advertising environment, more time is still needed to evaluate the proposed approach.
Despite being focused on mobile advertising fraud detection in this presentation, the proposed GFD approach could be generalized to benefit many other online applications (e.g., ecommerce) that involve relationship between several types of entities. Future work should focus on the robustness and accuracy of our proposed model for other largescale online datasets.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported in part by the Natural Science Foundation of Guangdong Province of China (Grant no. 2018A030313309), the Innovation Fund of Introduced HighEnd Scientific Research Institutions of Zhongshan (Grant no. 2019AG031), and the Fundamental Research Funds for the Central Universities, SCUT (Grant no. 2019KZ20).