Abstract

Click-through rate prediction is critical in Internet advertising and affects web publisher’s profits and advertiser’s payment. The traditional method of obtaining features using feature extraction did not consider the sparseness of advertising data and the highly nonlinear association between features. To reduce the sparseness of data and to mine the hidden features in advertising data, a method that learns the sparse features is proposed. Our method exploits dimension reduction based on decomposition, takes advantage of the attention mechanism in neural network modelling, and improves FM to make feature interactions contribute differently to the prediction. We utilize stack autoencoder to explore high-order feature interactions and use improved FM for low-order feature interactions to portray the nonlinear associated relationship of features. The experiment shows that our method improves the effect of CTR prediction and produces economic benefits in Internet advertising.

1. Introduction

Click-through rate (CTR) prediction is critical to many web applications including web search, recommender systems [1, 2], sponsored search, and display advertising. Search advertising, known as sponsored search, refers to advertisers identifying relevant keywords based on their product or service for advertising. When the user retrieves the keyword purchased by the advertiser, the corresponding advertisement is triggered and displayed. In the cost-per-click model, the advertiser pays the web publisher only when a user clicks their advertisements and visits the advertiser’s site. The CTR prediction is defined to estimate the ratio of clicks to impressions of advertisements that will be displayed [3].

With the rapid development of the mobile Internet and its wide range of applications, advertising has become one of the most successful business models in the world. Internet text advertising is regarded as a more effective advertising communication method due to its strong targeted communication and convenience of user clicking and has become an important income resource for many Internet companies. Some electronic commerce companies and search engine companies are seeking targeted advertising to increase their revenue.

In general, the display of online advertising can be seen as a three-party game between media, advertisers, and users. How to advertise to specific user groups is a key issue in the field of online advertising. Inappropriate advertising can lead to a decline in user experience. Advertising cannot achieve the desired effect, and the media can also be affected. Internet text advertising is usually in the form of text, and the advertisers get the opportunity to buy media ads through cost-per-click (CPC) [4]. In the CPC model, the click-through rate (CTR) is an important indicator to measure the effectiveness of advertising display and is a key factor in the three-party game. Therefore, the CTR estimation of advertising is a hot research direction in the field of computing advertising. In this paper, the click-through rate prediction of Internet text advertising shows the probability of predicting a user’s click on a text under the current context environment. Due to the three-party information of advertising properties, user properties, and context environment, the CTR prediction is very complicated.

At present, the prediction of click-through rate for online advertising has attracted widespread attention from researchers in industry and academia. Researchers have proposed many models that are usually based on machine learning methods. We can divide them into three categories: linear, nonlinear, and fusion models. Typically, a predictive task is formulated as estimating a function that maps predictor variables to some target. To build predictive models with these predictor variables, a common solution is to convert them to a set of binary features (a.k.a. feature vector) via one-hot encoding [5]. McMahan et al. [6] used the logistic regression [7] model to solve the CTR problems of Google Advertising. They adopted user information, advertising data, search keywords, and other features as the input of the model and proposed an online sparse learning algorithm to train the model. Chapelle [8] proposed a machine-learning framework based on the logistic regression in which advertisers, web publishers, users, and time characteristics were used as input to the model to solve the advertising CTR prediction for Yahoo. Dave and Varma [9] used the gradient boosting decision tree (GBDT) to predict the advertising CTR. They extracted similar features from advertising data and discovered implicit relationships between different features. Finally, they found out the nonlinear relationships between the predicted target and features. He et al. [10] introduced a fusion model which combines decision trees with logistic regression for predicting clicks on Facebook ads. The traditional CTR prediction model mainly depends on the design of features. The features of data are artificially selected and processed. The data have a complex mapping relationship, especially for meaningful data, and it is crucial to account for the interactions between features. Many successful solutions in both industry and academia largely rely on manually crafting combinatorial features [11], i.e., constructing new features by combining multiple predictor variables, also known as cross features. However, the power of such features comes at a high cost since it requires heavy engineering efforts and useful domain knowledge to design effective features. Factorization machines (FMs) [12] are a supervised learning approach that embed features into a latent space and model the interactions between features via inner product of their embedding vectors. Models based on degree-2 polynomial mapping and factorization machines are widely used for CTR prediction. The factorization-based prediction method field-aware factorization machines [13] were developed by Juan et al.

In recent years, deep learning [14, 15] has achieved very good results in the fields of speech recognition [16], image data processing [17], and natural language processing [18]. As a powerful approach to learning feature representation, deep neural networks have the potential to learn sophisticated feature interactions. Liu et al. [19] extended CNN for CTR prediction, but CNN-based models are biased towards the interactions between neighboring features. Zhang et al. [20] studied feature representations and proposed factorization machine-supported neural network (FNN). This model pretrained FM before applying DNN and thus limited by the capability of FM. He and Chua [21] proposed a novel neural factorization machine (NFM) for prediction under sparse setting. NFM combines the linearity of FM in modelling second-order feature interactions and the nonlinearity of neural network in modelling higher-order feature interactions. Despite great promise, we argue that FM can be hindered by its modelling of all factorized interactions with the same weight. In real-world applications, different predictor variables often have different predictive power. Not all features contain useful information for predicting the target. Therefore, the interaction of features with less useful information should be assigned a lower weight indicating that they contribute less to the prediction. However, FM lacks the ability to distinguish the importance of feature interactions, which will lead to suboptimal prediction.

Considering the high-dimensional sparsity of advertising data and the highly nonlinear association between features [22], a hybrid model for advertising CTR estimation based on stacked autoencoder, named Attention Stacked Autoencoder (ASAE), is proposed. Our model takes advantage of the attention mechanism in neural network modelling [23, 24] and improves FM to make feature interactions contribute differently to the prediction. More importantly, the importance of feature interactions is automatically learned from the data with any human domain knowledge. We explore data dimension reduction and identify the relationship between features. Additionally, many experiments are conducted to show that this method improves the accuracy of CTR estimation.

The rest of this paper is organized as follows. Section 2 provides the factorization machines. In Section 3, the sparse feature learning method for advertising data based on the ASAE model is proposed. In Section 4, we design the experiment and verify the prediction effect of the method by comparison experiment. We also analyze the experimental results in this section. Section 5 concludes the paper and lists possible future work.

2. Factorization Machines

The factorization machines are originally proposed for learning feature interactions in the recommendation system. Given a real-valued feature vector where denotes the number of features, FM estimates the target by modelling all interactions between each pair of features:where is the global bias, denotes the weight of the th feature, and denotes the weight of the cross feature , which is factorized as , where denotes the embedding vector for feature and denotes the size of the embedding vector. Besides linear (order-1) interactions among features, FM models pairwise (order-2) feature interactions as inner product of respective feature latent vectors. It can capture order-2 feature interactions much more effectively than previous approaches especially when the dataset is sparse. It is worth noting that FM models all feature interactions in the same way: first, a latent vector is shared in estimating all feature interactions that the th feature involves; second, all feature interactions have the same weight of 1. However, it is common that not all features are relevant to the prediction. These interactions of irrelevant features can be considered as noise that does not contribute to the prediction. FM models all features using the same weights for interaction and may have a negative impact on generalization performance.

3. Click-through Rate Estimation Based on Deep Neural Network

One of the necessary steps in the click rate prediction system is to mine features that are highly correlated with the estimated task. To reduce the high sparseness of features and characterize the nonlinear association between features, we propose a sparse feature learning method for advertising data based on deep learning (DLSAE).

3.1. Data Dimensionality Reduction

Click log data contain many types of objects, such as users, queries, and advertisements. The relationship between these objects is very complex. The same objects have similarity, and there are complex relationships between different types of objects. For instance, given a particular user and the query submitted by the user, it is necessary to predict whether the user will click on the advertisement and the probability. There is a complex implicit relationship between users, queries, and advertising. Based on the characteristics of the click log data, dimension reduction is achieved in the following two aspects: the similarity between the internal objects and the association between different objects.

In this paper, the k-means clustering algorithm [25] based on distance is adopted. We cluster queries, advertisements, and users separately, and the similar objects are aggregated into the same cluster. We use advertising frequency as the weight of the advertisement and query and create a matrix of the ad-query (where is the number of ads and indicates the number of queries), using the k-means algorithm to cluster the ad-query matrix. We scan the ad-query matrix to obtain the ad sets and query sets, as and . Then, we take K samples from the advertising set randomly as the initial point of the cluster center, record as . Next, Equation (2) is used to calculate the distance between ad and each cluster center point . The number of clusters of users, ads, and queries is represented by , respectively. Finally, the number of users, ads, and queries in the dataset is reduced from to :where is the weight of , is the weight of , and is the distance between and .

There is a ternary relationship between the user-query-ad in the click log data. In this paper, we use the three-dimensional tensor structure model [26, 27] to represent the user, query, and advertisement. Then, the tensor decomposition method is used to reduce the dimensions. The sum of the display number of ads in the cluster is used as the weight of the elements in 3D space. The three-dimensional tensor model is constructed and represented by . In this paper, tensor is decomposed using the Tucker factorization. Equation (2) is the decomposition formula.where represents the core tensor of tensor . We use , , and to represent the feature matrix of the tensor X on the dimension .

Figure 1 is a schematic diagram of the Tucker decomposition. The purpose of the Tucker decomposition is to find an approximate tensor with the original tensor and to retain the original tensor information and structural information to the greatest extent. The minimization formula is shown below:

Equation (4) is the objective optimization function. According to Equation (3), the expression of the core tensor can be obtained as follows:and the objective function can be written in a squared form:

Therefore, the objective function is transformed to

In the process of solving the optimal solution, we need to fix the matrix of the other dimensions , solve for , and then perform a singular value decomposition (SVD) of . Next, expand the tensor into a matrix on the user, query, and advertising dimensions, respectively, as and apply SVD on :where are the diagonal singular value matrices obtained using singular value decomposition of the matrices . are the dimensions of the singular value matrix . The dimensions are obtained by calculating the diagonal singular values of in proportion. In the process of reducing the dimensions, the proportion of excluded singular values is set to 50% in this paper. Therefore, the calculation of the core tensor after dimension reduction is as follows:

The three dimensions of the initial tensor are , and the three dimensions of the approximate tensor after decreasing dimension are denoted by . The time complexity of the Tucker decomposition algorithm is proportional to the tensor dimension, which is expressed as . We previously used the clustering method to achieve the reduction of the original matrix, which reduced the cost of the Tucker decomposition greatly and improved the efficiency and precision.

3.2. Feature Composition Analysis of the Input Layer

There is a high degree of nonlinear correlation between the features in advertising data. Although the approximate tensor of the original tensor is reduced by the Tucker decomposition, it only reflects the information between the three characteristic dimensions of user, query, and ad. Other useful information in the data is not fully utilized for click-through rate estimates, such as the position of the advertisement on the page, the number of ads, and the age and gender of the user. This paper combines the features of <user, query, ad> after tensor reduction and other valid information in the log data as the object of feature learning. The composition of the input layer features is summarized as follows:(1)ID Feature. ID feature uniquely identifies a class of entities in the actual click log, usually using a set of numeric strings to represent variables. For instance, “10110” can identify only one user group. The ID class used in this article has the UserID, QueryID, AdID, position, and the number of advertisements on the return page. UserID, QueryID, and AdID are collections of “virtual” ID classes that are obtained using k-means clustering and tensor dimension reduction.(2)Attribute Characteristics. The ID class feature is a symbol that cannot be obtained from the new entity data and has weak generalization ability. Attribute features are used to describe a set of users, ad collections, etc., and have better generalization ability and apply to multiple instances. Therefore, it is necessary to attribute the property class as learning the input layer feature further. Commonly used attribute class features are user’s URL, user’s gender, user’s age, and advertising time to trigger and query keywords.(3)Statistical Characteristics. The statistical feature uses historical data statistics information to provide an estimate for the forecasting model. The statistical characteristics of the text consist of the number of advertising histories, the number of clicks on the advertising history, and the click-through rate after the advertising position normalization, denoted by Shows, Clicks, and COEC In the experiment, the input layer feature of the ASAE model is shown in Figure 2.

3.3. Study on CTR Prediction via Attention Mechanism Based on the Stacked Autoencoder
3.3.1. Attentional Factorization Machines

Since the attention mechanism has been introduced into neural network modelling, it has been widely used in many tasks. On the basis of FM, Figure 3 shows the neural network structure of attentional factorization machines (AFM). The input layer and the embedding layer are the same as the FM; the input features are represented with sparse features, and each nonzero feature item is embedded in the dense vector. Formally, let the set of nonzero features in the feature vector x be and the output of the embedding layer be . In the interaction layer, we can represent the output as a set of vectors:where denotes the element wise product of two vector and . By defining the interaction layer, we express FM under the neural network architecture. We compress with a sum pooling. Then use the full connection layer to establish it and get the prediction score:where denotes the weights and denotes the bias for the prediction layer.

The attention mechanism has been widely used in many tasks. The idea is to allow different parts to contribute differently when compressing them to a single representation. We use attention mechanisms for feature interaction:where is the attention score for feature interaction , and it can be interpreted as the importance of in predicting goals. can be learned by minimizing the loss function, but the attention scores of interactions that never occur in training data cannot be estimated. In order to solve the generalization problem, we use the multilayer perceptron (MLP) to further parameterize the attention score, which we call the attention network. The input of the attention network is an interaction vector of two features and can encode their interaction information in the embedding space. In general, the attention network is defined aswhere are the model parameters and is the hidden layer size of the attention network, which we call the attention factor. Rectifiers are used as the activation function for attention scores and show good performance empirically. The output of the attention layer is a -dimensional vector that compresses all feature interactions in the embedding space by differentiating their importance. We give the overall formulation of attentional factorization machines aswhere has been defined in Equation (13).

For the part of the attention network, which is a single-layer MLP, we apply regularization on the weight matrix to prevent possible overfitting. In other words, the actual objective function we optimize iswhere denotes the set of training instances and controls the regularization strength.

3.3.2. Stacked Autoencoder

The autoencoder (AE) [28] is a kind of the neural network model that automatically learns features from data without supervision. It consists of three network layers. The bottom is the input layer I, the middle of the hidden layer H, and the output layer O or reconstruction layer. The autoencoder architecture is shown in Figure 4. In Figure 4, is the connection weight of the two layers and is the bias. In the input layer and hidden layer, the AE model will convert input data to each node of the hidden layer. In the hidden layer and the reconstruction layer, the value of the nodes in the hidden layer is reconstructed and the output data are obtained.

The stacked autoencoder (SAE) [29] is a kind of network that consists of AE stacks from the bottom to the top, as shown in Figure 5. The input data of the bottom AE are . When the training of the bottom AE is finished, the feature of the hidden layer is obtained and can be represented by . Then, is regarded as the input data of the second AE layer, which is trained and provides the features of the hidden layer and is represented by . This process is repeated until is obtained.

The related definition of the node in the hidden layer of AE can be described as follows: is the number of nodes in the hidden layer (H) of the AE. is the connection weight between the node of hidden layer (H) and the node of input layer (I). is the bias of the node in the hidden layer (H). is the weight sum of the input of the node in the hidden layer (H). is the output value of the node in the hidden layer (H). The activation function of every neuron node is .

The output value of the node in the hidden layer (H) can be represented by the following formula:

When the feature of the hidden layer (H) is decoded, the feature of the reconstruction layer O is obtained. The output value of the node in the reconstruction layer O can be represented by the following formula:

To easily calculate and deduce the formulae, we define the residual error of the node in the layer. The residual error of the neuron node of the reconstruction layer can be calculated using the following formula according to the chain rule:

The parameters and can be calculated by formulae (19) and (20):

The parameters and can be updated as the following formulae, where is the learning rate:

The SAE is a generative model that is composed of a stack of autoencoders. This method relies on the training algorithm of the autoencoder to initialize the parameters of a stacked autoencoder. Each new layer is stacked on top of the current autoencoder. The process gradually refines the previously learned information and further discovers more complex features. After this, a dense real-value feature vector is generated, which is finally fed into the sigmoid function for CTR prediction:where is the model weight, is the bias, and is the number of hidden layer.

This paper selects the square error as the objective function and adopts the gradient descent [30, 31] to train the parameters, and the objective function can be described by the following formula:

3.3.3. ASAE Model

The ASAE model consists of two components, AFM component and SAE component, which share the same input. The graphical model of the ASAE model is shown in Figure 6. It is able to learn feature interactions of all orders in an end-to-end manner, without any feature engineering besides raw features. is fed in AFM component to model order-2 feature interactions and distinguish their importance. is fed in SAE component to model high-order feature interactions, and it can generalize better to unseen feature combinations through low-dimensional dense embedding learned for the sparse features. All parameters are trained jointly for the combined prediction model:where is the predicted CTR, is the output of the AFM component, and is the output of the SAE component.

4. Experiments

4.1. Datasets

We perform experiments with two publicly accessible datasets: Frappe [32] and SIGKDD Cup2012 track2. The Frappe dataset has been used for context-aware recommendation. It contains 96,215 app usage logs of users under different contexts. Each log contains 8 context variables, including app ID, user ID, city, and daytime. We convert each log into a feature vector with one-hot encoding, resulting in 5,479 features in total. We split dataset into the training set and testing set using a random partition method by the ratio of 8 to 1. The target value of 1 indicates that the user has used the application in context.

The KDD2012 CUP track2 corresponding research question is based on the actual click data information to predict the click rate of the advertisement. The training dataset provided by the competition has a total of 149,639,105 records, and the size of 9.8 GB. In addition to the number of click and the number of displays, the test dataset is consistent with the training dataset, a total of 20,257,594 records, 1.28 GB in size. After data cleaning and data preprocessing, a total of 3.5 million samples were randomly selected from the candidate dataset for the experiment. Table 1 summarizes the statistics of the final evaluation datasets.

In the KDD2012 CUP track2 dataset, the samples of seven different scale datasets are 150000, 200000, 300000, 500000, 600000, 750000, and 1 million. The training data are grouped randomly, and the final result is the average of all the experimental results to ensure the reliability of the experimental results.

4.2. Evaluation Index

We use two evaluation metrics in our experiments: AUC (area under ROC) and Logloss (cross entropy). The curve in AUC usually means the receiver operating characteristic (ROC) [33], which is usually used to measure performance of two-class classifier. The CTR prediction is a classic binary classification method based on whether the advertising is clicked. The value of AUC is usually between [0.5, 1). The larger the value of AUC becomes, the more accurate the advertising CTR prediction is.

4.3. Baseline Models

We compare the ASAE model with the following methods that are designed for sparse data prediction:FM [34]: FM is successfully applied to the recommended system and user response prediction task. FM explores feature interaction, which is effective on sparse dataFNN [20]: FNN is a FM-initialized feedforward neural network. It is able to capture high-order latent patterns of multifield categorical data.CCPM [19]: convolutional click prediction model (CCPM) is based on convolution neural network. It can extract local-global key features from an input instance with varied elements, which can be implemented for single advertising impression and sequential advertising impression.Deep cross [11]: it applies a multilayer residual network on a feature embedding cascade for learning feature interactions. This model is a deep neural network that automatically combines features to produce superior models.Wide and deep [35]: this model combines a linear (“wide”) model and a deep model. The deep part is a three-layer MLP that first concatenates feature embedding. The wide part (which is a linear regression model) is subject to design to incorporate cross features.

4.4. Analysis of Experimental Results

This section evaluates the ASAE model from two perspectives: (1) discussing the impact of relevant parameters and (2) comparing the ASAE model with five existing prediction models.

4.4.1. Impact of Parameters

Dropout [36] refers to the probability that a neuron is kept in the network. Dropout is a regularization technique to compromise the precision and the complexity of the neural network. We set the dropout to be 0.1 to 0.8. As shown in Figure 7, the optimal dropout ratio on Frappe is 0.3. The result shows that adding reasonable randomness to model can strengthen model’s robustness.

The number of network layers in the depth learning phase has a direct effect on the final estimate of the model. Therefore, this paper experimented with parameters to select the better combination of parameters. In the Frappe dataset, as presented in Figure 8, increasing number of hidden layers improves the performance of the models at the beginning. However, with the increasing of the number of hidden layers, the model performance is degraded. This phenomenon is because of overfitting. We can see from Figure 8, the highest AUC value is obtained when the number of hidden layers is 4 in Frappe dataset.

The number of iterations () in the training phase have a direct effect on the final estimation of the model. In the KDD dataset, we used a set of sampled data training models with a data size of 500,000 samples, tested on the test set, and we used it to select the best parameters. While fixing the network layer number of the model, we can analyze the effect of different on the model performance, and the results are shown in Table 2.

In accordance with Table 2, Figure 9 reflects the AUC change for different network hidden layers and LR model iterations for . As seen in Figure 9, when the number of iterations is 90 to 120, the AUC values of several curves stabilized. Therefore, in the comparison experiment, 115 is chosen as the number of iterations for training the prediction model. As shown in Figure 9, the curve fluctuates greatly with the change of iterations, and is relatively stable, so we choose .

When other factors remain constant, the number of hidden layer units in the ASAE has a huge impact on network performance and the direct cause of the problem is extremely important. However, this figure does not have a general parameter adjustment method in theory. Therefore, in this part, we carry out experiments on the effect of the number of hidden layer neurons. As we can observe from Figure 10, increasing the number of neurons per layer does not always bring benefit. For example, when the number of neurons per layer is increased from 400 to 800, the ASAE performs stably. This is because the complicated model is easy to overfit. In our experiment, 200 to 400 neurons per layer is a good choice.

4.4.2. Performance Comparison

We trained the models on the two datasets and evaluated the estimated results on the same test set. Tables 35 describe the estimated results for the different methods at different datasets.

Tables 35 show the overall performance. Compared with the other five methods, the ASAE model showed a better prediction effect. As the data size increased, the accuracy rose and the logloss declined.FM: this model was successfully applied to the user response prediction task. It explores feature interaction, which is effective on sparse data. However, this model is limited in mining high-order latent patterns or learning quality feature representations. As shown in Tables 35, the performance of this model is worst in all comparison models.FNN: FNN is a FM-initialized feedforward neural network. The FM pretraining strategy results in some limitations, such as the embedding parameters might be over affected by FM and the efficiency is reduced by the overhead introduced by the pretraining stage. From Tables 35, we can see that the performance of FNN ranked fifth.CCPM: this model is based on convolution neural network for single and sequential advertising impression. However, this model highly relies on feature alignment and is a lack of interpretation. Thus, as shown in Tables 35, the performance of this model ranked fourth.Deep cross: the deep cross is the deepest method that stacks 10 layers above the embedding layer in all compared methods. From Tables 35, we can see that the performance of this model ranked third due to the problems of overfitting.Wide and deep: wide and deep combines a linear model and a deep model. It learns high- and low-order feature interactions. There is a need for expertise feature engineering on the input to the “wide” part. Thus, as shown in Tables 35, the performance of this model ranked second.ASAE: the ASAE model performed best. The reasons are as follows. (1) The input of the model exploits dimension reduction based on decomposition and reduces the sparseness of data. (2) The model takes advantage of the attention mechanism in neural network modelling and improves FM to make feature interactions contribute differently to the prediction. (3) We use improved FM for low-order feature interactions, and stacked autoencoder is used for high-order feature interactions. The model more effectively mines the relationship between features, which can improve the CTR.

5. Conclusions

In this paper, based on the search advertising click data, we proposed a sparse feature learning method for advertising data from the perspective of feature learning (DLSAE). We used the reduced dimension method to cluster similar advertisements, queries, and users and established a three-dimensional tensor model for the trial after dimension reduction. Then the low-order approximate tensor was obtained using the Tucker decomposition. Aiming at the highly nonlinear relation between the features, we proposed a hybrid model (ASAE) for advertising CTR estimation based on the stacked autoencoder from the perspective of feature learning. The ASAE model trains a deep component and an AFM component jointly. Performance improved based on these advantages. First, this model does not need any pretraining. Second, it learns both high- and low-order feature interactions, introduces a sharing strategy of feature embedding, and more effectively mines the relationship between features. Last but not least, the proposed model distinguishes the importance of features and makes click-through rate predictions more accurate. More importantly, the importance of feature interactions is automatically learned from the data with any human-domain knowledge. We conducted extensive experiments in two datasets to compare the effectiveness of ASAE with other models. There are two interesting directions for future study. One is exploring a convolutional click prediction model based on CNN for single and sequential advertising impression. And another we are interested in exploring the pooling for recurrent neural networks (RNNs) for sequential data modelling.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 61572301 and 61772321), the Innovation Foundation of Science and Technology Development Center of Ministry of Education and New H3C Group (2017A15047), CERNET Innovation Project (NGII20170508), and the Open Research Fund from Shandong Provincial Key Laboratory of Computer Network (no. SDKLCN-2016-01).