Computational and Mathematical Methods in Medicine

Volume 2018, Article ID 8056541, 11 pages

https://doi.org/10.1155/2018/8056541

## A New Approach for Advertising CTR Prediction Based on Deep Neural Network via Attention Mechanism

^{1}School of Information Science and Engineering, Shandong Normal University, Jinan, China^{2}School of Mathematical Science, Shandong Normal University, Jinan, China

Correspondence should be addressed to Fang’ai Liu; nc.ude.unds@afl

Received 29 March 2018; Accepted 1 August 2018; Published 13 September 2018

Academic Editor: Martti Juhola

Copyright © 2018 Qianqian Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Click-through rate prediction is critical in Internet advertising and affects web publisher’s profits and advertiser’s payment. The traditional method of obtaining features using feature extraction did not consider the sparseness of advertising data and the highly nonlinear association between features. To reduce the sparseness of data and to mine the hidden features in advertising data, a method that learns the sparse features is proposed. Our method exploits dimension reduction based on decomposition, takes advantage of the attention mechanism in neural network modelling, and improves FM to make feature interactions contribute differently to the prediction. We utilize stack autoencoder to explore high-order feature interactions and use improved FM for low-order feature interactions to portray the nonlinear associated relationship of features. The experiment shows that our method improves the effect of CTR prediction and produces economic benefits in Internet advertising.

#### 1. Introduction

Click-through rate (CTR) prediction is critical to many web applications including web search, recommender systems [1, 2], sponsored search, and display advertising. Search advertising, known as sponsored search, refers to advertisers identifying relevant keywords based on their product or service for advertising. When the user retrieves the keyword purchased by the advertiser, the corresponding advertisement is triggered and displayed. In the cost-per-click model, the advertiser pays the web publisher only when a user clicks their advertisements and visits the advertiser’s site. The CTR prediction is defined to estimate the ratio of clicks to impressions of advertisements that will be displayed [3].

With the rapid development of the mobile Internet and its wide range of applications, advertising has become one of the most successful business models in the world. Internet text advertising is regarded as a more effective advertising communication method due to its strong targeted communication and convenience of user clicking and has become an important income resource for many Internet companies. Some electronic commerce companies and search engine companies are seeking targeted advertising to increase their revenue.

In general, the display of online advertising can be seen as a three-party game between media, advertisers, and users. How to advertise to specific user groups is a key issue in the field of online advertising. Inappropriate advertising can lead to a decline in user experience. Advertising cannot achieve the desired effect, and the media can also be affected. Internet text advertising is usually in the form of text, and the advertisers get the opportunity to buy media ads through cost-per-click (CPC) [4]. In the CPC model, the click-through rate (CTR) is an important indicator to measure the effectiveness of advertising display and is a key factor in the three-party game. Therefore, the CTR estimation of advertising is a hot research direction in the field of computing advertising. In this paper, the click-through rate prediction of Internet text advertising shows the probability of predicting a user’s click on a text under the current context environment. Due to the three-party information of advertising properties, user properties, and context environment, the CTR prediction is very complicated.

At present, the prediction of click-through rate for online advertising has attracted widespread attention from researchers in industry and academia. Researchers have proposed many models that are usually based on machine learning methods. We can divide them into three categories: linear, nonlinear, and fusion models. Typically, a predictive task is formulated as estimating a function that maps predictor variables to some target. To build predictive models with these predictor variables, a common solution is to convert them to a set of binary features (a.k.a. feature vector) via one-hot encoding [5]. McMahan et al. [6] used the logistic regression [7] model to solve the CTR problems of Google Advertising. They adopted user information, advertising data, search keywords, and other features as the input of the model and proposed an online sparse learning algorithm to train the model. Chapelle [8] proposed a machine-learning framework based on the logistic regression in which advertisers, web publishers, users, and time characteristics were used as input to the model to solve the advertising CTR prediction for Yahoo. Dave and Varma [9] used the gradient boosting decision tree (GBDT) to predict the advertising CTR. They extracted similar features from advertising data and discovered implicit relationships between different features. Finally, they found out the nonlinear relationships between the predicted target and features. He et al. [10] introduced a fusion model which combines decision trees with logistic regression for predicting clicks on Facebook ads. The traditional CTR prediction model mainly depends on the design of features. The features of data are artificially selected and processed. The data have a complex mapping relationship, especially for meaningful data, and it is crucial to account for the interactions between features. Many successful solutions in both industry and academia largely rely on manually crafting combinatorial features [11], i.e., constructing new features by combining multiple predictor variables, also known as cross features. However, the power of such features comes at a high cost since it requires heavy engineering efforts and useful domain knowledge to design effective features. Factorization machines (FMs) [12] are a supervised learning approach that embed features into a latent space and model the interactions between features via inner product of their embedding vectors. Models based on degree-2 polynomial mapping and factorization machines are widely used for CTR prediction. The factorization-based prediction method field-aware factorization machines [13] were developed by Juan et al.

In recent years, deep learning [14, 15] has achieved very good results in the fields of speech recognition [16], image data processing [17], and natural language processing [18]. As a powerful approach to learning feature representation, deep neural networks have the potential to learn sophisticated feature interactions. Liu et al. [19] extended CNN for CTR prediction, but CNN-based models are biased towards the interactions between neighboring features. Zhang et al. [20] studied feature representations and proposed factorization machine-supported neural network (FNN). This model pretrained FM before applying DNN and thus limited by the capability of FM. He and Chua [21] proposed a novel neural factorization machine (NFM) for prediction under sparse setting. NFM combines the linearity of FM in modelling second-order feature interactions and the nonlinearity of neural network in modelling higher-order feature interactions. Despite great promise, we argue that FM can be hindered by its modelling of all factorized interactions with the same weight. In real-world applications, different predictor variables often have different predictive power. Not all features contain useful information for predicting the target. Therefore, the interaction of features with less useful information should be assigned a lower weight indicating that they contribute less to the prediction. However, FM lacks the ability to distinguish the importance of feature interactions, which will lead to suboptimal prediction.

Considering the high-dimensional sparsity of advertising data and the highly nonlinear association between features [22], a hybrid model for advertising CTR estimation based on stacked autoencoder, named Attention Stacked Autoencoder (ASAE), is proposed. Our model takes advantage of the attention mechanism in neural network modelling [23, 24] and improves FM to make feature interactions contribute differently to the prediction. More importantly, the importance of feature interactions is automatically learned from the data with any human domain knowledge. We explore data dimension reduction and identify the relationship between features. Additionally, many experiments are conducted to show that this method improves the accuracy of CTR estimation.

The rest of this paper is organized as follows. Section 2 provides the factorization machines. In Section 3, the sparse feature learning method for advertising data based on the ASAE model is proposed. In Section 4, we design the experiment and verify the prediction effect of the method by comparison experiment. We also analyze the experimental results in this section. Section 5 concludes the paper and lists possible future work.

#### 2. Factorization Machines

The factorization machines are originally proposed for learning feature interactions in the recommendation system. Given a real-valued feature vector where denotes the number of features, FM estimates the target by modelling all interactions between each pair of features:where is the global bias, denotes the weight of the th feature, and denotes the weight of the cross feature , which is factorized as , where denotes the embedding vector for feature and denotes the size of the embedding vector. Besides linear (order-1) interactions among features, FM models pairwise (order-2) feature interactions as inner product of respective feature latent vectors. It can capture order-2 feature interactions much more effectively than previous approaches especially when the dataset is sparse. It is worth noting that FM models all feature interactions in the same way: first, a latent vector is shared in estimating all feature interactions that the th feature involves; second, all feature interactions have the same weight of 1. However, it is common that not all features are relevant to the prediction. These interactions of irrelevant features can be considered as noise that does not contribute to the prediction. FM models all features using the same weights for interaction and may have a negative impact on generalization performance.

#### 3. Click-through Rate Estimation Based on Deep Neural Network

One of the necessary steps in the click rate prediction system is to mine features that are highly correlated with the estimated task. To reduce the high sparseness of features and characterize the nonlinear association between features, we propose a sparse feature learning method for advertising data based on deep learning (DLSAE).

##### 3.1. Data Dimensionality Reduction

Click log data contain many types of objects, such as users, queries, and advertisements. The relationship between these objects is very complex. The same objects have similarity, and there are complex relationships between different types of objects. For instance, given a particular user and the query submitted by the user, it is necessary to predict whether the user will click on the advertisement and the probability. There is a complex implicit relationship between users, queries, and advertising. Based on the characteristics of the click log data, dimension reduction is achieved in the following two aspects: the similarity between the internal objects and the association between different objects.

In this paper, the k-means clustering algorithm [25] based on distance is adopted. We cluster queries, advertisements, and users separately, and the similar objects are aggregated into the same cluster. We use advertising frequency as the weight of the advertisement and query and create a matrix of the ad-query (where is the number of ads and indicates the number of queries), using the k-means algorithm to cluster the ad-query matrix. We scan the ad-query matrix to obtain the ad sets and query sets, as and . Then, we take *K* samples from the advertising set randomly as the initial point of the cluster center, record as . Next, Equation (2) is used to calculate the distance between ad and each cluster center point . The number of clusters of users, ads, and queries is represented by , respectively. Finally, the number of users, ads, and queries in the dataset is reduced from to :where is the weight of , is the weight of , and is the distance between and .

There is a ternary relationship between the user-query-ad in the click log data. In this paper, we use the three-dimensional tensor structure model [26, 27] to represent the user, query, and advertisement. Then, the tensor decomposition method is used to reduce the dimensions. The sum of the display number of ads in the cluster is used as the weight of the elements in 3D space. The three-dimensional tensor model is constructed and represented by . In this paper, tensor is decomposed using the Tucker factorization. Equation (2) is the decomposition formula.where represents the core tensor of tensor . We use , , and to represent the feature matrix of the tensor *X* on the dimension .

Figure 1 is a schematic diagram of the Tucker decomposition. The purpose of the Tucker decomposition is to find an approximate tensor with the original tensor and to retain the original tensor information and structural information to the greatest extent. The minimization formula is shown below: