#### Abstract

Knowledge graph (KG) contains a large number of real-world knowledge and has become an invaluable aid to assist the application of artificial intelligence. Knowledge graph completion (KGC) is the task to complete the missing triple in KG database. Our goal in this study is to enhance the performance of KGC tasks based on CNN model. To do this, we first investigated the effect of adding multiple filters of different shapes into the pioneer model. The obscure improvement leads us to seek other approaches. Our second proposed model, termed DP-ConvKB, which is a deep convolution-neural-network-based model, outperforms state-of-the-art models on several metrics. Our study provides supporting evidence that, by cooperating deep pyramid network structure into models, it can significantly improve the KGC performances.

#### 1. Introduction

Since the development of the Semantic Web in the 1980s, knowledge graph (KG), as a concept of knowledge base, serves as a surrogate of the real-world information, which focuses on describing the relationships between interlinked entities. Based on the graph-structured data model of integrating data, KG uses a triple (head entity, relation, and tail entity) to store interlinked descriptions of entities. Thanks to its structure-based and machine-readable attributes, the application of KG proliferates in various domains, such as search, analytics, and recommendation. In 2012, search engine giant Google launched their KG [1], and by May 2020, Google KG had grown to 500 billion facts on 5 billion entities [2]. However, information incompleteness is a problem that always exists in all KGs, including massive KGs. For example, in Freebase and DBpedia, more than 66% of the person entities are missing a birthplace [3, 4]. These missing facts not only affect the KG structure itself but also would sometimes result in inferring misleading information. Therefore, knowledge graph completion (KGC) researchers aim to identify missing links under the supervision of the existing KG. The specific work scope of KGC includes finding and mining missing entities and relationships [5], link prediction [6], and inferring new facts [7].

The key problem of knowledge graph completion is how to represent and model the combination of entities and relations. Data representation and representation learning are known to play a pivotal role in the development of KGC [8]. Embedding method is a popular data representation technique, which generally converts the entities and relations into low-dimensional vectors or matrixes. For example, in TransE [9], the candidate triple is used for representing a valid fact, and then, the relation corresponds to a translation between the embeddings of the head entity and the tail entity , that is, . Several succeeding transition-based models are proposed with the basic idea of TransE, such as TransH [10], TransD [11], TransR [12], and TransG [13]. Among them, TransH models a relation as a hyperplane and TransR divides the workspace into entity space and relation space during representing the relations and entities. In addition, a trilinear dot product has also been applied to compute the score for each triple; DistMult [14] and ComplEx [15] are two typical such linear models.

KGC models can be generally divided into embedding-based models, linear models, and neural network models. Neural networks (NNs) have been widely used in machine learning tasks such as pattern recognition and perception science [16] and also gained more attention in the field of KG recently. By cooperating with convolution algorithm, convolutional neural networks (CNNs) are a specialized type of NNs that have been heavily used in computer vision since its birth. Such popularity of CNN-based applications in computer vision is because the neural-network-based structure of CNNs can match the composition of images natively. Since Kim proposed TextCNN in 2013 [17], CNNs have also begun to receive numerous attentions in the field of natural language processing (NLP), such as sentence classification [18], sentence modeling [17], and search query retrieval [19]. Most of these models adopt the convolution layer similar to TextCNN to extract features in the embedding. Inspired by computer vision, Dettmers et al. proposed the first CNN-based model for the KGC task—ConvE [20]. Following ConvE, Nguyen et al. proposed the model ConvKB [21], where the triple is represented as -dimensional vectors , , , and the input matrix combines these vectors into a matrix. In the convolution layer, different filters with the same shape of are designed to explore the global features among the same dimensional units of the embedding triple .

Although the model ConvKB has overcome the limitation of ConvE and obtains better link prediction results than the existing models on several benchmark datasets, there are still several unsolved problems. For example, limited by the convolution kernel of fixed size, long-distance interaction between different positions in embedding vectors cannot be extracted, but only shallow information from the single convolutional layer. Therefore, multiple methods have been proposed to solve this problem, such as CapsE [22] and ConvR [23]. In this paper, we applied a deep pyramid convolutional network structure with the original ConvKB model and designed a new model named deep pyramid ConvKB (DP-ConvKB).

We summarize the main contribution of our work as follows.

To improve KGC task performance, we first attempted it with a tentative model by introducing multiple complex filters to the ConvKB, hereafter referred to as multifilter ConvKB (MF-ConvKB). Compared to other existing CNN-based models, including ConvKB, experiment results on two benchmark datasets, FB15k-237 and WN18RR, showed that MF-ConvKB could only bring about mild improvements on some specific metrics. However, it motivated us to seek for other possible model structures which might potentially help to overcome the limit of improvement. We found that by incorporating a deep pyramid network structure into ConvKB, the new-designed model DP-ConvKB could significantly improve the KGC performances on several metrics.

#### 2. Related Work

##### 2.1. Embedding-Based Model

Embedding-based TransE was the first model using transseries methods for modeling multirelational data [9]. Inspired by the Word2Vec Skip-gram model, the TransE maps the relationship in the triple to the translations in latent feature space, denoted as . However, the TransE model can only work well on the instances with one-to-one relation due to its lower parameter complexity and fails to deal with complex relations, such as one-to-more, more-to-one, and more-to-more. To enhance the scope ability and the efficiency of the score function of the previous models, TransH [10] is proposed to model the complex relations as a hyperplane together with a translation operation on it; however, this model still lacks of adaptation to scenarios with multirelations. TransD [11] and TransR/CTransR [12] represent entities and relations in separate spaces and project each entity with a relation-specific matrix. Unlike TransR, STransE [24] and TranSparse [25] tie the head and tail entities together with their own projection matrices. In addition, PTransE [26] and PTransR [27] take the relation path information into consideration when representing the triple . Recently, aiming at expanding the embedding space, embedding-based models like RotatE [28] continue to play a big role in the KGC task field.

##### 2.2. Linear Model

RESCAL [29] is a typical bilinear model obtaining the underlying semantics information of the representation of the entity, but this model is prone to overfitting because of the large amounts of parameters. DistMult [14] can be considered as a special case of RESCAL, in which DistMult represents each relation as a diagonal matrix rather than a full matrix to avoid overfitting. ComplEx [15] and SimplE [30] can be viewed as direct extensions of the DistMult, in which ComplEx applies the complex domain to handle a variety of binary relations, and in SimplE, the subject and object embeddings for the same entity are learned dependently.

##### 2.3. Neural Network Model

ConvE [20] and ConvKB [21] both are CNN-based models, and ConvE is the first model with CNNs applied for the KGC task. In ConvE, various filters with the fixed shape are operated over the input matrix—reshaping and concatenating of the head entity and relation embeddings. The design of the input is straightforward and one sided, which ignores the global information among the same dimensional position of the representation of the whole triple. ConvKB was proposed to deal with the problem from ConvE, replacing the reshaping operation with a convolution layer over the embedding triple . CapsE [22] applies the capsule network for the KGC task by adding a capsule network layer instead of the convolution layer at the base of the ConvKB. SACN [31] and R-GCN [32] are graph convolutional network- (GCN-) based models, and DOLORES [33] introduces long short-term memory (LSTM) to KGC tasks. In addition, KBGAT [34], GAATs [35], ConvR [23], and CoPER-ConvE [36] are effective methods which are proposed in recent years. Collectively, incorporating NNs has surfaced to become the mainstream way to solve the problem of the incomplete knowledge graph. Table 1 illustrates the score function and the optimization methods of each related work.

Deep CNNs are commonly used to extract depth features from images, and a number of powerful models have emerged, such as LeNet [37], AlexNet [38], ResNet [39], and Google Inception Net [40]. Inspired by ResNet, Johnson and Zhang proposed a low-complexity word-level deep convolutional neural network architecture for text categorization—DPCNN [41]. In this paper, we proposed a novel neural network model, namely, DP-ConvKB, which takes the advantage of deep convolution structure, to improve the performance of KGC.

#### 3. The Construction of Proposed Models

As we mentioned above, we attempted two different approaches to improve the KGC task performances by ConvKB. In the first model, MF-ConvKB, we incorporated multiple filters of different shapes into ConvKB. In the second model, DP-ConvKB, we utilized a deep pyramid convolutional network to improve the KGC task performance. In this section, we present the procedures for constructing these two models.

##### 3.1. MF-ConvKB

An incomplete KG collects the facts in the form of , with , where and denote the sets of entities and relations, respectively. Each triple is represented as a unique embedding group collecting with -dimensional vectors, and all these vectors are concatenated into a matrix and represents the -th row of . For example, a filter is applied to the window consisting from to to generate a feature , which is defined as follows: where is a bias term and is a nonlinear function such as rectified linear unit (ReLU), and we use to form the feature map of this filter.

Different from the ConvKB that only uses filters of a single shape, our model involves multiple shapes of filters to generate the transitional features from the pretrained embeddings. As shown in Figure 1, multiple filters are designed to capture features from long or short distances. We used to denote the number of filters of three different shapes, which are , , and , so the original model ConvKB can be viewed as a special case of MF-ConvKB with . After convolution, all feature maps are concatenated into a vector . Then, the score for the triple is calculated by a dot product between and , where is a weight vector, and the score function is defined as in where denotes the set of filters of the -th shape, is the feature map generated from the set of filters of the -th shape, denotes a convolution operator, and denotes a dot product.

##### 3.2. DP-ConvKB

Compared to the tentative MF-ConvKB model, we implemented a completely different approach to extract features in DP-ConvKB. Instead of adding a variety of filters, we took advantage of the deep residual technique [39] and built another module of deep residual learning in addition to the pioneer ConvKB model. In this way, more features in a long distance can be taken into consideration. DP-ConvKB takes its name from referring its structure to deep pyramid (DP) CNNs [41]. As shown in Figure 2, the design of DP-ConvKB consists of two modules in general.

###### 3.2.1. ConvKB Module

The left module is from the original ConvKB, where TransE model is used to initialize the input entities and relations. In order to keep the maximum potential of the transitional features in the transition-based model, we only applied the filter to exploit the global relationships among the same dimensional entries of the embedding triple. The feature map is defined as in which denotes the set of filters and denotes a convolution operator. The output of the final feature map from this module is fed into the right module as the initial layer of region embedding.

###### 3.2.2. Deep Residual Learning Module

In the deep residual learning module, the initial region embedding is followed by two convolutional layers, an activation layer, and a shortcut connection. This module with shortcut connection can be represented as , where denotes the skipped layers of feedforward neural networks and is the shortcut connection as proposed in ResNet [39]. A similar structure has also been applied in DPCNN for text categorization [41]. This design of the shortcut connection can effectively avoid the vanishing gradient problem when training the deep NNs. ReLU was chosen as the activation layer, which corresponds to the design of the feature map.

In this paper, we focused more on the global information among the same dimension between long distances, but different from the semantic features among a long sentence; in the KGC task, the input matrix is only composed of the embeddings of entities and relations, without the semantic characters between different dimensions. In order to capture the features with a relatively long distance, we used the kernel size of 2 for convolution, as shown in Figure 2.

After the first shortcut connect, another circulation block is added, and it consists of a pooling layer, two convolution layers, and a shortcut connection, in which the convolution operation is defined as , where is for ReLU and is the filter initialized by a normal distribution. At the beginning of this submodule, we performed max-pooling with size 2 and stride 2: the pooling layer chooses the maximum over 2 contiguous internal vectors and the 2-stride pooling window reduces the size of the input feature map by half. After this, output from each pooling layer is collected, arranged in the form of a “pyramid,” and this is where DP takes its name.

For the shortcut connection , both and require the same dimensionality so that they can be summed up. To avoid the extra dimension matching operation, in the DP structure, we unified the convolutional layers with the same number of filters to obtain feature maps for each convolution layer. In DP-ConvKB, we set , where is the number of filters used in ConvKB (the right module); the value of region embedding is the same as . After the circulation block, the score function of the model can be written as follows: where is the output of the final convolution layer and is a weight vector.

#### 4. Experiment Setup

##### 4.1. Dataset

We evaluated MF-ConvKB and DP-ConvKB models on two benchmark datasets: WN18RR and FB15k-237, which are the subsets of the datasets WN18 and FB15k, respectively. This is because WN18 and FB15k contain highly redundant relations, as mentioned in studies [42, 43]; these highly redundant relations could lead to inaccurate test results. One example of redundant relations is when a triple and its inverse triple exist in the test set and training set, respectively. This would make the original triple easily retrievable which leads to an overly good result and jeopardize the model’s generalization. WN18RR and FB15k-237 are two datasets intentionally designed to remove triples with such inverse relation. Statistics of the two benchmark datasets are given in Table 2.

The relations from the two datasets can be generally classified into 4 categories: , , , and , with for MANY. According to [9], denotes one head can appear with one tail entity at most, and denotes one head can appear with more than one tail. Similarly, denotes more than one head can appear with the same tail, and is for the case that multiple heads map to multiple tails. We found that 0.9% of the relations on FB15k-237 is type, and for , , and type, the portion is 6.3%, 20.5%, and 72.3%, respectively. Regarding the WIN18RR dataset, there are 11 relations, among which, *also_see*, *similar_to*, *verb_group*, and *derivationally_related_form* can be classified into relation, and *member_meronym* and *hypernym* are and relation, respectively. Moreover, there are 1251 triples containing the relation *hypernym* and 1074 triples containing the relation *derivationally_related_form*, which in total accounts for about 75% of 3134 test triples in the test set. Figure 3 shows the percentage of each relation in both test and training sets.

##### 4.2. Design of Loss Function

The loss function was minimized as follows: where is the train set collection of valid triples, while is a collection of invalid triples generated by corrupting a valid triple by replacing the head entity or the tail entity with other entities in . According to Bernoulli trick [4], the new invalid triples and occur with the probability and , respectively. For a certain relation , denotes the average number of head entities per tail entity, and denotes the average number of tail entities per head entity. is the score function as in (5), and is the regularization on .

##### 4.3. Evaluation Protocol

The object of this study is to predict the missing entity in a triple, i.e., predicting the head entity with given or predicting the tail entity with given . To evaluate our proposed models’ performance, we focused on ranking the scores of candidate entities from the dataset. We employed three commonly used evaluation metrics in the previous study: mean rank (MR), mean reciprocal rank (MRR), and Hits@10. MR denotes the mean rank of all test triples as calculated in (7). It evaluates whether all of the ground-truth relevant items selected by the model are ranked higher or not.

MRR is calculated by taking the mean of the reciprocal rank for each query of the test triples as in (8), and it only cares about the single highest-ranked relevant item.

is the proportion of the rank that is lower than or equal to , with usually set to 10 in the link prediction task. Either lower MR, higher MRR, or higher indicates a better result of the task. Note that, according to the work of TransE [9], we had to remove the corrupt triples that already exist in the datasets when ranking the test triples.

##### 4.4. Implementation Details

Prior to the model training, the entities and relation embeddings were first preprocessed by the embeddings produced from TransE. We then applied stochastic gradient descent (SGD) algorithm with 3000 epochs on WN18RR and FB15k-237 to train the TransE model parameters. We adopted grid search algorithm on the validation set, in order to optimize parameters, including the dimensionality of the word embedding, the margin hyperparameter, and the SGD learning rate. We fixed the in our objective function. Our experiment showed that, for MF-ConvKB, the highest Hits@10 can be obtained when , , and on WN18RR and , , and on FB15k-237; for DP-ConvKB, these optimized parameters were , , and for both datasets.

We used the Adam optimizer to train both MF-ConvKB and DP-ConvKB. For training MF-ConvKB, we set the initial learning rate of Adam [44] at , and set the -regularizer at 0.001 to avoid overfitting. After convolution, we chose ReLU as the activation function. We designed the filters with the shape of , , and . Each filter was initialized by a truncated normal distribution. The highest Hits@10 scores were obtained when using , , and on WN18RR and , , and on FB15k-237. For training DP-ConvKB, we set the initial learning rate of Adam at and set the -regularizer at 0.001. To get the feature map for deep convolution, we initialized the original filter by a truncated normal distribution, and we fixed its shape at . In the deep NN part, convolutional filters were initialized by a normal distribution. The highest Hits@10 scores were obtained when using , , and on WN18RR and , , and on FB15k-237.

##### 4.5. Main Results

Table 3 shows the link prediction comparison results between our two models and the state-of-the-art models.

From Table 3, it is clear that the proposed model DP-ConvKB significantly outperformed all other models listed, including the original ConvKB (baseline model, hereafter). Specifically, on dataset WN18RR, DP-ConvKB obtains significant improvements of in MRR and absolute improvement in Hits@10. On dataset FB15k-237, these improvements were 0.327 in MRR and 22.3% in Hits@10.

As shown in Table 3, we also noticed that the tentative model MF-ConvKB performed just slightly better than the baseline model on all metrics on WN18RR and even obtained a less score of MRR and Hits@10 on FB15k-237. However, it could achieve lower MR than the baseline model on both datasets. The shape of the convolutional filter is known to play a pivotal role in extracting features in fields like image processing [45], speech recognition [46], and sentence classification [47]. However, its effect in the link prediction task is not well understood. To investigate, we embedded convolutional filters with different shapes into MF-ConvKB and test model performances on link prediction tasks. Results are shown in Table 4: the first row with a single filter represents the baseline model, and starting from the second row, we replaced the filter with different shapes and the final row was from MF-ConvKB. These results suggested that, in general, replacing the single filter in the baseline model with different shapes did not bring substantial improvements and combing multiple different filters can ameliorate performance on MR but failed on MRR and Hits@10 necessarily. Specifically, MF-ConvKB with a variety of filters can have better performances on WN18RR but not necessarily on FB15k-237. Taken together, these results indicate that combining filters of different shapes may not be the most efficient strategy for link prediction task performance improvement. Therefore, we sought other possible structures, e.g., deep NN in our proposed model DP-ConvKB, to improve link prediction task performance.

Figure 4 shows a result comparison between our proposed DP-ConvKB model and baseline model, tested on MRR and Hits@10 scores of different relations on dataset FB15k-237. Results of predicting head and tail entities are separately displayed. It is clear that DP-ConvKB improves baseline performances on all relations. Specifically, achieves the highest scores, followed by , , and ranking last. Moreover, DP-ConvKB shows more robust performances when predicting head and tail on and relations, while the performances from the baseline model fluctuate in these cases (from 0.37 to 0.084 and 0.46 to 0.72 on MRR).

Performance comparisons for all relations on dataset WN18RR between DP-ConvKB and baseline model are shown in Figure 5. In general, our proposed model DP-ConvKB outperforms the baseline model in all considered relations on both Hits@10 and MRR metrics. Strikingly, when evaluated by MRR score, model performance is completely boosted by DP-ConvKB by more than 2 times. Particularly, performances on relations of *Verb_group* and *similar_to* reach the highest scores on MRR. On WN18RR, we also tested the model’s performances in predicting heads and tails. Figure 6 depicts the performance of DP-ConvKB in predicting head and tail tasks with the metrics of MRR and Hits@10 on the dataset WN18RR. We can see that the scores of predicting head and tail on almost every relation category are very close. As a comparison, the baseline model has a relatively large gap in predicting head and tail tasks which can be found in Figure 4. Results shown in Figure 6 demonstrate the ability of minimizing discrepancies in predicting head and tail when using DP-ConvKB.

**(a)**

**(b)**

**(a)**

**(b)**

Taken together, the overall experimental results show that the deep CNN-based model DP-ConvKB can effectively sum up the global features beyond distance and obtain the best performance for both the head prediction task and tail prediction task.

#### 5. Conclusion

In this paper, we proposed a CNN-based model, DP-ConvKB, to improve the performance of the knowledge graph completion task. We first showed that simply adding a variety of filters into the pioneer model ConvKB might not be in the right direction to enhance its performance. We then designed a new model, DP-ConvKB, which cooperates a deep pyramid neural network into ConvKB; therefore, it is capable of exploring the deep features. Our results on datasets WN18RR and FB15k-237 show that DP-ConvKB outperforms the baseline model (ConvKB). DP-ConvKB obtains the best mean rank and the highest mean reciprocal rank and Hits@10. To sum up, our study demonstrates that, by implementing such deep convolutional network structure into models for KGC tasks, it can significantly improve the performances.

Although DP-ConvKB has achieved great improvement on the link prediction task, however, it involves more computational complexity and is difficult to find the optimized hyperparameters in the training process. In future work, we plan to prune the deep neural networks to minimize the training time and generalize our structure to other NLP tasks.

#### Data Availability

The data that support the findings of this study are available from the authors. Requests for access to these data should be made to Xueting Wang, xuetingcuc@163.com.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.