Abstract
Graph convolutional network (GCN) is an efficient network for learning graph representations. However, it costs expensive to learn the highorder interaction relationships of the node neighbor. In this paper, we propose a novel graph convolutional model to learn and fuse multihop neighbor information relationships. We adopt the weightsharing mechanism to design different order graph convolutions for avoiding the potential concerns of overfitting. Moreover, we design a new multihop neighbor information fusion (MIF) operator which mixes different neighbor features from 1hop to khops. We theoretically analyse the computational complexity and the number of trainable parameters of our models. Experiment on text networks shows that the proposed models achieve stateoftheart performance than the text GCN.
1. Introduction
Text classification problem is a fundamental problem in many natural language processing (NLP) applications, such as text mining, spam detection, summarization, and questionanswering system [1–5]. Many deep learning approaches such as convolutional neural networks [6], recurrent neural networks (RNN) [7], and long shortterm memory (LSTM) [8] are introduced to text classification.
Text could be constructed on a typical graphstructured network, and graph networks have natural advantages for processing such data. Scarselli et al. [9] proposed a graph neural network, which was widely used for text classification [10–12] and other NLP tasks [13]. Graph convolutional network (GCN) [10], which is the extension of the CNN on graph data, has shown good performance on text classification than the traditional CNN [14]. Yao et al. [15] proposed a text GCN to apply document nodes and weighted edges to construct the text network graph, and their model outperformed the stateoftheart text classification methods.
When the messages pass through the graph of the text network, the node’s output is affected by not only the directly connected nodes but also the khop nodes [16]. To obtain more neighbor node information, GCN models can expand the receptive field by stacking multiple layers. However, GNN models and GCN often suffer from the oversmoothing issue [17, 18]. On the contrary, the representation ability of the shallow network structure is clearly insufficient.
To address the above issue, we propose a multihop neighbor information fusion graph convolutional network for text classification based on the GCN. In our model, we propose a novel negative minimum value fusion operator to fuse multihop neighbor information (MIF). To reduce the computational complexity, we share the trainable weight [19] among the multihop neighbor nodes. Our experimental results show that our models achieve stateoftheart performance in several citation text datasets with lower computational complexity.
The contributions of this paper are as follows. First, we propose a novel negative minimum value fusion operator, which fuses the feature information of multihop neighbors. Second, we propose highefficiency graph convolutional networkbased MIF to successfully capture khop neighbors for nodes’ classification of the text dataset.
The remainder of this paper is organized as follows. In Section 2, the related works are reviewed. In Section 3, our methods are proposed. In Section 4, the experimental results are presented. Finally, we draw the concluding remarks in Section 5.
2. Related Work
In this section, we will describe the related work about the graph convolutional network text classification, and we also introduce the related work about the multihop neighbor information of graph convolutional networks.
2.1. Graph Convolutional Network
Gori et al. [20] first proposed the concept of the graph neural network (GNN), which was based on the recurrent neural network architecture. Micheli [21] developed the GNN by random walk on the graph network [18]. Morris et al. [22] proved that the graph neural network and 1D WeisfeilerLeman had the same ability to decompose nonisomorphic graphs.
Graph convolutional networks (GCNs) were developed from convolutional neural networks [23]. However, it is difficult to apply the GCN for largescale graphs due to the high computational burden of eigenvalue decomposition [23]. Defferrard et al. [10] proposed a localized filter using Chebyshev polynomial. Kipf and Welling [24] proposed vanilla GCN, which achieves stateoftheart classification performance on the citation network. Niepert et al. [25] proposed PATCHYSAN to capture the information from locally connected regions. Hamilton et al. [26] developed a set of aggregate functions by sampling nodes in the neighbor to address the limitation of transductive learning. Monti et al. [27] contributed a unified framework for generalizing convolution to nonEuclidean domains. Velickovic et al. [28] leveraged masked attention to propose a graph attention network (GAT). Ding et al. [29] developed GAT and achieved better classification performance.
Recent works have been proposed for capturing khop neighbor information of nodes [30–33]. Zhou and Li [33] proposed a new highorder convolution operator to capture khop neighbor information and developed adaptive filtering to adjust the weights of the operator. Based on the motif graph attention mechanism, Lee et al. [34] proposed motif convolutional networks to capture khop interactive information. Mao et al. [35] proposed a Siamese framework to capture the khop information in the brain network. AbuElHaija et al. [36–38] proposed several versions of the mixhop convolution to mix these features of different order graph convolutions using a fully connected layer.
2.2. Text Classification
Recently, deep learning models are introduced into text classification, which achieve far better performance than traditional models.
With the development of deep learning techniques, increasingly deep learning models are applied for text classification. Kim [39] developed several variants of the CNN model for the text classification. Recurrent neural networks (RNNs) [2, 7] are widely applied for text classification, showing better results than traditional models.
With the development of graph network models, many researchers developed more GCNbased classification methods [26, 40–42]. Zhaoa et al. [42] proposed the SDGCN model to capture the interdependencies hidden in the data. Liu et al. [43] developed TensorGCN to aggregate intragraph and intergraph information of the text graph. However, Text GCN [15] has a large number of parameters and high computational complexity. We will propose a novel GCNbased model to solve the issue in the Text GCN [15].
3. Method
In this section, we review the definition of related graph notations and analyse the layerwise propagation model of the GCN in detail. Then, we develop a novel information propagation method to capture and fuse the multihop neighbor information. We propose two novel frameworks to capture the rich information of the text network. Finally, we analyse the computational complexity and parameter quantities of our models.
3.1. Notations’ Definition
We assume graph signal could be characterized by the node feature matrix of the graph, where and represent the number of nodes and feature dimensions, respectively. Let be the adjacency matrix representing its edge connection. We define the normalized Laplacian matrix as , where and denote the identity matrix and the degree matrix of the graph, respectively.
The popular convolutional propagation model [24] is as follows:where is the adjacency matrix with selfloops, namely, . The degree matrix could be written as , where denotes the propagation matrix. If = 0, then , which means the input signal is connected to the network. The trainable weight matrix could be optimized by gradient descent. We repeat the application of the convolutional model to get the vanilla GCN [24] framework.where and present different weight matrices. The classification function is . The convolutional operator is essentially a linear combination of its own vertices and onehop neighbourhood vertices to make the same category of vertex features similar. Stacking two convolutional layers makes the vertices of the same category more closely connected and further eases the classification task. However, when more layers are applied, the vertices of different categories will be mixed and become indistinguishable, which is excessive smoothing [17, 18].
3.2. Multihop Neighbor Information Fusion with the Graph Convolutional Operator
Definition 1 (multihop neighbor information fusion (MIF)).
It is assumed that matrix denotes the regularized adjacency matrix of graph , where . If , then . The power matrix of is , where denotes the identity matrix. The multihop neighbor information fusion operator is to fuse the khop neighbor information with the elementwise topological information which is preserved. The MIF operator is defined as follows:where denotes the khop regularized adjacency matrix and represents the khop neighbor information.
Proposition 1. The multihop neighbor information fusion operator is a topological preserved operator.
Proof. If , then , and the khop neighbor information in convolutional layer has the same dimension, . As defined in formula (3), the MIF operator is the elementwise operation. Therefore, the dimension of is equal to the dimension of . The MIF operator preserves topological information.
The MIF operator is an information aggregation layer that is used to mix korder adjacency information. The procedure of the calculation of MIF is as follows:(1)Calculating the minimum value of these features from different order graph convolutions: We give a living example to show that the MIF operator works. It is assumed that , namely, we obtain the maximum 3hop neighbor information. We assume , , and ; then, the result is as follows:(2)The output of the MIF operator is defined as negative for each element of , namely, .Following DIFFPOOL [44], we use the topological preserved operator MIF to improve the performance.
3.3. MIF Propagation Model
To address the limitations of the Text GCN, we propose a propagation model of multihop neighbor information fusion graph convolution as follows:where is defined in formula (4), , and denotes that multiplies itself by k times, where is the jhop graph convolution, and is the weight parameter matrix. In the MixHop model [36–38], AbuElHaija et al. adopted different weights for different . To reduce the computational complexity, we share the weight in the same convolutional layer in the multihop convolutional operator.
In formula (6), the convolutional layers combine the multihop neighbor information from 1hop graph convolution with khop graph convolution. The calculation procedure is summarized in Algorithm 1. The MIF has advantages compared to Text GCN. MIF implements feature aggregation on nodes and their khop neighbor nodes. Therefore, MIF contains multihop neighbor information, in which it captures more information than Text GCN. In summary, MIF merges multihop neighbor information features while avoiding the extra parameter number. Those nodes in the same category are more closely connected. Furthermore, the MIF operator suppresses excessive weighting features while retaining features with small weight values, which may prevent gradient disappearance and gradient explosion problems.

3.4. Graph Convolutional Network Based on MIF
In Figure 1, we propose a twolayer graph convolutional neural network using the MIF layer. The first layer is the graph convolutional layer with MIF, and the convolutional layer is represented as follows:where is the weight parameter matrix between the input layer and hidden layer.
The second layer is the traditional graph convolution. We set the nonlinear activation function σ between the two layers as ReLU and achieve multiple classifications via softmax after the second layer. The network extends the 1hop graph convolution to khop graph convolution to capture multihop neighbor interactive information. The output of our model is expressed as follows:where represents the weight parameter matrix between the hidden layer and output layer. The trainable weight parameters and would be updated by gradient descent.
In the preliminary network design, we compare how many convolutional layers and hops fit to our model. The twoconvolutionallayer network shows better performance than the three and more convolutional layer network. When we implement the multihop neighbor information fusion, we observe that is better for most text networks, while is better for a few networks. In further experiments, when , the classification results would decrease. Moreover, the larger the value, the higher the computational cost. Therefore, we only discuss the cases of and in our models.
When , our MIF operator fuses the 1hop graph convolutional layer and 2hop graph convolutional layer. The 2hop MIF graph convolutional layer (MIFGC2) is as follows:
When , the 3hop MIF graph convolutional layer (MIFGC3) fuses from 1hop to 3hop convolutional layer. The NMGC3 model is as follows:
The crossentropy loss is utilized as our model loss function:where denotes the nodes set with labels and represents the number of classes. denotes the real labels of tag nodes, and denotes the probability value between 0 and 1 predicted by softmax.
3.5. Computational Complexity and Parameters
Because the actual running time is sensitive to hardware and implementations, we follow He and Sun [45] to adopt the theoretical time complexity to show the complexity rather than the actual running time. For largescale graph networks, it is a huge challenge to directly calculate . Therefore, we calculate with righttoleft multiplication. For example, if , we calculate as is usually a sparse matrix with m nonzero entries. In formulas (7)–(10), our graph convolutional layers adopt the weightsharing mechanism. Therefore, the calculation procedure is efficient.
Since different hop graph convolutions share the same weight in the same layer, the parameter quantities are consistent with the 1hop graph convolution. It is assumed that , where is the number of nodes; , where is the feature dimensions, , where represents the number of hidden neurons in the 1^{st} layer, and , where represents the hidden neurons in the 2^{nd} layer. Then, output dimension in the 1^{st} layer as the same, namely, , and . Therefore, in the first convolutional layer of our proposed model, the computational complexity is , and the trainable parameters are . The whole computational complexity of our proposed model is with trainable parameters. The node feature dimension is far large than the neural number, namely, . Therefore, the computational complexity of our network frameworks approximates , and trainable parameters approximate , respectively. It matches the computational complexity and parameters of vanilla GCN [24]. Similarly, Text GCN [15] takes computational complexity and trainable parameters.
4. Experiment
We will evaluate our NMGC2 and NMGC3 on text networks and compare our methods with the classic method and deep learning methods, such as the embedding model, CNNbased, LSTMbased, and GCNbased. We analyse the terms of computational complexity and trainable parameters in detail. We investigate the impact of network framework parameters and training epochs on classification accuracy.
4.1. Datasets
We test our methods on five benchmark corpora datasets including R52 and R8 of Reuters21578, 20Newsgroups (20NG), Ohsumed, and Movie Review (MR). According to the preprocessing steps by Yao et al. [15], we process the text datasets and use the documents and words as nodes to build the text graph. In Table 1, the statistics characters of the datasets are described in detail.
4.2. Baseline and Experimental Settings
We compare with the following baseline methods as in Yao et al. [15], i.e., CNN with randomly initialized word vectors (CNNrand) [39], CNN with pretrained word vectors (CNNpretrain) [39], predictive text embedding (PTE) [46], LSTM framework (LSTM) [2], LSTM framework with pretrained word embeddings (LSTMpretrain) [2], fast text classifier (fastText) [47], fast text classifier with bigrams (fastTextbigrams) [47], label embedding model with attention (LEAM) [48], simple wordembedding model (SWEM) [49], graph CNN with spline filter (GCNNS) [23], graph CNN with Fourier filter (GCNNF) [50], and graph CNN with Chebyshev filter (GCNNC) [10].
We tune a series of hyperparameters’ (learning rate, dropout rate, hidden units, and epochs) values to determine the best hyperparameters of our model on text networks. The hyperparameters are reported in Tables 2 and 3. In our NMGC2 and NMGC3 models, we set L2 regularization factor as 0 and use Adam [51] to optimize the learning rate, following Yao et al. [15].
4.3. Results’ Analysis
We compare our NMGC2 and NMGC3 with other baseline methods in terms of test accuracy. As shown in Table 4, the proposed model NMGC2 or NMGC3 achieves the highest classification performance on datasets R52, R8, 20NG, and Ohsumed. Specifically, our NMGC2 obtains the best accuracy of 94.35%, 97.31%, and 69.21% on datasets R52, R8, and Ohsumed, respectively, whereas our NMGC3 obtains the highest accuracy of 86.68%on 20NG.
The success of NMGC2 and NMGC3 is mainly due to the following three aspects. (1) Our NMGC2 and NMGC3 models have the capability to capture the relations in terms of wordword and documentword in the datasets. (2) Our NMGC2 and NMGC3 make full use of the advantages of the GCN. We implement the feature information aggregation on the node and its 1hop neighbor information (each layer) so that the node (wordword and documentword) features in the same cluster are similar, which are easy to classify. (3) Our NMGC2 and NMGC3 capture more and richer feature information from 1hop to khop neighbors, which may circumvent the limitations of the GCN.
On dataset MR, CNNpretrain [39] achieves the highest classification result of 77.75%, which shows the model consecutive and shortdistance semantics because CNNpretrain [39] and LSTMpretrain [2] model consecutive word sequences, while our NMGC2 and NMGC3 ignore the word orders. Another reason is that our NMGC2 and NMGC3 models are difficult to propagate the information among the nodes (wordword and documentword) on MR with few edges.
Compared to Text GCN [15], our NMGC2 and NMGC3 are better by a large margin in most cases, which demonstrates the effectiveness of our method in terms of capturing highorder interactive information.
4.4. Hidden Units’ Analyses of Our Method
To evaluate the relationship between hidden units and the model performance, we use different hidden units to conduct experiments. We choose a representative set of hidden units as our comparative experiments in balancing computational complexity and classification performance. The results are summarized in Table 5. Specifically, our NMGC2 uses the best hidden units of 128, 64, 128, 128, and 64 to achieve the higher accuracy on R52, R8, 20NG, Ohsumed, and MR, respectively, whereas our NMGC3 uses the best hidden units of 128, 128, 128, 128, and 32 to achieve the better accuracy on R52, R8, 20NG, Ohsumed, and MR, respectively. We note that our NMGC2 is always better than our NMGC3 in many cases for different hidden units. This is most likely that our NMGC3 suffers from oversmoothing in most cases. However, NMGC3 outperforms NMGC2 on short MR with few edges because NMGC3 considers higherorder node information and propagates more label information to the entire graph network. In most cases, when the number of hidden units decreases, the accuracy of our method decreases, and more epochs are required to train our network.
4.5. Computational Complexity and Trainable Weight Parameters
We design the weightsharing mechanism to share the weight in the proposed convolutional layer, which reduces the number of parameters. When using weight sharing, our calculations are very efficient. This naturally reduces the computational complexity. We design the weightsharing mechanism to avoid overfitting caused by many parameters.
As shown in Table 6, we compare our method with Text GCN [15] in terms of computational complexity and the number of trainable weight parameters. The detailed derivation process of Comp. and Params. is analysed in Section 3.5. In Table 6, we observe that our NMGC2 and NMGC3 could match the computational complexity of Text GCN [15]. In particular, for MR, NMGC3 is about 1 time less than Text GCN [15] in terms of computational complexity. Interestingly, our models have the least parameters on all datasets because we use the weightsharing mechanism on these convolutions from 1hop graph convolution to khop graph convolution and a smaller number of hidden units.
5. Conclusion
In this work, we propose a new multihop neighbor information fusion graph convolutional network on graphstructured data. We develop a novel MIF operator to combine the graph convolution features of multihop neighbor information from 1hop graph convolution to khops. Experiments on text networks suggest that our models are capable of encoding in terms of node features and global graph topology in a way useful for graph classification. In this setting, our models achieve performance improvement compared to other methods while being computationally efficient and with less trainable parameters. In future work, we plan to study different fusion schemes and extend our model to more datasets.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was funded by the National Natural Science Foundation of China (U1701266, 61702117, and 61672008), the Guangdong Provincial Key Laboratory of Intellectual Property and Big Data (2018B030322016), the Special Projects for Key Fields in Higher Education of Guangdong (2020ZDZX3077), and in part by Qingyuan Science and Technology Plan Project (Grant nos. 170809111721249 and 170802171710591).