Graph Invariants and Their ApplicationsView this Special Issue
Multihop Neighbor Information Fusion Graph Convolutional Network for Text Classification
Graph convolutional network (GCN) is an efficient network for learning graph representations. However, it costs expensive to learn the high-order interaction relationships of the node neighbor. In this paper, we propose a novel graph convolutional model to learn and fuse multihop neighbor information relationships. We adopt the weight-sharing mechanism to design different order graph convolutions for avoiding the potential concerns of overfitting. Moreover, we design a new multihop neighbor information fusion (MIF) operator which mixes different neighbor features from 1-hop to k-hops. We theoretically analyse the computational complexity and the number of trainable parameters of our models. Experiment on text networks shows that the proposed models achieve state-of-the-art performance than the text GCN.
Text classification problem is a fundamental problem in many natural language processing (NLP) applications, such as text mining, spam detection, summarization, and question-answering system [1–5]. Many deep learning approaches such as convolutional neural networks , recurrent neural networks (RNN) , and long short-term memory (LSTM)  are introduced to text classification.
Text could be constructed on a typical graph-structured network, and graph networks have natural advantages for processing such data. Scarselli et al.  proposed a graph neural network, which was widely used for text classification [10–12] and other NLP tasks . Graph convolutional network (GCN) , which is the extension of the CNN on graph data, has shown good performance on text classification than the traditional CNN . Yao et al.  proposed a text GCN to apply document nodes and weighted edges to construct the text network graph, and their model outperformed the state-of-the-art text classification methods.
When the messages pass through the graph of the text network, the node’s output is affected by not only the directly connected nodes but also the k-hop nodes . To obtain more neighbor node information, GCN models can expand the receptive field by stacking multiple layers. However, GNN models and GCN often suffer from the oversmoothing issue [17, 18]. On the contrary, the representation ability of the shallow network structure is clearly insufficient.
To address the above issue, we propose a multihop neighbor information fusion graph convolutional network for text classification based on the GCN. In our model, we propose a novel negative minimum value fusion operator to fuse multihop neighbor information (MIF). To reduce the computational complexity, we share the trainable weight  among the multihop neighbor nodes. Our experimental results show that our models achieve state-of-the-art performance in several citation text datasets with lower computational complexity.
The contributions of this paper are as follows. First, we propose a novel negative minimum value fusion operator, which fuses the feature information of multihop neighbors. Second, we propose high-efficiency graph convolutional network-based MIF to successfully capture k-hop neighbors for nodes’ classification of the text dataset.
The remainder of this paper is organized as follows. In Section 2, the related works are reviewed. In Section 3, our methods are proposed. In Section 4, the experimental results are presented. Finally, we draw the concluding remarks in Section 5.
2. Related Work
In this section, we will describe the related work about the graph convolutional network text classification, and we also introduce the related work about the multihop neighbor information of graph convolutional networks.
2.1. Graph Convolutional Network
Gori et al.  first proposed the concept of the graph neural network (GNN), which was based on the recurrent neural network architecture. Micheli  developed the GNN by random walk on the graph network . Morris et al.  proved that the graph neural network and 1D Weisfeiler-Leman had the same ability to decompose nonisomorphic graphs.
Graph convolutional networks (GCNs) were developed from convolutional neural networks . However, it is difficult to apply the GCN for large-scale graphs due to the high computational burden of eigenvalue decomposition . Defferrard et al.  proposed a localized filter using Chebyshev polynomial. Kipf and Welling  proposed vanilla GCN, which achieves state-of-the-art classification performance on the citation network. Niepert et al.  proposed PATCHY-SAN to capture the information from locally connected regions. Hamilton et al.  developed a set of aggregate functions by sampling nodes in the neighbor to address the limitation of transductive learning. Monti et al.  contributed a unified framework for generalizing convolution to non-Euclidean domains. Velickovic et al.  leveraged masked attention to propose a graph attention network (GAT). Ding et al.  developed GAT and achieved better classification performance.
Recent works have been proposed for capturing k-hop neighbor information of nodes [30–33]. Zhou and Li  proposed a new high-order convolution operator to capture k-hop neighbor information and developed adaptive filtering to adjust the weights of the operator. Based on the motif graph attention mechanism, Lee et al.  proposed motif convolutional networks to capture k-hop interactive information. Mao et al.  proposed a Siamese framework to capture the k-hop information in the brain network. Abu-El-Haija et al. [36–38] proposed several versions of the mix-hop convolution to mix these features of different order graph convolutions using a fully connected layer.
2.2. Text Classification
Recently, deep learning models are introduced into text classification, which achieve far better performance than traditional models.
With the development of deep learning techniques, increasingly deep learning models are applied for text classification. Kim  developed several variants of the CNN model for the text classification. Recurrent neural networks (RNNs) [2, 7] are widely applied for text classification, showing better results than traditional models.
With the development of graph network models, many researchers developed more GCN-based classification methods [26, 40–42]. Zhaoa et al.  proposed the SDGCN model to capture the interdependencies hidden in the data. Liu et al.  developed TensorGCN to aggregate intragraph and intergraph information of the text graph. However, Text GCN  has a large number of parameters and high computational complexity. We will propose a novel GCN-based model to solve the issue in the Text GCN .
In this section, we review the definition of related graph notations and analyse the layer-wise propagation model of the GCN in detail. Then, we develop a novel information propagation method to capture and fuse the multihop neighbor information. We propose two novel frameworks to capture the rich information of the text network. Finally, we analyse the computational complexity and parameter quantities of our models.
3.1. Notations’ Definition
We assume graph signal could be characterized by the node feature matrix of the graph, where and represent the number of nodes and feature dimensions, respectively. Let be the adjacency matrix representing its edge connection. We define the normalized Laplacian matrix as , where and denote the identity matrix and the degree matrix of the graph, respectively.
The popular convolutional propagation model  is as follows:where is the adjacency matrix with self-loops, namely, . The degree matrix could be written as , where denotes the propagation matrix. If = 0, then , which means the input signal is connected to the network. The trainable weight matrix could be optimized by gradient descent. We repeat the application of the convolutional model to get the vanilla GCN  framework.where and present different weight matrices. The classification function is . The convolutional operator is essentially a linear combination of its own vertices and one-hop neighbourhood vertices to make the same category of vertex features similar. Stacking two convolutional layers makes the vertices of the same category more closely connected and further eases the classification task. However, when more layers are applied, the vertices of different categories will be mixed and become indistinguishable, which is excessive smoothing [17, 18].
3.2. Multihop Neighbor Information Fusion with the Graph Convolutional Operator
Definition 1 (multihop neighbor information fusion (MIF)).
It is assumed that matrix denotes the regularized adjacency matrix of graph , where . If , then . The power matrix of is , where denotes the identity matrix. The multihop neighbor information fusion operator is to fuse the k-hop neighbor information with the element-wise topological information which is preserved. The MIF operator is defined as follows:where denotes the k-hop regularized adjacency matrix and represents the k-hop neighbor information.
Proposition 1. The multihop neighbor information fusion operator is a topological preserved operator.
Proof. If , then , and the k-hop neighbor information in convolutional layer has the same dimension, . As defined in formula (3), the MIF operator is the element-wise operation. Therefore, the dimension of is equal to the dimension of . The MIF operator preserves topological information.
The MIF operator is an information aggregation layer that is used to mix k-order adjacency information. The procedure of the calculation of MIF is as follows:(1)Calculating the minimum value of these features from different order graph convolutions: We give a living example to show that the MIF operator works. It is assumed that , namely, we obtain the maximum 3-hop neighbor information. We assume , , and ; then, the result is as follows:(2)The output of the MIF operator is defined as negative for each element of , namely, .Following DIFFPOOL , we use the topological preserved operator MIF to improve the performance.
3.3. MIF Propagation Model
To address the limitations of the Text GCN, we propose a propagation model of multihop neighbor information fusion graph convolution as follows:where is defined in formula (4), , and denotes that multiplies itself by k times, where is the j-hop graph convolution, and is the weight parameter matrix. In the MixHop model [36–38], Abu-El-Haija et al. adopted different weights for different . To reduce the computational complexity, we share the weight in the same convolutional layer in the multihop convolutional operator.
In formula (6), the convolutional layers combine the multihop neighbor information from 1-hop graph convolution with k-hop graph convolution. The calculation procedure is summarized in Algorithm 1. The MIF has advantages compared to Text GCN. MIF implements feature aggregation on nodes and their k-hop neighbor nodes. Therefore, MIF contains multihop neighbor information, in which it captures more information than Text GCN. In summary, MIF merges multihop neighbor information features while avoiding the extra parameter number. Those nodes in the same category are more closely connected. Furthermore, the MIF operator suppresses excessive weighting features while retaining features with small weight values, which may prevent gradient disappearance and gradient explosion problems.
3.4. Graph Convolutional Network Based on MIF
In Figure 1, we propose a two-layer graph convolutional neural network using the MIF layer. The first layer is the graph convolutional layer with MIF, and the convolutional layer is represented as follows:where is the weight parameter matrix between the input layer and hidden layer.
The second layer is the traditional graph convolution. We set the nonlinear activation function σ between the two layers as ReLU and achieve multiple classifications via softmax after the second layer. The network extends the 1-hop graph convolution to k-hop graph convolution to capture multihop neighbor interactive information. The output of our model is expressed as follows:where represents the weight parameter matrix between the hidden layer and output layer. The trainable weight parameters and would be updated by gradient descent.
In the preliminary network design, we compare how many convolutional layers and hops fit to our model. The two-convolutional-layer network shows better performance than the three and more convolutional layer network. When we implement the multihop neighbor information fusion, we observe that is better for most text networks, while is better for a few networks. In further experiments, when , the classification results would decrease. Moreover, the larger the value, the higher the computational cost. Therefore, we only discuss the cases of and in our models.
When , our MIF operator fuses the 1-hop graph convolutional layer and 2-hop graph convolutional layer. The 2-hop MIF graph convolutional layer (MIFGC-2) is as follows:
When , the 3-hop MIF graph convolutional layer (MIFGC-3) fuses from 1-hop to 3-hop convolutional layer. The NMGC-3 model is as follows:
The cross-entropy loss is utilized as our model loss function:where denotes the nodes set with labels and represents the number of classes. denotes the real labels of tag nodes, and denotes the probability value between 0 and 1 predicted by softmax.
3.5. Computational Complexity and Parameters
Because the actual running time is sensitive to hardware and implementations, we follow He and Sun  to adopt the theoretical time complexity to show the complexity rather than the actual running time. For large-scale graph networks, it is a huge challenge to directly calculate . Therefore, we calculate with right-to-left multiplication. For example, if , we calculate as is usually a sparse matrix with m nonzero entries. In formulas (7)–(10), our graph convolutional layers adopt the weight-sharing mechanism. Therefore, the calculation procedure is efficient.
Since different hop graph convolutions share the same weight in the same layer, the parameter quantities are consistent with the 1-hop graph convolution. It is assumed that , where is the number of nodes; , where is the feature dimensions, , where represents the number of hidden neurons in the 1st layer, and , where represents the hidden neurons in the 2nd layer. Then, output dimension in the 1st layer as the same, namely, , and . Therefore, in the first convolutional layer of our proposed model, the computational complexity is , and the trainable parameters are . The whole computational complexity of our proposed model is with trainable parameters. The node feature dimension is far large than the neural number, namely, . Therefore, the computational complexity of our network frameworks approximates , and trainable parameters approximate , respectively. It matches the computational complexity and parameters of vanilla GCN . Similarly, Text GCN  takes computational complexity and trainable parameters.
We will evaluate our NMGC-2 and NMGC-3 on text networks and compare our methods with the classic method and deep learning methods, such as the embedding model, CNN-based, LSTM-based, and GCN-based. We analyse the terms of computational complexity and trainable parameters in detail. We investigate the impact of network framework parameters and training epochs on classification accuracy.
We test our methods on five benchmark corpora datasets including R52 and R8 of Reuters-21578, 20-Newsgroups (20NG), Ohsumed, and Movie Review (MR). According to the preprocessing steps by Yao et al. , we process the text datasets and use the documents and words as nodes to build the text graph. In Table 1, the statistics characters of the datasets are described in detail.
4.2. Baseline and Experimental Settings
We compare with the following baseline methods as in Yao et al. , i.e., CNN with randomly initialized word vectors (CNN-rand) , CNN with pretrained word vectors (CNN-pretrain) , predictive text embedding (PTE) , LSTM framework (LSTM) , LSTM framework with pretrained word embeddings (LSTM-pretrain) , fast text classifier (fastText) , fast text classifier with bigrams (fastText-bigrams) , label embedding model with attention (LEAM) , simple word-embedding model (SWEM) , graph CNN with spline filter (GCNN-S) , graph CNN with Fourier filter (GCNN-F) , and graph CNN with Chebyshev filter (GCNN-C) .
We tune a series of hyperparameters’ (learning rate, dropout rate, hidden units, and epochs) values to determine the best hyperparameters of our model on text networks. The hyperparameters are reported in Tables 2 and 3. In our NMGC-2 and NMGC-3 models, we set L2 regularization factor as 0 and use Adam  to optimize the learning rate, following Yao et al. .
4.3. Results’ Analysis
We compare our NMGC-2 and NMGC-3 with other baseline methods in terms of test accuracy. As shown in Table 4, the proposed model NMGC-2 or NMGC-3 achieves the highest classification performance on datasets R52, R8, 20NG, and Ohsumed. Specifically, our NMGC-2 obtains the best accuracy of 94.35%, 97.31%, and 69.21% on datasets R52, R8, and Ohsumed, respectively, whereas our NMGC-3 obtains the highest accuracy of 86.68%on 20NG.
The success of NMGC-2 and NMGC-3 is mainly due to the following three aspects. (1) Our NMGC-2 and NMGC-3 models have the capability to capture the relations in terms of word-word and document-word in the datasets. (2) Our NMGC-2 and NMGC-3 make full use of the advantages of the GCN. We implement the feature information aggregation on the node and its 1-hop neighbor information (each layer) so that the node (word-word and document-word) features in the same cluster are similar, which are easy to classify. (3) Our NMGC-2 and NMGC-3 capture more and richer feature information from 1-hop to k-hop neighbors, which may circumvent the limitations of the GCN.
On dataset MR, CNN-pretrain  achieves the highest classification result of 77.75%, which shows the model consecutive and short-distance semantics because CNN-pretrain  and LSTM-pretrain  model consecutive word sequences, while our NMGC-2 and NMGC-3 ignore the word orders. Another reason is that our NMGC-2 and NMGC-3 models are difficult to propagate the information among the nodes (word-word and document-word) on MR with few edges.
Compared to Text GCN , our NMGC-2 and NMGC-3 are better by a large margin in most cases, which demonstrates the effectiveness of our method in terms of capturing high-order interactive information.
4.4. Hidden Units’ Analyses of Our Method
To evaluate the relationship between hidden units and the model performance, we use different hidden units to conduct experiments. We choose a representative set of hidden units as our comparative experiments in balancing computational complexity and classification performance. The results are summarized in Table 5. Specifically, our NMGC-2 uses the best hidden units of 128, 64, 128, 128, and 64 to achieve the higher accuracy on R52, R8, 20NG, Ohsumed, and MR, respectively, whereas our NMGC-3 uses the best hidden units of 128, 128, 128, 128, and 32 to achieve the better accuracy on R52, R8, 20NG, Ohsumed, and MR, respectively. We note that our NMGC-2 is always better than our NMGC-3 in many cases for different hidden units. This is most likely that our NMGC-3 suffers from oversmoothing in most cases. However, NMGC-3 outperforms NMGC-2 on short MR with few edges because NMGC-3 considers higher-order node information and propagates more label information to the entire graph network. In most cases, when the number of hidden units decreases, the accuracy of our method decreases, and more epochs are required to train our network.
4.5. Computational Complexity and Trainable Weight Parameters
We design the weight-sharing mechanism to share the weight in the proposed convolutional layer, which reduces the number of parameters. When using weight sharing, our calculations are very efficient. This naturally reduces the computational complexity. We design the weight-sharing mechanism to avoid overfitting caused by many parameters.
As shown in Table 6, we compare our method with Text GCN  in terms of computational complexity and the number of trainable weight parameters. The detailed derivation process of Comp. and Params. is analysed in Section 3.5. In Table 6, we observe that our NMGC-2 and NMGC-3 could match the computational complexity of Text GCN . In particular, for MR, NMGC-3 is about 1 time less than Text GCN  in terms of computational complexity. Interestingly, our models have the least parameters on all datasets because we use the weight-sharing mechanism on these convolutions from 1-hop graph convolution to k-hop graph convolution and a smaller number of hidden units.
In this work, we propose a new multihop neighbor information fusion graph convolutional network on graph-structured data. We develop a novel MIF operator to combine the graph convolution features of multihop neighbor information from 1-hop graph convolution to k-hops. Experiments on text networks suggest that our models are capable of encoding in terms of node features and global graph topology in a way useful for graph classification. In this setting, our models achieve performance improvement compared to other methods while being computationally efficient and with less trainable parameters. In future work, we plan to study different fusion schemes and extend our model to more datasets.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was funded by the National Natural Science Foundation of China (U1701266, 61702117, and 61672008), the Guangdong Provincial Key Laboratory of Intellectual Property and Big Data (2018B030322016), the Special Projects for Key Fields in Higher Education of Guangdong (2020ZDZX3077), and in part by Qingyuan Science and Technology Plan Project (Grant nos. 170809111721249 and 170802171710591).
S. Lai, L. Xu, K. Liu et al., “Recurrent convolutional neural networks for text classification,” in Proceedings of the Twenty-ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, January 2015.View at: Google Scholar
H. Tao, S. Tong, H. Zhao et al., “A radical-aware attention-based model for Chinese text classification,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, February 2019.View at: Google Scholar
Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar, October 2014.View at: Google Scholar
F. Scarselli, M. Gori, A. C. Tsoi et al., “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.View at: Google Scholar
M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in Neural Information Processing Systems, Springer, Berlin, Germany, 2016.View at: Google Scholar
L. Huang, D. Ma, S. Li et al., “Text level graph neural network for text classification,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3444–3450, Hong Kong, China, November 2019.View at: Google Scholar
Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, February 2018.View at: Google Scholar
M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, pp. 729–734, IEEE, Montreal, Canada, July 2005.View at: Google Scholar
M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in Proceedings of the International Conference on Machine Learning, pp. 2014–2023, New York, NY, USA, June 2016.View at: Google Scholar
W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 1024–1034, Long Beach, CA, USA, July 2017.View at: Google Scholar
F. Monti, D. Boscaini, J. Masci et al., “Geometric deep learning on graphs and manifolds using mixture model cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124, Honolulu, HI, USA, July 2017.View at: Google Scholar
M. Ding, J. Tang, and J. Zhang, “Semi-supervised learning on graphs with generative adversarial nets,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 913–922, ACM, Torino Italy, October, 2018.View at: Google Scholar
R. A. Rossi, R. Zhou, and N. K. Ahmed, “Estimation of graphlet counts in massive networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 1, pp. 44–57, 2018.View at: Google Scholar
R. A. Rossi, N. K. Ahmed, and E. Koh, “Higher-order network representation learning,” in Companion Proceedings of the Web Conference, Lyon, France, April 2018.View at: Google Scholar
S. Abu-El-Haija, N. Alipourfard, H. Harutyunyan et al., “A higher-order graph convolutional layer,” in Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montreal, Canada, December 2018.View at: Google Scholar
S. Abu-El-Haija, B. Perozzi, A. Kapoor et al., “MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing,” in Proceedings of the International Conference on Machine Learning, pp. 21–29, Long Beach, CA, USA, June 2019.View at: Google Scholar
H. Peng, J. Li, Y. He et al., “Large-scale hierarchical text classification with recursively regularized deep graph-cnn,” in Proceedings of the 2018 World Wide Web Conference, pp. 1063–1072, Lyon, France, April 2018.View at: Google Scholar
Z. Ying, J. You, C. Morris et al., “Hierarchical graph representation learning with differentiable pooling,” in Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, pp. 4805–4815, Montréal, Canada, December 2018.View at: Google Scholar
K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5353–5360, Boston, MA, USA, June 2015.View at: Google Scholar
J. Tang, M. Qu, and Q. Mei, “Pte: predictive text embedding through large-scale heterogeneous text networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1165–1174, ACM, Sydney, Australia, August 2015.View at: Google Scholar
G. Wang, C. Li, W. Wang et al., “Joint embedding of words and labels for text classification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2321–2331, Melbourne, Australia, July 2018.View at: Google Scholar
D. Shen, G. Wang, W. Wang et al., “Baseline needs more love: on simple word-embedding-based models and associated pooling mechanisms,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 440–450, Melbourne, Australia, July 2018.View at: Google Scholar