Using Neuroevolution to Design Neural NetworksView this Special Issue
Hybrid Low-Order and Higher-Order Graph Convolutional Networks
With the higher-order neighborhood information of a graph network, the accuracy of graph representation learning classification can be significantly improved. However, the current higher-order graph convolutional networks have a large number of parameters and high computational complexity. Therefore, we propose a hybrid lower-order and higher-order graph convolutional network (HLHG) learning model, which uses a weight sharing mechanism to reduce the number of network parameters. To reduce the computational complexity, we propose a novel information fusion pooling layer to combine the high-order and low-order neighborhood matrix information. We theoretically compare the computational complexity and the number of parameters of the proposed model with those of the other state-of-the-art models. Experimentally, we verify the proposed model on large-scale text network datasets using supervised learning and on citation network datasets using semisupervised learning. The experimental results show that the proposed model achieves higher classification accuracy with a small set of trainable weight parameters.
Convolutional neural networks (CNNs) have achieved great success in grid structured data such as images and videos [1, 2]. It is attributed to a series of filters of convolutional layers from the CNNs that can obtain local invariant features. Compared to a regularized network, the number of neighbors of a node in a graph network may be different. Therefore, it is difficult to directly implement the filter operator in an irregular network structure .
In the graph network, the nodes and the connecting edges between them contain abundant network characteristic information. A graph convolutional network (GCN) aggregates the neighborhood nodes to realize continuous information transmission based on a graph network. By making full use of this information, a GCN can effectively achieve tasks such as classification, prediction, and recommendation.
A graph convolutional network (GCN) generalizes traditional convolutional neural networks (CNNs) to the graph domain. The GCN methods are mainly divided into two categories , the frequency domain-based methods [4–6] and the spatial domain-based methods [7, 8].
In the spatial domain, to simulate the convolution operation of the traditional CNN on an image, the convolution operation aggregates the information of the neighborhood nodes [7–10]. Henaff et al.  proposed a smoothed parametric spectral filter to realize localization and to preserve the parameters of filters independent of the input dimension. One of the key challenges is that the number of neighborhood nodes in the network irregularly changes.
In the frequency domain, Bruna et al.  were the first ones to extend CNN-type architectures to graphs. Cao et al.  applied a generalized convolutional network to the graph frequency domain using the Fourier transform. In this method, eigenvalue decomposition is performed on the neighborhood matrix. To reduce the computational complexity, Defferrard et al.  proposed the Chebyshev polynomial of the eigenvalues of the graph Laplacian to achieve efficient and localized graph convolutional operation filters. Kipf and Welling  proposed a classical GCN, which was approximated by a first-order Chebyshev polynomial. This approach reduces the computational complexity but introduces truncation errors. This introduction results in the inability to capture high-level interaction information between the nodes in the graph, and it also limits the capabilities of the model. The information propagation process in the graph is related not only to its first-order neighborhood but also to its higher-order neighborhood.
Abu-El-Haija et al. [14, 15] proposed the high-order convolutional network layer on a graph that used linear combination of the high-order neighborhood basis of the GCN . Tiao et al.  proposed a Bayesian estimation approach via the stochastic variational inference in the adjacency matrix of the graph. Levie et al.  proposed Cayley polynomials to compute the localized regular filters of the interest frequency bands of graphs. Therefore, the rational use of second-order neighborhoods, third-order neighborhoods, and other high-order neighborhood information will be beneficial to classification prediction accuracy [14–16, 18–20].
Based on the classical GCN , to make full use of the high-order and low-order neighborhood information, we propose a novel hybrid low-order and higher-order graph convolutional network (HLHG). As shown in Figure 1, the graph convolutional layer of our model is simple and effective at capturing the high-order neighborhood information, nonlinearly combining the different order neighborhood information. The contributions are summarized as follows:(1)We propose a new fusion pooling layer to achieve high-order neighborhood fusion with the low-order neighborhood of graph networks(2)We propose a low-order neighborhood and high-order neighborhood weight sharing mechanism to reduce the computational complexity and number of parameters of the model(3)The experimental results show that our HLHG achieves state-of-the-art performance in both the text network classification with supervised learning and the citation network with semisupervised learning
The rest of the paper is organized as follows. In Section 2, the related theoretical basis such as the graph convolution and the high-order graph convolution are introduced. In Section 3, the general information fusion pooling for the high-order neighborhood is presented. Then, the proposed model and its variant are presented. The computational complexity and parameter quantity of the proposed model are also theoretically analyzed. In Section 4, our proposed model is verified and the corresponding analysis are presented. Finally, Section 5 concludes the paper.
2. Related Theoretical Background
In this section, the related theoretical basis will be introduced, including the graph convolutional network (GCN).
Given a graph , its nodes set , and its edges , the graph is represented as . If nodes and are connected, then ; otherwise, . The information in the graph propagates along with the edge . It also applies when considering the network node self-loop, which means that . Assuming that the information that is propagated by each node in the graph network is , the information matrix in the graph is , where is the total number of nodes in the graph network and is the dimension of the information feature. It assumes that if the loop graph network G is represented as , then the adjacency matrix of the graph network is represented as . The degree matrix of in the graph network is the diagonal matrix, .
2.2. Graph Convolutional Network
In the given graph , there are two signals and . The graph’s Fourier transforms are defined as and , where is the orthonormal eigenvalues of the graph Laplacian of graph . The same as in Euclidean space, the spectral graph convolution operation of and is given as an elementwise product as follows:where represents the diagonal matrix of .
Defferrard et al.  utilized the k-th order polynomial filters based on Chebyshev to represent the graph convolutional operation of Laplacian , where denotes the coefficients and represents the eigenvalues of the Laplacian.
Kipf and Welling  propose the classical graph convolutional neural network model based on the Fourier transform, . The GCN model approximates the model using a first-order Chebyshev polynomial. The propagation model in the graph network is as follows:where denotes the information propagation matrix; represents the trainable weight of layer ; when , , which represents the initial input value of the GCN; denotes the activation function. To reduce the computational complexity, the convolution operator in the graph is defined by a simple neighborhood average. However, the convolutional filters are too simple to capture the high-level interaction information between the nodes in the graph. Therefore, the classification accuracy on citation network datasets is low.
Abu-El-Haija et al. [14, 15] propose a high-order graph convolutional layer model based on the GCN for semisupervised node classification. The propagation model of the high-order graph convolution is as shown in formula (3). In this model, the transfer function of the -th layer is a column concatenation from the first order to the order in the -th layer, which is the linear combination of the high-order neighborhood. In the propagation model, the different order neighborhoods of the same layer use different weight parameters:where . However, as the network layers deepen, the dimensions of will increase and propagate between layers. Therefore, the number of trainable weight parameters will be more, and the training resource will also be increased to learn the optimized dimension of the weight.
When the message passes through the graph network, the nodes will receive latent representations from their first-hop nodes and from their N-hop neighbors every time. In this section, we propose a model to nonlinearly aggregate the trainable parameters, which can choose how to mix latent messages from various hop nodes.
3.1. General Information Fusion Pooling
The information propagation of the graph network is passed along the edges between the vertices in the graph. It assumes that the graph network is an undirected graph. The general procedure of fusion pooling is described as follows. It assumes that the -th order neighborhood matrix is , and the result after the fusion pooling operator is , where ) and represents the hop from the given node.
Here, is an example to show how to fuse the different order neighborhoods. For a given adjacency matrix , assume that denotes the first-order neighborhood and denotes the second-order neighborhood.
If and , then .
In the information dissemination and fusion process, both the first-order neighborhood features and the high-order neighborhood features are fully considered. Therefore, the classification accuracy should be improved.
3.2. Our Proposed Model
In Figure 2, we propose the high-order graph convolutional network model to fuse the high-order messages that pass through the graph network. The model consists of an input layer, two graph convolutional layers, and an information fusion pooling layer that is connected to the graph convolutional layer. The softmax function is used for the multiclassification output.
The proposed model extends the classical GCN model  to the graph neural network of higher-order neighborhoods. Each node in the model can get its representation from its neighborhood and integrate messages. The system model is as follows:where is the order of the neighborhoods, = , is the activation function, function denotes the softmax function. Parameter is the trainable weight parameter of layer in the graph network, and function represents , which denotes the hybrid high-order and low-order of the information fusion. When parameter is equal to 0, , which is the output of the first convolutional layer of the graph propagation model. In addition, , which represents the initial input of our model.
In the preliminary experiment, we found that the two-layer high- and low-order mixed graph convolution is better than the one-level high- and low-order mixed graph convolution, and stacking more layers does not significantly improve the accuracy of the graph recognition task. Therefore, this paper uses a 2-layer graph convolution layer. In further experiments, we validate and in equation (4) for our HLHG models. In the supervised learning and unsupervised learning classification tasks, our HLHG models show very good performance and achieve a good balance between the classification accuracy and computational complexity. We also validate that at and , the classification accuracy is not significantly improved. Therefore, we only analyze and implement our model for and in the following sections.
In equation (4), the model with , that is, the hybrid model of the 1st and 2nd order neighborhoods, is called the HLHG-2 model. The model with , that is, the hybrid model of the 1st, 2nd, and 3rd order neighborhoods, is called the HLHG-3 model.
In the HLHG-2 model, it assumes that the graph convolutional network has 2 convolutional layers and the activation function is . Then, the output Y of the HLHG-2 model can be expressed as follows:where and denotes the fusion pooling .
The same as with the HLHG-2 model, the output Y of the HLHG-3 model can be expressed as follows:where and .
For a large-scale graph network, it is unacceptable to directly calculate . Therefore, we calculate . In general, the dimension of is less than , and this procedure avoids large-scale matrix multiplication operations.
Therefore, our HLHG model has a 2-layer graph network, and the iterative expression of the 2nd order neighborhood is as follows:where . We use as our fusion pooling operator, which assumes the maximum value in the corresponding element. Algorithm 1 shows how to fuse the different order neighbors.
We use the multiclassified cross entropy as the loss function of our HLHG model, , where is the labeled samples. The graph neural network trainable weights and are trained using gradient descent. In each training iteration, we perform the batch gradient descent.
3.3. Computational Complexity and Parameter Quantity
In the large-scale graph network, the adjacency matrix is . It is difficult to directly calculate . To reduce the computational complexity, we iteratively calculate . For higher orders, the right to left iterative multiplication procedure is . For example, when , . When , .
In the proposed model, the input feature of the graph network is. The weight of the first convolutional layer is , and the weight of the second layer is . Then, the input of the first convolutional layer is where the parameter represents the dimension of the input feature. For example, denotes the number of hidden neurons in the first convolutional layer and denotes the number of hidden neurons in the second convolutional layer. In our HLHG model, the trainable weight parameters are shared in the same convolutional layer. Therefore, in the first convolutional layer, the output dimension after the convolutional operator is the same. That is, , , and , where is the order of the adjacency matrix .
In the -th convolutional layer, , where denotes the number of hidden neurons in the -th convolutional layer. It assumes that is a sparse matrix with nonzero elements. For the -th convolutional layer of our HLHG, the computational complexity is and the quantity of trainable weight is .
The total computational complexity of our HLHG model is , and the total number of trainable parameters is , where parameter denotes the total number of convolutional layers and denotes the -th convolutional layer. When , represents the feature dimensions of the datasets and represents the number of hidden neurons in the -th convolutional layer. For all the datasets, ; therefore, we only consider the first convolutional layer when we compare the computational complexity and number of parameters.
Compared to , we set fewer filters to maintain a similar computational complexity and the number of parameters is less via weight sharing for both the lower-order and higher-order convolutions.
We conduct experiments in order to verify that our HLHG model can be applied to supervised learning and semisupervised learning. On the text network datasets, we compare our model with the state-of-the-art methods using supervised learning. On the citation network datasets, we compare our model with the state-of-the-art methods using semisupervised learning. For all experiments, we construct a 2-layer graph convolutional network of our model using TensorFlow. The code and data are available on GitHub.
4.1. Supervised Text Network Classification
We conduct supervised learning on five benchmark text graph datasets to compare the classification accuracy of HLHG with the graph convolutional neural network and other deep learning approaches.
In our supervised experiments, the 20-Newsgroups (20NG), Ohsumed, R52 and R8 of Reuters 21578, and Movie Review (MR) are used to verify the proposed models. These datasets are publicly available on the web and are widely used as test-verified datasets. The summary statistic features of the text network are shown in Table 1.
These benchmark text datasets were processed by Yao et al. , who converted the text datasets into graph network structures. Then, they used preprocessing to construct the adjacency matrix of the graph network input and input parameters. The dataset is divided into a training dataset and a test dataset in the same way.
4.1.2. Baselines and Experimental Setting
We compare our HLHG with the following approaches: the convolutional neural network with pretrained vectors (CNN-rand) , the LSTM model with pretrained vectors (LSTM-pre) , the predictive text embedding for text classification (PTE) , the fast text classifier (fastText) , the simple word embedding model with simple pooling strategies (SWEM) , the label-embedding attentive model for text classification (LEAM) , the graph CNN model with the Chebyshev filter (GCN-C) , the graph CNN model with the spline filter (GCN-S) , the graph CNN model with the Fourier filter (GCN-F) , and the graph convolutional network for text classification (text GCN) . The baseline models were tested by Yao et al. .
In our HLHG-2 model, we set the dropout rate = 0.2. The learning rate is updated from Adam  during the training process. In our model, we set the L2 loss weight as 0, and we adopt early stopping. We set the learning rate to 0.02 for the R8 dataset, and the learning rates of the remaining datasets are all set to 0.01. We set different epochs for different datasets. The number of epochs in the R52 dataset is 350. The number of epochs in the OH and 20NG datasets is 200, and the number in the R8 and MR datasets is 60. In the HLHG-2 model, we set the number of hidden neurons in the 1st convolutional layer as 128 for all datasets.
Except for the parameters in Table 2, the other parameters are the same as in the HLHG-2 model.
For our HLHG-3, we set the number of hidden neurons in the first convolutional layer to 128 except for the MR dataset, which is set to 64. To obtain better training results, we separately set different hyperparameters such as the dropout rate, learning rate, and number of epochs for different datasets (see Table 2). In addition, the other parameters of HLHG-3 are the same as those in HLHG-2.
We construct the graph network for our HLHG-2 and HLHG-3 models, and the feature matrix and other parameters are the same as those by Yao et al. .
Table 3 presents the classification accuracies and standard deviations of our models and the benchmark on the text network data. In general, our HLHG-2 and HLHG-3 achieve high levels of performance. Specifically, they achieve the best performances on R52, OH, 20NG, and R8. Compared to the best performing approach, the proposed models yield worse accuracies on the MR dataset. In general, the HLHG-3 and HLHG-2 models perform equally well. More specifically, the 3rd order HLHG has slightly better classification accuracy than the 2nd order HLHG on most datasets. However, the performance difference is not very large. Overall, the proposed architecture with hybrid high- and low-order neighborhoods has good classification performance, which indicates that it effectively preserves the topological information of the graph, and it also obtains a high-quality representation of the nodes.
The benchmark test results are copied from . The mean standard deviation of our model is the average of 100 runs.
Table 4 shows the comparison of the network complexity and the number of parameters with the Text GCN . Our HLHG can match the Text GCN with respect to computational complexity while requiring fewer parameters than the Text GCN. As described in Section 3.3, the number of features in the dataset is much larger than the number of neurons in the hidden convolutional layer. Therefore, we only compare the computational complexity and number of parameters of the first convolutional layer in our HLHG model. In Table 4, Comp. and Params represent the computational complexity and the number of parameters in the first layer of the graph convolutional network, respectively. In the computational complexity results, the first constant denotes the number of neurons in the first convolutional layer and the second constant denotes the order of the adjacency matrix. The parameter m denotes the number of nonzero entries of the sparse regularization adjacency matrix. The parameter r denotes the feature dimension of the nodes in the graph network.
In the Text GCN , the number of hidden neurons in the first convolutional layer is 200; therefore, the complexity and params are 200. In our HLHG-2 model, 128 denotes the number of hidden neurons in the first convolutional layer and 2 represents the highest order of HLHG-2. In our HLHG-3 model, 128 and 64 denote the number of hidden neurons in the first convolutional layer and 3 represents the highest order of the corresponding model. The result in Table 4 shows that our HLHG-3 model has better computational complexity for the MR dataset. Because of the weight sharing in the different order neighborhoods, our HLHG models require fewer trainable weight parameters. Especially on the MR dataset, the number of parameters is only 1/3 of that of the Text GCN .
4.2. Semisupervised Node Classification
We conduct semisupervised learning on three benchmark citation network datasets to compare the node classification accuracy of HLHG with some classical approaches and with some graph convolutional neural network approaches. The graph semisupervised learning corresponds to the process of “label” spreading on citation networks.
In semisupervised node classification, we use the CiteSeer, Cora, and PubMed citation network datasets . In these citation datasets, the nodes represent the articles that were published in the corresponding journal. The edges between the two nodes represent references from one article to another, and the tags represent the topics of the articles. The citation link constructs an adjacency matrix. Those datasets have low label rates. The summary statistic features of the citation graph are shown in Table 5.
4.2.2. Baselines and Experimental Setting
We compare our HLHG with the same baseline methods as by Abu-El-Haija et al.  and Yang et al. . The baselines are as follows: manifold regularization (ManiReg) , semisupervised embedding (SemiEmb) , label propagation (LP) , skip-gram-based graph embeddings (DeepWalk) , the iterative classification algorithm (ICA) , Planetoid , HO , and MixHop .
For the HLHG-2 model, we use the following parameters for the citation datasets (Cora, CiteSeer, and PubMed): 16 (number of hidden units), 0.5 (dropout rate), 0.0005 (L2 regularization), 10 (early stopping), 300 (number of epochs), and 0.01 (learning rate).
For tthe HLHG-3 model, we set different numbers of hidden neurons for the different datasets. We set 8 hidden neurons for the CiteSeer dataset to reduce the computational complexity and the number of parameters, and set 10 hidden neurons for the Cora and PubMed datasets to capture richer features. The hyperparameters of the HLHG-3 are set as shown in Table 6.
In the semisupervised experiments, we train and test our models on those citation network datasets following the methodology that was proposed by Yang et al. . The classification accuracy is the average of 100 runs with random weight initializations.
In Table 7, the node classification accuracies that are above the line are copied from Abu-El-Haija [14, 15] and Yang et al. . The values below the line are our HLHG models. ± represents the standard deviation of 100 runs with different random initializations. These splits utilize only 20 labeled nodes per class during training. We achieve the best test accuracies of 82.7% and 71.5% on the Cora and CiteSeer datasets, respectively. Compared with other high-order graph convolutional neural networks [14, 15] on the same datasets, they get the high-order information using linear combinations of features from farther distances. Our HLHG model acts nonlinearly to get the high-order neighborhood information.
In Table 8, we compare the network complexity and the number of parameters with the other high-order graph convolutional networks and the classic GCN. The result shows that our model has the same computational complexity as other approaches. With respect to the number of parameters, our HLHG-3 model has fewer parameters than the GCN . The reason is that our model shares the weights in the same layer among the different order neighborhood matrixes.
In this paper, we propose a hybrid lower-order and higher-order GCN model for the supervised classification of text network datasets and for semisupervised classification in a citation network. In our model, we propose a novel nonlinear information fusion layer to combine the low- and higher-order neighborhoods. To reduce the number of parameters, we propose sharing the weights in the same convolutional layer with different order neighborhoods. Experiments on the two network datasets suggest that HLHG has the capability to fuse higher-order neighborhoods for supervised classification and semisupervised classification. Our model significantly outperforms the benchmarks. We also find that the computational complexity and the number of parameters are less than those of the high-order method. In order to obtain more neighborhood information, we could use more higher-order adjacency matrix. However, the direct use of higher orders may lead to oversmoothing problems. Therefore, in future research work, we will extend our HLHG models to fuse graph attention networks  to develop a deeper graph convolutional network.
The Supervised Text Network Classification data used to support the findings of this study have been deposited in the repository DOI:10.1609/aaai.v33i01.33017370. The Semisupervised Node Classification data used to support the findings of this study have been deposited in the repository DOI:10.1609/aimag.v29i3.2157
The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported in part by the National Natural Science Foundation of China under Grants U1701266, 61571141, 61702120, and 61672008; Guangdong Province Key Laboratory of Intellectual Property and Big Data under Grant 2018B030322016; Scientific and Technological Projects of Guangdong Province under Grant 2019A070701013; and Qingyuan Science and Technology Plan Project under Grants 170809111721249 and 170802171710591.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proceedings of the International Conference on Neural Information Processing Systems, pp. 1097–1105, Lake Tahoe, NV, USA, December 2012.View at: Google Scholar
F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model CNNs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124, Honolulu, HI, USA, July 2017.View at: Publisher Site | Google Scholar
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning, pp. 1263–1272, Sydney, Australia, August 2017.View at: Google Scholar
S. Abu-El-Haija, N. Alipourfard, H. Harutyunyan, A. Kapoor, and B. Perozzi, “A higher-order graph convolutional layer,” in Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018), NIPS, Montreal, Canada, December 2018.View at: Google Scholar
J. Tang, M. Qu, and Q. Mei, “PTE: predictive text embedding through large-scale heterogeneous text networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1165–1174, Sydney, Australia, August 2015.View at: Google Scholar
Z. Yang, W. W. Cohen, and R. R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings,” in Proceedings of the ICML, New York, NY, USA, June 2016.View at: Google Scholar
M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: a geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006.View at: Google Scholar
J. Weston, F. D. R. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semi-supervised embedding,” in Neural Networks: Tricks of the Trade, Springer, Berlin, Germany, 2012.View at: Google Scholar
X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proceedings of the ICML, Washington, DC, USA, August 2003.View at: Google Scholar
B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: online learning of social representations,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD’14, New York, NY, USA, August, 2014.View at: Google Scholar
Q. Lu and L. Getoor, “Link-based classification,” in Proceedings of the ICML, Washington, DC, USA, August 2003.View at: Google Scholar