Abstract
With the higherorder neighborhood information of a graph network, the accuracy of graph representation learning classification can be significantly improved. However, the current higherorder graph convolutional networks have a large number of parameters and high computational complexity. Therefore, we propose a hybrid lowerorder and higherorder graph convolutional network (HLHG) learning model, which uses a weight sharing mechanism to reduce the number of network parameters. To reduce the computational complexity, we propose a novel information fusion pooling layer to combine the highorder and loworder neighborhood matrix information. We theoretically compare the computational complexity and the number of parameters of the proposed model with those of the other stateoftheart models. Experimentally, we verify the proposed model on largescale text network datasets using supervised learning and on citation network datasets using semisupervised learning. The experimental results show that the proposed model achieves higher classification accuracy with a small set of trainable weight parameters.
1. Introduction
Convolutional neural networks (CNNs) have achieved great success in grid structured data such as images and videos [1, 2]. It is attributed to a series of filters of convolutional layers from the CNNs that can obtain local invariant features. Compared to a regularized network, the number of neighbors of a node in a graph network may be different. Therefore, it is difficult to directly implement the filter operator in an irregular network structure [3].
In the graph network, the nodes and the connecting edges between them contain abundant network characteristic information. A graph convolutional network (GCN) aggregates the neighborhood nodes to realize continuous information transmission based on a graph network. By making full use of this information, a GCN can effectively achieve tasks such as classification, prediction, and recommendation.
A graph convolutional network (GCN) generalizes traditional convolutional neural networks (CNNs) to the graph domain. The GCN methods are mainly divided into two categories [3], the frequency domainbased methods [4–6] and the spatial domainbased methods [7, 8].
In the spatial domain, to simulate the convolution operation of the traditional CNN on an image, the convolution operation aggregates the information of the neighborhood nodes [7–10]. Henaff et al. [11] proposed a smoothed parametric spectral filter to realize localization and to preserve the parameters of filters independent of the input dimension. One of the key challenges is that the number of neighborhood nodes in the network irregularly changes.
In the frequency domain, Bruna et al. [5] were the first ones to extend CNNtype architectures to graphs. Cao et al. [12] applied a generalized convolutional network to the graph frequency domain using the Fourier transform. In this method, eigenvalue decomposition is performed on the neighborhood matrix. To reduce the computational complexity, Defferrard et al. [13] proposed the Chebyshev polynomial of the eigenvalues of the graph Laplacian to achieve efficient and localized graph convolutional operation filters. Kipf and Welling [6] proposed a classical GCN, which was approximated by a firstorder Chebyshev polynomial. This approach reduces the computational complexity but introduces truncation errors. This introduction results in the inability to capture highlevel interaction information between the nodes in the graph, and it also limits the capabilities of the model. The information propagation process in the graph is related not only to its firstorder neighborhood but also to its higherorder neighborhood.
AbuElHaija et al. [14, 15] proposed the highorder convolutional network layer on a graph that used linear combination of the highorder neighborhood basis of the GCN [6]. Tiao et al. [16] proposed a Bayesian estimation approach via the stochastic variational inference in the adjacency matrix of the graph. Levie et al. [17] proposed Cayley polynomials to compute the localized regular filters of the interest frequency bands of graphs. Therefore, the rational use of secondorder neighborhoods, thirdorder neighborhoods, and other highorder neighborhood information will be beneficial to classification prediction accuracy [14–16, 18–20].
Based on the classical GCN [6], to make full use of the highorder and loworder neighborhood information, we propose a novel hybrid loworder and higherorder graph convolutional network (HLHG). As shown in Figure 1, the graph convolutional layer of our model is simple and effective at capturing the highorder neighborhood information, nonlinearly combining the different order neighborhood information. The contributions are summarized as follows:(1)We propose a new fusion pooling layer to achieve highorder neighborhood fusion with the loworder neighborhood of graph networks(2)We propose a loworder neighborhood and highorder neighborhood weight sharing mechanism to reduce the computational complexity and number of parameters of the model(3)The experimental results show that our HLHG achieves stateoftheart performance in both the text network classification with supervised learning and the citation network with semisupervised learning
(a)
(b)
The rest of the paper is organized as follows. In Section 2, the related theoretical basis such as the graph convolution and the highorder graph convolution are introduced. In Section 3, the general information fusion pooling for the highorder neighborhood is presented. Then, the proposed model and its variant are presented. The computational complexity and parameter quantity of the proposed model are also theoretically analyzed. In Section 4, our proposed model is verified and the corresponding analysis are presented. Finally, Section 5 concludes the paper.
2. Related Theoretical Background
In this section, the related theoretical basis will be introduced, including the graph convolutional network (GCN).
2.1. Graph
Given a graph , its nodes set , and its edges , the graph is represented as . If nodes and are connected, then ; otherwise, . The information in the graph propagates along with the edge . It also applies when considering the network node selfloop, which means that . Assuming that the information that is propagated by each node in the graph network is , the information matrix in the graph is , where is the total number of nodes in the graph network and is the dimension of the information feature. It assumes that if the loop graph network G is represented as , then the adjacency matrix of the graph network is represented as . The degree matrix of in the graph network is the diagonal matrix, .
2.2. Graph Convolutional Network
In the given graph , there are two signals and . The graph’s Fourier transforms are defined as and , where is the orthonormal eigenvalues of the graph Laplacian of graph . The same as in Euclidean space, the spectral graph convolution operation of and is given as an elementwise product as follows:where represents the diagonal matrix of .
Defferrard et al. [13] utilized the kth order polynomial filters based on Chebyshev to represent the graph convolutional operation of Laplacian , where denotes the coefficients and represents the eigenvalues of the Laplacian.
Kipf and Welling [6] propose the classical graph convolutional neural network model based on the Fourier transform, . The GCN model approximates the model using a firstorder Chebyshev polynomial. The propagation model in the graph network is as follows:where denotes the information propagation matrix; represents the trainable weight of layer ; when , , which represents the initial input value of the GCN; denotes the activation function. To reduce the computational complexity, the convolution operator in the graph is defined by a simple neighborhood average. However, the convolutional filters are too simple to capture the highlevel interaction information between the nodes in the graph. Therefore, the classification accuracy on citation network datasets is low.
AbuElHaija et al. [14, 15] propose a highorder graph convolutional layer model based on the GCN for semisupervised node classification. The propagation model of the highorder graph convolution is as shown in formula (3). In this model, the transfer function of the th layer is a column concatenation from the first order to the order in the th layer, which is the linear combination of the highorder neighborhood. In the propagation model, the different order neighborhoods of the same layer use different weight parameters:where . However, as the network layers deepen, the dimensions of will increase and propagate between layers. Therefore, the number of trainable weight parameters will be more, and the training resource will also be increased to learn the optimized dimension of the weight.
3. Method
When the message passes through the graph network, the nodes will receive latent representations from their firsthop nodes and from their Nhop neighbors every time. In this section, we propose a model to nonlinearly aggregate the trainable parameters, which can choose how to mix latent messages from various hop nodes.
3.1. General Information Fusion Pooling
The information propagation of the graph network is passed along the edges between the vertices in the graph. It assumes that the graph network is an undirected graph. The general procedure of fusion pooling is described as follows. It assumes that the th order neighborhood matrix is , and the result after the fusion pooling operator is , where ) and represents the hop from the given node.
Here, is an example to show how to fuse the different order neighborhoods. For a given adjacency matrix , assume that denotes the firstorder neighborhood and denotes the secondorder neighborhood.
If and , then .
In the information dissemination and fusion process, both the firstorder neighborhood features and the highorder neighborhood features are fully considered. Therefore, the classification accuracy should be improved.
3.2. Our Proposed Model
In Figure 2, we propose the highorder graph convolutional network model to fuse the highorder messages that pass through the graph network. The model consists of an input layer, two graph convolutional layers, and an information fusion pooling layer that is connected to the graph convolutional layer. The softmax function is used for the multiclassification output.
The proposed model extends the classical GCN model [6] to the graph neural network of higherorder neighborhoods. Each node in the model can get its representation from its neighborhood and integrate messages. The system model is as follows:where is the order of the neighborhoods, = , is the activation function, function denotes the softmax function. Parameter is the trainable weight parameter of layer in the graph network, and function represents , which denotes the hybrid highorder and loworder of the information fusion. When parameter is equal to 0, , which is the output of the first convolutional layer of the graph propagation model. In addition, , which represents the initial input of our model.
In the preliminary experiment, we found that the twolayer high and loworder mixed graph convolution is better than the onelevel high and loworder mixed graph convolution, and stacking more layers does not significantly improve the accuracy of the graph recognition task. Therefore, this paper uses a 2layer graph convolution layer. In further experiments, we validate and in equation (4) for our HLHG models. In the supervised learning and unsupervised learning classification tasks, our HLHG models show very good performance and achieve a good balance between the classification accuracy and computational complexity. We also validate that at and , the classification accuracy is not significantly improved. Therefore, we only analyze and implement our model for and in the following sections.
In equation (4), the model with , that is, the hybrid model of the 1st and 2nd order neighborhoods, is called the HLHG2 model. The model with , that is, the hybrid model of the 1st, 2nd, and 3rd order neighborhoods, is called the HLHG3 model.
In the HLHG2 model, it assumes that the graph convolutional network has 2 convolutional layers and the activation function is . Then, the output Y of the HLHG2 model can be expressed as follows:where and denotes the fusion pooling .
The same as with the HLHG2 model, the output Y of the HLHG3 model can be expressed as follows:where and .
For a largescale graph network, it is unacceptable to directly calculate . Therefore, we calculate . In general, the dimension of is less than , and this procedure avoids largescale matrix multiplication operations.
Therefore, our HLHG model has a 2layer graph network, and the iterative expression of the 2nd order neighborhood is as follows:where . We use as our fusion pooling operator, which assumes the maximum value in the corresponding element. Algorithm 1 shows how to fuse the different order neighbors.

We use the multiclassified cross entropy as the loss function of our HLHG model, , where is the labeled samples. The graph neural network trainable weights and are trained using gradient descent. In each training iteration, we perform the batch gradient descent.
3.3. Computational Complexity and Parameter Quantity
In the largescale graph network, the adjacency matrix is . It is difficult to directly calculate . To reduce the computational complexity, we iteratively calculate . For higher orders, the right to left iterative multiplication procedure is . For example, when , . When , .
In the proposed model, the input feature of the graph network is. The weight of the first convolutional layer is , and the weight of the second layer is . Then, the input of the first convolutional layer is where the parameter represents the dimension of the input feature. For example, denotes the number of hidden neurons in the first convolutional layer and denotes the number of hidden neurons in the second convolutional layer. In our HLHG model, the trainable weight parameters are shared in the same convolutional layer. Therefore, in the first convolutional layer, the output dimension after the convolutional operator is the same. That is, , , and , where is the order of the adjacency matrix .
In the th convolutional layer, , where denotes the number of hidden neurons in the th convolutional layer. It assumes that is a sparse matrix with nonzero elements. For the th convolutional layer of our HLHG, the computational complexity is and the quantity of trainable weight is .
The total computational complexity of our HLHG model is , and the total number of trainable parameters is , where parameter denotes the total number of convolutional layers and denotes the th convolutional layer. When , represents the feature dimensions of the datasets and represents the number of hidden neurons in the th convolutional layer. For all the datasets, ; therefore, we only consider the first convolutional layer when we compare the computational complexity and number of parameters.
Compared to [14], we set fewer filters to maintain a similar computational complexity and the number of parameters is less via weight sharing for both the lowerorder and higherorder convolutions.
4. Experiments
We conduct experiments in order to verify that our HLHG model can be applied to supervised learning and semisupervised learning. On the text network datasets, we compare our model with the stateoftheart methods using supervised learning. On the citation network datasets, we compare our model with the stateoftheart methods using semisupervised learning. For all experiments, we construct a 2layer graph convolutional network of our model using TensorFlow. The code and data are available on GitHub.
4.1. Supervised Text Network Classification
We conduct supervised learning on five benchmark text graph datasets to compare the classification accuracy of HLHG with the graph convolutional neural network and other deep learning approaches.
4.1.1. Datasets
In our supervised experiments, the 20Newsgroups (20NG), Ohsumed, R52 and R8 of Reuters 21578, and Movie Review (MR) are used to verify the proposed models. These datasets are publicly available on the web and are widely used as testverified datasets. The summary statistic features of the text network are shown in Table 1.
These benchmark text datasets were processed by Yao et al. [21], who converted the text datasets into graph network structures. Then, they used preprocessing to construct the adjacency matrix of the graph network input and input parameters. The dataset is divided into a training dataset and a test dataset in the same way.
4.1.2. Baselines and Experimental Setting
We compare our HLHG with the following approaches: the convolutional neural network with pretrained vectors (CNNrand) [22], the LSTM model with pretrained vectors (LSTMpre) [23], the predictive text embedding for text classification (PTE) [24], the fast text classifier (fastText) [25], the simple word embedding model with simple pooling strategies (SWEM) [26], the labelembedding attentive model for text classification (LEAM) [27], the graph CNN model with the Chebyshev filter (GCNC) [13], the graph CNN model with the spline filter (GCNS) [5], the graph CNN model with the Fourier filter (GCNF) [11], and the graph convolutional network for text classification (text GCN) [21]. The baseline models were tested by Yao et al. [21].
In our HLHG2 model, we set the dropout rate = 0.2. The learning rate is updated from Adam [28] during the training process. In our model, we set the L2 loss weight as 0, and we adopt early stopping. We set the learning rate to 0.02 for the R8 dataset, and the learning rates of the remaining datasets are all set to 0.01. We set different epochs for different datasets. The number of epochs in the R52 dataset is 350. The number of epochs in the OH and 20NG datasets is 200, and the number in the R8 and MR datasets is 60. In the HLHG2 model, we set the number of hidden neurons in the 1st convolutional layer as 128 for all datasets.
Except for the parameters in Table 2, the other parameters are the same as in the HLHG2 model.
For our HLHG3, we set the number of hidden neurons in the first convolutional layer to 128 except for the MR dataset, which is set to 64. To obtain better training results, we separately set different hyperparameters such as the dropout rate, learning rate, and number of epochs for different datasets (see Table 2). In addition, the other parameters of HLHG3 are the same as those in HLHG2.
We construct the graph network for our HLHG2 and HLHG3 models, and the feature matrix and other parameters are the same as those by Yao et al. [21].
4.1.3. Results
We show supervised text classification accuracies for the five datasets in Table 3. We demonstrate how our model performs on common splits that were taken from Yao et al.’s study [21].
Table 3 presents the classification accuracies and standard deviations of our models and the benchmark on the text network data. In general, our HLHG2 and HLHG3 achieve high levels of performance. Specifically, they achieve the best performances on R52, OH, 20NG, and R8. Compared to the best performing approach, the proposed models yield worse accuracies on the MR dataset. In general, the HLHG3 and HLHG2 models perform equally well. More specifically, the 3rd order HLHG has slightly better classification accuracy than the 2nd order HLHG on most datasets. However, the performance difference is not very large. Overall, the proposed architecture with hybrid high and loworder neighborhoods has good classification performance, which indicates that it effectively preserves the topological information of the graph, and it also obtains a highquality representation of the nodes.
The benchmark test results are copied from [8]. The mean standard deviation of our model is the average of 100 runs.
Table 4 shows the comparison of the network complexity and the number of parameters with the Text GCN [21]. Our HLHG can match the Text GCN with respect to computational complexity while requiring fewer parameters than the Text GCN. As described in Section 3.3, the number of features in the dataset is much larger than the number of neurons in the hidden convolutional layer. Therefore, we only compare the computational complexity and number of parameters of the first convolutional layer in our HLHG model. In Table 4, Comp. and Params represent the computational complexity and the number of parameters in the first layer of the graph convolutional network, respectively. In the computational complexity results, the first constant denotes the number of neurons in the first convolutional layer and the second constant denotes the order of the adjacency matrix. The parameter m denotes the number of nonzero entries of the sparse regularization adjacency matrix. The parameter r denotes the feature dimension of the nodes in the graph network.
In the Text GCN [21], the number of hidden neurons in the first convolutional layer is 200; therefore, the complexity and params are 200. In our HLHG2 model, 128 denotes the number of hidden neurons in the first convolutional layer and 2 represents the highest order of HLHG2. In our HLHG3 model, 128 and 64 denote the number of hidden neurons in the first convolutional layer and 3 represents the highest order of the corresponding model. The result in Table 4 shows that our HLHG3 model has better computational complexity for the MR dataset. Because of the weight sharing in the different order neighborhoods, our HLHG models require fewer trainable weight parameters. Especially on the MR dataset, the number of parameters is only 1/3 of that of the Text GCN [21].
4.2. Semisupervised Node Classification
We conduct semisupervised learning on three benchmark citation network datasets to compare the node classification accuracy of HLHG with some classical approaches and with some graph convolutional neural network approaches. The graph semisupervised learning corresponds to the process of “label” spreading on citation networks.
4.2.1. Datasets
In semisupervised node classification, we use the CiteSeer, Cora, and PubMed citation network datasets [29]. In these citation datasets, the nodes represent the articles that were published in the corresponding journal. The edges between the two nodes represent references from one article to another, and the tags represent the topics of the articles. The citation link constructs an adjacency matrix. Those datasets have low label rates. The summary statistic features of the citation graph are shown in Table 5.
4.2.2. Baselines and Experimental Setting
We compare our HLHG with the same baseline methods as by AbuElHaija et al. [15] and Yang et al. [30]. The baselines are as follows: manifold regularization (ManiReg) [31], semisupervised embedding (SemiEmb) [32], label propagation (LP) [33], skipgrambased graph embeddings (DeepWalk) [34], the iterative classification algorithm (ICA) [35], Planetoid [30], HO [14], and MixHop [15].
For the HLHG2 model, we use the following parameters for the citation datasets (Cora, CiteSeer, and PubMed): 16 (number of hidden units), 0.5 (dropout rate), 0.0005 (L2 regularization), 10 (early stopping), 300 (number of epochs), and 0.01 (learning rate).
For tthe HLHG3 model, we set different numbers of hidden neurons for the different datasets. We set 8 hidden neurons for the CiteSeer dataset to reduce the computational complexity and the number of parameters, and set 10 hidden neurons for the Cora and PubMed datasets to capture richer features. The hyperparameters of the HLHG3 are set as shown in Table 6.
4.2.3. Results
In the semisupervised experiments, we train and test our models on those citation network datasets following the methodology that was proposed by Yang et al. [30]. The classification accuracy is the average of 100 runs with random weight initializations.
The benchmark test results were copied from [15, 30]. The mean standard deviation of our model is the average of 100 runs.
In Table 7, the node classification accuracies that are above the line are copied from AbuElHaija [14, 15] and Yang et al. [30]. The values below the line are our HLHG models. ± represents the standard deviation of 100 runs with different random initializations. These splits utilize only 20 labeled nodes per class during training. We achieve the best test accuracies of 82.7% and 71.5% on the Cora and CiteSeer datasets, respectively. Compared with other highorder graph convolutional neural networks [14, 15] on the same datasets, they get the highorder information using linear combinations of features from farther distances. Our HLHG model acts nonlinearly to get the highorder neighborhood information.
In Table 8, we compare the network complexity and the number of parameters with the other highorder graph convolutional networks and the classic GCN. The result shows that our model has the same computational complexity as other approaches. With respect to the number of parameters, our HLHG3 model has fewer parameters than the GCN [6]. The reason is that our model shares the weights in the same layer among the different order neighborhood matrixes.
5. Conclusion
In this paper, we propose a hybrid lowerorder and higherorder GCN model for the supervised classification of text network datasets and for semisupervised classification in a citation network. In our model, we propose a novel nonlinear information fusion layer to combine the low and higherorder neighborhoods. To reduce the number of parameters, we propose sharing the weights in the same convolutional layer with different order neighborhoods. Experiments on the two network datasets suggest that HLHG has the capability to fuse higherorder neighborhoods for supervised classification and semisupervised classification. Our model significantly outperforms the benchmarks. We also find that the computational complexity and the number of parameters are less than those of the highorder method. In order to obtain more neighborhood information, we could use more higherorder adjacency matrix. However, the direct use of higher orders may lead to oversmoothing problems. Therefore, in future research work, we will extend our HLHG models to fuse graph attention networks [36] to develop a deeper graph convolutional network.
Data Availability
The Supervised Text Network Classification data used to support the findings of this study have been deposited in the repository DOI:10.1609/aaai.v33i01.33017370. The Semisupervised Node Classification data used to support the findings of this study have been deposited in the repository DOI:10.1609/aimag.v29i3.2157
Disclosure
The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grants U1701266, 61571141, 61702120, and 61672008; Guangdong Province Key Laboratory of Intellectual Property and Big Data under Grant 2018B030322016; Scientific and Technological Projects of Guangdong Province under Grant 2019A070701013; and Qingyuan Science and Technology Plan Project under Grants 170809111721249 and 170802171710591.