Abstract

By searching the hyperlinks with domain name “.edu.cn” which constitutes the China Education and Research Network, we build a complex directed network containing 366,422 web pages containing 540,755 URLs. These URLs constitute a complex directed network through self-organization. By analyzing the topology of China Education and Research Network, we found that it is different from the common Internet in several aspects. Most of the vertices have incoming links, a few vertices have outgoing links, and very few vertices have both incoming and outgoing links. The vertex distribution has a power-law tail. A large proportion of newly added edges always connect with those pages selected from one subnetwork that they belong to, instead of connecting with the pages selected from the whole network. According to these features, we presented the evolution model of this complex directed network. The results indicate that this model reflects some main characteristics of China Education and Research Network.

1. Introduction

The research on complex networks is developing at a brisk pace, and significant achievements have been made in recent years; among them is the introduction of scale-free network and related models [14], as it makes big progress in revealing the characteristics of dynamic evolution of complex networks. Theoretical and empirical research on complex network has been carried out with some important achievements [59].

China Education and Research Network (CERNET) was established since 1995. More than 1000 universities and research institutes have been connected to this network so far. It has 36 regional network centers and main nodes, which are distributed among different provinces of China. As of now this network has host machines more than 1,200,000 and has become the second largest internet in China. However, compared with the large number of researches that has been done on the general Internet [1013], only a few work is on CERNET can be found. From these studies we found that the features of CERNET are different from those of the general Internet, especially in the structure and formation mechanism [14, 15]. Hence, the study on CERNET is quite important.

We have been working on CERNET since 2005 and trying to establish the evolution model of CERNET for analysis and prediction purposes [1416]. However, due mainly to the large scale of CERNET and lack of computing power, it took quite a long time to adjust the parameters to modify the model at that time. Therefore, the model we got is relatively simple which cannot well reflect the main features of CERNET [16]. For example, the average shortest path length of the simulation model is only about 2.8, far from 8.95 of the real network [17].

In this paper, the CERNET we analyze is a virtual network made up of web pages where “.edu.cn” is included in the addresses of all these pages. In this network, all web pages are nodes, and all the hyperlinks in these pages that link to other pages are the directed edges. This directed complex network has 366,422 nodes and 540,755 edges. We analyze the features of this network and extract the evolution model using empirical methods to reveal the formation mechanism of CERNET.

The remainder of the paper is organized as follows. Topological structure of CERNET is analyzed in Section 2, and the evolution model of CERNET and comparison between the real and simulated networks are described in Section 3, before giving conclusion and future work in Section 4.

2. Topological Structure of CERNET

There are several features that can be used to characterize a network, for example, the degree distribution, the average shortest path length, and the clustering coefficients. Among them the degree distribution is considered to be the most important [2].

From graph theory we know that the number of edges connected to one node is the degree of this node. For directed graph, the outdegree is the number of output edges and the indegree the number of input edges. Using the data we collect, we setup a database of CERNET and get the and , where is the probability that one page has output pages and is the probability that one page has input pages. The formulas we use to calculate the output and input probability of node are listed in (1) and (2), respectively, where is the maximum outdegree of the network and the maximum indegree of the network:

We plot the double logarithmic curves of and that change as a function of , as shown in Figures 1 and 2, respectively. Linear-regression analysis is done on the linearized data, as shown in the straight red lines in these figures. From Figure 1 we see that the tail of outdegree distribution of CERNET follows the power law distribution, , where . From Figure 2 we see that the indegree distribution generally follows the power law distribution, but the tail is not very smooth, , where , which differs greatly with the Poisson distribution predicted using the traditional theory of random graph.

We make statistical analysis of these data and get the accumulated frequency of degree and the corresponding ratio of the degree to total degree in CERNET, as shown in Table 1. From Table 1 we can see that a large amount of pages have small connections, a few pages have a medium number of connections, while a tiny minority of notable pages have a large number of connections. This phenomenon is similar to the research result made by Albert et al. [1].

This virtual network of CERNET is made up of subsets of web pages of different universities. The number of web pages of each subset is determined by the corresponding universities; the addition and deletion of pages totally depended on the university that these pages belong to. However, we find that though the number of pages is different for different universities they do share some similar features. For example, the proportion of pages that have output links to the total number of pages is less than 25% in every university, while the proportion of pages that have input links to the total number of pages is usually bigger than 85%. Only a very small number of pages have both output links and input links. Hence, if each university is treated as a subnetwork, then in each network most nodes only have input edges, a few nodes only have output edges, and the number of nodes with both input edges and output edges is rare. From these features we know that each university connects to other universities through a small number of pages, as shown in Table 2.

3. The Evolution Model of CERNET

Using the mechanism of growth and preferential attachment, the scale-free model proposed by Barabasi et al. can to some degree disclose the nature of many complicated phenomena in the practical world. However, this model cannot be applied to CERNET. For example, every newly attached node has output edges in this scale-free model, but for the directed network of CERNET a larger amount of newly attached nodes have only one input edge; that is, these nodes have zero outdegree. Also in this model, the preferential attachment of newly added nodes will search the whole network for the best node to connect to, while in CERNET the newly added pages will generally choose some pages in the same university to connect to. Only occasionally, the newly added pages will choose pages in other universities, but these pages will not search the whole CERNET for the best pages to connect to. From these features of CERNET, we propose the evolution model of CERNET, as follows.(i)The CERNET starts from nodes and edges. The nodes are randomly divided into subsets. There are , and nodes and , and edges in each subset, respectively, where , and .(ii)At each moment, a new node will randomly be added into one of the subsets of the network. There are 5 cases for the edges that are added together with the new node:(1)the new node has only one input edge;(2)the new node has only output edges;(3)the new node has one input edge and one output edge;(4)the new node has one input edge and output edges;(5)the new node has one output edge and input edges,where and is the minimum initial number of nodes among subsets and .(iii)When the new node with one input edge is added to the network with probability , this node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset. Let denote the probability of node to be selected as the source node; then is determined by , the outdegree of .(iv)When the new node with output edges is added to the network, there are 2 cases we should consider. The probabilities of the two cases are and , respectively.(1)For the first case, the new node will randomly choose a subset and let itself connect to a preferentially selected node in this subset. Let denote the probability of node be selected as the target node; then is determined by , the indegree of . For the rest of the output edges, at each moment only one edge randomly chooses a subset which has not been connected by the new node and connects itself to a preferentially selected node in this subset, till all output edges are processed.(2)For the second case, the new node will still randomly choose a subset, but this time this node will preferentially choose nodes in this subset and let itself be connected. Let denote the probability of node be selected as the target node; then is determined by , the indegree of . For the rest of the edges that this new node carries, it will randomly pick a subset which has not been connected by this new node and connect itself to a preferentially selected node in this subset.(v)When the new node with one input edge and one output edge is added to the network, there are also 2 cases we should consider. The probabilities of the two cases are and , respectively.(1)For the first case, the new node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset. The probability of node to be selected as the source node is determined by , the outdegree of . The output edge of the new node will randomly select a subset which has not been connected by the new node and connect itself to a preferentially selected node. The probability of a node to be selected as the target node is determined by , the indegree of .(2)For the second case, the new node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset. The probability of node to be selected as the source node is determined by , the outdegree of . For the output edge that this new node carries, it will still pick a node in the same subset and connect itself to a preferentially selected node which has not been connected by the input edge of the new node. The probability of a node to be selected as the target node is determined by , the indegree of .(vi)When the new node with 1 input edge and output edges is added to the network with probability , this node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset. The probability of node to be selected as the source node is determined by , the outdegree of . For the rest of the output edges, at each moment only one edge randomly chooses a subset which has not been connected by the new node and connects itself to a preferentially selected node in this subset, till all the output edges are processed. The probability of node to be selected as the target node is determined by , the indegree of .(vii)When the new node with 1 output edge and input edges is added to the network with probability , this node will randomly choose a subset and connect itself to a preferentially selected node in this subset. The probability of node to be selected as the target node is determined by , the indegree of . For the rest of the input edges, at each time only one edge randomly chooses a subset which has not been connected by the new node and lets itself be connected to a preferentially selected node in this subset, till all input edges are processed. The probability of node to be selected as the source node is determined by , the outdegree of . The definitions of and are listed in (3) and (4), respectively. The relation between different probabilities is listed in (5). We have the following equations:

In (3) and (4), is the number of nodes of the subset that has new edges connected to it. The denominator of (3) is the sum of indegree of the same subset and the denominator of (4) is the sum of outdegree in this subset.

After moments, we get a directed random network with nodes and edges, where , and From the analysis of CERNET we set , , , , , , and . When , , and , we get the distribution of outdegree and indegree of this simulated model. The outdegree and indegree distributions are illustrated in Figures 3 and 4, respectively. Figures 5 and 6 illustrate the comparison between the simulated data and the real data. From the comparison of outdegree distribution we can see that the slope of the simulated data is 2.48, the same as that of the real data, but the beginning part of the simulated data cannot fully reflect the statistical result of the real data. From the comparison of indegree distribution we see that the slope of simulated data is 2.40, the same as that of the real data, but the beginning part of the simulated data cannot fully reflect the statistical result of the real data. The tail is smoother than that of the real data. The slope is 2.40, the same as the real data.

4. Conclusions

From the figures of degree distribution, we can see that the simulated network can partly reflect the characteristic of CERNET. The degree distribution of the simulated network matches much better the real network than that in model [16]. We also compared other features of the simulated and the real networks. For example, the average shortest path length for the real network is 8.95, while for the simulated network, it is 7.81, which is much closer than that of the model listed in [16].

The main contribution of this paper is the evolution model of the CERNET. The result shows that the simulated model can partly disclose the property of this network. However, the model introduced in this paper is only the ideal model, which means that only the main features of the real network are considered. With the help of the fast growing computing power, we intend to adjust this model so that it can be used in the analysis of the ever increasing large scale complex networks.

Acknowledgments

The authors are grateful to colleagues in the German Research School for Simulation Science for their constructive suggestions. The authors are also grateful to the reviewers for their valuable comments and suggestions to improve the presentation of this paper. This work is supported by the National Natural Science Foundation of China under Grant no. 70971089, Shanghai Leading Academic Discipline Project under Grant no. XTKX2012, and Jiangsu Overseas Research and Training Programs for University Prominent Young and Middle-Aged Teachers and Presidents.