Abstract

Scientific coauthorship, generated by collaborations and competitions among researchers, reflects effective organizations of human resources. Researchers, their expected benefits through collaborations, and their cooperative costs constitute the elements of a game. Hence, we propose a cooperative game model to explore the evolution mechanisms of scientific coauthorship networks. The model generates geometric hypergraphs, where the costs are modelled by space distances, and the benefits are expressed by node reputations, that is, geometric zones that depend on node position in space and time. Modelled cooperative strategies conditioned on positive benefit-minus-cost reflect the spatial reciprocity principle in collaborations and generate high clustering and degree assortativity, two typical features of coauthorship networks. Modelled reputations generate the generalized Poisson parts, and fat tails appeared in specific distributions of empirical data, for example, paper team size distribution. The combined effect of modelled costs and reputations reproduces the transitions that emerged in degree distribution, in the correlation between degree and local clustering coefficient, and so on. The model provides an example of how individual strategies induce network complexity, as well as an application of game theory to social affiliation networks.

1. Introduction

Collaborations between researchers contribute not only to the breakthrough achievement unattainable by individuals [1, 2] but also to the transmission and fusion of knowledge, and hence, they incubate several interdisciplines [36]. Coauthorship in scientific papers, as a valid proxy of collaborations, can be expressed graphically (termed as coauthorship network), where nodes and edges represent authors and coauthorship, respectively. Studies of large-scale coauthorship networks provide a bird’s eye view of collaboration patterns in diverse fields and have become an important topic of social sciences [711].

Empirical coauthorship networks have specific common local (degree assortativity, high clustering) and global (fat-tail, small-world) features [1217]. Some important models have been proposed to reproduce those properties, such as modeling fat-tail through preferential attachment or cumulative advantage [1823] and modeling degree assortativity by connecting two nonconnected nodes that have similar degrees [24]. Except for preferential attachment, the inhomogeneity of node influences is an alternative explanation for fat-tail: nodes with wider influences are likely to gain more connections [25]. The idea has been applied to model coauthorship networks in a geometric way: node influences are modelled by attaching specific geometric zones to nodes [26, 27].

To find the essence from the above features, we face a basic question [28]: “how did cooperative behavior evolve”? Five typical mechanisms of cooperative evolution [29] all hold for coauthorship: coauthoring frequently occurs between students and their tutors (kin selection); cooperation helps to achieve breakthroughs that are unattainable by an individual (direct reciprocity); coauthoring someone could establish a good reputation (indirect reciprocity); spatial structures or social networks make some researchers interact more often than others and obtain more collaborators (network reciprocity); a successful research team is attractive for collaborators (group selection). To quantify collaborations and predict behavioral outcomes, a modelling approach termed as the game theory is developed to find rational strategies. Then do there exist inherent game rules behind the complexity of coauthorship networks?

We try to find a solution for the above question through simulation. A cooperative game consists of two elements: a set of players and a characteristic function specifying the value (i.e., benefit-minus-cost) created by subsets of players in the game. Scientific cooperation has those elements. The diversity of researchers’ learning programs leads to their individual research interests. Cooperation costs could be considered investments of time and effort to complete a study by crossing the distance between research interests [30]. The reputation in academic society could be regarded as the expected benefit of cooperation: coauthoring with a famous researcher contributes to achieve academic success.

In the model, the set of interests is abstracted as a circle, and players are located on the circle. Cooperative costs are geometrized as angular distances, and the reputation benefit of a player is valued as a power function of player generation time. Modelled cooperative strategies conditioned on positive benefit-minus-cost imitate the spatial reciprocity principle in collaborations [31] and yield high clustering and degree assortativity. The designed form of reputations, together with the strategies, yields the features (hook heads, fat tails [32]) of specific distributions of empirical data, such as degree distribution and the distribution of paper team sizes. Moreover, the combined effect of spatial reciprocity and the diversity of reputations reproduce the transition phenomena in degree distribution, in the correlation between degree and local clustering coefficient, and so on. The good model-data fitting shows the reasonability of the designed game mechanisms.

This paper is organized as follows: the model and data are described in Sections 2 and 3, respectively. Cooperation cost, reputation benefit, and the relationship between them are discussed in Sections 4 and 5. The conclusion is drawn in Section 6.

2. The Model

A hypergraph is a generalization of a graph, in which an edge (termed as hyperedge) can join any number of nodes. Coauthorship relationship can be expressed by a hypergraph, where nodes represent authors, and the author group of a paper (called a “paper team”) forms a hyperedge. A number of models have been proposed for generating hypergraphs in specific random ways, and some of them have been used for modelling coauthorship networks [3234]. Meanwhile, there has been an amount of previous work on the structures of specific random hypergraphs, such as clustering and the emergence of a giant component [3538].

We provide a geometric hypergraph model, where the set of research interests is abstracted as a circle , and researchers are expressed as nodes located on the circle (Figure 1). The nodes are generated in batches from 1 to ; hence, they can be identified by spatiotemporal coordinates. Some nodes are randomly selected as “lead nodes” to attach specific arcs that imitate their “reputations.” The nodes covered by a lead node’s arc constitute a “research team.” The paper teams are modelled by hyperedges, which are generated by following ca ooperative game mechanism.

A cooperative game consists of two elements: a set of players and a characteristic function specifying the value (benefit-minus-cost) created by subsets of players in the game. The characteristic function is a function mapping each subset of to the value it creates. Regard nodes as players . Think of a player as a lead node with players as its research team members and player as a candidate attempting to cooperate with and specific members (e.g., ).

Assume that the cooperation cost is , and suppose that those players will receive a benefit valued by ’s reputation . We can define azgame as follows:

Under definition (1), if , those players will collaborate.

The empirical distributions of paper team sizes emerge a hook head and a fat tail, which means the sizes of substantial papers are around their average, and a few papers have a significantly large size. In reality, researchers in a small research team are more likely to write papers together. Members of a large research team rarely coauthor a paper altogether, but rather with a fraction of members. Treating paper team size as a random variable , we design a mechanism to simulate the distribution of . Give the upper bound of small research team and the lower bound of large research team Denote the expected value of paper team size and the size of the corresponding research team to be and , respectively. Let if . Let , if . Draw from a power law distribution with an exponent and domain , if . Then draw from a Poisson distribution with expected value . Note that in the description of the above game, and .

Cooperation costs could be considered investments of time and effort to complete a study by crossing the distance between the research interest of the leader and that of the candidate and so on. Denote the spatiotemporal coordinates of player by , and write the player as We abstractly geometrize the cost , namely, the angular distance between and .

We now show how to value reputation. Considering the inefficient information of new players, we simply assume each lead node has the same attraction to new players and so we value the reputation of a lead node as Hence, the expected number of ’s collaborators is at time , where Those yield . The probability density of a lead node generated at is . Hence, . Then the tail of degree distribution . We can obtain the general case for large enough by valuing the reputation , where . The strict mathematical deduction of the degree distribution tail needs averaging on Poisson distribution, which is inspired by some of the same general ideas as explored in [25].

We next show how to generate a paper team, namely, cooperation rule. Empirical collaboration behaviors have specific certainty (due to kin selection, network selection, etc.), as well as uncertainty. Consider a usual scene, a researcher of leader research team wants to complete a work and write it as a paper, which needs researchers to work together. Then would ask leader for help, and would suggest the members of his research team , who have the most similar interest to , to cooperate with . Such behavior can be viewed as kin selection and is featured in certainty. When finishing the work is beyond the ability of team , the researcher would ask for external helpers. Uncertainty exists in this selection behavior, which inspires the design of randomly choosing players outside of to cooperate. The uncertainty shorts the average shortest path length of modelled networks. Note that a researcher could belong to several research teams; hence, the above scene would happen in each team.

Based on the above set-up, we build the hypergraph model as follows: (1)Reputation assignment: for time , do the following:

Sprinkle nodes as new players uniformly and randomly on . Select subset from randomly as lead nodes, and value the reputation of as . (2)Cooperation rule: for time , do the following:

For each new node , select a lead node set for which satisfies and . For each jMil, add to ’s research team , and generate a hyperedge at probability by grouping players of nearest to and players randomly, where is the random variable above defined.

The player set of the model , and the number of players . Here, we let and be constants over . Compared with the model in [38], the new model reduces the number of parameters. Moreover, the new model has the ability to reproduce the empirical feature of the distribution of hyperdegrees and that of paper team sizes. A node’s hyperdegree is the number of hyperedges that contain the node.

3. The Data

To test the fitting ability of the proposed model, we analyze two empirical coauthorship networks (Table 1). Dataset PNAS is composed of 52,803 papers published in Proceedings of the National Academy of Sciences during 1999–2013. Dataset PRE comprises 24,079 papers published in Physical Review E during 2007–2016. Note that 43,304 papers of the first dataset belong to biological sciences, and the second dataset comes from physical sciences. The different collaboration level (reflected by the average number of authors per paper) of the two datasets (PNAS 6.028, PRE 3.102) helps to test the flexibility of the model.

In the process of extracting networks from those metadata, authors are identified by their names on their papers. For example, the author named “Carlo M. Croce” on his paper is represented by the name. We mainly focus on the distribution of degree and that of hyperdegree as well as some properties based on degrees. From the analysis of [39], we find that identifying authors by their name on papers holds the degree distribution feature of ground truth data, which partially verifies the reliability of the empirical networks used here.

Using the surname and the initial of the first given name generates a lot of merging errors of name disambiguation [40]. Hence, we compute the proportion of those authors and that of those authors further conditioned on publishing more than one paper. Meanwhile, Chinese names were also found to account for the repetition of names [39]. We count the proportion of names with a given name less than six characters and a surname among major 100 Chinese surnames. The small proportions of such authors and those of such authors publishing more than one paper (Table 1) limit the impact of name repetition, especially for dataset PNAS.

To reproduce specific features of the empirical data, we choose proper parameters (Table 2) to generate two hypergraphs and extract simple graphs from them (where edges are formed between every two nodes in each hyperedge, isolated nodes are ignored, and multiple edges are viewed as one). Since the model is stochastic, we generate 20 networks with the same parameters and compare their statistical indicators in Table 3. The finding is that the model is robust on those indicators (Table 4).

4. Cooperation Cost and Reputation Benefit

Based on the cost and the benefit of collaborations, we explain the distribution feature of paper team sizes. The benefit of joining a paper team is limited. The law of diminishing marginal utility holds in academic society. The allocation of academic achievements is often according to author order. Hence, only the researchers with positive benefit-minus-cost would join the paper team. Assume the number of those researchers is . Meanwhile, the joining behavior has certain degrees of randomness. Let the joining probability be . Then the paper team size will follow a binomial distribution and so a Poisson distribution with expected value approximately (Poisson limit theorem). Due to the law of diminishing marginal utility, the sizes of those papers would follow a generalized Poisson distribution, because this distribution describes situations where the occurrence probability of an event involves memory [41].

Some important works require many researchers (even from different research teams) to work together, which would bring about huge economic and social benefits. The papers of those works would have many authors and sometimes show their appearances in specific famous journals, for example, a paper in Nature has 2832 authors (see Figure 2 in [36]). In fact, signing on a paper of a famous journal will also bring about a huge benefit. The existence of those papers leads to fat tails emerging paper team size distributions.

In brief, the above analysis makes us think that benefit-minus-cost and the randomness of joining behavior make a paper team size follow a generalized Poisson distribution, and huge expected benefits lead to fat-tail. There exists a crossover between the two limits (Figure 3). The fitting function of the distribution, including the following discussed distribution of hyperdegree and that of degree, is a combination of a generalized Poisson distribution and a power law function (Table 5). We perform a two-sample Kolmogorov-Smirnov (KS) test to compare the distributions of two data vectors: indexes (e.g., paper team sizes) and the samples drawn from the corresponding fitting distribution. The null hypothesis is that the two data vectors are from the same distribution. The value of each fitting shows the test cannot reject the null hypothesis at the 5% significance level.

In the model, with a proper upper bound parameter (around average number of authors per paper of corresponding empirical data), the model can reproduce the generalized Poisson part of the distribution of paper team sizes, because most of the modelled paper team sizes are drawn from Poisson distribution with an expected value around . Meanwhile, with a proper lower bound parameter , the mechanism can generate a few significantly large paper team sizes and so the fat tails of the modelled paper team size distributions. We choose through iteration from the starting point of the power law part in the corresponding empirical distribution of paper team sizes ( in Table 5) until the modelled networks have the similar feature of the empirical distribution of degrees and that of paper team sizes.

Now, we turn to explain the distribution feature of degrees. Substantial authors publish only one paper (PNAS: 64.8%, PRE: 63.9%), and most of paper team sizes draw from a generated Poisson distribution (PNAS: 99.9%, PRE: 99.9%). Those lead the generalized Poisson parts of degree distributions. Note that the boundaries of generated Poisson parts of paper team size distributions are 41 and 20 for PNAS and PRE, respectively, which are detected by the boundary point detection algorithm for probability density functions in [38] (listed in the appendix).

With the growing of their papers, a few authors experience the cumulative process of collaborators over time, whose reputations also increase. As empirical data show, it is an accelerative process, which is often explained by cumulative advantage. The process reflects as the transition from a generated Poisson to a power law (Figure 2). The above explanation can also be used to explain the similar feature of hyperdegree distributions. Note that the nodes of large paper teams also have a large degree, which reflects as the outliers in the tails of degree distributions.

In the model, we can choose suitable parameters and to make the hyperdegrees of substantial players be one (Synthetic-1: 48.5%, Synthetic-2: 61.5%). Meanwhile, the substantial modelled paper team sizes follow a generalized Poisson distribution (Synthetic-1: 99.9%, Synthetic-2: 100%). Those yield the generalized Poisson part of modelled degree distributions. The boundary of generalized Poisson part is 34 for Synthetic-1 and 23 for Synthetic-2. The mechanism of generating hyperedges makes only early lead nodes, and specific players close to them can experience the cumulative process of connecting new players. The cumulative process generates the fat tails of modelled degree distributions, as well as those of modelled hyperdegree distributions (Figure 2). The cumulative speed and the power law exponent can be tuned by parameter .

5. Spatial Reciprocity and Network Reputation

Cooperation needs to be based on acquaintanceship. Hence, there is an acquaintanceship network under each coauthorship network. Geographic contexts (such as organization and institution) contribute to an emerging clustering structure in an acquaintanceship network, namely, “the friend of my friend is also my friend” [16]. The Internet extends the scope of acquaintanceship, which crosses spatial barriers even national boundaries. Therefore, the factor of clustering changes from geography to interest, namely, “birds of a feather flock together” [42].

Cooperation costs make cooperators which should have similar research interests, namely, collaborations existing in researcher clusters formed by similar interests. Hence, the spatial reciprocity principle in the cooperative game theory [31] needs to be modified by interest in the situation of academic cooperation. In a network perspective, the extent of spatial reciprocity can be reflected by local clustering coefficient and the degree difference between a node and its neighbors.

Now, we discuss the relationship between spatial reciprocity and network reputation. In the view of nodes, network reputation can be reflected by degree. Hence, the relationship can be reflected by two functions of degree, namely, the average local clustering coefficient of -degree nodes and the average degree of -degree nodes’ neighbors . There is a transition in each of the functions (Figure 4). The tipping points of and are detected by the boundary point detection algorithm for general functions in [38] (listed in the appendix). Inputs of the algorithm are , , and . Using those inputs is based on the observation of and .

Coauthorship networks are found to have two features: high clustering (a high probability of a node’s two neighbors connecting) and degree assortativity (a positive correlation coefficient between two random variables: a node’s degree and the average degree of the node’s neighbors), which are measured by GCC and AC in Table 3, respectively. To understand the essence of high clustering and degree assortativity, as well as the transitions in and , we analyze the feature of the basic context of collaborations, that is, research teams. Given the cost and benefit of joining a research team, only the researchers with positive benefit-minus-cost would join the team with a probability (that would be affected by previous members due to gossips, etc.). With an argument similar to the one used in the distribution of paper team sizes, we can assume the research team sizes follow a generalized Poisson distribution. A few research teams with a huge reputation would attract substantial collaborators and become significantly large ones.

Based on the above analysis, we can think that small degree authors comprise two parts: one is composed of the authors of small research teams, and the other one comprises the unproductive authors belonging to small paper teams and to large research teams. Researchers in the small research team probably write a paper together, which causes them to have a high local clustering coefficient and a slight degree difference between them and their neighbors. Many authors in large research teams only write one paper, and the paper team only contains a few leaders. Hence, those authors would have a relatively high local clustering coefficient and a relatively small difference between their degree and the average degree of their neighbors. From the above analysis, we can infer small degree authors contribute to degree assortativity and high clustering of coauthorship networks, which fits the empirical data (Figure 4).

The collaborators of some productive authors may not coauthor, and some productive authors often have many collaborators. The degree difference emerges between those authors and their neighbors, on average. Hence, we can infer those large degree authors negatively contribute to degree assortativity and high clustering. The inference fits the empirical data: the tails of and of each empirical network emerge a different trend from the heads (Figure 4). Note that the authors of large paper teams also have a large degree but contribute to degree assortativity and high clustering. The existence of those authors causes the scattered points of the tails of and .

The model can generate research teams with a size distribution as above inferred. Due to the power function of reputation, the expected size of a research team of lead is proportional to for . This yields . With an argument similar to the reasonability of reputation function, we can obtain the tail of the distribution of modelled research team sizes . When , the research team size is drawn from a Poisson distribution with an expected value proper to due to the Poisson point process of generating nodes. Hence, the small modelled research team sizes are drawn from a range of Poisson distributions with expected values taking from a power function. With proper parameters, those can be used as basis to fit aa given generalized Poisson distribution.

Most modelled hyperedges are generated by grouping a small fraction (around ) of nodes close in space, which expresses the spatial reciprocity principle. Moreover, to fit empirical hyperdegree distributions, we choose specific parameters which make a large fraction of nodes only belong to one hyperedge. Meanwhile, most modelled hyperedges contain one lead node, and only early lead nodes and a few nodes close to them can be persistently contained by new hyperedges. Those yield that the small/large degree nodes contribute positively/negatively to degree assortativity and high clustering. Hence, the model well reproduces the transitions. In addition, the tails of proportional to also holds in modelled networks. For a lead node , the probability of its new team member coauthor with the formers is where is the degree of node .

6. Discussions and Conclusions

Five typical mechanisms of cooperation evolution hold for academic collaborations, which inspires us to explore game mechanisms in the evolution of coauthorship networks. We define a cooperative game model on a circle and reveal how the costs and benefits of individuals generate a range of statistical and topological features of coauthorship networks, such as fat-tail and small-world. It overcomes the weakness of the model in [27], a lot of parameters, and has the new ability to fit the distribution of paper team sizes and that of hyperdegree. Moreover, it has the potential to illuminate specific views and implications in the broader study of cooperative behaviors as follows.

Do there exist innate rules behind the social complexity? It provides an example of how individual strategies based on maximizing benefit-minus-cost and on specific randomness generate the complexity that emerged in coauthorship networks. The general idea of the model potentially bridges the cooperative game theory and specific social networks generated by human strategies, for example, social affiliation networks.

Does utilitarianism help the development of sciences? The strategy of maximizing benefit-minus-cost will give rise to flocking to famous research team or to hot fields. Taking such strategy helps to collect publications and citations, but suppresses diversity, and consequently does harm to the flexibility of an academic environment. However, current academic evaluation methods and funding mechanisms are mainly oriented by specific indexes, for example, the number of citations. Specific regulations could be simulated through the model to work out a way to maintain the balance in the academic environment, while encouraging breakthroughs in key fields.

Appendix

Detecting Boundary Points

The following boundary detection algorithms come from [38].

Input: Observations Ds, s = 1, … ,n, rescaling function g(·), and fitting model h(·).
For k from 1 to max (D1, … ,Dn) do:
 Fit h(·) to the PDF h0(·) of {Ds, s = 1, … ,n|Dsk} by maximum-likelihood estimation;
 Do KS test for two data g (h (t)) and g (h0(t)), t = 1, … ,k with the null hypothesis they coming from the same continuous distribution;
 Break if the test rejects the null hypothesis at significance level 5%. Output: The current k as the boundary point.
Input: Data vector h0(s), s = 1, … ,K, rescaling function g(·), and fitting model h(·).
For k from 1 to K do:
 Fit h(·) to h0(s), s = 1, … ,k by regression;
 Do KS test for two data vectors g (h (s)) and g (h0(s)), s = 1, … ,k with the null hypothesis they coming from the same continuous distribution;
 Break if the test rejects the null hypothesis at significance level 5%. Output: The current k as the boundary point.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

The authors have contributed equally to this work. All authors conceived and designed the research and wrote the paper. Zheng Xie analyzed the data. All authors discussed the research and approved the final version of the manuscript.

Acknowledgments

This work is supported by the National Science Foundation of China (Grant no. 61773020).