Abstract

It is meaningful for a researcher to find some proper collaborators in complex academic tasks. Academic collaborator recommendation models are always based on the network embedding of academic collaborator networks. Most of them focus on the network structure, text information, and the combination of them. The latent semantic relationships exist according to the text information of nodes in the academic collaborator network. However, these relationships are often ignored, which implies the similarity of the researchers. How to capture the latent semantic relationships among researchers in the academic collaborator network is a challenge. In this paper, we propose a content-enhanced network embedding model for academic collaborator recommendation, namely, CNEacR. We build a content-enhanced academic collaborator network based on the weighted text representation of each researcher. The content-enhanced academic collaborator network contains intrinsic collaboration relationships and latent semantic relationships. Firstly, the weighted text representation of each researcher is obtained according to its text information. Secondly, a content-enhanced academic collaborator network is built via the similarity of the weighted text representation of researchers and intrinsic collaboration relationships. Thirdly, each researcher is represented as a latent vector using network representation learning. Finally, top- similar researchers are recommended for each target researcher. Experiment results on the real-world datasets show that CNEacR achieves better performance than academic collaborator recommendation baselines.

1. Introduction

During the era of big scholarly data, information overload has become a serious problem. It is challenging how to dig useful information from overloaded information [1, 2]. Prior studies show that collaboration among researchers can increase the productivity of the researcher and come up with unprecedented inspirations [3, 4]. So, academic collaborator recommendation that aims to find the proper collaborators for a target researcher has played an important role in complex academic tasks.

Academic information can be described as an academic collaborator network with attributes (as shown in Figure 1). The methods of academic collaborator recommendation are divided into three categories, including network-based recommendation, content-based recommendation, and hybrid recommendation. For network-based recommendation, the structure of the network was utilized to improve the performance of recommending the researchers [5]. The probability theory and graph theory were used to model and analyze coauthor networks [6]. Another network-based recommendation involves the classic random walk model, which can dig for useful information from the academic collaborator network. For content-based recommendation, the interest of the researcher is an important attribute that characterizes the research topics, fields, and other personalized features [79]. It can be analyzed and mined through papers that the researcher publishes every year, and the relationships among researchers can also be established through interest detection. Compared with the above methods which consider the academic collaborator network structure and the interests of the researcher, respectively, the combination of network topology and text information is more effective. For hybrid recommendation, utilizing both text information and network structure can improve learning of the latent representation for each researcher [1012]. Existing hybrid models always learn the feature representation of each researcher using text information and network structure independently and then combining the two feature representations into a unified latent representation. They do not utilize the complex relationships between text information and network structure [13]. The aforementioned hybrid method could improve the recommendation of academic collaboration, but the latent semantic relationships formed by the text information in an academic collaborator network were ignored.

To capture the latent semantic relationships to improve academic collaborator recommendation, we utilize text information of each researcher to build a content-enhanced network and propose the CNEacR model for academic collaborator recommendation. CNEacR builds a content-enhanced academic collaborator network that contains the intrinsic collaboration relationships and the latent semantic relationships formed by the text information. Firstly, CNEacR obtains the weighted text representation for each researcher and then builds a content-enhanced academic collaborator network based on the similarity of the feature representation of each researcher. Secondly, high-quality latent representation is obtained by network embedding. Finally, the similarity between researchers can be calculated by the cosine similarity based on high-quality latent representation. Experiment results on the real-world datasets demonstrate that CNEacR produces a better improvement on precision, recall, F1, and normalized discounted cumulative gain (NDCG) over all baseline methods.

The main contributions of this paper are summarized as follows:(1)A context-enhanced academic collaborator network that contains not only intrinsic collaboration relationships but also the latent semantic relationships formed by the text information is built using the similarity among the weighted text representation of researchers.(2)To get a context-enhanced academic collaborator network, the weighted text representation of each researcher is obtained from the text information. The edges are added which are between a node and its semantic similar nodes, and then a context-enhanced academic collaborator network is built.(3)Experiment results on the datasets demonstrate the performance of CNEacR is better than other methods of academic collaborator recommendation.

2. Problem Definitions

2.1. Academic Collaborator Recommendation

Given a researcher set , where is the number of researchers in the set. Each researcher has text information , where represents the -th term for and is the number of terms for . The structure of academic collaborator network , where represents the relationship between researcher and researcher , . Academic collaborator recommendation aims to get a ranked researchers’ list for a given target researcher , which are the most relevant researchers from researcher collection A.

2.2. Content-Enhanced Academic Collaborator Network

The academic collaborator network can be denoted as , where , is the -th researcher, and . Each represents whether there exists the collaborative relationship if denotes that there exists a collaborative relationship between researcher and researcher ; otherwise, it does not exist. In the academic collaborator network, we firstly use the model to evaluate the importance of each term and then embed each term into a vector by Word2vec. The vector is the weighted text representation of each researcher. Secondly, relevant researchers are listed based on cosine similarity of each . Finally, we can get a relationship set , . If the relationship between and exists, . The content-enhanced academic collaborator network is built, .

3. Methodology

In this section, we explain the CNEacR in detail. CNEacR builds the weighted text representation of each academic researcher and constructs a context-enhanced academic collaborator network. Then, we maximize the co-occurrence probability to obtain the high-quality latent representation of each researcher. Finally, top- researchers are recommended for a target researcher via the similarity of high-quality latent representation. Some important notations are shown in Table 1. We summarize the framework of our proposed CNEacR in Figure 2 and show the whole algorithm framework in Algorithm 1.

Input: the academic collaborator network , the text information set of all researchers , , top-
Output: the top- list for a target researcher
  :
(1)fordo
(2)  calculate by equation (3)
(3)end for
(4)fordo
(5)  for, do
(6)   calculate by equation (4)
(7)  end for
(8)  choice similar researchers for
(9)end for
(10) construct
(11) map into a low-dimensional space to get latent representation X of all researchers
(12):
(13)fordo
(14)  for, do
(15)   calculate by equation (4)
(16)  end for
(17)   top- most similar collaborator for
(18)end for

As shown in Figure 2, all nodes in the dataset belong to the test set. To validate the effectiveness of our algorithm, the real collaborative relationships among nodes are divided into two classes: collaborative relationships and unknown relationships according to [13]. The collaborative relationships are edges in the academic collaborator network which means the structure of the network. The unknown relationships do not participate in the algorithm process. They are used to compare with the recommended top- collaborative relationships. The ratio of collaborative relationships to unknown relationships is discussed in Section 4.4.3.

3.1. Building Weighted Text Representation of the Academic Researcher

It is fundamental to represent the text in many natural language processing (NLP) tasks. There are many methods to extract the feature representation of the researcher from text information, including probabilistic latent semantic analysis (pLSA) [14], latent Dirichlet allocation (LDA) [15], Word2vec [16], and BERT [17]. Word2vec is widely used to generate more accurate feature representations based on text information in a specific scenario. So, we choose Word2vec to get weighted text representation.

Given an academic researcher set , where represents the text information of the -th researcher composed by his published paper’s titles. , where is the -th term in . Some similar operations apply to the set , such as segmenting, filtering, and extracting. is represented into vector using Word2vec. is used to describe the importance of each term to the text information of different researchers. could be defined as follows:where D is the text information set of all researchers, is the total number of researchers in the dataset, represents the text information of each researcher, and represents the -th term in . is the total number of the text information of the -th researcher which contains term . stands for the term frequency of the term in the text information of the -th researcher, and is the inverse document frequency. As far as we know, the frequent occurrence of a term in the researcher’s text information means that this term is important to the researcher. However, if a term appears in many researchers’ text information at the same time, it indicates that this term is common to each text and is less important to each researcher. is used to weight the importance of a term in the text information of each researcher. As described above, the weights of the terms in the text information of researcher can be defined as follows:

The weighted text representation of researcher can be defined as follows:

Since each researcher has a different amount of text information, we normalized the weighted text representation of each researcher. is the number of terms in the text information of each researcher, is the vector of the -th term of the -th researcher learned by Word2vec, and is the weight of the -th term of the -th researcher.

3.2. Constructing the Context-Enhanced Academic Collaborator Network

Given an academic collaborator network , where is the researcher set and represents collaborative relationships among researchers. We calculate any two nodes’ similarity using their weighted text representation by widely used cosine similarity:

So, generate relationships , ; each is defined as follows:where is the top researchers in the similarity list for each researcher. is a hyperparameter. If , is a new relationship. We add these new relationships to G, and then we will obtain a new academic collaborator network , where , which is our context-enhanced academic collaborator network.

3.3. Network Embedding

The latent representation of each researcher is the input feature of many downstream tasks, such as classification, link prediction, clustering, and visualization. To get a low-dimensional space , , the network embedding aims to learn a function . Let denote the embedded vectors in the latent space. maintains as much of the original network topology information as possible. There exist many network embedding methods, such as DeepWalk [18], LINE [19], Node2Vec [20], and GCN [21]. In this paper, the local information and global information are equally important for each target researcher, so DeepWalk is suitable to obtain high-quality latent representation.

Given a context-enhanced academic collaborator network, we use the DeepWalk model to represent the relationships of academic collaboration. Intuitively, for academic collaborator recommendation, it is equally important for both local information, namely neighborhood, and global information. We use latent academic collaborative relationships obtained from random walks to learn academic researcher latent representation. For each walk sequence , following skip-gram, we aim to maximize the probability of the neighbors of researcher in this walk sequence as follows:where is the window size, is the current representation of researcher , and is the local context researchers of .

Finally, we use hierarchical softmax [22] to obtain the embedding vector of each researcher , . The latent representation of each researcher fuses researchers’ text information and network structure. is a hyperparameter that is the dimension of the latent representation of the researcher. For each target research, we can get the top- similar collaborators according to equation (4).

4. Experiment

In this section, we evaluate our proposed CNEacR model on two real-world datasets. We introduce datasets, baselines, evaluation criteria, and the results of experiments in detail.

4.1. Datasets

PRB (Physical Review B) from the APS (American Physical Society) (https://journals.aps.org/datasets) consists of some articles about the subject of physics. At first, we do name disambiguation on authors from 1893 to 2015 based on [23]. Authors who have less than 2 collaborators from 2006 to 2010 are removed. Finally, we extract 34,905 authors and 14,055 papers to evaluate our proposed CNEacR. AMiner, a larger-scale dataset, is adopted, we randomly choose 14,000 papers, and it contains 20,057 researchers, who have more than 10 papers. Table 2 shows the details of the datasets. Some necessary cleaning is done, such as removing excess code fragments, removing the stop words, tokenization, and lemmatization.

To evaluate the performance of CNEacR, we assume all researchers in the dataset as target researchers. The ratio collaborator relationships of each researcher are used as the training samples, and the ratio collaborator relationships are used as the test target according to [13]. In experiments, we choose relationships with the ratio R 10 times to ensure that the selected relationships can contain as many authors as possible. All experiments are performed on a 64-bit Linux-based operation system, Ubuntu 16.04 with a 64-duo and 2.10 GHz Intel CPU, 1-T Bytes memory. All the programs are implemented with Python.

4.2. Baselines

We compare CNEacR with the following six methods, where the first is the classic method for academic collaborator recommendation. The baselines consist of the following:(1)MVCWalker: MVCWalker [24] is a random walk model standing on the shoulder of a random walk with restart for the collaborator recommendation which combines three academic factors including coauthor order, latest collaboration time, and times of collaboration.(2)TNERec-G: TNERec-G is a portion of TNERec which only uses the structure of the academic collaborator network to get the feature representation of the researcher for collaborator recommendation.(3)CTPF: CTPF [25] is a probabilistic model of articles to represent researchers with their preferences for topics. It integrates two ideas: collaborative topic regression and Poisson factorization.(4)TNERec: TNERec [13] is an academic collaborator recommendation method that learns feature representation from the interests of the researcher based on the topic model and feature representation from the structure of the academic collaborator network using network embedding, respectively, and then fuses them using a spectral technique for better collaborator recommendation.(5)CNEacR-G: CNEacR-G is a portion of CNEacR which only uses the structure of the academic collaborator network to get the feature representation of the researcher for collaborator recommendation (does not use any semantic relationship).(6)CNEacR-T: CNEacR-T is a portion of CNEacR which only uses the text information of the researcher to recommend the collaborator, which is based on text recommendation.

4.3. Evaluation Criteria

We use the most common evaluation criteria in information retrieval as the academic collaborator recommendation evaluation metrics.

Precision@k means the ratio of the right recommended collaborators to top- recommended candidates when recommending candidate collaborations for the target researcher. Precision@k is defined as follows:

Recall@k means the ratio of the recommended right collaborators who are in the test set to all recommended candidates when recommending candidate collaborations for the target researcher. The recall value is computed as follows:where is the number of target researchers, is the top- recommended researcher list for the target researcher, and is the real collaborators of the target researcher in the test set. F1 is the harmonic average of precision and recall, and F1 is defined as follows:

IDCG represents the list of the best recommendation results. NDCG is the normalized recommended list evaluation scores. We define as the rating of the -th researcher in the recommended researcher list. If , the recommended collaborator is relevant, and , otherwise. NDCG@k is defined as follows:

4.4. Experiment Results and Parameter Analysis

Table 3 demonstrates the performance comparison of CNEacR, and the results outperform all baselines on precision, recall, F1, and NDCG. Besides, we present the result of CNEacR-G and CNEacR-T in PRB. CNEacR-G only uses text information of each researcher, and CNEacR-T only uses collaborator relationships in the network. To make the results more convincing, we give the results of the experiment in AMiner and compared it with the two kinds of methods, content-based recommendation and network-based recommendation. Table 4 demonstrates the results of the experiment in AMiner.

From Tables 3 and 4, we know that CNEacR-G does not use the text information, and the results are poor. CNEacR-T does not use the network structure, and the results are not good enough. We can see that utilizing both text information and network structure plays an important role in academic collaborator recommendation. We demonstrate the performance in different recommendation lists and analyze different results when choosing different training sets of ratios in PRB. As an auxiliary experiment, we only demonstrate the performance in .

4.4.1. Parameter

We analyze the parameter used to build the relationship among researchers in two datasets. Similar to [13], set the length of the recommendation list as 5, and choose as 0, 1, 2, 3, 4, and 5, respectively, to build the content-enhanced academic collaborator network. Figure 3 shows the comparison results of CNEacR on different in two datasets. From Figure 3, we can easily find that different datasets have different . of the best performance of CNEacR in PRB is 2, and of the best performance of CNEacR in AMiner is 1. We can see from Figure 3 that different have a big influence on the performance of CNEacR. With the increase of , the number of uncertainty relationships is increasing, which influences the performance of our proposed CNEacR to capture real collaborative relationships.

4.4.2. Influence of the Recommendation List

We analyze the performance of CNEacR with different lengths of recommendation. We choose the ratio of the training set to conduct our experiment and set the dimension of the researcher vector as 100. The parameter is set as 2. Figure 4 shows that our proposed model is compared with other methods of precision, recall, F1, and NDCG. With the increase of recommendation list , we can see that the precision of CNEacR, CNEacR-T, CNEacR-G, TNERec, TNERec-G, and CTPF shows a downward trend. MVCWalker goes up at first and then goes down with the recommendation list increasing. The recall of all methods shows an upward trend. F1 of all methods takes on the tendency of increasing first but decreasing afterward. The NDCG of all methods keeps a steady trend. We can see that network-based and context-based collaborator recommendations can work well, respectively, and the results of experiments verify that our method which utilizes both weighted text representation and academic collaborator network can perform well compared with all the above methods.

4.4.3. Influence of Ratio

To prevent the contingency of experimental results, we use different sizes of the training set to evaluate the performance of CNEacR over the training set. We set the ratio R varying from 20% to 80% and set recommendation list size as 3. We also set the latent representation of the researcher as 100 and set the parameter as 2. Figure 5 shows the performance compared with other methods on different R in terms of precision, recall, F1, and NDCG. CNEacR outperforms other methods a lot on four metrics no matter how R is. From Figure 5, these methods have the same trends except the network-based methods including CNEacR-G and TNERec-G. We can see that CNEacR is always better than the network-based recommendation, content-based recommendation, and hybrid recommendation.

4.5. Case Study

Table 5 shows the case study of different methods for collaborator recommendation. We randomly select a researcher () for a demonstration from the test set. We use three methods to recommend the top 5 collaborators for the target researcher . From the table, we can see that only CNEacR-G correctly provides one collaborator, . It indicates that CNEacR-G captures the information of the network structure. CNEacR-T correctly recommends a new collaborator, , than CNEacR-G. It indicates that CNEacR-T can capture the information of semantic relationships. Our method CNEacR correctly recommends four collaborators, and it recommends two new collaborators, and , than CNEacR-T. It indicates that utilizing the weighted text representation and intrinsic collaborative relationships to recommend collaborators can yield better performance than context-based and network-based recommendation. CNEacR correctly recommends four collaborators including the researchers recommended in both CNEacR-G and CNEacR-T. It indicates that CNEacR can capture both the semantic relationship and the collaborative relationship to recommend latent academic collaborators for the target researcher.

At present, it is common for the researcher to collaborate in research [26]. A researcher who collaborates with others has an enormous effect on scientific productivity than those who always do the research independently [27]. So, how to find proper collaborators from complex and unstructured data is essential for the researchers. Recently, lots of works have been done on how to help the researcher to find proper collaborators. These works on academic collaborator recommendations are mainly based on three categories: network-based recommendation, content-based recommendation, and hybrid recommendation.

In an academic collaborator network, academic collaborator recommendation is usually modeled as a link prediction problem. The key to predicting the relationship with structural features of the academic collaborator network is to calculate the similarity among researchers. In [28], Jeh and Widom used SimRank scores based on a simple and intuitive graph-theoretic model to measure the similarity between two researchers. However, they cannot exploit all different length paths of the network. To overcome this problem, they provided more accurate and faster friend recommendations by traversing all limited length paths [29]. Recently, new measurements such as relative entropy [30] and network motif [31] were proposed. The most popular model in the field of collaborator recommendation was random walk [32]. There exist some works that stand on the shoulder of random walk for academic collaborator recommendation, which had been proved to be competent for calculating the rank score of researchers in the academic collaborator network [33]. These methods completely utilized the weight on edge to guide Random Walker on the academic collaborator network [24, 34, 35]. These values of weight were composed of the affiliated institution of the researcher or the academic factors, such as coauthor order, latest collaboration time, and the times of collaboration. MVCWalker used the rich information of both nodes and links to dig out the similarity structure of the academic collaborator network based on probability [33, 36]. However, Random Walker can merely extract information from the academic collaborator network.

Using structural features is not sufficient for academic collaborator recommendation. The proposed models for computing similarity between two researchers were based on expertise profiles extracted from their publications and academic home pages [7]. Kong et al. held that the interest of each researcher was very important for academic collaborator recommendation. The topic model was used to mine the text information of researchers each year to obtain the topic information and then cluster the topics as the researchers’ interests [8]. The cross-domain topic learning model used topic layers to replace author layers to alleviate the sparseness issue and topic skewness for different discipline collaborations [9].

The text information and network structural information are equally important to academic collaborator recommendations. There exist some hybrid recommendation models. They combined the structural information and user-generated content. And then, a generative model was introduced to help people find friends on Twitter and Flickr [10]. CCRec clustered the topics of each researcher’s text information and utilized the structure of the academic collaborator network to find the most relevant and latent collaborator [11]. A hybrid algorithm with eight measures was proposed to recommend latent academic collaborators under different disciplines [37]. It generated high-quality researchers’ profiles by integrating researchers’ expertise, coauthor network characteristics, and researchers’ institutional connectivity into a unified framework with SVM-rank [38]. It was applied in the ScholarMate system, which is a virtual academic community for promoting researchers’ collaboration. They predicted coauthor relationships based on content, social, and hybrid recommendation algorithms [12]. Kong et al. thought that the fusing topic model and academic relationships could improve the performance of academic collaborator recommendations [13]. However, the topic model showed the probability distribution of words and documents, which only demonstrated their implied topics. The title of a paper was always short, but it contained the main idea of the whole paper which can distinctly express the research field of a researcher. Word2vec [16] was based on text information (i.e., semantic and syntactic) of a researcher, which can express the researchers’ feature representation in specific application scenarios. In this paper, we use the weighted text representation to represent each researcher. Then, a context-enhanced network was built according to the similarity between every two researchers to predict collaborative relationships.

6. Conclusion

In this paper, we propose a novel CNEacR method to recommend academic collaborators. CNEacR utilizes the weighted text representation to build a content-enhanced academic collaborator network that contains not only intrinsic collaborative relationships but also the latent semantic relationships formed by the text information. From this network, we use network embedding to get high-quality latent representation, which captures the latent semantic relationships among researchers. Extensive experiments on the real-world datasets demonstrate the effectiveness of CNEacR and its superiority over several existing methods.

We just pay attention to these strong relationships (the paper content and academic relationships), while the weak-tie relationship such as conference or journal is also supposed to be considered. Because the two papers from the same conference or journal share the same research field, researchers are likely to build a collaborative relationship in the future. Thus, we will take the weak-tie relation into account in the next job.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant nos. 61876001, 61602003, and 61673020), National High Technology Research and Development Program (Grant no. 2017YFB1401903), the Provincial Natural Science Foundation of Anhui Province (Grant no. 1708085QF156), the Major Program of the National Social Science Foundation of China (Grant no. 18ZDA032), and the Recruitment Project of Anhui University for Academic and Technology Leader.