Abstract

With an increasing number of web services on the Web, selecting appropriate services to meet the developer’s needs for mashup development has become a difficult task. To tackle the problem, various service recommendation methods have been proposed. However, there are still challenges, including the sparsity and imbalance of features, as well as the cold-start of mashups and services. To tackle these challenges, in this paper, we propose a Multigraph Convolutional Network enhanced Neural Factorization Machine model (MGCN-NFM) for service recommendation. It first constructs three graphs, namely, the collaborative graph, the description graph, and the tag graph. Each graph represents a different type of relation between mashups and services. Next, graph convolution is performed on the three graphs to learn the feature embeddings of mashups, services, and tags. Each node iteratively aggregates the information from its higher-order neighbors through message passing in each graph. Finally, the feature embeddings as well as the description features learned by Doc2vec are modeled by the neural factorization machine model, which captures the nonlinear and higher-order feature interaction relations between them. We conduct extensive experiments on the ProgrammableWeb dataset, and demonstrate that our proposed method outperforms state-of-the-art factorization machine-based methods in service recommendation.

1. Introduction

Service-oriented computing (SOC) has become a significant paradigm for developing low-cost and reliable software applications in software engineering and cloud computing [1]. Web services are the basic building blocks of service-oriented computing, which encapsulate application functionalities and can be accessed through standard interfaces. Nowadays, an increasing number of web services, mainly in the form of RESTful Web APIs, have been published online. According to the recent statistics of ProgrammableWeb (https://www.programmableweb.com/), the largest online Web API registry, there are over 24,000 Web APIs available. However, the functionality of an individual service is limited and cannot satisfy the complex requirements of developers. As a result, it is common for developers to compose existing services and develop value-added services, also called mashups [2]. For example, a developer may create a new mashup that can display ratings and reviews of restaurants on a regional map by integrating Google Map Service and Yelp Service. However, creating a mashup can be difficult and time-consuming for an inexperienced developer due to the overwhelming number of services on the Internet. Therefore, it is vital to proactively recommend appropriate services that can satisfy the developer’s complex requirements and reduce the burden of manual selection, especially in the era of the Internet of Things and Big Data [3].

A variety of methods have been proposed to recommend services for mashup development, and they can be divided into three classes: collaborative filtering-based methods, content-based methods, and hybrid methods [4]. Different types of methods focus on modeling different sources of information. Collaborative filtering-based methods make use of past composition history of mashups; Content-based methods make use of user requirements and service descriptions; Hybrid methods are a combination of the two methods, and various kinds of additional information are incorporated, such as tags, categories, developers, QoS, etc.

In this paper, we adopt a factorization machine-based model, which is a type of hybrid method for service recommendation. Factorization Machine (FM), first proposed in [5] by Rendle, is a supervised machine learning algorithm for classification, regression, and ranking tasks. It achieves great success and is widely employed for product recommendation as it can model high-dimensional sparse features and their interactions in an efficient manner [6]. As a result, it attracts a lot of research attention for service recommendation. For instance, in [7], Cao et al. use FM with topic similarities, co-occurrence, and popularity as input features for service recommendation. In [8], Cao et al. propose an attentional FM model that discriminates the importance of different feature interactions with an attention mechanism. In [9], Kang et al. propose a hybrid FM model by integrating a deep neural network to capture the nonlinear and higher-order feature interactions.

Although previous FM-based models have achieved great results for service recommendation, there are still challenges that can limit the performance of the existing methods:(i)Feature sparsity and imbalance problems: the performance of the factorization machine-based models heavily relies on the co-occurrence of different features. However, the features used for service recommendation are generally sparse: (1) although there are numerous services on the web, each mashup only composes a few services, and most services are composed by mashups only a few times, leading to a sparse mashup-service composition record. Moreover, only a small portion of services are composed frequently, while the majority of services are rarely composed, resulting in the imbalance problem. It is difficult to learn latent factors for mashups and rarely composed services with limited data; (2) even if contextual information such as categories, tags, providers, etc. is incorporated, the features themselves may suffer from the sparsity and imbalance problem. For instance, a majority of tags are used in a few services, while a few popular tags are frequently used by a majority of services. In this case, the model suffers from learning informative latent factors for infrequently used tags, and the benefit of incorporating tag information into the model is reduced.(ii)Cold-start problem: when a new mashup is first included in the development cycle, it does not contain any component services. We refer to them as “cold-start mashups”. As FM can only learn features that appear in the training data, the latent factors for the cold-start mashup cannot be learned due to a lack of composition data. Similarly, we refer to services that have never been composed by any mashup as “cold-start services”, and it is impossible to learn the latent factors of cold-start services. As a result, when a cold-start mashup or service is encountered, FM can only make predictions by setting the latent factors of the mashup or service to random values, and relying on the latent factors of contextual information such as description information. Therefore, it is unlikely to lead to good recommendation results for cold-start mashups.

To address the aforementioned issues, we aim to learn informative latent factors for sparse and cold-start mashups and services in FM. To this end, we enhance FM with Multigraph Convolutional Network (MGCN), which exploits various relations between mashups and services. We dub our model the Multigraph Convolutional Network enhanced Neural Factorization Machine, or MGCN-NFM for short. In particular, we construct three graphs, namely, collaborative graph, description graph, and tag graph. Each graph reflects different types of relations, and different relations can complement each other to enrich the information in a mashup or service. Next, we use graph convolution on each graph to further enrich the relational features. The intuition is that by exploiting information from higher-order neighbors, the representations of sparse nodes can be enhanced. To reduce the computational complexity, we use a neighbor sampling technique to maintain a small representative set of neighbors for each node in the graph convolution process. After graph convolution, each mashup or service has three enhanced feature representations that possess different characteristics. These feature embeddings, as well as tag feature embeddings learned from the tag graph and the description features learned from Doc2vec, are used as the input of FM to capture the fine-grained feature interactions. In this paper, we adopt the neural factorization machine model (NFM) proposed in [10]. It enhances vanilla FM by modeling higher-order and nonlinear feature interactions with a bilinear interaction pooling layer and several deep neural network layers. Different from [10], the embedding layer of NFM is obtained based on the MGCN instead of an embedding lookup table. Our model effectively reduces the feature sparsity, imbalance, and cold-start problems due to the multigraph construction and convolution process. For sparse features, their embeddings have more chances of being updated in training as they are connected to more multihop nodes with different types of relations. For cold-start mashups or services, since they have (multihop) connections to other nodes in the description graph and tag graph, their feature embedding can be learned when updating the embeddings of their neighbors.

The contributions of this work can be summarized as follows:(i)We introduce a novel framework, MGCN-NFM, for service recommendation by enhancing the neural factorization model with a multigraph convolutional network that learns the latent factors for various features.(ii)We alleviate the feature sparsity, imbalance, and cold-start problems by constructing three graphs reflecting different types of relations between mashup and service, and using multigraph convolution to enrich the sparse relations.(iii)We conduct extensive evaluations of our proposed model on the ProgrammableWeb dataset, and the results show that our model achieves better performance than state-of-the-art FM methods for service recommendation.

The rest of this article is organized as follows: Section 2 presents the related work of service recommendation. Section 3 introduces the details of our proposed approach. Section 4 reports the experimental results and analysis. Section 5 concludes this paper with future work.

Recommending web services to developers according to their mashup requirements is a hot topic in service computing. Most works use the dataset from ProgrammableWeb, and they exploit different kinds of auxiliary information, including functional descriptions, tags, categories, providers, architectural styles, etc., as the mashup-service composition record is extremely sparse. Based on the modeling of the mashup-service composition record, existing research can be roughly divided into three categories: neighbor-based collaborative filtering (CF) methods [1115], latent factor-based CF methods [6, 1621] and deep learning-based methods [8, 9, 2227].

Neighbor-based CF methods recommend services based on the identification of similar users (user-based CF) or similar items (item-based CF). It is widely used in the scenario of QoS-aware service recommendation, where the Quality of Service (QoS) information of services is available [28, 29]. Absent from the QoS information, existing works in service recommendation for mashups calculate the matching degree between mashup and service from different information sources with different methods and aggregate them together. As there are multiple types of objects (mashup, service, content, tag, etc.) and rich relations among those objects, they usually build a heterogeneous graph to capture them. In [11], Cao et al. recommend services by building a social network based on social relationships among mashups, services, and their tags. In [12], Gao et al. propose to use a generalized manifold ranking algorithm on the graph with relations among mashups and services. In [13], Liang et al. measure the meta-path-based similarity between mashups under different semantic meanings based on a heterogeneous information network (HIN). Based on the HIN, in [14], Xie et al. propose a mashup group preference-based service recommendation, where mashup group preference is utilized to capture the rich interactions among mashups. In [15], Wang et al. design a knowledge graph to encode the mashup-service relations and exploit random walks with restart to assess the similarities.

Latent factor-based CF methods aim to learn the latent factors of mashups and services to discover potential features. The two most widely adopted models are the matrix factorization-based model (MF) and the factorization machine-based model (FM). MF models decompose the mashup-service composition matrix to derive the mashup feature vectors and service feature vectors (embeddings). The key is to build relevant features that are helpful for improving the recommendation performance. In [16], Xu et al. propose a social-aware service recommendation model, where a coupled matrix factorization model is used to predict the multidimensional relations among users, mashups, and services. In [17], Yao et al. propose a probabilistic MF with implicit correlation regularization, where they develop a latent variable model to uncover the coinvocation patterns of services driven by explicit textual relations and implicit similar or complement relations. In [18], Gao et al. compute different similarity scores between services through heterogeneous functional aspects, and MF is used to learn the embeddings of mashups and services for each functional aspect. The FM model is a generalization of the linear regression model and the MF model. It is more powerful than the MF model because it can entail contextual features and model second-order feature interactions of various sparse features. In [6], Cao et al. propose a QoS-aware service recommendation based on relational topic model (RTM) and FM, where RTM is first used to mine the latent topics derived from the relations among mashups, services, and their links, and then FM is used to predict their link relations. In [19], Li et al. integrate tag, topic, co-occurrence, and popularity factors in the FM for service recommendation, where they exploit the enriched tags and topics of mashups and services derived by RTM and use the invocation times and category information of services to derive their popularity. In [20], Cao et al. extend the description of services using Word2vec and derive latent topics by the hierarchical dirichlet process (HDP). FM is then applied to train these latent topics for service recommendation. In [21], Xie et al. propose to use FM on the features of mashups and services learned from different kinds of metapaths of HIN.

With the rapid development of deep learning in the past few years, it has become popular to adopt neural networks for service recommendation. Compared with traditional methods, deep learning-based methods can perform feature engineering automatically and provide a more powerful representation capability. In [22], Zhang et al. cluster the descriptions of services using Doc2Vec and use the DeepFM model [30] to mine the higher-order composition relations. In [23], Chen et al. also adopt the DeepFM model by taking the features learned from word embedding and the Dirichlet mixture model (DMM) as input. In [8], Cao et al. propose an attention FM [31] by employing an attention mechanism that learns the importance of each input feature interaction via a neural network model. In [9], Kang et al. further combine the advantages of the DeepFM model and the attention FM model, and propose the NAFM model that can capture the nonlinear feature interactions with different degrees of importance. In [24], Xiong et al. adopt three kinds of similarity feature extractors on textual descriptions by a variety of pretrained word embeddings and integrate the mashup-service composition record and their textual descriptions by using a multilayer perceptron.

Graph Neural Network (GNN) [32] has recently emerged as a popular deep learning model for extracting features from graph-structured data, and research has begun to focus on the use of various GNN models for service recommendation. In [25], Zhang et al. propose a Semantic Variational Graph Auto-Encoder model, where they construct the service graph using the composition relations in the mashup, and use a variational graph autoencoder model as a link prediction task for service recommendation. In [26], Lian and Tang propose a service recommendation method that exploits the higher-order connectivity between mashups and services based on the neural graph collaborative filtering technique. In [27], He et al. propose a service link prediction method based on a heterogeneous graph attention network, where they select five types of neighbors associated with service links, and two levels of attention are applied to learn the importance of different nodes and their associations. Our method is different from theirs in that the constructed graphs incorporate different kinds of relations, making them more comprehensive, and the learned features go through the FM model that can better exploit their hidden interactions.

3. MGCN-NFM Approach

In this section, we will describe our proposed multigraph convolutional network enhanced neural factorization machine model (MGCN-NFM) in detail. First, in Section 3.1, we describe the problem of service recommendation for the mashup. Next, we present the overall framework in Section 3.2. The main components of the model, namely, graph construction, multigraph convolutional network, and neural factorization machine, are described in Sections 3.3, 3.4, and 3.5, respectively. Finally, the training process and model complexity are discussed in Section 3.6.

3.1. Problem Definition

We denote as a set of mashups, as a set of services. denotes the mashup-service invocation history, where means mashup has composed service in the past, otherwise . Each mashup and each service are associated with its title, description, and tags. In particular, titles and descriptions consist of a sequence of words. For each mashup , we concatenate its title and description and denote as , where is a word and is the length of the description. denotes the descriptions of all mashups in . For each service , can be defined in a similarly manner, and denotes the descriptions of all services in . We denote as a set of tags. denotes the tagging assignment of mashups, where means tag is tagged by mashup , and otherwise . Similarly, denotes the tagging assignment of services. The problem of service recommendation can be defined as follows: given , , , , , , , and that are recorded in the service repository, and a newly developed mashup with requirement description, specified tags, and possibly composed services, the goal is to recommend a list of appropriate services that are likely to be composed by the new mashup.

3.2. Overall Framework

Figure 1 shows the overall framework of our proposed MGCN-NFM model. It mainly consists of three modules: graph construction, multigraph convolutional network, and neural factorization machine. In the graph construction module, we construct three graphs capturing the three relations between mashups and services, namely, collaborative graph, description graph, and tag graph. In constructing the description graph, the descriptions of mashups and services are transformed to the low-dimensional description embedding by the Doc2vec technique. In the multigraph convolutional network module, we adopt the simplified graph convolution network for each graph to propagate higher-order information between mashups and services and output the aggregated feature representations of each node. In the neural factorization machine module, the feature representations of mashup, service, and tags learned in each graph, as well as the description embedding of mashup and service, are used as inputs to the neural factorization machine model, which can model nonlinear and higher-order feature interactions. Finally, the model outputs a score representing the probability of a mashup composing a service. For model training, the predicted score is compared with the actual composition record, and the error is back-propagated to update the model parameters. For service recommendation, the scores of the mashup and all candidate services are computed and ranked, and the top-scoring services are recommended.

3.3. Graph Construction

To fully exploit different kinds of information in mashups and services, we construct three graphs reflecting diverse relations between mashups and services: the collaborative graph, the description graph, and the tag graph. Figure 2 illustrates an example of the three constructed graphs.

3.3.1. Collaborative Graph

The collaborative graph reflects the historical composition relation between mashup and service. However, each mashup generally only composes a small number of services, making the composition relation extremely sparse. Inspired by the idea of collaborative filtering that mashups composed of the same set of services tend to be similar, we expand the composition relation to alleviate the sparsity problem by exploiting the collaborative signal. The invocation matrix can be regarded as a mashup-service bipartite graph , where there is an edge between mashup and service , i.e. if . We also consider the collaborative relations inside the mashups and services. We build a mashup-mashup collaborative graph and a service-service collaborative graph to complement the graph . The mashup-mashup collaborative graph is constructed based on the similarity of mashups measured by the number of commonly composed services. Specifically, if mashups and have co-composed services, then there is an edge between them. The similarities between them are calculated as the number of co-composed services normalized by the number of composed services for each mashup, i.e., , where and denote the -th and -th rows of the matrix . The normalization keeps the similarity in the range of [0,1]. In a similar manner, the service-service collaborative graph can be constructed based on the similarity of services measured by the number of co-appearing mashups. If services and have both occurred in the same mashup, they are connected by an edge . The similarity between them is defined as , where and denote the -th and -th columns of the matrix . With the constructed mashup-mashup collaborative graph and service-service collaborative graph , one mashup can directly utilize the information of the neighboring mashups, thereby alleviating the sparsity issue with the mashup-service invocation record. We merge the three graphs to get an overall collaborative graph .

3.3.2. Description Graph

The description graph represents the similarities between mashups and services in terms of their textual descriptions. We employ the state-of-the-art Doc2vec technique [33] to measure the description similarity. Doc2vec maps the textual descriptions of mashup and service into latent semantic embeddings and . Doc2vec follows the idea of word2vec [34] for learning word representations. Specifically, each document and each word is mapped to a unique vector. The model concatenates the document vector with a sequence of word vectors from the document and predicts the following word. It can be seen as a self-supervised learning task where the input is the context words and the document, and the label is the target word. After training on a large number of documents, we can learn a good document embedding that captures the overall semantics of the document. It has been shown to achieve superior performance compared to traditional bag-of-words models such as LDA [35] or HDP [36]. We feed the descriptions of all mashups and services for training the Doc2vec model, and get the corresponding description embeddings after training. We calculate the cosine similarity of the learned embeddings to represent the description similarity. Specifically, given two nodes and that could be either mashup or service in the description graph, the similarity between them is calculated as . However, including all pairs of nodes as edges brings a lot of noise as most pairs of mashup and service are irrelevant. We set a threshold (empirically set as 0.1) to filter out dissimilar pairs of nodes. That is, if , there is an edge between nodes and . In this way, we get a description graph , where denotes the filtered edges in the description graph.

3.3.3. Tag Graph

The tag graph represents the annotated tags of mashups and services and is denoted as . Different from the collaborative graph and the description graph, it consists of three types of nodes: mashup, service, and tag. There is an edge between mashup and tag , i.e., if . Similarly, an edge exists between service and tag if .

3.4. Multigraph Convolutional Network

After constructing three types of graphs reflecting different relations between mashups and services, we employ the multigraph convolutional network to learn comprehensive mashup and service features from the three graphs. In particular, it consists of two steps: (1) Neighbor Sampling, which samples a fixed number of neighbors for each node in the graph; (2) Graph Convolution, which aggregates neighboring nodes, updates node embeddings at each layer, and combines embeddings at all layers. Figure 2 illustrates an example of the whole process.

3.4.1. Neighbor Sampling

In traditional GCN [37], each node aggregates information from all neighboring nodes in the graph. However, in the case of the three constructed graphs, some nodes may be connected to a large number of nodes, resulting in a large number of neighbors. For instance, in the collaborative graph, a popular service may connect to a large number of mashups and services, as it may be composed of many mashups and cocomposed with many services. Directly aggregating all neighboring nodes may have several issues: (1) it adds the computational burden of the graph convolution process as the size of neighbors grows exponentially with the number of layers; (2) it makes the learning process excessively focus on the nodes with a large number of neighbors and ignores the nodes with only a few neighbors, causing the overfitting issue for the densely-connected nodes and the underfitting issue for the sparsely-connected nodes; (3) it causes the oversmoothing issue [38] when aggregating a large number of neighbors with weak correlation as the learned embedding for nodes and their neighbors will become indistinguishable. Hence, we adopt the neighbor sampling strategy to control the size of the neighbors before graph convolution. Specifically, for each node, we sample its neighbors with respect to the similarity between them. The sampling probability of each neighbor is computed as the normalization of their similarity: first, we get a vector of similarity scores for the target node and its neighboring nodes. Then, we normalize the vector into a probability distribution. Finally, we sample the neighbors with respect to the probability distribution. We use the standard normalization function defined aswhere is the vector of similarity scores. Other normalization functions can be used, such as the softmax function, although no performance gain is observed. Next, we describe the neighbor sampling process for each graph in detail. A graphical illustration of the sampling process for each graph is shown in Figure 3.(i)Collaborative graph: for each mashup node, we keep all of its service neighbors. If the number of service neighbors is less than , we sample the neighbors from according to the sampling probability distribution derived from mashup collaborative similarity, until the number of neighbors reaches . For each service node, we keep all of its mashup neighbors. If the number of mashup neighbors is less than , we sample the neighbors from according to the sampling probability distribution derived from service collaborative similarity, until the number of neighbors reaches . However, as some popular services may be composed of a large number of mashups, if the number of mashup neighbors is greater than , we sample neighboring mashups with a uniform probability distribution.(ii)Description graph: for each node, if the number of neighbors is greater than , we sample neighboring mashups and services according to the sampling probability distribution derived from the description similarity.(iii)Tag graph: as mashups and services are only associated with a few tags, no sampling is needed for the mashup and the service node. For the tag node, we sample nodes from its neighboring mashups and services with a uniform probability distribution.

3.4.2. Graph Convolution

The core idea behind graph convolution is to iteratively aggregate feature information from the local neighbors of each node. A single layer of convolution aggregates feature information from the node’s direct neighbors, and by stacking multiple layers, feature information can be propagated across long ranges and the node features can be enhanced with sufficient higher-order neighbor information. The graph convolution process consists of three steps: neighbor nodes aggregation, node embedding update, and layer combination. For each graph, after sampling the neighbors for each node, the next step is to aggregate the features of the node neighbors to obtain the neighbor embedding, and the embedding of the node is updated by fusing the neighbor embedding. Denote the center node to be updated as and the set of neighbor nodes as , the neighbor aggregation and updating process for node in the -th layer can be abstracted aswhere represents the feature embedding of node in the -th layer, and is the aggregation function which is the core of graph convolution. In this paper, we adopt a simple aggregation function that drops the feature transformation and nonlinear activation function used in the original graph convolution, which has been shown to achieve better performance with less model complexity [39]. The graph convolution operation is defined as

The aggregation function is a weighted sum of the neighbor embeddings, and is the normalization term to avoid the scale of embeddings increasing with graph convolution. By stacking propagation layers defined in (3), each node is capable of receiving features from nodes within -hops away, and the higher-order relation between mashups and services can be explored. After layers of graph convolution, we have embeddings for node , and we average them to obtain the final embedding:

Note that the only model parameters in the simplified graph convolution are the embeddings at the 0-th layer, i.e., . Figure 2 illustrates an example of learning the embedding of service with graph convolutional networks in the three graphs. All three graphs share the same initial node embedding for a mashup and service, but since different graphs capture different types of relations, the final embedding of a mashup and service after the graph convolution will capture the unique characteristics specific to that relation. We denote the node embeddings obtained from the collaborative, description, and tag graphs as , , and , respectively.

3.5. Neural Factorization Machine

After learning the embeddings of mashups and services that comprehensively model different types of relations, we use a neural factorization machine model to capture both higher-order and nonlinear interactions between mashups, services, and tags. Figure 4 shows the overall model of the neural factorization machine. Given a pair of mashup and service as well as their corresponding tags and textual descriptions, NFM models both the linear interactions between each pair of features like the traditional factorization machine, and higher-order feature interactions in a nonlinear way. The input of the NFM consists of one-hot encoding of the mashup, one-hot encoding of the service, and multihot encoding of tags that are annotated by the mashup and service. We concatenate the three encoding vectors, and the sparse feature vector is denoted as , where means the -th feature exists in the input. The linear regression part of NFM is given aswhere is the global bias and is the weight of feature . To incorporate the functional semantics in the textual descriptions, we use the dense embeddings of the mashup and service learned from Doc2vec. However, since the features of mashup, service, and tags are sparse, we need to transform them into dense representations. Different from the original neural factorization model in [10], where they build an embedding lookup table which is a fully connected layer that projects each feature to a dense embedding, the multigraph convolution network is used to project the feature to its embeddings from different graphs. In this way, the mashup and service have three embeddings, each representing information from different sources, and the feature interaction can be performed in a fine-grained manner. Moreover, a cold-start mashup or service will have a meaningful embedding which is learned from the description graph and/or tag graph. For tag features, it is also beneficial to learn the tag embeddings from the tag graph, as long-tailed tags will have a more meaningful embedding from the higher-order convolution process of multihop neighbors. With the enhanced feature embeddings, the feature interactions in NFM can be learned more effectively.

After transforming sparse features of the mashup, service and tags into dense embeddings, they are fed into the bi-interaction pooling layer to convert multiple embedding vectors into one vector. Specifically, the input of the bi-interaction pooling layer consists of the set , where is a list of vectors converted from the multihot tag encoding. The bi-interaction pooling operation is defined aswhere is the number of embedding vectors in , and denotes the element-wise product of two vectors. The output of the bi-interaction pooling layer is a -dimension vector that encodes the second-order interactions between features. It is worth noting that in the case of cold-start mashup or service, they are represented as the initial node embeddings, i.e., or equals or , because there is no connection in the collaborative graph.

Above the bi-interaction layer is a stack of fully connected layers, which learn higher-order interactions between features:where denotes the number of hidden layers, denotes the weight matrix, and denotes the bias vector in layer . is the activation function. Finally, the output of the last layer is transformed into the final score:

Overall, NFM estimates the target of an instance as the sum of the first-order linear regression part and the higher-order nonlinear feature interaction part. Since the target label is in {0,1}, with 1 indicating that the service is a component of the mashup and 0 otherwise, we transform the final output with the sigmoid function to output the probability of the service being recommended to the mashup. The sigmoid function, defined as , constrains the value between 0 and 1. To summarize, the formulation of NFM iswith parameters .

3.6. Model Training and Prediction

To train the MGCN-NFM model, we define the loss function using binary cross-entropy:where is the set of training instances, each instance includes mashup, service, tags, mashup description, and service description. is the label indicating whether the mashup and service in instance have been composed in the record, and denotes the composition probability in instance predicted by the model. As the mashup-service composition record is extremely sparse, the number of negative instances where the label far exceeds the positive instances where the label Therefore, for each positive instance, we sample negative instances with the same mashup to mitigate the data imbalance problem. controls the strength of the regularization to prevent overfitting. In addition, dropout is used in NFM to prevent overfitting, where we randomly drop percent of the dimensions in the fully connected layers when training. denotes all trainable model parameters. We use Adaptive Moment Estimation (Adam) [40], a variant of stochastic gradient descent, to optimize the model parameters.

Algorithm 1 presents the overall training process of the MGCN-NFM framework. In line 1, three graphs are constructed based on the input data. Line 2 constructs the training instance set including both positive and sampled negative labeled instances. Each instance includes a mashup, a service, their tags, and their descriptions. Line 5–7 is a forward-propagation process to make a prediction for an instance. The embeddings of the mashup, service, and tags are learned through the steps of neighbor sampling and graph convolution. The embeddings of the descriptions are obtained from Doc2vec. They are used as the input to NFM to compute the recommendation probability. Line 8 performs back-propagation and updates the model parameters based on the loss function in Equation (10).

 Input: , , , , , , ,
 Output: parameter set
(1) Construct graphs , , ;
(2) Construct training instances ;
(3) for epoch do
(4)  for each instance composed of mashup , service , tag in do
(5)  Compute with Equation (4);
(6)  Obtain from Doc2vec model;
(7)  Compute with Equation (9);
(8)  Update to minimize in Equation (10) with Adam;
(9) end for
(10) end for
(11) sreturn

Once the model parameters are learned, which contain the initial embeddings for mashups, services, and tags, a forward-propagation pass is performed on the multigraph convolutional network to obtain the final embeddings of all mashups, services, and tags in the three graphs, which act as a fixed embedding lookup table. Then, the mashup to be recommended and each candidate service are used to construct the input instances of the NFM, with their embeddings retrieved from the lookup table. The NFM outputs the probability of each candidate service being composed by the mashup. We rank these prediction values and recommend the top-K services with the largest values to the mashup.

We analyze the time complexity of MGCN-NFM for model training and prediction. The construction of the collaborative, description, and tag graph takes , , and respectively, where denotes the number of nonzero entries in the matrix. The main time cost of the MGCN model lies in the graph convolution process, and the time complexity is , where is the number of graph convolution layers, is the embedding size. , and are the number of edges in the collaborative, description, and tag graph respectively. Since we perform neighbor sampling for each node in the three graphs, the number of edges is upper bounded by , , for the collaborative, description, and tag graph respectively. Therefore, the training cost of MGCN linearly scales with the number of nodes in the graph. For NFM in both training and prediction, the bi-interaction pooling layer can be efficiently computed in time with a reformulation of (6). The main time cost lies in the hidden layers of the neural network, and the time complexity is , where denotes the dimension of the -th hidden layer.

4. Experiments

In this section, we conduct a series of experiments to evaluate our proposed model in service recommendation and present the empirical performance. All experiments were developed in Python and carried out on a personal PC with Intel Core i7 CPU with 2.5 GHz and 16 GB RAM, running on macOS High Sierra. We aim to answer the following research questions:(i)How does the proposed MGCN-NFM model perform compared with the state-of-the-art factorization machine-based service recommendation methods?(ii)How much do the collaborative graph, description graph, and tag graph influence the performance of the proposed model?(iii)How much do the various hyperparameters affect the experiment results of the proposed model?

4.1. Dataset Description

The experimental dataset is collected from ProgrammableWeb (PW), the largest online web service and mashup repository. For each mashup, we crawl its name, description, tags, and composed services. For each service, we crawl its name, description, and tags (including the primary and secondary categories). We remove duplicate services, and the mashups and services without textual descriptions and tags. The dataset contains 6,300 mashups and 21,474 web services, of which only 1,609 services are composed of at least one mashup. Table 1 shows the detailed statistics of the dataset.

A five-fold cross-validation is performed to evaluate the effectiveness of the model. We randomly split the mashup-service invocation records into five folds, and in each round, one fold is used as the test set while the remaining four folds are used as the training set. The results of the five rounds are averaged as the final result. In addition, to test models under different data sparsity levels, we use 1/2/3/4 folds in the training set, which corresponds to 20/40/60/80% of the mashup-service invocation records. Figure 5(a) shows the statistics of the service distribution of mashup in the full training set, and Figure 5(b) shows the statistics of the service distribution of mashup when only 20% of the dataset is used for training. We can see that for the full training dataset, 52.3% mashups compose one service, and 91.2% mashups compose less than 3 services. When only 20% of the dataset is used for training, 68.6% mashups do not have any component services, which poses a significant challenge for the cold-start problem.

4.2. Evaluation Metrics

We adopt two metrics, namely, recall and Normalized Discounted Cumulative Gain (NDCG), to evaluate the accuracy of the top-K service recommendation.

Recall@K is the proportion of the number of services in the top-K recommendation list that is composed by the mashup to the number of services in the mashup. It is defined as:where is the set of mashups in the test set, rec() is the recommendation list of services of size K for mashup . truth() is the ground truth list of services that are composed by the mashup in the test set.

NDCG@K considers the ranking position of the recommended services, and assigns different weights to each service in the top-K recommendation list. The higher ranked service is assigned with a larger weight if it is composed by the mashup. It is defined as:where indicates whether the service at position of the ranking list is in truth(m). IDCG@K is the ideal DCG score of the top-K services that can be achieved.

4.3. Baseline Methods

We compare MGCN-NFM with the following baselines that are related to our work:(i)RTM-FM [7]: this method combines RTM and FM for service recommendation. It uses RTM to mine latent topics of mashups and services by link prediction. In addition, it exploits the co-occurrence and popularity features of services in FM.(ii)TR-FM [19]: this method integrates tag, topic, co-occurrence, and popularity factors which are modeled by FM for service recommendation. It uses the enriched tags and the derived topic information from RTM to measure the similarity between services.(iii)ATT-FM [8]: this method uses attentional FM for service recommendation. It uses functional similarity, tags, and popularity of services as features, and employs an attention mechanism on feature interactions to discriminate their importance.(iv)NAFM [9]: this is the state-of-the-art service recommendation method based on the FM model. Besides first-order and second-order linear feature interactions, it integrates a deep neural network to capture nonlinear feature interactions and an attention network to capture important feature interactions.(v)MGCN-FM: it is a variation of the proposed method where we use the basic factorization machine model after learning the features of mashup, service, and tags. The output of the bi-interaction pooling layer is summed directly, which is equivalent to setting in MGCN-NFM.

4.4. Parameter Settings

To learn the representation of textual descriptions of mashups and services in Doc2vec, we aggregate the textual descriptions of all 6300 mashups and 21474 services into one big corpus and perform a series of preprocessing steps: (1) split sentences into words and transform them into lower case; (2) remove stop words and words that appear less than 5 times; (3) transform words into their root form. All preprocessing steps are done using the NLTK library (https://www.nltk.org). After preprocessing, we use the models.doc2vec API in the gensim library (https://radimrehurek.com/gensim/) to learn the document embedding. All parameters are set to the default value of the gensim API except for the dimension size, which is set to 50 as the length of the description of a mashup and service is relatively short. The default values of hyperparameters in MGCN-NFM are set in Table 2.

4.5. Performance Comparison

To assess the performance of our method and baseline methods under different data sparsity levels, we select 20/40/60/80% of the data for constructing the collaborative graph and learning the parameters of the model. Table 3 shows the comparisons of different methods when the number of services recommended for a mashup K = 3/5/10. It can be seen that our model MGCN-NFM consistently achieves the best performance under all percentages of training data in terms of both Recall and NDCG. TR-FM achieves better performance than RTM-FM, because it considers the tagging information to refine the similarity measures of services. ATT-FM achieves better performance than the two methods that use the vanilla factorization model, since it models the feature interactions in the factorization machine with an attention mechanism that can discriminate the importance of each feature. NAFM attains better performance than the aforementioned three methods, since considering higher-order feature interactions can bring additional improvement. The MGCN-FM method achieves better performance than all baseline methods, but its performance is inferior to that of MGCN-NFM, especially on training data with high sparsity. This shows that introducing nonlinear feature interactions can improve the expressiveness of FM. Compared with the best baseline model NAFM, our model obtains 7.28% and 7.14% improvement in top-3 recommendations for Recall and NDCG respectively under the full training data. The performance gain becomes even more significant when the training data becomes sparser. For instance, when only 20% of the data is used for training, our model obtains 18.13% and 14.97% improvements in top-3 recommendations for Recall and NDCG compared with NAFM. As the percentage of the training data increases, both Recall and NDCG increase for all methods. However, our method consistently outperforms baseline methods when the training data becomes sparser, while the performance of the baseline methods drops to a greater extent than our method. It shows that our method is less sensitive to the sparsity of the training data compared with the baseline methods, which could be explained by the higher-order and nonlinear feature interaction modeling in graph convolution and NFM.

4.6. Impact of Different Graphs

In MGCN-NFM, we represent different sources of information by constructing three graphs: collaborative graph, description graph, and tag graph. To demonstrate the effectiveness of utilizing the three graphs in our model, we design six comparing variants of the model: (1) CGCN-NFM, where only the collaborative graph is constructed. In this model, only and are computed in Step 5 of Algorithm 1; (2) DGCN-NFM, where only the description graph is constructed, and only and are computed; (3) TGCN-NFM, where only the tag graph is constructed, and only , and are computed; (4) CDGCN-NFM, where both collaborative graph and description graph are constructed, and tag graph is omitted; (5) CTGCN-NFM, where both collaborative graph and tag graph are constructed; (6) DTGCN-NFM, where only collaborative graph is omitted in the original model.

Figure 6 shows the result of the comparison among different variants. We observe that MGCN-NFM achieves the best performance, confirming that all three graphs are necessary for the model, as they are complementary to each other for enriching different aspects of relations between mashups and services. When either graph is dropped from the original model, the performance becomes worse, which also suggests that both three graphs can indeed help improve the model performance. Furthermore, incorporating two graphs into the model is better than only utilizing one graph in the model, except for CTGCN-NFM. In this case, we argue that the description graph is the most important graph for the model performance, as DGCN-NFM outperforms CGCN-NFM and TGCN-NFM, and without the description graph, CTGCN-NFM performs even worse than DGCN-NFM. Therefore, we conclude that accurately modeling the functional information of mashups and services in their descriptions is vital. In addition, the collaborative graph has a stronger influence on the model performance than the tag graph, which is exhibited by the fact that CGCN-NFM outperforms TGCN-NFM and CDGCN-NFM outperforms DTGCN-NFM.

We further discuss the reason and provide several case studies of how the three graphs alleviate the sparsity, imbalance, and cold-start problems in service recommendation. The collaborative graph connects services that are rarely composed to frequently composed ones, which makes them more densely connected, so that the latent factors can be updated more frequently. The description graph and tag graph connect cold-start services with services that are composed by mashups, so that the cold-start services can be discovered and recommended by the FM model. Table 4 shows some actual examples of service recommendation results. CityPockets Daily Deals is a mashup that aggregates all daily deals and coupons from various deal sites. Compared with DTGCN-NFM which excludes the collaborative graph, MGCN-NFM successfully recommends service 8coupons in the Top-5 list, because service Groupon and service 8coupons have been composed by different mashups a few times, and it is captured by the collaborative graph. NearPlace mashup is a free store locator and Google Maps marker. Compared with CTGCN-NFM which excludes the description graph, MGCN-NFM successfully recommends cold-start service MetaLocator in the Top-5 list. Although MetaLocator has never been composed by any mashups before, it is a mapping and locator service for stores, vendors, and ATMs. Therefore, its functionality is similar to popular services such as geocoder and Google Maps Places, which is reflected in the description graph. iEnviroWatch is a mashup that visualizes and queries environmental information in a geographical area of one’s interest. The MGCN-NFM successfully recommends the EEA Discomap service in the Top-5 list, because they are both tagged with “Environment”, and the tag “Sustainability” in the mashup is also close to the tag “Environment” in the tag graph.

4.7. In-Depth Analysis of Graph Construction
4.7.1. Impact of Mashup-Mashup Graph and Service-Service Graph in Collaborative Graph Construction

To demonstrate the superiority of incorporating mashup-mashup collaborative graph and service-service collaborative graph into the sparse mashup-service collaborative graph, we design three variants of the model: (1) , where only the mashup-service collaborative graph is constructed; (2) , where the mashup-mashup collaborative graph is fused with ; (3) , where the service-service collaborative graph is fused with ; (4) , which is the original model. Figure 7 shows the results of different variants. We observe that incorporating and both achieves better model performance than only using for the collaborative graph, which verifies the effectiveness of modeling the collaborative relations between mashups and between services. Furthermore, the collaborative relation between mashups has a more significant impact on the model performance than the collaborative relation between services. Only by combining both relations can the best result be achieved.

4.7.2. Impact of Document Embedding Techniques in Description Graph Construction

In this paper, we use Doc2vec, which learns a distributed representation of documents with a self-supervised learning paradigm. We compare Doc2vec with other document modeling techniques widely used for service description. Three variants are considered: (1) TF-IDF (https://en.wikipedia.org/wiki/Tf-idf), where the description documents are represented by the TF-IDF model; (2) LDA, where the topic probability distributions of description documents are learned by the LDA model; (3) HDP, where the topic probability distributions of description documents are learned by the HDP model; (4) Doc2vec, which is the original model. The latent dimensions of LDA and Doc2vec are set to 50. Figure 8 shows the results of different variants. We can observe that Doc2vec performs the best among all methods, indicating its superior modeling capacity for mashup and service description. HDP follows behind Doc2vec and performs better than LDA, which shows that HDP can derive better topic distribution than LDA as it can automatically infer the number of topics from data. TF-IDF performs the worst, as it only uses the term-based vector space model and lexical matching between documents.

4.8. Impact of Hyperparameters
4.8.1. Impact of the Size of Sampling Neighbors

In the neighbor sampling process, to control the size of the graph, we sample a fixed-size set of neighbors for each node. We vary the number of sampled neighbors for each node in the collaborative graph , description graph , and tag graph from 5 to 40 with a step size of 5 to find the optimal setting. The experimental results are shown in Figure 9. We only report Recall@K as NDCG@K performs in a similar fashion. We can see that the optimal size of neighbors is different for each graph. For the collaborative graph, the optimal value of is relatively small (), while for the description graph, the optimal value of is larger (. When the size of neighbors further increases past the optimal value, the performance begins to decrease, which verifies the necessity of node sampling before graph convolution. For the tag graph, we observe that the performance does not change much as the size of neighbors grows. We set the optimal value of to 30 considering both the training efficiency and recommendation accuracy.

4.8.2. Impact of the Number of Layers in Graph Convolution

We evaluate the effectiveness of graph convolution for learning higher-order mashup and service features. We vary the number of layers in the range of {1,2,3,4} and evaluate the performance. The experimental results are shown in Figure 10. We can see that the optimal value of . Increasing layers from 1 to 3 improves the performance as stacking more layers helps nodes reach multihop neighbors to enrich mashup and service features. However, stacking too many layers can make the node features similar and reduce the performance.

4.8.3. Impact of the Latent Dimension

We evaluate the impact of latent dimension on the model performance, which varies in the range of {8,16,32,64,128}. The performance is shown in Figure 11, from which we can see that the optimal value of is 64. Setting the latent dimension too small will probably constrain the model capacity, and setting the latent dimension too large not only leads to the overfitting issue that hurts the recommendation accuracy, but also increases the model training time.

5. Conclusions

In this paper, we delve into three issues, namely, feature sparsity, imbalance, and cold-start when applying factorization machine models for service recommendation, and propose a novel MGCN-NFM model to address those issues. First, three graphs are built to represent the various types of interactions between mashups and services: a collaborative graph, a description graph, and a tag graph. Next, we use the graph convolutional network to iteratively aggregate higher-order neighbors to enrich the information of the sparsely-connected nodes in each graph. The feature embeddings are used as latent factors in the neural factorization model, where it predicts the probability of composition given a pair of mashup and service, their tags, and textual descriptions. We perform extensive experiments on the ProgrammableWeb dataset, and the results demonstrate the superiority of our proposed method. Moreover, our model is able to incorporate other information from mashups and services by defining graphs with new relations.

In the future, we intend to investigate other potentially useful attributes and social information of mashups and services, such as QoS, developers, followers, etc. In addition, other more advanced GCN models and FM models can be explored to further improve the recommendation accuracy. Finally, as the list of recommended services should be of high coverage and diversity besides high accuracy, we plan to investigate and improve the model with respect to the two metrics.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was partially supported by National Key R&D Program of China under grant no. 2019YFB1404802, National Natural Science Foundation of China under grants nos. 62176231 and 62106218, Zhejiang Public Welfare Technology Research Project under grant no. LGF20F020013, Wenzhou Bureau of Science and Technology of China under grant no. Y2020082.