Abstract

With the increasing volume of web services in the cloud environment, Collaborative Filtering- (CF-) based service recommendation has become one of the most effective techniques to alleviate the heavy burden on the service selection decisions of a target user. However, the service recommendation bases, that is, historical service usage data, are often distributed in different cloud platforms. Two challenges are present in such a cross-cloud service recommendation scenario. First, a cloud platform is often not willing to share its data to other cloud platforms due to privacy concerns, which decreases the feasibility of cross-cloud service recommendation severely. Second, the historical service usage data recorded in each cloud platform may update over time, which reduces the recommendation scalability significantly. In view of these two challenges, a novel privacy-preserving and scalable service recommendation approach based on SimHash, named , is proposed in this paper. Finally, through a set of experiments deployed on a real distributed service quality dataset WS-DREAM, we validate the feasibility of our proposal in terms of recommendation accuracy and efficiency while guaranteeing privacy-preservation.

1. Introduction

With the ever-increasing volume and variety of web services in various web-based communities, it becomes a challenging task to find the web services that a target user is really interested in [13]. In this situation, various service recommendation techniques are introduced to alleviate the heavy burden on the service selection decisions of target users, for example, the well-adopted user-based Collaborative Filtering (i.e., UCF). According to traditional UCF, the similar friends of a target user are often employed to make recommendations to the target user [4]. Therefore, similar friend discovery is the key step to the subsequent service recommendation.

Generally, the bases for similar friend discovery, that is, historical service usage data (e.g., service quality observed by users) are centralized; in this situation, it is easy to determine the similar friends of a target user. However, in the age of IoT (Internet of Things), the quality data of various services are often monitored and collected by geographically distributed sensors and stored in different cloud platforms [5]. In this situation, the historical service usage data are not centralized, but distributed. Such a distributed service recommendation scenario calls for data sharing and collaboration between different cloud platforms. However, as work [6] indicates, this kind of cross-platform data sharing may bring additional privacy leakage risk, which decreases the feasibility of cross-cloud service recommendation severely. Besides, for the involved multiple cloud platforms, their volume of service quality data may become increasingly huge with updates over time, which leads to a frequent recalculation of user similarity and hence reduces the recommendation scalability significantly.

In view of these two challenges, a novel privacy-preserving and scalable service recommendation approach based on SimHash, named , is put forward in this paper. Our can achieve a good recommendation performance in terms of accuracy, efficiency, and privacy-preservation.

Generally, the contributions of this paper are threefold:(1)To the best of our knowledge, existing research works seldom consider the service recommendation in a distributed cloud environment, as well as the resulting privacy-preservation problems. In this paper, we formalize this privacy-preserving service recommendation problem and clarify its research significance.(2)We put forward a novel service recommendation approach based on offline SimHash technique [7], named , to protect the private information of most users in different cloud platforms, and meanwhile improve the service recommendation efficiency and scalability.(3)We conduct a set of experiments based on a real distributed service quality dataset WS-DREAM to validate the feasibility of our proposed approach. Experiment results show that achieves a good performance in terms of recommendation accuracy and scalability while guaranteeing privacy-preservation.

The rest of the paper is organized as follows. Related work is presented in Section 2. Research motivation is demonstrated in Section 3. In Section 4, we introduce the details of our proposed service recommendation approach . In Section 5, a set of experiments are conducted based on WS-DREAM dataset, to validate the feasibility and advantages of our proposal. And finally, in Section 6, we summarize the paper and point out the future research directions.

Collaborative Filtering (i.e., CF) has become one of the most effective techniques in various recommender systems. User-based CF and item-based CF are brought forth for high-quality service recommendation in [4] and [8], respectively. In order to combine their advantages, a hybrid CF recommendation approach is introduced in [9]. Experiment results show that the hybrid approach improves the recommendation performance. As the quality of a web service often depends on the service execution context (e.g., time, location), time-aware CF and location-aware CF are proposed in [10] and [11], respectively, to improve the accuracy of recommended results. However, the above approaches cannot handle the recommendation problems where historical service usage data are very sparse. In view of this drawback, a belief propagation-based approach is proposed in [12], to find the potential friends of the target user.

However, the above approaches all assume that the service recommendation bases, that is, historical service usage data, are centralized, without considering the distributed service recommendation scenarios as well as the resulting privacy leakage risk. In view of this drawback, the authors in [13] suggest that a user should release only a small portion of his/her observed service quality data to the public so that the remaining majority of user-service quality data are secure. However, the released small portion of data can still reveal part of a user’s private information. In order to protect user privacy completely, the data obfuscation technique is adopted in [14] to hide the real service quality data by adding an obfuscated data item. However, as the service quality data used to make service recommendations have been obfuscated, the recommendation accuracy is decreased accordingly; besides, additional time cost is brought by the adopted data obfuscation operation. Similarly, a segment-based data hiding approach is introduced in [15], where each piece of user-service quality data is divided into several data segments, and then the data segments are employed to calculate user similarity approximately and make further service recommendation. However, there are still two shortcomings in this approach. First, the data segmentation process often takes much time, which decreases the recommendation efficiency heavily. Second, it fails to protect some important privacy information appropriately, for example, the information of the service intersection commonly invoked by two users. Locality-sensitive hashing technique is recruited in [16] to protect and realize the privacy-preservation purpose; however, only partial private information of users can be protected very well.

In view of the drawbacks of existing approaches, a novel privacy-preserving and scalable service recommendation approach based on SimHash, that is, , is proposed in this paper, to cope with the service recommendation problems in the distributed cloud environment. Next, an example is presented in Section 3 to further demonstrate the research motivation of our paper.

3. Research Motivation

An intuitive example is presented in Figure 1 to motivate our paper. Here, denotes a target user to whom Amazon platform intends to recommend services; and are two users whose observed service quality data are recorded in Microsoft and IBM platforms, respectively; are the candidate services for recommendation. Specifically, if a user has never invoked a service, the corresponding service quality data is null.

Next, according to traditional UCF, the first step is to calculate user similarity and so as to determine the similar friends of . However, the above user similarity calculation process involves the cross-platform collaborations and hence faces the following two challenges:(1)Generally, Microsoft and IBM are not willing to share their recorded service quality data to Amazon due to privacy concerns, which decreases the feasibility of cross-cloud user similarity calculation and subsequent service recommendation severely.(2)In Amazon, Microsoft, and IBM, the volume of service quality data may become increasingly huge with updates over time; in this situation, the collaboration efficiency and scalability are often reduced significantly and hence cannot satisfy the quick recommendation requirements from target users.

In view of these two challenges, a privacy-preserving and scalable service recommendation approach, that is, , is proposed in this paper, which will be introduced in detail in the next section.

4. A SimHash-Based Service Recommendation Approach

In this section, a privacy-preserving and scalable approach, that is, , is proposed to handle the distributed service recommendation problems. The main idea behind is: the users who have invoked the most common services can be regarded as “probably similar” friends [17]; therefore, we first utilize SimHash to look for a small number of “probably similar” friends of the target user, in a privacy-preserving and scalable way; afterwards, we determine the target user’s “really similar” friends from the “probably similar” ones; finally, we make service recommendations to the target user based on the preferences of his/her “really similar” friends.

Concretely, consists of the four steps in Box 1. Here, denotes a target user, is the user set in multiple involved cloud platforms, is the candidate service set, and denotes the hash value of user based on SimHash.

Step 1 (building user indexes offline based on SimHash). For each user , according to his/her historical service invocation records, we can build his/her index offline, denoted by , based on SimHash technique (see Figure 2). Here, and denote the number of users and number of services, respectively. Next, we introduce how to obtain .
First, for each service , we can generate a random -dimensional 0-1 vector where (here, means the upper integer of ; e.g., = 4). Considering the example in Figure 2, and the 0-1 vector corresponding to service , that is, holds. Then according to the historical service invocation records, can be denoted by a -dimensional vector in (1).Next, in vector , we drop the dimensions with null value and replace value “0” by value “−1”, after which a new vector is achieved (see Figure 2). Then for the derived (at most) matrix corresponding to vector , we calculate the sum of its each column. Afterwards, we obtain a new vector (see Figure 2), where the positive and negative values are replaced by “1” and “0”, respectively, after which -dimensional 0-1 vector (see Figure 2) is obtained. Then according to SimHash theory [6], can be regarded as the index for user . This way, we can build indexes for all the users in set .
For a user, his/her historical service invocation data are recorded by a certain cloud platform (e.g., Amazon or Microsoft or IBM in Figure 1); therefore, the user index can be built offline beforehand by the cloud platform so as to reduce the time cost. Besides, through SimHash, each user is encapsulated into a less-sensitive user index , without revealing his/her sensitive information (e.g., whether he/she has invoked a service or not, a service’s running quality observed by him/her) to other platforms. Therefore, user privacy is protected.

Step 2 (finding “probably similar” friends of the target user). According to same hash functions adopted in Step  1, we calculate the index for the target user, that is, . Next, we calculate the Hamming Distance between and (), denoted by . Concretely, suppose and are denoted by -dimensional vectors () and (), respectively. Then can be calculated by (2), where is a Boolean value calculated by (3). Here, symbol “⊕” denotes the XOR operation.According to SimHash [6], if < 3 holds, then we can conclude that the services invoked by and are approximately the same. In other words, can be regarded as a “probably similar” friend of and then put into set . Moreover, the size of , that is, , is often small (≪) due to the nature of SimHash.

Step 3 (finding “really similar” friends of the target user). The users in set (obtained in Step  2) are only “probably similar” friends of the target user, not necessarily “really similar” friends. Considering this point, in this step, we further determine the “really similar” friends of the target user from set . Concretely, for any , we calculate his/her similarity with , that is, , based on Pearson Correlation Coefficient (PCC) [18] in (4) (as is often small, only a small number of users take part in the user similarity calculation process in (4); as a consequence, we can protect the private service quality data observed by the remaining majority of users).
In (4), symbol denotes the service intersection invoked by and ; is a quality dimension of web services, for example, response time; and represent service ’s quality values over dimension observed by and , respectively; and denote ’s and ’s average quality values over dimension of all the services invoked by and , respectively. Specifically, if the service intersection holds. Moreover, if condition in (5) holds, can be regarded as a “really similar” friend of and put into set . Here, symbol is a predefined similarity threshold ().

Step 4 (service recommendation). For all the users in set (obtained in Step  3), we rank them by (see (4)) in descending order and return the Top 3 (at most) similar friends (denoted by set ) of the target user. Afterwards, for each service never invoked by the target user, denoted by , we predict its quality over dimension observed by , that is, , by (6), where and represents service ’s quality value over dimension observed by . Finally, we select the service with the optimal predicted quality and recommend it to the target user, so as to finish the whole service recommendation process.

5. Experiments

5.1. Experiment Configurations

In this section, a set of experiments are deployed on WS-DREAM dataset [19] to validate the feasibility of our proposed recommendation approach . WS-DREAM is a real-world service quality (e.g., throughput) set obtained from 339 users on 5825 web services from different countries. To simulate the recommendation scenario that we focus on in this paper (i.e., recommendation in a distributed cloud environment), each country is regarded as a cloud platform.

We compare our approach with a benchmark approach UPCC [20] and another two up-to-date privacy-preserving recommendation approaches, that is, P-UIPCC [14] and PPICF [15]. Many works, for example, [2123], consider the time cost and the MAE as the evaluation criteria; likewise, we also adopt these two criteria in this paper (in our approach, most user privacy information, e.g., whether a user has invoked a service or not and service quality observed by a user, can be protected by the intrinsic nature of SimHash; therefore, we will not evaluate the capability of privacy-preservation of our proposal here).(1)Time cost: the consumed time for recommending a web service to the target user, which can be used to measure the recommendation efficiency and scalability.(2)MAE: the difference between the predicted quality and real quality of recommended services (the smaller the better), which can be used to measure the recommendation accuracy.

The density of user-service quality matrix is set at 3% and the experiments are conducted on a Lenovo laptop with 2.40 GHz processor and 12.0 GB RAM. The laptop is running under Windows 10 and JAVA 8. Each experiment is repeated 10 times and the average experiment results are reported.

5.2. Experiment Results and Analyses

Concretely, the following four profiles are tested and compared, respectively. Here, and denote the number of users and number of web services, respectively; user similarity threshold holds.

Profile  1: Recommendation Efficiency Comparison. In this profile, we test the time cost of our proposal with respect to and and compare it with the remaining three approaches. The experiment parameters are set as follows: is varied from 50 to 300; n is varied from 1000 to 5000. The concrete experiment results are shown in Figure 3 ( holds in Figure 3(a) and = 300 holds in Figure 3(b)).

As can be seen from Figure 3(a), the time costs of UPCC, P-UIPCC, and PPICF approaches all increase approximately linearly with the growth of ; this is because more time is needed to calculate user similarities when the number of users, that is, , becomes larger, while our proposed approach outperforms those three ones in terms of time cost, as most jobs (e.g., user indexes building) can be finished offline before a service recommendation request arrives. Furthermore, after the hashing process, only a few “probably similar” friends of the target user are obtained; as a consequence, little time is taken to find the “really similar” friends of the target user from the small number of “probably similar” friends. Due to the above two reasons, the recommendation efficiency and scalability of our proposed approach are improved significantly. Similar comparison results can be observed from Figure 3(b), whose reasons are the same as those in Figure 3(a) and will not be discussed repeatedly.

Profile  2: Recommendation Accuracy Comparison. Accuracy is a key criterion to evaluate the quality of a recommender system. Therefore, in this profile, we test the MAE (the smaller the better) of our proposal and compare it with the remaining three approaches. The experiment parameters are set as follows: is varied from 50 to 300; is varied from 1000 to 5000. The experiment results are presented in Figure 4 ( holds in Figure 4(a) and = 150 holds in Figure 4(b)).

As Figure 4 shows, the recommendation accuracy values of P-UIPCC and PPICF approaches are often low (i.e., MAE values are high), as many approximate operations are recruited in these two approaches to protect the user privacy, for example, data obfuscation technique adopted in P-UIPCC approach and data segmentation-merging technique recruited in PPICF approach. These techniques on one hand can protect the privacy information of users effectively and on the other hand decrease the accuracy of recommended results, while our proposed approach achieves the approximate service recommendation accuracy as the benchmark approach UPCC, as the SimHash technique adopted in can guarantee finding the “really similar” friends of a target user with high probability and thereby can achieve a high recommendation accuracy.

Profile  3: Number of “Probably Similar” Friends of the Target User in with respect to and . In our approach, a small number of “probably similar” friends (the number is ) of a target user are obtained. In this profile, we test the relationship between and and . Experiment parameters are set as follows: is varied from 50 to 300; is varied from 1000 to 5000. The concrete experiment results are presented in Figure 5.

As Figure 5(a) shows, the value of increases approximately linearly with the growth of ; this is because it is more probable to find a “probable friend” of the target user when the candidate user space becomes larger. As Figure 5(b) shows, the value of increases relatively slowly when rises, whose reasons are twofold. First, more valuable recommendation information is available when the number of services, that is, , increases; as a consequence, more “probably similar” friends of the target user can be found by our proposed approach. Second, due to the intrinsic nature of SimHash technique adopted in our approach, the number of services, that is, , does not influence the finding process of “probably similar” friends directly in our proposal and, hence, the influence of parameter stressed on is not so obvious as that in Figure 5(a).

Profile  4: Recommendation Failure Rate of with respect to and . The SimHash technique adopted in this paper is essentially a kind of probability-based similar neighbor finding approach [24]. Therefore, our proposed approach may fail to return any recommended result in certain situations, that is, a failure occurs. Considering this point, in this profile, we test the recommendation failure rate of with respect to and . Concretely, failure rate can be measured by the equation in (7), where and represent the number of successful service recommendations and the number of failed service recommendations, respectively. The concrete experiment parameters are set as follows: is varied from 50 to 300; is varied from 1000 to 5000. The experiment results are shown in Figure 6.

As Figure 6(a) shows, the failure rate of approach decreases with the growth of m; this is because it is more probable to find the “probably similar” friends of a target user when the candidate space of users becomes larger. Moreover, the failure rate approaches 0 when is large enough, for example, when = 200, 250, or 300. Figure 6(b) shows the relationship between failure rate of and the number of services, that is, . As indicated in Figure 6(b), the failure rate approximately drops with the growth of ; this is because when the number of services increases, the probability that two users have invoked the common services grows accordingly, and hence it is more probable to find the “probably similar” friends of a target user. Furthermore, as can be seen from Figure 6(b), the failure rate of approach approaches 0 when is large enough, for example, when .

5.3. Shortcoming Analyses

In terms of the experiment results, we can conclude that approach achieves a good tradeoff among the recommendation accuracy, efficiency, and failure rate while guaranteeing privacy-preservation. However, other evaluation criteria are not discussed in depth, such as the well-known consistency criterion (e.g., the inferred friend consistency) suggested in work [25]. Besides, as [26] indicates, weight plays an important role in the final evaluation results; however, we do not consider the weight of found friends in this paper for simplicity.

6. Conclusions and Future Work

In the distributed cloud environment, a cloud platform is often not willing to share its recorded user-service invocation data with other cloud platforms due to privacy concerns, which decreases the feasibility of cross-cloud collaborative service recommendation severely. Besides, the user-service invocation data recorded by each cloud platform may update over time, which reduces the recommendation scalability significantly. In view of these two challenges, a novel privacy-preserving and scalable service recommendation approach based on SimHash, that is, , is put forward in this paper. To validate the feasibility of our proposal, we conduct a set of experiments based on a real distributed service quality dataset WS-DREAM. Experiment results show that outperforms the other up-to-date approaches in terms of recommendation accuracy and efficiency while guaranteeing privacy-preservation.

As work [27] indicates, SimHash is essentially a probability-based search technique and, hence, failure is inevitable in certain situations. Considering this point, in the future, we will continue to refine our proposal so as to further decrease the recommendation failure rate and boost the recommendation robustness. Besides, due to the inherent shortcoming of various hash-based privacy-preservation techniques suggested in [28], it is hard to evaluate the privacy-preservation performance of our proposal. In the future, we hope to find well-adopted technical criteria to evaluate the effectiveness of our proposal in terms of privacy-preservation. Moreover, work [29] proposes to utilize the semantic information to improve the retrieval performance; likewise, we hope to refine our work by adding more semantic information in the future.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper is partially supported by the Natural Science Foundation of China (no. 61402258, no. 61672276, no. 61373027, and no. 61672321), Key Research and Development Project of Jiangsu Province (no. BE2015154, no. BE2016120), and Open Project of State Key Laboratory for Novel Software Technology (no. KFKT2016B22).