Abstract

With the explosive growth of information resources in the age of big data, mankind has gradually fallen into a serious “information overload” situation. In the face of massive data, collaborative filtering algorithm plays an important role in information filtering and information refinement. However, the recommendation quality and efficiency of collaborative filtering recommendation algorithms are low. The research combines the improved artificial bee colony algorithm with K-means algorithm and applies them to the recommendation system to form a collaborative filtering recommendation algorithm. The experimental results show that the MAE value of the new fitness function is 0.767 on average, which has good separation and compactness in clustering effect. It shows high search accuracy and speed. Compared with the original collaborative filtering algorithm, the average absolute error value of this algorithm is low, and the running time is only 50 s. It improves the recommendation quality and ensures the recommendation efficiency, providing a new research path for the improvement of collaborative filtering recommendation algorithm.

1. Introduction

With the continuous advancement of technology and the rapid development of the Internet, rich information resources are inundating everyone all the time, and both information producers and consumers are facing huge challenges [1]. Information producers need to deliver their information precisely to eligible target users, while as information consumers have to select effective data that satisfies themselves among the redundant and complicated mass of data, so the recommendation system appears. By analyzing users’ historical data and actively providing them with news and products that meet their needs and interests, recommendation systems are able to filter information and provide users with personalized information services. It is playing an important role in various areas of social life such as current business systems [2]. Collaborative filtering recommendation algorithm is a comprehensive filtering algorithm, which takes the similarity of the item’s attributes or user’s ratings as the basis for personalized recommendation. It can handle unstructured complex objects without first extracting the content of the item. Collaborative filtering recommendation algorithms are capable of handling unstructured data and are gradually occupying a central position in recommendation systems. However, current collaborative filtering recommendation algorithms face a series of significant problems due to their own algorithmic characteristics, such as difficulties in ensuring real-time algorithm performance when dealing with huge data volumes. Ortega et al. developed a hybrid recommendation algorithm with multiclass classification algorithms and executed based on user rating behavior to improve the prediction and recommendation quality [3]. In addition, the artificial bee colony algorithm (ABC) has been effectively used in improving clustering performance due to its fewer parameters, simplicity, and ease of implementation and global merit seeking capability [4]. However, the K-means algorithm is prone to fall into local extremum and rely too much on the initial point selection, and the algorithm center determination time complexity is high. Therefore, in the second part of the study, the literature review of collaborative filtering algorithms at home and abroad is described. In the first section of the third part, the artificial bee colony algorithm is improved and combined with the K-means algorithm. The second section of the third part is a collaborative filtering recommendation method based on the improved colony K-means algorithm. The fourth part is to verify the application effect of the proposed method. The fifth part is the conclusion of the study.

In recent years, collaborative filtering recommendation algorithms and artificial bee colony K-means clustering models have received a great deal of attention among related professionals at home and abroad, and researchers have proposed many new methods for this purpose. Al-Bakri and Hassan proposed a modest approach to enhance data prediction by applying a user-based collaborative filtering algorithm to clustered data, and the results showed that the algorithm improved the recommendation system scalability [5]. Selvi and Sivasankar used supervised adaptive genetic networks to locate the most popular data points in clusters as a way to ensure simple and effective recommendations and reduce error rates. The effectiveness of the algorithm is demonstrated through experiments on the Netflix dataset [6]. Yang et al. proposed a time-weighted collaborative filtering algorithm with improved small-batch K-means clustering for sparse rating matrices and derived user scores with high recall and rating prediction accuracy to address shortcomings such as user interest bias in traditional collaborative filtering algorithms [7]. Wang et al. proposed a collaborative filtering algorithm incorporating temporal factors and applied it to score prediction and nearest neighbor selection with a time-weighted function. The results showed that the algorithm has good operational performance [8]. Najafabadi et al. performed neighbor selection for each user through user-based fuzzy clustering and a new similarity metric, and the results showed that it can improve the accuracy of recommendations to users [9]. Garanayak et al. developed a new recommendation system using K-means and item-based collaborative filtering techniques to filter out desired information segments based on people’s preferences and concerns for the information [10].

Ashaduzzaman et al. proposed a clustering method that integrates multicriteria ratings into traditional recommender systems, rating in multidimensional situations such as auxiliary information, contextual information, and multiple criteria, and the results showed that the method was able to produce effective recommendations [11]. Ali and Wasid used the K-means algorithm for clustering and incorporated user-based rating criteria. The Mahalanobis distance metric was used to calculate the clustering similarity and generate a neighborhood set. Experiments showed that the algorithm improved the quality of recommendations [12]. Li et al. proposed a collaborative filtering algorithm based on category priority to address the problems of scalability and data sparsity of collaborative filtering recommendation algorithms and proposed a user-item priority ratio to calculate the priority ratio matrix. The recommendation accuracy of this algorithm was improved by 2.81% through experiments on the MovieLens dataset [13]. Onuean et al. set up recommendation items based on memory-based collaborative filtering techniques and used K-means clustering to cluster each data. The experimental results showed that the method has high accuracy in prediction and recommendation [14]. Zhu et al. proposed a fuzzy clustering-based method that evaluates the prediction-driven uncertainty and classifies based on existing data. Experimental results show that the method outperforms traditional collaborative filtering recommendation algorithms [15]. Chakraborty et al. addressed the problem of large bias in clustering results caused by initial guesses of clustering centers by combining the K-means algorithm with a volume metric algorithm and genetic arithmetic as a way to predict the optimal value of initial clustering centers. Experimental results showed that the algorithm improved prediction efficiency [16]. Daoudi et al. developed a new parallel K-means algorithm using a graphics processing unit to select the initial center of mass by using an open computing language in the programming environment and performing the initialization steps on the CPU in parallel. Experimental results showed that the method reduced the running time while maintaining quality [17].

In summary, the K-means algorithm has significant advantages in user clustering, and most researchers have also used K-means clustering by introducing it into traditional collaborative filtering recommendation algorithms to improve recommendation efficiency, but less research has been conducted on the combination of artificial bee colony algorithms with K-means. Therefore, the study will improve the swarm K-means model as a way to enhance the performance of recommendation systems.

3. Collaborative Filtering Recommendation Algorithm Based on Swarm K-Means Clustering Model

3.1. Improvement of Bee Colony Algorithm Based on K-Means Clustering Model

The artificial bee colony (ABC) algorithm is a colony intelligence optimization algorithm developed on the basis of simulating the foraging behavior of honey bee colonies. It has the advantages of simple operation, easy implementation, fast convergence, and few control parameters and is widely used in optimization problems such as function optimization and data mining [18]. The K-means algorithm, a distance-based hard clustering algorithm, is easy to end up with a local optimum and difficult to apply to data classification, despite its fast clustering speed and strong local search capability [19]. K-means objective criterion function is the distance from each point to the cluster cent. Solving for the extreme value of the function can iteratively adjusts the rules. K-means algorithm first selects number of clusters and a dataset including data, the initial cluster centers are selected randomly, the number of data samples is , and the remaining nodes are ; then, they are divided into the class where the nearest centroid is located according to the distance of the remaining nodes to each initial cluster center. The new centroids of each cluster are obtained by calculating again, and it is judged whether the new cluster centroids change, i.e., whether the criterion function converges. If it converges, the algorithm ends; otherwise, it continues to the next iteration, as shown in Figure 1.

The K-means algorithm divides the data samples mainly by measuring similarity, and the Euclidean distance formula for similarity between samples is given in

In Equation (1), and are the data samples in the dataset, while and are the data samples in the cluster. The objective criterion function of the K-means algorithm is the mean squared error, as shown in

In Equation (2), represents the data elements in the selected sample, represents the sum of the mean squared deviations of the data elements, and represents the cluster center of the cluster. Cluster analysis is judged by the sum of squares of errors, as shown in Equation (3), with the aim of obtaining an optimal set of divisions that are as independent as possible between clusters and as compact as possible within clusters.

In Equation (3), represents a data sample in the class , is the mean value of the data objects selected in the cluster , represents the number of data objects in the th cluster, and represents the Euclidean distance between and . The artificial bee colony algorithm is introduced. It is improved and combined with the K-means algorithm. The artificial bee colony algorithm is improved at three levels, fitness function, population initialization, and position update, and its conceptual correspondence with the K-means algorithm is shown in Table 1.

The initialization process of the artificial bee colony algorithm randomly generates nectar sources with twice the number of nectar sources already generated and randomly generates nectar source information within a given range of values. Finally, the fitness value of that source is calculated, as shown in

In Equation (4), is the nectar source information, is the fitness value of the nectar source, is a random number between (0, 1), represents the upper limit of the dimensional data, and is the lower limit of the dimensional data. The conventional fitness function is shown in

In Equation (5), is the fitness value of the nectar source , and represents the number of points in the cluster. In order to combine the improved artificial bee colony algorithm with the K-means algorithm, it is necessary to construct a fitness function that makes K-means more efficient and faster clustering. Considering the intraclass distance and the number of points contained in each cluster as an influence, then the new fitness formula is given in

In Equation (6), represents the sum of the distances from other cluster centers to the cluster center , and represents the number of cluster centroids. The hired bee performs a neighborhood search at its current location and selects a nectar source as shown in

In Equation (7), represents the velocity, is a random number between [-1, 1], and is not equal to . The ability of the swarm to obtain a higher quality honey source at a faster convergence rate depends on the location update formula. The existing artificial bee colony algorithm converges slowly in the later stages, so a global bootstrap factor is introduced and the new location update formula is shown in

In Equation (8), represents a new location near . , , and are random numbers obtained from a random formula, and are both equal to or not equal to and mutually exclusive, represents the current highest quality food source, and is a random number between [-1,1] and [0,1]. The formula for calculating the probability when the full hired bee search is complete is given in

In Equation (9), represents the probability, and is the new fitness function value. The new location update formula shows that when the optimal location in the population is far from the location of the individual, the next iteration of the individual’s search will increase the step size and approach the global optimal location at a faster rate, and vice versa, converge slowly. Commonly used external evaluation metrics for clustering are shown in

In Equation (10), is the completeness rate, is the accuracy rate, and is the external evaluation index of clusters. The weighted average of the -measure of each category gives , as shown in

In Equation (11), represents the number of all objects in the classification. So the flow chart of the improved algorithm is shown in Figure 2.

As can be seen from Figure 2, the improved algorithm has eight basic steps. Firstly, the number of scout bees, followers, and leaders is initialized, with equal numbers of followers and leaders; secondly, the initial colony is clustered, and the new fitness formula is used to derive fitness values, which are arranged in descending order. Thirdly, the leader bees search the vicinity of the current location to obtain a new food source and then choose whether to keep the new food source based on the size of the fitness value compared to the original food source. Step 4 is to calculate the probability of following the bee . Step 5 is to perform a nearest neighbor search after the following bee has selected the leader bee. Step 6 is to obtain the cluster center relative to the new food source after the nearest neighbor iteration and perform K-means clustering on the population. Step 7 is to check if there are any unrenewed food sources after iterations. Finally, depending on whether the number of iterations satisfies the termination condition, the algorithm is terminated algorithm or move to step 2.

3.2. Recommendation Algorithm Based on Improved Clustering Model

Collaborative filtering recommendation algorithm can filter information that cannot be automatically analyzed by the machine and has the ability to recommend new information. It has gradually become one of the most widely used and successful recommendation algorithms in the recommendation system [20]. However, collaborative filtering recommendation algorithms face the problems of data sparsity and cold start. The sparsity problem refers to the fact that users voluntarily give few reviews, and the common reviews of the same item by different users are even more scarce, so it tends to have low accuracy when calculating the similarity between items and between users. The cold start problem refers to the lack of historical evaluation data from users when new users and new items enter the recommender system or when a brand new system is just launched [21]. The problem of data sparsity is mainly compensated by filling in other useful information, which is a way to build effective models of user interests and item characteristics. Alternatively, the scoring data can be preprocessed by machine learning methods, such as matrix partitioning, matrix decomposition, and clustering, on the basis of existing data [22]. The cold-start problem then relies on the incorporation of trust relationships, background knowledge, and demographic information when calculating similarity, the integrated inclusion of item content information, and the proposal of new similarity metrics. The improved swarm K-means algorithm is therefore introduced into the collaborative filtering algorithm, where users are first clustered according to swarm K-means; i.e., the historical behavioral data of the initial user is analyzed and processed accordingly to find the set of users with similarities to the interest preferences of the user to be recommended. After clustering, clusters are obtained, and each cluster has a cluster center corresponding to it [23]. Then, the similarity between the target user and other users is calculated, and then, the recommendation list of the target user is formed based on the rating information of the nearest neighbor users. The principle of the algorithm is shown in Figure 3.

As can be seen from Figure 3, in the recommendation of user , user clustering is performed based on the user’s attribute information to obtain clusters, and when user exists in the class cluster , the nearest neighbor set of of is obtained in the class cluster based on the user-item rating matrix and calculated by similarity [24]. The similarity between users is mainly expressed by the cosine angle between user vectors, the magnitude of which is positively related to the degree of user similarity. The formula for calculating the similarity between the two is given in

In Equation (12), both and represent the representative user, represents the rating obtained by the user in the dimensional space, is the rating obtained by the user in the dimensional space, and represents similarity. Since the rating scales differ between users, the difference in rating scales is compensated by introducing an average user rating in the cosine similarity calculation, i.e., subtracting the average user score on the item. The modified cosine similarity formula is shown in

In Equation (13), represents the rating of user on item . and represent the average of the ratings of all items by user and user , respectively, represents the elements in the item set, and . The items are then sorted in descending order of similarity, and a certain group of similar neighbors in the sort is selected together to form the set of nearest neighbors of the target user . The predicted ratings of all items in the item set is shown in

In Equation (14), represents the set of nearest neighbors, is the predicted score of user on item , and is the similarity between user and user . The predicted scores are then used to obtain a score table, and recommendations are made based on the score table to obtain a recommendation list. The performance of the recommendation system is determined by the user’s satisfaction level, which is also an indicator of the quality of the recommendation [25]. The most commonly used statistical accuracy measure is the mean absolute error (MAE), as shown in

In Equation (15), is the predicted rating scale from user , represents the actual rating scale, and represents the number of test sets. The mean absolute error (MAE) is a measure of the difference between the true and predicted scores to measure the correctness. It is negatively correlated with the quality of the recommendation; i.e., a smaller MAE value means that the algorithm’s recommendation quality is better, and conversely, a larger MAE value means that the algorithm’s recommendation quality is worse.

4. Application Effect Analysis of Collaborative Filtering Recommendation Algorithm Based on Bee Colony K-Means Clustering

An experimental analysis of the improved collaborative filtering recommendation algorithm was conducted. The configuration environment of the experiment includes Intel (R) Core (TM) i5-6200U CPU @ 2.30 GHz 2.40 GHz, 8.0 GB memory, 500G hard disk, Windows 10 64 bit operating system. Development platform is PyCharm platform based on Python 3.8.4. PyCharm is a powerful Python compiler. Its greatest advantage is that it combines multiple libraries (such as Matplotlib, pandas, and NumPy), which is simple and convenient. The experimental dataset was obtained from 90,000 ratings of 986 movies by 692 users on Douban, and the ratings were integers within [1, 5], with a positive correlation between the rating and the user’s liking; i.e., a higher rating means that the user likes the movie more, while a lower rating means that the user likes the movie less. The specific parameters included in the experiment and the settings are shown in Table 2.

The improved collaborative filtering recommendation algorithm proposed in the study is based on the swarm K-means clustering model. The strength of the new fitness function directly affects the performance of the algorithm, so the traditional fitness function is compared with the new fitness function to test the performance of the improved algorithm. The clustering results of both under the same data set are shown in Figure 4.

In Figure 4, the larger asterisks represent the cluster centers, and points of the same shape are in the same class. From Figure 4(a), it can be seen that the distance of sample points within the class center is smaller, which achieves better intraclass compactness, but the separation between classes does not perform significantly, while the clustering results in Figure 4(b) not only have better compactness but also better separation, indicating that the overall performance of the new fitness function is better than that of the traditional fitness function. The impact of both on the recommended results is shown in Table 3.

As can be seen from Table 3, the number of wins for the new fitness function and the traditional fitness function is 4 and 2, respectively, and the average MAE value for is 0.767 and 0.804 for , indicating that the new fitness function outperforms the traditional fitness function in terms of average value and number of wins. The results indicate that the new fitness function is able to improve the quality of recommendations. To further validate the performance of the improved swarm algorithm, four commonly used test functions, Griewank, Rastrigin, Sphere, and Rosenbrock, were tested. The characteristics of the four test functions are shown in Table 4.

In Table 4, Rastrigin and Griewank have similar characteristics, both being multipeaked functions, Sphere is a multipeaked convex function, and Rosenbrock is a convex function with consecutive single peaks. The global minimum 0 is obtained at for all four. The iterative trends in fitness between the original swarm algorithm and the improved algorithm on Rastrigin, Griewank, Rosenbrock, and Sphere are shown in Figure 5.

From Figures 5(a) and 5(b), it can be seen that the iterative search for the optimal value of the original swarm algorithm has different degrees of slow convergence and local extremes in all four test functions; from Figures 5(d) and 5(c), it can be seen that the original artificial swarm algorithm requires a longer iteration time and more iterations to reach the same local optimal solution than the improved algorithm; and in Figure 5(d) and the fitness trend in Figure 5(b), the accuracy and precision of the original artificial bee colony algorithm in searching for the optimal solution is poor and differs from the improved bee colony algorithm in the optimal solution by multiple orders of magnitude. The initialization process of the improved algorithm is purposeful and introduces a global bootstrap factor, so its convergence speed and search accuracy are significantly higher than the original algorithm in the iterative search for the optimal solution. The improved swarm K-means algorithm was then tested for performance with the parameters set as shown in Table 5.

The number of datasets Iris, Balance-scale, and Glass was 150, 625, 625, and 214, respectively; the number of attributes was 4, 4, and 10, respectively; and the number of classifications was 3, 3, and 6, respectively. The convergence trends of the fitness values of the improved swarm K-means algorithm for 100 runs on the three datasets are shown in Figure 6.

As can be seen in Figure 6, on the Iris, Glass, and Balance-scale datasets, the magnitude of change in fitness of the improved swarm K-means algorithm is smaller over the course of the population iterations, and the new position update formula enables the algorithm to dynamically adjust the search step and gradually approximate towards the global optimum. The algorithm jumps out of the local optimum in 60 iterations on the Balance-scale dataset and reaches a position with a higher fitness value. Therefore, the algorithm is able to accurately obtain the global optimal solution in a shorter time with fewer iterations required, faster convergence, and higher stability for both the dataset Glass with a large attribute dimensionality and the dataset Balance-scale with a larger sample size. To verify the better recommendation quality of the collaborative filtering recommendation algorithm proposed in the study, it is compared with the user-based recommendation algorithm [26], the user-based clustering algorithm [27], and the ICCFRA algorithm [28]. Firstly, 560 users in the dataset were selected to form the training set and 321 to form the test set, and different numbers of nearest neighbors were set. The MAE results of the four recommendation algorithms are shown in Table 6.

In Table 6, “Number” represents the number of nearest neighbors. From Table 6, the MAE values of the algorithm proposed in the study are smaller than those of the user-based clustering algorithm and the user recommendation algorithm, and the MAE values of the ICCFRA algorithm gradually exceed the algorithm as the number of nearest neighbors increases. Therefore, the algorithm’s recommendation results are more reliable, and the accuracy of the algorithm is more reliable when the amount of data increases. To verify the running efficiency of this algorithm, the running time of the algorithm was compared with the other three algorithms, and the running time of all four is shown in Figure 7.

As can be seen from Figure 7, the user-based recommendation algorithm takes significantly more time than the improved algorithm, with the highest at 150 seconds when the number of nearest neighbors is 30. The difference in running time between the collaborative filtering algorithm based on user clustering and the improved algorithm is smaller, with the highest time of the improved algorithm being 50 seconds. This is due to the fact that the improved algorithm first clusters users and then builds user clusters, largely reducing the space for searching nearest neighbors. It improves the quality of recommendations as well as ensures operational efficiency.

5. Conclusion

Collaborative filtering recommendation algorithm is the most widely used and successful recommendation algorithm in the recommendation system, but its recommendation efficiency and quality are low at present. Therefore, an improved bee colony K-means clustering model is established and applied to the collaborative filtering recommendation algorithm to optimize the recommendation system. The experimental results show that under the same data set, the MAE value of the new fitness function is 0.767 on average, while that of the traditional fitness function is 0.804. In the four commonly used test functions Rosenbrock, Griewank, Rastrigin, and Sphere, the improved algorithm can obtain the same local optimal solution in a shorter iteration time and fewer iterations. In the iterative optimization process, the improved algorithm has higher convergence speed and search accuracy than the original algorithm. The MAE value of user clustering algorithm and user recommendation algorithm is larger than that of the improved algorithm, and the accuracy of the algorithm is more reliable when the number of nearest neighbors increases. In terms of running time, the improved algorithm has a maximum time of 50 seconds and has higher running efficiency. However, the user data included in the study is still relatively small, and more abundant information data is needed to determine the number of nearest neighbors.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

It is declared by the author that this article is free of conflict of interest.