Abstract

A recommendation system delivers customized data (articles, news, images, music, movies, etc.) to its users. As the interest of recommendation systems grows, we started working on the movie recommendation systems. Most research efforts in the fields of movie recommendation system are focusing on discovering the most relevant features from users, or seeking out users who share same tastes as that of the given user as well as recommending the movies according to the liking of these sought users or seeking out users who share a connection with other people (friends, classmates, colleagues, etc.) and make recommendations based on those related people’s tastes. However, little research has focused on recommending movies based on the movie’s features. In this paper, we present a novel idea that applies machine learning techniques to construct a cluster for the movie by implementing a distance matrix based on the movie features and then make movie recommendation in real time. We implement some different clustering methods and evaluate their performance in a real movie forum website owned by one of our authors. This idea can also be used in other types of recommendation systems such as music, news, and articles.

1. Introduction

Movie recommendation systems suggest movies to user that he/she might be interested in. The generated suggestions are obtained from the consideration of many aspects. The suggestion can be based on the tastes, interests, goals, people connections, and so forth.

In general, a movie recommendation system compares user’s profile or usage data to some reference characteristics and combines the user’s social environment to make movie recommendations. This type of recommendations is based on user. However, this type of recommendations may not work or make inaccurate recommendation in the following situations. The user does not have strong profile setting in the system. There are many users who do not want to set their profile due to laziness or privacy concerns. In this case, most recommendation systems consider the user’s social connection in the system such as friends, classmates, families, and colleagues. However, the tastes of the movies may be various even among best friends. What is worse, the “friends” that the user has added in the website may not be people with the same interest. For example, in a community network or local network such as a university or college, like our experimental environment, users’ social connections are built mainly because they are in the same university with the same major and similar age; however, their tastes towards movies may be totally different, which fails the fundamental bias in the recommendation system mentioned above.

Because of the situation stated above, we propose the use of machine learning (ML) techniques (clustering) to analyze the movie features and system logs (user’s voting logs) to make correct recommendations more adequately. In this proposed approach, we compute a distance matrix for the movie features and apply the clustering techniques to classify movies into different areas off-line. For every user logged on the system, we recommend movies from the clusters combined with the user’s majority voting result in real time.

In order to compare the accuracy and efficiency, we have implemented different clustering techniques as follows: DBSCAN (density-based clustering), affinity propagation, hierarchical clustering, and random clustering as a base line.

We have tested this proposal with all the clustering techniques in a university social website owned by one of our authors. There is a subsystem with movie service in the website. We add a recommendation function in a banner and put it on top of the section to let user make votes. We offer them all the clustering results from four cluster algorithms shuffled together and record the voting results separately.

The rest of the paper is organized as follows. Section 2 gives an overview of related works in which machine learning techniques have been applied to recommendation systems. Section 3 is the system overview of our recommendation system. Section 4 illuminates the distance functions we use to compute the distance matrix. Section 5 introduces all the clustering techniques used in our preclustering system. Section 6 includes the experimental setup and evaluation results. Finally, Section 7 concludes with discussions and future research.

Machine learning techniques are very useful when huge amounts of data have to be classified and analyzed, which nowadays is a very common situation in many scenarios, especially in the recommendation system. There are a lot of different machine learning techniques used in different recommendation systems, such as Naive Bayes classification [1], decision tree [2], k-means clustering and improvement [36], and so forth.

Generally, the recommendation systems are divided into two major categories such as collaborative recommendation system and content based recommendation system [7]. In case of collaborative recommendation systems, these try to seek out users who share same tastes as that of the given user as well as to recommend the movies according to the liking of these sought users. For example, Tatli and Birtürk described an approach for creating music recommendations based on user-supplied tags that are augmented with a hierarchical structure extracted for top level genres from Dbpedia. In this structure, each genre is represented by its stylistic origins, typical instruments, derivative forms, subgenres, and fusion genres. They use this well-organized structure in dimensionality reduction in user profiling [8]. Yang et al. proposed a personalized web page recommendation model called PIGEON (abbreviation for PersonalIzed web paGe rEcommendatiON) via collaborative filtering and a topic-aware Markov model and proposed a graph-based iteration algorithm to discover user’s interested topics, based on which user similarities are measured [9]. Cai et al. also proposed a model that fully captures the bilateral role of user interactions within a social network and formulated collaborative filtering methods to enable people to people recommendation [10]. As I mentioned in Section 1, making recommendation based on user is not working properly in some particular areas such as a community network system or university network.

The content based recommendation systems try to recommend contents similar to those web sites the user has liked. Biancalana et al. proposed two different context-aware approaches for the movie recommendation task, one is a hybrid recommender that assesses available contextual factors related to time in order to increase the performance of traditional CF approaches, and the other one aims at identifying users in a household that submitted a given rating [11]. This latter approach is based on machine learning techniques, namely, neural networks and majority voting. In our paper, we focus on the content based recommendation and use the majority voting and clustering techniques. We obtained better results compared to their research, which will be shown in Section 6.

Some of the researchers proposed hybrid recommendation approaches by combining different approaches. Ghazanfar and Prügel-Bennett proposed a unique switching hybrid recommendation approach by combining a Naive Bayes classification approach with the collaborative filtering [1]. Ujwala et al. presented and investigated an approach based on weighted Association Rule Mining Algorithm and text mining [7]. Bellogín et al. implemented decision trees and attribute selections together to build the recommendation model to find the most relevant preferences of user and system [2]. The idea of hybrid recommendation approach gives us inspiration to combine different approaches to implement the movie recommendation for future work.

3. System Overview

Figure 1 provides a high-level overview of our new movie recommendation approach. As shown in Figure 1, the new movie recommendation system comes in three parts: distance computation system, preclustering system, and real time online recommendation system.

3.1. Distance Computation System

Distance computation system computes the distance between different movies based on different properties of the movies including the movies types, publish year, countries of publishing companies, languages, directors, casts, and duration time. We first compute the Jaccard distance based on movies types, countries of publishing companies, languages, directors, and casts, and then the distance based on publish year and duration time by the distance we defined in Sections 4.2.2 and 4.2.3. We then compute the overall distances between each movie sample by summarizing them together with the weights which we obtained from the survey mentioned in Section 4.1.

3.2. Preclustering System

Preclustering system separates movies into different clusters before giving them to online recommendation system. In this system, we tried five kinds of different clustering algorithms: affinity propagation, DBSCAN, hierarchical clustering, spectral clustering, and random generator, among which spectral clustering is too slow to get the result in demanded time. Therefore, we sent the remaining four clustering results to the recommendation system with two data formats, one can get the cluster label by one movie’s IMDB identifier and another one can drag all the movies’ IMDB identifiers by one of cluster labels. In this way, the online recommendation system in Section 3.3 can quickly get the information it wants.

3.3. Real Time Online Recommendation System

After getting the preclustering results, the website read the history of users’ votes on movies and used the majority of the voting to generate recommendations. In more detail, if one user likes movie which belongs to cluster , will give one on . On contract, if dislikes movie in , will cancel a for . If a cluster gets negative at the end, it will return to 0. After votes all the movies he/she comments, different clusters will get different votes. The movie recommendation system will recommend movies according to the votes of clusters as as follows: where is the probability that a movie should be recommended from cluster and is the number of votes in cluster . is the number of clusters. As shown in Figure 2.

To make sure that in our evaluation the reason that user votes a movie we recommended as dislike is not because the movie itself is really bad, we recommend movies with the IMDB score higher than 7 in the clusters which have already been voted out by the user.

4. Distance Computation System

4.1. Weight Survey

Movies’ properties create a very special space, where the weights of each dimension are treated completely different. For example, the type of a movie is of course more important than the duration of it. Therefore, we cannot use the default distance function provided by Scikit-learn [18] package that we use. Instead, the feature selection like in [2] or starting a survey to figure out the weights to different dimensions should be implemented. In other research areas where we human beings do not know which feature is important, such as features in carcinogenic, we always use feature selection to decide the weights of different features. In movie recommendation area, however, we can investigate it by the survey shown in Figure 3, since people know why they like the movie. By carefully designing a survey and letting users vote the top three factors, we obtained a result shown in Figure 4.

From the survey result shown in Figure 4, we can figure out that the weight of publish year is 0.0915, weight of country is 0.147, weight of type is 0.4167, weight of language is 0.0545, weight of director is 0.0896, weight of casts is 0.1792, and weight of duration is 0.0214. These weights will be used in the overall distance computation in Section 4.3.

4.2. Distance Function
4.2.1. Jaccard Distance

Since for countries, languages, casts, and types of the movie, the distance is determined by how many same and different items there are between sets, we decided to use Jaccard distance proposed in [12]. Jaccard distance is a statistic used for measuring dissimilarity between sample sets. The formula is shown in formula (2). According to this formula, we use intersection and union to do the calculation and easily obtain the distance between each movie feature mentioned above as follows:

4.2.2. Publish Year Distance

The distance for publish year between movies is a very special case for the human being. To human’s intuition, the distance between movies published in 1950s and 1960s seems much less than that between movies published in 2012 and 2013. In another word, the distance of publish year is not a linear function. One reason for such phenomenon is that there are more and more movies published recently. The statistic of number of movies in each year in our database is shown in Figure 5.

Figure 5 illuminates that the number of published movies in each year is similar to an exponential function. The trend line is shown as follows:

According to formula (3), we propose a new distance function for publish year shown in the following:

4.2.3. Duration Distance

Human being is very sensitive for a small period of time, which means that half an hour is nearly as double time as a quarter of an hour in human’s feeling. Therefore, we use a linear function shown in formula (5) to compute the duration distance

Here, in order to normalize the distance, we consider 200 minutes as the maxima duration of a movie.

4.3. Overall Distance

We build up a distance matrix between each movie in our database based on formula (6) as follows:

4.4. Parallel Distance Computation

Since our distance matrix is symmetric, we can use such property to do distance computation in parallel.

To do parallel computation, two conditions must be satisfied.(1)Works assigned to different processing must be approximately equal.(2)There is no interaction or data exchange between different working processing.

The distance matrix we need to build is a symmetric matrix in which all the elements on the main diagonal are equal to 0, formally defined as a matrix , where and , as follows:

The scratch of the distance matrix shown in formula (7) is shown in Figure 6. Since the distance matrix is symmetric, the data in area and area and that in area and area are correspondingly equal. Therefore instead of computing all of them, we just compute the upper triangle of the matrix, that is, the data in areas and . However, if we separate the parallel tasks by rows of the matrix, the works (number of elements) in each task are (is) not equal (from to 1). To overcome this problem, the works needed in area are filled into area , so that we can group unequal tasks into equal tasks. In this way, a process pool where each process computes distance in one task can be used to parallel these tasks.

More specifically, we firstly assign the main diagonal line to 0. There is no need to compute the main diagonal, since we know that the distance between one movie and itself is 0. Then for each row , we combine the distance computations of ,  ,…,,  , and those of ,  ,…,,   together as a task. After computation, we get an upper triangular matrix, that is, area and area . Then fill the lower triangle with the upper triangle; that is, let .

5. Preclustering System

In this section, we use four clustering methods to investigate the best cluster method which can be used in the final movie recommendation system. The four clustering algorithms are affinity propagation [13], DBSCAN [14], hierarchical clustering [15], and random clustering for base line usage.

5.1. Clustering Method
5.1.1. Affinity Propagation

Affinity propagation which was first proposed in [13] generates clusters by sending messages between pairs of samples until convergence or changes falling below a threshold. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given as demonstrated in Figure 7.

5.1.2. DBSCAN

The DBSCAN algorithm [14] views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be in any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples) as demonstrated in Figure 8.

5.1.3. Hierarchical Clustering

Hierarchical clustering [15] is a general family of clustering algorithms that build nested clusters by merging them successively. This hierarchy of clusters is represented as a tree. The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

The beauty of this clustering method is that we have no need to provide the size of clustering ahead. Instead, we can initialize with a large cut height and decrease the cut height when the user has voted most movies in one cluster. In this way, we can provide a larger cluster to fit user’s interests. As shown in Figure 9.

5.1.4. Random Clustering

We first fix the size of clustering to 60 and then for each movie, we randomly pick the cluster it belongs to by generating a random number between 1 and 60.

5.2. Preclustering System Output

Because of seeking efficiency, we output the clustering label results in two formats. One can be used to check clustering labels by movie ID and another is used to check all the movies’ IDs in one cluster given the cluster label. By outputting the file in JSON, we can easily transfer data between preclustering system in Python and real time online recommendation system in PHP.

6. Evaluation

Before the evaluation starts, we created a survey and asked users to select the top three movie features they care about as mentioned in Section 4.1. After one week of data collection, we have the result which is shown in Figure 4 as mentioned.

Based on the result, we calculate the distance matrix and make it as an input to the different clusters. We implemented the four clustering techniques by using python library Scikit-learn.

6.1. Experimental Setup

The evaluation has been conducted using a university social website in China since the website is owned by one of our authors. We add the recommendation function in a banner and put it on top of the movie service to let users vote. The overview voting component of the website is showed in Figure 10.

There are three voting choices: like, do not like, and unseen for each recommended movie. Voting like scores 1, voting do not like scores −1, and voting unseen scores 0. Because of the limited time of data collection, we cannot wait for users to watch the recommended movies. Therefore, we add the unseen choices for users to select if they have not seen the recommended movies and only consider the votes which are like or do not like to evaluate our approach.

6.2. Result

After two weeks voting for the movie recommendation, we have collected 168,424 total votes, 132,360 votes are collected in 12 days and used for recommendation majority voting, as mentioned in Section 3.3. The remaining 6064 votes are collected in two days after the recommendation system is deployed. Since only the like vote and do not like vote are considered illegible, we generated the results in the following subsections.

6.2.1. First Evaluation Result

In the first evaluation result, listed in Table 1, we calculated the true positive, false positive, and accuracy for each clustering techniques. True positive equals the number of like votes, and false positive equals the number of do not like votes. Because most of the recommended movies are voted unseen, there are a total of 1,133 votes counted, and 216 users participated in the voting. Consider the following:

As shown by the result in Table 1, the hierarchical clustering approach is the best. However, the accuracy of every clustering approach is not satisfied. Therefore, we extracted the users’ voting history from our server database and analyzed those data. We found that some of the users made less than 5 votes, some of them made between 5 to 10 votes. There are a few users who made more than 20 votes. In those people who made more than 20 votes, we found that 4 users made all their votes do not like. Therefore, almost one third of the false positive votes were made by only 4 users over 216 users. Table 3 shows one example from one typical user. The first column is the movie ID from imdb, the second column is the voting results, and the third column is the voting time. We think maybe this user dislikes our recommendation idea so he made all the votes negative.

There is another small portion of users made more than 20 votes. Most of the votes are negative. Only a few votes are positive, and no zero or unseen votes. Table 4 shows one of such examples. It is probably because this user is voting negative instead of the unseen choices for the unseen movies, since most of the users were voting unseen more than like and do not like.

6.2.2. Second Evaluation Result

These two situations mentioned in Section 6.2.1 effect our result a lot in the first evaluation. Therefore, we made a second evaluation by calculating each user’s accuracy in formula (8) and then computing the average of all user’s accuracy. Consider the following:

In the second evaluation, we wait for another 6 hours to collect more data. There are 7,100 votes in total, 1420 votes are considered and 242 users participated in voting. Based on the second evaluation, we obtained much better result as shown in Table 2. DBSCAN (density-based clustering), affinity propagation, and hierarchical clustering obtained over 80% accuracy. DBSCAN reached to 84.71%.

6.3. Comparison with Related Work

In this section, we compared our result with those of with several other movie recommendation approach/systems. Pomerantz and Dudek [16] used a hierarchical Bayesian approach combined with the item similarity of the content for the movie recommender and achieved the classifiers of as the best result. In Biancalana et al.’s paper [11], the best result is obtained by combining three classifiers in a neural network. Basu et al. [17] presented an inductive learning approach to recommendation and achieved an averaged precision of .

Our best result () achieved by DBSCAN [14] as shown in Table 2 improves compared to best result of related works mentioned above. Meanwhile all these related works are computed offline and they use the stored models when testing, and these models cannot be updated in time since it takes too long to redo the learning process. Our approach, on the other hand, does not need to store user models and it updates the recommendation result in real time by preclustering movies and using a very simple learning approach (majority voting) to provide movie recommendation. Since movies are updated much less frequently than users’ votes, our approach is much more time efficient than related works. To sum up, our research becomes significantly meaningful as it realizes the real time model updating while it does not compromise the accuracy.

7. Conclusion and Future Work

In this paper, we proposed a new distance function for publish year (year related) to compute the distance matrix, and then implemented a real time movie recommendation system based on content (movie) using preclustering and majority voting. We implemented four different clustering techniques in the preclustering process and obtained 84.71% accuracy as the best result, which is better than some of the other research papers' approaches. We realized a real time updating model with no compromises on accuracy. Our approach can even work better with special user groups such as people in colleges, universities, or professional communities.

The next step in this research is to make the system adaptive to be widely used by many other types of groups such as articles, news, and music, and to obtain better accuracy. One of the limitations in our approach is that if there is a cluster that contains a lot of positive votes as well as a lot of negative ones, we will not recommend movies in this cluster. However, this situation may happen because of two totally different reasons. One is that the user does not care about the movies in this cluster. In this case, no recommendation in this cluster is a wise choice. However, if the user really likes the movies in this cluster and he/she watched a lot of movies in this cluster, and some movies he/she likes while some he/she dislikes, ignoring this cluster, in such situation, will dissatisfy the user. To conquer this limitation, we can give weights to the votes from users. In the movie which is voted positive by the user and is voted positive by many other people as well, the weight of such vote will be decreased, while if the movie is voted positive by the user but is voted negative by most of others, this means such vote reveals the special taste from this user, and gains higher weight and vice versa. In a more specific way, we can set the weight of positive as and the weight of negative as , where is the total positive votes from all users towards this movie and is the total negative votes from all users towards this movie. In this way, we can still ignore the cluster in first situation, while overcoming the limitation we had in the second situation.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors especially thank Dr. Khaled Rasheed for providing them with suggestions and information regarding their research. The authors also thank Mr. Hongfei Yan for the related work and discussions.