Abstract

With the growing development of the era of big data, data acquisition and analysis have become hot spots, and Python-based crawler technology is one of the most widely used tools in data analysis work at present. In this paper, we apply Python crawler key technology to acquire data of movie list and hot movies on Cat’s Eye movie network, analyze data based on Python development environment Spyder, use the Numpy system to store and process large data, Chinese Jieba word separation tool to crawl data for word separation text processing, Snownlp library to process text sentiment, and finally by the word cloud map and web dynamic map display information such as viewers’ emotional tendency and movie rating statistics, and provide decision support for users’ movie viewing.

1. Introduction

With the rapid development of web technology, the amount of web information is also growing rapidly [1, 2]. In order to better meet the personalized needs of users, various recommendation systems have emerged, which automatically establish the connection between users and information by studying their interest and preferences, thus helping users to discover their potential needs from the huge amount of information. Python is an open source, free, cross-platform interpreted high-level dynamic programming language, and its powerful functions and simplicity make it the preferred language for Internet application development.

Scholars at home and abroad have also made many contributions to the research of film rating prediction. Karl Persson collected the attribute and feature information of 3,376 Hollywood movies from the IMDB website and constructed a random forest model and a support vector machine model to predict the movie ratings. Liu Changming proposed a hybrid prediction model of movie scoring based on the machine learning algorithm and movie features and predicted movies on Douban [3, 4]. Lu Junzhi (Spark Mllib machine learning framework, 2018) built a movie score prediction model through the random forest regression algorithm, with good results. In general, domestic and foreign scholars have studied the prediction of films, but the prediction of foreign film is more mature, and there is still room for improvement in domestic research on film prediction models.

In this paper, the movie data are obtained from Douban website, and a total of 91,368 movie records are crawled for movie ratings, number of ratings, number of stars, genres and tags, etc [5, 6]. The data are analyzed based on Python’s integrated development environment, using Numpy for storing and processing data, pandas for data analysis, and matplotlib, and this data analysis project is mainly composed of three parts: production and quality analysis of the movie industry, analysis of Douban rating factors, and prediction of rating models, which can be used to intuitively grasp movie-viewing orientation and quickly find preferred high-quality movies through data visualization.

2. Python-Based Movie Data Acquisition

2.1. Introduction to Python

Python is a high-level combination of interpreted, compiled, interactive, and object-oriented scripting language that is powerful, syntactically simple, and easy to maintain. It is a great language for beginners because of its highly readable design, its more distinctive syntax than other languages, and its support for a wide range of applications [7, 8].

2.2. Movie Ranking Acquisition Techniques

For the movie ranking acquisition, we mainly use the re module and the request the HTTP client library to get the data of the Hot 100, Most Anticipated, Domestic Box Office, North American Box Office, and Top 100 lists. We use the browser to login to the CatEye movie page, view the html body data of the page, analyze the html structure of the page, extract the URL information using regular expressions, and then collect the data in the next step. The key code is as follows:def getOnePage(url)response = requests.get(url, headers = header)if response.status_code = = 200:allTop = re.findall(“<dd>. ∗?board-index-(\d+). ∗?title = ”(. ∗?)”. ∗?/p>. ∗? </dd>”, response.text, re.S) return allTop, response.text

2.3. Movie Data Fetching Techniques

The movie data acquisition mainly uses requests HTTP client library, json package, random library, csv function package, datatime module, and re module. re module and requests library and random library are mainly used to crawl data, json package is used to convert the acquired data format to json format, datatime module the csv function package is used to store the data [9, 10]. The User-agent proxy mechanism is used to represent the requester’s information, the User-agent can be collected and saved, and the User-agent can be dynamically changed during the crawling process to prevent the data from being terminated due to frequent acquisition. In this paper, the movie “Spirited Away” is selected for analysis according to the hot word-of-mouth list, and finally, 13318 records are obtained for visualization using Python crawler technology.

3. Acquisition of Douban Top 250 Movie Data

3.1. Introduction of Concepts Related to Douban Top 250 Movie List

Douban is a sharing and review community website that provides users with information about books, movies, music, and others. Douban.com is a community website that provides users with information about books, movies, music, and other works, with descriptions and comments provided by users, and is one of the most unique Web 2.0 websites [11, 12]. As an important part of Douban.com, Douban Movies provides users with movie-related information, such as descriptions, schedules, ticket prices, and reviews of currently released movies. The Top 250 movie list is based on the number of people who have seen each movie and the reviews it has received and represents the movie preferences of the majority of users.

3.2. Data Acquisition and Cleaning

In this paper, we choose Octopus collector to crawl the information of Douban Top 250 movie list and enter the URL of Douban Top 250 movie “https://movie.douban.com/top250” on the home page to enter the interface of the collection process [13, 14]. The total number of information crawled was 250, and 20 fields were obtained after data cleaning with the Pandas library in Python, namely, movie name, score, number of ratings (num), number of five-star ratings (star5), number of four-star ratings (star4), number of three-star ratings (star3), number of two-star ratings (star2), and number of one-star ratings (star3). (star1), short reviews (short), director (director), writer (writer), actor1, actor2, actor3, type1, type2, region of production (region), year of release (year), month of release (month). year, month, and time.

4. Exploratory Analysis of Douban Top 250 Movies

4.1. Number of Movies in Each Region

The origins of Douban Top 250 movies are divided into UK, USA, Italy, France, Korea, Japan, and China, as shown in Figure 1. Since the number of movies from the remaining regions is small, it is not shown in this paper. In the Douban Top 250 movie list, there are 114 American movies, accounting for 45.6% of the overall total. This is followed by a total of 41 Chinese movies, including 16 movies from Mainland China, 19 movies from Hong Kong, and 6 movies from Taiwan. In addition, there were 32 Japanese films, 17 British films, and 10 Korean films [15, 16]. It can be seen that Western countries have some influence on Chinese film culture. Since most of the users of Douban are young people, the reason why American movies are highly rated by young people is inseparable from their movie culture and dissemination. In addition, further analysis of Japanese movies shows that most of the movies are anime, which is an important part of Japanese soft culture and reflects the preference of young people for anime culture.

4.2. The Overall Distribution of Movie Ratings

Because the list of Top 250 movies is selected, the ratings of movies are relatively high, concentrated in the range of 8∼10, and the number of movies with each score is shown in Figure 2.

As we can see from Figure 2, the lowest rating among the movies crawled is 8.3 and the highest is 9.7.93.2% of the movies which are rated 8.5∼9.3, and most of them are rated 8.5 and above, indicating that the quality of these movies is relatively high. In addition, there are very few movies with high ratings, which shows that audiences have different preferences for movies, and movies that seem to be highly rated by some people are not liked by all [17, 18].

4.3. Analysis of Movie Genres

In order to better understand the movie genres that viewers pay attention to, we first use the wordcloud word generation library in Python to create a word cloud map of Douban Top 250 movie genres.

The movies with the highest frequency in the list are romance movies. This is followed by comedy, suspense, family, and drama movies, and song and dance movies appear very infrequently [19, 20]. After further statistical analysis, the movie genres in the Douban Top 250 movie list are mainly divided into 20 types, such as romance, comedy, suspense, family, drama, action, and crime. According to the classification and number of movies, the movie genres with more than 10 movies were selected for the analysis of the number of movies and rating averages, as shown in Figure 3.

As can be seen from Figure 3, there are ten categories of movies with a number of movies greater than 10. The most movies are in the romance category, with 42 movies in total. The next category is comedy, with 32 movies. There are 22 suspense movies, 19 family movies, 19 drama movies, and 16 action and crime movies. The number of movies in the fantasy, animation, science fiction, and thriller categories is relatively small, but all of them have more than 10. In addition, the average rating of most types of movies is around 8.9 points. The highest is drama movies with an average rating of 9.07, and the lowest is thriller movies with an average rating of 8.75.

5. Python-Based Visual Analysis of Movie Data

5.1. Sentiment Analysis and Word Cloud Generation

Snow NLP is a class library of Python, which can easily process Chinese text content, call the classification method under sentiment, and score the emotional tendency of the review, between 0 and 1; the more positive emotional tendency corresponds to a higher score, as shown in Figure 4. We can find that the user reviews of the movie “Spirited Away” are mostly positive tendency words, which means the movie has more positive comments.

Using the lexicon-based sentiment analysis method, the text is borrowed and analyzed through the sentiment dictionary and rules, and the sentiment value is calculated by traversing the sentiment words, degree words, negation words, matching words, and exclamation words, and finally, the sentiment value is used as the basis for the sentiment tendency of the text [2123]. The specific operation process is “text preprocessing, text word splitting—exact pattern word splitting, customizing common word removal database, removing individual words, doing word frequency statistics on word splitting, getting the top 100 most frequent words, reading positive and negative word database, counting positive words and negative words, and drawing word cloud.” Through the word cloud chart, we found that the overall feeling of the audience for the film is good and classic, and we also found that the most watched cities in China are Beijing, Guangzhou, Shenzhen, and Shanghai.

5.2. Movie Star Rating Analysis

The Pie component in the Pyecharts library is imported and used to generate pie charts [24, 25]. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to manipulate large data sets efficiently, and the movie ratings are grouped and summed using Pandas to derive the percentages, which are finally displayed in the form of dynamic graphs on web pages, as shown in Figure 5.

Three pieces of information are captured according to the html structure: the rating level for each account; the review message for each account; and the http link to jump to the next review page. After acquiring all the information and processing the information, the total number of each star rating and how many accounts in total were rated were calculated [2628]. Through the results, it can be found that 78.97% of the viewers were five-star positive, and the overall rating of the movie was high and worthy of recommendation. The key code for star rating analysis is as follows.import pandas as pd from pyecharts import Pie # Import Pie component for generating pie charts.# pandas.df = pd.read_csv(“D:comments.txt,” encod-ing = “gb18030,” names = [“id ,” “NickName,” “user-Level,” “cityName,” “content,” “score,” “startTime”]) attr = [“one-star,” “two-star,” “three-star,” “four-star,” “five stars”] score = df.groupby(“score”).size() # sum by group.value = [score.iloc[0] + score.iloc[1] + score.iloc[1],score.iloc[3] + score.iloc[4],score.iloc[5] + score.iloc[6],score.iloc[7] + score.iloc[8],score.iloc[9] + score.iloc[10],]pie = Pie(“Spirited Away Star Ratings Ratio,” title_pos = “left,” width = 600)pie.use_theme(“dark”)pie.add(“rating,” attr, value, center = [40, 50], radius = [25, 75], rosetype = “raea,” is_legend_ show = True, is_label_show = True)pie.render(“rating-star.html”)

6. Douban Top 250 Movie Rating Prediction

In this paper, the random forest regression algorithm, regression tree algorithm, and gradient boosting regression algorithm were selected to train and predict the ratings of Douban Top 250 movies. By evaluating the prediction results of the three algorithms, the gradient boosting regression algorithm was selected to predict the ratings of the experimental data [29, 30].

6.1. Introduction of the Algorithms
6.1.1. Random Forest Regression Algorithm

The random forest regression algorithm is a fusion algorithm constructed based on a decision tree classifier, which is composed of multiple decision trees. The algorithm performs multiple sampling from the sample data by sampling with put-back, constructs a corresponding decision tree for each sample subset, then averages or votes on the prediction results of all decision trees, and then selects the optimal prediction result. The advantage of this algorithm is that it is insensitive to the correlation between variables and avoids the effect of multicollinearity [3133].

6.1.2. Regression Tree Algorithm

The decision tree regression algorithm is a relatively common algorithm for regression and classification, where data are regressed by rules in order to construct a regression tree model. Creating a decision regression tree requires using the values obtained from the observed data to establish a rule for constructing a model in which each characteristic attribute is a variable, and after classifying, a variable according to the rule so that the sum of squared residuals of the two parts is minimized to form a regression tree with good results.

6.1.3. Gradient Boosting Regression Algorithm

The gradient boosting regression algorithm is a representative algorithm in machine learning algorithms, which can be used for regression or classification problems, and the common gradient boosting algorithms include AdaBoost and gradient boosting algorithms. In this paper, the gradient boosting algorithm is used to predict movie ratings, and the principle of the gradient boosting algorithm is to evaluate the reliability of the model using a loss function [34, 35]. According to the established loss function, each iteration of the model will refine the model according to the direction of gradient descent, gradually reducing the value of the model’s loss function. The gradient boosting algorithm is based on regression trees for model construction, and the residuals of the previous tree are used as the next learning target, so as to construct new regression trees until the residuals of the model reach the allowed range.

6.2. Model Construction and Result Analysis
6.2.1. Feature Selection and Data Processing

By processing the feature information, this paper selects 9 feature variables for modeling according to the characteristics of each feature information and referring to related literature, which are score, num, star5, star4, star3, star2, short, time, and year, where score is used as the target variable and the other 8 variables are used as predictor variables [36, 37]. In this paper, we divide 250 data into training set and test set by train_test_split ( ) in sklearn.model_selection, then standardize both training set and test set data of predictor variables by using StandardScaler ( ), and use the standardized data to build the prediction model.

6.2.2. Model Construction

In this paper, the random-forest classifier and gradient boosting regressor in sklearn.ensemble are used to implement the random forest algorithm and the gradient boosting regression algorithm, and the tree package in sklearn is used to implement the regression tree algorithm. The processed training set data are used to construct the random-forest regression model, gradient boosting regression model, and regression tree model, respectively, using these three algorithms, and the corresponding values, i.e., movie ratings, are predicted for the test set using the predict( ) method.

The error results obtained from the three models are shown in Table 1.

From Table 1, we can see that the gradient boosting regression model has the best evaluation result, with the model prediction accuracy reaching 91.16%, the mean square error only 0.5974, and the average absolute error 0.6268, which is obviously better than the other two models. Therefore, the gradient boosting regression model is selected in this paper for movie rating prediction of experimental movie data. The prediction results are shown in Table 2. It can be seen that the errors between the predicted and actual ratings of the five movies are small, and the prediction results are highly accurate [38].

7. Thinking and Countermeasures

In recent years, the habit of movie-going in China has not only created the prosperity and development of the movie market but also put forward higher requirements for movie workers. China is undoubtedly a big movie country, but there is still a small gap from being a strong movie country, and how to get out of the country and further enhance the cultural influence of domestic movies is perhaps the next step that should be considered.

7.1. Improve the Quality of Movies

People often hope that a “good movie” with high rating can get a high box office to match it, instead of relying on publicity and marketing and the participation of big popular stars, so as to promote the benign development of the film industry and promote more high-quality films [29]. The era of relying on large-scale marketing and publicity to obtain the market has long passed, and the final factor that attracts the audience to enter the theater is still the film’s choreography and production. As a big movie country, the rate of bad movies is much higher than that of other countries, so it is important for all filmmakers to set the right attitude and do a good job in film arrangement and production, not to meet the market demand, but to make films that really resonate with the public and cause society to think [39].

7.2. Create Diversified Cinema Lines

The full explosion of the movie market is destined to be accompanied by the diversification of people’s movie-going preferences [3039]. In the future development of domestic movies, under the influence of the new film industry pattern, movie genres will become more and more abundant, and traditional comedies and romances will join hands with new genres such as suspense and crime to the screen. It will be difficult to summarize the genre of a movie with one or two genres, and there will be more and more movies with multiple genres at the same time, and their structure and methods will be more mature.

8. Conclusion

In this paper, we use Python crawler technology combined with Python library to analyze the data of movie information on Douban website, clean and analyze the scattered movie data, and use word cloud and charts to visualize and display the data to achieve self-interpretation. The user evaluation data are focused in multiple dimensions and levels, and the patterns and features of the data are discovered to make it have reference value for the audience’s movie viewing behavior. For the construction of the movie score prediction model, three algorithms of random forest, regression tree, and gradient lifting are adopted. The model evaluation results show that the best prediction effect is the gradient boosting regression model, with an accuracy of 91.16%. This model was used to score the five films, and the prediction score and the actual score were very close. In the next step, the program will be further extended so that it can be developed into a complete film evaluation visualization system equipped with a user interface for smooth operation. At the same time, it will focus on dynamic data crawling for mobility so that it can achieve multiple data acquisition and evaluation and can play a greater role in public opinion analysis in the future.

Data Availability

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Beijing Municipal Social Science Program (Grant 20YTC032) and Beijing Municipal Education Commission and Scientific Research Program (Grant SM202110028014).