Abstract

With the continuous innovation of Internet technology and the substantial improvement of network basic conditions, e-commerce has developed rapidly. Online shopping has become the mainstream mode of e-commerce. In order to solve the problem of information overload and information loss in the selection of e-commerce online shopping platform, a personalized recommendation system using information filtering technology has come into being. An e-commerce online shopping platform recommendation model is proposed based on integrated multiple personalized recommendation algorithms: random forest, gradient boosting decision tree, and eXtreme gradient boosting. The proposed model is tested on the public data set. The experimental results of the separate model and mixed model are compared and analyzed. The results show that the proposed model reduces the recommendation sparsity and improves the recommendation accuracy.

1. Introduction

Internet technology has been innovating and developing continuously in recent years. As the representative of digital technology, it affects all fields of economy and society and has become a strong driving force for the consumption upgrading. With the growth of mobile Internet users and the rapid development of mobile payment, e-commerce transaction applications are constantly upgraded and further integrated into people's life [17]. With the cooperation and integration of online and offline transactions and the efficient development of the logistics express industry, online shopping has become a major trend of economic development [810].

For businesses, promoting and selling products through an e-commerce platform can expand the business and expand consumer groups more quickly and conveniently and save a lot of costs. Under the condition that the state promotes the reform and upgrading of retail, more businesses use the way of online and offline integrated development to sell products simultaneously, further optimize the user experience, and improve the user value. For consumers, online shopping makes them get rid of the restrictions of time and place, and it is easier to compare the types and quality of goods horizontally. In addition, online shopping allows customers to buy goods rarely seen in physical stores. Cross-border e-commerce, which has developed well in recent years, has met the needs of consumers in this regard.

Personalized recommendation system [1113] is an advanced business intelligence platform based on massive data mining. It recommends goods for users and meets their personalized needs. E-commerce recommendation system can greatly improve the turnover of online shopping mall. Amazon has increased the sales volume of online shopping mall by 35% through personalized recommendation system. Compared with traditional search engines, personalized recommendation system finds users’ interest points by studying users’ behavior, so as to guide users to find products they are interested in faster. A good recommendation system can not only improve users’ purchase efficiency but also establish a good relationship with users and improve users’ sense of belonging.

However, traditional recommendation algorithms, such as collaborative filtering algorithm, still have problems such as data evacuation, cold start, poor scalability, and difficulty in extracting multimedia information features. In order to solve the above problems, scholars introduced users’ demographic information, social information, and trust into similarity calculation, which have specific instructions on solving data sparsity and cold start problems, respectively [14,15]. How to effectively solve these two problems simultaneously requires further research. Therefore, an e-commerce online shopping platform recommendation model is proposed based on integrated multiple personalized recommendation algorithms. The data set of real online shopping platform is used to preprocess user behavior data. Through visual analysis to understand the underlying business logic, the platform selects meaningful features from the dimensions of users, goods, and interactions within them. Experimental results show that the performance of the proposed fusion model is better than that of other models.

2. Personalized Recommendation and Algorithms

2.1. Recommendation System

Recommendation system can successfully solve the problems of information overload and information loss by using information filtering technology. As one of the main applications of personalized service, it has attracted extensive attention. Its typical application is in the field of e-commerce, such as Amazon, eBay, and Taobao, which have invested a lot of research and development in the recommendation system. The recommendation system can effectively improve the utilization of the platform and enhance the user’s dependence on the platform, so as to achieve greater economic benefits. At present, the widely recognized informal definition of recommendation system is “using e-commerce websites to provide users with product suggestions, help users make purchase decisions, simulate marketing, and enable users to complete the purchase process” given by Resnick and Varian in 1997 [16].

The recommendation system consists of three parts: input module, output module, and recommendation algorithm. The input module collects and records the user’s historical behavior and other information and converts it into user interest data. After the calculation of the recommendation algorithm, the items suitable for the user are given. Finally, the output module presents the recommendation results for the user. The whole recommendation process includes three basic elements: users, items, and recommendation methods. How to model user and item information and what recommendation strategy to adopt are the core issues of recommendation system. The recommendation system model is shown in Figure 1. The recommendation system models the user according to the implicit or explicit information obtained from the input module, models the items at the same time, and selects the best recommendation object to present to the user through the matching of the recommendation algorithm.

The mathematical expression of the recommendation system is as follows: set the user set as C and the recommendation object set as S. The scale of C and S sets is usually large, and S can be any object recommended to users, such as goods, articles, advertisements, and songs. The utility function u() can calculate the recommendation degree of object s to user c, which can be expressed aswhere R is a fully ordered nonnegative real number in a certain range.

The problem of the recommendation algorithm is to find the object S that can maximize the recommendation R calculated by u(), which can be expressed as

The utility value of the recommended object is expressed by score, which represents a user’s preference for an object. The utility function u() changes according to the actual recommendation scenario and recommendation strategy. Because the utility value of the recommended object is not given in the whole space, how to build a recommendation engine to estimate it is the core of the recommendation problem.

2.2. Classification of Recommendation System

The recommendation system adopts different recommendation strategies and technical means [17], which can be divided into three categories according to the degree of providing personalized services for users as shown in Figure 2.(1)Nonpersonalized Recommendation System. General recommendation: The recommendation system will not generate recommendations according to the user’s personal characteristics but through the marketing strategy formulated by the background operator of the web system or the statistical analysis tool based on the background of the system. This kind of recommendation system is not targeted to users but presents the same recommendation results for all users. Specific forms of recommendation include advertising recommendations and sales ranking recommendation.(2)Semipersonalized recommendation system: The system obtains the user's preference information and generates recommendations by analyzing the user's browsing behavior or current shopping data. The user's behavior data affect the recommendation results, and its degree of personalization is higher than that of nonpersonalized recommendation system.(3)Fully personalized recommendation system: The user's historical information in the system is retained according to its value extracted from personalized features, such as user's browsing information, purchase history, registration information, scoring data, collection list, and registration information. Because this kind of recommendation system makes use of the long-term data of users, it can build a relatively stable user preference model and analyze it in combination with the current behavior of users, so it can provide users with fully personalized recommendation services. This kind of recommendation service has the highest degree of personalization and is generally only for registered users.

From the perspective of algorithm and implementation, recommendation systems can be divided into the following categories: content-based recommendation system, collaborative filtering algorithm-based recommendation system, association rule-based recommendation system, knowledge-based recommendation system, and hybrid recommendation system, which are also shown in Figure 2. The recommendation system based on collaborative filtering algorithm and content-based recommendation system have the longest research time and the most applications.

2.3. Recommendation Algorithms

In the recommendation of e-commerce online shopping platform, multicriteria decision-making [1824] is a common method, but this method has been proved to have many shortcomings, such as uncertainty and subjectivity in the decision-making process [2528]. Many researchers classify recommendation algorithms differently from different perspectives. From the perspective of information technology, they can be divided into collaborative filtering and content-based recommendation according to the algorithm and generation mechanism of recommendation results.

Collaborative filtering algorithm is a mature recommendation algorithm recognized in the industry. Goldberg et al. [29] first proposed the collaborative filtering algorithm and applied it to the Tapestry e-mail filtering system. The algorithm mainly recommends items based on users' previous preferences and the choices of users with similar interests. Different from Tapestry's single-point filtering mechanism, Resnick et al. [30] proposed a cross-point and cross-system news filtering mechanism, GroupLens. This automatically helps people find what they like from a large number of articles available. The drawback of the above nearest neighbor method is that it requires a lot of calculation. Therefore, Verbert et al. [31] proposed a collaborative filtering algorithm based on items. The algorithm replaces the nearest neighbor method by finding similar goods. Online computing costs have nothing to do with the number of users or items. It can generate high-quality recommendations in real time on huge amounts of data. The advantages of collaborative filtering algorithm are obvious. In engineering, the model is simple, effective, and versatile. If the scale of the e-commerce system expands, the rapid increase of user and project data will lead to data sparsity. In addition, it also has a serious cold start problem. Pereira et al. [14] introduced user demographic information into the recommendation algorithm. This forms the hybrid collaborative filtering recommendation algorithm, which can solve the cold start problem well. Shambour et al. [15] incorporated the idea of project rating trust into the traditional user-based collaborative filtering recommendation algorithm. They abandoned the traditional method of similarity calculation. Experiments show that this algorithm can alleviate the problem of data sparsity.

In addition, the content-based recommendation algorithm can effectively solve the cold start problem of the collaborative filtering algorithm and its whole operation process. In the collaborative filtering recommendation algorithm, if the condition that the project is evaluated by many relevant users is met, it will make recommendations for other relevant users. In content-based recommendation algorithm, feature vectors describing related content are established by extracting features from content for users. Then, the preferences of the users are calculated to determine whether to recommend relevant content to the users.

A review of relevant research found that the online shopping platform has specific business scenarios for product recommendations. It needs to select an efficient recommendation algorithm from the perspective of accuracy, efficiency, and explainability of the algorithm. In addition, collaborative filtering and content-based recommendation algorithms have their own advantages and difficulties. In fact, most recommendation systems carry out a hybrid recommendation by integrating different recommendation algorithms in various forms. It can use the results of the respective models to carry out the weighted combination and waterfall mixing. Therefore, random forest (RF), gradient boosting decision tree (GBDT), and eXtreme gradient boosting (XGBOOST) models are selected as the basic models.

Random forest (RF) is composed by Breiman et al. [32] on the basis of a classification and regression tree (CART). Its principle is to form different data sets by randomly sampling the original data set. By training decision trees on a single dataset, multiple decision trees are combined to form a classifier. The specific steps are shown in Figure 3.

The first step is to preprocess the RF model. The downsampling method was adopted to avoid the influence of sample imbalance on the experimental results (the ratio of positive and negative samples in the original data set was 1 : 1100).(1)K-means clustering algorithm is used to cluster negative samples(2)Based on the subsamples of each group set with the same ratio, the optimal ratio is selected through testing in random subsamples(3)RF model is used to train and predict the lower sample set.

The second step is parameter tuning.(1)The imbalance rate N/P of positive and negative samples was optimized.(2)The number of forest scale trees is optimized.(3)For the setting of the probability wide value, the predicted value can be obtained by inputting the sample characteristics of the trained model. The predicted value is the probability value. The default value is 0.5. By constantly adjusting and modifying the condition that the probability value reaches the broad value, the purchase prediction classification label for the user and the sample is changed.

The third step is using RandomForestClassifier() in sklearn package to establish the model and train to generate a subset P of predicted results.

GBDT [33] belongs to boosting algorithm, which is different from the bagging algorithm to RF. Boosting algorithm adopts a series serialization model. Individual learners are strongly correlated; that is, the new model generates training results based on the old learning model. GBDT algorithm is learned from the generated CART classification regression tree. Samples with large residuals output from existing models and actual samples. So, generation is selected to generate a new CART tree. Through the above training methods, residuals are constantly reduced to ensure the accuracy of the results. The specific steps are shown in Figure 4.

The first step is to use k-means clustering to cluster negative samples. The second step is to select the optimal ratio by adding human GBDT into the random subsample based on the subsample of each population with the same ratio. Third, GBDT classifier selects the best parameters.(1)The imbalance rate N/P of positive and negative samples was optimized(2)GBDT selects the best number and learning rate of forest scale trees(3)The best maximum depth value of the base tree, the minimum number of samples required for internal node subdivision, and the minimum number of samples for leaf nodes are selected(4)Probability wide value is tuned

The fourth step is using GradientBoostingClassifier() in sklearn package to establish the model and train to generate a subset P of predicted results.

Both of them belong to boosting algorithm, and XGBOOST efficiently implements and optimizes GBDT algorithm. In the choice of the base learner, XGBOOST [34] model can adopt the tree model and other models (such as LR). Therefore, it is not limited to the CART tree defined by the GBDT algorithm. XGBOOST learns new functions and fits residuals through continuous feature fission of the tree. The specific steps are shown in Figure 5.

The first step is to train XGBOOST model parameters for analysis.(1)Use higher learning rate to adjust the number of optimal forest size trees(2)Adjust the maximum depth value and the minimum sum of sample weights in subnodes(3)Adjust parameter Y(4)Adjust input and A(5)Reduce the learning rate and cycle to obtain a more stable parameter combination

The second step is using GradientBoostingClassifier() in sklearn package to establish the model and train to generate a subset P of predicted results.

3. Experimental Data

The experimental data construct a product recommendation model based on real desensitization data from an online shopping platform, which comes from the Tianchi big data platform competition of Alibaba. The dataset contains the user's mobile behavior data over a 30-day period. The user table contains six fields, including user ID, brand ID, user location ID, product classification ID, user interaction behavior, and behavior time. The specific data amount is as follows. The number of users is 92302, the number of commodity subset is 65221, and the number of commodity user interaction data is 22560328.

The experimental data include the data of Taobao's shopping festival on December 12. After analyzing the total operations of users in this month (browsing, collecting, adding to the shopping cart, and purchasing behavior), it is found that the purchase behavior doubled at 0 on the 12th and was obviously abnormal data. Therefore, the data of this day are eliminated. In the process of e-commerce website data statistics, there are repeated data, abnormal data, and other problem data. Therefore, the above repeated values and invalid values will be directly eliminated in the data cleaning process at this stage. The ratio of positive and negative samples in the original data set is about 1 : 1100. Therefore, it may cause the algorithmic prediction model to treat the positive sample set as noise data and thus bias the negative sample set data. Positive samples are more likely to be misclassified than negative ones. Therefore, the random downsampling method is adopted to avoid the above problems.

3.1. Feature Construction

According to the data preprocessing results and combined with the business characteristics of the online shopping platform, data characteristics are reconstructed. Feature construction is completed from the three dimensions of user feature, commodity feature, and commodity category feature and their combination, which is shown in Figure 6. If this experiment is treated as a dichotomous problem, the output variable Y labels are 1 (purchased) and 0 (not purchased), respectively.

The characteristics of user behavior are as follows:(1)User activity: it refers to the sum of users' behaviors in the recent N days, reflecting users' purchasing habits(2)User conversion rate: it reflects the user's buying decision operation habit

The commodity characteristics are as follows:(1)Commodity activity: it refers to the total number of user operations on the category in the previous N days, reflecting the popularity of the product(2)The conversion rate of goods: it reflects the operation characteristics of purchasing decision (such as high value and long decision time)

The characteristics of the commodity category are as follows:(1)Activity of commodity category: it refers to the total number of user operations on this category in the last N days, reflecting the popularity of this category(2)Purchase conversion rate of commodity category: it reflects the characteristics of purchase decision operation of the commodity category

The user-product category characteristics are as follows:(1)Total browsing of user-product category(2)Total purchases of user-product category(3)Last operation date of user-product category

The commodity-commodity category characteristics are as follows:(1)User behavior sum sort of commodities in the category(2)Sales ranking of commodities in the category

3.2. Division of Data Set

This experiment divides the data set into two parts, including the training set and test set. The learning and training process is shown in Figure 7.

The data preprocessing results are shown as follows. The user click conversion rate of this dataset is less than 1%; that is, it takes an average of 100 interactions (including browsing, clicking, and adding products) for each user to generate one transaction. The amount of data is relatively large.

If all interactive behaviors of users in the whole time period are taken as characteristic data, the calculation speed of the model will be greatly reduced. Due to the time decay of user interest, the accuracy of the model will be reduced. Therefore, when dividing the data set, it can be divided into four groups with weeks as the interval unit, including a group of abnormal data. The reason for elimination has been explained above.

4. Case Study and Algorithm Analysis

4.1. Algorithm Evaluation Indicator

Fl-score is a comprehensive indicator that can be considered when user ratings are sparse or missing. It is characterized by harmonic precision and recall rate . The basic meanings of these three indicators are as follows:where some concepts of confusion matrix are involved, tp means true positive, fp means false positive, fn means false negative, and tn means positive.

Combined with the recommendation scenarios of online shopping platforms, harmonic precision and recall rate can also be expressed aswhere means recommended quantity, and means serviceable quantity.

Higher precision means more accurate recommendation, and higher recall rate means more acceptance of recommendation. Only when both are high, the performance of the recommendation algorithm is good. Therefore, F1-score is taken as the main indicator.

4.2. Experimental Environment

The operating system of this experiment is Windows 10–64 bit, with CPU : Intel(R)Core(TM) 15–5200U CPU @ 2.20 GHz (4 CPUs), 2.2 GHz, and Python version 3.7.2. It is implemented based on software packages such as PANDAS and sklearn.

4.3. Separate Model Training

First, the LR model was trained, but the result was not ideal. When RF, GBDT, and XGBOOST are used, the effect is significantly improved. In the three integration algorithms, XGBOOST has only a slight advantage in F1-score compared with RF, but the model implementation time is shorter than RF. LR models are simple, and run time is short, but its accuracy is poor compared with other algorithms. The three integration models (RF, GBDT, and XGBOOST) show strong generalization ability for such data with complex features. The training result of the separate model is shown in Table 1.

4.4. Mixed Model Training

All kinds of recommendation algorithms have their own advantages and difficulties in implementation. In fact, most recommendation systems carry out a mixed recommendation by integrating different recommendation algorithms in various forms.(1)Weighted mixing: Weighted hybrid combination is a common way. The recommendation results and scores can be combined to generate the final recommendation set by a weighted combination of the deduction results of respective models. Through a reasonable weighting method, the result of a comprehensive algorithm is better. But there are also some problems such as a large amount of computation and complex systems.(2)Waterfall mixing: waterfall hybrid mode adopts the principle of “filtering” and takes the filtering result of one model as the input data of another model. Generally, taking GBDT + LR model as an example and combining the respective advantages of the LR model and GBDT model, the OUTPUT results of the GBDT algorithm are directly output as LR input features, which improves the utilization rate of data information and the learning efficiency of LR model step by step.(3)Graded mixing: In industry, this kind of hybrid method makes the hybrid algorithm simple and efficient. Combined with different scenes, according to the characteristics of different algorithms, the high-precision algorithm is first used. Other algorithms are used to get subsequent results.

The training result of mixed model is shown in Table 2.

In the experiment, the models are fused according to the independent training results of the previous model. The experiment involves a lot of steps of data processing, feature selection, and parameter adjustment. The results of the experiment are as follows:(1)Under normal circumstances, the fusion between models will improve performance. However, when the performance of a single model is poor, the fusion with other models will reduce the effect of its fusion model.(2)In the mixed recommendation, the direct fusion effect is poor; for example, the simple weighting can adopt the hierarchical strategy. Efficiency is improved by taking the output of one model as the input of another.(3)When the experimental performance of the models is similar, the fusion effect will not be significantly improved.(4)The final results show that the model combining LR, GBDT, and XGBOOST has the best effect.

5. Conclusion

With the continuous innovation of Internet technology and the substantial improvement of network basic conditions, e-commerce has developed rapidly. As the mainstream mode of e-commerce, online shopping has added great vitality to the development of the economy. The online shopping platform recommendation algorithm is explored. An e-commerce online shopping platform recommendation model based on integrated personalized recommendation is designed and implemented. The later research of this paper can be carried out from the perspectives of real-time recommendation, visual recommendation result generation, and user interest model mining combined with scenes. The model used in the recommendation system can be further improved. In the future, user data from various sources can be integrated, and attribute tags can be added to make the personalized recommendation model more accurate. In addition, according to the different business needs of the application, the personalized recommendation system can add more recommendation scenarios to meet the needs of users more comprehensively.

In future software system development based on the model proposed in this study, the preliminary system function module planning is as follows. It is mainly divided into the foreground function module and the background function module. The foreground function module includes the system home page submodule, registration and login submodule, alternative platform display submodule, and personal center submodule. The background function module includes a platform management submodule, order management submodule, merchant management submodule, and security authority submodule. The function of the software system is tested by the method of unit test.

Data Availability

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.