Abstract

In recommender systems (RSs), explicit information is often preferred over implicit because it is much more accurate than implicit or predicted information; for example, the user can enter information about his interests directly into the system, and the system will generate accurate recommendations for him. Receiving explicit information, however, may be difficult for a system. Explicit demographic information might be uncomfortable for some users, and extremely common questions, such as race, gender, income, and age, can lead to bias and unfair recommendations. As a result, in this study, we propose a method, in which information collected from a new user does not contain demographic information, and enquired explicit information is data driven. Users’ interest in tourism activities is used to identify seven categories of tourism. The mapping between extracted categories and activities is established with a multilabel classification (MLC) algorithm. The user’s interest in 18 tourism activities is predicted by rating only seven tourism categories. Common MLC algorithms with different classifiers were used to evaluate the proposed method. The best result relates to binary relevance with the Naïve Bayes classifier, which also outperforms the entitled algorithms in collaborative filtering (CF) systems as baseline models. The proposed method can capture users’ interests and develop their profiles without receiving demographic information. Also, compared to CF, in addition to a slight advantage in metrics, it only requires seven ratings to predict user interest in 18 activities. In contrast, CF algorithms require at least 15 user ng records to predict user interest in unknown activities (3-4 activities) to achieve a performance close to the proposed method.

1. Introduction

Personalization is the ability to provide tailored content and services to users based on the knowledge about their preferences and tastes [1]. Personalization techniques are mainly related to recommender systems (RSs), which aim to filter irrelevant information and provide personalized information to each particular user [2]. RS can be defined as a personalization tool that provides people with a list of items that best fit their individual preferences, restrictions, or tastes [3]. One of the interesting applications of RS lies in the trip planning area [4].

Tourists are often confused about where to go when reaching new and unfamiliar places as there could be a wide variety of choices for consideration [5]. Besides, they typically have a limited amount of time and budget available; thus, it is almost impossible to visit all tourist attractions during a trip, especially to large cities [6]. As a result, tourists have to select the most compelling points of interest (POIs) according to their preferences. Then, they plan an itinerary, taking into account the time available to reach the POIs concerning their accessibility and opening hours [7].

The use of modern technologies such as collaborative filtering (CF) of classical recommender systems is considered an effective solution within the tourism industry [8]. CF is one of the recommender systems’ approaches, helping people make their choice based on the opinions of those similar to them. The similarity between users is calculated based on the scores they have given to the list of items. When the system finds out which people are closer to each other based on their interests and choices, other “similar” users’ favorites are suggested to the intended user. In this approach, to find out which recommendation is favorable and which is not, obtaining feedback is necessary. CF systems use a user-item matrix to predict users’ interest in items. In such matrices, each row, column, and cell respectively represent a user, an item, and a user giving rate to an item [9].

A further paradigm in cross-domain collaborative filtering is proposed by Yu et al. [10], in which a model is proposed that solves the problem of different auxiliary domains’ importance in the target domain. They propose a cross-domain collaborative filtering algorithm that takes advantage of latent factors in auxiliary domains to expand user and item features. In the proposed algorithm, the recommendation is formulated as a classification problem in the target domain, where user and item location serve as features and ratings as labels. Then, Funk-SVD decomposition is employed to extract extra user and item features from user- and item-side auxiliary domains, respectively, with the purpose of expanding the two-dimensional location feature vector. In the final step, a C4.5 decision tree algorithm was used to predict missing ratings. To balance recommendation accuracy and efficiency, Yu et al. (2021) [11] examine how to select significant subsets from all the auxiliary domains. A two-sided CDCF based on selective ensemble learning is proposed, which considers both accuracy and efficiency (TSSEAE). The model solves a biobjective optimization problem for selective ensemble learning, concentrating on a subset of auxiliary domains to achieve a balance between accuracy and efficiency.

Despite the high effectiveness of CF-based recommender systems in the tourism area, they suffer from two main challenges issues of sparsity and cold start. Sparsity happens when many user-item matrix cells suffer from the lack of rates given by users [12, 13]. This makes the training of machine learning models, especially in memory-based algorithms, challenging. By growing the number of items and users, the sparsity and the dimension of the user-item problems become more severe and more problematic. To solve the problem of sparsity, propose a two-sided cross-domain collaborative filtering model. It is assumed that there are two auxiliary domains, that is, a user-side domain and an item-side domain, where the user-side auxiliary domain shares the same aligned users and the item-side domain has the same aligned items. As a first step, they conceptualize the user and item features in the context of biorthogonal trifactorization. The recommendation problem is then converted into a classification problem, using the inferred user and item features as feature vectors and the rating as the class label. Using both user-side and item-side shared information, the model can transfer knowledge from auxiliary domains more effectively and infers domain-independent user and item features.

The cold-start problem occurs when entering a new user into the system; since there is no record of the user interests and rates to the items, it is impossible to predict what the user would be attracted to. The same problem existing for newly added items, in literature, is a cold start. The cold-start problem is usually handled by using hybrid systems or expanding users and item profiles through gathering explicit and implicit information [14]. Trust-aware recommenders are one solution for dealing with the cold-start problem. Ahmadian et al. [15] use reliability measurements to improve the accuracy of trust-aware recommender systems and remove people whose predictions are unreliable while preserving good coverage. Another study presents a variant of the profile expansion technique to alleviate the cold-start problem in recommender systems. For this purpose, the authors consider the user’s demographic information (e.g., age, gender, and occupation), and the user’s rating information to enrich the neighborhood set. In particular, two distinct strategies are used to embellish the rating profiles of users by adding some additional ratings. The proposed expansion of rating profiles significantly affects the performance of recommender systems, particularly those experiencing a cold-start issue [16].

The RSs can automatically learn the user’s preferences by analyzing their explicit or implicit feedback. Explicit data might be given by the user in different ways, for instance by requiring them to fill out a questionnaire about their preferences and interests. The system can infer implicit interests through the analysis of the user’s behavior [2].

The explicit information is often preferred over implicit information because it is more accurate than the predicted or implicit information; that is, the user can directly enter information about his interest, and then the system will generate accurate recommendations for him [17]. However, receiving explicit information could be challenging for a system. Users might feel uncomfortable providing explicit demographic information and extremely common questions, such as one’s race, gender, income, or age, could cause bias and unfair recommendations [18]. To this end, in this study, we have proposed a method, in which the information collected from a new user does not contain demographic information, and the enquired explicit information is data driven.

In this method, tourism activities are categorized by using exploratory factor analysis (EFA). New users with no rating record of tourism activities are asked to rate each of these categories on a scale of one to five points. The data, rated categories, are then mapped to the activities by a multilabel classification algorithm (MLC), which predicts what activities the user is likely to enjoy; in other words, it will develop a tourism profile for the user. By using the proposed RS and mapping activities to their associated categories, respondents are required to answer fewer questions. The proposed RS can indeed predict tourism activities with fewer data about users. In terms of evaluation, the proposed RS works better than CF-based models.

The rest of the study is structured as follows: the next section provides an overview of the multilabel classifiers and their applications in RS; the research methodology section deals with how to identify tourism activities, how to extract tourism categories, proposing algorithms to predict the user’s favorite activities, and how to evaluate the presented method. In the result section, we review and analyze the method’s ability in capturing and predicting user interests. Last, in the conclusion section, we review our method and discuss this study’s achievements, research limitations, and future research suggestions.

In machine learning, single-label classification is one of the commonly used methods, in which each instance in the dataset associates with a unique class label from a set of disjoint class labels . Depending on the number of these classes, the problem can be either a binary classification (when ) or a multiclass classification (when | ). However, in the multilabeling problems, each instance can be associated with multiple classes. In such algorithms, the goal is to learn from a set of instances to label each instance’s class or classes in [19]. MLC approaches are categorized into (a) problem transformation and (b) algorithm adaptation methods.

In problem transformation, the MLC problem transforms into one or more single-label classification problems. Therefore, it does not need any change or adaptation to traditional algorithms, and those algorithms can be applied to the problem [20]. Problem transformation methods are divided into three main algorithms: binary relevance (BR), label power set (LP), and classifier chain (CC). Using these three problem transformation algorithms, this study applies five classifiers, namely, support vector machine (SVM), decision tree (DC), random forest (RF), Naïve Bayes (NB), and K-nearest neighbor (KNN). In adaptation algorithms, instead of transforming the problem, the algorithms are changed and modified to handle multilabel data. We used two adaptation algorithms, namely, binary relevance KNN (BRKNN) and multilabel K nearest neighbor (MLKNN). Besides these approaches, ensemble learning algorithms can learn from multilabel data natively without any transformation in the base algorithms or the problem. Ensemble methods are learning algorithms that construct a set of classifiers before classifying new data points by taking a (weighted) vote of their predictions [21]. This study used random forest (RF) and extra tree (ET) classifiers as ensemble algorithm candidates.

MLC has many applications in various domains including text classification [22, 23], image classification [24], bioinformatics [25], genre classification [26], and social media analysis [27]. More details could be found in references [28, 29]. Moreover, MLC has leveraged its power in RSs world too. Carrillo et al. [30] demonstrated the MLC ability to recommend items and deal with RS common problems including data sparsity. Zheng et al. [31] have used MLC to recommend users’ contexts in such a way that instead of recommending the item to the user, the user-related contexts are predicted based on the items selected by the user and the ratings given to each item. To this end, they transformed the problem into an MLC problem and showed that MLC algorithms are more capable of recommending and predicting than the base algorithms. Rivolli et al. [32] used the MLC algorithms to recommend track foods. They obtained a set of data using a questionnaire comprised of two stages: in the first stage, the user answers 21 questions, which are the attributes describing the user. These questions are viewed as predictive attributes. The second stage of the questionnaire includes 12 food alternatives, in which the user is asked to specify their preferences to each of them. These alternatives are associated with classes’ labels or target attributes. The results indicate that the adaption algorithm showed weaker performance in comparison to the transformation methods. Elhassan et al. [33] used MLC to provide remedial actions to address students’ shortcomings in “learning outcome attainment rates.” In their model, each instance is a student described by a set of characteristics such as field of study, academic level, and grades. Moreover, the related tags for each student are equal to their remedial actions. The results show that the chain classification method with the decision tree classifier gives the best outcome for the given dataset.

However, despite the wide range of studies done about MLC applications in RS, there is still insufficient attention and evaluation of MLC capabilities. One of those capabilities is using MLC to address the cold-start problem and reduce the amount of received explicit information of a new user. This study is an attempt to fill mentioned gaps and show the performance of MLC algorithms in comparison with CF algorithms as base models. To the best of our knowledge, this is the first work in addressing the cold-start problem in developing tourism users’ profiles with explicit data-driven and MLC algorithms.

3. Research Methodology

From a data-driven viewpoint, there are two main steps in building the proposed recommendation: the first step is to extract tourism categories by measuring users’ interests in tourism activities. The second step is to associate categories and activities with a data-driven connection. To establish such a connection, data collection and training of the MLC algorithm are required. Therefore, this connection allows the prediction of the user’s interest in activities using his ratings in extracted categories. The tourism sites and activities of Tehran, one of the tourist cities of Iran [34], have been selected as the case study of the presented method. In this section, a detailed explanation of the methodological steps and internal and external evaluation is offered. The stages of the proposed method are shown graphically in Figure 1.

3.1. Identifying Tourism Activities

To identify tourism activities in Tehran, we used previous research [3436] and analytical reports of the British Tourism Organization (http://www.visitbritain.org/archive-great-britain-tourism-survey-overnight-data) and the content of the Tripadvisor (http://www.tripadvisor.com/Attractions-g293998-Contexts-Iran.html). The point to keep in mind was that many of the tourism activities mentioned in the papers and the British Tourism Organization’s analytical reports, such as nightclubs or beach tours, do not exist in Tehran. Therefore, by combining and modifying the activities in the mentioned sources, 18 types of tourism-related activities in Tehran were identified, namely, going to the cinema, theaters, museums, holy sites, historical sites, sports events, sports activities, art and book exhibitions, music events, malls, public gardens, restaurants, cafes, zoos, rural places, rivers and lakes, and mountains. It is noticeable that tourism activities can be extracted from executing text mining approaches such as co-word analysis [3740], topic modeling [41], and big data clustering [42] on textual sources such as online social networks.

3.2. Data Collection (Questionnaire 1)

Reviewing and rating these 18 tourism activities can be a tedious and challenging task for a user. Thus, reducing these 18 activities into fewer and more interpretable categories makes the user more comfortable in recognizing and categorizing the content. In order to generate the category layer, it was necessary to obtain data; therefore, based on the Likert scale (a scale between one and five where one indicates the slightest interest in an activity, and five denotes the most), a questionnaire was designed to measure users’ interest in each of the 18 activities. To better guide the users, we introduced several POIs in Tehran as instances for each of the mentioned activities. As an example, Saadabad Palace and Negarestan Mansion were mentioned as instances for the historical sites. After designing the questionnaire, it was distributed randomly on social media platforms, such as Facebook, Twitter, and messenger applications. A total number of 272 questionnaires were collected, and the reliability of the designed questionnaire was proved by calculating Cronbach’s alpha equal to 0.846, which was more than the cutoff required of 0.7 [43]. A schema of the data collected by the questionnaire is listed in Table 1. Due to page size limitations, only a few of the activities and scores given to each activity by participants in the questionnaire are shown.

3.3. Extracting Tourism Categories (Factor Analysis of Questionnaire 1)

Factor analysis (FA) primary goal is to summarize data for revealing relationships and patterns by regrouping variables into a limited collection of clusters based on shared variance. FA utilizes mathematical methods to simply interrelate measures for discovering patterns in a set of variables. FA was applied in various types of fields, such as behavioral and social sciences, medicine, economics, and geography [44], and is divided into two main classes, namely, exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is used when the research goal is to discover the number of influencing variables or to find variables that go together. FA is useful for studies such as questionnaires based on a few to hundreds of variables, which can be reduced to a smaller set to simplify interpretations. Therefore, not only is focusing on a smaller set of variables easier than considering too many keys but also, by clustering them into some categories, it makes variables meaningful. In this study, EFA was applied for accessing meaningful categories of variables. The determinant score for our data is 0.0000135, which is more than 0.00001, and indicates a violation in the assumption of correlation of variables; in such a case, to extract the factors, it is recommended to use the principal axis factor [44]. We used the Varimax rotation method with 30 iterations based on the default value in SPSS software for rotation. To check the adequacy and suitability of the dataset for EFA, Kaiser–Meyer–Olkin measure (KMO) and Bartlett’s test of sphericity were applied. The minimum value of the KMO index for the factor analysis is 0.5, which is 0.76 in our research. The Bartlett test takes a statistical hypothesis, and its null hypothesis states that the correlation matrix is an identity matrix, so there is no significant relationship between the variables. As listed in Table 2, the p value is not in the rejection area (the value of sig must be less than 0.05, which is zero for our data).

To determine the number of significant factors, Kaiser’s criteria states that only factors with Eigenvalues of one or more should retain. According to Figure 2 (scree plot), the best number of factors after rotation for this dataset is 7.

As factor naming does not follow a specific rule, here, we named each factor based on the associated variables that describe the factor (Table 3).

3.4. Data Collection (Questionnaire 2)

To train MLC algorithms, we need data to map the connection between categories and activities. The advantage of this connection is that different states can be considered, and the interest of the new user in tourism activities can be predicted simply based on rating the seven categories extracted by factor analysis (Figure 3). Therefore, a dichotomous questionnaire was designed. The first part asked the users to determine their interest rate for every category on a five-point Likert scale. The second part asked participants to indicate their fondness for each of the 18 activities using binary values of 0 for not being interested and 1 for being interested. We randomly distributed this questionnaire via social media and messaging applications to Tehran residents. In total, 578 questionnaires were collected, and the calculated Cronbach’s alpha was 0.859.

Table 4 lists a view of the data gathered from the second questionnaire. Because of space limitations, only a few categories, activities, and scores are displayed on the page.

3.5. Multilabel Classification Problem Definition

Let be the users-category matrix, and be a finite set of labels or activities. A user , represented in terms of features vector , which is referred to the given rates of a user to each of extracted categories; therefore, the user is associated with a subset of labels . Notice that if we call this set be the set of relevant labels of , then we could call the complement to be the set of irrelevant labels of . Let denote the set of relevant labels with a binary vector , where is the set of all such possible labeling.

Therefore, given a training set, , , consisting n training instances (independent and identically distributed) drawn from an unknown distribution , the goal of the multilabel learning is to produce a multilabel classifier (in other words, ) that optimizes some specific evaluation function (i.e., loss function) [19].

This study uses the second questionnaire data as a training data set for MLC algorithms. Consequently, when a new user is entered into the system, by rating each of the categories from 1 to 5, his/her interest in the 18 activities will be predicted. In the transformation approach, all three algorithms (BR, LP, and CC) with LR, DT, RF, SVM, and KN classifiers are used. For adaptation algorithms, BRKNN and MLKNN, and for ensemble algorithms, ET and RF classifiers are utilized. We implement MLC algorithms using Python version 3.5 with the scikit-learn and scikit-multilearn packages. All the classifiers’ hyperparameters in this study are the package’s default values.

It is important to consider imbalanced labels’ issues, as shown in Figure 4. The number of classes is not equal in any of the labels, which may cause problems in some algorithms’ learning processes. To solve this problem, we used the scikit-learn package class weight-balancing feature in all classifiers except the NB, MLKNN, and BRKNN because those cannot benefit from this technique.

3.6. Evaluation

In this research, the evaluation stage is vital in two ways: first, in internal evaluation, in response to the first research question, we measured and compared the performance of different MLC algorithms in capturing and predicting users’ interests. Second, the proposed method performance was compared to CF algorithms according to state-of-the-art to address the second research question. Fivefold cross validation with three metrics was used for this. Cross validation can train algorithms with little data.

Moreover, since all samples are used both as training and test data in the algorithm learning process, it is favorable to compare different algorithms’ performance on classification problems. There are a few different metrics for the MLC evaluation, but it is necessary to select those that can be used for CF methods. One of our goals during the evaluation stage is to compare the proposed framework with state-of-the-art CF techniques. To assess this, we selected precision, recall, and F1-score metrics. Since the weight-balancing technique is not applicable for some classifiers (NB, MLKNN, and BRKNN), we used the macroaverage criteria that do not take label imbalance into account.

3.6.1. Metrics

Let be a multilabel dataset consisting multilabel examples , with a label set . Let be a multilabel classifier and be the set of label memberships predicted by for the example . Therefore:

Precision (P). The precision is the proportion of predicted correct labels to the total number of actual labels, averaged over all instances. In our case, precision indicates how much of the predicted activities are correct for the user [19, 38]:

Recall. The recall is the proportion of predicted correct labels to the total number of predicted labels, averaged over all instances. In our case, recall indicates how much the algorithm has been able to predict the user’s favorite activities [19, 38]:

F1 Score. Definition for precision and recall naturally leads to the following definition for F1 score [19, 38]:

3.6.2. The Proposed Method versus Baseline Model

In this part, the framework presented should be compared to CF algorithms, which predict a user’s interest in an activity by observing his interactions with other activities. The problem should be transformed into a CF problem to evaluate the CF prediction ability with our proposed method. We defined a scenario, in which users are given a set of activities at random and asked to indicate their interest in each one in a binary manner. As a next step, the user’s scoring record is fed into the CF algorithm, which predicts his other interests based on his scoring record. This step was made possible by using the second questionnaire data. As for the CF problem, we did away with users’ ratings of categories and only retained users’ binary ratings of activities. Finally, our method’s best score was compared to that of two state-of-the-art CF algorithms.

CF systems generally have memory-based and model-based techniques for the recommendation. There are no assumptions on data in the memory-based technique, and it essentially depends on the nearest neighbors’ search to find the closest pairs of items or users. When the recommendation is based on measuring the similarities between items, it is an item-item method, and when it is based on measuring users’ similarities, it is a user-user method. We focus on the user-user method for our problem because the number of items (activities) is few. In the case of few items, the variance of the item-item method is low, and its bias is high; thus, personalization may suffer. On the other hand, model-based techniques rely upon assumptions made about the data and build a model that explains the interactions between users and items.

To formulate our problem for CF systems, we only need the user-activity matrix (users’ interest in activities) without the user category (their rating to each category). Thus, let we have N users and K activities, the user-activity matrix is defined as follows:

CF systems try to replace all the “question marks” in A with some optimal guesses; the goal is to minimize the RMSE (root mean square error) when predicting the user interests on a test set (which is, of course, unknown during the training phase), that is to minimizewhere test if user is interested in activity in the test set, is its cardinality, is the true rating, and is the prediction based on the recommendation system [45].

Following is a description of six CF algorithms that are appropriate for our data and type of problem compared with the method presented in this study. In this study, we intend to compare a variety of algorithms of varying capabilities with the presented method. The two first algorithms are basic, and they do not do much work, but they are still appropriate for comparing performance. The two other algorithms relate to model-based and memory-based techniques. The last two algorithms are two popular and powerful models based on neural networks.

Random Predictor (RP). The random predictor predicts the rating of the training set based on its distribution, which is assumed to be normal. is the prediction resulting from a normal distribution N (, ), where a maximum likelihood estimator is used to estimate and from training data.

This model aims at providing a basis for comparisons between different models and random prediction.

Baseline only. The baseline only predicts the baseline estimate for a given user and item.

. By means of and , we measure the observed deviations of user and activity from the average, whereas represents the overall rating average. It is assumed that bias is zero if user is unknown. This is also true for item with .

In a memory-based approach, we choose BSKNN that takes into account a baseline rating; a baseline estimate for an unknown rating is denoted by and accounts for the user and item effects:

The parameters and indicate the observed deviations of user and activity , respectively, from the average, denotes the overall average rating, and the regularization term. To predict , that is, to minimize the problem

Therefore, the prediction is set as follows:where denotes similarity measurement between user and , and only includes neighbors, for which the similarity measure is positive [46].

Singular value decomposition (SVD) [46] as a model-based approach is a matrix factorization algorithm that tries to decompose the original sparse matrix to low-dimensional matrices with latent factors. The prediction is set as follows:

If user is unknown, then the bias , and the factors are assumed to be zero. The same applies for item with and .

To estimate the unknown parameters, the problem is minimized:

Neural Matrix Factorization (NeuMF). Neural matrix factorization [47] takes advantage of the flexibility and nonlinearity of neural networks to replace the dot products in matrix factorization in order to improve the model’s expressiveness. To formularize NeuMFwhere and denote the latent factor matrix for users and items, respectively; and denote the number of users and items, respectively; and are feature vectors that describe user and item , respectively; denotes the model parameters of the interaction function . Since the function is defined as a multilayer neural network, it can be formulated as follows:where and respectively denote the mapping function for the output layer and th neural collaborative filtering (CF) layer, and there are neural CF layers in total.

Standard Variational Auto-Encoder (VAE) [48]. The standard VAE considered in this study takes user rating as input. Through the encoder function , the user input is encoded to learn the mean, , and standard deviations of the K-dimensional latent representation. The latent vector for each user is sampled using and . The decoder function is then used to transfer the latent vector from K-dimensional space to a probability distribution in the original N-dimensional space. This distribution indicates the probability that each of the N-activities will be liked by user u:

The output is a probability distribution over the K items. In this model, ELBO is used as the objective function/loss:where is the activity feature vector, while is its latent representation. Here, the first part of the equation considers the log-likelihood for an activity given its latent representation and the second part is the Kullback–Leibler (KL) divergence measure. The log-likelihood function considered is given as follows:where is taken over all the items . The KL divergence is calculated for the latent state of the model, .

Another problem is left; that is, the output of the mentioned algorithm is in the range , but the desired output should be binary, in which one and zero respectively denote a user interest and dislike of an activity. To solve this problem, we suggest using a threshold :

For both selected algorithms, to choose the best threshold value, the F1 score is calculated for different values, and the value having the best F1 score is chosen as the threshold.

4. Results and Discussion

This section includes two parts of evaluation: in the first part, MLC algorithms’ results are reviewed, and in the next part, the results of the best MLC algorithm are compared with the CF problem-solving approach.

4.1. MLC Results (Internal Evaluation)

The performance of MLC algorithms is listed in Table 5. The three columns represent the precision, recall, and F1 score; the higher the value, the better the result. For MLC transformation algorithms, we used “algorithm classifier” to denote the algorithm; for example, BR-NB denotes the use of binary relevance (BR) as a transformation algorithm and Naïve Bayes as a classifier. Moreover, for ensemble methods, we used the “ensemble classifier” form. The results demonstrate that the proposed method can capture and predict users’ interests with just a few explicit inputs.

According to the metrics results, the best performance is related to BR algorithms, while the LP algorithm shows lower metrics values than others. The precision score and the recall closeness in most classifiers indicate that the class imbalance has not affected the learning process. Since there are 18 tourism activities, BR algorithms transform the problem into 18 separate problems regardless of the interdependence of labels. This algorithm’s satisfactory performance indicates that the detected activities are distinctive from each other, and the algorithm has been able to map the category layer to the activity layer space well. For example, the RF classifier with the BR algorithm showed better results than other algorithms with the RF classifier.

Nevertheless, the LP algorithms’ disappointing results are due to their problem-solving approach. LP converts the MLC problem into a multiclass classification problem with possible class values. Since the dataset used in this study has few records and many labels, it is not easy to train such algorithms on such a data set. Nevertheless, algorithms’ outcomes and superiority may change as the number of data increases. Recall’s high score refers to the classifier’s success rate in identifying and proposing the user’s favorite activities. The higher this score is, the more the recommender has recommended the user’s favorite tourism activities. On the other hand, precision indicates what percentages of the activity recommended to the user were actually the user’s favorite activities. Of course, sometimes, a precision error can also be welcome; that is, when an activity outside of the user’s favorite activities is recommended, the user feedback to that is positive; therefore, it can be seen as a way to prevent overpersonalization. If a recommender offers all the activities to the user, its recall score will be 100%, but its precision score will be low. Also, if it tries to suggest a smaller number of activities to the user, its recall will be low, and its precision score will be high. In such cases, the F1 score, a harmonic average of the two mentioned metrics, can be a valuable criterion for comparing classifiers performance. To select the best algorithm based on the F1-score results, BR-NB has the best performance among the categories. Its precision, recall, and F1 scores are 0.75, 0.74, and 0.733, respectively. The evaluation results of the two adaptive algorithms, BRKNN and MLKNN, were similar and did not have a significant advantage over each other, as are the ensemble algorithms.

4.1.1. Feature Importance Explanation

As an aid to understanding how MLC predicts users’ interest in particular activities, Table 6 sets out a breakdown of the features’ importance for each activity. As our best algorithm, Naïve Bayes, cannot handle feature importance analysis, we have chosen the BR-RF algorithm. Even though the results in Table 4 seem pretty straightforward, our case study concerns Tehran, so some of the feature importance analysis results need to be clarified.

Cultural and fun features play an important role in theater and cinema, but their significance varies slightly. In comparison with cinema, the fun aspect of theater has decreased in importance, while the cultural aspect has increased. In fact, theater is a more cultural experience for individuals than films.

In visiting museums activity, aside from cultural and historical features, the religious feature is also very effective because of the many religious museums in Tehran, such as the Quran Museum.

For activities such as visiting public gardens and parks, in addition to urban-related features, cultural and historical features are also important. Some public gardens in Tehran have classical architecture buildings, which give the gardens a cultural and historical significance, for example, Negarestan Garden.

The urban-related feature is the second most effective for all activities such as visiting rural areas, mountaineering, lakes, and rivers. Many tours and tourism categories introduce these activities as weekend tourism activities. As Tehran has easy access to the mentioned points of interest and many of them are within walking distance of residential areas, it is understandable for people to have a similar urban view of such activities.

4.2. BR-NB vs. BSKNN and SVD (External Evaluation)

We utilized fivefold cross validation to evaluate the CF base models. For the adjustment of the hyperparameters, we used the different values examined in other studies and compared the results by repeating the experiment. Finally, we report the highest result for each algorithm as well as the hyperparameters associated with that result. In Table 7, the values of hyperparameters for each model are listed.

Since the CF base algorithms’ output falls between 0 and 1, a threshold was used to convert it to a binary output. Figure 5 shows each of the algorithms’ F1 score for different α values. As expected from the performance of these algorithms, with the increase of α, the process of recommending activities to the user becomes more rigorous; consequently, the amount of recall score decreases, and the score of precision increases. Thus, using the F1 score as a harmonic average of precision and recall could provide a reasonable basis for determining a threshold value.

According to Figure 5, the F1 score for both the SVD and the BSKNN lies in the range of 0.45 to 0.55, with the difference of the F1 within this range being negligible for both. Accordingly, a value of 0.5 was determined for both algorithms. The basic models, RP and baseline only, behave differently. The RP predicts labels randomly based on an assumed normal distribution over data. The changes in RP performance until 0.5 are not noticeable, but after this threshold, the changes are more pronounced; 0.5 is an appropriate value for its threshold. A threshold value of 0.55 is considered appropriate for the baseline-only model. A positive aspect of this model is that the gap between precision and recall is very small, regardless of the threshold value. The VAE model is sensitive to threshold changes, and the gap between precision and recall is impressive. F-score performance is almost indistinguishable up to the threshold point of 0.4; however, above this point, there is a marked decrease in performance. Based on our observations during the experiment to adjust the hyperparameters, we found that with more training, this model tends to reduce output values to near zero. By increasing the threshold, the false negatives also increase, and hence there is a decrease in the recall, leading to a lower F score. In this study, an ideal threshold value is 0.4. NeuMF exhibits similar behavior to VAE in the F score with threshold changes. By adjusting the hyperparameters of this model, we observed that the outputs of this model are more stable. In addition, the gap between precision and recall was smaller than in the VAE case. Figure 5 shows that 0.45 is the optimal threshold for this model.

To compare the proposed method with the selected CF models, we selected the BR-NB classifier, which has the superior performance among the MLC algorithms. Figure 6 shows the precision, recall, and F1 scores for each of them. The weakest results are obtained for the SVD model, which is even worse than the baseline-only model, but still performs much better than random prediction. It appears that the linear model of this algorithm does not adequately describe the data collected by this researcher. NeuMF was proposed in order to overcome the shortcomings of linearity in SVD, and the results of this model demonstrate how nonlinear modeling of NeuMF yields superior results to SVD. Out of the two neural network-based models proposed in this study, NeuMF displays more stable results than VAE, while also having a lower gap between precision and recall. As we explained earlier, we have observed that, during the training of neural network-based models on our data, they tend to reduce their outputs that are between 1 and 0 to as close to 0 as possible. Consequently, both models have a high recall for thresholds below 0.5, which contributes to their F score. While the recall is highest in the VAE model, on the other hand, its performance in precision is quite poor. However, the conditions for the NeuMF model are more favorable. Among the CF-based models, the BSKNN model produces relatively better results and the highest F score. Moreover, the difference between its precision and recall is negligible. In our opinion, the reason for the good performance of this model may be associated with the type of data and the manner, in which it was collected. Memory-based models are in general very sensitive to outliers; however, the data that we collected via the questionnaire allowed us to avoid the occurrence of any outliers during the data collection phase. The proposed method with BR-NB classifier outperformed all other CF-based models. Even so, it is pertinent to note that in the CF system, only a single step of data collection is needed, not the additional step for tourism category extraction. Yet, the advantage of our method lies in the fact that the user interacts with only seven categories; in other words, we have reduced the problem dimensions by mapping the user’s input to activities.

Another critical point is that we also evaluated CF models with the fivefold cross-validation method. The total number of activities is 18. In each step of cross validation, fourfold (for each user, nearly 15 given rate records) is considered as training data and onefold as test data (for each user, about three given rate records). This means that the CF models’ results are based on having about 15 records of user interaction with tourism activities and predicting user interests to nearly three activities. However, in the proposed method, the user interests in all 18 tourism activities are predicted by rating scale (explicit data) to each of the seven tourism categories without having any interaction records.

5. Conclusion and Future Research

In this study, we present a method of receiving explicit information that addresses the cold-start problem of tourism users without including sensitive or demographic information, and the enquired explicit information is data driven. For this purpose, several tourism activities were identified in the city of Tehran. Then, tourism categories were obtained by applying FA to questionnaire data that measured users’ interests in the identified tourism activities. Accordingly, an MLC problem was formulated, in which new users’ ratings of each category represented explicit input that is mapped to identified activities. This decoding process by MLC predicts whether the user is interested in an activity or not. We used a second questionnaire to collect the required data for training and testing the MLC algorithms. In the internal evaluation phase, we compared the results of different MLC algorithms, and the BR-NB results performed best compared to the other classifiers. According to the literature review and the performance outcome of different MLC algorithms in this study, they can have different performances given to the problem and data. In other words, there is no definite superiority for any of the algorithms. In the external evaluation phase, we also compared the best classifier result in our proposed method with entitled CF algorithms as baseline models. Our method outperformed the mentioned algorithms, although this was a slight advantage. Aside from somewhat superior metrics, reducing the problem space from 18 activities to seven tourism categories makes profile development easier because the user does not need to interact with all 18 activities to develop profiles using the CF method. We found that our proposed method was able to capture and predict users’ interests from a few explicit information provided by new users. Comparatively, CF algorithms require more users rating records to achieve a close performance to our method.

Data Availability

The questionnaire data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.