Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2014, Article ID 457197, 10 pages
Research Article

Real-Time Forecast of Tourists Distribution Based on the Improved k-Means Method

Business School, Sichuan University, Chengdu 610064, China

Received 29 March 2014; Revised 1 May 2014; Accepted 15 May 2014; Published 5 June 2014

Academic Editor: Ker-Wei Yu

Copyright © 2014 Peiyu Ren et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Tourist distribution, a vector to reflect the tourist number of every scenic spot in a certain period of time, serves as the foundation for a scenic spots manager to make a schedule scheme. In this paper, a forecast model is offered to forecast tourist distribution. First of all, based on the analysis of changing mechanism of tourist distribution, it is believed that the possibility for a scenic spot to have the same tourist distribution next time is high. To conduct this forecast, we just need to research on the similar tourist distributions of which time and tourist scale are close. Considering that it is time-consuming, an improved K-means cluster method is put forward to classify the historical data into several clusters so that little time will be needed to search for the most similar historical data. In the end, the case study of Jiuzhai Valley is adopted to illustrate the effectiveness of this forecast model.

1. Introduction

Tourism, which benefits transportation, accommodation, catering, entertainment, and retailing sectors, has become a quite prosperous industry in the past few decades [1]. In 1960, the international tourist arrivals were only 69.3 million, while in 2010 the number was 935 million, and that meant an average growth rate of 5.3 per year [2]. The UNWTO forecasted that international tourism demand will double by 2020 to 1.6 billion visitors and it would generate nearly $2 trillion dollars in economic activities [3]. China’s tourism industry develops faster than other countries as the Chinese people’s living standards improved significantly along with the rapid development of national economy. As long as a scenic spot can successfully attract tourists and expand its bearing capacity, it will bring huge economic benefits. Therefore, in China, especially the western area which lacks natural resources, tourism has become a driving force of regional economic development. With the rapid development of tourism industry and the rapid increase of tourists, the scenic area has become increasingly congested, so that safety risk is increased and visitors experience is not so good. Congestion can be divided into two categories, true congestion and spurious congestion. The former means the tourist number is far beyond the capacity of scenic spot, while the latter means tourist number is about to surpass the capacity, and, because of tourist scheduling delay, tourist distribution is of disequilibrium; namely, some spots are crowded while some are vacant. For the first kind of congestion, the capacity should be expanded, while, for the latter, effective scheduling can be used. Nowadays, shunt scheduling has two main problems. The first one is delay. The existing scheduling mode belongs to the schedule that scheduling instruction is offered only after the crowd comes into being, so scheduling instruction is delayed because when the scheduling began to take effect, the congestion degree of the scenic spot has changed, which in the end greatly reduced its effect. The second is that the present scheduling lacks hierarchy so that it fails to effectively control the congestion. A good solution to this problem is the following: managers could accurately forecast the amount of tourists arrivals and forecast under different resolutions. Real-time forecast can help scheduling scheme design in different levels of each node and effectively improve the utilization rate of resources. As a result, the real-time forecast on the amount of tourists arrivals has become the research focus.

The researches on forecasting of tourists quantity began in the 1960s, and, in recent years, scholars have made great efforts to improve the accuracy of the forecast. Exponential model, ARIMA model, and GARCH model were built to produce accurate forecasting (Kim and Ngo [4], 2001; Chen [1], 2011; Law [5, 6], 2000a, 2004; Qu and Zhang [7], 1996). Some scholars think that nonlinear method can improve the forecasting accuracy because the data characteristics of the tourists quantity is not linear. Law and Au [8] (1999) built the first neural network model to forecast Japanese’s demand to travel to Hong Kong, showing that neural network model performed well in forecasting the tourism demand. Since then neural network model had attracted lots of attention to forecasting such as Cho [9] (2003), Law [10] (2000), Taskaya-Temized and Casey [11] (2005), and Palmer et al. [12] (2006). It was demonstrated that no single forecasting model could produce the best forecast in all situations by empirical results (Cho [13], 2001). Furthermore, there were no certain criteria being used to choose a certain forecasting method which can produce the best performance when a particular tourism demand forecasting was performed. For example Smeral and Wüger [14] (2005) proposed that ARIMA model performed better than exponential model; however Yang et al. [15] (2011) confirmed that ARIMA model performed worse comparing with exponential model.

From the above analysis, it is found that researches on methods to forecast tourist arrivals have been done by lots of researchers, but the research on the real-time forecast is few. Qiu et al. [16] analyzed the space distribution features of tourist flow. Witt and Song [17] pointed out that the tourists distribution is imbalanced in terms of time and space. Yan and Meng [18], Lu et al. [19], and Yan et al. [20] analyzed time and space distribution features of the tourists. Liang and Bao [21] analyzed the seasonal features of theme parks visitors, its influencing factors, and the tourist flow fluctuation law in busy seasons. These studies only focused on the characteristics of tourists’ distribution in terms of space and time so that in-depth dynamic forecast researches on nodes in scenic spots are needed.

Based on the analysis of the tourist number of each scenic node in a scenic spot, we initiate a clustering algorithm to search the nearest tourist distribution. Then the algorithm is used to forecast the future tourist distribution of a scenic spot so as to testify the previous forecast. At last, an empirical research is done on Jiuzhai Valley.

Section 2 describes the problem and analyzes tourist distribution forecast in a scenic spot. In Section 3, an improved -means algorithm is proposed to solve the forecast model. Section 4 takes Jiuzhai Valley as an example to verify the effectiveness of the tourist distribution forecast model for a scenic spot and Section 5 draws a conclusion.

2. Problem Analysis

For a scenic spot with only one entrance, one exit, and many scenic nodes, if we can forecast the tourist number of each node at any time, then the schedule schemes of the scenic spot will become easy and timely. In order to make this real-time forecast, we need to study historical tourist distribution data of the scenic spot. Tourist distribution is defined as

In (1), tourist distribution at time is a -dimension vector which represents scenic spots’ tourists number , .

It is clear that the possibility for a scenic spot to have the same tourist distribution next time is high if time and tourist scale are close. For example, time and are similar times of different days and is small. If their respective tourist distributions and are very similar and tourist scales and are also close, then the possibility for their next-time tourist distributions and to become similar will be high. So, if, at time , we want to forecast tourist distribution , we only need to know tourists distribution is close to and the next-time tourist distribution of can be taken as the forecast value of .

However, as the purpose is to make real-time forecast of each minute and scenic spots’ open duration is generally more than 10 hours which means there are 600 data a day, it is time-consuming to find the similar tourists distribution . In order to make the search efficient, we can classify historical data into several clusters by clustering algorithm and then judge which cluster is belonged to. Thanks to the cluster, we can find the similar tourists distribution within time on average ( is the cluster numbers and is the time consumed to find without clustering).

3. Modeling

3.1. Principles of Forecast

From the above analysis, to realize the real-time tourist distribution forecast, we should firstly classify the tourist distribution samples into several clusters and abide by the principle that in the same cluster the objects should be similar in respect of tourist distribution, time of that tourist distribution, and the tourist scale at that time. As a result, we should add elements (time of that tourist distribution and the tourist scale at that time) into the tourist distribution and the new tourist distribution is defined as the time-scale distribution of the tourist distribution .

So let be the time-scale distribution of the tourist distribution in which represents time of the tourist distribution , represents tourist scale, and represents tourist number of node . So time-scale distribution and tourist distribution have the following connection:

can be classified into many clusters based on the cluster algorithm. In the same cluster, tourist distribution , time of the tourist distribution, and the tourist scale are similar. So, if a certain data is judged to belong to cluster , then the next-time tourist distribution can be estimated ( is the first element of , so it refers to the time) in that it is highly possible that the similar distribution jumps to the similar cluster next time.

For example, in Figure 1, a tourist distribution data has clusters . Any distribution which belongs to cluster can be changed under three kinds of circumstances. Firstly, next-time distribution may just remain in the same cluster with the distribution . Secondly, next-time distribution may jump to cluster . Thirdly, next-time distribution may jump to cluster .

Figure 1: Changing mechanism of tourist distribution.

At time , the tourist distribution is , and the sample distribution is , which is the most similar one to in the same cluster. In this situation, tourist distribution of next time should be similar to the . So can be taken as the forecast value of . The detailed forecast process is shown in Figure 2.

Figure 2: Forecast process.
3.2. -Means Algorithms

Let be a set of data objects and each object has natures; then these data can also be represented by a profile data matrix that has row vectors and each row vector has dimensions to represent a data object. In this paper, a data object represents a time-scale distribution, so is the dataset of the time-scale distribution. The th row vector denotes the th time-scale distribution from the dataset and in the first element , and it represents time , the last element represents tourist scale and other elements represent tourist distribution at time . With , the aim of the -means clustering algorithm is to find a partition of groups. The partition can generate time-scale distributions in the same group that are as similar to each other as possible, while time-scale distributions of different groups are as different as possible. The -means clustering algorithm’s main idea is stated as follows.Firstly, randomly select initial centroids and then assign each time-scale distributions to the closest centroids , . A given time-scale distribution will be assigned to centroid if and only if the following condition holds: where is the distance between object and centroid .

The distance can be calculated by the following function: , where denotes th component or feature of the corresponding data object.Secondly, generate new centroid by calculating the mean of the objects set assigned to each cluster. The new centroids will be where denotes the number of the objects belonging to cluster , .

Then repeating the above step until centroids remains unchanged. The -means clustering algorithm is shown in Algorithm 1.

Algorithm 1: K-means clustering: .

3.3. The Improved -Means Algorithms

In the forecast of tourist distribution, cluster number has an impact on the accuracy of forecast which is the key to the -means clustering algorithm. However, the -means clustering algorithm cannot give cluster number . To solve this problem, we will propose improved -means algorithms which can ensure cluster number to make a better forecast.

3.3.1. Select the Two Optimal Centroids

To make sure that objects in the same group are as similar to each other as possible and objects of different groups differ as much as possible, we should select two centroids whose distance is the farthest when compared with any other two objects. So they are the two optimal centroids that can be obtained through the method below.(1)Randomly select an object from the object data , and then calculate the distance between object and any other object : (2)Select the and the farthest object .(3)Calculate distance between the object and any other object and then get the farthest object .(4), are the two optimal centroids if or ; else, , are the two optimal centroids,(5), represent the two optimal centroids.

3.3.2. Confirmation of the Cluster Number

After obtaining the two centroids, we need to determine the cluster number . The method is stated as follows.Select a new centroid.By calculating the distance between the existing centroids (the current cluster number) and another object we can get each minimum distance of object and object may be taken as a new centroid (named ) if .Judge whether the new centroid is good or not.By assigning every other object to the closest centroid with (1) and (2), the clusters can be obtained. The new centroid can be calculated by where represents th feature of centroids in cluster .

Definition 1. Let be the intercluster distance, which represents the distance between the th cluster and the th cluster . Thus, it can be calculated by

Definition 2. Let be the intracluster distance which represents the average distance among the objects ; then it can be calculated by

Definition 3. Let be evaluation index which indicates whether the new centroid is necessary; then is defined as

A better cluster algorithm must conform with two conditions. Firstly, objects in the same cluster should be as similar as possible which means intracluster distance must be as small as possible. Secondly, objects in different clusters should be as different as possible which means intercluster distance must be as big as possible. In (7), we can find that increases while both and decrease. So can be regarded as the evaluation index to reflect whether creating a new centroid is reasonable. In this case when centroid has been created, we just need to compare with . If , then the new centroid is reasonable, and a new cluster by above method. Else, the new centroid is unreasonable and the optimal .

The improved -means algorithm is based on the -means algorithm, and it has not set the cluster number , which is determined by the data. So the improved -means algorithm can perform better than -means algorithm. The comprehensive description of the improved -means algorithm is shown in Algorithm 2.

Algorithm 2: Improved K-means clustering: IKM(O).

3.4. Forecast Method

In order to forecast tourist distribution, we should judge which cluster the current tourist distribution’s time-scale distribution belongs to and then find out the most similar time-scale distribution in that cluster. Therefore, we can take next-time tourist distribution of the similar tourist distribution as the forecasting value for the next minute. The detailed forecast process is shown in Figure 2.

4. Empirical Analysis

4.1. Jiuzhaigou Introduction and Errors Evaluation Method

Jiuzhai Valley scenic spot is located in western China and is complimented as a world-class natural scenic spot. The tourism industry there greatly promoted the economic development of western China. The number of tourists every day during statutory holidays, such as Chinese National Day Golden Week, can be as many as 20,000. In the past two years, as people’s living standard improved and the governments vigorously support tourism, tourist number has increased greatly. For example, during the National Day Golden Week in 2012, the tourist number in Jiuzhai Valley per day was more than 40,000. The peak season has posed a great challenge to tourism management so that it is important to forecast tourist distribution.

Jiuzhai Valley Scenic Spot Administration has carried out Digital Jiuzhai Valley Comprehensive Demonstration Project (Digital Jiuzhai Valley for short). Technologies such as RFID and GIS are widely applied in the scenic spot; digitization systems such as Access Control System and RFID card reader have been distributed in each attraction of the scenic spot so that all kinds of data can be collected. In this paper, Jiuzhai Valley is taken as an example to testify the forecast model.

In order to evaluate the effectiveness of the forecast model, this paper takes mean absolute percentage error (MAPE) as evaluation statistics, and the calculating method is as follows:

As MAPE is the mean value of all times forecast errors, we should analyze each times error to find the major characteristics of the errors. We set relative error and its calculating method is as follows: where represents forecasting result of the tourist distribution and the truth tourists distribution at time .

Through (10), it is clear that if the forecast value is closer to the real value, then the result of the equation will be smaller. So the equation can be used to evaluate the effectiveness of the forecast model.

4.2. Data

We have collected data of Jiuzhai Valley’s tourist distribution and its corresponding time-scale distribution from May 1 to May 7 in 2012 and 2013, respectively. The data of May 7, 2013, has been used to test the accuracy of the forecast model and other data are taken as the training data to get the clusters and their centroids by the improved -means cluster algorithm.

4.3. Clustering

With the help of the improved -means cluster algorithm, we have obtained 9 clusters and their corresponding centroids are shown in Table 1.

Table 1: The centroids .
4.4. Forecast

Based on the forecast method proposed in this paper, we have forecast the tourist distribution of the 120 minutes (from 11:01 to 13:00) on May 7, 2013 (see Figures 3, 4, 5, 6, 7, 8, 9, and 10), and then obtained the MAPE = 5.32% by (10). In order to analyze the errors at different times, is calculated by (11) (see Figure 11).

Figure 3: Truth value and forecast value from 11:01 to 11:15 on May 7, 2013.
Figure 4: Truth value and forecast value from 11:16 to 11:30 on May 7, 2013.
Figure 5: Truth value and forecast value from 11:31 to 11:45 on May 7, 2013.
Figure 6: Truth value and forecast value from 11:46 to 12:00 on May 7, 2013.
Figure 7: Truth value and forecast value from 12:01 to 12:15 on May 7, 2013.
Figure 8: Truth value and forecast value from 12:16 to 12:30 on May 7, 2013.
Figure 9: Truth value and forecast value from 12:31 to 12:45 on May 7, 2013.
Figure 10: Truth value and forecast value from 12:46 to 13:00 on May 7, 2013.
Figure 11: Forecast relative errors from 11:01 to 13:00 on May 7, 2013.

From Figure 11, we can find that 4 forecast errors are less than 3%, and they contribute to 2.9% of the total; 27 are less than 4% but more than 3%, and they contribute to 22.5% of the total; 20 are less than 5% but more than 4%, and they contribute to 16.7% of the total; 26 are less than 6% but more than 5%, and they contribute to 21.7% of the total; 16 are less than 7% but more than 6%, and they contribute to 13.3% of the total; 16 are less than 8% but more than 7%, and they contribute to 13.3% of the total; 7 are less than 9% but more than 10%, and they contribute to 5.8% of the total; and 3 are (10.05%, 11.54%, and 10.22% at 11:35, 11:38, and 11:50, resp.) more than 10%, and they contribute to 2.5% of the total. So we can come to the conclusion that the forecast model is effective.

5. Conclusion

Based on the improved -means cluster algorithm, a forecast model is put forward to forecast the tourist distribution which serves as the foundation for the scheduling. Firstly, with the improved -means cluster algorithm, time-scale distribution data is classified into several clusters and then we come to judge which cluster it belongs to. Because consists of elements in of which the first element , tourist distribution can also be classified into the same cluster with , of which the first element and belongs to the same cluster with its time-scale distribution , of which the first element . Secondly, research is made on the most similar historical data in the same cluster with tourist distribution which is being used to test the forecast model, and tourist distribution is taken as the forecast value of the tourist distribution . Finally, the empirical study in Jiuzhai Valley illustrates that the forecast model is effective.

Conflict of Interests

The authors (Peiyu Ren, Zhixue Liao, and Peng Ge) declare that there is no conflict of interests regarding the publication of this paper.


This work was supported by the Major International Joint Research Programme of the National Natural Science Foundation of China (Grant no. 71020107027), the National High Technology Research and Development Major Programme of China (863 Programme) (Grant no. 2008AA04A107), the National Natural Science Foundation of China (Grant no. 71001075), the National Natural Science Foundation of China (Grant no. 71371130), the China Postdoctoral Science Foundation (Grant no. 2012M521704), Doctoral Fund of Ministry of Education of China (20110181110034), 985 and 211 Projects of Sichuan University, and Central University Fund of Sichuan University.


  1. K. Y. Chen, “Combining linear and nonlinear model in forecasting tourism demand,” Expert Systems with Applications, vol. 38, no. 8, pp. 10368–10376, 2011. View at Publisher · View at Google Scholar · View at Scopus
  2. H. Y. Song and R. J. Hyndman, “Tourism forecasting: an introduction,” International Journal of Forecasting, vol. 27, no. 3, pp. 817–821, 2011. View at Publisher · View at Google Scholar · View at Scopus
  3. S. E. Levy and D. E. Hawkins, “Peace through tourism: commerce based principles and practices,” Journal of Business Ethics, vol. 89, no. 4, pp. 569–585, 2009. View at Publisher · View at Google Scholar · View at Scopus
  4. J. H. Kim and M. T. Ngo, “Modelling and forecasting monthly airline passenger flows among three major Australian cities,” Tourism Economics, vol. 7, no. 4, pp. 397–412, 2001. View at Google Scholar · View at Scopus
  5. R. Law, “Demand for hotel spending by visitors to Hong Kong: a study of various forecasting techniques,” Journal of Hospitality and Leisure Marketing, vol. 6, pp. 17–29, 2000. View at Google Scholar
  6. R. Law, “Initially testing an improved extrapolative hotel room occupancy rate forecasting technique,” Journal of Travel and Tourism Marketing, vol. 16, pp. 71–77, 2004. View at Google Scholar
  7. H. Qu and H. Q. Zhang, “Projecting international tourist arrivals in East Asia and the Pacific to the year 2005,” Journal of Travel Research, vol. 35, no. 1, pp. 27–34, 1996. View at Google Scholar · View at Scopus
  8. R. Law and N. Au, “A neural network model to forecast Japanese demand for travel to Hong Kong,” Tourism Management, vol. 20, no. 1, pp. 89–97, 1999. View at Publisher · View at Google Scholar · View at Scopus
  9. V. Cho, “A comparison of three different approaches to tourist arrival forecasting,” Tourism Management, vol. 24, no. 3, pp. 323–330, 2003. View at Publisher · View at Google Scholar · View at Scopus
  10. R. Law, “Back-propagation learning in improving the accuracy of neural network-based tourism demand forecasting,” Tourism Management, vol. 21, no. 4, pp. 331–340, 2000. View at Publisher · View at Google Scholar · View at Scopus
  11. T. Taskaya-Temizel and M. C. Casey, “A comparative study of autoregressive neural network hybrids,” Neural Networks, vol. 18, no. 5-6, pp. 781–789, 2005. View at Publisher · View at Google Scholar · View at Scopus
  12. A. Palmer, J. J. Montaño, and A. Sesé, “Designing an artificial neural network for forecasting tourism time series,” Tourism Management, vol. 27, no. 5, pp. 781–790, 2006. View at Publisher · View at Google Scholar · View at Scopus
  13. V. Cho, “Tourism forecasting and its relationship with leading economic indicators,” Journal of Hospitality and Tourism Research, vol. 25, pp. 399–420, 2001. View at Google Scholar
  14. E. Smeral and M. Wüger, “Does complexity matter? Methods for improving forecasting accuracy in tourism: the case of Austria,” Journal of Travel Research, vol. 44, no. 1, pp. 100–110, 2005. View at Publisher · View at Google Scholar · View at Scopus
  15. X. Z. Yang, C. L. Gu, and Q. Wang, “Study on the driving force of tourist flows,” Geographical Research, vol. 30, pp. 23–36, 2011. View at Google Scholar
  16. Y. Q. Qiu, P. Ge, and P. Y. Ren, “A study on temporal and spatial navigation based on the load-Balance of tourists in Jiuzhaigou Valley,” Resources Science, vol. 32, pp. 118–123, 2010. View at Google Scholar
  17. S. F. Witt and H. Song, “Forecasting tourism flows,” in Tourism and Hospitality in the 21st Century, vol. 3, pp. 106–118, 2002. View at Google Scholar
  18. F. Yan and J. D. Meng, “The application of logistic growth model in forecast of tourists amount: a case study of Suiyang, Guizhou Province,” Human Geography, vol. 20, pp. 87–91, 2005. View at Google Scholar
  19. S. Lu, L. Lu, and L. Wang, “Temporal characteristics of tourist flows to ancient villages: a case study of two world cultural heritages, Xidi Village and Hongcun Village,” Geographica Sinica, vol. 24, pp. 250–256, 2004. View at Google Scholar
  20. L. Yan, X. G. Xu, and X. P. Zhang, “Analysis to temporal characteristics of tourist flows on Jiuzhaigou world natural heritage,” Acta Scientiarum Naturalium Universitatis Pekinensis, vol. 45, no. 1, pp. 171–177, 2009. View at Google Scholar · View at Scopus
  21. Z. X. Liang and J. G. Bao, “A seasonal study on tourist flows in theme parks during golden weeks: a case of theme parks in Shenzhen Overseas Chinese Town,” Tourism Tribune, vol. 27, pp. 58–65, 2012. View at Google Scholar