Accurate taxi demand prediction can solve the congestion problem caused by the supply-demand imbalance. However, most taxi demand studies are based on historical taxi trajectory data. In this study, we detected hotspots and proposed three methods to predict the taxi demand in hotspots. Next, we compared the predictive effect of the random forest model (RFM), ridge regression model (RRM), and combination forecasting model (CFM). Thereafter, we considered environmental and meteorological factors to predict the taxi demand in hotspots. Finally, the importance of indicators was analyzed, and the essential elements were the time, temperature, and weather factors. The results indicate that the prediction effect of CFM is better than those of RFM and RRM. The experiment obtains the relationship between taxi demand and environment and is helpful for taxi dispatching by considering additional factors, such as temperature and weather.

1. Introduction

Taxi is an essential part of urban public transportation, and taxi demand is different from others because of its stochastic trajectory and dependence of spatial location [1, 2]. However, the imbalance between the supply and demand of taxis is particularly severe due to the uneven information distribution between drivers and passengers [3]. Taxi drivers’ customer-searching behavior relies on historical experience, and passengers’ trips are random. The information asymmetry of taxis and passengers wastes limited public resources [4]. Thus, the taxi demand in the hotspots should be predicted [5].

Previous studies on taxi demand prediction are generally based on historical taxi trajectory data. Previous studies have shown the feasibility of obtaining predictions from historical taxi trajectory data [1, 523]. Methods of traffic demand prediction can be classified into three types: linear system theory (such as the autoregressive moving average model [24], Kalman filtering model, and time series model), nonlinear system theory (such as the neural network model, gray prediction model, and random forest model (RFM)), and combination forecasting model (CFM). The first application of the time series prediction model in traffic prediction research was modeling the univariate traffic flow data as seasonal autoregressive integrated moving average processes [25]. Shekhar used the Kalman filter model to study univariate traffic condition predictions [2]. Alvarez-Garcia et al. proposed a system based on the hidden Markov model to predict taxi trip destinations [26]. Chang et al. mined historical taxi trajectory data and predicted the time and spatial distributions of taxi demand [9]. Moreira-Matias et al. introduced a new method for using traffic flow data to predict the spatial distribution of taxi passengers in the short-term time. A CFM combining three time series prediction methods that can effectively determine the spatiotemporal distribution of taxi passenger demand was proposed [17]. Lv et al. proposed a traffic flow prediction method based on deep learning considering spatiotemporal correlation and used an autoencoder model to learn traffic flow characteristics [27]. Zhang et al. proposed an adaptive prediction method to predict a hotspot location and its heat [22]. Zhao et al. implemented and compared three predictors for predictive algorithms that determine maximum predictability: Markov, Lempel–Ziv–Welch, and neural network predictors [13]. Davis used a time series model to predict taxi travel demand based on mobile app taxi services [28]. Zhao et al. proposed a new prediction model based on long short-term memory (LSTM) networks. The proposed LSTM network considered the spatiotemporal correlation in traffic systems [29]. Zhang et al. proposed a Dmodel based on the hidden Markov chain model for taxi prediction [21]. Yu et al. proposed a spatiotemporal recurrent convolutional network for traffic volume prediction based on the deep convolutional neutral network [30]. Ou et al. proposed a method of combining the bias-corrected random forest algorithm with the data-driven feature selection strategy for short-term urban traffic flow prediction to solve the problem of unreasonable feature selection [31]. Yao et al. proposed a deep multiview spatiotemporal network framework to simulate spatiotemporal relationships based on traffic prediction models [32]. Bao et al. considered the interaction between subways and taxis based on univariate traffic prediction and applied the residual neural network to predict the taxi demand in different regions [6]. Ishiguro et al. proposed a taxi demand prediction algorithm using real-time demographic data generated by cellular networks and used a stacked denoising autoencoder to assess the impact of real-time demographic data on taxi demand prediction accuracy [12]. Markou et al. considered the information provided by unstructured data while using taxi GPS data and used machine learning techniques to predict taxi demand [11]. Xu et al. believed that the occurrence of taxi request behavior is related to the historical traffic behaviors and proposed an LSTM model, which can predict taxi requests for each region of the city based on historical demand and other relevant information [19]. Past research has mostly focused on pickup points. Rodrigues et al. considered drop-off points and combined the time correlation with the spatial correlation to predict the taxi demand with an LSTM method [18]. Kuang et al. proposed two deep learning methods that combine unstructured textual information with historical taxi trip data for traffic demand prediction research [15]. Furthermore, Castro et al. conducted a review of studies on traffic GPS data and proposed a new direction based on GPS data [33].

Previous works have focused on mining the regularity of trajectory data to predict the traffic demand, but environmental data have been ignored. Furthermore, the method that combines linear and nonlinear system theory has been rarely proposed. This study aims to explore the prediction method combining RFM and RRM for predicting taxi demand in hotspots. Moreover, environmental data are considered. First, the method identifies the taxi demand hotspots in the city. Then, we predict taxi demand at various time periods using the RFM and RRM [34]. Next, we propose a CFM model that combines the RFM and RRM. The forecasting method considers environmental and historical taxi trajectory data. This study is beneficial for traffic management rebalancing taxis.

The paper consists of four sections: Section 1 describes the importance of taxi demand prediction and focuses on related research about taxi demand prediction; Section 2 describes the data and method we used in this study; Section 3 describes the results of the experiment; discussion and future research are included in Section 4; and Section 5 describes the conclusion.

2. Data

2.1. GPS Data

GPS data are from the Xi’an Taxi Management Office and consist of vehicle location data that are recorded every 5 s for 30 days. The dataset consists of 40 million track points. The GPS data have undergone extensive cleaning, and only error-free trip strings are used in this research (Figure 1).

2.2. Environmental Data

The purpose of this study is to accurately predict the demand for taxis in hotspots by constructing a set of affecting factors of the taxi demand. Therefore, the impacts of air quality, weather, wind speed, and temperature on demand for taxis are considered. In this study, the influencing factors of taxi demand are constructed on the basis of two types of data: air quality and meteorological data.

The air quality data are derived from the official website of Green Breathing. The detection indicators include various pollutant data, including PM2.5 and PM10, and the air quality level of the day can be defined according to the AQI. The meteorological data are from the National Meteorological Information Center. This study selects the hourly data of Xi’an, including hourly observations of temperature, pressure, humidity, wind speed, and precipitation. The air quality data used in this study have seven dimensions, and the meteorological data have five dimensions (Table 1).

3. Methods

3.1. Random Forest Model

RFM is an ensemble learning algorithm and an extension of bagging [35]. At each node of each decision tree, a subset of feature attributes is randomly selected from the feature attribute set of the node; then, the best feature attribute is selected from the subset for division (Figure 2).

3.2. Ridge Regression Model

RRM is a partial estimation method designed for collinear data analysis and is an improved least-square estimate method. The regression coefficient becomes realistic and reliable by abandoning the unbiasedness of the least-square estimation and losing part of the information. An RRM fits the ill-conditioned data more accurately than the least-square estimation.

Given a dataset , where . The simplest linear regression model defines the loss function as the square of the residual. Then, the optimization objective is expressed as follows: is a regression coefficient. and y are predicted values. The abovementioned formula would easily overfit when the sample has many features, and the number of samples is relatively small. Regularization terms can be used in the aforementioned formula. The norm regularization is introduced into the RRM as follows:

We define , where is the identity matrix, and is shown as

As increases, the absolute values of the elements in tend to decrease, and the deviation of correct value increases. When tends to infinity, tends to 0. The trajectory of that changes with is called the ridge. When the ridge is stable, is the optimal value. In general, the value of the ridge regression equation will be slightly low, but the significance of the regression coefficient is usually significantly high.

3.3. Combination Forecasting Model

CFM can solve special prediction problems in research by combining the characteristics of different models. The calculation can be expressed as where is the predicted value of the CFM, is the predicted value of the RRM, is the predicted value of the RFM, and and are the weight coefficients of RRM and RFM, respectively.

The core of the CFM is the determination of the weight coefficients and . Inverse-variance weighting method is used to determine the weight coefficient of the CFM. The calculation equations are expressed as follows:

The squared error sum of the RRM is expressed as equation (7), and the squared sum of the RFM is expressed as equation (8):where represents the sum of squared errors of the RRM, represents the sum of squared errors of the RFM, represents the true value, represents the fitted value of RRM, and represents the fitted value of the RFM.

4. Data Processing

4.1. GPS Data Processing

The “STAT” attribute in taxi GPS data is the record of the taxi driving state, in which “4” represents the passenger and “5” represents empty driving. A change from “4” to “5” indicates that the passenger exits the vehicle. This record is recorded as point D. A change from “5” to “4” indicates that the passenger enters the vehicle. This record is recorded as point O.

4.2. Feature Selection

Ensuring that the features are independent of one another is difficult because of their large number in the experiment. In the modeling process, two features with a strong correlation tend to exhibit multiple collinearities in the data. Therefore, the correlation of the experimental data features should be tested. The method chosen in this study is the Pearson correlation analysis, which can measure the linear relationship between variables. The calculation is expressed as follows:where represents the covariance between the variables X and Y, and represent the standard deviations of the variables X and Y, and represents the correlation coefficient of two continuous variables; the value of is between −1 and 1. If , then the two variables are positively correlated; if , then the two variables are negatively correlated. A large absolute value of corresponds to a strong correlation. The corr function of the pandas library in Python is applied to obtain the correlation coefficient matrix (Figure 3).

Figure 3 shows that the correlation among PM2.5, PM10, and AQI is strong. A slight multicollinearity is observed in the correlation between O3 and TEM (temperature); therefore, a correlation exists between RHU and TEM. Indicators with severe multicollinearity are excluded. Thus, indicators PM2.5 and PM10 are eliminated.

Four indicator variables of hour, wdy, week, and holiday are also added to explore the impact of time, week, weekday, and holiday factors on the taxi demand (Table 2).

4.3. One-Hot Encoding

All data are encoded using the one-hot encoder function in the scikit-learn.preprocessing library. The week attribute is taken as an example (Figure 4).

After the one-hot encoding, the data dimension has expanded to 39. In the experiment, the sample size of the dataset is small, and the verification and test sets can be combined when dividing the dataset. The first 23 days of April 2017 are taken as the training set, with the other 7 days as the test set.

5. Results and Discussion

5.1. Extract Hotspots

The ArcGIS 10.2 kernel density analysis tool is used to analyze the kernel density of the residents’ pickup and get-off positions in the three time periods of the working and rest days (Figure 5).

As shown in Figure 5, the taxi demand on weekdays and nonworking days are mainly distributed in the main roads of Xi’an. The taxi demand at various peak hours is also distributed among the main roads of Xi’an. Xi’an taxi demand intensive areas are normalized and have no visible space-time character. The 30-day thermogram is superimposed (Figure 6).

Hotspots are distributed in areas such as Xi’anbei Railway Station, Bell Tower, Xiaozhai, Railway Station, and City Library. Xi’anbei Railway Station and Railway Station are transportation hubs. Xiaozhai, City Library, and Bell Tower are commercial areas. In this study, two representative areas, namely, Bell Tower and Xi’anbei Railway Station, are selected (Figure 7).

5.2. Random Forest Prediction

Using Python’s sklearn.ensemble library, we can use random forest regression (RFM) (Table 3).

The main influencing factor of RFM is “n_estimators.” We use the goodness of fit to adjust the parameters of RFM. The calculation is expressed as follows:where is the sample size, is the sum of squares, is the sum of squares of regression, is the sum of squared residuals, is the value to be fitted, is the mean of y, and is the fitted value.

Considering the number of samples and training speed of RRM, we choose as variable span. The relation between “n_estimators” and can be calculated (Figures 8 and 9).

The adjusted optimal parameters for Xi’anbei Railway Station and Bell Tower areas are shown in Tables 4 and 5.

The prediction results of RFM in Xi’anbei Railway Station and Bell Tower areas are shown in Figures 10 and 11.

RFM can score the importance of feature attributes. In the RFM, evaluating the importance of feature attributes is based on the random replacement of the permutation principle. The reduction in the mean square residual and the prediction accuracy reflects the importance of characteristic variables. In this study, the calculation of the mean square residual reduction is used to evaluate the importance of the variables:(1)We assume regression trees in the random forest. represents the out-of-bag data of the ith tree. The out-of-bag mean square deviations of each tree are , .(2)We assume that the total number of variables is . For each input variable Xi, random replacement in out-of-bag data is conducted. new out-of-bag data OOB are obtained, and the mean square deviation of the new out-of-bag data is calculated. Then, an out-of-bag error matrix can be constructed as follows:(3)The out-of-bag error before replacement is subtracted with the ith row of the out-of-bag error matrix. Then, the significance score of Xi is the average of the abovementioned calculated results, as shown in the following equation:

A large value of corresponds to a great contribution of the variable. This study uses the feature_importances_ function in RMM of the scikit-learn library to score the input variables (Figures 12 and 13).

5.3. Ridge Regression Prediction

Using Python’s sklearn.ensemble library, we can find the implementation of ridge regression prediction models (Table 6).

The two most essential parameters in the RRM are the regularization intensity (alpha) and computational solver (solver) (Table 7).

After the RRM with the optimal parameters is constructed, the prediction results are shown in Figures 14 and 15.

After the training of the RRM, the fitted model can be output. The standardization process is performed in advance. Thus, the model has no intercept term, and each index coefficient represents the importance of the index (Figures 16 and 17).

5.4. Combination Forecasting Model

The weight coefficients of two models in the CFM can be obtained by the sum of residuals of RFM and RRM on the training set. The weight coefficients of RFM and RRM are and , respectively. The prediction results are shown in Figures 18 and 19.

We use mean square error, mean absolute error, and goodness of fit to test the prediction effect of three models (Tables 8 and 9).

Figures 10, 11, 14, 15, 18, and 19 show the prediction results of taxi demand in the Xi’anbei Station and Bell Tower areas through by RRM, RFM, and CFM. Then, Tables 8 and 9 analyze the forecast effect of three forecasting methods. The tables indicate that CFM has the highest accuracy among the three models.

As shown in Figures 12 and 13, the most crucial factor in taxi demand is hours in the Xi’anbei Station because the station is a transport hub. This finding illustrates that taxi demand in a transport hub has a strong correlation to the time factor. Figures 12 and 13 also show that O3 is the main factor in the Bell Tower. Ozone concentration is related to temperature, and hot weather increases the taxi demand in the commercial area. However, Figures 16 and 17 imply that the main factors of RRM in two areas are time factor and O3. Differences between the two areas of RRM are less than those of RFM.

6. Conclusions

In this study, we investigated the taxi demand prediction in hotspots and then proposed three prediction models, namely, RFM, RRM, and CFM. We extracted hotspots of taxi demand, and the taxi demand prediction model was constructed on the basis of taxi demand hotspots. The proposed models combined time, meteorological, and environmental characteristics to explain the generation of taxi demand. The prediction results show that CFM has better robustness and smaller error than FRM and RRM in the Xi’anbei Railway Station area and the Bell Tower area. The experiment also indicates that taxi demand prediction is mainly affected by the time period in the Xi’anbei Railway Station. In the Bell Tower area, the importance of ozone concentration and temperature to the model is relatively advanced. The study concludes that the proposed model can improve prediction accuracy. The most important influencing factor of the taxi demand prediction model is the time factor. Temperature and weather indicators are also relatively important.

Some limitations in the research on taxi demand prediction still need to be addressed. For example, the impact of other similar types of traffic demand is ignored in this study. If travel demand can be met by an online car-hailing service, then taxi demand will be greatly reduced. This study also ignores the impact of land use properties on taxi demand, which will be one of our future research directions. Part of environmental features is challenging to obtain. Thus, we will propose a method to predict environmental features for predicting taxi demand more precisely in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This study was supported jointly by the Technology Project of Shaanxi Transportation Department (grant number 15-39R) and Special Fund for Basic Scientific Research of Central Colleges of Chang’an University (grant number 300102218409).