#### Abstract

The car-sharing system is a popular rental model for cars in shared use. It has become particularly attractive due to its flexibility; that is, the car can be rented and returned anywhere within one of the authorized parking slots. The main objective of this research work is to predict the car usage in parking stations and to investigate the factors that help to improve the prediction. Thus, new strategies can be designed to make more cars on the road and fewer in the parking stations. To achieve that, various machine learning models, namely vector autoregression (VAR), support vector regression (SVR), eXtreme gradient boosting (XGBoost), k-nearest neighbors (kNN), and deep learning models specifically long short-time memory (LSTM), gated recurrent unit (GRU), convolutional neural network (CNN), CNN-LSTM, and multilayer perceptron (MLP), were performed on different kinds of features. These features include the past usage levels, Chongqing’s environmental conditions, and temporal information. After comparing the obtained results using different metrics, we found that CNN-LSTM outperformed other methods to predict the future car usage. Meanwhile, the model using all the different feature categories results in the most precise prediction than any of the models using one feature category at a time

#### 1. Introduction

Predicting the future is considered as one of the most challenging tasks in applied sciences. Computational and statistical methods are used for deducting dependencies between past and future observed values in order to build effective predictors from historical data. Transport answers to people’s desire to participate in different activities in different places [1]. Cars have become a part of the mobility ecosystem owing to the flexibility and freedom that they provide [2]. People are more dependent on cars for both intercity and intracity transit, causing traffic congestion and parking difficulties [3]. “Looking for a parking space creates additional delays and impairs local circulation. In central areas of large cities, cruising may account for more than 10% of the local circulation as drivers can spend 20 minutes looking for a parking spot”, said by Dr. Jean-Paul Rodrigue Department of Global Studies and Geography of Hofstra University. Many rental models are emerged to solve these parking problems as one of them is the car-sharing model, which aims to distribute cars within a city for use at a low cost. In this fashion, individuals can exploit all the benefits of a private vehicle without the hassles of lease payments, maintenance, or parking. The program comprises one-way or round-trip, depending on whether the pick-up and the drop-off stations are the same or not [4].

The car-sharing system provides an option to numerous people who opt not to own a vehicle, and they use this system whenever a private vehicle is needed. This system usually fixes the cost on the price per minute that includes a quote of variables such as fuel, price per kilometre, and the share of fixed costs for the operator like maintenance, rebalancing, insurance, and parking [5]. Besides helping in decreasing the level of congestion and managing the lack of parking lots, car-sharing systems have many other advantages such as the reduction of vehicle ownership that leads to efficient use of road and infrastructure, economical savings for the users, and diminution of air and noise pollution [5].

However, this program is facing many issues [6], one of them is the nonsuitable distribution of vehicles within car-sharing systems. As a result, cars tend to be easily accessible in low-demand parking lots in excess whereas an insufficient number of vehicles are available in high-demand parking lots [6]. For car-sharing companies, this problem causes a major financial loss. To improve the car usage rate, the companies employ a variety of techniques that hold great promises in car-sharing predictions.

Over the last few years, machine learning and deep learning have proved their efficiency and got recognition in different fields. Machine learning approaches make use of learning algorithms that make inferences from data to learn new tasks [7] and are widely adopted in a number of massive and complex data-intensive fields such as medicine, astronomy, and biology [8–11]. Deep learning models yield good results in the fields of computer vision and natural language processing [12–15] where it can automatically extract multidimensional features and effectively extract the data patterns for classification or regression [16].

In our work, a multivariate time series approach is presented, and it aims to predict the car usage in the short term and to investigate the factors that help to improve its prediction accuracy. Multiple machine learning models and deep learning models have already fulfilled their promises for multivariate time series prediction approaches and also have proved their ability in extracting meaningful understandings that are hard for humans to analyze and infer [5]. Those models were performed with different features set including the past usage levels, Chongqing’s environmental conditions, and the temporal information.

The rest of the paper is organized as follows. Section 2 presents a literature review of current studies on times series models. Section 3 gives a description of the studied problem. A time series analysis is presented in Section 4. Section 5 demonstrates the framework of our approach. Section 6 describes the experimental framework used to evaluate the performance of used models for the multivariate time series approach. Finally, Section 7 concludes the paper and outlines future work directions.

#### 2. Literature Review

Car-sharing has become one of the most popular research subjects in transportation. Many studies have been conducted, but to the best of our knowledge, no work has been done in the scientific literature that compares different machine learning and deep learning models in predicting the future usage of the car-sharing system and in investigating the factors that help improve its prediction accuracy. Interesting works are instead the following:

Studies related to this topic can be categorized into numerous subgroups including the followings [5, 17]: (i) user characteristics: it investigates the ways the users interact with the service [18, 19]; (ii) characterizing the shaping service in charge of the provisions and distribution of the cars around the city [20, 21]; (iii) car demands prediction level [22, 23].

Car demands prediction levels for car-sharing systems can be formulated as a time series prediction problem. Time series uses many approaches, such as the autoregressive integrated moving average (ARIMA) model that focuses on extracting the temporal variation patterns of the traffic flow and uses them for prediction; support vector regression (SVR) model, that captures complex nonlinearities, [20] demonstrated that this approach generally performs better on traffic flow time series; extreme gradient boosting (XGBoost), [21] showed that this model improves the prediction’s precision and efficiency. Before starting the calculation, XGBoost sorts the traffic data according to the feature values and also realizes parallel computing on feature enumerations.

Recently, machine learning methods have been challenged by deep learning methods on traffic prediction. Deep learning approaches have a strong ability to express multimodal patterns in data, in order to reduce the overfitting problem and to obtain high prediction accuracy. In addition, as a traffic flow process is complicated in nature, deep learning algorithms can represent traffic features without prior knowledge, which has good performance for traffic flow prediction.

Xu and Lim [22] used an evolutionary neural network to prove the effectiveness of this algorithm and its possible usage as a tool for forecasting the net flow of a car-sharing system in order to offer the vehicle in the shortest time possible with the best accuracy; [23] attempted using the deep belief network (DBN) to define a deep architecture for traffic flow prediction that learns features with limited prior knowledge.

The abovementioned models require the input length to be predefined and static, and they cannot automatically determine the optimal time lags. To remedy these problems, many works have been done such as [24] used a model called long short-term memory recurrent neural network (LSTM RNN) that capture the nonlinearity and randomness of traffic flow more effectively and automatically determine the optimal time lags; [25] presented a novel long short-term memory neural network to predict travel speed using microwave detector data, where the future traffic condition is commonly relevant to the previous events with long time spans; Mo et al. [26] predicted the future trajectory of a surrounding vehicle in congested traffic by using the CNN-LSTM. To the best of our knowledge, no work is found in the literature on car-sharing time series prediction using CNN-LSTM.

Regarding the investigation of factors improving the prediction, [6] conducted the study about the effect of seasonal factors on the bookings of cars in Montreal, and after analysing the results, it was concluded that the usage outcomes scored better in summer season.

With respect to the above works, our approach presents the following highlights:(i)Comparison of various machine learning and deep learning models to predict the future number of bookings made by car-sharing users using different metrics(ii)Investigation of the factors that help predicting the car-sharing usage by estimating the relationship between data features and model performances

#### 3. Problem Description

The aim of this study is to predict the number of vehicles that are going to be used in the parking stations at a given moment and to investigate the factors that improve the accuracy of predictions.

The number of vehicle usage at a given time is likely to be correlated with a set of features [6], which are as follows:(i)The past usage : the history usage is tracked to build prediction models. It comprises of the number of car-sharing transactions based on the data from a car-sharing operator located in Chongqing, China.(ii)Temporal information : the time at which the past usages have been acquired. Since the car demands may vary over time, we partition the time period into segments to capture different temporal trends (e.g., holidays/working days, 1 h timeslots) [6].(iii)The environmental conditions at that time: the user transportation habits are usually affected by the weather conditions.

Table 1 summarizes the description of the feature categories of the car rental time series.

##### 3.1. Car Rental Prediction Problem

Hereafter, we formulate the multivariate regression problem addressed in this paper [6]. It consists of predicting the car usage based on the values of features belonging to categories , , and . Let be the historical time period considered for training, , …, be the past time points in , and be the current sampling time. We will denote usage [1 ≤ *j* ≤ *k*] the usage level at time . We will give 1 h timeslots as a prediction horizon [6].

Since the future car usage is related to multiple features, the multivariate regressor is expressed as follows:where is the usage levels of cars, is the temporal information at *t*_{x}, and is the weather conditions in the area at time *t*_{x} [6].

##### 3.2. Factors Investigation Problem

Another objective of this research is to determine the factors that help to predict vehicle usage. The features considered in this study are classified into different categories, namely, the past usage, temporal information, and the environmental conditions. We studied numerous machine learning and deep learning models, merging the different feature categories or using them separately one by one, aiming to find features that improve the prediction accuracy of the models.

#### 4. Time Series Analysis

Time series is a sequential collection of recorded observations in consecutive time periods, and they can be univariate or multivariate [27, 28]. We may perform time series analysis with the aim of either predicting future values or understanding the processes driving them [29].

To address the problems stated in the previous section, multiple machine learning and deep learning models were performed.

##### 4.1. Machine Learning Models

###### 4.1.1. Vector Autoregression (VAR)

Vector autoregression is a forecasting algorithm used when two or more time series influence each other. It is considered as an autoregressive model because the predictors are not only lags of the series but also past lags of itself [30]. Suppose we measure three different time series variables, denoted by , , . The vector autoregression model of order 1 denoted as VAR(1) is as follows [30]:

The variable is a *k*-vector of constants serving as the intercept of the model. is a time-invariant (*k* × *k*)-matrix, and is a *k*-vector of error terms.

Each variable is a linear function of the lag 1 values for all variables in the set. In general, for a VAR(*p*) model, the first *p* lags of each variable in the system would be used as regression predictors for each variable [31, 32].

###### 4.1.2. eXtreme Gradient Boosting (XGBoost)

XGBoost is an efficient and scalable implementation of gradient boosting framework by Friedman et al. [33] and Friedman et al. [34] The package includes an efficient linear model solver and tree learning algorithm [35]. XGBoost fits the new model to new residuals of the previous prediction and then minimizes the loss while adding the latest prediction [36]. What makes it unique is that it uses “a more regularized model formalization to control overfitting, which gives it better performance”—Tianqi Chen. XGBoost is used for supervised learning problems, where we use the training data *x*_{i} to predict a target variable *y*_{i}. After choosing the target variable *y*_{i}, we need to define the objective function to measure how well the model fits the training data, and it consists of two parts, training loss and regularization term, as follows:where *θ* denotes the parameters that we need to learn from data, *L* is the training loss function, and Ω is the regularization term. A common choice of *L* is the mean-squared error, which is given by:

And the regularization term controls the complexity of the model, which helps us to avoid overfitting [27].

###### 4.1.3. Support Vector Regression (SVR)

The foundations of support vector machines (SVM) have been laid by Vapnik and Chervonenkis, and the methodology is gaining in popularity. The foundations of SVM that deal with classification problems are called support vector classification (SVC) and those of SVM that deal with modelling and prediction are called support vector regression (SVR)) [28].

Most real-world problems cannot be modelled by using linear forms [31]. SVR methodology allows to handle real-world problems. Here are some common kernels used in the SVR modelling [25]:(1)Linear kernel: *x* ∗ *y*(2)Polynomial kernel: [(*x* ∗ *x*_{i}) + 1]*d*(3)Radial basis function (RBF): exp{-ϒ |*x* − *x*_{i}|^{2}}

###### 4.1.4. K-Nearest Neighbors (kNN)

K-nearest neighbors (kNN) is an efficient and intuitive method that has been used extensively for classification in pattern recognition [32]. It is a distance-based classifier, which implies that it implicitly presumes that the smaller the distance between two points is, the more similar they would be [37]. KNN classification algorithm is by far more popular than KNN regression [37]. In the KNN regression model, the derived information from the observed data is applied to forecast the amount of predicted variable in real time [38]. In other words, it estimates the response of a testing point *X*_{t} as an average of the responses of the *k* closest training points, *X*_{(1)}, *X*_{(2)}, …, *X*_{(k)}, in the neighborhood of *X*_{t} [32]. Let *X* = {*X*_{1}, *X*_{2},..., *X*_{M}} be a training data-set consisting of *M* training points, each of which possesses *N* features [32]. The Euclidean distance is used to calculate how close each training point *X*_{i} is to the testing point *X*_{t} usingwhere is the number of features, is the nth feature values of the testing point *X*_{t}, and is the nth feature values of the training point *X*_{i}. Some other methods are Manhattan, Minkowski, and Hamming distance methods.

##### 4.2. Deep Learning Models

###### 4.2.1. Long Short-Term Memory (LSTM)

Long short-term memory neural network (LSTM NN) was initially introduced by Hochreiter and Schmidhuber (1997) [21]. The primary objectives of LSTM NN are to overcome the vanishing gradients problem of the standard recurrent neural network (RNN) when dealing with long-term dependencies [39]. Its features are especially desirable for traffic prediction in the transportation domain [40]. Figure 1 shows the architecture of long short-term memory cell. The core concept of LSTM is to be composed of recurrently connected memory blocks, each of which contains one or more memory cells, along with three multiplicative “gate” units: One input gate with corresponding weight matrix , , , ; one forget gate with corresponding weight matrix , , , ; one output gate with corresponding weight matrix , , , . All of those gates are set to generate some degrees, by using current input , the state that the previous step generated, and the current state of this cell , for the decisions whether to take the inputs, forget the memory stored before, and output the state generated later. Just as these following equations demonstrate [39]:

The network controls the flowing information by its sigmoid layer which outputs numbers between zero and one (*S*(*t*) = 1/(1 + *e*^{−t})).

###### 4.2.2. Gated Recurrent Unit (GRU)

The gated recurrent unit (GRU) architecture contains two gates: an update gate *z*_{t} decides how much the unit updates its activation or content and a reset gate *r*_{t} allows to forget the previously computed state [41]. The model is deﬁned by the following:where *h*_{t} represents the output state vector at time *t*, while *H*_{t} is the candidate state obtained with a hyperbolic tangent, represents the input vector at time *t*, and the parameters of the model are , , (the feed-forward connections), , , (the recurrent weights), and the bias vectors , , [42].

###### 4.2.3. Convolutional Neural Network (CNN)

Convolutional neural networks (CNNs) are analogous to traditional artificial neural networks (ANNs) as they are comprised of neurons that self-optimize through learning [43]. They were initially developed for computer vision tasks; nevertheless, there have been a few recent studies applying them to time series forecasting tasks. CNNs are comprised of three types of layers including convolutional layers, pooling layers, and fully connected layers as shown in Figure 2.

The convolutional layer will determine the output of neurons connected to local regions of the input through the calculation of the scalar product between their weights and the region connected to the input volume [43]. There are two important techniques used in the convolutional layers to accelerate the training process: local connectivity and weight sharing. The two techniques are implemented using a filter with a specific kernel size which defines the number of nodes that share weights. Their usage decreases significantly the number of learned and stored weights and allows the network to grow deeper with fewer parameters. The pooling layer is usually incorporated between two successive convolutional layers [44].

The main idea of pooling is to reduce the complexity for further layers by down-sampling [45]. Max-pooling is one of the most common types of pooling methods as it performs better [44]. It consists in partitioning the image to subregion rectangles and only returning the maximum value of the inside of that subregion [45].

The fully connected layers are simply, feed-forward neural networks. They form the last few layers in the network [46]. The input to the fully connected layer is the output from the final pooling or convolutional layer, which is flattened and then fed into the fully connected layer in order to perform the same duties found in standard ANNs [43, 46].

###### 4.2.4. CNN-LSTM Model

CNN-LSTM is a hybrid model built by combining CNN with LSTM for improving the accuracy of forecasting [47]. Figure 3 shows the architecture of the CNN-LSTM model. The model comprises of two main components: the first component consists of convolutional and pooling layers in which complicated mathematical operations are performed to filter the input data and extract the useful information. More specifically, the convolutional layers apply convolution operation between the raw input data and the convolution kernels, producing new feature values [48]. The convolution kernel can be considered as a window that contains coefficient values in a matrix form. This window slides all over the input matrix applying convolution operation on each subregion of it. The result of all these operations is a convolved matrix that represents a feature value.

The convolutional layers are usually followed by a nonlinear activation function and then a pooling layer. A pooling layer is a subsampling technique that extracts certain values from the convolved features and produces new matrices (i.e., summarized versions of the convolved features that are produced by the convolutional layer).

The second component exploits the generated features by LSTM which possesses the ability to learn long-term and short-term dependencies through the utilization of feedback connections and dense layers [48].

###### 4.2.5. Multilayer Perceptron (MLP)

Multilayer perceptrons (MLPs) are deep artificial neural networks and are often applied to supervised learning problems [49]. As we can see from Figure 4, a multilayer perceptron consists of three types of layers: An input layer to receive the signal, an output layer that performs the required tasks to make a decision or prediction about the input, and an arbitrary number of hidden layers that are the true computational engine [49, 50]. In MLP, the data flow in the forward direction from the input to output layer, and the neurons are trained with the backpropagation learning algorithm on a set of input-output pairs. The training involves adjusting the parameters, or the weights and biases, of the model in order to minimize errors [49].

##### 4.3. Vanishing Gradients Problem

Artificial neural networks often experience training problems due to vanishing and exploding gradients. The training problem is amplified exponentially especially in deep learning due to its complex artificial neural network architecture [51]. The vanishing gradient is one example of unstable behavior that may be encountered during the training with gradient-based methods (e.g., back propagation) [52]. The neural network’s weights receive an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. In some cases, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers preventing the weight from changing its value [52].

Several approaches exist to reduce this effect in practice, for example, through careful initialization, hidden layer supervision, and batch normalization [53]. In our work, batch normalization has been used, as it was effective in augmenting the performance of the deep neural network.

#### 5. Framework of the Approach

Figure 5 shows the process of car-sharing usage prediction and factors investigation approach based on machine and deep learning models.

##### 5.1. Collecting Chongqing’s Car-Sharing Operator Data

Chongqing car-sharing operators’ data set contains more than 1 M records of car-sharing usage over 860 parking lots, from January 1st, 2017, 00:00:00 to January 31st, 2019, 23:00:00. The initial records were obtained at different time intervals, and for study purposes, the data are aggregated by hours, days, and weeks, for the whole network.

##### 5.2. Collecting Chongqing’s Weather Data

The web crawling technology with selenium was used to extract the hourly Chongqing’s weather conditions data from January 1st, 2017, 00:00:00 to January 31st, 2019, 23:00:00 [54].

##### 5.3. Data Preprocessing

A data preprocessing is performed on the data set to improve the performance [55].

###### 5.3.1. Processing Missing Values

Some values were missing in the data set from Chongqing’s car-sharing operator. Due to its numerical meaning, we replaced the missing values by the mean of the previous and next hour number of car-sharing usage. This method yields better results compared to the removal of rows. The detailed calculation is shown as follows:where represents station *i*’s missing value on the *k*^{th} hour of *j*th day of the year. After handling the missing values of Chongqing’s car-rental operator data set, we merged it with Chongqing’s weather conditions data set based upon dates and time.

###### 5.3.2. Encoding the Categorical Data

Since the final data set, combining Chongqing’s car-sharing operator and weather data, contains some categorical data such as weather condition and season, we converted the categorical data to numerical data using one-hot encoding method. It consists of representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0 [56].

###### 5.3.3. Clustering Car-Sharing Parking Stations

To identify and understand the car-rental behaviors across stations and reveal the relationships between the time of a day and usage [57], we organized the parking stations with similar patterns into five distinct classes as follows:(i)Class A: daily rented cars(ii)Class B: frequently used cars(iii)Class C: sometimes used cars(iv)Class D: occasionally used cars(v)Class E: unlike other parking stations, cars of this class are cars rarely usedwhere Class A, B, C, D, and E have different parking stations’ IDs such as 16, 104, 6, 28, and 25, respectively. To simplify the large set of data and make it understandable, we used the grouped frequency table to cluster the 860 parking stations into five classes [58]. First, the usage frequencies of the parking stations were put in order, and then, the range was calculated as below [59]: Second, an approximate class width was calculated by dividing the range by the number of classes: The lowest usage frequency represents the first minimum data value. Third, the next lower class value was calculated by adding the class width to the lowest usage frequency. This step was repeated for the other minimum data values until the chosen classes number was created. Fourth, the upper class limits (that are the highest values possible in the class) were calculated by subtracting 1 from the class width and adding that to the minimum data value. Finally, the list of classes is obtained by including in each class the usage frequencies that are greater than the lower class value and smaller than the upper class limit.

##### 5.4. Deseasonalization

Stationarity is an important concept for time series analysis. Some experts believe that neural networks are able to model seasonality directly and that no prior deseasonalization is required, whereas others believe the contrary. The results in [60] show that a prior data processing is required to construct a forecasting model. To test our time series, Augmented Dickey-Fuller Test (ADF Test) was conducted with machine learning and deep learning models to make predictions more accurate [61]. We employed differencing to remove seasonality from the nonstationary time series after the ADF Test [61, 62].

##### 5.5. Scaling

The scaling phase is crucial to move the time series into a reasonable range. In our work, MinMaxScaler was used to scale each feature to a given range.

##### 5.6. Splitting the Data Set

After handling the previously mentioned steps, we prepared our data set properly. We split the data between training and test sets. The training set starts from January 1st, 2017, to December 31st, 2018, and the test set from January 01st, 2019, to January 31st, 2019. Nested cross-validation with an outer loop equal to ten and an inner loop equal to five is used to calculate and to compare each model error. All models had to use the same validation procedure for consistency matters.

#### 6. Experiments

##### 6.1. Data Set

The experiments were performed on the preprocessed Chongqing’s car-sharing operator data set combined with Chongqing’s weather data set, to extract features that help to predict car usage, and to demonstrate the effectiveness of deep learning, more precisely the CNN-LSTM comparing to other models.

We have implemented the proposed models using a PC with an i7 Intel (R) Core™i7-7500U CPU running at 3.00 GHz and 8 GB RAM with the Windows 10 operating system, under the Python 3.7 development environment. The following packages were installed: TensorFlow 1.14.0; Keras 2.2.4-tf; Pandas 0.23.4; Sklearn 0.21.1; Numpy 1.18.1; Matplotlib 3.1.0; Statsmodels 0.10.1.

The hourly weather observations include time, temperature, humidity, wind speed, pressure, precipitation, and weather conditions. To have a basic idea about what weather conditions are typically associated with the city of Chongqing, we calculated the normalized frequency distributions of every weather condition (Table 2).

From Table 2, fair is the most prevalent meteorological condition in Chongqing, closely followed by fog, light rain, partly cloudy, and cloudy.

From Table 3, July and August are the hottest months of the year with an average temperature of 28°C, and January is the coldest month of the year with the temperature of 7°C.

##### 6.2. Evaluation Metrics

The evaluation metrics are the measure that reflects how close the prediction matches the historical data. They are useful in comparing prediction methods on the same set of data [63].

###### 6.2.1. Mean Absolute Error (MAE)

MAE is calculated as the mean of the absolute predicted error values. The MAE is popular as it is easy to both understand and compute.

###### 6.2.2. Mean Square Error (MSE)

MSE is known for putting more weight on large errors. It is calculated as the average of the squared predicted error values.

###### 6.2.3. Root Square Mean Error (RMSE)

The mean-squared error described above is in the squared units of the predictions. It can be transformed back into the original units of the predictions by taking the square root of the mean-squared error score [64]:

The RMSE is chosen as an evaluation metrics because it penalizes large prediction errors more compared with mean absolute error (MAE).

###### 6.2.4. Mean Absolute Percentage Error (MAPE)

Mean absolute percentage error (MAPE) is one of the most widely used measures of forecast accuracy, due to its advantages of scale independence and interpretability [65]. It is calculated using the absolute error in each period divided by the observed values that are evident for that period and consequently averaging those fixed percentages [66]. MAPE indicates how much error in prediction is compared with the real value. The MAPE can be defined by the following formula:where is the actual value, is the forecast value, and denotes the number of fitted points.

###### 6.2.5. Root-Mean-Squared Log Error (RMSLE)

The root-mean-squared log error (RMSLE) is the RMSE of the log-transformed predicted and target values [54]. RMSLE only considers the relative error between predicted and actual values, and the scale of the error is nullified by the log-transformation [67]. The formula for RMSLE is represented as follows:where is the total number of observations in the data set, is the prediction of target, is the actual target for *i*, and is the natural logarithm of , log_{e}(*x*).

##### 6.3. Experiments and Analysis

Many experiments were conducted on parking stations of different classes to extract which features improved the prediction and to predict the demands accordingly. Before applying the different models, some tests were performed on our time series.

###### 6.3.1. Granger Causality Tests

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful for predicting another [68]. In other words, it is an approach that analyses the causal relationships between different variables of the time series.

After analysing the results shown in Table 4, we observed that all the given value were smaller than the significance level (0.05). For example, the value of 0.0003 at (row 4) represents the value of the Grangers causality test for temperature_x causing number_rented_cars_y ( value (0.0003) < significance level of 0.05).

From the results, we can infer that all the variables are good candidates to help predicting the number of rented cars.

###### 6.3.2. Machine Learning Configuration

*(1) eXtreme Gradient Boosting*. A grid search was created for XGBoost model in order to locate the most optimal hyperparameters for the data set.

*(2) Vector Autoregression*. To select the right order of the VAR model, we iteratively fit increasing orders of the VAR model and picked the order that gave a model with the least AIC [29]. In our work, we have chosen the lag 4 model. Before predicting the future values of the target variable with VAR, we used serial correlation of residuals to check if our model is able to explain the patterns in the time series. The obtained scores in Table 5 show that our model is able to capture all the patterns without leftover.

*(3) Support Vector Regression*. The RBF kernel was chosen in this study for its good performance and advantages in time series prediction problem that has been proved in past researches [69, 70]. The penalty parameters are all tuned for the best using of a grid search method. Predictions are computed with optimal combination of cost and gamma parameters.

*(4) K-Nearest Neighbors*. The most important hyper-parameters for KNN are as follows: the number of neighbors (K) and the distance metric, and they determine the way in which the nearest neighbors are chosen.

To choose the number of neighbors (K), a grid search was performed. Moreover, the distance metric plays a crucial role in the nearest neighbor algorithm. Most of the papers in the references used the Euclidean distance. Reference [71] did a comparison between the Euclidean and Manhattan distance and found that statistically, no distance metric is significantly better than the other. The Euclidean distance has been selected since it is the most used in time series forecasting with KNN regression works.

###### 6.3.3. Deep Learning Configuration

*(1) Long Short-Term Memory and Gated Recurrent Unit Models*. In this study, LSTM and GRU models have one neuron in the output layer. Both neural network models were designed with only one hidden layer, a number of 50 epochs were chosen, and the learning rate was set to 0.01. The input and hidden neurons for the GRU model were the same as those for the LSTM model and were set to 41 and 50, respectively [56].

*(2) Convolutional Neural Network*. CNN is one of the most successful deep learning methods, and its network structures include 1D CNN, 2D CNN, and 3D CNN. 1D CNN is used in this paper, and it can be well applied to time series analysis. The detailed process of the 1D CNN is described as follows:

In our experiment, we used various layers including one convolutional layer with a kernel size of 2 and 64 filters, a max-pooling layer, along with that a rectified linear unit (ReLU) activation function is applied in the convolutional and output layers. To minimize the mean-squared error, the gradient descent backpropagation algorithm and Adam optimizer were used; following that, a dropout rate of 0.5 was employed to avoid the overfitting [72], and a number of 70 epochs were used for training the model.

*(3) CNN-LSTM*. In our implementation, we utilized a version of the CNN-LSTM model that consists of two one-dimensional convolutional layers with a kernel size equal to 5, 32, and 64 filters, respectively, followed by a max-pooling layer, a LSTM layer with 50 units, and a dense layer of 32 neurons [48], In order to avoid overfitting during training, the dropout was adjusted to 0.2. A number of 100 epochs were chosen to train the model.

*(4) Multilayer Perceptron Regressor*. MLP was applied on our multivariate time series, with ReLU as an activation function to train the regression model. Three hidden layers were used with a number of 50, 35, and 10 hidden neurons, respectively. The training was performed for 50 epochs for MLP with a learning rate of 0.01.

##### 6.4. Prediction Result

For the good organization of the paper, and not being redundant in our explanations, we only discussed the result analysis of the class “A” as other classes exhibit the same behavior and lead to the same conclusion.

To perform the comparison while fitting the models with different features, the results of only CNN-LSTM for deep learning models and XGBoost for machine learning models are described in our analysis. In the same way, each of the applied models had the analysis of the same results.

In our investigation, metrics such as MAE, MSE, RMSE, MAPE, and RMSLE are used in respective order to make the comparison. Note that the smallest errors are shown in bold text in Tables 6–13.

###### 6.4.1. Univariate Time Series

*(1) Machine Learning Models*. As shown in Table 6, XGBoost reduced the error by (93.14%, 99.14%, 93.36%, 69.17%, 93.76%) against VAR; (93.69%, 99.48%, 94.17%, 83.09%, 94.70%) against SVM; and (93.93%, 99.65%, 94.44%, 84.17%, 95.05%) against KNN.

*(2) Deep Learning Models*. As it can be seen from Table 7, CNN-LSTM yielded the best results and had smaller evaluation error compared to all other models. CNN-LSTM decreased the evaluation error by (44.19%, 48.37%, 34.44%, 1.03%, 24.14%) against LSTM; (51.91%, 57.97%, 35.15%, 11.71%, 40.21%) against GRU; (61.15%, 58.58%, 79.38%, 37.38%, 38.76%) against CNN; and (67.43%, 95.75%, 69%, 88.24%, 61.37%) against MLP.

###### 6.4.2. The Effect of Weekends Information

We might not forget or underestimate the impact of weekends on the prediction while building prediction models with time series data. They are significant since they can add peculiarity to the outcomes.

*(1) Machine Learning Models*. From Table 8, it can be noticed that using the XGBoost model with weekend features enhanced the improvement rate of (1.14%, 0.62%, 1.08%, 0.42%, 0.24%) compared to univariate time series.

XGBoost outperformed VAR, SVM, and KNN. It reduced evaluation error at the rate of (93.20%, 99.19%, 93.34%, 69.04%, 93.76%) contrary to VAR; (93.74%, 99.47%, 94.14%, 82.98%, 94.46) contrary to SVM; and (94.00%, 99.65%, 94.37%, 84.04%, 95.04%) contrary to KNN.

*(2) Deep Learning Models*. The addition of weekend features improved prediction accuracy as it can be observed from Table 9, where adding the weekend feature to the univariate time series of class “A” improved the results with a rate of (2.4%, 4.95%, 2.54%, 4.23%, 0.43%) for CNN-LSTM.

Regarding models’ comparison, CNN-LSTM achieved the best results with an improvement rate of (43.75%, 46.52%, 28, 50%, 0.88%, 37.98%) against LSTM; (46.47%, 57.96%, 35.16%, 4.87%, 35.45%) against GRU; (40.38%, 59.34%, 67.17%, 48.70%, 40.56%) against CNN; and (65.92%, 89.22%, 67.79%, 75.46%, 60.88%) against MLP.

###### 6.4.3. The Effect of Weather Information

In addition to rental information in the city of Chongqing, we can leverage weather data to improve prediction at different times of the day.

*(1) Machine Learning Models*. Table 10 shows that applying XGBoost reduced the MAE by 2.68%; the MSE by 3.18%; the RMSE by 0.25%; the MAPE by 6.84%; and the RMSLE by 1.56% compared to the univariate time series results. Correspondingly, it reduced the MAE by 1.56%; the MSE by 2.58%; the RMSE by 0.33%; the MAPE by 6.98%; and the RMSLE by 1.32% relative to the weekend results.

XGBoost outperformed the other models as it decreased the error values by (93.28%, 99.21%, 93.35%, 76.26%, 93.72%) compared to VAR; (93.49%, 99.41%, 94.15%, 94.20%, 94.05%) compared SVM; and (94.08%, 99.56%, 94.26%, 94.54%, 95.10%) compared to KNN.

*(2) Deep Learning Models*. From Table 11, our findings show that the CNN-LSTM model was (37.97%, 37.59%, 35.82%, 5.49%, 23.42%) more accurate than LSTM; (39.82%, 57.73%, 34.93%, 2.62%, 38.70%) more accurate than GRU; (40.21%, 58.84%, 58.46%, 43.80%, 40.76%) more accurate than CNN; and (65.25%, 82.77%, 52.96%, 76.67%, 60.67%) more accurate than MLP.

Moreover, it was observed after a thorough examination that integrating weather data as a feature improved the prediction accuracy. When the CNN-LSTM model was applied to class “A” data with weather features, it reduced the MAE by 2.67%; the MSE by 6.04%; the RMSE by 3.03%; the MAPE by 11.10%; and the RMSLE by 5.02% compared to the univariate time series results. Similarly, it also reduced the MAE by 0.28%; the MSE by 1.16%; the RMSE by 0.51%; the MAPE by 7.17%; and the RMSLE by 5.43% compared to the weekend’s results.

###### 6.4.4. Combined Effect of Historical, Weekends, and Weather Information

*(1) Machine Learning Models*. The evaluation error of XGBoost was lower as shown in Table 12. It reduced the error by (93.35%, 99.28%, 93.76%, 73.71%, 80.32%) relative to VAR; by (93.57%, 99.47%, 94.47%, 94.21%, 93.81%) relative to SVM; and by (94.12%, 99.59%, 94.62%, 94.54%, 95.16%) relative to KNN.

Our findings show that when all features were used together, XGBoost performed better with rate of (3.96%, 12.92%, 6.65%, 6%, 2.85%). Similarly, it also performed (2.85%, 12.38%, 6.72%, 6.14%, 2.62%) better than when history combined with weekend features were used and (1.31%, 10.6%, 6.41%, 5.48%, 1.31%) better than when history combined with weather features were used.

*(2) Deep Learning Models*. From Table 13, our findings reveal that when CNN-LSTM was fitted with all features together, it performed better with rate of (1.41%, 2.34%, 1.28%, 8.82%, 7.98%) than when only history features were used; likewise, it was (4.04%, 8.24%, 4.28%, 8.17%, 12.60%) better than when history combined with weekend features were used, and (1.68%, 3.47%, 1.79%, 8.33%, 12.97%) better than when history combined with weather features were used.

In comparison to all other models, CNN-LSTM has much lower evaluation error. It decreased the error by (36.80%, 23.39%, 31.18%, 46.46%, 29.66%) against LSTM, by (37.03%, 53.09%, 35.41%, 74.91%, 37.21%) against GRU, by (36.939%, 58.25%, 31.52%, 88.65%, 42.98%) against CNN, by (38.79%, 59.17%, 36.02%, 88.52%, 31.97%) when compared to MLP.

##### 6.5. Comparison of the Results

One of our objectives is to compare the accuracy of various machine learning and deep learning models in predicting the future number of car-sharing transactions. After analysing the results obtained in our study, the following are our findings:

###### 6.5.1. Machine Learning

First, with regard to the results obtained with the machine learning models and after the comparison based on the evaluation metrics, it shows that XGBoost gave the best results, followed by VAR, SVR, and KNN. The XGBoost model had several advantages in model prediction such as complete feature extraction, good fitting effect, and high prediction accuracy.

Second, the SVR prediction series failed to capture random and nonlinear patterns. Hence, it failed to perform well, while XGBoost and VAR forecast series were able to capture random walk patterns.

Third, KNN performed the worse compared to the other machine learning models because of the high number of inputs.

###### 6.5.2. Deep Learning

After comparison of results, we can deduce that CNN-LSTM generated better outcomes followed by LSTM, GRU, CNN, and MLP.

The hybrid CNN-LSTM model yielded better performance on the strength of its capability in supporting very long input sequences that can be read as subsequences by the CNN model and then formed together by the LSTM model.

Besides the CNN-LSTM model, the long short-term memory model achieved good results on account of its ability to learn patterns from sequenced data more effectively.

The key difference between the gated recurrent unit model and the long short-term memory model is that GRU is less complex than LSTM, as it only has two gates (i.e., reset and update) while LSTM has three gates (including input, output, and forget). By comparing the two models using the different evaluation metrics, it can be concluded that the LSTM model had good memory for longer sequences as compared to the GRU model, and it outperformed in the tasks requiring modelling the long-distance relations.

The CNN produced quite impressive results because of the ability of its convolutional layer in identifying patterns between the time steps. Contrary to the LSTM model, the CNN model is not recurrent, and it can only train the data that are inputted by the model at a particular time step.

Unlike the other models, the multilayers perceptron model performed worse. The model received inputs and didn’t treat them as sequence data, which led to temporal dependencies and sequence patterns loss.

###### 6.5.3. Comparison between Machine Learning and Deep Learning Models

It can be inferred from the obtained results that the deep learning models outperformed all the machine learning time series prediction models. From the different results of the different models, we noticed that CNN-LSTM gave the best performance measures and achieved the most accurate prediction results. Furthermore, Figure 6 shows that the dashed line of predicted values almost coincides with the one of real values, which proves that the hybrid CNN-LSTM model generated good results.

Figure 7 shows a comparison between the two best machine learning and deep learning models of Class A, and it can be noticed that CNN-LSTM slightly outperformed the XGBoost model.

###### 6.5.4. The Computational Time

The computational time of various machine and deep learning models can be found in the following tables:

Table 14 shows that XGBoost has faster computational time while SVM is the more demanding. For deep learning models, Table 15 shows that the computational time of CNN-LSTM is bigger than the LSTM, GRU, CNN, and MLP models.

Machine learning models exhibit faster computational time, while deep learning models take a longer time because of their high number of parameters and their complex mathematical formulas.

##### 6.6. Features Investigations

All the used time series prediction models showed that the prediction results were more accurate when we used the different features categories, namely, the past usage levels feature (e.g., number of car-sharing transactions), temporal information features (e.g., season, weekdays/weekend), and environmental condition features (e.g., temperature, humidity, wind speed, pressure, precipitation, weather conditions) together.

It also showed that the environmental condition features dominated other features, and it was followed by the temporal information features. We got the worst results when using only the number of car-sharing transactions based on the data from a Chongqing’s car-sharing operator.

#### 7. Conclusions

This research paper, through applying different machine learning and deep learning models to multivariate time series, aims to predict the car usage and to investigate the factors that help to improve the predictions’ accuracy.

The evaluation of the different machine learning and deep learning models with MAE, MSE, RMSE, MAPE, and RMSLE reveals that the hybrid model (CNN-LSTM) gives substantially smaller errors as compared to the standalone used models. The experimental results show that the utilization of the CNN-LSTM model on the number of car-sharing transactions, along with environmental conditions and temporal information features together, yields the highest prediction accuracy. The principal idea of the hybrid model is to efficiently amalgamate the advantages of two deep learning techniques. It exploits the ability of convolutional layers for extracting useful knowledge and learning the internal representation of time series data as well as the effectiveness of long short-term memory (LSTM) layers for remembering events for short and very long time.

Furthermore, through our experimental analysis, we conclude that even though LSTM models constitute an efficient choice for car-sharing time series prediction, their usage along with additional convolutional layers provides a significant boost in enhancing the forecasting performance. Although CNN-LSTM requires high search time due to its sensitivity to various hyperparameters and its high complexity, it shows the highest forecasting accuracy and the best performance.

All the results of the used models confirm that the car-rental usage is more sensitive to environmental conditions than temporal information that means the impact of weather on car-rental transportation deserves more attention at research. However, our work is limited to temporal features. Future studies can extend on adding more features such as the time span of data, spatiotemporal variables, and expand the model to consumers’ habits [73–76].

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

#### Acknowledgments

This research work was funded by the Beijing Municipal Natural Science Foundation (Grant no. 4212026) and National Science Foundation of China (Grant no. 61772075). The authors are thankful to them for their financial support.