Research Article

A Real-Time Taxicab Recommendation System Using Big Trajectories Data

Procedure 2

Step 1. Obtaining Initial Destination Sets
() Input the historical taxicab manned trajectory data. Then, the storage level of these trajectory data is put into
StorageLevel.MEMORY_ONLY by method due to the need for repeated comparisons.
Input the last manned trajectory of real-time taxicabs in Occupied Status. method is used to load the HDFS file
into Spark as an initial RDD.
() Using transformation, we can obtain the similarity between the last manned trajectory of these taxicabs
and the historical manned trajectory data.
After the above operations, the new RDD with the format of (similarity, destination) is transformed.
() Using sortByKey (false) transformation, the descending order about similarity of potential destinations is obtained.
() Using take(n) operation, we can obtain n taxicab historical manned trajectories which have higher similarity, and
destinations of these manned trajectories are regarded as a preliminary set .
() In order to deal with these data more conveniently and quickly, we change the form of to (destination, similarity)
by transformation. After that, the new is exported to HDFS to facilitate filtering operations later.
Step 2. Forecast Final Destinations
() method loads and abstracts into RDD, and then gathers the similarity of the same potential destination
by transformation.
() Using multiple operators provided by Spark and user-defined functions, downsized and optimized is obtained in
basic and advanced models.
() Using operation, we calculate the visit frequency and the average similarity of potential destinations in
and export the data in the format of (potential destinations, (frequency, average similarity)) to HDFS.
() The new in HDFS is abstracted as RDD by the method. Then, through a series of transformations and
actions including user-defined functions, we implement and complete three different types of clustering
algorithms and output the representatives of .
The format of initial is ((destinations and these attributes in Cluster A), (destinations and these attributes in Cluster B))
() Based on transformation and initial , cluster centers and total visit frequency of clusters are calculated
by user-defined functions.
The format of the output file is ((the cluster center and total visit frequency of Cluster A),…).
() We traverse each element of the RDD by the operation to count the total visit frequency . Then, the ultimate
with the format of is exported to HDFS.