Step 1. Obtaining Initial Destination Sets |
() Input the historical taxicab manned trajectory data. Then, the storage level of these trajectory data is put into |
StorageLevel.MEMORY_ONLY by method due to the need for repeated comparisons. |
Input the last manned trajectory of real-time taxicabs in Occupied Status. method is used to load the HDFS file |
into Spark as an initial RDD. |
() Using transformation, we can obtain the similarity between the last manned trajectory of these taxicabs |
and the historical manned trajectory data. |
After the above operations, the new RDD with the format of (similarity, destination) is transformed. |
() Using sortByKey (false) transformation, the descending order about similarity of potential destinations is obtained. |
() Using take(n) operation, we can obtain n taxicab historical manned trajectories which have higher similarity, and |
destinations of these manned trajectories are regarded as a preliminary set . |
() In order to deal with these data more conveniently and quickly, we change the form of to (destination, similarity) |
by transformation. After that, the new is exported to HDFS to facilitate filtering operations later. |
|
Step 2. Forecast Final Destinations |
() method loads and abstracts into RDD, and then gathers the similarity of the same potential destination |
by transformation. |
() Using multiple operators provided by Spark and user-defined functions, downsized and optimized is obtained in |
basic and advanced models. |
() Using operation, we calculate the visit frequency and the average similarity of potential destinations in |
and export the data in the format of (potential destinations, (frequency, average similarity)) to HDFS. |
() The new in HDFS is abstracted as RDD by the method. Then, through a series of transformations and |
actions including user-defined functions, we implement and complete three different types of clustering |
algorithms and output the representatives of . |
The format of initial is ((destinations and these attributes in Cluster A), (destinations and these attributes in Cluster B)) |
() Based on transformation and initial , cluster centers and total visit frequency of clusters are calculated |
by user-defined functions. |
The format of the output file is ((the cluster center and total visit frequency of Cluster A),…). |
() We traverse each element of the RDD by the operation to count the total visit frequency . Then, the ultimate |
with the format of is exported to HDFS. |