Table of Contents Author Guidelines Submit a Manuscript
Mobile Information Systems
Volume 2018, Article ID 1243289, 17 pages
https://doi.org/10.1155/2018/1243289
Research Article

Efficient Shared Execution Processing of k-Nearest Neighbor Joins in Road Networks

Department of Software, Kyungpook National University, 2559, Gyeongsang-daero, Sangju-si, Gyeongsangbuk-do 37224, Republic of Korea

Correspondence should be addressed to Hyung-Ju Cho; rk.ca.unk@ujgnuyh

Received 22 September 2017; Revised 12 January 2018; Accepted 12 February 2018; Published 12 April 2018

Academic Editor: Jinglan Zhang

Copyright © 2018 Hyung-Ju Cho. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We investigate the k-nearest neighbor (kNN) join in road networks to determine the k-nearest neighbors (NNs) from a dataset S to every object in another dataset R. The kNN join is a primitive operation and is widely used in many data mining applications. However, it is an expensive operation because it combines the kNN query and the join operation, whereas most existing methods assume the use of the Euclidean distance metric. We alternatively consider the problem of processing kNN joins in road networks where the distance between two points is the length of the shortest path connecting them. We propose a shared execution-based approach called the group-nested loop (GNL) method that can efficiently evaluate kNN joins in road networks by exploiting grouping and shared execution. The GNL method can be easily implemented using existing kNN query algorithms. Extensive experiments using several real-life roadmaps confirm the superior performance and effectiveness of the proposed method in a wide range of problem settings.

1. Introduction

Road networks are often represented as weighted undirected graphs by placing a graph vertex at each road intersection or terminus and connecting vertices by edges that represent each segment of a road between two vertices [15]. The distance between two points in a road network is the length of the shortest path between them. In this study, we investigate the k-nearest neighbor (kNN) join in road networks, which combines each object in a dataset with the k objects in another dataset that are closest to it [68]. The kNN join is a primitive operation, which is widely used in many data mining and analytic applications, such as kNN classification, k-means clustering, sample assessment and sample post processing, missing value imputation, and k-distance diagrams [612].

Figure 1 presents an example of a kNN join, which we denote as , in a road network where and . For convenience, we assume that outer objects through denote hotels represented by rectangles and inner objects through denote tourist attractions represented by triangles. In this example, we can consider a kNN join query for tourists, which could be “find ordered pairs of two tourist attractions closest to each hotel.” Outer objects and inner objects typically correspond to points of interest in different categories such as hotels, tourist attractions, and hospitals.

Figure 1: Example of kNN join in a road network where and .

The kNN join is an expensive operation because it combines the kNN query and the join operation. A simple solution to the kNN join of datasets and requires scanning once for each object in while computing the distance between each pair of objects from and . This simple solution is not achievable for large datasets due to the complexity of . Therefore, many studies have been performed to improve the efficiency of the kNN join [612]. Most of these previous studies focused on processing the kNN join in the Euclidean space, mainly by designing elegant indexing techniques to avoid scanning the entire dataset repeatedly and to prune as many distance computations as possible. However, few studies have considered processing the kNN join in road networks.

In this study, we address the problem of implementing the kNN join operator in road networks by proposing a shared execution-based approach called the group-nested loop (GNL) method, which comprises the following three steps. In the first step, the outer objects in a dataset are grouped and then converted into a set of outer segments, where each outer segment connects adjacent outer objects. A method for grouping adjacent outer objects into an outer segment is presented in Section 4.1. In the second step, at most two kNN queries are evaluated for the outer segment. In the last step, the kNN join operation based on shared execution is performed for the outer objects in the outer segment. The GNL method is efficient for the following reasons. (1) It inherits the strength of shared execution processing by avoiding the evaluation of redundant kNN queries. (2) It can effectively group neighboring outer objects and consider them as a whole while pruning away unpromising network traversals. (3) It does not require any materialized structures, such as precomputing network Voronoi polygons [13, 14] or precomputing the network distance between every pair of vertices [15]. (4) It can be implemented easily using existing kNN query algorithms (e.g., INE [16], DisBrw [17], ROAD [2], and G-tree [5]), which is highly desirable in practice.

We summarize the main contributions of this study as follows:(i)We propose the GNL method for processing kNN joins efficiently in road networks. To the best of our knowledge, this is the first attempt to evaluate kNN joins efficiently in road networks.(ii)The GNL method is intuitive and straightforward to implement, thereby allowing its simple integration with existing kNN query processing methods [1, 2, 5, 13, 1619]. The GNL method employs an optimized number of kNN queries to evaluate kNN joins while exploiting grouping and shared execution.(iii)We conduct extensive experiments with different setups to demonstrate the superior performance of the GNL method compared with conventional solutions.

The remainder of this paper is organized as follows: In Section 2, we review related research. In Section 3, we provide some background knowledge. In Section 4, we present the basic GNL method for processing kNN joins in road networks. In Section 5, we present the GNL method to avoid redundant kNN queries for processing kNN joins efficiently in road networks. In Section 6, we present empirical comparisons of the GNL method and conventional solutions with different setups. Finally, we discuss our conclusions in Section 7.

2. Related Work

2.1. kNN Search in Road Networks

The processing of kNN queries in road networks has been studied extensively [2, 3, 5, 16, 17, 19, 20]. Papadias et al. [16] introduced incremental Euclidean restriction (IER) and incremental network expansion (INE). IER exploits the Euclidean restriction principle in road networks to achieve better performance. INE conducts network expansion from the query location in a similar manner to Dijkstra’s algorithm and examines the data objects in the sequence in which they are encountered. Shahabi et al. [3] developed an embedding technique for transforming a road network to a constraint-free high-dimensional Euclidean space to approximately retrieve the nearest objects using traditional Euclidean-based algorithms. Kolahdouzan and Shahabi [13, 14] utilized the first degree network Voronoi diagrams to partition the spatial network into network Voronoi polygons (NVP), with one for each data object. They indexed the NVPs by using a spatial access method to reduce the problem to a point location problem in the Euclidean space. This minimizes the online network distance computation by allowing precomputing of the NVPs. Huang et al. [18] addressed the same problem using the island approach where each vertex is associated with all the data points. These islands are considered centers with a given radius covering the vertex. In their approach, they utilized restricted network expansion from the query point using the precomputed islands.

Huang et al. [20] and Samet et al. [17] proposed two different algorithms to address the drawbacks of data object-dependent precomputation. Huang et al. [20] introduced S-GRID for partitioning the spatial network into disjoint subnetworks and precomputing the shortest path for each pair of connected border points. To find the k-nearest neighbors (NNs), network expansion is first performed within the subnetworks before outer expansion between the border points using the precomputed information. Samet et al. [17] proposed a distance browsing (DisBrw) method where they associate a label with each edge to represent all of the vertices with a shortest path starting with a particular edge. They use these labels to traverse the shortest path quadtrees that facilitate geometric pruning to find the network distance between objects.

Lee et al. [2] proposed a new system framework called ROAD for processing location-dependent spatial queries. ROAD organizes a large road network as a hierarchy of interconnected regional subnetworks, each of which is augmented with shortcuts and object abstracts to accelerate network traversal and facilitate rapid object lookups, respectively. Inspired by the R-tree [21], Zhong et al. [5] proposed a height-balanced index called G-tree in road networks, which employs graph partitioning to efficiently compute the network distances through a hierarchy of subgraphs. Abeywickrama et al. [1] performed a thorough experimental evaluation of several kNN algorithms for road networks, where they showed that G-tree [5] typically outperformed INE [16], DisBrw [17], and ROAD [2], but not in all cases, thereby demonstrating the impact of an efficient implementation. Finally, other previous studies [15, 16] also processed the distance joins in road networks. For example, an ordered distance join operation returns a set of object pairs, which are reported in increasing order of the distance between the pairs. However, it is not appropriate to extend the existing solutions to solve our problem due to differences in problem definition.

2.2. kNN Join in the Euclidean Space

The processing of kNN join queries in the Euclidean space has been studied extensively [610, 12]. Böhm and Krebs [6] developed an R-tree-based method called a multipage index (MuX) to evaluate kNN joins in the Euclidean space. The MuX method organizes the input datasets with large-sized pages in order to reduce the I/O cost. The computational cost is further reduced by carefully designing a secondary structure with a much smaller size within the pages. Xia et al. [7] developed a grid partitioning-based approach called Gorder (an order based on a grid) to evaluate kNN joins in the Euclidean space. Gorder is a block-nested loop join method that employs sorting, join scheduling, and distance computation filtering and reduction to decrease both the I/O and CPU costs. Yu et al. [8] developed an index-based kNN join method called iJoin by using iDistance [22] as the underlying index structure. By splitting the two input datasets into individual sets of partitions, the iJoin method employs a B+-tree to hold the objects in each dataset using the iDistance technique and it evaluates the kNN joins based on the properties of the B+-tree. Yao et al. [10] developed z-kNN as a Z-order-based method to evaluate kNN joins in large relational databases without changing the database engine, thereby allowing the query optimizer to understand and generate the best query plan. The z-kNN method transforms the kNN join operation into a set of kNN search operations where each object in is a query point. Recently, Lu et al. [9] and Zhang et al. [12] developed novel algorithms using MapReduce to efficiently perform parallel kNN joins on large datasets. However, due to the different problem environments, it is not possible to apply these solutions based on the Euclidean distance to the kNN join problem in road networks. Finally, Li and Taniar [23] presented a taxonomy for distance-based spatial join queries, which are divided into the following three main categories: (1) all-range join (e.g., [24, 25]), (2) all-kNN join (e.g., [6, 26]), and (3) all-reverse NN join. Our study belongs to the all-kNN join category, and to the best of our knowledge, our current study represents the first attempt to evaluate kNN joins efficiently in road networks.

3. Preliminaries

In Section 3.1, we formally define the kNN join problem in road networks and explain its properties. In Section 3.2, we define the terms and notations used in this study.

3.1. kNN Join

We consider outer objects and inner objects in a road network where the outer objects and inner objects typically correspond to points of interest in different categories as shown in Figure 1. Given two points and , denotes the network distance between and , which is the length of the shortest path connecting them in .

Definition 1. (kNN search): Given an integer k, an outer object , and a set of inner objects , the kNNs of from , denoted as , constitute a set of k inner objects from such that holds for and .

Definition 2. (kNN join): Given an integer k and datasets and , the kNN join of and , denoted as , returns ordered pairs of two objects and such that is one of the kNNs of where and . For simplicity, is abbreviated as . Formally,

The kNN join has the following properties:(i)The kNN join is not commutative, that is, . Without loss of generality, we assume that and in this study.(ii)The cardinality of the result set of a kNN join is because the kNN join returns the k inner objects that are closest to each outer object in .(iii)The traversal distances from an outer object to its kNNs are not known in advance, in contrast to the range join.

3.2. Definition of Terms and Notations
3.2.1. Road Network

A road network can be modeled using a weighted undirected graph , where , , and indicate the vertex set, edge set, and edge distance matrix, respectively. Each edge has a nonnegative weight to represent the network distance.

3.2.2. Classification of Vertices

Vertices can be divided into three categories based on their degree. (1) If the degree of a vertex is larger than or equal to 3, the vertex is referred to as an intersection vertex. (2) If the degree is 2, the vertex is an intermediate vertex. (3) If the degree is 1, the vertex is a terminal vertex.

3.2.3. Vertex Sequence and Outer Segment

A vertex sequence denotes a path between two vertices, and , such that and are either an intersection vertex or a terminal vertex, and the other vertices in the path, are intermediate vertices. The length of a vertex sequence is the total weight of the edges in the vertex sequence. An outer segment denotes a path that connects outer objects in the same vertex sequence. Table 1 summarizes the notations used in the study. To simplify the presentation, we use to denote if it causes no confusion, where are outer objects in the same vertex sequence.

Table 1: Symbols and their meanings.

Figure 2 shows the difference in the distance and segment length between two objects and in a road network, where the numbers on the edges indicate the distance between two adjacent points (e.g., ). The shortest path from to is , and thus the distance between them is , whereas the segment connecting and in the same vertex sequence becomes , and thus its length is . We recall that is defined if and only if the two objects are located in the same vertex sequence.

Figure 2: and .

4. Basic GNL Method for kNN Joins in Road Networks

In Section 4.1, we describe the grouping of outer objects in a vertex sequence. In Section 4.2, we explain the shared execution processing of outer objects in a segment. In Section 4.3, we provide algorithms for processing kNN joins efficiently in road networks. Finally, in Section 4.4, we discuss the evaluation of an example kNN join in a road network.

4.1. Grouping of Outer Objects in a Vertex Sequence

Figure 3 presents an example of the kNN join in a road network that we consider in this section, where there are six outer objects, through , and four inner objects, through , in a road network, with three vertex sequences, , , and . For simplicity, we consider a 2-NN join where and .

Figure 3: Example of kNN join in a road network.

Figure 4 shows a sample grouping of outer objects in a vertex sequence, where the outer objects , , and in vertex sequence are grouped into and the outer objects and in vertex sequence are grouped into . Note that the outer segments (e.g., and in Figure 4) are marked by bold lines. Therefore, a set of outer objects is transformed into , where denotes the set of outer segments generated from the outer objects in .

Figure 4: Grouping outer objects in a vertex sequence.
Algorithm 1: Group_outer_objects .

Algorithm 1 describes how to group neighboring outer objects into an outer segment in a vertex sequence. This algorithm comprises two steps. In the first step, a set of vertex sequences is generated from a set of vertices and a set of edges . In the second step, a set of outer segments is generated from a set of outer objects and . We first examine each vertex to find either an intersection vertex or a terminal vertex. If is either an intersection vertex or a terminal vertex, for each edge adjacent to , we explore the path from toward until a nonintermediate vertex is found. Then, a new vertex sequence is added to . We search for outer objects in each vertex sequence . The outer objects are grouped into an outer segment , which is added to .

4.2. Shared Execution Processing of Outer Objects in a Segment

The shared execution strategies in the basic GNL method are motivated by the observation that two kNN queries at most are sufficient to retrieve the kNNs of all outer objects in an outer segment. This observation is formalized in Lemma 1, which states a simple but important fact regarding the shared execution processing of outer objects in an outer segment. If we evaluate two kNN queries for outer objects and , which are located at the ends of an outer segment , then we can determine the kNNs of the other outer objects without evaluating additional kNN queries for them. Thus, the two kNN queries for and are sufficient to retrieve the kNNs of the other outer objects in a segment.

Lemma 1. For every outer object , it holds that , where is the set ofkinner objects closest to outer object and is the set of inner objects located in the outer segment (e.g., in Figure4).

Proof. We prove Lemma 1 by contradiction. If we assume that does not hold, i.e., , then this implies that there is an inner object such that . Clearly, , so is more distant from than its kth closest inner object , that is, . Similarly, , so is also more distant from than its kth closest inner object , that is, . However, , so is not located in . Therefore, the shortest path from to passes through either or and the distance from to is determined by . From the former conditions, , so cannot belong to , which contradicts the assumption that an inner object exists such that .

We determine the k inner objects that are closest to an outer object by using two kNN sets and of outer objects and , respectively. First, we investigate the distance from the outer object to an inner object . In Figure 5, we assume that corresponds to the origin of the XY coordinate system. Then, the Y-axis represents and the X-axis represents . If there is a path of , then the distance from to is computed as , as shown in Figure 5(a). Similarly, if there is a path of , then is computed as , as shown in Figure 5(b). If an inner object is located in , then is computed as , as shown in Figure 5(c). Note that min returns the minimum of the values in the input array. Thus, is the length of the shortest path among multiple paths between and , and it is computed as follows:

Table 2 explains how to compute the distance from to , where and . An inner object belongs to a combination of , , and , so seven possible cases are considered in total. Clearly, it is trivial to retrieve a set of inner objects that is located in compared with retrieving a set of k inner objects closest to outer object . Note that in Table 2 may not necessarily be the length of the shortest path from to , as discussed in Section 4.4.

Figure 5:
Table 2: Computation of for and .
Algorithm 2: Basic_GNL .
4.3. kNN Join Algorithm

Algorithm 2 describes the basic GNL algorithm, which employs the grouping of outer objects and shared execution to reduce the processing time. This algorithm comprises two steps. In the first step, the outer objects in a vertex sequence are grouped into an outer segment (line 3). In the second step, the kNNs of the outer objects in the outer segment are retrieved using the shared execution method (line 6), as explained in Algorithm 3. An ordered pair comprising an outer object and inner object is then added to the partial join result . Finally, the kNN join result set is returned after all the outer segments have been processed (line 8).

Algorithm 3: .

Algorithm 3 describes the kNN join for all outer objects in . Three general cases are formally considered depending on the number of outer objects in , that is, , , and , where returns the number of outer objects in . denotes that contains only one outer object , denotes that contains only two outer objects and , and denotes that contains more than three outer objects . If , then the kNN query from is evaluated. In this study, INE [16] is employed to evaluate the kNN query. Naturally, INE can be replaced by other kNN algorithms [1, 2, 5, 13, 1719]. A partial join result for is generated from and returned (lines 1–4). Similarly, if , then two kNN queries from and are evaluated. Partial join results and for and are generated from and , respectively; and their union is returned (lines 5–10). Finally, if , then two kNN queries from and are evaluated and a search for the inner objects located in is performed. The partial join results and for and are generated from and , respectively. For each outer object , the set of kNNs of is retrieved from , , and (lines 17–19), as explained in Algorithm 4. Finally, the union of partial join results is returned.

Algorithm 4: .

Algorithm 4 describes the kNN search of an outer object . First, the set of kNNs of outer object , , is initialized to the empty set. The distance from to a candidate inner object is computed according to the condition of (lines 3–10), as explained in Table 2. After computing , we can determine whether is added to the candidate set . If , then is simply added to (lines 12-13). If and , then is added to and is removed from accordingly, where denotes the kth closest inner object to , that is, (lines 14-15). The set of the kNNs of is returned after all of the candidate inner objects in have been considered (line 16).

4.4. Evaluation of an Example kNN Join Using the Basic GNL Method

We now discuss the evaluation of the kNN join example shown in Figure 3. We recall that , , and are given. Table 3 summarizes the computation of the kNN join .

Table 3: Computation of the example kNN join using the basic GNL method.

As explained in Algorithm 2, the basic GNL method first groups the outer objects in a vertex sequence into an outer segment. Therefore, is generated from , as shown in Figure 4. In this example, we process , , and in this order. As explained in Algorithm 3, we process the outer segment differently according to the number of outer objects in the outer segment. Outer segment contains three outer objects , , and , so two kNN queries are issued from and , and the inner objects located in are retrieved. We have , , and , as shown in Table 3. Thus, and simply generate the partial join results and , respectively.

According to Lemma 1, we can retrieve the kNNs of outer object from a set of candidate inner objects without issuing a kNN query for . Thus, we need to compute the distance from to each candidate inner object . Clearly, according to Table 3, so the distance from to is , as shown in Figure 6(a). Similarly, according to Table 3, so the distance from to is , as shown in Figure 6(b). Finally, according to Table 3, so the distance from to is , as shown in Figure 6(c). It should be noted that the shortest path from to is rather than , and thus the shortest distance from to is . However, this does not affect the correctness of the set of the kNNs of because does not belong to so the path does not have to be considered. Consequently, we have from , , and , which generates the partial join result .

Figure 6: Computation of the distance from to .

Next, we compute the partial join results for the outer objects in . Outer segment only contains two outer objects and , so two kNN queries are issued from and . Therefore, as shown in Table 3, we obtain and , which directly generate the partial join results and , respectively. Finally, we compute the partial join result for outer object , where a single kNN query is simply issued from . Therefore, as shown in Table 3, we obtain , which generates the partial join result . Finally, we obtain the kNN join result that is the union of the partial join results for outer objects to , as follows: where and

5. GNL Method for kNN Joins in Road Networks

5.1. Avoiding Redundant kNN Queries

We present the GNL method to evaluate a smaller number of kNN queries than the basic GNL method. To this end, we investigate the number of outer segments in vertex sequences that are adjacent to an intersection vertex. Let be the number of outer segments in vertex sequences that are adjacent to an intersection vertex . If , no kNN query is issued at . If , a kNN is evaluated at , where it is assumed that an outer segment is in a vertex sequence adjacent to and that is closer to than . If , a kNN query is evaluated at instead of at . Figure 7 illustrates a simple example of the shared query processing to reduce the number of kNN queries used to evaluate the kNN join. Observe that there are two intersection vertices and , both of which are adjacent to three vertex sequences , , and . In Figure 7(a), an outer segment is in a vertex sequence and we have and . Therefore, two kNN queries are evaluated at and . However, in Figure 7(b), outer segments and are in vertex sequences and , respectively, both of which are adjacent to and , and we have and . Therefore, the GNL method evaluates only two kNN queries at and , instead of evaluating four kNN queries at , , , and .

Figure 7: Example of shared kNN query processing of the GNL method: (a) evaluating two kNN queries at and and (b) evaluating two kNN queries at and .
Algorithm 5: GNL .

Algorithm 5 describes the GNL algorithm that employs an optimized number of kNN queries to evaluate kNN joins by exploiting grouping and shared execution. Similar to the basic GNL algorithm, this algorithm comprises two steps. In the first step, the outer objects in a vertex sequence are grouped into an outer segment (line 3). In the second step, the GNL algorithm considers and as shown in Figures 8 and 9, respectively, to use the shared execution for avoiding redundant kNN queries.

Figure 8 shows that contains an outer object only, that is, . For the case of , a kNN query is evaluated at , whose result is not reused (lines 6–9).

Figure 9 shows that contains more than two outer objects, that is, , meaning that two kNN queries are required to retrieve the k inner objects closest to each outer object . As shown in Figure 9(a), for the case of and , two kNN queries at and are evaluated, whose results are reused for the other outer segments adjacent to or (lines 11–15). As shown in Figure 9(b), for the case of and , two kNN queries at and are evaluated, and the query result at is reused for the other outer segment(s) adjacent to (lines 16–20). As shown in Figure 9(c), for the case of and , two kNN queries at and are evaluated, and the query result at is reused for the other outer segment(s) adjacent to (lines 21–25). Finally, as shown in Figure 9(d), for the case of and , two kNN queries at and are evaluated, whose results are not reused (lines 26–30).

Figure 8: . (a) , (b) , (c) (d) .
Figure 9: . (a) , (b) , (c) (d) .
Algorithm 6: .

Algorithm 6 describes how to find the k inner objects closest to each outer object , where the kNN search of the outer object is performed using Algorithm 4.

To avoid evaluating redundant kNN queries, we present a simple heuristic so that the redundant kNN queries are not evaluated at locations close to terminal vertices. For example, as shown in Figure 10, the graph has one intersection vertex and three terminal vertices , , and . In this example, the kNN query at is redundant because it holds that for each outer object .

The GNL method employs an optimized number of kNN queries to evaluate kNN joins while exploiting grouping and shared execution. For this, we consider the eight cases in Figures 8 and 9. For each case, Table 4 investigates the number of kNN queries evaluated by the GNL method and the minimum number of kNN queries evaluated by the virtual optimal method. The values in parentheses indicate locations of kNN queries evaluated by the GNL and virtual optimal methods. In Figure 8(a), the GNL algorithm evaluates three kNN queries at , , and . However, the kNN query at is redundant because the query results at and can be reused to find the k inner objects closest to . This shows that the GNL algorithm does not always use the minimum number of kNN queries to evaluate the kNN join, as previously mentioned. However, for Figures 8(b) to 8(d) and 9(a) to 9(d), the GNL method evaluates the same number of kNN queries as the virtual optimal method.

Figure 10: for each outer object .
Table 4: Investigation of and
5.2. Evaluation of an Example kNN Join Using the GNL Method

Returning to the example kNN join in Figure 3, we reevaluate the kNN join to show that the GNL method evaluates a smaller number of kNN queries than the basic GNL method. We recall that the basic GNL method evaluates five kNN queries at , , , , and to evaluate the kNN join, as described in Section 4.4. However, the GNL method evaluates three kNN queries at intersection vertices and , and an outer object because and are given. Specifically, the kNN query at replaces the two kNN queries that the basic GNL method issues at and . Similarly, the kNN query at replaces the two kNN queries that the basic GNL method issues at and .

We process , , and in this order. We retrieve kNNs of each outer object in the segments , , and from a set of candidate inner objects. First, we retrieve two NNs of an outer object in . For this, we should determine , , and to retrieve two inner objects closest to . According to Table 5, , and the distance from to is . Similarly, according to Table 5, , and the distance from to is . Finally, according to Table 5, , and the distance from to is . Consequently, we have from , , and , which generates the partial join result . In the same manner, we can compute the distance from an outer object to each of its candidate objects and determine the two inner objects closest to . Table 6 presents a summary of computing the distances from an outer object to its candidate objects and determining the two inner objects closest to . Note that the two inner objects and closest to are retrieved directly from the kNN query at . Finally, using the GNL method, we can obtain the kNN join result .

Table 5: Computation of the example kNN join using the GNL method.
Table 6: Summary of computing the example kNN join

6. Performance Evaluation

In this section, we present our empirical analysis of the GNL method, where the experimental settings are given in Section 6.1 and the experimental results in Section 6.2.

6.1. Experimental Settings

In the experiments, we use three real-life roadmaps freely available in [27]. The first road map comprising 175,813 vertices and 179,179 edges includes major roads (e.g., highways) in North America (NA). The second road map comprising 174,956 vertices and 223,001 edges includes major roads (e.g., city streets) in San Francisco (SF), California. The third road map comprising 18,263 vertices and 23,874 edges includes major roads (e.g., city streets) in San Joaquin County (SJ), California. The numbers of vertex sequences in NA, SF, and SJ are 12,416, 192,276, and 20,040, respectively. The settings of the experimental parameters are given in Table 7.

Table 7: Experimental parameter settings.

The positions of both the outer objects and inner objects follow either a centroid or uniform distribution. The centroid dataset is generated so that it resembles real-world data. First, 10 centroids are selected randomly, and the objects around each centroid follow the Gaussian distribution, where the mean is set to the centroid and the standard deviation is set to 1% of the side length of the data universe. In each experiment, we vary a single parameter within the range shown in Table 7, while keeping the other parameters at the default values shown in bold. The outer objects and inner objects follow a centroid distribution unless stated otherwise.

We implement and evaluate two versions of the GNL method, that is, the basic GNL and GNL methods. As a benchmark for our proposed method, we use a baseline method that computes the kNNs of every outer object using INE [16]. The three methods are implemented with the maximize speed option using C++ in Visual Studio 2015 and run on a desktop PC running Windows 10 operating system with 32 GB RAM and a quad-core processor (i7-6700 K) at 4 GHz. We consider that the indexing structures of all techniques should be memory resident to ensure responsive query processing, which is assumed recently in many studies [1, 4] and is crucial to online map services and commercial navigation systems. We determine the average values based on 10 experiments.

6.2. Experimental Results

Figure 11 compares the query processing times using the baseline, basic GNL, and GNL methods when evaluating kNN joins in the NA road map, where each chart illustrates the effect of changing one of the parameters in Table 7. The values in parentheses indicate the numbers of kNN queries that are evaluated by the basic GNL and GNL methods to compute the kNN joins. The numbers of kNN queries evaluated by the baseline method to compute the kNN joins are omitted because the numbers of the kNN queries using this method equal the numbers of the outer objects. Figure 11(a) shows the query processing time as a function of the number of requested NNs, that is, k. The processing times using all the methods increase slightly with k. However, the GNL method shows the best performance, and the processing times using the GNL method are up to 80 times shorter than those using the baseline method in all cases. During the shared execution, the basic GNL method only evaluates two kNN queries to retrieve the kNNs of all the outer objects in a segment, which decreases the processing time significantly. The baseline, basic GNL, and GNL methods evaluate a total of 50,000 kNN queries, 1,509 kNN queries, and 743 kNN queries, respectively. This means that the numbers of kNN queries evaluated by the basic GNL and GNL methods represent 3% and 1.5% of the kNN queries evaluated by the baseline method, respectively. This indicates a strong relationship between the query processing times and the numbers of the kNN queries evaluated to compute the kNN joins. Figure 11(b) shows the query processing time as a function of the number of outer objects, that is, . The GNL method performs better than the other methods in all cases. The number of kNN queries in the baseline method increases linearly with . However, because of the benefits of shared execution processing, the numbers of kNN queries evaluated by the basic GNL and GNL methods increase slightly with . Figure 11(c) shows the query processing time as a function of the number of inner objects, that is, . The GNL method performs better than the other methods in all cases. The numbers of kNN queries evaluated by the basic GNL and GNL methods are independent of . Figure 11(d) shows the query processing time for various distributions of outer objects and inner objects, where each ordered pair (i.e., , , , and ) denotes a combination of the distributions of outer objects and inner objects. The processing time is very long for the baseline method, particularly when the inner objects follow a centroid distribution (i.e., and ). The processing time is also long with the basic GNL and GNL methods for because the outer objects are widely scattered and most of the outer segments are generated with a few outer objects, which hinders the shared execution processing.

Figure 11: Comparison of the baseline, basic GNL, and GNL methods for NA: (a) varying k, (b) varying , (c) varying , and (d) varying the distribution of objects.

Figure 12 compares the query processing times of the three methods when evaluating kNN joins in the SF road map. Figure 12(a) shows the query processing time as a function of k between 1 and 100. The GNL method shows the best performance in all cases because the GNL method evaluates the smallest number of kNN queries among the three methods. The baseline, basic GNL, and GNL methods evaluate 50,000 kNN queries, 9,420 kNN queries, and 5,070 kNN queries, respectively. Figure 12(b) shows the query processing time as a function of between and . The processing time with the baseline method increases in a linear manner with the value of . However, the basic GNL and GNL methods are not sensitive to increases in the value of due to the shared execution processing. The GNL evaluates the smallest number of kNN queries in all cases. Figure 12(c) shows the query processing time as a function of between and . The GNL method outperforms the basic GNL method in all cases because the GNL method evaluates 5,070 kNN queries compared to 9,420 kNN queries for the basic GNL method. Figure 12(d) shows the query processing time for various distributions of outer objects and inner objects. The basic GNL and GNL methods outperform the baseline method for and . However, the basic GNL and GNL methods show similar performance to the baseline method for and . This is because the uniform distribution of outer objects obstructs the shared execution processing of the basic GNL and GNL methods.

Figure 12: Comparison of the baseline, basic GNL, and GNL methods for SF: (a) varying k, (b) varying , (c) varying , and (d) varying the distribution of objects.

Figure 13 compares the query processing times of the three methods when evaluating kNN joins in the SJ road map. Figure 13(a) shows the query processing time as a function of k between 1 and 50. The GNL method outperforms the other methods in all cases because it evaluates 1,210 kNN queries, which are fewer than the other two methods. Figure 13(b) shows the query processing time as a function of between and . The processing time using the basic GNL and GNL methods increases less with increasing values of than the baseline method. Figure 13(c) shows the query processing time as a function of between and . The GNL method outperforms the other methods in all cases. Figure 13(d) shows the query processing time for various distributions of outer objects and inner objects. The GNL method outperforms the other methods when the outer objects follow a centroid distribution, that is, and . However, all methods show a similar performance when the outer objects follow a uniform distribution, that is, and .

Figure 13: Comparison of the baseline, basic GNL, and GNL methods for SJ: (a) varying k, (b) varying , (c) varying , and (d) varying the distribution of objects.

7. Conclusions

In this study, we investigated the kNN join problem in road networks. The kNN join is an operation that combines each object of a dataset with its kNNs in another dataset and is used to facilitate data mining tasks such as clustering, classification, and outlier detection. The kNN join can also provide more meaningful query results than the range join. We proposed the GNL method as an efficient kNN join algorithm, which employs grouping of adjacent outer objects and distance computations based on shared execution to avoid evaluating redundant kNN queries. We evaluated the performance of the GNL method based on several real-life roadmaps in a wide range of problem settings. The empirical results confirmed that the GNL method is efficient and scalable with the number of outer objects and is significantly superior to the baseline method and that the GNL method outperforms the basic GNL method, particularly when the outer objects follow a nonuniform distribution.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This research was supported by Kyungpook National University Bokhyeon Research Fund, 2016.

References

  1. T. Abeywickrama, M. A. Cheema, and D. Taniar, “k-nearest neighbors on road networks: a journey in experimentation and in-memory implementation,” Proceedings of the VLDB Endowment, vol. 9, no. 6, pp. 492–503, 2016. View at Publisher · View at Google Scholar · View at Scopus
  2. K. C. K. Lee, W.-C. Lee, B. Zheng, and Y. Tian, “ROAD: a new spatial object search framework for road networks,” IEEE Transactions on Knowledge and Data, vol. 24, no. 3, pp. 547–560, 2012. View at Publisher · View at Google Scholar · View at Scopus
  3. C. Shahabi, M. R. Kolahdouzan, and M. Sharifzadeh, “A road network embedding technique for k-nearest neighbor search in moving object databases,” GeoInformatica, vol. 7, no. 3, pp. 255–273, 2003. View at Publisher · View at Google Scholar · View at Scopus
  4. L. Wu, X. Xiao, D. Deng, G. Cong, A. D. Zhu, and S. Zhou, “Shortest path and distance queries on road networks: an experimental evaluation,” Proceedings of the VLDB Endowment, vol. 5, no. 5, pp. 406–417, 2012. View at Publisher · View at Google Scholar · View at Scopus
  5. R. Zhong, G. Li, K.-L. Tan, L. Zhou, and Z. Gong, “G-tree: an efficient and scalable index for spatial search on road networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 8, pp. 2175–2189, 2015. View at Publisher · View at Google Scholar · View at Scopus
  6. C. Böhm and F. Krebs, “The k-nearest neighbour join: turbo charging the KDD Process,” Knowledge and Information Systems, vol. 6, no. 6, pp. 728–749, 2004. View at Publisher · View at Google Scholar
  7. C. Xia, H. Lu, B. C. Ooi, and J. Hu, “Gorder: an efficient method for kNN join processing,” in Proceedings Very Large Data Bases (VLDB), pp. 756–767, Toronto, ON, Canada, 2004.
  8. C. Yu, B. Cui, S. Wang, and J. Su, “Efficient index-based kNN join processing for high-dimensional data,” Information and Software Technology, vol. 49, no. 4, pp. 332–344, 2007. View at Publisher · View at Google Scholar · View at Scopus
  9. W. Lu, Y. Shen, S. Chen, and B. C. Ooi, “Efficient processing of k nearest neighbor joins using MapReduce,” Proceedings of the VLDB Endowment, vol. 5, no. 10, pp. 1016–1027, 2012. View at Publisher · View at Google Scholar · View at Scopus
  10. B. Yao, F. Li, and P. Kumar, “K nearest neighbor queries and kNN-joins in large relational databases (almost) for free,” in Proceedings of the International Conference on Data Engineering (ICDE), pp. 4–15, Long Beach, CA, USA, March 2010.
  11. C. Yu, R. Zhang, Y. Huang, and H. Xiong, “High-dimensional kNN joins with incremental updates,” GeoInformatica, vol. 14, no. 1, pp. 55–82, 2010. View at Publisher · View at Google Scholar · View at Scopus
  12. C. Zhang, F. Li, and J. Jestes, “Efficient parallel kNN joins for large data in MapReduce,” in Proceedings of the International Conference on Extending Database Technology (EDBT), pp. 38–49, Berlin, Germany, March 2012.
  13. M. R. Kolahdouzan and C. Shahabi, “Voronoi-based k nearest neighbor search for spatial network databases,” in Proceedings of the Very Large Data Bases (VLDB), pp. 840–851, Toronto, ON, USA, August-September 2004.
  14. M. R. Kolahdouzan and C. Shahabi, “Continuous k-nearest neighbor queries in spatial network databases,” in Proceedings of the Scientific and Statistical Database Management (SSDBM), pp. 33–40, Aalborg, Denmark, June-July 2004.
  15. J. Sankaranarayanan, H. Alborzi, and H. Samet, “Distance join queries on spatial networks,” in Proceedings of the ACM International Symposium on Geographic Information Systems (GIS), pp. 211–218, Arlington, VA, USA, Nobember 2006.
  16. D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao, “Query processing in spatial network databases,” in Proceedings of the Very Large Data Bases (VLDB), pp. 802–813, Berlin, Germany, September 2003.
  17. H. Samet, J. Sankaranarayanan, and H. Alborzi, “Scalable network distance browsing in spatial databases,” in Proceedings of the ACM SIGMOD Conference, pp. 43–54, Vancouver, BC, Canada, June 2008.
  18. X. Huang, C. S. Jensen, and S. Saltenis, “The islands approach to nearest neighbor querying in spatial networks,” in Proceedings of the International Symposium on Spatial and Temporal Databases (SSTD), pp. 73–90, Hong Kong, China, August 2005.
  19. S. Nutanong and H. Samet, “Memory-efficient algorithms for spatial network queries,” in Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 649–660, Brisbane, Australia, April 2013.
  20. X. Huang, C. S. Jensen, H. Lu, and S. Saltenis, “S-GRID: a versatile approach to efficient query processing in spatial networks,” in Proceedings of the International Symposium on Spatial and Temporal Databases (SSTD), pp. 93–111, Boston, MA, USA, July 2007.
  21. A. Guttman, “R-trees: a dynamic index structure for spatial searching,” in Proceedings of the ACM SIGMOD Conference, pp. 47–57, Boston, MA, USA, 1984.
  22. H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “iDistance: an adaptive B+-tree based indexing method for nearest neighbor search,” ACM Transactions on Database Systems, vol. 30, no. 2, pp. 364–397, 2005. View at Publisher · View at Google Scholar · View at Scopus
  23. L. Li and D. Taniar, “A taxonomy for distance-based spatial join queries,” International Journal of Data Warehousing and Mining, vol. 13, no. 3, pp. 1–24, 2017. View at Publisher · View at Google Scholar · View at Scopus
  24. L. Li, D. Taniar, M. Indrawan-Santiago, and Z. Shao, “Surrounding join query processing in spatial databases,” in Proceedings of the Australasian Database Conference (ADC), pp. 17–28, Brisbane, Australia, 2017.
  25. C. Xiao, W. Wang, and X. Lin, “Ed-join: an efficient algorithm for similarity joins with edit distance constraints,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 933–944, 2008. View at Publisher · View at Google Scholar · View at Scopus
  26. A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos, “Closest pair queries in spatial databases,” in Proceedings of the ACM SIGMOD Conference, pp. 189–200, Indianapolis, IN, USA, June 2000.
  27. “Real datasets for spatial databases,” 2005, https://www.cs.utah.edu/∼lifeifei/SpatialDataset.htm.