Abstract

In this paper, we study the path based continuous spatial keyword queries, which find the answer set continuously when the query point moves on a given path. Under this setting, we explore two primitive spatial keyword queries, namely nearest neighbor query and range query. The technical challenges lie in that: (1) retrieving qualified vertices in large road networks efficiently, and (2) issuing the query continuously for points on the path, which turns out to be inapplicable. To overcome the above challenges, we first propose a backbone road network index structure (BNI), which supports the distance computation efficiently and offers a global insights of the whole road network. Motivated by the safe zone technique, we then transform our queries to the issue of finding event points, which capture the changes of answer set. By this transformation, our queries are to be simple and feasible. To answer the queries, we propose a Two-Phase Progressive (TPP) computing framework, which first computes the answer sets for some crucial vertices on the path, and then identifies the event points by the retrieved answer sets. Extensive experiments on both real and synthetic data sets are conducted to evaluate the performance of our proposed algorithms, and the results show that our algorithms outperform competitors by several orders of magnitude.

1. Introduction

With the prevalence of smartphones and other mobile devices, continuous spatial queries have gained increasing attention from the research community [112]. Previous works can be roughly classified into two categories by the query model: (1) snapshot based model [2, 3, 7, 8]. In this model, snapshot query processing techniques are periodically invoked, which either yields excessive costs or outdated results as pointed out in [6]; (2) safe zone based model [1, 4, 6, 9, 12, 13]. When the query point moves within the safe zone, it can be guaranteed that the answer set will not change. This property encourages to identify the safe zone, rather than periodically invoking the query processing techniques.

Traditional spatial keyword query models usually suppose that the query point is static or moves randomly, however, in real life a user (query point) may be moving on a navigation path or a public transport line, thus an data point is relevant only if it covers all query keywords (capturing user needs) and is on the query path at the same time.

Consider the following scenario as shown in Figure 1, there are 10 vertices ) on the road network, specifically, each vertex is associated with a set of keywords (the abbreviations of the keywords are defined on the top right corner in Figure 1), and each edge is associated with a real-valued distance. Given a query path (the red arrow in Figure 1), a tourist (described as a blue car in the figure) monitors Chinese food restaurant with car parking service within 1 km continuously when driving on path , the demand of the tourist can be expressed by keyword set . As another example, a passenger may continuously query the nearest Chinese food restaurant with car parking service. The first example can be answered by issuing the path based continuous range query, and the second one can be answered by issuing the path based continuous NN query. As shown in 1, when the query point is , the answer set for path based continuous range query is , since is the only vertex within the query range and cover the query keywords at the same time. And as the car moving on the query path, the answer set keeps changing, when it moves to , the answer set for path based continuous range query is . And in the same way, we can find that the answer of NN query for is , since these two points are on the query path and cover the keywords at the same time. Notice that, is nearer to compared to and it also satisfies the keyword constraint, however, it is not on the query path , thus it is not in the answer set.

Motivated by the above examples, we propose to study the continuous spatial keyword queries with path based paradigm. Particularly, we focus on studying two fundamental types of queries, namely path based continuous range keyword (PCR) query and path based continuous NN keyword (PC NN) query.

From the technical perspective, we encounter two major challenges. The first challenge is how to compute distance on large scale road networks, which incurs prohibitive computation overhead. Various techniques are developed [1417] to facilitate the distance computation. However, they either incur prohibitive pre-computation overhead or are short of global insights of the whole network. To make the road network distance computation more efficiently, we propose a dual-index structure, namely backbone network index (BNI). Concretely, we first reduce the original network to a backbone network using a simple but effective strategy. We then maintain the skeleton and detailed information of the road network by two structures, namely the modified G-tree [18] (in memory) and modified adjacent list [19] (on disk). As pointed out in [18], G-tree enables to facilitate the distance computation by assembly-based method, whilst offering the global insights of the road network.

Another challenge arises when considering infinite number of query points on the path. Obviously, issuing query at each query point is infeasible. Inspired by the observation that the answer set will not change when query point moves within the safe zone, we target to find the event points on the path which capture the changes of answer set. This transformation makes our queries simple and feasible.

Given a query path, [20] investigates the continuous nearest neighbor queries. To enable efficient query processing, they precompute NN sets for a fraction of intersection points. Nevertheless, one may request various NN sets for different needs (expressed by different keywords set). As a result, computing all the NN sets for different query keywords is infeasible. LARC [9] studies the continuous nearest neighbor keyword queries on the road network. However, LARC suffers from two problems when it is applied to answer our queries. First, it has to precompute and reserve much information for each vertex, which incurs prohibitive computation and space overhead in large scale road networks. Second, the safe zone computed by LARC is approximate and redundant for our path based queries.

To answer the queries, we develop a Two-Phase Progressive query framework (TPP) on the top of our proposed index structure BNI. We first compute and maintain the answer sets for some crucial vertices on the given path when receiving queries. This can be achieved by only issuing one query. Then, we identify the event points on the path progressively with the reserved answer sets.

This paper is a significant extension to its preliminary conference version [21]. Compared to the preliminary version, we extend the path based continuous range keyword query to the nn keyword query, and propose two new effective algorithms. And the previous work in [21] lacks of theoretical analysis of the effectiveness and efficiency of the proposed index structure and algorithms.

Our major contributions (excluding the contributions in conference version [21]) can be summarized as follows:(i)We extend the spatial keyword query to the nearest neighbor query and propose two new effective algorithms including and to solve the Path Based Continuous kNN Query (PCkNN) (in Section 4 and Section 5).(ii)Some detailed proofs and definitions that have not been included int the conference version have been added here;(iii)We add an analysis section to present the theoretical analysis of the effectiveness of proposed BNI index structure, and give out the time complexity and space complexity for all the proposed algorithms (in Section 6);(iv)We extend the experimental part to verify the effectiveness and efficiency of our proposed PCkNN algorithms (in Section 7). And we also extend the experiment results to evaluate the algorithms’ page access number for both queries.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 formally defines the problem and introduce the backbone network. We describe the proposed index structure in Section 4 and introduce our proposed algorithms in Section 5. Section 6 offers the theoretical analysis of proposed algorithms, and we give the experimental study in Section 7. This paper is concluded in Section 8.

We review works related to the path based continuous spatial keyword queries, mainly focusing on the continuous/moving spatial queries and the road network index structures.

2.1. Continuous/Moving Spatial Queries

Spatial queries have been extensively studied in database community. Specifically, enormous focuses have shifted to range queries [16, 2327] and nearest neighbor queries [2, 46, 9, 18]. Next, we will introduce the works on them respectively.

Cheema et al. [22] explore the continuous range queries on both the Euclidean space and the road network. They adopt the concept of safe zone for monitoring the query position updating. Literature [23, 24] studied the problem in context of road network. Du et al. [23] answer the continuous spatial keyword range queries by two steps. They first build the range tree for the query, which maintains the shortest path from query point to all qualified objects. Then, the up-to-date results are obtained by adjusting the range tree accordingly. To alleviate the effect the frequent object updates, Wang et al. [24] develop the Shortest-Distance-based Tree (SD-Tree), which supports incremental computation of answer sets using retrieved paths. Literature [2528] explore the range queries with the assumption that objects move on the Euclidean space. Specifically, [26, 28] consider an uncertainty model, in which the motion of objects is uncertainty. Chung et al. [28] propose to transform moving objects into points by hough transform, and then index them with R-tree. In [26], by transforming the irregular motion areas to polygon regions, the initial problem can be reduced to a simplified version. Wu et al. [27] study the continuous range queries over moving objects but stationary queries. They propose the containment-encoded squares (CESs), which decompose query regions and store precomputed search results. Zhang et al. [25] study the predictive range query and the predictive nearest neighbor query. In fact, they distinguish from our work by two major aspects. On one hand, they model the motion of objects as a linear function of time but not the road network. On the other hand, they retrieve one answer set for whole querying period, while we keep searching the up-to-date answer set. Transformed minkowski sum (TMS) are used to determine whether the moving objects intersect the moving query point.

Given a query path, [20] investigates the problem of continuous nearest neighbor queries. To enable efficient query processing, they precompute NN sets for a fraction of intersection points. Nevertheless, in practice, one may require different NN sets due to different needs when considering keywords. As a result, computing all the NN sets for different query keywords is infeasible. Literature [1, 6] try to answer the NN queries by safe region technique, which is motivated by V-Diagram [27]. To avoid prohibitive computation cost, Li et al. [1] utilize influential sets to restrict the safe zone. Note that, it can achieve the same large safe zone as order-k Voronoi cell [29] while minimizing the computation overhead. In [3], Mouratidis et al. study the continuous NN queries on road networks. Different from our setting, they consider both the moving objects and queries without making assumption on the motion patterns. Two methods are developed, namely incremental monitoring algorithm (IMA) and group monitoring algorithm (GMA). The former monitors the answer set of individual query by maintaining an expansion tree from query point. With this tree, only object updating that may affect the answer set is considered. GMA groups all queries lying on the same edge, and then gets the answer sets for them based on the answer sets of end points of corresponding edge and the objects lying on this edge. [4] studies the continuous nearest neighbor queries on the context of both Euclidean space and spatial network. They propose to construct the safe zone based on the objects, query points and the search space. However, this method has to recompute the safe regions more frequently and has higher validation overhead as pointed by [1]. Zheng et al. [9] utilize the 2-hop label to facilitate the distance computation. But this work suffers from extensive computation overhead and storage overhead. More importantly, the retrieved safe regions are too large and inaccurate when applied to our problem. Mouratidis et al. [2] develop the conceptual partitioning (CPM) technique, which handling location updates only from objects that fall in the vicinity of some query (and ignoring the rest). In [7], to support various scenarios, Object-Indexing and Query-Indexing are developed. Specifically, both of them are based on the grid index. Specifically, [8, 11, 30] retrieve the NN set in the context of distributed environment. To be adaptable to dynamic environment, [8] develops the dynamic strip index (DSI), which enables to merge or split objects dynamically. Besides, [31, 32] consider querying trajectories/routes that a composed of a sequence of geo-locations associated with text descriptions, Cong et al. [31] proposes the -tree index integrating spatial information and text information for the top-k spatial keyword query on trajectories. Feng et al. [32] focus on the indoor top-k keyword-aware routing query.

2.2. Road Index Structure

Distance computation brings great challenge in the presence of large scale road network. To address this problem, various index structures have been studied. In general, previous works on the road network index structure can be roughly classified into two categories: (1) some maintain the precomputed distances in flat structure [14, 15]. Literature [14] utilizes the lab technique to facilitate the distance computation. In particular, the distance between each pair of vertices can be computed easily by looking up the precomputed results. Hu et al. [15] propose the distance signature structure. Each vertex in the road network maintains a distance signature. Intuitively, distance signature needs to store all the distances between other vertices to current vertex. To minimize the storage cost, it maps the distances between objects and network nodes into categories and then encodes these categories. (2) others consider the hierarchical structure [1618, 33, 34]. CH [33] provides a method to generate a hierarchical structure by iteratively contracting the least important node. That is, replacing these nodes with shortest cuts. This strategy relies heavily on the weight of edges, thus the whole index structure is needed to be recomputed when the weight of edges changes. In [17], they pre-compute all pairs of distance and organize the computed results in different subsets. This is motivated by the fact that the shortest paths from vertex to all of the remaining vertices can be decomposed into subsets based on the first edges on the shortest paths to them from . Literature [16, 18] devise the hierarchical structures to organize the road network by partitioning the road network into multiple sub-networks. In [18], authors build a multi-level index structure called G-tree. It is constructed by iteratively partitioning the road networks into equal-size sub networks until the size of each leaf node is within the given threshold. Specifically, each node in G-tree is associated with a distance matrix that records the distance between borders (for non-leaf nodes) or the distance between borders and vertices (for leaf nodes). With G-tree, the distance between each pair of vertices can be computed by assembly-based method.

Summary of the related works can be found in Table 1.

3. Preliminary

3.1. Problem Statement

In this work, we model a road network as an undirected and edge weighted graph , where refers to the set of vertices and refers to the set of edges. Each vertex is comprised of a location and a set of keywords . Each edge connecting the vertices and is associated with a real-valued number to capture the weight, e.g., the distance or travel time, of this edge. For ease of discussion, we refer to the weight as distance hereafter. Given two vertices , , we denote by the shortest distance among all paths between them. Without loss of generality, we denote the path by a sequence of adjacent vertices on the road network, e.g., . Table 2: lists the frequently used notations. Next, we formally define our problem.

Definition 1. Path Based Continuous Range (PCR) Queries. Given a road network , a path based continuous range query is of the form , where is the query path, is the distance threshold, and is the set of query keywords. The query asks for a set of vertices for each query point on .

Definition 2. Path Based Continuous NN (PC NN) Queries. Given a road network , a path based continuous NN query is of the form , where is the query path, is the retrieved size, and is the set of query keywords. The query asks for a set of vertices for each query point on . Specifically, satisfies the following three conditions:(i).(ii), , .(iii)for each , it covers the query keywords set, i.e. .

Intuitively, there are infinite number of points on the query path, which renders issuing query at each point infeasible. We will show how to transform our queries to an equivalent problem in Section 5.

3.2. Backbone Network

We notice that all the proposed road network index structures are built over the original road network. When the road network is large, the index construction overhead and the query evaluation overhead are expensive. To address this challenge, we propose to build the index over the backbone network based on the following observation.

3.2.1. Observation 1

For any vertex with two adjacent edges, we can treat as an internal vertex on a virtual edge.

Figure 2(a) depicts the instance of an original road network with 8 vertices, i.e., . Since , , …, only have two adjacent edges, we can introduce a virtual edge and take them as the internal vertices as shown in Figure 2(b).

3.2.2. Simplifying Strategy

Observation 1 offers a simple but powerful insight for us to simplify the road network. That is, if a vertex only has two adjacent edges, we can introduce a virtual edge and take it as an internal vertex on the virtual edge. This strategy is similar to CH [33]. We observe that CH is computationally expensive and relies heavily on the weight of edges. It needs to be rebuilt when the weight of edges changes. Hereafter, we denote by backbone network the simplified network. We call the remaining vertices on the backbone network as backbone vertex (e.g.), and vertices lying on the virtual edges as internal vertex.

Figure 3 presents an instance of the backbone network. We denote by the black (white) circle the backbone (internal) vertex in the original network (Figure 3(a)). We present the retrieved backbone network in Figure 3(b). Furthermore, we denote by the dotted line the virtual edge. Clearly, the backbone network achieves a compact structure compared with the original road network (Experimental results in Section 7 show that the size of road network can be cut off from 50% to 95%). It is obvious that there may be more than one edge between two vertices. For instance, there are two paths, namely and , between and (one is the virtual edge and the other is the original edge). In this case, we take the one with smallest distance as the edge in the backbone network, whist maintaining all edges in our index structure (to be shown later). We denote by the edge that connects the vertices . Note that, is either an original edge or a virtual edge comprising of multiple original edges. For two internal vertices , we denote by the direct distance between them on the corresponding virtual edge.

Theorem 1. , the shortest distance between them is defined as:
if are on the same virtual edge , then:if are not on the same virtual edge , then:

Proof. This theorem is self-evident, we omit the proof here due to the space consideration, the illustration of this theorem can be found in Figure 4.

4. Backbone Network Index Structure

In this section, we introduce the backbone network index (BNI) structure, which is a dual-index. Specifically, it maintains the global insights of network by G-tree (in memory) and the detailed information of network by the adjacent list (on disk).

4.1. G-Tree Index Structure

Though there are various partition strategies for road network, we adopts [18] in this work. This is because G-tree has the following properties: (1) enables efficient distance computations for large network; (2) offers global sight of the road network. We proceed to introduce some basic notions that lay the foundation of G-tree.

Definition 3. (Backbone Network Partition). Given a road network , we denote by the corresponding backbone network. The partition of is a set of regions, i.e., ,…, such that:(i),(ii)For , and(iii), if , then .

In the partitions, there are some portal vertices that connect to the vertices of other regions. To illustrate this, we define the concept of borders as follows.

Definition 4 (Backbone Network Border). Given a backbone network , a region of . We say that is a border if and . Hereafter, we denote by all the borders of .

As shown in Figure 5, we present an instance of the partition of the backbone road network. Note that, all borders are marked with red. Correspondingly, we show the G-tree in Figure 6, which is built over the road network in Figure 5(b). Specifically, each node in G-tree corresponds to one subregion in Figure 5(b).

Each node in G-tree is of the form , where is the identifer of the node, contains some keyword information of the corresponding node (will be discussed later). We denote by the borders of , and refers to the distance matrix. Specifically, for leaf node, it records the distance between borders and backbone vertices (e.g. ). For non-leaf node, it record the distance between borders (e.g. ).

As with [35], we map vertices and keywords to a bits number by a hush function . Specifically, for each keyword , sets exactly one of the bits to 1. is the superimposition of for each , and is the superimposition of for each relevant vertex . We say is relevant to a node if lies on an edge such that or . That is, relevant vertices contain all the vertices in and the vertices on the adjacent edges to .

For a vertex and a node , we denote by the minimum possible distance between and the vertices in . Specifically, if , then we set to 0. Otherwise, we set as the minimum distance between and the borders of , i.e., .

4.2. Generalized Adjacent List File

Adjacent list has been used widely in the context of road network [19, 36] for maintaining detailed information of edges. In this work, we generalize the adjacent list by the following two aspects. First, we maintain all edges (including both the real edge and virtual edges) for two adjacent vertices and in the backbone network. Second, the road network is stored on disk based on locality principle rather than the vertex id. The partition strategy of G-tree suggests that the vertices in the same subregion has higher possibility to be accessed together. Inspired by this, we propose to maintain the road network sequently from left to right leaf nodes, as shown by the arrow line in Figure 6.

As shown in Figure 7, adjacent list is comprised of three components, namely hash table, adjacent file and point file. We proceed to introduce them. By hash table, each backbone vertex is mapped to a pair , where refers to the leaf node containing and records the start address of the item for in adjacent file. For each backbone vertex , we maintain an item for it in adjacent file. The item first records the number of adjacent vertices, and the associated keywords for . Then, each adjacent vertex corresponds to an entry, and each edge is associated with a record file, the superimposition of keywords associated with the internal vertices, and the start address of internal vertices on this edge.The structure of point file is simple compared with adjacent file. In the point file, all internal vertices lying on edges between two vertices are placed in one group. More details about the generalized adjacent list file can be found in our previous work [21].

The structure of point file is simple compared with adjacent file. In the point file, all internal vertices lying on edges between two vertices are placed in one group. The group first records the corresponding backbone vertices, and the total number of internal vertices. Then, we maintain the id, the direct distance to smaller vertex, and the associated keywords for each internal vertex.

5. Algorithms

In this section, we first explore the query framework at a high level, and then present the algorithms for PCR and PC NN queries.

5.1. Overview of the Framework

Considering there are infinite number of query points on the query path, it is infeasible to issue the query at each point. We proceed to show how to transform our queries to a simple but equivalent problem.

Theorem 2. Given a path based range query , and an edge on . For any point on , it holds that , where is the set of internal vertices on that cover .

Proof. The proof of this theorem can be found in the conference version in [21]. And the illustration of this theorem can be found in Figure 8.

Theorem 3. Given a path based NN query , and an edge on . For any point on , it holds that , where is the set of internal vertices on that cover .

Proof. The proof of this theorem can be derived as with Theorem 2. We omit the proof here.
Inspired by the safe zone technique [4] we know that the answer set might not change when query point moves within some intervals. This intuition offers us a direction to transform our queries to a simple but equivalent problem.

Definition 5. Safe Interval. Given a query path , we define the safe interval as a disclose segment on , denoted by , where and denote the start point and end point of this interval. When moves within , the answer set is unchanged. We denote the safe interval of and as and respectively.

In practice, there are multiple safe intervals on , which are split by some special points, i.e., the start and end points of safe intervals. For ease of presentation, we call these points as event point hereafter.

Definition 6. Backbone Path. Given a query path in , we define the corresponding backbone path as a path in such that:(i) such that locates on , and(ii) is minimum.

In the above definition, we say is minimum if there does not exist another backbone path that contains less number of vertices than and satisfies the first condition.

We answer the path based queries by a two phases query framework. The task of the first phase is to compute the answer set for each vertex in . Note that, this can be achieved by only issuing the query once. We maintain the retrieved answer sets for the next phase. We call this phase Issuing query. In the second phase, we mainly focus on finding the event point on the . As discussed in Theorems 2 and 3, the answer sets for these points can be identified easily without issuing the query at these points. This phase is called Identifying event point.

Clearly, by these two phases, we can answer the queries easily. Meanwhile, the computation cost can be cut off significantly. The Two-Phase Progressive (TPP) framework is described in Algorithm 1. The details of the algorithm can be found in the conference version in [21].

  Input :
  Output:
(1)Construct the query vertices from
(2)
(3);
(4)Compute answer sets for event points in progressively
(5)Return .
5.2. PCR Query Processing

We answer the PCR query by two steps. Firstly, we get the answer sets for backbone vertices by issuing the range query at the first backbone vertex in . Then, we identify event points by these sets.

5.2.1. Issuing Range Query

Intuitively, the answer sets for these backbone vertices are overlapped. This motivates us to compute the answer sets progressively. The following theorem offers us the insight for progressive computation.

Theorem 4. Given two query point on the query path, and a distance threshold . We denote by the range query results of . For any , it holds that .

Proof. This theorem is obvious. We omit the proof here.
Considering the fact that the distance between backbone vertices is not large, we propose to achieve the progressive computation by a filter-and-verification model. At a high level, we expand the network from the first vertex in progressively (for filter). We know that, for each backbone vertex , the candidates must within the range of . That is, we only need to verify the vertices within this range for other backbone vertices (for verification).
As presented in Algorithm 2, we retrieve the answer sets for backbone vertices in by a filter-and-verification framework. Particularly, we utilize as the filter condition for the current visited vertex (line 7). From Theorem 4, we know that the range query results for are within the range of to . In the process of tree traversal, if is a non-leaf node, we need to add its child nodes that cover the query keywords into the priority queue for further exploration (lines 9–13). Otherwise, that is is a leaf node. We first compute the distance between and the backbone vertices in (lines 15–16). Then, for each backbone vertex , we find the results by checking the vertices lying on the adjacent edges (lines 17–19). Note that, we also maintain the candidates, which cover the query keywords, by for achieving quick verification. When the filter condition is broken, we proceed to build by verifying candidates in (lines 20–23).
Optimizing Strategies: To facilitate the query processing, we investigate the optimizing strategies. We observe that there are much repeated computation when computing the and performing verifications, thus we can maintain the computed distances for the following distance computation.

  Input :
  Output:,
(1)build from
(2);
(3) 0;
(4)The first vertex in ;
(5)for each do
(6)
(7)whiledo
(8)  ;
(9)  if is a non-leaf node then
(10)   for each child node of do
(11)    ifthen
(12)     compute the ;
(13)     ;
(14)  else
(15)   if is in then
(16)    InternalDist ;
(17)   else ExternalDist ;
(18)   ;
(19)   for each do
(20)    put candidates on adjacent edges into ;
(21)    put qualified candidates into ;
(22)for each in do
(23)  if is in then InternalDist ;
(24)  ;
(25)  else ExternalDist ;
(26)  ;
(27)  put all qualified candidates of into
(28)Return and ;
5.2.2. Identifying Event Point

We mainly focus on identifying event points on edges. Specifically, we distinguish event points by two types, namely IE and OE. We denote by IE the in event point, and OE the out event point. We refer to in/out from the direction of the backbone vertex that is near to . As suggested in Theorem 2, the event points on are determined by , and . Specifically, there are seven cases, as follows:

  Input :
  Output: The sequence of event points.
(1)for 1 to do
(2); ;
(3) the qualified vertices on ;
(4)for each do
(5)  event points computed by (1–4)
(6)for each do
(7)  if is not on then
(8)   if is not in then
(9)    event points computed by (5);
(10)   else
(11)    event points computed by (7);
(12)for each do
(13)  if is not on then
(14)   if is not in then
(15)    event points computed by (6)
(16)Return

Theorem 5. Given two backbone vertices on , and a vertex covering the keywords. Then, the event point in the edge has the following seven cases (we only describe the first three cases according to Figure 9, and referring our previous work [21] for more details.):(i)If on one edge between , and , then there are two event point ;(ii)If on one edge between , and , then there are two event point .(iii)If on one edge between , and , then there are two event point.

Algorithm 3 presents the pseudo-code for identifying the event points. For each edge on , we denote by and the answer sets for and . We first identify the event points for internal vertices on (lines 4–5). Then, for each vertex , if it is on the edge, we know that it has already been processed. Otherwise, we find out whether it is in . If yes, there might exist both in and out event points (lines 8–9). If no, we only need to compute the out event points for (lines 10–11). Then, for the vertices in , we only need to process ones that are neither on nor in . For these vertices, we need to compute the in event point for them (lines 12–15).

5.3. PC NN Query Processing

Similarly, to answer the PC NN queries, we first retrieve the NN sets for backbone vertices in . Then, we identify the event points.

5.4. Issuing NN Query
  Input :
  Output:,
(1)build from ;
(2);
(3) 0; ;
(4)for each in do
(5)INFINITY ;
(6);
(7)while () do
(8)  
(9)  if is a non-leaf node then
(10)   for each child node of do
(11)    ifthen
(12)     compute the
(13)     ;
(14)  else
(15)   if is in then
(16)    InternalDist
(17)   else ExternalDist
(18)   ;
(19)   for each do
(20)    put candidates on adjacent edges into ;
(21)    put qualified candidates into ;
(22)    update ;
(23)for each in do
(24)  if is in then InternalDist
(25)  ;
(26)  else ExternalDist
(27)  ;
(28)  put all qualified candidates of into
(29)Return and

Theorem 6. Given two query point on the query path. We denote by the nearest neighbor query results of . For any , it holds that . Here, refers to the distance between and the th nearest neighbor.

Proof. This theorem is obvious. We omit the proof here.
Theorem 6 enables to find the nearest neighbor by only issuing the nearest neighbor search once. Algorithm 4 shows the pseudo-code for finding NN sets for backbone vertices. Generally, it is similar to Algorithm 1. The major difference comes from the filter condition (line 7). We denoted by the distance between and the th nearest neighbor. Besides, we need to update accordingly when new result is inserted into (lines 17–20).

5.4.1. Identifying Event Point

Compared with the range query, the NN query is more complicated. Whether a vertex is the NN result for the query is not only determined by its distance to query point but also by other vertices. This leads to the difficulty in identifying the event points for NN queries. To address this problem, we introduce the concept of distance curve, which captures the change of distance between query point and the candidate when query point moves along the edge.

Figure 10(a) illustrates the distance curve for different cases. Given the candidate vertex , for , we know that the shortest path from to or must go through or . For , it is clear that is on . Different from , we know that present cases as there are more changes. Based on the distance curve, we identify the event points as illustrated in Figure 10(b). We know that the event points always appear at the intersections and turn points. To find out the event points, we can first compute all the intersections and turn points, and then compute the NN at each turn points. This naive method is time consuming, thus we propose the enhanced method. We maintain the NN curves that has the lowest distance. Then, we compute the intersections between these curves and other curves. We select the first intersection, and compute the NN. This procedure continues until this edge is explored.

  Input :
  Output: The sequence of event point.
(1)for 1 to do
(2);
(3) compute the distance curve for vertices in ;
(4) NN ; ;
(5)while is on this edge do
(6)   minimum intersection position between NN and other curves
(7)   NN recompute the NN at
(8)  if NN changes then
(9)   add event point into
(10)Return

We present the details of how to compute the event points for NN in Algorithm 5. We first compute the distance curves for vertices in and (line 3). Then, we compute the event points by accessing the next explored intersection. At this point, we update the NN by comparing the distance value of start and end points. If NN is changed, we add the event point into (lines 8–9).

6. Analysis

We proceed to offer some insights of our index structure and algorithms.

6.1. Analysis on Index Structure

As discussed in Section 3, the key point of compact strategy is to remove some vertices from the original network, and then takes them as internal vertices on introduced virtual edges. Suppose the compression ratio is (As shown by experimental studies, this value varies from 5% to 75%). That is, . Since an edge will be removed from the original network when a vertex is removed. We know that, it holds that .

The whole space cost of BNI is comprised of three parts, namely hash table, G-tree and generalized adjacent list. By hash table, we need to maintain the corresponding leaf node and address for each backbone vertex in . Thus, hash table costs . Motivated by the analysis about G-tree in [18], we know that the space cost of G-tree is . Please refer to [18] for more details. Furthermore, we also maintain the detailed information about road network in the generalized adjacent list, which costs . We denote by and the number of partition regions for each partition and the upper bound of the number of vertices in leaf nodes. Thus, the total cost is .

6.2. Analysis on Algorithms
6.2.1. Analysis on Range Queries

(1) Time Complexity. Clearly, the time complexity of path based continuous range queries comes from two components, namely issuing range queries and identifying event points. Suppose the length of query path is . That is, . As noted earlier, there are backbone vertices in . We need to issue the range query once and perform times verification for other backbone vertices. Without loss of generality, we assume the range query needs to access leaf nodes (regions) before retrieving all results. In the worst case, that is, all backbone vertices in belong to different regions. Thus, each backbone vertex should explore at most regions. Then, should be invoked at most . Since the matrix size of each node at level is . The height of the G-tree is . Thus, the cost for is . Suppose the average number of vertices maintained for each backbone vertex is . The second part of cost is . The overall cost is . In practice, the complexity is much smaller than the worst-case complexity.

(2) Space Complexity. As discussed earlier, we need to maintain the range queries answer sets for the backbone vertices in . The space complexity is . Compared with distance matrices of G-tree, this space cost is negligible.

6.2.2. Analysis on NN Queries

(1) Time Complexity. As presented by Algorithms 1 and 3, we know that the only difference between them is the filter condition. That is, we need to issue the NN query once and perform times verification for other backbone vertices. Without loss of generality, we assume the NN query needs to access leaf nodes (regions) before retrieving all results. Similarly, in the worst case, each backbone vertex should explore at most regions. Then, should be invoked at most . Thus, the total cost for is . Besides, the cost for computing the event points is in worst case. Thus, the total cost is .

(2) Space Complexity. We need to maintain the range queries answer sets and the distance curves for the backbone vertices in . Thus, the space complexity is .

7. Experiments

We evaluate the performance of our proposed algorithms on both synthetic and real life data sets. We introduce the experimental settings in Section 7.1, and present the overall performance of the algorithms in Section 7.2. Section 7.3 studies the impact of different parameters on the performance of the algorithms.

7.1. Experimental Setup
7.1.1. Comparison Algorithms

Previous studies on the continuous spatial keyword queries usually exploit the modified version of Grid index [2, 8, 22] to organize the spatial objects. It is mainly because the index structure is easy to construct and can be applied to dynamic environment. As a result, we modify the algorithms in [22] for comparison, which is developed on the top of Grid index. To be fair, we partition the generated backbone network into grid cells, such that the number of grid cells is as close to the leaf nodes of G-tree as possible. Similarly, we maintain borders (connecting vertices), distance matrix (distance table) and keyword bitmap for each cell. The range query algorithm first finds all relevant cells that satisfy the given distance threshold constraint. Then, the vertices in these cells are checked by query conditions to find out the final answer set. The NN query algorithm explores the closest unvisited cell progressively using the modified version of Dijkstra algorithm until qualified vertices are retrieved. During the query processing, distance matrix are exploited to facilitate the distance computation. For ease of discussion, we term our algorithms and competitors as BNI-Range, BNI- NN, Grid-Range and Grid- NN respectively.

7.1.2. Data and Queries

We test the algorithms on five real road networks, namely SF, NA, LA, NY, CA. We obtain SF and NA from [37], and LA, NY and CA are retrieved from TIGER/Line. Since the original road networks in LA, NY and CA are unconnected, we take the corresponding maximal connected subgraph as the road network. In Table 3, we reveal some statistics properties for real road networks. In particular, we denote by the compression ratio for the road network. Clearly, the road network can be reduced by 25%–95%. We generate the associated keywords for each vertex as shown in [36] based on the experimental setting. We mainly evaluate the performance on five parameters: (1) the average number of keywords associated with each vertex (AK); (2) the length of query path, i.e., the number of vertices on the path (PL); (3) the number of query keywords (QK); (4) the ratio of query scope, that is, the distance threshold for range query (RT); (5) the retrieved size for NN query (K). To study the effect of these parameters, we conduct experiments on data set NA. Table 4 offers detailed information about the parameters. We mark their default values in bold. For each experiment, we randomly generate 20 queries based on the experimental setting in Table 4. We report the averaged performance.

All algorithms are implemented in C/C++ and run in Windows 7 System on an Intel(R) Core(TM) i5-4590 [email protected] GHz with 8 GB RAM. The index KHT and LIR-tree are disk-resident and the buffer size is set to 4 MB.

7.2. Overall Performance of Algorithms

We proceed to offer the global insights about the index structures and algorithms. We illustrate the construct time for grid index structure and backbone network index (BNI) structure over real road networks in Figure 11. We observe that the construction time for two index structures does not vary much for moderate road networks, such as NA, LA and NY. As for the large network, i.e., CA, much more time is spent for building BNI than that for building grid index. The reason behind that is BNI exploits much more complicated partition strategy compared with the grid index.

As shown in Figure 12, we evaluate the overall performance of algorithms on several real road networks, namely SF, NA, LA, NY and CA. Experimental results in Figure 12 reveal that our algorithms (both the range query algorithm and NN query algorithm) outperform competitors by several orders of magnitude in terms of number of page access and running time. The strength of our algorithms are obvious, especially on very large road network, such as CA.

7.3. Effect of Parameters
7.3.1. Effect of Parameter AK

We first study the effect of AK. In particular, we generate six synthetic data sets based on NA by varying AK from 3 to 8. Figure 13 shows experimental results as AK increases. As expected, the running time for range query algorithms increases when AK increases (see Figure 13(a)). This is because more qualified vertices are to be verified. However, AK has little effect on the number of page access(see Figure 13(b)), since it mainly affected by the search range. We notice some new findings from NN query algorithms, as shown in Figures 13(c) and 13(d). Both the running time and the number of page access achieve the worst performance when AK = 5. Recall that, we set the number of query keywords as 5 by default. That is, the false positive rate achieves the highest value in that case. Thus, much more cost is spent for checking whether or not the current visited vertex satisfies the query condition for competitors. On the other hand, BNI- NN performs well, which thanks to the global insights provides by our hierarchical index structure.

7.3.2. Effect of Parameter PL

In this experiment, we evaluate the effect of the length of the query path, i.e., the number of vertices on . Figures 14(a) and 14(b) show the results for range query algorithms. When the length of query path increases, the number of corresponding backbone vertices on increases accordingly. To answer the queries, much more iterations are required for building the answer sets for these backbone vertices. This leads to the increase of running time and the number of page access for Grid-Range. With the maintained answer sets, BNI-Range can retrieve answer sets for backbone vertices incrementally, rather than retrieving from scratch as Grid-Range. This explains the difference in performance. Figures 14(c) and 14(d) show the same findings, and thus can be explained with the same reason.

7.3.3. Effect of Parameter QK

In Figure 15, we present the experimental results achieved by varying QK. Figures 15(a) and 15(b) reveal that the running time decreases for both Grid-Range and BNI-Range When QK increases, we know that the number of qualified vertices decreases within the query range, which incurs less verification overhead. Figure 15(b) shows that QK has little effect on the number of page access for both algorithms. In contrast, we notice the running time and the number of page access increases greatly for Grid- NN. This is not as strange as it seems. Note that, Grid- NN requests for qualified vertices. When QK becomes large, it is much more difficult for Grid- NN to obtain the answer set. That is, Grid- NN has to explore a larger proportion of network before finding the NN results. We omit the experimental results when QK is greater than 5, because the answer set cannot be found within one day. On the other hand, BNI- NN is affected slightly by QK. This is because we utilize the hierarchical structure to guide the search process.

7.3.4. Effect of Parameter RT

To study the effect of distance threshold on range query algorithms, we tune RT from 1% to 32%. To set RT, we first get the whole length of the road network, denoted by , and then RT can be set by some specific ratio. For instance, when RT is set to 8%, the corresponding real value is . Figures 16(a) and 16(b) show the running time and the number of page access for range query algorithms respectively. As expected, both the running time and the number of page access increase as RT increases. This is because the range query algorithms would explore a larger proportion of the networks. BNI-Range achieves much better performance than Grid-Range, which is largely due to the effective of our proposed index structure.

7.3.5. Effect of Parameter K

As shown in Figures 17(a) and 17(b), we observe that K has much effect on NN query algorithms. Clearly, when K increases, that is, much more vertices are required for satisfying the query. This renders the algorithms to explore much more vertices before meeting the query conditions. Figure 17(b) shows that the number of page access of Grid- NN increases almost liner to K, whereas BNI- NN is affected by slightly, this demonstrates the effectiveness of our proposed index structure.

8. Conclusions and Future Work

In this paper, we introduce a novel query type, called path based continuous spatial keyword query. Different from previous studies, in this paper, we study the spatial keyword queries on a given query path. With this model, we propose to study the range queries and NN queries. We propose a progressive framework to address these queries. Besides, to support our queries efficiently, we propose a backbone network index structure. This dual-index takes the advantage of G-tree and adjacent list. As verified by the experiments, our algorithms outperform the competitors by several orders of magnitude.

There are several interesting directions to be studied in the future. First, we can study the path based continuous spatial keyword query in a dynamic environment with the moving or mobile objects. Then, it is also interesting to study the problem in the distributed environment.

Data Availability

The five real road networks data used to support this study, which have been cited. And all these data are available at:https://www.cs.utah.edu/∼lifeifei/SpatialDataset.htm.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant no. 62002216), the Key Disciplines of Computer Science and Technology of Shanghai Polytechnic University, the Construction of Electronic Information Master Degree of Shanghai Polytechnic University, the Construction of University Enterprise Cooperation Automotive Electronics Engineering Technology Center (No. A11NH190704), the Cultural Relic Protection Science and Technology project of Zhejiang Province (No. 2018007), the Shanghai Sailing Program (No. 20YF1414400), the Shanghai Pujiang Program (No.18PJ1433400), the Collaborative Innovation Platform of Electronic Information Master of Shanghai Polytechnic University (No.A10GY21F015), the National Natural Science Foundation of China (No.61103213), the Innovation Program of Shanghai Municipal Education Commission (No.14ZZ167), and the Open Foundation of Big Data Management System Engineering Research Center.