Abstract

A top-k spatial keyword (TkSk) query ranks objects based on the distance to the query location and textual relevance to the query keywords. Several solutions have been proposed for top-k spatial keyword queries. However, most of the studies focus on Euclidean space or only investigate the snapshot queries where both the query and data object are static. A few algorithms study TkSk queries in undirected road networks where each edge is undirected and the distance between two points is the length of the shortest path connecting them. However, TkSk queries have not been thoroughly investigated in directed and dynamic spatial networks where each edge has a particular orientation and its weight changes according to the traffic conditions. Therefore, in this study, we address this problem by presenting a new method, called COSK, for processing continuous top-k spatial keyword queries for moving queries in directed and dynamic road networks. We first propose an efficient framework to process snapshot TkSK queries. Furthermore, we propose a safe-exit-based approach to monitor the validity of the results for moving TkSK queries. Our experimental results demonstrate that COSK significantly outperforms existing techniques in terms of query processing time and communication cost.

1. Introduction

With the popularization of geo-tagged data (e.g., geo-tagged photos, videos, check-ins, and text messages), many online location-based services such as Google Maps, Yahoo Maps, and Bing Maps have started providing useful information via location-based queries [14]. Moreover, textual descriptions of points of interest, e.g., hotels, shopping malls, and tourist attractions, are easily accessible on the Web. These developments demand techniques that efficiently process top-k spatial keyword queries that return a ranked list of the k best facilities based on their proximity to the query location and relevance to the query keywords. Several algorithms have been proposed for processing top-k spatial keyword queries in Euclidean space [5, 6]. Although few algorithms exist that study keyword queries in a road network, they all focus on undirected road networks. However, in real scenarios the urban road networks are directed and dynamic where each edge has a particular orientation and its weight changes according to traffic conditions such as traffic congestion and reversible lanes. Therefore, in this study, we investigate moving top-k spatial keyword queries in directed and dynamic road networks.

Top-k keyword queries can be used for a wide range of applications in recommendation and decision support systems. For example, tourists may want to retrieve a sorted list of restaurants that serve Italian steak based on the shortest distance from their location and textual relevance to the query keywords. Tourists can issue a top-k spatial keyword query to the location-based services (LBS) to collect information about qualifying restaurants in their vicinity. However, through moving top-k spatial keyword queries if they does not like the results, they can simply keep moving, and the updated results will be provided until a desired restaurant is found. Typically, the query issuer follows the underlying road network to reach at the desired location. Therefore, TkSK algorithms based on Euclidean space does not work in road networks. A road network is generally modeled as a weighted directed graph, where each edge has some direction and its weight can vary according to the traffic conditions.

Given a set of data objects , query location, and set of keywords, the TkSK query returns the best k data objects from D according to their combined textual and spatial relevance to the query. We use distance function to represent the shortest network distance from q to data object d. Figure 1 presents an example of a directed road network, where rectangles represent the data objects with a textual description, and the triangle represents the query location. The number label on each edge indicates the weight of that edge such as the amount of time required to travel along it, e.g., and . Consider a scenario where a tourist is interested in finding an “Italian Restaurant." If an undirected road network is considered, the top-1 “Italian Restaurant" is . However, in a directed road network, the shortest path from q to is . Therefore, for a directed road network, the top-1 result is because it is closer to the query location than . Now, consider that the tourist is looking for “Cafe Bakery". The data object could score higher than data object because (“Cafe and bakery") is more textually relevant to query keywords than (“Cafe"), and is only marginally greater than .

Moving Top-k spatial keywords in directed and dynamic road networks are useful for many location-based applications. However, query processing is costly because movement of query object q may invalidate the query results. Therefore, the main challenge in moving TkSk is to maintain the freshness of the query results when the query objects are moving freely. A straightforward approach is to increase the update frequency of the query. However, this approach not only compromises the up-to-date query results but also increases the computation and communication overhead. Because whenever query object changes its location the query object has to report its location to server which increases the communication cost and server has to recompute the results again which increases the computation cost.

To address the aforementioned challenges, we first present an efficient processing technique of snapshot TkSK queries in directed road networks. Then, we present a safe-exit-based approach for processing and monitoring moving TkSk queries where query object q is freely moving in a directed spatial network. The safe exit point of query object q represents a boundary point between the safe region and nonsafe region of q. A safe region of query points indicates that the query result remains valid if the query object lies within its respective safe region. Therefore, the query results will only be recomputed when q leaves its respective safe region which significantly reduces the computation and communication costs. To the best of our knowledge, this is the first attempt to study moving top-k spatial keyword queries in directed and dynamic road networks.

Below, we summarize our contributions:(i)We study the problem of continuous monitoring of moving top-k spatial keyword queries in a directed and dynamic road networks.(ii)We present an algorithm to monitor the moving TkSK queries which efficiently computes the safe exit points for query object q in a directed road network. The algorithm significantly minimizes the computation and communication costs for moving queries.(iii)We also propose a method that monitors the validity of query results and safe region when weight of road segments is updated due to traffic conditions.(iv)Finally, we conduct extensive experiments on real road network datasets and demonstrate the superiority of the proposed algorithm over the existing approach.

The remainder of this paper is structured as follows. Section 2 reviews the existing work on the processing of TkSk queries on Euclidean and road networks. Section 3 provides terminology definitions and describes the problem. Section 4 elaborates on the proposed query processing technique for TkSK queries in directed road networks. In Section 5, we present our safe-exit-based technique to process moving TkSK queries. Section 7 presents a performance analysis of the proposed technique. Section 8 concludes this paper.

In this section, we discuss some of the promising related studies of top-k spatial keyword queries. Our related work is divided into two sections: Section 2.1 reviews snapshot TkSK queries, and Section 2.2 presents the studies proposed to address moving TkSK queries.

2.1. Snapshot Top-k Spatial Keyword Queries

In recent years, spatial keyword queries have drawn the attention of many researchers. Several approaches have been proposed for ranking spatial data objects. Initially, Zhou et al. [7] worked on combining inverted indexes [8] and R-trees [9]. They proposed three different hybrid indexing structures. Their study demonstrated that building an inverted index on top of an R-tree provides superior performance. Hariharan et al. [10] proposed the indexing structure KR-tree by capturing the joint distribution of keywords in space. Ian de Felipe et al. [11] proposed a data structure that combines an R-tree with text signatures. Each node of the R-tree exploits a signature to indicate the presence of keywords in the subtree of the node. However, both these approaches address only Boolean keyword queries in Euclidean space.

Top-k spatial keyword queries where data objects are ranked according to their combined textual and spatial relevance to keyword queries were first studied by Cong et al. [5] and Li et al. [6]. Both studies [6] integrate location indexing and text indexing to generate IR-trees. These studies process top-k spatial keyword queries only in Euclidean space and are not suitable for processing top-k spatial preference queries in road networks, where the distance between objects is determined by the shortest path connecting them. Later, Rocha et al. [12] proposed the indexing technique S2I, which maps each term in the vocabulary into a separate block or aR tree for efficient processing of top-k spatial keyword queries. Zhang et al. [13] proposed an m-closest keyword query that returns the closest object based on distance and which matches m query keywords.

Top-k spatial keyword queries in road networks were introduced by Rocha et al. [14]. In particular, they proposed three different indexing techniques (Basic Indexing, Enhanced Indexing, and Overlay Indexing) for processing spatial keyword queries in road networks.

2.2. Moving Top-k Spatial Keyword Queries

Recently, research focus has shifted to the continuous processing of spatial queries where query or data objects are arbitrarily moving in road networks, which is the most realistic scenario. Considerable research effort has been undertaken to process moving range, k nearest neighbor (kNN), and reverse k nearest neighbor queries (RkNN) [1518]. However, there is a lack of efficient algorithms for moving top-k spatial keyword queries. Initially, Wu et al. [19] and Huang et al. [20] proposed different methods for monitoring top-k spatial keyword queries in Euclidean space. Guo et al. [21] studied moving top-k spatial keyword queries on road networks. They presented two methods for monitoring moving queries in an continuous manner that reduces the traversing of network edges. Later, Li et al. [22] proposed TPR-tree-based indexing to monitor moving top-k spatial keyword queries. In contrast to [21, 22], in this study we consider moving top-k spatial keyword queries in directed and dynamic road networks where each road segment has a particular orientation and its weight changes due to according to traffic conditions.

Table 1 compares our problem scenario with related work in terms of query type, space domain, and orientation of road networks.

3. Preliminaries

Section 3.1 defines the terms and notations used in this paper. Section 3.2 formulates the problem using an example that illustrates the general results of top-k spatial keyword queries.

3.1. Definition of Terms and Notations
3.1.1. Road Network

A road network is represented by a weighted directed graph where N, E, and W denote the node set, edge set, and edge distance matrix, respectively. The network distance of an edge changes depending on the traffic conditions. Each edge is also assigned an orientation that is either undirected or directed. The undirected edge is represented by where and are the boundary nodes of an edge, whereas the directed edge is represented by or . Naturally, the arrow above the edge indicates the associated direction. We refer to as the starting node and as the ending node of an edge. For example, in Figure 1, is the starting node of edge , whereas it is the ending node for edge . The particular edge where a query object is located is called an active edge. It is important to note that, the distance between two points, and , is not symmetrical in directed road networks (i.e., ). For example in Figure 1, the , whereas the because shortest path from to is .

3.1.2. Segment

Segment is the part of an edge between two points, and , on the edge. An edge consists of one or more segments. An edge is also considered a segment where the nodes are the end points of the edge. The weight of a segment is denoted by .

3.2. Problem Formulation

Similar to previous studies [5, 14, 23], we assume each data object has a point location in the road network and a text description . Given a query location , a set of keywords , and k number of data objects to return, the top-k spatial keyword query is defined as , which takes three arguments and returns the best k data objects from D according to a score that considers spatial proximity and text relevance. The score of a data object d is defined by the following equation:

where is the spatial relevance between and , is the textual relevance between and , and is a positive real number that determines the importance of one measure over the other. For example, if only textual relevance is considered, then . If more importance is given to spatial relevance, then .

Spatial relevance is defined as the shortest distance between data objects d and q: . Thus, indicates that data object is more spatially relevant to q than data object . The textual relevance () can be computed using any popular information retrieval model, such as cosine similarity or the language model. In this study, we use the cosine similarity between and . The textual relevance is defined as follows:

The weight , where represents the frequency of term t in . The weight , where is the number of objects in D, and is the document frequency. A higher means a higher textual relevance to the query keywords. We used the variation of cosine similarity based on the significance factor of term t in a document n, where n represents the description of data object or query keywords . The significance is the normalized weight of the term in the document by taking into account the length of the document [24, 25]. Hence, the textual relevance can be rewritten as

4. Query Processing System

In this section, we present the proposed query processing system that indexes the data objects and prunes the irrelevant edges for efficient query processing. In Section 4.1, we discuss the indexing framework, and in Section 4.2, we present an efficient keyword query processing algorithm for snapshot queries.

4.1. Indexing Framework

In this study, our main work focuses on moving queries in a directed and dynamic road networks. We use a method similar to the enhanced technique presented in [12] as our basic framework for processing snapshot queries in directed and dynamic road networks. The indexing framework combines a road network framework [1] for storing spatial information and an inverted file for indexing data objects. For easy traversing of the network, we store the adjacent nodes of each given node by storing node id , edge id (), the direction of the edge, and the weight of the edge. The indexing framework consists of two main components: a pruning component and an inverted file component. Figure 2 illustrates the main components of an indexing framework. The pruning component first prunes the edges that contain data objects irrelevant to the query keyword. To achieve this, we introduced the highest significance of a given term t in the description of objects lying on the edge. The on an edge is retrieved by a key composed of a pair of edge id and term id (). The represents an upper-bound significance of any object lying on an edge with term t in its description. The inverted list of a term t on an edge is accessed only if the upper-bound score composed by and the minimum network distance between the starting node of the edge and query q may return a candidate data object. Naturally, the edges with upper-bound scores smaller than the score of the k-th object found so far are pruned.

We implement an inverted file for indexing data objects. The inverted file contains a vocabulary and inverted lists. The vocabulary keeps general information about each term (such as the frequency of the term), which is helpful in computing the textual relevance of the data objects. The inverted list stores the data objects located on the edge that have a term t in their description. An inverted list is identified by a key composed of . Each inverted file is a set of inverted lists. A separate inverted list is used for each term in the object description. An inverted list stores two attributes for each data object: first, the distance between the data object and the starting node ; second, the significance factor of the term in the description of the data object. Note that the network distance between two points in a directed road network is not symmetrical (i.e., ). Recall that the starting node is chosen according to the orientation of the edge such that the direction of the edge is from the node toward the data object. In Figure 1, is the starting node for . For bidirectional edges, any of the adjacent nodes can act as a starting node.

The proposed indexing scheme has three main advantages. First, the object search relevant to query keywords is very efficient using the pair. Second, inverted files also store the network distance between the starting node and the data object, which helps in accessing the data object in the directed road network. Finally, the pruning technique allows for faster query processing by exploring fewer edges.

Table 2 presents the notations used in this study.

4.2. Query Processing Algorithm

Our algorithm traverses the road network incrementally in a similar fashion to Dijkstra’s algorithm [26]. Algorithm 1 returns the top-k data objects with the highest scores according to their joint textual and spatial relevance to the query. The algorithm begins by exploring the active edge where query object q is located, and expands the network in an increasing order of distance from q. Each entry in the min-heap has the form , where indicates the anchor point in the edge. For an active edge, q becomes the anchor point. Otherwise, for directed edges, ending node becomes the anchor point. For bidirectional edges, either of the adjacent boundary nodes, i.e., or , becomes the anchor point. Let be the current set of top-k data objects and be the score of the k-th data object in . The function retrieves the candidate data objects located in an edge with a better score than . Next, the set is updated with the data objects in , and so does . The algorithm continues its expansion and inserts the adjacent edges of the boundary node until the heap is exhausted or the upper-bound score of the remaining data objects cannot have a better score than . The upper-bound score of node n is computed using and the maximum textual relevance (). Therefore, if , it means that even if there is unexplored data object d matching all query keywords, its score can be better than the k-th object in because . This is certain owing to the fact that the algorithm strictly expands the node with a minimum distance to the query location.

Input: Top-k spatial keyword query
Output: Top-k data objects with highest score
/∗set of candidate data objects
(4) max-heap /∗current Top-k set
(5) /∗k-th score in
(6) min-heap
(7)
(8) min-heap.insert
(9)
(10) update and with
(11) while min-heapanddo
(12) for  each unexplored adjacent edge of ()  do
(13)
(14)
(15) update and with
(16) end
(17) min-heap.insert(adjacent node, edge)
(18) end
(19) return

Algorithm 2 presents the procedure, which finds the candidate data objects. This procedure has two main steps. In the first step, the upper-bound score of the edges is computed using a significance factor of a term and the shortest distance between the edge and the query location. In the next step, the inverted lists of term t are fetched if their upper-bound score is greater than . In the inverted lists, the objects with score greater than are returned.

Input: Edge ID: , Term ID: , score of k-th object
Output: candidate list
compute
(4) ifthen
(5)
(6) end
(7) ifthen
(8) for  each data object in do
(9) compute
(10) end
(11) ifthen
(12)
(13) end
(14) end
(15) return

To understand the proposed algorithm, consider the road network presented in Figure 1. Assume that a query q generated a top-1 keyword query with q.d “Italian Restaurant.” For ease of presentation, we assume and the textual relevance is the number of occurrences of query keywords in divided by the number of keywords in the document (description of data object). For example, . The algorithm starts the network expansion from an active edge where q is the anchor point. Note that the direction of the edge is from to . Therefore, the algorithm explores only . There is no data object found in . Then, becomes the anchor point and edges , , and are inserted in min-heap. Next, the function retrieves the candidate data objects on edges , , and , whose score is better than . On edge , data object is retrieved with . Data object is inserted in the set, and the value of is set to 0.2. For edges and , there is no candidate object found because (“Cafe”) and (“Cafe and Bakery”) do not match with . The algorithm continues expanding the edges whose upper-bound score is greater than . The edge is explored next. The upper-bound score of is , which is less than . Similarly, for edge , the upper-bound score is . Therefore, the algorithm terminates and reports as the top-1 result.

5. Moving Top- Spatial Keyword Queries

In this section, we present our method to monitor the moving top-k spatial keyword queries where query objects are moving in a directed road network. Figure 3 provides an example of TkSK in road networks, where query point q issues a TkSK query at point . Note that the numbers on the arrows in the figure indicate the order of the steps. To obtain top-k results at , the server executes Algorithm 1 as mentioned in Section 4.2. Now, consider that the query object is moved to as shown in Figure 4 to retrieve the top-k results at point . The simple method is to repeat the procedure executed at . However, the use of recomputation whenever query q changes its location significantly increases the computation cost. Furthermore, it also increases the communication overhead because the query object must report its location whenever it moves, and the server must send the results set. To address these issues, we introduce the safe exit approach.

In the proposed framework, the server computes safe exit points for a query object. The server maintains a set of moving queries, and the query result remains valid until the query objects remain inside their respective safe exit points. Whenever a query object leaves its safe exit points, the server recomputes the TkSK and safe exit points for the query object.

Next, we present our method to compute the safe exit points for a query object. The safe exit point represents a point in the segment where a safe region and nonsafe region meet. We compute the safe exit point using the divide-and-conquer technique. Before presenting the detailed methodology, we define the terminologies used in this section.

Definition 1 (safe region). A portion of a road segment that can guarantee that, as long as the query point lies in it, its top-k results remain valid.

Definition 2 (answer objects ). A data object d is called an answer object of query q if the score of data object d (), where represents any other data object in the directed road network. Similarly, we can generalize this definition for TkSK: a data object d is called an answer object of query q if the score of a data object d (), where represents the data object in the directed road network. In other words, we can state that all answer objects are top-k results of query q.

Definition 3 (nonanswer objects ). A data object d is called a nonanswer object of query q if the score of data object d (), where represents any other data object in the directed road network. Similarly, we can generalize this definition for TkSK: a data object d is called a nonanswer object of query q if the score of data object d (), where represents the kth data object in the directed road network. That is, we can say that all answer objects are top-k results of query q. Therefore, we can state that none of the nonanswer objects are in the top-k results of query q.

Definition 4 (lowest answer object ). An answer object is called a lowest answer object to a point such that , where represents the score of the lowest answer object at point p. In other words, at point p, where is any other answer object in the set.

Definition 5 (highest nonanswer object ). A nonanswer object is called a highest nonanswer object to a point such that , where represents the score of the highest nonanswer object at point p. In other words, the at point p, where is any other nonanswer object in the set.

As discussed earlier, the main challenge in the continuous processing of moving TkSK is to maintain the validity of the result set because the movement of query objects can nullify the result set. To monitor the validity of the result set, we propose a safe-region-based approach.

5.1. Computation of Safe Exit Points

In this section, we present our technique to compute the safe exit points. The main goal is to find a point in the road network where the query result set will change. The result set will change when the score of highest nonanswer surpasses the score of . Generally, the textual relevance score does not change. Therefore, the score of data objects only changes because of the spatial relevance score, which can only change by the movement of query objects. The computation of the safe exit point is based on two key observations:

Observation 1. If , there is no safe exit point in the segment
Explanation. represents the set of answer objects at anchor point , whereas represents the set of answer objects at boundary node . As discussed earlier, the safe exit point is the particular point where the query results changed. If the query results at the starting node are the same as the ending node of any segment/edge, there does not exist any point where the query result is changing. Hence, we do not search the safe exit point in that segment.

Observation 2. If , there is a safe exit point in the segment
Explanation. In contrast to Observation 1, if the query results are different at the starting and ending points, then there exists a point where the query results are changing. Hence, there is a safe exit point in the segment.

To find the safe region, we observe the following cases:

Case 1 (when and the textual relevance of the highest nonanswer object and lowest answer object is the same). In this case, both the textual and spatial relevance have the same importance (i.e., ). In addition, the top-k result depends only on the spatial relevance because the textual relevance of both objects is the same. The data object that is closer to query point q becomes the answer object. For an undirected edge, the safe exit point is the center point, i.e., , between the lowest answer object and the highest nonanswer object. However, in case of a directed edge where , the safe exit point is either or . If , then the safe exit point is ; otherwise, the safe exit point is .

Case 2 (when and the textual relevance of the highest nonanswer object and lowest answer object is different). In this case, the top-k result depends on all functions that are the , spatial, and textual relevance. Clearly, for the undirected edges, the midpoint between the lowest answer object and the highest nonanswer object does not provide a valid safe exit point. Therefore, we introduce the divide-and-conquer technique. This will keep dividing the search space until we get the point where the score of the nonanswer is greater than that of the answer object. Typically, the safe exit point should be closer to the data object whose score is lower. Based on this observation, first we compute the midpoint in a similar fashion to Case 1, and then we continue dividing the search space until we find the point. For undirected edges, the safe exit point can be computed in a similar fashion to Case 1.

Case 2 also works for other cases when the safe exit point is not the mid point between the lowest answer object and the highest nonanswer object. In these cases the safe exit point depends on two or more functions. Therefore, the safe exit point can be easily computed using the aforementioned divide-and-conquer technique. Following are the scenarios where the safe exit point can be computed using Case 2.(a)When and textual relevance of the nearest nonanswer object and farthest answer object is different.(b)When and textual relevance of the nearest nonanswer object and farthest answer object is same.

Case 3 (when ). This means the spatial relevance has no effect on the score of data objects. Hence, no monitoring is required for this scenario.

Algorithm 3 retrieves the safe exit points using the observations we discussed earlier. The core function in this algorithm is ComputeSafeExit, which finds the safe exit point in a segment between and . The detailed ComputeSafeExit is described in Algorithm 4. First, Algorithm 4 determines and at point . Recall that is the lowest answer object to p, where is the highest nonanswer object to p. Algorithm 4 computes the safe exit point based on the cases we discussed earlier. There are a further two scenarios for Cases 1 and 2. For Case 1, if , then the safe exit point is the midpoint between and . If , then the edge is directed, and therefore the safe exit point is either or . If lies on the edge , then is the safe exit point. Otherwise, is the safe exit point.

Input: Same as Algorithm 1
Output:: a set of safe exit points
/∗set of safe exit points
(4)
(5) /∗Results calculated using Algorithm 1
(6)
(7) /∗Results calculated using Algorithm 1
(8) ifthen
(9) no safe exit point /∗refer to Observation 1
(10) end
(11) ifthen
(12) /∗safe exit point
exist - refer to Observation 2
(13) end
(14) return
Input: same as Algorithm 1
Output: se: safe exit point in
for each point , such that
(4) for each point , such that
(5) if  Case 1  then
(6) ifthen
(7)
(8) end
(9) ifthen
(10) or where
(11) end
(12) end
(13) if  Case 2  then
(14) ifthen
(15) closest point to such that
(16) end
(17) ifthen
(18) Same as Line (10)
(19) end
(20) end
(21) return

Similarly, for Case 2, if , then the safe exit point is computed by dividing the search space by half until we find the closest point such that . The safe exit point is computed in the same way as in Case 2 if .

5.2. Computation of Safe Exit Points for Example

Consider the same example in Figure 1, where the query point q issues a top-1 keyword query with q.t “Italian restaurant.” For this example, let us consider . The monitoring algorithm starts exploring from the active edge containing the query object q. Therefore, is explored first. As shown in Table 3, for , and . According to Observation 1, no safe exit point exists in this segment. Therefore, edges adjacent to are explored, and becomes the new . The edge is explored next. Similarly, the answer object at and is the same: . Therefore, a safe exit point does not exist in . The edge is explored next. As shown in Table 3, and . By Observation 2, there is a safe exit point in . As shown in Figure 1, and . Therefore, according to Case 1, the safe exit point is the midpoint between and . That is, , where and for . Consequently, , which means that the distance from to is 1.

Next, we determine a safe exit point in . As shown in Table 3, the answer object at is also the same as . Hence, no safe exit point exists in this edge. Next, is explored with . According to Table 3, and . Therefore, a safe exit point exists in this edge. This edge is directed, and for each point , the shortest distance from p to is from . Therefore, is the safe exit point.

The bold lines in Figure 5 indicate the safe region of q. The top-1 result remains until the query q lies in the safe region.

Next, we analyze the time complexity for determining a set of safe exit points using a set of qualifying objects . Note that indicates the set of k data objects that satisfies the query condition at . According to Dijkstras algorithm [26], the time complexity for computing a set of answer objects at a query point q is . This means that holds for endpoints and . Thus, time complexity when determining the skyline with the k-th highest score is where is the number of qualifying objects that participate in the constitution of the skyline with the k-th highest score. Therefore, the time complexity of determining a safe exit point coincides with the time complexity of determining the two skylines, i.e., the skyline with the k-th highest (or lowest) score for answer objects and the skyline with the highest score for nonanswer objects. This is because the safe exit point is found at the cross point between these skylines.

Figure 6, represents the skyline graph for in an edge . Let us draw the score function for and for the road segment where a safe exit point exists. This is because and for . For each point , the distance between and point p can be represented as . Similarly, for each point , the distance between and point p can be represented as . Let, be a variable x. We can write and . Then, we can represent score function and as follows: for for

Finally, we present the lemma to prove that safe exit points computed by COSK are correct.

Lemma 8. The COSK algorithm correctly computes a set of safe exit points.

Proof. We will prove the correctness of the COSK algorithm by contradiction. We assume that if , there is no safe exit point in a road segment . This means that, for each point p in the road segment , the query result at p equals , i.e., . However, it leads to a contradiction that when . Therefore, if , a safe exit point exists in . In addition, a safe exit point is determined using the skyline for answer objects and the skyline with the highest score for nonanswer objects when . The first skyline is a composite polyline drawn from answer objects in . The second skyline is a composite polyline drawn from nonanswer objects in .

6. Monitoring Query Results and Safe Regions in Dynamic Directed Road Networks

In this section, we discuss the monitoring of spatial keyword queries in dynamic road networks where the network distance changes depending on the traffic conditions. The updates on weight of some edges may invalidate the query results or safe region of q, even though the query object q remains within their respective safe region. Figure 7 illustrates an example of changing the weights edges and . For convenience we consider and q.t = “Italian restaurant.” In Figure 7(a), the top-1 result is and bold lines show the safe region of query q. Now consider at time the weights of two edges and changed due to heavy traffic condition as shown in Figure 7(b). The update in weight of edges may invalidate the query result or safe region of q. Therefore, it is necessary to monitor the validity of results and safe region when the changes occur.

Next, we introduce a monitoring region to monitor the validity of the safe region effectively when the weight of an edge is changed. Monitoring region MR contains all the points between query point q and lowest answer object and highest nonanswer object. Formally, it is defined as , where is the distance between q and lowest answer object and is highest nonanswer object. In given example, the and . Therefore, the dotted lines in Figure 8(a) shows the monitoring region of query object q.

Now at time , the update to edges and which is not part of monitoring region can safely be ignored. However, the updated on segment which is associated with monitoring region may nullify the results. As shown in Figure 8(b), after update the top-1 result becomes and bold lines represents the new safe region of q.

Algorithm 5 monitors the validity of result set and safe region of query object q when the weight of any edge changes. Let us consider weight of edge changes at time . First, algorithm checks whether edge is associated with monitoring region or not. If it is not part of monitoring region then algorithm simply ignores the update in edge and query results and safe region remains valid. In contrast, if edge is associated with monitoring region (i.e., ) then algorithm evaluates the query results. Consequently, the top-k results and safe region of query q needs to be updated. Finally, the algorithm updates the monitoring region of q.

Input: Monitoring region: MR, updated edge:
Output: none
ifthen
(4) /∗edge is not part of monitoring region
(5) ignore the change in the weight of edge
(6) end
(7) /∗set of safe exit points
(8) else
(9) /∗update set of
top-k results
(10) /∗update safe exit
points
(11)
/∗update monitoring region
(12) end

7. Performance Evaluation

In this section, we evaluate the performance of COSK through simulation experiments. We describe our experimental settings in Section 7.1, and we present our experimental results for static and dynamic road networks in Sections 7.2 and 7.3, respectively.

7.1. Experimental Settings

All of our experiments were performed using real road networks, namely, Oldenburg, San Francisco, and San Joaquin. All three road networks were obtained from [27]. The original road network of San Francisco had 21,047 nodes and 21,692 edges. We reformatted the network, pruned approximately 30% of the nodes, and adjusted the edges and their weights accordingly. This resulted in a network with 14,732 nodes and 14,316 edges. Both the direction of edges and data objects on the edges were generated randomly. The description of each data object was extracted from Twitter messages [28], and we assigned one tweet per data object. Table 4 presents the characteristics of the data sets used in the experimental evaluation. We simulated moving query objects by using a spatiotemporal data generator [29]. The input to generator was the road network of the data set used, and the output was the set of query objects moving on the road network. Each experiment had 100 moving queries which were continuously monitored for 100 timestamps (1 timestamp = 1 second), and the average result was reported in the experiments.

As a benchmark for COSK in static road network, we implemented a CMTkSK+ algorithm [22] which also continuously monitored the moving top-k spatial keyword queries in the road networks. However, this algorithm was originally designed for undirected road networks. To make a fair comparison, we modified CMTkSK+ to process top-k spatial keyword queries in directed road networks and called it CMTkSK+. Specifically, we modified the distance computation method between two points such that in directed road networks, . Since CMTkSK+ does not handle top-k spatial queries in dynamic road roads, we compared the performance of COSK with basic algorithm which recomputes the results whenever query object changes its location. All algorithms were implemented in Java and were executed on a desktop PC 2.80-GHz Intel Core i5 with 8 GB of memory. In the experiments, we compared query processing times; edges processed, i.e., the number of edges processed for retrieving query results; and index sizes. Table 5 summarizes the parameters used in the experiments. In each experiment, we varied a single parameter within the range that is shown in Table 5 while maintaining the other parameters at the bolded default values.

We evaluated the performance of the algorithms by using the following measures: total amount of server CPU time, which indicates the query processing time, and total communication cost as the total number of points (i.e., the location updates sent by query objects, and the query results and safe exit points returned by the server) transferred between clients and the server. The battery power and wireless bandwidth consumption typically increase with the amount of data transferred between objects (clients) and servers. Thus, we used the amount of transferred data as a metric to evaluate the communication cost.

7.2. Experimental Results of Top-k Spatial Keyword Queries in Static Road Networks
7.2.1. Effect of k

Figure 9 indicates the effect of the number of results on the query processing time and communication cost for both algorithms. Figure 9(a) indicates that the query processing time increases for both algorithms as the value of k increases. This is expected because with an increase in k, more data objects are required to be explored and verified. Nevertheless, COSK significantly outperforms CMTkSK+ for two main reasons. First, a relevant object search is very efficient when using the highest significant factor; and second, COSK does not need to verify the set of answer objects as long as the query object lies in a safe region. On the other hand, the CMTkSK+ query processing time increases significantly because it has to monitor and verify the set of candidate objects periodically. In Figure 9(b), the communication costs for both algorithms increase as the number of objects increases. However, the proposed algorithm demonstrates superior performance compared to CMTkSK+ because client-server communication is not required when the query object lies within the safe exit points, whereas in CMTkSK+, the query object is required to report its location to the server whenever it moves.

7.2.2. Effect of

This experiment was conducted on dataset San Joaquin. This dataset included 19,098 data objects; therefore, we randomly generated approximately 30,000 additional data objects on different edges. In Figure 10, we evaluate the performance of COSK and CMTkSK+ by varying the cardinality of the data objects. Note that corresponds to a low density of data points, while corresponds to a high density. In Figure 10(a), it is interesting to notice that the query processing times of both algorithms decrease as the cardinality of the data objects increases. For CMTkSK+, this is because with high density, the monitoring range of a query decreases. However, for COSK, it is mainly because when the data density is high, fewer edges are required to be expanded, which decreases the query processing time. In Figure 10(b), we study the influence of the cardinality of the data objects on the communication costs. The experimental results indicate that the communication costs of CMTkSK+ incur almost constant communication costs regardless of data object cardinality. However, the communication costs of COSK increase in proportion to the value. This is expected because the safe region becomes smaller as the density of the data objects increases, which increases the communication costs.

7.2.3. Effect of Query Keywords (n)

Figure 11 shows the query processing time and communication for COSK and CMTkSK+ as a function of the number of query keywords. Figures 11(a) and 11(b) show the trend that the performance of both algorithms degrades when the number of keywords increases. This is mainly because by increasing the number of query keywords, the number of relevant objects may also increase, resulting in a higher query processing time and communication cost. However, the safe-region-based algorithm COSK scales better than CMTkSk+ because of its less expensive monitoring technique.

7.2.4. Effect of

Figure 12 demonstrates the impact of query parameter on the query processing time and on the communication cost. A small value of indicates a greater importance of textual relevance, whereas a high value of gives more preference to the spatial relevance. It is interesting to note that the query processing time is lower for higher values of , which indicates more importance to the spatial relevance. This is mainly because when the spatial relevance is higher, fewer edges and objects are required to be explored and processed to determine the top-k data objects. Observe that in Figure 12(b), the number of messages sent by COSK decreases sharply with an increase in .

7.2.5. Effect of Speed

Figure 13(a) demonstrates the influence of the speed of the query objects on the query processing time of the COSK and CMTkSK+ algorithms. The experimental results indicate that the performance of CMTkSK+ is not significantly influenced by the speed of the query objects because the candidate objects must be continuously monitored after a regular interval of time, regardless of the speed. On the other hand, for COSK, the performance gradually decreases as the speed of the query objects increases because the objects leave their respective safe regions more frequently. Figure 13(b) shows the communication costs of COSK and CMTkSK+ with respect to the speed of the query objects. CMTkSK+ incurs almost constant communication costs because a server-initiated request to verify the candidate objects does not depend on the speed. For COSK, the query objects cross safe regions more frequently when the speed is high, which increases the communication costs.

7.2.6. Effect of Mobility

Figure 14 shows the effect of mobility (mobility refers to the percentage of query objects that are moving at any timestamp) on the performance of COSK and CMTkSK+ algorithms. As expected, the query processing time and communication costs for both algorithms increase with . Nevertheless, COSK performs better than CMTkSK+ in terms of query processing time and communication costs.

7.2.7. Effect of Directed Edges

Figure 15 shows the impact of percentage of directed edges on the performance of COSK and CMTkSK+ algorithms. The query processing time increases with because algorithm needs to explore more edges to retrieve the top-k keyword queries. However, the communication cost is not significantly affected by the value of for both the algorithms.

7.2.8. Effect of Datasets

Figure 16 demonstrates the index sizes of the COSK and CMTkSK+ approaches for different datasets. As shown in Figure 16, both algorithms have similar index sizes. However, COSK has minor space overhead because it stores additional information of the highest significance factor of edges. More important, this space overhead is minimal as compared to the gain achieved by COSK in query processing time and communication costs.

7.3. Experimental Results of Top-k Spatial Keyword Queries in Dynamic Road Networks

In this section, we evaluate the performance of COSK and basic algorithm for dynamic road networks. The indicates the percentage of all edges that change their weight at each timestamp. The length of an updated edge is randomly selected between 0.1 to 10 times the original length. Figure 17(a) depicts the query processing time of COSK and basic algorithm. It is evident from the figure that query processing time of basic algorithm is not significantly affected by . This is mainly because the query objects issue top-k spatial queries at each timestamp. However, query processing time of COSK increases with the value of because the probability that the updated edge may associated with the monitoring region of query q increases with . Therefore, when becomes large the results need to be frequently updated which increases the query processing time. Figure 17(b) shows the communication costs of COSK and basic algorithm with respect to . Basic algorithm incurs almost constant communication costs regardless of the value of . In contrast, the communication cost of COSK increases with because the query result and safe regions needs to be frequently updated.

8. Conclusion

In this paper, we investigated moving top-k spatial keyword queries in directed and dynamic road networks. We presented an efficient indexing framework using inverted files that indexes the data objects on edges, allowing for the effective searching of data objects relevant to queries in terms of both textual and spatial relevance. We also presented a safe-exit-based algorithm called COSK to monitor moving top-k spatial keyword queries. We demonstrated that the query results remain valid as long as the query object resides within a safe region. Furthermore, COSK can effectively monitor the validity of query results and safe regions in dynamic road networks. Finally, an experimental evaluation conducted on real road networks demonstrated that COSK significantly reduced the query processing time and communication costs compared to the CMTkSK+ algorithm.

Data Availability

The real road network data used in this study are also used in many previous studies. The road network data is cited in the manuscript and it is available at https://www.cs.utah.edu/~lifeifei/SpatialDataset.htm. To simulate the moving queries the authors used the spatiotemporal data generator which is also used in previous studies. The research article of generator is cited in the manuscript. The documentation and source files of generator are available at https://iapg.jade-hs.de/personen/brinkhoff/generator/. They used the Twitter tweets for generating the description of data objects and also query keywords. The tweets used can be accessible at http://followthehashtag.com/datasets/free-twitter-dataset-usa-200000-free-usa-tweets/.

Conflicts of Interest

The authors declare that there is no conflicts of interest regarding the publication of this paper.

Acknowledgments

Hyung-Ju Cho was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIP) (NRF-2016R1A2B4009793) and this research was partially supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B03934129).