Abstract

The value of large amount of location-based mobile data has received wide attention in many research fields including human behavior analysis, urban transportation planning, and various location-based services. Nowadays, both scientific and industrial communities are encouraged to collect as much location-based mobile data as possible, which brings two challenges: how to efficiently process the queries of big location-based mobile data and how to reduce the cost of storage services, because it is too expensive to store several exact data replicas for fault-tolerance. So far, several dedicated storage systems have been proposed to address these issues. However, they do not work well when the ranges of queries vary widely. In this work, we design a storage system based on diverse replica scheme which not only can improve the query processing efficiency but also can reduce the cost of storage space. To the best of our knowledge, this is the first work to investigate the data storage and processing in the context of big location-based mobile data. Specifically, we conduct in-depth theoretical and empirical analysis of the trade-offs between different spatial-temporal partitioning and data encoding schemes. Moreover, we propose an effective approach to select an appropriate set of diverse replicas, which is optimized for the expected query loads while conforming to the given storage space budget. The experiment results show that using diverse replicas can significantly improve the overall query performance and the proposed algorithms for the replica selection problem are both effective and efficient.

1. Introduction

With the development of data collection capabilities, it is much easier to collect a huge number of location-based mobile data of users or objects via billions of electronic devices such as mobile phones, tablet computers, vehicle GPS navigators, and a wide variety of sensors. For example, taxi companies monitor the mobility information of taxis; telecom operators continuously record the locations of active mobile phones; location-based service (LBS) providers keep the mobile information of the users whenever they use the services. Such large amount of location-based mobile data is valuable for many research fields such as human behavior analysis [1], urban transportation planning [2], customized routing recommendation [3, 4], and location-based advertising and marketing [5].

We called the datasets as location-based mobile data because they share the following three common characteristics. Firstly, all these datasets have at least three core attributes: object ID, timestamp, and location. They may as well contain other attributes which are called common attributes that can vary among datasets. Secondly, most queries on these datasets are associated with spatial and temporal ranges. Hence, efficient indexing schemes for range data filtering are required to improve overall query performance. Thirdly, mainstream big data storage and management systems (e.g., HDFS, parallel RDBMSs, and NoSQL databases) are not suitable for storing and processing these data. This is because these systems do not naturally lend themselves to dealing with spatial-temporal range queries, especially when the number of the result records is very large. The main reason is that they cannot physically co-cluster records according to spatial and temporal proximity, which leads to too many slow random disk accesses.

In recent years, several dedicated storage systems have been proposed to store big location-based mobile data, such as TrajStore [6], CloST [7], and Panda [8]. Data are partitioned in terms of spatial and temporal attributes in the above system. The records in the same partition are physically stored together. To process a range query, we only need to sequentially scan the partitions whose range intersects with the query range. It is demonstrated that this approach is much more efficient than fetching a large number of nonconsecutive disk pages. In addition, these systems can achieve high data compression ratio by leveraging specialized storage structures and encoding schemes.

However, the above existing dedicated systems do not work well when the ranges of queries vary widely. The fundamental reason is that there is only one set of configuration parameters to organize (i.e., partition and compress) the data. It is obvious that we cannot find a single configuration that is optimized for all possible queries. For example, consider that data are partitioned into many small partitions (whose size, in the extreme case, can be as small as a disk page). On one hand, queries with small ranges can be processed efficiently because we can prune most of the partitions. On the other hand, queries with large ranges will incur high I/O costs because a large number of partitions will be involved and locating each of them will invoke a random page access. In this context, these systems have to choose the parameters optimized for the overall performance of the expected query workloads. Note that the expected query workloads can be either derived from historical queries [7] or known as a priori knowledge [6].

In this paper, we explore the use of diverse replicas in the context of storage systems for big location-based mobile data. In big data storage systems, e.g., Hadoop HDFS, replication is mainly used for data availability and durability, but not yet for optimizing the performance of query processing. Hence, the use of diverse replicas is a novel approach. The implications of diverse replicas are twofold. First, data are partitioned and compressed in multiple ways such that different queries can pick the best-fit configuration to minimize the processing time. Second, in spite of the diversity of physical data organizations, diverse replicas can recover each other when failures occur because they share the same logical view of the data. Since we can replace the exact replicas with diverse ones, the gain of query performance does not necessarily come at the cost of more storage space. Though the potential advantages of using diverse replicas are prominent, it is nontrivial to determine which replicas to use. Concretely, given a large location-based mobile dataset, a representative workload, and a constraint on storage space, we need to find an optimal or near-optimal set of diverse replicas in terms of overall query performance. To address this problem, we make the following contributions:(i)We propose BLOT, a system abstraction that describes an important class of location-based mobile data storage systems. Based on the BLOT system abstraction, we conduct general discussions on how to integrate diverse replicas into existing systems.(ii)We formally define the replica selection problem that finds the optimal set of diverse replicas in the context of BLOT systems. Besides, we prove that this problem is at least NP-complete.(iii)We propose two solutions to the replica selection problem, including an exact algorithm based on integer programming and an approximation algorithm based on greedy strategy. In addition, we propose several practical approaches to reduce the input size of the problem.(iv)We design a simple yet effective cost model to estimate the cost of an arbitrary query on an arbitrary replica configuration. The parameters of the cost model can be either calculated by closed-form formula or measured accurately by a few low-cost experiments.(v)We evaluate our solutions using two typical deploy environments of BLOT systems. The experiment results confirm that using diverse replicas can significantly improve the overall query performance. The results also demonstrate that the proposed algorithms for the replica selection problem are both effective and efficient.

The rest of this paper is organized as follows. Section 2 briefly summarizes the related works and Section 3 presents the common designs of BLOT systems as well as the general use of diverse replicas. Section 4 defines the replica selection problem, proves its hardness, and describes the solutions. Section 5 presents the query cost estimation model for BLOT systems. Section 6 shows the experiment results and conducts analysis and Section 7 concludes the paper.

There is a plethora of works on storing spatial-temporal data and efficient processing of range queries. Early studies, dating back to 1970s and 1980s, mainly focus on indexing individual points or trajectories. Representative works include k-d tree [9], quadtree [10], R-tree [11], and TB-tree [12]. These data structures incur many random reads which are inefficient when the number of records in the query result is large. To address this issue, TrajStore [6] and PIST [13] attempt to co-locate data according to spatial and temporal proximities and use relatively large partition size. Both TrajStore and PIST cannot scale to terabytes of data because they can only consider nondistributed environments. CloST [7] and SpatialHadoop [14] are two Hadoop-based systems which aim at providing scalable distributed storage and parallel query processing of big location-based mobile data. SATO [15] is a spatial data partitioning framework that can quickly analyze and partition spatial data with an optimal spatial partitioning strategy for scalable query processing. Note that TrajStore, PIST, CloST, SpatialHadoop, and SATO can be viewed as concrete instances of BLOT systems without using diverse replicas.

Recommending a physical configuration for a given workload has been widely studied since 1987 [16]. Most of the existing works [1723] propose effective methods to estimate the cost of a given workload over candidate physical configurations. However, only a few of them consider the situations where data can be replicated [20, 21]. An earlier work introduces the technique of Fractured Mirrors [24] to store data in both row-fashion and column-fashion. For data partitioning, it has been proved in [25] that finding the optimal vertical partitioning is an NP-hard problem. Therefore, the are a number of works that focus on heuristic algorithms for vertical partitioning optimization [2630]. For workload size reduction, the authors of [31] propose a workload compression method to reduce the size of SQL workloads. A more scalable workload grouping method is proposed in [20]. Most of the above works are based on the relational data model while our work is based on the BLOT data model which is more suitable for big location-based mobile data.

3. BLOT Systems and Diverse Replicas

In this section, we introduce BLOT, a system abstraction that reflects common designs of an important class of dedicated systems for storing big location-based mobile data. We refer to such systems as BLOT systems. Figure 1 shows an overview of how data are organized and queried in BLOT systems.

BLOT systems are primarily aimed at providing a storage layer that supports efficient data filtering by spatial-temporal ranges for high-level data analytical systems such as RDBMSs and Hadoop. They can be also used as standalone systems to dedicatedly answer range queries. The advantages of BLOT systems have been demonstrated by a number of existing works such as PIST [13], TrajStore [6], CloST [7], and Spatial Hadoop [14]. Compared with other solutions (e.g., using the original Hadoop or NoSQL databases), the speed of range queries in a BLOT system can be up to one to two orders of magnitude faster while using a much smaller storage space (typically 20% or less). In the rest of this section, we will first describe the general design of BLOT systems and then explain why using diverse replicas can significantly improve the overall system performance.

3.1. Data Model

A BLOT system stores a large number of location-based mobile records. Each record is in the form of , where is an object ID, is a timestamp, is the location of object at time , and through are other attributes that can vary among different datasets. We refer to the first three attributes as core attributes and the others as common attributes. Any dataset that naturally fits into this data model, i.e., containing and emphasizing core attributes, can be viewed as location-based mobile data.

3.2. Data Partitioning

Based on the data model, BLOT systems split a large dataset into relatively small partitions using core attributes. In TrajStore and CloST, for example, data are first partitioned by location () and then further partitioned by time (). Records in the same partition are stored together in a storage unit which is optimized for sequential read. For instances, a storage unit can be an object stored in Amazon S3, a file on HDFS, a segment of a file on a local file system, etc. Typically, the size of a storage unit in BLOT systems is much larger than that of a disk page, ranging from hundreds of kilobytes to several megabytes. The advantages of using relatively large storage units are twofold. First, queries with large spatial-temporal ranges can be efficiently processed because data are mostly accessed sequentially. Second, it makes the number of storage units sufficiently small such that we can easily maintain the partitioning index, a small global data structure to index the spatial-temporal ranges of all data partitions.

3.3. Data Encoding

A data partition can be stored in any format. A popular approach is to store each partition as a CSV file with each line specifying a record. While this format is easy to process, the storage utilization is low. It is therefore undesirable for huge datasets, especially when using cloud storage systems that charge for every bit stored. To reduce the storage size, a BLOT system usually uses various compression techniques to encode records in a partition. For example, we can(1)use binary format instead of text format;(2)apply a general compression algorithm to compress the entire partition;(3)organize the data in column fashion and then apply column-wise encoding schemes (e.g., delta encoding and run-length encoding).

Moreover, we can use the combinations of the above techniques to further reduce the storage size. Note that higher compression ratio comes at the cost of longer decompression time which may degrade the performance of query processing.

3.4. Query Processing

To process range queries in a BLOT system, we first search for involved partitions, i.e., the partitions whose range intersect with the query range. Next, we read and decompress each involved partition to extract all the records. Finally, we check the extracted records and output the ones within the query range. Note that it is straightforward to conduct parallel query processing by scanning multiple partitions simultaneously.

In general, the cost of processing an involved partition consists of two parts: scan cost which includes the cost of extracting and filtering the records and extra cost which includes the cost of initializing the procedure, locating the partition, loading the decoder, cleaning up the procedure, etc. In a typical BLOT system, scan cost is usually proportional to the total number of records in the partition while extra cost is usually a constant decided by the corresponding encoding scheme. Therefore, for a specific query, the query cost is determined by the total amount of records to be scanned and the total number of involved partitions. Consider three partitioning schemes and a query shown in the upper part of Figure 2. For illustration purpose, we omit the temporal dimension and highlight the partitions that are not involved. Table 1 compares the number of involved partitions and the estimated percentage of data to scan among the three cases.

In this example, it is obvious that the middle case has the lowest query cost because both the scan cost and the extra cost are the lowest. However, it is unclear whether the query cost of the left case is higher or lower than that of the right case. To answer that question, we will develop an effective cost estimation model in Section 5.

3.5. Diverse Replicas

From Figure 2 we can see that the cost of a query may vary a lot with different partitioning schemes. Undoubtedly, encoding scheme also has a significant influence on query performance. Most existing BLOT systems can adaptively optimize the configuration of the physical storage organization, such as spatial and temporal partition sizes, based on analyzing the historical queries. However, in the cases when the range of queries has high variation, the optimal configuration may still be far from satisfactory in terms of overall query performance. It is intuitive that using multiple copies of data with different physical organizations can mitigate the “one-size-does-not-fit-all” problem. Traditionally, this is a typical performance tuning approach that trades space for time. However, in the context of big data storage systems where data are replicated for fault-tolerance, we can make better use of the storage space by replacing exact replicas with diverse ones. As a result, the overall query performance can be improved without necessarily using more storage space.

Figure 2 illustrates the use of diverse replicas in BLOT systems. There are two components that are key to the success of such systems. First, the system must be able to estimate the query cost both efficiently and effectively. Query cost estimation helps the system to determine which one of the existing replicas is supposed to have the least processing time for the issued query. For example, in Figure 2, the second replica is chosen to answer the given query. Besides, the estimated costs of all queries in the given workload on all candidate replicas are important inputs for the second component which selects a set of diverse replicas (and generates the actual replicas) that is optimized for a given workload under a storage constraint. The storage constraint is a hard constraint indicating the upper bound of the available storage space. It turns out that selecting the optimal set of diverse replicas in BLOT systems is a challenging problem. To the best of our knowledge, it has not been well investigated in the previous works. Therefore, we will elaborate on this problem in the next section.

4. Replica Selection Problem

Given a very large location-based mobile dataset , we want to choose a set of diverse replicas which conforms to a storage size constraint and optimizes the overall performance for a given workload. In this section, we first formally define the replica selection problem and then propose several practical solutions.

4.1. Problem Definition

Before formalizing the replica selection problem, we first give the formal definitions of several important concepts mentioned in Sections 3 and 4.

Definition 1 (partitioning scheme). Let denote the spatial-temporal bounding box of . A spatial-temporal partitioning scheme is a spatial-temporal partition of , whereand Particularly, is called the -th spatial-temporal partition of .

Definition 2 (data partition). Given a partitioning scheme , for any partition , the corresponding data partition is the set of all records in that are spatial-temporally contained by . In addition, we define(1);(2);(3). By Definition 1, we have and Since it is usually clear from the context, we often use the term partition to indicate both spatial-temporal partition (i.e., ) and data partition (i.e., ). In addition, we use and to denote the spatial-temporal range of a spatial partition and that of a data partition , respectively.

Definition 3 (encoding scheme). Given a data partition , an encoding scheme is an algorithm that generates a physical storage layout for .

Definition 4 (replica and replica set). A replica is a physical organization of all records in in which records are partitioned by and each partition is encoded by . A replica set is a set of diverse (i.e., unique) replicas.

We use and to indicate the partitioning scheme and the encoding scheme of , respectively. Note that the above definition requires that all partitions are encoded by the same encoding scheme. Nevertheless, the essential theoretical analysis in the following can be easily generalized for BLOT systems that allow a separate encoding scheme for each partition.

Definition 5 (storage size). The storage size of a replica , denoted by , is the size of storage space required to store all encoded partitions in . The storage size of a replica set , denoted by , is the total storage size of all replicas in , i.e.,

Definition 6 (query and workload). A (range) query is a process that extracts all records in that are contained by a cuboid whose size is specified by and centroid is specified by . A workload is a set of unique queries with each query associated with a non-negative weight.

Similar to and , we use to denote the spatial-temporal range of , i.e., . The weight of a query in a workload can be interpreted as the importance (frequency, priority, etc.) of the query. In some situations, the weights are normalized such that In particular, we use to denote the set of all queries in , i.e., .

Below we define query cost and workload cost based on the query processing mechanism described in Section 3.4.

Definition 7 (query cost and workload cost). Given a replica and a query , the query cost of on is denoted as . Therefore,and

Now we can formally define the problem of finding an optimal set of diverse replicas.

Definition 8 (replica selection problem). Given a dataset , a workload , a set of candidate replicas , and a storage budget , find a replica set such that(1);(2);(3) for all such that .In most situations, contains all possible replicas, i.e., if we have partitioning schemes and encoding schemes, then .
To find the optimal replica set , we need to know the query cost and the storage size for all and in the first place. For , we can estimate it using the compression ratio of the corresponding encoding scheme . Since compression ratio is stable in most situations, it can be effectively measured with a small sample of . For and we will propose a highly accurate cost model in Section 5 to estimate query cost without generating actual replicas.
For the rest of this section, we assume that all and are already given and focus on designing practical algorithms to solve the problem.

4.2. Exact Solution

Before presenting the exact solution, we first prove the following theorem.

Theorem 9 (NP-hard). The replica selection problem is NP-Hard.

Proof. We prove the theorem by reducing from the minimum weight set cover problem [32] to the replica selection problem. Specifically, given a set of elements , and a set of sets , where and the minimum weight set cover problem is to find a set such that and and the cost of is minimum where the cost of is defined aswhere is the cost (weight) of set .
The minimum weight set cover problem is a well-known NP-hard problem [32]. In this proof we will demonstrate that we can solve any instance of the minimum weight set cover problem by constructing and solving an instance of the replica selection problem.
In correspondence to , we construct a workload , where all weights are set to . In correspondence to , we construct a set of candidate replicas where all are set to . The query cost is set as follows:(1) if ;(2) if .According to Definition 7, we can interpret as that answering on requires the minimum query cost, and as that answering on requires more query cost than the minimum.
For the ease of presentation, we use problem and problem to denote the instance of the minimum weight set cover problem and the corresponding instance of the replica selection problem, respectively.
Suppose that we have found an optimal replica set in problem . We can then construct the corresponding set in problem . To decide whether problem is feasible, we need to discuss two cases. On one hand, if in problem , then any query in can be answered instantly by some replica in . According to our construction process from problem to problem , it follows that any element in must be covered by some set in . In this case, we can safely conclude that problem is feasible. On the other hand, if in problem , we prove that problem is infeasible by contradiction. Assume is a feasible solution to problem . We can then construct a replica set for problem . We can easily verify that , which follows that . This contradicts with the fact that is an optimal replica set in problem .
Thus, we have proved that problem is feasible if and only if the optimal workload cost in the corresponding problem equals . We therefore conclude that the replica selection problem is equally hard to the set covering decision problem. This completes the proof.

Though Theorem 9 eliminates the possibility of finding the optimal replica set in polynomial time, an exact solution is still useful when the input size is relatively small. Our exact solution is to model the original problem as a 0-1 Mixed Integer Programming (MIP) problem [33] and hand it over to a MIP solver. The challenge here is how to model the problem properly to ensure that the optimal solution of the 0-1 MIP problem is the optimal solution of the original problem.

Let and . For any and any , let be a 0-1 variable indicating whether replica is present in the replica set and be a 0-1 variable indicating whether query is processed on replica . We first list the constraints as follows.

The constraint related to storage size is

We use exactly one replica to process each query:

Any replica that is chosen to process at least one query must be present in :

We can see that (16) specifies constraints. Because an MIP problem may become extremely difficult in the presence of too many constraints, it is preferable to use fewer constraints. Therefore, we use the following constraints instead (which are slightly relaxed but do not change the optimal solution):

Let ; we use the following objective function:

Putting them together, we need to minimize (18) subject to the constraints specified by (14), (15), and (17). The details are shown in the following:

As and are 0-1 variables, this is a well-formed 0-1 MIP problem that can be solved directly by MIP solvers.

4.3. Reducing the Problem Size

In general, the computation time of solving an MIP problem grows exponentially with the problem size, i.e., the number of decision variables. The total number of decision variables in our formulation is (all and ) which could be very large even though both and are relatively small. For example, there are more than decision variables when we have partitioning schemes, encoding schemes, and queries in the given workload. Though this is a typical scenario in practice, it already makes the formulated MIP problem computationally infeasible (on up-to-date computers nowadays). Thus, to make the aforementioned solution more scalable, we propose several practical techniques that can significantly reduce the problem size.

4.3.1. Reducing the Workload Size

If we directly use all historical queries recorded in the query log to form the input workload, then may increase too fast in a working system where new queries are issued frequently. To address this issue, we treat each as a group of similar queries. Specifically, we use only one grouped query, denoted by , to represent all the queries with the same size of spatial-temporal range. Accordingly, we adjust the definition of query in Definition 6 by replacing with . This variation reflects the observation that queries with the same size of range often occurs many times in real situations. For example, it is common that users use an equal-sized grid to decompose the space and then conduct simple statistics for each grid cell. It is worth pointing out that estimating the cost of a grouped query is generally more difficult than estimating a single query. We will address this issue in Section 5. In addition, if the number of different range sizes is still large, we can use clustering algorithms such as -means to cluster the range sizes and only use the cluster centers to construct the input workload. In this way, we have full control of the value of by manipulating the number of clusters.

4.3.2. Reducing the Number of Candidate Replicas

Considering two replicas satisfying and we refer to this case as replica dominates replica . Obviously, if we use instead of as the input candidate replicas, it will not change the optimal workload cost . Therefore, we can safely prune from . In general, it is more common that a replica is dominated by a set of replicas. Concretely, given a replica and a replica set , we say that replica set dominates replica if(1);(2);(3).

Ideally, we want to find a minimum dominant replica set such that dominates any replica . However, as we can prove that the replica selection problem itself is NP-hard, we do not pursue a minimum in practice. Instead, we use a rough yet effective heuristic algorithm to find a suboptimal dominant replica set.

4.4. Approximation Solution

In this section, we propose several approximate algorithms to select a near-optimal set of replicas based on the reduced problem size as illustrated in Section 4.3. Approximation algorithm is suitable in case that the number of candidate replicas is still large after pruning or the workload is changing rapidly so that the replica set should be reselected frequently.

4.4.1. Greedy Strategy

First we give a fast greedy algorithm to solve the replica selection problem. This algorithm is adopted and extended from the minimum weighted set cover algorithm. As shown in Algorithm 1, we add one replica at a time to the replica set such that in each step the added replica maximizes, until the storage budget is exhausted or the overall workload cost cannot be further decreased by adding any one of the remaining replicas. Before the storage size is full, each time we add one replica into the replica set, in worst case we need iterate times until the storage space is full. In each iteration, we(1)score all replica candidates that are not added to yet;(2)add the replica with highest score into .

Input:  , , , for all and
Output:  
(1) begin
(2)  ;
(3)  while    do
(4) ;
(5) ;
(6) for    do
(7) ;
(8) if    then
(9) ;
(10) ;
(11) if    is null  then
(12) break;
(13) else
(14) ;
(15) ;
(16)  return .

The scoring step computes the gain of each replica candidates that may be added to ; thus in this step all are compared with the costs on the current replica and the candidate replicas. Hence, this step takes time, and it will result in an approximation ratio, where is size of the set of all queries . The running time of this greedy algorithm is , where is size of the set of candidate replicas. In Section 6, we will see that the approximation ratio of the greedy algorithm is quite desirable (lower than in most cases) in practice.

4.4.2. LP Rounding Strategy

Although the greedy strategy is simple to implement and achieves good approximate result in practice, the best we can hope for the greedy strategy is a logarithmic approximate ratio (). When the quantity of queries goes large, the performance guarantee will drop accordingly. In this section we introduce a constant-factor approximate algorithm based on linear programming rounding [34]. The linear programming rounding strategy consists of three stages:(1)Formulating the problem to integer linear programming(2)Relaxing the integral constraints and finding the optimal solution for the relaxed linear programming(3)Rounding the fractional solution of the linear programming and producing an integral solution.

In the replica selection problem, the LP rounding strategy is based on the MIP proposed in Section 4.2; thus we already finished stage 1. In stage 2, we further relax the MIP by allowing and . Then we can solve the LP in polynomial time resulting fractional and . In stage 3, since general rounding techniques cannot be directly adopted on the replica selection problem, we present the following rounding strategy.

Suppose we have found an optimal solution for the LP in stage 2. For any query , we define the neighborhoods of as All the replicas that serve fractionally are the neighborhoods of . Further we define cluster as a set of queries and replicas with the center . In the LP, we denote and thus the total query cost is

Now we sort queries by in ascending order and then iteratively assign each query and replica to clusters until all queries and replicas are assigned to one cluster. In each iteration, we pick the query with the smallest . If for any existing cluster center , we open a new cluster and add into the new cluster and denote as the cluster center. If , we add to cluster . Then we can round the fractional solution: for each cluster, we select the cheapest replica for each cluster center in and assign queries in this cluster to replica . The overall constant-factor approximation algorithm is shown in Algorithm 2. Theorem 10 provides the approximate ratio of the LP rounding based strategy.

Input:  , , , for all and , for all
Output:
(1) begin
(2)  ;
(3)  sort by in ascend order;
(4)  while    and    do
(5) choose from with smallest ;
(6) ;
(7) foreach    do
(8) if    then
(9) ;
(10) break;
(11) if    then
(12) ;
(13) ;
(14)  ;
(15)  foreach    do
(16) with ;
(17)  return .

Theorem 10. The proposed LP rounding strategy is a 3-factor approximation algorithm.

Proof. Suppose the optimal solution of the MIP is and the optimal solution of the relaxed LP is , since is a feasible solution of , we can prove [35]. In the rounding solution, we select replicas that have the cost at most .
Assuming that is the center of cluster , we have selected replica for any query in cluster . For , there are three types of query cost on replica :(1) is in cluster and .(2) is in cluster but . Since , queries and share some replica in common. By triangle inequality we have . The last inequality is because we sort in ascending order and pick the query with smallest each time.(3) is not in cluster ; in this case, we set to and will not be queried on any replica in this cluster.The total cost of the rounding solution is which is at most triple cost of the LP.

5. Query Cost Estimation

In this section, we propose an effective model to estimate the query cost for the replica selection problem.

We estimate the cost of a query with respect to a replica via the expectation of the running time towards the replica. Since each partition of a replica consists of a spatial range and a temporal range , we will show our estimations of the query cost in both spatial and temporal aspects.

As defined in Definition 6, in this paper, we consider as a cuboid and we use to denote the spatial-temporal range of , i.e., . To clearly show the proof, in this section, we use to denote the spatial range of , where , is the top-left point of the rectangle and and are the width and the height of the rectangle, respectively. Similarly, for each partition , we use and to denote the width and the height of the partition.

To clearly address the expected partitions that a query should scan, we consider the queries are uniformly distributed in the space, as shown in Figure 3. In Figure 3, and are the width and height of the map, respectively. The query is shown as a blue rectangle, and the top-left point of the query is only allowed to be generated in the gray area, because if a query exceeds the spatial range of the map, such query can be considered as another query with a smaller spatial range. The probabilities of the top-left point being anywhere of the gray area are the same, i.e., uniformly distributed.

5.1. Expected Spatial Partitions

In this paper, given a workload , the probability of a spatial partition being scanned is clearly the quotient of the number of queries overlapped with the partition, being divided by the total number of queries in . Since the queries are uniformly distributed, the probability can be written as the quotient of the area within which the queries may overlap with the partition (the orange rectangle in Figure 4), being divided by the entire area that all the queries belong to (the gray area in Figure 3).

Assuming the distances between a partition and the boundary of the map are , , , and , we define the expected spatial partitions as follows.

Theorem 11 (expected spatial partitions). Given query and replica with partitions , the expected number of spatial partitions that the query should scan is where where is the offset of the query, and

Proof. The proof of the denominator of is trivial; thus we only consider the numerator, denoted by ; i.e., the area within which the queries may overlap with the partition as shown in Figure 4. In Figure 4, the query is colored in blue, and the partition is colored in purple. The orange rectangle shows the area within which the queries may overlap with the partition. (1)The area of the partition is smaller than the query, as shown in Figure 4(a). From observation, we have , and . Hence Theorem 11 holds.(2)The area of the partition is larger than the query, as shown in Figure 4(b). Similar to the previous situation, Theorem 11 holds.(3)The partition is in the corner and exceeds the legal range of the query, as shown in Figure 4(c). From observation, we have . Hence Theorem 11 holds.(4)The partition is adjacent to the boundary, as shown in Figure 4(d). From observation, we have , since ; Theorem 11 holds.(5)The partition is adjacent to more than two boundaries. This is not possible based on the spatial partition scheme, because the number of partitions 4. In conclusion, Theorem 11 holds.

5.2. Expected Temporal Partitions

Similar to the expected spatial partitions, the probability of a temporal partition being scanned is the quotient of the temporal range within which the queries may overlap with the partition, being divided by the temporal range that all the queries belong to. Assuming the intervals between a partition and the temporal range of all the records are and , we define the expected temporal partitions as follows.

Theorem 12 (expected spatial partitions). Given query and replica with partitions , the expected number of temporal partitions that the query should scan is where where is the offset of the query, and

The proof of Theorem 12 is similar to Theorem 11.

5.3. Expected Query Cost

As described in Section 3, to answer query on replica , a BLOT system scans (the physically stored objects of) all partitions that satisfies and then filters each record by . Based on the expected number of spatial and temporal partitions that a query should scan, we can combine them as the expectation of desired partitions given query :

Now given the number of spatial and temporal partitions and of , respectively, we havewhere and are the scanning speed in terms of number of records scanned per unit time and the time before and after the actual scan process of the replica given its encoding scheme , respectively. For example, if each partition is stored continuously as a regular file on a local disk, then is the seek time of locating the beginning of the file and is the transfer rate of the disk (assuming that CPU always waits for I/O). As another example, if each partition is stored as an object on Amazon S3 and queries are processed on Amazon EMR (Elastic MapReduce), then is the time initializing the map task plus the time locating the S3 object before scanning the partition. The value of depends on the encoding scheme . In real situations, a high compression ratio generally leads to a slow scan speed.

In this paper, we assume that all candidate partitioning schemes will generate non-skewed data partitions. In other words, the number of records in each is almost the same for all . Non-skewed partitioning is a desirable property when partitions are processed in parallel (e.g., in MapReduce). An example of such partitioning schemes is using a k-d tree to partition the space where data are split equally each time the space is subdivided.

Putting (34) and (35) together, we can compute the cost of any query on replica in time. It follows that the time complexity of computing all query costs is

6. Evaluation

In this section, we describe the experiment settings and present the evaluation results in detail.

6.1. Experiment Settings

We consider two typical execution environments for BLOT systems. The first one is a local Hadoop cluster where each partition is stored as a separate file on HDFS. The second one uses Amazon S3 to store partitions. To process a query, we lunch a map-only MapReduce job, either in local cluster or in Amazon EMR, with each mapper scanning exactly one of the involved partitions. The dataset we use is a sample of vehicle GPS log collected from more than 4,000 taxis in Shanghai during a month. Each record contains 8 attributes (including the 3 core attributes). The total number of records is around 65 million and the total storage size in uncompressed CSV format is 3.7 GB. The latitude ranges from 30 to 32, longitude from 120 to 122, and time from 11/01/2007 to 11/29/2007. It is worth pointing out that though the full dataset in our working system is more than 100 GB, we only need a small portion of the data to build the cost model and select diverse replicas for the whole dataset.

For data partitioning, we first partition the space and then the time to generate equal-sized (in terms of number of records) partitions. The space is partitioned according to a k-d tree [9] index which recursively decomposes the space by alternatively using each space dimension. The number of spatial partitions is chosen from , , , , and the number of temporal partitions is chosen from , , , , . Therefore, there are candidate spatial-temporal partitioning schemes in total. For data encoding, we store data either by row or by column (with delta encoding), with an option of whether or not using a general compression method chosen from Gzip, Snappy, and LZMA2. Since uncompressed column-store has poor performance in terms of both compression ratio and scan speed, we do not use it as a candidate encoding scheme. Therefore, there are candidate encoding schemes in total. The compression ratio of each encoding schemes measured on our dataset is listed in Table 2. Putting the above partitioning schemes and the encoding schemes together, the total number of candidate replicas is .

6.2. Measuring Scan Rate and Extra Time

Since and are constants with respect to encoding schemes, we conduct measurements corresponding to 7 candidate encoding schemes in each execution environments, respectively.

For each measurement, we generate 5 sets of partitions with each set containing 20 partitions. The sizes of partitions within a partition set are the same while they are different across partition sets. We then launch a map-only MapReduce job with 20 mappers with each scanning a partition. After the job is finished, we compute the average processing time of all mappers and use it as the (measured) value of in (34). Accordingly, we use the corresponding partition size (in terms of number of records) as . We therefore have 5 measured points for (35). In the last step, we perform linear regression to fit the measured points and use the fitted parameters as and .

In Figure 5, the left two subfigures show all the measurement results and the right two subfigures show the fitted lines for three measurements in each of the execution environments. In addition, the measured values of and are listed in Table 3. We can see that is well-fitted by (35) especially when the size of partition is relatively large, which demonstrates the effectiveness of our cost model.

6.3. Performance of Replica Selection

To measure the effectiveness and the efficiency of our replica selection algorithms, we construct a synthetic workload containing 8 grouped queries with wildly varied range size. We conduct all the following experiments in the Amazon S3 and EMR execution environment.

Figure 6 compares the computation time via MIP upon different sizes of workload and candidate replicas. When the size of the given workload or candidate replica set increases, we can see that the computation time of the MIP solution increases exponentially. Hence, when the input workload or the candidate replica set is too large, it is desirable to switch to the greedy algorithm which runs in polynomial time.

Figure 7 compares the relative query performance for all the queries when the replica set is selected by different approaches. The storage budget is set to be the same as the storage size of 3 exact copies of the optimal single replica. The approximation ratio of each approach is shown in the brackets of the figure (ideal case is always 1.00). It is clear that when the size of data grows, the performance of the greedy algorithm and the MIP solution is closer to the ideal case than a single replica; thus the advantages of using diverse replicas become more and more prominent. Figure 8 shows the overall query performance relative to the ideal case when varying the storage budget. In this figure, the x-coordinate is the storage budget relative to the storage budget used in Figure 7. We can see that when the MIP solution is close to the ideal case regardless of the storage budget, which is faster than the single replica case by up to , the approximation ratio of the greedy algorithm decreases dramatically as the storage budget increases. When the relative storage budget is greater than 1, the approximation ratio of the greedy algorithm is less than 1.2.

7. Conclusion

In this paper, we explore the use of diverse replicas in the context of storage systems for big location-based mobile data. Specifically, we propose BLOT, a system abstraction that describes an important class of location-based mobile data storage systems. Then, we formally define the replica selection problem that finds the optimal set of diverse replicas. We propose two solutions to address this problem, including an exact algorithm based on integer programming and an approximation algorithm based on greedy strategy. In addition, we propose several practical approaches to reduce the input size of the problem. We also design a simple yet effective cost model to estimate the cost of an arbitrary query on an arbitrary replica configuration. Finally, we evaluate our solutions using two typical execution environments including Amazon and local Hadoop cluster. The results demonstrate that the proposed algorithms for the replica selection problem is both effective and efficient. In this paper, we only consider full replication of the entire data. The use of partial replication, where only frequently accessed data ranges are replicated, is one of our future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

An earlier version of this work appeared in the Proceedings of IEEE ICDCS [36], June 2014.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported in part by the National Key Research and Development Program of China under Grant no. 2017YFB0202201 and National Natural Science Foundation of China under Grant no. U1711261.