Abstract

A rapid development in wireless communication and radio frequency technology has enabled the Internet of Things (IoT) to enter every aspect of our life. However, as more and more sensors get connected to the Internet, they generate huge amounts of data. Thus, widespread deployment of IoT requires development of solutions for analyzing the potentially huge amounts of data they generate. A top- query processing can be applied to facilitate this task. The top- queries retrieve tuples with the lowest or the highest scores among all of the tuples in the database. There are many methods to answer top- queries, where skyline methods are efficient when considering all attribute values of tuples. The representative skyline methods are soft-filter-skyline (SFS) algorithm, angle-based space partitioning (ABSP), and plane-project-parallel-skyline (PPPS). Among them, PPPS improves ABSP by partitioning data space into a number of spaces using hyperplane projection. However, PPPS has a high index building time in high-dimensional databases. In this paper, we propose a new skyline method (called Grid-PPPS) for efficiently handling top- queries in IoT applications. The proposed method first performs grid-based partitioning on data space and then partitions it once again using hyperplane projection. Experimental results show that our method improves the index building time compared to the existing state-of-the-art methods.

1. Introduction

A rapid development in wireless communication and radio frequency technology has enabled the Internet of Things (IoT) to enter every aspect of our life. The IoT is part of the internet of the future and will comprise billions of intelligent communicating “things” which will have sensing, actuating, and data processing capabilities [1]. For example, the things in IoT can be smart devices in home or home appliances such as refrigerator, washing machine, and air conditioner, which have controllable devices. Restaurants, hotels, and countries can be also considered as the things in IoT, since they are connected and communicate with each other. However, as more and more sensors get connected to the Internet, they generate enormous amounts of data. Thus, widespread deployment of IoT requires development of solutions for analyzing the potentially huge amounts of data they generate [24]. A top- query processing can be applied to facilitate this task.

The top- query finds tuples with the lowest or the highest scores among all of the input tuples. When a database is large, it may take long computing time to find a complete answer to a query. Most users, however, are interested in looking at just a few top results, which are ranked by a small set of attribute values, and they want to see the results immediately after they issue the query [5]. We can apply this notion to find the top- results in huge amounts of data in IoT applications. Example 1 presents the scenario to find the top- results in IoT applications.

Example 1. Consider a user John, who wants to have a dinner in an Italian restaurant. He defines the following criteria for the search: the distance of restaurant from his home should be less than 800 meters and price should be less than 85 US Dollars. In order to find a restaurant that best suites his interest, John makes a scoring function as = 0.4  distance + 0.6  price, where is the tuple of database. Here, we can think of the restaurant as a thing in IoT. All restaurants are connected to the Internet, which forms the network of IoT. Finding the top- results among a large amount of restaurants could save John’s time. The query shown below is based on the PostgreSQL syntax.
SELECT  
FROM Restaurant
WHERE .distance <8.0 AND .price <8.5
ORDER BY
The list of restaurants and their scores are shown in Table 1. These restaurants can be represented in two-dimensional space as shown in Figure 1(a). The Alto, E, is top-1 answer to the query with a score of 3.2 and Olive Garden, D, is top-2 answer to the query with a score of 4.0. Since restaurants A, F, G, I, and J have higher values for distance and price, they do not satisfy the requirements provided by John. Thus, the scores for these restaurants are not calculated.
To answer the top- queries efficiently, building an index by accessing the subset of database is needed. The skyline methods are representative methods for answering the top- queries by constructing skyline as an index. These methods express data tuples as objects in a -dimensional space and then construct a skyline. Here, is the number of attributes of a database. The skyline methods are efficient for queries in a database with a large number of attributes and data. In Figure 1(b), the rectangular black line, composed of black points, represents the skyline. The skyline points do not dominate each other. We can answer top- queries only by reading the skyline points, since the skyline can be considered as an index. The soft-filter-skyline (SFS) algorithm [6], which is the state-of-the-art method, presorts the objects by calculating entropy value of object. The angle-based space partitioning (ABSP) [7] and plane-project-parallel-skyline (PPPS) [8] partition data space into a number of subregions in order to reduce the computing time. PPPS improves ABSP by partitioning data space into a number of spaces using hyperplane projection. However, PPPS has a high index building time in high-dimensional databases.
In this paper, we propose a new skyline method for efficiently handling top- queries in IoT applications. This paper focuses on the effectiveness of grid-based partitioning. More precisely, the contributions we make in this paper are as follows.(i)We propose a new skyline method (called Grid-PPPS) for efficiently handling top- queries in IoT applications. The proposed method first performs grid-based partitioning on data space and then partitions it once again using hyperplane projection. This reduces the time complexity of the PPPS.(ii)We show the performance advantages of the Grid-PPPS through the comparison of the index building time and number of dominating objects compared to PPPS.

The rest of this paper is organized as follows. Section 2 describes existing work related to this paper. Section 3 presents the proposed method for computing Grid-PPPS and Section 4 demonstrates the results of performance evaluation. Section 5 summarizes and concludes the paper.

In this section, we discuss the existing literature. In Section 2.1, we review data management solutions in IoT, and, in Section 2.2, we explain the index building methods for top- queries.

2.1. Data Management Methods in IoT

Generally speaking, all things on the IoT may generate a huge amount of data that contains different kinds of useful information. However, how to handle such big data and how to retrieve the valuable information have become hot research topic in recent years. Several index building methods for handling massive amount of IoT data are proposed. Ma et al. [9] proposed an update and query efficient index framework (UQE-Index) based on key-value store that can support both multidimensional query and high insert throughput. In order to effectively reduce the index update times and decrease the index maintenance cost, the authors proposed a dynamic data partition strategy that can make sure that the data is evenly distributed into each region in HBase and the data that is close in time and space dimension is usually stored in the same regions.

In order to address the problem of high dimensionality in IoT data, Huang et al. [10] proposed dynamic skyline cube (SKYCUBE) computation to efficiently balance the computation update and costs in IoT. The authors proposed an efficient grid-based ADSCIT (algorithm for dynamic SKYCUBE computation in the Internet of Things) which consists of two modules: continuous maintenance module (CMM), which incrementally updates the nonpseudo objects, and progressive computation module (PCM), which can rapidly obtain the skyline cube from the updated nonpseudo objects. In order to integrate the proposed two modules, a grid-based evaluation method that uses regular grid index is proposed.

Elkheir et al. [11] surveyed the data management solutions that are proposed for IoT and proposed a data management framework that takes into consideration the drawbacks of existing approaches. The proposed framework adapts a federated, data and sources centric approach to link diverse things with their abundance of data to the potential applications and services. Data mining technologies can also be used to discover the hidden information in the data of IoT, which can be used to improve the performance of the system or to enhance quality of services this new environment can provide [12]. Tsai et al. [12] surveyed research on how to connect data mining technologies to the IoT, which include clustering, classification, and frequent patterns mining technologies, from a different perspective. The authors also discuss changes, potentials, open issues, and future trends of applying data mining to the IoT.

2.2. Index Building Methods for Top- Queries

To construct an index efficiently, skyline and convex hull methods are representative methods. These methods construct an index as a list of layers and consist of objects which are not dominated by each other. The computing cost of skyline methods is much lower than that of convex hull methods; however, the number of objects in each layer of skyline methods is much larger than that of convex hull methods. Thus, the skyline methods are mainly used in the applications where insertion, update, and deletion operations are frequently occurring on objects. Since such applications need to construct skyline more frequently, they require small computing time. On the other hand, objects in convex hull methods are not updated often. Thus, these methods are used in the applications where top- query processing is performed. This is because a layer in convex hull methods consists of small number objects, which results in rapid processing of top- queries. In this paper, we focus on reducing the index construction of skyline in which data is frequently updated.

2.2.1. Skyline Methods

The skyline methods are useful when answering top- queries by accessing only a subset of the database. These methods have an advantage of low index building cost. The skyline operation was first introduced by Köhler et al. [8] and there have been a number of variations of it. The data space partitioning technique is used in many skyline methods for early pruning objects which are not included in skyline. There are several algorithms for constructing skyline that apply space partitioning technique. Grid-based data space partitioning has been commonly used in distributed and parallel skyline processing [8]. The angle-based space partitioning approach (ABSP) [7] is proposed by using hyperspherical coordinates of data objects and improves grid-based space partitioning. Köhler et al. [8] proposed a novel approach called PPPS, which reduces the computing time of ABSP by coordinating the objects using hyperplane projection.

There are also other algorithms for constructing skyline and the representative methods are block nested loops (BNL) [13], SFS [6], and linear elimination sort for skyline (LESS) [14]. BNL sequentially reads the input relation and saves in a window . When an object is read, it is compared to objects in . If an object in dominates , BNL eliminates . Otherwise, dominates some objects in ; these are deleted from and is added to [13]. The SFS algorithm [6] improves BNL by presorting the input relation according to the entropy value of object. LESS is an improvement of SFS that essentially combines aspects of a number of the established algorithms [14]. LESS discards some dominating objects earlier; thus this has the advantage of reducing the number of pairwise comparisons between the objects than SFS. However, the number of comparisons is still large. There also has been a growing interest in distributed [15, 16] and parallel [17, 18] skyline computation lately.

2.2.2. Other Methods

The convex hull methods construct the layer of edge objects in a convex hull shape and discard other objects. The layer size of convex hull methods is smaller than that of skyline methods; however, the index building time of convex hull methods is higher than that of skyline methods. The representative convex hull methods are ONION [19] and HL-Index [5]. ONION [19] builds convex hull as an index by constructing a boundary with the edge objects. That is, the objects of the first layer encircle the other objects. ONION builds a second layer in the same manner and finally constructs a list of layers as a result. HL-Index [5] builds a convex hull as ONION does and sorts lists additionally for retrieving top- results efficiently.

In order to reduce the index building time of convex hull methods, there are some methods that combine convex hull and skyline methods. For example, Ihm et al. [20] proposed the approximate convex skyline (AppCS) method that constructs skyline over the entire objects and then partitions it. Further, AppCS builds an approximate convex hull in each partitioned region with virtual objects. Another method that focuses on reducing index building time of convex hull is proposed in [21]. The authors proposed a method called approximate convex hull index (aCH-Index) that computes the skyline over the entire set of objects, partitions the region into multiple subregions to reduce the computing time of convex hull in all origins, and then computes the convex hull in each subregion.

3. Grid-PPPS

In this section, we explain the proposed methods, Grid-PPPS. As explained in Section 2.1, the PPPS [8] improves the indexing building time of ABSP [7]. However, PPPS has a high index building time in high-dimensional databases. The Grid-PPPS reduces the time complexity of the PPPS. The Grid-PPPS is constructed by five steps as shown in Figure 2: (a) approximate skylining step, (b) grid-based partitioning step, (c) hyperplane-based partitioning step, (d) local skylining step, and (e) merging step. For the convenience of the explanation, Figure 2 shows the procedure of processing Grid-PPPS in two-dimensional region. We explain each step in detail from Sections 3.1 to 3.5.

3.1. Approximate Skylining Step

In the first step, Grid-PPPS constructs approximate skyline. This step is shown in Figure 2(a). Computing the exact skyline of all tuples set can be expensive, since each tuple should be compared to many other tuples. However, we can prune several tuples with the few comparisons. We prune the objects by calculating the entropy value of each object. We select several tuples, which have low entropy value, and then make a small set with those tuples. By the small set , some tuples in are dominated by , and those tuples can be eliminated safely. Since we pick the tuples according to entropy value, we can discard more tuples. Finally, we can get approximate skyline. Importantly, for fixed size , computing the approximate skyline can be performed in a linear time with a single pass over the dataset [8].

3.2. Grid-Based Partitioning Step

In the major step that is grid-based partitioning step, Grid-PPPS partitions the data space into subspaces using grid-based partitioning technique. A grid is something which is in a pattern of straight lines that cross over each other, forming squares. Many applications are using grid-base technique, since it is simple and has low computing cost [2224]. The grid-based partitioning scheme is based on recursively dividing some dimension of the data space into two parts [7]. The computing time of grid-based partitioning is lower than other partitioning techniques, because grid-based partitioning is simple and cheap to compute. Thus, we partition objects, which are obtained from approximate skylining step into spaces with grid-based partitioning technique. Figure 3(a) shows the example of grid-based partitioning in two-dimensional data space, and three-dimensional example is shown in Figure 3(b).

3.3. Hyperplane-Based Partitioning Step

In the hyperplane-based partitioning step, Grid-PPPS partitions space into subspaces using hyperplane-based partitioning, which is proposed in PPPS [8]. We first calculate the formula of the hyperplane such as . Next, we project tuples onto the hyperplane, and (1) shows the calculation of projection. Finally we partition space, which consists of projected tuples, into subspaces:

3.4. Local Skylining Step

In the local skylining step, Grid-PPPS computes the local skyline in each subspaceij. We call local skyline in subspacei as subskylinei and use SFS algorithm [6] for computing subskyline. For the construction of local skyline, the dominating calculation, which determines whether the object is in the skyline or not, should be computed between two objects. Grid-PPPS filters out the objects by grid-based partitioning step, and, thus, the number of dominating calculation decreases.

3.5. Merging Step

In the last step, Grid-PPPS combines the subskylines in each subspace. We build a layer by merging the subskylines. Since Grid-PPPS computes subskyline points once again, it combines the subskylines and builds a result layer without losing tuples and overlapping.

4. Performance Evaluation

In this section, we first explain the data and environment in Section 4.1 and then present the results of experiments in Section 4.2.

4.1. Experimental Data and Environment

We have implemented the proposed method using C++. We conduct all the experiments on an Intel i5-760 quad core processor running at 2.80 GHz Linux PC with 16 GB of main memory. We use the uniform dataset for all of our experiment data. We use 10 K, 100 K, and 1000 K data size. We experiment our data in two through nine dimensions.

4.2. Result of Experiments

We compare the computing time and the nDC (number of domination calculation) of the Grid-PPPS with the existing methods PPPS [8] and SFS [6]. We use the wall clock time as the measure of the computing time. We measure the computing time nDC on the synthetic dataset while varying the data size and the dimension .

The result of the skyline constructed by Grid-PPPS is exactly the same as PPPS. Grid-PPPS improves the index building time of PPPS in large and high-dimensional dataset. When data has 10 K size and under six attributes, the index building time of Grid-PPPS is a little higher than PPPS, because of partitioning step. The number of filtered tuples in Grid-PPPS is similar to PPPS in the small and low-dimensional dataset. However, Grid-PPPS constructs an index much quickly in large and high-dimensional dataset as shown in experiments.

Experiment 1. Computing time and nDC as data size is varied.
Figure 4(a) shows the computing time of Grid-PPPS and PPPS as is varied from 10 K to 1000 K. The result increases in log scale as shown in Figure 4. The computing time of the Grid-PPPS improves by 1.41–1.52 times over the PPPS. Figure 4(b) shows the nDC of Grid-PPPS and PPPS as is varied from 10 K to 1000 K. The nDC of Grid-PPPS improves 1.49–2.00 times over the PPPS.

Experiment 2. Computing time as dimension and data size are varied.
Figures 5(a), 5(b), and 5(c) show the computing time of Grid-PPPS and PPPS as is varied from 2 to 9 and is varied from 10 K to 1000 K. The result increases in log scale as shown in Figure 5. Figure 5(a) shows the computing time of the Grid-PPPS improves by 0.75–1.52 times over the PPPS as is varied and is 10 K. Figure 5(b) shows the computing time of the Grid-PPPS improves by 0.77–1.51 times over the PPPS as is varied and is 100 K. Figure 5(c) shows the computing time of the Grid-PPPS improves by 0.73–1.43 times over the PPPS as is varied and is 1000 K. In order to show the precise difference between Grid-PPPS and PPPS, we conduct the experiments shown in Figure 6.

Experiment 3. The nDC as dimension and data size are varied.
Figures 7(a), 7(b), and 7(c) show the nDC of Grid-PPPS and PPPS as is varied from 2 to 9 and is varied from 10 K to 1000 K. The result increases in log scale as shown in Figure 7. Figure 7(a) shows the nDC of the Grid-PPPS improves by 1.00–2.01 times over the PPPS as is varied and is 10 K. Figure 7(b) shows the nDC of the Grid-PPPS improves by 0.68–1.89 times over the PPPS as is varied and is 100 K. Figure 7(c) shows the nDC of the Grid-PPPS improves by 0.65–1.49 times over the PPPS as is varied and is 1000 K.

5. Conclusion

As more and more sensors get connected to the Internet, the IoT applications generate enormous amounts of data. In order to solve this problem, in this paper, we have proposed to use a top- query processing to find the best results among vast amount of data. In order to efficiently handle top- queries, we have proposed a new skyline method called Grid-PPPS, which performs grid-based partitioning first on data space and then partitions it once again using hyperplane projection. We have compared the proposed method with the state-of-the-art methods, such as PPPS and SFS. The results of experiments demonstrate several times improvement in most cases.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012003797).