Abstract
A top query processing is widely used in many applications and mobile environments. An index is used for efficient query processing and layerbased indexing methods are representative to perform the top query processing efficiently. However, the existing methods have a problem of high index building time for multidimensional and large data; thus, it is difficult to use them. In this paper, we proposed a new concept of constructing layerbased index, which is called unbalanced layer (UBLayer). The existing methods construct a layer as a balanced layer with outermost data and wrap the rest of the input data. However, UBLayer constructs a layer as an unbalanced layer that does not wrap the rest of the data. To construct UBLayer, we fist divide the dimension of the input data into divideddimensional data and compute the convex hull in each divideddimensional data. And then, we combine dividedconvex hull to build UBLayer. We also propose UBSelectAttribute algorithm for dividing the dimension with major attributes. We demonstrate the superiority of the proposed methods by the performance experiments.
1. Introduction
Recently, Searching user interested data among a large amount of data has become very important as the size of databases grows. In particular, searching the data that users want over highdimensional and large data is more difficult in mobile environments, because it is the only part of the massive amount of data that users are interested in. However, especially in the mobile environment, efficient retrieval from vast amounts of data is limited.
Furthermore, as the attributes of the data are varied, problems arise in index construction and query processing for the highdimensional and large database. For example, when users are searching for a used car, they should consider the various attributes such as manufacturer, model, rating, year, price, region, mileage, fuel, transmission, and color, and it will take a long time.
Top query processing retrieves data that is most interested in from a given data considering various attributes. Top query processing is handled in various applications such as searching hotels or cameras on the websites [1–3], healthcare [4], and cloud environments [5–7]. Top query processing retrieves results that users want by considering various attributes in a large amount of data, and it is very inefficient to search all data at each query. Therefore, it is important to build an index in order to perform top query processing efficiently. There have been a number of studies that construct an index by making layers over the entire database. These methods find top answers by accessing just the first few layers.
The convex hull [8] is a representative method of index construction method for top query processing. The method reads total data and constructs indices into the layer and the convex hull is the outer layer surrounding all data layer lists geometrically. Since the convex hull algorithm constructs layer lists based on the control points from all direction, the method has the advantage of processing omnidirectional queries. Even though the method is advantageous in query processing, convex hull includes a small amount of data in one layer so that the construction time is inefficient.
Figure 1 shows the experimental result of constructing the convex hull as dimension and size of data are varied. As shown in Figure 1(a), as the dimension increases, the computing time increases sharply, and if data has more than 9 attributes, in the case of 10K data, the experiment is impossible. Figure 1(b) shows the computing time as data size varied, and when the size of data increases, the construction time of convex hull increases sharply. And if the data size is more than 1000K, the experiment is also impossible. Figure 1(c) shows the number of layers in the convex hull as the dimension is varied. If the dimension increases, the number of layers decreases sharply. That is, the number of data contained in a one layer increases as the number of layers increases.
(a) The index building time (dimension is varied)
(b) The index building time (data size is varied)
(c) The number of layers (dimension is varied)
As the number of attributes and the size of the data increase, the time for constructing the convex hull increases sharply and the number of layers decreases. Therefore, it is almost impossible to build the convex hull from a highdimensional and large database.
In this paper, we propose a UBLayer (unbalanced layer) which is a new concept of unbalanced layer considering all direction of query processing, and we also propose a method of building UBLayer. The proposed method is a study to construct an index to efficiently perform top query processing, and query processing is not covered in this paper. The UBLayer is an unbalanced layer because it is constructed around the outermost points but also contains some inner points. In order to build the UBLayer, we first find out the main attributes of input data, and then we divide the input data with the main attributes. Next, we construct the dividedconvex hull based on the divided data, and, finally, we combine the dividedconvex hull to build a final layer. UBLayer is a more efficient method for top query processing because it builds layers much faster and has more layers than convex hull.
More precisely, we make the following contributions in this paper:
We propose a new concept of unbalanced layer method for the highdimensional and large database, called the UBLayer (unbalanced layer). UBLayer first makes a divided dataset by dividing the dimension of input data and constructs a dividedconvex hull in each divided dataset. The proposed method could build a list of layers in the highdimensional and large dataset and reduces the index building time compared to the convex hull.
We propose a method, which is called UBSelectAttribute, to divide a dimension by selecting major attributes for improving the precision of the UBLayer. UBSelectAttributes divide the dimension using the major attributes of input data.
We show the performance advantages of UBLayer through various experiments. We compare the index building time, the number of layers, and precision of UBLayer with the existing method convex hull.
The rest of this paper is organized as follows. Section 2 describes the background and Section 3 describes the existing work related to this paper. Section 4 explains the proposed method, UBLayer. Section 5 presents the results of performance evaluation. Section 6 summarizes and concludes the paper.
2. Background
In this section, we describe the background related to this paper. We first describe a top query processing in Section 2.1 and describe a convex hull method for mobile computing.
2.1. Top Query Processing
Top query processing retrieves results from input data. The following example shows an example of top query processing to search for a hotel.
Example 1. Suppose a customer Tom who wants to book a hotel for a summer vacation searches for a hotel on a website. A hotel has various attributes such as rates, hotel ratings, customer ratings, distance from subway stations, distance from major attractions, and type of accommodation. The customer Tom considers rates and distance from the subway station for his reservation, and he wants to search a hotel with small attribute values. In this example, we will consider only two attributes for easy explanation. He also wants to set the weights for each attribute as (0.6: rates, 0.4: distance) as search conditions. The scoring function for this is = 0.6 rates + 0.4 distance.
SELECT  
FROM  Hotels 
ORDER BY  
STOP AT  3 
Table 1 shows the search result according to the query of the customer Tom. According to the scoring function, the top 1 result is hotel C, the top 2 result is hotel E, and the top 3 result is hotel B among the five hotels.
2.2. Convex Hull Method for Mobile Computing
Convex hull is a method to construct a list of layers with the set of outermost points. There are several studies construct convex hull for mobile computing and wireless network environments. Mobile computing requires wireless network access [9], and it is much more difficult than wired communication because mobile computing should consider the signal, path, noise, delays, errors, power, and so on. One of the basic issues of mobile computing and wireless network is the energy constraint [10]. Thus, the high efficiency service is needed for mobile computing. Convex hull is used to retrieve the user interested data and retrieve sensors to connect in mobile computing.
3. Related Work
In this section, we explain convex hull method and discuss the existing work. The representative methods using the convex hull for constructing a list of layers are ONION [11], HLIndex [12, 13], and aCHIndex [14]. ONION constructs an index by making layers with the vertices of the convex hull over the objects in the multidimensional space. That is, it creates the first layer with the convex hull vertices over all objects and then creates the second layer with the convex hull vertices over the remaining objects, and so on. Finally, the set of layers becomes the layer list. HLindex constructs a layer list with the convex hull as Onion does and then constructs sorted lists in the ascending or descending order based on each attribute value of the objects in each layer. HLindex keeps those lists in order to fasten query processing. The aCHIndex first finds the skyline points over entire objects and then partitions the skyline layer into multiple subregions. Then, aCHIndex combines the convex hull by computing over in each subregion. Convex hull method can perform top query processing from every direction. However, the method has the problem of the high time complexity in index generation.
Besides, there are several methods such as the method that uses tree to construct convex hull [15], CONHEM (Convexhull Edge Method) which is the method that sets minimum region to calculate the allowance [16], and VAICH (Visual Attention Imitation Convex Hull) which is based on the human prospective of constructing convex hull [17]. VAICH constructs the convex hull by eliminating data from the centroid to the limit. Also, other research using GPU to fasten the calculation for constructing convex hull. Tang et al. [18] proposed the method that uses both GPU and CPU simultaneously in order to construct convex hull. When it comes to get the input, the method eliminates initial point by GPU based filter and constructs the convex hull by using CPU based algorithm. Reference [19] proposed the parallel algorithm called CudaHull. CudaHull constructs 3dimensional convex hull using GPU from CUDA programming. gHull [20] maximizes the parallelization based on the relationship between the 3dimensional Voronoi diagram and 3dimensional convex hull and uses GPU to constructs 3dimensional convex hull. Singh et al. [21] proposed the convex hull approach in conjunction with Gaussian mixture model in order to improve the detection accuracy without capitalizing much of computation time. CHVS [22] is a fast convex hull vertices selection algorithm for online classification and the proposed method converts convex hull into a linear equation problem with a low computational complexity.
There are also several methods to construct convex hull for mobile environments. Jiang et al. [23] proposed a depthadjustment deployment algorithm based on twodimensional convex hull for network energy efficiently. Kundu et al. [24] proposed an ondemand power saving routing algorithm for mobile ad hoc networks, which identifies node using means and convex hull algorithm. Xu et al. [25] proposed a path selection algorithm using convex hull to reduce the data delivery latency on the mobile elements in wireless sensor networks. However, existing methods did not study how to construct convex hull efficiently and have used convex hull to select nodes or data. In addition, they have not considered a highdimensional data. Thus, in this paper, we present the efficient algorithm to construct convex hull for highdimensional data.
4. UBLayer (Unbalanced Layer)
In this section, we propose a new layerbased index building method for highdimensional and large database, which is called UBLayer (unbalanced layer). As we mentioned in Introduction, the constructing time of convex hull increases exponentially when the dimension of the data increases. And also the number of layers decreases; thus the convex hull is actually impossible to use for the Top processing. The UBLayer improves the convex hull by dividing the dimension of input data and it reduces the computing time and increases the number of layers. Figure 2 shows the concept of UBLayer. First, Figure 2(a) is the result of index constructed by the existing layerbased indexing method. It consists of balanced layers which the outside layer covers with the other layers. Figure 2(b) shows unbalanced layer which this research suggests; it consists of layers which the outside layer does not cover with the other layers. We define this type of layers with unbalanced layer in this research.
(a) A balanced layer
(b) An unbalanced layer
UBLayer is made by 3 steps: dimension division step; constructing dividedconvex hull step; combining step. First, we divide the dimension of input data to create data that contains divided dimension in divided dimension level. Second, we create dividedconvex hull based on the data with divided dimension. Finally, we create final UBLayer to merge each dividedconvex hull in merging level.
4.1. Dimension Division Step
In this section, we explain the first step, dimension division step to construct UBLayer. Divided dimension level divides the dimension of input data.
4.1.1. UBBasic Algorithm
In this section, we propose UBBasic algorithm for dividing the dimension of the input data. UBBasic algorithm is a naïve method which divides the dimension of the input data into two divideddimensional datasets. Here, let be a given dataset with attributes. UBBasic algorithm divides into two divideddimensional datasets and . For example, in the case of 6dimensional data, it is divided into two 3dimensional datasets and 5dimensional data is divided into 2dimensional and 3dimensional datasets. Table 2 shows the process of dividing the 4dimensional input data into two 2dimensional datasets div_{1} and div_{2} using UBBasic algorithm, and Figure 3 shows the result of div_{1} and div_{2}. In Table 2, the original dataset has four attributes X_{1} to X_{4}, and UBBasic algorithm divides the input data into two divideddimensional datasets where one has two attributes X_{1} and X_{2} and other has X_{3} and X_{4}.
(a)
(b)
Algorithm 1 shows the UBBasic algorithm for dimension division. The input of the algorithm is , which is a set of dimensional data objects, and , which means the number of dimensions, that is, the number of attributes. And the result of the UBBasic algorithm is div, which is a set of data with divided dimension. In lines to (3), algorithm checks the number of attributes first. If is less than 4, the UBBasic algorithm is not performed. Because if the dimension is less than 4, it is impossible to divide the dimension, so algorithm constructs the convex hull in the next line and terminates the algorithm. If dimension is greater than or equal to 4, algorithm initializes div_{1} and div_{2} to store the divideddimensional data in line (4). Next, if is not an empty set, algorithm divides the dimension in half in line (6) and saves the data of the two divideddimensional datasets in the next line in div_{1} and div_{2}, respectively. Finally, the dataset div of the divided dimension is returned and the algorithm terminates.

4.1.2. UBSelectAttribute Algorithm
In this section, we also propose UBSelectAttribute algorithm which improves a division method of the UBBasic algorithm. The UBSelectAttribute algorithm determines the main attributes among the various attributes of the input data and divides the dimensions based on the main attributes. Table 3 shows an example of a UBSelectAttribute algorithm. The input sixdimensional data has six attributes, X_{1} to X_{6}. And if X_{1} and X_{3} are determined as main attributes, the proposed algorithm divides sixdimensional data into twodimensional data with main attributes X_{1} and X_{3} and fourdimensional data with remained attributes. Representative studies to identify key attributes include Mean Decrease Impurity (MDI), which measures the influence of variables through an average of data impurity degradation during classification. In addition, it is possible to determine the main attribute by analyzing the main property through the histogram method.
Algorithm 2 shows the UBSelectAttribute algorithm for dimension division based on main attributes. The input of the algorithm is , which is a set of dimensional data objects, and , which means the number of dimensions, and att, which is a set of main attributes. And the result of the UBSelectAttribute algorithm is div, which is a set of data with divided dimension. In lines to (4), the algorithm checks the number of attributes first the same as UBBasic algorithm. If the dimension is less than 4, it is impossible to divide the dimension, so algorithm constructs the convex hull in the next line and terminates the algorithm. Next, UBSelectAttribute algorithm checks the number of main attributes in line (4) and also checks the number of attributes except for the main attributes, and if the numbers are less than 2, the algorithm is not performed. This is because when constructing a convex hull, at least two attributes are needed. UBSelectAttribute algorithm divides the input data into two divideddimensional datasets based on the main attributes. In the two cases mentioned above, which are in lines (4) and (7), it is impossible to construct the convex hull in the divided dimension data. In line (9), initialize the resulting datasets div_{1} and div_{2} to store the divideddimensional data. Next, if is not an empty set, the algorithm divides the dimension in half in line (12) and saves the data of the two divided dimensions in the next line in div_{1} and div_{2}, respectively. Finally, the dataset div of the partitioned dimension is returned and the algorithm terminates.

4.2. Constructing DividedConvex Hull Step
In this section, we explain the second step of UBLayer, the constructing dividedconvex hull step. In the second step, we construct a dividedconvex hull for the divideddimensional data div, which is the result of the first step. Since each divideddimensional data consists of at least two or more dimensional data, the convex hull could be constructed. Algorithm 3 shows the ConstructingDividedConvexhull algorithm for generating dividedconvex hull. First, the input of the algorithm is div, which is the divideddimensional data. And the output of the algorithm is a list of dividedconvex hull. The ConstructingDividedConvexhull algorithm constructs local convex hull as DividedCH over each divideddimensional data div_{i} in lines –(3). Finally, the algorithm returns with a dividedconvex hull list.

4.3. Combining Step
In this section, we explain the last step of UBLayer, the combining step. We finally combine the dividedconvex hulls to build UBLayer. In the combining step, we build the UBLayer with no overlapping tuples in each layer. The following example explains a building process of the UBLayer.
Example 2. Figure 4 shows an example of building the UBLayer with the 4dimensional dataset. Figure 4(a) shows the input 4dimensional data, and Figure 4(b) shows the result of the dimension division step for input data. It is divided into 2dimensional data. Figure 4(c) shows the result of constructing dividedconvex hull in each of the divideddimensional datasets div_{1} and div_{2}. In div_{1}, the dividedconvex hull consists of three layers, and the dividedconvex hull in div has two layers. Figures 4(d)–4(f) show the process of constructing the final layer lists by combining the dividedconvex hulls. For the combining step, we first construct the first layer of UBLayer by removing duplicated data of the first layer of dividedconvex hull in each divideddimensional data, and UBLayer includes datasets p1, p2, p4, p5, and p7. Next, in Figures 4(e) and 4(f), we construct the second and third layer of UBLayer, respectively, and build a final UBLayer.
(a) An input 4dimensional dataset
(b) The two divideddimensional dataset
(c) The result of constructing dividedconvex hull
(d) The result of constructing the first UBLayer
(e) The result of constructing the second UBLayer
(f) The result of constructing the third UBLayer
5. Performance Evaluation
In this section, we first explain the data and environment in Section 4.1 and then present the results of experiments in Section 4.2.
5.1. Experimental Data and Environment
In this experiment, we compare the computing time of index generation, the total number of layers, and accuracy of UBLayer and convex hull [8]. The measure of the index generation time is wall clock time. We compare the number of data included in the layer to calculate accuracy. The equation of accuracy is shown as (1). NumOfData(CH) is the number of data included in the first layer in convex hull. UBLayer compares the number of data in the first layer and its similar data.
UBLayer compares all 3dimensional division methods. Each UBBasic uses UBBasic algorithm, UBSA uses UBSelectAttribute algorithm to construct UBLayer. Through the experiments with synthetic data, we use input data from data generator from PLIndex [12]. Data size goes to 10K and data dimension was converted from 2dimensional to 8dimensional. For the experiment, we construct UBLayer and convex hull by C++. First, the algorithm generates the first layer and compares its accuracy rate with index generation time in order to compare the number of data to calculate accuracy rate. We conducted all the experiments on an Intel i5760 quad core processor running at 2.80 GHz Linux PC with 16 GB of main memory.
Table 4 summarizes experiments and variables in order to measure the index generation time and precision. The variables for the experiment are data size and dimension .
5.2. Result of Experiments
In this section, we show the experiments results of the index construction time and accuracy of the convex hull, UBLayer, and UBBasic which is the basic dimension division algorithm of UBLayer and UBSelectAttribute (UBSA) that divides input data based on the main attributes.
Experiment 1 (the comparison of index construction time as dimension is varied (K)). Figure 5 shows the construction time as a wall clock time of UBLayer and convex hull when dimension is varied from 4 to 8. As the dimension increases, the index construction time of convex hull shows an exponential increase. However, the index construction time of UBLayer increases in log scale. The index construction time of UBLayer is improved by 0.74 to 99.35 times compared to the convex hull. The construction speed is slower than the convex hull when dimension is 4, because of the dimension dividing and merging cost. However, as dimension increases more than 5, the difference becomes larger as Figure 5(b) shows.
(a) The comparison of index construction time
(b) Representation in log scale
Experiment 2 (the comparison of the number of total layers as dimension is varied (K)). Figure 6 shows the number of total layers of UBLayer and convex hull as dimension is varied from 4 to 8. Convex hull constructs 11 layers for 10K input data on average; that is, one layer includes about 909 data. The number of total layers of UBLayer is improved by 4.2 to 11 times compared to the convex hull; that is, one layer of UBLayer includes about 5 to 11 times less data. Therefore, UBLayer is more efficient in query processing because the number of data in one layer of UBLayer is less than convex hull.
Experiment 3 (the comparison of index construction time of UBLayer as dimension is varied (K, = 4~12)). Figure 7 shows the comparison of the index construction time between proposed methods as dimension is varied. We have presented UBBasic and UBSA for dividing dimension of the input data in Section 4. In Figure 7, UBSA means that the number of the main attribute is 4, and UBSA means that the main attribute is 5. The index construction time of UBSA shows constant increase when the total number of attributes increases. Nevertheless, UBSA builds indices faster than UBBasic algorithm.
(a) The comparison of index construction time
(b) Representation in log scale
Experiment 4 (the comparison of the number of total layers of UBLayer methods as dimension is varied (K, = 4~12)). Figure 8 shows the comparison of the number of total layers of two UBLayer construction methods as dimension is varied from 4 to 12. In the case of UBBasic and UBSA algorithm, the number of the layers decreases when the amount of data attribute increases. Since the algorithm reads each layer to retrieve result value in top query processing, the algorithms perform more efficient query processing as the number of the total layer increases.
Experiment 5 (the comparison of index construction time of total methods as dimension is varied (K, = 6~8)). Figure 9 shows the comparison of index construction time of proposed UBLayer construction methods and convex hull as dimension is varied from 6 to 8. The experiment was based on 6 to 8 dimensions because the convex hull experiment with more than 9 dimensions is impossible. We prove that the proposed methods generate indices about 70 to 396 times faster than the convex hull algorithm.
(a) The comparison of index construction time
(b) Representation in log scale
Experiment 6 (the comparison of the number of total layers of total methods as dimension is varied (K, = 6~8)). Figure 10 shows the comparison of the number of total layers of two UBLayer construction methods and convex hull as dimension is varied from 6 to 8. Likewise, we perform the experiment by 6 to 8 dimensions because the convex hull experiment with more than 9 dimensions is impossible. The proposed methods build more layers by 3.6 to 27 times than the convex hull algorithm. Top query processing performs by reading the layers one by one. Therefore, UBLayer is more efficient to query processing, because a large number of total layers means that the number of data to be read in the query processing is small.
Experiment 7 (the comparison of accuracy of UBLayer methods as dimension is varied (K, = 6~8)). Figure 11 shows the comparison of the accuracy of two UBLayer construction methods as dimension is varied from 6 to 8. The accuracy becomes 100% if all input data from first layer in constructed convex hull is included. The proposed methods show 50% of accuracy on average. However, the accuracy of proposed methods becomes higher when it comes to increase its dimension. Therefore, the proposed methods provide accurate results in highdimensional data.
6. Conclusion
In this paper, we have proposed the UBLayer that significantly reduces the index building time and increases the number of layers of the convex hull. The proposed method first divides the dimension of input data and constructed dividedconvex hull in each divideddimensional data. And then, it combines the dividedconvex hull to build a final UBLayer.
We have performed experiments on synthetic datasets with varying the data size and the dimension. Experimental results show the proposed method builds an index fast over highdimensional data, whereas the convex hull could not have constructed. And UBLayer is also more efficient for top query processing because it has more layers than convex hull.
However, the proposed method has some limitations. First, the optimized query processing algorithm should be studied, because UBLayer is a method to construct an index. Second, the index construction time is greatly reduced, but some of the correct answer data is missing when it is compared to the convex hull.
As for the future work, we will study algorithms to solve these limitations of our method. We will first improve the index building time of the proposed method by dividing dimension hierarchically. Second, we will improve the precision of the UBLayer. Third, we will analyze the time complexity of our method and compare to existing methods. Moreover, we will study about efficient top query processing method with UBLayer.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIT) (no. R7120171007, SIAT CCTV Cloud Platform).