Abstract

Skyline query computes all the “best” elements which are not dominated by any other elements and thus is very important for decision-making applications. Recently, it is generalized to skyband query and a k-skyband query returns those elements dominated by no more than k, of other elements. To incorporate the skyband operator into the stream engine for monitoring skybands over sliding windows, space usage estimation for skyband operator becomes a critical issue in the query optimizer. In this paper, we firstly introduce the skyband sketch as the cost model. Based on the cost model, we propose an approach for estimating the space usage of skyband operator over sliding windows of data streams under the assumptions of statistical independence across dimensions, no duplicate values over each dimension, and dimension domains totally ordered. Experiments verify that our approaches can estimate the space usage effectively over arbitrarily distributed data. To the best of our knowledge, this is the first work that attempts to address the issue and proposes effective approaches to solve it.

1. Introduction

Skyline queries [1] are very important for multicriteria decision-making applications, as the queries can return all the “best” elements which are not dominated by any other element. However, skyline queries may eliminate elements which are valuable but dominated by few other elements, for dimensions commonly can not cover all user’s consideration. Therefore, Papadias et al. [2] generalized the skyline to skyband, and a -skyband query returns all the elements which are dominated by no more than of other elements.

By using the common hotel example in the literature, assuming that each hotel has the information of its distance from the beach and its price, and that one prefers the hotels which are cheap and close to the beach, Figure 1 demonstrates the difference between the skyline (the 0-skyband) and the 1-skyband. Three hotels, that is, , , and , are returned by the skyline query, but additional four hotels, that is, , , , and , are returned by the 1-skyband query because they are dominated by only one of other elements. Buchta [3] proposed that the expected number of the skyline elements in a -dimensional space which contains elements is ; therefore, low-dimensional skyline queries commonly return a small number of skyline elements to the user, and some valuable elements may be eliminated, the reason is that each element has a high probability of being dominated by other elements in a low-dimensional space. Skyband queries may return the elements which are valuable but dominated by few other elements to the user, hence, are widely used by decision-making applications in low-dimensional spaces.

Recently, the database research community witnessed a paradigm shift to continuous queries, and much attention has been put on sliding-window skyline queries [4, 5] in the stream environment. However, the issue of space usage estimation, which is very important for extending the query optimizer's cost model to accommodate skyline queries in the stream engine, is still left untouched. In this paper, we propose some effective approaches to estimate the space usage of sliding-window skyband queries. Since the skyline query is a special case of skyband queries, our proposed approaches can be naturally applied to sliding-window skyline queries as well.

Monitoring sliding-window skybands needs to extract all skyband elements from the live elements in the window and continuously report skyband changes as the window slides. In this paper, we first introduce the skyband sketch as the cost model and present effective policies for the sketch maintenance. As such, the skyband sketch has the quality of good space efficiency because it only stores the skyband elements along with the potential-skyband elements which do not belong to the skyband currently and are not guaranteed to be excluded from the skyband in their remaining lifespan. Next, under the assumption of statistical independence across dimensions, which is commonly used by query optimizers, and that no duplicate values exist over each dimension and domains are all totally ordered, we propose an approach for estimating the space usage of monitoring skybands over sliding windows. Experimental study verifies that our approaches can estimate the space usage effectively over arbitrarily distributed data. To the best of our knowledge, this is the first work that attempts to address the issue of space estimation and proposes effective approaches to solve it.

The rest of this paper is organized as follows. Section 2 summarizes the related work; Section 3 introduces some preliminary knowledge; Section 4 details our approaches for estimating the space usage; experimental results are given in Section 5 and followed by our conclusions in Section 6.

Many algorithms have been proposed for computing static skylines, including the non-index-based algorithms [1, 6, 7] and the index-based algorithms [810], where the index-based algorithms uniformly outperform the non-index-based algorithms. Skyline computation under some certain conditions also received much attention, including skyline computation with partially ordered domains [11] and low-cardinality domains [12], subspace skyline computation [13, 14], skyline cube maintenance [1519], and skyline computation in the distributed environment [14, 2023]. Some skyline variations have also been proposed, including the -dominant skyline [24], the top- subspace skyline [25], the reverse skyline [26], the most representative skyline [27], the probabilistic skyline [28], and the skyband [2].

Under the assumptions of statistical independence across dimensions, no duplicate values over each dimension, and dimension domains being all totally ordered, the problem of estimating the number of the skyline elements, that is, the skyline cardinality, has been addressed in the works [3, 29, 30]. Chaudhuri et al. [31] relaxed the assumption of no duplicate values over each dimension by allowing two possible values (e.g., 0 and 1).

As stated before, continuous skyline queries over sliding windows in data streams [4, 5] have important applications such as environment monitoring and trends sensing. To accommodate skyline operator in the stream processing engine, the issue of space usage estimation needs to be solved. Motivated by this ambition, under the similar assumptions, we propose robust approaches to estimate the number of the skyband and potential-skyband elements over continuously distributed data.

3. Preliminaries

In this section, we present some preliminary results that will be used in the next section. In addition, we also describe a data structure called the skyband sketch. Theorem 3.1 characterizes the number of the elements in a finite set which just satisfy of the properties. It is based on the generalized form of the Inclusion-Exclusion Principle [32]. Similarly, Theorem 3.2 characterizes the number of the elements in a finite set which satisfy no more than of the properties; the theorem will be used for our theoretical analysis of the space usage in the next section.

Theorem 3.1. Suppose that is a finite set, are properties, and are subsets of , where consists of all those elements in with property . Let be the number of the elements in which just satisfy of the properties, it can be characterized as where is characterized as follows:

Theorem 3.2. Suppose that is a finite set, are properties, and are subsets of , where consists of all the elements in which satisfy ; the number of the elements in which satisfy no more than of the properties, that is, , can be characterized as where is the same as that in Theorem 3.1.

Proof. By Theorem 3.1, can be characterized as We have thus proved the theorem.

In a -dimensional space, for simplicity and without loss of generality, an element is said to dominate another element if it is smaller than or equal to over each dimension and strictly smaller than over at least one dimension and is noted as . In a sliding-window, if no more than of other live elements can dominate an element, the element is a -skyband element; if an element is not a -skyband element and no more than of the succeeding elements can dominate it, the element is a potential--skyband element.

Now we are able to describe a data structure called the skyband sketch for keeping the -skyband elements or the potential--skyband elements. The skyband sketch is a memory resident synopsis. The potential-skyband elements are the elements which do not belong to the skyband currently but are not guaranteed to be excluded from the skyband in their remaining lifespan. Hence the skyband sketch has the quality of good space efficiency for monitoring skybands over sliding-windows. The space usage in this paper is measured by the numbers of the skyband and the potential-skyband elements stored by the sketch.

Figure 2 shows the architecture of the skyband sketch; the sketch changes occur only when a new element arrives or a current skyband element expires. When a new element arrives, if no more than skyband elements can dominate it, it is probably a skyband element; otherwise, it is a potential-skyband element. If the new element appears to be a skyband element, all the skyband elements which are dominated by more than succeeding skyband elements and all the potential-skyband elements which are dominated by more than succeeding skyband and potential-skyband elements should be deleted because they will be dominated by the succeeding elements during their remaining lifespan; in addition, the skyband elements which are dominated by no more than succeeding skyband elements but are dominated by more than live skyband elements will appear to be potential-skyband elements. If the new element appears to be a potential-skyband element, all potential-skyband elements which are dominated by more than succeeding skyband and potential-skyband elements should be deleted. When a skyband element expires, all the potential-skyband elements which are dominated by no more than skyband and potential-skybad elements will appear to be skyband elements. In this paper, since we focus on the problem of space usage estimation, we leave out the detailed implementation issues of the skyband algorithm.

4. Space Usage Estimation

In this section, we present our robust approaches for estimating the space usage of sliding-window skybands under the assumption of statistical independence across dimensions based on the preliminary results in the previous section.

4.1. Distribution-Constrained Data

Here, we give our theoretical analysis for the space usage of sliding-window skybands over data which is distribution constrained, that is, there are no duplicate values over each dimension. By mapping the problem of evaluating the number of the elements in a finite set which satisfy no more than of the properties to the problem of evaluating the probability that no more than of other elements can dominate an element, Lemma 4.1 gives the probability that at most of other elements in a -dimensional space can dominate an element. Based on Lemma 4.1, Theorem 4.2 gives the expected number of the -skyband elements in a sliding window which contains   -dimensional live elements.

Lemma 4.1. Suppose that are elements in a -dimensional space, under assumptions of statistical independence across dimensions, no duplicate values over each dimension, and data domains being all totally ordered; let be the fact that no more than of other elements can dominate , then the probability of , that is,  , can be characterized as

Proof. We map , , and in Theorem 3.2 to the full probability space, , and , respectively; is mapped to , which can be characterized as
Under assumptions of statistical independence across dimensions, no duplicate values over each dimension, and domains being all totally-ordered, an element has a probability of being dominated by all other elements; therefore, can be further characterized as By Theorem 3.2, can be characterized as We have thus proved the lemma.

Theorem 4.2. Suppose that there are   -dimensional live elements in a sliding window, under assumptions of statistical independence across dimensions, no duplicate values over each dimension, and dimension domains being all totally-ordered, the expected number of the -skyband elements, that is, , can be directly characterized as and can be recursively characterized as with initial conditions where and where .

Proof. By Lemma 4.1, can be characterized as can further be recursively characterized as with initial conditions We have thus proved the theorem.

Theorem 4.3 shows that there exists inherent correlation between the expected number of the skyband elements in case of monitoring a -dimensional -skyband over a sliding window which contains elements and the expected number of the elements stored by the skyband sketch in case of monitoring a -dimensional -skyband over a sliding window which contains elements , that is, . In addition, the expected number of the potential-skyband elements in case of monitoring a -dimensional -skyband over a sliding window which contains elements equals . Therefore, by a minor revision, Theorem 4.2 can also be used to characterize the expected number of the potential-skyband elements.

Theorem 4.3. Under assumptions of statistical independence across dimensions, no duplicate values over each dimension, and domains being all totally-ordered, the expected number of the skyband elements in case of monitoring a -dimensional -skyband over a sliding window which contains live elements, that is, , equals the expected number of the elements stored by the skyband sketch in case of monitoring a -dimensional -skyband over a sliding window which contains live elements, that is, .

Proof. By Lemma 4.1, can be characterized as To see why the theorem holds, suppose are the live elements in the sliding window, which are ascendingly ordered by the element sequence number, and , where , are the elements stored by the skyband sketch for monitoring a -dimensional -skyband over the sliding window. We map each of the live element into a -dimensional elements , where is the sequence number of the element, then are just the -skyband elements in the -dimensional space.

4.2. A Dynamic Programming Algorithm

In this subsection, based on the theoretical analysis proposed in the above subsection, we propose an efficient dynamic programming algorithm to estimate the space usage. Since there exist inherent correlations among the expected number of the skyband elements, the expected number of the potential-skyband elements, and the expected number of the elements stored by the skyband sketch, we only consider how to estimate the number of the skyband elements.

Estimating the number of the skyband elements using (4.5) is infeasible in most cases because combination numbers are used to characterize the expected number of the skyband elements; for example, the number of the different ways of selecting 50 elements from 100 different elements can not be stored by a 64-bit integer. Based on (4.6), we can design a recursive algorithm to estimate the number of the skyband elements, which will not encounter integer overflow. The recursive algorithm can be characterized by a binary tree with the depth of , where , , and are the same as those in Theorem 4.2. Therefore, estimating the number of the skyband elements using the recursive algorithm has the computational complexity of , which is unacceptable in most cases. Actually, there exists a large amount of duplicate computations in the binary tree; therefore, if duplicate computations can be eliminated, the computational complexity can be reduced. Algorithm 1 is a nonrecursive algorithm for estimating the number of the skyband elements, which is based on (4.6), and all the duplicate computations are eliminated. The algorithm is a dynamic programming algorithm [33], because although the algorithm is based on a recurrence, it is non-recursive, and each step of the algorithm gives an exact answer for the corresponding subproblem.

Input: n: the number of the elements
   d: the number of the dimensions
k: the k-skyband
Output: the expected skyband cardinality
begin
 if nk+1  then return n;
 if   d=1  then return k+1;
 for i=1  to d do α[i]k+1;
ξ0;
 for i=k+2  to n  do
ξ(ξ+1)mod 2;
   if ξ=1  then
β[1]k+1;
    for j=2  to d  do β[j]β[j-1]/i+α[j];
   else
α[1]k+1;
    for j=2  to d  do α[j]α[j-1]/i+β[j];
   end
 end
 if ξ=0  then return α[d]  else return β[d];
end

Algorithm 1 functions as follows. First, two vectors and with size are created, and the values of are initialized to , respectively. According to the initial conditions, we have , hence all the values of are initialized to . Then, we evaluate and store the values to respectively. According to the initial conditions, we have , hence is set to . According to the recurrence, we have , that is, , hence we can evaluate and store the value to . By the same principle, we may evaluate sequentially and store the values to . We may continue to evaluate using the values in and store the values of to , respectively, until we evaluate and store the values to or . At last, the value of or is returned as the value of . It is apparent that the algorithm is space and time efficient, because the space complexity and the time complexity are and , respectively.

5. Experiments

In this section, we verify our theoretical results on space usage estimation of the k-skyband operator monitoring skybands over sliding windows in the stream environment by extensive experiments. The algorithms have been implemented by the C++ programming language and run on a 2.0 GHz Intel CPU with 2 GB of memory, and the data over each dimension is generated by the (GNU Scientific Library GSL: http://www.gnu.org/software/gsl). We test the space performance in a lower dimensional (4-dimensional) and a higher dimensional (8-dimensional) space, respectively. According to the probability theory, if the data over a dimension is continuously distributed, the probability that there are duplicate values over the dimension is zero. Therefore, for each space, we generate a dataset; the data over the first dimension is normally distributed with , and the data over other dimensions is normally distributed with . At the same time, the sliding-window size increases from 500 to 1000 stepped by 50; for each step, we compute the maximal, average, and minimal skyband sketch size, number of the skyband elements, and number of the potential-skyband elements during the moving of the sliding window over one million elements. Since there is no previous work that evaluates the space usage over continuous data, thus we compare our corresponding theoretical results with the experimental results.

Figures 3 and 4 show the comparisons between experimental results and the theoretical results for 4-dimension space and 8-dimension space. We can see that the experimental results are almost the same as we expected in the theories. What is more is that the maximal values are not twice as much as the minimal value and they are all close to the theoretical results. For the given parament (-skyband) and , both of the actual space usage and the estimated space usage increase with the window size, as more objects need to be evaluated. At the same time, the skyband cardinality also increases when the value of parament increases. The comparison between 4-dimension space and 8-dimension space, as Figures 3(a) and 4(e) show, illustrates that the skyband sketch size in high-dimension space is much more than that in low-dimension space, when the window size and the parament are given. This is because less elements are likely to be dominated by other objects in high-dimension space compared with in low-dimension space. As there are sufficient skylines for users to make a decision in the higher-dimensional space, skybands query shows its efficiency in low-dimensional space.

6. Conclusions and Discussions

Skyband query is of great importance for multi-criteria decision-making applications. To support skyband query in the stream engine, the problem of effective space usage estimation must be solved, which is important for extending the query optimizers cost model. In this paper, under the assumption of statistically independent [34, 35] across dimensions, no duplicate values over each dimension, and dimension domains being all totally ordered, we propose effective methods to address this issue; since the skyline query is just a special case of skyband queries, it is obvious that our approaches apply to sliding-window skyline queries either. We also put forward a dynamic programming algorithm to estimate the space usage, which is space and time efficient. In addition, if only the distribution function is given, we can also use the similar approach to evaluate the skyband cardinality over a space, where there are duplicate values over some dimensions. Finally, we carried out extensive experiments which verified that our proposed approaches can estimate the space usage accurately, hence, can be used to extend the optimizer's cost model for incorporating the skyband operator.

Acknowledgments

This work is partially supported by China “863” Hi-tech Program (Grant no. 2007AA01Z153), Zhejiang Provincial NSF (Grant no. Y1090096), and the National Natural Science Foundation of China (NSFC) under Grant no. 60573125 and 60873264.