Abstract
Skyline query is a typical multiobjective query and optimization problem, which aims to find out the information that all users may be interested in a multidimensional data set. Multiobjective optimization has been applied in many scientific fields, including engineering, economy, and logistics. It is necessary to make the optimal decision when two or more conflicting objectives are weighed. For example, maximize the service area without changing the number of express points, and in the existing business district distribution, find out the area or target point set whose target attribute is most in line with the user’s interest. Group Skyline is a further extension of the traditional definition of Skyline. It considers not only a single point but a group of points composed of multiple points. These point groups should not be dominated by other point groups. For example, in the previous example of business district selection, a single target point in line with the user’s interest is not the focus of the research, but the overall optimality of all points in the whole target area is the final result that the user wants. This paper focuses on how to efficiently solve top group Skyline query problem. Firstly, based on the characteristics that the low levels of Skyline dominate the high level points, a group Skyline ranking strategy and the corresponding SLGS algorithm on Skyline layer are proposed according to the number of Skyline layer and vertices in the layer. Secondly, a group Skyline ranking strategy based on vertex coverage is proposed, and corresponding VCGS algorithm and optimized algorithm VCGS+ are proposed. Finally, experiments verify the effectiveness of this method from two aspects: query response time and the quality of returned results.
1. Introduction
Skyline query are also called maxima or Pareto [1] (to gain optimality without harming the interests of others in the field of business management). It is also a query optimization problem. Skyline query is proposed by Borzsonyi et al. [2], and it is introduced to the database domain at the 2001 ICDE conference at first. From then on, Skyline query attracts extensive attentions of the domestic and foreign researchers and becomes one of the most difficulty and hotspot in databaseresearch field. Skyline query has lots of applications in the field of multidimensional optimization analysis such as choosing petrol stations and hotels in the road network, selecting players in social networks, and determining targets through multiple attribute information.
The Skyline has been differently extended in recent years and becomes an emphasis for research in the database domain. At present, there are still many researches on single point query based on the traditional Skyline, such as the Skyline query on the data stream [3] and on the subspace [4–8]. In the skyline query on the data stream, with the dynamic change of data stream tuples, for a given constraint query, find the nodes that fall into the valid area or affect the result tuple set. Such queries are often applied to intelligent transportation, online monitoring, and other fields. In the face of massive highdimensional data, the whole space skyline query has the disadvantages of too large result set and low efficiency; so, the subspace skyline query has more important research significance. To reduce the size of the result set and feedback some representative Skyline points, dominated Skylinedefined variant is given by Skyline [9, 10], Top Skyline query, distancebased classical Skylines [11], etc. In many cases, what we search is a point group made up of s points not a single point. For example, in the road network query, people want to find adjacent malls which meet their demands. These malls form a cluster and are connected on the route to shopping. In turn, people can recognize hotels and entertainment within the Skyline according to the distribution dense of shopping malls, which is usually called site selection analysis. Liu et al. [12] first extend the Skyline based on an original single point to the Skyline based on the point group and propose the corresponding algorithm for Skyline. In practical applications, such objective optimization problems can also be applied to path optimization [13] to calculate the minimum cost path, mobile trajectory tracking [14, 15] to look for similar trajectories, social networks to find close communities, and graph correlation [16] to get the correlation degree of the target point.
Top is a typical query problem in largescale data processing, which is widely used in daily query, such as the analysis and summary of the top 10 query words in search engines. The Top is introduced into the group Skyline query, and each query returns the best Skyline point groups to reduce the burden of further selection by setting the measurement index. To solve the group Skyline problem, this paper proposes some efficient algorithms. The main research work of this paper has four points. (1)Combining with the practical application requirements, this paper introduces the Skyline query problem of Topk group, makes theoretical analysis and exploration on this problem, and puts forward the criterion to evaluating the quality of the pointgroup. Taking the number of vertices in the Skyline layer as the basis for sorting the results, an SLGS algorithm is proposed(2)Aiming at the ranking strategy of skyline layer, the concept of vertex coverage is proposed to deal with the situation that the ranking of result point groups is the same. To avoid blindness in selection, the VCGS algorithm based on vertex coverage is proposed, which further ranks all result sets and returns the Top point groups(3)For optimizing the algorithm and improving efficiency, the VCGS + algorithm is proposed. By pruning Skyline layer, the number of enumerated result sets and redundant traversal operations is significantly reduced, and the efficiency of the algorithm is improved(4)Some experiments based on multiple real data sets are carried out, and the performance of different methods is compared from query response time and the quality of the returned results. The validity and accuracy of the proposed algorithms are verified
2. Background Knowledge and Related Work
2.1. Background Knowledge
Definition 1. (Dominance). Given a set which contains data points in dimensional spaces, let and be two different points in the set . If are in all dimensions, and at least one dimensional , is the dimension of point for , and then dominates .
Definition 2. (Strictly dominance). Given a set which contains data points in dimensional spaces, let and be two different points in the set . If in all dimensions for , then strictly dominates .
Definition 3. (Group dominance). Given a set which contains data points in dimensional spaces, and are two different point groups with points of . We can say that the group dominates if we can find two permutations of the points for and , and , such that dominates for all (1i), and dominates strictly for at least one .
Definition 4. (Skyline). Given a set which contains data points in dimensional spaces, Skyline is a set of points that are not dominated by other points in .
Definition 5. (Group Skyline). Group Skyline is a set of point groups that are not dominated by other point groups.
2.2. Related Work Analysis
This paper mainly focuses on how to obtain the top Skyline point groups. Topk [12, 16–20] Skyline query is a common problem in largescale data processes. The group Skyline query is to compute the set of point groups which are not dominated by other point groups on a given dataset. It is a further extension of the traditional Skyline query. Up to now, there are few researches on group Skyline query, the group Skyline is put forward to and researched in Ref. [21–24]. In recent years, effective query results [25–27]get more attention. For reducing the size of the query result set and returning more representative Skyline points, the variations of Skyline definitions such as dominated Skyline [24], representative Skyline [28], topk Skyline query, and distancebased representative Skyline [29] are given. Basic algorithms of group Skyline query include algorithm PointWise [12], UnitWise [12], and UnitWise+ [12].
In Ref. [24], the definition of group Skyline is first proposed, and the definition of group domination depends on a certain aggregated point or a representative point in a point group. Although many aggregation functions, such as the function summation, minimum, and maximum, can be used to calculate aggregation points, finding all group Skyline sets is not easy.
PointWise algorithm enumerates candidate group Skyline by dynamically generating set enumeration tree containing candidate group and pruning off nongroup Skyline group. Firstly, the directed Skyline graph is preprocessed, the redundant nodes are filtered out, and then the remaining points in the graph are enumerated. The pruning strategy: if a point group is not a group Skyline in the enumeration process, then it need not be extended, and the subtree rooted by it can be pruned. Each candidate set corresponds to an extended set of points, which can filter out some points in the set and further reduce the enumeration. The verified point groups are the final point group Skyline.
UnitWise algorithm expands candidate group by adding point groups one by one. Similarly, the candidate groups are enumerated by dynamically generating a set enumeration tree containing candidate groups. Each node in the tree is a set of unit groups. At the same time, the candidate skyline groups are listed by pruning off the other useless point groups to the greatest extent. The pruning strategy: the candidate point group contains at least points, then the number of candidate point groups in G’s subtree will be larger than , the subtree can be pruned, and some points in the set of extended points corresponding to candidate point group can be filtered. The algorithm is based on cell group expansion, reduces the number of enumerations, and is more efficient than PointWise.
UnitWise+ algorithm is an improved algorithm based on UnitWise. In order to delete more point groups of nongroup Skyline in advance, the algorithm first processes the highlevel points in Skyline layer, enumerates larger candidate point groups in advance, and also filters the set of extended points corresponding to candidate point groups to reduce the size of the set. Moreover, depthfirst traversal is used to detect candidate point groups to terminate the algorithm in advance, which can further narrow the query range and improve the effectiveness of the result set.
Although these algorithms reduce the size of candidate sets and the number of enumeration point groups, the result set is still large when the size, dimension, and the number of point groups are enlarged. We need to extract some appropriate point groups as the result to return. To overcome this shortcoming, the SLGS algorithm based on Skyline layer, and VCGS algorithm based on vertex coverage and improved algorithm VCGS+ are proposed.
3. Query Algorithm Based on Skyline Layer
3.1. Ranking Strategy
Firstly, the characteristics of the result set are discussed, and the criteria to measure the quality of the result set are put forward. The following analysis is combined with an example. As shown in Example 3.1, Table 1 is a set of hotel data sets.
The Skyline layer of the point set is constructed as follows: Firstly, 12 points are sorted in ascending order according to their attribute valuedistance (users can choose the attribute value according to their preferences), and each point is processed sequentially. Point is the first point, and the next point is processed. The other point on the first layer cannot dominate ; so, the point belongs to the first layer. The next Skyline layer is constructed by processing , where the point on the first layer dominates . By analogy, until all points are processed. The results are shown in Figure 1.
Based on the definition of Skyline layer, the directed Skyline graph of the point set is constructed. In Figure 1, index value of the point is omitted (the point index value of point is 0, the point index value of point is 1, the point index value of point is 2, and so on). According to the Skyline layer, the directed Skyline graph results are numbered sequentially from lower to higher levels, as shown in Figure 2.
According to the definition of Skyline layer, the points on the first layer are defined as Skyline points of the whole point set , which dominate the points on other layers except the first layer; the points on the second layer are Skyline points of the subset of the set P except the points on the first layer; that is, the points on the second layer dominate the points on other layers except the points on the first layer and the second layer. By analogy, it can be concluded that the point at the lower level dominates the point at the higher level. According to the definition of point domination, the values of lowlevel points on some attributes are not worse than that of highlevel points, and the value of lowlevel points on at least one attribute is better than that of highlevel points. Therefore, the more points from the lower level in a point group, the better the point group.
The number of points from different Skyline layers in the Skyline point group is to identify which are better or worse, so that the Skyline point groups can be sorted and the top Skyline point groups can be obtained. From the above analysis, it can be seen that the number of points from the lower Skyline layer in the Skyline point group is a key factor affecting the overall group’s quality. Thus, the following definitions about Skyline point group are derived.
Definition 6. ( is better than ). Given the two point groups and in group Skyline that have not dominant relationship each other, and represent, respectively, the number of points on the th skyline layer. If , and will be compared, until the number of points from the th layer in is greater than , or in all the skyline layers, is always equal to . If , , then is better than .
Definition 7. ( is equivalent to ). Given the two point groups and in group Skyline that have not dominant relationship each other, if i(0ilayers size1), , then is equal to
3.2. Algorithmic Description
Through the ranking strategy proposed above, optimal groups of Skyline points can be obtained by processing the result set calculated by the UnitWise+ algorithm, but the result set is disorderly. The best point group may be located anywhere in the result set, which brings the adverse effects to the user’s choice. Because the first result is not sure to be the best, the whole result set needs to be sorted. Based on the ranking strategy, it can be concluded that there are some equivalent and indistinguishable point groups in the ranking process, which are packaged into a block to distinguish different groups of equivalents. That is to say, after processing the result set, different equivalent point group blocks are formed. The first block includes the best point group, the second block includes the better point group, and the last block naturally contains the worst point group. The results are well organized and hierarchical. Based on this idea, a SLGS query algorithm based on Skyline layer is proposed.
The basic idea of the algorithm is given the result point group ; based on the ranking strategy proposed above, each point group in is traversed, and the equivalent point groups are divided into blocks; then, each block is sorted, and the blocks are dynamically inserted into the corresponding positions. Finally, point groups from the block result set are extracted. The following is a simple flow chart of the algorithm SLGS.

Example 8. Given a set of hotel data, let (the size of the result point group). Based on the constructed directed Skyline graph, all group Skyline point groups can be enumerated, including , , , , , , , , , , , and n = {p_{12}, p_{9}, p_{11}, p_{10}}.
For the results set , if the user wants to select 5 optimal groups from 12 given result sets, that is, set , then the execution process of the algorithm is as follows. First, initializing the tag array mark , which means that all point groups are not accessed, and the block set is empty. Second, traversing point group , it is found that the number of points from the first, second, and third levels is 2, 1, and 1, respectively, and is equivalent to point group and . The corresponding position of tag array mark is assigned to 1, and the block composed of three point groups is added to the head of . Then, the next unvisited point group is processed, and the number of points from the first, second, and third layers is 1, 2, and 1, respectively, which is equivalent to point group . The corresponding position of tag array mark is assigned to 1. Point groups and form a block , which is inserted into . At this time, the block already exists in the set . Because the number of point groups from the lower layer is more than that from the upper layer, the midpoint group of is better than that of . Therefore, the block is inserted at the end of the . By analogy, point groups , , and are equivalent, and they are composed of the block . Because is better than and , is inserted into the head of set . Point group is equivalent to , , and , and they formed the block . By comparing the number of points from lower level, it is found that is better than and , but worse than ; so, should be inserted in front of . At this time, all point groups have been accessed, and a complete set of 4 blocks has been obtained. Given , because the number of point groups of the first block in is 3 and less than , selecting two point groups from the second block is needed. Finally, the five point groups of , , , , and are returned.
Assuming that there are elements and blocks in the result set , the time complexity of the algorithm SLGS involves two aspects: (1) traversing elements and realizing the block processing, whose time complexity is . (2) Finding the equivalence point group has to traverse elements again. At the same time, order each block in the set of blocks by the binary search method. This time, the time complexity is . Therefore, the overall time complexity of the algorithm SLGS is .
4. Query Algorithm Based on Vertex Cover
4.1. The Basic Idea
According to the directed Skyline graph in Figure 2, it is found that if the number of points that can be dominated by points from other layers is different, then the number of points that can be dominated by different point groups is also different. For example, the size of the given point group is 4, and the number of points dominated for , , , and is 0, 5, 8, and 2, respectively. The sum of points that this point group can dominate is 15. Similarly, the sum of points is 18 for another given point group . Obviously, the point group is better than . It means that the sum of points that can be dominated by all the point groups also affect the overall quality of the Skyline point group. Therefore, the concept of vertex coverage is proposed.
Definition 9. (Vertex coverage). Given a point group in group Skyline, Let separately represents the number of points dominated by in , so that is the number of vertex covers, and we name as VC (vertex coverage) ().
According to the ranking strategy of 3.1, the characteristic of the Topk Skyline pointgroup is as follows:
(1)There are more points on the Skyline low layer in the point group(2)The number of vertex cover of the point group is largerThrough the above analysis, the accurate Top groups can be obtained by further sorting the partition results of SLGS, and the corresponding VCGS (Vertex Coverage Group Skyline) algorithm is proposed. The basic algorithm idea is given the result point group , first run the algorithm SLGS, using Skyline layer and the number of vertices in the layer as the basis of the result ranking, get the result composed of the blocks, and then traverse the point groups in each block of . While traversing, the better or worse point group between blocks is judged by the size of the vertex cover set. Then, these equivalent point groups are reordered; after each block has been processed, point groups can be extracted. The algorithm VCGS is shown in Algorithm 2.

Example 10. Given the result set returned by example 8 is , let array store the number of points of each vertex dominants. By traversing in , we can get of each point group in per partition . of the , , and is 15,18, and 17, respectively. Because is better than , and is better than , block is reordered as . In the same way, , _{,} and are reordered as , , and . Now, . if , the result will return . if , then the result is , not like example 8.
4.2. Algorithm Optimization
The algorithm is very sensitive to the size of block set and the number of elements of each block. This is mainly because the values of and will be very large when the size of data sets, dimensions, and point groups increases; so, it will take more time to traverse these elements. In fact, it is not necessary to calculate the whole result set. The algorithm UnitWise^{+} enumerates all the points on Skyline layer. Assuming that the number of Skyline points on the first layer is , we select points from the points from enumeration. If the enumeration value is greater than or equal to , that is to say, the point group generated by the points on the first layer is enough to find optimal results, then the Skyline layer higher than the first layer can be pruned, and the point group composed of the points on the first layer can be sorted directly. If is less than , indicating that the point groups on the first layer are not enough to find k results, add the second layer and enumerate the Skyline points on the first two layers. If the enumeration number is greater than or equal to , the Skyline layer higher than the second layer can be pruned, the points on the first two layers can be enumerated, and the result points can be sorted to find optimal point groups, and so on. In this way, the algorithm UnitWise^{+} can be judged earlier in the execution process and do not enumerate the invalid point groups. Based on the above analysis, the optimized algorithm VCGS^{+} is introduced and shown in Algorithm 3.

4.3. Comparison and Analysis of Three Algorithms
Algorithm 1 has the problem of too much enumeration and computation, and in the equivalent tuple, it cannot compare which point in the tuple is better. Algorithm 2 can further distinguish the advantages and disadvantages of points in tuples, but there is still a large amount of calculation, and many useless point groups participate in the calculation. Therefore, Algorithm 3 is optimized from three aspects: enumerator, pruning strategy, and selection of equivalence points in the group.
5. Experiment and Result Analysis
5.1. Experimental Environment
The hardware and software platforms used in the experiment are Intel () Pentium () CPU with a main frequency of 2.9 GHz, 1 TB hard disk, 4 GB RAM memory, and 64bit Windows 7 Professional OS. The experimental programming environment of all algorithms is Microsoft Visual Studio 2010, and the programming language is C++.
In experiment, the parameters , , , and represent, respectively, the dimension of the data set, the size of the data set, the size of the points group required, and the number of best groups returned.
5.2. The Data Set and Evaluation Criteria
The experiment uses the two real datasets for NBA (http://stats.nba.com/leaders/alltime/?ls=iref:nba:gnav) player statistics and NHL (htttps://http://www.nhl.com/player/) player statistics. The information description of each data set is shown in Table 2. The experiment tests and compares the dimension, scale, size of point groups, and the number of returned results. The specific settings are shown in Table 3.
The dimension of data set is as follows: the number of attributes contained in the target point set.
Data set size is as follows: the number of target points.
The size of the point group is as follows: the number of target points contained in each point group.
Returns the number of optimal point groups : the number of elements in the result set returned to the user.
5.3. The Performance Comparison and Analysis
5.3.1. The Influence of the Size of the Point Group
As can be seen from Figure 3, the execution time of each algorithm increases with the increase of . When the is small, the execution speed of the three algorithms is very fast, and the execution speed of SLGS and VCGS is similar. Compared with SLGS, VCGS takes a little longer to execute because VCGS algorithm is a further sorting processing of point groups on the basis of the calculation results of SLGS algorithm, which increases the execution time. With the increase of , the execution time of SLGS and VCGS increases exponentially, and the number of points on the former Skyline layer will increase sharply, resulting in the increase of the Skyline point groups and data scale. However, the execution speed of the improved VCGS^{+} algorithm based on VCGS is better than the other two algorithms, and the best case is 10 times the worst case.
As can be seen from Figure 4, the enumeration results of the three algorithms increase with the increase of value. When the value is small, the enumeration results of the three algorithms are not much different, and the enumeration number of the algorithm VCGS^{+} is only a little less than that of the first two algorithms. Moreover, because the algorithm VCGS is a further ranking of equivalent point groups based on SLGS calculation results, the enumeration results of the two algorithms are equal. When the value increases gradually, the enumeration number of SLGS and VCGS increases a lot. By pruning Skyline layer, the enumeration result of VCGS^{+} is less affected by .
5.4. The Influence of Data Dimensions
In Figures 5 and 6, on two different datasets, we can see that with the growth of data scale , the running time of the algorithm also increases, and the efficiency decreases rapidly. When the dimension of NBA dataset rises to 5, and that of NFL dataset rises to 7, the impact of SLGS and VCGS is more severe. The reason is that with the increase of , the number of Skyline points on each Skyline layer increases dramatically. These two algorithms need more time to calculate the group of Skyline points; so, the efficiency will become lower. Compared with the other two algorithms, VCGS^{+} is more efficient and performs better on NFL datasets.
5.5. The Influence of Dataset’s Size
In Figures 7 and 8, we can see that with the increase of target data , the performance of the algorithm is relatively stable. Therefore, the influence of is not obvious. The running time of the algorithm increases linearly with the increase of . The main reason is that only the points on the former Skyline layer are used when computing group Skyline, and the number of these points is much smaller than the size of data . For different data sets, the impact of data size on the overall algorithm is different. The running time of VCGS^{+} is less than that of SLGS and VCGS, and the performance of VCGS^{+} on NFL datasets is more drastic. The pruning strategy of this algorithm can improve the efficiency of the algorithm by two to three times.
5.6. The Influence of the Number of Point Groups Returned
In Figures 9 and 10, with the growth of the point in the result set, the efficiency of the three algorithms varies steadily and linearly. Because the number of enumerated result point groups is greatly affected by the size of calculated point groups, data dimension, and data set size, it is independent of value. This can also be explained directly from the time complexity of the algorithm.
6. Conclusions
Aiming at the problem of large result set and low query efficiency in existing group Skyline query algorithms, the following results are obtained. (1)Aiming at the problem of large result set and large number of meaningless result point groups in existing Skyline algorithm, the Skyline query problem of the top group is given, and a SLGS algorithm based on Skyline layer is proposed to return optimal Skyline point groups. This algorithm combines the structural characteristics of the highlevel points dominated by the middle and lowlevel points in Skyline layer and gives a quantitative criterion to find the better one of two groups. Based on this criterion, the group Skyline results are ranked. and the results in the top ranking are returned(2)To solve the problem of the same ranking result in SLGS algorithm, a ranking strategy based on Skyline layer and vertex coverage is proposed. The size of vertex coverage set in the point group is used as the basis of ranking, and the results with the same ranking are further processed. The corresponding VCGS algorithm is proposed to sort all the results, which makes the sorting results more accurate. Because the algorithm adopts traversal strategy, it is inefficient. In order to improve users’ satisfaction with the returned results, an improved algorithm VCGS+, which is based on the algorithm VCGS, is proposed. This algorithm provides a pruning strategy of Skyline layer and avoids accessing most Skyline points. Only a few results can be calculated to find top groups of Skyline points, reduces the number of results enumerated and the number of points that need to be traversed, and thus improves the efficiency of the algorithm. Meantime, the experimental results show that the algorithm can improve the efficiency about ten times(3)The proposed algorithm is validated by experiments. The experimental results verify the effectiveness of the proposed method in terms of query response time and the quality of the returned results(4)In order to verify the effectiveness and feasibility of the algorithm, a simple test was made with the data of 12 hotels in Table 1, and a questionnaire was developed, with 30 members of the research group and 30 family members in the laboratory as the interview targets. Firstly, the skyline point group with size of 2 is listed. The results include 6 point groups: , , , , , and . Table 4 shows the choices made by the target population
The investigation results are consistent with the algorithm results, which proves the effectiveness of the proposed algorithm in practical application. At the same time, this study can be applied to various site selection analysis, such as school district housing selection, division of business district, and the location of public facilities. It has a high theoretical value in the application of locationbased services.
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was partially supported by the grants from the Hebei Education Department Key Project (No. ZD2021037) and Hebei University of Environmental Engineering Key Project (No. 2020ZRZD03).