Abstract

As uncertainty is the inherent character of sensing data, the processing and optimization techniques for Probabilistic Skyline (PS) in wireless sensor networks (WSNs) are investigated. It can be proved that PS is not decomposable after analyzing its properties, so in-network aggregation techniques cannot be used directly to improve the performance. In this paper, an efficient algorithm, called Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, is proposed. The algorithm divides the sensing data into candidate data (CD), irrelevant data (ID), and relevant data (RD). The ID in each sensor node can be filtered directly to reduce data transmissions cost, since, only according to both CD and RD, PS result can be correctly obtained on the base station. Experimental results show that the proposed algorithm can effectively reduce data transmissions by filtering the unnecessary data and greatly prolong the lifetime of WSNs.

1. Introduction

Recently, it is found that wireless sensor networks (WSNs) have a more and more important impact on the ways to collect and use information from the physical world. With the rapid development of microelectronics technology, communication technology, and the embedded technology, WSNs have become a common concern to industry and academia because of their great commercial prospects and its value of academic research [13]. For example, we can prevent forest fires by monitoring the temperature and humidity in real time. Influenced by manifold factors such as hardware devices, sensor technology, communication quality, and the surrounding environment, sensing data collected by sensor nodes are often with inaccurate or low confidence. That is to say, the temperature and humidity data acquired by sensor nodes are not accurate. As uncertainty is an inherent property of sensing data, to some extent, sensing data are uncertain data essentially.

As one of the most important means, multiobjective decision, skyline query [48] processing technologies have brought a large number of excellent researches, both in WSNs [916] and for uncertain data [1726]. Considering a wireless sensor network that consists of a large amount of sensor nodes deployed in a geographical region, sensing data are collected by these distributed sensor nodes. Accordingly, there could be multiple sensor nodes deployed in certain zones to promote the precision of uncertain data. As a result, many queries in WSNs that rarely need transmitting every piece of sensing data in the local sensor nodes have been well studied to reduce the communication cost and to speed up the computation [916], for instance, sliding window skylines in sensor network [11, 12], continuous skyline monitoring in WSNs [10], probabilistic query of uncertain data streams [18, 19], dynamic (or relative) skylines [25], and distributed uncertain skyline query [26]. Nevertheless, most of these researches are studied under a centralized system setting.

In this paper, an efficient algorithm, called Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, is proposed. It explores the problem of PS query processing in distributed WSNs, in which there exist alternative tuples. The basic idea is to perform data pruning and aggregation at sensor node such that only the data required for final processing are transferred to the base station. By comparing the data communication cost of DPPS and Centralized Algorithm (CA) to examine the effectiveness of the DPPS, we also perform sensitivity tests to observe the behavior of examined DPPS under various parameter settings. The result validates our ideas and shows the superiority of our proposal.

In summary, the contributions of this paper are as follows:(i)The properties of PS have been analyzed, and we prove theoretically that PS query is not decomposable.(ii)An efficient algorithm, called Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, is proposed, which reduces the in-network amount of data transmission by filtering the irrelevant data on the sensor nodes.(iii)Last but not least, the experimental results show that DPPS has advantages of data transmission in WSNs over CA.

The rest of this paper is organized as follows. The related work is introduced briefly in Section 2. Section 3 introduces the important notions and theorems. In Section 4, the DPPS is depicted in detail. And we analyze the performance evaluation of DPPS in Section 5. Finally, the conclusion of this paper is presented in Section 6.

Here, we review representative work in the areas of (1) skyline query processing in WSNs and (2) skyline query processing on uncertain data.

Skyline Query Processing in Sensor Networks. An extensive number of research works in this area have appeared in the literature [916]. Due to the limited energy budget available of sensor nodes, the primary issue is how to develop energy-efficient techniques to reduce communication and energy costs in the networks. In literature [9], Wang et al. analyzed the properties of reverse skyline query and presented a skyband-based approach to tackle the problem of reverse skyline query answering efficiently over WSNs. Chen et al. [10] addressed the problem of continuous skyline monitoring in WSNs and presented a hierarchical threshold-based approach, MINMAX, to minimize the transmission traffic. Two papers in the literature [11, 12] investigated the sliding window skylines in sensor network. The former put forward an energy-efficient algorithm, SWSMA, to continuously maintain sliding window skylines over a wireless sensor network. The algorithm employs tuple filter or grid filter within each sensor to reduce the amount of data to transmit and save the energy consumption as a consequence, while the latter proposed a method EES which uses a mapping function to map the data into a smaller range of integers and carries out the skyline of the mapped set as the mapped skyline filter (MSF). Chen et al. [13] partitioned the entire data set into disjoint subsets and returned the skyline points progressively through examining the subsets one by one. Also, a global filter consisting of some found skyline points in the processed subsets is used to filter out those unlikely skyline points from the rest of subsets for transmission. Shen et al. [14] researched location-based skyline queries in WSNs and raised an energy-efficient approach of Ring-Skyline (RS) which divides the monitoring area into several rings and adopts in-network query processing to reduce energy consumption. In [15], Xin et al. raised an energy-efficient multiskyline evaluation (EMSE) algorithm to evaluate multiple skyline queries effectively in WSNs. EMSE utilizes both global and local optimization mechanisms to eliminate unnecessary data transmission. In literature [16], a new lter-based method, called SKYFILTER, was brought up for skyline query processing. The method provides an enhanced efficiency by reduction of the total wireless communication between sensor nodes.

Skyline Query Processing on Uncertain Data. In literature [17], the bottom-up and top-down algorithms are put forward to process -skyline queries; a -skyline contains all the objects whose skyline probabilities are at least . It can filter the unqualified objects efficiently with the help of the grid-based space division algorithm and weight-counting algorithm. Literature [18, 19] investigated the PS query of uncertain data streams. The former proposed an approach, candidate list, to compute a PS on a large number of uncertain tuples within the sliding window, and the later studied the problem of efficiently computing the skyline over sliding windows on uncertain data elements against probability thresholds. The all skyline query problem over discrete uncertain data sets was first researched in [20], in which space splitting algorithm and dominating counting algorithm were raised. In [21], Böhm et al. attempted to model the uncertainty with pdfs (probability density function) and investigated the skyline query over the pdf modeled uncertain data. Additionally, in [22], the objects are indexed with the Gauss-tree in the parameter space to improve the pruning efficiency, where the leaf nodes store the objects with expectation and variance. Ding and Jin [23] first address the distributed uncertain skyline query problem and the DSUD and e-DSUD algorithms were raised to process the queries over tuple-level uncertain data with the processing framework, in which the uncertain tuples are independent of each other. For skyline computation in highly distributed environments, Hose and Vlachou [24] provide a good survey of existing approaches, where the uncertain skyline queries and the open research directions are discussed. The reverse skyline query over uncertain database retrieves all the uncertain objects whose dynamic (or relative) skylines [25] contain a user-specified query object with a probability not less than a user-specified threshold. In [26], efficient exact and approximate algorithms are addressed to tackle this problem that skyline probability computation over uncertain preferences is -complete.

As opposed to our investigation, these researches either ignored the uncertainty of sensing data or considered no particularity of wireless sensor network environment. All of them failed to solve PS query processing problems effectively in WSNs.

3. Preliminaries

3.1. Problem Statement

In this section, some important concepts are defined; also, some theorems are proved to be true. The variable is the threshold of the Probabilistic Skyline and the meanings of frequently used symbols are listed in Table 1.

Consider a WSN that consists of a lot of sensor nodes deployed in a geographical region. Feature readings (e.g., temperature and humidity) are collected from these distributed sensor nodes. Multiple sensors are deployed at certain zones in order to improve monitoring quality. Figure 1 shows a wireless sensor network (with a two-tier hierarchical topology) that monitors forest temperature and humidity in different zones (denoted as different color). In this network, sensor nodes are grouped into clusters, where cluster heads are responsible for local processing and for reporting aggregated results to the base station. As shown, and denote the cluster heads for clusters A and B, correspondingly.

A table is shown in Figure 1, representing a snapshot of temperature and humidity records collected from the sensor network. As shown, each tuple records both possible temperature and humidity corresponding to a location. The confidence value associated with a tuple indicates the existence probability of those particular temperature and humidity. For example, there are two data tuples generated for Location A. The temperature and humidity in these two tuples are both valid (i.e., with measured confidences).

Definition 1 (possible world semantics [23]). We use to denote a -dimensional space and to denote the universal set of all uncertain tuples in the -dimensional space . Each tuple has a probability () to occur, and () denotes the th dimension value. The tuples that cannot exist at the same time are alternatives. A possible world is instantiated by taking a set of tuples from the alternative relation.

For example, uncertain tuples and in Figure 1 are alternatives. The various dimensions numerical values of and indicate the relevant information of the region A. Due to the property of alternative tuples, both of them may occur but cannot occur simultaneously.

The aggregate confidence of is the sum of the confidence values of all its alternative tuples; that is, . For instance, corresponding to location A, ; that is, and are alternative tuple instances (or simply called alternatives) of . Consider . In the same way, we can get that and . The probability of all possible worlds in is shown in Table 2.

Definition 2 (skyline). Given a set of uncertain tuples in the -dimensional space , a skyline query retrieves tuples in that are not dominated by any other tuple. For two tuples and in , tuple dominates (denoted as ) if it is not worse than in all dimensions () and better than at least in one (). The probability that dominates is ’s existing probability denoted as .

Definition 3 (skyline probability). Given a set of uncertain tuples in the -dimensional space , the set of possible worlds based on set is denoted in the form of . We assume that there exit uncertain tuple and possible world subset , if and satisfy that(1)for any possible world , the uncertain tuple belongs to the skyline of ; that is, ;(2)for any possible world -, the uncertain tuple does not belong to the skyline of ; that is, .Then, we conclude that the skyline probability of an uncertain tuple is the sum of all the possible worlds’ existential probability which are in the subset ; that is to say, . For example, .

Assume that there exist an uncertain tuple and an alternative tuples set in the universal set . If there exists that dominates , we can say dominates (). Then, the probability that dominates can be calculated as . We use to denote the set that is composed of all in ; that is, . Consequently, the skyline probability of uncertain tuple is the product of the existent probability of and the nonexistent probability of ; that is, .

Definition 4 (Probabilistic Skyline). Given a set of uncertain tuples in the -dimensional space and a threshold value , then the Probabilistic Skyline of contains all the uncertain tuples in whose skyline probability is bigger than , denoted as .

3.2. Property Analysis

Theorem 5. Probabilistic Skyline query is not a decomposable operator.

Proof of Theorem 5. We first let represent the fact that is better than and let represent the fact that is better than . Then, we assume that the set of uncertain tuples is depicted in Figure 2(a), and the threshold value is 0.3. We can know that , , , , and by Definition 2. Also, we have the result according to Definition 3. Now, let , , illustrated in Figure 2(b), and , shown in Figure 2(c). Similarly, it can be proved that and . Only by demonstrated in Figure 2(d), in whatever way, we cannot obtain the result that ; that is to say, . Thus, PS query is not a decomposable operator.

We can know that PS query is not a decomposable operator by Theorem 5; thus, we cannot improve the efficiency of PS queries in WSNs by using in-network computing technology [11, 15] directly.

Next, we will further analyze the properties of the PS query.

Theorem 6. Given a set of uncertain tuples in the -dimensional space , a tuple and a threshold value . are the subset of which contains tuples collected on the th cluster, and one uses to denote the set that is composed of . Thus, does not belong to the skyline of when it satisfies the conditions as follows:

Proof of Theorem 6. This theorem can be proved by Definitions 2 and 3 directly.

Theorem 7. Given a set of uncertain tuples in the -dimensional space , a tuple , and a threshold value , then, should be excluded when it satisfies the conditions as follows:

Proof of Theorem 7. Since , , and , then it can be deduced that . Thus, and .
Only the skyline probability of the tuples dominated by will be affected if we delete . Suppose dominated by is a tuple in another sensor node which will possibly be interleaved with tuples in at the base station, and let indicate the skyline probability of . There are two possible cases to consider.
Case 1. itself forms a new because the tuples that dominate must dominate as well. Thus, it can be deduced that and will not be judged as the skyline tuple by mistake.
Case 2. is a member of an existed that does exist in named . Due to the mutual exclusiveness of tuple members in , may appear in a possible world if and only if no other members of coexist in this possible world. By formula (2), it can be proved that . Also, will not be judged as the skyline tuple by mistake.

Theorem 6 pointed out the tuples in the subset that must not belong to the skyline of clearly; that is, it pointed out the tuples that may be the skyline tuples of . Theorem 7 evidenced that we can delete the tuples in which will not affect the calculation of the skyline of . Not all the tuples which do not belong to can be deleted. The tuples that do not satisfy the conditions above will affect the calculation of skyline probability of other tuples, so we should hold them.

4. DPPS Algorithm

In this section, we propose the notions of candidate data, irrelevant data, and relevant data according to Theorems 6 and 7. Next, we take the PS query as a test case to derive candidate data and relevant data meanwhile prune the irrelevant data. Thus, irrelevant data tuples pruned in local sensor nodes will never appear in the final answer set.

Definition 8 (candidate data). In the sensing data subset on sensor node, the tuples which are candidate data (CD) of the Probabilistic Skyline query satisfy the conditions:

Definition 9 (irrelevant data). In the sensing data subset on sensor node, the tuples which are irrelevant data (ID) of the Probabilistic Skyline query satisfy the conditions:

Definition 10 (relevant data). In the sensing data subset on sensor node, the tuples which are relevant data (RD) of the Probabilistic Skyline query satisfy the conditions:

Algorithm 1 sketches the process of data aggregation, data classification, and the ID filtering on sensor nodes. First, the algorithm merges all the data tuples sent by child nodes. In other words, it merges CD into candidate data set and merges RD into relevant data set (Lines 4–7); second, the algorithm adds the local data tuple to the candidate data set (Line 8); and, then, the skyline probability of each tuple in the candidate data set and relevant data set will be calculated. Meanwhile, the tuples will be classified according to the definitions to removing ID and signing RD and CD (Lines 9–33); in the end, the partial relevant data set and candidate data set will be submitted to the parent node (Line 34).

// input: The message set of child node , the local sensing data ,
//  the threshold value
// output: The data set which will be submitted to the parent node
For each element in   Do
  ;
  ;
end For
;
For each element in   Do
  ;
  ; // get the number of
  ; // and all dominate
  For each in   Do
   calculate ;      // get 's domination probability
  end For
  If    Then
   ;    // delete ID from CD set
  Else If    Then
   ;   // transmit RD to RD set from CD set
   ;
  end If
end For
For each element in   Do
  ;
  ; // get the number of
  ; // and all dominate
  For each in   Do
   calculate ;
  end For
  If    Then
   ;    //  delete ID from RD set
  end If
end For
return ;

For data classification in a candidate data set, our algorithm works as follows: first, it initializes the cumulative probability variable (Line 10); second, the value of is calculated, where is the number of that can dominate the tuple (Line 11); third, it finds out all that dominate (Line 12), after which each ’s dominant probability is calculated (Lines 13–15). Then, the data tuples are classified based on the definitions above. In this procedure, tuples which are ID are deleted while tuples which are RD are transferred from the candidate data set to the relevant data set (Lines 16–22).

The process of data classification in a relevant data set is similar to the former. At first, the cumulative probability variable is initialized (Line 24); second, the value of is calculated (Line 25); third, it finds out all that dominate (Line 26); next, the dominant probability of each will be worked out (Lines 27–29); finally, the algorithm deletes from the relevant data set if it is ID (Lines 30–33).

In consideration of the running example in Theorem 5, we assume that the WSN is a two-tier hierarchical topology network. Let tuples , , and in be collected by sensor nodes . In the meantime, let and in be collected by sensor node . According to Algorithm 1, we can firstly calculate the Local Skyline Probability (denoted as ) of the tuples and then get the result that , , , , and . Thus, the data classification on node is that is ID, is RD, and is CD. Similarly, is ID and is CD on node . As a result, tuples , on node and on node are transmitted to the base station.

The process of query processing on base station is described in detail in Algorithm 2. To begin with, the algorithm merges all the data tuples sent by child nodes; that is to say, it merges CD into the candidate data set and merges RD into the relevant data set (Lines 3–6); second, the skyline probability of each tuple in the candidate data set will be calculated; then, ID are removed from candidate data set (Lines 7–17); finally, the rest data tuples in candidate data set are the final result of PS (Line 18).

// input: The message set of child node , the threshold value .
// output: The data set which will be submitted to the parent node
For  each element in   Do
;
;
end  For
For  each element in   Do
;
; // get the number of
; // and all dominate
For  each in   Do
  calculate ;
 end  For
If    Then
  ;    // delete ID and RD from CD set
 end  If
end  For
return ;

For removing ID and RD in a candidate data set, it first initializes the cumulative probability variable (Line 8); second, the value of is calculated (Line 9); third, it finds out all that dominates (Line 10); then, the dominant probability of each will be calculated (Lines 11–13); last, the tuple which is not CD is removed from the candidate set (Lines 14–17).

For example, on base station, the process of our running example above works as follows: first, tuples and are merged in candidate data set; is merged in RD. Second, we have and . Third, delete from candidate data set. Finally, we get the last result that is the skyline result, which illustrates the correctness and feasibility of our algorithm.

5. Experimental Evaluations

In our experiments, sensor nodes were generated randomly in a region with an area of ; thus, the average area of each node is 1. The communication radius between two nodes was set to be , and the maximum packet transmitted between two nodes was stipulated to be 48 bytes. All the experiments were conducted on a computer with Intel Core i7-3770 CPU 3.40 GHz and 8.00 GB RAM. We conducted our evaluation on the standard test data sets of PS query, in which the probability for each tuple was generated uniformly. The performance of the algorithm is mainly studied on independence data and anticorrelated data.

Three parameters are mainly investigated in our experiments, which are the number of sensor nodes, the dimensions of sensing data, and the threshold value of the PS query. The algorithm adjusted the values of the parameters to minimize the overall data transmission in the network. The overall data transmission is calculated by the communication cost sent by all the sensor nodes in the network; that is, it is calculated by the dimensionality of sensing data × numbers × hop count. The communication costs of DPPS and CA were mainly explored with a number of sensor nodes which range from 600 to 1000, with the default number equaling 600. The dimensions of the sensing data range from 2 to 6 with the default dimension equaling 2. The threshold value of the PS query ranges from 0.1 to 0.3, which is 0.1 by default.

Under the independent and anticorrelation distribution, the data communication cost of DPPS and CA affected by the change of sensor nodes number is shown in Figure 3. In this figure, we found that a large number of sensor nodes lead to more communication cost. The increase speed of DPPS is slower than CA’s. As the number of sensing data increases due to the more sensor nodes, the communication cost of CA increases fast. However, the unnecessary sensing data are filtered in DPPS which directly leads to a less communication cost and a much slower rate of increasement. The communication cost in independent distribution is close to the one in anticorrelation distribution, which explains that data distribution has less impact on communication cost. In other words, the confidence of sensing data is the primary factor which affects the communication cost.

The data communication cost of both the algorithms, under the two kinds of data distribution, affected by the change of sensor data dimensionality is revealed in Figure 4. Obviously, the bigger the dimensionality is, the more the communication cost is. The reason is that, with the increment of data dimension, the probability of tuples dominated by others is decreased, which led to an increment in the number of skyline tuples and the data communication cost. The communication cost of DPPS is smaller than CA’s, which further verified the effectiveness of DPPS. In addition, we can draw a conclusion that it is the confidence of sensing data which plays the primary role in communication cost affection.

Under the two different distributions, the data communication cost of DPPS and CA affected by the change of threshold value is shown in Figure 5. In the figure, we can see that a larger threshold value usually leads to less communication cost. It is intuitive, since the larger the threshold value is, the smaller the PS query result set will be. That actually results in a less communication cost. The communication cost of DPPS is always less than CA’s, which proved the effectiveness of DPPS in a very great degree. In a similar way, the results demonstrated the confidence is the primary factor again.

All the results showed that DPPS precedes CA in all changes of sensor node number, the sensing data dimension, and the PS threshold value. It can be widely used in sensor networks since it can improve efficiency and reduce the communication cost significantly.

6. Conclusion

In this paper, we explored deeply the requirements of PS query algorithm in WSNs and summarized the existing problems in the WSNs. According to the characteristics of applications in WSNs, we firstly studied the basic properties of PS query and theoretically proved that the algorithm is not decomposable. Then, an efficient algorithm, Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, was put forward. DPPS can classify the sensing data on sensor nodes and discard the irrelevant data which will not affect the result of the PS query. Thereby, the DPPS can reduce the data transmission cost significantly in WSNs. Finally, the algorithm was verified by simulation experiments, and the results showed that the performance of DPPS compared with the CA is significantly improved in saving the communication cost in network.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China under Grant nos. 61402089, 61472069, and 61100022, the Natural Science Foundation of Liaoning Province under Grant no. 2015020553, and the Fundamental Research Funds for the Central Universities under Grant nos. N141904001 and N130404014.