Abstract

In wireless sensor networks (WSNs), data provenance records the data source and the forwarding and the aggregating information of a packet on its way to the base station (BS). To conserve the energy and wireless communication bandwidth, the provenances are compressed at each node along the packet path. To perform the provenances compression in resource-tightened WSNs, we present a cluster-based arithmetic coding method which not only has a higher compression rate but also can encode and decode the provenance in an incremental manner; i.e., the provenance can be zoomed in and out like Google Maps. Such a decoding method raises the efficiencies of the provenance decoding and the data trust assessment. Furthermore, the relationship between the clustering size and the provenance size is formally analyzed, and then the optimal clustering size is derived as a mathematical function of the WSN’s size. Both the simulation and the test-bed experimental results show that our scheme outperforms the known arithmetic coding based provenance compression schemes with respect to the average provenance size, the energy consumption, and the communication bandwidth consumption.

1. Introduction

Wireless sensor networks (WSNs) are composed of a large number of low-cost, low-power, and randomly distributed wireless sensor nodes (nodes, for short), which are intended to monitor physical or environmental data from the detecting areas and cooperatively pass the data to the base station (BS) or a desired actuator through wireless communication. They are widely deployed in a vast number of different application areas, such as health monitoring, meteorology, and military operations. Because of the diversity of the environment and the large number of sensor types involved, in order to use the reliable information to make an accurate decision, it is essential to evaluate the trustworthiness of the received data at the base station (BS) of a WSN. In practice, there are some examples of significant losses because the faulty data are used [1].

In a multihop WSN, provenance of each data packet presents the history of the data acquisition and the actions performed on the data while the data are transmitted to the BS. Provenance provides the knowledge about how the data come to be in its current state, including where the data originated, how it was generated, and the operations it has undergone since its generation. Therefore, provenance plays an important role in assessing the data’s trustworthiness [2, 3]. With the increase in packet transmission hops, the provenance’s size expands rapidly and sometimes the size even largely exceeds the size of the data packet itself. WSNs are a kind of resource constrained network. Because of storage and computational resource constraints, sensor nodes deployed in WSN do not have the ability to manipulate the provenance if its size is very large. Generally, sensor nodes utilize batteries to supply power, so the amount of energy is limited. Data transmission is the major part of the energy consumption. When the sensing data item is fixed, the packet size mainly depends on the provenance. Besides, the transmission channels do not have sufficient capacity for transmitting large provenance.

As a result, in large-scale WSNs, the provenances generally cannot be directly and completely transmitted due to both the bandwidth and the energy constraints on wireless sensor nodes. For the same reason, the provenance encoding schemes for wired computer networks, e.g., the works of [46], are not applicable for WSNs. Hence, several lightweight or compact data provenance schemes [79] as well as the compression schemes [1, 10, 11] have been proposed. The lightweight schemes drop some information with less significance in the provenance, e.g., the provenance graph’s topology, and then shorten the provenances size. The compression schemes decrease the provenances size through arithmetic coding [12], LZ77 [13], and LZ78 [14]. Note that for a given packet path, the provenance compression has a determined entropy upper bond according to Shannon’s theory. Even the dictionary-based scheme [10] can achieve the highest provenance compression rate up to date, in a large-scale network the provenance overload problem is inevitable.

To mitigate the average provenance size increases as well as utilize the provenance data efficiently, we propose a CBP (cluster-based provenance) encoding scheme for WSNs. The CBP scheme focuses on encoding and decoding the provenance incrementally (like Google Maps, can be zoomed in and out according to the user’s requirement) at the BS. The specific contributions of the paper are as follows:(i)We proposed a cluster-based lossless provenance arithmetic encoding scheme (CBP) for WSNs. Our approach not only has the ability of encoding and decoding the provenance incrementally, but also achieves a higher average provenance compression rate.(ii)We derived the optimal cluster size for the CBP scheme as a mathematical function of the number of the nodes in a WSN.(iii)We provided a detailed performance analysis, simulation, and experimental evaluations of our CBP scheme and a comparison with other related schemes.

The rest of the paper is organized as follows: Section 2 surveys the related works. Section 3 introduces the system model and the related background. Section 4 gives an overview of our method. Section 5 describes our proposed encoding and decoding approaches for simple and aggregated provenances, respectively. The cases study is presented in Section 6. Section 7 theoretically analyzes the performances of our method. Sections 8 and 9 show the simulation and experimental results, respectively. Section 10 concludes the paper.

Shebaro et al. [9] proposed an in-packet Bloom filter (BF, for short) based lightweight secure provenance scheme, in which every node on packet’s path is embedded into an array with fixed size through a set of hash functions. A BF is a simple but space efficient randomized data structure which supports fast membership queries with false positive, where the false positive rate depends on the array’s size. Alam et al. [7] proposed an energy-efficient provenance encoding scheme based on the probabilistic incorporation of the nodes’ IDs into the provenance, in which the entire provenance is scattered into a series of packets that are transmitted along the same route. Therefore, all the sections carried by the packets have to be retrieved correctly at the BS and then the provenance can be decoded. The major advantage of such a method is that it successfully limits the size of the provenance attached to a single packet, whereas the drawback is that it has a higher decoding error rate compared to the methods that encode the entire provenance into a single packet. The above methods are all lossy provenance encoding schemes because the topology information of the WSN is not included.

Hussain et al. [1] proposed an arithmetic coding based provenance scheme (ACP). The ACP scheme assigns a global cumulative probability for each node according to its occurrence probability in the used packet paths. For a given packet path in a WSN, the ACP scheme utilizes the global cumulative probability of the source node as the initial coding interval, and then the cumulative probabilities acquired from the associated probabilities derived from each connected nodes pair are used to generate the provenance; i.e., all the nodes IDs along the path are sequentially encoded into a half-open interval through arithmetic coding. Unlike most of the known provenance schemes whose average provenance size is directly in proportion to the increases in the packet transmission hops, the ACP scheme’s provenance size is decided by the packet path’s occurrence probability in a WSN.

Wang et al. [10] proposed a dictionary-based provenance encoding scheme (DP). The DP scheme treats every packet path as a string of symbols consisting of the node IDs along the packet path. Just like building a dictionary for symbol strings in LZ compression [13, 14], the DP scheme builds the dictionary for packet paths with the used packet paths at every node and the BS holds a copy of each node’s dictionary. Therefore, the provenance can be encoded through an index or a series of indices of the packet paths at each node along the packet path and be decoded by looking up the dictionaries at the BS. When the topology of the WSN is relatively stable, the provenance size under the DP scheme can be even shorter than the provenance’s entropy; on the contrary, the provenance compression rate increases drastically with the quick change of the WSN’s topology. Wang et al. [11] proposed a dynamic Bayesian network based provenance scheme (DBNP) for WSNs. The DBNP scheme encodes edges instead of node IDs along the packet path into the provenance by overlapped arithmetic coding. Compared to the known lossless provenance schemes, a higher provenance compression rate can be achieved by the DBNP scheme. Furthermore, such a scheme is not sensitive to the WSN’s topology changes. However, applying the overlapped arithmetic coding leads to decoding false positives [15], where extra knowledge is added to eliminate the false positives. Therefore, it is difficult to make a tradeoff between the acceptable false positive rate and the optimum compression rate.

All the approaches mentioned above focus on how to mitigate the provenances size rapid expansion in WSNs. However, none of these approaches supports the provenance incremental encoding and decoding, which can raise both the efficiencies of the provenances decoding and the data trust assessment.

3. Background and System Model

In this section, we provide a brief primer on arithmetic coding. We also introduce the system model applied in this paper. Some of these definitions are partly from our previous work [1, 10, 11].

3.1. Clustering Model

In WSNs, clustering management refers to selecting a series of nodes as cluster-heads according to a certain communication protocol, e.g., EEHC (Energy-Efficient Hierarchical Clustering) [16]. A cluster-head aggregates the data generated by the nodes in its cluster and then sends such data to the BS.

Figure 1 shows the two data collection modes in cluster-based WSNs, where the solid circles and the empty circles denote the member nodes of a cluster and the cluster-heads, respectively. Figure 1(a) shows an example of the single-hop communication between the cluster-heads and the BS, which is applied for small-scale WSNs; Figure 1(b) shows an example of multihop communication among the cluster-heads and the BS, which is applied for large-scale WSNs because some cluster-heads cannot reach the BS in one hop.

3.2. Provenance Model

In WSNs, the provenance of a data packet refers to where the data packet is generated and how it is transmitted to the BS [1, 2, 8]. In our provenance model, a data source is a node generating data periodically through the sensors attached to the node; a forwarder is a node that relays the packet toward the BS along the packet path; an aggregator is a node that aggregates two or more received data packets from its upstream nodes as a new one and then sends the new packet toward the BS. The aggregator nodes in our scheme are not selected. While being transmitted toward the BS, only the packets that fulfill the aggregation conditions are aggregated [17]. Note that such a process results in the provenances aggregation accordingly.

Each packet contains the following: (i) a unique packet’s sequence number; (ii) a cluster-head sequence number; (iii) data source node ID; (iv) data value; (v) provenance; and (vi) a message authentication code (MAC), which binds the provenance and its data together to prevent any unauthorized modification.

There are two different kinds of provenance in WSNs. Figure 2(a) presents a simple provenance, where data is generated at leaf node and forwarded by nodes and toward the BS; Figure 2(b) shows an aggregated provenance, where data are aggregated at nodes and on the way to the BS. The aggregated provenance can be presented as a tree through a recursive expression , where denotes the root and and denote the left and the right subtrees, respectively. Therefore, the aggregated provenance in Figure 2(b) has the form

Without loss of generality, the formal definition of data provenance in a WSN is as follows [11].

Definition 1 (provenance). For a given data packet , the provenance is a directed acyclic graph , each vertex , where and represents the cardinality of the set , is attributed to a specific node , and represents the provenance record for that node. One refers to this relation as ; i.e., node is the host of . Each edge represents a directed edge from vertex to , where . Meanwhile satisfies the following properties: (1) is a subgraph of the WSN; (2) for, , is a child of iff forwards to ; (3) is a set of children of iff, for each , receives data packets from .

3.3. Arithmetic Coding

Arithmetic coding [1, 18, 19] is a lossless data compression method that assigns short code words to the symbols with high occurrence probabilities and leaves the longer code words to the symbols with lower occurrence probabilities. The main idea of arithmetic coding is that each symbol of a message is represented by a half-open subinterval of the initial half-open interval , and then each subsequent symbol in the message decreases the interval’s size by a corresponding subinterval according to the symbol’s occurrence probability [19]. Figure 3 shows the process in which a message “bca” is encoded with the probability model in Table 1.

The encoding and decoding operations are as follows:

(1) Encoding: The initial interval is . When the first symbol “b” is encoded, it narrows the interval from to , where is the interval assigned to “b”. After the second symbol “c” is encoded, it narrows the to according to the interval assigned to “c”. Finally, the message “bca” is encoded as the interval .

(2) Decoding: The decoding algorithm utilizes the same probability model in Table 1. With the interval of being encoded, the first symbol “b” is decoded because the is a subinterval of the interval which is the interval assigned to “b”. In what follows, the subinterval of “b” is further divided in the same manner to derive the subsequent symbols until the interval of being decoded is equal to the interval of being encoded, namely, in this example.

The detailed encoding and decoding algorithms of arithmetic coding can be found in [1, 18].

4. Overview of Our Method

In a large-scale WSN, the provenance decoding load at the BS is heavy, which also results in the low efficiency for the data trust evaluation. The layered clustering management provides a good way to manage large-scale WSNs. With a multilayer cluster structure, provenance can be encoded hierarchically, where the final provenance consists of multiple segments. By decoding the provenance segment on the higher layers, the BS obtains the provenance information roughly and then assesses the trustworthiness of the data quickly. Whether to decode the provenance on the next layer depends on the current decoding results; i.e., if we assure that the data have been tempered, the decoding stops immediately.

Compared to the ACP scheme, our scheme has the following characteristics:

(1) Because the layered clustering management is applied, the provenance on each layer can be encoded as an independent segment and the final encoded provenance is composed of a series of segments from different cluster layers. When the BS receives a provenance, it decodes the segment from the highest layer first, and then the BS obtains the provenance information on the most coarse-grained layer, which can be used for a rapid data trust evaluation. Thereafter, the BS continues to decode the provenance step by step. Finally, the BS reconstructs the accurate provenance by combining each segment’s decoding result. Therefore, our scheme can encode and decode the provenance in an incremental manner. Such a decoding method raises the efficiencies of the provenance decoding as well as the data trust evaluation.

(2) Compared to the ACP scheme, which uses global probabilities to encode provenance, our scheme encodes the provenance through local probabilities which are only valid in a cluster. By using local probabilities, our scheme not only has a higher compression rate but also can update the probability model partially, which raises the provenance’s encoding and decoding efficiencies.

A large WSN can be managed through a multilayer cluster structure, in which the cluster-heads of the same layer are the nodes of an upper layer cluster. As shown in Figure 4, we manage the WSN by different clusters and then form a two-layer cluster managing structure.

In each cluster, the local cumulative probabilities are assigned for each member node. Note that the local cumulative probability is only valid in a cluster, which is quite different from the ACP scheme using global cumulative probabilities [1]. In Figure 4, the highlighted packet path starts from a data source node in a layer-1 cluster, and then the packet passes through the nodes in a layer-2 cluster before it reaches the BS. Therefore, the provenance can be encoded into two segments for incremental decoding as shown in Figure 5.

When the BS receives the encoded provenance, it can decode the provenance incrementally and reconstruct the provenance in a way of stepwise refinement. From the first provenance segment decoded results, the BS derives the packet path on layer-2 and the source node of the packet path indicates the cluster from which the packet comes. Then the second segment of the provenance is decoded and yields the packet path on layer-1. Finally, we can recover the entire packet path by combining the two parts decoding. Note that if we know that the packet has passed through some undependable nodes by the layer-2 decoding, the provenance and its packet will be dropped without further decoding, which increases the efficiencies of both provenance decoding and data trust assessment.

Besides, applying a local cumulative probability rather than a global cumulative probability achieves a higher compression rate compared to the ACP scheme. Furthermore, in contrast to the DBNP scheme [11], we use nonoverlapped arithmetic coding which has no false positives.

Because our CBP scheme uses the encoding intervals generated from each node’s cumulative probability, the cumulative probabilities are then not changed drastically with respect to the WSN’s topology changes. Hence, our scheme is robust to the WSN’s topology changes.

5. Cluster-Based Provenance Encoding and Decoding

Before introducing our scheme, we first define the main symbols used in the scheme and algorithms. See Table 2 for details.

At the beginning, the BS trains the network for a certain period to get the number of the times each node appears on the packets’ paths as the node’s occurrence frequency . During the training process, we let each source node in the network send a certain number of packets to the BS. Upon receiving these packets, the BS computes the occurrence frequencies for each node in the network. Then the local probabilities of each node are computed in their cluster, respectively. How long the training process takes depends on the WSN’s scale as well as the accuracy requirement. The more packets used in the training process, the more model accuracy attained, which is also time consuming.

For each node , its occurrence probability can be computed by the following formula: , where . Thereafter, the BS computes the cumulative occurrence probability for each node by the following equation:

In what follows, the BS calculates the occurrence frequency with which appears next to as the associated frequency .

Because the number of times a node appears on all the packet paths is equal to the number of the packets that the other nodes receive from it, the total associated frequency of node is then equal to its occurrence frequency . Hence, the associated probability . At the BS, the cumulative associated probability for each node is thus derived as

Once the nodes of the WSNs we considered in our scheme are deployed, they stay stationary. In our scheme, we hypothesize that topology changes of the WSNs are slow and infrequent. This kind of slow and infrequent topology changes cannot drastically modify the occurrence frequencies and the probabilities assigned to the nodes. To keep the occurrence frequencies and the probabilities as accurate as possible, we update the probability model periodically or on request by the BS.

5.1. Simple Provenances Encoding and Decoding

Along a packet transmission path, in our simple provenance encoding algorithm (see Algorithm 1), the initial coding interval at the data source is . In what follows the interval is used to denote the provenance at the (x-1)th node. When the provenances come from different clusters, the ID of the cluster-head is attached to distinguish the provenances that may share the same coding interval.

Input:
Output:
,
IF is a source node or a cluster source node THEN
ELSE IF is a forwarder node or a cluster forwarder node receiving packet from THEN
ELSE IF is a cluster-head THEN
END IF
WHILE MSD()=MSD() DO
/MSD(x) returns the most significant digit of x/
/ShiftL(x,y) function returns x shifted by y digits in the left/
END WHILE

A cluster-head of the first layer may play the role of a data source, forwarder, or aggregator node on the second layer. The time complexity of the compression at a cluster-head is the same as that of a data source node, forwarder, or aggregator node, where the space complexity which depends on the layer where the cluster-head is located is doubled or even more.

(1) If node is the data source with the occurrence probability and the cumulative probability , the interval is then encoded as follows:

(2) If node is a forwarder node which receives an interval from the (x−1)th node , the provenance is then encoded as follows:

Although real numbers are used in our algorithms to represent the values of the probabilities and the intervals, at each sensor node the real numbers are replaced by integers to fit for the limited computational ability. Therefore, to meet the demands for the increasing precision as well as avoid transmitting duplicated data, when the most significant digits of the two numbers that define the interval are identical, the most significant digit will be shifted out and stored in a buffer. For example, the interval is represented as , where is the buffer.

Upon receiving a packet, the BS recovers the provenance through the encoded provenance ; refer to Algorithm 2. The middle point number of the and the is selected using (6) as the flag code to locate the data source node’s interval and the data source node is then retrieved. Thereafter, the data source node’s effect is removed from the interval and the next node on the packet path will be retrieved through the new flag code by (7).where and denote the cumulative occurrence probability and the occurrence probability of the node being decoded, respectively. Furthermore, the cluster-head’s ID is used to exclude the nodes that do not belong to such a cluster.

Input:
Output: Provenance P
IF THEN
ELSE
/ShiftR(x,y) function returns x shifted by y digits in the right/
END IF
/If code is equals to the one we retrieved. /
/Then we retrieve the path which includes /
FOR=1 to DO
IF THEN
BREAK
END IF
END FOR
/source decoded/
/probability range of decoded/
/lower end of probability range/
WHILE the node from which BS received packet do
/remove effect of decoded node/
FOR=1 to DO
IF THEN
BREAK
END IF
END FOR
END WHILE
5.2. Aggregated Provenances Encoding and Decoding

Without loss of generality, at an aggregator node assume that there are two provenance intervals , . The aggregator node encodes its ID into both the and the , which yields the new intervals and , respectively. The aggregator node compares the lengths of the two intervals and and then sets the longer one as the first interval which indicates that it is the active interval. Thereafter, the aggregator node randomly chooses a real number in the shorter interval and sends with the active interval to the next hop. Finally the aggregated provenance has the form ; refer to Algorithm 3.

Input:
Output:
FOR x=1 to DO
encodes itself withusing Algorithm 1 and results
IF(x=1) THEN
END IF
IF(x>1) THEN
IF THEN
chooses a real number
ELSE
chooses a real number
END IF
END IF
END FOR
prepend to

The BS decodes an aggregated provenance as a series of simple provenances which consists of the aggregated provenances (see Algorithm 4), where the simple provenances are packet path sections: (1) from a data source to an aggregator node, (2) from an aggregator node to another aggregator node, (3) from an aggregator node to the BS.

Input: ,
Output: Aggregated provenance AP
P=decode using Algorithm 2
FOR x=1 to DO
Let code=
P=decode code using lgorithm 2 until it reaches a node on the path retrieved from
END FOR

6. Cases Study

In this section, two cases are provided. Without loss of generality, the WSN is organized in two levels in these cases. Our provenance scheme uses hierarchical arithmetic coding to compress provenance within and among the clusters. When the packets pass through the cluster-heads, the provenance will be encoded at a higher level. Figure 6 shows the topology of a WSN composed of four clusters in layer-1. The four cluster-heads, namely, , , , and , collect packets within their clusters and then transmit the packets to the BS.

In Figure 6, the initial probabilities for the member nodes and the cluster-heads were assigned evenly; refer to Tables 3 and 4.

For simplicity, we take a sample provenance and an aggregated provenance that were generated in cluster A as the examples to illustrate how our provenance encoding scheme works.

In Figure 7, the values assigned to the directed edge represent the associated probabilities between the nodes. Based on the data from Tables 3 and 4, the cumulative associated probabilities assigned to the nodes and the cluster-heads were derived by (3).

6.1. Sample Provenance

Assume that the simple provenance to be encoded is . As the occurrence probability’s range of the data source is , then sends the provenance as to its next node , where denotes that the buffer is now empty. Node then derives the new coding interval through its associated probability and cumulative associated probability with respect to . According to Algorithm 1, the new coding interval at is equal to . Similarly, the new interval at is encoded as and then the provenance is updated as when being sent to . Because is a cluster-head, it simply adds its cluster ID to the provenance and updates the provenance as .

Because the occurrence probability’s range of the data source is , sends the provenance as with to its next node . Because is a cluster-head, it only updates the provenance at the higher level, i.e., the cluster-head level, and then sends the updated provenance in the form of , to the BS. Table 5 shows the encoding processes of the simple provenance at each node along the packet path.

When the packet arrives, the BS parses the attached provenance. Because there are two cluster levels, the provenance has two sections accordingly. For the first section, namely, , the BS filtrates the interval and the buffer and then derives that which belongs to the cumulative probability of . The data source node of the first section is thus . Thereafter, the BS removes the effect of from the and then derives that is equal to 0.26. Because the associated probability of with respect to is , is yielded from the decoding. The BS stops decoding the first section because the packet was received from and yields the decoding result of layer-2 as . With the second section, the BS filtrates that the cluster-head ID is , the coding interval is , and the buffer is . The BS then derives that which belongs to the occurrence probabilities of , , , and . Note that the cluster-head’s ID is ; the data source is thus . Therefore, the BS removes the ’s effect from the and derives that is equal to 0.625. Because the associated probability’s range of with respect to is , is yielded from the decoding. The BS stops decoding until the last node is yielded. The decoding result is then . Finally, the BS combines the incremental decoding results and then knows the provenance is .

In order to explain the decoding process more clearly, Table 6 shows the steps of the simple provenance decoding at the BS. The BS decodes the first part of the provenance and reconstructs the data packet path on layer-2 which is composed of cluster-heads for the clusters on layer-1 (see Figure 6). According to the layer-2 decoding result, the BS knows from which cluster this data packet comes. If the BS finds any undependable information such as tampered nodes or faked packet path in the layer-2 decoding, it will discard the packet and stop decoding, which raises the provenance decoding efficiency as well as the data trust assessment efficiency.

6.2. Aggregated Provenance

In Figure 7, now we consider the encoding and decoding of an aggregated provenance . According to Algorithm 3, at the beginning, the data source sends (referred to as ) with a packet to its next node . At the moment, the data source also sends as the provenance with packet to its next node , where the new coding interval (referred to as ) is derived.

At the first aggregator node , the two packets and are met. Because there are two coding intervals, i.e., and , encodes its ID into both of them and makes the new intervals as = and = , respectively. Thereafter, compares the lengths of and and chooses a random real number belonging to the shorter interval. Subsequently, updates the aggregated provenance to the form of , where , and sends it with the new aggregated packet to . Because is a cluster-head, it simply adds its ID to the provenance and updates the provenance to the form of . Finally, the provenance of the aggregated packet is at the BS. Figure 8 shows the encoding of the aggregated provenance at each node along the packet path.

When the packet arrives, first, the BS decodes the foremost part of the provenance, which yields . Second, the BS decodes 0.65 in the provenance according to Algorithm 4, because which belongs to the occurrence probability ranges of , , , and . Note that the cluster-head’s ID is ; the data source is thus decoded as . In what follows, the BS removes the ’s effect from the and yields as 0.6. Because the associated probability of with respect to is , is then yielded from the decoding. The decoding process stops until is yielded. As is a node on the path decoded from the foremost part of the provenance, the decoding process ends with the result . Figure 9 shows the decoding processes of the aggregated provenance at the BS.

7. Performance Analysis

In this section, we theoretically analyze the performance of our CBP scheme with respect to the compressed provenance length and the optimal clustering size.

7.1. Entropy of Simple Provenance

In our approach, according to Shannon’s theory, it takes bits to represent a source node and bits to represent a forwarder node , where is ’s children node. Hence, we can derive the entropy of the provenance from the number of the bits required to represent an interval ; i.e., . Note that the final interval’s size is equal to the product of the occurrence probabilities of the source node and the corresponding associated probabilities of the forwarder nodes; i.e., . is thus calculated as

7.2. Entropy of Aggregated Provenance

Assume that the number of aggregator nodes is and the number of the branches is () at an aggregator node (). Therefore, bits are needed to represent all the aggregator nodes, where is the associated probability of node with respect to its child node .

7.3. The Optimal Cluster Size

Assume that the number of the nodes in a WSN is and the number of the cluster-heads is , where and M and N are positive integers. For the given and , if the average provenance size has the minimum value compared to other values assigned to and , is then the optimal clustering size for the WSN.

Claim. With our cluster-based arithmetic coding provenance scheme (CBP), the optimal cluster size is .

Justification. Assume that the occurrence probability of each node is evenly distributed. All member nodes’ occurrence probabilities are then evenly distributed too; i.e., . Therefore, each node’s cumulative occurrence probability is , where . According to the entropies of the simple provenance and the aggregated provenance, when , the average entropy of the provenances has the minimum value, which indicates that the average provenance size reaches its minimum value.

8. Simulations

We used TinyOS 2.1.2 TOSSIM and PowerTOSSIMz for the implementation of our approach to evaluate the performance of our approach. In the simulations, a WSN consisting of 101 nodes with IDs 0 through 100 is deployed, where the node with ID 0 is set as the BS. The network diameters vary from 2 to 14 hops. The duration of each data collection round is set to 2 seconds. We define the topology of a WSN by offering one-hop neighboring information between nodes. According to the topology of such a WSN, we randomly select some nodes either as the data source or aggregator nodes in TinyOS, and then we take a two-layer cluster to verify our scheme. First, all the 100 nodes are managed into different clusters. The cluster-head collects data from its member nodes and then sends the data toward the BS through multihops. At the beginning, the BS computes the occurrence probability, the cumulative occurrence probability for each node, and the associated probability and the cumulative associated probability for each node pair in the WSN.

For the purpose of simplicity, real numbers are used in this paper to represent the probabilities and define the coding intervals. In the simulations and the experiments, because of the computational limitations, integer arithmetic coding is applied because of the computational limitations at sensor nodes. By using a 2-byte integers, the initial interval is represented as .

In addition, we directly compare our approach with the ACP scheme [1] which has close relationship with our CBP scheme, where the ACP scheme assigns each node a global cumulative probability and then uses the cumulative conditional probabilities for each connected node pair to generate the provenance through arithmetic coding.

8.1. Performance Metrics

The following performance metrics are used in the paper:(i)Average provenance sizes (APS): When the BS receives a packet, it filtrates the and from the provenance. Assume that , , and denote the sizes of , , and , respectively. The size of the provenance is then equal to . In the simulations, we use 8 bits to denote and 16 bits to denote and , respectively.

Assume that there are packets ; APS is then defined as follows:where represents the provenance size of the packet .(ii)Total energy consumption (TEC): Suppose that there are nodes in a WSN. TEC is defined as follows:where denotes the energy consumed by node and represents the total number of the nodes in the WSN.

8.2. Simulation Results

The identical simulation environment is used to simulate the ACP scheme and our CBP scheme. Figure 10 shows the relationship between the number of the clusters and the average provenance size. The curve shows that when the number of clusters is equal to the number of the nodes in each cluster, the average provenance size reaches its minimum value. When all nodes are assigned in one cluster or each node forms an individual cluster, the average provenance size reaches its maximum value. Such simulation results also verified the conclusion in Section 7.3.

From the trends of the two curves in Figure 11(a), it can be concluded that our scheme can achieve a higher compression rate than that of the ACP scheme. With the number of hops increasing, our scheme outperforms the ACP scheme with respect to the average provenance size. Figure 11(b) shows the total energy consumption of the ACP and the CBP schemes when 100 packets are transmitted. Compared to Figures 11(a) and 11(b), we can find that curves share the same trend. The more messages are transmitted through wireless signals, the more energy is consumed. As to the CBP and the ACP schemes, the total energy consumption of the CBP scheme at each node is lower than that of ACP scheme (see Figure 11(b)). From Figures 11(a) and 11(b), it can be concluded that the CBP scheme has better performance with respect to total energy consumption (TEC) and average provenance size (APS).

Figure 12(a) shows the average provenance size for the ACP and the CBP schemes with respect to the number of packets sent by data source nodes. In the simulations, there are 100 nodes; 30 of them are data source nodes and 5 of them are aggregator nodes. Figure 12(b) shows the average provenance size for the ACP and the CBP schemes with respect to the number of packets sent by data source nodes. In the simulations, there are 100 nodes; 30 of them are data source nodes and 10 of them are aggregator nodes. It can be concluded that as the number of packets sent by data source nodes increases, the average provenance sizes of the ACP and the CBP schemes increase too, but the average provenance size of the CBP scheme is less than that of the ACP scheme.

Furthermore, in [1], it has been shown that the ACP scheme is better than the Bloom filter based provenance scheme (BFP) [9], the Generic secure provenance scheme (SPS) [20], and the MAC based provenance scheme (MP) [8]. The comparisons between the CBP scheme and the ACP scheme indirectly show that the CBP scheme outperforms the BFP, the SPS, and the MP schemes with respect to both the average provenance size and the energy consumption. As to the dictionary-based scheme (DP) [10], it has been compared with the ACP and the DBNP schemes in [11] and the results show that the DP scheme is sensitive to the WSN’s topology change, whereas the CBP scheme keeps stable when the WSN’s topology changes.

9. Experiments

To further evaluate the performance of our scheme, we deployed the ACP and the CBP schemes in a test-bed which included 26 sensor motes. In the experiments, the performances were evaluated by the average provenance size only.

9.1. Experimental Setup

We used ZigBee sensor motes to port the implementation. The ZigBee mote used by us has a CC2530 microcontroller, 2.4GHz radio, 8KB RAM, and 256KB external flash for data logging (see Figure 13(a)). ZigBee motes are placed in an indoor environment in a grid topology with network areas of 20×10m2 (see Figure 13(b)). In addition, the equipment also includes the IAR compiler environment, the serial port view tool. The mote connecting to the laptop computer through a USB port is set as the BS. The I/O functions of the TinyOS simulation codes are modified and then ported to the ZigBee mote. The experiments are performed with more than 100 packets’ transmission in the WSN.

9.2. Experimental Results

In the test-bed experiments, Figure 14 shows the average provenance sizes for the ACP and the CBP schemes with respect to the number of packet transmission hops. The data show that the curves have similar trends compared with the simulation results in Figure 11(a), which also shows that the CBP scheme under the optimal clustering size can achieve a much higher average provenance compression rate than that of the ACP scheme.

10. Conclusions

In a large-scale WSN, in order to mitigate the data provenances size’s rapid expansion, we propose a cluster-based arithmetic coding provenance scheme (CBP). Compared to the known provenance schemes based on arithmetic coding, the CBP scheme not only yields a higher provenance compression rate, but also can encode and decode the provenance incrementally at the BS, which increases the efficiencies of both the provenance decoding and the data trust assessment. Furthermore, the optimal cluster size for the CBP scheme is formally derived. Both the simulation and the experimental results show the effectiveness and the efficiency of our CBP scheme in the paper.

Data Availability

All data generated and analyzed during this study are from our simulations and experiments, rather than public repositories online. The simulator we used is TinyOS 2.1.2 TOSSIM. The energy consumption is measured through PowerTOSSIMz. We submitted our simulation code of the ACP scheme and our scheme when we submitted our article. The experiment results are generated by porting the simulation code to the ZigBee sensor nodes. The major changes in the simulation code were related to the I/O functions.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 61672269 and the Jiangsu Provincial Science & Technology Project under Grant BA2015161.

Supplementary Materials

The supplementary material file we submitted includes the simulation code of the ACP scheme and our scheme. Before running the code, TinyOS 2.1.2 should be installed in the Linux environment (the version of the Linux we use is Ubuntu) and some related configurations should be made. Almost all the configurations can be found on the Internet. We just mention some main code execution instructions. After entering the specified directory, execute “make micaz” and “make micaz sim” instructions. After that, execute “python test.py”. The simulation results will be stored in “log.txt” file. Execute “python postprocessZ.py -powercurses Energy.txt > EnergyPowerCurses.txt” when the PowerTOSSIMz is successfully configurated. The energy consumption results will be stored in “EnergyPowerCurses.txt” file. Before porting the simulation code to the ZigBee sensor nodes, the I/O functions should be changed according to the specific node type. (Supplementary Materials)