Security and Privacy Challenges for InternetofThings and Fog Computing
View this Special IssueResearch Article  Open Access
Qinbao Xu, Rizwan Akhtar, Xing Zhang, Changda Wang, "ClusterBased Arithmetic Coding for Data Provenance Compression in Wireless Sensor Networks", Wireless Communications and Mobile Computing, vol. 2018, Article ID 9576978, 15 pages, 2018. https://doi.org/10.1155/2018/9576978
ClusterBased Arithmetic Coding for Data Provenance Compression in Wireless Sensor Networks
Abstract
In wireless sensor networks (WSNs), data provenance records the data source and the forwarding and the aggregating information of a packet on its way to the base station (BS). To conserve the energy and wireless communication bandwidth, the provenances are compressed at each node along the packet path. To perform the provenances compression in resourcetightened WSNs, we present a clusterbased arithmetic coding method which not only has a higher compression rate but also can encode and decode the provenance in an incremental manner; i.e., the provenance can be zoomed in and out like Google Maps. Such a decoding method raises the efficiencies of the provenance decoding and the data trust assessment. Furthermore, the relationship between the clustering size and the provenance size is formally analyzed, and then the optimal clustering size is derived as a mathematical function of the WSN’s size. Both the simulation and the testbed experimental results show that our scheme outperforms the known arithmetic coding based provenance compression schemes with respect to the average provenance size, the energy consumption, and the communication bandwidth consumption.
1. Introduction
Wireless sensor networks (WSNs) are composed of a large number of lowcost, lowpower, and randomly distributed wireless sensor nodes (nodes, for short), which are intended to monitor physical or environmental data from the detecting areas and cooperatively pass the data to the base station (BS) or a desired actuator through wireless communication. They are widely deployed in a vast number of different application areas, such as health monitoring, meteorology, and military operations. Because of the diversity of the environment and the large number of sensor types involved, in order to use the reliable information to make an accurate decision, it is essential to evaluate the trustworthiness of the received data at the base station (BS) of a WSN. In practice, there are some examples of significant losses because the faulty data are used [1].
In a multihop WSN, provenance of each data packet presents the history of the data acquisition and the actions performed on the data while the data are transmitted to the BS. Provenance provides the knowledge about how the data come to be in its current state, including where the data originated, how it was generated, and the operations it has undergone since its generation. Therefore, provenance plays an important role in assessing the data’s trustworthiness [2, 3]. With the increase in packet transmission hops, the provenance’s size expands rapidly and sometimes the size even largely exceeds the size of the data packet itself. WSNs are a kind of resource constrained network. Because of storage and computational resource constraints, sensor nodes deployed in WSN do not have the ability to manipulate the provenance if its size is very large. Generally, sensor nodes utilize batteries to supply power, so the amount of energy is limited. Data transmission is the major part of the energy consumption. When the sensing data item is fixed, the packet size mainly depends on the provenance. Besides, the transmission channels do not have sufficient capacity for transmitting large provenance.
As a result, in largescale WSNs, the provenances generally cannot be directly and completely transmitted due to both the bandwidth and the energy constraints on wireless sensor nodes. For the same reason, the provenance encoding schemes for wired computer networks, e.g., the works of [4–6], are not applicable for WSNs. Hence, several lightweight or compact data provenance schemes [7–9] as well as the compression schemes [1, 10, 11] have been proposed. The lightweight schemes drop some information with less significance in the provenance, e.g., the provenance graph’s topology, and then shorten the provenances size. The compression schemes decrease the provenances size through arithmetic coding [12], LZ77 [13], and LZ78 [14]. Note that for a given packet path, the provenance compression has a determined entropy upper bond according to Shannon’s theory. Even the dictionarybased scheme [10] can achieve the highest provenance compression rate up to date, in a largescale network the provenance overload problem is inevitable.
To mitigate the average provenance size increases as well as utilize the provenance data efficiently, we propose a CBP (clusterbased provenance) encoding scheme for WSNs. The CBP scheme focuses on encoding and decoding the provenance incrementally (like Google Maps, can be zoomed in and out according to the user’s requirement) at the BS. The specific contributions of the paper are as follows:(i)We proposed a clusterbased lossless provenance arithmetic encoding scheme (CBP) for WSNs. Our approach not only has the ability of encoding and decoding the provenance incrementally, but also achieves a higher average provenance compression rate.(ii)We derived the optimal cluster size for the CBP scheme as a mathematical function of the number of the nodes in a WSN.(iii)We provided a detailed performance analysis, simulation, and experimental evaluations of our CBP scheme and a comparison with other related schemes.
The rest of the paper is organized as follows: Section 2 surveys the related works. Section 3 introduces the system model and the related background. Section 4 gives an overview of our method. Section 5 describes our proposed encoding and decoding approaches for simple and aggregated provenances, respectively. The cases study is presented in Section 6. Section 7 theoretically analyzes the performances of our method. Sections 8 and 9 show the simulation and experimental results, respectively. Section 10 concludes the paper.
2. Related Work
Shebaro et al. [9] proposed an inpacket Bloom filter (BF, for short) based lightweight secure provenance scheme, in which every node on packet’s path is embedded into an array with fixed size through a set of hash functions. A BF is a simple but space efficient randomized data structure which supports fast membership queries with false positive, where the false positive rate depends on the array’s size. Alam et al. [7] proposed an energyefficient provenance encoding scheme based on the probabilistic incorporation of the nodes’ IDs into the provenance, in which the entire provenance is scattered into a series of packets that are transmitted along the same route. Therefore, all the sections carried by the packets have to be retrieved correctly at the BS and then the provenance can be decoded. The major advantage of such a method is that it successfully limits the size of the provenance attached to a single packet, whereas the drawback is that it has a higher decoding error rate compared to the methods that encode the entire provenance into a single packet. The above methods are all lossy provenance encoding schemes because the topology information of the WSN is not included.
Hussain et al. [1] proposed an arithmetic coding based provenance scheme (ACP). The ACP scheme assigns a global cumulative probability for each node according to its occurrence probability in the used packet paths. For a given packet path in a WSN, the ACP scheme utilizes the global cumulative probability of the source node as the initial coding interval, and then the cumulative probabilities acquired from the associated probabilities derived from each connected nodes pair are used to generate the provenance; i.e., all the nodes IDs along the path are sequentially encoded into a halfopen interval through arithmetic coding. Unlike most of the known provenance schemes whose average provenance size is directly in proportion to the increases in the packet transmission hops, the ACP scheme’s provenance size is decided by the packet path’s occurrence probability in a WSN.
Wang et al. [10] proposed a dictionarybased provenance encoding scheme (DP). The DP scheme treats every packet path as a string of symbols consisting of the node IDs along the packet path. Just like building a dictionary for symbol strings in LZ compression [13, 14], the DP scheme builds the dictionary for packet paths with the used packet paths at every node and the BS holds a copy of each node’s dictionary. Therefore, the provenance can be encoded through an index or a series of indices of the packet paths at each node along the packet path and be decoded by looking up the dictionaries at the BS. When the topology of the WSN is relatively stable, the provenance size under the DP scheme can be even shorter than the provenance’s entropy; on the contrary, the provenance compression rate increases drastically with the quick change of the WSN’s topology. Wang et al. [11] proposed a dynamic Bayesian network based provenance scheme (DBNP) for WSNs. The DBNP scheme encodes edges instead of node IDs along the packet path into the provenance by overlapped arithmetic coding. Compared to the known lossless provenance schemes, a higher provenance compression rate can be achieved by the DBNP scheme. Furthermore, such a scheme is not sensitive to the WSN’s topology changes. However, applying the overlapped arithmetic coding leads to decoding false positives [15], where extra knowledge is added to eliminate the false positives. Therefore, it is difficult to make a tradeoff between the acceptable false positive rate and the optimum compression rate.
All the approaches mentioned above focus on how to mitigate the provenances size rapid expansion in WSNs. However, none of these approaches supports the provenance incremental encoding and decoding, which can raise both the efficiencies of the provenances decoding and the data trust assessment.
3. Background and System Model
In this section, we provide a brief primer on arithmetic coding. We also introduce the system model applied in this paper. Some of these definitions are partly from our previous work [1, 10, 11].
3.1. Clustering Model
In WSNs, clustering management refers to selecting a series of nodes as clusterheads according to a certain communication protocol, e.g., EEHC (EnergyEfficient Hierarchical Clustering) [16]. A clusterhead aggregates the data generated by the nodes in its cluster and then sends such data to the BS.
Figure 1 shows the two data collection modes in clusterbased WSNs, where the solid circles and the empty circles denote the member nodes of a cluster and the clusterheads, respectively. Figure 1(a) shows an example of the singlehop communication between the clusterheads and the BS, which is applied for smallscale WSNs; Figure 1(b) shows an example of multihop communication among the clusterheads and the BS, which is applied for largescale WSNs because some clusterheads cannot reach the BS in one hop.
(a)
(b)
3.2. Provenance Model
In WSNs, the provenance of a data packet refers to where the data packet is generated and how it is transmitted to the BS [1, 2, 8]. In our provenance model, a data source is a node generating data periodically through the sensors attached to the node; a forwarder is a node that relays the packet toward the BS along the packet path; an aggregator is a node that aggregates two or more received data packets from its upstream nodes as a new one and then sends the new packet toward the BS. The aggregator nodes in our scheme are not selected. While being transmitted toward the BS, only the packets that fulfill the aggregation conditions are aggregated [17]. Note that such a process results in the provenances aggregation accordingly.
Each packet contains the following: (i) a unique packet’s sequence number; (ii) a clusterhead sequence number; (iii) data source node ID; (iv) data value; (v) provenance; and (vi) a message authentication code (MAC), which binds the provenance and its data together to prevent any unauthorized modification.
There are two different kinds of provenance in WSNs. Figure 2(a) presents a simple provenance, where data is generated at leaf node and forwarded by nodes and toward the BS; Figure 2(b) shows an aggregated provenance, where data are aggregated at nodes and on the way to the BS. The aggregated provenance can be presented as a tree through a recursive expression , where denotes the root and and denote the left and the right subtrees, respectively. Therefore, the aggregated provenance in Figure 2(b) has the form
(a)
(b)
Without loss of generality, the formal definition of data provenance in a WSN is as follows [11].
Definition 1 (provenance). For a given data packet , the provenance is a directed acyclic graph , each vertex , where and represents the cardinality of the set , is attributed to a specific node , and represents the provenance record for that node. One refers to this relation as ; i.e., node is the host of . Each edge represents a directed edge from vertex to , where . Meanwhile satisfies the following properties: (1) is a subgraph of the WSN; (2) for, , is a child of iff forwards to ; (3) is a set of children of iff, for each , receives data packets from .
3.3. Arithmetic Coding
Arithmetic coding [1, 18, 19] is a lossless data compression method that assigns short code words to the symbols with high occurrence probabilities and leaves the longer code words to the symbols with lower occurrence probabilities. The main idea of arithmetic coding is that each symbol of a message is represented by a halfopen subinterval of the initial halfopen interval , and then each subsequent symbol in the message decreases the interval’s size by a corresponding subinterval according to the symbol’s occurrence probability [19]. Figure 3 shows the process in which a message “bca” is encoded with the probability model in Table 1.

The encoding and decoding operations are as follows:
(1) Encoding: The initial interval is . When the first symbol “b” is encoded, it narrows the interval from to , where is the interval assigned to “b”. After the second symbol “c” is encoded, it narrows the to according to the interval assigned to “c”. Finally, the message “bca” is encoded as the interval .
(2) Decoding: The decoding algorithm utilizes the same probability model in Table 1. With the interval of being encoded, the first symbol “b” is decoded because the is a subinterval of the interval which is the interval assigned to “b”. In what follows, the subinterval of “b” is further divided in the same manner to derive the subsequent symbols until the interval of being decoded is equal to the interval of being encoded, namely, in this example.
The detailed encoding and decoding algorithms of arithmetic coding can be found in [1, 18].
4. Overview of Our Method
In a largescale WSN, the provenance decoding load at the BS is heavy, which also results in the low efficiency for the data trust evaluation. The layered clustering management provides a good way to manage largescale WSNs. With a multilayer cluster structure, provenance can be encoded hierarchically, where the final provenance consists of multiple segments. By decoding the provenance segment on the higher layers, the BS obtains the provenance information roughly and then assesses the trustworthiness of the data quickly. Whether to decode the provenance on the next layer depends on the current decoding results; i.e., if we assure that the data have been tempered, the decoding stops immediately.
Compared to the ACP scheme, our scheme has the following characteristics:
(1) Because the layered clustering management is applied, the provenance on each layer can be encoded as an independent segment and the final encoded provenance is composed of a series of segments from different cluster layers. When the BS receives a provenance, it decodes the segment from the highest layer first, and then the BS obtains the provenance information on the most coarsegrained layer, which can be used for a rapid data trust evaluation. Thereafter, the BS continues to decode the provenance step by step. Finally, the BS reconstructs the accurate provenance by combining each segment’s decoding result. Therefore, our scheme can encode and decode the provenance in an incremental manner. Such a decoding method raises the efficiencies of the provenance decoding as well as the data trust evaluation.
(2) Compared to the ACP scheme, which uses global probabilities to encode provenance, our scheme encodes the provenance through local probabilities which are only valid in a cluster. By using local probabilities, our scheme not only has a higher compression rate but also can update the probability model partially, which raises the provenance’s encoding and decoding efficiencies.
A large WSN can be managed through a multilayer cluster structure, in which the clusterheads of the same layer are the nodes of an upper layer cluster. As shown in Figure 4, we manage the WSN by different clusters and then form a twolayer cluster managing structure.
In each cluster, the local cumulative probabilities are assigned for each member node. Note that the local cumulative probability is only valid in a cluster, which is quite different from the ACP scheme using global cumulative probabilities [1]. In Figure 4, the highlighted packet path starts from a data source node in a layer1 cluster, and then the packet passes through the nodes in a layer2 cluster before it reaches the BS. Therefore, the provenance can be encoded into two segments for incremental decoding as shown in Figure 5.
When the BS receives the encoded provenance, it can decode the provenance incrementally and reconstruct the provenance in a way of stepwise refinement. From the first provenance segment decoded results, the BS derives the packet path on layer2 and the source node of the packet path indicates the cluster from which the packet comes. Then the second segment of the provenance is decoded and yields the packet path on layer1. Finally, we can recover the entire packet path by combining the two parts decoding. Note that if we know that the packet has passed through some undependable nodes by the layer2 decoding, the provenance and its packet will be dropped without further decoding, which increases the efficiencies of both provenance decoding and data trust assessment.
Besides, applying a local cumulative probability rather than a global cumulative probability achieves a higher compression rate compared to the ACP scheme. Furthermore, in contrast to the DBNP scheme [11], we use nonoverlapped arithmetic coding which has no false positives.
Because our CBP scheme uses the encoding intervals generated from each node’s cumulative probability, the cumulative probabilities are then not changed drastically with respect to the WSN’s topology changes. Hence, our scheme is robust to the WSN’s topology changes.
5. ClusterBased Provenance Encoding and Decoding
Before introducing our scheme, we first define the main symbols used in the scheme and algorithms. See Table 2 for details.

At the beginning, the BS trains the network for a certain period to get the number of the times each node appears on the packets’ paths as the node’s occurrence frequency . During the training process, we let each source node in the network send a certain number of packets to the BS. Upon receiving these packets, the BS computes the occurrence frequencies for each node in the network. Then the local probabilities of each node are computed in their cluster, respectively. How long the training process takes depends on the WSN’s scale as well as the accuracy requirement. The more packets used in the training process, the more model accuracy attained, which is also time consuming.
For each node , its occurrence probability can be computed by the following formula: , where . Thereafter, the BS computes the cumulative occurrence probability for each node by the following equation:
In what follows, the BS calculates the occurrence frequency with which appears next to as the associated frequency .
Because the number of times a node appears on all the packet paths is equal to the number of the packets that the other nodes receive from it, the total associated frequency of node is then equal to its occurrence frequency . Hence, the associated probability . At the BS, the cumulative associated probability for each node is thus derived as
Once the nodes of the WSNs we considered in our scheme are deployed, they stay stationary. In our scheme, we hypothesize that topology changes of the WSNs are slow and infrequent. This kind of slow and infrequent topology changes cannot drastically modify the occurrence frequencies and the probabilities assigned to the nodes. To keep the occurrence frequencies and the probabilities as accurate as possible, we update the probability model periodically or on request by the BS.
5.1. Simple Provenances Encoding and Decoding
Along a packet transmission path, in our simple provenance encoding algorithm (see Algorithm 1), the initial coding interval at the data source is . In what follows the interval is used to denote the provenance at the (x1)th node. When the provenances come from different clusters, the ID of the clusterhead is attached to distinguish the provenances that may share the same coding interval.

A clusterhead of the first layer may play the role of a data source, forwarder, or aggregator node on the second layer. The time complexity of the compression at a clusterhead is the same as that of a data source node, forwarder, or aggregator node, where the space complexity which depends on the layer where the clusterhead is located is doubled or even more.
(1) If node is the data source with the occurrence probability and the cumulative probability , the interval is then encoded as follows:
(2) If node is a forwarder node which receives an interval from the (x−1)th node , the provenance is then encoded as follows:
Although real numbers are used in our algorithms to represent the values of the probabilities and the intervals, at each sensor node the real numbers are replaced by integers to fit for the limited computational ability. Therefore, to meet the demands for the increasing precision as well as avoid transmitting duplicated data, when the most significant digits of the two numbers that define the interval are identical, the most significant digit will be shifted out and stored in a buffer. For example, the interval is represented as , where is the buffer.
Upon receiving a packet, the BS recovers the provenance through the encoded provenance ; refer to Algorithm 2. The middle point number of the and the is selected using (6) as the flag code to locate the data source node’s interval and the data source node is then retrieved. Thereafter, the data source node’s effect is removed from the interval and the next node on the packet path will be retrieved through the new flag code by (7).where and denote the cumulative occurrence probability and the occurrence probability of the node being decoded, respectively. Furthermore, the clusterhead’s ID is used to exclude the nodes that do not belong to such a cluster.

5.2. Aggregated Provenances Encoding and Decoding
Without loss of generality, at an aggregator node assume that there are two provenance intervals , . The aggregator node encodes its ID into both the and the , which yields the new intervals and , respectively. The aggregator node compares the lengths of the two intervals and and then sets the longer one as the first interval which indicates that it is the active interval. Thereafter, the aggregator node randomly chooses a real number in the shorter interval and sends with the active interval to the next hop. Finally the aggregated provenance has the form ; refer to Algorithm 3.

The BS decodes an aggregated provenance as a series of simple provenances which consists of the aggregated provenances (see Algorithm 4), where the simple provenances are packet path sections: (1) from a data source to an aggregator node, (2) from an aggregator node to another aggregator node, (3) from an aggregator node to the BS.
6. Cases Study
In this section, two cases are provided. Without loss of generality, the WSN is organized in two levels in these cases. Our provenance scheme uses hierarchical arithmetic coding to compress provenance within and among the clusters. When the packets pass through the clusterheads, the provenance will be encoded at a higher level. Figure 6 shows the topology of a WSN composed of four clusters in layer1. The four clusterheads, namely, , , , and , collect packets within their clusters and then transmit the packets to the BS.
In Figure 6, the initial probabilities for the member nodes and the clusterheads were assigned evenly; refer to Tables 3 and 4.


For simplicity, we take a sample provenance and an aggregated provenance that were generated in cluster A as the examples to illustrate how our provenance encoding scheme works.
In Figure 7, the values assigned to the directed edge represent the associated probabilities between the nodes. Based on the data from Tables 3 and 4, the cumulative associated probabilities assigned to the nodes and the clusterheads were derived by (3).
6.1. Sample Provenance
Assume that the simple provenance to be encoded is . As the occurrence probability’s range of the data source is , then sends the provenance as to its next node , where denotes that the buffer is now empty. Node then derives the new coding interval through its associated probability and cumulative associated probability with respect to . According to Algorithm 1, the new coding interval at is equal to . Similarly, the new interval at is encoded as and then the provenance is updated as when being sent to . Because is a clusterhead, it simply adds its cluster ID to the provenance and updates the provenance as .
Because the occurrence probability’s range of the data source is , sends the provenance as with to its next node . Because is a clusterhead, it only updates the provenance at the higher level, i.e., the clusterhead level, and then sends the updated provenance in the form of , to the BS. Table 5 shows the encoding processes of the simple provenance at each node along the packet path.

When the packet arrives, the BS parses the attached provenance. Because there are two cluster levels, the provenance has two sections accordingly. For the first section, namely, , the BS filtrates the interval and the buffer and then derives that which belongs to the cumulative probability of . The data source node of the first section is thus . Thereafter, the BS removes the effect of from the and then derives that is equal to 0.26. Because the associated probability of with respect to is , is yielded from the decoding. The BS stops decoding the first section because the packet was received from and yields the decoding result of layer2 as . With the second section, the BS filtrates that the clusterhead ID is , the coding interval is , and the buffer is . The BS then derives that which belongs to the occurrence probabilities of , , , and . Note that the clusterhead’s ID is ; the data source is thus . Therefore, the BS removes the ’s effect from the and derives that is equal to 0.625. Because the associated probability’s range of with respect to is , is yielded from the decoding. The BS stops decoding until the last node is yielded. The decoding result is then . Finally, the BS combines the incremental decoding results and then knows the provenance is .
In order to explain the decoding process more clearly, Table 6 shows the steps of the simple provenance decoding at the BS. The BS decodes the first part of the provenance and reconstructs the data packet path on layer2 which is composed of clusterheads for the clusters on layer1 (see Figure 6). According to the layer2 decoding result, the BS knows from which cluster this data packet comes. If the BS finds any undependable information such as tampered nodes or faked packet path in the layer2 decoding, it will discard the packet and stop decoding, which raises the provenance decoding efficiency as well as the data trust assessment efficiency.

6.2. Aggregated Provenance
In Figure 7, now we consider the encoding and decoding of an aggregated provenance . According to Algorithm 3, at the beginning, the data source sends (referred to as ) with a packet to its next node . At the moment, the data source also sends as the provenance with packet to its next node , where the new coding interval (referred to as ) is derived.
At the first aggregator node , the two packets and are met. Because there are two coding intervals, i.e., and , encodes its ID into both of them and makes the new intervals as = and = , respectively. Thereafter, compares the lengths of and and chooses a random real number belonging to the shorter interval. Subsequently, updates the aggregated provenance to the form of , where , and sends it with the new aggregated packet to . Because is a clusterhead, it simply adds its ID to the provenance and updates the provenance to the form of . Finally, the provenance of the aggregated packet is at the BS. Figure 8 shows the encoding of the aggregated provenance at each node along the packet path.
When the packet arrives, first, the BS decodes the foremost part of the provenance, which yields . Second, the BS decodes 0.65 in the provenance according to Algorithm 4, because which belongs to the occurrence probability ranges of , , , and . Note that the clusterhead’s ID is ; the data source is thus decoded as . In what follows, the BS removes the ’s effect from the and yields