Research Article | Open Access
Arunanshu Mahapatro, Pabitra Mohan Khilar, "An Adaptive Approach to Discriminate the Persistence of Faults in Wireless Sensor Networks", International Scholarly Research Notices, vol. 2012, Article ID 342461, 13 pages, 2012. https://doi.org/10.5402/2012/342461
An Adaptive Approach to Discriminate the Persistence of Faults in Wireless Sensor Networks
This paper presents a parametric fault detection algorithm which can discriminate the persistence (permanent, intermittent, and transient) of faults in wireless sensor networks. The main characteristics of these faults are the amount the fault appears. We adopt this state-holding time to discriminate transient from intermittent faults. Neighbor-coordination-based approach is adopted, where faulty sensor nodes are detected based on comparisons between neighboring nodes and dissemination of the decision made at each node. Simulation results demonstrate the robustness of the work at varying transient fault rate.
Node failures and environmental hazards cause frequent topology change, communication failure, and network partition. Such perturbations are far more common in wireless sensor networks (WSNs) than those found in traditional wireless networks. The extent of such a perturbations depends on the persistence of faults. Based on persistence, faults can be classified as transient, intermittent, or permanent. A transient fault will eventually disappear without any apparent intervention, whereas a permanent one will remain unless it is removed by some external agency . After their first appearance, the rate of fault appearance is relatively high for intermittent faults, and finally the intermittent faulty nodes tend to become permanent [2, 3]. Permanent or hard faults are software or hardware faults that always produce errors when they are fully exercised .
In fact, experimental studies have shown that more than of the faults that occur in real systems are transient or intermittent faults [3, 5, 6]. These faults are more severe, from both data aggregation and network lifetime perspective. They are much problematic to diagnose and handle. In contrast, permanent faults are considerably easier to diagnose and handle. Since the effect of faults is not always present, detection of intermittent or transient faults requires repetitive testing at the discrete time in contrast to single test to detect permanent faults.
Discrimination of transient from intermittent or permanent faults is crucial as a sensor node with transient fault does not necessarily imply that the sensor node should be isolated although the unstable environment might warrant a temporary shutdown . A discrimination between transient and intermittent or permanent faults solves the following key problems.
Effective Bandwidth Utilization
By isolating permanent faults, the traffic generated by the permanent faulty nodes is restricted.
Effective Energy Utilization
The depletion of sensor node battery energy in forwarding the erroneous data generated by permanent faults can be avoided. Isolation of sensor nodes with transient faults will reduce available sensor nodes in the network. This in turn increases the work load of each sensor node, thus leading to faster depletion of sensor node battery energy and impacting network lifetime.
Network Coverage and Connectivity
Isolation of fault-free nodes with transient faults will reduce the available sensor nodes in the network thus impacting network coverage and connectivity.
Coverage of the Network Fault Hypothesis 
The assumption on the number of faults tolerated by the detection algorithm within a given time window is affected by isolation of fault-free nodes with transient faults.
These issues motivate the need to design an efficient fault discrimination algorithm suitable for WSNs. To discriminate transient from intermittent faults, this chapter is motivated from the count and threshold mechanism adopted in . Similar to , our approach uses two counters, namely, reward () and penalty () counter to discriminate fault types with low latency and low energy overhead. A node detected as faulty enters to observation state. Unlike , we first tune the intertest interval to detect the presence of fault with minimum test repetition. Second, we adopt the earlier discussed two-state Markov chain to model fault appearance and disappearance. We consider the time a node spends in the fault disappearance state (sojourn time) to tune the detection parameters. This chapter demonstrates an effective means to discriminate faults based on the persistence by properly tuning the detection parameters. We consider the following detection parameters.
(i) Intertest Interval
The time interval of two consecutive sensor measurements.
(ii) Reward Counter Threshold
The number of diagnostic rounds, a node under observation, shows expected behavior, after which a node is reintegrated to the network.
(iii) Penalty Counter Threshold
The number of correlated diagnostic rounds, after which a node gets isolated.
(iv) Adoptive Penalty Increments
The penalties assigned after a fault is detected.
The following performance metrics are used to tune the mentioned detection parameters. a.Accuracy is the probability that a fault-free node with transient fault in the error-free state entering the observation phase is not isolated . b.Coverage is the probability that an intermittent faulty node in the error-free state entering the observation phase is isolated . c.Number of test repetitions is the measure of the number of times the test repeated to discriminate transient from intermittent or permanent faults.
The main contributions of this paper are as follows. a.We extend the basic design with two practical considerations. First, we propose a diagnosis scheme that identifies faults with high detection accuracy and low false alarm rate by maintaining low latency, low energy overhead, and less dependency on the average node degree. Second, we propose a robust method to accommodate channel fault. b.Our approach discriminates transient from intermittent faults which in turn maintains low false alarm rate. c.We propose an adaptive increment-based scheme to reduce the detection latency. d.Our fault diagnosis algorithm imposes a negligible extra cost in a network where diagnostic messages are sent as the output of the routine tasks of a network.
The remainder of the paper is organized as follows. Section 2 presents related works. Section 3 presents the system model. Distributed diagnosis algorithm is investigated in Section 4. Simulation results are presented in Section 5, and finally conclusions are given in Section 6.
2. Related Work
The context of sensor networks and the nature of sensor data make the design of an efficient fault diagnosis technique more challenging. The conventional fault diagnosis techniques devised for wired interconnected networks [8–13] might not be suitable for WSNs for reasons, namely, resource constraints, random deployment, dynamic network topology, attenuation, and signal loss.
The problem of identifying faulty nodes (crashed) in WSN has been studied in . This paper proposes the WINdiag diagnosis protocol which creates a spanning tree for dissemination of diagnostic information. Chen et al.  proposed a localized fault detection algorithm to identify the faulty sensors. It uses local comparisons with a modified majority voting, where each sensor node makes a decision based on comparisons between its own sensor reading (such as temperature) and sensor reading of one-hop neighbors, while considering the confidence level of its one-hop neighbors. The performance of such an approach depends on the average node degree of the network. Jiang  claimed an improvement over the aforementioned scheme  by introducing an improved distributed fault detection scheme (DFD).
The two-phase neighbor coordination scheme is suggested by Hsin and Liu , where a node waits for its neighbors to update information concerning the faulty node in the first phase. It uses the second phase to consult with its neighbors to reach a more accurate decision. Agnostic Diagnosis (AD) , an online lightweight failure detection approach, is motivated by the fact that the system metrics (e.g., radio on time and number of packets transmitted) of sensors usually exhibit certain correlation patterns.
FIND  detects nodes with data faults. It ranks the nodes based on their measurements as well as their physical distances from the event. A node is detected faulty if there is a significant mismatch between the sensor data ranks, and its readings violate the distance monotonicity significantly. Gao et al.  approached WSN fault detection problems by suggesting a weighted median fault detection scheme (WMFDS) which primly focused on the soft fault. Krishnamachari et al. have presented a Bayesian fault recognition model to solve the fault-event disambiguation problem in sensor networks .
Most of the fault detection schemes [19, 23–25] are designed to detect permanent faults. In , a class of count-and-threshold mechanisms collectively named -count is suggested, which are able to discriminate between transient faults and intermittent faults in computing systems. The authors have presented a single-threshold scheme and a better performing double-threshold scheme. Serafini et al.  proposed to use a count-and-threshold algorithm on top of the diagnostic protocol to reduce the likelihood of isolation and increase the availability of fault-free nodes in case of external transient faults. Their approach uses two values: a penalty counter and a reward counter to discriminate transient from intermittent fault. They consider that discrete time Markov chain (DTMC) is used to model the behavior of the proposed algorithm. Both the approaches are designed, analyzed, and tested on wired interconnected networks. Thus, the parameters tuned may not be applicable for wireless sensor networks. For instance, in , . Such a small value for cannot be adopted in WSNs since it requires frequent exchange of data and thereby impacting the network lifetime.
Lee and Choi  approached WSN fault detection problems where nodes with malfunctioning sensors are allowed to act as a communication node for routing, but they are logically isolated from the network as far as fault detection is concerned. Only those sensor nodes with a permanent fault in the transceiver (including lack of power) are to be removed from the network. Time redundancy is used to tolerate transient faults in sensing and communication. However, detection of transient fault is not addressed. Their scheme uses two thresholds to check whether a node is permanent faulty or fault-free with transient faults. Their scheme does not answer the questions like how many tests required to discriminate the faults and what should be the intertest interval.
In summary, most of the proposed diagnosis approaches are designed to detect permanent faults in sensor networks. Some techniques are proposed to tolerate transient faults in fault detection. Though some techniques investigate the fault discrimination problem, they are designed for wired interconnected networks. To the best of our knowledge, this is the first attempt in discriminating transient from intermittent or permanent faults in WSNs.
3. System Model
3.1. Network Model
The proposed algorithm considers a network with sensor node nonuniformly distributed in a square area of side , which is much larger than the communication range () of the sensor nodes. Every node maintains a neighbor table . Each sensor periodically produces information as it monitors its vicinity. Similar to , nodes with malfunctioning sensors are allowed to act as a communication node for routing. However, these nodes are asked to switch off their sensors. Only those sensor nodes with a permanent fault in the transceiver and power supply are to be removed from the network.
3.2. Energy Consumption Model
Similar to , this work assumes a simple model for the radio hardware energy dissipation where the transmitter dissipates energy to run the radio electronics and the power amplifier, and the receiver dissipates energy to run the radio electronics. Both the free space ( power loss) and the multipath fading ( power loss) channel models are used, depending on the distance between the transmitter and receiver. The energy spent for transmission of an -bit packet over distance is
The electronics energy, , depends on factors such as the digital coding and modulation, whereas the amplifier energy, or , depends on the transmission distance and the acceptable bit-error rate. To receive this message, the radio expends energy:
3.3. Fault Model
After a fault is activated, we consider that sensor nodes can either continue with the faulty behavior or alternate between periods of correct and faulty behavior. In the latter case, faults are observable for a time, which is termed as fault appearance duration (FAD), before they disappear. Eventually, faults may reappear either because of new transient faults or correlated intermittent faults . The time duration during which fault disappears is termed as fault disappearance duration (FDD). Intermittent faults, after their first appearance, exhibit a high occurrence rate and eventually tend to become permanent . For intermittent fault, the number of time units (sojourn time) that the node remains in the fault appearance state is less than or equals to the number of time units that was in the previous fault appearance state. We use this sojourn time to model subsequent failures of the sensor node over time. Similar to , a node is unhealthy if it has internal faults and fails in a permanent or intermittent manner. A node is healthy if it fails only on external intervention like electromagnetic radiations, and so forth. The proposed algorithm assumes that the sensor fault probability is uncorrelated and symmetric; that is, where is the sensor measurement (say temperature) and is the actual ambient temperature.
3.4. Channel Model
The model used for channel is a two-state Gilbert-Elliott channel (two-state Markov channel model) [27, 28] with two states: G (good) state and B (bad) state. This model describes errors on the bit level. In the good state, the bits are received incorrectly with probability , and in the bad state, the bits are received incorrectly with probability . For this model, it is assumed that . The transition probability and will be small, and the probability remaining in and is large. The steady-state probability of a channel being in the bad state is . Thus, the average bit error probability of the channel is . For the simulations, this work uses this model that independently generates error patterns for all channels between nodes.
3.5. Definitions and Terminologies
The performance parameters used to measure the effectiveness of the proposed detection algorithm is as follows. (i)Detection accuracy (DA) is defined as the number of faulty sensor nodes diagnosed by each node to the total number of faulty sensor nodes in the network. (ii)False alarm rate (FAR) is defined as the ratio of the number of fault-free sensor nodes diagnosed as faulty to the total number of fault-free nodes in the network. (iii)Network lifetime is the measure of the number of data-gathering rounds when the first node dies due to depletion of battery.
4. The Fault Detection Framework
The fault detection frame work consists of two phases, namely, fault detection phase and isolation phase. The fault detection phase exploits the fact that sensor faults are likely to be stochastically unrelated, while sensor measurements are likely to be spatially correlated. In WSNs, sensors from the same region should have recorded similar sensor readings . For example, let be a neighbor of ; and are the sensor readings of and , respectively. Sensor reading is similar to when , where is application dependent. As an illustration, in bolt loosening monitoring, a sensor node and its neighbors are expected to have similar voltage. Similarly, in the case of temperature monitoring, a sensor node and its neighbors are expected to have similar temperature reading. Hence, is expected to be a small number. In the proposed approach, sensor nodes coordinate with their one-hop neighbors to detect faulty sensor nodes before conferring with the central node. Therefore, this design reduces communication messages, and subsequently, conserves sensor node energy. The isolation phase uses a count and threshold-based approach to isolate unhealthy nodes (see Algorithm 1).
4.1. Fault Detection Algorithm
In this approach, each node in the network broadcasts its sensor reading periodically by using a round-based message dissemination protocol. Upon receiving the sensor readings of one-hop neighbors, a node constructs a set () of nodes with similar reading . The node is detected fault-free if reading agrees with and the cardinality of set is greater than the threshold (). Otherwise, is marked as soft faulty. The optimal value for is , where is the number of neighbors. The node detects node as hard faulty, if does not receive the sensor reading from before . should be chosen carefully so that all the fault-free nodes must report node before . A node detected as soft faulty is not immediately isolated from the network. The node is allowed to take part in the network activities; however, the node is asked to switch off its sensor. This node next enters the observation stage. The node will be isolated if it is detected as faulty in the observation stage. Otherwise, it is reintegrated to the network and is asked to switch on its sensor. A description of fault detection is given in Algorithm 1 (see Algorithm 2).
4.2. Isolation of Unhealthy Nodes
A node detected as faulty for first time by Algorithm 1 enters to observation stage. This phase decides whether to isolate the node from the network (intermittent faulty) or to reintegrate the node to the network (fault free with transient fault). In this phase, the node under observation first initializes the penalty counter to one and the reward counter to zero. The node does not take any sensor reading. At the discrete time , it receives the sensor readings of its one-hop neighbors and executes Algorithm 1. If a fault appears and is detected at time , subsequently it first reset the reward counter. Second, it checks the present fault disappearance duration () with the preceding fault disappearance duration . If , subsequently the penalty counter is incremented by a factor equals to . If , then penalty counter is incremented by a factor equals to one. This is because intermittent faults usually exhibit a relatively fast occurrence rate. If the penalty counter exceeds its threshold (), the node is isolated from the network. Similarly, if the reward counter exceeds its threshold (), the node is reintegrated to the network.
5. Simulation Experiments
The performance of the proposed scheme through simulations is presented in this section. This work uses Castalia-2. 3b , a state-of-the-art WSN simulator based on the OMNET++  platform. The simulation parameters are given in Table 1. For these simulations, energy is consumed whenever a sensor transmits or receives data or performs data aggregation.
5.1. Experiment 1: DA and FAR in Regard to and
In this experiment, the performance of the diagnosis algorithm in regard to DA and FAR is evaluated. In this simulation, sensor nodes are assumed to be faulty with probabilities of 0.05, 0.10, 0.15, 0.20, 0.25, and 0.30, respectively. The range is chosen for the sensor network to have the desired average node degree (). In this experiment, faults are assumed to be permanent. Since a faulty node will often report unusually high- or low-sensor measurements, all the nodes with malfunctioning sensors are momentarily assumed to show a match in comparison with a probability of 0.5 regardless of their locations.
The DA and FAR at varying fault rate and average node degree are shown in Figure 1. A high level of DA (>0.96) is reported even when fault rate is as high as 0.2 and (i.e., sparse network). This is because the detection algorithm wrongly detects a fault-free node as faulty only when it has more than number of faulty one-hop neighbors. In addition, the proposed approach uses an optimal value for . As expected, an improvement in DA is observed for higher value of . A very low level (<0.004) of FAR is reported for even when fault rate is as high as . The reason is that a faulty node is detected as fault free only when it has more than number of faulty one-hop neighbors, and all produces the same faulty reading. The probability of the mentioned number is very low and decreases for an increase in .
5.2. Experiment 2: Parameter Tuning
There are several design parameters in the proposed approach, namely, , , , and . In this experiment, we tune these parameters with regard to the accuracy, coverage, and detection latency. In this experiment, we have deployed 100 faulty nodes randomly . Each faulty node can exhibit the permanent, intermittent, and transient fault with probability . While conducting sensitivity analysis on each design parameter, we fix the others to the nominal values as summarized in Table 2. The transmission range of each node is chosen to have . This ensures that a fault is detected by a test (execution of the Algorithm 1) if it appears at the time of test. In addition, larger value of ensures low FAR (refer to Figure 1). However, this restriction in node degree is relaxed in subsequent experiments to observe the performance of the proposed approach in sparse networks.
A The data-gathering stage is scheduled at where is an integer and is application specific. For instance, applications with short mission time need the data to be gathered more frequently in contrast to applications, where frequency of data gathering is less. For applications with long mission time, is large. Thus, to discriminate transient from intermittent faults, number of sensor measurements needs to be broadcasted by each node. This in turn make the packet to grow with . Since energy consumed by a sensor node is directly proportional to the number of bits it transmits or receives, the energy overhead will be more for large value of and may not be practically implementable. To address this issue, we suggest to sample the interval where each sample constitutes of consecutive sensor measurements. The standard deviations of these sensor measurements correspond to each sample interval are calculated and broadcasted along with the routine data. This in turn reduces the packet size and makes the algorithm energy efficient. Each node takes the decision by comparing the corresponding standard deviations of one-hop neighbors. Use of standard deviation instead of individual measurements does not affect the detection performance since rate of change in sensor measurements over time is very low. In addition, a sensor often reports unusually high- or low-sensor measurement during FAD. Thus, the standard deviation of sensor measurements of a sample interval with at least one incorrect measurement will be distinguished from the corresponding standard deviations of one-hop neighbors with all true measurements. In this experiment, we assume temperature sensors.
Figure 2(a) depicts the average accuracy and coverage at varying values of . This result confirms that has a strong impact on average accuracy. It is observed that the average accuracy falls after . This is because when is excessively long, an excessively long time is required to reach the reward threshold. For instance, this time for and is hours. The mentioned period of correct operation is too long and increases with . Thus, the occurrence of subsequent transient faults will be viewed as correlated intermittent faults and the node will be isolated. It is observed that the average coverage remains unaffected by change in . However, as shown in Figure 2(b), the average latency of isolation increases with . The reason is that if is too high, then probability that the FAD coincides within the intertest interval is high. This means, the probability that a fault may appear after and subsequently disappears before increases with . This in turn increases the number of test repetitions and thus the average latency of isolation to reach the penalty threshold.
The impact of the reward threshold on the average accuracy and coverage is depicted in Figure 3(a). In the proposed detection algorithm, a node is isolated if it fails before reaching the reward threshold. If is too large, then a healthy node enters the observation stage may be isolated. The reason is that transient faults appear to be correlated intermittent faults. This in turn affects the accuracy. If is too small, then intermittent faults will be treated as transient faults and will be reintegrated to the network causing poor coverage. This is because the value of must be greater than the average number of test receptions required to detect the presence of fault. Thus, proper tuning of is crucial to achieve a good discrimination. The best tradeoff for the given scenario is observed at . The time period of correct operation for () is adequate for an unhealthy node to reach the penalty threshold. In addition, as shown in Figure 3(b), accuracy is ; that is, the transient faults do not appear as correlated intermittent faults for . The average detection latency for varied values of is shown in Figure 3(b). The average latency of isolation is reported almost unaffected for . This is because coverage is reported for . M detection latency depends only on and the number of test repetitions required to reach . Thus, for , the detection latency is negligibly affected.
Figure 4(a) shows the coverage and accuracy at varying value of penalty threshold. As discussed earlier, the penalty counter is incremented by a value if the present FDD is smaller than the preceding FDD. For smaller value of , the probability of isolation of healthy nodes in the observation state is more as the transient faults are appeared to be correlated intermittent faults. As expected and shown in Figure 4(a), the average coverage is not affected by varying values of . As shown in Figure 4(b), the average latency of isolation increases with . This is because the number of test repetitions required to detect the presence of fault for time increases with . Since the proposed approach implements an adaptive penalty increment technique and a relatively high fault occurrence rate is observed in an unhealthy node, the average detection latency grows less after .
Finally, we study the effect of on the average detection latency, the average number of test repetitions, the average coverage, and average accuracy. When is set to 1, the proposed algorithm acts similar to the approach proposed in  that does not consider the fault disappearance state holding time. Figure 5(a) illustrates the improvement of the detection latency with . When is greater than 2, the detection latency is lower than that of the circumstance when . Similarly, Figure 5(b) illustrates the improvement of the number of test repetitions required to discriminate transient from intermittent faults. When is larger than 2, the number of test repetitions is lower than that of the circumstance when . The effect of on average accuracy and coverage is depicted in Figure 5(c). A tradeoff is observed where both the average accuracy and coverage attain their highest value form to . These results suggest the importance of in discriminating transient from intermittent faults.
In summary, for wireless sensor networks, a setting of , , , and allows to discriminate most of the transient from intermittent faults.
5.3. Experiment 3: Robustness with regard to Transient Faults
In this experiment, we estimate how well the proposed detection algorithm discriminates transient from intermittent faults. We compare the performance of the detection algorithm with the sate-of-art detection algorithm proposed by Lee et al. in . Similar to , we redefine FAR as follows. Let , , and represent the number of good nodes, the number of good nodes with a transient fault, and the number of faulty nodes, respectively. Let be the number of nodes wrongly detected as faulty out of the good nodes. Similarly, the number of healthy nodes with a transient fault identified as faulty is denoted by . The FAR is redefined as . For better performance evolution, we consider the equal number of permanent and intermittent faults. In this experiment, the impact of transient fault rates () on DA and FAR has been evaluated for , and .
As expected and shown in Table 3, the detection accuracy is less affected by varying rate of transient faults in the network. This is because the proposed detection algorithm wrongly detects a faulty node as fault free if the node has more than number of faulty neighbors during test at time and all are reporting the same faulty reading. The probability of the mentioned number is very less due to the following reasons. (1) In this approach, all the nodes with malfunctioning sensors are momentarily assumed to show a match in comparison with a probability of 0.5 regardless of their locations. (2) The appearance of intermittent and transient faults are random in nature. At the time of the test, the probability that all the transient and intermittent fault appears is very low. (3) Simulation results shown in Experiment 1 confirm a better performance in sparse networks. Similar to the proposed detection algorithm, the detection accuracy of the detection algorithm proposed by Lee et al. is less sensitive to change in .
The robustness of the fault detection algorithm to transient faults from FAR perspective is shown in Table 4. As expected the FAR is less affected by varying rate of transient faults in the network. In the proposed approach, a fault-free node detected as faulty only when it has more than number of faulty neighbors during test. As discussed, the probability of the said number is very low. In addition, proper tuning of detection parameters ensures efficient discrimination of transient from intermittent faults. The fault-free nodes with transient faults are effectively reintegrated into the network which in turn keeps the FAR low. As reported in Table 4, the proposed detection algorithm outperforms Lee's approach from FAR perspectives. The reason is that the two thresholds used in Lee's scheme are not adequate to discriminate transient from intermittent or permanent faults. Thus, their approach isolates the maximum number of fault-free nodes with transient faults.
5.4. Experiment 4: Robustness with Regard to Channel Fault
In this experiment, the robustness of the detection algorithm to faults in the communication channel is analyzed by estimating DA and FAR for various channel error probabilities. In this experiment, we set . For simplicity in the simulation, is taken as and is taken as . is fixed to , and is varied to get different channel error probabilities . The channel error rate is increased in steps from to . Faults in the communication channel might cause some fault-free nodes to fail in receiving the sensor measurements from its neighbors. This in turn decreases the effective neighbor size of a sensor node and might affect the local decision. However, as discussed in Experiment 1, the detection algorithm shows better performance even in sparse networks. Thus, as expected and shown in Tables 5 and 6, the detection algorithm effectively tolerates faults in the communication channel. It is observed that the detection scheme proposed in  effectively tolerates faults in the communication channel.