Abstract

Wireless sensor networks are used in many diverse application scenarios that require the network designer to trade off different factors. Two such factors of importance in many wireless sensor networks are communication reliability and battery life. This paper describes an efficient, low complexity, high throughput channel decoder suited to decoding low-density parity-check (LDPC) codes. LDPC codes have demonstrated excellent error-correcting ability such that a number of recent wireless standards have opted for their inclusion. Hardware realisation of practical LDPC decoders is a challenging area especially when power efficient solutions are needed. Implementation details are given for an LDPC decoding algorithm, termed adaptive threshold bit flipping (ATBF), designed for low complexity and low power operation. The ATBF decoder was implemented in 90 nm CMOS at 0.9 V using a standard cell design flow and was shown to operate at 250 MHz achieving a throughput of 252 Gb/s/iteration. The decoder area was 0.72 mm2 with a power consumption of 33.14 mW and a very small energy/decoded bit figure of 1.3 pJ.

1. Introduction

Wireless sensor networks (WSN) have been deployed in many applications, from oceanography, environmental monitoring, understanding wildlife behaviour to intrusion detection, and building structural analysis, see [1]. WSNs are composed of a number of cheap, small, resource-limited nodes with some form of sensor and wireless communication capability. As such, careful design of the overall network is necessary to ensure sufficient robustness of the network to node failures, deleterious communications, and sensor data correlation. The radio system can be a major source of power drain compared to other systems within the sensor node, as stated in [2]. Of the many functions performed within a radio system, channel decoding is one of the most computationally intensive and, thus, resource demanding.

Low-density parity-check (LDPC) channel codes were described by Gallager [3] in 1963 and subsequently rediscovered by Mackay [4]. LDPC codes are capacity approaching codes, coming within 0.0045 dB of the Shannon capacity limit [5], whilst being attractive from an implementation viewpoint due to the availability of iterative decoding algorithms.

In addition to describing error-correcting codes based on low-density parity-check matrices, Gallager presented a number of practical decoding algorithms. A probabilistic soft decision message passing algorithm referred to as the sum-product (SP) algorithm capable of achieving optimum performance was one class of decoding algorithm. Although SP techniques have been shown to perform exceptionally well [5], they do suffer from implementation complexity. In [6] for a fully parallel SP decoder, three major implementation problems are highlighted as follows: routing bottleneck caused by large wire-dominated circuit delays due to mirroring of the LDPC matrix connectivity, significant memory storage for messages, large number of processing elements due to large code lengths. Thus, partially parallel solutions based on structured codes have been proposed [7, 8] that address some of these problems but generally have lower throughput than a fully parallel implementation. An alternative decoding algorithm, falling into the class of bit flipping (BF) methods was designed to be less complex than SP decoding. One such method, termed adaptive threshold bit flipping (ATBF) was described in [9], designed for low power, high throughput operation.

The work in this paper reports on the implementation of the ATBF algorithm deriving figures for throughput, power consumption and silicon area. This paper is organised as follows. Section 2 presents a review of simplified hard and soft decision decoding algorithms. Section 3 provides details of the adaptive threshold bit flipping algorithm and a method for reducing power consumption. Section 4 describes a criteria for early stopping of decoding of the ATBF algorithm. Simulation results for the ATBF and early stopping ATBF (ES-ATBF) algorithms are compared to published bit flipping methods in Section 5. Section 6 describes the VLSI architecture of the ATBF decoder, highlighting the issue of routing and providing figures for power consumption, area, and timing analysis. Section 7 concludes this paper.

2. Background

2.1. Simplified Hard Decision Decoding

Simplified decoding methods are broadly categorised into hard and soft decision algorithms. The majority of bit flipping techniques fall into the category of soft decision decoding methods and form the basis of a novel decoding algorithm and subsequent improvements as presented in this paper.

Let be the soft-valued received vector of the transmitted codeword and be the binary hard decision of . Let for be the index of nonzero elements in row and for be the index of nonzero elements in column of the parity-check matrix. Furthermore, let be the error vector associated with the received hard decision codeword such that . Let the row and column weights of the parity-check matrix be and , respectively. The syndrome, , of the received bits can be written as Since and , (1) can be rewritten as

2.1.1. Gallager Bit Flipping

The number of unsatisfied syndrome checks on a given bit give one measure of the reliability associated with the bit. By performing the syndrome check in (1), forming a parity-check failure count and flipping the bits involved in some number of parity-check failures above a preset threshold, the basic steps of the hard decision Gallager bit flipping algorithm are described. The complete bit flipping algorithm described by Gallager [3] can be summarised as shown in Algorithm 1.

) For  j = 1 : N,
) Let   over for all .
) If   for all ,
  output x and stop.
) Form a count, , of the number of failed parity checks for all .
) If the count for any bit exceeds a threshold ,
 change the sign of the bit .
) If the maximum number of iterations is reached
 output x and stop.
otherwise
 go to step 2.

For a BSC channel, Gallager [3] derived the optimum threshold value, , to be used in the bit flipping algorithm.

2.2. Simplified Soft Decision Decoding

It is intuitive to understand how the use of additional reliability information derived from soft decisions of the received data could be used to improve decoding performance of the bit flipping technique. The precise way in which this soft information is utilised in the decoding process has led to a number of different bit flipping algorithms being developed, some of which are described below.

2.2.1. Weighted Bit Flipping

In addition to describing a geometric method of constructing LDPC codes, the authors of [10] describe a method for incorporating reliability information into the one-step majority-logic decoding and Gallager’s bit flipping algorithms. Let an LDPC code be described by a parity-check matrix consisting of rows, and columns. For and , let where is the set of parity-checks on bit position and is the received value at bit position . The term in brackets in (4) is a bipolar, , representation of the syndrome value. Thus, represents a weighting of a check sum by the least reliable received value participating in the check sum resulting in having the range . The weighted bit flipping algorithm (WBF) can be stated as shown in Algorithm 2.

) For   = 1 : N,
) Let   over for all .
) If   for all ,
  output x and stop.
) Compute   in (4) for .
) Find the bit position for which ,
  change the sign of the bit at position .
) If the maximum number of iterations is reached
  output x and stop.
otherwise
  go to step 2.

Clearly, weighted bit flipping requires handling of soft values, specifically real addition operations which makes it an inherently more complex algorithm than Gallager bit flipping. For a WBF implementation, the number of real addition operations per iteration is given as , where is an implementation dependent constant typically less than three [11] and is the total number of edges in the parity-check matrix. Additionally, the WBF algorithm requires storage of the values. However, the additional decoding complexity does provide a coding gain over the hard decision Gallager bit flipping algorithm.

2.2.2. Modified Weighted Bit Flipping

In the weighted bit flipping formulation given in (4), the term weights the check sum by the least reliable received value participating in the check sum. In [12] the authors interpret (4) as a message about the hard decision based on the check sum values which is iteratively updated. Their observation and subsequent modification is that (4) gives an indication of the extent to which should change its value based on the information provided by the check values whilst ignoring to what extent should maintain its value based on the information provided by . The modified weighted bit flipping (MWBF) algorithm changes (4) to include the term as follows: Clearly, as stated in [12], the weight of will vary with different values of signal-to-noise ratio and with different row weights will also vary. Thus, a weighting factor is added to the term in (5) to give where is a positive real value. In [12], it is observed that the optimal value of increases with the column weight, regardless of the increase in code length. Compared to the WBF algorithm, the MWBF algorithm shows no appreciable difference in the average number of decoding iterations for codes of short length. However, for longer length codes, the MWBF requires fewer iterations than the WBF as such codes tend to converge slower [12].

2.2.3. Gradient Descent Bit Flipping

Recently, Wadayama [13] formulated a new way of calculating the decision metric for flipping a given bit based on the gradient descent formulation. The gradient descent bit flipping (GDBF) algorithms outperform the WBF and MWBF algorithms in error-correcting ability and more significantly in the number of average iterations needed for successful decoding.

If the elements of the hard decision vector are represented in bipolar form, , the maximum likelihood decoder can be represented as That is, the ML solution is that codeword which gives the largest correlation to the received vector . This is the basis used in [13] for deriving the correlation decoding rule, where the following objective function is defined: The correlation term in the objective function should be maximised as per the ML formulation. The second term is equivalent to the modulo 2 summation of the binary representation of for the bits in the codeword, that is, it is the bipolar syndrome of . When the parity-check involving is satisfied this term will have a maximum value equal to the degree of the check node. The gradient descent inversion function is based on the partial derivative of with respect to the variable and can be stated as

Based on (9) the single-step GDBF algorithm can be stated as shown in Algorithm 3. The single-step GDBF algorithm flips only a single bit in each iteration which results in a large number of iterations before the objective function reaches a local maximum. To address this problem, a second algorithm is proposed in [13] termed the multibit flipping gradient descent (MGDBF) algorithm which modifies the condition in step 3 such that all bits with are flipped, where is the inversion threshold. This approach has the advantage of achieving faster convergence due to the larger step size arising from flipping more than a single bit. However, as noted in [13], when close to a local maximum, a large step size is not suitable and thus the algorithm has to drop down to single-bit flipping mode. This switch in operating modes is facilitated by evaluating the behaviour of an objective function, shown in (10), and comparing to a threshold as The multistep GDBF algorithm achieves better error-correcting performance than both the WBF or MWBF algorithms whilst using fewer iterations at a cost of additional complexity.

) For  j = 1 : N
  
) If  
  for all , output x and stop
) Flip  bit where,
) If  the maximum number of iterations is reached
  output x and stop
otherwise
 go to step 2

3. Adaptive Threshold Bit Flipping (ATBF)

All the soft-decision bit flipping algorithms described in the previous section have a key weakness when a very high speed or low power hardware implementation is required. They all require, in one form or another, locating a minimum (or maximum) value over the whole block length. Which in hardware requires sequential comparator stages and, since is large for LDPC codes this forms a major bottleneck to practical high speed decoding and low power operation. Thus, the motivation for developing a novel bit flipping algorithm came from this observation as described in [9].

The gradient descent bit flipping algorithms seem to give the best error-correcting performance with the fewest decoder iterations compared to the other bit flipping algorithms in the literature. In addition, in deriving the gradient descent formulation, an elegant framework was established allowing the other major bit flipping algorithms to be mathematically represented. Hence, using the single-step GDBF algorithm as a starting point, the adaptive threshold bit flipping (ATBF) algorithm was derived as described next.

Let , be a negative threshold value associated with each received bit and a constant scaling factor used to modify . The adaptive threshold bit flipping algorithm can then be stated as shown in Algorithm 4, where is the initial threshold value.

) Initialise  
) For   = 1 : 
  
) If  
  output x and stop
) For   = 1 : 
  If  
   flip bit
  otherwise
  
) If  the maximum number of iterations is reached
  output x and stop
otherwise
 go to step 3

The ATBF algorithm replaces step 3 of the single-step GDBF algorithm by flipping a bit when the inversion function associated with the bit has a value below some threshold. If the inversion function value associated with a bit, , is not below the threshold, , the threshold is scaled using a constant factor, . A beneficial consequence of thresholding on a per bit level is that multiple bits will be flipped to start with progressing to fewer flips as most checks are satisfied. Thus, ATBF intrinsically moves from multiple bit flipping to single bit flipping as necessary. Using the proposed modification removes the need to locate a minimum value over the entire block length, instead requiring only localised operations.

3.1. Reduced Power Decoding

With each iteration of the ATBF algorithm, step 4 will result in either a change of sign of or the lowering of . As the probability of a sign change will become smaller leading to scaling of being the dominant operation. Thus, detecting the point at which a particular bit processor enters into this state and terminating further computation should lead to saving in power consumption.

To determine when a bit processor is unlikely to change the sign of a bit, a flipping threshold, , is introduced against which a given is compared. If the bit processor is placed in a quiescent state and no further operations are undertaken. Determining the optimal flipping threshold value was deemed too complex as it is a function of the signal-to-noise ratio and the LDPC code characteristics therefore a empirical approach was adopted. With the maximum number of permitted iterations set to a high value, 100, the variation of was averaged over at least 1000 blocks in an additive white Gaussian noise (AWGN) channel over a range of SNR values. The LDPC code used was the regular-(3,6), half-rate (504,1008) code taken from [14] referred to as PEGReg504x1008. Figure 1 shows how the mean of each of the 1008 threshold values changes as a function of SNR, averaged over at least 1000 blocks. Let , and is the number of blocks simulated. For a given signal-to-noise ratio, we define . Using the values, a fourth-order least squares polynomial fit is applied to derive a relationship between the mean threshold value reached in decoding and the SNR value. A flipping threshold, , based on the polynomial shown in (11) is defined as the threshold value for which a bit processor is put in a quiescent state, where is the SNR in dB as For a practical implementation, the scaling factor applied to the values is made a power of two, , where . This permits the threshold values to be scaled in hardware by a simple shift operation. In such an implementation, the flipping threshold, , can be derived from by setting to the nearest value to achievable by a scaling as follows:

Figure 2 shows the average number of values with as they vary with each iteration for different SNR values. We estimate the power saving possible using the method described by taking the point in Figure 2 at which the mean number of values becomes nonzero for a given SNR. All previous iterations prior to this point will use the full 1008 bit processors and all iterations thereafter will use, at most, 1008 minus the mean number of bit processors indicated by the curve at this point. The upper limit on the number of iterations at a given SNR point is taken from Figure 5, which gives the average number of decoding iterations as a function of SNR for different decoding algorithms. Figure 3 shows the estimated percentage power saving as a function of SNR (dB) when using the power reduction method described. At 3 dB we observe a power saving of 76% over the standard ATBF algorithm which reduces as the SNR increases. It is interesting to observe the power saving going to zero beyond 5 dB, this suggests it may be possible to reduce the maximum allowable iterations below the observed values.

4. Early Stopping of Decoding

In decoding of LDPC codes, it is well known the majority of bits are decoded in the first few iterations leaving a small number of bits to drive on the decoding process without successful decoding. Thus, it is desirable to detect such undecodable blocks and terminate the decoding process to save time and energy. In [15], a good overview of various criterion which can be used for early stopping of turbo decoders is presented. Some of the methods have also been used in studies into early stopping of LDPC decoders. Much of the existing literature addresses the problem of early stopping when using the belief propagation decoding algorithm. Such methods involve taking some measure of the message reliability [16, 17], at the check or variable node, or a combination of the two [18].

The previous section introduced the concept of a flipping threshold used to determine when a node processor is placed in a quiescent state. A further observation from Figure 2 should allow for early termination of decoding without significant impact on the error-correcting ability of the ATBF algorithm. The near step change seen in Figure 2 indicating no threshold values reaching the flipping threshold prior to 13 iterations followed by a majority of threshold values reaching the flipping threshold suggests it should be possible to terminate decoding at this point. This is the case because Figure 1 gives the flipping threshold value which is essentially the steady-state value.

To keep latency to a minimum, we want to avoid use of operations requiring computation over the block length, such as a summation or minimum operation. Thus, based on Figure 2, we define a simple, bit-local metric for determining when to terminate the decoding process. When any the decoding process is terminated, this is a straightforward edge detection of the transition from zero values with to many values for which . The ATBF algorithm is modified to incorporate the early stopping criterion as shown in Algorithm 5.

) Initialise  
) For   = 1 : 
  
) If  
  output x and stop
) For   = 1 : 
If  
   flip bit
otherwise
   
) If  
  Or  the maximum number of iterations is reached
 output x and stop
  otherwise
 go to step 3

5. Simulation Results

The PEGReg504x1008 code from [14] was used to encode a random binary sequence, BPSK modulated and passed through an additive white Gaussian noise (AWGN) channel. The bit error rate (BER) and average number of decoding iterations for the ATBF and ES-ATBF algorithms are compared to the WBF, MWBF, single-step GDBF, and multistep GDBF algorithms in Figures 4 and 5, respectively.

Both the ATBF and ES-ATBF algorithms BER performance is very close to the GDBF algorithms being within 0.25 of the single-step and 0.5 dB of the multistep algorithm. The advantage of correctly selecting the stopping point is evident from the ES-ATBF algorithm showing almost no loss in performance over the ATBF algorithm. The ES-ATBF algorithm uses, at most, eleven iterations, compared to the other algorithms which are permitted up to 100 iterations, a reduction of 89%.

6. ATBF Decoder Architecture

In this design exercise, a fully parallel implementation was opted for to give a worst-case figure for silicon area and power consumption whilst giving a best-case figure for throughput. A fully parallel implementation directly maps each row and each column of the parity-check matrix to a different processing unit, whilst all the processing units execute in parallel. In this implementation, the PEGReg504x1008 parity-check matrix has check nodes and () variable nodes connected by edges wherever . There are total connections, where and are the weight of check nodes and variable nodes, respectively. The PEGReg504x1008 code is a regular graph with and ; therefore, a fully parallel implementation has a set of global wires and their associated wire length compensating circuits (i.e., routing buffers). Here is the number of bits used to represent the data. The next section provides an analysis of the routing congestion.

6.1. Design Flow and Routing Congestion Analysis

In the ATBF algorithm, the sign information is the only global signal propagating between the processing elements; therefore and the wiring load of the circuit is 6048 global interconnects. The interconnect complexity of the ATBF algorithm is times (typically 4–6) less complex than the equivalent SP variants where 4–6 bits of soft information is used for message passing. In our analysis, the Synopsys IC-Compiler computer-aided design (CAD) tool was used for placement, clock tree synthesis, and routing (global, track, and detailed routing). We used a standard-cell-based automatic place and route flow with a square floor plan. The decoder was developed using Verilog to describe the architecture and synthesised with Synopsys Design Compiler. The total wire length and compiler route CPU time were used as a pessimistic measure of estimating routing congestion. Initially with five metal layers used for the power grid, the row utilisation was set to a high value (87%) resulting in significant routing congestion. Keeping the metal layer configuration the same routing closure was achieved by relaxing the row utilisation with 100% routing closure achieved using 63.75% row utilisation.

In conclusion, the measured results have shown that the ATBF algorithm is routable in a square floor plan using 90 nm standard cell process using five metal layers and a single poly-silicon layer. Figure 6 shows the completed design with a core size microns and chip size of microns.

6.2. Fully Parallel ATBF Decoder Implementation

The block diagram of a fully parallel implementation of the ATBF decoder is shown in Figure 7. Two processing elements, variable nodes and check nodes, are labelled VNP and CNP along with the wiring connections between the VNP and CNP blocks labelled as the routing network. Input and output edges of the check node processor are labelled by the connectivity defined in the bipartite graph corresponding to the parity-check matrix. The 4-bit soft LLR values received from the channel are passed to the VNP blocks and these channel values are maintained between the iterations by the register feedback mechanism. The final outputs are taken from the VNP modules once the decoding is complete or the permissible number of iterations has been reached.

The timing diagram of a check node and a variable node update is shown in Figure 8. The check node and variable node messages are updated one after the other in one clock cycle. All check and variable nodes outputs are updated in parallel thus taking one cycle to update all messages for one iteration. For a clock frequency of 250 MHz, the coded throughput for the fully parallel ATBF decoder with a code length of 1008 bits is 252 Gb/sec/iteration.

6.3. Variable Node Processing

Figure 9 shows the variable node processor architecture which computes the hard decision vector, as shown in Algorithm 4. This vector is routed though the routing network to the CNP blocks. Since routing between the two node types is of a single bit value, the size of the routing network is smaller than would be used in an implementation of the sum-product algorithm.

The variable node processing block consists of a correlator, inversion function, threshold adaptation, sign-magnitude to 2’s complement converter, and a flip-detection circuit. At the start of decoding the received signal, , is supplied to the VNPs by a register-feedback arrangement as shown. The received soft values are in 4-bit signed-magnitude (SM) format. Let denote the nth 4-bit soft bit value supplied to the VNP, where is the inverted 1-bit hard decision value and is the magnitude or the reliability of this hard decision value. The SM representation makes the correlation calculation in (4) simpler to implement. As shown, the correlator circuit is made up from a 1-bit multiplexer and inverter. The VNP also performs the threshold adaptation of step 4 in Algorithm 4, which would appear to require a real value multiplier. However, by fixing the scaling factor to a power of two the adaptation can simply be implemented as a shift operation of the value . The result of threshold adaptation obtained from the comparison (most significant bit of the subtractor in the threshold adaptation circuit) is also fed to the flip detection circuit. If a bit flip is detected the threshold value is unchanged otherwise the threshold is altered for the next iteration by a right shift.

6.4. Check Node Processing Block

The function performed by the check node processor is the second term in (9), a parity-check of the received bits as defined by the parity-check matrix. The CNP takes hard decision values and calculates the resulting sign to form the bipolar syndrome by forming an XOR from the corresponding hard decision values. The number of input values to each check node processor is determined by the row weight in the parity-check matrix, which for the PEGReg504x1008 code was six. The global sign signal determined from this XOR logic of each CNP block is routed by using the interleaved address given by the edge numbering of the equivalent bipartite graph.

6.5. Fixed-Point Analysis

A fixed-point software implementation of the ATBF decoder was written to model the ASIC bit width precision values shown in Figure 9. Figure 10 shows the BER results in an AWGN channel compared to the floating point result with , and the maximum iterations limited to 100. Migrating from floating to fixed point representation results in a small (0.25 dB) loss in BER performance. This shows the ATBF algorithm to be robust to fixed-point arithmetic.

6.6. Area Analysis

The design was implemented using UMC 90 nm standard cell CMOS libraries. In a fully parallel approach, the wire length in the decoder design is an important indication of the area; especially where there exists a limited number of metal layers (five metal layers). The total area in different stages of the design flow is shown in Table 1. During the clock tree synthesis stage 83 clock buffers were inserted with the subsequent increase in area shown in the second column of Table 1. Routing was completed with a standard cell utilisation of 63.75% with the final chip area of the placed and routed design being 0.729 .

6.7. Power Analysis

Power measurements were averaged over 10000 blocks assuming a signal-to-noise ratio of 5.5 dB. Each received block consisted of 1008 4-bit sign-magnitude soft values. The circuit was toggled at a clock frequency of 250 MHz and all frames were subject to ten iterations. The global supply voltage was 0.9 V and kept the same for all cells.

Table 2 shows the breakdown of power usage for the clock tree and Table 3 the breakdown of power usage for the decoder. The clock tree uses approximately 10% of the total decoder power usage.

The design results for area, power and throughput are presented in Table 4. Table 5 compares the implementation results to another bit flipping implementation described in [19] and various SP implementations. From Table 4, the overall area occupied by the design is 0.729 mm2 operating at a clock frequency of 250 MHz whilst consuming 33.14 mW of power. Comparing this to the soft-bit-flipping (SBF) implementation of [19] in Table 5 shows the ATBF implementation to have a higher throughput as compared to the SBF algorithm operating at the peak rate which assumes all received bits were error free. Similarly, the area figure for the ATBF decoder is approximately 60% of the SBF implementation when scaled to a 90 nm fabrication process. The largest difference between the ATBF and SBF implementations is seen in the power consumption figures, the comparatively simple computations of the ATBF algorithm together with lower row and column weights leads to less power consuming and fewer operations resulting in significantly lower power consumption. The split-row design in [20] gives an exceptionally high throughput at the expense of high power consumption and large silicon area.

7. Conclusion

LDPC codes have shown near-capacity error-correcting performance whilst efficient hardware implementations have led to adoption of LDPC error coding in high throughput wired and wireless standards. For LDPC codes to be used in low power wireless sensor devices and networks, power efficiency and silicon area are important factors. This paper presented results from the implementation in hardware of a low power decoding algorithm for LDPC codes. An efficient architecture suitable for implementation in CMOS was described and throughput, power and area results from a design implementation exercise presented. Implementation in 90 nm CMOS showed the ATBF algorithm to operate at 250 MHz achieving a throughput of 252 Gb/s/iteration whilst consuming 33.14 mW of power (for 10 decoding operations) and occupying an area of 0.72 mm2. The energy per decoded bit was 1.3 pJ, one of the lowest figures in the published literature.

Acknowledgments

The authors would like to thank the directors of the Telecommunications Research Laboratory (TRL), Toshiba Research Europe Ltd and Lee Keep from Synopsys.