Abstract

Flexible channel decoding is getting significance with the increase in number of wireless standards and modes within a standard. A flexible channel decoder is a solution providing interstandard and intrastandard support without change in hardware. However, the design of efficient implementation of flexible low-density parity-check (LDPC) code decoders satisfying area, speed, and power constraints is a challenging task and still requires considerable research effort. This paper provides an overview of state-of-the-art in the design of flexible LDPC decoders. The published solutions are evaluated at two levels of architectural design: the processing element (PE) and the interconnection structure. A qualitative and quantitative analysis of different design choices is carried out, and comparison is provided in terms of achieved flexibility, throughput, decoding efficiency, and area (power) consumption.

1. Introduction

With the word flexibility regarding channel decoding, we mean the ability of a decoder to support different types of codes, enabling its usage in a wide variety of situations. Much research has been done in this sense after the great increase in number of standards, standard complexity, and code variety witnessed during the last years. Next-generation wireless standards such as DVB-S2 [1], IEEE 802.11n (WiFi) [2], IEEE 802.3an (10GBASE-T) [3], and IEEE 802.16e (WiMAX) [4] feature multiple codes (LDPC, Turbo), where each code comes with various code lengths and rates. The necessity for flexible channel decoder intellectual properties (IPs) is evident and challenging due to the often unforgiving throughput requirements and narrow constraints of decoder latency, power, and area.

This work gives an overview of the most remarkable techniques in context of flexible channel decoding. We will discuss design and implementation of two major functional blocks of flexible decoders: processing element (PE) and interconnection structure. Various design choices are analyzed in terms of achieved flexibility, performance, design novelty, and area (power) consumption.

The paper is organized as follows. Section 2 provides a brief introduction to LDPC codes and decoding. Section 3 gives an overview of flexible LDPC decoders classifying them on the basis of some important attributes for example, parallelism, implementation platforms, and decoding schedules. Sections 4 and 5 are dedicated to PE and interconnection structure, respectively, where we depict various design methodologies and analyze some state of the art flexible LDPC decoders. Finally, Section 6 draws the conclusions.

2. LDPC Decoding

2.1. Introduction

LDPC codes [5] are a special class of linear block codes. A binary LDPC code is represented by a sparse parity check matrix with dimensions such that each element is either 0 or 1. is the length of the codeword, and is the number of parity bits. Each matrix row introduces one parity check constraint on the input data vector : The complete matrix can best be described by a Tanner graph [6], a graphical representation of associations between code bits and parity checks. Each row of corresponds to a check node (CN), while each column corresponds to a variable node (VN) in the graph. An edge on the Tanner Graph connects a with only if the corresponding element is a1 in . If the number of edges entering in a node is constant for all nodes of the graph, the LDPC code is called regular, being otherwise irregular in case of variable node degree. Irregular LDPC codes yield better decoding performance compared to regular ones.

Next-generation wireless communication standards adopt structured LDPC codes, which hold good interconnection, memory and scalability properties at the decoder implementation level. In these codes, the parity check matrix is associated to a matrix, as defined in [7]:

has block rows and block columns; it is expanded, in order to generate the matrix, by replacing each of its entries with a permutation matrix, where is the expansion factor. The permutation matrix can be formed by a series of right shifts of the identity matrix according to a determined shifting factor, equal to the value . The same base matrix is used as a platform for all the different code lengths related to a selected code rate: implementation of a full-mode decoder is thus a challenging task, due to huge variations in code parameters. For example, current IEEE 802.16e WiMAX standard features four code rates, that is, 1/2, 2/3, 3/4, and 5/6 with matrices of size , , , and , respectively. Each code rate comes with 19 different codeword sizes ranging from 576 bits to 2304 bits , with granularity of 96 bits .

Algorithm 1 (The Standard TPMP Algorithm). Initialization: For

CN Update Rule: do

VN Update Rule: do

Decoding: For each bit, compute it’s a posteriori LLR
Estimated codeword is , where element is calculated as If , then stop, with correct codeword .

2.2. LDPC Decoding Algorithms

The nature of LDPC decoding algorithms is mainly iterative. Most of these algorithms are derived from the well-known belief propagation (BP) algorithm [5]. The aim of BP algorithm is to compute the a posteriori probability (APP) that a given bit in the transmitted codeword equals 1, given the received word . For binary phase shift keying (BPSK) modulation over an additive white Gaussian noise (AWGN) channel with mean 1 and variance , the reliability messages represented as logarithmic likelihood ratio (LLR) are computed in two steps: check node update and variable node update. This is also referred to as two-phase message passing (TPMP). For iteration, let represent the message sent from variable node to check node , represent the message sent from to , is the set of parity checks in which participates, the set of variable nodes that participate in parity check , the set with check excluded, and the set with excluded. The standard TPMP algorithm is described in Algorithm 1.

As given in (4), the CN update consists of sign update and magnitude update, where the latter depends on the type of decoding algorithm, of which several are commonly used (Table 1). The sum product (SP) algorithm [8] gives near-optimal results; however, the implementation of the transcendental function requires dedicated LUTs, leading to significant hardware complexity [9]. Min-Sum (MS) algorithm [10] is a simple approximation of the SP: its easy implementation suffers an 0.2 dB performance loss compared to SP decoding [11]. Normalized Min-Sum (NMS) algorithm [12] gives better performance than MS by multiplying the MS check node update by a positive constant , smaller than 1. Offset Min-Sum (OMS) is another improvement of standard MS algorithm which reduces the reliability values by a positive value : for a quantitative performance comparison for different CN updates, refer to [13, 14].

2.3. Layered Decoding of LDPC Codes

Modifying the VN update rule (6): we can merge the CN and VN update rules into a single operation, where the CN messages are computed from and . This technique is called layered decoding [15]. Layered decoding considers the matrix as a concatenation of layers (block rows) or constituent subcodes, that is, , where the column weight of each layer is at most 1. In this way, a decoding iteration is divided into subiterations. Formally, the algorithm for layered decoding Min-Sum is described in Algorithm 2.

After CN update is finished for one block row, the results are immediately used to update the VNs, whose results are then used to update the next layer of check nodes. Therefore, an updated information is available to CNs at each subiteration. Based on the same concept, the authors in [7] introduced the concept of turbo decoding message passing (TDMP) [16] using the BCJR algorithm [17] for their architecture-aware LDPC (AA-LDPC) codes. TDMP results in about 50% decrease in number of iterations to meet a certain BER, which is equivalent to increase in throughput and significant memory savings as compared to the standard TPMP schedule. Similar to the TDMP schedule is the vertical shuffle scheduling (VSS) [18]: while TDMP relies on horizontal divisions of the parity check matrix, VSS divides the horizontal layers into subblocks. It is a particularly efficient technique with quasicyclic LDPC codes [19], where each subblock is identified by a parity check submatrix.

Algorithm 2 (The Layered Decoding Min-Sum). Initialization: do .

CN Update Rule: do
Estimated codeword is , where element is calculated as If , then stop, with correct codeword .

3. Flexible Decoders

3.1. Parallelism

The standard TPMP algorithm described in the previous section exploits the bipartite nature of the Tanner Graph: since no direct connection is present between nodes of the same kind, all CN (or VN) operations are independent from each other and can be performed in parallel. Thus, a first broad classification of LDPC decoders can be done in terms of the degree of parallelism. The hardware implementation of LDPC decoders can be serial, partially parallel, and fully parallel.

Serial LDPC decoder implementation is the simplest in terms of area and routing. It consists of a single check node, a single variable node, and a memory. The variable nodes are updated one at a time, and then check nodes are updated in serial manner. Maximum flexibility could be achieved by uploading new check matrices in memory. However, each edge of the graph must be handled separately: as a result, throughput is usually very low, insufficient for most of standard applications.

A fully parallel architecture is the direct mapping of Tanner graph to hardware. All node operations (CNs and VNs) are directly realized in hardware PEs and connected through dedicated links. This results in huge connection complexity that in extreme cases dominates the total decoder area and results in severe layout congestion: maximum throughput can be, however, theoretically reached. In [20], a 1024-bit, fully parallel decoder is presented, achieving 1 Gbps throughput with logic density of only 50% to accommodate the complexity of interconnection: it comprises of 9750 wires with 3-bit quantization. None of the parallel implementations in [2022] grant multimode flexibility due to wired connections. In addition, almost all existing fully parallel LDPC decoders are built on custom silicon, which precludes any prospect of reprogramming. An alternative approach is the partially parallel architecture which divides the node operations of Tanner graph over P PEs, with . This means that each PE will perform the computation associated to multiple nodes, necessitating memories to store intermediate messages between tasks. Time sharing of PEs greatly reduces the area and routing overhead. Partially parallel architectures are studied extensively and provide a good trade off in throughput, complexity, and flexibility, with some solutions obtaining throughputs up to 1 Gbps.

3.2. Implementation Platforms

Hardware implementation of LDPC decoders is mainly dictated by the nature of application. LDPC codes have been adopted by a number of communication (wireless, wired, and broadcast) standards and storage applications: a few of them are briefly summarized in Table 2.

In wireless communication domain, LDPC codes are adopted in IEEE 802.16e WiMAX which is a wireless metropolitan area network (WMAN) standard and IEEE 802.11n WiFi which is a wireless local area network (WLAN) standard. Both standards have adopted LDPC codes as an optional channel coding scheme with various code lengths and code rates. LDPC codes are also used in digital video broadcast via satellite (DVB-S2) standard which requires very large code lengths of 64800 bits and 16200 bits with 11 different codes rates, and a 90 Mb/s decoding throughput. In wireline communication domain, LDPC codes are adopted in 10 Gbit Ethernet copper (10GBASE-T) standard which specifies a high code rate LDPC code with a fixed code length of 2048 bits, with a very high decoding throughput of 6.4 Gbps.

There is no standard for magnetic recording hard disk; however, they demand high code rate, low-error floor, and high decoding throughput. In [23], a rate-8/9 LDPC decoder with 2.1 Gbps throughput has been reported for magnetic recording. The decoder utilizes four block lengths with maximum consisting of 36864 bits.

The varied nature of applications makes the selection of a suitable hardware platform an important choice. Typical platforms for LDPC decoder implementation include programmable devices (e.g., microprocessors, digital signal processors (DSPs) and application-specific instruction set processors (ASIPs)), customized application-Specific integrated circuits (ASICs), and reconfigurable devices (e.g., FPGAs).

General purpose microprocessors and DSPs utilize strong programmability to achieve highly flexible LDPC decoding, allowing to modify various code parameters at run time. Programmable devices are often used in the design, test, and performance comparison of decoding algorithms. However, they are usually constituted by a limited number of PEs that execute in a serial manner, thus limiting the computational power to a great extent. An LDPC decoder implemented on TMS320C64xx could yield 5.4 Mb/s throughput running at 600 MHz [24]. This performance is not sufficient to support high data rates defined in new wireless standards.

Reconfigurable hardware platforms like FPGAs are widely used due to several reasons. First, they speed up the empirical testing phases of decoding algorithms which are not possible in software. Secondly, they allow rapid prototyping of decoder. Once verified, the algorithm can be employed on the same reconfigurable hardware. It also allows easy handling of different code rates and SNRs, power requirements, block lengths, and other variable parameters. However, FPGAs are suited for datapath intensive designs and have programmable switch matrix (PSM) optimized for local routing. High parallelism and the intrinsic low adjacency of parity check matrix lead to longer and complex routing, not fully supported by most FPGA devices. Some designs [16, 25], used time sharing of hardware and memories that reduces the global interconnect routing, at a cost of reduced throughput.

Customized ASICs are a typical choice which yield a dedicated, high-performance IC. ASICs can be used to fulfill high computational requirements of LDPC decoding, delivering very high throughputs with reasonable parallelism. The resulting IC usually meets area, power, and speed metrics. However, ASIC designs are limited in their flexibility and usually intended for single standard applications only: flexibility, if reached at all, comes at the cost of very long design time and nonnegligible area, power or speed sacrifices. An alternative or parallel approach is the usage of ASIPs, that greatly overcome the limitations of general purpose microprocessors and DSPs. Fully customized instruction set, pipeline and memory achieve efficient, high-performance decoding: ASIP solutions are able to provide inter- and intra-standard flexibility through limited programmability, guaranteeing average to high throughput.

3.3. Decoding Schedule

A partial parallel architecture becomes mandatory to realize flexible LDPC decoding. Generally, functional description of a generic LDPC decoder can be broken down into two parts:(i)node processors;(ii)interconnection structure.

A partially parallel decoder with parallelism P consists of P node processors, while an interconnection structure allows various kinds of message passing according to the implemented architecture. Based on the decoding schedule that is, TPMP or Layered decoding, the datapath can be optimized accordingly. Figure 1 shows two possible datapath implementations of partially parallel LDPC decoder which are discussed as follows.

3.3.1. TPMP Datapath

In the TPMP structure depicted in [26] for a generic belief propagation algorithm, each VN consists of 4 dual port RAMs: I, Sa, Sb, and E, as shown in Figure 1(a). RAM I stores the channel intrinsic information, while RAMs Sa and Sb manage the sum of extrinsic information for previous and current iteration, respectively, and RAM E stores the extrinsic information for current iteration. The decoding process consists of D iterations: during iteration , the intrinsic information () fetched from RAM I is added to the contents of RAM Sa (). Simultaneously, the extrinsic information generated by the current parity check during the previous iteration, , is retrieved from RAM E and subtracted from the total. The result of the subtraction is fed to the PE, which executes the chosen CN update (see Table 1). The updated extrinsics are then accumulated with the iteration ones (RAM Sb) and replace the old extrinsic information in RAM E. At iteration , the roles of RAM Sa and RAM Sb are exchanged.

3.3.2. Layered Datapath

The layered decoding datapath described in [27] is shown in Figure 1(b). The VN structure is simplified and consists of RAM I only, which stores at each subiteration. Equation(9) is computed inside the check node, that consists of a PE, a FIFO, and RAM S which stores . During iteration , these are subtracted from the message incoming from RAM I, to generate the VN-CN message. The updated extrinsic generated by PE is added to the corresponding input coming from the FIFO, storing the resulting in RAM S.

In both datapath architectures described above, assignment of PEs to nodes (VNs and CNs) is determined by a given code structure and can be done efficiently by designing LDPC codes using permuted identity matrices. Considering parallelism , the VNs are connected to CNs through a interconnection network which has to realize the permutations of identity matrix. Typically, a highly flexible barrel shifter allows all possible permutations (rotations) of identity matrix. In some implementations, a single node type joining both VN and CN operations is present, thus changing the nature and function of the connections. The controller is typically composed of address generators (AGs) and permutation RAM (PRAM). The address generators generate the address of RAMS (I, Sa, Sb and E), while the permutation RAM generates the control signals for permutation network according to the rotation of identity matrix. Multimode flexibility is achieved by reconfiguring AGs and PRAMs each time a new code needs to be supported.

In order to realize an efficient LDPC decoder, optimization is required both at PE and interconnection level. Overall complexity and performance of decoder are largely determined by the characteristics of these two functional units. In the next two sections we will discuss them in detail and analyze various design choices aimed at realizing high-performance flexible LDPC decoder.

4. Processing Element

The PE is the core of the decoding process, where the algorithm operations are performed. Its design is an important step that heavily affects overall performance, complexity and flexibility of decoder. The PE can be designed to be serial, with internal pipelining to maximize throughput, or parallel, processing all data concurrently. Depending on this initial choice, critical design issues can arise in either latency and memory requirements or complex interconnection structures and extended logic area.

4.1. Serial PE

As described in Section 2.1, the LDPC codes specified by majority of standards are based on the so-called structured LDPC codes. Considering a decoder parallelism , as in state-of-the-art layered decoders, one sub matrix (equal to P edges) is processed per clock cycle, with one operation completed by each PE working in a serial fashion. Figure 2(a) shows a generalized architecture for serial PE implementing the Min-Sum algorithm. In Min-Sum decoding, out of all LLRs considered by a CN, only two magnitudes are of interest, that is, minimum and the second minimum. The PE works serially maintaining three variables, namely, MIN, MIN2, and INDEX. MIN and MIN2 store the minimum and second minimum of all values, respectively, whereas INDEX stores the position index of minimum value. Each time a new VN-CN message is received, its magnitude is compared with MIN and MIN2, possibly substituting one of the two, with consequent position storage in INDEX. For each outgoing message , either the value is MIN ( INDEX) or MIN2 ( INDEX). Such method avoids storing all VN-CN messages and results in considerable memory saving in CN kernel.

Table 3 collects some information about WiMAX and WiFi standards different parameters. denotes the number of block rows in matrix whereas and denote the maximum row and column weights (i.e., CN and VN degrees), respectively. A full-mode LDPC decoder for WiMAX must support 6 code rates with weights ranging from 7 to 20. Serial CN implementation is particularly suitable for this scenario, as it allows run time flexibility to process any value of CN degree with the same number of comparators, allowing efficient hardware usage. However, very large values of CN degree increase the latency and limit the achievable throughput to a great extent, requiring a high degree of parallelism to achieve medium-to-high throughputs (Table 4).

4.2. Parallel PE

Realizing high throughput decoders (supporting data rates up to few hundred Mb/s) either asks for massive parallelism or high clock frequency, resulting in significant area and power overhead. However, parallelism at CN level can bring significant increase in throughput with affordable complexity. A parallel PE manages all VN-CN messages in parallel and writes back the results simultaneously to all connected VNs. This results in lower update latency and consequently higher throughput. A parallel Min-Sum PE for is shown in Figure 2(b). This unit computes the minimum among different choices of five out of six inputs. PE outputs the result to output ports corresponding to each input which is not included in the set, for example, min(). The PE is capable of supporting all values of less than or equal to 6, whereby unused inputs are initialized to . Supporting higher values of requires additional circuitry which adds to complexity and latency of PE. As shown in the figure, the complexity of PE is dominated by logic components (e.g., comparators) and increases almost linearly with node degree. Such type of PE architectures is mostly employed to structured LDPC codes, where the check node degrees are either fixed or show small variations throughout the decoding process. To achieve code rate flexibility, the check node PE is synthesized for maximum check node degree () required by a particular application, supporting all values of less than or equal to .

4.3. State-of-the-Art

Flexibility as a design parameter is not always addressed as an important figure of merit, but various design techniques have been reported in the literature which can be compared in terms of throughput, complexity, and number of supported decoding modes, thus evaluating the obtainable degree of flexibility.

4.3.1. ASIC Implementations

The partially parallel decoder presented by Kuo and Willson in [32] offers a simple and tailored solution to the mobile WiMAX problem. The designed ASIC is able to work, upon reconfiguration, on all the mobile WiMAX standard LDPC codes. The quasicyclic structure of such codes allows an effective implementation of the layered decoding approach, here exploited with a variable degree of parallelism, and simple interconnection between memories and processing units. The implemented decoding algorithm is the OMS, and it is fixed. Each component of [32] is not flexible per se, but a serial architecture and programmable parallelism extend its range of usable codes to any code with parameters smaller than or equal to the WiMAX ones (block length, column and row weights, total number of exchanged information), although without guaranteeing compliance with standard throughput requirements.

In [33], the intrastandard flexibility comes together with a choice among two decoding approaches, the layered decoding and the TPMP. Although this ASIC performance has been evaluated only in case of the structured QC-LDPC codes of WiMAX, the true benefit from the dual algorithm comes in case of unstructured codes. The usually more performing layered decoding generates data collisions that are transparent to the two-phase decoding process, enabling the presence of superimposed sub-matrices in the code matrix. The central processing unit consists of a Reconfigurable Serial Processing Array (RSPA) incorporating a serial Min-Sum PE and reconfigurable logarithmic barrel shifter. The RSPA can be dynamically reconfigured to choose between the two decoding modes according to different block LDPC codes. With intelligent hardware reuse via modular design the overhead due to the double decoding approach is reduced to a minimum, with an overall acceptable power consumption. The decoder operates at 260 MHz achieving a handsome throughput which meets the WiMAX standard specifications.

When designing an efficient multi mode decoder a typical approach is to find similarities between different modes and then implementing common parts as reusable hardware components. Controlling the data flow between reusable components guarantees multi mode flexibility. One of such efforts is the work by Brack et al. [27]. It portraits an IP core of a full mode LDPC decoder that can be synthesized for a selected code rate specified by WiMAX standard. The unified decoder architecture proposes two different datapaths for TPMP and layered decoding, as in [33], and combines them in a single architecture sharing the components common to both. Code rate and codeword size flexibility is achieved by realizing serial check node PEs, and the chosen decoding algorithm is Min [43].

As discussed in Section 3.3.2, a partially parallel TDMP decoder performs serial scanning of block rows of matrix. The check node reads the bit-LLR connected to a block row serially and stores them in FIFO whose size is proportional to row weight . For multirate irregular QC decoders, the utilization ratio of FIFO is low for smaller . In addition, due to random location of nonzero submatrices and correlations between consecutive block rows of , extrinsic information exchange can lead to memory access conflicts. These limitations were addressed by Xiang et al. in [34], whereby the authors presented an overlapped TDMP decoding algorithm. The design proposes a block row and column permutation criterion in order to reduce correlation between consecutive rows and uniform distribution of zero and nonzero matrices in columns, with a smart memory management technique. The resulting decoder is a full-mode, QC LDPC decoder for WiMAX. The decoder achieves a maximum throughput of 287 Mb/s, with support for other similar QC-LDPC codes.

An interesting way to tackle the flexibility issue is proposed in [35]. Here the different code parameters are handled via what has been called processing task arrangement. This work presents a decoder based on the decoding approach described in [44], layered message passing decoding with identical core matrices (LMPD-ICMs). LMPD-ICM is a variation of the original layered decoding: the matrix is partitioned in several layers, with each layer yielding a core matrix. This consists of the nonzero columns of that layer. The resulting core matrix is further divided into smaller and identical tasks. Applying LMPD-ICM to a QC-LDPC code reveals that core matrices of layers are column-permuted versions of each other and show similarities not only among different layers in a single code, but also among different codes within a same code class. This technique is applied to QC-LDPC codes for WiMAX in [35], and a novel task arrangement algorithm is proposed to assign the processing operation for a variety of QC-LDPC codes to different PEs. The design features four-stage pipelining for task execution, flexible address generation to support multirate decoding, and early termination strategy which dynamically adjusts the number of iterations according to SNR values to save power. The decoder achieves a moderate throughput of 200 Mb/s at 400 MHz frequency utilizing parallelism P of 4.

A single design flow is exploited in [28] to provide three different implementations, each supporting a different standard. The adaptive single-phase decoding (ASPD) [45] scheduling is enforced, that allows to detach the decoder’s memory requirements from the weight of rows and columns of the matrix, leaving them dependent on the codeword length only. Though sacrificing up to 0.3 dB in BER performances, this technique accounts for 60–80% reduction in memory bits. The offset Min-Sum decoding algorithm is employed: different sizes of memories are able to comply with DVB-S2, 802.11n, and 802.16e standards. At run time, the standard serial node architecture enables intrastandard flexibility.

A classical layered scheduling is used in the DVB-S2 decoder proposed in [30]: the 360 PEs, whose architecture is detailed along with the iteration timing, are able to process a whole layer concurrently. A barrel shifter manages the interlayer communication: since the number of rows that compose the layer never change in DVB-S2, a change of code will mean a different workload on the communication structure, but very easy reconfiguration.

The work in [36] presents a reconfigurable full-mode LDPC decoder for WiMAX. A so-called phase overlapping algorithm similar to TDMP is proposed which resolves the data dependencies of CNs and VNs of consecutive subiterations and overlaps their operation. The proposed decoder features serial check nodes with Min-Sum algorithm implementation. Parallelism of 96 yields a throughput of 105 Mb/s at 20 iterations.

In addition to serial check node architectures, the state-of-the-art for flexible LDPC decoders also reports some solutions utilizing parallel check nodes. The work in [37] proposes a reconfigurable multimode LDPC decoder for Mobile WiMAX. The authors applied the matrix reordering technique [46] to the matrix of rate 1/2 WiMAX. This improved matrix reordering technique allowing overlapped operation of CNs and VNs and results in 68.75% reduction in decoding latency compared to nonoverlapped approach. A reconfigurable address generation unit and improved early stopping criterion help to realize a low-power flexible decoder which supports all the 19 block lengths (576–2304) of WiMAX. Parallel check nodes implementing Min-Sum algorithm help to achieve a throughput of 222 Mb/s with low frequency of 83.3 MHz and parallelism of 4.

The work in [38] features a parallel check node based on divided group comparison technique, adaptive code length assignment to improve decoding performance, and early termination scheme. The proposed solution is run time programmable to support arbitrary QC-LDPC codes of variable codes lengths and code rates. However, no compliance with standardized codes is guaranteed. A reasonable throughput of 86 Mb/s at 125 MHz is achieved. In [47], the authors presented a parallel check node incorporating the value reuse property of Min-Sum algorithm, for nonstandardized, rate 0.5 regular codes.

The WiMedia standard [48] requires very high throughput: the work by Alles et al.[29] manages to deliver more than 1 Gb/s throughputs for most code lengths and rates. The result is achieved via the instantiation of 3 PEs with internal parallelism of 30: the similarity of WiMedia codes with the QC-LDPC of WiMAX allows a multiple submatrix-level parallelism in the decoder.

A more technological point of view is given in [31], where low-power VLSI techniques are used in a 802.11n LDPC decoder design. The decoder exploits the TDMP VSS technique with a 12-datapath architecture: separate variable-to-check and check-to-variable memory banks are instantiated, one per type for each datapath. Each of these “macrobanks” contains 3 “microbanks,” each storing 9 values per word. The internal degree of parallelism is effectively sprung up to . VLSI implementation is efficiently tackled in less-than-worst case thanks to VOS [49] and RPR [50] techniques, saving area and power consumption.

4.3.2. ASIP Implementations

Future mobile and wireless communication standards will require support for seamless service and heterogeneous interoperability: convolutional, turbo, and LDPC codes are established channel coding schemes for almost all upcoming wireless standards. To provide the aforementioned flexibility, ASIPs are potential candidates. The state-of-the-art reports a number of design efforts in this domain, thanks to good performance and acceptable degree of flexibility.

The work portrayed in [39] outlines a multicore architecture based on an ASIP concept. Each core is characterized by two optimized instruction sets, one for LDPC codes and one for turbo codes. The complete decoder only requires 8 cores, since each of them can handle three processing tasks at once. The simple communication network maintains this parallelism, allowing for efficient memory sharing and collision avoidance. The intrinsic flexibility of the ASIP approach allows multiple standards (WiFi, WiMAX, LTE, and DVB-RCS) to be easily supported: the exploitation of the diverse datapath allows a very high best case throughput, while the reduced network parallelism keeps the complexity low.

A possible dual turbo/LDPC decoder architecture is described in [40]. Here, a novel approach to processing element design is proposed, allowing a high percentage of shared logic between turbo and LDPC decoding. The TDMP approach allows the usage of BCJR decoding algorithm for both codes, while the communication network can be split into smaller supercode-bound interleavers, effectively merging LDPC and turbo tasks.

In [41], the authors proposed a flexible channel coding processor FlexiCHAP. The proposed ASIP extends the capabilities of previous work FlexiTrep [51] and is capable to decode convolutional codes, binary/duobinary turbo codes and structured LDPC codes. The proposed ASIP is based on the single-instruction multiple datapath (SIMD) paradigm [52], that is, a single IP with an internal data parallelism greater than one, and processes whole submatrix of parity check matrix with single instruction. Multistandard multi mode functionality is achieved by utilizing 12-stage pipeline and elaborated memory partitioning technique. The proposed ASIP is able to decode binary turbo codes up to 6144 information bits, duobinary turbo codes and convolutional codes upto 8192 information bits. LDPC decoding capability of block length up to 3456 bits and check node degree up to 28 is sufficient to cover all code rates of WiMAX and WiFi standards. The ASIP achieves payloads of 237 Mb/s and 257 Mb/s for WiMAX and WiFi, respectively.

The solution proposed in [42] makes use of a single SIMD ASIP with maximum internal parallelism of 96. A combined architecture is strictly designed for LDPC codes, but the supported ones range from binary (WiFi and WiMAX) to nonbinary (Galois Field of order 9): the decoding approach is turbo decoder-based. While the decoding mode can be changed at runtime, flexibility is guaranteed at design time, by instancing wide enough rotation engines for the different LDPC submatrix sizes, and a sufficient number of memories. These memories dominate the area occupation, mainly due to the non-binary decoding process.

Tables 4 and 5 summarize the specifications of various state-of-the-art ASIC and ASIP solutions for flexible LDPC decoders discussed above. To simplify the comparison, the area of each decoder has been scaled up to 130 nm process represented as normalized area (). A parameter called throughput to area ratio (TAR) defined as has also been included in the table to evaluate the area efficiency of proposed decoders. Another metric named decoding efficiency (DE) given as [53] has been defined to give a good comparison regardless of different clock frequencies. DE gives the number of decoded bits per clock cycle per iteration, while Dp is the total degree of parallelism, taking into account both the number of PEs and their possible multiple datapaths.

To evaluate the effective flexibility of each decoder, and its cost, a metric called flexibility efficiency is introduced, and computed as It gives a measure of each decoder’s flexibility through its different decoding modes (DMs), taking in account the normalized throughput performances (DE) in relation to the normalized area occupation . The metric is applied to ASIC decoders only, since in these cases the cost of flexibility is reflected on the area much more directly than in the ASIP case.

As shown in Table 4, the work in [27] dominates in terms of throughput and TAR but offers only design time flexibility : for this reason, its FE value is very low. On the contrary, the design time flexibility of [28] results in a runtime flexibility once the standard to be supported has been chosen: since the implementation of the WiMAX decoder requires a higher number of codes to be supported at the same time than DVB-S2 and WiFi, its FE will be higher. Best DE value is held by the DVB-S2 decoder presented in [30], together with an average TAR. Explicitly enabling the decoding of just 20 codes, however, lowers its FE measure. Among serial PE-based run time flexible solutions discussed above, the work in [34] achieves very high throughput, TAR and DE with a small area occupation of 2.46 mm2, yielding the best FE of all decoders. A full-mode reconfigurable solution based on parallel check node in [35] achieves a handsome throughput of 200 Mb/s and the best FE among the parallel node solutions. Its of 1.416 mm2 is the minimum among all WiMAX solutions discussed above, but with very low DE stains overall performance.

Among the ASIP solutions (Table 5), the work in [40] cannot effectively be compared to the others in terms of area, not providing complete estimations. The work in [41] yields the higher TAR and DE in LDPC mode: the solution proposed in [39], however, yields the best decoding efficiency and the top TAR in turbo mode, while at the same time reaching a very good DE in LDPC mode too.

5. Interconnection Structures

As shown through the previous sections, in the great majority of current LDPC decoders, some kind of intradecoder communication is necessary. Except for very few single-core implementations based on the single-instruction single datapath (SISD) paradigm, the need for message routing or permutation is a constant throughout the wireless communication state of the art. As a first classification, two scenarios can be roughly devised:(i)single PE architectures: some state-of-the-art decoders propose single core solutions with internal parallelism greater than one, that rely on smart memory sharing and on programmable permutation networks. These decoders make often use of TDMP and VSS, that require either reduced communication or very regular patterns: the involved interconnection structures are simple;(ii)partially parallel architectures: referring to the graph representation of the LDPC matrix within a selected decoding approach, it is possible to map the graph nodes onto a certain number of processing cores. In the partially parallel approach, the number of graph nodes is much higher than the PEs. Each node is connected to a set of other nodes distributed on the available PEs: different nodes will have different links, resulting in a widely varied PE-to-PE communication pattern. This situation calls for flexible and complex interconnection structures.

5.1. Shift and Shuffle Networks

Structured LDPC codes decoding, regardless of their implementation, often require shift or shuffle operation to route information between PEs or to/from memories. This is particularly true for some kinds of LDPC codes, as QC-LDPC and shift-LDPC [54].

The barrel shifter (BS) is a well-known circuit designed to perform all the permutations of its inputs obtainable with a shift operation, thus being well suited for the circularly shifted structure of QC-LDPC matrix.

Rovini et al. in [55] exploit the simple structure of the barrel shifter to design a circular shifting network for WiMAX codes. This network must be able to handle all the different submatrix sizes of the standard, thus effectively becoming a multisized circular shifting (MS-CS) network. This MS-CS network is composed of a number of BSs, where is the greatest common divisor among all the supported block sizes. Each BS rotates of the same shift amount all the blocks of data, that are subsequently rearranged by an adaptation network into the desired order according to the current submatrix size. Implementation results show that the proposed MC-CS network outperforms in terms of complexity previous similar solutions as [5659], with a saving ranging from to .

In [36] is designed a shift network for WiMAX standard, based on a self-routing technique. The network is sized to handle the largest submatrix size of the standard, 96: when decoding a smaller code, dummy messages are routed as well, with a dedicated flag. Two stages of barrel shifters provide the shift function to real and dummy messages alike, together with a single permutation network: a lookup engine finally selects the useful ones basing its decision on the flag bits, shift size and submatrix size.

Barrel shifters, though providing the most immediate implementation of the shift operation, often lack the necessary flexibility to directly tackle multiple block sizes. For this reason, they are usually joint to more complex structures.

One of the most common implementations among the simplest interconnection structures is the Benes network (BeN). This kind of network is a rearrangeable nonblocking network frequently used as a permutation network. Defining the number of inputs and outputs, an BeN can perform any permutation of the inputs creating a one-to-one relation with the outputs with switches. Its stand-alone use is thus confined to situations in which sets of data tend not to intertwine: its range of usage, though, can be extended through smart scheduling.

In [60] a flexible ASIC architecture for high-throughput shift-LDPC decoders is depicted. Shift-LDPC codes are subclass of structured LDPC: the of an ()() Shift-LDPC is structured as where are the dimensions of the matrix, and the leftmost submatrices are randomly row-permuted versions of the identity matrix . Matrix identifies a permutation matrix, obtained by cyclically shifting right the columns of by a single position. The operations involved in the definition of each the submatrices guarantee that is with the rows shifted up one position, so all matrices of row can be found cyclically shifting .

To exploit at best the proprieties of Shift-LDPC, a VSS scheme has been selected, along with a highly parallel implementation: variable node units (VNUs) and check node units (CNUs) perform a whole iteration in steps. Since every submatrix is a shifted version of the previous one, also the connections between VNUs and CNUs shift cyclically every clock cycle: this observation leads to the joint design of the Benes global permutation network and the CN shuffler.

The BeN is used to define the links between VNUs and CNUs: these links are static, once the parameters , , and the structure of the leftmost have been fixed. Its high degree of flexibility is exploited to guarantee support over a variety of different codes. The inter-CN communication required by the VSS approach is handled by the CN shuffle network. Its function is to cyclically shift the submatrix rows assigned to each CNU: this means that while each CNU will be physically connected to the same VNU for the whole decoding, the row of the matrix they represent will change. The BeN has consequently no need to be rearranged.

The flexible decoder is able to achieve 3.6 Gb/s with an area of in 180 nm CMOS technology: the area occupation is relatively small w.r.t. the very high degree of parallelism thanks to the nonuniform 4-bit quantization scheme adopted.

In [61], a SIMD-based ASIC is proposed for LDPC decoding over a wide array of standards. The ASIC is composed of 12 parallel datapaths able to decode both turbo and LDPC codes through the BCJR algorithm. As most of the SIMD cores, the decoder handles communication by means of shared memories. Memory management can be challenging, especially in case the parallel datapaths are assigned to fractions of the same codeword. According to [62], it is possible to avoid collisions in such cases with adhoc mapping of the interleaving laws together with one (for LDPC) or two (for turbo) permutation networks to interface with memories. These two networks are implemented in [61] with  BeN, the one at the input of the extrinsic values memory being transparent in LDPC mode.

Not every supported standard require all the 12 datapaths to be active: the chosen parallelism is the minimum necessary for throughput compliance, and the same can be said for the working frequency. The implementation results show full compliance with WiMAX, WiFi, 3GPP-HSDPA, and DVB-SH, at the cost of 0.9 mm in 45 nm CMOS, technology, and total power consumption of 86.1 mW.

One of the limitations of the traditional BeN is the number of its inputs and outputs, that are bound to be a power of 2. However, LDPC decoders often need a permutation of different size: for example, WiMAX codes require shift permutations of sizes corresponding to the possible expansion factors, that is, from 24 to 96 with steps of 4. In [63], an alternative switch network is designed that makes use of switches as well, leading to a more hardware-efficient design. The introduction of switches allows in fact . A fully compliant WiMAX LDPC decoder shift operation can thus be implemented with a novel switch network: it requires switches, against the 832 necessary for a traditional  BeN. Together with efficient control signal generation, this solution outperforms in terms of both complexity and flexibility other modified Benes-based decoders as [57, 64], that exploit a secondary BeN to rearrange the first one.

In [65], Lin et al. propose an optimized Benes-based shuffle network for WiMAX. Unlike [63], the starting point of the design is the nonoptimized  BeN: from here all the switches are removed where no signal is passed, whereas switches with fixed output are replaced by wires. This adhoc trimming technique, together with an efficient algorithm for control signal creation, allow a 26.6%–71.1% area reduction with respect to previously published shift network solutions like [27, 55, 66].

Similar to the BeN is the Banyan network (ByN) [67], that can be seen as a trade-off between the flexibility of the BeN and the complexity of the BS. Although not nonblocking in general, the ByN is non-blocking in case of shift operations only. Moreover, it is composed of averagely half of the switches of a BeN and requires fewer control signals.

The work described in [68] portrays a highly parallel shuffle network based on the ByN paradigm. Like the BeN, also ByN is bound to a power-of-two number of inputs: as Oh and Parhi have done in [63], also here the introduction of switches allows to handle WiMAX standard various submatrix sizes. The implemented decoder guarantees a very high degree of flexibility with complexity lower or comparable to [36, 6365].

5.2. Networks-on-Chip

Networks-on-Chip (NoCs) [69] are versatile interconnection structures that allow communication among all the connected devices through the presence of routing elements. Recently, LDPC decoders for both turbo and LDPC codes based on the NoC paradigm have been proposed, thanks to the intrinsic flexibility of NoCs. NoC-based decoders are multiprocessor systems composed of various instantiations of the same IP associated to a routing element, which are linked in defined pattern. This pattern can be represented with a graph, in which every node corresponds to a processor and a router: the arcs are the physical links among routers, thus identifying a topology.

In [70], a De Bruijn topology [71] NoC is proposed for flexible LDPC decoders. Since the TPMP has been selected, the NoC must handle the communication between VNs and CNs. The NoC design, however, is completely detached from code and decoding parameters, effectively allowing usage for any LDPC code. The router embeds a modified shortest path routing algorithm that can be executed in 1 clock cycle, together with deadlock-free and buffer-reducing arbitration policies and is connected to its PE via the network interface. The network is synthesized and compared to other explored network topologies, as the 2-dimensional mesh [72], Benes [73] and MDN [74]: the degree of flexibility and scalability that the proposed topology guarantees is unmatched.

The performance of another topology, the 2D toroidal mesh, is evaluated in [40]. The routing element implements the near-optimal - routing for the torus/mesh [75]. A whole set of communication-centric parameters is varied in order to evaluate the impact of the network latency on the whole decoder performance. It has been shown that small PE sending periods , that is, cycles between two available data from the processor, increase latency to unsustainable levels, with the smallest values at . Also, the variations in throughput due to different NoC parallelisms are shaded by the impact of latency.

The work in [76] describes a flexible LDPC decoder design tackling the communication problem with two different NoC solutions. The first network is a De Bruijn NoC adopting online dynamic routing, implementing the same modified shortest path algorithm described in [70], while the second, a 2D torus, is based on a completely novel concept named zero overhead NoC (ZONoC). Given a mapping of VNs and CNs over a topology, the message exchange pattern is deterministic, along with the status of the network at each instant. The ZONoC exploits this propriety by running offline simulations and storing routing information into dedicated memories, effectively wiping out the time spent for routing and traffic control. Overall network complexity is scaled down, since no routing algorithm is necessary; FIFO length can be trimmed to the minimum necessary, while router architecture is as simple as possible (Figure 3). A cross-bar switch controlled by the routing memory receives messages from the FIFOs connected to its PE and other routers, while outgoing messages are sent to as many output registers. Implementation results show a significant reduction in complexity with respect to [28, 36, 56, 72] and comparable or superior throughput.

5.3. Reducing the NoC Penalty

The NoC approach guarantees a very high degree of flexibility and, in theory, a NoC-based decoder can reach very high throughput. The achievable throughput is proportional to the number of PEs: but increasing the size of the network means rising the latency, and thus degrading performance back. Very few state of the art solutions have managed to solve this problem, and those who do suffer from large complexity and power consumption. We have tried to overcome these shortcomings in some recent works.

5.3.1. NoC-Based WiMAX LDPC Decoder

The solution described in [76] supports the WiMAX standard LDPC codes, but does not guarantee a high enough throughput. Stemming from it, we have developed an LDPC ZONoC-based decoder fully compliant with WiMAX standard: although having a more convoluted graph structure, relies on a smaller number of exchanged messages and guarantees factor in convergence speed. We designed a sequential PE implementing the normalized Min-Sum decoding algorithm, as described by Hocevar in [77]: unlike in [76], we adopt the layered decoding approach. The PE architecture is independent of code parameters, and the memory capacity sets the only limit to the size of supported codes. Together with the PE, we devised a decoder reconfiguration technique to upload the data necessary for routing and memory management when switching between codes.

In order to comply with WiMAX standard throughput requirements, the size of the 2D torus mesh has been risen from 16 nodes to 25. As detailed in [78], the decoder guarantees more than 70 Mb/s for all rates and block sizes of the standard, with an area of 4.72 mm2 in 130 nm CMOS technology.

5.3.2. Bandwidth and Power Reduction Methods

While the former decoder is compliant with WiMAX in worst case, that is, when the maximum allowed number of iteration is performed, a codeword is averagely corrected with fewer iterations: the unnecessary iterations significantly contribute to the NoC high-power consumption. In [79], two methods aimed at reducing power and increasing throughput are studied and implemented. The first one is the iteration early-stopping (ES) criterion proposed in [80], that allows to stop the decoding when all the information bits of a codeword are correct, regardless of the redundancy bits. The other is a threshold-based message stopping (MS) criterion, that reduces the traffic load on the network by avoiding injection of values which carry information about high-probability correct bits.

Various combinations of the two methods have been tried, together with different parallelisms of the 2D torus mesh. Implementation of the ES criterion requires a dedicated processing block with minimal PE modifications, while MS requires a threshold comparison block for each PE and switching to online dynamic routing. This is necessary since stopping a message invalidates the statically computed communication pattern. While the ES method guarantees an average energy per frame decoding reduction regardless of the implementation, the MS method’s results change with the size of the NoC. Since stopped messages can lead to additional errors, a performance sacrifice must be accepted: among the solutions presented in [79], with 0.3 dB BER loss, a 9-PE NoC is sufficient to support the whole WiMAX standard.

5.3.3. NoC Analysis for Turbo/LDPC Decoders

In [81], an extensive analysis of performance of various NoC topologies is performed in the context of multiprocessor turbo decoders. Flexibility can be explored also in terms of types of code supported: in [82], we have consequently extended the topology analysis to LDPC codes, in order to find a suitable architecture for a dual turbo/LDPC codes. As a case of study, we focused our research on the WiMAX codes.

The performance of a wide set of topologies (ring, spidergon, toroidal meshes, honeycomb, De Bruijn, and Kautz) has been evaluated in terms of achievable throughput and complexity, considering different parallelisms. Exploiting a modified version of the cycle-accurate simulation tool described in [81], a range of design parameters has been taken in consideration, including data injection rate, message collision management policies, routing algorithms, node addressing modes, and structure of the routing element, allowing to span from a completely adaptive architecture to a ZONoC-like precalculated routing.

The simulations revealed the Kautz topology [83] to be the best trade-off in terms of throughput and complexity between LDPC and turbo codes, with a partially adaptive router architecture and FIFO-length based routing. Two separate PEs for turbo and LDPC codes have been designed: different NoC and PE working frequencies allow to trim the message injection rate in the network. The full decoder, complete with turbo and LDPC separated PEs, has been synthesized with 90 nm CMOS technology: the decoder is compliant with both turbo and LDPC throughput requirements for all the WiMAX standard codes. Worst-case throughput results overperform the latest similar solutions as [39, 41, 61, 84], with a small area occupation and particularly low-power consumption (59 mW) in turbo mode.

6. Conclusions

A complete overview of LDPC decoders, with particular emphasis on flexibility, is drawn. Various classifications are depicted, according to degree of parallelism and implementation choices, focusing on common design choices and elements for flexible LDPC decoders. An indepth view is given over the PE and interconnection part of the decoders, with comparison with the current state-of-the-art, the latest work by the authors on NoC-based decoders is briefly described.