Research Article  Open Access
Swapnil Mhaske, Hojin Kee, Tai Ly, Ahsan Aziz, Predrag Spasojevic, "FPGABased Channel Coding Architectures for 5G Wireless Using HighLevel Synthesis", International Journal of Reconfigurable Computing, vol. 2017, Article ID 3689308, 23 pages, 2017. https://doi.org/10.1155/2017/3689308
FPGABased Channel Coding Architectures for 5G Wireless Using HighLevel Synthesis
Abstract
We propose strategies to achieve a highthroughput FPGA architecture for quasicyclic lowdensity paritycheck codes based on circulant1 identity matrix construction. By splitting the node processing operation in the minsum approximation algorithm, we achieve pipelining in the layered decoding schedule without utilizing additional hardware resources. Highlevel synthesis compilation is used to design and develop the architecture on the FPGA hardware platform. To validate this architecture, an IEEE 802.11n compliant 608 Mb/s decoder is implemented on the Xilinx Kintex7 FPGA using the LabVIEW FPGA Compiler in the LabVIEW Communication System Design Suite. Architecture scalability was leveraged to accomplish a 2.48 Gb/s decoder on a single Xilinx Kintex7 FPGA. Further, we present rapidly prototyped experimentation of an IEEE 802.16 compliant hybrid automatic repeat request system based on the efficient decoder architecture developed. In spite of the mixed nature of data processing—digital signal processing and finitestate machines—LabVIEW FPGA Compiler significantly reduced time to explore the system parameter space and to optimize in terms of error performance and resource utilization. A 4x improvement in the system throughput, relative to a CPUbased implementation, was achieved to measure the errorrate performance of the system over large, realistic data sets using accelerated, inhardware simulation.
1. Introduction
The year 2020 is slated to witness the first commercial deployment of the 5th generation of wireless technology. 5G is expected to deliver a uniform Quality of Service (QoS) of 100 Mb/s and peak data rates of up to Gb/s, with overtheair latency of less than 1 ms [1]. All of this is with the energy consumption of contemporary cellular systems. Channel coding is crucial to achieve good performance in a communication system. Nearcapacity performing codes such as Turbo codes [2] and LowDensity ParityCheck (LDPC) codes [3] typically require highcomplexity encoding and decoding methods. Today, the standardization efforts towards realizing 5G cellular systems have already begun [4]. The suitability of a particular channel coding scheme is being discussed; and for a system realization of the size of 5G, the evolution of requirements pertaining to channel coding is naturally expected. In our effort to study and design channel codes based on areas ranging from theoretical performance evaluation up to implementation complexity analysis, we have identified two main requirements in the development process. The first one is flexibility for future modifications. To facilitate this, we choose the reconfigurable FPGA platform. Moreover, for this evolving architecture, we aim to observe not only the theoretical complexity versus performance tradeoff, but also the implementation complexity versus performance tradeoff. This brings us to the second major requirement, which is realworld rapid prototyping of our methods. Figure 1 summarizes our research methodology. Even though theoretical simulations validate a novel idea, they fail to comprehensively assess its realworld impact. In an effort towards designing and developing a hardware architecture for channel coding, it is crucial to monitor the performance of the system in realtime, on actual stateoftheart hardware. This helps us keep track of parameters such as throughput, latency, and resource utilization of the system, each time a modification is done. We would also like to emphasize that rapid prototyping can be used not only for validating the design on realworld hardware platforms (Sections 5.1 and 5.2), but also for speedup of theoretical simulations (Section 5.3).
To accomplish this, in addition to the use of FPGAbased implementation, we use a HighLevel Synthesis (HLS) compiler built in LabVIEW, namely, the LabVIEW FPGA Compiler [5–9] the details of which (relevant to this work) are given in Section 3. One of the main contributions of this work is the stateoftheart HLS technology that offers an automated and systematic compilation flow, which generates an optimized hardware implementation from a user’s algorithm and design requirements. This methodology empowers domain experts with minimum hardware knowledge to leverage FPGA technology in exploring, prototyping, and verifying their complex domainspecific applications. As shown in Figure 2, our compilation flow takes an application diagram as well as highlevel design requirements, such as clock rate and throughput, and produces an optimized implementation with resource and timing estimates. By simply modifying application parameters and design requirements, designers can quickly get new hardware implementations with updated estimates. Highlevel design (user) requests and estimates enable designers to easily evaluate the current model and requirements and plan further algorithmic exploration. This rapid design process paves the way for domain experts to successfully accomplish the optimized design solution with significant time and cost savings.
QCLDPC codes or their variants (such as accumulatorbased codes [10]) that can be decoded (suboptimally) using Belief Propagation (BP) are highly likely candidates for 5G systems [4]. Insightful work on highthroughput (order of Gb/s) BPbased QCLDPC decoders is available; however, most of such works focus on an applicationspecific integrated circuit (ASIC) design [11, 12] which usually requires intricate customizations at the registertransfer level (RTL) and expert knowledge of verylargescale integration (VLSI) design. A sizeable subset of the abovementioned work caters to fullyparallel [13] or codespecific [14] architectures. From the point of view of an evolving research solution, this is not an attractive option for rapid prototyping. In the relatively less explored area of FPGAbased implementation, impressive results have recently been presented in works such as [15–17]. However, these are based on fullyparallel architectures which lack flexibility (codespecific) and are limited to small block sizes (primarily due to the inhibiting routing congestion) as discussed in the informative overview in [18]. Since our case study is based on fully automated generation of the hardware description language (HDL), we compare our results with some recent HLSbased stateoftheart implementations [19–22] in Section 6. The main contributions of this work are as follows. In this work, we present a highthroughput FPGAbased IEEE 802.11n standard compliant QCLDPC channel decoder. With the architectural technique of splitting of the node processing, we achieve the said degree of pipelining without utilizing additional hardware resources. To demonstrate the scalability of the architecture, we present its application to a massivelyparallel Gb/s USRPbased decoder implementation (also demonstrated on the exhibit floor in the 2014 IEEE GLOBECOM conference [23]). The final contribution is a method to rapidly prototype the experimentation of a HARQ system based on the efficient decoder architecture developed, using the IEEE 802.16 standard compliant QCLDPC code. The system not only comprises digital signal processing (DSP), but also finitestate machines (FSM). In spite of such mixed nature of data processing, LabVIEW FPGA Compiler was able to significantly reduce the time to explore the overall system parameter space and to optimize resource utilization for the errorrate performance achieved.
The remainder of this article is organized as follows. Section 2 provides a succinct introduction to the QCLDPC code structure and the corresponding decoding algorithm considered for the architecture. The strategies for achieving highthroughput for the standalone QCLDPC decoder are explained in Section 4. The case studies for the highthroughput decoder, its application demonstrating scalability, and the rapidly prototyped HARQ experiment are detailed in Section 5. A survey of recent stateoftheart solutions is provided in Section 6. Section 7 concludes the article.
2. QuasiCyclic LDPC Codes
LDPC codes are a class of linear block codes that have been shown to achieve nearcapacity performance on a broad range of channels. Invented by Gallager [3] in 1962, they are characterized by a LowDensity (sparse) ParityCheck Matrix (PCM) representation. Mathematically, an LDPC code is a nullspace of its PCM , where denotes the number of paritycheck equations or paritybits and denotes the number of variable nodes or code bits [24]. In other words, for a rank PCM , is the number of redundant bits added to the information bits, which together form the codeword of length . An example of a Tanner graph representation (due to Tanner [25] who introduced a graphical representation) is shown in Figure 3. Here, the PCM is the incidence matrix of a bipartite Tanner graph comprising two sets: the check node (CN) set of paritycheck equations and the variable node (VN) set of variable or bit nodes; the CN is connected to the VN if . The column weight and the row weight , where row weight and column weight are defined as the number of s along a row and a column, respectively. An LDPC code is called a regular code if each CN has a degree and each VN has a degree and is called an irregular LDPC code otherwise.
2.1. ParityCheck Matrix
The first LDPC codes by Gallager [3] are random, which complicate the decoder implementation, mainly because a random interconnect pattern between the VNs and CNs directly translates to a complex wire routing circuit on hardware. QCLDPC codes [26] belong to the class of structured codes that do not significantly compromise performance relative to randomly constructed LDPC codes.
The construction of QCLDPC codes relies on an matrix sometimes called the base matrix which comprises cyclically rightshifted identity and zero submatrices both of size , where, and , the shift value, The PCM matrix is obtained by expanding using the mapping, where is an identity matrix of size which is cyclically rightshifted by and is the allzero matrix of size . As comprises the submatrices and , it has rows and columns. The base matrix for the IEEE 802.11n (2012) standard [27] with is shown in Table 1.

2.2. Scaled MinSum Approximation Decoding
LDPC codes can be suboptimally decoded using the BP method [3, 28] on the sparse bipartite Tanner graph where the CNs and VNs communicate with each other, successively passing revised estimates of the loglikelihood ratio (LLR) associated in every decoding iteration. In this work, we have employed the efficient decoding algorithm presented in [29], with a pipelining schedule based on the rowlayered decoding technique [30], detailed in Section 4.3.
Definition 1. For and , let denote the bit in the length codeword and denote the corresponding received value from the channel corrupted by the noise sample . Let the variabletocheck (VTC) message from VN to CN be and let the checktovariable (CTV) message from CN to VN be . Let the a posteriori probability ratio for variable node be denoted as .
The steps of the scaledMSA are given below.
(1) Initialization. The a posteriori probability for the VN and the CTV messages are initialized as
(2) Iterative Process. During the decoding iteration,where and represents the set of the VN neighbors of CN excluding VN and is the scaling factor used, the rationale behind which is explained below.
(3) Decision Rule. ,
(4) Stopping Criteria. If or (maximum number of decoding iterations), declare as the decoded codeword.
It is well known that since the MSA is an approximation of the sumproduct algorithm (SPA) [3], the performance of the MSA is relatively worse than the SPA [24]. However, work such as [31] has shown that scaling the CTV messages can improve the performance of the MSA. Hence, we scale the CTV messages by a factor (set to ) to compensate for the performance loss due to the MSA approximation.
The standard BP algorithm is based on the socalled flooding or twophase schedule where each decoding iteration comprises two phases. In the first phase, VTC messages for all the VNs are computed and, in the second phase, the CTV messages for all the CNs are computed, strictly in that order. Thus, message updates from one side of the graph propagate to the other side only in the next decoding iteration. In the algorithm given in [29] however, message updates can propagate across the graph in the same decoding iteration. This provides advantages such that a single processing unit is required for both CN and VN message updates, memory storage is reduced on account of the onthefly computation of the VTC messages , and the algorithm converges faster than the standard BP flooding schedule requiring fewer decoding iterations.
3. HLS with LabVIEW FPGA Compiler
The HLS compiler in LabVIEW CSDS [32], namely, LabVIEW FPGA Compiler, aims at identifying opportunities to efficiently parallelize in the application’s algorithmic description, subject to requirements set by the user. Here, we briefly describe the main techniques [5] embedded into the LabVIEW FPGA Compiler toolset that enable efficient highthroughput translation of the algorithm into a VHDL description.
3.1. Memory Dependency Analysis
Loop unrolling on FPGA platforms is a wellknown compiler optimization used to exploit parallelism [33]. However, in the presence of execution dependencies between loop iterations, loop unrolling may not contribute to throughput improvement. An example is shown in Figure 4(a) where an execution dependency restricts parallelization of unrolled loops. Although loops have been unrolled by a factor of two as shown in Figure 4(b), the first loop copy waits until the second loop copy execution is finished. Due to the serialized loop execution, the overall performance is the same as the original loop, however at the cost of more FPGA resources used by the new loop copies.
(a) Timing diagram before unrolling
(b) Timing diagram after unrolling
However, if unrolling is performed only when it improves throughput, a tradeoff between throughput and resource consumption can be achieved in the implementation. An illustrative example is provided in Figure 5, where a feedback node defines a data dependency across consecutive diagram executions. A ReadAfterWrite (RAW) dependency between the current memory read operation and a previous memory write operation is shown in Figure 5(a). This dependency prevents the compiler from pipelining the diagram executions and becomes a bottleneck, restricting the overall throughput as shown in Figure 5(b). However, if the compiler can determine that never reads a memory location that is updated by , then the diagram execution can overlap with the execution and achieve better throughput as shown in Figure 5(c). Such an analysis is also applicable to relax WAR and WAW dependencies.
(a) Algorithm description (application diagram)
(b) Without access pattern analysis
(c) With access pattern analysis
The memory access pattern analysis in LabVIEW FPGA Compiler mainly comprises two steps. In the first step, a periodic access pattern is determined by monitoring all the stateful nodes that contribute to each memory access pattern. In the second step, access patterns of memory accessor pairs are compared, and the pairwise worst iteration distance is computed. This dependent iteration distance is used to create a relaxed interiteration dependency, thus allowing pipelined executions without any memory corruption.
3.2. Memory Access Traffic Relaxation
Loop unrolling may not be effective if the memory access speed cannot keep up with the data throughput request set by the user. This is particularly true for processing intensive applications like the ones studied and implemented in this work. LabVIEW FPGA Compiler uses the following techniques to reduce memory traffic such that the performance targets set by the user are met.
3.2.1. Memory Partitioning
Memory blocks on modern FPGAs typically have only two ports, one of which is generally readonly. Implementing memories with more ports can become very resource intensive and can drastically reduce the clock rate of the design. The limited amount of memory ports often causes accesses to get serialized. These serialized memory access requests often make computational cores idle, thus resulting in a reduction of the system throughput [34]. Memory partitioning is the division of the original memory block into multiple smaller memory blocks. This partitioning effectively increases FPGA physical memory access ports to allow simultaneous memory read and write operations, thus minimizing the idle time of the computational cores. Memory accessors are grouped into sets, such that accessors within one set are guaranteed to have a nonoverlapping address space with members of another set, allowing the compiler to safely partition the single memory into a memory for each set of accessors. The size of each partition is the size of the address space for that set. The original memory is divided into small partitions based on minmax address ranges of memory accessor groups, and each group is mapped to a separate partition having the matched address range.
LabVIEW FPGA Compiler statically analyzes memory access patterns in a given application diagram and automatically relaxes the memory access bottleneck without impacting the execution of the highlevel algorithmic description input to it. Memory traffic is thereby reduced linearly by the partitioning number at no additional memory space cost.
3.2.2. Memory Accessor Jamming
In many applications, memory access is sequential and predictive. When multiple accesses to a memory can be computed in parallel, the values can be accessed together in one clump rather than as many separate smaller accesses. We refer to this as memory accessor jamming. This method creates a memory accessor group such that accessor patterns are of the form,where is a periodic offset, is a loop indexer, and is a constant offset that is smaller than . The multiple accessors in a group are jammed into a single accessor with a wide word length. This word length is the product of the original word length and the jamming factor value. Consequently, memory access traffic is decreased by the value of the jamming factor. Jamming modifies the memory layout by increasing the word length and reducing the address range by the jamming factor, but it does not need any additional memory space. Jamming is well suited for use with loop unrolling because any inorder memory access pattern inside the loop becomes a jammable access pattern after unrolling.
All of the above techniques have been successfully employed by LabVIEW FPGA Compiler without any manual intervention from the user. For instance, loop unrolling is primarily employed to process algorithmic metrics described in Section 2.2 for the technique of fold parallelization of node metric processing as described in Section 4.2. Here, memory access analysis captures relaxed memory dependencies and achieves the reported throughput without any applicationspecific compiler directives. Moreover, due to the graphbased iterative decoding nature of the application considered for this work, readwrite patterns that lend themselves to memory accessor jamming have been identified by the tool and successfully exploited.
The authors would like to emphasize that the algorithmic compiler (LabVIEW FPGA Compiler) translates the application’s highlevel description to VHDL. The subsequent compilation of VHDL is performed by the Xilinx Vivado compiler, the details of which are beyond the scope of this work.
4. Techniques for HighThroughput
To understand the highthroughput requirements for LDPC decoding, let us first define the decoding throughput of an iterative LDPC decoder.
Definition 2. Let be the clock frequency, be the code length, be the number of decoding iterations, and be the number of clock cycles per decoding iteration; then the throughput of the decoder is given by b/s.
Even though and are functions of the code and the decoding algorithm used, and are determined by the hardware architecture. Architectural optimization such as the ability to operate the decoder at higher clock rates with minimal latency between decoding iterations can help achieve higher throughput. We have employed the following techniques to increase the throughput given by Definition 2.
4.1. Linear Complexity Node Processing
As noted in Section 2.2, separate processing units for CNs and VNs are not required unlike that for the flooding schedule. The hardware elements that process (4)–(6) are collectively referred to as the Node Processing Unit (NPU).
Careful observation reveals that, among (4)–(6), processing the CTV messages , and , is the most computationally intensive due to the calculation of the sign and the minimum value of the set of magnitudes of VTC messages received from VN to CN , where . As the degree of CN is , the complexity of processing the minimum value (in terms of the comparisons required) is . In a straightforward algorithmic description, this translates to two nested forloops, an outer loop that executes times and an inner loop that executes times.
To achieve linear complexity for the CN message update process in our implementation, the minimum value is computed in two phases or passes. In the first (global) pass, the two smallest values for the CN are computed. These are the first and the second minimum (the smallest value in the set excluding the minimum value of the set). Subsequently, for every incident edge on the said CN, the smallest VN message that does not correspond to the considered edge is selected. In other words, if the said incident edge (for which the CN to VN message is to be sent) has the smallest value (first min), then the second smallest value (second min) obtained in the global pass is sent over this edge, else, the second smallest value (second min) is sent. This pass is called the second (local) pass. A similar approach is found in [11, 35]. In a straightforward algorithmic description, this translates to two separate forloops in tandem: first loop executes times computing the first and the second minimum for the set of VTC message values and the second loop executes times assigning the overall minimum to each branch connecting CN and VN , where , , and . Consequently, this reduces the complexity from to . Based on the functionality of the two passes, the NPU is divided into the Global NPU (GNPU) and the Local NPU (LNPU). The algorithm to accomplish this is as follows.
(1) Global Pass. The Global NPU (GNPU) processes this pass.(i)Initialization: let denote the discrete timesteps such that and let and denote the value of the first and the second minimum at time , respectively. The initial value at time is(ii)Comparison: for , , and , note that the ordering of the set that belongs to is induced on the set that belongs to.
(2) Local Pass. The Local NPU (LNPU) at time determines the actual minimum value for each VN , as per the equivalence relation:where . Thus, the computation of the minimum value is accomplished in linear complexity . It was rightly noted by one of the reviewers that initializing the variable is unnecessary, resulting in a redundant iteration in (10). However, we would like to note that the implementation was done based on (10) as given in the algorithm.
4.2. Fold Parallelization of NPUs
The CN message computation given by (5) is repeated times in a decoding iteration, that is, once for each CN. A straightforward serial implementation of this kind is slow and undesirable. Instead, we apply a strategy based on the following understanding.
Fact 1. An arbitrary submatrix in the PCM corresponds to CNs connected to VNs on the bipartite graph, with strictly edge between each CN and VN.
This implies that no CN in this set of CNs given by shares a VN with another CN in the same set. Table 2 illustrates such an arbitrary submatrix in . This presents us with an opportunity to operate NPUs in parallel (hereafter referred to as an NPU array), resulting in a fold increase in throughput.

4.3. Layered Decoding
In the flooding schedule discussed in Section 2.2, all nodes on one side of the bipartite graph can be processed in parallel. Although such a fullyparallel implementation may seem as an attractive option for achieving highthroughput performance, it has its own drawbacks. Firstly, it becomes quickly intractable in hardware due to the complex interconnect pattern between the nodes of the bipartite graph. Secondly, such an implementation usually restricts itself to a specific code structure. Although the efficient scaledMSA algorithm discussed in Section 2.2 is inherently serial in nature (as the messages are propagated across the bipartite graph more than once every decoding iteration), one can process multiple nodes at the same time if the following condition is satisfied.
Fact 2. From the perspective of CN processing, two or more CNs can be processed at the same time (i.e., they are independent of each other) if they do not have one or more VNs (code bits) in common.
The rowlayering technique used in this work essentially relies on the condition in Fact 2 being satisfied. In terms of the PCM , an arbitrary subset of rows can be processed at the same time, provided that no two or more rows have a in the same column of . This subset of rows is termed as a rowlayer (hereafter referred to as a layer). In other words, given a set of layers in , and , then,
Observing that , in general, can be any subset of rows as long as the rows satisfy the condition specified by Fact 2, implying that , is possible. Owing to the structure of QCLDPC codes, the choice of (and hence ) becomes much obvious. Submatrices in (with row and column weight of ) guarantee that, for the CNs (rows corresponding to ), condition in Fact 2 is always satisfied. Hence, in our work, we choose .
From the VN or column perspective, , implies that the columns of the PCM are also divided into subsets of size (called block columns from now on) given by the set , . The VNs belonging to a block column may participate in CN equations across several layers. We call the intersection of a layer and a block column as a block. Two or more layers are said to be dependent with respect to the block column if and This is observed in Table 3, where we can see that layers , and are dependent with respect to block column . Assuming that the message update begins with layer and proceeds downward, the arrows represent the directional flow of message updates from one layer to another. For the block column , for instance, layer cannot begin updating the VNs associated with block column before layer has finished updating messages for the same set of VNs and so on.

The idea of parallelizing z NPUs seen in Section 4.2 can be extended to layers, where sized arrays of NPUs can process message updates for multiple layers, provided they are independent with respect to the block column being processed. In Section 4.4, we discuss pipelining methods that allow us to overcome layertolayer dependency and maximize the throughput. Before we discuss the pipelined processing of layers implemented in our decoder, in this section, we present a novel compact (thus efficient) matrix representation leading to a significant improvement in throughput. We call submatrices in (corresponding to a in ) as invalid blocks, since there are no edges between the corresponding CNs and VNs. The other submatrices are called valid blocks. In a conventional approach to scheduling, for example, in [12], message computation is done over all the valid and invalid blocks. To avoid processing invalid blocks, we propose an alternate representation of in the form of two matrices: , the block index matrix, and , the block shift matrix. and hold the index locations and the shift values (and hence the connections between the CNs and VNs) corresponding to only the valid blocks in , respectively. Construction of is based on the following definition.
Definition 3. Construction of is as follows. for set for if Let denote the set of valid blocks for layer , .Let ; then, , we define the block index matrix asSimilarly, we define as
The block index (shift) matrix () is shown in Table 4 (Table 5) for the case of the IEEE 802.11n rate LDPC code. To observe the benefit of this alternate representation, let us define the following ratio.


Definition 4. Let denote the compaction ratio, which is the ratio of the number of columns of (which is the same for ) to the number of columns of . Hence, .
The compaction ratio is a measure of the compaction achieved by the alternate representation of . Compared to the conventional approach to scheduling node processing based on matrix, scheduling as per the and matrices improves throughput by times. In our case study, , thus providing a throughput gain of .
Remark 5. In the QCLDPC code in our case study, for all layers except layers and where it is . With the aim of minimizing hardware complexity by maintaining a static memoryaddress generation pattern (does not change from layertolayer), our implementation assumes regularity in the code. The decoder processes blocks for each layer of the matrix resulting in some throughput penalty.
4.4. Area Efficient Pipelining Architecture
In Section 4.3, we saw how dependent layers for a block column cannot be processed in parallel. For instance, in the base matrix in Table 1, VNs associated with the block column participate in CN equations associated with all the layers except layer , suggesting that there is no scope of parallelization of layer processing at all. This situation is better observed in shown in Table 4.
Fact 3. If a block column of has a particular index value appearing in more than one layer, then the layers corresponding to that value are dependent.
Proof. It follows directly by applying Fact 2 to Definition 3.
In other words, , , if , then, the layers and are dependent. It is obvious that, to process all layers in parallel ( to in Table 1), the conditionmust hold . For the structure of shown in Table 4 (by definition of the code), it is not possible to parallelize all the layers. However, a degree of parallelization can be achieved by making the layers independent with respect to a block column.
To accomplish this, we rearrange the matrix elements from their original order with the following idea. If , , then stagger the execution of with respect to by moving to , . Table 6 shows one such rearrangement of (Table 4) for the QCLDPC code for our case study. However, some dependencies still remain (shown in bold and italic in Table 6). Note that if we partition into two halves, to and to , each half satisfies Fact 2 separately. In other words, , , , and , , .

We call the set of layers satisfying Fact 2 a superlayer. Figure 6(a) shows the blocklevel view of the NPU timing diagram without the pipelining of layers. As seen in Section 4.1, the GNPU and LNPU operate in tandem and in that order, implying that the LNPU has to wait for the GNPU updates to finish. The layerlevel picture is depicted in Figure 7(a). This idling of the GNPU and LNPU can be avoided by introducing pipelined processing of blocks given by the following lemma.
Lemma 6. Within a superlayer, while the LNPU processes messages for the blocks , the GNPU can process messages for the blocks , , and .
Proof. It follows directly from the layer independence condition in Fact 2.
Figure 6(c) illustrates the blocklevel view of this 2layer pipelining scheme. It is important to note that the splitting of the NPU process into two parts, namely, the GNPU and the LNPU (that work in tandem), is a necessary condition for Lemma 6 to hold. However, at the boundary of the superlayer, Lemma 6 does not hold and pipelining has to be restarted for the next layer as seen in the layerlevel view shown in Figure 7(c). This is the classical pipelining overhead.
Definition 7. Without loss of generality, the pipelining efficiency is the number of layers processed per unit time per NPU array.
For the case of pipelining, two layers are shown in Figure 7(c):Thus, we impose the following conditions on :(1)Since two layers are processed in the pipeline at any given time,(2)Given a QCLDPC code, is a constant. This is to facilitate a symmetric pipelining architecture which is a scalable solution.(3)Choice of should maximize pipelining efficiency ,In our work, , , and The rearranged block index matrix is shown in Table 6 and the layerlevel view of the pipeline timing diagram for the same is shown in Figure 7(d).
Remark 8.
FourLayer Pipelining. For the case of the IEEE 802.11n (2012) QCLDPC code chosen for this work, the pipelining of four layers might suggest an increase in the throughput; however, this is not the case as depicted in Figure 8. Due to the need for two NPU arrays, the pipelining efficiency of this scheme isHence, we limit ourselves to pipelined processing of two layers. To achieve further gains in throughput, without loss of generality, parallel processing of multiple blocks can be performed. For details on this approach of improving throughput, the reader is referred to Appendix B. From the perspective of memory access relaxation (Section 3.2) in LabVIEW FPGA Compiler, the proposed 2layer pipelining is a suitable methodology for the FPGA internal memory with a single pair of read/write port. This is because two layers running in parallel are timely assigned to a memory read and write port. Since this approach does not have any layer execution postponed due to a resource limitation, we can achieve the theoretical maximum throughput performance. Even if pipelining more than two layers was efficient, for such a method multiple layers need to be processed in parallel. However, the number of layers in a parallel run is limited by the number of ports in the shared memory. Any layers that need processing beyond the shared memory port number would be postponed, and this would prevent us from achieving the theoretical maximum throughput. Deploying multiple decoding cores (as described in Section 5.2) is another way of improving throughput. The downside of this approach is that the memory requirement grows linearly with the number of parallel layers.
HighLevel FPGABased Decoder Architecture. The highlevel decoder architecture is shown in Figure 9. The readonly memory (ROM) holds the LDPC code parameters specified by and along with other code parameters such as the block length and the maximum number of decoding iterations. Initially, the a posteriori probability (APP) memory is set to the channel LLR values corresponding to all the VNs as per (3). The barrel shifter operates on blocks of VNs APP values of size , where is the fixedpoint word length used in the implementation for APP values. It circularly rotates the values in the APP block to the right by using the shift values from the matrix in the ROM, effectively implementing the connections between the CNs and VNs specified by the Tanner graph of the code. The cyclically shifted APP memory values and the corresponding CN message values for the block in question are fed to the array of NPUs. Here, the GNPUs compute VN messages as per (4) and the LNPUs compute CN messages as per (5). These messages are then stored back at their respective locations in the randomaccess memory (RAM) for processing the next block. At the time of writing this paper, we have successfully implemented two versions of the decoder.
(1) 1x. As the name suggests, only one layer is processed at a time by the NPU array; in other words, there is no pipelining of layers. The blocklevel and the layerlevel view of the pipelining are illustrated in Figures 6(b) and 7(b), respectively.
(2) 2x. This version is based on the 2layer pipeline processing. Pipelining is done in software at the algorithmic description level. The blocklevel and layerlevel views of the pipelined processing are shown in Figures 6(d) and 7(d), respectively. Due to the pipelining overhead, . Comparing this to the 1x version with , the 2x version is times faster than the 1x version.
5. Case Studies
The techniques for improving throughput in an efficient manner, described in Section 4, are realized on hardware using an HLS compiler. The realization is divided into three case studies, namely, an efficiently pipelined IEEE 802.11n standard [27] compliant QCLDPC decoder, an extension of this decoder that provides a throughput of Gb/s, and an HARQ experimentation system based on the IEEE 802.16 standard [36] QCLDPC code. Each case study is detailed in the following Sections.
5.1. IEEE 802.11n Compliant LDPC Decoder
To evaluate the proposed strategies for achieving highthroughput, we have implemented the scaledMSA based decoder for the QCLDPC code in the IEEE 802.11n (2012). For this code, , , , and resulting in code lengths of , , and bits, respectively. Our implementation supports the submatrix size of and is thus capable of supporting all the block lengths for the rate code.
We represent the input LLRs from the channel and the CTV and VTC messages with 6 signed bits and 4 fractional bits. Figure 10 shows the biterrorrate (BER) performance for the floatingpoint (FP) and the fixedpoint (FxP) data representation with 8 decoding iterations. As expected, the fixedpoint implementation suffers by about 0.5 dB compared to the floatingpoint version at a BER of , and the gap widens for lower BER values. The decoder algorithm was described using the LabVIEW CSDS software. LabVIEW FPGA Compiler was then used to generate the very high speed integrated circuit (VHSIC) hardware description language (VHDL) code from the graphical dataflow description. The VHDL code was synthesized, placed, and routed using the Xilinx Vivado compiler on the Xilinx Kintex7 FPGA available on the NI PXIe7975R FPGA board. The decoder achieves an overall throughput of 608 Mb/s at an operating frequency of 200 MHz and a latency of 5.7 s at decoding iterations with BER performance shown in Figure 10 (blue curve). Table 7 shows that the resource usage for the 2x version (almost twice as fast due to pipelining) is close to that of the 1x version. The LabVIEW FPGA Compiler chooses to use more flipflops (FF) for data storage in the 1x version, while it uses more block RAM (BRAM) in the 2x version.

Remark 9. The clock rate selection in the HLS compiler generally determines pipeline stage depth of each primitive operation. For example, a higher target clock rate would result in a deeper pipeline stage. This requires more FPGA resources and a relatively longer compile time. Various target clock rates were tested, and the one offering the highest throughput in time with the most optimal resource utilization was chosen for the subsequent VHDL compile (e.g., MHz for the compiles shown in Table 8). It is important to note that the HLS compiler provides an accurate throughput and resource estimation after it generates VHDL. This throughput and resource estimation time is short as recorded in the results tables (e.g., Table 8) as Time to VHDL. The user can easily find the optimal clock rate in terms of maximal throughput and optimized resource utilization.

5.2. Case Study: A 2.48 Gb/s QCLDPC Decoder on the Xilinx Kintex7 FPGA
On account of the scalability and reconfigurability of the decoder architecture in [37], it is possible to achieve highthroughput by employing multiple decoder cores in parallel as detailed in [38]. As shown in Figure 11, the encoded bit stream is packetized into frames of equal size and distributed for decoding in a roundrobin manner to the cores operating in parallel. The main contribution of this approach is the elimination of a complicated buffering and handshake mechanism which increases the development time and adds hardware overhead. This is mainly due to(1)fixed latency of decoding the frames across all cores,(2)timestaggered operation of cores,(3)tightly controlled execution of the roundrobin serialparallelserial conversion process.To validate the multicore decoder architecture, in this case study, we chose the IEEE 802.11n (2012) QCLDPC code for which , , and resulting in code lengths of , 1296, and 1944 bits, respectively, and a code rate . The decoder core (described in Section 5.1) was compiled for a clock rate of 200 MHz and achieves a throughput of 420 Mb/s (first column in Table 8) with pipelining as described in Section 4.4.
The multicore decoder was developed in stages. The first stage is the aforementioned pipelined decoder core to which additional cores were added incrementally as per the scheme depicted in Figure 11. We have listed the resource utilization and the throughput performance for each stage in Table 8 for a qualitative comparison.
5.3. Rapid Prototyping of HybridARQ System
HybridARQ (HARQ) is a transmission technique that combines Forward Error Correction (FEC) with ARQ. In HARQ, a suitable FEC code protects the data and errordetection code bits. In its simplest form, the FEC encoded packet—referred to as a Redundancy Version (RV) in this context—is transmitted as per the ARQ mechanism protocol. If the receiver is able to decode the data, it sends an acknowledgement (ACK) back to the transmitter. However, if it fails to recover the data, the receiver sends a negative acknowledgement (NAK) or retransmission request to the transmitter. In this scenario, the FEC simply increases the probability of successful transmission, thus reducing the average number of transmissions required in an ARQ scheme. HARQ has two modes of operation: TypeI and TypeII. In TypeI, a current retransmission is chasecombined [39] with a previously buffered (and failed) retransmission and then decoded. In TypeII HARQ, in the event of a decoding failure, additional code bits are transmitted in every subsequent retransmission. Since, in this mode, all code bits are not transmitted every retransmission, the efficiency of this scheme is higher. However, the complexity is also higher compared to TypeI.
To study the performance of the two HARQ schemes (TypeI and TypeII), we have implemented a baseband bidirectional link with two transceiver nodes. This can be compared to a downlink connection between a base station (BS) and user equipment (UE) with a data channel and a feedback channel. Each node is capable of running the HARQ protocol in its two modes. In our work, the BS (initiator of the transmission) operates in the master mode and the UE operates in the slave mode. A highlevel description of the overall system with several subsystems is shown in Figure 12 and the media access control (MAC) level operation is described in Appendix A. At the initiator node, each data packet of length is encoded with an LDPC mother code of rate and the Cyclic Redundancy Check (CRC) value for it is simultaneously computed. The RV generator selects bits from the encoded data to form RVs as per the code rate adaptation algorithm [40]. The header is encoded with a rate repetition code. Finally, the RV is appended to the header and sent over the channel.
At the receiver node, header bits are decoded and the RV combiner uses the information in the header to combine the received signal values for TypeI mode or TypeII mode. CRC values from the header and the decoded data are compared to generate a feedback for the initiator node. The feedback (bit ACK/NAK) is coded with a rate repetition code before sending over the channel. We assume an errorfree feedback for this experiment, which is guaranteed by the rate repetition code over the SNR range in consideration.
5.3.1. LabVIEW FPGA Compiler for Ease of Experimentation
The HARQ system comprises subsystems that can be classified into two main categories based on the nature of the processing they perform. The bitmanipulation subsystems—akin to digital signal processing (DSP)—follow a pattern of processing that does not change significantly on a per transmission basis. In other words, they are more or less stateless. The channel encoders and decoders are examples of this category. The protocolsensitive subsystems on the other hand have to perform functions that are highly sensitive to the state of the system in a given transmission. For instance, the HARQ controller, the RV generator, and the RV combiner maintain a state [40]. With a few examples, this section highlights the ease of modification in a short time that LabVIEW FPGA Compiler provides across subsystems which is otherwise not possible for a purely HDLbased description.
ProtocolSensitive Subsystem Modification. The HARQ controller is essentially a finitestate machine (FSM). For a reliable and an efficient implementation of an FSM on an FPGA, the designer needs to take care of issues such as clock and input signal timing, state encoding scheme, and the choice of the coding style [41]. Modification to the MAClevel protocol directly affects the FSM in our work. For details on the MAClevel operation of the HARQ protocol, the reader is referred to Appendix A. For instance, during experimentation, the frame structure is likely to undergo modifications. Any modification to the frame structure affects nearly all subsystems. One such example is illustrated in Figures 13 and 14 where the field (of length bits) of the header is read to process output which is further used to process the data. The description in Figure 13 is agnostic to the modification at the HDL level and the designer can implement a change without HDL domain expertise, whereas in Figure 14 one can see that the description is specific to the subsystem in which the header is being used. Modifying the length of a field, for instance, requires modification of counter logic and adjustment of the delay value. This needs to be repeated for all subsystems that are affected by this change. In contrast, LabVIEW FPGA Compiler automatically generates the counter logic and delay values in Figure 14 by propagating the fieldlength values into the algorithm in Figure 13, allowing the same algorithmic description to be reused for different values of the fieldlength.
BitManipulation Subsystem Modification. The channel coding subsystems, namely, the LDPC and repetition encoders and decoders, are at the core of the HARQ system. LabVIEW FPGA Compiler also eases the implementation of bitmanipulation subsystems like these. For example, the x repetition encoder description in LabVIEW FPGA is shown in Figure 15. Comparing this to the algorithmic description shown in Figure 16, it is evident that modification without much time and effort is facilitated by LabVIEW FPGA Compiler. It is important to emphasize here that LabVIEW FPGA provides a highlevel abstraction to VHDL. However, it is not the same as the algorithmic description that we refer to, throughout this brief. This is because LabVIEW FPGA is a lowerlevel description language relative to the algorithmic description that is input to LabVIEW FPGA Compiler.
5.3.2. Results
The HARQ system has been implemented on the Xilinx Kintex7 series of FPGAs and the algorithmic description was input using LabVIEW CSDS. We chose these sets of tools as the FPGA is available in the NI USRP 2943R series used for realworld prototyping of our research. At the time of writing this paper, the system performance has been evaluated for the IEEE 802.16 (2012) [36] set of QCLDPC codes. We would like to emphasize here that owing to the ease of modification, we can, in short development cycles, replace the channel codes with other code structures being researched such as the one described in [42]. The errorrate performance for k frames of codeword size of is shown in Figure 17. The residual Frame ErrorRate (FER) accounts for the errors that the HARQ protocol failed to correct, whereas the FER accounts for errors that happen without the use of the HARQ protocol. The data throughput of the system, defined as , and the throughput averaged over the frames per SNR point are plotted in Figure 18. As expected, the performance of the system is improved with HARQ at the cost of a decrease in the throughput. The FPGA resource utilization for the same is given in Table 9.

Scalable Simulation Speedup. Each time any change in the system is made, there is a need to evaluate the performance of the system. This is especially true for testing code structures under research. Errorrate performance in excess of bits is required to observe phenomena such as the errorfloor of a code [24]. This makes timeefficient simulations not only a luxury but a necessity. In our implementation, while developing a realworld prototype we also get the benefit of a x speedup in simulation time using a decoder without pipelining. We measured the execution time for k frames over SNR values. We used the IEEE 802.16 (2012) specified QCLDPC code, with a and a repetition code for the header and the feedback, respectively. The decoder was set to perform decoding iterations.
On a host machine, a Dell Precision T3600 GHz Quad Core Xeon (i7) with a GB RAM, it took about min, whereas on our FPGA testbed it took about min resulting in a x speedup with a one time timetocompile of approximately min. While the timetocompile seems significant, once compiled, for several trials with larger datasets (orders of magnitude larger than experimental value specified above), this time becomes insignificant.
6. A Comparative Survey of State of the Art
A survey of the state of the art for channel code architectures and their implementation using HLS technology reveals that insightful work on the topic has been done. In this section, we list some of the notable contemporary works. While there are a myriad of LDPC architecture designs implemented on the FPGA platform, here we restrict ourselves to a subset of those works that utilize HLS technology. In this section, we list some of the notable contemporary works that fall into this category.
The performance of an implementation depends on a host of factors such as the vendor specific device(s) with its associated HLS technology and the type of channel code in consideration. Thus, the intent of the authors is not to claim an allencompassing performance comparison demonstrating gains or losses with respect to each other, but to provide the reader with a qualitative survey of the state of the art. Table 10 lists works [19–22] based on the settings from each that are chosen according to the proximity of their relevance to our work.
 
n.a.: not available (i.e. not reported in the cited work). 
7. Conclusion
We use an HLS compiler that without expertlevel hardware domain knowledge enables us to reliably prototype our research in a short amount of time. With techniques such as timing estimation, pipelining, loop unrolling, and memory inference from arrays, LabVIEW FPGA Compiler compiles untimed dataflow algorithms written with loops, arrays, and feedback into VHDL descriptions that achieve a high clock rate and highthroughput. The employed HLS technology significantly reduced the time to explore the system parameter space and optimized it in terms of the errorrate performance and the resource utilization. We propose techniques to achieve a highthroughput FPGA architecture for a QCLDPC code. The strategies are validated by implementing a standard compliant QCLDPC decoder on an FPGA. The decoder architecture is scaled up to achieve another highlyparallel realization that has a throughput of 2.48 Gb/s. The HLS compilation process is used to rapidly prototype a HARQ experimentation system using LDPC codes that not only comprises bitmanipulation subsystems but also protocolsensitive subsystems. This facilitated the errorrate performance measurement of the system over large, realistic data sets at a 4x greater speed than the conventional CPUbased experimentation. Finally, the use of HLS and reconfigurable hardware platforms holds the promise of realizing the architecture suited for the evolving research requirements of 5G wireless technology.
Appendix
A. MACLevel HARQ Operation
Here, we briefly discuss the operation of the protocolsensitive subsystems (Section 5.3.1) in the HARQ system for the interested reader. Without loss of generality, for RVs and the maximum number of retransmissions set to , the MAClevel operation of the HARQ protocol is shown in Figure 19 for the master mode and Figure 20 for slave mode. For TypeI scheme of HARQ, the RV generator does not puncture any bits and sends the whole mother codeword every transmit instance, whereas, in the TypeII scheme, it generates RVs as detailed in [40].
At the receiver, for the TypeI scheme of HARQ, the transmit instance performs , where denotes the buffer contents and with . For the TypeII scheme, the RV combiner performs , where , , and is the position of the code bit in the mother codeword determined by the puncturing method.
B. Parallelizing Block Columns
In Section 4, it was concluded that increasing the number of layers to more than two layers in the pipeline provides diminishing returns in efficiency of the pipelining scheme. Here, we present a technique for a multifold increase in throughput by processing multiple blocks in a particular layer. We would like to note that this technique has not been implemented in any of the case studies provided in this article. To gain further throughput improvement, in this approach, we take advantage of the following fact. There is no message exchange across the blocks of a particular layer. In other words, message exchange (and hence dependency) happens only in the vertical direction in , where, and ,The matrix is defined in Section 4.4. In the pipelined version, the NPU array processes each block (within a layer) sequentially as shown in Figure 21. However, if we split the blocks into two sets and process each set independent of the other (requiring 2 NPU arrays), we can double the throughput. Owing to this fact, we call this version as the 4x version. Similarly, by employing 4 NPU arrays, we have the 8x version and finally, if we employ 8 NPU arrays, we have the 16x version, thus increasing throughput gradually at each stage.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors would like to thank the Department of Electrical and Computer Engineering, Rutgers University, NJ, USA, and the National Instruments Corporation, Austin, TX, USA, for their continual support for this research work.
References
 B. Raaf, W. Zirwas, K.J. Friederichs et al., “Vision for Beyond 4G broadband radio systems,” in Proceedings of the IEEE 22nd International Symposium on Personal, Indoor and Mobile Radio Communications, (PIMRC '11), pp. 2369–2373, IEEE, Toronto, Canada, September 2011. View at: Publisher Site  Google Scholar
 C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit errorcorrecting coding and encoding: turbocodes,” in Proceedings of the IEEE International Conference on Communications, pp. 1064–1070, Geneve, Switzerland, May 1993. View at: Google Scholar
 R. G. Gallager, “LowDensity ParityCheck Codes,” IRE Transactions on Information Theory, vol. 8, no. 1, pp. 21–28, 1962. View at: Publisher Site  Google Scholar
 “3GPP RAN WG1,” in 3rd Generation Partnership Project (3GPP), 2016, http://www.3gpp.org/specificationsgroups/ranplenary/ran1radiolayer1/home. View at: Google Scholar
 H. Kee, S. Mhaske, D. Uliana et al., “Rapid and highlevel constraintdriven prototyping using lab VIEW FPGA,” in Proceedings of 2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014, pp. 45–49, USA, December 2014. View at: Publisher Site  Google Scholar
 H. Kee, T. Ly, N. Petersen, J. Washington, H. Yi, and D. Blasig, “Compile time execution,” U.S. Patent 9 081 583, 2015. View at: Google Scholar
 T. Riche, N. Petersen, H. Kee et al., “Convergence analysis of program variables,” U.S. Patent 9 189 215, 2015. View at: Google Scholar
 H. Kee, H. Yi, T. Ly et al., “Correlation analysis of program structures,” U.S. Patent 9 489 181, 2016. View at: Google Scholar
 T. Ly, S. Mhaske, H. Kee, A. Arnesen, D. Uliana, and N. Petersen, “Selfaddressing memory,” U.S. Patent 9 569 119, 2017. View at: Google Scholar
 W. Ryan and S. Lin, Channel Codes: Classical and Modern, Cambridge University Press, Cambridge, 2009. View at: Publisher Site  MathSciNet
 Y. Sun and J. R. Cavallaro, “VLSI architecture for layered decoding of QCLDPC codes with high circulant weight,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 10, pp. 1960–1964, 2013. View at: Publisher Site  Google Scholar
 K. Zhang, X. Huang, and Z. Wang, “Highthroughput layered decoder implementation for quasicyclic LDPC codes,” IEEE Journal on Selected Areas in Communications, vol. 27, no. 6, pp. 985–994, 2009. View at: Publisher Site  Google Scholar
 N. Onizawa, T. Hanyu, and V. C. Gaudet, “Design of highthroughput fully parallel LDPC decoders based on wire partitioning,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 3, pp. 482–489, 2010. View at: Publisher Site  Google Scholar
 T. Mohsenin, D. N. Truong, and B. M. Baas, “A lowcomplexity messagepassing algorithm for reduced routing congestion in {LDPC} decoders,” IEEE Transactions on Circuits and Systems. I. Regular Papers, vol. 57, no. 5, pp. 1048–1061, 2010. View at: Publisher Site  Google Scholar  MathSciNet
 A. BalatsoukasStimming and A. Dollas, “FPGAbased design and implementation of a multiGBPS LDPC decoder,” in Proceedings of 22nd International Conference on Field Programmable Logic and Applications, FPL 2012, pp. 262–269, nor, August 2012. View at: Publisher Site  Google Scholar
 V. A. Chandrasetty and S. M. Aziz, “FPGA implementation of high performance ldpc decoder using modified 2bit MinSum algorithm,” in Proceedings of 2nd International Conference on Computer Research and Development, ICCRD 2010, pp. 881–885, mys, May 2010. View at: Publisher Site  Google Scholar
 R. Zarubica, S. G. Wilson, and E. Hall, “MultiGbps FPGAbased Low Density Parity Check (LDPC) decoder design,” in Proceedings of 50th Annual IEEE Global Telecommunications Conference, GLOBECOM 2007, pp. 548–552, usa, November 2007. View at: Publisher Site  Google Scholar
 P. Schläfer, C. Weis, N. Wehn, and M. Alles, “Design space of flexible multigigabit LDPC decoders,” VLSI Design, vol. 2012, Article ID 942893, 2012. View at: Publisher Site  Google Scholar
 J. Andrade, G. Falcao, and V. Silva, “Flexible design of widepipelinebased WiMAX QCLDPC decoder architectures on FPGAs using highlevel synthesis,” Electronics Letters, vol. 50, no. 11, pp. 839840, 2014. View at: Publisher Site  Google Scholar
 F. Pratas, J. Andrade, G. Falcao, V. Silva, and L. Sousa, “Open the Gates: Using Highlevel Synthesis towards programmable LDPC decoders on FPGAs,” in Proceedings of 2013 1st IEEE Global Conference on Signal and Information Processing, GlobalSIP 2013, pp. 1274–1277, usa, December 2013. View at: Publisher Site  Google Scholar
 J. Andrade, F. Pratas, G. Falcao, V. Silva, and L. Sousa, “Combining flexibility with low power: Dataflow and widepipeline LDPC decoding engines in the Gbit/s era,” in Proceedings of 25th IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, ASAP 2014, pp. 264–269, che, June 2014. View at: Publisher Site  Google Scholar
 E. Scheiber, G. H. Bruck, and P. Jung, “mplementation of an LDPC decoder for IEEE 802.11n using Vivado TM highlevel synthesis,” in Proceedings of int. Conf. Electron., Signal Process. and Commun. Syst., pp. 45–48, 2013. View at: Google Scholar
 H. Kee, D. Uliana, A. Arnesen et al., “A 2.06Gb/s LDPC decoder (exhibit floor demonstration),” in Proceedings of IEEE Global Commun. Conf., 2014, https://www.youtube.com/watch?v=o58keqeP1A. View at: Google Scholar
 D. Costello and S. Lin, Error Control Coding, Pearson, 2004.
 R. M. Tanner, “A recursive approach to low complexity codes,” Institute of Electrical and Electronics Engineers. Transactions on Information Theory, vol. 27, no. 5, pp. 533–547, 1981. View at: Publisher Site  Google Scholar  MathSciNet
 L. Chen, J. Xu, I. Djurdjevic, and S. Lin, “NearShannonlimit quasicyclic lowdensity paritycheck codes,” IEEE Transactions on Communications, vol. 52, no. 7, pp. 1038–1042, 2004. View at: Publisher Site  Google Scholar
 “EEE Std. for information technologytelecommunications and information exchange between LAN and MANPart 11: Wireless LAN medium access control (MAC) and physical layer (PHY) specifications,” in IEEE P802.11REVmb/D12, pp. 1–2910, 2011. View at: Google Scholar
 F. R. Kschischang, B. J. Frey, and H.A. Loeliger, “Factor graphs and the sumproduct algorithm,” Institute of Electrical and Electronics Engineers. Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, 2001. View at: Publisher Site  Google Scholar  MathSciNet
 E. Sharon, S. Litsyn, and J. Goldberger, “Efficient serial messagepassing schedules for {LDPC} decoding,” Institute of Electrical and Electronics Engineers. Transactions on Information Theory, vol. 53, no. 11, pp. 4076–4091, 2007. View at: Publisher Site  Google Scholar  MathSciNet
 M. M. Mansour and N. R. Shanbhag, “Highthroughput LDPC decoders,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 11, no. 6, pp. 976–996, 2003. View at: Publisher Site  Google Scholar
 J. Chen and M. Fossorier, “Near optimum universal belief propagation based decoding of LDPC codes and extension to turbo decoding,” in IEEE Int. Symp. Inf. Theory, p. 189, June 2001. View at: Google Scholar
 National Instruments Corp., LabVIEW Communications System Design Suite Overview, 2014, http://www.ni.com/whitepaper/52502/en/.
 J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, 1995.
 Q. Liu, T. Todman, and W. Luk, “Combining optimizations in automated low power design,” in Proceedings of Design, Automation and Test in Europe Conference and Exhibition, pp. 1791–1796, March 2010. View at: Google Scholar
 K. K. Gunnam, G. S. Choi, M. B. Yeary, and M. Atiquzzaman, “VLSI architectures for layered decoding for irregular LDPC codes of WiMax,” in Proceedings of 2007 IEEE International Conference on Communications, ICC'07, pp. 4542–4547, gbr, June 2007. View at: Publisher Site  Google Scholar
 “IEEE standard for wireless MANadvanced air interface for broadband wireless access systems,” IEEE Std 802.16.12012, 2012. View at: Google Scholar
 S. Mhaske, H. Kee, T. Ly, A. Aziz, and P. Spasojevic, “Highthroughput FPGAbased QCLDPC decoder architecture,” in Proceedings of 82nd IEEE Vehicular Technology Conference, VTC Fall 2015, usa, September 2015. View at: Publisher Site  Google Scholar
 S. Mhaske, D. Uliana, H. Kee, T. Ly, A. Aziz, and P. Spasojevic, “A 2.48Gb/s FPGAbased QCLDPC decoder: An algorithmic compiler implementation,” in Proceedings of 36th IEEE Sarnoff Symposium, Sarnoff 2015, pp. 88–93, usa, September 2015. View at: Publisher Site  Google Scholar
 D. Chase, “Code combining—a maximumlikelihood decoding approach for combining and arbitrary number of noisy packets,” IEEE Transactions on Communications, vol. 33, no. 5, pp. 385–393, 1985. View at: Publisher Site  Google Scholar
 S. Mhaske, H. Kee, T. Ly, and P. Spasojevic, “FPGAaccelerated simulation of a hybridARQ system using high level synthesis,” in Proceedings of 2016 IEEE 37th Sarnoff Symposium, pp. 19–21, Newark, NJ, USA, September 2016. View at: Publisher Site  Google Scholar
 N. I. Rafla and B. L. Davis, “A study of finite state machine coding styles for implementation in FPGAs,” in Proceedings of 2006 49th Midwest Symposium on Circuits and Systems, MWSCAS'06, pp. 337–341, pri, August 2007. View at: Publisher Site  Google Scholar
 B. Young, S. Mhaske, and P. Spasojevic, “Rate compatible IRA codes using row splitting for 5G wireless,” in Proceedings of 2015 49th Annual Conference on Information Sciences and Systems, CISS 2015, usa, March 2015. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2017 Swapnil Mhaske et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.