Abstract

A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100 Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.

1. Introduction

Low Density Parity Check (LDPC) error correcting code is a kind of linear block codes, proposed by Gallager in 1962 [1] and rediscovered by Mackay and Neal in 1996 [2]. It takes its name from its sparse check matrix. LDPC codes are capacity-approaching codes, which means that it allows the noise threshold to be set very close to the Shannon limit for a symmetric memoryless channel; thus, the practical constructions of LDPC code exists.

Good performance of the LDPC code is at the cost of a very large amount of calculation. DCP decoding computation has very high parallel computation. The current commercial LDPC decoder is based on the hardware implementation, which only allows several kinds of specific LDPC codes at the same time and is difficult to upgrade. There are a large number of studies using FPGA to realize the efficient LDCP decoder [3, 4]. With the rapid development of the graphics processing units (GPU) on the desktop, there are a lot of researches using CUDA framework for LDPC decoding [5, 6]. The LDPC code is widely used in the fourth generation of mobile telecommunications technology, which makes it significant to develop efficient software LDPC decoding on the mobile device. At the same time, software LDPC code can dynamically change the parameters, including code length, code rate, and the number of iterations to quickly deal with all kinds of network environment.

Open Computing Language (OpenCL) [7] is a framework for writing programs that execute across heterogeneous platforms consisting of CPU, GPU, DSP, FPGA, and other processors or hardware accelerators. This technical specification was reviewed by the Khronos members and approved for public release on 2008. Compute Unified Device Architecture (CUDA) [8] also enables developers to develop parallel computing program on GPU at the desktop. OpenCL appears later, but it supports more scenarios. With the rapid development of mobile devices, many mobile devices especially mobile phone began to have their own high-performance GPU chips. Some vendors such as Qualcomm, Imagination PowerVR, ARM, and Vivante are beginning to support the OpenCL on their mobile GPU [9], which make developing parallel computing program on mobile devices based on GPU easier. In this article, we tried to develop a LDPC decoder on the mobile GPU based on the OpenCL. Nevertheless, the global memory is limited on a mobile GPU; therefore, the performance is not as good as on the desktop GPU. We improve the decoding through making full use of the local memory of each computing unit and the private memory of each processing unit. At the same time, we properly reduce the number of threads per code word and add code-words in decoding process, and better performance is obtained. In our experiments, as the best result in the decoder, the throughput reached 160 Mbps, which can satisfy the current mobile wireless communication in many cases, and delay time is less than 2 milliseconds (ms), which can satisfy many real-time applications like video calling.

2. MSA for LDPC Decoding

Belief propagation (BP) algorithm is a kind of important message passing algorithm, often used in the field of artificial intelligence [9]. Algorithm between each node transfers the belief information. For example, the belief information from bit node to check node depends on the observation of and all the check nodes connected with, except . Similarly, the belief information from check node to bit node depends on the observation of and all the bit nodes connected with, except . As a BP algorithm, the Min Sum Algorithm (MSA) is a very efficient LDPC decoding algorithm [10]. It is based on the belief propagation between nodes connected as indicated by the Tanner graph [11] edges. Figure 1 shows the Tanner graph of a particular 4 × 8 H matrix. MSA, proposed by Gallager, operates in the logarithmic probabilistic domain.

LDPC code is a special form of linear () block code, defined by sparse binary parity check H matrices of dimension , while . We assume that the channel is an additive white Gaussian noise (AWGN) channel with the mean 0 and the variance σ2. BPSK modulation maps a code-word onto the sequence , according to . The received sequence is , with . In the case of receiving , the logarithmic a priori probability of is . MSA is as shown in Figure 2.

Before entering the loop iteration, we use the received sequence to initialize the prior probabilities of as follows:

In this algorithm, we do not compute the posterior probabilities of and directly; instead, we compute the message transferring between the bit nodes and check nodes as well as the posterior probabilities before hard decoding.

In the step of updating message to , for th iteration, accessing H in row-major order, as the message sent from to is updated according to any bit nodes connected to in Tanner graph, except the . The update process, called minimum step, is as follows:

Using the H matrix and Tanner graph in Figure 1, for instance, is updated by BN1 and BN2, as in Figure 3, .

The posterior probabilities of is updated by the prior probabilities of and all the check nodes connected to :

Similarly, in the step of updating message to , for th iteration, , as the message sent from to is updated according to any check nodes connected to in Tanner graph, except the . The update process is called sum step.

Using the Tanner graph in Figure 1, for instance, is updated by CN2, as in Figure 4, .

Actually, the steps of updating and can be exchanged. If we update first, the result of can be used to update , which reduces the repeated computation.

The final hard decoding is performed at the end of an iteration.

The iteration procedure is stopped if the decoded word c verifies all parity check equations , or the maximum iteration is reached.

The implementation of decoder is achieved by a flood scheduling algorithm [12]. It guarantees that the bit nodes would not interfere with each other in the update step and when updating check nodes, check nodes will not interfere with each other too. Using this principle allows the true parallel execution of MSA for LDPC decoding based on the stream-based computing method.

3. OpenCL for Mobile GPU

Modern GPU is based on ultra high parallel computing ability and programmable pipeline. Stream processor of GPU is able to do general-purpose computation [13]. GPU is more efficient than CPU floating point performance especially when we deal with the single instruction multiple data (SIMD) and the completion of compute-intensive tasks, in which data processing operation needs far more time than the data scheduling and data transmission [14].

Unlike the dedicated GPU for desktop computers, a mobile GPU is typically integrated into an application processor, which also includes a multicore CPU, an image processing engine, DSPs, and other accelerators [15]. Recently, modern mobile GPUs such as the Qualcomm Adreno GPU [16], the Imagination PowerVR GPU, ARM Mali, and GPGPU on Vivante tend to integrate more compute units in a chip. Mobile GPUs have gained general-purpose parallel computing capability thanks to the multicore architecture and emerging frameworks such as OpenCL, and they are likely to offer flexibility similar to vendor specific solutions designed for desktop computers, such as CUDA of Nvidia.

OpenCL is a programming framework designed for heterogeneous computing across various platforms [17]. In OpenCL, a host processor (typically a CPU) manages the OpenCL context and is able to offload parallel tasks to several compute devices (for instance, GPU).

The parallel jobs can be divided into work-groups, and each of them consists of many work-items which are the basic processing units to execute a kernel in parallel.

OpenCL defines a hierarchical memory model containing a large global memory but with long latency and a small but fast local memory which can be shared by work-items in the same work-group; what is more, each work-item has its own memory, which is not shared with other items and is fastest accessing.

To efficiently and fully utilize the limited computation resources on a mobile processor for better performance, we partition the tasks between CPU and GPU and explore the algorithmic parallelism, and memory access optimization needs to be carefully considered.

On embedded platform, to handle various tasks is becoming a trend. OpenCL specification describes a subset of the OpenCL specification for handheld and embedded platforms.

The OpenCL embedded profile has some restrictions; for instance, there are optional support for 3D images and no support for 64-bit integers and no support for 64-bit integers. The details of the OpenCL embedded profile can be found in Khronos’s website [17].

Despite these specification restrictions, it is possible to use OpenCL to accelerate the program on the mobile devices. The compute-intensive computation on the mobile device is transferred to the GPU or other devices supporting OpenCL; not only these tasks can perform even more efficiently, but also CPU can handle more tasks that it is good at. Actually, LDPC decoding is a kind of traditional compute-intensive computation.

4. Parallel MSA LDPC Decoding on Mobile GPU

MSA is an intensive processing, which should be processed in a high-performance specific computing engine, or in a highly parallel programmable device. On the mobile device, the GPU is a good choice. This general model, supported by GPU using OpenCL, executes kernels in parallel on several multiprocessors. Each processor is composed by several cores that dispatch multiple threads. In this section, a parallel processing to save the information of matrix H into work-items is showed. In order to save the private memory, each work-item only keeps the compressed information that related to its own computation. After that, the specific parallel algorithm in OpenCL kernel is introduced. Given an () LDPC code, it is important to manage the computation to reduce the expenditure in parallel programming. Instead of using work-items ( is the maximum column weight of matrix H), the model uses work-items in each work-group, and each work-item updates the message about one check node, which means work-items work for check nodes, respectively.

4.1. Compact Representation of the Tanner Graph

The Tanner graph of a LPDC code is defined as H. We propose it in two separate data structures, namely, and . This is because one iteration of the LDPC decoder can be decomposed into horizontal and vertical processing, which means we update message from to and message from to , respectively.

The data structure used in the horizontal step is defined as . It is generated by scanning the matrix H in a row-major order and mapping only the bit nodes’ edges associated with nonnull elements in H used by a single check node equation in the same row. Algorithm 1 describes this procedure in detail for a matrix having rows and columns. is saved in the private memory. Because each work-item updates the message in the whole row, is not necessary to be accessed by any other work-items.

(1) as the work-item in a work-group: do
(2)   
(3)   for all BN (columns in ): do
(4)     if    then
(5)      ++

The data structure is used in the vertical processing step. It can be defined as a sequential representation of the edges associated with nonnull value in H. It is generated by scanning the H matrix in a column-major order. is also saved in the private memory. Because each work-item updates the message in the neighbor rows, is not necessary to be accessed by other work-items too.

4.2. Programming the MSA on the OpenCL Grid

Each work-group contains work-items that represent threads. Instead of the whole matrix H or , each work-item can save the necessary part of information of in the private memory, which make access to perform the update faster. Again, the same principle applies to the update of messages.

According to LDPC code length, the CPU on mobile do allocate memory in GPU, including the global memory for storing the check matrix H, input data, output data, and the local memory for saving the message data sent from bit nodes to check nodes, marked as and from check nodes to the bit nodes, marked as (Algorithm 3).

In step , the compact and are generated in private memory by Algorithms 1 and 2.

(1) as the work-item in a work-group: do
(2)    for offset = from 0 to : do
(3)     
(4)     for all (columns in ): do
(5)       if    then
(6)         ++
(1) Initialize the work-group size (or number of work-item per work-group).
(2) Generating compact , from matrix H
(3) while
(4)   as the work-item on an work-group: do
(5)      for all : do
(6)        
(7)        update the message sent from to
(8)        update the message sent from to
(9)      Synchronize all threads
(10)      for offset = 0 to : do
(11)        
(12)        for all : do
(13)        update the posterior probabilities of
(14)      Synchronize all threads
(15)      perform hard decoding

The same as the normal MSA algorithm, the loop execution from step will end until the output code word is current or it reaches the maximum loop times.

It executes a horizontal processing, a vertical processing, and a synchronization for all threads in steps (5)–(9). Generally, all threads should be synchronized after the horizontal and vertical processing, but in this algorithm, every work-item takes charge of its own check node, and data is not shared with other work-items, so it is able to cancel the synchronization after horizontal processing to improve performance. data is still shared with all work-items, so the synchronization after vertical processing is retained.

After the synchronization, it calculates the posterior probabilities of and every work-item deals with bit nodes as in steps (10)–(13). After the second synchronization, it performs the hard decoding by posterior probabilities, according to the method described in Section 2.

True parallel execution is conducted and the overall processing time required to decode a code word can be significantly reduced as a result, as it will be seen in the next section. More data parallelism can be exploited by decoding several code words simultaneously, but it was not considered in this work.

5. Implementation and Experimental Results

The experimental setup to evaluate the performance of the proposed parallel LDPC decoder on the GPU consists of a PowerVR G6200 with 256 MB global memory and 4 KB local memory and was programmed using the C language and the OpenCL programming interface (version 1.1). In this algorithm, each code word is decoded in a work-group. Because of the limited local memory, only small LDPC code can be used in this test mobile phone. However, the work-group number can be large due to the relatively large global memory on the GPU.

To decode a batch of code words, whose original size is 1 Mbit, the variation in performance is minimal and in Figure 5 we show only the best results achieved. As a 144 × 576 matrix, the work-items per work-group are equal to their row number, which means we use 144 work-items per work-group and 1000 work-groups in this experiment.

The decoding times reported in Figure 5 define global processing times, including data transmission time and decoding time. The decoding time increases along with the increase of iterations. They have a linear relation. The computation capacity of GPU is fully used. The throughput decreases as iterations increase when iterations increase when the size of data for decoding remains the same.

On the mobile device we attach as much importance to the delay as the throughput. Figure 6 shows the decoding delay when the speed is from 10 Kbps to 100 Mbps. With the speed exponential increase, the delay increases but slowly. Actually, the size of data for decoding on the GPU in a decoding cycle is too small that some capacity of GPU is waste and the parallel effect is not obvious with low speed.

It is obvious that the delay increases when code words for GPU decoding increase. However, the time of a decoding cycle, which is the most important part of the delay, increases but slowly thanks to the more fully use of the computation capacity. The mean time for a code word decreases in higher speed. Thus, it can be applied for some high-speed mobile services, like large file transmission, and delay-sensitive services like video calling.

6. Conclusion

This paper proposes a multicode word parallel LDPC decoder using a GPU on the mobile device running OpenCL. LDPC is widely used in the fourth generation of mobile telecommunications technology, so it is significant to realize high-speed LDPC decoding on the mobile devices.

For an instance, popular video calling software, Skype, has its bandwidth requirements noticed on its official website [18]. The bandwidth required by Skype depends on the type of calls. The minimum speeds required for normal screen sharing video calling, high-quality video calling, and HD video calling are 0.3 Mbps, 0.5 Mbps, and 1.5 Mbps. In the experiment above the decoding delay is 0.84 ms, 0.86 ms, and 0.98 ms in Figure 6. The HD video calling has less than 1 ms delay. It can meet its requirements apparently.

With the software realization of LDPC decoding on mobile devices, LDPC can dynamically change the parameters, including code length, code rate, and the number of iterations. All of them can be fast dynamic switched on OpenCL device, which can quickly deal with all kinds of network environment. With the bad network the code rate can be reduced to improve the ability of error correction, while the code rate can be improved when the network is fine. Compared with the traditional way of hardware decoding, our proposed decoding algorithm based on the software implementation of decoding on the mobile GPU is more efficient, for it can switch at any time according to actual environment.

Competing Interests

The authors declare that they have no competing interests.