Abstract
Channel simulators are powerful tools that permit performance tests of the individual parts of a wireless communication system. This is relevant when new communication algorithms are tested, because it allows us to determine if they fulfill the communications standard requirements. One of these tests consists of evaluating the system performance when a communication channel is considered. In this sense, it is possible to model the channel as an FIR filter with timevarying random coefficients. If the number of coefficients is increased, then a better approach to real scenarios can be achieved; however, in that case, the computational complexity is increased. In order to address this issue, a design methodology for computing the timevarying coefficients of the fading channel simulators using consumerdesigned graphic processing units (GPUs) is proposed. With the use of GPUs and the proposed methodology, it is possible for nonspecialized users in parallel computing to accelerate their simulation developments when compared to conventional software. Implementation results show that the proposed approach allows the easy generation of communication channels while reducing the processing time. Finally, GPUbased implementation takes precedence when compared with the CPUbased implementation, due to the scattered nature of the channel.
1. Introduction
Currently, the high demand for integrated services (voice, data, and video) means that new data transmission schemes have to be developed for dealing with high transmission data rates and at the same time for offering high levels of quality of service. The fourth generation (4G) of mobile communication systems is still under development; its main goal is to provide a digital communication network (land, mobile, and satellite) with peak data rates of 100 Mbps for high mobility devices and high data rates of 1 Gbps for users or devices in low mobility environments or stationary conditions. The main technologies used in 4G include techniques based on multipleinput and multipleoutput (MIMO) antennas, turbo decoding, adaptive modulation, coding schemes and error correction, and orthogonal FDMA (orthogonal FDMA, OFDM) [1, 2]. Current versions of standards that incorporate 4G are LTEA (long term evolutionadvanced) and IEEE 802.16 m WiMAX (Worldwide Interoperability for Microwave Access) mobile. Therefore, the new issues imposed by the standards require new processing algorithms to be tested on high mobility environments affected by Doppler shifts (timeselective channels) and multipath propagation (frequencyselective channels). The temporal channel variability occurs when the characteristics of the transmission medium change over time or when there is a relative motion between the receiver and transmitter, as in communication systems such as LTE and WiMAX. The frequency selectivity appears when multiple copies of the transmitted signal arrive at the receiver due to physical mechanisms such as multipath propagation.
Moreover, knowing the behavior or performance of a mobile communication system under real conditions (in situ test) can be very expensive, owing to the transfer of the communications system and test equipment to the place under study, among other issues. Additionally, the system behavior can not be tested under the same propagation conditions due to the nature of the communication channel. Faced with this problem, an economical alternative is to use mathematical models, which represent the radio channels under consideration. In this sense, we can define a channel simulator as a software tool that permits reproduction of the behavior or the propagation conditions of a mobile communications channel under controlled or laboratory conditions.
On the other hand, GPUaccelerated computing is the use of a graphics processing unit (GPU) together with a CPU in order to accelerate scientific, engineering, and business applications [3]. Recently, several works related to the wireless communication area, which uses GPU devices, have been published [4–7]. Those works follow an implementation strategy in order to handle the channel complexity using multiple cores. For example, in [4] a wireless channel simulator is implemented. In that work, the potential of GPUbased processing is studied in order to improve the runtime performance of computationally intensive accurate wireless network simulation. In [5], the use of general purpose GPUs is investigated in order to provide the computational capabilities required for performing the radio frequency path loss computation. A discussion of the acceleration of wireless channel simulation using GPUs is provided in [6]. In addition, in [7], an implementation of parallel lattice reductionaided 2 × 2 MIMO detector using GPUs for the WiMAX standard is presented.
Although several works related to the use of GPUs in communication systems exist, there are currently no works that describe in detail the implementation of a fading channel simulator based on GPUs. In this paper, the methodology for implementing a fading channel simulator (time and frequency selective) via GPU computing is presented.
The proposed methodology considers the use of common GPU software libraries that permit nonspecialized users in GPU programming to easily implement the proposed simulator. On the other hand, the generation of the Rayleigh fading variates is achieved using the filtering method [8–10]. In this case, the filtering method is carried out in time domain by using a finite impulse response (FIR) filter for coloring Gaussian noise samples. Furthermore, it is well known that if the filter order is increased, then the accuracy of the channel statistics can be improved, though at the cost of increasing the computational complexity. Therefore, in this work, we take advantage of GPUs for handling such computational complexity (multiplication and addition operations) in order to implement an accurate communication channel for SISO systems. Moreover, this methodology paves the way for implementing MIMO channel simulators in the future.
The rest of this paper is organized as follows: In the second section, the background of the wireless communication system is stated, specifically as regards the channel communication model. In Section 3, how to simulate the communication channel is explained. Next, in Section 4, the GPU implementation of the fading channel simulator is detailed. Section 5 is devoted to presenting the implementation results when a WiMAX scenario is considered. Finally, the conclusions are presented in Section 6.
2. Communication System
Consider a singleinput and singleoutput (SISO) communication system where the transmission of inphase and quadrature signals modulated by orthogonal carriers and , respectively, are assumed, which are mixed for obtaining . This signal is propagated through the communication channel , which is considered to be a causal timevarying linear system. The signal filtered by the channel reaches the receiver where a noisy version is detected. It can be expressed mathematically as follows:where , and is a time variable. The impulse response states the response of the channel in the instant when a stimulus is applied in , which reflects the time variability of the channel impulse response. Likewise, is the aggregated stochastic noise. This received signal is demodulated in order to obtain the inphase and quadrature signals and .
For sake of simplicity, if and , where is any carrier frequency and is any phase, the system becomes the well known single carrier communication system. It is important to emphasize that an OFDM system implemented with IFFT/FFT produces a baseband signal that is modulated as in a single carrier system.
If we consider that both signals and are band limited to a maximum frequency of and (this condition is always accomplished in real communication systems) it is easy to demonstrate [11, 12] with the aid of the Hilbert transform the existence of baseband equivalent signals , , , and for , , , and , respectively. In general, these equivalent baseband signals are complex, where the real part corresponds to the inphase component and the imaginary to the quadrature component; thus, and for . The relations between the original passband signals and their baseband equivalents are as follows [12]:where is the real part of the complex number in parentheses. Considering (2), the baseband equivalent of (1) iswhich can be interpreted as a collection of multiple paths (scatters), where the transmitted signal is propagated. The fact that these paths have different lengths and pass through different conditions of propagation causes the received signal from a specific path to be a delayed, attenuated, and phaseshifted version of the . In this sense, for a specific time and a specific delay , the channel coefficient will be a complex variable, where the magnitude represents the attenuation factor and the phase shift factor. On the other hand, due to the constant changes in the environment and the possible relative movement between transmitter and receptor, these factors are time dependent. According to [12], can be modeled as a complex stochastic process composed of the sum of a deterministic part (the ensemble average of ) and a random part (zero mean random process). From this point, we will only consider the random part (an assumption generally accepted when a channel simulator is developed). The autocorrelation function of this random process is equal towhere is the expectation operator and represents the complex conjugate. This channel model is difficult to implement; nevertheless, some assumptions can be asserted which simplify the model. The first is the absence of correlation between the different scatters, and the second is that each scatter is a widesense stationary process, which together comprise the well known widesense stationary uncorrelated scattering (WSSUS) model. Therefore, (4) transforms intowhere , , and is the autocorrelation function with respect to the time difference variable for the scatter located in the delay variable . From (5), it is possible to calculate the scattering function, which is defined as the Fourier transform of the correlation function with respect to the time difference variable , as follows:where is the Fourier transform operator. This scattering function indicates how the Doppler spectrum is for a given delay value in the variable .
In many communication standards, a discrete number of scatters are considered instead of a continuous number, as suggested in previous equations. If this assumption is considered, thenwhere is an index variable that enumerates the discrete scatters and is a complex variable that encloses the gain and phase shift factor of such scatter. If a WSSUS channel is considered, the correlation function of (7) iswith scattering function
3. Channel Simulation
In order to perform a computational simulation of the communication channel, it is necessary to deal with the discrete version of the baseband equivalent channel presented in (7). This discrete channel results in bandlimiting and sampling (7) in time and timedelay domains at a rate of . Thus, it is defined aswhere , , the symbol represents the convolution operator, and is a function for bandlimiting the channel to , which, for practical purposes, could be a time windowed cardinal sine function. Substituting (7) into (10) results inwhere corresponds to the coefficients of the FIR filter for simulating the communication channel, enumerates the samples in the time domain, and enumerates the taps of the filter. Likewise, can be calculated as , where is the maximum delay of the paths in the channel , and is the length of the filter . This filter could be anticausal; nevertheless, it is possible to introduce a delay in order to convert this filter into a causal filter and therefore physically feasible.
In order to implement (11), it is necessary to generate uncorrelated discrete Gaussian stochastic complex processes at rate . In the state of the art many algorithms for obtaining these stochastic processes are stated, as mentioned in [13–16] and references therein. Such processes must be filtered (colored) in order to accomplish the desired scattering function. It is important to note that these filters only affect the frequency components below a maximum Doppler frequency ; therefore, it is possible to generate the samples at a rate of at least , where typically , and then to use any upsampling technique for accomplishing the rate.
The impulse response of the filter for coloring the th process is the discrete version (at rate ) of the following expression:
Finally, an interpolation technique such as splines, polynomial, or basis expansion is used for obtaining the samples at rate. The entire process is presented in Figure 1 and summarized in Algorithm 1.

4. GPU Implementation
The emergence of GPUs has allowed complex algorithms to be executed almost in real time. GPU is conceptualized as a set of streaming multiproccesors (SM), where each SM is characterized by a single instruction multiple data (SIMD) architecture. Therefore, in each clock cycle, each processor of the multiprocessor executes the same instruction, operating on multiple data streams; that is, each of these processors has the possibility of accessing a shared memory (common to all processors belonging to the same SM) and a local cache memory. In addition, all the processors have access to the global GPU (device) memory. Figure 2 illustrates the GPU hardware architecture.
Our strategy for implementing the fading channel simulator is aimed at improving the overall performance by chaining software functions (called kernels) representing each communication step. In order to implement the parallel fading simulator as illustrated in Figure 3, we distinguish five stages in the GPU design methodology as follows.
4.1. Gaussian Random Number Generator
In this stage, the CUDA Random Number Generation (cuRand) library [17] is employed in order to obtain Gaussian random numbers (GRN) by means of efficient generation of highquality pseudorandom numbers. Particularly, curand_init function is launched for creating a random number generator in a massively parallel scheme. There are seven types of random number generators in cuRand; in this study, we have selected the XORWOW algorithm, which is a member of the Xor_shift family of pseudorandom number generators, with customized parameters for operating on GPUs.
The curand_normal2 function generates two normally distributed pseudorandom numbers in each call. Because the underlying algorithm is based on the BoxMuller transform, it is suitable for generating random complex numbers; that is, each call generates real and imaginary parts at the same time.
There is a CUDA kernel for computing a set of independent GRN vectors. Each vector corresponds to a path, which is computed in chunks by the GPU multiprocessors and then stored on device global memory. The implementation of the GNR generator is presented in the Algorithm 2, where the function setup_kernel initializes the threads of the same block with a different sequence number but the same seed and offset (zero offset). Furthermore, generate_normal_kernel computes several pseudorandom values with Gaussian distribution through the calling of curand_normal2.

4.2. Parallel Doppler FIRFilter
The Doppler filter uses the resulting coefficients obtained by sampling (12) and the random numbers generated in the previous subsection. Since the filter coefficients are fixed for all channel realizations and paths, they are stored in the constant memory of GPU. This memory is devoted to storing and broadcasting readonly data to all threads on the GPU. In addition, the results of GRN are stored in shared memory, since many threads must access them simultaneously. The filtering is conceptualized as a convolution, so a kernel that performs the convolution in parallel is used.
There is a set of independent 1D signal convolutions to be computed, one for each path. However, the filtering is performed using the NVIDIA Performance Primitives library (npp) [18]; specifically, one of the nppiFilterRow functions is used, which performs a 1D filtering on 2D data, each row being a channel path.
4.3. Path Gain Implementation
The path gain is implemented with a multiplication function. The resulting colored noise from the previous stage is multiplied by a scalar. This could be carried out with a specific kernel or by using a standard library, such as CUDA Basic Linear Algebra Subroutines (cuBLAS) [19] or npp. The proposed implementation uses the nppiMulC function of the npp library.
4.4. Upsampler
The upsampler stage is responsible for generating noise samples at the rate , implemented as an interpolation. The usual interpolation available for GPUs is the linear interpolation offered by texture memory; npp offers other methods for more accurate results. In this case, the nppiResize function with a cubic interpolation is used. It returns the interpolated value for a given coordinate within two known noise values.
4.5. Tap Generator
Multiple paths have been treated separately. In this stage, they are correlated using predefined (computed offline) coefficients according to (11). This correlation operation can be seen as the multiplication of upsampled scaled colored noise (path) by the coefficient matrix . This could be carried out with a programmer’s own implementation or by using a standard library, such as cuBLAS as well. This proposal uses the cublasSgemm kernel that performs a matrixmatrix multiplication with optional scalar product.
5. Implementation Results
In order to corroborate the functionality of the proposed fading channel simulator in modern communication systems such as WiMAX, it was configured with the following parameters [20, page 404]: a maximum frequency Doppler Hz and a sample rate Msps, . In addition, the vehicular class B ITU multipath channel model was considered, which consists of six discrete paths with relative power dB at delay time nsec, respectively. For implementing the filter , a raised cosine function with a rolloff factor of and a duration of sec was considered. This delay results in the generation of taps. In Figure 4, a resulting GPUbased realization of the fading channel according to the specified parameters for time samples is presented. It is important to note that the offline computed data (see Figure 3) are transferred to GPU simulator by text files.
The simulation was carried out using an iMAC computer with the following specifications: OS 10.9.4 (Maverics), Intel Core processor i5 (3.4 GHz), 16 GB of RAM, graphic card GeForce GTX 780 M with 4 GB of RAM, and 1536 CUDA cores.
For evaluating the time performance, the parameters used in the previous test have been maintained; however, the parameter was fixed to samples. In this sense, Table 1 presents the average, maximum, and minimum time consumption for a CPUbased implementation (Matlab) versus the proposed GPUbased methodology (CUDA). It is clear that the GPU methodology has gains of fold (mean value) when compared with CPUbased implementations, which is attractive if parallel versions of the channel simulator are required, as could be the case in MIMO applications.
Table 2 reports the time percentage for accomplishing each task of the channel simulator in the GPU. It should be noted that in this table the reading and device memory allocation—the most timeconsuming tasks—are not considered. These tasks are performed only once at the initialization stage of the simulation.
On the other hand, Table 3 and Figure 5 present the overall time consumption in milliseconds for CPU and GPUbased implementations when the number of samples is fixed to = 5120, 10240, 20480, 81920, 327680, 655360, 1000000, and samples. This shows that while the time consumption in the CPUbased implementation increments exponentially, it remains almost linear in the GPUbased implementation.
Similarly, the good performance achieved with the GPU implementation with respect to the CPU implementation can be observed in the xfold gain reported in Table 3. This gain is calculated as the time consumption quotient of both implementations. The behavior of this gain has been reported for each of samples stated in the previous paragraph.
Finally, it is important to emphasize that the presented approach can deal with several path realizations. This suggests that the developed fading channel simulator can be considered for generating large MIMO channels, which represents a new simulation paradigm.
6. Conclusions
The principal result of this study is the introduction of a methodology for designing fading channel simulators via GPU devices. Such a methodology permits nonspecialized users to easily implement channel simulators in parallel. As was shown, the use of GPUs in the development of fading channel simulators greatly saves simulation time when channel realizations are generated for testing communication systems. Moreover, a case of study for WiMAX systems demonstrated the functionality of the implemented channel simulator. We believe that the proposed parallel channel simulator can aid in testing mobile communication systems based on LTE and WiMAX. Additionally, the presented approach based on GPU will allow the design of more sophisticated simulators of complex channel models such as triply selective MIMO fading channels (i.e., time, frequency, and space selective).
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the Programa para el Desarrollo Profesional Docente (PRODEP) 2014 and CONACYT, Ciencia Básica, 2014 (CB2014241272), Mexico.