Abstract
A methodology for implementing a triply selective multipleinput multipleoutput (MIMO) simulator based on graphics processing units (GPUs) is presented. The resulting simulator is based on the implementation of multiple doubleselective singleinput singleoutput (SISO) channel generators, where the multiple inputs and the multiple received signals have been transformed in order to supply the corresponding space correlation of the channel under consideration. A direct consequence of this approach is the flexibility provided, which allows different propagation statistics to each SISO channel to be specified and thus more complex environments to be replicated. It is shown that under some specific constraints, the statistics of the triply selective MIMO simulator are the same as those reported in the state of art. Simulation results show the computational time improvement achieved, up to 650fold for an 8 8 MIMO channel simulator when compared with sequential implementations. In addition to the computational improvement, the proposed simulator offers flexibility for testing a variety of scenarios in vehicletovehicle (V2V) and vehicletoinfrastructure (V2I) systems.
1. Introduction
With the growing demand by users of rapid transfer of high amounts of data, it has been necessary to develop new transmissions strategies and, consequently, to verify their performance in order to accomplish the established goals. In this sense, multipleinput multipleoutput (MIMO) simulator has been considered in recent years a fundamental part of new communication standards embraced by longterm evolutionvehicle (LTEV) and vehicletovehicle (V2V) technologies. This is due to the fact that MIMO can take advantage of space diversity and multipath scattering for increasing the data transmission rate considerably in comparison with singleinput singleoutput (SISO) communication systems [1]. Because of its relevance, several mathematical channel models and hence diverse channel simulators/emulators have been proposed in order to prove the performance of MIMObased communication systems.
In this sense, as summarized by [2], MIMO channel models can be classified in diverse ways; the broadest classification considers physical models and analytical models. Physical models take into account the electromagnetic propagation and the environment under study for obtaining a channel model. Moreover, such models can be categorized as deterministic models [3], geometricbased stochastic models [4], or stochastic models [5, 6]. Analytical models, on the other hand, abstract the complex electromagnetic propagation mechanisms into tractable channel impulse responses for modeling the MIMO channels. Moreover, analytical models can be classified as propagationbased models and correlationbased models, where the distinguished Kronecker model [7] and the Weichselberger model [8] fit in the latter classification. These models are the most commonly used models for simulating MIMO channels due to their simplicity and the conceptualization of the propagation environment.
This paper considers an analytical model based on correlation, which assumes a triply selectivity; that is, the propagation channel for each transmitter and receiver antenna pair presents both time and frequency selectivity and a spatial correlation. This model was chosen to capture the nature of the propagation channel with greater precision. As expected, the implementation of this triply selective channel simulator is highly complex; for example, if a MIMO system made up of transmitter antennas and receiver antennas is considered, then it is necessary to implement independent SISO channel simulators in parallel. This number of SISO channels increases as well as the number of antennas is increased. Therefore, graphics processing units and GPUaccelerated computing techniques can help to manage the computational complexity in the MIMO channel simulator.
The GPUaccelerated computing techniques use the GPU together with CPUs, which are available in affordable computers or servers. The purpose is to accelerate scientific computation and engineering calculations, among others. As a result, several works related to wireless channel simulators have been presented in order to handle the computational complexity in the implementation of channel simulators [9–12] and high demanding digital signal processing algorithms [13, 14].
In [9], the authors present a GPUbased implementation, which uses the filtering method for developing a doubly selective SISO channel simulator (frequency and time selective). In [10], a full 3D GPUbased beamtracing method is presented for propagation modeling in complex indoor environments. Likewise, the study in [11] presents an improved path loss simulation incorporating a threedimensional terrain model using parallel coprocessors or GPUs. In addition, in [12] and references therein, a selection of GPUbased implementations is presented, which improves the computational processing time and reports GPUbased implementations with significant speedups. However, even though many GPUbased implementations have been reported in the state of the art, a complete triply selective fading channel simulator has not been reported in the open literature. Examples of complete MIMO fading channel simulators and emulators can be found in [15–17], but these architectures do not exploit the triple selectivity at the same time, or they configure the hardware with a low number of antennas.
This paper presents a triply selective MIMO channel simulation methodology using GPU techniques; the methodology will be based on SISO channel generators presented in [9]. This simulator includes the phenomenology of the propagation environment (time, frequency and space selectivities), which has been left out of the stateofart simulators due to the computational complexity. Likewise, the introduced methodology exhibits enough flexibility for implementing channel simulators with different MIMO channel configurations.
1.1. Notation
Bold upper (lower) case letters are used for denoting matrices (vectors); , , , and denote transpose, complex transpose (Hermitian), the ceil function, and the expectation operator, respectively. denotes the element in the th row and th column of . is the reordering of the columns of into a single column vector. is a diagonal matrix whose elements are those from vector . denotes an identity matrix of length . Finally, stands for the Kronecker product of two matrices.
1.2. Organization
The paper is organized as follows: in Section 2, the mathematical model of the triply selective channel is analyzed. The methodology for implementing this mathematical model using GPUs is described in Section 3. Implementation results assuming diverse scenarios are presented in Section 4. Finally, some concluding remarks in Section 5 close this paper.
2. Triply Selective Channel Model
Assume a baseband discrete time MIMO communication channel that presents time, frequency, and spatial selectivity (tripleselective channel). Moreover, consider that this communication system is conformed of transmit antennas and receive antennas as depicted in Figure 1. This system can be envisaged as an array of SISO channels where a transmitter and a receiver correlation stage are included in order to provide the corresponding spatial correlation statistics. Without loss of generality, if it is stated that all the SISO channels are modeled as FIR filters of coefficients, then the MIMO system at time index can be expressed mathematically as follows:where with is a vector containing the samples received from each antenna, is the matrix that provides the spatial correlation due to the receive antennas, and , correlates the transmitted samples according to the correlation of the transmit antennas.
Additionally, , where is a vector composed of the actual and past samples of each transmitter, where . The matrix is defined as follows:where is the discrete timevarying channel impulse response from the transmitter to the receiver . The value is selected as , where is the maximum delay corresponding to the subchannel from transmitter to receiver , is the impulse response duration of the combined transmitter and receiver filters, and is the symbol period.
Thus, it is possible to consider the MIMO channel aswhose autocorrelation function can be stated as follows:
This channel model can be interpreted as a form of the Kronecker model reviewed in [7, 18], where time and frequency selectivity have also been considered. In order to simplify the calculations, autocorrelation of is analyzed instead, which does not affect the statistics of the channel under analysis. Thus, the autocorrelation of iswhere it has already made use of and the fact that ; in addition, is the autocorrelation function of . If it is assumed that all the subchannels in are(i)uncorrelated among themselves,(ii)widesense stationary uncorrelated scattering (WSSUS),(iii)Rayleigh,
then, can be expressed asFrom (6), is a zeromean circularly symmetric complex Gaussian random vector with an autocorrelation function:where and are both diagonal matrices, such that is the power of the th path of the subchannel from antenna to antenna with delay and autocorrelation function evaluated at the instant difference for , where is the number of paths of the subchannel from to .
Let be a block diagonal matrix, where the elements of each block are given bywhere is the impulse response of the band limiting filter, for example, sinc or raised cosine filter with duration , , and .
From (6), is given bywhere are both block diagonal matrices.
Function (11) is the autocorrelation function of the proposed model (3). It is possible to observe that this model offers great flexibility by allowing the subchannels to have different statistics, for example, different power delay profile (PDP) and power Doppler spectrum (PDS). As a special case, if all the paths of all subchannels are forced to have the same PDS, then , where is the autocorrelation function of all the paths. As a particular case, if Jakes’ model is considered, where the different scatters propagate in the azimuth plane and arrive at the receivers with an angle of arrival distributed uniformly between , then , where is the Bessel function of order 0 and is the maximum Doppler frequency. If it is also considered that all the subchannels have the same PDP composed of paths, then and , where is a diagonal matrix containing the gains of the paths and is calculated as in (8). However, assuming that the paths of all the subchannels have the same delay, then (11) transforms into (12), which coincides with the model proposed in [16]:
3. GPU Implementation
The simulator implementation takes advantage of the parallel capabilities of GeneralPurpose Computing on GPU (GPGPU) and the use of the VexCL library [19]. VexCL is an OpenCL/CUDA library developed in C++ by Denis Demidov. It supports multidevice, multiplatform computations and provides functions for floatingpoint vector/matrix operations. VexCL uses vector expressions, which are automatically processed in parallel across all devices.
The simulation process comprises two stages: channel coefficient generation and data frame processing. These stages are depicted in Figure 2. The first stage corresponds to the generation of the elements of matrix of (2); the second stage represents (1).
3.1. Complex Operations
GPU frameworks, like OpenCL and CUDA, do not provide complex data types; the closest data type is the 2vector type cl_double2 of OpenCL. Hence, custom vex functions were implemented to perform addition and multiplication of complex numbers. These functions are used in vector expressions. The codes implemented for performing the multiplication and addition of complex numbers are presented in Listings 1 and 2, respectively. The previous functions receive two 2vectors used as complex numbers.


3.2. Channel Coefficient Generation
The channel coefficients require random numbers, which are filtered for generating the channel taps with specific PDS and correlation.
3.2.1. Random Number Generation
VexCL provides templates to generate random numbers; they are based on Random123 [20, 21]. The templates support the Philox generator family (based on integer multiplication) or the Threefry family (based on Threefish encryption). It is possible to generate uniformly distributed random numbers with the Random template. Additionally, there are RandomNormal templates that use the BoxMuller transform to generate normally distributed random numbers. In this implementation, vex::RandomNormal<cl_double2, vex::random::threefry > random_numbers
are used as a C++ functor. A double random number vector is generated with noise=random_numbers(vex::element_index(), 123),
where noise is a vector of of pairs of double precision floatingpoint numbers, vex::element_index() is the function to get the th random number, and 123 is the seed for the generator. Instead of using doubles, the vector noise is composed of cl_double2 numbers; for example, a pair of double precision numbers is used as complex numbers.
3.2.2. Doppler Filter
The generated complex random numbers are passed through a filter which provides the corresponding time domain statistics. The transfer function of this filter is the square root of the autocorrelation function in the time domain for each path. As a particular case where all the paths in all the subchannels follow the Jakes model, the impulse response of this filter is , where is the gamma function, is the length of the filter, and is an index that enumerates the coefficients [22]. In order to perform the noise filtering, a custom function, named convolution, was coded; it receives pointers to arrays of 2vectors corresponding to the noise and Doppler filter coefficients, respectively. The code used for this convolution is presented in Listing 3, where i is the OpenCL’s element index, x is the noise, y is the filter, tF is the length of the filter, fS is the number of samples to be filtered, nS is the number of samples, and nP is the number of paths. The convolution function processes in parallel the complete noise vector. It executes the function body for each noise sample.

3.2.3. Path Gains
They are implemented as the product of a vector and a scalar. Each path has its own gain.
3.2.4. Upsampling
The upsampling is a quadratic interpolation to a desired number of values of the noise vector.
3.2.5. Tap Generation
The resulting noise vector, representing the paths, is correlated with a matrix (e.g., raised cosine). This correlation is implemented as a matrixmatrix product. A MIMO system has multiple channels; the correlation is done for each channel.
3.3. Data Frame Processing
This section describes the implementation of Figure 1 and (1): a transmitter and a receiver correlation stage and the discrete timevarying channel impulse response from the transmitter to the receiver .
3.3.1. Transmitter Correlation
The correlation on the transmitter side is implemented as a product of the data and correlation matrices ; the latter matrix contains the coefficients that correlate data frames and the transmit antennas.
3.3.2. FIR Filter
The transmitted data frames are filtered with the channel coefficients, , described in Section 3.2. In order to achieve this goal, a convolution is implemented. The code used for this convolution is presented in Listing 4, where the variable i is the OpenCL’s element index, the variable x is a pointer to the data to be processed, the variable y is a pointer to the generated channel coefficients, the variable tRc is the number of taps of the raised cosine, the variable nD is the data frame size, the variable uS is the size of the path, the variable nT is the number of transmitters, and the variable nR is the number of receivers.

3.3.3. Receiver Correlation
The correlation on the receiver side is implemented as a product of the received data and the correlation matrix ; the latter matrix contains the coefficients that correlate data frames and the receive antennas.
4. Results
In this section, the time performance of the proposed simulator is evaluated. In order to achieve this goal, a MIMO channel is considered, where all the subchannels have the same PDS with Jakes shape and Hz. If a carrier frequency of 5.9 GHz is considered, then the assumed corresponds to a scenario where the speed of the mobile is 360 km/h. Likewise, the PDP of each SISO subchannel is fixed as described in Table 1. The PDPs are selected in order to prove the functionality of the simulator, with the PDPs of and corresponding to the vehicular test environment channel A and channel B as defined in [23] respectively. The values of matrices and are fixed in such a way that they make and , respectively [24]. It is assumed that the transmitter and the receiver filters are both a square root raised cosine with a rolloff factor of 0.5 and the duration of its convolution is , where the period symbol is fixed at μs. According to the maximum delay values presented in Table 1, as well as the values of and , the parameter is equal to taps. Finally, the MIMO simulator generates channels assuming data frames composed of 1024 symbols.
Figure 3 shows a parallel realization of each MIMO subchannel using the proposed GPUbased simulator, where each of the subfigures presents the timevariation of the corresponding filter coefficients, which are associated with the assigned PDS. Moreover, it is possible to observe that each subchannel satisfies the specifications of the PDP assigned. Thus, PDPs with large delays correspond to filters with more coefficients, as observed in the channel realizations and , respectively.
The time performance evaluation is carried out using a personal computer (PC) with the following specifications:(i)Fedora 25, 64 bits(ii)Intel core i7920 (2.66 GHz)(iii)12 GB DDR3 RAM(iv)Graphics card Asus GTX Titan Black with 6 GB of RAM and 2880 CUDA cores.
In order to evaluate the time performance, 15 different scenarios of triply selective MIMO channels are considered: , , , , , , , , , , , , , , and . For all the cases, it is assumed that all the subchannels are configured with the same PDP (vehicular test environment channel B) and PDS (Jakes with Hz) statistics; moreover, the matrices and have identical values which make for and . The rest of the simulation parameters have the same values as described previously.
The simulations are carried out over 100 frames and the elapsed processing time per frame is obtained by averaging all the frames. Table 2 shows the time required to execute each of the simulator blocks.
Figure 4 presents the comparison of the time required to execute all the considered cases when the GPUbased simulator is used, as well as the sequential implementation of this simulator using C language. For a better appreciation, the results are plotted in logarithmic scale.
Table 3 summarizes the time consumption and also presents the fold time gain. This gain is calculated as the quotient of the time required by the sequential implementation divided by the GPU implementation. It is possible to observe that for the cases from to , as the complexity of simulator increments, the gain increments too. This is due to the fact that as more operations are required, better utilization of the parallel resources of the GPU is achieved. In the remaining cases, the gain is around 680; starting at the case, the simulator simultaneously uses the available resources in the device. As a matter of fact, with the GPU considered for testing the simulator, the case is the most complex MIMO channel that can be simulated; nevertheless, if the hardware is upgraded, then this inconvenience can be addressed, opening the possibility for the simulation of more complex scenarios, for example, massive MIMO.
4.1. Discussion
Recently, several channel simulator approaches have been introduced in the open literature. However, these implementations are based on field programmable gate array (FPGA) devices, and they are restricted to using a single configuration in the developed simulator; that is, the time and frequencyselective correlations are fixed. As a result, setting new channel statistics in the simulator (new channel propagation conditions) implies the redesign of the implemented hardware. For example, in the channel simulator introduced by [16], the time selectivity is generated by using a sum of sinusoids, which approximates a Jakes PDS. This simulator uses uniform random number generators for defining the configuration of the parameters of these sinusoids. Thus, if one wishes to generate a channel with a different PDS, then this will imply changing the probability density function of the random number generator block. In contrast, for changing the statistics of the simulator proposed in this paper, it is necessary only to upgrade the coefficients of the filters, which allows for a quick redesign of the simulator.
As regards performance, even though the reported FPGA implementations can operate in real time, they are constrained in terms of the number of operations that they can execute. For example, in [16] it is reported that the MIMO channels which can be simulated should be limited to total taps. Meanwhile, the tests performed on the proposed simulator with the selected GPU device reveal that the most complex channel that can be simulated contains 118656 total taps (). Moreover, the implementations presented in the open literature for achieving the reported performance consider that the paths of the PDP occur in multiples of the symbol period and all the subchannels share the same PDP. This contrasts with our GPUbased implementation, which can deal with paths with arbitrary positions and distinct PDPs for each subchannel. This is noteworthy because communication standards recommend PDPs with paths whose allocations are symbol period independent, while the flexibility of setting different PDPs for each subchannel makes it possible to simulate more complex scenarios. If the constrains assumed in the reported simulators were considered, then the proposed simulator could generate more total taps and consequently MIMO channels with larger numbers of antennas.
5. Conclusions
This paper presents a methodology for implementing a triply selective MIMO simulator based on GPUs. The proposed simulator considers the use of multiple SISO simulators, which are also implemented with GPUs, together with input and output correlation matrices in order to provide the corresponding space selectivity. This approach allows for great flexibility because each SISO subchannel can be set with different statistics and, therefore, more complex environments can be modeled. It is shown that under some considerations, the scheme fits with the more specific model proposed in the state of the art. Moreover, the implementations with GPUs considerably reduce the execution time compared with sequential implementations. Implementation results show a 688fold gain for MIMO when compared with sequential implementations. This suggests that this approach will enable more complex MIMO channels to be simulated with a greater number of antennas and a minimum penalty of time execution; in this way, communication systems can be tested in less time. Likewise, the considered model allows for the simulation of different scenarios, which are not forced to have the same statistics in each subchannel. Therefore, it becomes possible to simulate propagation environments that are more adequate for V2I and V2V communication systems.
Appendix
Implemented Codes
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Acknowledgments
This work was financed by the Mexican Council for Science and Technology (CONACYT) through the SEPCONACYT Basic Research Program (no. 241272).