#### Abstract

In this paper, a configurable superimposed training (ST)/data-dependent ST (DDST) transmitter and architecture based on array processors (APs) for DDST channel estimation are presented. Both architectures, designed under full-hardware paradigm, were described using Verilog HDL, targeted in Xilinx Virtex-5 and they were compared with existent approaches. The synthesis results showed a FPGA slice consumption of 1% for the transmitter and 3% for the estimator with 160 and 115 MHz operating frequencies, respectively. The signal-to-quantization-noise ratio (SQNR) performance of the transmitter is about 82 dB to support 4/16/64-QAM modulation. A Monte Carlo simulation demonstrates that the mean square error (MSE) of the channel estimator implemented in hardware is practically the same as the one obtained with the floating-point golden model. The high performance and reduced hardware of the proposed architectures lead to the conclusion that the DDST concept can be applied in current communications standards.

#### 1. Introduction

Presently, there is need to develop communications systems capable of transmitting/receiving various types of information (data, voice, video, etc.) at high speed. Nevertheless, designing these systems is always an extremely difficult task, and, therefore, the system must be broken down into several stages each with a specific task. The complexity of each stage is higher when the system operates in a wireless environment because the additional challenges that should be facing due to the complex nature of the channel and its susceptibility to several types of interference.

As it is not possible to avoid the influence of the channel on a transmitted data sent through it, an option is to characterize the channel parameters with enough precision so that their effects can be reverted in the receiver. For that reason, channel estimation stage is a key part of any reliable wireless system because a correct channel estimation leads to a reduction of the bit error rate (BER). The channel estimator must deal with multiple phenomenas, such as multipath propagation and frequency Doppler (due to the mobility of the users). In order to deal with these problems, current communication standards specify the transmission of pilot signals which are known in the receiver, allowing an ease estimation of the communication channel. The way of transmitting such pilot signals can be classified in to two major branches: pilot-assisted transmission (PAT)—where pilot and data signals are multiplexed in time, frequency, code, space, or in a combination of the mentioned domains—and implicit training (IT), a technique proposed recently where the pilot signal is hidden in the data transmitted. PAT is the technique implemented in actual standards, such as WiMAX, WiFi, and Bluetooth. It presents the advantage that pilots and data relies on orthogonal subspaces allowing a simple separation of them in the receiver; however, it is necessary to decrease the available bandwidth for data in order to transmit the pilot signal. On the other hand, IT overcomes this problem because all the time, data and pilot signal are transmitted; nevertheless, it leads to a transmission of such signals into nonorthogonal subspaces. Despite the aforementioned, IT has been recognized as a feasible alternative for future communication standards [1].

The simplest form to carry out IT is to add (superimpose) the pilot signal to the data. This approach is known as superimposed training (ST), first proposed in [2] and enhanced by diverse authors whose results are summarized in [3, Ch. 6]. In [4–8] was presented a refinement of ST known as data-dependent superimpose training (DDST), this technique makes it possible to null the interference of data during the estimation process via the addition of a new training sequence, which depends on the transmitted data, together with the data and the ST sequence.

Because of the benefits that ST/DDST offer, it is necessary to develop efficient implementations of these algorithms. Although these techniques have been widely studied, to this point, there exist few reported practical implementations in the literature. In fact, almost all of them are approximations based on floating point and software. In [9], the algorithms are programmed in a digital signal processor (DSP) for a low-rate communication system, while in [10] the proposed implementation is developed into an embedded microprocessor with hardware accelerators inside of a FPGA. At ReConFig 2011, we have presented a full-hardware architecture—with high throughput, low hardware consumption, and high degree of reusability—for the channel estimation stage of an ST/DDST receiver [11]. Its novelty consisted in that a systolic array processors (AP) was used for performing the entire estimation process instead of two separated signal processing modules. In this paper, we present a extended version of that paper, where a hardware-efficient architecture for configurable ST/DDST transmitter that supports 4/16/64-QAM constellations is used to complement the results presented in [11], because now, all transmitted data—in each Monte Carlo trial—are generated by the proposed transmitter hardware instead of the transmitter simulation model programmed in Matlab.

The rest of the paper is organized as follows. Section 2 presents the system model being considered, the ST/DDST transmitter structure, the channel estimation algorithm, and the cyclic mean reformulation onto systolic APs. Section 3 describes in detail the full-hardware architectures for the configurable ST/DDST transmitter. Section 4 proposes an architecture based on SA processor for the DDST channel estimator. In Section 5, the performance evaluation of the proposed architectures is carried out. Conclusions are set down in Section 6.

*Notation 1*. Lowercase (uppercase) bold letters denote column vectors (matrices). Operators , , and , denote the Hermitian, transpose, and inverse operations of matrix . represents a column vector of length with all its elements equal to one; similarly, represents an all-zeros column vector of length . is the identity matrix of size . denotes the th element of vector . denotes a vector conformed with the elements of as follows: . represents Kronecker product. Finally, represents the expectation operator.

#### 2. System Model

This section is devoted to introduce the DDST algorithm mentioned previously. Suppose a single carrier, baseband communication system based on DDST as the one presented in Figure 1. The transmitted signal conformed to the sum of the data sequence , the training sequence and the data-dependent training sequence . The index helps to enumerate the samples of such signals which are transmitted at a rate equal to . , is a periodic sequence with period equal to and power equal to [12]. It is assumed that the data sequence is a zero-mean, stationary stochastic process with power equal to , where the symbols of such process come from a equiprobable alphabet. The sequence is constructed as mentioned in [5]. is propagated through the communication channel whose time impulse response conformed to the convolution of the system filters and the propagation medium impulse responses (all of them assumed to be time-invariant). Such channel can be modeled as a finite impulse response (FIR) filter with time-invariant coefficients as much. Finally, the distorted signal by the channel is contaminated with the noise for conforming the received signal . is a zero-mean white Gaussian noise, which possess variance equal to . The transmission of blocks of symbols, which is preceded by a cyclic prefix of length is assumed. Perfect block synchronization, which allows to fix it is also assumed. For ease of implementation, it is assumed that is a multiple of and is a power of two.

Thus, the received signal after removing the cyclic prefix can be expressed in a matrix form as follows: where is a circulant matrix whose first row is given by , where is a vector containing the coefficients of the channel impulse response (CIR). Similarly, , , , , and are vectors equal to , , , , and , respectively, with .

##### 2.1. Digital Transmitter with ST/DDST Included

Figure 2 depicts the discrete-time baseband block diagram of the (data-dependent) superimposed training transmitter. This is a modified version of the IT transmitter presented in [3]. From Figure 2, it can be noted that the key component of the transmitter is the sequence transformation block. It serves to implicitly embed the training sequences onto data sequence by the affine transformation expressed as where represent the complex baseband discrete-time transmitted signal, is a precoding matrix, and refers to vector obtained by replicating times one period of the training signal of size , that is, where , and such sequence is given by [12]: with when is odd, if is even, and .

The precoding matrix allows to modify the training technique according to withwhere and are matrices of sizes and , respectively.

In the DDST case, the -length vector containing the data-dependent sequence (DDS) can be obtained from (2) and using (5)–(6b) as follows:

##### 2.2. Channel Estimation Using DDST

It is possible to observe that due to the periodicity of , will have a periodic signal embedded with a period equal to . Taking advantage of this characteristic, an estimated of the cyclic mean of the received signal is utilized for performing the estimate of the channel. Such cyclic mean estimator can be defined as: where is a column vector of length whose elements are the estimated coefficients of the cyclic mean of and given by According to [4], the estimation of the CIR is given by where is a vector containing the estimated CIR coefficients, is a matrix formed by the first rows of , and is a circulant matrix of size formed by vector .

##### 2.3. Cyclic Mean Algorithm Using Array Processors and Partitioning

The next analysis describes how the cyclic mean is obtained using a systolic array that computes a matrix-vector multiplication (MVM). Consider (8), where it is not possible to perform directly the MVM operation due to the Kronecker product involved. To avoid this cumbersome operator, the same equation can be reformulated as follows:
where is a matrix of size which is defined as follows:
An architecture based on AP for computing (11) would be impractical from the point of view of hardware consumption because it will need processor elements (PEs). This problem is known as *problem-size-dependent array* where the algorithm requires a systolic AP whose size depends on the complexity of the problem to be solved. However, it is possible to map the cyclic mean algorithm to a systolic AP of a smaller size using the partitioning method [13, Ch. 12]. Considers to be partitioned in blocks of size chosen to match a systolic array size then (12) becomes
where
In similar way, is partitioned in unitary vectors . Substituting (13) in (11), the cyclic mean with *partitioning* is concisely expressed as

Therefore, the array of PEs will process one pair of and blocks after another in a sequential manner together with partial results.

#### 3. A Configurable ST/DDST Transmitter Architecture

Considering the explained in Section 2.1, the architecture shown in Figure 3 is proposed for the transmitter. It is composed of the five hardware modules: the symbol adecuator, the mapper, the data sequence transformer, the Tx_control, and the Tx_AGU. The reconfigurability feature of the architecture allows to switch between two operating modes: ST or DDST, in order to send data blocks with a cyclic prefix attached. In both modes, the transmitter hardware supports 4/16/64-QAM constellations.

In the next subsections, additional details about the main transmitter modules will be described.

##### 3.1. Symbol Adecuator

The design of this module is widely conditioned by the features of the mapper. By early account, a key aspect exploited in the mapper design, it consists of the fact that the 4-QAM and 16-QAM constellations are contained in Grey-coded 64-QAM one, as shown in Figure 4. For that reason, the symbol adecuator is necessary because not all the same point-numbers in the three constellations are mapped to the same complex symbol output. For example, while the point number 2 of the 4-QAM constellation is mapped to symbol, 16-QAM will map this point number to and 64-QAM will map to .

**(a)**

**(b)**

##### 3.2. Mapper

As stated in Section 3.1, in the mapper design is only required the 64-QAM constellation. In this work, a memory-efficient scheme is proposed to build that constellation, whose eight possible values (1, 3, 5, 7, −1, −3, −5, and −7) of the and axes are stored in the *constellation LUT*. Additionally, the mapper has to normalize the complex symbols based on two criteria: the constellation order and power assigned to each of the sequences involved. Thus, a normalization constant *Norm_Mapp_Cte* that combines the two criteria is given by
where
withThe mapper architecture designed is depicted in Figure 5. The *constellation LUT* was implemented with a dual-port ROM with eight memory locations, depth. On the contrary, in the *normalization LUT*, the ROM depth was 16 locations.

##### 3.3. Data Sequence Transformer

The data sequence transformer is the greater complexity module of the transmitter. Thus, its design was broken down into three submodules, whose individual architectures are described in the following paragraphs.

###### 3.3.1. Training Sequence Generator

Analyzing (4), it can be noticed that the parameters , , and , needed to generate the training sequence, are known in advance and they remain constants during the transmitter operating. Hence, the values of the training sequence can be calculated off-line, quantized, and stored in an LUT. This LUT is read times in order to expand the training sequence length, as indicated in (3), and it can be superimposed, element by element, with the data sequence by the complex adder.

###### 3.3.2. ST Cyclic Prefix Insertion Submodule

There are several problems to arise because of the way in which the prefix cyclic is generated and its position where it is attached in the ST sequence.(i)Since the prefix cyclic conformed to the last data of the sequence ST, it can only be generated from this sequence until it has been completely processed.(ii)Given that, in all the data to be transmitted, the first data correspond to the cyclic prefix, it is necessary to use a memory buffer in order to store the remaining data (ST sequence) and, thus, prevent data loss.

A dual-port RAM (*RAM_CP*) of depth was used for the ST cyclic prefix insertion submodule designing. The *RAM_CP* have two independent address buses one for data reading (*addr_rd_st_cp*) and one for data writing (*addr_wr_st_cp*). This feature allows to read and write data simultaneously to/from the *RAM_CP*. The process for generating and attaching a cyclic prefix in the ST sequence can be summarized in the following steps (Figure 6). (I)When the th datum is stored in *RAM_CP*, the previous datum stored is addressed by *addr_rd_st* bus. (II)During clock cycles, the ST sequence storing and reading take place in the *RAM_CP*. (III)The ST sequence storing in the *RAM_CP* is stopped. However, the data reading will continue for cycles.

###### 3.3.3. Data-Dependent Sequence Generator

The operation of this submodule is based on (6a)–(7), which implies to compute two high-demand processing operations: an MVM and the Kronecker product. Moreover, similar to the cyclic prefix insertion case, the DDS can only be generated from data sequence until it has been completely processed. In consequence, the following adaptations should be made to the original equations in order to ease its mapping-to-hardware process. (I)The sequence is rearranged into a matrix of size , according to (II)The mean of the each rows of the matrix is obtained. (III)The mean results are replicated times in order to obtain the vector and data for DDST cyclic prefix purposes.

Figure 7 shows the hardware architecture of DDS generator. Its novel design avoids the sequence rearranging by the loop-back shift register *lb_delay_dds*. This register generates a symbol delay in order to align the data for each matrix row. So, the data rows can be added “on the fly” by the complex adder without the data input stream is stopped. The sum results are stored in the *RAM_DDS*, after its entire contents are read times and each datum is divided by in the *shifter* block. Finally, the results are sent to the DDS generator outputs.

**(a)**

**(b)**

#### 4. Systolic Channel Estimator Architecture

This section introduces an architecture for the DDST-based channel estimation process. Its design is based on MVM operation, which is carried out in a systolic way into AP. The main idea in the system design is to reuse the same systolic array for computing the cyclic mean of the received data. The proposed architecture, called in this paper “systolic DDST channel estimator” (SYSDCE) is depicted in Figure 8(a). Four functional units can be identified: a modified systolic matrix-vector multiplier (MSYSMVM), a data input feeder (DATINF), an inverse C look-up table (ICLUT), and a control unit (CU). Broadly speaking, the SYSDCE operation can be divided into three phases: input sequence storage, cyclic mean compute, and CIR estimate.

**(a)**

**(b)**

As soon as the *start* signal is asserted, an data samples (vector and cyclic prefix, resp.) can be read from the input port . After excluding the samples corresponding to the cyclic prefix, the rest of samples are rearranged and stored in the memory bank of DATINF. When this process is finished, the CU configures the MSYSMVM unit and during cycles it reads parallel data per cycle from DATINF and computes the cyclic mean . Once this phase is finished, the obtained vector together with ICLUT data are fed to the MSYSMVM again for performing the product expressed in (10). Finally, after cycles, the *done* flag is asserted and one by one the coefficients of the channel estimated are sent to the bus *H_OUT*. It is worth mentioning that the SYSDCE can be configured to compute only the cyclic mean if *mode* input control signal has been set to zero. In this case, the *cm_flag* out is asserted to indicate that valid results are available in *CM_OUT* bus. Thus, the channel estimator is prepared for another data sequence processing. A deeper explanation about each component of the SYSDCE architecture will be given in the subsections.

##### 4.1. Modified Systolic Matrix-Vector Multiplier (MSYSMVM)

The fundamental operation to perform by SYSDCE is a matrix-vector multiplication which is high time-processing demanding. The hardware design for solving this operation is the most critical part in the architecture. The obvious strategy for accelerating MVM consists in computing as many operations as possible, with the penalty of a great consumption of FPGA resources. Therefore, this paper proposes a modification of the systolic MVM presented in [14, Ch. 3] in order to obtain a good performance with reasonable resources consumption. This modification allows to compute the cyclic mean using *partitioning* method with the same systolic array reported. Figure 8(b) shows the processor element (PE), which is the atomic digital signal processing module in MSYSMVM. It processes three flows: the data flow from the ICLUT or DATINF, the input registers values, and the data produced by the previous adjacent PE.

In the MSYSMVM design was considered that the number of PEs needed (*AP* size) is , which matches with the dimensions of matrix and vector respectively. The projection vector (see details in [14]) was used with a vector schedule . The pipelining period for this design is equal to 1 and the computing time for the full MVM is clock periods.

For computing the cyclic mean using the MSYSMVM module, the original structure of PE was modified with an additional multiplexer. For that reason, the PE can perform all trivial multiplications by bypassing the data from the input of the complex multiplier directly to the complex adder.

##### 4.2. Data Input Feeder (DATINF)

Similar to almost any systolic array, the MSYSMVM needs the data, which will be fed to each of its PEs to be given in a defined order before processing it. In the proposed approach, the module DATINF is responsible for performing this task. It is made up of an array of memories, each with a depth of , organized as a memory bank as shown in Figure 9. DATINF reads data from bus; it identifies and removes the first data corresponding to . Subsequently, this module rearranges this sequence (correspondence to ) in blocks of size in order to form . Therefore, the stored data can be viewed as a matrix, where each individual memory in the bank stores one column of each block and the blocks are stored consecutively one after another, as depicted in Figure 9.

Each datum of the input sequence has associated three addresses that define its location inside the memory bank: block number (*blk_num*), memory number (*mem_num*), and memory address (*mem_addr*). The DATINF must generate these addresses using the following expressions: where is the th element of and denotes the *floor* operator.

In order to minimize the hardware consumption, a “hard-wired” addressing approach was built for the memory bank. As shown in Figure 10, the bits corresponding to the DATINF *address bus* are split into three parts. The first most significant bits (MSB) are used for block selecting, the next MSB are used to select a particular memory in the bank and the remaining bits are used to individually address each of the locations in the selected memory.

##### 4.3. Inverse C Look-Up Table (ICLUT)

The values of the circulant matrix are constants that can be precomputed once off-line and stored in a LUT. Only the values of the first column are necessary because the remaining columns are shifted versions of the first one. Consequently, the ROM location’s number required for the LUT is just . If traditional design is used, then the LUT will be designed with a multiport ROM of locations, but it will be synthesized by the employed compiler tool as an array of single-port ROMs. Therefore, the number of memory locations is increased to . A novel solution was designed with an array of registers operating as a circular buffer. This is called “inverse C look-up table” (ICLUT) and it saves memory locations. The first row values of are stored in the registers. Next, one rotation is applied in each tick of the clock to change the register’s outputs, as indicated in Figure 11.

#### 5. Results

In this section, the proposed architectures are evaluated. First, the hardware utilization and throughput of the ST/DDST transmitter implementation are presented. After, its functional performance from the point of the signal-to-quantization-noise ratio (SQNR) is analyzed. Next, the FPGA resources consumption and throughput of the SYSDCE implementation are obtained. Finally, the SYSDCE functional results specified in terms of the MSE of the channel estimated and SQNR performance are carried out by Monte Carlo simulations and using the transmitter hardware in DDST mode.

##### 5.1. Implementation and Simulation of the Transmitter

The configurable ST/DDST transmitter architecture was implemented in RTL level using Verilog hardware description hardware. It is able to transmits ST or DDST data blocks of length with . The power of training sequence is set to with a period . The configurable transmitter was synthesized and targeted in Xilinx Virtex-5 XC5VLX110T FPGA. Default settings and no “user constraints” were selected in the EDA tool Xilinx ISE v11. No IP core o predesigned component were used. All signals are represented in signed fixed-point two’s complement, and nonrounding scheme was considered.

Table 1 summarizes the synthesis results for the proposed ST/DDT transmitter. Analyzing this table, it can be noted a operating frequency of 160 MHz with a symbolic FPGA resource utilization. So, it is clear that excellent area-frequency balance is achieved.

On the other hand, it is difficult to compare directly the proposed transmitter and channel estimator with the others previously presented in [9, 10] because of the differences in technology, paradigms used, and testing conditions. In [9], DDST communication system was implemented under full-software philosophy in TMS320C6713 DSP with a 300 MHz external clock. A hybrid software-hardware FPGA implementation of the DDST receiver is described in [10]. In both DDST implementations mentioned, the comparison against our transmitter was not possible. In the former because the transmitter was full-software based and the latter only the DDST receiver was implemented.

The transmitter operating validity is presented in Figure 12. The first graph (Figure 12(a)) shows clearly that the transmitter hardware has embedded the training sequence into . It can be noted that the data sequence energy is spread in all frequency components. In contrast, the training sequence energy are only concentrated in equispaced frequency components. Similar behavior occurs in the DDST mode (Figure 12(b)), but now the pilots signals also have the same energy. This is unequivocal proof that the transmitter architecture is properly superimposing and into .

**(a)**

**(b)**

The SQNR obtained for 100 Monte Carlo trials is monitored, in order to quantify the difference between the sequence obtained with the hardware transmitter compared with the floating-point transmitter golden model. Thus, the histogram of Figure 13 represents concisely the results of this test. The most of the occurrence are concentrated in 84 dB.

##### 5.2. Implementation and Simulation of the Channel Estimator

The SYSDCE architecture was implemented using the same considerations and design parameters of the transmitter. Also, the systolic channel estimator was synthesized and targeted in the same FPGA.

Table 2 summarizes the synthesis results for the proposed estimator. The values in the parenthesis in each feature indicate the total of corresponding available resources in the FPGA. The results in Table 1 reveal a frequency operation of 115.247 MHz with a minimal consumption (except DSP48Es) with respect to the total resources of the FPGA.

Againly, it was not possible to compare the SYSDCE against the existent approaches. In [10], the module corresponding to the channel estimation, only the arithmetic mean was accelerated by a dedicated coprocessor. In this work, the input sequence length was assumed (but it did not explicitly mentioned) to be symbols. The MVM operation described in (9) was implemented in software. Also, no results—from the point of view of the mean square error (MSE) in the channel estimated or SQNR performance—are presented.

Other important parameter of the proposed estimator is the number of cycles required for performing the tasks estimation. Particularly, the cyclic mean requires The first term in (21) corresponds to the input storage phase and the second to the MVM operations involved in the cyclic mean task. Furthermore, the number of cycles required for the CIR estimator is Consider the set of metrics listed in Table 3 to compare the performance of the SYSDCE system. The processing time (PT) is the time elapsed from the beginning of cyclic mean or channel estimation process until its computing has finished. The throughput (TP) per area is another useful metric, a higher value of this ratio indicates that the system implementation is better. As can be seen from Table 3, the proposed architecture provides a better performance compared to the arithmetic mean coprocessor used in [10].

The validity of the provided architectures is granted by comparing their results with the floating-point simulation golden model programmed in Matlab, in terms of channel estimation error versus signal-to-noise ratio (SNR). Thereby, the following scenario (similar to that used in [6]) was considered. The hardware transmitter was configured in DDST mode, in order to send data blocks of symbols obtained from a 4 QAM constellation. The channel is randomly generated at each Monte Carlo trial and it is assumed to be Rayleigh with length . The power of training sequence is set to with a period equal to .

Figure 14 shows the MSE of channel estimated, which is averaged over 300 Monte Carlo simulations for each value of SNR. Note that the MSE of the hardware estimator is too close to the theoretical line [4] and almost indistinguishable with respect to the golden model. On the other hand, Figure 15 presents the probability density function (PDF) of the SYSDCE hardware, obtained for the same Monte Carlo trials. Analyzing such PDF, it can be noted that the fixed-point performance in average is about 68 dB in terms of SQNR.

#### 6. Conclusions

In this paper, digital architectures for transmitter and channel estimation stages of the ST/DDST communications systems have been presented. These architectures represent the first implementations under the full-hardware philosophy for a wireless systems based on ST/DDST. Both architectures present high throughput and reduced FPGA resources consumption, achieving a good trade-off between performance and area utilization. The proposed transmitter architecture is configurable enough to generate two types of training using three constellation orders. In the SYSDCE hardware, it is possible to observe a great flexibility and reusability because the same systolic array is used for two different tasks (operations): cyclic mean and channel estimation. Also, the SYSDCE design can be easily modified (by means of partitioning strategy) for processing channels of different lengths. The validity and performance of these approaches have been verified by Monte Carlo simulations, where an SQNR of 82 dB and 68 dB in average are achieved for the transmitter and the SYSDCE, respectively. At the same time both architectures present a insignificant differences in the performance results when they are compared with their respective floating-point golden models. The provided results show that ST/DDST concepts can be effectively utilized in current and future wireless communications standards.

#### Acknowledgments

This work was supported by PROMEP ITSON-92, CONACYT-181962, and Mixbaal 158899 Research Grants.