Future 4th Generation (4G) wireless multiuser communication systems will have to provide advanced multimedia services to an increasing number of users, making good use of the scarce spectrum resources. Thus, 4G system design should pursue both higher-transmission bit rates and higher spectral efficiencies. To achieve this goal, multiple antenna systems are called to play a crucial role. In this contribution we address the implementation in FPGAs of a multiple-input multiple-output (MIMO) decoder embedded in a prototype of a 4G mobile receiver. This MIMO decoder is part of a multicarrier code-division multiple-access (MC-CDMA) radio system, equipped with multiple antennas at both ends of the link, that is able to handle up to 32 users and provides raw transmission bit-rates up to 125 Mbps. The task of the MIMO decoder is to appropriately combine the signals simultaneously received on all antennas to construct an improved signal, free of interference, from which to estimate the transmitted symbols. A comprehensive explanation of the complete design process is provided, including architectural decisions, floating-point to fixed-point translation, and description of the validation procedure. We also report implementation results using FPGA devices of the Xilinx Virtex-4 family.
1. Introduction
The aim of the 4MORE Project (4G
MC-CDMA Multiple Antenna System-on-Chip for Radio Enhancements) is to
complement worldwide research efforts on MIMO systems, MC-CDMA, and other
advanced signal processing techniques that will provide the high data rates
and spectral efficiencies expected from 4G wireless multiuser communication
systems. In order to investigate the real performance and feasibility of
implementation of these technologies, a complete hardware demonstrator of a
broadband mobile terminal (MT) has been designed and is being constructed
within the 4MORE project [1]. The demonstrator will focus on an MT with two
antennas, but a base station (BS) emulator with four antennas will also be
built, since it is required for validation of the MT.
Multi-carrier CDMA, based on the serial combination of
direct sequence CDMA and OFDM, has been considered for the physical layer in
the downlink because it derives benefits from both technologies: OFDM, with
appropriate carrier spacing and guard interval, provides robustness against
multipath, avoiding intersymbol interference; whereas the use of CDMA with
orthogonal spreading codes provides frequency diversity and multiple-user
flexibility [2].
The use of multiple antennas
is another enabling technology for 4G systems, which helps to exploit spatial
diversity, to increase capacity and to mitigate the effects of fading. In our
system the space-time block code for two transmit antennas designed by Alamouti
[3] is employed. This option has been favoured over other MIMO technologies,
such as beam-forming or layered space-time coding (BLAST) because it provides
the maximum attainable diversity order for the number of antennas employed
using a simple decoding algorithm.
To achieve good bit error rate (BER) performance,
state-of-the-art channel coding techniques, including duo-binary turbo codes [4]
for the uplink, and convolutional and low density parity check codes [5] for
the downlink, are employed in the 4MORE demonstrator.
The joint use of all these sophisticated technologies
greatly increases the complexity of the transceiver. To deal with the constraints
of VLSI design, the demonstrator includes ASICs as well as FPGAs. From the
onset of the project it was clear that the demonstrator would make use of some
well-established algorithms that could be implemented on ASICs, but the
flexibility provided by FPGAs was required to accommodate to the more
innovative algorithms to be investigated, bearing in mind that design and
implementation tasks would partially overlap in time.
The rest of the paper describes the design and implementation in FPGAs of the hardware module that performs MIMO decoding in the MT, and is organized as
follows. In Section 2 a brief overview of the complete downlink system is
given, where focus is on the receiver. The basis of the Alamouti MIMO decoding
scheme is reviewed in Section 3. Sections 4 and 5, respectively, describe the
architecture of the MIMO decoder and detail its fixed-point translation. We
discuss implementation details and results in Section 6, before we finally draw
our conclusions.
2. Overview of the Downlink System
2.1. Transmitting Base Station
A simplified diagram of the transmitting BS is shown
in Figure 1. Data bits to be transmitted to each active user are independently
channel encoded and mapped onto symbols of the appropriate constellation (QPSK,
16-QAM or 64-QAM).
Figure 1: Simplified
diagram of the BS transmitter.
Each modulated symbol is multiplied by the spreading
code of the corresponding user, and the spread symbols of the active users
are added together to be simultaneously transmitted over the same set of subcarriers,
which constitutes an MC-CDMA symbol. In our system, the spreading factor in
frequency is , and the number of users must be in the range of .
An OFDM symbol consists of contiguous
MC-CDMA symbols, so that information is simultaneously transmitted over subcarriers.
Data is prepared for multiantenna transmission by the
MIMO encoding module. According to the Alamouti scheme [3], a pair of OFDM
symbols , , also known as a space-time block, is transmitted
employing two antennas over two consecutive symbol periods. During the first
symbol period, is transmitted
from the first antenna, and simultaneously is transmitted
from the second one. During the next symbol interval, the first antenna outputs , while the second one transmits , with standing for
complex conjugate and for the symbol
epoch. Small bold letters denote vectors with elements,
corresponding to the number of data subcarriers in an OFDM symbol.
Before OFDM modulation, the framing module interleaves
pilot symbols in the data stream, in order to aid channel estimation at the
receiver. One IFFT operation per transmit antenna is required for OFDM
modulation, to convert data to the time domain. The IFFT size is , and the sampling rate is .
Each stream of complex OFDM symbols is finally
IQ-modulated, power amplified by independent RF front-ends, and radiated in the 5-GHz band.
2.2. Receiving Mobile Terminal
A simplified diagram of the MT receiver is depicted in
Figure 2. Analog signals received by the two antennas of the MT are
downconverted to baseband by twin zero-IF RF front-ends, and then sampled at . After automatic gain control (AGC) and correction of
RF impairments caused by the zero-IF architecture of the front-ends, time and
frequency synchronization must be performed in order to minimize misalignments
with the transmitting BS.
Figure 2: Simplified
diagram of the MT receiver.
One FFT operation per antenna branch is required to
recover the symbols in the frequency domain (OFDM demodulation).
Next, pilots are split from information symbols by the
de-framing module. By interpolation of pilot symbols in time and frequency, the
MIMO channel estimator provides the MIMO decoder with channel state information
(CSI), which is combined with two contiguously received OFDM symbols to build
the improved signal from which to estimate the modulated symbols.
However, the output stream of the MIMO decoder further
requires module equalization [6] and despreading (separation of users by
correlation with their spreading codes) before detection of the desired user
can take place. The output of the soft demapper is finally sent to the channel
decoder to make decisions about the transmitted information bits.
3. MIMO Decoding Principle
The fact that during each symbol period both antennas
simultaneously transmit different information implies that a linear combination
of symbols, affected by the channel frequency response of the different paths,
will be received at each antenna of the MT. Due to the intelligent way in which
spatial diversity is introduced, a simple linear processing of the signals
received by the two antennas during a space-time block eliminates the co-antenna interference (CAI) artificially created by MIMO transmission.
For each space-time block, the MIMO decoder must perform the following linear combination:
where is the
estimated frequency response of the channel between transmit antenna and receive
antenna at the th subcarrier
() during the th OFDM symbol
period, is the signal
obtained after OFDM demodulation at antenna branch , and is the combined
output signal. Assuming ideal channel estimation, and a constant channel response
during one space-time block, it can be shown that this combining scheme
provides full diversity order and cancels CAI [3], leading to this simple model
for the combined signal:
where is the th element of
vector , and is a Gaussian
noise term. Equation (2) is valid for all , but the equivalent channel has slightly
different expressions for even and odd :
According to (2), information could be now
recovered from by zero-forcing
equalization (dividing by the real factor ) or by MMSE
equalization [6].
4. Architecture of the MIMO Decoder
The MIMO decoder must implement (1) to obtain the MIMO-combined signal , and (3) to obtain the equivalent channel , required by the equalizer.
The memory of the Alamouti scheme is one OFDM symbol.
Throughout the paper we have used the pair to refer to the
OFDM symbol and subcarrier indices. After OFDM demodulation, information
received on all subcarriers is converted from parallel to serial, so we recover
a single (complex) stream per antenna branch, that is, the pair of indices
is equivalent to a single-time index . Hence, a straightforward implementation of the
decoder would require the storage of a whole OFDM symbol for every input and
output signal (real and imaginary parts of the received signal on each antenna,
those of the estimates for the MIMO channel,
those of the combined output signal, and the equivalent channel), making a
total of samples.
However, if all complex signals in (1) are split in their real and imaginary parts
(superscripts and ), after some
algebra and intelligent grouping of terms, we arrive to expressions that
suggest a much more efficient implementation. For example, for the real part of we get:
where we have
defined:
Equation (5) is valid for all , and corresponds to memoryless arithmetic operators that
will run continuously, while all memory effects have been included in (4). The
architecture inferred from these equations is shown in Figure 3, where all
signals are real. All arithmetic resources are disposed so as to make a 100%
utilization of them, including the programmable adder/substractor A3 at
the output of the module. The whole structure works as a pipeline running at
clock speed and, although not explicitly shown in Figure 3, adders and
multipliers have registered outputs. The even/odd signal indicates whether the current OFDM symbol is
even or odd, and is used to control the multiplexer and to change between
addition and substraction in the programmable adder A3. Slotted rectangles are
used to represent multibit shift-registers, which do not need to be resettable.
We observe that memory requirements for evaluation of are samples, and that the total latency is equal to clock periods.
Figure 3: Architecture for the MIMO decoder (real part
). Signal ranges and
wordlengths displayed are for the fixed-point implementation option Q2 (see
Section
5 and Table
2).
We do not show the full details of the architectures
used to evaluate and because they
are very similar to that shown in Figure 3, just placing the appropriate
signals at the inputs. For evaluation of , the major difference is that first-level adders A1
are replaced by subtractors, while for , the programmable adder/substractor A3 is replaced by
a simple unsigned adder, the rest of the adders being unsigned as well. Thus,
the MIMO decoder comprises three submodules very much like the one shown in
Figure 3, and we therefore reduce the total memory
requirements of the complete module to samples.
This architecture can be easily and efficiently adapted to a different number of antennas at the receiver. To this end, the arithmetic
blocks surrounded by dotted lines in Figure 3 should be replicated, both in the upper and lower branches of the architecture, and the two-input adders A2 should be replaced by cascaded adders to handle more than two inputs. While deploying more
than two antennas at the MT is unpractical, this architecture could also be
used for MIMO decoding in the uplink, where a BS with four or more receive
antennas is feasible.
5. Fixed-Point Translation
The fixed-point translation of the architectural
design described in the previous section was accomplished following three
steps.
(a)
Determine the
range of each input, output, and intermediate signal involved in the MIMO
decoder.(b)Obtain the
number of bits (precision) required for each signal.(c) Test the
robustness of the design by performing BER simulations.
Following this
process, similar to that described in [7], we seek to obtain a low-cost, performance-effective
implementation for the hardware module.
5.1. Estimation of Signal Ranges
This task was accomplished with the help of the
SystemC-based floating-point software simulator that has been developed within
the 4MORE Project, which accurately models the behaviour of all the modules in
the demonstrator and includes a realistic MIMO channel model. It is possible
with this simulator to obtain traces of the signals at any point in the
communication link.
We show in Table 1 the most important parameters of
the different working modes that have been implemented in the demonstrator.
While the range for the channel estimates is independent
of the mode, the range for the received signals depends on the
modulation type and on the number of users. The widest signal range will be
attained when 64-QAM modulation is combined with the maximum number of users.
By careful examination of histograms of large records of data obtained running
the SystemC simulator with these parameters, we found that the range for the
real and imaginary parts of the received signals lied with high
probability in the interval while for the
channel estimates the range was
found to be . The histograms observed for all signals were almost
Gaussian in shape. To be on the safe side we decided to include an extra
margin, and considered the ranges for and to be for the design.
By doing so we try to take outliers into account, and some of the variability
of the channel which might have not been captured in our data records. Bear in
mind that the channel variability greatly affects the amplitude of the received
signals, and that the MIMO channel model is quite complex, its behaviour being
influenced by many physical and statistical parameters.
Table 1: Parameters of the modes implemented in the
demonstrator.
Once the ranges for input signals were known, those of
intermediate and output signals could be obtained taking into account the
theoretical margins that result when operating with inputs whose range is
already known. Nevertheless, this would lead to an overdimensioned module, due
to the existence of hidden correlations between the inputs. After all, each of
the received signals is a linear
combination of the data multiplied by
the channel paths . Therefore, we resorted to histogram observation to
determine those ranges. The results are all shown in parentheses in Figure 3
and also in Table 2.
Table 2: Fixed-point quantization rules.
5.2. Word-Length Optimization
To ease this task we developed a simple software model
of the MIMO decoder, identical to the module included in the floating-point
SystemC simulator of the whole chain, but much faster and practical, since all
unnecessary burdens were removed. This new software model can be quickly
modified to include fixed-point conversion effects in any of its parts.
As performance metric we used the
signal-to-quantization noise ratio (SQNR) at the outputs of the MIMO decoder,
measured by comparison of the outputs of the floating-point version of the
module with that obtained after including quantization effects in some signal,
or in all of them. By doing so we seek to keep the power of quantization noise
much lower than that of additive white Gaussian (AWGN) noise, hence
guaranteeing a negligible effect of the first one on performance.
Fixed-point conversion effects were introduced one
signal at a time, and simulations were run in parallel with both versions of
the MIMO decoder. The number of bits assigned to the fractional part of the
signal under study was then adjusted and simulations repeated until a target
value for the SQNR was reached.
Next, fixed-point effects were removed from that
point, and we proceeded to optimize the word-length of another signal in the
module.
Nevertheless, for those signals that share the same
statistics, quantization effects were simultaneously analysed. For instance,
optimization of the number of bits at the output of all multipliers M1 in
Figure 3 was done simultaneously, running simulations with all multipliers
substituted by their fixed-point counterparts, all of them with the same number
of bits. For the same reason, all first-level adders A1 were simultaneously
optimized, as well as all second-level adders A2.
Following this procedure we obtained, three sets of
quantization rules, to which we will refer as Q1, Q2, and Q3 from now on, each
of them established aiming at a different goal. The final parameters for these
quantization rules are shown in Table 2 (and for Q2, they are also embedded in
Figure 3). The number of bits displayed for all signals includes integer plus
fractional part.
Quantization rule Q1 was conceived overdimensioned to
ensure that it would work with every mode of the demonstrator. Quantization
rule Q2, slightly less resource-consuming than Q1, was tried for 64-QAM, but
final results were not good enough. As it will be shown in next section, the
64-QAM constellation is very sensitive to even small noise increments. Finally,
Q3 was designed to work only with QPSK modulation, using the minimum number
of resources.
Signal traces to run the tests were obtained from the
complete SystemC simulator, always setting , since in this case the range of the inputs is the
smallest and therefore the required precision is
the highest. We used 64-QAM signals for Q1 and Q2, and QPSK for
Q3. The target value for SQNR was set to be greater than when designing
Q1, with Q2, and with Q3.
As will be shown later (see Figure 4), the
demonstrator may require values of the signal-to-noise ratio (SNR) per
information bit () at the input
of the receiver as high as to obtain a low
BER, the limiting case being that of 64-QAM modulation with 32 users. This is
tantamount to a value of the per-carrier signal-to-noise ratio () of
approximately , since and are related by [6] by the following equation:
Measurements
with signal traces obtained running the simulator in this limiting case
resulted in the higher value at the ouput of
the MIMO decoder, the increase being due to the combining process.
Figure 4: BER degradation comparing the floating-point version of the
MIMO decoder (solid lines with marker “o”) and its fixed-point counterpart
implementation Q1 (dashed lines with marker “x”).
At the end of the word-length optimization process we
ran a final simulation to compare the floating-point version with the optimized
fixed-point one, including all quantization effects simultaneously. The measured
SQNR value was about for Q1, safely
bigger than , and output fell only from to when including
quantization effects. For Q2, the final SQNR was about , while fell to . For Q3, losses in were
negligible.
5.3. Validation in Terms of BER Performance
As final step, the SystemC simulator was used to
validate in terms of BER performance the final decisions concerning signal
ranges and word-length optimization. For this purpose a complete fixed-point
software model of the MIMO decoder was developed, which is bit-accurate with
the VHDL source code to be implemented in the FPGAs. By substitution of the
original floating-point MIMO decoding module by its fixed-point counterpart in
the complete SystemC simulation chain, and including appropriate
floating/fixed-point interfaces to the neighbouring modules, we verified the
degradation in BER performance introduced by the fixed-point MIMO decoder. This
can be checked in Figures 4–6, where the BER versus performance has
been evaluated for different modes of the demonstrator.
Figure 5: BER degradation comparing the floating-point version of the
MIMO decoder (solid lines with marker “o”) and its fixed-point counterpart
implementation Q2 (dashed lines with marker “x”).
Figure 6: BER degradation comparing the floating-point version of the
MIMO decoder (solid lines with marker “o”) and its fixed-point counterpart
implementation Q3 (dashed lines with marker “x”). In the zoomed area, results
for the fixed-point implementation Q2 are also shown for comparison (dotted
lines with marker “”).
As it can be seen in Figure 4, quantization Q1 is
suitable for every mode, with a maximum loss of about at for 64-QAM
(negligible with 16-QAM and QPSK). From Figure 5, quantization Q2 can be
considered for 16-QAM with a loss up to , but not for 64-QAM, where losses reach . Finally, according to Figure 6, Q3 is suitable for
QPSK with negligible losses, while it worsens by for 16-QAM, a
loss double than that obtained using Q2.
6. Implementation and Results
The following tools were used during the design:
Xilinx ISE 7.1 and the XST engine were used for VHDL synthesis and
place-and-route, while Mentor ModelSim SE 6.0d was used to run functional and
post place-and-route simulations. The target FPGAs considered for the
implementation are Xilinx Virtex-4, since they are most suitable for
implementation of wireless systems [8]. Specifically, model XC4VLX100-12 units
are included in the demonstrator.
Table 3 shows the synthesis results for the MIMO
decoder using the three different fixed-point implementations discussed in
Section 5 and summarized in Table 2.
Table 3: Synthesis results for the MIMO decoding module.
The second column, labelled
“DSP48,” refers to an option of the synthesis tool which can take three
different values: “no” means that no DSP blocks are allowed; “yes” tells the
synthesis tool to use as many of them as required; and “auto” triggers a free
use of the DSP blocks, depending on the best trade-off found by the tool.
The value of that option has a very significant effect
on the column “DSP slices” since the architecture of MIMO decoder needs 24 multipliers.
When using “auto” for the “DSP48” option, these are made available as DSP
blocks by the synthesis tool, whereas when the “yes” option is selected, the
tool also maps the 21 adders (including 15 adders, 4 substractors, and 2
programmable adders/substractors) and other elements in DSP blocks, finally
getting 49 DSP slices used, and consequently reducing the number of LUTs in the
column “Logic” (from 3163 to 92 for Q2, while shift registers keep the same
size).
The column “LUTs” can be obtained by adding the
following three: “Logic,” LUTs used for logic functions and arithmetic;
“Route-through” for routing paths between slices; and “Shift registers.” The
data in this last column are very relevant for our design, since shift
registers are large components in the architecture and consume the greatest
part of the resources (except in the case of value “no” for “DSP48”). They
affect the slice count, since the width of the registers is reduced when
changing to more severe quantizations (from Q1 to Q3).
Considering the total number of slices, there is a
reduction of 23% from quantization Q1 to Q2 (“auto”), while it is only 7.5%
from Q2 to Q3.
The column “Flip-flops” includes the registers needed
in the control unit and also those used for the pipeline. This excludes the
registers that follow the arithmetic units mapped to DSP blocks, since they are
directly taken from the blocks, and not from the slices.
The last column is the minimum clock cycle inferred by
the synthesis tool with a timing constraint of , which is the clock frequency available in the
demonstrator. It can be emphasized that the use of DSP blocks results in a
slower design, due to the additional routing needed to reach the (fixed)
positions of those components in the FPGA. In this regard, the fastest
implementation (and also the largest in area) is the one using quantization
rules Q2 selecting “no” for the “DSP48” option.
Quantized outputs of the deframing and channel
estimation modules (see Figure 2) obtained from the floating-point SystemC
simulator were used as realistic input test patterns to perform the functional
validation of the hardware implementation. The outputs of the VHDL simulations
driven by these patterns were compared for equality with those obtained by the
bit-accurate fixed-point software model of the MIMO decoder, when driven by
those same input patterns.
7. Conclusions
We have presented the design methodology used in the
implementation of a MIMO decoder within a 4G radio system. The architecture of
the system has been optimized to comply with the throughput requirements while
reducing implementation area.
Given the random nature of the inputs, the design of
wireless systems demands a simulation-based fixed-point translation approach
for word-length optimization. A robust simulation framework, able to deal both
with floating-point and fixed-point descriptions, has proven to be essential in
the design.
Several quantization versions have been developed,
synthesized with different options, in order to check the trade-offs between
accuracy and use of resources in different conditions.
Our implementation results using Xilinx Virtex-4
devices show that the MIMO decoder requires a limited number of FPGA resources,
while achieving high performance.
Acknowledgments
This work has been supported by European FP6 IST 2002 507039 Project 4MORE and by the Spanish Ministry of
Science and Technology under Project TEC2006-13067-C03-03.