#### Abstract

Data rates in the upcoming 3G long term evolution (LTE) standard will be manifold when compared to the current universal mobile telecommunications system. Implementing receivers conforming with the high-capacity transmission techniques is challenging due to the complexity and computational requirements of algorithms. In this study, the software defined radio (SDR) is targeted and the four essential baseband functions of the 3G LTE receiver, namely, list sphere decoding, fast Fourier transform, QR decomposition, and turbo decoding, are addressed and the functions are implemented as application specific processors (ASPs). As a result, the design space that describes the essential computational challenges of 3G LTE receivers is clarified and estimates of area, power, and interprocessor communication buffer requirements are presented.

#### 1. Introduction

The upcoming 3G long term evolution (LTE) standard will support data rates up to 100 Mbps [1]. Such a high data rate will be achieved in 20 MHz bandwidth by using transmission techniques like orthogonal frequency division multiplexing (OFDM) [2], multiple-input multiple-output (MIMO) [3], that is, the use of multiple antennas, and an efficient forward error correction method, the turbo coding [4]. As these techniques are applied, the receiver needs to realize very sophisticated algorithms. The design complexity or difficulty of designing implementations of such algorithms calls for flexible software-based solutions, that is, software defined radio (SDR). On the other hand, the computational complexity of algorithms advocates dedicated hardware accelerators for maximizing the performance. Thus, the implementation technique of choice should possess the benefits of both approaches.

High throughput and efficiency can be achieved with highly parallel hardware accelerator which is designed for the application in hand. As a drawback, designing is time consuming and any further changes can be difficult with unprogrammable fixed hardware. Programmable processor-based implementations tend to suffer from a lower throughput, unused resources, and memory throughput bottlenecks but they allow a shorter development time and higher flexibility due to the programmability. A solution, which strives to achieve the benefits of both hardware accelerators and processor-based implementations, is to use application-specific processors (ASPs) with highly parallel computing resources. With proper tools, ASPs can be designed and programmed rapidly, yet high throughput can be obtained with highly parallel computing resources. Flexibility and efficiency are obtained with accurate control at software level.

On the contrary to focusing on the implementation of solely one function, even a couple of interoperating functions complicate the design. For example, the number of clock domains and the most suitable clock frequencies must be determined for all the functions. In addition, there is always a tradeoff between area and throughput. Furthermore, even if the throughput is adequate, the delay can be too long. Thus, the dimensions of the design space include clock frequency, area, power, parallelism, number of processors, clock domains, and so forth. To find answers to the multivariable and multiobjective design problems, the design space must be explored by focusing on promising candidates, that is, design alternatives, and analyzing them. Naturally, such analysis is far away from evaluation of a fully functional system-on-chip (SoC) but it provides inevitable insight into the design problem in hand.

In this paper, efficient ASPs, whose performance rivals pure hardware implementations, are applied to the 3G LTE baseband processing. The targeted essential and computationally demanding baseband functions are list sphere decoding (LSD), fast Fourier transform (FFT), QR decomposition, and turbo decoding. Baseband functions are separated from system level operations as the area and power analysis focuses on the core computations. The assisting interprocessor communication (IPC) is analyzed in terms of data buffer requirements of ideal IPC links. The presented work forecasts how demanding the implementation of these baseband functions of the 3G LTE receiver would be, and what would be the number of logic gate equivalents (GE), power, number of processors, and IPC requirements with realistic clock frequencies. The results also show how strongly an efficient symbol detection method dominates the total complexity.

The next section introduces some previous implementation techniques and fundamentals of the addressed functions and system. In Section 3, a high-level description of the targeted receiver is presented. The applied ASP implementations are presented in Section 4. Multiprocessing requirements and complexity are analyzed in Section 5 before the conclusions.

#### 2. Previous Work

The upcoming 3G LTE, MIMO-OFDM, and the main transmission parameters are discussed in depth in [1]. In [5], the fundamentals of MIMO communications, including the capacity gain, channel model, and receiver algorithms are explained. As an example of the high potential of MIMO-OFDM systems with sophisticated symbol detection, a MIMO-OFDM system and maximum likelihood (ML) detection achieves over 1 Gbps throughput in [6]. The MIMO-OFDM is applied also in 4G telecommunications systems and WLANs. The entire baseband processing chain of a 4G SDR is addressed in [7]. A hardware implementation of MIMO-OFDM system for WLANs is presented in [8] and implementations of two vital functions, the matrix decomposition and symbol detection with sphere decoder, are considered in [9].

Typical DSPs like TI's C64x [10] are tempting candidates for baseband processing as they have parallel computing resources and special instructions suitable for many of the required tasks. For example, the FFT can be computed with an off-the-shelf library routine [11]. Alternatively, a dedicated FFT processor can be used [12], and with FPGAs, off-the-shelf IP cores can be used for the FFT [13]. In this paper, we have applied the FFT implementation presented in [14] for complexity and power estimations.

There are many alternative techniques and algorithms for QR decomposition. Since the MIMO receiver requires a relatively small matrix, extensively parallel systolic array processors [15, 16] can be oversized solutions. The QR decomposition requires the computation of a highly nonlinear operation, namely division by a norm or, alternatively, multiplication with an inverse of square root operation. One approach is to carry out the computations in domain as in [17]. A Nios processor with CORDIC accelerators on FPGA is used in [18]. In [19], a scalable architecture using squared Givens rotations is presented. In this paper, the QR decomposition implemented in [20] has been applied.

In many practical MIMO systems, the ML symbol detection can be too complex. Alternatively, for example, zero forcing or linear minimum mean square error (LMMSE) principles can be applied [21]. In this paper, LSD is assumed as it approximates the ML detection with reduced computational complexity. There are several LSD variants. A -best LSD is assumed in this study and in [22–24] where architectures for the algorithm are presented. The -best LSD processor used in this paper is presented in detail in [25].

Turbo decoder can be implemented, for example, as a coprocessor of a DSP as in [26] or a hardware accelerator [27] or an ASP [28]. Naturally, there are variants of the algorithm, and the level of parallelism and clock frequency mainly determine the throughput. In this paper, a programmable turbo decoder presented in [29] is applied.

The ASP template, which is applied in this paper, uses the transport triggered architecture (TTA) [30]. There exists many multiprocessor systems applying TTA processors. In [31], a simple asynchronous communication link between TTA processors is enabled with units containing an FIFO buffer. TTA and LEON3 processors are connected with an AMBA bus in [32]. On the contrary to a shared bus, a network-on-chip approach has been applied in [33] where two Coffee RISC processors, a TTA processor, and a shared memory are connected with a network. A bioinspired multiprocessor system is presented in [34, 35] where TTA processors are abstracted as cells of a biological system. In this paper, the IPC requirements of a multiprocessor system are analyzed, and an abstract multiprocessor system using shared memory banks as communication links is assumed.

Several inevitable building blocks for baseband processing are presented in the aforementioned references. On the contrary to focusing solely on one particular function without practical motivation for the achieved throughput, we focus on a baseband processing chain consisting of FFT, QR decomposition, LSD, and turbo decoder and we derive the processing requirements from the 100 Mbps peak data rate of the upcoming 3G LTE systems. In this paper, we consider especially the ASPs in [14, 20, 25, 29] and their applicability for baseband processing. In order to obtain realistic estimates, the considered ASPs are resynthesized for the prevailing operating conditions, complexity, and power estimates are given for a 3G LTE compliant system configuration.

#### 3. System Model

A high-level description of the targeted 2-antenna MIMO-OFDM receiver is presented in Figure 1. The input ports are connected to radio-frequency functions of the receiver. The functional block diagram is only a high-level model as it does not suggest how the functions should be mapped to the processors nor it does not suggest how data is passed between the functions and whether the data vectors have serial or parallel presentations. In the following, the targeted transmission techniques are presented briefly.

##### 3.1. Orthogonal Frequency Division Multiplexing

OFDM uses the frequency spectrum efficiently as the used frequency band is divided into several orthogonal subcarriers. OFDM uses the FFT and inverse FFT (IFFT) for efficient conversions between the time and frequency domains. The time domain signal is generated in the transmitter side with inverse transform that is, data belonging to several parallel subcarriers is fed to the IFFT. In the receiver side, parallel subcarriers, , are extracted from the time domain signal withTo alleviate timing synchronization, additional cyclic prefix is inserted to the signal. The channel estimation can be alleviated with pilot symbols.

In the receiver side, distortion of the channel can be equalized conveniently in frequency domain by multiplying the received symbols with equalizing factors. Before the FFT, the cyclic prefix must be removed from the signal, and timing synchronization is responsible for feeding the time domain signal, whose length equals the FFT length, with correct timing offset to the FFT block.

##### 3.2. Multiple-Input Multiple-Output

In a spatial multiplexing MIMO system, multiple antennas are used to transmit independent data streams. Spatial multiplexing gain, that is, increase in capacity, is proportional to the number of antennas and it does not require extra power nor bandwidth. Two transmit and receive antennas are a highly probable configuration for the first 3G LTE systems, since a higher number of antennas increases the computational requirements of symbol detection significantly.

Computational complexity of ML detection of transmitted symbols depends exponentially on the number of spatial channels. Therefore, even with a modest number of antennas, simpler approximative methods must be used. The usage of list sphere decoding algorithms is tempting as they can achieve higher performance than LMMSE [36], even though they are computationally demanding. The sphere detector restricts the search space by evaluating only the symbols inside the sphere centered in the received symbol. In the system model in Figure 1, -best LSD is assumed. The -best LSD operates by gradually increasing the dimension of the symbol vector. At each level, a list of the best partial solutions is selected for continued processing.

In principle, an MIMO system with a complex-valued channel matrix, , noise vector, , transmitted symbol, , and received symbol, , can be described withThe number of receive and transmit antennas equals the numbers of rows and columns of , respectively. The transmitted symbol can be estimated by ML detection by solvingwhich gives the optimal result. However, solving (4) is intractable with multiple antennas and large constellations.

Instead of solving (4), the symbol estimation can be simplified by using QR decomposition of . With this practice, the computational complexity is lowered. Instead of ML detection, a substituteis used. As the is in upper triangular form, approximation of is computationally simpler with the aid of (5). The simplified approximation is based on computing the Euclidean distance in (5) by gradually increasing the dimensions of the symbol vector. Basically, there will be partial solutions which are too far away from the received symbols and when such partial solutions are discarded, the search space is efficiently limited. The -best LSD applies the aforementioned principles by maintaining a -length list of the best partial solutions found so far.

##### 3.3. Forward Error Correction

The function of the forward error correction is to introduce redundancy in the transmitted signal in order to alleviate error detection and correction. In 3G LTE, a similar turbo coding as in the contemporary 3G systems will be used. The only difference is the definition of the interleaving function [37, 38]. The new interleaving function covers longer code blocks and it is simpler to implement than the contemporary 3G interleaving. Naturally, the longer code block size affects the memory requirements.

Turbo decoding is an iterative process, which runs a soft-in soft-out (SISO) component decoder several times. The arguments of the component decoder are extrinsic information , systematic bit, , and parity bit vector, . As a result, it generates new extrinsic information, , and soft bit estimate vectors, , that is,

The *a posteriori* information is generated on
the previous half iteration, and used as *a priori* information on the
next half iteration. The information exchange takes place by passing the
extrinsic information between the component decoder processes. The main
difference between the half iterations is that every second half iteration
processes data related to the interleaved systematic bits.

The applied turbo decoder processor in Section 4.6 uses the max-log-MAP algorithm for SISO decoding. In principle, max-log-MAP algorithm generates the forward path metric at state at trellis stage , recursively aswhere is the branch metrics. The backward path metric is defined in the same way as

The soft output, ,
is a function of the forward, backward, and branch metrics, that is,In (9), the first maximum
corresponds to the state transitions where the transmitted systematic bit ,
and the second maximum is computed based on all the state transitions where .
The signum function is used to calculate the final hard bit estimates based on .
The new extrinsic information is computed with the aid of the received soft
systematic bit, , *a priori* information, ,
and ,
that is,

#### 4. Transport Triggered Architecture Processor Implementations

The targeted baseband functions are implemented on a customizable ASP template. The implementations are presented shortly in the following sections.

##### 4.1. Principles of Transport Triggered Architecture Processors

In this paper, TTA [30] has been used as the architecture template for ASPs. Processors with similar efficiency and performance could be implemented also with some other ASP templates supporting sufficient parallelism and customizability. Since there exists up-to-date tool support for TTA processors [39], we have exploited the template and the baseband functions have been implemented with TTA processors.

The main difference when compared to a pure hardware solutions is that the TTA processors are fully programmable. TTA reminds VLIW machine but the interconnection is exposed to the programmer unlike in traditional processors. TTA is one form of application-specific instruction set processor where the instruction set of the processor is tailored for the given application. In this sense, code for customized TTA processor is not compatible with another TTA processor. In TTA, the computations are triggered by data transported to the computing unit, which is contrary behavior to conventional operation-triggered architectures. The processor is programmed with data transports, which reflects the architecture to the programmer. The maximum number of parallel data transports is determined by the number of buses of the interconnection network. As the interconnection network connecting the computing resources is visible to the programmer, there is accurate control of all the operations.

The modularity of TTA processors allows to tailor them by including only the necessary function units (FU). Application-specific functions are implemented as user defined special FUs (SFU) which are utilized in a similar way as conventional FUs, that is, by transporting data on assembly level or by using function-like macros in C language. Due to frequent direct data transports between the FUs or SFUs, the register pressure is very low. However, the modularity of the processor allows a variable number of register files (RF) with variable numbers of input and output ports. In Figure 2, a high-level example of a TTA processor is given. The figure highlights the modular and customizable structure of the processor by denoting the variable numbers of the respective resources. The control unit (CU) in Figure 2 allows data transports to access the program counter and the return address register, which is required for jump or call operations.

The load on the buses of the interconnection network can be lowered by excluding the unnecessary connections if the work load of the processor is known beforehand. In this case, the targeted application program determines which connections are used. Typically, one application requires only a fraction of all the possible connections between the computing resources. If any other application is run on the same processor, it must be able to use the same connections. As a consequence of the limited connectivity and lowered load on the buses, the maximum clock frequency of the interconnection network is raised.

##### 4.2. Multiprocessor Systems with TTA Processors

There exists many multiprocessor systems applying TTA processors as listed in Section 2. However, the required number of processors for baseband processing in Section 5 is far higher than the number of processors in [31–33]. In addition to the bioinspired abstraction of multiple TTA processors [34, 35], multiple processors could be also abstracted as a hierarchical structure where the SFUs would be comprised of TTA subprocessors. Another way would be to combine all the TTA processors to a set of loosely connected clusters inside a single TTA processor. However, assembly programming such a processor would be error prone due to the extremely long instruction word and the scheme would limit the control flow of the clusters very strictly to a single combined flow. Regardless of the applied structure of the multiprocessor system, generating and controlling a multiprocessor system consisting of dozens of processors would be a demanding task.

Since it would be uneconomical to produce results of computations faster than they can be transferred to the next stage, shared memory banks or RFs running with the same clock frequency, , as the processors must be assumed for the IPC at the lowest level. Fortunately, the applied TTA processor template has flexible memory interfaces, which can simplify the IPC. For example, simple point-to-point connections between two processors could be implemented with an SFU interfacing a shared single- or dual-port memory. Furthermore, if complex address generation or bank selection is required, it can be included to the same SFU, which slightly raises the abstraction level of the IPC visible to the programmer. Such an incorporation of all the memory related logic to the same unit could enable a seamless IPC.

##### 4.3. FFT Processor

The applied FFT TTA processor is presented in detail in [14]. The processor implements mixed-radix FFT consisting of radix-2 and radix-4 computations and it supports several power-of-two transform sizes. It has 11 RFs containing 25 general-purpose registers and three Boolean registers, 17 buses in the interconnect network, a conventional adder, a comparison unit, and two-load/store units. The main computations are carried out with the following SFUs.

*Complex Adder Unit*

It supports
four different summations composed of four alternative operands.

*Complex Multiplier*

It alleviates the
butterfly operation with four real multipliers and two real adders.

*Address Generator Unit*

It generates
two addresses with bitwise reversal and rotation operations.

*Coefficient Generator*

It generates the
twiddle factors instead of loading them from a memory.

The processor applies a complex-valued number presentation where the real and imaginary parts both take 16 bits. Data is stored in single-port memory banks and the kernel loop applies the principles of software pipelining. Code compression is applied to enhance the code density and lower the power consumption.

##### 4.4. QR Decomposition Processor

The QR TTA processor presented in [20] is based on the modified Gram-Schmidt algorithm [40]. With complex-valued arithmetic units the processor can compute equally well both the complex- and real-valued decompositions. The only conventional units of the processor are the two-load/store units and an RF consisting of five general purpose registers. The interconnection network contains seven buses. The applied SFUs are as follows.

*Complex Adder/subtractor Unit*

It is for
native complex-valued computations.

*Complex Multiplier Unit*

It can
optionally conjugate the other input. The conjugation is required for the
computation of the real-valued norm.

unit is for a fast
estimation of the highly nonlinear function. The
function is used in the QR decomposition to avoid division operations.

As the processor has a bit accurate complex multiplier, it can be used also for other tasks where the accuracy of 16-bit fixed-point number system is sufficient. The unit and the multiplier can be used also for computation of square root as .

##### 4.5. K-Best LSD Processor

The LSD TTA processor in [25] generates a 16-element list of candidate solutions to approximate the transmitted symbol in (5). The processor uses 16-bit arithmetic and it is targeted for antennas and 64-quadrature amplitude modulation (QAM). Instead of complex-valued matrix, a real-valued matrix with doubled dimensions is processed. Therefore, a real-valued QR decomposition is required for the LSD. The interconnection network is very sparse and contains 16 buses. The arithmetic operations are computed with two addition units, a subtraction unit, a multiplier, and a squaring unit. The following SFUs are targeted for the applied -best algorithm.

*Insertion Sorter Unit*

It sorts a
list of 16 samples according to the partial Euclidean distances (PED).
Internally, the list is kept in a shift register and the new value is inserted
to the register pointed by comparison logic.

*Ped Extractor Unit*

It extracts
the PED from the internal storage format, that is, the unit accesses bits by
hardwiring.

*Multiplexer and Look-up-Table Unit*

It consists
of a multiplexer selecting the bits, which index the look-up-table. In
principle, the unit converts a bit pattern to fixed-point format.

*Storage Format Composer Unit*

It composes
a 28-bit word consisting of symbol information and the corresponding PED.

There are three RFs of sizes 16, 10, and 4 registers. On the contrary to conventional processors, the LSD TTA processor does not have load/store units nor data memory, since there is no need for accessing large arrays. The input data is passed via two RFs and the results of the computations are available in the registers of the insertion sorter SFU.

##### 4.6. Turbo Decoder Processor

The turbo decoder TTA processor is presented in [29]. It has a sparsely connected interconnect network of 30 buses and the high number of buses is a consequence of high parallelism. The only conventional FUs are the addition and comparison units. There are only two RFs, both of them containing one general purpose register. As there are not many conventional FUs, the applied max-log-MAP algorithm is computed solely with the following SFUs.

*Control Unit*

It generates
a control word which is used as an argument to all the other SFUs.

*Address Generator*

It generates
addresses for accessing the branch metric buffer.

*Forward Process Unit*

It computes
forward path metrics according to (7).

*Backward Process Unit*

It computes
backward path metrics as defined in (8) and extrinsic information and soft
output bit estimates according to (10) and (9), respectively.

*Branch Metric Generator*

It generates and
buffers the branch metrics for the forward and backward processes.

The turbo decoder TTA processor applies high parallelism as it processes one trellis stage in 1.016 clock cycles on average, that is, both forward and backward path metrics are computed in one clock cycle. Such a high parallelism requires also a high memory throughput. Therefore, the processor does not have conventional load/store units. Instead, the SFUs access memory interfaces of the processor directly. As (7)–(10) indicate, the main computations in the SFUs are carried out with basic arithmetic, add-compare-select, and maximum operations. The processor includes memory bank selection, address generation, and access buffer logic to allow parallel interleaved accesses of the extrinsic information with four-single-port memory banks. The interleaving function is excluded from the processor and it is accessed via external interface of the processor.

#### 5. Processing Requirements and Complexity

The number of processors, their total area and memory requirements, and interprocessor communication requirements are derived from the targeted 100 Mbps throughput.

##### 5.1. Time and Throughput Requirements

There are seven OFDM symbols per transmit antenna in 0.5 millisecond time frame in 3G LTE downlink. Thus, the processing time requirement millisecond microseconds includes also the additional time contributed by the cyclic prefix of the OFDM symbol. The FFT must be computed for both antennas.

The QR decomposition must be processed in the coherence time, , of the channel. If bullet train speed km/h is assumed for the receiver, the coherence time is millisecond where is the speed of light and GHz is the carrier frequency. However, with a more rapidly varying channel, the QR decomposition must be computed more frequently, that is, shorter must be used in (12). A single QR decomposition combines information from all the antennas. In other words, the matrix and vector sizes of the QR decomposition depend on the number of antennas.

The LSD must be computed for each subcarrier. So, the time requirement equals to the time requirement of the FFT. However, even if the maximum length of the FFT is 2048, only 1201 subcarriers are in use. A single LSD processes the signals of both antennas, that is, it outputs estimates of symbols transmitted from both antennas.

Since the turbo decoder processes soft bits instead of QAM symbols, it is meaningful to express throughput as data rate. The throughput requirement of turbo decoding equals the maximum data rate of 100 Mbps. Naturally, with code rate and 64-QAM symbols, the data rate on the LSD side is 200 Mbps and symbol rate 33.3 Msps.

##### 5.2. Required Number of Clock Cycles

The FFT TTA processor in [14] takes 12332 clock cycles for the 2048-point transform and the transform must be computed for both antennas. So, the required clock cycles of the FFT task are

The QR decomposition algorithm is of order and the QR decomposition TTA processor in [20] takes 139 clock cycles for a matrix. The dimensions of the decomposed matrix are doubled, since the LSD TTA processor applies real-valued computation. Since the matrix is the argument of matrix-vector product in (5), the products are mapped to the same processor. The products must be computed continuously for each received symbol vector, but the QR decomposition only once in the coherence time. So, the average number of clock cycles in time period, for both computations is approximatelywhere matrix multiplication takes 16 clock cycles. Naturally, with more rapidly varying channel, the increases as the must be decreased. The products take approximately 59% of the . The maximum number of clock cycles is spent when the decomposition of a new channel matrix is computed for each subcarrier, that is,The average number of clock cycles, , is only 17% of the maximum, .

The LSD TTA processor in [25] takes 441 clock cycles for processing one symbol vector. Thus, in time period the number of required clock cycles for the LSD, , is approximatelyFortunately, the LSD can be parallelized among the subcarriers.

In order to compare turbo decoding with the other baseband functions, the clock cycles of turbo decoding must be normalized to clock cycles, , taken in time frame. The turbo decoder TTA processor in [29] takes 1.016 clock cycles per trellis stage processed in half iteration. With six iterations, each trellis stage is processed 12 times. Therefore,where the first multiplications express how many bits are processed in . Turbo decoding can be parallelized to several processors with block-by-block pipelining where each processor decodes a code block of its own independently.

The required number of clock cycles of all the four functions are illustrated in Figure 3. The figure shows clearly how the LSD dominates the computation load. Obviously, the requirements cannot be met with single-processor systems with currently achievable clock frequencies.

##### 5.3. Number of Processors

The required minimum number of processor is determined by the throughput per processor, clock frequency, , and parallelization scheme of the targeted functions. If a task can be parallelized to several processors and the throughput is directly proportional to the number of processors, then the minimum required number of processors, , of the task taking clock cycles in time frame isThe utilization, , of the processor, , dedicated to task tells how efficiently the computing resources are used. It can be defined in a similar way asNaturally, tells how many percent of the time the processor idles. For the QR decomposition and matrix-vector product task, the average number of clock cycles, , is used to calculate the minimum number of processors and utilization. The total utilization of the whole processing chain can be computed aswhere the sums are computed for all the elements of the task set FFT, QR_avg, LSD, Turbo. The total utilization in (18) expresses the ratio between the required execution cycles of all the tasks and the available execution cycles of all the processors.

##### 5.4. Delay

The delay of a task depends on the maximum size of the processed data vector and the scheduling. Except for the first half iteration, the turbo decoder requires that the whole code block is received before decoding. The maximum code block length is 6144 [37], which is about 20% longer than in the current 3G systems. With code rate , the required number of soft bits is naturally . For two OFDM symbols, the LSD generates symbol candidate lists, which can be converted to soft bit estimates with 64-QAM (6 bits per symbol). Since the number of soft bits exceeds the required number for the maximum code block length, the analysis of the delay of FFT and LSD can be limited to the processing of two OFDM symbols.

With at maximum two processors, the delay of the FFT is simplyand in a similar way the delay of the LSD iswhere as the LSD can be parallelized among the subcarriers. The QR decomposition processor has two tasks, the QR decomposition and the matrix-vector products, of which the QR decomposition is computed only once in the coherence time, millisecond. Thus, the worst-case delay when both tasks are computed iswhere as the decompositions and multiplications can be parallelized among the subcarriers. For an average delay, can be used in a similar way. The delay of turbo decoding is determined by the maximum code block size, 6144. Thus, the delay with six turbo iterations iswhere processing one trellis stage with the turbo decoder TTA processor takes on average 1.016 clock cycles. Distributing the turbo decoding to several processors with block-by-block pipelining would affect only the throughput but not the delay and, therefore, the number of processors is omitted from (22).

##### 5.5. TTA Processor Configurations as Function of Clock Frequency

Utilization, delay, and number of processors are analyzed in Figure 4 as functions of clock frequency. The total utilization in Figure 4(a) shows that the utilization is always greater than 0.93 in the explored clock frequency range. High utilization can be obtained easily, since the LSD dominates the computational load and it can be parallelized with very fine granularity. In other words, since the utilization of the LSD task is always high, also the utilization of the whole processing chain is relatively high. The peaks in the utilization occur, when the number of processors of some task can be decremented. In that case, the utilization grows. On the contrary, if the number of processors remains untouched and the clock frequency is increased the utilization decreases. The discontinuations of delay in Figure 4(b) originate from the same phenomenon. The greatest discontinuation at 229 MHz takes place as the QR decomposition is mapped from three to two processors. The number of processors in Figure 4(c) decreases quite steadily, since it is dominated by the LSD task, which requires the largest number of processors.

**(a)**

**(b)**

**(c)**

##### 5.6. Analysis

An example configuration of TTA processor-based baseband processing chain is presented in Table 1. A single clock domain with MHz is applied and the processors have been synthesized with 0.13m technology for obtaining complexity and power estimates. The area and power estimates exclude the memories. The power estimates are scaled with the number of respective processors and their utilization in the ninth column of Table 1. The results in Table 1 show that since the LSD task takes only 441 clock cycles per subcarrier and it can be computed for each subcarrier independently, the task can be easily divided among several processors to achieve a high utilization. On the contrary, it is more difficult to obtain very high utilization for both the FFT and the QR processors with the same clock frequency, as the granularity of the tasks is more coarse. As a second remark, the delay of the QR decomposition is long when compared to other functions, even though the other functions are more complex. However, the QR decomposition must be computed only once in the coherence time millisecond, that is, the delay in Table 1 is the worst case delay. On average, the delay of the QR decomposition and the matrix-vector products is only 17% of the delay in Table 1.

In principle, the FFT and QR tasks could be mapped to the same processor. The processor should be formed as a hybrid of both processors in this case. Since both functions require complex arithmetic, the same resources could be shared efficiently. With MHz, both tasks could be mapped to two hybrid FFT/QR TTA processors and a utilization, , would be obtained.

Mapping the turbo decoding and some other function to the same processor could not benefit as much from sharing the resources, since the turbo decoding requires mostly real-valued add-compare-select operations. Shortening the delay of the turbo decoding is difficult for two reasons. Firstly, turbo decoding is an iterative process where the previous iteration must be finished before the next one can begin. Secondly, the component decoder applying the radix-2 algorithm processes at maximum one trellis stage in one clock cycle. The next path metrics cannot be computed according to (7) and (8) before the previous ones are computed. For these reasons, increasing the clock frequency or applying the radix-4 algorithm are the only ways to shorten the delay of the turbo decoding task in Table 1.

To illustrate more deeply the computational requirements of the baseband processing, example configurations consisting of other implementations are shown in Tables 2–4. As the respective implementations in Tables 2–4 are not necessarily targeted to the 3G LTE system or they are not targeted to operate among each other, the Tables 1–4 should be not considered as comparisons of TTA processors and other implementations. Instead, the tables show indicative example configurations of baseband processing chains.

For some implementations, all the required information is not available or it is given with different units. The area is reported if it has been given as the GEs. For some implementations, the performance data is not available for the targeted configuration of 2048-length FFT, antennas, 64-QAM, and list length 16. For this reason, alternative MIMO-OFDM configurations with lower data rate, 68 Mbps, have been used. Shorter code blocks are assumed for turbo coding in Tables 2 and 3. With shorter code blocks, the delay of the FFT can be limited to processing one OFDM symbol per each antenna.

In the configuration in Table 2, hardware implementations presented in [9] are used for the matrix decomposition and symbol detection. For the FFT and turbo decoding the TI's C6416 DSP has been applied as it can compute the FFT with an efficient software library routine and it includes a turbo coprocessor which runs with halved clock frequency. Since the core DSP and turbo coprocessor are mapped to the same device, the number of required processors is determined by the more dominating task, that is, turbo decoding. The idling of the DSP core while turbo decoding is taken into account when the utilization in Table 2 is calculated, and therefore, the utilization is low in Table 2 but still several processors are required. The hardware implementations for QR and symbol detection in Table 2 are targeted for MIMO-OFDM systems [9]. However, the sphere detector applies a different algorithm than the -best LSD which is used in TTA processor implementations.

In Table 3 a 1024-point FFT is applied. The applied turbo decoder processor supports also Viterbi decoding. The list length of the -best LSD is 10 symbols. In principle, a complex-valued -best LSD with 64-QAM, 2 antennas, and must process nodes and with 16-QAM, 4 antennas, and it must process nodes during the symbol detection. Thus, the processing requirements of different symbol detectors can be characterized by the number of visited nodes during a tree traversal of the algorithm. The applied QR decomposition hardware accelerator is presented in [8] as a part of MIMO-OFDM transceiver for WLANs. The decomposition takes 65 clock cycles for matrix.

In Table 3, the workload of 4G baseband processing with 100 Mbps is presented in terms of required execution cycles on an SODA architecture [7]. For each task a realistic clock frequency is assumed and the tasks are divided to separate processors. Furthermore, it is assumed that the LDPC error correction decoding task can be parallelized to several processors. The Table 3 shows that the LDPC task dominates clearly the workload.

In conclusion, the results in Tables 2–4 show that in addition to the data rate, the computational requirements depend heavily on the applied algorithms and on the parameters of the algorithms. Furthermore, efficiency in terms of high utilization requires that the tasks can be mapped among the processors or hardware units in a flexible way.

##### 5.7. Memory Requirements

The area estimates in Table 1 exclude the memories and memory requirements are reported separately in the Table 5. In other words, the area in terms of logic GEs expresses the complexity of the actual computations of baseband processing. The separation eases future comparisons, since the memory requirements depend heavily on the targeted data vector lengths and technology. For example, long code blocks are preferred in turbo decoding, as they enhance the error correction performance. A second reason for separating the memories is that the IPC requires also memories, and therefore, the total area with all the memories of the whole baseband processing chain would depend on the implementation method of the IPC.

The data memory requirements in Table 5 show that due to the small matrix size, the QR decomposition requires a very small memory. The LSD processor has no memory requirements at all, as the data is stored in registers. On the other hand, the turbo decoder and the FFT processors require large memories as they have to process long data vectors. The memory of the FFT is divided into two banks and a memory interface hides the banking structure from the programmer, that is, the memory system imitates dual-port memory.

##### 5.8. Interprocessor Communication Requirements

As the analyzed processors lack extra facilities for IPC, only requirements but not costs can be stated. There exists many methods for SoCs but they are beyond the scope of this paper, the complexity of computing the main baseband functions. Therefore, the effects of using some particular method or SoC platform are not considered. In Table 6, the IPC requirements are tabulated for an assumed system using shared memory banks between the processors.

The FFT processor uses an in-place algorithm, that is, the result overwrites the input vector and processing does not require additional memory. However, passing the data to and from the FFT processors requires buffer memories. In practice, there must be an extra input buffer which is written while the data in the main memory is processed in-place. In a similar way, there must be an extra output buffer, from which the previous result can be read at the same time. The first two buffers in Table 6 are dedicated for such an IPC. The roles of the three memory banks, that is, the input buffer, the output buffer, and the processing memory, can be interchanged on every two completed OFDM symbols.

The QR processor generates the triangular matrix, , with 10 nonzero elements and 4-element vector for each subcarrier. The results are written to one buffer. The other identical buffer holds the previous results which are passed to the LSD processors at the same time. Since there are several QR and LSD processors, the buffer must be divided into several parallel accessible banks. Again, the roles of the buffers can be interchanged on OFDM symbol boundaries.

The turbo decoder processors require an additional input buffer which is filled with the soft bits while the decoders are processing. There is no need for and additional output buffer, since the decoder overwrites the previous output only on the last half iteration. The buffer size of the turbo decoder input in Table 6 allows code rate with the maximum block size. The input word length of the applied turbo decoder TTA processor is 7 bits [29], but all the other applied TTA processors use 16 bits for the real or imaginary parts.

In general, the complexity of IPC buffers depend on the sizes of memory banks, their throughput or clock frequency, and the number of memory banks as each bank requires interfacing logic. In addition, the IPC increases also the computational load which is not included in Tables 1–3. Therefore, if a fully functional SoCs were designed, full utilization should not be targeted when solely the core computations are analyzed. Instead, with lower utilization, computing capacity would be reserved also for the IPC. Also, the total delay in Tables 1–3 exclude the effect of IPC. As it is assumed that one buffer is written while the other is read in a pipelined fashion, it can be assumed that the IPC has a constant delay.

Since the workloads of the processors depend only on the applied block lengths, static scheduling could be applied, which would ease synchronization of the tasks. Even if the number of processors is very high, in principle, similar IPC requirements would be met also with smaller number of processors if they applied higher parallelism internally or if they applied higher clock frequency. The first option would require parallel IPC links and the second option would require smaller number of IPC links but higher throughput for each link.

#### 6. Conclusions

The main baseband functions of a 3G LTE conforming MIMO-OFDM receiver were considered in this paper, and ASP implementations were assumed for each function. The main emphasis was on the complexity of the actual computations, that is, the data path, of the functions implemented with the ASPs. The complexity was derived by estimating the required number of respective processors and the clock frequency to meet real-time requirements. The area and power estimates of the functions processed with the ASPs showed the demands of the baseband processing with the current technology. It was shown that especially the LSD dominates the computational load. However, due to the fine granularity and convenient parallelization of the LSD, it can be distributed among several processors and high utilization can be achieved. Also other processors or hardware accelerators of the addressed functions were analyzed to further illustrate the computational demands and costs. The IPC requirements were estimated by a block by block processing model with processors connected via shared memory banks.

#### Acknowledgment

This work has been supported by the Finnish Funding Agency for Technology and Innovation under research funding decision 40163/07.