#### Abstract

Digital multimedia broadcasting is available in more and more countries with various forms. One of the most successful forms is Digital Video Broadcasting for Terrestrial (DVB-T), which has been deployed in most countries of the world for years. In order to bring the digital multimedia broadcasting services to battery-powered handheld receivers in a mobile environment, Digital Video Broadcasting for Handheld (DVB-H) has been formally adopted by ETSI. More advanced and complex digital multimedia broadcasting systems are under development, for example, the next generation of DVB-T, a.k.a. DVB-T2. Current commercial DVB-T/H receivers are usually built upon dedicated application-specific integrated circuits (ASICs). However, ASICs are not flexible for incoming evolved standards and less overall-area efficient since they cannot be efficiently reused and shared among different radio standards, when we integrate a DVB-T/H receiver into a mobile phone. This paper presents an example implementation of a DVB-T/H receiver on the prototype of Infineon Technologies' Software-Defined Radio (SDR) platform called MuSIC (Multiple SIMD Cores), which is a DSP-centered and accelerator-assisted architecture and aims at battery-powered mass-market handheld terminals.

#### 1. Introduction

The DVB-T system was developed and agreed in 1997 by the DVB Project [1]. In the last ten years, DVB-T systems have been successfully deployed not only in Europe but also in the rest of the world. Now the next generation of DVB-T called DVB-T2 is under development. In order to bring the digital video broadcasting service to battery-powered mobile receivers, for example, mobile phone handheld devices, the DVB-T standard is extended to Digital Video Broadcasting-Handheld (DVB-H) with additional features such as 4K mode, in-depth interleaving, time-slicing, and additional forward error correction [2].

Recently, more radio systems, such as GSM, WCDMA, HSPA, GPS, FM radio, Bluetooth, WiFi, and DVB-H have been integrated into mobile handset terminals because these “all in one” mobile terminals provide an “anytime, anywhere” access to information in an easy way for the end user. In fact, the market already requires this kind of terminals and some manufacturers have promptly reacted, for example, Nokia with its N95 and Apple with its iPhone3G. This creates new challenges to the semiconductor manufacturers, which should keep the stringent requirements for power consumption, silicon area, and time-to-market together with the increasing requirements for throughput and number of coexisting radio standards in a mobile phone terminal. Infineon Technologies provides its innovative SDR platform MuSIC (Multiple SIMD Cores) to meet the requirements. MuSIC is a DSP-centered and accelerator-assisted architecture [3]. The key attraction of the SDR concept is the ability to support multiple standards on the same chip by changing the software only and the feasibility to share the processing resources among several standards in case of a nonsimultaneous execution. In this work, we present an example implementation of a DVB-T/H receiver on the prototype of Infineon Technologies’ Software-Defined Radio platform MuSIC which aims at mass-market handset terminals.

The paper is organized as follows. Section 2 provides a system overview of a DVB-T/H receiver showing the main algorithmic functions comprising the baseband processing chain. The architecture of MuSIC and its programming model are introduced in Section 3. Section 4 investigates the computational requirements of DVB-T/H and its potential for parallelization on MuSIC. In Section 5, some conclusions and hints for the future work are given.

#### 2. DVB-T/H Reciever Algorithms

In this work, we only consider the physical layer processing of the DVB-T/H receiver and omit all analog components, higher layer protocols, and application processing. The functional block diagram in Figure 1 shows a conventional DVB-T/H receiver structure. The RF signal is received by the receiver antenna, downconverted by the tuner circuitry, scaled by an Automatic Gain Control (AGC) circuitry, and then digitized by the Analog-to-Digital Converter (ADC). The baseband processor receives the digitized signal as complex samples from the ADC and delivers the descrambled MPEG transport stream to a higher layer protocol and application processor.

##### 2.1. Synchronization and Fast Fourier Transformation

In a typical receiver, a pre-FFT acquisition stage is required to obtain the OFDM symbol timing, the OFDM symbol length, that is, size of the Fast Fourier Transformation (FFT), and the cyclic prefix length, where the latter two parameters are adjustable by the transmitter. The principle of the pre-FFT synchronization is based on the availability of the cyclic prefix in the OFDM symbols [4]. In addition, the carrier frequency offset (CFO) existing between the transmitter and the receiver is also partly estimated in this functional block, that is, only the fractional part of it, and compensated in time domain. Hence, only CFOs which are an integer value of the subcarrier spacing will remain in the OFDM signal. The recovered OFDM symbols are transformed by means of the FFT, which acts as a matched filter for the OFDM signal. The post-FFT synchronization obtains estimations for the integer part of the CFO and the initial sampling clock frequency offset (SCFO) in frequency domain, which are then compensated in time-domain OFDM signal in order to reduce Inter-Carrier Interference (ICI).

After the acquisition phase, the pre-FFT synchronization is inactive, and the post-FFT synchronization is turned to tracking mode. Due to the instability and drift of the oscillator at receiver the CFO and SCFO will vary during the data receiving phase. Therefore, it is necessary to track the small residual SCFO and CFO to ensure the orthogonality of the OFDM subcarriers and accurate timing after the acquisition phase. The detectors of the residual CFO and SCFO are based on the temporal correlation of OFDM signal in frequency domain. For the investigated DVB-T/H receiver, the continual pilot signals are used to estimate the residual SCFO and CFO. The principle of the tracking methods is described in [5]. Figure 2 shows an example behavior of the implemented SCFO tracking algorithm under additive white Gaussian noise (AWGN) channel with 4 dB channel SNR and an SCFO of 30 ppm. Our simulations show that in the tracking mode, the residual SCFO and CFO, depending on the reception channel conditions, remain very small after a number of OFDM symbols.

A more detailed analysis of the synchronization strategies for an OFDM-based receiver is provided in [6]. It should be noticed that most of the synchronization tasks are only active during the acquisition phase. Because of the use of time-slicing in DVB-H it is more necessary to minimize the whole synchronization time.

##### 2.2. Channel Estimation, Equalization, and TPS Decoding

After FFT the impairment of the transmission channel, especially the Doppler frequency drift by high-speed mobility of the handheld terminal, will be mitigated through the channel equalization which is characterized as where is the subcarrier index in the OFDM symbols, and is the OFDM symbol index in the OFDM frame. and are the equalized and received OFDM symbols, respectively. are the coefficients of the channel transfer function (CTF). In order to estimate the coefficients , two cascaded interpolation filters are used as proposed in [7].

The principle of the channel estimation is depicted in Figure 3. The value of and corresponding to the scattered pilots is known beforehand at the receiver. Therefore, at a first stage the CTF coefficients for the scattered pilots can be readily estimated using (1). For subcarriers sharing the same index , a 6-tap time-direction interpolation filter is used in which the six closest scattered pilots with the same index are used as supporting points, thereby obtaining the coefficients . Afterwards, both and are used as the supporting points in a 12-tap frequency-direction filter to calculate the CTF for the remaining subcarriers. Thus, those positions for which a coefficient is computed can be seen as “pseudo” pilots, which are marked as in Figure 3. The interpolation filters use the scattered pilots in each OFDM symbol as a priori knowledge. Therefore, the scattered pilots have to be determined before the channel estimation can be started.

In a regular DVB-T receiver the exact positions of the scattered pilots are determined after decoding the transmission parameter signaling (TPS) bits. The TPS bits also provide information about other system parameters required to demodulate and decode the received signal. They are organized into blocks of 68 bits, and each bit is transmitted on several subcarriers within one OFDM symbol. Hence, 68 symbols, that is, one OFDM frame in DVB-T/H, are required to transmit a whole TPS block. The TPS bits are modulated using Differential Binary Phase Shift Keying (DBPSK) in the time direction. This allows recovery of the bits without channel equalization.

The OFDM frame synchronization using TPS bits implies a synchronization time ranging from 17 up to 84 OFDM symbols depending on the very first received OFDM symbol in the frame. However, such a long synchronization time is not acceptable in a DVB-H receiver due to time slicing, where the receiver is allowed to power off when it is inactive leading to significant power savings. To solve this problem a fast scattered pilot synchronization scheme has been proposed in [8], which needs only one OFDM symbol to identify the scattered pilot and is based on the temporally repetitive structure of the scattered pilots. This method also makes it possible to execute the channel estimation and equalization prior to TPS bit decoding, thereby improving the robustness of the TPS decoding itself under extremely bad transmission conditions [9]. After DBPSK decoding of the TPS bits, the frame synchronization is required to determine the position of the respective TPS bit in one TPS block, as the semantics of each TPS bit defined in [10] depends on its position in one TPS block.

##### 2.3. Demapping

With the decoded TPS signal the constellation parameters used by the transmitter for the QAM Gray-mapping are made available to the receiver. The QAM demapper can then recover data bits from the data subcarriers, that is, one QAM symbol demapped onto one -bit binary word according to Gray-mapping and the constellation parameters, where equals 2 for QPSK, 4 for 16-QAM, and 6 for 64-QAM. Soft decision must be available to the Viterbi decoder. It improves the error-correction capability and makes the correct interpretation of depunctured bits possible [11]. Every bit of the -bit binary word is represented as one n-bit binary word to represent the signed distance between the constellation point and the decision border, called softbits. A higher resolution of the soft decision increases not only the gain of the Viterbi decoder but also the implementation complexity. A tradeoff between performance and implementation complexity has to be found by means of simulation. Our simulation results show that a 5-bit resolution is a proper choice.

##### 2.4. Inner Deinterleaving and Viterbi Decoding

In order to improve the long burst error correction capability, several interleaving stages are specified in the DVB-T/H standard. Because of the bitwise sequent address generation algorithm and the large range of the symbol deinterleaving (e.g., the interleaving over 6048 QAM symbols by 8k mode), a software implementation for the deinterleaving in SIMD core is not efficient. To avoid the sequent address generation a lookup table can be used, but it will be very large for on-chip memory. The inner deinterleaving can be done better with a scalar processor or in hardware accelerator. Following the deinterleaver, a Viterbi decoder with soft decision is used to perform convolutional decoding.

##### 2.5. Outer Deinterleaving, Reed-Solomon Decoding, and Descrambling

The bit stream obtained after the convolutional decoding is reorganized as a byte stream and further processed by the outer deinterleaver, thereby improving the long burst correction capabilities of the receiver. Afterwards, the byte stream is processed by Reed-Solomon decoder and descrambled at byte level.

The functions marked in gray in Figure 1 are core functions demanding most of the computing power and storage capacity and are therefore the critical functions for a software implementation on an SDR platform. We have analyzed their computational requirements and investigated the possibilities for a parallel realization on the MuSIC platform, as it will be discussed in Section 4. To achieve best performance under various network conditions and coverage scenarios, the DVB-T/H standard provides the network operators numerous system parameters which are listed in Table 1. The combinations of these parameters derive a net bit rate from 3.11 Mbit/s at 5 MHz channel bandwidth, QPSK, 1/2 code rate, 1/4 guard interval, up to 31.67 Mbit/s at 8 MHz channel bandwidth, 64-QAM, 7/8 code rate, 1/32 guard interval. The receiver must comply with all the possible combinations of parameters the transmitter may use. At this point it is reasonable to analyze only the worst case scenario, that is, a transmission at 31.67 Mbit/s.

The whole DVB-T/H receiver described has a good system performance. Our simulation results show that for AWGN channel with a channel SNR of more than 4 dB, or for typical urban channel with vehicular speed of 6 km/h (TU6) and a channel SNR of more than 10 dB, user data can be decoded with a bit error rate smaller than .

#### 3. Architecture and Programming Model

The intensive work carried out at Infineon Technologies resulted in a versatile processor architecture which is able to cope with the performance, power, and area requirements of a multistandard SDR approach [12, 13]. This processor mainly consists of a cluster of four single-instruction multiple-data (SIMD) DSP cores (see Figure 4). Each SIMD core contains four processing elements (PEs) and operates with a clock frequency of 300 MHz. To relax the timing requirements for the memory and to resolve pipeline hazards, each core runs four threads which are switched by a fixed time multiplexing mechanism. This is equivalent to running 16 threads at 75 MHz each.

Long instruction words (LIWs) of the PE array show memory, arithmetic, and communication components. The SIMD core controller is in fact a 32-bit general purpose processor (GP). The GP communicates with the other units via instruction and data FIFOs.

The cluster of the SIMD cores is accompanied by dedicated configurable hardware accelerators for coding/decoding and for filtering operations. In addition, there is an ARM processor for the execution of the protocol stacks. For a more detailed discussion about the MuSIC processor, we refer to [14–16].

The programming model for this architecture is the multithreaded programming in C. Wrapped into functions called by threads, the purely data parallel parts (associated with SIMD cores) are programmed in a data parallel language extension of C.

To support multithreaded programming, the Infineon Lightweight Operating System (ILTOS) has been developed. It provides the means to create and synchronize threads to asynchronously send and query messages between them and to allocate and free shared memory.

Functions to be executed on an SIMD core are written in Data Parallel C Extension (DPCE) language, a superset of the C language [17]. DPCE offers parallel data types and operations on them. A compiler which takes a DPCE source and produces synchronized C code for GP core and DMA transfers (to be translated further by a C compiler for the GP) as well as PE assembly was developed at Infineon. This compiler is not yet optimized, though. To achieve best performance, inline assembly is used for the PE array and explicit DMA configuration. What remains is mainly the C language with some intrinsic functions for PE and DMA control plus an assembly source code library for the PE. Implementations can be done completely without PE and DMA by writing pure C programs. These will then run on the GP core alone. This feature is of importance for testing assembly implementations.

Moreover, a virtual prototype of the entire MuSIC platform based on SystemC has been developed at Infineon. The virtual prototype is a cycle- and bit-accurate software-based simulator. It contains models of all processors, accelerators, busses, memories, and peripherals which will be available in the real hardware. The same software can be run on both the virtual prototype and the real hardware.

#### 4. Computational Requirements for an Implementation on MuSIC

##### 4.1. Fast Fourier Transformation (FFT)

FFT is the core function of an OFDM receiver. The most commonly used algorithm for FFT calculation is the well-known butterfly algorithm. The computational requirement of FFT is very high, especially for the large FFT block which in DVB-T/H can be 8192 complex samples. The theoretical computational complexity of FFT with complex samples based on radix-2 algorithms is given as follows: In the case equals 8192, 4 instructions are needed for 1 complex multiplication and 2 instructions are needed for 1 complex addition, it leads to 425984 instructions for 8k mode based on radix-2 without data level parallelism. In the ideal parallel case of 4 data paths, that is, 4 PEs in an SIMD core, the cycle count can be reduced from 425984 to 106496. An example implementation of 8k FFT on MuSIC was measured out. It shows that about 20 percent overhead in cycles is needed to overcome all data transfer and temporary storage. This shows the MuSIC architecture fully exploits the data level parallelism of the FFT algorithms.

##### 4.2. Channel Estimation and Equalization

Section 2.2 describes the channel estimation in our DVB-T/H receiver. Its computational complexity is analyzed here. The first step, that is, the interpolation in time direction, is carried out for those subcarriers for which a scattered pilot is available (see Figure 3). It can be described as follows:

The second step is based on the scattered pilots and the estimations computed in time direction within one OFDM symbol, namely where and are the filter coefficients and depend on the distance between the supporting point and the estimated point, which is also shown in Figure 3. Therefore, it needs real multiplications (RMs) to calculate . There are 568 scattered pilots in one 8k OFDM symbol. For each pseudopilot , 12 real multiplication-accumulations (rMACs) are required. There are 1705 pseudopilots in one 8k OFDM symbol. 24 rMACs are required for each of the remaining sub-carriers to calculate its CTF coefficient . There are 4544 such subcarriers in one 8k OFDM symbol. The channel correction requires 2 RMs for every payload subcarrier. Because split of the TPS and continual pilots from the payload subcarriers by means of software is even more expensive than the equalization for the TPS and continual pilots, we equalize all the subcarriers excluding the scattered pilots, that is, 6249 subcarriers need to be corrected.

Together it needs 13634 RMs and 129516 rMACs per OFDM symbol in 8k mode for the channel estimation and correction, which equals 143150 instructions without parallel processing. Because the time direction interpolation is not causal, FIFO buffering is required to delay the input OFDM symbols. In the case of the 6-tap time direction filter, all subcarriers within 12 OFDM symbols and the CTFs at all scattered pilot positions within 23 OFDM symbols need to be stored, which leads to the most memory demand of the DVB-T/H receiver.

##### 4.3. Demapping

The Gray mapping is used in DVB-T/H. The demodulation of one subcarrier, in the case of 64-QAM, needs 8 operations. The quantization of one soft decision needs 3 operations and in the case of 64-QAM, each payload subcarrier implies 6 soft-decisions. So operations are needed for the demapping and quantization of one OFDM symbol.

The computational complexity of each function is listed in Table 2 as required million operations per second (MOPS), where one OFDM symbol duration is 924 microseconds, which determines the real-time processing requirement. The Viterbi and Reed-Solomon decoders demand enormous computing power and are therefore implemented as hardware accelerators. In this manner the computational requirements can be reduced from formerly 5678 MOPS down to 786 MOPS. For 4 data paths in data-level parallelism the real-time processing requirement is reduced to 197 million cycles per second. With use of thread level parallelism, this computing requirement can be affordably met with 3 of 16 threads at 75 MHz on MuSIC in the case of about 15 percent implementation overhead and can be sufficiently met with 4 threads, that is, one SIMD core, in the case of about 50 percent implementation overhead.

#### 5. Conclusions

In this paper, we first analyzed the example algorithms of a DVB-T/H receiver in detail, and then gave a brief introduction on the architecture and programming model of the prototype of Infineon SDR platform MuSIC. Based on the algorithms and the hardware architecture, and based on our implementation on MuSIC, we estimated and partly measured the computational requirements of the relevant functions for a DVB-T/H receiver on MuSIC. The results show that it is feasible to implement a DVB-T/H receiver on MuSIC with one of the four SIMD cores. It should be noted that the control functions for the respective functional blocks have not been considered till now. This will be studied in the future work.

#### Acknowledgments

The authors gratefully acknowledge fruitful discussions with Mirko Sauermann, Mathias Richter, Dominik Langen, Reinhard Rueckriem, and Professor Ulrich Ramacher. This work has been supported by the German BMBF (Bundesministerium für Bildung und Forschung) project MxMobile.