Abstract

Cyclic prefix (CP) in multicarrier modulation systems has been considered as an alternative to the training sequences to track channel estimates. In this paper, two new algorithms are developed that exploit CP from their data detection part and employ systolic block Householder transformation recursive least squares (SBHT-RLS) algorithms for channel tracking in multicarrier systems. The new methods are compared with the existing CP exploiting correlation matrix based block RLS (CMB-RLS) channel tracking approach to outline their relative advantages. Aspects of computational complexity and parallel implementation are addressed, and the algorithms are tested in terms of their channel estimation and tracking capabilities. Performance of the algorithms is also evaluated for varying forgetting factor parameter values, constellation size, and word lengths. Floating-point and fixed-point simulations are tailored to illustrate pertinent tradeoffs.

1. Introduction

Over the last two decades, multicarrier modulation has received considerable interest for its use in wireless and wireline communication systems [14]. It has been adopted in many communication standards, including digital audio broadcasting (DAB) [5], digital video broadcasting (DVB) [6], high-speed modems over digital subscriber lines (xDSLs) [4], and local area mobile wireless broadband [7].

Most multicarrier systems use coherent detection of data symbols, which requires reliable estimation of channel at the receiver. Channel state information is also necessary for techniques such as channel shortening [8], adaptive modulation/loading, and/or power control [9]. In applications such as discrete multitone (DMT) xDSL [4], channel is estimated through some initial training process, and retraining is required to track the channel variation. To avoid the system overhead due to retraining and thus to track the channel more efficiently, in [10], a correlation matrix based block recursive least-squares (CMB-RLS) algorithm is proposed. The algorithm takes advantage of the inherent redundancy introduced by the cyclic prefix (CP) to blindly estimate the channel. In [11], performance of the algorithm is analyzed considering both the effect of channel noise and decision error. The algorithm is further explored in [12], where its performance is analyzed considering the impact of exponential forgetting factor values, constellation size, and channel nulls. Also, in [13], the method is used in single-carrier (SC) modulation with frequency domain equalization (FDE) to maintain both system performance and throughput.

While CP-based CMB-RLS approach is standard complaint, there are two basic problems that make it unsuitable for real-time implementation. First, it relies on computation of inverse of the correlation matrix Φ per time update. The computational cost of performing the required matrix inversion in real time can be prohibitively high for a system with a large channel length (To reduce the computational complexity and thus processing power, this inversion cannot be done recursively using Matrix inversion Lemma (such as in conventional RLS (CRLS) algorithm [14]).). Second, the direct inversion and recursive inversion approaches are known to severely limit parallelism and pipelining that can effectively be applied in the practical implementation.

Usually, to minimize the round off error, matrix inversions are done with general-purpose digital signal processing (DSP) devices/processors using floating-point arithmetic. A disadvantage of this approach, however, is severe processing power limitation due to small number of floating-point processing units commonly available per device. Specialized hardware with high-processing power is therefore required to execute requisite computations in real time. An appealing alternative for implementation is not to do this inversion explicitly and solve the problem through a computationally cheaper approach that works directly with data matrix and is realizable on the systolic array architecture offering large amounts of parallelism for high-speed very large scale integration (VLSI) implementation. In VLSI implementation, floating-point arithmetic units are more complex than those of fixed-point arithmetic, involving extra hardware overhead and more clock cycles [15]. Hence, the bit-level systolic architecture must be implemented with fixed-point arithmetic.

The QR decomposition (QRD) approaches for RLS problem have played an important role in adaptive signal processing, adaptive equalization, and adaptive spectrum estimation [16]. It is generally agreed that QRD-RLS algorithms are one of the most promising RLS algorithms, due to their numerical stability [17, 18] and suitability for VLSI implementation [19, 20]. There are three approaches to QRD-RLS problem, namely, Givens rotation (GR), modified Gram-Schmidt (MGS), and Householder transformation (HT) method. These methods have been successfully applied to the development of the QRD-RLS systolic array [16, 2124]. Because HT generally outperforms GR and MGS methods under finite precision computations (see the references in [16]), and in the context of our application the channel needs to be updated for each block input data matrix, we focus our attention to the QRD-RLS algorithm based on block HT. Notice that HT is a well-known rank-𝑣 update approach and is one of the most efficient methods to compute QRD (Rank-1 updating fast QRD-RLS algorithms (where QRD is updated after the original data matrix has been modified by the addition and deletion of a row or column) [14] are not suitable here in particular due to high throughput (here the term throughput is used to indicate total number of data vectors at the input of the RLS algorithm) and speed requirements.). In [24, 25], Liu et al. investigated one such QRD-RLS algorithm using block HT. The work in [24] describes the block HT implementation on a systolic array and its application to RLS algorithm called systolic block HT-RLS (SBHT-RLS). So far, SBHT-RLS is used in beamforming and linear predication applications but has not been applied for channel tracking in high-throughput multicarrier applications. The algorithm is well known for its computational efficiency, very good numerical properties, and parallel processing implementation advantages.

In this paper, we develop two new CP exploiting SBHT-RLS approaches for adaptive channel estimation in multicarrier systems.

The first approach is based on SBHT-RLS approach of Liu et al.. In its original form, the SBHT-RLS does not provide access to channel weights, as its use has been limited to the problem seeking an estimate of output error signal. In the context of our application, the proposed approach finds the channel explicitly. In order to differentiate between the two techniques, the new method will be referred to as CP-based Direct SBHT-RLS approach. The proposed scheme is computationally efficient and can be mapped to triangular systolic arrays for efficient parallel implementation. Unfortunately, the scheme suffers from a major drawback, namely, back substitution, which is a costly operation to perform in array structure [26, 27].

The second approach relies on inverse factorizations to calculate least squares channel coefficients (weight vector) without back substitution. This approach also employs SBHT to recursively update the channel coefficients and thus preserves the inherent stability property of SBHT-RLS approach. The derivation of the inverse factorization method in this paper is done by generalizing the Extended QRD-RLS algorithm to block RLS case [28]. For this reason, this method will be referred to as CP-based Extended SBHT-RLS approach. We underscore here that this simple and straightforward derivation is different than the previous challenging work on block RLS using inverse factorizations in [29, 30]. Computational complexity of this scheme is equivalent to the first proposed scheme, but unlike the first scheme it is fully amenable to VLSI implementation and also results in improved steady-state performance.

For the sake of brevity, in the rest of this paper, we refer to the CP based CMB-RLS as CPE1, Direct SBHT-RLS as CPE2, and Extended SBHT-RLS as CPE3. Also, for uniformity, we closely follow the notation that appears in [10].

The paper is organized as follows. In the next section, we provide an overview of the DMT system model [10]. Section 3 explains the newly proposed algorithms, followed by a discussion on their computational complexity and systolic array implementation in Section 4. In Section 5, illustrating floating- and fixed-point simulations are conducted, while conclusions are drawn in Section 6. Some results contained in this paper have been presented/accepted for presentation in [31, 32].

Notation 1. ()𝑇,(), and 𝐸[] denote transpose, complex conjugate, and expectation operation. The Matlab notation 𝐗(,𝑚𝑚) is used to to denote the submatrix of 𝐗 that contains the columns 𝑚 to 𝑚. 𝐱(𝑛𝑛) denotes the subvector of 𝐱 comprising of entries 𝑛 through 𝑛. 𝐈𝑛 denotes identity matrix of size 𝑛, 𝟎 denotes the all zeros matrix of appropriate dimensions, and 𝑗=1. The meaning of other variables will be clear from the context.

2. System Model

We consider a high-speed DMT data transmission system over digital subscriber lines, shown in Figure 1. The system has 𝑚/2 complex parallel subchannels and illustrates the typical CP based adaptive channel estimation task, which is our main concern in this paper. Let {𝑠𝑛} represent the data sequence to be transmitted over the channel. This input data is buffered to blocks, and each data block is divided into 𝑚/2 bit streams and then mapped to quadrature amplitude modulation (QAM) constellation points 𝑋𝑖,𝑘,𝑖=0,,𝑚/21 at time 𝑘. After 𝑚-point inverse fast Fourier transform (IFFT) on the 𝑘th DMT block 𝐗𝑘=[𝑋0,𝑘,𝑋1,𝑘,,𝑋𝑚1,𝑘]𝑇 (here the last 𝑚/2 samples are just the conjugates of the first 𝑚/2 samples), the modulated real valued time domain signal is 𝐱𝑘=[𝑥0,𝑘,𝑥1,𝑘,,𝑥𝑚1,𝑘]𝑇. A CP 𝐱𝑘(𝑓)=[𝑥𝑚𝑣,𝑘,,𝑥𝑚1,𝑘]𝑇, where 𝑥𝑖,𝑘=𝑥𝑚𝑖,𝑘 and 𝑖=1,,𝑣, is then appended in front of 𝐱𝑘 before transmission through the channel 𝐻(𝑧)=𝑣𝑙=0𝑙,𝑘𝑧𝑙, having impulse response 𝐡𝑘=[0,𝑘,1,𝑘,,𝑣,𝑘]𝑇 of length 𝑟=𝑣+1. At the receiver, the prefix part 𝐲𝑘(𝑓)=[𝑦𝑣,𝑘,,𝑦1,𝑘]𝑇 is removed.

The relationship between prefix part 𝐲𝑘(𝑓) and the transmitted signal may be expressed as [10]𝐲𝑘(𝑓)=𝐀𝑘𝐡𝑘+𝐧𝑘(𝑓),(1) where 𝐀𝑘=𝑥𝑣,𝑘𝑥𝑚1,𝑘1𝑥𝑚𝑣,𝑘1𝑥1,𝑘𝑥𝑣,𝑘𝑥𝑚1,𝑘1=𝐚0,𝑘,𝐚1,𝑘,,𝐚𝑣,𝑘,(2)𝐚𝑗,𝑘 is the 𝑗th column of 𝐀𝑘, 𝐧𝑘(𝑓)=[𝑛𝑣,𝑘,,𝑛1,𝑘]𝑇, and 𝑛𝑖,𝑘𝒩(0,𝜎2) is the channel noise.

After the FFT operation on 𝐲𝑘=[𝑦0,𝑘,𝑦1,𝑘,,𝑦𝑚1,𝑘]𝑇, the demodulated signal is 𝐘𝑘=[𝑌0,𝑘,𝑌1,𝑘,,𝑌𝑚1,𝑘]𝑇. The CP removes interblock interference (IBI) between 𝐗𝑘’s. The received symbols can thus be written as𝑌𝑖,𝑘=𝑋𝑖,𝑘𝑖,𝑘+𝑁𝑖,𝑘,𝑖=0,,𝑚1,(3) where 𝑖,𝑘=𝑣𝑙=0𝑙,𝑘𝑒𝑗((2𝜋𝑖𝑙)/𝑚) is the channel frequency response and 𝑁𝑖,𝑘=(1/𝑚)𝑚1𝑙=0𝑛𝑙,𝑘𝑒𝑗((2𝜋𝑖𝑙)/𝑚)𝒩(0,𝜎2) is the noise of the 𝑖th subchannel.

To get the estimation of 𝑋𝑖,𝑘 from 𝑌𝑖,𝑘, a one-tap minimum mean square error (MMSE) equalizer 𝑤𝑖,𝑘=(Γ𝑖1/2𝑖,𝑘)/(Γ𝑖𝑖,𝑘2+𝜎2𝑖), where 𝑖=0,,𝑚1 and Γ𝑖=𝐸[𝑋𝑖,𝑘2], is then employed at the 𝑖th channel. The estimated data is then 𝑋𝑖,𝑘=𝑌𝑖,𝑘𝑤𝑖,𝑘. The decision is then made on 𝑋𝑖,𝑘 to get the final output 𝑋𝑖,𝑘𝑋=𝑞(𝑖,𝑘), where 𝑞() is the decision operation.

3. CP-Based SBHT-RLS Algorithms

3.1. CP-Based Direct SBHT-RLS Algorithm (CPE2)

Based on the CP data model (1), we define 𝑛𝑣×𝑟 weighted data matrix and the 𝑛𝑣×1 weighted received vector in a recursive manner as ̇𝐀𝑘𝐀=Λ𝑘(𝑛1)𝐀𝑘𝑛𝐀𝑘=𝜆1/2̇𝐀𝑘1𝐀𝑘̇𝐲,(4)𝑘(𝑓)𝐲=Λ(𝑓)𝑘(𝑛1)𝐲(𝑓)𝑘𝑛𝐲𝑘(𝑓)=𝜆1/2̇𝐲(𝑓)𝑘1𝐲𝑘(𝑓),(5) where Λ is an 𝑛𝑣×𝑛𝑣 block-diagonal forgetting matrix of the form𝜆𝚲=(𝑛1)/2𝐈𝑣𝟎𝟎𝟎𝜆1/2𝐈𝑣𝟎𝟎𝟎𝐈𝑣,(6) with forgetting factor across blocks 0<𝜆1. The forgetting factor 𝜆 is incorporated in the scheme to avoid overflow in the processors as well as to facilitate nonstationary data updating [25].

Suppose that at the (𝑘1)th update we have QRD𝐐𝑘1̇𝐀𝑘1=𝐑𝑘1𝟎,(7) where 𝐐𝑘1 is an (𝑛1)𝑣×(𝑛1)𝑣 orthogonal matrix and 𝐑𝑘1 is a 𝑟×𝑟 upper triangular matrix.

Now by denoting 𝐐𝑘1=𝐐𝑘1𝟎𝟎𝑇𝐈𝑣, we then have𝐐𝑘1̇𝐀𝑘=𝐑𝑘1𝟎𝐀𝑘.(8)

A 𝑛×𝑛 HT matrix 𝐓 is of the form 𝐓=𝐈𝑛𝛽𝐯𝐯𝑇, where 𝛽=2/𝐯𝑇𝐯=2/𝑣2. When a vector 𝐱=[𝑥1,𝑥2,,𝑥𝑛]𝑇 is multiplied by 𝐓, it is reflected in the hyperplane defined by span{𝐯}. Choosing 𝐯=𝐱±𝐱2𝐞1, where 𝐞1=[1,0,0,,0]𝑇, then 𝐱 is reflected onto 𝐞1 by 𝐓 as: 𝐓𝐱=±𝐱2𝐞1.

A series of HTs are then used to zero out 𝐀𝑘 in the right-hand side of (8). Let 𝐇𝑘=𝐇𝑘(𝑟)𝐇𝑘(𝑟1)𝐇𝑘(1) (a sequence of 𝑟-ordered matrix multiplications), where 𝐇𝑘(𝑖) denotes the 𝑖th HT matrix (which zeroes out 𝑖th column of updated 𝐀𝑘) given as𝐇𝑘(𝑖)=𝐇(𝑖)𝑘,11𝟎𝐇(𝑖)𝑘,12𝟎𝐈(𝑛1)𝑣𝑟𝟎𝐇(𝑖)𝑘,21𝟎𝐇(𝑖)𝑘,22,(9) where 𝐇(𝑖)𝑘,11 is 𝑟×𝑟 identity matrix except for the 𝑖th diagonal entry, 𝐇(𝑖)𝑘,12 is 𝑟×𝑣 zero matrix except for the 𝑖th row, 𝐇(𝑖)𝑘,12=𝐇(𝑖)𝑘,21, and 𝐇(𝑖)𝑘,22 is a symmetric 𝑣×𝑣 matrix.

It is thus we have 𝐇𝑘𝐐𝑘1̇𝐀𝑘=𝐑𝑘𝟎 and 𝐐𝑘=𝐇𝑘𝐐𝑘1. Now with𝐐𝑘̇𝐀𝑘̇𝐲𝑘(𝑓)=𝐑𝑘𝐮𝑘𝟎𝐯𝑘,(10) where 𝐮𝑘=[𝑢0,𝑘,𝑢1,𝑘,,𝑢𝑣,𝑘]𝑇 and𝐑𝑘=𝑟(0,0),𝑘𝑟(0,1),𝑘𝑟(0,𝑣),𝑘0𝑟(1,1),𝑘𝑟(1,𝑣),𝑘00𝑟(𝑣,𝑣),𝑘,(11) the optimal solution is thus obtained by solving the upper triangular system 𝐑𝑘̂𝐡𝑘=𝐮𝑘 by back substitution operation as follows:𝑖,𝑘=𝑢𝑖,𝑘𝑣𝑗=𝑖+1𝑟(𝑖,𝑗),𝑘𝑗,𝑘𝑟(𝑖,𝑖),𝑘,𝑖=𝑣,,0.(12)

The matrix ̇𝐀𝑘1 can be uniquely QR factorized only if it is full column rank (i.e., rank ̇𝐀𝑘1=𝑟). Therefore, the minimum number of rows in ̇𝐀𝑘1 must be at least large as the number of columns. To satisfy this requirement and thus to reduce the number of received blocks needed by CPE2 (and CPE3 in Section 3.2), in step (10), we seṫ𝐀𝑘=𝜆1/2𝐑𝑘1𝐀𝑘,̇𝐲𝑘(𝑓)𝐲=Λ(𝑓)𝑘2(𝑣)=𝑦1,𝑘2𝐲(𝑓)𝑘1𝐲𝑘(𝑓)̇𝐲=Λ(𝑓)𝑘1𝐲𝑘(𝑓).(13)

Based on the above discussion, CPE2 algorithm is summarized in Table 1.

3.2. CP-Based Extended SBHT-RLS Algorithm (CPE3)

In this section, we propose an alternative approach by appending one more column to the matrices of CPE2 algorithm. To simplify the derivation, we combine the first column of (10) and the new column to construct the formula𝐐𝑘𝜆𝐑𝑘1𝐑𝑇𝑘1/𝜆𝐀𝑘𝟎𝑇=𝐑𝑘𝐑𝑘𝑇𝟎𝐖𝑘.(14) We next define a lemma, known as the matrix factorization lemma [33] that is very elegant tool in the development of QRD-RLS algorithms.

Lemma 1. If 𝐀 and 𝐁 are any two 𝑁×𝑀(𝑁𝑀) matrices, then 𝐀𝑇𝐀=𝐁𝑇𝐁,(15) if and only if there exists an 𝑁×𝑁 unitary matrix 𝐐(𝐐𝑇𝐐=𝐈) such that 𝐐𝐀=𝐁.(16)

Applying Lemma 1 to (14), we obtain𝐑𝑘𝑇𝐑𝑇𝑘=𝐑𝑇𝑘1𝐑𝑇𝑘1=𝐈𝑟.(17) This shows that 𝐑𝑘𝑇 obtained is the correct inverse transposition of 𝐑𝑇𝑘 and can be updated by using the same orthonormal transformation 𝐐𝑘.

Next, we combine the second column of (10) and the new column to construct the formula𝐐𝑘𝜆̇𝐲(𝑓)𝑘1𝐑𝑇𝑘1/𝜆𝐲𝑘(𝑓)𝟎𝑇=𝐮𝑘𝐑𝑘𝑇𝐯𝑘𝐖𝑇𝑘.(18) Now by applying Lemma 1 to (18) yields𝐑𝑘1𝐮𝑘+𝐖𝑘𝐯𝑘=𝐑1𝑘1𝐮𝑘1.(19) From (19), we establish a simple recursion to compute the channel vector𝐡𝑘=𝐡𝑘1𝐖𝑘𝐯𝑘.(20) This recursion can be written in component form as𝑖,𝑘=𝑖,𝑘1𝐰𝑇𝑖,𝑘𝐯𝑘,𝑖=0,,𝑣,(21) where 𝐰𝑖,𝑘 is the 𝑖th column of the matrix 𝐖𝑇𝑘.

Based on the above discussion, CPE3 is formulated in Table 2.

Remarks
(i) Both algorithms are initialized in a training mode, the algorithms then switch to a decision-directed mode for channel tracking. Note that, in step (1), based on the previous channel estimate ̂𝐡𝑘1, the previous frequency response 𝑘1=[0,𝑘1,,𝑚1,𝑘1] is computed. In step (2), 𝑘1 is then used to compute equalization coefficients. The decision-directed data vector 𝐗𝑘 is then computed in step (3). In step (4), symbol estimates are projected onto the finite alphabet (FA), and the estimated transmitted CP data 𝐱𝑘(𝑓) is obtained by performing partial FFT on the decision-directed projected samples 𝐗𝑘. In steps (5) through (7), the new channel estimate is then obtained by treating the resulting symbol estimates as the known symbols. The process of alternating between channel and symbol estimation steps is applied repeatedly.
(ii) In [29], Sakai has derived a method for extracting weight coefficients based on the inverse factorization method of Pan and Plemmons [34] and Liu’s SBHT-RLS algorithm. The time updating formula for channel coefficients is obtained by first generalizing the inverse factorizations for the block case and then deriving a formula for updating the channel coefficients. The complicated and challenging derivation gets rid of matrix operations by exploiting the relation between a priori and posteriori error vectors. Based on [35] and suggested by its author, Sakai has also presented a simpler derivation for updating the channel vector in [30]. In contrast, in the above discussion, the same result is derived by following a straightforward approach by generalizing the Extended QRD-RLS algorithm of Yang and Bohme [28] to the block RLS case.

4. Computational Complexity and Systolic Array Implementation

4.1. Computational Complexity

The CPE1, CPE2, and CPE3 algorithms are similar in the CP estimation part (i.e., steps (1) through (4)), we therefore compare their complexities in the channel estimation part. The CPE1 channel estimation stage requires 𝒪(𝑟3) computations to update 𝐡. In contrast, due to absence of any matrix inversion as opposed to CPE1, it is possible to implement channel estimation parts of both the algorithms with 𝒪(𝑟2) operations per time update. This indicates that the proposed algorithms are computationally superior than the CPE1.

4.2. Systolic Array Implementation

The detection part of both proposed algorithms (comprising of steps (1) through (4)) is particularly simple for which many efficient systolic array architectures have been proposed. We therefore limit our discussion to possible implementation architectures for channel estimation part of the proposed algorithms.

The systolic array implementation of channel estimation section of CPE2 and its processing cells are shown in Figure 2, where adaptive filtering triangular update part (comprising of step (6)) is realized on a triangular vectorial systolic array as in [24] for 𝐑𝑘 and 𝐮𝑘 extraction. It consists of two sections: the upper triangular array (shown in part (a) of Figure 2), which stores and updates 𝐑𝑘 and the right-hand column of cells (shown in part (b) of Figure 2), which stores and updates 𝐮𝑘. The input data are fed from top and propagate to the bottom of the array. The rotation angles are calculated in left boundary cells, and propagate from left to right. The resulting 𝐑𝑘 and 𝐮𝑘 updates in step (6) are subsequently used in the linear bidirectional systolic array section [19] (shown in part (c) of Figure 2) to obtain the channel estimate using back substitution operation. Unfortunately, a critical obstruction appears because the process of the triangular-updates runs from the upper-left corner to the lower-right corner of the array, while the process of the back substitution runs in exactly the opposite direction. It is therefore pipelining of the two steps (the triangular update and back substitution) that seems impossible on a triangular array. Back substitution may be implemented as a separate operation on a parallel two-dimensional array [36]. Nevertheless, the two-dimensional array can become quite large for long channel lengths, requiring a substantial area for VLSI implementation. On the other hand, comparatively simpler linear array structure shown in Figure 2 is highly sequential, thus involving more time delay due to increased clock cycles to compute the channel coefficients. For these reasons, the back substitution in CPE2 is a costly operation to perform.

The CPE3 approach involves a time recursive QR solution to compute the channel vector 𝐡. The channel estimation part of CPE3 algorithm can be implemented by a fully pipelined rhombic systolic array obtained by combining lower triangular array with an upper triangular array. This implementation has been performed by Sakai in [29, 30] and is reproduced in Figure 3. The components of 𝐑𝑘 are updated in the upper triangular part (a) of Figure 3. Also, the components of 𝐮𝑘 are updated in part (b) in the same fashion as the off-diagonal components of 𝐑𝑘, with the input data 𝐲𝑘 from the top of this column and the output 𝐯𝑘 from the bottom of this column. Notice that systolic implementation in upper section of part (c) is similar to that in part (a), except that the array is now lower triangular, and each element is divided with 𝛽=𝜆 before updating, and the input to the array is provided from the top in the form of a zero vector. A systolic array performing (20) is shown in lower portion of part (c) of Figure 3, where the cells in the bottom line, shown by small circles; perform (20) for calculating the tap coefficients. Each column of the lower triangular array whose cells are shown by diamonds perform 𝐑𝑘𝑇 updating. The cells also calculate each column of 𝐔𝑇𝑘, appearing from the last diamond cell. Notice that due to absence of back substitution, the CPE3 algorithm is rich in parallel operations and therefore leads to more efficient and simple implementation on systolic processors.

5. Simulation Results

In this section, floating-point and fixed-point simulation results are presented to examine and compare the performance of the CPE1, CPE2, and CPE3 approaches. All simulations were carried out in a typical asymmetric digital subscriber line (ADSL) environment with perfect block synchronization, FFT size 𝑚=512, the CP length 𝑣=32, 𝜆=0.75, 𝛿=1𝑒3, and 4-QAM constellation for modulation, unless otherwise stated. For a fair comparison, for CPE1 we set forgetting factor across blocks 𝜇1=𝜆 and forgetting factor within blocks 𝜇2=1. The mismatch performance is evaluated by averaged mean-square-error (MSE) per subchannel err=𝑖𝑈𝑋𝑖𝑋𝑖/|𝑈|, where 𝑈 is the set of indexes corresponding to the 𝑈 used subchannels and |𝑈| is the number of all the used subchannels [10]. The transmit power of all used subchannels is same (i.e., 𝜎2𝑖=𝜎2) and the noise power was set such that SNR= 30 dB (a typical value of SNR in ADSL environments).

The discrete channel impulse response with transfer function 𝐻0(𝐷) for carrier service loop area (CSA) loop # 1 was obtained from the Matlab DMTTEQ Toolbox [37] and sampled at 2.208 MHz. For simulation purposes, the shorter channel was generated by subsampling. 𝐻0(𝐷) was perturbed to obtain another test channel 𝐻1(𝐷) (to mimic small variation in 𝐻0(𝐷)). Corresponding frequency responses for the two test channels are shown in Figure 4. Initially, the channel transfer function is 𝐻0(𝐷), which remains unchanged for the first 400 data blocks. At data block 401, the channel is switched from 𝐻0(𝐷) to 𝐻1(𝐷). For all adaptive schemes, only the first DMT symbol was sent as pure training sequence to identify the initial channel for fast convergence. Also, the inverse of the correlation matrix in CPE1 is initialized to a constant multiple of the identity matrix.

Example 1
Figure 5 shows typical learning curves of the three algorithms, with adaptation factor parameter 𝜆 values of 0.75 (top plots) and 0.55 (bottom plots), under double-precision floating-point implementation (using IEEE standard for floating-point arithmetic (IEEE 754)). It can be seen that all the schemes are able to converge and can track the channel variation. The learning curves of CPE1 and CPE2 are overlaid and both the algorithms converge faster than CPE3. As compared to CPE3, the two algorithms are also seen to have greater uneven performance. In contrast, although CPE3 convergence is slower, it is seen to demonstrate superior steady-state performance. A close examination of CEP2 algorithm shows that the back substitution operation involves decision-feedback computation of channel coefficients. If a channel coefficient suffers from an error, this error weights heavily in the estimation of the next and subsequent channel coefficients. The erroneous estimated channel causes the next detection error. This decision error further propagates and causes subsequent decision errors. Consequently, CPE2 encounters performance loss. In contrast, channel is recursively updated without back substitution in CPE3. CPE3 is therefore seen to yield better performance.
A close observation of top and bottom plots of Figure 5 also indicates that convergence rate and steady-state performance of the three algorithms can be improved by lowering the value of 𝜆. The price paid in growth is uneven performance which can be reduced and thus numerical stability can be improved by increasing the data block size (i.e., with the increased CP length), while the system latency is increased.

Example 2. Without giving a rigorous stability analysis, we verify the stability of the CPE1, CPE2, and CPE3 algorithms experimentally through a long-time simulation with 5×103 data blocks (considerably large number of samples). Corresponding results in Figure 6 show that the three algorithms do not show any sign of divergence and have very stable performance.

Example 3. The more complex the modulation alphabet, the narrower the gap between the symbol decision space and the higher the probability of error in detecting the signal [38]. Since the three algorithms rely on the FA property of source symbols, high-performance degradation is expected as the constellation size increases. It is therefore the three algorithms that may not be suitable for rate adaptation. To verify this, in this simulation example, we repeat Example 1 with 16-QAM and 64-QAM constellation sizes. Corresponding simulation results in Figure 7 show that the three algorithms take the same number of data blocks to converge. However, as expected, their performance degrades when the constellation size is increased.

Example 4. In this section, due to inherent parallelism and thus suitability for fixed-point VLSI implementation, we examine the fixed-point performance of the CPE2 and CPE3 algorithms with 16, 24, and 32 bit data word length 𝑊𝐿 implementations for both data and channel coefficients. These 𝑊𝐿𝑠 are selected as a reasonable approximation as these data lengths are suitable for many applications. For fixed-point simulations, routines in Matlab are developed to mimic the operations of fixed-point arithmetic, and all quantities in the algorithms are represented with finite bits. The fixed-point representation requires 𝑊𝐿 = (𝑙𝑖 bits for integer part) + (𝑙 bits for the fractional part) + (1 bit for sign). For real number 𝑥, its quantized value 𝑥𝑞 is obtained as follows. With 𝑊𝐿 bits, the largest integers that can be represented are ±2(𝑊𝐿1)1. When the value of 𝑥 falls outside the interval [+2(𝑊𝐿1)1,2(𝑊𝐿1)1], the saturation occurs, and the 𝑥𝑞 is then taken as one of the boundary values, +2(𝑊𝐿1)1 or 2(𝑊𝐿1)1. On the other hand, if 𝑥 lies within the interval [+2(𝑊𝐿1)1,2(𝑊𝐿1)1], then the 𝑙𝑖 bits are computed to represent the integer part of 𝑥, and the remaining bits 𝑙 are used to represent the fractional part of 𝑥. It is important to note here that for the above choice of 𝑊𝐿𝑠, the thresholds are sufficiently larger than signal values involved in both the algorithms. The quantizer is therefore always expected to operate on values that are much lesser than the boundary values, and therefore no saturation errors are expected. The only errors that are introduced by finite precision approximations are the round-off errors.
Figure 8 provides performance plots of CPE2 (top) and CPE3 (bottom) with different 𝑊𝐿 choices and floating-point performance. From the performance curves, we infer that both algorithms are able to track the channel without numerical stability issues with 𝑊𝐿𝑠 24 and 32. The performance curves with 𝑊𝐿 of 16 bits indicate unacceptable performance or breakdown caused by quantization errors for both the algorithms. For both the algorithms, increasing 𝑊𝐿 above 24 bits does not result in any improvement and performance curves of their 24-bit and floating-point implementations are overlaid (there is no visible difference). It is therefore 24-bit finite word implementation is a reasonable approximation of their floating-point computation.

6. Conclusion

In this paper, by using numerically robust block HTs, two CP-based adaptive channel estimation algorithms have been presented for multicarrier systems. Conceptually, the new schemes maintain the same spirit of the CP based CMB-RLS channel tracking scheme. More precisely, the basic idea is to utilize CP data from the data detection part for adaptive channel estimation. The new approaches achieve the same purpose by replacing the computationally expensive CMB-RLS channel estimation part with the computationally cheaper SBHT-RLS alternatives. Among the two schemes, the method called CP based Direct SBHT-RLS is based upon Liu’s algorithm in the channel estimation part but adaptively updates channel vector instead of the error vector. The second method called CP based Extended SBHT-RLS is based upon Sakai’s algorithm in the channel estimation part but uses an independent and simpler derivation.

Floating-point performance curves indicate that all the three schemes are able to converge and can track channel variation without any stability problems. CPE1 and CPE2 exhibit identical stable performance, whereas CPE3 outperforms both the CPE1 and CPE2 techniques. In contrast to CPE1, what is remarkable here is that the CPE2 and CPE3 algorithms achieve their performance at lower computational complexity, enhanced parallelism, and pipelining for systolic array/VLSI implementation. All the three algorithms are seen to converge faster and perform better with lower values of forgetting factor parameter 𝜆. Our simulation results suggest that such advantages come at the price of greater uneven performance. Hence, moderate values of forgetting factor would be preferred where a balance in both performance and stablility is required. The three techniques also show reduction in performance with the increase in modulation constellation size. Hence, these techniques are more appealing when the constellation size is small and may not be suitable for rate adaptation. It is also shown that in terms of finite word length behavior, 24-bit finite word implementation is a reasonable approximation of their typical floating-point computation (In practice, the word lengths are optimized with respect to the actual system requirements (i.e., chip area, latency, power consumption, FFT size, throughput), noise, channel length, and desired acceptable performance.).

Systolic array structures that allow efficient parallel implementations of the schemes with VLSI technology in real time were considered. The CPE2 approach is partially concurrent due to costly back substitution operation, whereas, CPE3 approach is highly concurrent due to the absence of back substitution operation and therefore lead to more efficient implementation on systolic processors.

The methods proposed in this paper are well suited for applications where good numerical properties, computational saving, and parallel processing implementation advantages (with improved performance (in case of CPE3 only)) are desired. Although a real baseband DMT case is the main focus of this paper, the proposed approaches can also be applied to the complex baseband case (wireless multicarrier systems). In such case, a further improvement in performance is possible by including forward error correction (FEC) decoding in the reliable reconstruction of transmitted symbols. Future interesting directions include studying hardware implementation problems, fine grain implementation/architecture of processing elements to workout total cost of operators (adders, multipliers, dividers, memory elements (delay elements), etc.) and algorithm latencies, modifications of the schemes to achieve reduced complexity, performance improvement, and stable implementations with reduced word lengths.

Acknowledgment

The author wishes to express his sincere thanks to the reviewers for their constructive comments and useful suggestions towards improving this paper.