Abstract

In this paper, a novel processing-efficient architecture of a group of inexpensive and computationally incapable small platforms is proposed for a parallely distributed adaptive signal processing (PDASP) operation. The proposed architecture runs computationally expensive procedures like complex adaptive recursive least square (RLS) algorithm cooperatively. The proposed PDASP architecture operates properly even if perfect time alignment among the participating platforms is not available. An RLS algorithm with the application of MIMO channel estimation is deployed on the proposed architecture. Complexity and processing time of the PDASP scheme with MIMO RLS algorithm are compared with sequentially operated MIMO RLS algorithm and liner Kalman filter. It is observed that PDASP scheme exhibits much lesser computational complexity parallely than the sequential MIMO RLS algorithm as well as Kalman filter. Moreover, the proposed architecture provides an improvement of and decreased processing time parallely compared to the sequentially operated Kalman filter and MIMO RLS algorithm for low doppler rate, respectively. Likewise, for high doppler rate, the proposed architecture entails an improvement of and decreased processing time compared to the Kalman and RLS algorithms, respectively.

1. Introduction

Adaptive filtering techniques play a very important role in the emerging fields of science and technology [13]; however, the last two decades have witnessed tremendous research in the field of adaptive filtering for the improvement of their convergence and complexity requirements [46]. However, achieving fast convergence on an energy-constrained and computationally incapable platform still remains a dream in spite of magnificent advancements in Integrated Circuit (IC) technologies. For instance, in video conferencing, the echo cancellation requires a high definition adaptive filtering algorithm to avail a robust convergence performance while tracking the time varying uncertainties present in the communication link. Nevertheless, such high definition adaptive algorithm cannot be run on an energy-constrained and computationally incapable inexpensive platform. The following lines present a brief review of the literature where significant efforts have been made to propose a low complexity and distributed solution for this problem.

In [7, 8], Banachiewicz inversion formulation is used to perform the matrix inversion for MIMO-OFDM based software defined radio (SDR) signal detection. The inversion of a matrix is divided into four matrices that reduce the computational operations. Likewise, the authors in [9] derive a low complexity algorithm for Hermitian positive-definite recursive matrix inversion that provides low computational complexity compared to [7, 8] with the utilization of operations for finding matrix inversion only. Therefore, using the concept of matrix inversion of [79], it does not exhibit a significant impact on the computational cost of high definition adaptive filtering [10]. In [11], Xiao et al. introduce an LR-MMSE algorithm based on QR decomposition and complex lattice reduction (CLR) which provides lesser computational complexity than MMSE based scheme [12]; however, lesser computational cost still can not meet the demands of high data rate communications. A low complexity in reduced rank linear interference suppression is proposed in [13, 14] which is based on the use of polynomial expansion (PE). The matrix inverse is represented by an th order matrix polynomial [13, 1517] and its selection is one of the tedious tasks while trading off between the complexity and detection performance. In [18], a comparison is made on renowned subband adaptive filtering (SAF) structures with parallel arrangement of multirate filter banks. The SAF technique exhibits reduced complexity through the use of least mean square (LMS) adaptive filtering algorithm in acoustic noise environment. Therefore, due to phase, aliasing, and amplitude distortions and extra processing delay, these systems may be ruled out for real-time implementation. Another architecture configuration for reducing runtime is MMSE signal estimation using wireless sensor nodes [19]. In this architecture, authors use the distributed adaptive node-specific signal estimation (DANSE) technique to estimate the channel coefficients by following Wiener Hopf equation. However, DANSE technique only follows the MMSE criterion rather than running the adaptive filtering algorithm. This makes DANSE incapable of estimating time varying channel conditions. To the best of our knowledge, there is no parallel structure of recursive adaptive filtering algorithms in the literature where any of the complex adaptive algorithms runs in parallel fashion over computationally incapable platforms with no perfect time alignment. Nevertheless, software parallelism is provided by Matlab© [20] and Labview© [21] which are available in various architectures of the present fast processing computers with perfect time-alignment processors. These software programs provide parallel processing toolbox to divide the large problems into smaller computations, hence requiring reduced running time. Likewise, graphical processing unit (GPU) enables running high definition graphics on a personal computer (PC) by exploiting hundreds of cores [22]. Furthermore, compute unified device architecture (CUDA) [23] is NVIDIA’s GPU architecture which provides multithreaded applications where cores can communicate and exchange information with each other. However, these cores have not been used to run adaptive algorithms in parallel with nonaligned time indexes.

In this paper, our objective is to provide a novel low complexity solution for inexpensive and computationally incapable platforms through proposing their parallely distributed adaptive signal processing (PDASP) operation making them run computationally expensive procedures cooperatively.

The implementation of the proposed PDASP technique using recursive least square (RLS) makes the inexpensive and computationally incapable platforms work in parallel even with nonaligned time indexes by providing much lesser processing time parallely than the sequential Kalman [24] and RLS [25, 26] algorithms, whereas RLS is the special case of Kalman filter and is one of the most popular filters in the adaptive filtering domain that offers a superior convergence rate, especially for time varying environments with the price of an increase in nonlinear computational cost.

The idea behind the distributed signal processing is based on parallel processing and “divide and conquer” prototype. In this prototype, the part of the algorithm is divided into subalgorithms which are then passed to other processing nodes to provide an efficient low complexity solution.

The rest of the paper is organized in the following manner. Section 2 describes the system model. Section 3 presents the proposed PDASP scheme for computationally incapable platforms. Complexity analysis is introduced in Section 4. In Section 5, simulation based results are presented and Section 6 draws the conclusions.

2. System Model

In this section, we discuss the working procedure of our proposed parallely operated recursive least square (RLS) filter in the light of its conventional sequential operation.

In conventional (RLS) adaptive algorithm and its variants, all filter subparts are interdependent on each other and operate sequentially. Before introducing the proposed parallel RLS operation over individual platforms with different clock systems, we define some timing variables with illustration shown in Figure 1, where a single iteration of an RLS algorithm consists of sequential blocks.

(i) Computational Time . This is the time taken by the processor for a single computation. It can be calculated simply by the speed of the processor.

(ii) Block Processing Time . It is the processing time of a block of the algorithm. It depends on the number of computations involved in a block. It can thus be a multiple of .

(iii) Fetch Time . This is the time in which one block fetches information from another block, usually its predecessor.

(iv) Algorithm Step Time . It is the processing time of a complete iteration of the algorithm.

If RLS filtering is operated on a single computationally capable platform, all algorithm blocks would be executed sequentially as shown in Figure 1 with fetch time . However, if the same RLS filtering is operated on a group of computationally incapable platforms using the proposed PDASP architecture, different algorithm blocks would be executed parallely on various individual platforms with varying fetch times depending upon the media among the nodes as shown in Figure 2.

The only possible way to operate the RLS algorithm in parallel fashion on individual platforms with different clock systems is by putting the time as nonaligned. While setting the time nonalignment, two things must be taken into account. First, it must be realized that the filter is not showing any uncertain behavior though implementing on any application. Secondly, all the filter subparts are able to work in parallel manner with favorable fetch times with respect to block processing times. In this way, the sequential structure may be able to work parallely even with nonaligned time indexes. In Figure 2, the cooperative parallely operated RLS filtering architecture consists of four processing nodes, namely, , , , and . The processing nodes and are interlinked with and , respectively. while being connected to themselves also. Likewise, is interconnected with and and is only linked with . All the processing nodes would first share information with one another and then would work out the desired process. The processing time of each block differs from one another and is known to all nodes; therefore, all processing nodes which complete their processing tasks earlier than others wait the processing time equivalent to the block of maximum processing time till the processing of the block with maximum processing time ends. In this way, the inexpensive and computationally incapable platforms work in parallel for a combined goal.

3. Proposed PDASP Technique for RLS with Nonaligned Time Indexes

In adaptive filtering, all the filter subparts are interdependent on one another. Due to cascaded fashion the algorithm takes mutual processing time while attaining its convergence with respect to uncertain channel conditions. By using the PDASP technique, RLS algorithm runs in parallel manner even with nonaligned time indexes while providing parallely low processing time at each machine or processing node. The flow diagram of PDASP technique using RLS is shown in Figure 3. The notation “” is used to represent the time used in processing of whose computation is done inside the pointed block. Let the processing times taken by error covariance matrix “,” Kalman gain “,” received signal estimation , estimation error “,” and update filter coefficient matrix “” be , , , , and , respectively.

Therefore, the total time taken by the whole algorithm that runs in cascaded fashion is

The maximum processing time among is because takes more multiplications than any of , , , and , in order to operate RLS algorithm parallely while distributing the operation of various blocks on individual nodes with nonaligned time indexes. The strict and sufficient conditions with respect to fast convergence rate in terms of multiplication and addition computations can thus be written as

Due to nonaligned time indexes, the mismatch between the aligned and nonaligned time indexes can be written aswhere is the error of sequential algorithm and is the error of the PDASP algorithm with nonaligned time indexes. The proposed architecture can be run in sequential format for convergence calibration. The sequential implementation of PDASP RLS algorithm with nonaligned time indexes is nearly the same as that of a conventional RLS algorithm run on a single machine. Steps of this sequential format are shown in Algorithm 1.

Initilize:

4. Complexity Analysis

The complexity of the linear Kalman filter requires multiplications and additions per iteration, where represents the dimension of the filter order. Likewise, RLS algorithm that is the special case of Kalman filter entails multiplications and additions per iteration. The implementation of the proposed PDASP technique on RLS algorithm exhibits much lesser computational cost for each parallely distributed entity block. The proposed PDASP technique with nonaligned time indexes entails parallely multiplications and additions per iteration at maximum. The proposed parallel technique thus provides much lesser processing time than that of sequential Kalman and RLS algorithms.

5. Simulation Results

In this section, Monte Carlo simulations with binary phase shift keying (BPSK) are performed on MIMO communication system to substantiate the validation of our proposed PDASP architecture. The forgetting factor is set to be for both the proposed PDASP and sequential RLS algorithms.

The proposed parallel technique that is implemented on MIMO RLS is then compared with the sequential MIMO RLS adaptive algorithm and Kalman filter in terms of computational complexity, mean square error (MSE), and processing time with nonaligned time indexes. The implementation of the proposed PDASP scheme is done using MIMO RLS algorithm and its performance in terms of computational complexity is then compared with sequentially operated nondistributed Kalman and RLS adaptive filtering algorithms. The parallel technique provides much lesser computational complexity parallely than the sequential Kalman and MIMO RLS algorithms. Figure 4 represents the multiplication and addition complexity comparison of proposed PDASP technique and those of sequentially operated Kalman and MIMO RLS algorithms. It is observed that the proposed PDASP technique using nonaligned time indexes provides parallely much lesser multiplication and addition complexity than sequential Kalman and MIMO RLS algorithms. Figures 5 and 6 show the MSE performance at low doppler rate and high doppler rate , respectively, and Figure 7 shows their MSE difference among the proposed PDASP MIMO RLS and sequentially operated algorithms. It is realized that the difference in convergence performance of proposed PDASP scheme run with nonaligned time indexes and that of the sequential Kalman and MIMO RLS algorithms is only of and iterations, respectively, at low doppler spread and about and iterations with relatively high doppler spread, respectively. Considering the difference in Figure 7, it can be seen that, due to initialization of the algorithm parameters of PDASP technique, the error difference that is small at the start gradually increases and then reverses to decrease and eventually becomes zero on a specific number of iterations.

The fast Ethernet speed of 125 Mbits/s is taken as the reference peak bit rate in the wired communication. In MIMO PDASP scheme, the maximum size of matrix is to be transmitted from one machine node to another through wired communication. However, each entry in MIMO matrix consists of bytes, in which two significant bytes are before the decimal point and two significant bytes are taken after the decimal point. The total number of bits for 4  ×  4 matrix are  bits. The fetch time according to this number of bits is μs. Therefore, the processing time comparison and the processing time difference at μs among the sequential algorithms and the proposed PDASP technique are presented in Figures 8 and 9, respectively. It is clear that the proposed PDASP technique provides much lesser processing time than the sequentially operated Kalman filter and MIMO RLS algorithm. The percentage improvement in decreased processing time is shown in Tables 1 and 2. At low doppler rate, it is realized that the proposed PDASP MIMO RLS algorithm converges about iterations with the addition of μs at each iteration but still utilizes and lesser processing time than the sequential Kalman and MIMO RLS algorithms, respectively. Likewise, for high doppler rate, the proposed PDASP MIMO RLS takes iterations for its convergence with the increase of and iterations compared to the sequential Kalman and MIMO RLS algorithms, respectively. It can be seen that the proposed technique still entails and lesser processing time than the sequentially operated Kalman and MIMO RLS algorithms, respectively.

6. Conclusions

In this paper, a novel low complexity architecture for the parallely distributed adaptive signal processing (PDASP) operation of inexpensive and computationally incapable small platforms has been proposed. The proposed architecture makes the inexpensive and computationally incapable devices run computationally expensive procedures like complex adaptive Kalman and RLS algorithms cooperatively. The operation of the proposed PDASP architecture has been evaluated on the basis of presence of time nonalignment in the execution of its parallel block entities. Complexity and processing time of proposed PDASP scheme with RLS algorithm have been compared with those of sequentially operated Kalman and RLS algorithms. It has been observed that PDASP scheme exhibits much lesser computational complexity and processing time parallely than the sequentially operated Kalman and RLS algorithms. The proposed PDASP technique with nonaligned time indexes provides parallely multiplications and additions per iteration at maximum. Likewise, the proposed technique utilizes and lesser processing time than the sequential Kalman and MIMO RLS algorithms, respectively, for low doppler rate. Likewise, for high doppler rate, the proposed technique entails and decreased processing than sequentially operated Kalman and MIMO RLS algorithms, respectively. In a nutshell, processing time and parallel complexity of the proposed PDASP based MIMO RLS scheme have been observed to be much lesser than those of Kalman and RLS adaptive filtering algorithms, if operated sequentially on a single unit.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.