Abstract

An offline single channel acoustic echo cancellation (AEC) scheme is proposed based on gradient based adaptive least mean squares (LMS) algorithm considering a major practical application of echo cancellation system for enhancing recorded echo corrupted speech data. The unavailability of a reference signal makes the problem of single channel adaptive echo cancellation to be extremely difficult to handle. Moreover, continuous feedback of the echo corrupted signal to the input microphone can significantly degrade the quality of the original speech signal and may even result in howling. In order to overcome these problems, in the proposed scheme, the delayed version of the echo corrupted speech signal is considered as a reference. An objective function is thus formulated and thereby a modified LMS update equation is derived, which is shown to converge to the optimum Wiener-Hopf solution. The performance of the proposed method is evaluated in terms of both subjective and objective measures via extensive experimentation on several real-life echo corrupted signals and very satisfactory performance is obtained.

1. Introduction

The phenomenon of acoustic echo occurs when the output speech signal from a loudspeaker gets reflected from different surfaces, like ceilings, walls, and floors, attenuated, and then fed back to the microphone. In real-life scenarios, such as a lecture in a large conference hall or in the public address system of a trade fair, acoustic echo is a very common phenomenon [1, 2]. Severe echo in these scenarios may degrade the quality of the speech signal to a great extent leading to complete loss of intelligibility and thereby cause public annoyance and produce severe sound pollution. A continuous acoustic feedback of a significant proportion of the sound energy transmitted by the loudspeaker back to the microphone may result in extreme howling [1, 3].

An acoustic echo canceller is usually incorporated in the design of a communication channel or a conference room environment. As communication links are mostly dual channel, the adaptive filter algorithms, which by principle require two separate channels, are widely used for acoustic echo cancellation (AEC) in communication links [48]. In these AEC systems, the near-end signal, which is available at hand, is fed to the adaptive filter as a reference to cancel the far-end echoed signal [9]. Here, the channel acoustic response parameters are updated adaptively to produce an estimate of echo. Among different adaptive filter algorithms, the gradient based least mean squares (LMS) algorithm and its modifications, such as normalized LMS (NLMS) and variable step size LMS (VLMS) algorithms, are widely used for their satisfactory performances, less computational burden, and ease of implementation [2, 9, 10]. A faster algorithm is the recursive least mean squares (RLS) algorithm which is, however, computationally expensive [9].

A rather different problem of single channel echo cancellation arises in large room environments where there is only one available channel for signal transmission [1]. The echo corrupted output of the loudspeakers may get echoed again and added to the input speech to produce extreme corruption of the speech data. In this case, handling the problem of AEC with adaptive filter seems to be much difficult as a desired reference signal is not available [11]. Instead, echo suppressors, earphones, and directional microphones have been conventionally used in order to combat the problem of single channel acoustic echo at large room environments. However, these instruments generally place restrictions on the talkers' movement [3]. The acoustic engineers also employ different manual measures, such as sound absorbers and manually tuned filters, to suppress the room echoes. Some digital filtering based approaches are also found in the literature [1214]. A more critical problem is to cancel out echo from the recorded echo corrupted speech data, for example, recorded speech in a conference room environment. In this case, the echo cancellation operation needs to be carried out in offline mode where the only available data is the highly echo corrupted speech signal. Moreover, the level of echo corruption is very severe as the echo corrupted output of the loudspeaker is echoed back to the microphone again and again. The problem of developing an offline single channel acoustic echo canceller based on adaptive filter algorithms is rarely addressed in the literature.

Considering the unavailability of a separate channel, an offline single channel AEC scheme based on the gradient based adaptive LMS algorithm is developed in this paper. Assuming the worst realistic case of continuous feedback of echo corrupted sound from the loudspeaker to the microphone, a delayed version of the echo corrupted signal is utilized as a reference to the adaptive filter for echo cancellation from current sample. The major advantage in this procedure is that an excellent performance of echo suppression is being achieved while a separate reference channel is no longer required. In this paper, starting from a modified objective function, the LMS update equation is derived and convergence of the resulting update to the optimum Wiener-Hopf solution is proved. Finally, considering various acoustic environments, the echo cancellation performance of the proposed method is evaluated in terms of both subjective and objective measures.

2. Proposed Echo Cancellation Method

A simplified block diagram representing the echo generation process in an acoustic room environment is shown in Figure 1. In this figure, the th samples of the input speech and echo are denoted by and , respectively, and corresponding echo corrupted speech is written as It is evident from the figure that the echo is actually a reflected and attenuated version of . For echo generating acoustic environment the conventional practice is to model the room response as a finite impulse response (FIR) with a predefined flat delay . The minimum time required for the speech sample to travel directly from the loudspeaker to the microphone is termed as flat delay [15]. Therefore, the coefficients of FIR filter representing the flat delay portion of the room response can be assumed to be zero. Now, if the room response filter is considered to have unknown coefficients, the echo produced by the acoustic environment can be expressed as where is a vector of previous values of with predefined flat delay and is the vector of the unknown room response filter coefficients. The number of unknown attenuation coefficients depends on the characteristics of the room. Note that in (2), for echo generation, three different cases are considered with respect to time in order to ensure that the most critical case of repeated feedback is incorporated. During the flat delay period (equivalently samples) there will be no echo generation. At or after flat delay period first, only the delayed version of original signal (with respect to current input) will be played through the microphone and reflected back and thus echo is generated, which is presented in (2) for . During this period the microphone is receiving current original speech input along with echo; that is, . Thus, next time as will be played through the loudspeaker, for , echo will be generated by delayed version of instead of . It is to be noted that, for , for every instance in echo generation, instead of original input, echo corrupted input is considered, which makes the echo cancellation problem extremely difficult.

The task of the proposed AEC system is to cancel the echo from the echo corrupted signal and retrieve the input speech signal adaptively. In Figure 2, the schematic diagram of the proposed model is depicted. It can easily be apprehended from the figure that the echo corrupted signal is fed to an adaptive filter block as an input. The task of the adaptive filter block is to produce an estimate of the echo signal , which is present in . The estimate is then subtracted from to produce error signal given by The task of the adaptive filter block is to generate an accurate estimate of echo such that the error signal contains the original signal . However, the resulting echo suppressed signal, namely, , may contain , known as residual echo. Finally is then transmitted to the loudspeaker.

To produce an estimation of the echo signal , the adaptive filter requires an input signal, which is undoubtedly in this case, and a reference. Since there is no scope to provide a separate reference in single channel AEC problem, we propose to utilize some delayed versions of the input echo corrupted signal as the reference signal. This is because, from the echo generation process, it is evident that the added echo is actually a delayed and attenuated version of the previous echo corrupted signal; that is, echo is generated when samples of previous echo corrupted signal pass the room response filter. Therefore, the estimated echo , generated by the adaptive filter block, can be expressed as where consists of the current estimate of the room attenuation parameters to be estimated by the adaptive filter block.

The objective function required to obtain can be defined as the mean square estimation of the error function ; namely, The last term at the right-hand side can be neglected based on the assumptions that the correlation between the input speech and its delayed version is negligible at higher lags. As a result, the objective function in (5) reduces to

Minimizing the objective function (6) with respect to , that is, considering , one can obtain which can be expressed as The above equation is similar to Wiener-Hopf equation and its solution can be written as where is the cross-correlation matrix between and , while is the autocorrelation matrix of . The Wiener-Hopf solution, , is undoubtedly the most optimum solution possible. Hence, it is shown that even for a single channel AEC problem the most optimum solution can be achieved based on the assumption of negligibility of cross-correlation terms in (5).

In order to avoid any correlation measurements or matrix inversion involved in (9), an adaptive LMS algorithm is employed. The update equation of the weight vector of conventional LMS adaptive algorithm is generally expressed as [9] where is the step factor controlling the stability and rate of convergence, is the cost function, and is the gradient operator. Unlike the objective function defined in (5), conventional LMS algorithm simply approximates the cost function as the square of the instantaneous error ; that is, , where is defined in (3). The gradient of , therefore, can be written as Thus, the update equation (10) can be expressed as For the th unknown filter parameter at the th iteration the update equation can be written as where .

3. Convergence Analysis of the Proposed LMS Update

Considering expectation operation on both sides of (12), one can obtain Here an underline beneath is introduced to represent the expected value . Based on the assumptions on the negligibility of the cross-correlation terms as stated earlier, (14) can be written as Evaluating the homogeneous and particular solutions of (15), the total solution can be obtained as (see Appendix) where is the th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of and is the th element of with the matrix consisting of eigenvectors corresponding to eigenvalues. Since in the iterative update procedure the homogeneous part diminishes with iterations, (16) in a matrix form can be expressed as Thus, it is found that the average value of the weight vector converges to the Wiener-Hopf solution, which is the optimum solution with increasing number of iterations.

4. Experimental Results

The performance of the proposed AEC algorithm is investigated in terms of objective and subjective measures and different echo corrupted speech signals under various acoustic room environments are considered. Speech samples uttered by several male and female speakers available in the TIMIT database are utilized for performance evaluation [16]. An acoustic room environment is simulated using an FIR filter of length , where as per conventional approaches filter coefficients during the flat delay portion are assumed to be zero. The flat delay time can be precalculated based on the distance between the microphone and the speaker [15]. Because of the implicit zeros corresponding to the flat delay, it is evident that a few number of unknown coefficients have to be determined. In the proposed method, a smaller step size is used in order to obtain a smooth convergence.

The echo corrupted signal recorded at different acoustic environments and the corresponding echo suppressed signals produced by using the proposed method are successively played to 5 individual listeners. From the overall response of the listeners in terms of mean objective score (MOS), a very satisfactory performance of the proposed method is obtained even under severe echo generating conditions in noise.

Next, two objective measures are employed to quantify the improvement in speech quality. One of the objective measures, the echo return loss enhancement (ERLE) in dB, represents the amount of attenuation of the echo introduced by the adaptive filter alone. It is the ratio of the instantaneous power of the residual echo remaining in the echo suppressed signal and that of the input echo and is expressed in dB as [1] The average value of in dB over time is used as the criteria of performance evaluation in this experiment. Another objective evaluation criterion, the signal to distortion ratio (SDR) in dB, is a measure of the ratio of signal power to the power of the distortion introduced. The SDR is computed at the input and output sides of the AEC system and the difference in these SDR values, namely, SDR improvement (SDRI), is considered as an indicator of the system performance and defined as where Here, is the overall power of the original signal , is the overall power of echo at the microphone input, and is the total power of distortion (residual noise and echo) present in the echo suppressed output signal. The higher the value of SDRI is, the better the performance of the echo canceller is.

The proposed algorithm has been tested on several different sentences taken from the TIMIT database. In order to demonstrate the principle of selecting different threshold values required in the proposed updating constraints, as a typical example, a sample utterance “good service should be rewarded by big tips” taken from the TIMIT database is shown in Figure 3(a) [16]. The echo corrupted speech signal is shown in Figure 3(b). The outcome of the proposed adaptive filter method, the echo suppressed speech signal, is plotted in Figure 3(c). The marked areas in the figure help to clearly visualize the AEC performance of the proposed method by comparing the waveform of the echo suppressed signal to that of the echo corrupted and the original input signal. It can be easily apprehended that the echo suppressed signal almost resembles the original input signal and the effect of echo is significantly reduced.

For a better understanding of the performance of the echo canceller, spectral analysis of the AEC performance is introduced. The spectrograms of the original speech signal , the echo corrupted signal , and the echo suppressed signal are depicted, respectively, in Figures 4(a), 4(b), and 4(c). It is observed that the spectrum of the echo corrupted signal is quite different from that of the original signal at different frequency levels because of the presence of unwanted signals resulting from the echo signals (as can be seen from some sample marked regions). Especially, the effect of noise corruption at high frequency regions of the spectrogram is quite prominent at different time instances. The main reason behind such effect is the continuous feedback of echo corrupted speech sample to the input microphone. To human ear, this noise is a high pitched sound present alongside the utterance. This noise may grow to a high magnitude if the echo power is increased in a conference room due to bad acoustics and may result in total loss of intelligibility. However, it is clearly evident in the spectrum of the echo suppressed signal obtained by the proposed method that the unwanted high pitch noise has been drastically reduced.

For further experimentation, another sample utterance “she had your dark suit in greasy wash water all year” from the TIMIT database is considered and shown in Figure 5(a). The echo corrupted speech signal and the echo suppressed speech signal for this utterance are shown in Figures 5(b) and 5(c), respectively. Moreover, the spectrograms of the utterance 2, the echo corrupted signal, and the echo suppressed utterance are shown in Figures 6(a), 6(b), and 6(c), respectively. From these figures it is evident that the proposed single channel AEC scheme can enhance speech quality by suppressing echo from the echo corrupted recorded speech.

In order to demonstrate the effect of various acoustic environments on the echo cancellation performance obtained by the proposed method, the number of room response filter coefficients is varied while generating the recorded data. In Tables 1 and 2, the performance variation of the proposed method for both utterances with the change in number of nonzero filter coefficients is reported in terms of SDRI (dB) and average ERLE (dB), respectively. Performance is evaluated for different number of unknown coefficients ranging from 4 to 12. It is found from the tables that the performance of the proposed method remains consistently better irrespective of the number of filter coefficients. Hence, it is evident that the proposed method exhibits robust performance at different acoustic echo generating environments.

5. Conclusion

A practical approach of single channel acoustic echo cancellation from recorded echo corrupted data using gradient based adaptive LMS algorithm is developed in this paper. The major problem of unavailability of the reference signal required for successful adaptation operation of the adaptive filter has tactfully been overcome by using the delayed version of echo corrupted input signal as the reference. This calls for a modification in the conventional objective function of the gradient based adaptive LMS algorithm and a theoretical proof of convergence of the adaptive scheme to the optimal solution is given. The echo cancellation performance of the proposed method is evaluated by means of subjective and objective measures and in time and frequency domain at different acoustic recording conditions. It is found that the proposed offline AEC algorithm is capable of providing a very satisfactory echo cancellation performance for recorded echo corrupted speech in terms of average ERLE (dB) and SDR improvement (dB) under various acoustic room environments. Different competing adaptive algorithms, other than the LMS algorithm and blind deconvolution, can be implemented in the proposed frame work to handle single channel echo cancellation. Moreover, keeping the proposed framework intact, it may be possible in the future to incorporate optimization algorithms other than the LMS, such as block LMS and RLS, in order to obtain single channel echo cancellation system. Also, the proposed scheme may be extended for online echo and noise cancellation, which would be an interesting future work.

Appendix

Derivation of the Solution of the LMS Update

In order to obtain a homogeneous solution of (15) one can consider For correlation matrix , using eigenvalue decomposition, we obtain where is a diagonal eigenvalue matrix and . Now, multiplying both sides of (A.1) by , we get where . The th coefficient of the weight vector can be expressed as Hence, the homogeneous solution can be obtained as where is a constant. Next, the particular solution for the th coefficient, based on (15), is obtained as where is the th element of . For a particular solution , one can obtain and thereby (A.6) reduces to

Conflict of Interests

The authors declare that there is no conflict of interets regarding the publication of this paper.