Abstract

This article aims to carry out a comparative study between discrete-time and discrete-frequency Kalman filters. In order to assess the performance of both methods for speech reconstruction, we measured the output segmental signal-to-noise ratio and the Itakura-Saito distance provided by each algorithm over 25 different voice signals. The results show that although the two algorithms performed very similarly regarding noise reduction, the discrete-time Kalman filter produced smaller spectral distortion on the estimated signals when compared with the discrete-frequency Kalman filter.

1. Introduction

Even with the advent of the Internet, voice transmission is still one of the most important ways of communication. The quality and intelligibility of speech signals play a key role in the ease and precision during information exchange. Practically in almost all voice transmission applications, the quality can be affected by factors such as ambient noise, losses due to digital link encoding, and interference from other conversations or even from other signal sources [1].

In order to overcome their harmful effects, digital speech processing techniques can be employed to reduce or even eliminate them. In recent years, some techniques and methods such as spectral subtraction, Kalman filtering, psychoacoustics, and wavelet transforms gained more prominence, especially in noise reduction, so that many research efforts have been made for improving them.

In [2, 3], the authors enhance speech quality by removing the musical noise introduced by spectral subtraction. In [1], the authors combined spectral subtraction and wavelets on a prefiltering approach for noise reduction in speech signals and used the result as an initial guess for a Kalman filter. When compared to Kalman filtering using only wavelets or spectral subtraction alone to produce the initial guess, their method showed the least spectral distortion and a similar segmental output signal-to-noise ratio.

Since wavelet-based denoising is highly dependent on thresholding the approximation and detail coefficients, recent research in this area focuses on new thresholds [4, 5].

Shao and Chang [6] concatenated the Kalman filter to a bank of wavelet filters with a perceptual weighting filter. They used a technique of masking the psychoacoustic model to derive the weighting filter. According to the authors, that work brought two contributions. The first one was the wavelet-based auditory model with a perceptual wavelet filter bank that maps the frequency response of the human auditory system through subband decomposition. The second was the Kalman filter using a voice state space model in the wavelet domain, whose computational cost was reduced when compared to the discrete-time Kalman filter. They were able to reduce the noise in different environments with low signal degradation.

Dhivya and Justin [7] proposed a noise reduction based technique that applies spectral subtraction to the wavelet approximation coefficients and soft thresholding to the detail coefficients. They used five wavelet filters and compared them according to their output signal-to-noise ratios. Besides the output SNR, they also considered the correlation coefficient and the perceptual evolution of speech quality (PESQ) criteria.

However, although these algorithms show significant advances in noise removal, most of them do not evaluate spectral distortion nor do they attempt to minimize it. So, since the method in [1] provided low spectral distortion, this article proposes a comparative study between discrete-time and discrete-frequency Kalman filters simply using the noisy signal as initial estimate. According to Fujimoto and Ariki [8], the main difference between the two approaches is that the operation of the Kalman filter is more computationally efficient in the frequency domain than in the time domain.

On the other hand, transforming the set of Kalman filter equations to/from the frequency domain produces a significant distortion in the estimated signal. Then, we used prefiltering based on spectral subtraction to reduce this distortion. In order to assess the performance of the proposed algorithms, we measured both the segmental signal-to-noise ratio of the outputs and the Itakura-Saito distance.

This article is structured as follows: Sections 2 and 3 describe the discrete-time and discrete-frequency Kalman filtering algorithms, respectively. Section 4 brings the experimental results and finally, in Section 5, the conclusions are presented.

2. Discrete-Time Kalman Filtering (DTKF)

In the 1960s, Rudolf Emil Kalman published the paper “A New Approach to Linear Filtering and Prediction Problems”, describing a recursive solution to the discrete-time linear filtering problem [1]. Since then, due to the major advances of digital computing, Kalman filtering has become a very important technique in several areas such as navigation, monitoring processes, economics, and signal reconstruction from noisy samples.

In this article, the Kalman filtering development follows the heuristics described by Vaseghi [9]. Thus, the speech signal is modeled as an autoregressive process of order , , according to where are the linear prediction coefficients of order , is the prediction error associated with the excitation of the source-filter model of speech production, and is the th sample of the speech signal.

It can be observed that, in the acquisition process of audio and speech signals, most of the signals are captured in the presence of some type of additive noise. Consequently, we can model the noisy signal as shown in where is the noisy speech signal and is a white Gaussian additive noise.

From (1) and (2), we can set up a state space model described by (3) and (4), respectively [9]: where is the state vector at time ; is the state transition matrix with dimensions that relates current time with past time ; is the input vector of the state equation and it is modeled as a white noise; is the observation vector; is the channel distortion matrix of dimensions ; and is an additive white noise vector [9].

According to Vaseghi [9], and are assumed to be independent white noise processes so that where and are diagonal covariance matrices, respectively, related to the additive noise and the prediction error.

The Kalman filtering estimates a process by using a kind of feedback control: first, the filter estimates the state of the process at a given time, then the feedback is obtained in the form of a new measurement.

Brown and Hwang [10] and Vaseghi [9] divided the Kalman filtering equations into two groups. The first ones are the time-update equations (prediction) and the second are the measurement-update equations (correction). Equation (7) describes the time-update: while measurement-update equations are shown in (8) and (9), respectively. where is the error covariance matrix at time ; is the Kalman gain matrix, responsible for minimizing ; and is the state estimate at time , according to the previous observations of .

3. Discrete-Frequency Kalman Filtering (DFKF)

Fujimoto and Ariki [8] introduced the discrete-frequency Kalman filtering (DFKF) in 2000 to provide more computationally efficient algorithm. This is accomplished by transforming the Kalman filter equations to be iterated in the frequency domain and then inverse transforming the estimated spectrum back to the time domain to find the estimated signal. In order to do so, they divide the frequency domain into multiple frames in such a way that the th frame is the complex spectrum of the noiseless signal and is the white Gaussian noise. Thus, the noise-corrupted signal is given by the following equation [8]: Since can be replaced by the Inverse Discrete Fourier Transform (IDFT) of , we have In matrix notation, (12) can be represented as shown in that can be simply written as where represents time within th frame, is the number of samples in the frame, and is the vector containing the basis of the Discrete Fourier Transform (DFT). is the complex spectrum vector of the th frame. Since time has no meaning for , there is no state transition matrix in the Kalman equations for the frequency domain, so that the computational effort of the DFKF is significantly reduced.

Analogous to the DTKF, the DFKF can be represented by the following equations: where means the complex conjugate transpose of a matrix.

In order to obtain the estimated signal of the Kalman filter in the time domain, we must apply the Inverse Discrete Fourier Transform (IDFT) on (16).

4. Results

In order to compare the performances of the studied techniques, we used 25 different recorded speech signals sampled at 22050 Hz and coded with 16 bits per sample. Each signal was windowed by a Hamming window of size 512 with 50% overlap. All tests were performed using Matlab R2013B on a Core i7 processor computer with 8 GB RAM.

The quality of the estimated speech signal in the output of each filter was evaluated using the segmental signal-to-noise ratio (SNRseg). We have chosen the SNRseg because it can be calculated over short segments of the speech signal, in order to balance the weights assigned to each segment of higher or lower signal strength. SNRseg is given by [11]where are the limits of each one of the frames of length . To carry out the tests, the signals were contaminated by additive white noise and the input segmental signal-to-noise ratio (SNRI) was adjusted to 3 dB.

As reported by Rabiner and Schafer [12], a suitable way to measure spectrum variations is the Itakura-Saito distance. Such measure can be calculated as where and are the linear prediction coefficients (LPC) vectors of the original and estimated signals, respectively, and is the autocorrelation matrix of the original signal. The closer to each other the spectra of the original and estimated signals, the smaller . Thus, an Itakura-Saito distance equal to zero indicates that the spectra are the equal [12].

The DTKF algorithm was employed in the first test, which used the utterance elétrica (electrical in Portuguese). The results are shown in Figures 1, 2, and 3, respectively.

Figures 2 and 3 evidence the noise reduction, especially during the silence parts of the signal. The SNRO in this case was 10 dB and the Itakura-Saito distance was 0.3250.

The second test preserved the same parameters of the first test except for the use of DFKF. The results are shown in Figures 4 and 5, respectively. The SNRO was 8 dB and the comparison of Figures 4 and 5 shows a considerable reduction in the noise. However, the Itakura-Saito distance was 0.3782, which indicates a larger distortion in the filtering.

Therefore, the DTKF algorithm produced smaller spectral distortion than the DFKF but provided a larger SNRO.

The results of the tests for the 25 words are presented in Figures 6 and 7. Figure 6 shows that the SNRO in targeted tests was almost always the same for DTKF and DFKF, with an average of 9 dB.

Figure 7 shows that the DTKF algorithm produced smaller signal distortion for all tests. Thus, we can affirm that the DTKF is more suitable than the DFKF for speech processing.

Tests were also performed after prefiltering the noisy signals. The prefiltering was based on spectral subtraction like in [1]. All results showed that the DTKF produced smaller spectral distortion than DFKF. The spectral distortions for the 25 words are shown in Figure 8 for an SNRI of 3 dB.

The comparison of Figures 7 and 8 indicates that prefiltering allowed only a tiny improvement in the reduction of spectral distortion provided by the DTKF algorithm.

5. Conclusions

This paper presented a comparative study between discrete-time and discrete-frequency Kalman filtering algorithms. Tests were carried out with 25 different words using Itakura-Saito distance to measure the spectral distortion and the segmental signal-to-noise ratio to evaluate the noise reduction.

Although the two algorithms performed very similarly regarding noise reduction, discrete-time Kalman filtering produced smaller spectral distortion on the estimated signals for all targeted tests. This shows that discrete-time Kalman filtering is more suitable than discrete-frequency Kalman filtering for the reconstruction of speech signals corrupted by additive white noise.

Data Availability

The voice data (.wav files) used to support the results of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.