Abstract

Speech enhancement has gained considerable attention in the employment of speech transmission via the communication channel, speaker identification, speech-based biometric systems, video conference, hearing aids, mobile phones, voice conversion, microphones, and so on. The background noise processing is needed for designing a successful speech enhancement system. In this work, a new speech enhancement technique based on Stationary Bionic Wavelet Transform (SBWT) and Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude is proposed. This technique consists at the first step in applying the SBWT to the noisy speech signal, in order to obtain eight noisy wavelet coefficients. The denoising of each of those coefficients is performed through the application of the denoising method based on MMSE Estimate of Spectral Amplitude. The SBWT inverse, , is applied to the obtained denoised stationary wavelet coefficients for finally obtaining the enhanced speech signal. The proposed technique’s performance is proved by the calculation of the Signal to Noise Ratio (SNR), the Segmental SNR (SSNR), and the Perceptual Evaluation of Speech Quality (PESQ).

1. Introduction

In many speech-related applications, an input speech signal is frequently corrupted by environmental noise and needs further processing using a speech enhancement technique for ameliorating the associated quality before being employed [1]. Generally, speech enhancement techniques can be grouped into two groups which are supervised and unsupervised. Unsupervised techniques include spectral subtraction (SS) [24], Wiener filtering [5, 6], short-time spectral amplitude (STSA) estimation [7], and short-time log-spectral amplitude estimation (logSTSA) [8]. Concerning the supervised speech enhancement techniques, they employ a training set for learning diverse models for noisy and clean speech signals, and examples include codebook-based methods [9] and Hidden Markov Model (HMM)-based techniques [10]. Classical speech enhancement techniques are frequently processing a noisy utterance in a frame-wise way, that is, for enhancing each short-time period of the utterance nearly in independent manner. Some research works showed that considering the inter-frame variation over a relatively long span of time can contribute to superior performance in enhancing speech [1]. Famous approaches along this direction include modulation-domain spectral subtraction [11], Kalman filtering, and modulation-domain Wiener filtering [12, 13]. Moreover, when we compare the discrete wavelet transform (DWT) to the Fourier transform (FT) where only the frequency parts are taken into consideration, though, in the expression of the DWT [14], both temporal and frequency characteristics of the signal to be analyzed are taken into consideration. The DWT has become a well-known method in speech analysis. In Wavelet Thresholding Denoising (WTD) [15], the wavelet transform is applied for splitting the time-domain signal into sub-bands. After that, thresholding of the obtained wavelet coefficients (sub-bands) is performed. In [16], the DWT [17, 18] was applied to the speech signal to simply conserve the obtained approximation portion, which simultaneously attains data compression and noise robustness in recognition. In [1], the DWT was employed for analyzing the spectrogram of a noisy utterance along the temporal axis, and then the resulting detail portion was devalued with an expect of reducing noise effect in order to promote speech quality. Despite the ease of its implementation, the preliminary evaluation results indicate that the technique proposed in [1] permits to have input signals with better perceptual quality. It was proved that this technique [1] can be paired with many well-known speech enhancement approaches for achieving even better performance [1]. In this work, a novel speech enhancement technique based on the Stationary Bionic Wavelet Transform (SBWT) [1921] and Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude [22] is proposed. In this paper, this approach is evaluated and compared to four other speech enhancement approaches which are as follows:(i)Unsupervised speech denoising via perceptually motivated robust principal component analysis [23].(ii)The speech enhancement technique based on MSS-SMPO [24, 25].(iii)The denoising technique based on MMSE Estimate of Spectral Amplitude [22].(iv)Our previous speech enhancement technique based on LWT and Artificial Neural Network (ANN) and using MMSE Estimate of Spectral Amplitude [26].

The fourth technique which is based on LWT and ANN [2729] and uses MMSE Estimate of Spectral Amplitude [26] can be summarized by the following steps:(i)First step: applying the LWT to the noisy speech signal for obtaining two noisy details coefficients, and , and one approximation coefficient, .(ii)Second step: denoising cD1 and cD2 by soft thresholding, and for their thresholding, suitable thresholds, , have to be used. Those thresholds are determined by using an Artificial Neural Network (ANN). This soft thresholding is performed for having two denoised coefficients, and .(iii)Third step: applying the denoising approach based on MMSE Estimate of Spectral Amplitude [22] to for obtaining a denoised coefficient, .(iv)Fourth step: applying the inverse of , to , , and , for finally obtaining the enhanced signal.

As a future work, we will develop a novel speech enhancement approach using ANN [3036] or deep learning [37, 38] for thresholding the noisy stationary bionic wavelet coefficients. Those coefficients are obtained by applying the to the noisy speech signal.

In Section 2 of this paper, materials and methods are presented. Section 2.4 describes the speech enhancement technique proposed in this work. In Section 3, results and discussion are presented. Finally, Section 4 concludes the paper.

2. Materials and Methods

2.1. The Stationary Bionic Wavelet Transform ()

In [19], the has been proposed as a novel wavelet transform. This transform was initially introduced for solving the problem of perfect reconstruction that exists with the Bionic Wavelet Transform (BWT). Its application was performed for speech enhancement [19, 20] and also for ECG denoising [21].

2.2. The MMSE Estimate of Spectral Amplitude

In the literature, it was proposed to estimate the noise power spectral density employing MMSE (Minimum Mean Square Error) optimal estimation [22]. It was proved that the obtained estimator can be considered as a VAD (Voice Activity Detector)-based noise power estimator, and the noise power is updated alone if speech absence is detected, compensated with a required bias compensation [22]. It was proved that the bias compensation is not needed if the VAD is substituted by a soft SPP (Speech Presence Probability) with fixed priors [22]. When choosing fixed priors, this has the benefit of decoupling the noise power estimator from subsequent steps in a speech enhancement algorithm, such as the estimation of the speech power and that of the clean speech [22]. Gerkmann and Richard [22] proved that the proposed SPP approach permits to maintain the quick noise tracking performance of the bias-compensated MMSE-based technique while exhibiting less overestimation of the spectral noise power and an even lower complexity of calculation.

2.3. Signal Model

In [22], Gerkmann and Richard considered frame-by-frame processing of time-domain signals where the Discrete Fourier Transform (DFT) is applied to these frames. Let the complex spectral noise and speech coefficients be given, respectively, by and , where is the time frame index and is the frequency bin index [22]. In [22], it was assumed that in the short-time Fourier domain, both noise and speech signals tend to be additive. Therefore, the complex spectral noisy observation has the following expression:

In [22], it was supposed that the noise and speech signals own zero mean and are independent so thatwhere E(∙)denotes the statistical expectation operator.

The spectral noise and speech power are expressed as follows:

Then, both a posteriori SNR and a priori SNR are expressed as follows:

All details about MMSE-based noise power estimation are given in [22].

2.4. The Proposed Speech Enhancement Technique

The speech enhancement technique introduced in this work is based on the SBWT [1921] and the MMSE Estimate of Spectral Amplitude [22]. The novelty of this approach consists in applying the speech enhancement method based on Estimate of Spectral Amplitude [1, 22] in the SBWT domain. In fact, this technique [22] is applied to each noisy stationary bionic wavelet coefficient for its denoising. Those noisy coefficients are obtained by applying the SBWT to the noisy speech signal. Then, the inverse of SBWT () is applied to the obtained denoised coefficients in order to obtain finally the enhanced speech signal. Figure 1 illustrates the flowchart of this proposed technique.

According to Figure 1, the first step of the proposed approach is to apply the to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients. Those coefficients are named , and each of them is denoised by the speech enhancement technique based on Estimate of Spectral Amplitude [1, 22]. and we obtain eight denoised coefficients, (Figure 1). In those coefficients, inverse is applied for SBWT (SBWT-1) in order to obtain the enhanced signal finally.

2.5. Minimum Mean Square Error () Estimate of Spectral Amplitude in the Domain

In general, classical speech enhancement approaches based on thresholding in the wavelet transform domain can introduce some distortions to the original speech signal. This particularly occurs for the unvoiced sounds. Consequently, a great number of speech enhancement techniques based on wavelet transforms are employing other tools such as spectral subtraction (SS), Wiener filtering, and MMSE-STSA estimation [39, 40]. This is the reason why we apply the Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude in the domain in our speech enhancement system. The application of the permits to solve the problem of the perfect reconstruction existing when we apply the BWT [19]. Furthermore, the SBWT among all wavelet transforms [41, 42] tends to uncorrelated data [43] and facilitates the noise suppression. The fact that the Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude [22] is applied to each noisy stationary bionic coefficient permits to have a better adaptation for speech and noise estimations compared to the application of this technique [22] to the whole noisy speech signal.

2.6. Unsupervised Speech Denoising via Perceptually Motivated Robust Principal Component Analysis [23]

To overcome the shortcomings in the existing sparse and low-rank speech denoising technique that the auditory perceptual properties are not fully exploited and the speech degradation is simply perceived, a perceptually motivated robust principal component analysis (ISNRPCA) technique was presented. In order to reflect the non-linear property for frequency perception of the basilar membrane, cochleagram is employed as inputs of . The latter employs the perceptually meaningful Itakura–Saito measure as its optimization objective function. Furthermore, non-negative constraints are also compulsory for regularizing the decomposed terms with respect to their physical meaning [23]. In [23], Min et al. proposed an alternating direction technique of multipliers (ADMM) for solving the optimization problem of ISNRPCA. The latter is completely unsupervised, and neither the noise nor the speech model requires to be trained beforehand. Experimental results under diverse kinds of noise and different SNRs prove that the ISNRPCA is showing promising results for speech denoising [23].

2.7. The Speech Enhancement Technique Based on MSS-SMPO [25]

In [25], a two-step enhancement technique based on spectral subtraction and phase spectrum compensation was presented for noisy speeches in diverse environments requiring non-stationary noise and medium to low levels of SNR. In the first step of the technique proposed in [25], the magnitude of the noisy speech spectrum is modified by a spectral subtraction technique, where a noise estimation approach was introduced. The latter is based on the low-frequency information of the noisy speech. This noise estimation technique is able to estimate precisely the non-stationary noise. In the second step, the phase spectrum of the noisy speech is modified consisting of phase spectrum compensation, where an SNR-dependent technique is incorporated for determining the amount of compensation to be compulsory on the phase spectrum [25]. A modified complex spectrum is obtained by aggregating the magnitude from the step of spectral subtraction and the modified phase spectrum from the step of phase compensation, which is found to be a better representation of enhanced speech spectrum.

3. Results and Discussion

In this work, the evaluation of the proposed technique is performed by its application to ten Arabic speech sentences pronounced by a male speaker and ten others by a female speaker (Table 1). Those speech signals are degraded in artificial manner by an additive noise at different values of (before denoising). In order to corrupt those speech signals (Table 1), we have chosen four kinds of noise which are white Gaussian, car, F16, and tank noises. Those twenty speech signals are sampled at and are listed in Table 1.

Also, for evaluating the proposed technique, it is compared with other three speech enhancement approaches which are as follows:(i)The denoising approach based on MMSE Estimate of Spectral Amplitude [22].(ii)The unsupervised speech denoising technique via perceptually motivated robust principal component analysis [23].(iii)The speech enhancement approach based on MSS-SMPO [24].

This evaluation is performed through the computations of the SNR (Signal to Noise Ratio), the Segmental SNR (SSNR), and the PESQ (Perceptual Evaluation of Speech Quality). The results obtained from these computations are presented in Tables 216.

According to these tables, the best results are the values in italics and they are practically obtained from the application of the proposed technique. Therefore, this technique outperforms the other speech enhancement approaches [2225] applied for this evaluation.

Figure 2 illustrates an example of speech enhancement applying the proposed technique to the clean speech signal (Figure 2(a)) corrupted in additive manner by a car noise (Volvo) with (Figure 2(b)). According to this figure, this technique permits to considerably reduce noise and to obtain an enhanced speech signal (Figure 2(c)) with little distortions despite the fact that the value of the SNR is low (0 dB). Figure 3 illustrates the spectrograms of the clean, noisy, and enhanced speech signals.

The spectrogram in Figure 3(b) shows that the type of noise corrupting the speech signal is localized in low-frequency parts. The spectrogram in Figure 3(c) shows that the car noise is considerably reduced by using the proposed speech enhancement technique. Moreover, this technique permits to have an enhanced speech signal with low distortions compared to the clean speech signal (Figure 2(a)).

In the following, we will compare the proposed technique with our previous speech enhancement approach which is based on LWT and ANN and uses MMSE [26]. The first difference between the speech enhancement technique proposed in this work and our previous approach is that they use two completely different wavelet transforms which are the for the technique proposed in this paper and the LWT for our previous approach proposed in [26]. The second difference between these two techniques is that the denoising approach based on MMSE Estimate of Spectral Amplitude is applied [22] to all stationary bionic wavelet coefficients for the technique proposed in this paper. However, we apply this approach [22] only to the approximation coefficient for our previous speech enhancement technique proposed in [26]. The latter also uses an Artificial Neural Network (ANN), and this fact differentiates this technique [26] from our technique proposed in this paper. The comparison of these two techniques is also in terms of SNR, SSNR, and PESQ. These two techniques are applied to a speech signal degraded by a car noise with diverse values of SNR before denoising (). Tables 1719 present the results obtained from the computation of SNR, SSNR, and PESQ for the two techniques.

According to these tables, the best results are the values in italics and they are obtained from the application of the proposed technique. Therefore, this technique outperforms the other speech enhancement approach proposed in [26].

4. Conclusion

In this paper, we propose a new speech enhancement technique based on and Estimate of Spectral Amplitude. In the first step of this technique, the SBWT is applied to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients. The denoising of each of those coefficients is performed through the application of the denoising approach based on MMSE Estimate of Spectral Amplitude. Finally, the inverse of is applied to the obtained stationary wavelet coefficients, for obtaining the enhanced speech signal. An evaluation of this technique is performed by its comparison with four other speech enhancement approaches where the first one is the denoising technique based on MMSE Estimate of Spectral Amplitude. The second one is the speech enhancement technique based on MSS-SMPO. The third one is the unsupervised speech denoising approach through perceptually motivated robust principal component analysis. The fourth one is the speech enhancement technique based on and ANN and using Estimate of spectral amplitude. This evaluation is performed through the computations of Signal to Noise Ratio (SNR), the Segmental SNR (SSNR), and the Perceptual Evaluation of Speech Quality (PESQ). The results obtained from these computations show that the proposed technique outperforms the other previously mentioned techniques. Furthermore, the technique proposed in this work permits to considerably reduce the noises corrupting the clean speech signal and to have an enhanced speech signal with good perceptual quality.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.