Abstract

This paper describes a new speech enhancement approach which employs the minimum mean square error (MMSE) estimator based on the generalized gamma distribution of the short-time spectral amplitude (STSA) of a speech signal. In the proposed approach, the human perceptual auditory masking effect is incorporated into the speech enhancement system. The algorithm is based on a criterion by which the audible noise may be masked rather than being attenuated, thereby reducing the chance of speech distortion. Performance assessment is given to show that our proposal can achieve a more significant noise reduction as compared to the perceptual modification of Wiener filtering and the gamma based MMSE estimator.

1. Introduction

Speech enhancement is concerned with improving the quality and intelligibility of speech signal. The need to enhance speech signals arises in many situations in which the speech signal originates from a noisy location or is affected by noise over a communication channel.

Speech enhancement methods are employed more and more often in applications such as mobile telephony, speech recognition, and human-machine communication systems [110]. Speech enhancement algorithms can therefore be used as a preprocessor in speech-coding systems employed in cellular phones. In the case of the speech recognition system, the noisy speech signal can be preprocessed by a speech enhancement method before being fed to the speech recognizer. In an air-ground communication scenario, as well as in similar communication systems used by the military, it is more desirable to enhance the intelligibility rather than the quality of speech [11]. Besides, a further possible application is the enhancement of mating sounds and bioacoustic signals before their analysis. Therefore, algorithms based on Wiener filter or spectral subtraction can be used to eliminate background noise and other sounds of nature not inherent to that of the animal, followed, for example, by a nonlinear time series analysis methods to analyze the dynamics of the sound-producing apparatus of the animal [12].

In this paper, single-microphone speech enhancement is studied. One of the main approaches of speech enhancement algorithms is to obtain the best possible estimates of the short-time spectral amplitude of a speech signal from a given noisy speech. The performance of enhanced speech is characterized by a tradeoff between the amount of noise reduction, the speech distortion, and the level of musical residual noise.

Several methods have been proposed to reduce the residual noise. Ephraim and Malah [1, 2] used the conventional hypothesis that, for speech enhancement in the discrete fourier transform (DFT) domain, the distribution of the complex speech DFT coefficients is Gaussian. Nowadays, super-Gaussian models of the DFT coefficients are used because they lead to estimators with improved performance compared to those based on a Gaussian model [35]. For example, the minimum mean square error estimators for the amplitudes, assuming a one-sided generalized gamma distribution, are studied in [69]. Experimental results showed that the gamma-based estimator had higher preference scores compared to the Gaussian-based estimator for various types of noise and at different noise levels.

It is very difficult to suppress residual noise without decreasing intelligibility and without introducing speech distortion and musical residual noise [13]. Several methods [1317] attempted to reduce the musical residual noise by emulating the human auditory system, based on the fact that the human ears cannot perceive the additive noise when the noise level falls below the auditory masking threshold (AMT). These methods are predominantly based on spectral subtraction and Wiener filtering, which have exploited the masking properties of the human auditory system. However, they do not perform well at very low varying signal-to-noise ratio (SNR) and introduce a perceptually disturbing musical noise especially with colored and nonstationary noise.

In this work, the human perceptual auditory masking effect is incorporated into the estimator based on the gamma model in order to obtain a more accurate estimate and achieve an effective suppression of noise as well as minimal musical tones in the residual signal. This study is followed by numerical simulations of these algorithms and an objective evaluation using a corpus of speech.

The rest of this paper is organized as follows. Section 2 describes the gamma-based short-time spectral amplitude estimator with some details. Section 3 presents our proposed enhancement method, and Section 4 demonstrates our implementations and results. Conclusions are finally drawn in Section 5.

2. Gamma-Based Short-Time Spectral Amplitude Estimator

The noisy speech signal is given by the following: where is the clean speech signal which is assumed to be independent of the additive noise . Their representation in the short time fourier transform (STFT) domain is given by where , and are the samples of the noisy speech, the clean speech, and the noise signal’s STFT correspondingly. The index corresponds to the frequency bins and the index to the time frames of the STFT. Since DFT coefficients from different time frames and frequency indices are assumed to be independent, the indices and will be sometimes omitted for simplicity. We can write and , where random variables and represent the magnitude of the clean speech DFT coefficient and the noisy speech DFT coefficient, respectively, and and represent the corresponding phases values. We use upper case letters to denote random variables and the corresponding lower case letters to denote their realizations.

In this paper, we focus on the minimum mean square error estimation of the clean magnitude . The MMSE estimate of is the expectation of the clean magnitude conditional on the noisy magnitude . With Bayes formula we can express as follows:

The estimation of the clean magnitude requires some assumptions about the distribution of the speech and the noise. The speech has usually been assumed Gaussian, for example, [1, 2], but in recent times estimators based on super-Gaussian speech assumptions such as Laplacian or gamma distributions have been derived [8]. A similar development has been seen for the noise assumptions; most commonly, the noise is assumed Gaussian [8].

With the zero-mean Gaussian distribution assumption of the noise DFT coefficients, can be written as follows [18]: where is the 0th-order-modified Bessel function of the first kind, and is the noise spectral variance.

In the gamma-based MMSE estimators of the speech DFT magnitudes, we assume that the speech DFT magnitudes are distributed according to a one-sided generalized gamma prior density of the form where is the Gamma function and the random variable represents the DFT magnitudes, with the constraints on the parameters , and .

The gamma-based MMSE magnitude estimators for the cases and have been derived in [6, 7, 19]. We will use the case , as the related estimator can be derived without any approximations, and the maximum achievable performance for both cases is about the same.

Inserting (5) with and (4) into (3) give

Using [20, Theorem 6.643.2], the integrals can be solved for . After inserting the relation between and the second-moment , which for this case is , with , the estimator is as follows [7]: where ; ;   with is called the a priori SNR, is the a posteriori SNR, and is recognized as the Whittaker function or in terms of confluent hypergeometric function [20, Equations 9.210.1 and 9.220.2]:

The special case is the traditional MMSE-STSA estimator derived in [1].

In order to evaluate the previous gain functions, we must first estimate the noise power spectrum . This is often done during periods of speech absence as determined by a voice activity detector (VAD), by using a noise-estimation algorithm such as the minimum statistics approach [21, 22], or by using a real noise in comparative studies.

The a posteriori SNR estimator is the ratio of the squared noisy magnitude and the estimated noise power spectrum. Furthermore, we use the decision-directed approach for the estimation of the a priori SNR like in [1, 2, 23]. Thus, is given by the following: where the smoothing factor is . A value of was used in the implementation and the lower limit recommended in [23], is similar to the use of the spectral floor in the basic spectral subtraction method [24]. A lower limit of at least −15 dB is recommended.

3. Proposed Enhancement Technique

Figure 1 contains the flowchart of the proposed speech enhancement scheme, which consists of different steps as described below. (i)Spectral decomposition: windowing + fast fourier transform (FFT).(ii)Speech/noise detection and noise estimation.(iii)Rough estimation of the speech magnitude spectra . (iv)Calculation of the auditory masking threshold .(v)Calculation of the enhanced speech magnitude spectra using (7)–(9).(vi)Calculation of the enhanced speech signal based on the following equation:

3.1. Auditory Masking Threshold (AMT) Calculation

The auditory masking threshold is obtained through modeling the frequency selectivity of the human ear and its masking property. This paper only considers the simultaneous masking. Before computing the auditory masking threshold, the speech spectra must be estimated. A spectral subtraction or a Wiener filter is used for obtaining a rough estimate of the speech spectra. The speech magnitude spectra estimated by spectral subtraction is given by the following: where ε is a small positive value.

Once is obtained, the auditory masking threshold can be calculated based on the Johnston model [25]; then the gamma-based short-time spectral amplitude estimator can be applied.

4. Performance Evaluation

This section presents the performance evaluation of the proposed enhancement algorithm as well as a comparison with two other estimators. The first one is the gamma-based MMSE estimator presented in Section 2 which does not take the masking properties into account. The second one is the estimator based on a perceptual modification of the generalized Wiener filtering, which was proposed by Lin et al. in [15].

For evaluation purposes, we used the noisy speech signal taken from the Noizeus database [26], which consists of 30 speech signals sampled at 8 kHz, contaminated by eight different real-world noises at different SNRs (babble, car, exhibition hall, restaurant, street, airport, train station, train). The frame size is 256 samples, with an overlap of 50%, and the data window used was a Hanning window, while the total number of critical bands was . The enhanced signal was combined using the overlap and add method. The initial noise variance was estimated from 0.64 seconds of noise only, preceding speech activity.

MATLAB implementations available from [27] have been used to evaluate the confluent hypergeometric functions.

To measure the quality of the enhanced signal, we have used the segmental SNR, the weighted spectral slope measure (WSS), and the perceptual evaluation of speech quality (PESQ) [2830]. All the measures show high correlation with subjective quality assessments.

The WSS measure is based on an auditory model and finds a weighted difference between the spectral slopes in each band. The magnitude of each weight reflects whether the band is near a spectral peak or valley and whether the peak is the largest in the spectrum. One implementation of the WSS measure can be defined as follows: where is the number of bands, is the total number of frames, and and are the spectral slopes (typically the spectral differences between neighboring bands) of the th band in the th frame for clean and processed speech signals, respectively. are weights, which can be calculated as shown by Klatt in [28]. The highest 5% of the WSS measure values were discarded, as suggested in [29], to exclude unrealistically high spectral distance values. The lower the WSS measure for an enhanced speech, the better is its perceived quality.

Segmental SNR is based on the classical SNR and it is one of the most widely used methods for testing enhancement algorithms. Since the correlation of classical SNR with subjective quality is so poor. Instead, we choose the frame-based segmental SNR by averaging the frame level SNR estimates defined by [29] where and represent the clean and processed speech sample, respectively, denotes the frame length, and is the number of frames. The lower and upper thresholds are selected to be −10 dB and +35 dB, respectively.

The perceptual evaluation of speech quality (PESQ) measure, described in [30], was selected as the ITU-T recommendation P.862 [31]. PESQ measure is one of the most commonly used measures to predict the subjective opinion score of a degraded or enhanced speech. In the PESQ measure, a reference signal and the enhanced signal are first aligned in both time and level. This is followed by a range of perceptually significant transforms which include Bark spectral analysis, frequency equalization, gain variation equalization, and loudness mapping. The difference, termed the disturbance, between the loudness spectra is computed and averaged over time and frequency to produce the prediction of subjective MOS score. The PESQ score ranges from 1.0 (worst) to 4.5 (best), with higher scores indicating better quality [32].

Figures 2, 3, and 4 show plots of mean results in terms of segmental SNR, PESQ, and WSS measures, for 30 Noizeus sentences corrupted by white, babble, and car noise, respectively, at 0–15 dB SNR.

As can be seen, the proposed method outperforms the two estimators in terms of SNR segmental, PESQ, and WSS measures for all SNR values. The improvement in the higher input SNRs is more noticeable because of the more accurate AMT calculation.

Table 1 presents an example of the objective results obtained for noisy speech and enhanced speech with the three estimators, in the case of white, babble and car noise at 0 dB SNR and 5 dB SNR.

From the objective results (Figures 2, 3, and 4, Table 1), it can be seen that the proposed estimator has higher preference scores compared to the two other estimators for all noise levels from 0 dB to 15 dB SNR. Furthermore, informal listening tests confirmed that the proposed estimator yields better quality with significantly lower noise distortion than the gamma-based estimator and comparable quality with the perceptual Wiener estimator.

5. Conclusions

In this paper, a gamma-based minimum mean square error estimator for speech enhancement incorporating masking properties was proposed. We showed an increase in the quality of the enhanced speech with different noise types.

Results, in terms of objective measures and listening tests, indicated that the proposed approach has a better tradeoff between the amount of noise reduction, the speech distortion, and the level of musical residual noise than perceptual Wiener filter and gamma-based estimator.

The implementation of the estimator based on gamma speech modeling requires the evaluation of the Gamma and the confluent hypergeometric functions, in addition to the AMT computation in the proposed estimator. In a real-time implementation, these functions can be stored in a table. The computation of these estimators during runtime will not be then much more complex than that of the perceptual Wiener estimator.

Based on the previous findings, using noise masking properties to adapt a speech enhancement system was beneficial to improve the minimum mean square error amplitude estimator under generalized gamma distribution.

In the future, we plan to evaluate its possible application in preprocessing for new communication systems and hearing aids.