About this Journal Submit a Manuscript Table of Contents
Journal of Electrical and Computer Engineering
Volume 2012 (2012), Article ID 282019, 12 pages
http://dx.doi.org/10.1155/2012/282019
Research Article

Application of Perceptual Filtering Models to Noisy Speech Signals Enhancement

1LRSITI, Département Génie Electrique, Ecole Nationale des Ingénieurs de Tunis, BP 37, 1002 Le Belvédère, Tunisia
2Département de Génie Physique et Instrumentations, Institut National des Sciences Appliquées et de Technologies, Centre Urbain Nord, BP 676, 1080 Tunis Cedex, Tunisia

Received 20 March 2012; Revised 24 May 2012; Accepted 30 May 2012

Academic Editor: Raj Senani

Copyright © 2012 Novlene Zoghlami and Zied Lachiri. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper describes a new speech enhancement approach using perceptually based noise reduction. The proposed approach is based on the application of two perceptual filtering models to noisy speech signals: the gammatone and the gammachirp filter banks with nonlinear resolution according to the equivalent rectangular bandwidth (ERB) scale. The perceptual filtering gives a number of subbands that are individually spectral weighted and modified according to two different noise suppression rules. The importance of an accurate noise estimate is related to the reduction of the musical noise artifacts in the processed speech that appears after classic subtractive process. In this context, we use continuous noise estimation algorithms. The performance of the proposed approach is evaluated on speech signals corrupted by real-world noises. Using objective tests based on the perceptual quality PESQ score and the quality rating of signal distortion (SIG), noise distortion (BAK) and overall quality (OVRL), and subjective test based on the quality rating of automatic speech recognition (ASR), we demonstrate that our speech enhancement approach using filter banks modeling the human auditory system outperforms the conventional spectral modification algorithms to improve quality and intelligibility of the enhanced speech signal.

1. Introduction

The high quality sound of talking speech in real environment is very important for automatic speech processing systems and human- machine interfaces. However, the performance of these systems can be affected by background noise. Thus, there is a strong need to resolve this problem and improve the performance of these applications in high level noise environment by applying effective speech enhancement techniques able to suppress the undesirable noise. These techniques are concerned with improving some perceptual aspect, the quality and intelligibility of degraded speech. In a broad context, many methods are developed in order to remove the background noise while retaining speech intelligibility based on short time spectral estimation of the clean speech. These methods are able to reduce the noise and improve the quality, but at the expanse of introducing speech distortion which results in loss of intelligibility. Hence, the main challenge in designing effective speech enhancement algorithms is to suppress the noise without introducing any perceptible speech distortion. The spectral modification methods are historically one of the first algorithms proposed for noise reduction, especially the generalized spectral subtraction is the most popular technique [1]. This method is able to reduce the background noise using estimation of the short-time spectral magnitude of the speech signal by subtracting the noise estimation from the noisy speech. The spectral subtraction technique offers a high flexibility and simplicity in implementation. However, it needs to be improved since its major drawback, the introduction in the enhanced speech of residual noise called “musical noise” with unnatural structure, is composed of tones at random frequencies. The unnatural structure of the musical noise is perceived as nonstationary noise artifacts that depend on the time and frequency changes of the noise, on one side and on the way that the human auditory system perceives these artifacts, on the other side. The minimum-mean-square-error-based-noise reduction proposed by Ephraim and Malah subtraction rule [2] exploits the average spectral estimation of the speech signal based on a prior knowledge of the noise variance, in the goal to mask and reduce the residual noise. In [38], the noise is reduced based on subtractive type algorithms according to a multibands and nonlinear spectral process. In [911], the authors exploit the human perceptual masking proprieties to improve the quality and intelligibility of the speech signal without introducing speech distortion. The difficulty with these approaches is that an estimate of the clean speech itself is necessary in order to calculate the masking threshold.

The solution proposed in this paper works towards achieving a high noise reduction with efficient residual noise elimination, at the same time, to preserve speech components. This is done by meeting several requirements to the speech analysis/synthesis system based on the knowledge of human perception proprieties. So it is proposed to adapt the spectral modification algorithms to a multibands analysis using human perceptual filter banks models according critical band concept and nonlinear frequency resolution. This allows to find the best tradeoff between the amount of noise reduction, the speech distortion and the level of musical noise in a perceptual view, and to overcome the limitation of spectral modification algorithms for speech enhancement in real-world listening situation where the background noise level and characteristics are constantly changing.

The paper is organized as follows: in Section 2, the principle of common spectral modification algorithms reviewed in the speech enhancement literature is described. In Section 3, the proposed enhancement approach is presented. Finally, an objective and subjective evaluation is performed in Section 4.

2. Spectral Modification Principle

The spectral modification techniques operate in the frequency domain. These methods are widely used for the enhancement of speech signals, which are corrupted by additive noise with constant or slowly varying spectral characteristics. The basic idea is to manipulate the magnitude of the noisy speech spectrum using fixed and uniform spaced frequency transformation. Consider a speech signal 𝑥(𝑛) degraded by additive background noise 𝑑(𝑛), the noisy speech 𝑦(𝑛) can be expressed as 𝑦(𝑛)=𝑥(𝑛)+𝑑(𝑛).(1)

The signal is divided into uniform frame using an adequate analysis window and it is processed in the frequency domain. The spectral analysis and synthesis are usually performed by a discrete Fourier transform and its inverse with overlap-add technique. The noise suppression process is a multiplication of the short-time spectral magnitude of the noisy speech |𝑌(𝑝,𝑤)| by a gain function 𝐺(𝑝,𝑤), ||||𝑋(𝑝,𝑤)=𝐺(𝑝,𝑤)with0𝐺(𝑝,𝑤)1.(2) With 𝑝 is the frame index and 𝑤 is the frequency index. |𝑋(𝑝,𝑤)| is the magnitude spectrum of the processed speech. Each gain function corresponds to a given noise suppression rule that changes depending to the characteristics of the noisy signal spectrum and the estimated noise spectrum.

3. Using Perceptual Filtering Models for Speech Enhancement

The spectral modification techniques performed in noise reduction using short time spectral analysis based on fixed and uniform speech decomposition. This processing, however, creates small isolated fluctuations in the spectrum occurring at random frequency locations in each frame, converted in the time domain, these fluctuations sound similar to tones with frequency peaks that change randomly from frame to frame. These artifacts described as residual noise consists of tonal remnant noise component significantly disagreeable to the ear. Focusing on the perceptual processing based on how human listeners process tones and bands of noise, it is possible to suppress the background noise and completely attenuate the random peaks in the structure of musical noise. The human auditory system may be sensitive to abrupt artifacts changes and transient component in the noisy speech signal based on time-frequency analysis with a nonlinear frequency selectivity of the basilar membrane Thus, the human hearing process is modeled as a series of transformations of the acoustic signal via an array of overlapping band-pass filters known as perceptual filters. These filters occur along the basilar membrane and increase the frequency selectivity of the human ear. Hence, the speech component can be identified and the selectivity can be amplified. The idea behind this is that embedding the psychoacoustics models of human auditory system in perceptual filter banks may lead to improve intelligibility and perceptual quality of speech. Moreover, it is known that humans are capable of detecting the desired speech in noisy environment without any prior information of the noise type. Taking into account the psychoacoustic analysis and human perception properties, it is possible to make a successful speech enhancement system when we use a suitable perceptual model to obtain nonuniform filter banks representing the human ear processing and an appropriate spectral modification approach, such as the generalized spectral subtraction technique (GSS) and the minimum mean square Error (MMSE) for spectral enhancement of each nonuniform filter banks bands output.

The proposed enhancement scheme is presented in Figure 1.

282019.fig.001
Figure 1: Proposed speech enhancement method based on perceptual filtering model.

Step 1. Speech decomposition via perceptual filter-bank analysis stage.

Step 2. Speech enhancement process: multibands noise suppression process.

Step 3. Continuous noise estimation.

Step 4. Speech synthesis via perceptual filter banks synthesis stage.

3.1. Perceptual Filtering Models

The aim in perceptual modeling is to find mathematical model which represents some physiological and perceptual aspects of the human auditory system. Perceptual modeling is very useful, since the sound wave can be analyzed according to the human ear comportment, with a good mode. The simplest way to model the frequency resolution of the basilar membrane is to make analysis using filter banks. The simplest and the most realistic model is the gammatone filter banks [12], the impulsion response is based on psychoacoustics measurements, providing a more accurate approximation to the perceptual frequency response, and it is represented by a gammatone function defined in the temporal model by the following expression: gt(𝑡)=𝐴𝑡𝑛1exp(2𝜋𝑏𝐵𝑐)cos2𝜋𝑓𝑐,𝑡+𝜑(3) where 𝐴 defines the magnitude normalization parameter,𝑛 is the filter order, 𝑓𝑐 is the center frequency of filters, 𝐵 is filters bandwidths, and 𝑏𝐵(𝑓𝑐) represents the filter envelop. The gammachirp filter bank is another perceptual model [13], it is an extension of the popular gammatone filter with an additional frequencymodulation term to produce an asymmetric amplitude spectrum. The complex impulsion response is based on psychoacoustics measurements, providing a more accurate approximation to the perceptual frequency response, and it is given in the temporal model as gc(𝑡)=𝐴𝑡𝑛1exp(2𝜋𝑏𝐵𝑐)cos2𝜋𝑓𝑐,𝑡+𝑐ln𝑡+𝜑(4) where time 𝑡>0, 𝐴 is the amplitude, 𝑛and 𝑏 are parameters defining the envelope of the gamma distribution, 𝑓𝑐 is the asymptotic frequency, 𝑐 is a parameter for the frequency modulation (𝑐=3), 𝜑 is the initial phase, ln 𝑡 is a natural logarithm of time, and ERB (𝑓𝑐) is the equivalent rectangular bandwidth of the perceptual filter at 𝑓𝑐.

The frequency resolution of human hearing is a complex phenomenon which depends on many factors, such as frequency, signal bandwidth, and signal level. Despite of the fact that our ear is very accurate in single frequency analysis, broadband signals are analyzed using quite sparse frequency resolution. The equivalent rectangular bandwidth (ERB) scale is an accurate way to explain the frequency resolution of human hearing with broadband signals. The expression used to convert a frequency 𝑓 in Hz in its value in ERB is ERB(𝑓)=21,41log4,371000+1.(5)

Figure 2 shows the correspondence between frequencies in Hz and its values in ERB and the frequency response of the gammatone and the gammachirp filter banks with 𝑘=27 ERB bands.

fig2
Figure 2: Frequency and ERB-scale correspondence (a) and the frequency response of the gammatone filter banks (b) and the gammachirp filter banks (c) with 𝑘=27 ERB bands.
3.2. Multibands Perceptual Process Using Perceptual Filter Bank

The proposed speech enhancement method is based on nonuniform decomposition of the degraded input waveform 𝑦(𝑛). The processing is done by dividing the incoming noisy speech into separate bands 𝑦𝑘,gt(𝑛) that could be individually manipulated using spectral modification algorithms to achieve quality and intelligibility improvement of the overall signal. The analysis filter banks consists of 27-4th order gammatone filters and of 27-4th order gammachirp filters that cover the frequency range of the signal.

The filters bandwidth changes according the equivalent rectangular bandwidth ERB scale. The output of the 𝑘 filter of the analysis gammatone filter banks can be expressed as 𝑦𝑘,gt(𝑛)=𝑦(𝑛)gt𝑘(𝑛),(6) where gt𝑘(𝑛) is the impulse response of the 𝑘th, 4th-order gammatone filter. And the output of the 𝑘 filter of the analysis gammachirp filter banks can be expressed as 𝑦𝑘,gc(𝑛)=𝑦(𝑛)gc𝑘(𝑛),(7) where gc𝑘(𝑛) is the impulse response of the 𝑘th, 4th-order gammachirp filter.

The proposed speech enhancement method is based on nonuniform decomposition of the degraded input waveform 𝑦(𝑛). The processing is done by dividing the band 𝑘 obtained by nonuniform gammatone decomposition 𝑦𝑘,gt(𝑛) and obtained by nonuniform gammachirp decomposition 𝑦𝑘,gc(𝑛) are divided into frames (10 ms–30 ms length) by multiplication with a sliding window 𝐹(𝑛). Nonuniform subband signals 𝑦𝑘,gt/gc(𝑛,𝑝) are transformed into the frequency domain with the fast Fourier transformation (FFT) and manipulated using the spectral gain given by the generalized spectral subtraction rule (GSS), on one side, and the Ephraim and Malah spectral rule (MMSE), on the other side.

3.2.1. Perceptual Generalized Spectral Subtraction Technique

The function gain of the generalized spectral subtraction rule is applied in a multirate system (Figure 3). The subbands spectrums of the noisy signal are multiplied by the general weights 𝐺GSS𝑘,gt/gc(𝑝,𝑤) in each subband k.

282019.fig.003
Figure 3: Proposed perceptual generalized spectral subtraction technique applied in a multirate system.

The multibands weights are calculated from the subbands magnitude spectrum of the noisy speech signal and the noise estimate in each frame 𝑝 and for each frequency 𝑤. Using the generalized spectral subtraction technique, the enhanced speech spectrum |𝑋GSS𝑘,gt(𝑝,𝑤)| in each gammatone subband signal is given by |||𝑋GSS𝑘,gt|||(𝑝,𝑤)=𝐺GSS𝑘,gt||𝑌(𝑝,𝑤)𝑘,gt||(𝑝,𝑤).(8) And in each gammachirp subband signal, the enhanced speech spectrum |𝑋GSS𝑘,gc(𝑝,𝑤)| is given by |||𝑋GSS𝑘,gc|||(𝑝,𝑤)=𝐺GSS𝑘,gc||𝑌(𝑝,𝑤)𝑘,gc||(𝑝,𝑤),(9) where the gain functions 𝐺GSS𝑘,gt(𝑝,𝑤) and 𝐺GSS𝑘,gc(𝑝,𝑤) are expressed in each subband 𝑘 as 𝐺GSS𝑘,gt/gc||𝐷(𝑝,𝑤)=1𝛼𝑘,gt/gc||(𝑝,𝑤)||||𝑌𝑘,gt/gc(||||𝑝,𝑤)21/2||𝑋if𝑘,gc||(𝑝,𝑤)2||𝐷>𝛽𝑘,gt/gc||(𝑝,𝑤)2𝛽||𝐷𝑘,gt/gc||(𝑝,𝑤)21/2otherwise,(10) where |𝐷𝑘,gt/gc(𝑝,𝑤)|2 and |𝑌𝑘,gt/gc(𝑝,𝑤)|2 are respectively the power spectrum of the noise estimate and the noisy speech signal in each nonuniform gammatone (gt) and gammachirp (gc) subband 𝑘. 𝛼 is the over-subtraction factor (𝛼1), and 𝛽(0<𝛽<1) is the spectral floor.

3.2.2. Perceptual MMSE Spectral Modification

In this section, we are interested in using the spectral gain 𝐺mmse𝑘,gt/gc(𝑝,𝑤) given by the spectral modification according to the Ephraim and Malah rule (MMSE) in each frame 𝑝 and each frequency 𝑤 (Figure 4) to obtain the enhanced speech spectrum 𝑋mmse𝑘,gt/gc(𝑝,𝑤) in each gammatone subband signal as |||𝑋mmse𝑘,gt|||(𝑝,𝑤)=𝐺mmse𝑘,gt||𝑌(𝑝,𝑤)𝑘,gt||(𝑝,𝑤).(11) And in each gammachirp subband signal, the enhanced speech spectrum |𝑋mmse𝑘,gc(𝑝,𝑤)| is given by |||𝑋mmse𝑘,gc|||(𝑝,𝑤)=𝐺mmse𝑘,gc||𝑌(𝑝,𝑤)𝑘,gc||(𝑝,𝑤),(12) where the gain functions 𝐺mmse𝑘,gt(𝑝,𝑤) and 𝐺mmse𝑘,gc(𝑝,𝑤) are expressed in each subband 𝑘 as 𝐺mmse𝑘,gt/gc=(𝑝,𝑤)𝜋211+𝑅post𝑘,gt/gc×𝑅priori𝑘,gt/gc1+𝑅priori𝑘,gt/gc𝐹.1+𝑅post𝑘,gt/gc×𝑅priori𝑘,gt/gc1+𝑅priori𝑘,gt/gc.(13)

282019.fig.004
Figure 4: Perceptual MMSE spectral modification technique applied in multirate system.

The local and relative level a posterior and the prior signal to noise ratio in the current frame 𝑝 and each gammatone (gt) and gammachirp (gc) subband are defined as:𝑅post𝑘,gt/gc||𝑌(𝑝,𝑤)=𝑘,gt||(𝑝,𝑤)2||𝐷𝑘,gt(||𝑝,𝑤)2,𝑅priori𝑘,gt/gc=(1𝜂)𝑅post𝑘,gt/gc×||𝑌1𝑘,gt/gc||(𝑝1,𝑤)2||𝐷𝑘,gt/gc||(𝑝,𝑤)2.(14)𝛾is a parameter defined as 0𝜂1.

|𝑌𝑘,gt/gc(𝑝1,𝑤)|2 is the power spectral density defined in the frame (𝑝1) and 𝑅post is the relative level a posterior defined in each frame 𝑝 and for each frequency 𝑤.

The temporal enhanced speech signal ̂𝑥𝑘,gt/gc(𝑛) in each temporal subband 𝑘 is estimated using the overlap-add technique and the inverse Fourier transform based on the assumption that phase distortion is not perceived by the human ear, the phase of the noisy speech is not processed and the enhanced speech signal in each subband 𝑘 is obtained by using the inverse Fourier transform and the phase from the noisy speech signal.

The final enhanced output speech signal ̂𝑥gt(𝑛) from the gammatone synthesis filter banks and the gammachirp synthesis filter banks ̂𝑥gc(𝑛) are obtained by using the summation of the subband signals after processing ̂𝑥gt(𝑛)=𝑀𝑘=1̂𝑥𝑘,gt(𝑛),̂𝑥gt(𝑛)=𝑀𝑘=1̂𝑥𝑘,gt(𝑛),(15) where ̂𝑥𝑘,gt(𝑛) and ̂𝑥𝑘,gc(𝑛) are given by ̂𝑥𝑘,gt||𝑋(𝑛)=IFFT𝑘,𝑔𝑡||e(𝑝,𝑤)𝑗𝜙(𝑌𝑘,𝑔𝑡(𝑝,𝑤)),̂𝑥𝑘,gc||𝑋(𝑛)=IFFT𝑘,𝑔𝑐||e(𝑝,𝑤)𝑗𝜙(𝑌𝑘,𝑔𝑐(𝑝,𝑤)).(16)

The noise estimate can have an important impact on the quality and intelligibility of the enhanced signal. If the noise estimate is too low, a residual noise will be audible; if the noise estimate is too high, speech will be distorted resulting in intelligibility loss. In the spectral subtraction algorithm, the noise spectrum estimate is updated during the silent moment of the signal. Although this approach might give satisfactory result with stationary noise, it will not with more realistic environments where the spectral characteristics of the noise change constantly. Hence, there is a need to update the noise spectrum continuously over time. Several noise-estimation algorithms have been proposed for speech enhancement applications [14]. In [15], the minimum statistics method for estimating the noise spectrum (MS) is based on tracking the minimum of the noisy speech over a finite window. As the minimum is typically smaller than the mean, unbiased estimates of noise spectrum were computed by introducing a bias factor based on the statistics of the minimum estimates. In [16], a minima controlled recursive algorithm (MCRA) is proposed; it updates the noise estimate by tracking the noise-only regions of the noisy speech spectrum. These regions are found by comparing the ratio of the noisy speech to the local minimum against a threshold. In the improved minima controlled recursive algorithm (IMCRA) approach [17], a different method was used to track the noise-only regions of the spectrum based on the estimated speech-presence probability. This probability, however, is also controlled by the minima. Recently, a new noise estimation algorithm (MCRA2) was introduced [18], the noise estimate was updated in each frame based on voice activity detection. The speech presence decision made in each frame is based on the ratio of the noise speech spectrum to its local minimum. In our work, the noise power spectrum is continuously estimated using these algorithms.

4. Results and Evaluation

The speech signals are obtained from TIMIT corpus. The sentences are sampled at 16 kHz. The noise is added to the original speech signal at different signal to noise ratio (0 dB, 5 dB, 10 dB, and 15 dB) from the AURORA database and includes multitalker babble and car noise. The database is used as it contains phonetically balanced sentences with relatively low word context predictability. To cover the frequency range of the signal, the analysis stage used in the multibands subtraction consists of 27-4th order gammatone/gammachirp filter banks according to the ERB scale. The parameters used in the noise suppression algorithms are set to Table 1.

tab1
Table 1: Experimental parameters used in the noise suppression process.

The performance of the proposed speech enhancement method: the generalized spectral subtraction rule implemented on ERB gammatone/gammachirp filter banks (GSS_GTFB/GSS_GCFB) and the Ephraim and Malah spectral modification rule implemented on ERB gammatone/gammachirp filter banks (MMSE_GTFB/MMSE_GCFB) using continuous noise estimation algorithms based on the MCRA method (mcra), the IMCRA method (imcra), the MCRA2 method (mcra2), and the minimum statistics method (ms), are evaluated and compared with that the generalized spectral subtraction (GSS) and the Ephraim and Malah (MMSE) spectral modification basics techniques.

4.1. Objective Evaluation

In order to evaluate the performance, we measure the perceptual evaluation of speech quality PESQ [13]. The PESQ score is able to predict subjective quality with good correlation in a very wide range of conditions, the original and degraded signals are mapped onto an internal representation using a perceptual model to predict the perceived speech quality of the degraded signal. The subjective experiments used in the development of the PESQ uses the absolute category rating opinion scale.

According to the results illustrated in Figure 5, we note that the approach based on nonuniform filter banks decomposition using two different models of the human perceptual comportment is performed in speech enhancement. We observe that the PESQ score is consistent with the subjectively perceived trend of an improvement in speech quality with the proposed speech enhancement approach over that the spectral modification (GSS) algorithm alone.

282019.fig.005
Figure 5: PESQ score for the proposed perceptual speech enhancement method based on the gammatone and the gammachirp filter banks decomposition at different signal-to-noise ratio for babble and car noise and compared to the classic algorithms.

This improvement is particularly significant in the case of car noise at 15 dB, and we register a score of 3,26 for the proposed GSS_GT (using gammatone decomposition) in spite of 2,73 for the GSS alone; the PESQ improvement is also observed using the GSS_GT at 0 dB (2,22) for babble noise continuously estimated with the MCRA2, contrary to the GSS (1,78). On the other hand, the gammachirp filter banks decomposition in association with the MMSE spectral modification rules (MMSE_GC_MCRA) contributes significantly in the enhancement of speech signal corrupted by car noise (3,54 PESQ score at 15 dB). In order to strengthen the objective evaluation, we measure the scores relative to the standard norm P. 835 [19].

This norm attends and rates successively the enhanced speech signal on the distortion of the speech signal alone using five-point scale of signal distortion (SIG), the noise distortion using a five-point scale of background intrusiveness (BAK), and the overall quality effect (OVRL). This process is designed to integrate the effects of both the signal and the background in making the rating of overall quality.

Figures 6 and 7 list at different signal to noise ratio the subjective overall quality the OVRL measure that includes the naturalness of speech (SIG) and intrusiveness of background noise (BAK) for babble noise. Figures 8 and 9 list the SIG-BAK and OVRL scores for the car noise. the proposed perceptual spectral modification using different continuous noise estimation algorithms performed significantly better than the classic spectral subtractive algorithms.

282019.fig.006
Figure 6: SIG-BAK-OVRL scores for proposed speech enhancement with the gammachirp decomposition compared to the GSS and the MMSE at different SNR input for babble noise.
282019.fig.007
Figure 7: SIG-BAK-OVRL scores for proposed speech enhancement with the gammachirp decomposition compared to the GSS and the MMSE at different SNR input for babble noise.
282019.fig.008
Figure 8: SIG-BAK-OVRL scores for proposed speech enhancement with the gammtone decomposition compared to the GSS and the MMSE at different SNR input for car noise.
282019.fig.009
Figure 9: SIG-BAK-OVRL scores for proposed speech enhancement with the gammachirp decomposition compared to the GSS and the MMSE at different SNR input for car noise.

Lower signal distortion (higher SIG score) is observed with the proposed approach in most condition with significant differences at 10 dB for car noise: a SIG score of 3,09 given by the GSS, and improved by the GSS_GT to 4,27 using IMCRA noise estimation and a score of 4,62 registered by the proposed MMSE_GC with the MS noise estimation. This demonstrates the performance of our approach based on nonuniform gammatone/gammachirp filter banks decomposition to reduce the noticeable of the background noise and minimize the signal distortion. We notice also that incorporating continuous noise estimation in particularly the IMCRA and the MCRA continuous noise estimation in the perceptual spectral modification approach performed better than the generalized spectral subtraction and the MMSE rules in the overall quality improvement.

This indicates that the proposed perceptual spectral modification for speech enhancement is sensitive to the noise spectrum estimate.

4.2. Subjective Evaluation

Significant gains in noise reduction are accompanied by a decrease in speech intelligibility. Formal subjective test is the best indicator of achieved overall quality. So the subjective evaluation used in our work is based on an automatic recognition system (ASR) developed under the HTK platform [20]. Thus, we used a standard continuous density HMM recognizer with 3 Gaussian mixtures per state, diagonal covariance matrices, and 5 emitting states per word model.

The parameterise step is consisted of 12 MFCC coefficients.

Tables 2 and 3 show the world recognised rate in percent (%) at different SNRs for the proposed approach using the two auditory filtering models compared to the classic spectral modification rules.

tab2
Table 2: Recognition rate for the proposed perceptual generalized spectral subtraction method based on the gammatone and the gammachirp filter banks decomposition (GSS_GT and GSS_GC) compared with the spectral subtraction rule (GSS).
tab3
Table 3: Recognition rate for the proposed perceptual generalized spectral subtraction method based on the gammatone and the gammachirp filter banks decomposition (MMSE_GT and MMSE_GC) compared with the spectral modification rule (MMSE).

We observe that the proposed multibands approach gives the best world rate recognition. It can be seen that the amelioration is significant, especially, in the case of car noise at different level of degradation.

5. Conclusion

In this paper, we proposed a new speech enhancement method which consists of integrating psychoacoustics proprieties of the human auditory system, especially perceptual filters modeling. It is based on decomposing the input signal in nonuniform subbands using an analysis/synthesis gammatone and gammachirp filter banks that are manipulated in each nonlinear block with the generalized spectral subtraction process and the MMSE spectral modification technique. We noticed that the use of the two perceptual filter banks models with frequency resolution according to the ERB scale allowed obtaining, from the perceptive point of view and from the vocal quality, better results than those supplied by the classic spectral modification algorithms to improve the quality and intelligibility of the enhanced speech signal.

References

  1. M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 208–211, April 1979.
  2. Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error-log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985. View at Scopus
  3. S. Kamath and P. Loizou, “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), vol. 4, pp. 4160–4164, May 2002. View at Scopus
  4. R. M. Udrea, S. Ciochinǎ, and D. N. Vizireanu, “Multi-band bark scale spectral over-subtraction for colored noise reduction,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '05), pp. 311–314, Iasi, Romania, July 2005. View at Publisher · View at Google Scholar · View at Scopus
  5. M. Klein and P. Kabal, “Signal subspace speech enhancement with perceptual post-filtering,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), pp. 537–540, May 2002. View at Scopus
  6. R. M. Udrea, N. Vizireanu, S. Ciochina, and S. Halunga, “Nonlinear spectral subtraction method for colored noise reduction using multi-band Bark scale,” Signal Processing, vol. 88, no. 5, pp. 1299–1303, 2008. View at Publisher · View at Google Scholar · View at Scopus
  7. R. M. Udrea, N. D. Vizireanu, and S. Ciochina, “An improved spectral subtraction method for speech enhancement using a perceptual weighting filter,” Digital Signal Processing, vol. 18, no. 4, pp. 581–587, 2008. View at Publisher · View at Google Scholar · View at Scopus
  8. M. Udrea and S. Ciochina, “Speech enhancement using spectral over-subtraction and residual noise reduction,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '03), vol. 2, pp. 311–314, 2003.
  9. D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, “Speech enhancement based on audible noise suppression,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 6, pp. 497–514, 1997. View at Scopus
  10. N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 2, pp. 126–137, 1999. View at Scopus
  11. A. Amehraye, D. Pastor, A. Tamtaoui, and D. Aboutajdine, “From maskee to audible noise in perceptual speech enhancement,” International Journal of Signal Processing, vol. 5, article 2, 2009.
  12. V. Hohmann, “Frequency analysis and synthesis using a Gammatone filterbank,” Acta Acustica United with Acustica, vol. 88, no. 3, pp. 433–442, 2002. View at Scopus
  13. T. Irino and M. Unoki, “An analysis/synthesis perceptual filterbank based on an IIR gammachrp filter,” in Computational models of Perceptual Function, S. Greenberg and M. Slaney, Eds., vol. 312 of NATO ASI Series, IOS Press, 2001.
  14. P. Loizou, “Speech Enhancement: Theory and Practice,” CRC Press, Boca Raton, Fla, USA.
  15. R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512, 2001. View at Publisher · View at Google Scholar · View at Scopus
  16. I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE Signal Processing Letters, vol. 9, no. 1, pp. 12–15, 2002. View at Publisher · View at Google Scholar · View at Scopus
  17. I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 5, pp. 466–475, 2003. View at Publisher · View at Google Scholar · View at Scopus
  18. S. Rangachari and P. C. Loizou, “A noise-estimation algorithm for highly non-stationary environments,” Speech Communication, vol. 48, no. 2, pp. 220–231, 2006. View at Publisher · View at Google Scholar · View at Scopus
  19. ITU-T P.835, “Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm,” International Telecommunication Union ITU-T Recommendation P.835, 2003.
  20. S. J. Young, The HTK Book 3.1, Entropic, 2002.