Abstract

One of the big challenges in the field of Automatic Speech Recognition (ASR) consists in developing suitable solutions able to work properly also in adverse acoustic conditions, like in presence of additive noise and/or in reverberant rooms. Recently a certain attention has been paid to deeply integrate the noise suppressor in the feature extraction pipeline. In this paper, different single-channel MMSE-based noise reduction schemes have been implemented both in the frequency and cepstral domains and the related recognition performances evaluated on the AURORA2 and AURORA4 databases, therefore providing a useful reference for the scientific community.

1. Introduction

Automatic Speech Recognition (ASR) is a challenging task largely addressed by the scientific community in the last two decades. A notable interest raised during last years in the study and development of robust solutions in presence of acoustic nonidealities [1], for example, background noise, simultaneous speakers, and reverberation. As result of these efforts, a profuse literature of environment-robust ASR techniques has been registered. The following classification can be proposed therein, as highlighted in [2]: feature-domain (FD) and model-based (MB) algorithms. The latter class encompasses all methodologies aimed to adapt the acoustic model (HMM) parameters in order to maximize the system matching to the distorted environment. Related contributions propose bayesian compensation frameworks [3], and most of them typically rely on Vector Taylor Series approximation for compensating parameter estimation [2].

The FD algorithm class directly enhances the quality of the speech features, trying to make them as close as possible to the clean-speech environment condition. Among the proposed methodologies, we can cite the log-spectral amplitude MMSE suppression rules due to their efficacy to reduce noise at a cost of low distortion level [4, 5]. Recently, these rules have been implemented in the cepstral domain, so working closely to the backend [6, 7]. Even though the MB approach achieves better performances [2], the FD one still has some advantages. Firstly, independence from the backend: all modifications are accomplished on the feature vectors rather than on the HMM parameters, which has a significant practical mean. Secondly, easy implementation: the algorithm parameterization is extremely simpler with respect to the MB class and no adaptation is requested. This justifies why the FD methodologies are still largely considered and implemented in ASR-based products.

In this paper, the authors have selected some of the most reliable and performing techniques in this field and put them in comparison in terms of recognition performances using the AURORA2 and AURORA4 databases. The noise reduction algorithms involved are all MMSE-based, both in the frequency [4, 5] and in the cepstral [7] domain. Some optimizations have been implemented in frequency-domain [8, 9] and generalized to cepstral domain leading to new and effective algorithmic solutions with respect to what appeared in the literature [7, 10]. Moreover, according to authors' perspective, comparative results of performed simulations provided here would represent a useful reference for further studies.

Here is the outline of the paper. Sections 2 and 3 will be devoted to the description of the noise reduction algorithms in the frequency and cepstral, domains respectively, with the related optimization procedures. In Section 4, results of performed computer simulations are reported and commented. Section 5 concludes the work together with highlights to some future work ideas.

2. Background on Frequenc-Domain MMSE Algorithms

Let and denote speech and uncorrelated additive noise signals, respectively, where is a discrete-time index. The observed noisy signal is given by . Let , and denote the short-time Fourier transforms of , and respectively, where is the frequency bin index and is the time frame index. Our purpose is to find an estimate of the clean speech signal , denoted with , given the noisy signal . In [4], such an estimate is obtained using a log-spectrum MMSE estimator and is given by

The gain function is computed as

where are the a priori and a posteriori SNR, respectively,

and .

The noise variance can be estimated using the well-known minima controlled recursive averaging (MCRA) [5] or the improved MCRA (IMCRA) [11]. Both options have been tested, and recognition results are provided in Section 4. The a priori SNR is estimated using the popular decision-directed algorithm [4]. In Section 5, this algorithm will be denoted as LSA.

2.1. Gain Modification Based on Soft Decision
2.1.1. Optimally Modified LSA Estimator (OMLSA)

In [12], the optimally modified log-MMSE (OMLSA) estimator has been proposed. The optimal spectral gain function is obtained as a weighted geometric mean of the hypothetical gains associated with the speech presence uncertainty. The exponential weight of each hypothetical gain is its corresponding probability, conditional on the observed signal. The probability function is estimated for each frame and each subband via a soft-decision approach, which exploits the strong correlation of speech presence in neighboring frequency bins of consecutive frames. The modified gain function takes the following form:

where is the same as (2), is the speech presence probability (SPP), and is a lower threshold [12]. The speech presence probability is computed as

where is the a priori speech absence probability estimated using a soft-decision approach [5].

2.1.2. Gain Modification Based on Global SNR (gSNR)

In [8], a soft-decision gain modification was introduced to improve the efficacy of Ephraim and Malah algorithm (E&M) when used as a preprocessing stage of ASR. The estimates of and are modified as follows:

where and are the noise overestimation factor and the spectral floor, respectively, and depend on the global SNR as in [8]. The global SNR can be computed recursively as

where .

Using (6), the gain function takes greater values when the global SNR is around 20 dB, while it remains unchanged when it is around 0 dB [8]. In [8], this approach was applied to LSA, in this paper we propose its application also to OMLSA.

2.2. Gain Function Smoothing ()

In [9], a frequency-domain smoothing of the gain function is proposed to improve the perceived quality of speech signals processed by E&M algorithm. Smoothing is performed by dividing the frequency range in critical bands and computing the median of the gain function in each band.

The smoothed gain function is computed as follows:

where “med” represents the median of and is the number of frequency bins at the th critical band. The smoothing factor controls the residual musical noise and the perceived clearness in the output signal [9]. In [9], this approach was applied to LSA, in this paper we propose its application also to OMLSA.

3. Background on Cepstral Domain MMSE Algorithms

The MFCC-MMSE noise reduction algorithm proposed in [7] operates directly on the MFCC coefficients, that is, in the same domain of the decoding algorithm. The followed approach is similar to E&M [4] but differs because the algorithm is applied to the power spectral magnitude of the filter bank's output instead of the DFT spectral amplitude and because the noise variance takes into account the phase difference between the noise and the clean speech.

As reported in [7], the MMSE estimate of the the th MFCC coefficients is given by the following expression:

where has the same form of (2) with the frequency bin replaced with the channel index . The terms are the discrete cosine transform coefficients, and is the output of the th Mel-frequency filter at frame .

The a priori and a posteriori SNR and are given by

The terms and are the variances of the noise and of the speech signal, respectively and is the variance of the phase difference between noise and speech.

As in frequency domain algorithm, the a priori SNR has been estimated with the decision-directed approach. In [7], the noise variance has been estimated with MCRA algorithm. In this paper, both the MCRA and IMCRA algorithms have been tested and results are shown in Section 4. The variance is computed as in [7]. In Section 4, this algorithm will be denoted as C-LSA.

3.1. Cepstral Domain Gain Modification Based on Soft Decision
3.1.1. Optimally Modified LSA Estimator (C-OMLSA)

Here, we propose an adaptation of the OMLSA algorithm to the cepstral domain. The speech presence uncertainty can be taken into account in the cepstral domain as well. However, the basic idea of exploiting the strong correlation between neighboring frequency bins does not hold using cepstral coefficients anymore. This problem can be overcome by computing the SPP in the frequency domain and then choosing the median value of all SPP coefficients belonging to a Mel-channel as the probability value of that channel. The MMSE estimate of the MFCC coefficients is then obtained by applying (4).

3.1.2. Gain Modification Based on Global SNR (gSNR)

Here, we propose an adaptation of the gain modification to the cepstral domain algorithm. The gain function has been modified by calculating ) and using (6), where and have been replaced by and , respectively.

A similar approach has been proposed in [10]: in that paper, the gain value is directly modified on the basis of the estimated noise power , while here a smoothed SNR is used.

3.2. Cepstral Domain Gain Function Smoothing ()

Here, we propose the smoothing of gain function in the cepstral domain. A similar approach for the MFCC-MMSE algorithm has been proposed in [10]: in that paper, the smoothed gain was calculated as a weighted average between the gain value on the actual and on the previous frame. Differently, in this paper, smoothing is computed as in (8), by calculating the median of the gain value in each critical band. Equation (8) must be modified substituting the frequency index with the channel index and calculating the median in the interval . The low degree of correlation between neighboring Mel-coefficients brought us to choose .

4. Computer Simulations

Evaluation has been conducted on the AURORA2 [13] and AURORA4 [14] databases. AURORA2 consists of a subset of TIDigits utterances downsampled to 8 kHz and with added noise and channel distortion. AURORA4 consists of the Wall Street Journal database to which various noise types have been added at different SNR levels. Experiments on AURORA4 have been conducted on the 8 kHz version of the official selection test set composed of 166 utterances.

Recognition has been performed using the Hidden Markov Model Toolkit [15]. Acoustic models structure and recognition parameters are the same as in [13] for AURORA2 tests, and as in [14] for AURORA4. Feature vectors are composed of 13 MFCCs (with C0 and without energy) and their first and second derivatives. Pre-emphasis and cepstral mean normalization have been included in the feature extraction pipeline.

Recognition results are expressed as word accuracy percentage. On AURORA2 results, averages are computed on the 0–20 dB SNR range [13]; on AURORA4 results, averages are computed on noisy test sets relative to the Sennheiser microphone (sets 2–7) and on test sets relative to the second microphone (sets 9–14) [14]. For both databases, clean and multicondition acoustic models have been created using the provided training sets. Results obtained with clean and multicondition acoustic models are reported in column denoted with “C” and “M”, respectively.

Table 1 shows the values of the parameters used in computer simulations. For the sake of conciseness, only the parameters which values differ from the ones reported in references papers are reported. Unless otherwise stated, the same values are used both in frequency and cepstral domain.

4.1. Comparison of MCRA and IMCRA Algorithms

In order to evaluate the impact of the noise estimator algorithm, the LSA and C-LSA algorithms have been tested both with the MCRA and the IMCRA estimators on the AURORA2 database. The overall average accuracy for LSA algorithm is 80.21% using MCRA, and 82.27% using IMCRA; as for the C-LSA algorithm, accuracy is 79.15% using MCRA and 82.27% using IMCRA.

Results clearly show that the accuracy of noise variance estimation is crucial for both the frequency and cepstral domain algorithms. In the following experiments, the IMCRA estimator has been used on all noise reduction algorithms.

4.2. Results and Discussion

Tables 2 and 3 show results on AURORA2 database for frequency and cepstral domain algorithms, respectively, while Tables 4 and 5 show results on AURORA4. Tables 2 and 4 also show results obtained without noise reduction algorithms (baseline). Higher accuracy values are shown in bold.

Frequency domain results on AURORA2 show that LSA algorithm produces a remarkable improvement of recognition accuracy, and that the gSNR modification gives a further increase of about 2% on average. On the contrary, the gain smoothing does not produce significant results, and OMLSA does not give any advantage over the LSA counterpart.

Cepstral domain LSA results are comparable to frequency domain ones, and the introduced modifications produce similar accuracy values. Cepstral domain OMLSA slightly degrades accuracy with the clean acoustic model, while it slightly improves it with the multicondition one.

On AURORA4, frequency domain results show that LSA algorithm alone degrades recognition performance, and only introducing the gSNR modification one can obtain accuracy increase. OMLSA gives a performance improvement over LSA and show a similar behavior using gSNR. Similar considerations can be made on cepstral domain algorithms: differently from AURORA2 results, the gSNR modification increases accuracy of about 4% on average over C-OMLSA.

Both in AURORA2 and AURORA4 results, cepstral domain algorithms perform better than frequency domain ones when multicondition acoustic model is used. This is consistent with the results obtained in [7] on the AURORA3 database, in which the training set is always multicondition.

Indeed, cepstral domain algorithms advantage over frequency domain ones is the reduced computational complexity. Typically, the number of frequency bins is around 256 in frequency-domain noise reduction algorithms, while 23 is the number of Mel-channels used in many ASR systems. This results in a complexity reduced of about one order of magnitude and makes cepstral domain algorithms the best choice for applications with limited computational resources or with strict real-time constraints. In [7], experiments were conducted on AURORA3 database: this paper completes the results with AURORA2 and AURORA4 experiments and proposed different solutions to the algorithms improvements presented in [10].

5. Conclusions

In this paper a comparative evaluation of single-channel MMSE-based feature enhancement schemes in terms of speech recognition performances has been carried out. As aforementioned, such an approach has some advantages with respect to the MB techniques for robust ASR and hence a certain appealing for practical implementations. Different algorithms have been considered, both in the frequency and in the cepstral domain, including some interesting innovative improvements to existing approaches, in this sense the contribution wants to be a useful reference for researchers in the field. The computer simulations have been performed by using HTK on the AURORA2 and AURORA4 databases, completing the results in [7]. Obtained results allowed us concluding that cepstral domain solutions have similar performances to their log-spectral domain counterpart with clean condition acoustic models, superior performance with multicondition ones, while attaining a significative computational burden reduction. Future efforts will be oriented to analyze the impact of further optimizations on the addressed algorithms and the extension to the multichannel case study.