About this Journal Submit a Manuscript Table of Contents
ISRN Mechanical Engineering
Volume 2012 (2012), Article ID 919234, 9 pages
http://dx.doi.org/10.5402/2012/919234
Review Article

Single Channel Speech Enhancement Techniques in Spectral Domain

Department of Systems Innovations, Graduate School of Engineering Science, Osaka University, 1–3 Machikaneyama, Osaka, Toyonaka 560-8531, Japan

Received 13 February 2012; Accepted 30 April 2012

Academic Editor: D. Aggelis

Copyright © 2012 Arata Kawamura et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper presents single-channel speech enhancement techniques in spectral domain. One of the most famous single channel speech enhancement techniques is the spectral subtraction method proposed by S.F. Boll in 1979. In this method, an estimated speech spectrum is obtained by simply subtracting a preestimated noise spectrum from an observed one. Hence, the spectral subtraction method is not concerned with speech spectral properties. It is well known that the spectral subtraction method produces an annoying artificial noise in the extracted speech signal. On the other hand, recent successful speech enhancement methods positively utilize the speech property and achieve an efficient speech enhancement capability. This paper presents a historical review about some speech estimation techniques and explicitly states the difference between their theoretical back-ground. Moreover, to evaluate their speech enhancement capabilities, we perform computer simulations. The results show that an adaptive speech enhancement method based on MAP estimation gives the best noise reduction capability in comparison to other speech enhancement methods presented in this paper.

1. Introduction

In recent years, speech enhancement is required in a wide area of applications including mobile communication and speech recognition systems, where the major example is a cell-phone as shown in Figure 1. Many speech enhancement methods have been established in decades [115]. These speech enhancement techniques can be classified to time domain methods and spectral domain methods. Recent major speech enhancement techniques are of the spectral domain method which is preferably used in a cell phone. In this paper, we focus on the spectral domain speech enhancement techniques that employ a single microphone.

919234.fig.001
Figure 1: Application of speech enhancement.

The spectral subtraction method [3] is one of the most popular methods among numerous noise reduction techniques in spectral domain. This method achieves noise reduction by simply subtracting a pre-estimated noise spectral amplitude from an observed spectral amplitude, where the spectral phase is not processed. The spectral subtraction method is easy for implementation and effectively reduces stationary noises. However, it incurs an artificial noise, called musical noise, which is caused from speech estimation errors. Because the spectral subtraction method is not concerned with speech spectral information, it often gives estimation errors. Ephraim and Malah have proposed the MMSE-STSA (Minimum Mean Square Error-Short-Time Spectral Amplitude) method [4] which utilizes a speech PDF (Probability Density Function) and a noise PDF. In the literature in [4], the speech and noise PDFs were modeled by Rayleigh and Gauss density functions, respectively. This method gives an optimal solution of the estimated speech signal in the sense of MMSE-STSA (the solution may change to Wiener filter [5] if we assume Gauss distributions for both of the speech and noise PDFs). Although the MMSE-STSA method gives an estimated speech signal with less musical noise, it requires more complicated computations, for example, the solution required to calculate the modified Bessel function. Moreover, as pointed out by some researchers, real speech histograms do not fit to Rayleigh function employed in [4].

A more efficient method that is based on a maximum a posteriori (MAP) estimation has been established by Lotter and Vary [11]. Lotter and Vary modeled the speech PDF by a parametric super-Gaussian function, controlled by two shape parameters. The parametric super-Gaussian function has been developed from a histogram made from a large amount of real speech data in a single narrow SNR (Signal to Noise Ratio) interval. The noise suppression capability of this method is superior to the Wiener filter. However, the residual noise is still persistently perceived. Andrianakis and White were aware that the speech PDF may change in some SNR intervals [12]. They utilized three histograms made from speech signals in three different narrow SNR intervals and approximate them with Gamma density function. As reported in [12], changing these three speech PDFs according to the SNR can improve the noise reduction capability. While Andrianakis discretely changes the speech PDF, Tsukamoto et al. continuously change the speech PDF according to the SNR [13]. They employed the parametric super-Gaussian function proposed in [11] and adaptively changed its shape parameters according to the SNR. Recently, Thanhikam et al. [16] sophisticated this approach by making and evaluating many real speech histograms made from various narrow SNR intervals. As shown in [16], this method has a very strong noise reduction capability in comparison to other traditional speech enhancement methods, and hence it is effective especially in low SNR environments.

In the following sections, we present a historical review of useful speech enhancement methods mentioned above and compare their speech enhancement capabilities by computer simulations.

2. Speech Enhancement in Spectral Domain

This section presents several speech enhancement techniques including both traditional methods and recent methods. Particularly, we will carefully explain the difference between them.

2.1. General Speech Enhancement System

Firstly, we explain about a general single-channel speech enhancement system in spectral domain.

We assume that an observed signal is a sum of a speech signal and a noise signal given as 𝑦(𝑡)=𝑥(𝑡)+𝑑(𝑡),(1) where 𝑦(𝑡) is the observed signal at time 𝑡. 𝑥(𝑡) and 𝑑(𝑡) denote the speech signal and the noise signal, respectively. We assume that 𝑥(𝑡) is uncorrelated with 𝑑(𝑡) through the paper. Taking the DFT of (1), we have 𝑌𝑘(𝑛)=𝑛𝑄+𝑁1𝑡=𝑛𝑄𝑦(𝑛𝑄+𝑡)(𝑡)𝑒𝑗2𝜋𝑛𝑘/𝑁(𝑘=0,1,,𝑁1),(2) where 𝑁, 𝑛, and 𝑘 denote the frame length, the frame index, and the frequency bin index, respectively. The analysis frame is shifted by 𝑄 samples, where 𝑄=𝑁/2 is used through the paper. The function (𝑡) denotes an analysis window function, where the Hanning window of size 𝑁 is used as (𝑡). The DFT spectrum 𝑌𝑘(𝑛) can be rewritten as 𝑌𝑘(𝑛)=𝑋𝑘(𝑛)+𝐷𝑘(𝑛),(3) where 𝑋𝑘(𝑛) and 𝐷𝑘(𝑛) are the 𝑘th spectra of 𝑥(𝑡) and 𝑑(𝑡), respectively. The enhanced speech spectrum 𝑋𝑘(𝑛) is given as 𝑋𝑘(𝑛)=𝐺𝑘(𝑛)𝑌𝑘(𝑛),(4) where 𝐺𝑘(𝑛) is a spectral gain. The enhanced speech is obtained as the observed signal 𝑌𝑘(𝑛) multiplied by the spectral gain 𝐺𝑘(𝑛). Hence, speech enhancement capability depends only on the spectral gain.

A general speech enhancement system can be illustrated in Figure 2, where the value of the spectral gain 𝐺𝑘(𝑛) depends on an employed speech enhancement algorithm. We see from (3) and (4) that the ideal spectral gain is given as 𝐺𝑘,opt𝐷(𝑛)=1𝑘(𝑛)𝑌𝑘.(𝑛)(5) This spectral gain perfectly provides the original speech signal as the enhanced speech. Since the ideal spectral gain above cannot be directly obtained from 𝑌𝑘(𝑛), we have to approximate the ideal spectral gain by introducing additional assumptions for the speech or the noise signals.

919234.fig.002
Figure 2: General speech enhancement system.

In the following sections, we give some typical spectral gains which have been derived from respective assumptions for the speech or the noise. For avoiding redundant expressions, we omit the indices 𝑛 and 𝑘 if they do not play an important role.

2.2. Spectral Subtraction

The most simple and famous speech enhancement technique is the spectral subtraction proposed by Boll in 1979 [3]. This method just subtracts a pre-estimated noise spectral amplitude from an observed one to obtain the estimated speech spectral amplitude. In the spectral subtraction method, the spectral phase is not modified; that is, the estimated speech spectral phase is identical to the observed one. This is based on the fact that the spectral phase is unimportant in comparison to the spectral amplitude in human speech perception [17]. The spectral subtraction method is achieved by using the following spectral gain. 𝐺SS||𝐷||=1||𝑌||,(6) where |𝐷| is the pre-estimated noise spectral amplitude. Usually, we choose |𝐷|=𝐸[|𝐷|]. We note that formula (6) is an absolute version of (5).

The spectral subtraction is not concerned with speech spectral property. As a result, the estimated speech signal includes many estimation errors. The estimation error produces an isolated spectrum in the estimated speech signal. This noise is called “musical noise” and it is perceived as an annoying sound for human. To obtain an estimated speech signal with less musical noise, we should introduce a speech property into speech enhancement scheme. In the following sections, we present some speech enhancement methods taking into account speech probabilistic properties.

2.3. Wiener Filter

In this section, we explain the Wiener filter [5] which utilizes both of the speech and the noise spectral probabilistic properties. It is well known that the Wiener filter provides an estimated speech signal with less musical noise in comparison to the spectral subtraction method.

To derive the Wiener filter, we assume that the speech spectrum 𝑋 is uncorrelated with the noise spectrum 𝐷 and 𝐸[𝑋]=0, 𝐸[|𝑋|2]=𝜎2𝑥, 𝐸[𝐷]=0, 𝐸[|𝐷|2]=𝜎2𝑑. The Wiener filter is obtained by minimizing the following cost function: ||𝑋||𝐽=𝐸𝑋2||||=𝐸𝑋𝐺𝑌2,(7) where 𝐸[] denotes the expected value. We can rewrite 𝐽 as ||𝑋||𝐽=𝐸2+||𝐺||2𝐸||𝑌||2𝐺𝐸𝑋𝑌𝐺𝐸𝑋𝑌=𝜎2𝑥+||𝐺||2𝜎2𝑥+𝜎2𝑑𝐺𝜎2𝑥𝐺𝜎2𝑥.(8) Differentiating 𝐽 with respect to 𝐺 gives 𝜕𝐽𝜕𝐺𝜎=𝐺2𝑥+𝜎2𝑑𝜎2𝑥.(9) Putting (9) to zero and solving it with respect to 𝐺, we have the spectral gain of the Wiener filter given as 𝐺Wiener=𝜎2𝑥𝜎2𝑥+𝜎2𝑑=𝜉,1+𝜉(10) where 𝜉=𝜎2𝑥/𝜎2𝑑 is the a priori SNR. The Wiener filter requires one parameter 𝜉 or two variances 𝜎2𝑥 and 𝜎2𝑑.

2.4. MMSE-STSA Method

In this section, we explain a historically important speech enhancement method, that is, the MMSE-STSA method [4] proposed by Ephraim and Malah in 1984. Ephraim and Malah have proposed not only an efficient spectral gain, but also an efficient estimation technique to get the a priori SNR.

The MMSE-STSA method is derived by minimizing a conditional mean square value of the short time spectral amplitude. The cost function to be minimized is given by 𝐽MMSE||𝑋||=𝐸𝑋2=𝑌||𝑋||2||𝑋||𝑝(𝑋𝑌)𝑑𝑥+2𝑋𝑋𝑋𝑝(𝑋𝑌)𝑑𝑥𝑋𝑝(𝑋𝑌)𝑑𝑥,(11) where 𝑝(𝑋𝑌) denotes the conditional PDF of 𝑋. The estimated speech spectrum which minimizes 𝐽MMSE is given as 𝑋MMSE=[].𝑋𝑝(𝑋𝑌)𝑑𝑥=𝐸𝑋𝑌(12) As shown in [6], when we assume 𝑝(𝑋) and 𝑝(𝐷) as Gauss functions, (12) produces the Wiener filter again. On the other hand, Ephraim and Malah considered the PDFs of the speech spectral amplitude and phase, that is, 𝑝(|𝑋|) and 𝑝(𝑋). They assumed that 𝑝(|𝑋|) and 𝑝(𝑋) as the Rayleigh distribution and the uniform distribution, respectively [18]. They assumed 𝑝(𝐷) as the Gauss function, where the noise variance 𝜎2𝑑 is assumed to split equally into real and imaginary parts. These PDFs are expressed as 𝑝||𝑋||=2||𝑋||𝜎2𝑥||𝑋||exp2𝜎2𝑥,1(13)𝑝(𝑋)=,12𝜋(14)𝑝(𝑌𝑋)=𝜋𝜎2𝑑||||exp𝑌𝑋2𝜎2𝑑,(15) where 𝑃(𝑌𝑋) is corresponding to 𝑝(𝐷). Assuming 𝑝(𝑋)=𝑝(|𝑋|)𝑝(𝑋), we can calculate (12) by using the relation 𝑝(𝑋𝑌)=𝑝(𝑌𝑋)𝑝(𝑋)/𝑝(𝑌). After tedious and complex computations, the spectral gain is given as [4] 𝐺MMSE=(𝜋𝑣)1/22𝛾exp𝑣2×(1+𝑣)𝐼0𝑣2+𝑣𝐼1𝑣2,(16) where 𝐼𝑖() is the modified Bessel function of order 𝑖 and 𝜉𝑣=||𝑌||1+𝜉𝛾,𝛾=2𝜎2𝑑.(17) Here, 𝛾 is called as the a posteriori SNR. As shown in [4], the optimal spectral phase in the sense of MMSE-STSA is identical to the observed one. Hence, 𝐺MMSE is also a real value. The MMSE-STSA solution, 𝐺MMSE, is completely characterized by 𝜎2𝑑, 𝜉, and 𝛾. When the noise variance 𝜎2𝑑 is known or can be estimated, 𝛾 is simply obtained by the observed spectrum. On the other hand, estimating the a priori SNR 𝜉 is difficult, although it needs to be required for many other spectral speech enhancers. One of the valuable contributions in [4] is to present a useful estimation method of 𝜉, called the decision-directed method. We will show and use it to estimate 𝜉 in Section 3.

2.5. MAP Estimation Method

As confirmed in many literatures, the spectral gain 𝐺MMSE derived in the previous section is superior to the spectral subtraction method. But 𝐺MMSE is not easy to implement due to a large amount of computational complexity. Indeed, we can obtain a more theoretically relevant and reasonable spectral gain from the same cost function shown in (11). The MMSE-STSA method has chosen 𝑋=𝐸[𝑋𝑌] to minimize (11). Here, we can note that 𝐸[𝑋𝑌] is the best choice when the PDF is an even function like a Gauss function. Because the Rayleigh distribution is asymmetric function, 𝑋=𝐸[𝑋𝑌] is not appropriate. The MAP estimation method [6] denotes that the best choice for minimizing (11) is to employ the speech spectrum maximizing 𝑝(𝑋𝑌).

To illustrate the difference between the MMSE-STSA solution and the MAP solution, we show an example of the specific PDF. Figures 3(a) and 3(b) show the Gauss and Rayleigh distributions, respectively. Here, the horizontal axis denotes the value of an argument 𝑥 and the vertical axis is a PDF 𝑝(𝑥). The vertical dotted lines denote the argument values giving the mean value and maximum value of 𝑝(𝑥), respectively. The former value is corresponding to the MMSE-STSA solution and the latter value is corresponding to the MAP solution. As shown in Figure 3(a), the MMSE-STSA solution is identical to the MAP solution for the Gauss distribution which is an even function. On the other hand, the solutions of them are different for the asymmetric Rayleigh distribution as shown in Figure 3(b). Obviously, we should choose the solution of the MAP estimation rather than the MMSE-STSA solution to minimize the cost function (11).

fig3
Figure 3: Maximum and mean values for the specific PDFs.

To obtain the MAP solution, we have to maximize the conditional PDF 𝑝(𝑋𝑌). Based on the Bayes’s rule, we have [6] 𝑝(𝑋𝑌)=𝑝(𝑌𝑋)𝑝(𝑋)𝑝(𝑌)𝑝(𝑌𝑋)𝑝(𝑋).(18) The MAP estimation is to find the arguments 𝑋 which maximize 𝑝(𝑋|𝑌), that is, 𝑋=argmax𝑋𝑝(𝑋𝑌)=argmax𝑋𝑝(𝑌𝑋)𝑝(𝑋)=argmax𝑋ln{𝑝(𝑌𝑋)𝑝(𝑋)}.(19) We assume the same PDFs from (13) to (15), and 𝑝(𝑋)=𝑝(|𝑋|)𝑝(𝑋). After calculating ln{𝑝(𝑌𝑋)𝑝(𝑋)} and differentiating it with respect to |𝑋| (or 𝑋), we put the obtained derivative to zero and solve it with respect to |𝑋| (or 𝑋). Then, we have [6] 𝐺MAP=𝜉+𝜉2+2(1+𝜉)(𝜉/𝛾)2.(1+𝜉)(20) Since the MAP solution of 𝑋 is identical to the observed spectral phase, 𝐺MAP is also a real value. We see that 𝐺MAP consists of 𝜉 and 𝛾 only; thus its computational complexity is extremely low in comparison to (16).

2.6. Lotter's Spectral Gain

In the previous section, we obtained a MAP solution for speech enhancement under the assumption that the PDF of the speech spectral amplitude can be modeled as the Rayleigh distribution. However, some researchers pointed out that there exists other appropriate speech PDF [811]. In 2005, Lotter and Vary have proposed an original speech spectral amplitude PDF. This PDF was derived from a real speech histogram made from a large amount of real speech data. In the same manner as in the previous section, the speech spectral amplitude and phase were separately modeled in [11]. The PDF of the spectral phase was also modeled as the uniform distribution defined in (14). Lotter et al. modeled the PDF of the speech spectral amplitude as a super-Gaussian function represented by 𝑝||𝑋||=𝜇𝜈+1Γ||𝑋||(𝜈+1)𝜈𝜎𝑥𝜈+1||𝑋||exp𝜇𝜎𝑥,(21) where Γ() is a Gamma function and 𝜇 and 𝜈 are the shape parameters which determine the shape of the above PDF. Using (21), (14) and (15), the same procedure in the previous section gives the MAP solution expressed as 𝐺LMAP=𝑢+𝑢2+𝜈,12𝛾(22)𝑢=2𝜇41.𝛾𝜉(23) The MAP solution of the speech spectral phase is also identical to the observed one, and thus 𝐺LMAP is a real value. Lotter and Vary reported that the most appropriate shape parameters are 𝜇=1.74 and 𝜈=0.126 in [11]. The spectral gain 𝐺LMAP also consists of 𝜉 and 𝛾 only, hence it is easy to implement.

2.7. Adaptive Speech PDF Method

In [11], the shape parameters of the speech spectral amplitude PDF, 𝜇 and 𝜈, had been derived from a large amount of speech data in a single narrow SNR interval. However, in a practical situation, a speech signal includes both activity segments and pause segments. Since the value of the speech spectral amplitude is always zero in the pause segments, we expect that its PDF can be modeled as a delta function. On the other hand, in the activity speech segments, the PDF of the speech spectral amplitude obeys other functions. Tsukamoto et al. have noticed the fact and investigated an adaptive method to change the PDF of the speech spectral amplitude, according to the SNR [13]. They have chosen Lotter's PDF defined in (21) as the adaptive PDF, because its shape is easily controlled by 𝜈 and 𝜇. Here, we show examples of Lotter's PDF with different shape parameters in Figure 4. We see from this figure that the PDF can fit the exponential distribution and the Rayleigh distribution by adjusting the shape parameters. Utilizing real speech histograms, Tsukamoto et al. derived adaptive shape parameters and showed its effectiveness through the computer simulations [13]. This basic idea is useful for speech enhancement in a practical situation. Unfortunately, a reliability of the derived adaptive shape parameter is comparatively low, because it is derived from only two speech histograms.

919234.fig.004
Figure 4: Shape examples of the PDF in (21) with different parameters.

To sophisticate Tsukamoto's adaptive shape parameter, Thanhikam et al. have made and evaluated many real speech histograms in various narrow SNR intervals [16]. They tried to fit the speech histograms with (21) and revealed an interesting curve of the shape parameters for narrow SNR intervals. The obtained shape parameters as the fitting results and the derived curve are shown in Figures 5(a) and 5(b), where the narrow SNR was calculated as 𝑃=10log10𝜉 [dB]. The lines in the figures denote the curves obtained by the least mean square method. Thes curves denote the relation between the shape parameters and 𝑃. Table 1 shows the formulations of the derived shape parameter function for 𝑃, where we denote the derived shape parameters by 𝑅𝜇𝑘(𝑛) and 𝑅𝜈𝑘(𝑛), and 𝐹[𝑥]=𝑥,𝑥>00,otherwise.(24)

tab1
Table 1: Instantaneous shape parameter functions 𝑅𝜇𝑘(n) and 𝑅𝜈𝑘(n).
fig5
Figure 5: Shape parameter fitting result for the SNR.

Thanhikam et al. used an averaged value of 𝑅𝜇𝑘(𝑛) and 𝑅𝜈𝑘(𝑛) to determine the present PDF shape of the speech spectral amplitude. Their “adaptive” MAP solution is as follows:

𝐺𝑘(𝑛)=𝑢𝑘(𝑛)+𝑢2𝑘𝜈(𝑛)+𝑘(𝑛)2𝛾𝑘,𝑢(𝑛)(25)𝑘1(𝑛)=2𝜇𝑘(𝑛)4𝛾𝑘̂𝜉(𝑛)𝑘,𝜇(𝑛)(26)𝑘(𝑛)=𝛼𝜇𝑘(𝑛1)+(1𝛼)𝑅𝜇𝑘𝜈(𝑛),(27)𝑘(𝑛)=&𝛼𝜈𝑘(𝑛1)+(1𝛼)𝑅𝜈𝑘(𝑛),(28) where 𝛼 is the forgetting factor and 𝜇𝑘(𝑛) and 𝜈𝑘(𝑛) are the adaptive shape parameters. In [16], they put 𝛼=0.98, 𝜇𝑘(0)=20, 𝜈𝑘(0)=0. This paper also use these settings.

In the next section, we compare the speech enhancement capabilities of the spectral gains presented in this paper.

3. Speech Enhancement Simulation

To compare the speech enhancement capabilities of some spectral gains derived in this paper, we firstly explain about common conditions for speech enhancement simulation. After that, we show the simulation results and discuss them.

3.1. Common Conditions

The speech enhancement methods explained in this paper commonly require the noise variance 𝜎2𝑑,𝑘(𝑛), a priori SNR 𝜉𝑘(𝑛), and a posteriori SNR 𝛾𝑘(𝑛). To obtain these parameters, the following estimation methods were used.

Firstly, the noise variance was calculated by using the weighted noise estimator proposed in [19]. This method can update the estimated noise variance even if a speech signal exists. The weighted noise estimator calculates an instantaneous noise power by using the weight 𝑊𝑘(𝑛) as shown in Figure 6. Here, 𝜃 and ̂𝛾𝐻 are constant values. The literature in [19] recommends that 𝜃=7 and ̂𝛾𝐻=10. As shown in Figure 6, 𝑊(𝑛) is a function of ̂𝛾(𝑛) given as ̂𝛾(𝑛)=10log10||||𝑌(𝑛)2𝜎2𝑑,𝑘.(𝑛1)(29) The noise variance 𝜎2𝑑,𝑘(𝑛) is updated as 𝜎2𝑑,𝑘(𝑛)=𝛽𝜎2𝑑,𝑘(𝑛1)+(1𝛽)𝑊𝑘||𝑌(𝑛)𝑘||(𝑛)2,(30) where 𝛽 is a forgetting factor and 𝛽=0.92 was used.

919234.fig.006
Figure 6: Weighting function.

Next, the a posteriori SNR was directly calculated as 𝛾𝑘||𝑌(𝑛)=𝑘||(𝑛)2𝜎2𝑑,𝑘.(𝑛)(31)

Lastly, the a priori SNR was calculated by using the decision-directed method proposed in [4]. The decision-directed method is given by 𝜉𝑘(𝑛)=𝛼snr||𝑋𝑘||(𝑛1)2𝜎𝑑,𝑘(+𝑛1)1𝛼snr𝐹𝛾𝑘,(𝑛)1(32) where 𝛼snr is a forgetting factor and 𝛼snr=0.98 was used according to [4].

The common speech enhancement system is shown in Figure 7, where the numbers denote the order of the estimation procedures. Of course, the spectral gain estimation is depending on the employed speech enhancement method. In simulations, the observed signal 𝑦(𝑡) was a female speech signal 𝑥(𝑡) corrupted with a practical tunnel noise 𝑑(𝑡) with SNR = 0 dB, where the noise was recorded in a tunnel in an expressway in Japan. All the signals used in the simulations were sampled at 8 kHz, and the DFT size was 256 (the FFT was used instead of the DFT). For objective evaluations, we utilized the SNR defined as SNR=10log10𝐿𝑡=0||||𝑥(𝑡)2𝐿𝑡=0||||𝑥(𝑡)̂𝑥(𝑡)2,(33) where 𝐿 denotes the number of the samples in time domain. It was also utilized the other evaluation function given as [17] 1LR=𝐽𝐽1𝑗=01𝑁𝑁1𝑘=0||𝑋log𝑘(||𝑛)||𝑋𝑘||+||𝑋(𝑛)𝑘||(𝑛)||𝑋𝑘||,(𝑛)1(34) where 𝐽 is the number of frames. The LR (Likelihood Ratio) denotes a spectral distance between the original speech and the estimated one, hence the perfect speech estimate gives LR = 0.

919234.fig.007
Figure 7: Speech enhancement system.
3.2. Simulation Results

Speech enhancement simulations were carried out to compare the presented speech enhancement methods. The chosen methods were the spectral subtraction method [3] and Wiener filter [5] as traditional methods, Lotter's spectral gain [11] as a MAP method using a fixed speech PDF, and the adaptive speech PDF method [16] as the recent method.

Table 2 shows the results of the objective evaluation for each methods, where both of the best SNR and LR results were obtained from the adaptive speech PDF method proposed by Thanhikam et al. [16]. We see from this table that the Wiener filter and Lotter's method also gave comparatively good SNR and LR results in comparison to the spectral subtraction method. The waveforms of the simulation results are shown in Figures 8(a)8(e), and the respective spectrograms are shown in Figures 9(a)9(e). From Figures 8(b) and 9(b), we see that the spectral subtraction method provided many residual noises. The main reason of it may be that the spectral subtraction method does not use any speech spectral information. The residual noises are perceived as an annoying musical noise. From Figures 8 and 9(c), we see that the Wiener filter is superior to the spectral subtraction method for speech enhancement. The Wiener filter gave the estimated speech with less musical noise, although the amount of the residual noise was comparatively large. From the waveform shown in Figure 8(d), we can confirm that the Lotter’s spectral gain method can effectively reduce the noise in some segments. But its spectrogram shown in Figure 9(d) showed that the Lotter’s spectral gain method emphasized isolated spectra, that is, musical noises. As a result, it also causes a perception problem. In Figures 8 and 9(e), such estimation errors cannot be confirmed. It implies that the adaptive PDF method proposed by Thanhikam is appropriate to reduce the noise in speech pause segments. However, in the speech activity segments, we can confirm that the speech spectral components were also vanished. The output speech quality of the adaptive speech PDF method may be improved by adjusting the forgetting factor in the adaptive shape parameters of the speech PDF.

tab2
Table 2: Objective evaluation results for noisy signal with SNR = 0 dB.
fig8
Figure 8: Waveforms of speech enhancement results.
fig9
Figure 9: Waveforms of speech enhancement results.

4. Conclusion

Single channel speech enhancement methods have been extensively studied in decades. This paper have presented some spectral gain methods among numerous studies. Of course, there exists various noisy situations, and hence we cannot choose the best speech enhancement system among them. We just tried to explicitly denote theoretical backgrounds of the chosen speech enhancement methods. The noise reduction capability of the speech enhancement methods was roughly compared for an arbitrary noisy speech, although the simulation results may slightly change when different noise and speech signals are used. From the obtained simulation results, we confirmed that the MAP estimation methods gave a good noise reduction performance. Particularly, the recently proposed adaptive speech PDF method reduced the noise signal strongly and hence did not produce a musical noise in speech pause segments. In the speech activity segments, we however perceived a small-level musical noise and a degradation of the speech. Such degradation tends to become large as noise increases. Future works in speech enhancement include a development of an effective noise reduction method which can give a good performance for a noisy speech signal with SNR less than 0 dB.

References

  1. M. Muneyasu and A. Taguchi, Nonlinear Digital Signal Processing, Asakura Publishing, Tokyo, Japan, 1999.
  2. A. Kawamura, Y. Iiguni, and Y. Itoh, “A noise reduction method based on linear prediction with variable step-size,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E88-A, no. 4, pp. 855–861, 2005. View at Publisher · View at Google Scholar · View at Scopus
  3. S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979. View at Scopus
  4. Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984. View at Scopus
  5. B. Widrow, J. G. R. Glover Jr., J. M. Mccool, et al., “Adaptive noise cancelling: principles and applications,” Proceedings of The IEEE, vol. 63, no. 12, pp. 1692–1716, 1975. View at Publisher · View at Google Scholar
  6. P. J. Wolfe and S. J. Godsill, “Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement,” Eurasip Journal on Applied Signal Processing, vol. 2003, no. 10, pp. 1043–1051, 2003. View at Publisher · View at Google Scholar · View at Scopus
  7. R. J. McAulay and M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 2, pp. 137–145, 1980. View at Scopus
  8. B. Chen and P. C. Loizou, “Speech enhancement using a MMSE short time spectral amplitude estimator with laplacian speech modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), pp. I1097–I1100, March 2005. View at Publisher · View at Google Scholar · View at Scopus
  9. R. Martin, “Speech enhancement based on minimum mean-square error estimation and supergaussian priors,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 845–856, 2005. View at Publisher · View at Google Scholar · View at Scopus
  10. S. Gazor and W. Zhang, “Speech enhancement employing laplacian-gaussian mixture,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 896–904, 2005. View at Publisher · View at Google Scholar · View at Scopus
  11. T. Lotter and P. Vary, “Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model,” Eurasip Journal on Applied Signal Processing, vol. 2005, no. 7, pp. 1110–1126, 2005. View at Publisher · View at Google Scholar · View at Scopus
  12. I. Andrianakis and P. R. White, “Speech spectral amplitude estimators using optimally shaped Gamma and Chi priors,” Speech Communication, vol. 51, no. 1, pp. 1–14, 2009. View at Publisher · View at Google Scholar · View at Scopus
  13. Y. Tsukamoto, A. Kawamura, and Y. Iiguni, “Speech enhancement based on MAP estimation using a variable speech distribution,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E90-A, no. 8, pp. 1587–1593, 2007. View at Publisher · View at Google Scholar · View at Scopus
  14. A. Kawamura, W. Thanhikam, and Y. Iiguni, “A speech spectral estimator using adaptive speech probability density function,” in Proceedings of the EUSIPCO 2010, pp. 1549–1552, August 2010.
  15. W. Thanhikam, A. Kawamura, and Y. Iiguni, “Speech enhancement using speech model parameters refined by two-step technique,” in Proceedings of the 2nd APSIPA Annual Summit and Conference, p. 11, December 2010.
  16. W. Thanhikam, A. Kawamura, and Y. Iiguni, “Speech enhancement based on real-speech PDF in various narrow SNR intervals,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E95-A, no. 3, pp. 623–630, 2012.
  17. S. Furui, Digital Speech Processing, Tokai University Press, Tokyo, Japan, 1985.
  18. S. L. Miller and D. G. Childers, Probability and Random Processes, Elsevier/Academic Press, 2004.
  19. M. Kato, A. Sugiyama, and M. Serizawa, “Noise suppression with high speech quality based on weighted noise estimation and MMSE STSA,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E85-A, no. 7, pp. 1710–1718, 2002. View at Scopus