Abstract

This paper proposes a noise-biased compensation of minimum statistics (MS) method using a nonlinear function and a priori speech absence probability (SAP) for speech enhancement in highly nonstationary noisy environments. The MS method is a well-known technique for noise power estimation in nonstationary noisy environments; however, it tends to bias noise estimation below that of the true noise level. The proposed method is combined with an adaptive parameter based on a sigmoid function and a priori SAP for residual noise reduction. Additionally, our method uses an autoparameter to control the trade-off between speech distortion and residual noise. We evaluate the estimation of noise power in highly nonstationary and varying noise environments. The improvement can be confirmed in terms of signal-to-noise ratio (SNR) and the Itakura-Saito Distortion Measure (ISDM).

1. Introduction

Noise estimation algorithms are essential components of many modern mobile communication, speech recognition, and human computer interaction systems for speech enhancement [1, 2]. It is generally included as a part of the speech enhancement to improve the speech intelligibility or quality of a signal corrupted by noise. However, it is difficult to reduce noise without distorting speech because the performance of any noise estimation algorithm usually depends on a trade-off between speech distortion and noise reduction.

Current single microphone speech enhancement methods belong to two groups, namely, time domain methods such as the subspace method and frequency domain methods such as the spectral subtraction (SS) [3] and minimum mean square error (MMSE) estimator [4]. Both methods have their own advantages and drawbacks. Subspace methods provide a mechanism to control the trade-off between speech distortion and residual noise, but with the cost of a heavy computational load [5]. Frequency domain methods, on the other hand, usually consume less computational resources but do not have a theoretically established mechanism to control trade-off between speech distortion and residual noise. Among them, spectral subtraction (SS) is computationally efficient and has a simple mechanism to control trade-off between speech distortion and residual noise but suffers from a notorious artifact known as musical noise [6]. These spectral noise reduction algorithms require an estimate of the noise spectrum, which can be obtained from speech absence frames indicated by a voice activity detector (VAD) or, alternatively, with the minimum statistic (MS) methods [7], that is, by tracking spectral minima in each frequency band.

Several recent studies have proposed noise estimation schemes for unknown noise signals [114]. The minimum statistics (MS) noise estimation scheme [7] is one that works well in nonstationary noisy environments. Martin proposed an algorithm for noise estimation based on minimum statistics [7]. The ability to track varying noise levels is a prominent feature of the minimum statistics (MS) algorithm [7]. The noise estimate is obtained as the minima values of a smoothed power estimate of the noisy signal, multiplied by a factor that compensates the bias. However, the MS algorithm still has a tendency to bias the noise estimate below that of the true noise level, regardless of the number of frames [8]. Therefore, it leaves residual noise in the frames of speech absence and in the frames of variation of noise characteristic in highly nonstationary noisy environments.

To solve this problem, we propose a combined adaptive factor based on a sigmoid function and a priori speech absence probability (SAP) estimation [9] for biased compensation. Specifically, we apply the adaptive factor as a posteriori SNR. When the a posteriori SNR decreases, increases but is constrained to take a value between and . Thus, the proposed adaptive biased compensation factor approaches at times when the SNR is low. In addition, when the a priori SAP equals unity, the adaptive biased compensation factor also approaches in each frequency bin and vice versa. Furthermore, our method uses another adaptive parameter to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in highly nonstationary and various noisy environments. The autocontrol parameter is controlled by a posteriori signal-to-noise ratio (SNR) as the variation of the noise level.

We evaluate the performance of the proposed algorithm for nonstationary noise and various noise environments. The improvement can be confirmed in the segmental SNR and the Itakura-Saito Distortion Measure (ISDM) [15]. The results show that our proposed method is superior to the conventional MS approach. The structure of the paper is as follows. Section 2 reviews the minimum statistics and the a priori SAP estimation algorithms. Section 3 addresses noise estimation and suppression using a linear and a nonlinear function. In Section 4, we express the combined sigmoid function using the a posteriori SNR and a priori SAP estimation for robust biased compensation. In Section 5, we discuss the experimental results.

2. Minimum Statistics (MS) and Speech Absence Probability (SAP)

2.1. Review of MS

The noisy speech signal can be represented as , where is the clean speech signal and is the noise signal. Dividing the signal into overlapping frames using a window function and applying the short-time Fourier transform (STFT) [16] to each frame yield the time-frequency representation , where is the frequency bin index and is the time frame index. It can be shown thatwhere , and are the power spectrum of the noisy speech signal, clean speech, and noise, respectively.

The MS algorithm relies on the fact that the noisy power spectrum often becomes equal to the noise power spectrum during periods of speech pauses [7, 13, 17]. Therefore, an estimate of the noise power spectrum is obtained by separately tracking the minimum of the noisy speech in each frequency bin. In addition, because the minimum is biased towards lower values, an unbiased estimate may be obtained through multiplication by a bias factor, which is derived from the statistics of the local minimum. To search for the minimum, we take the first-order recursive of the noisy power spectrum:where is the smoothed periodogram and is the smoothing factor. The smoothing factor used in (2) must be close to to keep the variance of the minimum tracking as small as possible. Hence, time and frequency dependence are required to determine if speech is present or absent. The smoothing factor is therefore derived by minimizing the mean square error between and :where is the noise variance:In (4), the time-frequency dependent smoothing factor is used instead of the fixed defined in (2). Substituting (4) into (3) and setting the first derivative to , we find the optimum value for According to (5), the smoothing factor can vary between and , but such a smoothing factor is not practical [15]. The value of becomes progressively smaller for a large a posteriori SNR (speech present). However, smoothing is required even during periods of speech because the speech power spectrum also contains a percentage of noise. Hence, the smoothing factor has a floor of (0.3), which results in a maximum of only (70%) of the original spectrum remaining within any one frame. Conversely, when the a posteriori SNR is low (speech is absent) tends towards 1, which causes the smoothed output to lock onto the previous value. To eliminate this, (5) is multiplied by . From (5), we note that depends on the true noise variance , which is unknown. In practice, we can replace with the latest estimated value . In general, however, this lags the true noise variance, and hence the estimated smoothing factor may be too small or large. Problems may arise when is close to 1 because will not respond fast enough to changes in the noise. Thus, tracking errors were monitored in [7] by comparing the average short-term smoothed periodogram to the estimated noise variance. After including the correction factor [7] the final factor is also smoothed over time [7].

The estimated noise power based the MS algorithm [7] is obtained by searching for a minimum within a finite window length of the smoothed power estimates : Because the minimum power estimate obtained through the time-varying smoothing factor is smaller than the mean value, the MS algorithm requires a bias compensation for the unbiased noise power estimate as detailed in the following [7]:where is the unbiased noise power estimate. The quantity is the bias compensation factor.

2.2. Review of Speech Absence Probability

The two-state model of speech events can be represented as a binary hypothesis model [9, 15, 17]:where and represent the absence and presence of speech, in the th frequency bin of the th frame, respectively, and whereis the a priori probability that speech will be absent. An efficient estimator is derived for the a priori SAP using a soft-decision approach based on the estimated a priori SNR [9]. A recursive average of this can be defined aswhere is a time constant. The decision-directed method proposed by Ephraim and Malah [4] provides a useful estimation scheme for the a priori SNR: where is a smoothing factor, is a function that prevents negative values, and represents the a posteriori SNR [9]. The local and global averaging window are then applied to (13) [9], resulting in where the subscript may denote either “local” or “global” window and is a normalized window of size . We define two parameters and , which represent the relationship between the above averages and the likelihood of speech in the th frequency bin of the th frame. These parameters are given as [9]where and are empirical constants, maximized to attenuate noise while leaving weak speech components unaffected. The third parameter , which is required to attenuate more noise in speech-absent frames, is based on the speech energy in neighboring frames [9]:If thenif thenelseElse,

where is an average in the frequency domain, represents a soft transition from speech to noise, is a confined peak value of , and and are empirical constants that determine the delay of the transition, as defined in [9]. Finally, the a priori SAP can be defined as [9]Accordingly, is larger if either previous frames or recent neighboring frequency bins do not contain speech. Therefore, when SAP goes to , the speech presence probability goes to .

3. Noise Estimation and Suppression Using Linear and Nonlinear Function

3.1. Combining Adaptive Factor Based on Sigmoid Function and A Priori SAP

In this section, we propose a method that combines the adaptive factor based on the sigmoid function and the a priori SAP estimation [9] to achieve biased compensation.

First, we can detect the adaptive factor by requiring the smoothed power spectrum be equal to the updated noise power estimator during speech absence region. In particular, we can determine the adaptive factor by minimizing the mean squared error (MSE) between and as follows: where we assume that the updated noise power estimator during the speech absence region isSubstituting (18) into (17) then after taking the first derivative of the MSE with respect to and setting it equal to zero, we get the adaptive factor for : where is the unbiased noise power estimate in (9). We apply the adaptive factor based on the sigmoid function to the biased compensation factor of the MS algorithm according to the a posteriori SNR: where is derived from the slope factor and the empirical constant for . The a posteriori SNR is where is the Euclidean length of a vector. The adaptive factor is controlled by the a posteriori SNR. When the a posteriori SNR decreases, increases but is constrained to take a value between and . Thus, the proposed adaptive biased compensation factor approaches at times when the SNR is low. In addition, when the a priori SAP equals unity, the adaptive biased compensation factor is also equal to in each frequency bin and vice versa. The adaptive factor is shown to be a biased compensation in Figure 1. It shows, as suggested by (20) and (21), that as the a posteriori SNR increases, decreases but maintains a value between and . Thus, the adaptive factor approaches when the SNR is close to 20 dB. Simulation results show that an increase in the is good for noisy signals with a low SNR of less than 5 dB and that a decrease in is good for noisy signals with a relatively high SNR greater than 10 dB. We can thus control the trade-off between speech distortion and residual noise in the frame index using . In (22), let be the updated noise power estimate according to the combined a priori SAP and the adaptive factor: The term is the a priori SAP in (16). When becomes 1, the adaptive biased compensation factor is equal to . Therefore, the speech absence region is efficiently compensated by combining the a priori SAP and the adaptive factor in the th frequency bin of the th frame. As a result, the updated noise power estimator for the optimal smoothing factor of is deduced from (7) as

3.2. Estimated Noise Suppression Using Linear Function

In this subsection, our method uses another adaptive parameter to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in a highly nonstationary and varying noisy environment. The autocontrol parameter is controlled by a posteriori signal-to-noise ratio (SNR) as the variation of the noise level.

The estimated clean speech power spectrum can be represented as shown in (28). One haswhere is the oversubtraction factor, is the slope, and is the offset. The constants , and , respectively [3]. The adaptive linear factor affects the amount of speech distortion caused by the spectral subtraction in (28). The factor offers a large amount of flexibility to the modified spectral subtraction (MSS) scheme. The in (24) is the a posteriori SNR in frequency bin. The estimated clean speech signal can then be transformed back to the time domain by taking the inverse STFT and synthesizing using the overlap-add method.

4. Experimental Results and Discussion

The noisy signals used in our evaluation were taken from the NOIZEUS database [15]. We used 30 test utterances, of which three each were from male and female speech signals. The analyzed signal was sampled at 8 kHz and short-time Fourier-transformed using 50% overlapping Hamming windows of 256 samples. Both the MS [7] and proposed methods track the minimum of the noisy speech to update the noise estimate in Figure 2. The MS method is obtained by tracking the minimum of the noisy power spectrum over a specified number of frames. Thus, the MS algorithm noise estimate tends to be biased below the true noise level, regardless of the number of frames. Our proposed method efficiently compensates the speech absence region by combining the adaptive bias compensation factor and a priori SAP. This implies that the proposed method is more accurate than the conventional one and could improve residual noise reduction.

Figure 3 shows the clear superiority of the proposed method in highly nonstationary noisy environments. The conventional method [7] does not work well from initial frame to 20 frames of car noise and from 110 frames to 130 frames of car and also suffered from residual noise. A different outcome is observed in the red circle of Figure 3. Particularly, the robust characteristics of the proposed method in spite of the variation of the noisy environments are well demonstrated. Thus, we can estimate more exactly the noise level to reduce a residual noise when compared with conventional method in highly nonstationary noisy environments.

The spectrum of the clean signal is given in Figure 4(a), and the spectrum of the noisy speech signal for speech enhancement using the MS plus spectral subtraction (SS) (MS + SS) [3, 7] method is given in Figure 4(b). We can also observe the minimum controlled recursive averaging (MCRA) with SS in Figure 4(c). There is residual noise in Figure 4(c) from and at , partly because of the inability of the noise estimation algorithm to bias below the true noise level. The spectrogram of the proposed methods for noise reduction is shown in Figure 4(d). In contrast, panel Figure 4(d) shows that the residual noise is more clearly reduced than the conventional methods.

Tables 1 and 2 summarize the averaged results of the segmental SNR and the Itakura-Saito Distortion Measure (ISDM) [15]. The segmental SNR can be evaluated in either the time or frequency domain. The time domain measure is perhaps one of the simplest objective measures used to evaluate speech enhancement method. For this measure to be meaningful it is important that the original and processed signals be aligned in time and that any phase error present be corrected [15]. For various noise types with an input SNR ranging from 0 to 15 dB, the segmental SNR after processing was clearly better for the proposed method compared to conventional ones [7], except for the case of (highlighted in bold). We can also confirm that our methods work well to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in highly nonstationary and various noisy environments.

The ISDM was shown to give a good correlation with subjective intelligibility measures specifically the diagnostic acceptability measure (DAM). This results in an objective test that can be used to produce a good meaningful result. This also results in a test that shows the distortion and noise reduction [15]. Here, we can confirm that the results of the ISDM with the proposed method produce good results of ISDM when compared with the conventional methods except for the case of the the MS method with SS in street 10 dB noisy signal.

5. Conclusion

We presented a modified noise estimation and suppression algorithm that combined the nonlinear function and a priori SAP estimation for biased compensation. Moreover, our method uses another adaptive parameter to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in highly nonstationary and various noisy environments. The performance of the new algorithm was evaluated by measuring the segment SNR and the ISDM. We showed that the proposed algorithm was generally superior to conventional methods, reducing both residual noise and speech distortion in nonstationary and noisy environments. In the future, we plan to evaluate its possible application in preprocessing for signal processing area.

Competing Interests

The authors declare no competing interests.

Acknowledgments

This research was supported by NRF (2013R1A1A2012536).