Abstract

This paper presents a novel data adaptive thresholding approach to single channel speech enhancement. The noisy speech signal and fractional Gaussian noise (fGn) are combined to produce the complex signal. The fGn is generated using the noise variance roughly estimated from the noisy speech signal. Bivariate empirical mode decomposition (bEMD) is employed to decompose the complex signal into a finite number of complex-valued intrinsic mode functions (IMFs). The real and imaginary parts of the IMFs represent the IMFs of observed speech and fGn, respectively. Each IMF is divided into short time frames for local processing. The variance of IMF of fGn calculated within a frame is used as the reference term to classify corresponding noisy speech frame into noise and signal dominant frames. Only the noise dominant frames are soft-thresholded to reduce the noise effects. Then, all the frames as well as IMFs of speech are combined, yielding the enhanced speech signal. The experimental results show the improved performance of the proposed algorithm compared to the recently reported methods.

1. Introduction

The research on speech enhancement is motivated by the rapidly growing market of speech communication applications, such as teleconferencing, hands-free telephony, hearing-aids, and speech recognition. In hands-free communication systems, the microphone(s) is typically placed at a certain distance from the speaker. In adverse acoustic environment, various noise sources make the speech signal corrupted. Although, the human auditory system is remarkably robust in most adverse situations, noise effects heavily affect the performance of automatic speech recognition (ASR) systems. The performance of an ASR system trained in one specific environment will drop considerably when used in another acoustic environment [1].

Several approaches have already been proposed to improve the speech enhancement results. Although the microphone array based approach exhibits better results, at the same time speech processing research community is trying to reduce the number of microphones (channels). The spectral subtraction is one of the early methods to reduce the noise effects from the observed speech signals. In this method, the noise reduction is achieved by appropriate adjustment of the set of spectral magnitudes [2]. Its basic requirement is the noise spectrum which is determined from the nonspeech segments [3]. In such single channel speech enhancement system, the residual noise is a usual issue. It decreases the speech intelligibility and hence further processing is required to reduce the residual noise. The subband approach to single channel speech enhancement is another potential method. Fourier transform and wavelet transform are dominating methods widely used in subband-based speech enhancement techniques. But the Fourier transformation is not suitable to analyze nonstationary signals like speech. There are several approaches of using wavelet transformation in subband decomposition. The decomposition results are varied with the different parameters, for example, the basis wavelet, the number of decomposition levels, and so forth. Moreover, the selection of parameters also depends on the analyzing data. Therefore, a data adaptive tool for analyzing nonstationary and nonlinear signal is highly desirable [4].

In the previous study [5], empirical-mode-decomposition-(EMD-) based data adaptive thresholding algorithm is introduced. EMD is developed by Huang et al. [6] to decompose any nonstationary signal into a finite set of bases called intrinsic mode functions (IMFs). Instead of the speech signal, the variance of each IMF is used to determine the adaptive threshold, and hence better performance is achieved in [5]. Its main drawback is to find the speechless part to determine the noise variance. The performance of this method depends on the efficiency of voice activity detection (VAD), and hence it is not convenient to implement for practical applications. In this study, bivariate EMD (bEMD), the generalized extension of traditional EMD [7, 8], is employed to resolve the mentioned problem. The fractional Gaussian noise (fGn) has interesting characteristics with EMD [9, 10]. The EMD on fGn acts as dyadic filter banks [10]. The energies of the IMFs decrease almost linearly with increasing their order. It implies that the higher frequency IMFs contain more energies than that of the lower frequencies [11, 12]. The analyzing speech signal and fGn are decomposed together with bEMD. The noise variance is determined from the individual IMF of fGn which is used here as the reference signal.

The BEMD is applied to the complex signal which is composed of noisy speech and fGn as real and imaginary components, respectively. Each of the obtained IMF is divided into frames. The energy of speech frame is compared with that of the fGn to classify the frame as noise or speech dominant. Note that the real and imaginary parts of any complex-valued IMF correspond with the IMF of speech and noise, respectively. The soft-thresholding is applied to only the noise dominant speech frames with optimal adaptive threshold [5]. Then, the real components (i.e., IMFs of speech) of the processed IMFs are summed together to reconstruct the enhanced speech.

This paper is organized as follows: the application of bEMD on speech and noise signals are described in Section 2, the noise variance estimation process is explained in Section 3, the proposed speech enhancement method using bEMD is described in Section 4, experimental results are illustrated in Section 5, and finally, Section 6 contains some concluding remarks.

2. BEMD of Speech and Reference Signals

The univariate EMD (uEMD) decomposes any signal into a finite set of basis waveforms modulated in both amplitude and frequency (AM-FM). The idea behind is that a signal with fast oscillations superimposed on slow oscillations. The EMD is designed to define a local low frequency component as the local trend, supporting a local high frequency component as a zero-mean oscillation. The principle of the uEMD technique is to decompose a signal into a sum of the band-limited functions termed as IMFs. Each IMF satisfies two basic conditions: (i)in the whole data set, the number of extrema and the number of zero crossings must be the same or different most by one,(ii)at any point, the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is zero.

The first condition is similar to the narrowband requirement for a stationary Gaussian process, and the second condition is a local requirement induced from the global one and is necessary to ensure that the instantaneous frequency will not have redundant fluctuations as induced by asymmetric waveforms. There exist many approaches of computing uEMD [6, 9, 10].

In order to handle bivariate time series, the uEMD is extended to complex-valued EMD which is called bivariate EMD [7]. The main difference between the bEMD and the complex-EMD [8] is that the latter uses uEMD to decompose real and imaginary parts of complex signals, whereas the bEMD adapts the rationale underlying the EMD to a bivariate framework. In bEMD, two variables are decomposed simultaneously (without losing mutual dependency) based on their rotating properties. The bEMD algorithm [7] is summarized as follows:(1)Let the complex signal is with time index .(2)For ,(a)projections of signal in direction of   is given by , so project the complex signal by using a unit complex number    in the same direction is given by where is the real part of the complex signal;(b)find the locations   corresponding to the maxima of ;(c)interpolate between the maxima points to obtain the partial envelope curve in direction named .(3)Compute the mean of all tangents: .(4)Subtract the mean from input signal to obtain .(5)Test if is an IMF:(a)if yes, repeat procedure from the Step on the residual signal,(b)if no, replace with and repeat the procedure from Step .

The BEMD of a signal can be expressed as where is total number of IMFs, is the kth IMF, and is the final residue. In bivariate EMD, the input signal is represented as   [13], where , is the fGn, and is the observed noisy speech. The additive noise is combined with the clean speech  ,  yielding defined as  . It is noted that the real and imaginary parts of any IMF are denoted by   and    representing the IMF of speech and fGn, respectively.

Fractional Gaussian noise (fGn) is a generalization of ordinary white noise. It is a versatile model of homogeneously spreading broadband noise without any dominant frequency band. Consequently, the statistical properties of (fGn) are entirely determined by its second-order structure, which depends solely on one single scalar parameter and the Hurst exponent (). In discrete time, the (fGn) corresponds to a time series indexed by a real-valued parameter . The energies of the subbands (IMFs) of (fGn) obtained by uEMD behave almost linearly as a function of the IMF index [10].

The decomposition produces two separate sets of IMFs corresponding to individual signals as shown in Figure 1. The Fourier spectrum of the IMFs of fGn illustrated in Figure 2(a); exhibits the property of dyadic filter banks [9, 10], that is, the center frequency of any IMF is half of that of the IMF just previously extracted. The energies of the IMFs are decreasing with the increase of their order as shown in Figure 2(b). It implies that that the lower order (higher frequency) IMFs contribute higher energies in fGn. It is agreed with the assumption that speech signal contains more noise at higher frequency, and hence it is justified to use the fGn as the reference signal to determine the noise variance. Moreover, more IMFs are generated when fGn is combined with speech to apply bEMD. The amplitude of the fGn is adjusted according to the noise variance of the observed speech signal.

3. Estimation of Noise Variance

The effective estimation of noise variance of the noisy speech plays a vital role to the performance of the proposed bEMD-based speech enhancement algorithm. The speech signal is considered as a smoothly varying signal with additive Gaussian noise of zero mean, and both are uncorrelated with each other. It can be transformed through a low-order polynomial form as where is the degree of polynomial. The purpose of the transformation is to find out the variance of a random variable resulting from the propagation of the noisy speech signal through the nonlinear polynomial function. The Taylor series expansion is used to obtain exact expression of variance of the random variable distribution [14]. The low-order polynomial terms are suppressed using filter based on a finite difference expression. So, the initial distribution can be written as where    is the expectation value of and is a zero mean random variable. Then, the Taylor expansion for is as and we can determine the first-order moment where denotes the nth-order moment of the random variable . The variance of a random variable is its second central moment, the expected value of the squared deviation from the mean [15]. Considering again the Taylor series expansion, the noise variance can be computed as follows: Thus, the obtained overall noise variance is used to adjust the energy of the fGn prior to the decomposition. Then, the derived fGn is combined with the noisy speech to form the complex signal to be decomposed by bEMD.

4. Speech Enhancement Method

In a frame by frame basis, the noise variance of is compared with the variance of to classify the frames signal and noise dominants. The variance of fGn is used here as the reference noise variance. The thresholding is performed only to the noise dominant frames. This process completely overcomes the limitation of computing noise variance from silence part as required in the case of traditional EMD-based approach [16].

4.1. Frame Classification

It is required to set a threshold to classify the frames into signal and noise dominants. The boundary is set to the case where the noise and the observed speech variances within a frame are equal. In case of independence of speech and noise, the covariance between the two will be zero [4]. Hence, generally for any frame, we can write where   indicates the variance of a frame of . For equal noise and signal power, we get where   indicates the frame variance of and indicates the frame variance of [4]. Therefore, in case of equal noise and signal power, with the assumption of independency, the variance of a frame is equal twice the noise variance. The comparisons of variances as a function of frame index for different IMFs are illustrated in Figure 3. The variance of signal dominant frame is always higher than that of the noise dominant. Each IMF is divided into frames (4 ms) to perform local processing in time domain. No frequency analysis has been performed with this short data rate frames, so there is no chance for end effect. Each frame of is processed using the reference noise variance estimated from . The soft-thresholding is applied on each noise dominant frame of . The classification condition of rth frame of kth  real-IMF is defined as where    is the variance of the th frame of and    is the average power of frame   of length   of  ,  and it is calculated as The proposed adaptive thresholding technique provides an effective boundary for the frame classification. If the condition given in (9) is satisfied, the frame is classified as speech dominant, otherwise noise dominant. It is noted that each IMF is a signal with zero local mean, and hence its average energy directly corresponds to the variance. Hence, the use of the comparison used in (9) is reasonable.

4.2. Adaptive Soft-Thresholding

It is more efficient to reduce the noise components from each IMF adaptively by using a frame-based soft-thresholding strategy [16]. Soft-thresholding strategy proposed in [17] is a powerful technique of speech enhancement for a wide range of input SNRs. It thresholded only the noise dominant frames and kept remain the same in case of the signal dominant frames. The soft-thresholding is carried out on each noise dominant frame of each adaptively. After properly suppression of noise using soft-thresholding, all the real-IMFs are summed up to get the enhanced speech signal. For any given rth frame of , the thresholded coefficient is calculated as where  ,    is the noise level of rth frame of , and denotes the qth sample of rth frame of ; the product is the adaptive threshold function, while is being the sorted index of . The threshold factor is varied adaptively for individual IMF based on its variance. An estimated value of can be obtained as where is the noise variance of , is the adaptation factor, and   is the frame length (in sample). The optimum is calculated experimentally as [5] to be fitted to the data points ,  , where and are the input SNR and optimum value of (to obtain the maximum output SNR), respectively, for the training speech data and (=9) represent the maximum number data points to be fitted.

4.3. Proposed Algorithm

The proposed bEMD-based speech enhancement algorithm can be summarized as follows.(i)The overall noise variance is approximately estimated from observed speech signal by (6), and such variance is used to adjust the amplitude of fGn. (ii)Noisy speech signal and fGn are combined producing complex signal .(iii)bEMD is used to decompose into complex-valued IMFs in which real and imaginary parts correspond the IMFs of speech and fGn, respectively.(iv)Each IMF is divided into 4 ms frames to perform very local soft-thresholding. The frame variance of fGn’s IMF and energy of the corresponding frame of speech are computed. (v)The frame variance of fGn is used as data adaptive reference energy for binary classification of the corresponding speech frame into noise or signal dominant. (vi)Only the noise dominant frames are processed using data adaptive soft-thresholding. The optimum adaptation factor is computed using (13). The signal dominant frames are left untouched.(vii)All the processed IMFs of speech signals are summed up to obtain the enhanced speech.

5. Experimental Results and Discussions

The effectiveness of the proposed algorithm is tested using computer simulation with different 10 male and 10 female utterances (English sentences) randomly selected from TIMIT Database. The sampling frequency of all the speech signals is set to 16 kHz. The white noise is added to the clean speech to obtain the noisy speech signals at different noise levels. The simulation is performed over those noisy speech signals. The denoising results of the proposed method are illustrated in Figure 4. The waveforms as well as the spectrograms of the clean and noisy of 10 dB SNR are shown in Figures 4(a) and 4(b). The outputs of the uEMD [5] and proposed bEMD-based approach are presented in Figures 4(c) and 4(d), respectively. It is observed in Figure 4(c) that using the uEMD, a small amount of noise is still remaining in the enhanced speech at low frequency regions. It is obvious that a considerable amount of noise is reduced using the bEMD, a better performance when is shown.

The overall output SNR (for white noise) of the proposed method is compared with the algorithms—spectral weighting-based speech enhancement SWm [3], hard and soft-thresholding (HST) technique [17], and univariate EMD-based soft-thresholding (uEMD) [5] as illustrated in Table 1. It is observed that both SWm [3] and HST [17] algorithms suffer from speech degradation at higher SNRs (above 20 dB). We can conclude from the outcomes of Table 1 and Figure 4 that the proposed bEMD results in a high speech enhancement score and clear sound without loss of speech content. The principal limitation of uEMD is to determine the speechless part to estimate noise variance and that is inefficient for the application of continuous speech processing.

Although overall SNR is a good measure for quantifying performance, it has a little perceptual meaning. A better measure can be achieved by calculating segmental SNR (segSNR) calculated within the frames of short duration. In this experiment, the 20 ms of frame length is used with 13.75 ms overlapping between the adjacent frames. Figure 5 shows the comparisons between the input and output segSNR for white noise as a function of frame index obtained by uEMD [5] and the proposed bEMD-based methods. It is observed in Figure 5 that the segmental output SNR is higher than that of the input SNR over all the frames of 0 dB and 5 dB noisy speech. Hence, the noise dominant frames are classified properly using the proposed method, and the noise is removed from those frames. It is noted that the classification of frames using IMFs of fGn is performed effectively over the whole speech. Since the segmental SNR provides high correlation of subjective result, the proposed bEMD-based adaptive thresholding algorithm works well in this respect.

The experiment is also carried out with these noisy signals to observe the efficiency of the algorithm in terms of the perceptual evaluation of speech quality (PESQ) [18]. Figure 6 shows the speech enhancement performance of the proposed method and a comparison of that with uEMD in terms of PESQ. The values 4 and 0 of PESQ measurement represent highest and lowest perceptual quality of the speech, respectively. Considering the theoretical background and experimental results, it is obvious to state that the proposed bEMD-based algorithm exhibits better results in speech enhancement with the minimization of the loss of speech intelligibility.

6. Conclusions

In this paper, a data adaptive soft-thresholding algorithm is introduced to effectively suppress the background noise components from the noisy speech signal. It is more challenging to determine the noise variance as well as the adaptive threshold in single channel speech enhancement. In the previous study [5], the variance of noise is calculated from the speechless part of single channel speech based on speech/silence detector. It is obvious that the performance of thresholding approach is very much dependent on the perfectness of detection of noise variance and threshold as well. The newly developed bivariate EMD (bEMD) is applied here to resolve the mentioned problem in more challenging situation. In bEMD, the speech signal and the reference signals, that is, fGn, are considered as the real and imaginary components and decomposed simultaneously, yielding the same number of IMFs for both signals. The results decomposed by the bEMD are more legible than that by the ordinary EMD, and it performs better analysis of speech signals. An adaptation factor is used in the adaptive threshold function. The optimal value of adaptation factor is computed based on the estimated input SNR. The experimental result shows that the proposed speech enhancement algorithm works efficiently for a wide range of input SNR. The performance (both quantitative and qualitative) of this algorithm is tested with the speech contaminated with white noise and nonwhite noises. The proposed method works well for white noise. Although uEMD-based method exhibits a slight better performance for non-white noise, the major advantage of bEMD is that it does not require the detection of silence part within the speech signal. Further research is required to improve the performance of the proposed method with different types of noises.