Abstract

The learning-based speech recovery approach using statistical spectral conversion has been used for some kind of distorted speech as alaryngeal speech and body-conducted speech (or bone-conducted speech). This approach attempts to recover clean speech (undistorted speech) from noisy speech (distorted speech) by converting the statistical models of noisy speech into that of clean speech without the prior knowledge on characteristics and distributions of noise source. Presently, this approach has still not attracted many researchers to apply in general noisy speech enhancement because of some major problems: those are the difficulties of noise adaptation and the lack of noise robust synthesizable features in different noisy environments. In this paper, we adopted the methods of state-of-the-art voice conversions and speaker adaptation in speech recognition to the proposed speech recovery approach applied in different kinds of noisy environment, especially in adverse environments with joint compensation of additive and convolutive noises. We proposed to use the decorrelated wavelet packet coefficients as a low-dimensional robust synthesizable feature under noisy environments. We also proposed a noise adaptation for speech recovery with the eigennoise similar to the eigenvoice in voice conversion. The experimental results showed that the proposed approach highly outperformed traditional nonlearning-based approaches.

1. Introduction

Speech is the most common information in telecommunication systems. Therefore, speech processing has been considered by numeral researchers. Quality and intelligibility of speech are degraded by different distortion sources such as background noises, commonly assumed as additive noises, channel noise, commonly assumed as convolutive noises, and distortion caused by speech disorders. Thus, clean (or undistorted) speech recovery is critical for speech communications.

Present single microphone noisy speech enhancement algorithms have been efficiently used for additive noise but inefficient for convolutive noise because only additive noise can be easily modeled as an independent Gaussian noise [14]. Moreover, quality and intelligibility of speech are greatly degraded in adverse environments with joint compensation of additive and convolutive noises, but there is still a lack of efficient methods to solve this problem.

Although multimicrophone models outperform single models [5], the requirement of having more than one microphone is not always practical. Therefore, developing a method for speech recovery in both additive and convolutive noises environments, especially in joint compensation of additive and convolutive noises, when only one microphone source is provided, is a critical and interesting research topic.

In the literature, there are a few researches on learning-based speech enhancement [613]. Among them, learning-based speech enhancement approach using statistical spectral conversion has been proposed for alaryngeal speech caused by speech disorders [8, 9], body-conducted speech [10], NAM-captured speech [11], and bone-conducted speech [12]. This approach is adapted from the concept of voice conversion that can be applied for both additive and convolutive noises using only single microphone. This approach can be also applied for other kinds of distortion as in speech disorders. Therefore, it might be a general learning-based approach used for speech enhancement with all kinds of distortions. However, general learning-based approaches also have still not attracted many researchers in the field of noisy speech enhancement, due to two main problems; those are the inefficiency of adaptation techniques and the lack of a low-dimensional robust synthesizable speech features for different noisy environments.

In this paper, we proposed a learning-based noisy speech enhancement approach that we call “eigennoise” approach, adopted from the terms “eigenface” in face recognition [14] and “eigenvoice” in voice conversion [15]. In the proposed approach, we solved the two drawbacks of learning-based speech enhancement approach using spectral conversion, in which we proposed a low-dimensional robust synthesizable wavelet-based feature, and a noise-independent modeling combined with a noise adaptation method. We evaluated the proposed method with other spectral-conversion-based methods and other traditional nonlearning-based methods with different kinds of noise, including additive noise, convolutive noise, and joint compensation of additive and convolutive noise, with SNR from ultra-low to high. The experimental results showed that the proposed approach highly outperformed traditional nonlearning-based approaches.

This paper is organized as follows. In Section 2, brief on noise modeling in speech is described; Section 3 presents the GMM-based statistical spectral conversion that we use for the proposed noisy speech enhancement approach; in Section 4, wavelet-based robust and synthesizable speech features that we used for the proposed noisy speech enhancement method are described. The generalized learning-based speech enhancement using spectral conversion approach is described and discussed in Section 5. Finally, our work is summarized in the last section.

2. Noise Modeling in Speech

Noisy environment can be modeled by a background noise or/and a distortion channel . In the ideal case, background noises are supposed additive while distortion channels are convolutive [16].

Assume that the clean speech is and the noisy speech is . In the ideal case with a convolutive channel noise source , the noisy speech is determined as in (1) and Figure 1(a):

In the ideal case with additive background noise source , the noisy speech is determined as in (2) and Figure 1(b):

Real noises can be compensated from both background noise and channel noise , and the noisy speech can be modeled as in (3), (4), and Figures 1(c) and 1(d):

Noise also is classified into stationary and nonstationary noises. In stationary noises, the noise spectrum levels do not change over time or position. On the contrary, in nonstationary noise, the spectrum levels change in time, and it does not include a trend-like behavior. Most of researches on noise reduction and also this paper are based on the assumption that noise is stationary.

3. GMM-Based Statistical Spectral Conversion

3.1. Learning Stationary Information in Speech

In most of learning-based speech applications, the “stationary” assumption helps us to avoid the learning with big data. An example is given for speaker recognition when speaker identity is characterized by the variation of short-time spectral parameters resulting in that speaker identity can be recognized by an unsupervised short training and short testing utterances [17]. This approach is based on the fact that the speaker individual is “stationary” information that is fully represented in short utterances. With an assumption on “stationary” characteristics, any applications can be trained and recognized by a short training and short testing utterances.

In this section, we review some popular learning methods that can be used for learning different kinds of “stationary” information of speech with a short training: those are the neural network, the HMM, and the GMM.

Rosenblatt [18] developed the “perceptron,” which was modeled after neurons. It is the starting point for numeral later works on neural networks. The performance of neural networks has been improved far from the starting point. However, neural networks have still high computational cost compared with statistical learning methods.

Statistical machine learning has been proposed with many advantages compared with neural networks [19]. There are many statistical machine learning methods and algorithms. The two most popular statistical methods used for speech applications are the Gaussian Mixture Model (GMM) and the Hidden Markov Model (HMM). The probabilistic HMM modeling is suitable for text-dependent speech applications such as speech recognition/synthesis [20]. However, in text-independent speech applications such as speaker recognition or spectral conversion, the sequencing of sounds found in the training data does not necessarily reflect the sound sequences found in the testing data [21]. This is also supported by experimental results in [22] which found text-independent performance was unaffected by discarding transition probabilities in HMM models.

Therefore, GMM might be one of the most suitable learning methods for training with big speech data in text-independent applications such as noisy speech enhancement. In this paper, GMM is used for training, integrated with a sparse low-dimensional speech feature.

3.2. GMM-Learning in Spectral Conversion

As mentioned in previous section, GMM seems one of the most efficient statistical learning methods for training with speech data in text-independent speech applications. GMM is also the most popular training method used in spectral conversions [15, 21]. In this subsection, we present briefly the training and predicting procedure using GMM-based statistical voice conversion that we used for the proposed noisy speech enhancement method.

3.2.1. Training Procedure

The time-aligned source feature is represented by a time sequence , where is the number of frames. The time-aligned target feature is represented by a time sequence , where and are the D-dimensional feature vectors for the th frame. Using parallel training dataset consisting of time-aligned source and target features , where denotes transposition of the vector, a GMM on joint probability density is trained in advance as follows:where denotes model parameters. The joint probability density is written aswhere is the number of Gaussian mixtures. denotes the 2D dimensional normal distribution with the mean and the covariance matrix .

The th mixture weight is which is the prior probability of the joint vector and it satisfies , . The parameters for the joint density can be estimated using the expectation maximization (EM) algorithm.

3.2.2. Conversion Procedure

The transformation function that converts source feature to target feature is based on the maximization of the following likelihood function:where is a mixture sequence.

At frame th, and are given bywhere

A time sequence of the converted feature is computed as follows:

The converted features can be estimated by the EM algorithm.

3.2.3. Universal Background Model

There are one-to-one, many-to-one, and one-to-many VC systems [15]. In many-to-one VC, the full training with all sources is expensive and sometime impossible, similar to the full training with all targets in one-to-many VC. Therefore, the source (target) independent model called UBM is introduced in the GMM-UBM speaker verification system [16], while a single, speaker-independent background model is used. The UBM is a large GMM trained to represent the speaker-independent distribution of features. The independent UBM is then used as a representative target with one-to-many VC and a representative source with many-to-one VC.

There are two main approaches that can be used to obtain the UBM model. The first approach is to simply pool all the data to train the UBM via the EM algorithm (Figure 2(a)). The second approach is to train individual UBMs over the subpopulations in the data and then pool the subpopulation models together (Figure 2(b)). In this paper, the first approach is used to train GMM-UBM due to its simplicity. When using this approach, training procedure is the same as presented in Section 3.2.1 but the training data is combined from many noisy speech conditions and environments.

3.2.4. MAP Adaption

Using UBM-GMM is useful with one-to-many and many-to-one VCs. However, to improve the estimation of the model, the maximum a posteriori (MAP) adaption, also known as Bayesian learning estimation, is used to adapt the UBM into the required source (many-to-one) or target (one-to-many) models [16]. In the proposed noisy speech enhancement framework, MAP adaption is used to adapt noise-independent model to the models of specific noisy conditions.

Although all weights, means, and variances of GMM can be adapted using MAP, experiments show that only adapting the mean of GMM obtained best performance [16]. The MAP adaption for the mean of GMM is represented as below.

Given a UBM and training vectors, , we first determine the probabilistic alignment of the training vectors into the UBM mixture components. That is, for mixture th in the UBM, we compute

We then use and to compute the sufficient statistics for the mean parameter:

Finally, these new sufficient statistics from the training data are used to update the old UBM sufficient statistics for mixture to create the adapted parameters for mixture th withThe adaptation coefficients controlling the balance between old and new estimates are .

4. Noise Robust Synthesizable Wavelet-Based Features

4.1. Noise Robustness of Traditional Speech Features

It is known that noisy speech signal varies largely with different kinds of noise. In learning-based speech applications, it is able to recognize and synthesize perfectly noisy speech as in clean environment if the noisy environment of training data is identical to that of the testing data. Unfortunately, the noisy environment of testing data is seldom known in advance, and it is difficult to train the data in all possible kinds of noisy environment. When the noisy environments of training data are different from the noisy environment of testing data, the recognition and synthesis system typically performs much worse. Therefore, it is necessary to understand and eliminate variance in the speech signal due to the environmental changes and thus ultimately avoid the need for extensive training in different noisy environments. As a consequence, the learning-based speech applications in noisy environments require robust features, which are insensitive with noise environments.

State-of-the-art speech recognition is based on source/filter model to extract vocal tract features or spectral envelope features separated with source features. The two most popular spectral envelope features are linear prediction coefficients (LPC), and Mel-Frequency-Cepstral-Coefficients (MFCC). The most popular source feature is fundamental frequency ().

While LPC has been shown to be sensitive with noises, MFCC is robust with noisy environments and it is the state-of-the-art and standard feature for speech recognitions in both clean and noisy environments [23]. The robustness of MFCC is mostly caused by the perceptual Mel-scale, integrated inside MFCC. The nonlinear Mel-scale follows the psychoacoustic model which is natural with human hearing [24]. Because humans are capable of detecting the desired speech in a noisy environment without prior knowledge of the noise, modeling speech features closing with human hearing has been improved performance of speech applications in noisy environments.

However, MFCC is built based on the concept of Short Time Fourier Transform (STFT), in which fixed length window is used for analysis. The basis vectors of MFCC cover all frequency bands, so corruption of a frequency band of speech by noise affects all MFCC coefficients. Therefore, researchers still attempt to improve the noise robustness of MFCC and to propose other noise robust features for speech applications in noisy environments.

4.2. Synthesizability of Traditional Speech Features

Feature extraction is a critical analysis stage for both recognition and synthesis systems. In recognition tasks, the features can be any parameter that characterizes speech. However, in synthesis tasks, the speech features are usually required invertible or synthesizable.

LPC and MFCC features are two indirectly synthesizable features. To synthesize speech, LPC or MFCC needs to be combined with in a VOCODER, which is one popular source/filter synthesizer widely used in speech coding and synthesis [20].

In VOCODER, and random noise are used for source excitation. In the literature, many researches show that VOCODER produces “buzzy” synthetic speech [25]. Therefore, the requirement of combination with , which causes the indirect synthesizability of MFCC, limits the efficiency of MFCC in speech synthesis. Currently, MLSA filter, one kind of VOCODER using MFCC and , has been still used in state-of-the-art TTS [20]. The use of directly synthesizable features without using source/filter model is expected to solve the “buzzy” problem of VOCODER in speech synthesis.

4.3. Perceptual Wavelet Packet

The discrete wavelet transform (DWT) of a general continuous signal is the family , defined in (15):

The indexes of are called wavelet coefficients.

The inverse DWT (IDWT) reconstructs the signal from wavelet coefficients as in (16):

With the deep mathematical formulation of DWT, IDWT can be found in [26].

In wavelet analysis (wavelet decomposition), using the DWT, a signal is split into an approximation and a detail. The approximation is then itself split into a second-level approximation and detail, and the process is repeated. Wavelet synthesis (wavelet reconstruction) is the inverse process of the wavelet analysis.

In wavelet packet analysis, the details as well as the approximations can be split; therefore, the subband structure can be customized with a user-defined wavelet tree. Recent researches in wavelet show that the integration of the wavelet packet and the psychoacoustic model into the perceptual wavelet packet transform (PWPT) may improve performance of speech applications [2729].

It is shown in the literature that PWPT has significant performances with noisy speech recognition and speech coding in comparison to the conventional wavelet.

In psychoacoustic model, frequency components of sounds can be integrated into critical bands that refer to bandwidths at which subjective response become significantly different. One widely used critical band scale is Mel-scale in MFCC [24], and another popular scale is Bark scale [30].

The Mel-scale m can be approximately expressed in terms of the linear frequency as in

The Bark scale is approximately expressed as

In (17), (18) is the linear frequency in Hertz.

The nonlinear Mel-scale and Bark scale can be used to design the wavelet tree for the perceptual wavelet packet. PWPT has been used to extract robust and synthesizable features in the literature.

4.4. Noise Robustness and Synthesizability of Wavelet-Based Features

Wavelet has fine time and frequency resolution, and the effects of noise on speech are localized in some specific subbands. Therefore, wavelet is expected to be an efficient tool for noise robust feature extraction.

In the literature, there are many noise robust wavelet speech features that have been proposed. These features can be grouped into two main categories.

The first category computes the sum (or weighted sum) of energies in each subband to form the whole feature [27, 28].

The second category simply uses the wavelet coefficients, retaining the time information, to form the feature [29]. There are also some mixed categories.

In the first category, the time information in the wavelet subbands is lost into the subband energies. Moreover, this kind of features is noninvertible or nonsynthesizable.

The use of wavelet coefficients in the second category is simple while keeping noise robustness of the wavelet analysis. Moreover, it is known that inverse wavelet transform can perfectly reconstruct signal from wavelet coefficients. Therefore, the feature using wavelet coefficients is robust with noise and synthesizable and this feature is used in the proposed noisy speech enhancement method in this paper.

4.5. Features Decorrelation and Compression with DCT

Although the feature using wavelet coefficients is robust with noise and synthesizable, the simple feature concatenated from all wavelet coefficients in all sub-bands is very high-dimensional. Moreover, wavelet coefficients are correlated within and between sub-bands [31]. Therefore, if we want to use wavelet coefficients feature in both recognition and synthesis tasks, we need to decorrelate and compress wavelet coefficients feature vector. Although principal component analysis (PCA) [32] is the most popular method to reduce feature dimension, it is difficult to reconstruct original feature for synthesis tasks. The most popular invertible decorrelation transform is DCT [33].

The most common DCT definition of a 1D sequence of length [33] isfor . Similarly, the inverse transformation is defined asfor . is defined as

The DCT is often used in signal processing because it has a strong “energy compaction” property [34] where most of the signal information tends to be concentrated in a few low-frequency components of the DCT. In this paper, DCT is used to decorrelate wavelet-based feature in this proposed noisy enhancement method.

5. Eigennoise Speech Recovery Framework

The approach of using eigenfaces for recognition was developed by Sirovich and Kirby [14]. A set of eigenfaces can be generated by PCA on a large set of images depicting different human faces. Informally, eigenfaces can be considered a set of “standardized face ingredients,” derived from statistical analysis of many pictures of faces.

Developed from the concept of eigenfaces, Ohtani et al. proposed an eigenvoice-GMM voice conversion [15].

In this paper, we call the UBM of a large set of noisy environments “eigennoise,” and then we proposed a speech recovery approach using GMM-UBM-MAP based on the joint factor analysis that we called “eigennoise” approach.

The noise robust synthesizable feature analysis and synthesis are described in Figure 3, and we use PWPT to extract wavelet coefficients from input speech. The coefficients in each subband are highly correlated and bear a lot of redundant information, especially that, in high bands, there are a lot of small coefficients or zeros. Therefore, DCT is used in each subband to decorrelate within-band correlations. After concatenating coefficients from all bands to form the whole coefficients, DCT is used again to decorrelate the between-band correlations. Both PWPT and DCT are completely invertible; thus, the speech is perfectly reconstructed.

The eigennoise training and conversion are presented in Figure 4. Noisy speech with several noisy environments is used for training the noise-independent eigennoise model (GMM-UBM) as shown in Figure 4(a) and presented in Section 3.2.3. The noise-independent model is then adapted to each noise-dependent noise model later as shown in Figure 4(b) and presented in Section 3.2.4. Clean speech is converted from correspondent noisy speech and noise-dependent model as shown in Figure 4(c) and presented in Section 3.2.2.

6. Implementation and Evaluations

6.1. Data Preparation

The clean speech data used in our evaluation was the well-known English MOCHA-TIMIT. The noise database is the NOISEX-92. We created some artificial noisy environments simulating the additive background noise, the convolutive channel noise, and the mixed noise. The noise sources were selected from NOISEX-92.

We simulated the practical open-dataset testing, in which the testing noisy condition does not match the training conditions. The noise inputs of the artificial noisy environment used for training and testing were factory noise. The signal-to-noise ratios (SNRs) of the noisy speech used for training were −5, 5, 15, 25, and 35 dB, but those used for testing were −10, 0, 10, 20, and 30 dB.

All enhanced noisy speech was evaluated with the objective tests while only mixed noisy speech with SNR of the noisy speech approximated −10 dB, which was the closest with the ultra-high real noises, was evaluated with the subjective test.

6.2. Implementation Parameters

In all experiments, the number of Gaussian components , which should be chosen large enough if we have enough data for training, was chosen as 15. The adaptation coefficient was initially set to 0.5. The speech data used for all evaluations were resampled at 8 KHz, yielding a bandwidth of 4 KHz, and there are approximately 17 first critical bands among 25 bands in Bark scale. Therefore, the wavelet coefficients feature had 17 coefficients. The order of LP analysis was chosen as 17 also. The frame size was chosen as 30 ms and the overlapped interval was 15 ms.

6.3. Objective Evaluations for Speech Quality

To evaluate our proposed “eigennoise” approach with two features LP [13] and wavelet, we implemented and compared our methods with the standard nonlearning-based spectral subtraction [4] and Wiener-filter [5] methods. We used Peak-Signal-Noise ratio (PSNR) for objective evaluation. The average PSNR results are shown in Figures 5(a), 5(b), and 5(c). The results reveal that while quality of enhanced speech with nonlearning-based methods depended linearly on the input SNRs, it was acceptable with additive noises but very bad with convolutive and mixed noises. On the contrary, performance of learning-based methods was quite independent with the input SNRs, as well as the kinds of noise. The proposed “eigennoise” speech recovery with wavelet-GMM outperformed the proposed “eigennoise” speech recovery with LP-GMM. In general, learning-based noisy speech enhancements greatly outperformed the nonlearning-based methods.

6.4. Subjective Evaluations for Speech Intelligibility

The speech signals of 100 English words with clean, noisy, and enhanced signals were played in random order in the tests for 5 native English subjects. The subjects were asked to listen to each word only once and write down what they heard. Speech intelligibility could generally be evaluated using the average recognition accuracy scored by all subjects. The results are shown in Figure 6. The subjective evaluation results also support that the nonlearning-based methods reduced the intelligibility of speech while the learning-based methods improved much intelligibility of speech. In addition, the “eigennoise” method with wavelet-GMM outperformed that with LP-GMM method.

7. Conclusions and Discussions

For adverse environments with joint compensation of additive and convolutive noises, one of the biggest challenges in noisy speech enhancement, the proposed learning-based approach using spectral conversion presented in this paper, is one promising candidate among a few available approaches.

However, the proposed framework and methods have still some remaining issues needed to be studied in the future. The two biggest issues are the requirements of hardware performance for training and the efficient training methods with big data.

There are many kinds of real noise. Thus, to build a practical learning-based noisy speech enhancement usually requires training with several noisy speech conditions, corresponding with the several real noise environments. Therefore, the hardware performance is required to be very high to cope with training of huge corpus. This requirement is not available at the turn of the millennium. There is good news that the hardware performance has been developed rapidly recently. While the first processor Intel 8080 has a clock rate of 2 MHz, speed of modern present processor overcomes 8 GHz [35]. In addition, processing performance of computers is increased by using multicore processors, in which processor can handle numerous asynchronous events, interrupts, and so forth. Recent advances in computer hardware researches and applications reduce the difficulty of training with gigantic corpus. Therefore, the first limitation of the learning-based noisy speech enhancement can be overcome with newest hardware technologies.

With the rapid developments of statistical learning methods, learning-based noisy speech enhancement could be much more efficient compared with the beginning results. In this paper, we proposed a wavelet-GMM method that we call eigennoise speech recovery method. The GMM-based training methods have been shown to be efficient with big speech data. However, the computational cost is still necessary to be improved in the future.

One other disadvantage of the proposed methods is the speaker-dependent requirement not mentioned in this paper. In the future, we will also compare deep neural models with the proposed model and evaluate the proposed method with a noisy speech recognition system to confirm the efficiency of the proposed model.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.