Abstract

A nonlinear multiband spectral subtraction method is investigated in this study to reduce the colored electronic noise in millimeter wave (MMW) radar conducted speech. Because the over-subtraction factor of each Bark frequency band can be adaptively adjusted, the nonuniform effects of colored noise in the spectrum of the MMW radar speech can be taken into account in the enhancement process. Both the results of the time-frequency distribution analysis and perceptual evaluation test suggest that a better whole-frequency noise reduction effect is obtained, and the perceptually annoying musical noise was efficiently reduced, with little distortion to speech information as compared to the other standard speech enhancement algorithm.

1. Introduction

Speech enhancement is an important problem in many speech processing applications, such as mobile communications, speech recognition, coding and communication applications. The main objective of speech enhancement is to improve the quality and intelligibility of the signal. During the last decades, various approaches have been proposed to solve this problem, such as spectral subtraction method [13], subspace methods [4], hidden markov modeling [5], and wavelet-based methods [6, 7].

Spectral subtraction method is a well-known and widely used enhancement method for all types of speech, which has been chosen for of its simplicity of implementation and low computational load. Additionally, it offers a high flexibility in terms subtraction parameters variation. This method attempts to estimate the short-time spectral magnitude of speech by subtracting a noise estimated from the noisy speech. The phase of the noisy speech remains unchanged since it is assumed that phase distortion is not perceived by the human ear.

However, the serious drawback of this method is that the enhanced speech is accompanied by unpleasant musical noise artifact which is characterized by tones with random frequencies; although many solutions have been proposed to reduce the musical noise in the subtractive-type algorithms [3, 814], results performed with these algorithms show that there is a need for further improvement.

A novel speech detecting method has been developed in our laboratory by using millimeter wave radar technology. Because of the special attributes of the millimeter wave, this method may considerably extend the capabilities of traditional speech detecting methods [15]. However, radar speech is substantially degraded by additive combined noises that include radar harmonic noise, electrocircuit noise, and ambient noise [16].

Unlike white Gaussian noise, which has a flat spectrum, the spectrum of MMW radar noise is not flat. The noise signal does not affect the speech signal uniformly over the whole spectrum. Some frequencies are affected more adversely than others. This means that this kind of noise is colored. Subtracting a constant ratio of the noise spectrum over the whole frequency spectrum may also remove parts of the speech signal. In order to prevent destructive subtraction of the speech while removing most of the residual noise, it is necessary to propose a nonlinear approach to improve the subtraction procedure.

Therefore, the purpose of this investigation is motivated by the need of improving millimeter wave conducted speech. A nonlinear multi-band spectral subtraction algorithm is proposed that takes into account the variation of signal-to-noise ratio across the speech spectrum using a different oversubtraction factor in each frequency band to reduce colored noise. Recent studies proposed a nonlinear spectral subtraction method [3, 17], which takes into account the variation of the signal-to-noise ratio (SNR) across the entire speech spectrum, but this method has not been applied to the MMW-conducted radar speech. Furthermore, in order to improve the performance of their algorithm for MMW radar speech, this study extends their filter-banks to nonlinear Bark-scaled frequency spacing because the human ear sensibility is a nonlinear function of frequency. The proposed method attempts to find the best tradeoff between speech distortion and noise reduction that is based on properties closely related to human perception.

Section 2 introduces the new MMW-conducted speech detecting method and outlines the experimental method, including the experiments, the data set, the added background noise models, and the evaluation metrics, as well as it presents the proposed multi-band spectral subtraction methods. Results of these experiments are presented and discussed in Section 3, followed by overall conclusions in Section 4.

2. Method

2.1. The Description of the System

The schematic diagram of the speech-detection system is shown in Figure 1. A phase-locked oscillator generates a very stable MMW at 34 GHz with an output power of 50 mW. The output of the amplifier is fed through a 6-dB directional coupler, a variable attenuator, a circulator, and then to a flat antenna. The 6 dB directional coupler branches out 1/4 of the amplifier output to provide a reference signal for the mixer. The variable attenuator controls the power level of the microwave signal to be radiated by the antenna. The radiated power of the antenna is usually kept at a level of about 10–20 mW. The flat antenna radiates a microwave beam of about beam width aimed at the opposing human subjects standing or sitting directly in front of the antenna. The echo signal is received by the same antenna, which is a 34 GHz MMW signal modulated by the speech which is produced by the larynx of the opposing human subjects. This signal is then mixed with a reference signal in a double-balanced mixer. The mixing of the amplified speech signal and a reference signal in the double-balanced mixer produces low-frequency signals which are amplified by a signal processor and then passed through an A/D converter before reaching a computer where further processing is done. For More details of description of the system, the reader is referred to [18, 19].

2.2. Experiments

Ten healthy volunteer speakers participated in the radar speech experiment including 6 males and 4 females. All of the subjects were native speakers of mandarin Chinese, there ages varied from 20 to 35, with a mean of 28.1 and standard deviation (SD) of 12.05. All of the experiments are in terms of the consent form which was signed by volunteers according to the Declaration of Helsinki (BMJ 1991; 302: 1194).

The distance between the radar antenna and the human subject ranges from 2 m to 20 m. Ten sentences of mandarin Chinese were used as the speech materials for acoustic analysis and acceptability evaluation (the lengths of the sentences are varied from 6 words (5.6 seconds) to 30 words (15 seconds)), and were produced by every participant in quiet experimental environment, respectively. The speakers were instructed to read the speech materials at normal loudness and speaking rate.

In order to test the effective of the proposed method, two different kinds of background noise: white Gaussian noise and speech babble noises, taken from the Noisex-92 database, were added in the enhanced MMW radar speech, since these two representative noises have a greater similarity than the other noises to the actual talking conditions. Noises were added to the original radar speech signal with a varying SNR at , 0, 5, 10 dB, where SNR is evaluated as: where is the noisy speech, is the clean speech, and N is the number of the samples in the clean and enhanced speeches.

For the perceptual experiment, ten listeners were selected to evaluate the acceptability of each sentence based on the criteria of the mean opinion score (MOS), which is a five-point scale (1: bad; 2: poor; 3: common; 4: good; 5: excellent). All of the listeners were native speakers of Mandarin Chinese, had no reported history of hearing problems, and were unfamiliar with radar speech. Their ages varied from 21 to 35, with a mean age of 25.26 (SD = 4.37). The listening tasks took place in a soundproof room, and the speech samples were presented to the listeners at a comfortable loudness level (60 dB SPL) via a high quality headphone. A 4-second pause was inserted before each citation word, and the order in which the speech samples were presented was randomized, to allow the listeners to respond and to avoid rehearsal effects.

2.3. Bark (Critical) Band

The sensibility of the human ear varies nonlinearly in the frequency spectrum [12], which denotes the fact that the perception by the auditory system of a signal at a particular frequency is influenced by the energy of a perturbations signal in a critical band around this frequency. The bandwidth of this critical band, furthermore, varies with frequency. A commonly used scale for signifying the critical bands is the Bark-band, which divides the audible frequency range of 0~16 KHz into 24 abutting bands. An approximate analytical expression to describe the relationship between linear frequency and critical band number (in Bark) is [12] In this paper, the frequency range of the radar speech is from 0 to 5000 kHz; the total number of critical bands is 19. Figure 2 illustrates the relationship between the frequency in hertz and the critical-band rate in Bark.

2.4. Multi-Band Spectral Subtraction Method

The multi-band is based on the assumption that the additive noise is stationary and uncorrelated with respect to the clean speech signal. If , the noisy speech, is composed of the clean speech signal and the uncorrelated additive noise signal , then The power spectrum of the corrupted speech can be approximately estimated as where , , and represent the noisy speech short-time spectrum, the clean speech short-time spectrum, and the noise power spectrum estimate, respectively.

Most of the subtractive-type algorithms have different variations allowing for flexibility in the variation of the spectral subtraction. Berouti et al. (1979) [2] proposed the generalized spectral subtraction scheme which is described as follows: where is the over-subtraction factor [2], which is a function of the segmental SNR, is the spectral floor, and is the exponent determining the transition sharpness. Here we set , and = 0.002.

This implementation assumes that the noise affects the speech spectrum uniformly, the over-subtraction factor , furthermore, subtracts an over-estimate of the noise over the whole spectrum. However, the noise in the MMW conducted speech, which is produced by MMW radar, maybe colored and does not affect the speech signal uniformly over the entire spectrum. Figure 3 shows the estimated segmental SNR for five frequency bands (0300 Hz (Band 1), 3001 KHz (Band 2), 1 K2 K (Band 3), 2 K3 K (Band 4), and 3 K5 K (Band 5)) of radar speech corrupted by radar noise. It can be seen from Figure 3 that the SNR of the low frequency band (Band 1, 2) was significantly higher than the SNR of the high frequency band (Band 35). The largest SNR difference among the SNR was about 25 dB, a large difference. This phenomenon suggests that the noise signal does not affect the speech signal uniformly over the whole spectrum; therefore, subtracting a constant factor of noise spectrum over the whole frequency spectrum may remove speech also.

In order to take into account the fact that colored noise affects the speech spectrum differently at various frequencies, it becomes imperative to estimate a suitable factor that will subtract just the necessary amount of the noise spectrum from each frequency subband. In this study, the speech spectrums were divided into N (N = 19) nonoverlapping Bark bands, and spectral subtraction was performed independently in each band. Hence the estimate of the clean speech spectrum in the ith band is obtained by where is the over-subtraction factor of the th frequency band, and is a tweaking factor that can be individually set for each frequency band to customize the noise removal properties. and are the beginning and ending frequencies of the ith frequency band. The whole algorithm using these parameters is shown in Figure 4.

The band specific over-subtraction factor is a function of the segmental noisy signal-to-noise ratio of the th frequency band which is calculated as

According to the SSNRi value calculated in (2.7), the over-subtraction factor is calculated as:

The use of this over-subtraction factor can provide a degree of control over the noise subtraction level in each band. Another factor , which is shown in (2.6), can be used to provide an additional degree of control within each band. Since most of the speech energy is present in the lower frequencies, smaller values were used for the low-frequency bands in order to minimize speech distortion. The values of were empirically determined and set to

The factor can be adjusted for each band for different speech conditions to get better speech quality.

2.5. Noise Estimation

The noise in the radar speech, which included of each order of the EMW harmonic, the channel noise, the ambient noise combined in the MMW radar speech, and so on, is highly nonstationary noise. Thus, it is imperative to update the estimate of the noise spectrum frequently. This study adopted the minimum-statistics method proposed by Cohen and Berdugo (2002) [20] for noise estimation, since this method is computationally efficient, robust with respect to the input signal-noise ratio (SNR), and has an ability to quick follow the abrupt changes in the noise spectrum. The minimum tracing is based on a recursively smoothed spectrum which is estimated using first-order recursive averaging where and are the kth components of noise spectrum and noisy speech spectrum at the frame l, and is a smooth parameter. Let denote the conditional signal presence probability in Cohen and Berdugo (2002) [20]; then (2.7) implies where is a time-varying smoothing parameter. Therefore, the noise spectrum can be estimated by averaging past spectral power values. For More details of description of this algorithm, the reader is referred to [20, 21].

3. Results and Discussion

This section presents the performance evaluation of the proposed enhancement algorithm, as well as a comparison with other algorithms. In order to analyze the time-frequency distribution of the enhanced speech, Speech spectrograms are presented to give accurate information about residual noise and speech distortion. As the perceptual experiment, a subjective measure of speech quality-mean opinion score (MOS) criterion is also used to evaluate the acceptability of the performance. For comparative purposes, two other algorithms are also performed; they are traditional spectral subtraction method [2] and the noise estimation algorithm [22].

Figure 5 shows the spectrograms of the original radar speech (a), the enhanced speech using spectral subtraction algorithm (b), the enhanced speech using noise estimation algorithm (c), and the proposed nonlinear multi-band spectral subtraction algorithm (d). It can be seen from Figure 5(a) that a certain amount of the combined noises exist in the origin radar speech; this is because of the harmonic of the MMW, electrocircuit noise, as well as ambient noise combined in the MMW radar speech. These noises can be obviously seen during speech pause, and are mainly concentrated in the low-frequency components, roughly below 3 KHz. Figures 5(b) and 5(c) show that the spectral subtraction algorithm and the noise estimation algorithm are effective in reducing the combined radar noises, both in the speech and the nonspeech sections. However, there is still too much remnant noise in the enhanced speech, especially in the frequency section in which the noise is concentrated, suggesting that the noise reduction is not satisfactory. Figure 5(d) shows that the proposed multi-band spectral subtraction algorithm cannot only greatly reduce the low-frequency noise, but also eliminate the high-frequency noise completely. It can also be seen from the figure that in the speech-pause regions, the residual noise is almost eliminated, suggesting that the multi-band spectral subtraction algorithm achieves a better reduction of the whole-frequency noise as compared to the spectral subtraction algorithm.

Perceptual evaluation results of the original speeches and the enhanced noisy speeches are shown in Figure 6. Mean Opinion Scores (MOS) were used for 100 sentences produced by ten volunteer speakers, also for the noisy sentences for white and babble noise at 0 dB SNR levels. It can be seen from the figure that the score of the enhanced speech obtained by using the proposed nonlinear multi-band algorithm is the highest, followed by that from the noise estimate algorithm. This is true for both the original speech and the noisy speech, suggesting that the proposed method is a better suit for MMW radar speech than the other abovementioned methods.

Informal listening tests also indicated that the multi-band approach yielded very good speech quality with very little trace of musical noise and with minimal, if any, speech distortion. This is because the over-subtraction factor can be adaptively adjusted in each Bark-band; the Bark-band also takes into account the frequency-domain masking properties of the human auditory system, thus prevents quality deterioration in the speech during the spectral subtraction process.

Because the subtraction parameters are fixed for a given frame, the traditional spectral subtraction algorithm cannot reduce the noise effectively, especially for the colored noise. These limitations will be worse for the enhancement of MMW speech in the case of combined electronic noise. With regard to the multi-band spectral subtraction algorithm, the over-subtraction factor of each frequency band can be adjusted, so that this algorithm can realize a good tradeoff between reducing noise, increasing intelligibility, and keeping the distortion acceptable to a human listener. The results also indicate that the proposed algorithm cannot only reduce the residual noise, but also improve the low-frequency deficit of MMW radar speech.

Moreover, the proposed multi-band spectral subtraction algorithm also has strong flexibility to adapt complicated speech environment for radar speech device users; this is because that the over-subtraction factor of each frequency band can be adaptively adjusted; thus the proposed algorithm is able to fit other different or complex speech environment. This makes it possible to obtain better speech quality via speech enhancement under some rigorous speech environment.

The performance of the proposed algorithm depends strongly on an important factor: the noise estimation approach. Considering the varying features of the radar noises, this study utilizes the minimum-statistics noise estimate method because it has an ability to quick follow the abrupt changes in the noise spectrum. Recent related spectrum subtraction algorithms [13, 14], which were performed for the traditional microphone speech enhancement, utilized a traditional noise estimation approach: to average the noisy signal over nonspeech sections, this approach is quite simple, but the reliability of the speech pause detection severely deteriorates for weak speech components and low-input SNR. Therefore, in order to promote the performance of this nonlinear spectral subtraction method, it is quite important to select an appropriate noise estimation approach according to the noisy characters. In addition, when the total number of bands is one, the approach of multi-band spectral subtraction algorithm will reduce to the traditional power spectral subtraction approach.

4. Conclusions

In order to remove the colored electronic noise from the MMW radar speech, a nonlinear spectral subtraction method, multi-band spectral subtraction algorithm is investigated in this study to take into account the nonuniform effect of colored noise on the spectrum of radar speech. Both the objective (time-frequency distribution analysis) and subjective test (MOS) results suggest that the over-subtraction factor, which is otherwise set to a constant value, can significantly remove the colored noise, the musical noise, and improve the speech quality. Furthermore, the proposed algorithm has strong flexibility to adapt any complicated rigorous speech environment by adjusted over-subtraction factor of each Bark frequency band.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC, no. 60571046), and the National postdoctoral Science Foundation of China (no. 20070411131). The authors also want to thank the participants from the E.N.T. Department, the Xi Jing Hospital, and the Fourth Military Medical University, for helping with data acquisition and analysis.