Department of Electrical Engineering and Information Sciences, Ruhr-Universität Bochum, 44801 Bochum, Germany
A new block-based noise reduction system is proposed which focuses on the preservation of transient sounds like
stops or speech onsets. The power level of consonants has been shown to be important for speech intelligibility. In single-channel noise reduction systems, however, these sounds are frequently severely attenuated. The main reasons for this are an insufficient temporal resolution of transient sounds and a delayed tracking of important control parameters. The key idea of the proposed system is the detection of non-stationary input data. Depending on that decision, a pair of spectral analysis-synthesis windows is selected which either provides high temporal or high spectral resolution. Furthermore, the decision-directed approach for the estimation of the a priori SNR is modified so that speech onsets are tracked more quickly without sacrificing performance in stationary signal regions. The proposed solution shows significant improvements in the preservation of stops with an overall system delay (input-output, excluding group delay of noise reduction filter) of only 10 milliseconds.
1. Introduction
A large class of speech enhancement algorithms is realized in the spectral domain. Since their performance depends on the quality of the spectral representation of the noisy data, systems for a reliable and precise spectral analysis are required. Apart of filter bank implementations, a common approach is to compute the discrete Fourier Transform (DFT) on short overlapping time domain segments [1]. Short-time DFT systems with frame overlap are attractive because of their aliasing robustness and ease of implementation [2]. The data length of a short-time segment is on the one hand connected to the frequency resolution which is achieved after transformation. The longer the time domain segment, the higher the spectral resolution. A short data length, on the other hand, is required for a good temporal resolution. In noise reduction systems usually a fixed data length is used for the short-time spectral analysis, thus making a compromise between the required spectral resolution and the minimal admissible temporal resolution [3, page 469]. This concept, however, has a major drawback: in order to achieve a sufficiently high frequency resolution, in many noise reduction systems the data length of the short-time segments is longer than the duration of stationarity of the time domain signal, making short-time segments span over nonstationary signal sections. An example for this are segments that contain speech pause and speech active samples. As a consequence, the short-time spectrum results in an average spectrum over the different statistics of the current time domain signal section. Since the spectral representation is less pronounced, a suboptimal noise reduction performance results. Using shorter data segments for the DFT would solve this problem only at the cost of a reduced spectral resolution. The resolution in this case might be yet sufficient to represent spectra that are relatively flat like those ones of burst-like signals. However, spectra that convey many details would not be sufficiently well resolved when short data segments are used for the DFT.
This trade off between spectral and temporal resolution has been addressed in recent algorithm developments. In [4] the data, length that contributes to the spectral representation is adaptively grown or shrunk according to the stationarity range of the current signal section. In another approach [5] that focuses on audio restoration, the frequency resolution is improved using an extrapolation of the time domain data prior to the computation of the short-time DFT. The disadvantages of this method are its high computational demands and the fact that the extrapolation requires perfect modeling of the signal which is in general difficult to achieve. Furthermore, random noise cannot be properly extrapolated. In audio coding, analysis windows of different lengths and shapes are switched in a signal-dependent fashion [6, 7] in order to reduce pre-echo effects that may appear after decoding.
In many application fields like telecommunications or hearing instruments the system delay is of great importance. The group delay of a hearing instrument can produce a noticeable or even objectionable coloration of the hearing aid wearer's own voice. In [8] it is reported that a delay of 3 to 5 milliseconds was noticeable to most of a group of normal hearing listeners while a delay of longer than 10 milliseconds was objectionable. In [9] asymmetric windows are presented as a way to reduce the delay in spectral analysis. However, spectral synthesis is not discussed and would become difficult with the proposed asymmetric windows if perfect or nearly perfect reconstruction is required. The delay issue has also been addressed recently in [10] where a warped analysis-synthesis filter bank for speech enhancement is presented which achieves a very low system delay. In a DFT-based analysis-synthesis system using overlap-add for signal synthesis, the delay is given by the frame length of the synthesis window, the frame advance and the group delay of a possible spectral modification filter.
In this contribution we propose an analysis-synthesis overlap-add framework that uses different analysis-synthesis window pairs. They differ in their length (before zero-padding) and their shape. Depending on the stationarity of the current time domain signal a proper window pair is selected for the analysis and synthesis. Data that is stationary over a relatively long span is analyzed using a long window in order to allow for a high spectral resolution, while short-time stationary data-like bursts of stops or speech onsets are analyzed with a short data window so that the energy burst is well preserved in the spectral representation. The reduced spectral resolution that results from using a short analysis window is not considered as a limitation, since for the latter class of short burst-like signals we expect relatively broadband spectra with few spectral details. The proposed system achieves perfect reconstruction and produces the same low delay irrespective of the analysis-synthesis window pair that is currently in use.
The signal dependent selection of an analysis-synthesis window pair to be used requires the knowledge of the span of stationarity in the signal. In order to find the boundaries of signal stationarity in [4] an iterative window growing algorithm is proposed that is based on a probabilistic framework. Since a necessary condition for stationarity is an invariant power of the random process, the temporal evolution of the mean power of consecutive frames is observed. Based on a likelihood ratio test a decision is made whether a neighboring frame contains data that originates from the same statistical process or not. The method requires a look-ahead over several frames of data in order to be able to determine the parameters of parameterized probability density functions (pdfs). It is thus not suited for very low delay applications. In an alternative approach [11] the detection of stationarity changes is based on an autoregressive signal model. For the reliable estimation of the model parameters a look-ahead of 20 ms is required which again is not permissible for very low delay applications that this paper focuses on. The approach presented here allows the detection of stationarity changes with a very low delay of about 2 ms.
Eventually, we propose and evaluate a noise reduction system that integrates the switching of the analysis-synthesis window pair based on the detection of stationarity changes in the time domain signal. The information on stationarity boundaries can be used to additionally improve the preservation of stops and speech onsets: we propose a change of the decision-directed a priori SNR estimator [12] and the amplification of plosive-like sounds. The latter is motivated by the fact that the improvement of the consonant-vowel intensity ratio was shown to be important for improving speech intelligibility [13–15].
In Section 2 we introduce the concept of switchable analysis-synthesis window pairs and estimate the benefit and the computational cost of the approach. Then, in Section 3, we introduce a detector for stationarity changes in the time domain signal that is based on a likelihood ratio test (LRT). The analysis of the properties of the likelihood ratio helps setting a proper threshold for the LRT. In Section 4 a noise-reduction system is proposed and analyzed that makes use of the nonstationarity detection. Apart from switching the analysis and synthesis windows we propose two measures that aim at improving speech intelligibility by a preservation or amplification of speech onsets and burst-like sounds. Finally, in Section 5 we present experimental results.
2. Analysis-Synthesis Window Sets
In this section we define the spectral analysis-synthesis system that provides spectral data to the frequency domain noise reduction algorithm and synthesizes the time domain signal after possible spectral modifications.
The main idea in this section is to provide an analysis system with long and short analysis windows that are arbitrarily switchable. This allows a signal dependent selection of the appropriate analysis window. Each analysis window is matched with a particular synthesis window that guarantees perfect reconstruction for each window pair.
2.1. DFT-Based Analysis Synthesis System
We assume a sampled noisy signal that is the sum of a speech signal, , and uncorrelated noise,
The index denotes the discrete time index of the data, sampled with sampling frequency .
We consider a block-based analysis-system with frequency bins and a frame advance of samples. If we restrict the system to uniform frequency resolution the discrete Fourier Transform (DFT) can be used and efficiently implemented by means of a Fast Fourier Transformation (FFT) algorithm. Then, the spectral coefficients, , of the sampled time domain data are obtained as
where denotes an analysis window, is the subsampled (frame) index, is the discrete frequency bin index, and is the length of the DFT.
The spectral coefficients, , may then be weighted with a spectral gain, , before the signal synthesis is performed via IDFT, multiplication with a synthesis window, , and a subsequent overlap-add operation [1].
2.2. Switchable Analysis-Synthesis Window Sets
In [16] a system with switchable analysis-synthesis window pairs is proposed which achieves perfect reconstruction and can provide spectral or temporal resolution in a flexible manner while always realizing the same small delay. The main ideas that are underlying the window design are the following.
(i)Since the spectral and temporal resolution of an analysis system is governed by the length of the analysis window, analysis windows of different lengths have to be provided for a system with maximum flexibility. (ii)The delay in an overlap-add system is basically determined by the length of the synthesis window. Therefore, in order to realize the same short delay for all window pairs in a switchable analysis-synthesis system, the synthesis windows have to be of the same length regardless of the length of the associated analysis window. (iii)In order to allow for an arbitrary frame-by-frame switching between different analysis-synthesis window pairs, in an overlap-add system the product of analysis and associated synthesis window has to be the same for all window pairs. (iv)The analysis-synthesis system should be perfectly reconstructing whenever no processing is applied. (v)The windows shall have reasonable frequency responses to avoid aliasing and imaging distortions. For the subsequent investigations we use the window set example in Figure 1 [16]. It is designed for a point DFT with frame advance samples at 16 kHz sampling frequency and consists of two analysis-synthesis window pairs. The first window pair consists of a zero-padded square-root Hann window of length 128 ( in Figure 1) for both, analysis, , and synthesis window, . The product of analysis and synthesis window is a length-128 Hann window. The second window pair provides an asymmetric analysis window, , with square root Hann slopes. The long asymmetric analysis window is padded with zeros to alleviate spectral aliasing. The respective short synthesis window, , is designed in a way that the product of analysis and synthesis window again results in the same length-128 Hann window as for the short window pair. Therefore, an arbitrary switching between either of the window pairs is possible without violating perfect reconstruction, of course assuming that the signal is not modified otherwise.
Figure 1: Example for a low delay switchable analysis-synthesis window set guaranteeing perfect reconstruction. The window pair in the left column has good temporal resolution while the pair in the right column provides a good spectral resolution. The asymmetry of the long analysis window emphasizes most recent samples.
2.3. Analysis of Energy Gain Using Switched Windows
As mentioned before, short analysis windows provide a high temporal resolution. This implies that the energy of nonstationary signal sections, like bursts of plosive speech sounds, is better captured with a short analysis window than with a long one. In the following, we quantify this effect.
A gain can be defined as the ratio of the signal power captured under the short analysis window, , related to the power that would be captured under the long analysis window,
The windows in the numerator and denominator of the above formula are normalized to unity power. Since zeros are padded in window the outer sum in the numerator can start at .
In the best case the nonstationarity (e.g. speech onset) occurs like a step function and coincides exactly with the limits of the short window, therefore maximizing the power that can be captured under the short window. This scenario is illustrated in Figure 2, where , , and denote the power of the noisy signal, the speech power, and the noise power, respectively.
Figure 2: Idealized scenario of a speech onset of power in noise of power . The speech onset coincides with the short spectral analysis window ( most recent samples). The analysis window is either (a), rectangular, or (b), tapered.
Speech is assumed to be statistically independent of the noise process and appears only during the most recent samples. If additionally the windows and are assumed to be rectangular (Figure 2(a), (3) simplifies so that an estimate for the maximal achievable gain is
In Figure 3 this expression is evaluated as a function of the a priori SNR and for several ratios of the length of the short window to the length of the long window. The solid lines show the result for the assumed rectangular windows, the dashed lines show the expected gain if the proposed tapered windows of Figure 2(b) are used instead. The gain of the tapered windows is always smaller than that obtained with rectangular windows. We find that using the proposed short window over the proposed long window during a burst-like speech sound at 15 dB a priori SNR improves the spectral representation by roughly 4 dB. Rectangular windows would yield a gain of about 5.6 dB at these conditions. Due to their unfavorable spectral properties rectangular analysis windows do not represent an alternative but should instead serve here as the upper bound for possible gains, . Note that an increase in the spectral representation of only a few dB may already help to change the filter behavior in a noise reduction application in a way that stops will be better preserved.
Figure 3: Expected spectral power gain versus a priori SNR for using a short analysis window versus a long window during speech onsets. The curves are labeled with the ratio of the lengths of short and long windows, . The solid line shows the results for rectangular analysis windows, the dashed line the results for the proposed window set. The envelope of the time domain signal is idealized as a step function where the section with increased power coincides with the boundaries of the short window.
In the preceding analysis we assumed that the nonstationarity coincides exactly with the limits of the short window. If this is not the case, the gain decreases. Therefore it is advisable to operate the analysis system with a small frame advance . For the proposed window set in Figure 1 a choice of turned out to be a good compromise between computational complexity and a sufficient high temporal resolution. In terms of the filter-bank interpretation of a DFT analysis system a small frame advance corresponds to an oversampled system which is also frequently used to reduce aliasing effects [1, page 339].
2.4. Computational Complexity
For the ease of use of a flexible spectral analysis-synthesis system it is desirable that the system behaves transparently to the spectral domain application when switching from one to another window pair. The system is therefore required to always provide the same number of spectral components no matter which window set is active. For this reason all analysis windows are zero-padded to the same length so that a DFT of one and the same length can be computed. Zero-padding the short window does not increase the spectral resolution but corresponds to an interpolation of the spectral data of the short window. This increases the computational complexity as compared to the case of a standard system with a short analysis window without zero-padding but allows for a more flexible allocation of temporal and spectral resolution. Compared to a standard long window for analysis and synthesis the proposed solution is less complex, more flexible in terms of spectrotemporal resolution and has a lower delay.
With these considerations, the total number of multiplications can be estimated for the following three cases.
(a)Standard system with symmetric short analysis and short synthesis windows, for example, square-root Hann windows. (b)Proposed flexible analysis-synthesis system with a set of two window pairs:(1)short analysis window and short synthesis window,(2)long (asymmetric) analysis window and long (asymmetric) synthesis window. (c)Standard system with symmetric long analysis and long synthesis windows, for example square-root Hann windows. Complexity is determined here in terms of the number of real-valued multiplications. These have been determined in [17] for the calculation of an N-point FFT or IFFT:
If, as in the classical analysis-synthesis system, only a short window of length is used without zero-padding, the complexity would amount to . Padding this window to the length would increase the number of multiplications to . However, as most of the input data to the FFT is zero, advanced techniques for pruned FFT may reduce the complexity by a factor of [18]
where and . Further multiplications are required when weighting the input data with the analysis window and the processed data with the synthesis window. Here, only the nonzero samples of every window have to be multiplied with the data.
Table 1 reports the computational complexity relative to the complexity of case a. Furthermore, the temporal resolution, the frequency bin spacing and the system delays are indicated for a sampling frequency of kHz and for a frame advance of samples which corresponds to 2 milliseconds. The relative computational complexity of the proposed solution varies between and depending on how frequently the short analysis window or the long asymmetric analysis window is used. In case a, the system delay is only 10 milliseconds. However, when applying the short window set A to a noise reduction system the denoised speech and the residual noise sound harsh and unnatural. In case b, the system delay is also low, but during longer stationary signal sections like vocals or speech pauses, the long asymmetric window can be used, resulting in a more natural sounding speech and residual noise. The short window pair in case b should be applied during transitions or during bursts of a stop. Since in speech, and in particular in speech pauses, stationary signal sections dominate transient sections the long analysis-synthesis window set will be more frequently used than the short window set so that an effective relative computational complexity close to can be expected. While the computational complexity increases when using the proposed solution B instead of A, it provides a considerably improved frequency bin spacing (about Hz/bin) which principally allows to resolve pitch harmonics. A similar high resolution is obtained in case c only at the price of a much greater system delay and an even slightly higher computational complexity.
Table 1: Comparison of proposed window set (center column) with a standard analysis-synthesis system using short (left column) or long (right column) standard analysis and synthesis windows. The values are indicated for a sampling frequency fs = 16 kHz and a frame advance of R = 32 samples (2 ms). The effective complexity of the proposed solution varies between 3.9 and 4.7 depending on the rate of use of either the short or the long analysis window. Typically, the long analysis window will be used more frequently than the short analysis window.
3. Detecting Stationarity Boundaries
In this section we develop a detector for stationarity boundaries of data which controls the selection of windows to be used for the spectral analysis of the current segment. Since for a real-time application this decision has to be made frame-by-frame, the detector is optimized for decisions with very low latency. This is an important aspect in which our solution differs from other approaches which use statistical models whose free parameters need to be estimated over several frames [4], or, in [11] a sufficient number of samples is required, corresponding to at least 20 ms. The algorithm presented in the following is operating on the time domain sampled data. It gives also information on how reliable the stationarity-decision is.
3.1. Task and Hypotheses
Given a stream of time domain sampled data (see Figure 4) we want to decide whether the latest samples (block ) are likely to originate from the same statistical process as the preceding samples (block ). Thus, we have the following hypotheses:
Figure 4: Definition of blocks and , consisting of and samples, respectively.
: the samples in block originate from the same statistics as those ones in the preceding block , that is, the data is stationary over both blocks;
: the samples in block are supposed to follow different statistics than the samples in block (detection of a stationarity boundary between the two blocks of data).
The lengths and can be arbitrarily set. A necessary condition for stationarity is that the process mean power must be constant over time. We inherently assume ergodicity of the random processes in the respective blocks of data since we replace the ensemble mean by the mean over the consecutive observations (e.g., squared time domain samples) within the respective blocks.
Furthermore, it is assumed that the samples within each of the two blocks are independent identically distributed (i.i.d.) and are wide-sense stationary within each block. This assumption may be violated, for example, during voiced speech or when the boundary between block 1 and block 2 does not coincide with the stationarity boundary. In practice, however, it turns out that with a proper parameter setting of the detector the stationarity detection works well even in these cases.
3.2. Likelihood-based Hypothesis Test
The hypothesis is tested with a likelihood ratio test (LRT). This requires the knowledge of the probability density function (pdf) which describes the distribution of the squared samples in block and under hypothesis or .
Assuming that the observed time domain samples, , are realizations of a zero-mean Gaussian random variable with variance , then the squared observations, , are distributed with degree of freedom [19, Equations (5.33), (5.65)]
and for . The mean of the squared random variable, , is the variance of the noisy time-domain samples, .
Given hypothesis , the data in both blocks originate from the same statistical process, so that the pdf describing the distribution of the squared samples in block can be formulated using the variance of the noisy time-domain samples in block , :
If, on the other hand, the data in block originate from a different statistics than the data in block (hypothesis ), the mean power has to be defined using only data of block :
Both conditional pdfs are zero for .
The variance of the noisy observations in block and constitute random variables, which may be approximated by their respective maximum likelihood estimates
Given the squared observations in block , , a likelihood ratio (LR) test is defined by
with being the LR decision threshold to be set to a reasonable value from the interval of possible LR values, .
The LR value gives an indication whether the observed mean energy could have originated from both distributions with equal or similar likelihood (LR ) or whether the statistics are significantly different. In the latter case we reject and decide that a stationarity boundary has been detected, in the former case we accept as we have no sufficient evidence that stationarity has been violated. The LR value itself gives information on the reliability of the decision. The more it approaches zero, the more reliable is the decision for . Accordingly, values close to one indicate a highly reliable decision for .
The value of the decision threshold controls the trade off between detection and false alarm rates. The higher the more stationarity boundaries are detected at the cost of an increased false alarm rate. In the next section we analyze the LR expression and investigate the relation between threshold and the probabilities of detection and false alarm.
3.3. Analysis of the Likelihood Ratio
In the sequel an expression will be derived for the likelihood ratio as a function of the SNR in the first block of data and the change of the SNR at the transition from block 1 to block 2. The analysis of the expected LR values and their variance helps to properly set the detection threshold .
3.3.1. Expected LR Value
Assuming speech and noise being statistically independent random variables with and being their respective variances, then the observed signal has variance . Therefore, the variances in block and may be written as
with being the a priori SNR . With these relations the LR (13) can be written as a function of the a priori SNR in block 1, , and the change of the SNR at the transition from block 1 to block 2, :
If the additive noise is stationary over both blocks and , that is, , the likelihood ratio simplifies to
Figure 5 illustrates the LR (16). The following conclusions can be drawn.
Figure 5: Contour plot of the simplified likelihood ratio (
16). The noise is assumed to be stationary (
).
denotes the
a priori SNR in block
,
the step of the
a priori SNR at the transition from block
to
.
(i)Detection of an SNR increase ( dB): (1)the more the SNR is below dB the higher has the SNR step to be in order to produce noticeable small LR values. However, the steepness of the LR function, that is, the decrease of the LR value as a function of the SNR step at a constant SNR is similar for all SNR ; (2)at SNR dB, the LR shows similar sensitivity for SNR increases.(ii)Detection of an SNR decrease ( dB): (1)below 0 dB SNR a detection of an SNR decrease is impossible. This is plausible as an SNR decrease of a signal that is already severely disturbed ( dB) does not result in a considerably lower power of the disturbed signal which the detection is based on. (2)for all SNR dB the LR values decrease in a similar manner over but less steeply than for the case of the detection of SNR increase; (3)we observe a saturation of the LR values at a level that increases with decrease in the SNR . For example, at an SNR of dB an expected LR value less than 0.48 is not possible, irrespective of the magnitude of the SNR drop. In (16) (cf. Figure 5) noise is assumed stationary over blocks 1 and 2 which is not always the case, for example in case of babble or cafeteria noise. Figure 6 shows the LR function for an assumed noise power increase by 6 dB at the transition from block 1 to block 2. During a speech pause the SNR is already very low (e.g. dB) and a noise burst further degrades the SNR ( dB). In this case the LR function returns smaller values than in the case of stationary noise (cf. Figure 5). Therefore, depending on the level of the decision threshold the detector might trigger on the noise burst. This example illustrates that the detector detects any instationarities and cannot distinguish between speech or noise.
Figure 6: Contour plot of the likelihood ratio (
15) assuming a noise power rise from block 1 to block 2 (
).
denotes the
a priori SNR in block
,
the step of the
a priori SNR at the transition from block
to
.
3.3.2. Variance of the LR Values
The variances of the modelled random processes, and , have to be estimated from the given data (cf. (10), (11). As a consequence the LR is a random variable with mean and variance. Since LR is a transcendental function of the random variables and , an analytic expression for the pdf of the LR is difficult to derive. In the following we therefore simulate the LR values for normal distributed input data and determine the histograms of the LR for a given SNR in block 1 and for a given SNR step at the transition from block 1 to block 2. In Figure 7 five histograms are plotted for five SNR steps, , and constant SNR . We observe that the variance of the data is particularly large for dB (light blue) and is small for the cases dB (green) and dB (dark blue).
Figure 7: Histograms of the LR values for constant SNR dB. The amplitudes of the time-domain signal are Gaussian i.i.d. The distribution is broadest at an SNR increase between roughly and dB.
We measured the variance of the distributions of the LR values not only for the five exemplary distributions in Figure 7, but for each pair of dB and dB with a resolution of 1 dB. The result is presented as a contour plot in Figure 8. The crosses indicate the five SNR combinations for which the distributions in Figure 7 have been shown. We notice that the variance is highest (about 0.045) for SNR increases and LR values close to 0.5 (compare with Figure 5) while for an SNR decrease the variance of the estimated LR is about one order of magnitude smaller. Therefore, the distributions of the LR values that are associated with the upper right quadrant in Figure 5 are relatively broad as compared to those ones associated with lower right quadrant (see also Figure 7). The impact of this observation on detection and false alarm probabilities will be discussed in Section 3.4.
Figure 8: Contour plot of the variance of the LR estimates. The variance has been determined empirically (
,
,
).
denotes the SNR in block
,
the change in SNR at the transition from block
to
. The crosses illustrate the points for which the LR histograms are plotted in Figure
7.
3.3.3. Optimal Block Lengths K1 and K2
A result from the preceding section is that for robust decisions the block lengths and should be as large as possible in order to reduce the variance of the estimates (10) and (11), therefore reducing the variance of the LR (13). At the same time block should be short enough to span (in the majority of cases) data from only one statistical process. If block contains data from more than one statistical process the power measurement via (11) would be misleading, resulting in a wrong estimate of the SNR change.
For the low-delay detection of stops, for example, the duration of block should not exceed a few ms. This is the typical duration of the brief burst that is produced after release of the vocal tract occlusion [20, Section 3.4.7]. Therefore, we set the duration of block to milliseconds.
In a frame-based implementation with a frame shift of samples we extend the length by the latest samples whenever (stationarity) was accepted in the preceding frame. By this, the variance of the maximum likelihood estimate (10) that is required in (13) can be reduced, leading to more robust decisions. Whenever in the preceding frame shift a nonstationarity boundary has been detected, this extension is stopped and the data which is based on is reset to only the latest samples.
3.4. Detection Probability and False Alarm Probability
The proposed detector can also be characterized by its detection and false alarm probabilities. Using the probability density function (pdf) of the LR values, for a given SNR and a given change of the SNR, we define the following.
(i)False alarm : a nonstationarity is detected although the signals in block and originate from the same statistical process, that is, the expected SNR difference is dB. We denote the probability associated with this event
(ii)Missed detection : although the data in block 1 and block 2 originate from different statistics, that is, the expected SNR difference is dB, a nonstationarity is not detected. The associated probability is denoted by
The detection probability is defined as . In the sequel we determine the detection probability and the false alarm probability of the proposed detector. The pdf is again approximated with histograms.
As an example let us first consider the detection of SNR increases (e.g., bursts or speech onsets) of dB at dB SNR. We ask for the decision threshold that is necessary to detect 95% of these SNR rises. The top plot in Figure 9 shows for every SNR change ( dB resolution) the distribution of the LR values for the given SNR dB. The natural logarithm of the relative frequencies is mapped to gray levels. The dashed lines show the 5%- and the 95%-percentile of the distributions. The distributions are broadest for dB. For dB the variances of the distributions are very small.
Figure 9: (a) Distributions of the LR for dB. The gray scale represents the natural logarithm of the measured relative frequencies of the LR values. The variance of the histograms is high during the decrease of the LR mean (dotted line). (b) Detection probabilities for three-decision thresholds as a function of the observed SNR change .
The lower plot in Figure 9 shows the detection probabilities as a function of the SNR step for three thresholds . With a threshold of (thin line) almost 100% of the SNR steps greater than dB are detectable. However, the false alarm rate which is found at dB is %, which is unacceptably high.
With a threshold of about 95% of the SNR steps dB can be detected while detections at dB are expected with probability . Although this false alarm probability is relatively small, we see that for every small SNR step in the interval dB detections occur with a considerable rate. In order to detect mainly those SNR steps that exceed a certain SNR threshold the decision threshold has to be decreased. The thick solid line shows the detection probability for . In this case SNR increments between and dB SNR do not result in a significant detection rate. Only if the SNR rise is larger than 5 dB the detection rate increases and attains 95% for dB. The low threshold is thus advantageous if only considerable changes in the SNR of at least five to ten dB should be detected.
While Figure 9 shows the detection probability for an exemplary SNR dB, in the same way the detection probabilities for a given threshold can be determined for all dB. The result for and dB resolution is shown in Figure 10 as a contour plot. The dotted red line indicates those cases where the detection probability equals . Additionally, SNR decreases are detected in the same fashion and can be distinguished from SNR rises by comparison of the estimated variances (10), (11) in block and .
Figure 10: : Detection probability, , as a function of the SNR in block and the SNR change from block to block . The dotted red line highlights the cases where .
3.5. Example
In Figure 11 we show the use of the detector for the detection of strong phoneme onsets in continuously spoken disturbed speech. The assumed Gaussianity of the pdfs of speech and noise is approximately fulfilled, in particular during unvoiced speech, like stops. The clean speech [21] was mixed with speech-shaped noise to an SNR of dB (bottom plot). The phonetic labels are printed on the plot. In the upper part of the figure the LR values are given. The duration of block was set to milliseconds. Whenever the LR falls below (dashed line) the detector fires. In this example bursts of stops are detected robustly and in time. The phoneme [k] shortly before 0.6 seconds is not detected. An analysis of the SNR reveals that dB and the SNR increase is dB during the burst of the phoneme. Regarding the preceding analysis of the detector it is clear that a detection under these severe conditions is not possible with the given threshold (cf. Figure 5). The decisions are obtained within only ms delay (, sampling rate kHz).
Figure 11: Example usage of the detector for low-delay detection of instationarities in continuously spoken disturbed speech (bottom, “complexity of complete marketing planning”, [
21]). The LR values are plotted on top, the decision threshold
is represented by the dash-dotted line.
3.6. Evaluation of Detection Performance
The proposed detector was used in a framework to verify its performance. A total number of 4200 clean speech sentences from the TIMIT database [21] have been disturbed with stationary speech-shaped noise, each at a mean segmental SNR of 10 dB. Then, using the phonetic labels of the TIMIT database, the number of occurrences of each phoneme was counted. For each occurrence of a phoneme it was recorded whether it was detected by the proposed detector or not. In case of a detection, the SNR increase, , and the SNR during the detection have been recorded. If the detector did not fire, the maximum SNR increase within the boundaries of the phoneme, , and the respective SNR, , have been recorded in order to document, at which SNR increase the detector failed to fire. The detection threshold is set to .
Given these data, histograms of the occurrences of a phoneme in the plane spanned by and can be created. This is illustrated for the stop “t” in Figure 12. In the same manner the detection counts and the missed detections are illustrated in Figures 13 and 14, respectively.
Figure 12: Occurrences of phoneme “t” in sentences disturbed at 10 dB mean segmental SNR.
Figure 13: Detected occurrences of phoneme “t” ().
Figure 14: Missed detections of phoneme “t” ().
It can be concluded from Figure 12 that under the given measurement conditions during the closure of the stop “t” the SNR is roughly dB and the SNR rise during the burst is around 40 dB. If the SNR increase leads to an SNR close to or less than 0 dB the stop cannot be detected (Figure 14). In this case a multichannel spatial preprocessing (e.g. [22]) can help to improve the SNR prior to the detection.
Stop “” whose ()-coordinates correspond to a small LR value can be robustly detected (Figure 13). The histogram thus confirms the theoretical considerations of the preceding subsections.
The experiment could be repeated for a higher or a lower input noise level. This would make the histograms shift towards lower, respectively, higher SNR , so that fewer, respectively, more phonemes would be detected.
Since the detector is sensitive to any transient, in nonstationary environments, like cafeteria noise, we expect detections of noise bursts also. If this shall be prevented, the detection threshold could be lowered in nonstationary environments. In a hearing instrument this can be triggered by manually selecting a situation-specific hearing-aid program, or could be controlled by an automatic classification system as used in state-of-the-art hearing instruments [23].
4. Modifications to the Noise Reduction System
With the ability of detecting instationarities in disturbed speech the classical noise reduction system is extended as illustrated in Figure 15. The detection of instationarities is based on the highpass-filtered input signal, . As many noise types show a lowpass characteristic, highpass filtering improves the SNR prior to the detection and hence helps to improve the detection rate. Given the likelihood ratio, , at the output of the detector, in the following paragraphs we discuss three possible measures that can be applied on their own or in combination.
Figure 15: Overview of a noise reduction system with nonstationarity detection. The stationarity decision controls the choice of the spectral analysis-synthesis window and the estimation of the clean speech spectral coefficients.
In short we propose to
(i)switch the analysis (and synthesis) window of the spectral analysis system for a better temporal resolution during transitional segments; (ii)adapt the decision-directed estimator for the a priori SNR [12] to allow for a faster and more precise tracking of the a priori SNR during transitions;(iii)amplify a segment that has been classified as transitional to improve speech intelligibility [14, 15]. 4.1. Window Switching
Figure 16 illustrates how the nonstationarity detection is applied to the spectral analysis-synthesis window sets presented in Section 2. Block has a length of samples ( ms), centered on the short analysis window, . Block is initially also of length samples but is growing by samples per frame shift as long as no nonstationarity is detected and a maximum length of samples is not exceeded. As argued before, this strategy reduces effectively the variance of the LR estimate. If a nonstationarity is detected, block is reset to the last samples.
Figure 16: Stationarity detection applied to the analysis system with frame advance R and short windows of length . In the example, the analysis window is switched from long asymmetric to short symmetric. Block consists of samples. Block has initially length but grows by samples as long as no nonstationarity is detected and until an upper limit for the length is reached.
4.2. Modified Decision-Directed Approach
In [12] the decision-directed estimator for the a priori SNR is proposed. It estimates the a priori SNR via a weighted sum of the current maximum a posteriori (MAP) estimate of the a priori SNR and an estimate which is built from the speech spectrum estimated in the preceding frame, :
The first estimate after a speech onset is ruled by the a posteriori SNR, , in the second term in (19) since the feedback term is small due to the speech pause in the preceding frame. Since the second term in (19) is weighted with , which is typically of the order of dB to dB (), the a priori SNR estimate considerably underestimates the true a priori SNR during speech onsets [24]. As a consequence, stops, which are normally of low intensity, are often severely attenuated by noise reduction filters based on the decision-directed approach.
By lowering the parameter , the response time on fast changes of the SNR can be improved, however, only at the price of an increased distortion of the residual noise (musical noise). Therefore it was proposed to make the parameter of the decision-directed approach time-dependent [25] or time- and frequency-dependent [26]. In [27] the response time of the a priori SNR estimator on SNR increases is improved with a recursion step in which per frame advance a preestimate of the clean speech spectrum is computed which is then used to determine the decision-directed estimate of the a priori SNR.
While in [25, 26] the parameter is modified frame-by-frame, we propose to change it only if a speech onset is detected. Whenever a significant power increase is reliably detected (LR less than a threshold, ), is reduced for those frequency bins where speech activity is likely. The latter is important, as broadband reduction of leads to audible musical noise in those frequency bands that are not masked by the speech.
To realize the desired behavior of the maximum likelihood estimate of the a priori SNR is smoothed along frequency and is then linearly mapped to the range of values of where is typically . Estimates of the a priori SNR greater or equal to dB very likely indicate the presence of speech and are therefore mapped to to preserve the speech presence in those frequency bins. Estimates less or equal to dB are very likely dominated by noise and are therefore mapped to .
This procedure is applied for three consecutive frames after the onset detection. After this time the feedback of the estimated clean speech spectra in (19) will have established more robust estimates, , so that can be increased again to until the next onset will be detected.
4.3. Amplification of Transients
In [13] the effect of adjusting the consonant-vowel intensity ratio on consonant recognition by hearing impaired subjects was investigated. The recognition of stops was significantly improved when the release burst of the stop was amplified. The improvement reached a maximum when the consonant-vowel intensity ratio was amplified by roughly 8 to 14 dB (depending on the stop, the vowel environment, audiogram configuration, etc.). While the results of this study relate to the undisturbed case, in [14] speech material was used that was disturbed to 6 dB SNR with a 12-talker babble. The effects of three modifications are compared: (1) increasing the duration of consonants, (2) increasing the consonant-vowel intensity ratio by 10 dB, and (3) a combination of (1) and (2).The most significant improvements are obtained from increasing the consonant-vowel intensity ratio. Similar results are obtained in [15] where bursts of plosives are amplified by 12 dB. As apposed to the studies presented before, in [15] also sentence material was used as stimulus. In this case less improvements from the amplification of the consonantal region were observed compared to the case where consonant-vowel-consonant stimuli were used. The clean speech was disturbed with speech-shaped noise at 5, 0 and 5 dB SNR.
Based on these findings, in our proposed system, in addition to the window switching, we amplify the samples of those frames that most probably contain a speech onset. To this end, the frame data is amplified with a gain , whenever the LR (13) falls below a threshold . In the cited works, the point in time and the duration of a consonant is perfectly known as annotated speech was used in the investigations. In our case, speech onsets have to be detected in the disturbed signal. To account for the uncertainty of the detection we let the gain linearly increase with increasing reliability of the nonstationarity-decision, that is the smaller the values of the higher is , cf. Figure 17. As soon as the LR exceeds the threshold , we let the gain decay exponentially to with a time constant of roughly ms. This was found to be perceptually advantageous over an abrupt decrease of the gain. Strong consonant amplification as proposed in the precedingly cited works results in unnaturally sounding speech. A limitation of the maximum gain to 3.5 dB results in a clearly perceptible amplification of transient sounds like bursts of stops, but preserves the naturalness of the speech. It is important to notice that the proposed increase of the consonant-vowel intensity ratio becomes feasible only with short analysis windows. The amplification of the data captured under a classical long analysis window can produce audible noise prior to the amplified speech onset if the onset occurs only in the most recent samples of the frame. With the concept of switched windows, however, the short analysis and synthesis windows will be used whenever a speech onset is detected, hence preventing audible prenoise.
Figure 17: Mapping the likelihood ratio to a frame amplification .
5. Results
5.1. Example of Estimated Speech
To illustrate the consequences of the measures proposed in Section 4, speech disturbed with speech-shaped noise has been denoised using a frequency domain Wiener filter and decision-directed estimation of the a priori SNR. The spectral analysis is realized using either permanently the asymmetric long window, , or the short and long analysis window set, , , switched according to the nonstationarity decisions taken by the detector presented in Section 3. In another case, not only the window set is switched, but also the parameter of the decision-directed approach is modified as proposed in 4.2.
Time-domain signals of the utterance “Poach the apples in ” are given in Figure 18. Figure 18(a) shows the clean and the noisy signal at 10 dB SNR (speech-shaped stationary noise). Figure 18(b) contains the output of a Wiener filter single-channel noise reduction. Since only the long spectral analysis window is used, the stops at 0.05 seconds or at 0.75 seconds are considerably distorted. In Figure 18(c) the result obtained with a signal dependent switching between long and short analysis windows is shown. At the bottom, the window decision is plotted. By using the short analysis window during transient sounds the distortion of these sounds in the filtered output can be reduced. Finally, in Figure 18(d) the result with additionally modified decision-directed approach is plotted. It shows considerable improvements of the transients. In particular the two stops at 0.05 s and at 0.75 s are very well preserved in the filtered output.
Figure 18: Time-domain signals of input and the results obtained after noise reduction with Wiener filter and the methods proposed in Sections
4.1 and
4.2.
In Figure 19 the spectrograms of the same example are given. The spectra are obtained using a 128-point DFT of the data weighted with a Hann window and 75 percent overlap. As before we observe a better preservation of the phonemes [p]. Additionally, the speech onset at frame index is better preserved when the analysis window is switched to the short window (Figure 19(d) and 19(e)) and is even better preserved when the modified decision-directed is used (Figure 19(e)).
Figure 19: Spectrograms (dB) of input signals and the results obtained after noise reduction with Wiener filter and the methods proposed in Sections
4.1 and
4.2. To create the spectrograms a 128-point DFT with 32 samples frame advance at 16 kHz sampling frequency and a Hann data window was used.
In Figure 20 we show sample spectra of the denoised speech during articulation of the phoneme [p] in the word “poach.” For comparison, the spectra of the clean speech, (thick solid green line), and of the noisy observation, (dotted black line), are also plotted.
Figure 20: Spectra of clean,
, noisy (speech-shaped additive noise),
, and estimated clean speech,
of the phoneme
in “poach.” The estimated clean speech is obtained by Wiener filtering using a spectral analysis based on either the asymmetric long window only,
, or based on the window set
and
from Figure
1. If the window set is used the window decision is based on the nonstationarity decision. The thin solid black line shows the result if additionally to the switched window set the proposed modification of the
decision-directed approach is applied. Frame amplification (cf.
4.3) is not shown here. The above spectra have been created using a 128-point DFT and Hann windowed data (sampling frequency
kHz).
At frequency bins and the speech spectrum is better preserved when using the switched window set (dashed blue line) as compared to the results obtained with the long asymmetric window only (red solid line). The maximum gain observed in this example is about 4 dB. If additionally the proposed modification of the decision-directed estimator of the a priori SNR is realized (thin solid black line), the estimated speech spectrum, on average, much better preserves the actual speech spectrum. As a consequence, the phoneme sounds sharper than without modification of the decision-directed SNR estimator and without window switching.
5.2. Instrumental Evaluation
In our experiment 4132 clean speech utterances [21] disturbed with additive speech-shaped noise at 10 dB SNR have been processed with a Wiener filter single-channel noise reduction using either square-root Hann windows (length 8 ms) for spectral analysis and synthesis or the proposed system. In terms of delay the square-root Hann window is comparable with the proposed system (cf. Table 1, A versus B). Then, for every occurrence of a phoneme the intelligibility-weighted [28] mean log-spectral distortions has been determined [29]. The mean is computed over frames with a segmental SNR greater than dB. A measurement frame is only ms long in order to be able to resolve the short bursts of stops. Finally, the differences between the spectral distortion produced by the proposed system and the distortion produced by the square-root Hann windows was determined. Figure 21 shows the histogram of the differences for the example of the phoneme [d]. A negative value signifies that the distortion obtained with the proposed system is less than in case of the square-root Hann windows. Below the histograms the mean and the - and -significance levels of the three distributions are indicated. Using the long window without switching to the short window (thick solid red line) produces on average a similar distortion as obtained with the reference window. We observe slightly less distortion when the window is switched to the short window (thick blue dashed line). When additionally the decision-directed approach is modified (thin solid black line) the average distortion considerably reduces (about 2.8 dB less than the reference). The distribution becomes bimodal because not all occurring phonemes [d] are detected.
Figure 21: Relative frequencies of the improvements of log-spectral distortion in 2792 occurrences of the phoneme [d]. Values less than zero signify less distortion of the proposed system as compared to a system using a square-root Hann window for spectral analysis-synthesis that produces the same small system delay.
5.3. Listening Tests
Informal experiments conducted with four expert listeners confirmed the improved reproduction of stops with the proposed modification of the noise reduction system. Stationary speech-shaped noise and cafteria-babble was used at 5 and 10 dB SNR. The amplification of transient frames (see Section 4.3) was limited to dB because this resulted in natural sounding speech. Note that in [13–15] stronger amplifications of about 10 dB are proposed to achieve a higher speech intelligibility.
6. Conclusion
In this paper a new system for block-based speech enhancement is proposed. The focus is on the preservation of stops, since their clarity is crucial for the preservation of speech intelligibility. The main idea is to detect nonstationary data in the signal segment under investigation. Given this information, a signal adapted spectral analysis and synthesis is performed. A short analysis window is used during plosive sounds. It ensures a high temporal resolution and thus helps to keep the impulsive energy of burst-like sounds concentrated in their spectrotemporal representation. A long analysis window is used when the signal is stationary. The high spectral resolution obtained with that window allows performing noise reduction in between spectral pitch harmonics.
In addition to switching the window set for spectral analysis and synthesis, the decision-directed SNR estimator [12], is modified to yield less distortion of speech onsets and stops. With the nonstationarity decision at hand, also the amplification of stops becomes possible, which has been shown to improve intelligibility [13–15].
To control the switching of the spectral analysis and synthesis windows, a low-latency likelihood-based detector for instationarities has been derived. Its properties have been analyzed and the detection performance was verified experimentally. The examples of the time-domain and spectral representation of signals denoised with the proposed system demonstrate that the signal dependent selection of the spectral analysis-synthesis window set allows to better preserve stops and speech onsets. Similarly, a considerably improved reproduction of stops has been shown for the proposed modification of the decision-directed SNR estimator. This is confirmed also by informal listening tests. For the future, formal listening tests are planned to check the proposed approach for intelligibility and qualtity improvements.
Acknowledgment
This work was sponsored in part by grants from the German High-Tech Initiative, 03FPB00097.