Research Letter | Open Access
Y. A. Mahgoub, R. M. Dansereau, "Time Domain Method for Precise Estimation of Sinusoidal Model Parameters of Co-Channel Speech", Journal of Electrical and Computer Engineering, vol. 2008, Article ID 364674, 5 pages, 2008. https://doi.org/10.1155/2008/364674
Time Domain Method for Precise Estimation of Sinusoidal Model Parameters of Co-Channel Speech
A time domain method to precisely estimate the sinusoidal model parameters of cochannel speech is presented. The method does not require the calculation of the Fourier transform nor the multiplication by a window function. It incorporates a least-squares estimator and an iterative technique to model and separate the cochannel speech into its individual speakers. The application of this method on speech data demonstrates the effectiveness of this method in separating cochannel speech signals in different target-to-interference ratios. This method is capable of producing accurate and robust parameter estimation in low signal-to-noise ratio situations compared to other existing algorithms.
Separation of mixed speech signals is still one of the major challenges in speech processing. This problem is commonly referred to as co-channel speech separation. The main goal of co-channel speech separation is to automatically process the mixed signal in order to recover each talker's original speech. Minimizing artifacts in the processed speech is a key concern, especially if the final goal is to use the recovered speech in machine-based applications such as automatic speech recognition and speaker identification systems.
Several previous studies have developed signal processing algorithms for modeling and separating co-channel speech. The primary approaches have taken the harmonic structure of voiced speech as the basis for separation and have used either frequency-domain spectral analysis and reconstruction [1–3] or time-domain filtering . One promising approach to address co-channel speech separation is to exploit a speech analysis/synthesis system based on sinusoidal modeling of speech. For example, in [1, 2] a voiced segment of co-channel speech is modeled as the sum of harmonically related sine waves with constant amplitudes, frequencies, and phases. In the sinusoidal modeling approach, the speech parameters of individual talkers are estimated by applying a high-resolution short-time Fourier transform (STFT) to the windowed speech waveform. The frequencies of underlying sine waves are assumed to be known a priori from the individual speech waveforms, or they are determined by using a simple frequency domain peak-picking algorithm. The amplitudes and phases of the component waves are then estimated at these frequencies by performing a least-squares (LS) algorithm. This technique has the following drawbacks:(1)the accuracy of the estimate is limited by the frequency resolution of the STFT;(2)error is introduced due to edge effects of the window function used for the STFT. This paper presents a time domain method to precisely estimate the sinusoidal model parameters of co-channel speech. The method does not require the calculation of the STFT nor the multiplication by a window function. It incorporates a time-domain least-squares estimator and an adaptive technique to model and separate the co-channel speech into its individual speakers. The performance of the proposed method is evaluated using a database consisting of a wide variety of mixed male and female speech signals at different target-to-interference ratios (TIRs).
This paper is organized as follows. In Section 2, the sinusoidal model of co-channel speech consisting of speakers is presented. The proposed time-domain method for estimating the sinusoidal model parameters is discussed in Section 3. In Section 4, experimental results and comparisons with other techniques are reported and discussed. Finally, the results are summarized and conclusions given in Section 5.
2. Sinusoidal Modeling of Co-Channel Speech
According to the speech analysis/synthesis approach based on the sinusoidal model , a short segment of co-channel speech (about 20 to 30 milliseconds) can be represented as the sum of harmonically related sinusoidal waves with constant amplitudes, frequencies, and phases as follows: where is the discrete time index, is the fundamental frequency for that segment of the th talker, and , , and denote the amplitude, frequency, and phase, respectively, of the th harmonic of the th talker. The total number of harmonics in each talker's model is given as for . The quadrature amplitudes and in (2) are related to and in (1) as follows : A precise estimate of the sinusoidal-model parameters is essential for separating the co-channel speech into its individual components. The basic problem addressed in this paper can be stated as follows. Given the real observed samples of the co-channel speech sequence , find the parameters , , , and that form the sequence, that best fits by minimizing the mean-squared error (MSE)In the following sections, we will consider the case of two talkers to represent the co-channel speech without loss of generality.
3. Time-Domain Estimation of Model Parameters
3.1. Estimation Setup
In a matrix notation, we may write (4) aswhere is the vector and is given as with is a matrix of the form where the matrix elements are given as with and . The MSE in (5) can now be written as where Substituting (6) into (12) gives The estimation criteria are to seek the minimization of (14) over the parameters , , , and .
The most important and difficult part in the estimation process is to estimate the fundamental frequencies . Unfortunately, without a priori knowledge of the frequency parameters, direct minimization of (14) is a highly nonlinear problem that is very difficult to solve. If these frequencies were known a priori or can be estimated precisely, one can easily find the optimum values of the other parameters accordingly.
3.2. Estimating the Number of Harmonics
If the fundamental frequencies are assumed to be known, the total number of harmonics in each signal can be estimated simply asPractically, is chosen much smaller than the value calculated by (15) since most of the energy of voiced speech is concentrated below 2 kHz. Using this assumption can reduce dramatically the computational complexity of the system.
3.3. Estimating the Amplitude Parameters
The optimum values of the quadrature parameters and can be estimated directly (assuming the availability of the fundamental frequencies) by finding the standard linear LS solution to (14) as follows :where The minimum MSE corresponding to is given by substituting (16) into (14) to give
3.4. Estimating the Fundamental Frequencies
Since in practical applications, the fundamental frequencies of the individual speech waveforms are not known a priori, they must be estimated from the mixed data. A direct approach to solve this problem is to search the -dimensional MSE surface for its minimum with respect to the fundamental frequencies. The initial estimate can be determined either from the previous frame or by applying a simple rough multipitch estimation method such as the one proposed in , which is a time-domain method that depends on the average magnitude difference function. After finding an initial guess for the , the optimum fundamental frequencies can be estimated by searching the MSE surface of (18) by the method of steepest descent . Using a weight vector , we describe the steepest descent algorithm by where and is a positive scalar that controls both the stability and the speed of convergence. The gradient of the MSE is calculated by differentiating (18) with respect to each fundamental frequency as follows:where Differentiating (10) and substituting into (21) give The fundamental frequencies are updated iteratively using (19). After each iteration, the optimum amplitude parameters corresponding to the estimated frequencies are calculated using (16). Note that even by using (19), final estimates of fundamental frequencies may still have small inaccuracies because frequencies may vary slightly within the speech frame. The use of exact gradient to update the fundamental frequencies in (19) gives an advantage compared to , where an approximation of the gradient is used. Gradient calculation is an integrated process in the time-domain method since the components on the right-hand side of (23) are already part of the previous steps in the algorithm.
An example of the MSE surface obtained for the single talker () case is shown in Figure 1. Figure 1(a) shows a 30-millisecond speech frame for a single talker, while Figure 1(b) shows the corresponding MSE surface using (18) as the cost function. From Figure 1(b), the optimal fundamental frequency is approximately 165 Hz. For the two-talker case (), the MSE surface would instead be two-dimensional.
3.5. The Ill-Conditioned Estimation Problem
In some instances, the harmonics of the two speakers can be very close to each other. When the harmonics overlap, the matrix in (17) will be singular, and the parameter estimation process in (16) becomes ill-conditioned. To handle this problem, the spacing between adjacent harmonics is continuously calculated. If two adjacent harmonics are found to be closely spaced, that is, less than 25 Hz apart, only one sinusoid is used to represent these two harmonics. The amplitude parameters of this single component are then estimated and shared equally between the two speakers .
4. Simulation Results
The performance of the proposed method is evaluated using a speech database consisting of 200 frames of mixed speech. All-voiced speech segments of length 30 milliseconds were randomly chosen from the TIMIT dataset  for male and female speakers and mixed at different TIRs. The speech data were sampled at a rate of 16 kHz.
Two sets of simulations were conducted to compare the performance of the proposed method with the frequency sampling approach presented in . As suggested by the authors, a Hann window and a high-resolution STFT of length were used in the frequency-domain technique. To avoid errors due to the multipitch detection algorithm, the initial guess of the fundamental frequency of each talker was calculated directly from the original speech frames before mixing, using a simple autocorrelation method.
In the first set of simulations, the comparison was carried out in terms of the signal-to-distortion ratio (SDR) versus TIR as shown in Figure 2 for TIRs ranging from −5 to +15 dB. The SDR measure is defined as  where is the original target signal before mixing, and is the reconstructed signal after separation from the mixture . Each point in the plot of Figure 2 presents the ensemble average of the SDRs over all 200 test frames. Two cases are considered for each algorithm. In case 1, precise estimation of the fundamental frequencies is done using (19), and in case 2 only the initial guess of the fundamental frequencies is used. Plots SDR-TD1 and SDR-TD2 are the results for the proposed time-domain algorithm in case 1 and case 2, respectively, while the plots SDR-FD1 and SDR-FD2 depict the results for the frequency-domain method. As can be seen from Figure 2, the SDR increases monotonically for both algorithms with an increase of the TIR in all cases.
More importantly, we see from Figure 2 that the proposed technique outperforms the frequency-domain techniques in both case 1 and case 2. At TIR = −5 dB, SDR-TD1 and SDR-TD2 are greater than SDR-FD1 and SDR-FD2 by about 2 and 1 dB, respectively. This difference is greater for larger TIR. As suggested in Section 1, analysis of the resulting estimates using voiced speech segments has revealed that the discrepancies are due to the limited frequency resolution of the STFT (even with ) and due to the choice of window function and resulting edge effects. Other window functions such as rectangular and Hamming windows had similar discrepancies when tested.
The robustness against background noise was examined in a second set of simulations using MSE versus signal-to-noise ratio (SNR). Speech segments were corrupted by additive white Gaussian noise (AWGN) with SNR varied from 0 to 15 dB. The results are presented in Figure 3. As shown in the figure, the proposed algorithm has a superior performance in low SNR compared to the frequency-domain technique. The AWGN causes additional frequency resolution problems after even a high-resolution STFT. If the proposed time-domain estimation approach is used instead, then the effect of the AWGN is not as severe.
A time-domain method to precisely estimate the sinusoidal model parameters of co-channel speech is presented. The method does not require calculation of the STFT nor multiplication by a window for the primary model parameters. The proposed method incorporates a least-squares estimator and an adaptive technique to model and separate the co-channel speech into its individual speakers, all in the time domain.
The application of this time-domain method on real data demonstrates the effectiveness of this method in separating co-channel speech signals at different TIRs. Overall, an improvement of 1–3 dB in SDR is obtained over the frequency-domain method, dependent on the accuracy of the fundamental frequency estimates of the talkers in the tested two-talker scenario. Note that these time-domain results are compared with the frequency-domain approach with an -point STFT. Changes in would affect the precision in the estimates of the frequency-domain technique.
We also note that the time-domain method is not as sensitive to additive white Gaussian noise as is the frequency-domain method for sinusoidal modeling. This result is particularly true for lower-SNR situations.
The authors wish to thank the National Capital Institute of Telecommunications (NCIT) for partially funding this research.
- T. F. Quatieri and R. G. Danisewicz, “An approach to co-channel talker interference suppression using a sinusoidal model for speech,” IEEE Transactions on Acoustics, Speech, Signal Processing, vol. 38, no. 1, pp. 56–69, 1990.
- F. M. Silva and L. B. Almeida, “Speech separation by means of stationary least-squares harmonic estimation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '90), vol. 2, pp. 809–812, Albuquerque, NM, USA, April 1990.
- D. P. Morgan, E. B. George, L. T. Lee, and S. M. Kay, “Co-Channel speaker separation by harmonic enhancement and suppression,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 5, pp. 407–424, 1997.
- A. Bánhalmi, K. Kovács, A. Kocsor, and L. Tóth, “Fundamental frequency estimation by least-squares harmonic model fitting,” in Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech'05), pp. 305–308, Lisbon, Portugal, September 2005.
- P. Stoica, H. Li, and J. Li, “Amplitude estimation of sinusoidal signals: survey, new results and an application,” IEEE Transactions on Signal Processing, vol. 48, no. 2, pp. 338–352, 2000.
- A. de Cheveigné, “A mixed speech estimation algorithm,” in Proceedings of the 2nd European Conference on Speech Communication and Technology (Eurospeech '91), pp. 445–448, Genova, Italy, September 1991.
- S. Haykin, Adaptive Filter Theory, Prentice-Hall, Upper Saddle River, NJ, USA, 3rd edition, 1996.
- J. Garofolo, L. Lamel, W. Fisher et al., “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” 1993, Linguistic Data Consortium.
- E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
Copyright © 2008 Y. A. Mahgoub and R. M. Dansereau. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.