Abstract

A time domain method to precisely estimate the sinusoidal model parameters of cochannel speech is presented. The method does not require the calculation of the Fourier transform nor the multiplication by a window function. It incorporates a least-squares estimator and an iterative technique to model and separate the cochannel speech into its individual speakers. The application of this method on speech data demonstrates the effectiveness of this method in separating cochannel speech signals in different target-to-interference ratios. This method is capable of producing accurate and robust parameter estimation in low signal-to-noise ratio situations compared to other existing algorithms.

1. Introduction

Separation of mixed speech signals is still one of the major challenges in speech processing. This problem is commonly referred to as co-channel speech separation. The main goal of co-channel speech separation is to automatically process the mixed signal in order to recover each talker's original speech. Minimizing artifacts in the processed speech is a key concern, especially if the final goal is to use the recovered speech in machine-based applications such as automatic speech recognition and speaker identification systems.

Several previous studies have developed signal processing algorithms for modeling and separating co-channel speech. The primary approaches have taken the harmonic structure of voiced speech as the basis for separation and have used either frequency-domain spectral analysis and reconstruction [13] or time-domain filtering [4]. One promising approach to address co-channel speech separation is to exploit a speech analysis/synthesis system based on sinusoidal modeling of speech. For example, in [1, 2] a voiced segment of co-channel speech is modeled as the sum of harmonically related sine waves with constant amplitudes, frequencies, and phases. In the sinusoidal modeling approach, the speech parameters of individual talkers are estimated by applying a high-resolution short-time Fourier transform (STFT) to the windowed speech waveform. The frequencies of underlying sine waves are assumed to be known a priori from the individual speech waveforms, or they are determined by using a simple frequency domain peak-picking algorithm. The amplitudes and phases of the component waves are then estimated at these frequencies by performing a least-squares (LS) algorithm. This technique has the following drawbacks:(1)the accuracy of the estimate is limited by the frequency resolution of the STFT;(2)error is introduced due to edge effects of the window function used for the STFT. This paper presents a time domain method to precisely estimate the sinusoidal model parameters of co-channel speech. The method does not require the calculation of the STFT nor the multiplication by a window function. It incorporates a time-domain least-squares estimator and an adaptive technique to model and separate the co-channel speech into its individual speakers. The performance of the proposed method is evaluated using a database consisting of a wide variety of mixed male and female speech signals at different target-to-interference ratios (TIRs).

This paper is organized as follows. In Section 2, the sinusoidal model of co-channel speech consisting of 𝐾 speakers is presented. The proposed time-domain method for estimating the sinusoidal model parameters is discussed in Section 3. In Section 4, experimental results and comparisons with other techniques are reported and discussed. Finally, the results are summarized and conclusions given in Section 5.

2. Sinusoidal Modeling of Co-Channel Speech

According to the speech analysis/synthesis approach based on the sinusoidal model [1], a short segment of co-channel speech (about 20 to 30 milliseconds) can be represented as the sum of harmonically related sinusoidal waves with constant amplitudes, frequencies, and phases as follows: 𝑥(𝑛)=𝐾𝐿𝑘=1𝑘=1𝑐(𝑘)cos(𝜔𝑘𝑛𝜙(𝑘))=(1)𝐾𝐿𝑘=1𝑘=1[𝑎(𝑘)cos𝜔𝑘𝑛+𝑏(𝑘)sin𝜔𝑘𝑛],(2) where 𝑛=0,,𝑁1 is the discrete time index, 𝜔𝑘 is the fundamental frequency for that segment of the 𝑘th talker, and 𝑐(𝑘), 𝜔𝑘, and 𝜙(𝑘) denote the amplitude, frequency, and phase, respectively, of the th harmonic of the 𝑘th talker. The total number of harmonics in each talker's model is given as 𝐿𝑘 for 𝑘=1,,𝐾. The quadrature amplitudes 𝑎(𝑘) and 𝑏(𝑘) in (2) are related to 𝑐(𝑘) and 𝜙(𝑘) in (1) as follows [1]:𝑐(𝑘)=(𝑎(𝑘))2+(𝑏(𝑘))2,𝜙(𝑘)=tan1(𝑏(𝑘)𝑎(𝑘)).(3) A precise estimate of the sinusoidal-model parameters is essential for separating the co-channel speech into its individual components. The basic problem addressed in this paper can be stated as follows. Given the real observed 𝑁 samples of the co-channel speech sequence 𝑥(𝑛), find the parameters 𝐿𝑘, 𝜔𝑘, {̂𝑎(𝑘)}𝐿𝑘=1, and {̂𝑏(𝑘)}𝐿𝑘=1 that form the sequence, 𝑛=̂𝑥𝐾𝑘=1𝐿𝑘=1[̂𝑎(𝑘)cos𝜔𝑘𝑛+̂𝑏(𝑘)sin𝜔𝑘𝑛],(4)that best fits 𝑥(𝑛) by minimizing the mean-squared error (MSE)1𝐸=𝑁𝑁1𝑛=0𝑥𝑛𝑛̂𝑥2.(5)In the following sections, we will consider the case of two talkers (𝐾=2) to represent the co-channel speech without loss of generality.

3. Time-Domain Estimation of Model Parameters

3.1. Estimation Setup

In a matrix notation, we may write (4) aŝ𝐱=𝐐𝐡,(6)where ̂𝐱 is the vector ̂𝐱=̂𝑥(0),̂𝑥(1),,̂𝑥𝑁1𝑇,(7) and 𝐡 is given as 𝐡𝐡=(1)𝐡(2),(8)with𝐡(𝑘)=[̂𝑎1(𝑘),̂𝑎2(𝑘),,̂𝑎(𝑘)𝐿𝑘,̂𝑏1(𝑘),̂𝑏2(𝑘)̂𝑏,,(𝑘)𝐿𝑘]𝑇.(9)𝐐 is a matrix of the form 𝐐𝐐=(1)𝐐(2),(10)where the matrix elements are given as 𝑄(𝑘)𝑖𝑗=cos𝑖𝑗𝜔𝑘𝐿for𝑗=1,2,,𝑘,𝑖𝐿sin𝑗𝑘𝜔𝑘𝐿for𝑗=𝑘𝐿+1,,2𝑘,(11)with 𝑖=0,1,,𝑁1 and 𝑘=1,2. The MSE in (5) can now be written as ̂𝐱𝐸=𝐱22=𝐱𝑇̂𝐱𝐱+𝑇̂̂𝐱𝐱2𝑇𝐱,(12) where 𝐱=𝑥(0),𝑥(1),,𝑥(𝑁1)𝑇.(13)Substituting (6) into (12) gives𝐸=𝐱𝑇𝐱+𝐡𝑇𝐐𝑇𝐐𝐡2𝐡𝑇𝐐𝑇𝐱.(14) The estimation criteria are to seek the minimization of (14) over the parameters 𝐿𝑘, 𝜔𝑘, {̂𝑎(𝑘)}𝐿𝑘=1, and {̂𝑏(𝑘)}𝐿𝑘=1.

The most important and difficult part in the estimation process is to estimate the fundamental frequencies {𝜔𝑘}𝑘=1,2. Unfortunately, without a priori knowledge of the frequency parameters, direct minimization of (14) is a highly nonlinear problem that is very difficult to solve. If these frequencies were known a priori or can be estimated precisely, one can easily find the optimum values of the other parameters accordingly.

3.2. Estimating the Number of Harmonics

If the fundamental frequencies 𝜔𝑘 are assumed to be known, the total number of harmonics in each signal can be estimated simply as𝐿𝑘=𝜋𝜔𝑘.(15)Practically, 𝐿𝑘 is chosen much smaller than the value calculated by (15) since most of the energy of voiced speech is concentrated below 2 kHz. Using this assumption can reduce dramatically the computational complexity of the system.

3.3. Estimating the Amplitude Parameters

The optimum values of the quadrature parameters {𝑎(𝑘)}𝐿𝑘=1 and {𝑏(𝑘)}𝐿𝑘=1 can be estimated directly (assuming the availability of the fundamental frequencies) by finding the standard linear LS solution to (14) as follows [5]:𝐡opt=𝐐𝑇𝐐1𝐐𝑇𝐱=𝐑1𝐏,(16)where𝐑=𝐐𝑇𝐐,𝐏=𝐐𝑇𝐱.(17) The minimum MSE corresponding to 𝐡opt is given by substituting (16) into (14) to give 𝐸min=𝐱𝑇𝐱𝐏𝑇𝐑1𝐏.(18)

3.4. Estimating the Fundamental Frequencies

Since in practical applications, the fundamental frequencies of the individual speech waveforms are not known a priori, they must be estimated from the mixed data. A direct approach to solve this problem is to search the 𝐾-dimensional MSE surface for its minimum with respect to the fundamental frequencies. The initial estimate can be determined either from the previous frame or by applying a simple rough multipitch estimation method such as the one proposed in [6], which is a time-domain method that depends on the average magnitude difference function. After finding an initial guess for the 𝜔𝑘, the optimum fundamental frequencies can be estimated by searching the MSE surface of (18) by the method of steepest descent [7]. Using a weight vector 𝐰=[𝜔1,𝜔2]𝑇, we describe the steepest descent algorithm by 1𝐰(𝑖+1)=𝐰(𝑖)2𝜇𝐄(𝑖),(19) where 𝐄(𝑖)=𝐄(1)(𝑖)𝐄(2)(𝑖),(20)and 𝜇 is a positive scalar that controls both the stability and the speed of convergence. The gradient of the MSE is calculated by differentiating (18) with respect to each fundamental frequency as follows:𝐄(𝑘)=𝜕𝐸min𝜕𝜔𝑘=𝐱𝑇̇𝐐𝐡opt𝐡𝑇opṫ𝐑𝐡opt+𝐡𝑇opṫ𝐐𝑇𝐱,(21)wherė𝐐=𝜕𝐐𝜕𝜔𝑘,̇𝐑=𝜕𝐑𝜕𝜔𝑘.(22) Differentiating (10) and substituting into (21) give 𝐄(𝑘)=2𝐱𝐐𝐡opt𝑇̇𝐐(𝑘)𝐡(𝑘)opt.(23) The fundamental frequencies are updated iteratively using (19). After each iteration, the optimum amplitude parameters corresponding to the estimated frequencies are calculated using (16). Note that even by using (19), final estimates of fundamental frequencies may still have small inaccuracies because frequencies may vary slightly within the speech frame. The use of exact gradient to update the fundamental frequencies in (19) gives an advantage compared to [1], where an approximation of the gradient is used. Gradient calculation is an integrated process in the time-domain method since the components on the right-hand side of (23) are already part of the previous steps in the algorithm.

An example of the MSE surface obtained for the single talker (𝐾=1) case is shown in Figure 1. Figure 1(a) shows a 30-millisecond speech frame for a single talker, while Figure 1(b) shows the corresponding MSE surface using (18) as the cost function. From Figure 1(b), the optimal fundamental frequency is approximately 165 Hz. For the two-talker case (𝐾=2), the MSE surface would instead be two-dimensional.

3.5. The Ill-Conditioned Estimation Problem

In some instances, the harmonics of the two speakers can be very close to each other. When the harmonics overlap, the matrix 𝐑 in (17) will be singular, and the parameter estimation process in (16) becomes ill-conditioned. To handle this problem, the spacing between adjacent harmonics is continuously calculated. If two adjacent harmonics are found to be closely spaced, that is, less than 25 Hz apart, only one sinusoid is used to represent these two harmonics. The amplitude parameters of this single component are then estimated and shared equally between the two speakers [1].

4. Simulation Results

The performance of the proposed method is evaluated using a speech database consisting of 200 frames of mixed speech. All-voiced speech segments of length 30 milliseconds were randomly chosen from the TIMIT dataset [8] for male and female speakers and mixed at different TIRs. The speech data were sampled at a rate of 16 kHz.

Two sets of simulations were conducted to compare the performance of the proposed method with the frequency sampling approach presented in [1]. As suggested by the authors, a Hann window and a high-resolution STFT of length 𝑀=4096 were used in the frequency-domain technique. To avoid errors due to the multipitch detection algorithm, the initial guess of the fundamental frequency of each talker was calculated directly from the original speech frames before mixing, using a simple autocorrelation method.

In the first set of simulations, the comparison was carried out in terms of the signal-to-distortion ratio (SDR) versus TIR as shown in Figure 2 for TIRs ranging from −5 to +15 dB. The SDR measure is defined as [9] SDRdB=10log10𝑛𝑠(𝑛)2𝑛𝑠(𝑛)̂𝑠(𝑛)2,(24) where 𝑠(𝑛) is the original target signal before mixing, and ̂𝑠(𝑛) is the reconstructed signal after separation from the mixture 𝑥(𝑛). Each point in the plot of Figure 2 presents the ensemble average of the SDRs over all 200 test frames. Two cases are considered for each algorithm. In case 1, precise estimation of the fundamental frequencies is done using (19), and in case 2 only the initial guess of the fundamental frequencies is used. Plots SDR-TD1 and SDR-TD2 are the results for the proposed time-domain algorithm in case 1 and case 2, respectively, while the plots SDR-FD1 and SDR-FD2 depict the results for the frequency-domain method. As can be seen from Figure 2, the SDR increases monotonically for both algorithms with an increase of the TIR in all cases.

More importantly, we see from Figure 2 that the proposed technique outperforms the frequency-domain techniques in both case 1 and case 2. At TIR = −5 dB, SDR-TD1 and SDR-TD2 are greater than SDR-FD1 and SDR-FD2 by about 2 and 1 dB, respectively. This difference is greater for larger TIR. As suggested in Section 1, analysis of the resulting estimates using voiced speech segments has revealed that the discrepancies are due to the limited frequency resolution of the STFT (even with 𝑀=4096) and due to the choice of window function and resulting edge effects. Other window functions such as rectangular and Hamming windows had similar discrepancies when tested.

The robustness against background noise was examined in a second set of simulations using MSE versus signal-to-noise ratio (SNR). Speech segments were corrupted by additive white Gaussian noise (AWGN) with SNR varied from 0 to 15 dB. The results are presented in Figure 3. As shown in the figure, the proposed algorithm has a superior performance in low SNR compared to the frequency-domain technique. The AWGN causes additional frequency resolution problems after even a high-resolution STFT. If the proposed time-domain estimation approach is used instead, then the effect of the AWGN is not as severe.

5. Conclusions

A time-domain method to precisely estimate the sinusoidal model parameters of co-channel speech is presented. The method does not require calculation of the STFT nor multiplication by a window for the primary model parameters. The proposed method incorporates a least-squares estimator and an adaptive technique to model and separate the co-channel speech into its individual speakers, all in the time domain.

The application of this time-domain method on real data demonstrates the effectiveness of this method in separating co-channel speech signals at different TIRs. Overall, an improvement of 1–3 dB in SDR is obtained over the frequency-domain method, dependent on the accuracy of the fundamental frequency estimates of the talkers in the tested two-talker scenario. Note that these time-domain results are compared with the frequency-domain approach with an 𝑀=4096-point STFT. Changes in 𝑀 would affect the precision in the estimates of the frequency-domain technique.

We also note that the time-domain method is not as sensitive to additive white Gaussian noise as is the frequency-domain method for sinusoidal modeling. This result is particularly true for lower-SNR situations.

Acknowledgment

The authors wish to thank the National Capital Institute of Telecommunications (NCIT) for partially funding this research.