Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario K1S 5B6, Canada
A time domain method to precisely estimate the sinusoidal model parameters of cochannel speech is presented. The method does not require the calculation of the Fourier transform nor the multiplication by a window function. It incorporates a least-squares estimator and an iterative technique to model and separate the cochannel speech into its individual speakers. The application of this method on speech data demonstrates the effectiveness of this method in separating cochannel speech signals in different target-to-interference ratios. This method is capable of producing accurate and robust parameter estimation in low signal-to-noise ratio situations compared to other existing algorithms.
1. Introduction
Separation of mixed speech
signals is still one of the major challenges in speech processing. This problem
is commonly referred to as co-channel speech separation. The main goal of
co-channel speech separation is to automatically process the mixed signal in
order to recover each talker's original speech. Minimizing artifacts in the
processed speech is a key concern, especially if the final goal is to use the
recovered speech in machine-based applications such as automatic speech
recognition and speaker identification systems.
Several previous studies have developed signal
processing algorithms for modeling and separating co-channel speech. The primary
approaches have taken the harmonic structure of voiced speech as the basis for
separation and have used either frequency-domain spectral analysis and
reconstruction [1–3] or time-domain filtering [4]. One promising approach to
address co-channel speech separation is to exploit a speech analysis/synthesis
system based on sinusoidal modeling of speech. For example, in [1, 2] a voiced segment of
co-channel speech is modeled as the sum of harmonically related sine waves with
constant amplitudes, frequencies, and phases. In the sinusoidal modeling
approach, the speech parameters of individual talkers are estimated by applying
a high-resolution short-time Fourier transform (STFT) to the windowed speech
waveform. The frequencies of underlying sine waves are assumed to be known a
priori from the individual speech waveforms, or they are determined by using a
simple frequency domain peak-picking algorithm. The amplitudes and phases of
the component waves are then estimated at these frequencies by performing a least-squares
(LS) algorithm. This technique has the following drawbacks:(1)the accuracy of
the estimate is limited by the frequency resolution of the STFT;(2)error is
introduced due to edge effects of the window function used for the STFT. This paper
presents a time domain method to precisely estimate the sinusoidal model
parameters of co-channel speech. The method does not require the calculation of
the STFT nor the multiplication by a window function. It incorporates a
time-domain least-squares estimator and an adaptive technique to model and
separate the co-channel speech into its individual speakers. The performance of
the proposed method is evaluated using a database consisting of a wide variety
of mixed male and female speech signals at different target-to-interference
ratios (TIRs).
This paper is organized as follows. In Section 2, the
sinusoidal model of co-channel speech consisting of speakers is
presented. The proposed time-domain method for estimating the sinusoidal model
parameters is discussed in Section 3. In Section 4, experimental results and
comparisons with other techniques are reported and discussed. Finally, the
results are summarized and conclusions given in Section 5.
2. Sinusoidal Modeling of Co-Channel Speech
According to the speech analysis/synthesis approach
based on the sinusoidal model [1], a short segment of co-channel speech (about 20 to 30
milliseconds) can be represented as the sum of harmonically related sinusoidal
waves with constant amplitudes, frequencies, and phases as
follows:
where is the discrete
time index, is the
fundamental frequency for that segment of the th talker, and , , and denote the
amplitude, frequency, and phase, respectively, of the th harmonic of the th talker. The
total number of harmonics in each talker's model is given as for . The quadrature amplitudes and in (2) are
related to and in (1) as
follows [1]:
A precise estimate of the sinusoidal-model parameters is essential for separating the co-channel speech into
its individual components. The basic problem addressed in this paper can be
stated as follows. Given the real observed samples of the
co-channel speech sequence , find the
parameters , , , and that form the sequence,
that best fits by minimizing
the mean-squared error (MSE)In the following sections, we
will consider the case of two talkers to represent
the co-channel speech without loss of generality.
3. Time-Domain Estimation of Model Parameters
3.1. Estimation Setup
In a matrix notation, we may write (4)
aswhere is the vector
and is given as
with is a matrix of the form
where the matrix elements are
given as
with and . The MSE in (5) can now be written as
where
Substituting (6) into (12)
gives
The estimation criteria are to seek the minimization of (14) over the parameters , , , and .
The most important and difficult part in the
estimation process is to estimate the fundamental frequencies . Unfortunately, without a priori knowledge of the
frequency parameters, direct minimization of (14) is a highly nonlinear problem
that is very difficult to solve. If these frequencies were known a priori or
can be estimated precisely, one can easily find the optimum values of the other
parameters accordingly.
3.2. Estimating the Number of Harmonics
If the fundamental frequencies are assumed to
be known, the total number of harmonics in each signal can be estimated simply
asPractically, is chosen much
smaller than the value calculated by (15) since most of the energy of voiced
speech is concentrated below 2 kHz. Using this assumption can reduce
dramatically the computational complexity of the system.
3.3. Estimating the Amplitude Parameters
The optimum values of the quadrature parameters and can be
estimated directly (assuming the availability of the fundamental frequencies)
by finding the standard linear LS solution to
(14) as follows [5]:where
The minimum MSE corresponding to is given by
substituting (16) into (14) to give
3.4. Estimating the Fundamental Frequencies
Since in practical applications, the fundamental
frequencies of the individual speech waveforms are not known a priori, they
must be estimated from the mixed data. A direct approach to solve this problem
is to search the -dimensional
MSE surface for its minimum with respect to the fundamental frequencies. The
initial estimate can be determined either from the previous frame or by
applying a simple rough multipitch estimation method such as the one proposed
in [6], which is a
time-domain method that depends on the average magnitude difference function.
After finding an initial guess for the , the optimum fundamental frequencies can be estimated
by searching the MSE surface of (18) by the method of steepest descent
[7]. Using a weight
vector , we describe the steepest descent algorithm by
where
and is a positive
scalar that controls both the stability and the speed of convergence. The
gradient of the MSE is calculated by differentiating (18) with respect to each
fundamental frequency as follows:where
Differentiating (10) and substituting into (21) give
The fundamental frequencies are
updated iteratively using (19). After each iteration, the optimum amplitude
parameters corresponding to the estimated frequencies are calculated using
(16). Note that even by using (19), final estimates of fundamental frequencies
may still have small inaccuracies because frequencies may vary slightly within
the speech frame. The use of exact gradient to update the fundamental
frequencies in (19) gives an advantage compared to [1], where an approximation of
the gradient is used. Gradient calculation is an integrated process in the
time-domain method since the components on the right-hand side of (23)
are already part of the previous steps in the algorithm.
An example of the MSE surface obtained for the single
talker () case is shown
in Figure 1. Figure 1(a) shows a 30-millisecond speech frame for a single
talker, while Figure 1(b) shows the corresponding MSE surface using (18) as the
cost function. From Figure 1(b), the optimal fundamental frequency is
approximately 165 Hz. For the two-talker case (), the MSE
surface would instead be two-dimensional.
Figure 1: An example of MSE surface
for a single-talker: (a) 30-millisecond single voiced speech segment in the
time domain, (b) MSE performance versus fundamental frequency based on
(
18).
3.5. The Ill-Conditioned Estimation Problem
In some instances, the harmonics of the two speakers
can be very close to each other. When the harmonics overlap, the matrix in (17) will be
singular, and the parameter estimation process in (16) becomes ill-conditioned.
To handle this problem, the spacing between adjacent harmonics is continuously
calculated. If two adjacent harmonics are found to be closely spaced, that is,
less than 25 Hz apart, only one sinusoid is used to represent these two
harmonics. The amplitude parameters of this single component are then estimated
and shared equally between the two speakers [1].
4. Simulation Results
The performance of the proposed method is evaluated
using a speech database consisting of 200 frames of mixed speech. All-voiced
speech segments of length 30 milliseconds were randomly chosen from the TIMIT
dataset [8] for male
and female speakers and mixed at different TIRs. The speech data were sampled
at a rate of 16 kHz.
Two sets of simulations were conducted to compare the
performance of the proposed method with the frequency sampling approach
presented in [1]. As
suggested by the authors, a Hann window and a high-resolution STFT of length were used in
the frequency-domain technique. To avoid errors due to the multipitch detection
algorithm, the initial guess of the fundamental frequency of each talker was
calculated directly from the original speech frames before mixing, using a
simple autocorrelation method.
In the first set of simulations, the comparison was
carried out in terms of the signal-to-distortion ratio (SDR) versus TIR as
shown in Figure 2 for TIRs ranging from −5 to +15 dB. The SDR measure is
defined as [9]
where is the original
target signal before mixing, and is the
reconstructed signal after separation from the mixture . Each point in the plot of Figure 2 presents the
ensemble average of the SDRs over all 200 test frames. Two cases are considered
for each algorithm. In case 1, precise estimation of the fundamental
frequencies is done using (19), and in case 2 only the initial guess of the
fundamental frequencies is used. Plots SDR-TD1 and SDR-TD2 are the results for
the proposed time-domain algorithm in case 1 and case 2, respectively, while
the plots SDR-FD1 and SDR-FD2 depict the results for the frequency-domain
method. As can be seen from Figure 2, the SDR increases monotonically for both
algorithms with an increase of the TIR in all cases.
Figure 2: SDR
results; SDR-TD1 and SDR-TD2 for the proposed time-domain method, and SDR-FD1
and SDR-FD2 for the frequency-domain method [
1], with precise and initial frequency estimates of
, respectively.
More importantly, we see from Figure 2 that the
proposed technique outperforms the frequency-domain techniques in both case 1
and case 2. At TIR = −5 dB, SDR-TD1 and SDR-TD2 are greater than SDR-FD1 and
SDR-FD2 by about 2 and 1 dB, respectively. This difference is greater for
larger TIR. As suggested in Section 1, analysis of the resulting estimates
using voiced speech segments has revealed that the discrepancies are due to the
limited frequency resolution of the STFT (even with ) and due to
the choice of window function and resulting edge effects. Other window
functions such as rectangular and Hamming windows had similar discrepancies
when tested.
The robustness against background noise was examined
in a second set of simulations using MSE versus signal-to-noise ratio (SNR).
Speech segments were corrupted by additive white Gaussian noise (AWGN) with SNR
varied from 0 to 15 dB. The results are presented in Figure 3. As shown in the
figure, the proposed algorithm has a superior performance in low SNR compared
to the frequency-domain technique. The AWGN causes additional frequency
resolution problems after even a high-resolution STFT. If the proposed
time-domain estimation approach is used instead, then the effect of the AWGN is
not as severe.
Figure 3: MSE results
for AWGN for both the proposed time-domain technique compared with the standard
frequency-domain method [
1].
5. Conclusions
A time-domain method to precisely estimate the
sinusoidal model parameters of co-channel speech is presented. The method does
not require calculation of the STFT nor multiplication by a window for the
primary model parameters. The proposed method incorporates a least-squares
estimator and an adaptive technique to model and separate the co-channel speech
into its individual speakers, all in the time domain.
The application of this time-domain method on real
data demonstrates the effectiveness of this method in separating co-channel
speech signals at different TIRs. Overall, an improvement of 1–3 dB in SDR is
obtained over the frequency-domain method, dependent on the accuracy of the
fundamental frequency estimates of the talkers in the tested two-talker
scenario. Note that these time-domain results are compared with the
frequency-domain approach with an -point STFT.
Changes in would affect
the precision in the estimates of the frequency-domain technique.
We also note that the time-domain method is not as
sensitive to additive white Gaussian noise as is the frequency-domain method
for sinusoidal modeling. This result is particularly true for lower-SNR
situations.
Acknowledgment
The authors wish to thank the National Capital
Institute of Telecommunications (NCIT) for partially funding this research.