Abstract

This paper describes a novel algorithm for underdetermined speech separation problem based on compressed sensing which is an emerging technique for efficient data reconstruction. The proposed algorithm consists of two steps. The unknown mixing matrix is firstly estimated from the speech mixtures in the transform domain by using -means clustering algorithm. In the second step, the speech sources are recovered based on an autocalibration sparse Bayesian learning algorithm for speech signal. Numerical experiments including the comparison with other sparse representation approaches are provided to show the achieved performance improvement.

1. Introduction

In recent years, compressed sensing (CS) theory [1, 2] has attracted a great deal of attention for various applications. It is a novel concept to directly sample the signals in a compressed manner and the signals in a lossless or robust manner, under the assumption that the signals have sparse or compressible representation in a particular domain [2, 3]. In particular, the sensing procedure in CS can preserve useful information embedded in the high-dimensional signals and the CS recovering procedure can robustly reconstruct the original sparse signals from these collected low-dimensional samples [3]. In this manner, both sensing and storage costs can be substantially saved. It provides potentially a powerful framework for computing a sparse representation of signals. The key factor allowing the success of the CS technique is proper exploitation and utilization of sparsity. In practical applications, fortunately, sparsity of the signal widely exists in various applications.

Speech separation refers to the process of separating source signals from their mixtures [4, 5]. When the number of mixtures is greater than or equal to the number of sources, independent component analysis (ICA) [6] based methods are widely used. However, for the case of underdetermined separation, where the number of mixtures is less than the number of sources, ICA based methods generally fail to separate the sources. In this context, the sparsity of the signal is often utilized to separate the source signals [5, 7, 8]. A signal is considered to be sparse if most of its samples are zero [4]. Since signals such as speech are more sparse in the time frequency (TF) domain compared to that in the time domain, several algorithms have been proposed for the separation of the signals in their TF domain [710]. In the received mixtures, a single source point is defined as any TF point that is associated with only one source signal. If all the TF points are single source points, the sources are known to be W-disjoint. Assuming W-disjoint sources, the degenerate unmixing estimation technique (DUET) [7] first estimates the feature vector consisting of TF points. The extracted feature vectors are then clustered to separate the sources.

In the underdetermined speech separation problem, the underdetermined mixture is a form of compressed sampling, and therefore CS theory can be utilized to solve the problem. The similarities between CS and source separation are shown in [11]. Xu and Wang developed a framework for this problem based on CS using fixed dictionary [12], while they proposed a multistage method for underdetermined speech separation using block-based CS [13]. However, all these mentioned methods ignore the error brought in after calculating the mixing matrix. Different from the previously reported work, our proposed approach can be considered to be parametric and is particularly tailored to solve the speech recovery with inaccurate estimation of the mixing matrix. The problem is formulated in a sparse Bayesian framework and solved by Bayesian inference technique due to the privileges of the sparse Bayesian algorithm [1416]. It operates in a statistical alternating fashion, where both the estimation and uncertainty information are utilized. Moreover, for calibrating the inaccurate mixing matrix, this framework facilitates parameter learning procedures.

The rest of the paper is organized as follows. In Section 2, the underdetermined speech separation problem is formulated into a compressed sensing framework. In Sections 3 and 4, our sparse Bayesian algorithm is proposed and speech recovery algorithm which deals with both mixing error and speech recovery is described. Numerical experiments and conclusion are given in Sections 5 and 6, respectively.

2. The CS Framework of Underdetermined Separation

The task of speech separation is to recover the sources using the observable signals. The noise-free instantaneous mixing model can be described as follows:where the mixing matrix is unknown, is the observed data vector at discrete time instant , is the unknown source vector, is the number of the microphones, and is the number of the sources. In this paper, we focus on the underdetermined speech separation; that is, .

Let us expand (1) aswhere stands for discrete time instants, , , is the th mixed signal at time instant , is the th element in mixing matrix , and , , is the th source signal at time instant . We carry out separation frame by frame with the window length , usually , and adjacent frames are overlapped.

Let us define some notations as follows: denotes matrix, where denotes a diagonal matrix, andWe also define every frame of the mixed and source signal as column vectors.where , denotes a frame of the th mixed signal and , denotes a frame of the th source signal.

For every frame, (4) can be converted into the form

We assume that the source has a sparse representation on some dictionary where is the sparse coefficient vector and is the dictionary on which has a sparse representation. Then can be sparsely represented bywhereis a dictionary composed of and

Thenwhere can be recovered by measurements using an optimization processand denotes the -norm. For a general CS problem, obtaining the sparsest solution to an underdetermined system (11) is known to be a NP-hard problem, which requires intractable computations [1]. This has led to considerable efforts in developing tractable approximations to find the sparse solutions. In general, most of the sparse recovery algorithms can be categorized into one of the following three categories.(i)The first one is generally known as greedy algorithms. These algorithms approximate the signals’ support and amplitude iteratively. Orthogonal matching pursuit (OMP) [17] is a classical representative in this category.(ii)The second category is associated with regularized optimization method, which can be considered as the tightest convex relaxation of norm. The basis pursuit (BP) [18] and basis pursuit denoising (BPDN) [19] are the classical regularized optimization methods to recover sparse signals in noiseless and noisy environments, respectively.(iii)The third category is based on the sparse Bayesian methodology. The problem is formulated as learning and inference in a probabilistic model. By properly choosing the hierarchical prior for the signals, sparsity can be imposed statistically [14, 20]. Sparse Bayesian learning is a classical method to recover the sparse signals by formulating a scaled Gaussian mixtures model [14]. The main advantages of the sparse Bayesian methods are their desirable statistical characteristics and flexibility in imposing prior information.To solve from (11), the observation , mixing matrix , and dictionary are required, respectively. Many methods for dictionary training and estimating the mixing matrix have been reported [7, 21, 22]. For convenience and without losing generality, we utilize the K-SVD [23] dictionary composition and -means unmixing estimation technique [12] that is to be described in Section 3. Then the method to solve is described in Section 4.

The detailed procedures are summarized as follows.

Algorithm 1 (procedure for dictionary training and mixing matrix estimation). (1)Every speaker’s speech sample is taken as training data using K-SVD method. are obtained.(2)Mixing matrix is estimated in frequency domain by -means unmixing estimation technique. The estimate for , that is,, is obtained.(3)Format and into and according to the dimension of and the selected frame window size.(4) is the mixed speech frame.(5)The separated speech signal can be recovered by solving (10) subject to (11) in a frame by frame manner.

3. Estimation of the Mixing Matrix

In TF domain, the mixing model in (1) can be written aswhere and contain the STFT coefficients of and , respectively. At every TF point , we havewhere is the th element in mixing matrix and they can be complex numbers as well. Denote ; then is noninvertible. The sources are generally estimated under the assumption that the source signals are W-disjoint. Defining as the set of TF points in frequency bin where is the dominant source, that is, for , the mixing model in (13) can then be simplified asThe above equation implies that, given , the vector can be estimated up to an amplitude and phase ambiguity. Without loss of generality, this ambiguity is resolved by assuming that is of unit norm with the first element being a positive and real value [10]. This can be achieved by normalizing the mixture sample vector aswhere is the phase of the first entry of and denotes the -norm. The normalized can now be clustered into clusters so that centroid of the th cluster corresponds to the estimate of [7, 10].

Conventional algorithms reported in [810] assume that the approximation in (14) holds for all the TF points. This, however, may not be true in a real environment. Instead of assuming that (14) applies for all the TF points, our proposed algorithm introduces a single source measure to quantify the validity of (14) for each TF point. Only TF points with a high value of confidence are used to estimate , based on which a more accurate mask can be computed to separate the sources.

3.1. The Proposed TF Points Selection

From (14), the corresponding autocorrelation matrix can be expressed aswhere is the expectation operator, , and is the conjugate operator. Therefore for single source points, we haveConsidering that speech utterances are locally stationary [25],where specifies the number of neighboring TF points used to estimate and is adjustable according to the time duration within which the source signals are considered to be stationary. It may not be proper to direct use as the single source TF point measure because the energy of nondominant sources is not always zero. To deal with this issue, a modified TF point selection method is provided.

Assume that at a particular TF point , source signals to , have nonzero energy and the signal is assumed to be the dominant source which is dB higher than the other sources in terms of energy at ; that is,

Assuming the sources are uncorrelated, the autocorrelation matrix for this TF point can then be expressed aswhereand , is the th row of the mixing matrix . Note that the diagonal elements of the matrix are nonzero. It is not proper to directly use (17) to determine single source point. Alternatively a continuous measureis also not feasible since and are unknown in the speech separation problem.

Considering that the decomposition in (20) is similar to the singular value decomposition (SVD) of , the ratio of singular values of is proposed as the single source TF point measure (SSTFM); that is,where is the th singular value of and is the maximum singular value of . Application of SVD to detect single source points is valid since the separation problem at each single source TF point is an overdetermined problem. The TF points after selection are as illustrated in Figure 1.

Since the SSTFM provides a measure of being a single source point, only those with a high SSTFM should be used by the clustering algorithm to estimate . This can be achieved by selecting which has a SSTFM value above a threshold. This threshold can be predetermined or simply select the median value of which will prevent cases of too few identified as single source point, which may degrade the performance of using clustering algorithm to estimate . As a result, for each frequency bin , the subsetis used to estimate , that is, in (11).

3.2. Estimating of Mixing Matrix

After selecting the TF points, the next stage is the estimation of the mixing matrix. Here we are using the -means clustering techniques [26]. It is noted that this may not be the best algorithm to cluster the samples as other algorithms can also be used. Since only the selected TF points are used in this step, the scatter plot has a clear orientation towards the directions of the column vectors in the mixing matrix. Hence these points are clustered into groups. After clustering, the column vectors of the mixing matrix are determined by calculating the centroid of each cluster. Note that the points lying in the left-hand side of the vertical axis in the scatter diagram are mapped to the right-hand side (by changing their sign) before calculating the centroid.

3.3. Obtaining the Dictionary

We assume that before separating sources we have some speech samples as training data. In our approach, we directly train K-SVD dictionary on these speech samples using the source-trained strategy described in [27].

4. Speech Recovery

The matrix in (11) is not precisely known. The matrix obtained from the mixing matrix estimation step (Procedure () in Algorithm 1) contains errors due to time frequency point selection. From numerical experiments, recovering speech with the estimated matrix brings crosstalk residuals [1113]. In theory, (11) with is degenerated intowhere solved from (25) is not the sparse solution with and observation . Instead, it is a sparse solution to a linear combination of trained dictionaries.

Thus to get the correct sparse solution, (10) should be alternatively expressed aswhere is diagonal matrix to represent the difference between accurate and estimated . In order to solve from (26), we introduce a sparse Bayesian model as follows.

4.1. Bayesian Model

Sparse Bayesian model [20] was derived from the research area of machine learning and has become a popular method for sparse signal recovery in CS. In this model, the sparse signal recovery problem is formulated from a Bayesian perspective while the sparsity information is exploited by assuming a sparse prior for the signal of interest. For instance, a Laplace prior [28] corresponds to the norm which has been widely studied in existing optimization approaches. Since exact Bayesian inference is typically intractable, approximation approaches to Bayesian inference have been adopted in [28]. One advantage of Bayesian CS compared to other CS methods is the flexibility of modeling sparse signals. It can promote the sparsity of its solution and exploit additionally known structures of the sparse signal [29] as in our case. Thus a sparse Bayesian model is selected to solve (26).

In our approach, we use the sparse Bayesian model in [28] to recover the sparse signals for sparser solutions. In other words, a sparer solution can be obtained by calibrating in (26). The mixing matrix difference is assumed to be an independent and identically distributed Gaussian noise and an empirical variance . Probabilistic model is used for convenient inference. According to (26), the observation is assumed to obey the following distribution:where is a noise variance between and recovered signal and assumed to be known a prior [28].

Probabilistic model is used for the signal to impose the sparsity-inducing Laplace prior and enable convenient inference [28]. In the first stage of the signal model, the probability density function of the speech sources, , is given aswhere and denotes the length of combined dictionary . The hyperparameter, for , is modeled as an independent Gamma distribution

Based on (28) and (29), the marginalized distribution obeys a Laplace distribution [28]. Furthermore, the parameter, , controls the shape of the Laplace distribution and determines the sparsity of the signal . To conveniently learn , a Gamma distribution is assumed aswhere is a parameter to be tuned and often set to be a small value as suggested in [28].

According to (28)–(30), the joint probability distribution conditioned on can be derived as

4.2. Proposed Methods

An expectation maximization (EM) algorithm is implemented to solve (31). The EM algorithm requires the knowledge of the posterior distribution [30]where is obtained in (31) and .

Subsequently, a distribution , where denotes the hidden variables, is assumed to approximate the true posterior, which minimizes the Kullback-Leibler (KL) divergence between and the true posterior [31]:

Then the hidden variable and parameter can be iteratively updated by the following steps.

4.2.1. Expectation Stage

In this stage, the method assumes that has a factorization form:According to [31], the optimal distribution that minimizes (33) can be expressed aswhere denotes the expectation with respect to and represents the set without . By substituting , , and into (35), respectively, the approximation is obtained from the following procedures.

(i) For , we haveBy substituting (27) and (28) into (36), obeys the Gaussian distribution with a mean and a covariance matrix as

(ii) For , we obtainBy substituting (28) and (29) into (38), obeys a generalized inverse Gaussian distribution whose th moment is expressed as [32]where is a Bessel function of the second kind.

(iii) For , we haveBy substituting (29) and (30) into (40), it is shown that obeys a Gamma distribution with a meanThe optimal approximated distribution can be obtained by iteratively calculating the above steps until convergence and each hidden variable is updated once before proceeding to the next stage.

4.2.2. Maximization Stage

According to [31], is estimated by maximizing the expected log-likelihood asThere exists a closed-form solution to (42) for updating ; that is,where represents the row of matrix .

In summary, this proposed algorithm jointly estimates and to achieve sparsity and the procedures are listed in Algorithm 2.

Algorithm 2 (proposed method for solving ). (1) Input: .(2) While converge do (3)Update and by (37).(4)Update by (39)(5)Update by (41)(6)Update by (43)(7) end while (8) Output: (9)The separated signals are recovered by .

5. Numerical Experiments

In this section we firstly describe the setting of our experiments and then comparisons are made in terms of the obtained performances by the proposed method and the other ones reported in the literature.

5.1. Estimating the Mixing Matrix

To compare our proposed modified unmixing method with algorithm described in [13], we use the mixing matrix randomly generated and mix the clean speeches from [33] to get the speech mixtures. The average normalized mean square error (NMSE) over 50 trials is presented in Figure 2(a) where the median value of SSTFM values is used as threshold rather than a predetermined one, and the value of in (18) is 2. Figure 2(a) shows that the proposed TF selection method has a smaller NMSE with various dimensions of mixing matrix. However the exact mixing matrix is not obtained because the NMSEs obtained by both methods are always larger than zero, which motivates the model reported in (26). Figure 2(b) illustrates two source signals at frequency bin and the estimated SSTFM using the mixtures in a simulation trial. It can been seen that the proposed method is effective in identifying the single source points with unfixed SSTFM, as marked by the dashed ovals.

5.2. Speech Signal Recovery

The speech sample is downloaded from the database of the source separation evaluation campaign [33]. For every speaker, we have sentences for testing and training separately. A sparse dictionary is prepared for every speaker by using K-SVD Algorithm [23]. Figure 3 illustrates the sparse coefficients obtained from the source signals. The parameters for K-SVD are in Table 1.

The mixing matrix is randomly generated. Two examples of random and mixed matrices and the corresponding estimated matrices by modified method are shown as reference with permutation ambiguity ignored.

Figures 5 and 6 show the recovery signal in STFT domain using sparse Bayesian [14] and proposed method, respectively. Compared with original sources in Figure 4, the significant differences are marked by the highlighted ovals. In Figure 5(b), the interference inside the green ovals mainly comes from source 1 (also see Figure 4(a)) while the points in red ovals are mainly from source 3 (also see Figure 4(a)). With our proposed method, these interferences are avoided by calibrating the mixing matrix while solving the sparse solution.

The separation performance is quantified in terms of source-to-interference ratio (SIR), source-to-distortion ratio (SDR), and source-to-artifacts ratio (SAR). Without loss of generality, we assume the separated output corresponding to the source signal for ease presentation. The three performance measures first decompose the th source estimate using orthogonal projections as [34]where is the portion attributing to , is the interference from other sources, is the artifacts introduced by the separation algorithm, and is the noise effect. The SIR, SDR, and SAR for source are defined as Since SDR considers both interference and artifacts, it is expected to be a more comprehensive criterion compared with SIR and SAR [34]. All the above measures can be computed using the BSS-EVAL Toolbox [35]. Note that in our noiseless simulations, , which will not affect the three criteria defined above.

The performance gain of the proposed algorithm compared to other algorithms is illustrated in Table 2.

From Table 2, it is seen that in the underdetermined situation (mixing matrix = 2 3 and 2 4), the proposed method increases both SIR and SAR, which means that, by introducing autocalibrating to (11) to obtain a sparser representation, both interferences and artifacts are effectively suppressed. The proposed algorithm achieves worse performance than that obtained by using a known mixing matrix, as shown in the last column of Table 2 because that is solved by using not . Note that even using a known mixing matrix, the recovery signal still have interference, distortion, and artifacts for two reasons. Firstly for mixtures separation, is solved by using rather than the trained dictionary directly. Secondly, the speech sources used for dictionary training are different from the those used for separation evaluation.

6. Conclusions

We have presented the compressed sensing based algorithm to solve the problem of instantaneous underdetermined speech separation. Since exact mixing matrix is unreachable, the sparse Bayesian learning model is used and the separated speeches are recovered from approximated posterior distribution derived with the EM method. The proposed one operates in a statistical manner to achieve a sparser estimation. Numerical experimental results show that the proposed algorithm provides better estimation performance than the other methods reported in the literature.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research work is supported by the National Natural Science Foundation of China (no. 61571174) and Zhejiang Provincial Natural Science Foundation of China (no. LY15F010010).