Mathematical Problems in Engineering

Volume 2016, Article ID 3982360, 13 pages

http://dx.doi.org/10.1155/2016/3982360

## Underdetermined Separation of Speech Mixture Based on Sparse Bayesian Learning

^{1}School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore^{2}School of Electronic Engineering and Automation, City College of Dalian University of Technology, Dalian, China^{3}School of Information Science and Engineering, Hangzhou Normal University, Hangzhou, China

Received 31 March 2016; Revised 1 September 2016; Accepted 19 September 2016

Academic Editor: Eric Feulvarch

Copyright © 2016 Zhe Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper describes a novel algorithm for underdetermined speech separation problem based on compressed sensing which is an emerging technique for efficient data reconstruction. The proposed algorithm consists of two steps. The unknown mixing matrix is firstly estimated from the speech mixtures in the transform domain by using -means clustering algorithm. In the second step, the speech sources are recovered based on an autocalibration sparse Bayesian learning algorithm for speech signal. Numerical experiments including the comparison with other sparse representation approaches are provided to show the achieved performance improvement.

#### 1. Introduction

In recent years, compressed sensing (CS) theory [1, 2] has attracted a great deal of attention for various applications. It is a novel concept to directly sample the signals in a compressed manner and the signals in a lossless or robust manner, under the assumption that the signals have sparse or compressible representation in a particular domain [2, 3]. In particular, the sensing procedure in CS can preserve useful information embedded in the high-dimensional signals and the CS recovering procedure can robustly reconstruct the original sparse signals from these collected low-dimensional samples [3]. In this manner, both sensing and storage costs can be substantially saved. It provides potentially a powerful framework for computing a sparse representation of signals. The key factor allowing the success of the CS technique is proper exploitation and utilization of sparsity. In practical applications, fortunately, sparsity of the signal widely exists in various applications.

Speech separation refers to the process of separating source signals from their mixtures [4, 5]. When the number of mixtures is greater than or equal to the number of sources, independent component analysis (ICA) [6] based methods are widely used. However, for the case of underdetermined separation, where the number of mixtures is less than the number of sources, ICA based methods generally fail to separate the sources. In this context, the sparsity of the signal is often utilized to separate the source signals [5, 7, 8]. A signal is considered to be sparse if most of its samples are zero [4]. Since signals such as speech are more sparse in the time frequency (TF) domain compared to that in the time domain, several algorithms have been proposed for the separation of the signals in their TF domain [7–10]. In the received mixtures, a single source point is defined as any TF point that is associated with only one source signal. If all the TF points are single source points, the sources are known to be W-disjoint. Assuming W-disjoint sources, the degenerate unmixing estimation technique (DUET) [7] first estimates the feature vector consisting of TF points. The extracted feature vectors are then clustered to separate the sources.

In the underdetermined speech separation problem, the underdetermined mixture is a form of compressed sampling, and therefore CS theory can be utilized to solve the problem. The similarities between CS and source separation are shown in [11]. Xu and Wang developed a framework for this problem based on CS using fixed dictionary [12], while they proposed a multistage method for underdetermined speech separation using block-based CS [13]. However, all these mentioned methods ignore the error brought in after calculating the mixing matrix. Different from the previously reported work, our proposed approach can be considered to be parametric and is particularly tailored to solve the speech recovery with inaccurate estimation of the mixing matrix. The problem is formulated in a sparse Bayesian framework and solved by Bayesian inference technique due to the privileges of the sparse Bayesian algorithm [14–16]. It operates in a statistical alternating fashion, where both the estimation and uncertainty information are utilized. Moreover, for calibrating the inaccurate mixing matrix, this framework facilitates parameter learning procedures.

The rest of the paper is organized as follows. In Section 2, the underdetermined speech separation problem is formulated into a compressed sensing framework. In Sections 3 and 4, our sparse Bayesian algorithm is proposed and speech recovery algorithm which deals with both mixing error and speech recovery is described. Numerical experiments and conclusion are given in Sections 5 and 6, respectively.

#### 2. The CS Framework of Underdetermined Separation

The task of speech separation is to recover the sources using the observable signals. The noise-free instantaneous mixing model can be described as follows:where the mixing matrix is unknown, is the observed data vector at discrete time instant , is the unknown source vector, is the number of the microphones, and is the number of the sources. In this paper, we focus on the underdetermined speech separation; that is, .

Let us expand (1) aswhere stands for discrete time instants, , , is the th mixed signal at time instant , is the th element in mixing matrix , and , , is the th source signal at time instant . We carry out separation frame by frame with the window length , usually , and adjacent frames are overlapped.

Let us define some notations as follows: denotes matrix, where denotes a diagonal matrix, andWe also define every frame of the mixed and source signal as column vectors.where , denotes a frame of the th mixed signal and , denotes a frame of the th source signal.

For every frame, (4) can be converted into the form

We assume that the source has a sparse representation on some dictionary where is the sparse coefficient vector and is the dictionary on which has a sparse representation. Then can be sparsely represented bywhereis a dictionary composed of and

Thenwhere can be recovered by measurements using an optimization processand denotes the -norm. For a general CS problem, obtaining the sparsest solution to an underdetermined system (11) is known to be a NP-hard problem, which requires intractable computations [1]. This has led to considerable efforts in developing tractable approximations to find the sparse solutions. In general, most of the sparse recovery algorithms can be categorized into one of the following three categories.(i)The first one is generally known as greedy algorithms. These algorithms approximate the signals’ support and amplitude iteratively. Orthogonal matching pursuit (OMP) [17] is a classical representative in this category.(ii)The second category is associated with regularized optimization method, which can be considered as the tightest convex relaxation of norm. The basis pursuit (BP) [18] and basis pursuit denoising (BPDN) [19] are the classical regularized optimization methods to recover sparse signals in noiseless and noisy environments, respectively.(iii)The third category is based on the sparse Bayesian methodology. The problem is formulated as learning and inference in a probabilistic model. By properly choosing the hierarchical prior for the signals, sparsity can be imposed statistically [14, 20]. Sparse Bayesian learning is a classical method to recover the sparse signals by formulating a scaled Gaussian mixtures model [14]. The main advantages of the sparse Bayesian methods are their desirable statistical characteristics and flexibility in imposing prior information.To solve from (11), the observation , mixing matrix **,** and dictionary are required, respectively. Many methods for dictionary training and estimating the mixing matrix have been reported [7, 21, 22]. For convenience and without losing generality, we utilize the* K*-SVD [23] dictionary composition and -means unmixing estimation technique [12] that is to be described in Section 3. Then the method to solve is described in Section 4.

The detailed procedures are summarized as follows.

*Algorithm 1 (procedure for dictionary training and mixing matrix estimation). *(1)Every speaker’s speech sample is taken as training data using* K*-SVD method. are obtained.(2)Mixing matrix is estimated in frequency domain by -means unmixing estimation technique. The estimate for , that is,, is obtained.(3)Format and into and according to the dimension of and the selected frame window size.(4) is the mixed speech frame.(5)The separated speech signal can be recovered by solving (10) subject to (11) in a frame by frame manner.

#### 3. Estimation of the Mixing Matrix

In TF domain, the mixing model in (1) can be written aswhere and contain the STFT coefficients of and , respectively. At every TF point , we havewhere is the th element in mixing matrix and they can be complex numbers as well. Denote ; then is noninvertible. The sources are generally estimated under the assumption that the source signals are W-disjoint. Defining as the set of TF points in frequency bin where is the dominant source, that is, for , the mixing model in (13) can then be simplified asThe above equation implies that, given , the vector can be estimated up to an amplitude and phase ambiguity. Without loss of generality, this ambiguity is resolved by assuming that is of unit norm with the first element being a positive and real value [10]. This can be achieved by normalizing the mixture sample vector aswhere is the phase of the first entry of and denotes the -norm. The normalized can now be clustered into clusters so that centroid of the th cluster corresponds to the estimate of [7, 10].

Conventional algorithms reported in [8–10] assume that the approximation in (14) holds for all the TF points. This, however, may not be true in a real environment. Instead of assuming that (14) applies for all the TF points, our proposed algorithm introduces a single source measure to quantify the validity of (14) for each TF point. Only TF points with a high value of confidence are used to estimate , based on which a more accurate mask can be computed to separate the sources.

##### 3.1. The Proposed TF Points Selection

From (14), the corresponding autocorrelation matrix can be expressed aswhere is the expectation operator, , and is the conjugate operator. Therefore for single source points, we haveConsidering that speech utterances are locally stationary [25],where specifies the number of neighboring TF points used to estimate and is adjustable according to the time duration within which the source signals are considered to be stationary. It may not be proper to direct use as the single source TF point measure because the energy of nondominant sources is not always zero. To deal with this issue, a modified TF point selection method is provided.

Assume that at a particular TF point , source signals to , have nonzero energy and the signal is assumed to be the dominant source which is dB higher than the other sources in terms of energy at ; that is,

Assuming the sources are uncorrelated, the autocorrelation matrix for this TF point can then be expressed aswhereand , is the th row of the mixing matrix . Note that the diagonal elements of the matrix are nonzero. It is not proper to directly use (17) to determine single source point. Alternatively a continuous measureis also not feasible since and are unknown in the speech separation problem.

Considering that the decomposition in (20) is similar to the singular value decomposition (SVD) of , the ratio of singular values of is proposed as the single source TF point measure (SSTFM); that is,where is the th singular value of and is the maximum singular value of . Application of SVD to detect single source points is valid since the separation problem at each single source TF point is an overdetermined problem. The TF points after selection are as illustrated in Figure 1.