Abstract

Single-microphone speech enhancement algorithms by using nonnegative matrix factorization can only utilize the temporal and spectral diversity of the received signal, making the performance of the noise suppression degrade rapidly in a complex environment. Microphone arrays have spatial selection and high signal gain, so it applies to the adverse noise conditions. In this paper, we present a new algorithm for speech enhancement based on two microphones with nonnegative matrix factorization. The interchannel characteristic of each nonnegative matrix factorization basis can be modeled by the adopted method, such as the amplitude ratios and the phase differences between channels. The results of the experiment confirm that the proposed algorithm is superior to other dual-microphone speech enhancement algorithms.

1. Introduction

For the sake of improving the quality and intelligibility of noisy signals, speech enhancement is widely applied in many fields including speech communication, speech coding, and speech recognition. In terms of the number of microphones, speech enhancement methods can be split into two classes: single microphone and microphone array.

In the past, there have been many single-microphone speech enhancement algorithms presented including statistical model method, spectral subtraction, subspace decomposition, and other typical algorithms. These algorithms have a good noise suppression performance under stationary conditions, but at the cost of a priori information loss of clean speech and noise, in which it provides limited performance under a complex environment.

Recently, a new matrix decomposition algorithm called nonnegative matrix factorization (NMF) [1] method has been successfully used to solve a variety of problems in many fields. NMF is a powerful method for machine learning and hidden data discovery; the basic idea of the method is that one nonnegative matrix is decomposed into the product of two nonnegative matrices without making any statistical hypothesis of data. Compared with the traditional matrix decomposition algorithm, it has a strong physical significance, it has small storage, and it is simple and easy to implement. The results show that it has been widely used to effectively solve various problems including pattern clustering and classification tasks [25], source separation [6], and speech enhancement [7]. In voice applications, we can obtain a priori information by using train data with NMF instead of the clean signal.

Currently, according to the different methods in machine learning, a single speech enhancement method based on NMF can be categorized into unsupervised learning and supervised learning algorithms [7]. Unsupervised methods are simple and easy to implicate without any prior information on the speech or noise, whose main difficulty is estimating the noise power spectral density (PSD) [8], especially in a complex environment.

For the supervised methods, selecting a proper model needs to consider not only the aspect of the speech and noise signals but also the model parameter estimation using the training samples of those signals. One advantage of these methods is estimating the noise PSD without the need to use other algorithms. Compared with the unsupervised methods under a complex environment, the studies have been proved that the supervised method is an effective way of obtaining better performance of the enhanced speech signals.

In order to solve the problem of the characteristic of mismatch between training data and testing data, a supervised NMF-based algorithm is proposed in speech enhancement to incorporate with some prior information, including temporal continuity [9] and statistical distribution of the data [10]. More recently, aiming at improving the general subspace constraints, an improved NMF algorithm is proposed by introducing additional terms into the objective function [11]. A framework for decreasing the computational complexity in NMF by using the extreme learning machine (ELM) is designed in [12]. ELM and its variants have been widely applied in different kinds of fields, because of its good scalability and strong generalization performance [13]. With the unceasing development of human-computer interaction recently, higher requirements for speech recognition and computer vision are put forward in a complex environment. In [1416]; the control scheme for improving the convergence speed is developed to optimize system performance.

In [17], a speech enhancement method for solving the difficult problem of manual selection modes by applying a regularized nonnegative matrix factorization algorithm is presented. In practical application, however, the speech signals have spatial characteristics (spatial diversity of reverberation guidance), which is not present in the single-microphone system. One microphone has good performance in speech enhancement system, however, it only uses both temporal and spectral information of signal and lacks spatial information.

The two-microphone system has attracted much attention for its small size and small amount of calculation, which is in line with the trend of miniaturization of devices. An algorithm for achieving a dual-microphone speech enhancement by using the coherence function is proposed [18]. In [19], the improved method, which incorporates the coherence function and the Kalman filter, is used to obtain enhanced speech signal. These algorithms belong to the unsupervised methods in a sense. Therefore, we propose a novel β-NMF for a dual-microphone speech enhancement. The interchannel characteristic of each NMF basis can be modeled for the method by applying the spatial diversity of speech signals.

The paper is arranged as follows: Section 2 reviews the objective function of standard NMF with β divergence. Section 3 extends it to the dual-microphones system for the NMF basis. Section 4 presents a two-channel speech signal model and details the proposed speech enhancement framework. Section 5 presents simulation results and Section 6 the conclusion.

2. Nonnegative Matrix Factorization with β Divergence

In a single-microphone system, let be the observed value of one microphone for a specific time duration. By applying the short-time Fourier transform (STFT) to , we can obtain a complex matrix ( denotes the number of frequency bins and the number of time frames). Using the standard NMF, the amplitude or equivalently is analyzed in [1]. Finally, the NMF-based algorithm is to find a local optimal decomposition, which is defined as where is a basis matrix, is a coefficient matrix, and is the number of basis vectors.

For the sake of seeking for two nonnegative matrices such that the difference between and the product is minimized, define a measure function to obtain the optimal decomposition where denotes the error divergence function between the observed data and the reconstructed data . The different probability models can be derived by (2), and then different types of cost functions are obtained by the maximum likelihood. Selecting an appropriate objective function is the key in formulating the NMF algorithm. Here, the objective function is derived by using a parametric divergence measure, namely, the divergence [20] where reflects the reconstruction penalty. The selection of parameter depends on the statistical distribution characters and requires prior knowledge. When , the result is shown as the squared Euclidean distance (ED); when , the result is approximately equal to the Kullback-Leibler (KL) divergence; and when , the result is nearly equal to Itakura-Saito divergence.

and are expressed by applying multiplicative iterative updating rules as described in [21]; the update rules are given as where the operation represents an element-wise multiplication, and the quotient line are performed element-wise division, and the superscript is the matrix transpose. As for the initializations of and , positive random numbers are often used.

3. The Dual-Microphone Model for NMF Basis with β Divergence

This section proposes an extension of the standard NMF. Compared with multichannel speech enhancement, dual-channel speech enhancement has advantages in many aspects. Assume that and explain the observations of the 1st and 2nd microphones in the time-frequency domain, respectively. In [22], a new interchannel matrix is defined, which represents the spatial characteristics between two channels, and they have both common nonnegative matrices and to model multichannel observations.

3.1. Preprocessing and Modeling

The first is only considering the amplitude observations in the time-frequency domain when we use the standard NMF algorithm for speech enhancement. The observation of the 1st channel is obtained and acted as a reference where is the complex conjugate, in order to fully reflect the interchannel characteristic, and then the same is done for the 2nd channel with the expression of

According to the above preprocessing principle, we can find that is not only a nonnegative matrix but also a complex matrix. Hence, an accurate modeling for the first channel is designed by using (3), and then an accurate modeling for the second channel is designed by introducing an interchannel matrix , where uses random initialization. The interchannel characteristic contains spatial information of the 2nd channel.

3.2. Maximum Likelihood Estimation and Its Cost Function

Using the dual-channel probabilistic model, the likelihood is written as where we assume that the data follows the probability distribution. Thus, the maximum negative log-likelihood solution of (8) is represented as where represents equality up to irrelevant constant terms. The former term is explained in Section 2, and now the latter term is given by

The gradient is expressed with respect to of the cost function (The subscript of the cost function of the 2nd term is omitted for convenience, where denotes a variable.) as the difference of two positive terms and as

The solution can be expressed by applying general heuristic multiplicative update rules as

The derivative of the cost function of the 2nd term in (10) with respect to , , and are shown as

This leads to the following updating rules by using the cost function of (9), and then the complex matrices and nonnegative matrices and are estimated by using the update rule of [21]; we can obtain the gradient of the cost function which is rewritten as where is a matrix of ones. As is shown by Formulas (14), (15), and (16) derived above, it can reduce to single-channel counterparts (4) and (5) if only one microphone is used, and the interchannel matrix is a unit matrix.

4. Proposed NMF-Based Speech Enhancement Algorithm

Assuming that dual microphones are set up in a complex environment, and the noise and target speech signals are spatially separated. Let be the target speech, and then the noisy speech signal of the microphone can be defined with the expression of where is the operator of conjunction, is the microphone index, is the sample index, and and represent room reverberation and noise, corresponding to the microphone, respectively. The block diagram of the proposed algorithm is described in Figure 1, which mainly includes two parts: the training stage and the enhancement stage.

4.1. Training Stage

By applying STFT, (17) can be represented in the frequency domain

At the stage of training, we chose the magnitude spectra of the clean speech and noise from the database as the data matrix for the β-NMF processing to produce the basis matrices and , by using multiplicative iterative updating rules given in (4) and (5) to the corresponding training data, separately. The basis matrices are saved as a joint dictionary matrix, namely, , and as a priori information of the enhancement stage.

4.2. Enhancement Stage

The proposed enhancement stage consists of three parts, firstly beamforming, secondly signal gain estimation, and finally speech signal reconstruction, which are explained in the next section.

4.2.1. Beamforming

Beamforming is one of the most popular algorithms which are the basis of microphone array speech enhancement. In general, the most common fixed beamformers are the delay-and-sum and superdirective beamformers. In the paper, we can use the delay-and-sum as where represents weight and denotes the time delay compensation obtained by estimation.

4.2.2. Signal Gain Estimation

Firstly, two noisy speeches and are used as input signals of this stage after delay compensation, and then we obtain the magnitude spectra of noise by applying STFT, namely, and . Next, they are factorized via the extension of NMF with the fixed joint dictionary matrix , which is just derived from the training stage via using the update rules given in (15) and (16). Accordingly, the magnitude spectra can be approximately decomposed into an interchannel matrix and a coefficient matrix. (1)Based on the above results, we can obtain the 1st channel (as reference channel) gain function which is defined with the expression of(2)By using the interchannel matrix , we can also obtain the 2nd channel gain function which is represented as(3)The final gain function can be obtained for this work by Formulas (20) and (21). Furthermore, the gain estimation is achieved bywhere is the element-wise division.

4.2.3. NMF-Based Signal Reconstruction

This stage is similar to a Wiener filtering process; the gain function is obtained by using (22) and acts as a Wiener filter. First, we obtain the magnitude spectra of by using STFT, namely, , and then the magnitude spectra of the enhanced speech is approximately represented by

Therein, the enhanced speech waveform is estimated by using the inverse STFT.

5. Experimental Results

In this section, we perform an experiment to evaluate the performance of these methods with respect to quality and intelligibility. We compare the proposed method with the speech enhancement algorithm coherence based in [18] and the standard NMF in terms of performance. The performance of the proposed method is evaluated using a perceptual evaluation of speech quality (PESQ) [23], source-to-distortion (SDR) [24], and segmental SNR (SSNR) which are used as the objective measures, where a higher value indicates a better result.

5.1. Experimental Setup

The selection of the clean speech and the noise is the TIMIT database [25] and the NOISEX database [26], where using downsampling we can adjust the sampling rate of all signals to 8 kHz. In this study, the training for the clean speech contains 20 sentences (60 seconds) pronounced by 10 males and 10 females. Each of the test speech signals for the speech enhancement work is one sentence. We select two background noises in the paper: the Hfchannel and Factory1 noises. Besides, training data and test data in the experiment are disjoint. For the proposed framework, the window function, the applied frame size, and the frame shift are Hamming window, 512 samples and 128 samples, respectively. According to the standard decision of , assuming the clean speech and noise basis vectors, is set to 30, respectively, and let the maximum iteration number be equal to 50. The two microphones with a 4 cm spacing distance picked up noisy speech signals which were generated by convolving the target and noise sources with a set of HRTFs measured inside a mildly reverberant room () with dimensions 4.3 × 3.8 × 2.3 m3 (length × width × height), by adding the noise to the clean testing speech to generate the noisy signals at four signal-to-noise ratios (SNRs): −10, −5, 0, and 5 dB. The distance between the target source and the midpoint of the two microphones is set to 1.2 m. The direction of arrival (DOA) was chosen, respectively, according to . The squared Euclidean distance is used for simplicity.

Figure 2 shows the results of the PESQ, SDR, and SSNR metric with the variation of the values while input SNR is set to 0 dB under the Factory1 noise condition. As can be seen, DOA of the target source has little influence on the PESQ metric for these methods, but a great effect on the other metrics. In the following experiments we ultimately chose ° for consistency and simplicity. Figure 2 also indicates that the proposed method can suppress not only the background noise level effectively but also comparability when the angle of the source is set to 60°.

5.2. Speech Quality and Intelligibility Evaluation

To investigate the achievable gain estimation performance, we chose two background noises in a complex environment: the Hfchannel and Factory1 noises. Figures 3 and 4 give some results between the noisy signal and the enhanced signal with the different methods where parameter is set to 2.

From Figures 3 and 4, we can find that the proposed method leads to higher PESQ, SDR, and SSNR scores than the coherence-based method [18] and the standard NMF algorithm [7] in almost all cases, which reveals that this algorithm could also prominently improve both the quality and intelligibility of speech signals. The analysis of PESQ scores shows that the method in [18] has good stability and scarcely affected by SNR, but with lower performance corresponding with other metrics. The latter tends to much distortion. It can be also seen that the advantage of these algorithms becomes less evident with SNR increased. Compared with the coherence-based method, the proposed methods based on NMF still attain improvement in objective measures.

Figure 5 shows the results of the PESQ, SDR, and SSNR metric for the change in the incidence of angle under different parameter conditions. From Figure 5, we can see that the change in the incidence angle has a significant influence on the performance of the proposed method. Based on the observation of SSNR values, for , the proposed method has better scores, but at the expense of speech quality and intelligibility. For , we can get an optimum solution of the angle of incidence. Besides, by comparing analysis of the PESQ and SDR values with different parameters, it is found from simulating results that parameter has great influence on speech quality more than speech intelligibility under the same angle conditions. For , it not only can guarantee the accuracy of the proposed method but also can suppress the background noise level effectively without introducing much distortion.

The simulation experiment shows the performance of the proposed algorithm with the different divergence and noises in Figure 6. We can find that PESQ, SDR, and SSNR scores become better and better when the SNR increases under the same conditions. For the same SNR conditions, an optimum solution with can be obtained where divergence tends to the KL divergence. In fact, this observation can be interpreted that the proposed method based on the KL divergence can improve speech quality and intelligibility better than other parameter properties. Besides, this result indicates that under the same conditions the proposed method has obvious improvement of PESQ, SDR, and SSNR scores, especially at low SNR. Hence, the proposed method can provide aural quality and noisy speech intelligibility.

5.3. Signal Spectrogram

By comparing the color depth of speech spectrograms, we can obtain the structural characteristics of residual noise and speech distortion. The spectrograms of the different signals are presented in Figure 7. It reflects that the performance of this method is better than that of those methods. Comparing to them for Factory1 noise while input SNR is set to 0 dB, it is easy to see that the proposed method based on NMF exhibits lower speech distortion and residues than the traditional coherence-based method and the standard NMF method do in the restored spectrogram.

Besides, the parameter influences the SDR scores. In the paper, the method based on the KL divergence is shown to be superior to the squared Euclidean distance in speech enhancement capability. Finally, the proposed speech enhancement framework based on KL-NMF provides the significant improvement in both quality and intelligibility justified by the higher evaluation scores.

6. Conclusions

We propose a dual-microphone speech enhancement framework based on β-NMF in the paper. This method extends single-microphone speech enhancement based on NMF by introducing the interchannel matrix to the cost function. It can express the interchannel characteristic of each NMF basis very well by applying a priori information. The results of the experiment express that the presented method is effective in nonstationary and low SNR conditions.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Scientific Public Welfare Research of Liaoning Province (20170056), National Natural Science Foundation of China (no. 60901063), the Program for Liaoning Innovative Research Team in University under Grant LT2016006, and the Program for Distinguished Professor of Liaoning Province.