Complexity

Volume 2018, Article ID 6153451, 13 pages

https://doi.org/10.1155/2018/6153451

## Speech Enhancement Control Design Algorithm for Dual-Microphone Systems Using *β*-NMF in a Complex Environment

School of Electronic and Information Engineering, Liaoning University of Technology, Jinzhou, Liaoning 121001, China

Correspondence should be addressed to Dong-xia Wang; moc.621@gl_gnawxd

Received 15 April 2018; Revised 3 July 2018; Accepted 26 July 2018; Published 9 September 2018

Academic Editor: Junpei Zhong

Copyright © 2018 Dong-xia Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Single-microphone speech enhancement algorithms by using nonnegative matrix factorization can only utilize the temporal and spectral diversity of the received signal, making the performance of the noise suppression degrade rapidly in a complex environment. Microphone arrays have spatial selection and high signal gain, so it applies to the adverse noise conditions. In this paper, we present a new algorithm for speech enhancement based on two microphones with nonnegative matrix factorization. The interchannel characteristic of each nonnegative matrix factorization basis can be modeled by the adopted method, such as the amplitude ratios and the phase differences between channels. The results of the experiment confirm that the proposed algorithm is superior to other dual-microphone speech enhancement algorithms.

#### 1. Introduction

For the sake of improving the quality and intelligibility of noisy signals, speech enhancement is widely applied in many fields including speech communication, speech coding, and speech recognition. In terms of the number of microphones, speech enhancement methods can be split into two classes: single microphone and microphone array.

In the past, there have been many single-microphone speech enhancement algorithms presented including statistical model method, spectral subtraction, subspace decomposition, and other typical algorithms. These algorithms have a good noise suppression performance under stationary conditions, but at the cost of a priori information loss of clean speech and noise, in which it provides limited performance under a complex environment.

Recently, a new matrix decomposition algorithm called nonnegative matrix factorization (NMF) [1] method has been successfully used to solve a variety of problems in many fields. NMF is a powerful method for machine learning and hidden data discovery; the basic idea of the method is that one nonnegative matrix is decomposed into the product of two nonnegative matrices without making any statistical hypothesis of data. Compared with the traditional matrix decomposition algorithm, it has a strong physical significance, it has small storage, and it is simple and easy to implement. The results show that it has been widely used to effectively solve various problems including pattern clustering and classification tasks [2–5], source separation [6], and speech enhancement [7]. In voice applications, we can obtain a priori information by using train data with NMF instead of the clean signal.

Currently, according to the different methods in machine learning, a single speech enhancement method based on NMF can be categorized into unsupervised learning and supervised learning algorithms [7]. Unsupervised methods are simple and easy to implicate without any prior information on the speech or noise, whose main difficulty is estimating the noise power spectral density (PSD) [8], especially in a complex environment.

For the supervised methods, selecting a proper model needs to consider not only the aspect of the speech and noise signals but also the model parameter estimation using the training samples of those signals. One advantage of these methods is estimating the noise PSD without the need to use other algorithms. Compared with the unsupervised methods under a complex environment, the studies have been proved that the supervised method is an effective way of obtaining better performance of the enhanced speech signals.

In order to solve the problem of the characteristic of mismatch between training data and testing data, a supervised NMF-based algorithm is proposed in speech enhancement to incorporate with some prior information, including temporal continuity [9] and statistical distribution of the data [10]. More recently, aiming at improving the general subspace constraints, an improved NMF algorithm is proposed by introducing additional terms into the objective function [11]. A framework for decreasing the computational complexity in NMF by using the extreme learning machine (ELM) is designed in [12]. ELM and its variants have been widely applied in different kinds of fields, because of its good scalability and strong generalization performance [13]. With the unceasing development of human-computer interaction recently, higher requirements for speech recognition and computer vision are put forward in a complex environment. In [14–16]; the control scheme for improving the convergence speed is developed to optimize system performance.

In [17], a speech enhancement method for solving the difficult problem of manual selection modes by applying a regularized nonnegative matrix factorization algorithm is presented. In practical application, however, the speech signals have spatial characteristics (spatial diversity of reverberation guidance), which is not present in the single-microphone system. One microphone has good performance in speech enhancement system, however, it only uses both temporal and spectral information of signal and lacks spatial information.

The two-microphone system has attracted much attention for its small size and small amount of calculation, which is in line with the trend of miniaturization of devices. An algorithm for achieving a dual-microphone speech enhancement by using the coherence function is proposed [18]. In [19], the improved method, which incorporates the coherence function and the Kalman filter, is used to obtain enhanced speech signal. These algorithms belong to the unsupervised methods in a sense. Therefore, we propose a novel *β*-NMF for a dual-microphone speech enhancement. The interchannel characteristic of each NMF basis can be modeled for the method by applying the spatial diversity of speech signals.

The paper is arranged as follows: Section 2 reviews the objective function of standard NMF with *β* divergence. Section 3 extends it to the dual-microphones system for the NMF basis. Section 4 presents a two-channel speech signal model and details the proposed speech enhancement framework. Section 5 presents simulation results and Section 6 the conclusion.

#### 2. Nonnegative Matrix Factorization with *β* Divergence

In a single-microphone system, let be the observed value of one microphone for a specific time duration. By applying the short-time Fourier transform (STFT) to , we can obtain a complex matrix ( denotes the number of frequency bins and the number of time frames). Using the standard NMF, the amplitude or equivalently is analyzed in [1]. Finally, the NMF-based algorithm is to find a local optimal decomposition, which is defined as where is a basis matrix, is a coefficient matrix, and is the number of basis vectors.

For the sake of seeking for two nonnegative matrices such that the difference between and the product is minimized, define a measure function to obtain the optimal decomposition where denotes the error divergence function between the observed data and the reconstructed data . The different probability models can be derived by (2), and then different types of cost functions are obtained by the maximum likelihood. Selecting an appropriate objective function is the key in formulating the NMF algorithm. Here, the objective function is derived by using a parametric divergence measure, namely, the divergence [20] where reflects the reconstruction penalty. The selection of parameter depends on the statistical distribution characters and requires prior knowledge. When , the result is shown as the squared Euclidean distance (ED); when , the result is approximately equal to the Kullback-Leibler (KL) divergence; and when , the result is nearly equal to Itakura-Saito divergence.

and are expressed by applying multiplicative iterative updating rules as described in [21]; the update rules are given as where the operation represents an element-wise multiplication, and the quotient line are performed element-wise division, and the superscript is the matrix transpose. As for the initializations of and , positive random numbers are often used.

#### 3. The Dual-Microphone Model for NMF Basis with *β* Divergence

This section proposes an extension of the standard NMF. Compared with multichannel speech enhancement, dual-channel speech enhancement has advantages in many aspects. Assume that and explain the observations of the 1st and 2nd microphones in the time-frequency domain, respectively. In [22], a new interchannel matrix is defined, which represents the spatial characteristics between two channels, and they have both common nonnegative matrices and to model multichannel observations.

##### 3.1. Preprocessing and Modeling

The first is only considering the amplitude observations in the time-frequency domain when we use the standard NMF algorithm for speech enhancement. The observation of the 1st channel is obtained and acted as a reference where is the complex conjugate, in order to fully reflect the interchannel characteristic, and then the same is done for the 2nd channel with the expression of

According to the above preprocessing principle, we can find that is not only a nonnegative matrix but also a complex matrix. Hence, an accurate modeling for the first channel is designed by using (3), and then an accurate modeling for the second channel is designed by introducing an interchannel matrix , where uses random initialization. The interchannel characteristic contains spatial information of the 2nd channel.

##### 3.2. Maximum Likelihood Estimation and Its Cost Function

Using the dual-channel probabilistic model, the likelihood is written as where we assume that the data follows the probability distribution. Thus, the maximum negative log-likelihood solution of (8) is represented as where represents equality up to irrelevant constant terms. The former term is explained in Section 2, and now the latter term is given by

The gradient is expressed with respect to of the cost function (The subscript of the cost function of the 2nd term is omitted for convenience, where denotes a variable.) as the difference of two positive terms and as

The solution can be expressed by applying general heuristic multiplicative update rules as

The derivative of the cost function of the 2nd term in (10) with respect to , , and are shown as

This leads to the following updating rules by using the cost function of (9), and then the complex matrices and nonnegative matrices and are estimated by using the update rule of [21]; we can obtain the gradient of the cost function which is rewritten as where is a matrix of ones. As is shown by Formulas (14), (15), and (16) derived above, it can reduce to single-channel counterparts (4) and (5) if only one microphone is used, and the interchannel matrix is a unit matrix.

#### 4. Proposed NMF-Based Speech Enhancement Algorithm

Assuming that dual microphones are set up in a complex environment, and the noise and target speech signals are spatially separated. Let be the target speech, and then the noisy speech signal of the microphone can be defined with the expression of where is the operator of conjunction, is the microphone index, is the sample index, and and represent room reverberation and noise, corresponding to the microphone, respectively. The block diagram of the proposed algorithm is described in Figure 1, which mainly includes two parts: the training stage and the enhancement stage.