Mathematical Problems in Engineering

Volume 2017, Article ID 1981280, 11 pages

https://doi.org/10.1155/2017/1981280

## Compressing Sensing Based Source Localization for Controlled Acoustic Signals Using Distributed Microphone Arrays

^{1}School of Physics and Technology, Nanjing Normal University, Nanjing 210097, China^{2}Jiangsu Key Laboratory of Meteorological Observation and Information Processing, Nanjing University of Information Science & Technology, Nanjing 210044, China^{3}Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China

Correspondence should be addressed to Wei Ke; moc.anis@wkykw and Jianhua Shao; nc.ude.unjn@auhnaijoahs

Received 19 October 2016; Revised 9 January 2017; Accepted 19 February 2017; Published 8 August 2017

Academic Editor: Laurent Bako

Copyright © 2017 Wei Ke et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In order to enhance the accuracy of sound source localization in noisy and reverberant environments, this paper proposes an adaptive sound source localization method based on distributed microphone arrays. Since sound sources lie at a few points in the discrete spatial domain, our method can exploit this inherent sparsity to convert the localization problem into a sparse recovery problem based on the compressive sensing (CS) theory. In this method, a two-step discrete cosine transform- (DCT-) based feature extraction approach is utilized to cover both short-time and long-time properties of acoustic signals and reduce the dimensions of the sparse model. In addition, an online dictionary learning (DL) method is used to adjust the dictionary for matching the changes of audio signals, and then the sparse solution could better represent location estimations. Moreover, we propose an improved block-sparse reconstruction algorithm using approximate norm minimization to enhance reconstruction performance for sparse signals in low signal-noise ratio (SNR) conditions. The effectiveness of the proposed scheme is demonstrated by simulation results and experimental results where substantial improvement for localization performance can be obtained in the noisy and reverberant conditions.

#### 1. Introduction

In the past decade, the microphone array based sound source localization technique has been developed rapidly and has been widely applied to video conference systems [1, 2], robot audition [3, 4], speech enhancement [5], and speech recognition [6]. Since distributed microphone arrays have larger array aperture than traditional microphone arrays [7, 8], it can provide better performance of sound source localization, and therefore it becomes a hot research area.

Existing sound source localization methods can be generally divided into three types: the first one is the beamforming method [9]; the second one is the high-resolution spectrum estimation method [10]; the third one is the time delay estimation method [11–14]. However, differently from positioning technology based on sonar and radar arrays, speech signals are wideband signals with property of short-term stationarity. Moreover, sound source localization is apt to be affected by the reverberant and noisy environments [15, 16]. Hence, there is still a quite big room to improve the performance of sound source localization.

In recent years, with the rapid development of compressive sensing (CS) theory [17, 18], it has brought a revolutionary influence to the traditional sound source localization method. Since sound sources can be regarded as point sources and the number of the sound sources is quite limited in the positioning space, the sound source localization problem essentially implicates the spatial sparsity. According to this natural sparsity, Cevher and Baraniuk modeled the sound source localization problem as a sparse approximation problem [19] based on the sound propagation model and provided better positioning performance. In [20], the CS theory was utilized to estimate time difference of arrival (TDOA) and obtained higher estimation accuracy than the generalized cross-correlation (GCC) method. Unlike the above methods, [21] proposed a source localization algorithm based on the two-level Fast Fourier Transform- (FFT-) based feature extraction method and spatial sparsity. The proposed feature extraction method leads to a sparse representation of audio signals. Further, Simard and Antoni exploited Green functions to establish the sparse localization model, which resulted in better sound source identification and localization performance than traditional beamforming methods [22]. In addition, [23] extended the sparse constraint to sound source imaging, which can not only locate sound source targets but also realize single target imaging. These above researches show that the CS theory has broad application prospects in sound source localization.

However, just like the traditional sound source localization methods, the CS-based sound source localization approaches are still confronted with the challenges from complex acoustic environments. The challenges mainly come from ambient noise and reverberation. On one hand, the sparse reconstruction performance is seriously affected by noises. At present, it is common to use norm instead of norm to enforce sparse constraint, since norm optimization is NP-hard to solve. However, under the norm constraint, the objective function will give large coefficients more constraints to ensure the convergence of the cost function. In fact, in the positioning applications large coefficients generally correspond to the targets, while small coefficients may correspond to the noises. Thus, the ordinary reconstruction algorithms such as the greedy algorithm [24] and the convex-optimization algorithm [25] may weaken the contribution of large coefficients in the reconstruction process, while small coefficients may not perform any constraint. Therefore, the reconstruction accuracy may be decreased significantly, and even in low SNR conditions, noise may be treated as targets wrongly. On the other hand, in indoor environments reverberation is ubiquitous. Moreover, the reflection and scattering effects of sound waves between walls or ceilings are hard to predict. When these happen, there will be some errors between the sparse model and real measurement, which also lead to localization performance degradation.

To improve the performance of sound source localization based on distributed microphone arrays in noisy and reverberant environments, a sound source localization method is proposed in this paper. This method exploits the inherent spatial sparsity to convert the localization problem into a sparse recovery problem based on the CS theory. Meanwhile, inspired by [21], this paper also uses sparse feature in transform domain to construct CS-based localization model. Since the DCT has the character of strong energy concentration and can keep in the real number area, we propose a two-step DCT-based feature extraction method to cover both short-time and long-time properties of the signal and reduce the dimension of the sparse model. In addition, we propose an improved block-sparse reconstruction algorithm based on approximate norm minimization to enhance reconstruction performance under low SNR conditions. The novel feature of this method is to use approximate norm to promote interblock sparsity and the optimization problem is solved by a sequential procedure in conjunction with a conjugate-gradient method for fast reconstruction. Moreover, a dictionary learning (DL) method is used to adjust the dictionary for matching the changes of audio signals, and then the sparse solution could better represent location estimations.

The remainder of the paper is organized as follows. Section 2 describes the system model. In Section 3, we propose a novel two-step DCT-based feature extraction approach. The details of the proposed localization method are addressed in Section 4. Simulation and experimental results are given in Sections 5 and 6, respectively. Finally, Section 7 concludes the paper.

#### 2. System Model

Let us consider a spatial distribution of sensors, each accommodating microphones. For the sake of notational simplicity and with no loss of generality, let us assume that each sensor has the same number of microphones. In this paper, we assume that all microphones begin to work at the same time and then keep working until each experiment ends. Moreover, the propagation medium of sound signals is isotropic. Let , be the coordinates of each microphone which are generally known in the localization system, while the acoustic sources are located at , which are unknown. Here, means transposition. The whole localization area is uniformly divided into grid points (generally ). Each microphone receives signals from sound sources and transmits them to the positioning center, where the localization algorithm is performed.

Since sound sources locate at one or a few grid points in the localization area, the positions of sound sources in the discrete spatial domain can be accurately represented as a sparse vector . The elements of vector are equal to ones if sound sources locate at the corresponding grids in the localization area, while other elements of vector are equal to zeros. In such case, the localization problem is converted to determine nonzero elements and their specific locations in the sparse vector according to the received signals. Based on the CS theory, the sparse localization model can be represented aswhere represents the entire feature vector, corresponds to the feature vector of received signals from th microphone, and is the dictionary matrix. The feature extraction method will be introduced in the next section. Since the dictionary is another key factor for the sparse reconstruction, this paper firstly sets up an initial dictionary based on the sound signal propagation model in [19] and then exploits the DL technique to modify the initial dictionary for overcoming the model errors. According to [19], the th received signal coming from the sound source located at grid by the th microphone of the th sensor can be expressed aswhere is the speed of sound, is the distance between the th grid point and the th microphone of the th sensor, represents sound signals, and represents noises. Since the positions of grid points and microphones are already known, the distance can be calculated in advance. Hence, under the noise-free conditions can be calculated as whererepresents the feature vector of the signal from the sound source located at the th grid and received by the th microphone of the th sensor, in which represents the feature extraction operation. Since each sensor has microphones, it should be noted that is a block-sparse vector whose nonzero elements occur in clusters. Unlike the conventional CS framework, the sparsity patterns of neighboring coefficients are related to each other in the block-sparse model, and this block-sparsity has the potential to encourage structured-sparse solutions.

Assuming sources as an example, we substitute the concrete forms of vectors and and matrix into (1), and then we can obtainwhere and represent the all-zero and all-one column vectors, respectively. If there are no noises and reverberation, the specific locations of nonzero blocks in point out the accurate positions of sound sources according to the corresponding grids. However, the reflection and scattering of sound signals and noises in indoor environments are unavoidable. Therefore, there are errors between the above sparse localization model and real measurements. This may ultimately lead to the degradation of localization performance. In order to show differences between the predefined dictionary and the actual cases, the sparse model (1) is then modified aswhere denotes noises, denotes the actual dictionary, and is the model error due to both reverberation and noise, which is time-varying and unknown in advance. To obtain the accurate position estimates of sound sources in real environments, a DL method is used to dynamically adjust the dictionary for matching the changes of actual signals. Meanwhile, an improved block-sparse reconstruction algorithm using approximate norm minimization is proposed to enhance reconstruction performance under the noisy conditions.

#### 3. Feature Extraction

In the applications of sound source localization, many localization algorithms use long speech signals to calculate source locations [9, 21]. However, considering high dimensionality of the received signals, the computational complexity is high. Moreover, lots of interferences and noises are embedded in the long acoustic signals, which may affect the final localization performance. Hence, instead of using acoustic signals in time domain, in this paper a two-step DCT-based feature extraction method is proposed to overcome the above problems.

Considering the properties of audio signals, the long signals received by microphones are partitioned into successive frames using the small-size window function (Hamming window is used here). In the first step, the received signal vector is partitioned into short frames; that is, where is the th element of the received signal vector and is the length of entire input signal. is a matrix, whose rows are the windowed frames of input signal. is the frame (window) length and is the number of the frames.

Next, the type-II DCT is applied to the speech signal frames.where represents the discrete cosine transform and is the output of DCT. Then, the amplitudes of the DCT coefficients in each frame are normalized by dividing them by their maximum value, and then all normalized frames are averaged; that is,

Although the acoustic signals are generally nonstationary over a relatively long-time interval, in this constructed short-time feature space the statistical properties of the resulting data become almost constant, so that the features can be considered as approximately stationary process. On the contrary, noise does not have the same short-time properties as acoustic signals and is usually assumed as obeying the certain distribution with zero mean, so noise may be mitigated after the averaging procedure. In the next step, the long-time properties of audio signals are considered. Apply the DCT to the vector again:

It should be noted that is still a vector, so the two-step DCT-based feature extraction method can reduce the dimension from the length of samples to the length of frames and thus decrease the computational load of sparse reconstruction. Meanwhile, since the DCT has the characteristic of strong energy concentration, big coefficients in the DCT domain will be relatively concentrated in certain locations of the vector , and other coefficients of the vector are very small. In other words, the vector is approximately sparse in the DCT domain, which is good for the following sparse reconstruction. Now, we can know that the transformation in (4) represents the operation from (7) to (10). The proposed feature extraction process is summarized in Figure 1, and the localization algorithm will be proposed in the next section.