Multiple Sound Source Localization and Counting Using One Pair of Microphones in Noisy and Reverberant Environments
A multiple sound source localization and counting method based on an angular spectrum is proposed in this paper. Local signal-to-noise ratio tracking, onset detection, and a coherence test are introduced to filter the generalized cross-correlation angular spectrum in the time-frequency domain for multiple sound source localization and counting in noisy and reverberant environments. Then, dual-width matching pursuit is introduced to replace peak search as the method of localization and counting. A comprehensive comparison of two statistical indicators, mean precision and mean absolute estimated error, indicates that the proposed localization and counting algorithm using both the filtered angular spectrum and dual-width matching pursuit method is more robust and accurate than the classic counterpart, especially in environments with low signal-to-noise ratio, strong reverberation, and abundant sound sources.
In the field of array signal processsing, multiple sound source localization and counting is a critical issue for applications such as indoor conferences, building security, and virtual reality . Multiple sound source localization is usually subject to ambient noise, reverberation caused by confined space and obstacles, and mutual interference between multiple sound signals . Many heuristic algorithms have been studied over the past two decades to suppress the effects of these negative factors and achieve robust and accurate performance. The main branch of research is based on time-frequency (TF) processing, which can be divided into three categories: clustering [3–6], histogram [7–10], and angular spectrum [6, 11–14].
The first category clusters the TF bins associated with each sound source on the basis of a criterion, such as interphase difference (IPD)  and direction estimation of mixing matrix (DEMIX) . This method which directly achieves localization and counting results by iterative processing is sensitive to the initial clustering parameter . In the second category, the weighted histogram using one pair of microphones is computed in  to solve the underdetermined problem. By using a relatively more complex topology, such as a unit circular array  and a soundfield microphone , the redundancy information of the circular integrated cross spectrum (CICS)  and the smoothed histogram  can improve the localization and counting accuracy, while expanding the estimated range of direction of arrival (DOA) from to . These histogram-based algorithms are sensitive to the spatial aliasing ambiguity in some widely spaced microphone arrays because of the local TF computation . In this paper, we focus on the third category, angular spectrum, which consists of two steps: (i) angular spectrum construction and (ii) sound source localization and counting.
First, an enumerated function related to all possible DOAs in each TF bin is accumulated to construct the angular spectrum , such as classical generalized cross-correlation phase transform (GCC-PHAT, simplified as GCC in the following) [11, 12] and the kernel density estimator (KDE) [13, 14]. GCC which achieves a certain degree of antinoise performance through the phase weighting of the cross-correlation function is robust to the spatial aliasing problem [6, 15]. However, in environments with the coexistence of noise, reverberation, and mutual interference between multiple sound sources, the performance deteriorates substantially because of the limitations of the ideal single-source propagation model . KDE has better antireverberation performance than GCC when the number of sound sources is relatively small (e.g., 2) . The spatial aliasing ambiguity can be suppressed by the embedded frequency-dependent weighting factors in the kernel function, but the suppression is sensitive to the choice of kernel bandwidth .
Second, the DOAs and the corresponding number of multiple sound sources are obtained through the exhaustive search of the optimal parameter value from the angular spectrum. Traditional peak search (PS) , which is based on single-point peak amplitude, implements source localization and counting by comparing the peak amplitude with the cut-off threshold. The cut-off threshold becomes adaptive by using the previous peak . Because the angular spectrum is seriously distorted in adverse environments, the estimated DOA when using PS may have a large offset, resulting in instability. Matching pursuit (MP)  can be used to improve the performance of PS by calculating the maximum inner product, but the choice of matching structure and atom width needs careful consideration. On the basis of source contributions, iterative contribution removal (ICR)  filters out the TF bins associated with the current estimated sound source during each iteration and then reconstructs the angular spectrum from the remainder to the next source localization. The reconstruction can restrain the distortion of the angular spectrum, but the search of the TF bins is computationally expensive, and incorrect previous source localization may exert considerable influence on the following iterative process.
In this paper, we improve each step of the angular spectrum-based algorithm. GCC is used because of its applicability to any microphone spacing [6, 15]. The main innovations are as follows. In the angular spectrum construction step, three TF filtering modules, local signal-to-noise ratio (SNR) tracking [15, 21], onset detection [22, 23], and coherence test [24, 25], are introduced to extract the TF bins that are less disturbed by noise, reverberation, and mutual interference between sound sources, respectively. The filtered GCC is termed GCCTF. In the localization and counting step, PS with the single-point amplitude is replaced by MP [8, 20] with the inner product of the atom from the perspective of contribution removal. The dual-width structure  and the source merging module are used to improve the iteration efficiency and to remove the repeatedly estimated results, respectively.
The remainder of this paper is organized as follows. In Section 2, the classic GCC is constructed based on the signal propagation model of multiple sound sources; then, three TF filtering modules are introduced to produce the GCCTF spectrum. In Section 3, the dual-width structure and the source merging module are presented during the description of MP. Numerical comparisons between the proposed GCCTF-MP algorithm and the classical ones are given in Section 4. Finally, Section 5 concludes the paper.
2. GCC Angular Spectrum and TF Filtering
2.1. Signal Propagation Model of Multiple Sound Sources
The propagation model of multiple sound sources is shown in Figure 1. In the sound field space composed of several independent sound sources , there is a pair of omnidirectional microphones with spacing . The DOA of the sound source, , is defined in an anticlockwise manner, with being the direction perpendicular to the line connecting the pair of microphones.
is indicated as the set of DOAs. The elements of and the corresponding number are unknown, where is the operator used to measure the number of elements in the set. In the approximately far field, the signal propagation model of multiple sound sources can be expressed aswhere , denotes the observed signal of the -th microphone, denotes the impulse response between the -th sound source and the -th microphone, denotes the discrete time signal vector of the -th sound source with sampling rate , and is additive white Gaussian noise independent of the sources and the impulse responses.
When points short-time Fourier transform (STFT), the expression in the discrete TF domain can be obtained aswhere and denote the STFT coefficients of observed signal and sound source signal corresponding to the -th frame and the -th discrete frequency, respectively; denotes the additive complex noise; and is the transfer function between the -th sound source and the -th microphone. Under the assumption of diffuse reverberation [26, 27], can be decomposed as the direct wave component and the reverberation component , that is,where and denote the propagation attenuation and arrival time of the direct wave from the -th sound source to the -th microphone, respectively, and denotes the frequency in the -th frequency bin.
3. GCC Angular Spectrum Construction
In an ideal environment, noise and reverberation do not exist, and the assumption of W-disjoint orthogonality (W-DO) is satisfied . Both and in equation (4) are 0, and at most one sound source dominates the power in each TF bin. In this case, equation (4) can be simplified aswhere denotes the index of the dominant sound source in each TF bin. Then, the IPD  which indicates the phase difference between the observed signals of the pair of microphones can be obtained aswhere is the operator to find the phase of a complex number, is the time difference of arrival (TDOA) between the pair of microphones in each TF bin, anddenotes the wrapping factor, where is the operator for obtaining the retained integer after . The simulation results showed that wider microphone spacing can bring better resolution for localization . However, the widening of is bound to break the limit of half the minimum wavelength , thus making , resulting in spatial aliasing ambiguity.
On the basis of the IPD obtained in equation (6), the local GCC related to the unknown DOA in each TF bin can be expressed aswhere ( denotes the atmospheric sound velocity) and denotes the real part of the complex argument. Because , as shown in equation (8), the wrapping factor can be eliminated. By accumulating the local GCC in equation (8) across all TF bins, the GCC angular spectrum  can be obtained aswhere denotes the linear space of with angle grid and denotes the set of all TF bins.
4. TF Filtered GCC Angular Spectrum
In practice, noise and reverberation are inevitable, and sources are more likely to overlap in the TF domain when the number of simultaneously occurring sound sources increases [9, 10]. Therefore, three modules, local SNR tracking, onset detection, and coherence test, are used to extract the TF bins that are less affected by the above problem in the GCC angular spectrum construction step. A block diagram of GCC and its filtered variant, GCCTF, is shown in Figure 2.
Local SNR tracking: the stronger noise contained in the observed signal will generate a higher angular spectrum floor with a lower recognition degree. So it is necessary to consider the enhancement of the observed signal which is often realized by tracking the noise floor . In this paper, a simple SNR tracking method is used to obtain the SNR in each TF bin, namely, the local SNR , which can be expressed aswhere denotes the local noise power and denotes the operator used to obtain the minimum of the set.
Assume an ideal case that the whole observed signal starts with a section of pure noise. Then, the first frames are used to measure the initial local noise power. should not be set too long since the local noise power changes slowly over time. It is usually set to 2 or 3 empirically. The initial local noise power can be expressed as
Then, the local noise power increases slowly during signal frames and decreases slowly during noise frames. It can be expressed aswhere is the updating factor.
The TF bins with local SNR above a user-defined threshold are extracted. The set of TF bins that satisfy the local SNR tracking module can be expressed as
Onset detection: in real environments, strong reverberation will bring serious angular spectrum distortion, which causes incorrect localization and counting results. Regardless of the special case that a new sound is very weak in each frequency band, the onset of a new sound, which is related to the direct wave component , is often accompanied by a sudden rise in signal amplitude (energy) within some frequency bands . To detect this rise, the parameter of onset detection in each TF bin is set as follows :where is the half-wave rectifier function.
For the -th microphone, once the peak of is detected in each frequency band, the corresponding to the same TF bin is considered as the onset. The threshold is set as and is then gradually attenuated as the time frame moves forward:where is the decaying factor empirically decided according to the experimental environment . The set of TF bins that satisfy the onset detection module can be expressed as
Coherence test: in practice, when the number of simultaneously occurring sound sources increases, the probability of sources with comparable power overlapping in the same TF bin increases. The W-DO assumption cannot be strictly met. So the assumption is appropriately relaxed: when accumulating the angular spectrum, there are TF bins with only one dominant sound source. Then, the coherence test module can be used to extract these TF bins effectively, which mitigates the effect of simultaneously occurring sources. The coherence test parameter is set as follows :where denotes the complex conjugate operator, denotes the average expectation of the consecutive time frames, and
TF bins with above the user-defined threshold are considered to contain only one dominant sound source. Then, the corresponding set can be expressed as
On the basis of the three TF filtering modules (local SNR tracking, onset detection, and coherence test), combined with the derivation of GCC, the GCCTF angular spectrum can be obtained aswhere is the set of TF bins after TF filtering.
5. Dual-Width Matching Pursuit Method
After the angular spectrum construction in the previous section, the angular spectrum vector with the length is used to realize the multiple sound source localization and counting. It can be expressed aswhere the subscript “str” represents the string “GCC” or “GCCTF.” Without loss of generality, is simplified to for convenience.
PS realizes localization and counting through the extraction of the spectrum peak above the cut-off threshold , while MP uses the inner-product comparison and the iterative source contribution removal. Consider an atom with one signal pulse, which can be approximately seen as the basic unit of the angular spectrum vector. The set of all the atoms can be defined aswhere denotes the operator of a circular shift to the right, denotes the row vector of shifted to the right by bits, and can be expressed aswhere denotes the -norm operator of a vector, denotes the row vector of shifted to the left by bits, and can be expressed aswhere is a Blackman window with width and denotes one half of the window width .
The choice of must consider a compromise: in the same noisy and reverberant environment, an excessively wide width may incorrectly estimate the DOAs of sound sources with small angular intervals, and an excessively narrow width may reduce the iteration efficiency.
Therefore, a dual-width structure is proposed, where a narrower width is used to localize and a wider width is used to process the iterative contribution removal. To differentiate, subscripts “1” and “2” are added to parameters , , , and to indicate the atoms with narrower and wider window widths, respectively. A block diagram of the proposed MP method is shown in Figure 3, where the initial value is set as . The corresponding steps are as follows:(1)DOA estimation: the inner-product of each atom with a narrower window width in set and the i-th angular spectrum vector can be expressed as where denotes the inner-product operator. Then, through the search of the maximum inner product, the estimated DOA in the -th iteration can be obtained as where is the shifted bit of when the maximum inner product in the -th iteration is obtained.(2)Contribution measurement: on the basis of the maximum inner product in equation (25) and the corresponding atom with wider window width, the contribution vector of the estimated sound source in the -th iteration can be measured as(3)Stop judgement: give two conditional expressions: where is the user-defined threshold, denotes the maximum number of iterations, and denotes the contribution corresponding to the contribution vector in equation (27), where denotes the operator for the summation of the vector elements. If either equation (28) or equation (29) is satisfied, the loop stops.(4)Residual calculation: after removing the contribution vector from the angular spectrum vector in the -th iteration, the residual used in the -th iteration can be calculated as Set the number of iterations when the loop stops as ; then, the set of the estimated DOAs after the loop body can be represented as Due to the limited window width, the contribution of a certain sound source may not be completely removed from the angular spectrum in each iteration, resulting in closely located sound sources. Extra counts generated by these sources will cause some deviations in counting results. Thus, the postprocessing step called source merging is used to merge the closely located sources into only one source in case that redundantly estimated sound sources are counted.(5)Source merging: set as the minimum angular interval. Any two estimated DOAs whose angular interval is smaller than should be merged into one sound source according to their corresponding initial inner products. If , where and are two estimated DOAs, the source merging process can be expressed as
Then, the closely located sources are merged, and the final localization and counting results can be obtained aswhere is the set of the final estimated DOAs, is the -th element of , and is the number of the estimated DOAs.
6. Numerical Analysis
To verify the performance of the proposed GCCTF angular spectrum and dual-width MP method, the image-source model [27, 29] is used to generate the observed data, where the room size is and the sound velocity is 344 m/s. The plane schematic diagram of the room is shown in Figure 4 where the heights of the microphones and sources are all set to 1.3 m. A pair of omnidirectional microphones parallel to the x-axis is located at the center of room with spacing . sound sources are distributed on a semicircle with as the centroid and microphone-source distance as the radius.
The DOA distribution when the true number of sources varies from 2 to 6 is presented in Table 1. 8 male and 8 female voices taken from the TIMIT dataset are used as sound sources , with a sampling rate . The total number of simulations is set to 200. In each simulation, segments of length 1.024 s from different voices are randomly extracted and then preprocessed to have the same average power.
From equation (1), the observed signal is produced through convolution of the source signal with the impulse response generated by the image-source model and then added with white Gaussian noise withwhere , , and denote the average power of the two microphones and and the additive noise, respectively. We discuss the performance in three scenarios with different reverberation times and : (i) , ; (ii) , ; and (iii) , . Direct to reverberation ratio (DRR) which indicates the reverberation level is defined as follows :where the numerator and denominator of the logarithmic function represent the total power of the direct wave component and the reverberation component of the impulse response between the -th sound source and the -th microphone, respectively. Then, the average DRRs of all the sources when in scenario (i), scenario (ii), and scenario (iii) are , , and , respectively.
7. Comparison of Angular Spectrum Recognition Degree
The parameter configuration in the angular spectrum construction step is presented in Table 2, where is 5 dB, as suggested in , and other parameters are set empirically through the previous experiment. Figure 5 shows the normalized angular spectra of GCC and GCCTF when and from scenario (i) to scenario (iii) in a single simulation, where the arrows in each subfigure indicate the true DOAs. The peak amplitudes of the two angular spectra are almost the same as the true DOAs in scenario (i) and scenario (ii). The local SNR tracking module in GCCTF can efficiently reduce the angular spectrum floor so that the peak amplitude deviations from the true DOAs are lower than those of GCC. In scenario iii, the growth of makes the distance between the direct wave peak and the first reverberant peak shorter, which results in the aggravation over the reverberation level. The spectrum floor is higher than scenario (ii), and the false peak marked by “x” has a high amplitude compared to the true DOA in GCC, resulting in an incorrectly estimated DOA. The onset detection module in GCCTF can prevent this phenomenon so that GCCTF can retain the correct estimation.
GCCTF forms a more recognizable angular spectrum than GCC from scenario (i) to scenario (iii) in Figure 5. To quantitatively indicate the recognition degree from a statistical perspective, we introduce the following mean precision (MPRE), which can be expressed as follows [17, 18]:where and denote the number of correctly estimated DOAs and the spectrum peaks whose normalized amplitudes are greater than 0.2 in the -th simulation, respectively. The criterion for judging whether the DOA is estimated correctly is that the interval between the true value and the estimated value is less than , which is , as suggested in [9, 15].
Figure 6 shows the MPREs of GCC and GCCTF versus SNR from scenario (i) to scenario (iii). In each subfigure of Figure 6, MPREs improve with the growth of SNR, which is mainly due to the fact that the reduction of noise floor means the recognition degree improvement of angular spectrum. Horizontal comparison of all the subfigures in Figure 6 shows that MPREs tends to decrease from scenario (i) to scenario (iii), which is mainly due to the fact that the deterioration of reverberation will strengthen the effect of false peaks and thus decrease the recognition degree. GCCTF performs better than GCC under the same noisy and reverberant environment, which shows that TF filtering can indeed bring better recognition degree and then bring a positive impact on the subsequent localization and counting performance.
8. Comparison of the Localization and Counting Performances
The mean absolute estimated error (MAEE) is used to measure the localization and counting performance [9, 10]:where and denote the true and estimated DOA of the -th sound source in the -th simulation, respectively; and , where and denote the numbers of true and estimated sound sources in the -th simulation, respectively. Since is not greater than , the excessively estimated DOAs are set to meet the following expression:
8.1. Choice of the Window Width in the MP Method
When using the MP method, the final localization and counting performance are affected by the choice of the dual-width. Through large numbers of simulations, we find that when , one half of the narrower window width, is chosen from 4 to 6, the maximum inner product can accurately determine the DOAs. In this case, the choice of , the wider counterpart, exhibits a regular change with . To illustrate this regularity, without loss of generality, we present the MAEE with fixed at 4 while varies from 4 to 8 (in intervals of 2), where the GCCTF angular spectrum is used. Based on the comprehensive consideration of algorithm efficiency and accuracy, the maximum number of loops is set to 10, and the cut-off thresholds in scenario (i), scenario (ii), and scenario (iii) are 0.51, 0.53, and 0.60, respectively.
Figures 7 and 8 show the MAEEs of GCCTF-MP in three scenarios when and , respectively. (single-width) performs the best when from scenario (i) to scenario (iii); then, the performance declines as the width increases.
In scenario (i), when is 4, 5, and 6, the best performance is obtained with set to 6, 8, and 10, respectively. Thus, in a low reverberant environment (e.g., scenario (i)), a wider window width improves the performance of MP when there are more sound sources.
The trends in scenario (ii) are similar, except that when and , performs the best. In scenario (iii), when and , performs the best; when and is 4, 5, and 6, the best performance is obtained with set to 4, 4, and 6, respectively. Thus, increasing the window width results in sensitivity to the adverse environment with relatively strong reverberation and low SNR. However, the wider width shows an improvement when there are more sound sources in the same noisy and reverberant environment.
8.2. Comparison of the Localization and Counting Performances
To provide a comprehensive comparison of the localization and counting performance, three algorithms, GCC-PS, GCCTF-PS, and GCC-MP, are obtained similarly to GCCTF-MP by combining the angular spectrum with the localization and counting method. Based on the previous analysis, where is set to 4, Table 3 presents the parameter configuration when using the PS and MP methods from scenario (i) to scenario (iii).
In Figure 9, the MAEEs ranges from to . As increases, the MAEE increases due to aggravation of the mutual interference between sound sources. In each subfigure, the MAEE gradually decreases with increasing SNR. GCC-PS and GCCTF-MP perform the worst and best among the four algorithms. When , the two algorithms using the MP method have lower MAEE than those using PS, and the two algorithms using the GCCTF angular spectrum have lower MAEE than those using GCC. When , the MAEE of GCCTF-PS is lower than that of GCC-MP, which shows that the SNR tracking module can effectively reduce the TF bins with relatively low local SNR.
In Figure 10, medium reverberation degrades the performance of the four algorithms, with MAEE between and . The trend is similar to scenario (i), except when , GCCTF-PS performs better than GCC-MP. Compared to the turning point of 0 dB in scenario (i), in a medium reverberant environment (e.g., scenario (ii)), the onset detection module may have a positive impact on the final localization and counting.
In Figure 11, strong reverberation results in the worst performance for the four algorithms, with MAEE between and . The trend is similar to scenario (ii), except when and SNR varies from 0 to 15 dB, GCCTF-PS performs better than GCC-MP. Compared to the turning point of 10 dB in scenario (ii), in a relatively strong reverberant environment (e.g., scenario (iii)), the coherence test module may have a positive impact on the final localization and counting.
Based on the results from Figures 9 to 11, GCCTF can provide a more robust angular spectrum than GCC especially under the acoustically adverse environment. This is because that the TF bins seriously affected by low SNR, strong reverberation, and abundant sound sources have been removed by the three filtering modules. So the distortion of the angular spectrum can be effectively alleviated, and the increase in the false peak amplitude can be suppressed simultaneously. MP performs better than PS because PS may produce instable performance with severely distorted angular spectrum. GCCTF-MP, which uses both the GCCTF angular spectrum and MP method, is the most robust and accurate multiple sound source localization and counting algorithm among the four combined algorithms.
In this paper, GCCTF-MP, an algorithm for multiple sound source localization and counting, is proposed for noisy and reverberant environments. Three modules, local SNR tracking, onset detection, and a coherence test, are used to filter the GCC angular spectrum; the dual-width MP method is used to replace the amplitude comparison with the inner product and contribution removal. On the basis of the statistical indicators MPRE and MAEE, MP is shown to be a more accurate method than PS, and GCCTF is shown to be a more recognizable and robust angular spectrum, especially in environments with low SNR, strong reverberation, and abundant sound sources. The proposed GCCTF-MP, which uses both the GCCTF angular spectrum and MP method, is thus a robust and accurate multiple sound source localization and counting algorithm. Furthermore, we find that the final localization and counting performance when using the MP method is affected by the choice of the dual-width. A brief comparison when using a fixed narrower width and a different wider counterpart is presented. In practice, the environmental parameters are difficult to determine, and the number of sound sources is unknown. How to implement the width in an adaptive manner is a challenging problem that warrants further study.
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research was supported by the National Natural Science Foundation of China (nos. 61171167 and 61401203) and the Scientific Research Foundation of Jinling Institute of Technology (no. JIT-040520400101).
K. Wu and A. Khong, “Sound source localization and tracking,” Context Aware Human-Robot and Human-Agent Interaction, Springer International Publishing, Cham, Switzerland, 1st edition, 2016.View at: Google Scholar
J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing, Springer International Publishing., Cham, Switzerland, 2008.
H. Sawada, S. Araki, R. Mukai, and S. Makino, “Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 5, pp. 1592–1604, 2007.View at: Publisher Site | Google Scholar
J. Escolano, N. Xiang, J. M. Perez-Lorenzo, M. Cobos, and J. J. Lopez, “A Bayesian direction-of-arrival model for an undetermined number of sources using a two-microphone array,” The Journal of the Acoustical Society of America, vol. 135, no. 2, pp. 742–753, 2014.View at: Publisher Site | Google Scholar
Y. Fang and Z. Xu, “A robust algorithm for unambiguous TDOA estimation of multiple sound sources under indoor environment,” Journal of Electronics & Information Technology, vol. 38, no. 5, pp. 1143–1150, 2016.View at: Google Scholar
B. Loesch and B. Yang, “Source number estimation and clustering for underdetermined blind source separation,” in Proceedings of the 11th International Workshop on Acoustic Echo and Noise Control (IWAENC 2008), Seattle, WA, USA, September 2008.View at: Google Scholar
S. Dixon, “Onset detection revisited,” in Proceedings of the 9th International Conference on Digital Audio Effects, Montreal, Quebec, Canada, September 2006.View at: Google Scholar
N. Tho, S. Zhao, and D. Jones, “Robust DOA estimation of multiple speech sources,” in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2014), Florence, Italy, May 2014.View at: Google Scholar
J. Garofolo, L. Lamel, W. Fisher et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus Linguistic Data Consortium, 1993.