Abstract

An unambiguous signal processing algorithm when using a wide intermicrophone distance is proposed in this paper for simultaneously locating and counting multiple active sound sources. Based on the kernel density estimator, a multistage structure in the time-frequency domain is used to suppress the influence of spatial aliasing, then the pooled angular spectrum is combined with a peak search method having an updated cut-off threshold and a source merging module. Complete source localization and counting is realized through the combination of these two steps. Simulation results show that the proposed method has a more robust performance than the classic counterpart, especially in adverse environments with spatial aliasing, reverberation, and interference between different sound sources.

1. Introduction

Multiple sound source localization is a key element in many applications ranging from national security to video conference [1, 2]. Especially in far-field speech interaction, localizing multiple active sound sources accurately is an essential prerequisite for high-quality speech enhancement and recognition [3, 4]. In these applications, there is often ambient noise, reverberation, and mutual interference between different sources, which makes the source localization performance seriously affected.

In recent years, based on W-disjoint orthogonality (WDO) [5, 6] of observed signals between pairwise microphones, multiple sound source localization methods can be further divided into clustering, histogram, and angular spectrum.

The clustering method, such as interphase difference (IPD) [7] and direction estimation of mixing matrix (DEMIX) [8], directly achieves localization results by iteratively clustering time-frequency (TF) bins associated with each sound source [9]. It is sensitive to the initial clustering parameter [10]. The histogram method computes the weighted histogram such as circular integrated cross spectrum (CICS) [11] and smooth histogram [12] to make it have high estimation accuracy. However, the used array topology is relatively difficult to implement and popularize. Most of the above two methods do not consider spatial aliasing [13] which may exist in multiple sound source localization.

From the space sampling theorem, the intermicrophone distance should be smaller than half the minimum signal wavelength (e.g. 2 cm for a sampling rate of 16 kHz). When the distance exceeds half the wavelength, it can be regarded as a wide distance, and spatial aliasing generated from the high-frequency part will occur. The wider the distance, the more serious the aliasing. The excessively wide distance will have a significant impact on sound source localization. To obtain a high resolution in the time difference of arrival (TDOA) space, the arrangement of the microphones often uses a distance far exceeding half the wavelength, especially in the case of a small number of microphones, thus inevitably generating the influence of aliasing [14]. Hence, we focus on the angular spectrum method because of its applicability for a wide intermicrophone distance [10, 15]. It consists of two steps: (i) spectrum construction and (ii) localization and counting.

In the first step, the angular spectrum is constructed by accumulating the local function related to all possible angles in each TF bin. Classical generalized crosscorrelation (GCC) [16] computes the crosspower spectrum phase [17] from the pairwise microphones. Because of the limitations of the ideal single-source propagation model [18], the GCC spectrum will be severely distorted in environments with the coexistence of reverberation and spatial aliasing. Generated from the generalized state coherence transform (GSCT) [19, 20], approximate kernel density estimator (KDE) using the nonlinear Gaussian kernel [20, 21] has a significantly improved angular spectrum than GCC under reverberant environment. The frequency-dependent weighting factor of the Gaussian kernel can suppress spatial aliasing to a certain extent, but the effect is limited [14].

Existing sound source localization methods often regard the number of sound sources as a prior information [7, 22]. However, during the utterance of mixed speech signal, some sources may only last for a short period of time and the number of sources often changes dynamically [23], which makes it difficult to determine the number of sources in advance.

Therefore, the second step requires the active sources to be located and counted simultaneously [8] from the constructed spectrum. Current source localization and counting methods mainly use iterative search methods, such as single-point peak amplitude in peak search (PS) [24, 25], inner product in matching pursuit (MP) [26, 27], and source contribution rate in iterative contribution removal (ICR) [15]. Each iteration selects the optimal value satisfying the corresponding conditions, removes the components corresponding to the current sound source from the spectrum, and restarts the next. MP and ICR are more accurate than PS when obvious distortion is not produced in the spectrum. However, in a high-resolution spectrum, the computation cost of the inner product and the contribution rate is much higher, which is not conductive to real-time tracking the number of sources at different times. When the spectrum is seriously distorted due to reverberation and spatial aliasing, the current incorrect estimation will deteriorate the reconstructed spectrum, which may exert considerable influence on the following iteration [28].

In this paper, to effectively solve the spatial aliasing when using a wide intermicrophone distance, an unambiguous angular spectrum-based algorithm is proposed for simultaneously localizing and counting multiple active sound sources. In the spectrum construction step, the local KDE spectrum in each TF bin between the pairwise microphones is generated from a Gaussian kernel function; then a multi-stage structure [14] is used to divide the entire band into several sub-bands by the maximum unambiguous frequency. The sub-band spectra are pooled together across all the TF bins to suppress the influence of spatial aliasing. In the localization and counting step, PS is used considering the simplicity and real-time processing. An updated cut-off threshold according to the current peak amplitude is used to replace the fixed threshold in traditional PS, which strengthen the flexibility of counting. Then the preliminary estimation is passed through a source merging module [15] to eliminate the duplicate sound source.

The remainder of this paper is organized as follows: in Section 2, the KDE spectrum is constructed on the basis of the signal propagation model of multiple sound sources. In Section 3, the MS structure is used to construct the KDEMS spectrum. Then PS are combined to form the complete localization and counting process. Numerical comparisons between the proposed KDEMS-PS and the other classical ones are given in Section 4. Finally, Section 5 concludes the paper.

2. Signal Propagation Model of Multiple Sound Sources and KDE Spectrum Construction

Set as sound source signals at the transmitting end, and as the observed signals corresponding to microphones at the receiving end. The sampling rate is . Then in the approximately far field, the discrete time signal propagation model of multiple sound sources can be expressed as follows:where denotes the vector of length corresponding to the impulse response between the -th source and the -th microphone, denotes the vector corresponding to the -th source, and denotes the additive white Gaussian noise of the -th microphone which is independent of source signals and impulse responses.

Through point short-time Fourier transform (STFT), the expression in the discrete TF domain can be obtained aswhere and denote the frame and frequency indices, respectively, , and denote the STFT coefficients corresponding to and , respectively, denotes the complex vector corresponding to the noise, andwhere is the transfer function between the -th source and the -th microphone, including the direct wave component and the reverberation component . Since the impulse response is time-invariant, the transfer function is only related to .

When the intensity of the reflected wave is uniformly distributed in all possible directions of propagation, can be regarded as spatially diffuse noise [29, 30]. can be decomposed aswhere , represents the frequency at the -th frequency bin, and denote the corresponding propagation attenuation and time, respectively. Under anechoic, noise-free, and WDO conditions [5], only exists, and at most one sound source energy dominates each TF bin. Then equation (2) can be simplified as follows:where is the index of the dominant sound source in the current TF bin.

Considering one pair of microphones , in the array, the normalized cross-power spectrum (NCS) of the observed signal can be expressed as [21].where denotes the true TDOA of the dominant sound source between and . Without loss of generality, the subscript of can be omitted and can be estimated by regarding it as a random variable that satisfies a given probability distribution. Then a Gaussian kernel function is introduced as [20].whereis a function of to calculate the Euclidean distance between and ,is the bandwidth of the kernel function where is the maximum possible TDOA, is the maximum spacing between adjacent microphones across the given pairs in the array, is the velocity of sound propagation, and is a factor that affects the resolution in the TDOA space. If is set too large, the kernel function may generate many burrs, which may subsequently lead to the distortion of the spectrum. If is set too small, the main lobe is obese, which is not conducive to locate the source correctly. is set as 20 empirically when . Substitute equation (8) into equation (7), the local KDE spectrum in each TF bin can be expressed aswhere .

Pool the local spectra across all the TF bins [10, 21]; the total KDE spectrum can be constructed aswhere denotes the narrow-band KDE spectrum corresponding to the -th frequency bin after pooling the local spectra across all time frames. The estimated TDOAs of sound sources can be obtained from the peaks of .

There is a frequency-dependent weighting factor in , which means that the higher the frequency is, the more the amplitude of the local spectrum is suppressed. Then spatial aliasing mainly caused by the high-frequency part is weakened to a certain degree, but it cannot be completely eliminated [14], especially in a strong reverberation environment where the spectrum is severely distorted. Hence, a multistage structure is used to eliminate the influence of spatial aliasing more efficiently in this section.

3.1. Spectrum Construction with Multistage Structure

When calculating in equation (11), the indices of the lowest frequency bin and the highest are set to and , respectively. Then the corresponding center frequencies can be set to and , respectively.

According to the spatial Nyquist sampling theorem, the unambiguous condition can be expressed aswhere is the distance between and , and represents the minimum signal wavelength corresponding to which can be expressed as

Substituting equation (13) into equation (12), the condition can be rewritten aswhere denotes the maximum unambiguous frequency. If equation (14) is fulfilled, no spatial aliasing exists. However, in order to produce a higher resolution angular spectrum to distinguish different sound sources, it is necessary to use a wider , which makes exceed inevitably.

If , the entire frequency band can be divided into two sub-bands and where a single sub-band may contain one or more continuous frequency bins. Then if , can be further divided into and . By analogy, if where , the sub-bands divided at the -th stage can be deduced as and . The entire band can be finally divided into a total of sub-bands . can be obtained aswhere denotes the round-up operator.

Then by pooling across all frequency bins contained in each sub-band, the sub-band KDE spectrum can be obtained aswhere , denotes the index of the lowest frequency bin contained in the -th sub-band which can be expressed asand denotes the index of the highest frequency bin which can be expressed as

In equations (17) and (18), is the index corresponding to which can be obtained aswhere is the round-down operator. When , the frequency will not exceed , so there is no interference of spatial aliasing; when , equation (14) is no longer fulfilled, and the -th sub-band spectrum will have at most one more false peak than the previous sub-band.

Then the MS structure is used for sub-band processing, and the output of the first stage can be expressed as

By multiplying with the second sub-band spectrum, the output of the second stage can be obtained aswhere the false peaks in can be suppressed by . Then the weighted output is substituted into the next stage. By analogy, the output of the -th stage, that is, the KDEMS spectrum can be deduced as

The spectrum of the lower sub-band has a wider main lobe and lower resolution but is less affected by spatial aliasing, while the spectrum of the higher sub-band has a narrower main lobe and higher resolution but is subject to spatial aliasing. Combining these two parts of the spectrum through the MS structure can effectively absorb the advantages of each part while making up for their shortcomings, resulting in the final output spectrum with obvious ambiguity suppression and high resolution.

Comparing equations (11) and (22), it can be seen that the MS structure does not increase the computational complexity. The total number of operations for both KDE and KDEMS can be approximately expressed aswhere denotes the number of operations required for , and denotes the number of samples contained in the TDOA space.

3.2. Multiple Sound Source Localization and Counting by Peak Search

When the angular spectrum is constructed, an improved PS is used to realize multiple sound source localization and counting. Without loss of generality and to facilitate the subsequent processing, the normalized form of the spectrum is obtained aswhere represents KDE, KDEMS, or other angular spectrum-based methods (e.g. GCC) and and represent the operators to find the minimum and maximum values in the function, respectively. Then the sequence containing all the peaks in can be expressed aswhere denote the TDOAs corresponding to the peaks in the TDOA space sampled with as the grid value and denotes the total number of peaks. The elements in equation (25) are sorted in the descending order of peak amplitude, so the condition where is fulfilled. When the number of sound sources is known where , only the first elements need to be extracted from , and the set of the estimated TDOAs can be expressed as . However, in most real cases, is unknown. Hence, it is necessary to count the number of sound sources to obtain effective localization results.

In traditional PS, a fixed cut-off threshold set as is compared with the elements in one by one iteratively according to the order of the sequence until the element is not greater than . If is set too large or too small, it will lead to excessive missing alarm rate or false alarm rate, respectively. So the threshold is set to vary as the peak amplitude changes to improve the counting flexibility. Set as the first threshold. It is generally assumed that there is at least one active sound source, so holds. Then the information of the first peak amplitude is introduced into the threshold setting and the threshold used in the next iteration can be updated as where represents the operators to find the maximum in the set. If , the iteration continues. Set the index of the iteration to . Then the threshold used in the -th iteration can be updated aswhere . is the total number of iterations. Give the following condition:where denotes the maximum number of iterations. When either of the two conditions in equation (27) is not fulfilled, the iteration stops. The set of the estimated TDOAs can be expressed as

Due to the interference of reverberation and noise, there may be multiple peaks with similar amplitude near the peak corresponding to the true TDOA in the distorted angular spectrum, which will lead to repeated estimation of the same sound source, thus bringing undesirable counting results. To address this problem, it is necessary to merge the duplicate estimation. The source merging is implemented as follows: for any two estimated TDOAs and where , the condition can be given aswhere indicates the allowed minimum distance between the estimated TDOAs corresponding to two different sound sources. can be obtained as [15]where is the minimum angular distance empirically set to when the interval is . When equation (29) is fulfilled, according to the peak amplitude, the estimated results can be reassigned as

After the reassignment, all the estimated TDOAs with the same value are merged into only one. Then the final localization and counting results can be expressed aswhere , indicates the estimated TDOA of the -th sound source when using PS with the source merging module, and indicates the estimated number of sound sources where is the operator to find the number of elements in the set.

A complete localization and counting method can be used by combining the constructed angular spectrum with PS, where the combination is marked with “-” in this paper. Hence, when the angular spectrum is generated by GCC, KDE, or KDEMS, the corresponding localization and counting method can be called GCC-PS, KDE-PS or KDEMS-PS, respectively. The block diagram of the proposed KDEMS-PS is shown in Figure 1, and the steps can be summarized as follows:(i)STFT: Transform the observed signals of one pair of microphones , into the corresponding coefficients , in each TF bin by STFT.(ii)Narrowband KDE spectrum calculation: Calculate from equation (6), and then pool the local KDE spectrum in equation (10) across the -th frequency bin to obtain the narrowband KDE spectrum .(iii)Multi-stage processing: Determine the number of sub-bands and the index corresponding to the maximum unambiguous frequency from equations (15) and (19), respectively, then obtain the sub-band KDE spectrum where from equation (16).(iv)Pooling and normalization: Pool all the sub-band spectra to construct the KDEMS angular spectrum from equation (22), then obtain the normalized form from equation (24).(v)Peak search: Search the peaks in , sort them in descending order, and then compare them with the updated cut-off threshold from updated equation (26) to obtain the estimated TDOAs in .(vi)Source merging: Merge the repeatedly estimated sound sources from equation (31) to obtain the final localization and counting results in .

4. Numerical Analysis

In this section, compared with classic GCC and KDE, the KDEMS spectrum is analyzed. Then the localization and counting performance of the combined GCC-PS, KDE-PS, and KDEMS-PS is further discussed.

The sound sources are taken from 16 pure speeches composed of 8 male and 8 female in the TIMIT database [31]. Each speech segment sampled with lasts 2 seconds, and has the same average power through preprocessing. The room is whose - plane diagram is shown in Figure 2. A pair of omni-directional microphones , with distance parallel to the -axis is located at the center of the room marked with , where is set to be an multiple of with a positive integer as the multiple factor. sound sources are distributed on a semicircle with as the centroid, where direction of arrival (DOA) of the -th sound source is defined in an anti-clockwise manner with being the direction perpendicular to the line connecting and . Then, the true TDOA can be obtained as where is the sound velocity set as . is the microphone-source distance set as . In Figure 2, coordinates of the sound sources and the microphones are all set as .

When the reverberation time expressed as changes between and and changes from to with as the interval, 200 simulations are performed under each scenario. In each simulation, sound sources are randomly selected from the 16 candidate segments, convolved with the room impulse response (RIR) generated by the image-source model [32, 33], and added white Gaussian noise with fixed signal-to-noise ratio (SNR) which can be obtained aswhere and indicate the average power of the noise-free observed signal corresponding to and , respectively, and indicates the average power of the additive white Gaussian noise at the receiving end. In the simulation scenarios, multiple sound sources with equal angular intervals are used. When is 2,4, or 6, the true DOA distribution is shown in Table 1.

Then the observed signal is passed through the bandpass filters between 0.2 kHz and 4 kHz [14] and transformed into the T-F coefficients with a 1024-point Hamming-weighted STFT, where the frame shift rate is . Thirty consecutive frames are extracted to pool the spectrum. According to the current environment, the parameters are chosen empirically. In the Gaussian kernel function used by KDE and KDEMS, is set to 20. in the TDOA space is uniformly set to . In the localization and counting step, is set to 0.35, and is set to 10.

When and , with set as , , , or , the normalized spectra generated by GCC, KDE, and KDEMS are shown in Figure 3. In each subfigure, three curves colored by red, green, and blue correspond to GCC, KDE, and KDEMS, respectively. For the intuitiveness and unity of graphic expression, a weighting factor is introduced to ensure the unified abscissa with different . Then the true TDOAs of the sound sources are located at 0.5 and , which are marked with two vertical dashed lines.

In Figure 3(a), since equation (12) is fulfilled when , there is no spatial aliasing. Hence, the KDEMS spectrum coincides with KDE. However, the low-resolution spectra generated by the narrow makes the peak too wide to produce good discrimination between different sound sources in the TDOA space. GCC has only one peak obviously deviated from the true TDOAs which make it difficult to give correct estimation.

In Figure 3(b), when is widened to , the resolution in the TDOA space improves and all the three spectra can form two peaks with the two highest amplitude near the true TDOAs. However, the main lobe corresponding to each sound source is still too wide and there is still a certain degree of deviation.

In Figure 3(c), when is further widened to , sharper peaks close to the true TDOAs make the deviation smaller than Figure 3(b). However, the simultaneously introduced spatial aliasing makes false peaks with a certain amplitude start to appear at , , 0.2, and 0.9 in GCC. Due to the suppression of high frequency components by the frequency-dependent in Gaussian kernel function, the amplitudes of the false peaks in KDE are lower than GCC. Then by further MS processing, the false peaks are almost completely suppressed in KDEMS.

In Figure 3(d), when , the number of false peaks in GCC is obviously more than KDE and KDEMS, the overall amplitudes of the false peaks becomes higher than Figure 3(c). The amplitude at in GCC is almost equivalent to the one at 0.5 corresponding to the true TDOA, which is easy to produce incorrect estimation. However, KDE and KDEMS maintain sharp peaks near the true TDOAs and still have a good suppression on the false peaks, thus have better recognition than GCC.

The normalized spectra when increases to are shown in Figure 4. The enhancement of reverberation makes the spectrum distortion more serious. In Figure 4(a), GCC, KDE, and KDEMS still face the influence of low resolution. In Figure 4(b), all the three spectra cannot estimate the TDOAs effectively because of the deviation. Compared with Figure 3(c), the amplitudes of the false peaks at , , 0.2, and 0.9 are higher in Figure 4(c), where the amplitude at , 0.2 is almost equivalent to the one at 0.5 in GCC. In Figure 4(d), the amplitudes of the false peaks at 0.3, 0.7 have exceeded the amplitude at 0.5 in GCC, the amplitude at 0.7 is close to 0.5 in KDE, while KDEMS has a relatively flat spectrum which shows that KDEMS is more robust than GCC and KDE under strong reverberation.

After the spectrum analysis of a single simulation, the source localization and counting performance of GCC-PS, KDE-PS, and KDEMS-PS can be evaluated in terms of three measures defined aswhere recall rate expressed as and precision rate expressed as are used to measure miss-detections and false alarms, respectively, F-score expressed as is the combination of and , and denotes the number of correctly estimated sound sources. The condition for correct estimation can be expressed aswhere and represent the DOAs corresponding to the estimated TDOA and the true TDOA, respectively.

Set as , the average values of these three measures versus SNR when and are shown in Figures 5 and 6, respectively, where they all reach its best value at 1 and worst at 0. As SNR decreases from 30 to −10 dB, the measures of GCC-PS, KDE-PS and KDEMS-PS gradually decline. In Figure 5(a), since only two sound sources are used, the difference of between these three methods is not very obvious. But the false peaks in GCC and KDE spectra will produce more false alarms than KDEMS. With the decrease of SNR, the increase of the spectrum base further aggravates the influence of false peaks, so of KDEMS-PS is significantly better than the other two methods. The situation of is similar to that of .

Due to the increase in the number of sound sources, the missed-detection situation has been enhanced. Then of KDEMS-PS is significantly better than the other two in Figure 6(a). False alarms still exist in Figure 6(b) but will be alleviated to a certain extent by the added sources. of KDEMS-PS is still the best of the three methods. The situation of is similar to that of .

It can be seen from Figures 5 and 6 that when the number of sound sources is small, can better distinguish the performance of different methods, and when the number of sound sources is large, is better, while is the harmonic average of and to comprehensively evaluate the performance. Hence, without loss of generality, use to measure the performance of different methods. With SNR set as 20 dB, the average F-score versus when and is shown in Figures 7 and 8, respectively.

In Figure 7(a), when , the deviation caused by the low resolution makes small, thus bringing low , then gradually rises as widens. When , the growth of disappears, GCC-PS and KDE-PS begin to gradually decline; this is due to the increase in the amplitude of false peaks caused by spatial aliasing, which makes increase. However, due to MS processing, KDEMS-PS is much higher than the other two and it can stabilize around 0.9 with no obvious downtrend when , which demonstrates the capability to effectively suppress the influence of spatial aliasing.

The overall declines to a certain extent when increases. When , KDEMS-PS can maintain around 0.8 and 0.6 in Figures 7(b) and 7(c), respectively, not decline as shown in Figure 7(a). This is because the increase of requires higher resolution to distinguish different sound sources, so a higher can be brought about when is appropriately increased.

The rule in Figure 8 is similar to that of Figure 7. The enhancement of reverberation makes smaller on the one hand and larger on the other hand, thus makes of different methods have a certain degree of decline. The overall performance is still KDEMS-PS KDE-PS GCC-PS. When , KDEMS-PS can still maintain above 0.5 and 0.4 when and , respectively, much higher than KDE-PS and GCC-PS, which shows the robustness of the method under strong reverberation and interference between different sound sources.

5. Conclusion

To correctly locate the sound sources in real scene, the number of them needs to be estimated simultaneously. Based on the kernel density estimator, a multiple sound source localization and counting method called KDEMS-PS is proposed in this paper. Divide the entire frequency band into several sub-bands and process them stage by stage, the KDEMS angular spectrum is constructed. Then PS with an updated threshold and a source merging module is combined to locate and count multiple sound sources. As shown in the computer simulation using spectrum analysis and comparison of F-score, KDEMS spectrum can effectively weaken the interference caused by false peaks and KDEMS-PS is a robust multiple sound source localization and counting method with good spatial aliasing suppression when using a wide intermicrophone distance. In the experiment, we found that the MS structure is mainly used to process sound sources with more low-frequency energy and more uniform spectrum coverage (e.g. speech). When the spectral components are concentrated in the higher frequency range (e.g. bird sound), the sub-band processing may face the problem that the unambiguous low-frequency spectrum has very weak energy, which may cause a decrease in localization performance due to insufficient peak energy at the sound source location. Whether to use some spectrum enhancement techniques or other unambiguous methods to extend the scope of application is still a question worthy of further study. Theoretically, the angular spectrum-based method based on pairwise microphones has no additional restrictions on the number of microphones or the topology. In this paper, the applied microphone array consists of only two microphones to focus on the improvement of the MS spectrum. More microphone pairs of different topologies (e.g. planar, spherical) can be added for some specific applications in the following research.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by The National Natural Science Foundation of China (Nos. 61171167 and 61401203) and The Scientific Research Foundation of Jinling Institute of Technology (No. JIT-040520400101).