Abstract
In this paper, a method to solve the localization of concurrent multiple acoustic sources in large open spaces is presented. The problem of the multisource localization in farfield conditions is to correctly associate the direction of arrival (DOA) estimated by a network array system to the same source. The use of systems implementing a Bayesian filter is a traditional approach to address the problem of localization in multisource acoustic scenario. However, in a real noisy open space the acoustic sources are often discontinuous with numerous shortduration events and thus the filtering methods may have difficulty to track the multiple sources. Incident signal power comparison (ISPC) is proposed to compute DOAs association. ISPC is based on identifying the incident signal power (ISP) of the sources on a microphone array using beamforming methods and comparing the ISP between different arrays using spectral distance (SD) measurement techniques. This method solves the ambiguities, due to the presence of simultaneous sources, by identifying sounds through a minimization of an error criterion on SD measures of DOA combinations. The experimental results were conducted in an outdoor real noisy environment and the ISPC performance is reported using different beamforming techniques and SD functions.
1. Introduction
The sensory capacity to analyze acoustic space is a very important function of an auditory system. The need for the development of an understanding of the sound environment has attracted many researchers over the past twenty years to build sensory systems that are capable of locating acoustic sources in space. Acoustic source localization (ASL) is an important task in a growing number of applications. Fields of application in which identification of the location of acoustic sources is desired include audio surveillance, teleconferencing systems, handsfree acquisition in car, system monitoring, humanmachine interaction, musical control interfaces, videogames, virtual reality systems, voice recognition, fault analysis of machinery, autonomous robots, processors for digital hearing aids, highquality recording, multiparty telecommunications, dictation systems, and acoustic scene analysis. The aim of an ASL system is to estimate the position of sound sources in space by analyzing the sound field with a microphone array, a set of microphones arranged to capture the spatial information of sound.
Several application areas that may potentially provide advantages in using the acoustic location have led to the development of many signal processing algorithms, which mostly consider the specific acoustic environment, the signal properties, and the localization goal.
ASL can be performed by two basic methods: indirect and direct. The indirect approach is used to estimate source positions by implementing the following two steps: in the first one, a set of time difference of arrivals (TDOAs) are estimated using measurements across various combinations of microphones, and in the second one, when the position of the sensors and the speed of sound are known, the source positions can be estimated using geometric considerations and approximate estimators: closedformed estimators based on a least squares solution [1–7] (for an overview on closedform estimators, see [8]) and iterative maximum likelihood estimators [9–15]. The direct approach involves the search space by constructing a spatial energy map and estimating, for each possible point of interest, the values that maximize a specific likelihood function that provides a coherent value from the entire system of arrays. The position of the sources can be estimated directly and spatial likelihood functions can be defined [16–20].
In nearfield conditions, since the sources radiate the sound in spherical waves, a hyperboloid describes all of the possible points of an acoustic source that generates the same TDOA to an array of two microphones. Indirect methods aim at estimating TDOAs for microphone pairs, typically using the generalized crosscorrelation (GCC) [21] and the adaptive eigenvalue decomposition (AED) [22] based on the blind system identification, which focuses on the impulse responses between the source and the microphones. The extension of the AED in the case of multiple microphones was proposed in [23], and it is efficiently performed with a normalized multichannel frequencydomain least mean square algorithm [24, 25]. However, the steered response power (SRP) is a direct method based on maximizing the power output of a beamformer. Beamforming is a combination of the delayed signals from each microphone in a manner in which an expected pattern of radiation is preferentially observed. In general, the SRP is computed in frequencydomain using the fast Fourier transformer on a signal portion, calculating the response power on each frequency bin, and subsequently fusing these estimates to obtain the final result. The conventional SRP is performed with the delay and sum beamformer [26]; it consists of the synchronization of signals that steer the array in a certain direction, and it sums the signals to estimate the power of the spatial filter. The SRP phase transform (SRPPHAT) [18] is a widely used filtered beamforming. PHAT filter [21] places equal importance on each frequency by dividing the spectrum by its magnitude. It normalizes the amplitude of the spectral density using only the phase information with the advantage to improve performance in case of moderate noise and reverberation. SRPPHAT is deeply used due to the fact that it can be efficiently computed by coherent summing the GCCPHAT from all of the microphone pairs for each possible point of interest. The highresolution SRP has been developed to improve the performance of the spatial filter, and the adaptive beamformer is called the minimum variance distortionless response (MVDR) due to Capon [27]. The multiple signal classification (MUSIC) algorithm is based on an eigen subspace decomposition method [28, 29], and the estimation of signal parameters via rotational invariance techniques (ESPRIT) is based on subspace decomposition exploiting the rotational invariance [30–32].
In farfield conditions, we are no longer able to detect the spherical wavefront in relationship with the distance of source from an array and the size of the array, and the wavefront is approximate to a plane. In this condition, with an array of microphones, we are able to estimate only the direction of arrival (DOA) of the source but not its distance from the array. In the farfield case the hyperboloid, which is the locus of points that generates the same TDOA to a microphone pair, can be approximated with the cone whose vertex is located at the midpoint of the array. Thus, we need a network of arrays to perform the localization of a source (at least two arrays for twodimensional space). Hence, the position estimation is computed by intersection of lines and by an approximated solution for overdetermined systems using the linear least squares method. In the case of an array containing microphones () the DOA estimation can be computed with the indirect method of multichannel crosscorrelation coefficient (MCCC) [33, 34], which is based on TDOAs estimation using GCC, and on the use of the spatial prediction error to measure the correlation among multiple signals. It has the advantage of using the redundant information between microphones to estimate the DOA in a more robust manner under a reverberant and noisy condition. The family of SRP direct methods with an array is used to estimate the DOAs of sources by picking the values corresponding to the principal peaks of the steered response power of a beamforming.
Recently, more sophisticated algorithms have been proposed for time delay estimation that use minimum entropy [35, 36] and blind source separation [37–39]. In [39], the authors demonstrate that the broadband independent component analysis methods are more robust against high background noise levels compared with the conventional GCCPHAT approach.
Both indirect and direct methods have been tested in many single source scenarios; however, in multiple sources cases, they require new considerations. Several works address the problem of multiple sources using a Bayesian approach based on the tracking of the sources and using Kalman filter [40–47] and Particle filter [19, 48–53]. Some studies consider an approach without tracking in reverberant environments [39, 54–57].
In a real open space, the traditional techniques based on Bayesian filters (Kalman and Particle filters) are difficult to apply for localization of concurrent multiple acoustic sources, because sources are often discontinuous with numerous shortduration events and the spatial resolution may be poor in some areas of analysis. Note that in practical applications the localization in a open space needs a reduced number of arrays, due to limited space for installing it and not to invade the monitoring spaces in an excessive way. Besides, methods based on movement tracking can fail in some specific situations: during the initialization phase of the filter, in the presence of sources with unpredictable trajectory (e.g., in the case of rapid changes of the velocity vector), and when two sources have intersecting trajectories.
As a solution to this problem, we present the approach based on the incident signal power comparison (ISPC). A preliminary work was proposed in [58, 59]. This paper describes a detailed stepbystep ISPC algorithm introducing a diagonal loading (DL) [60, 61] for MVDR beamforming, which gives more stable ISP estimation, and reporting new experimental results in a real scenario. ISPC is designed for a distributed array system, and it is based on source extraction and on a verification of similarity among sound sources. The first step consists of source extraction using beamforming techniques and estimation of the incident signal power (ISP) of every source captured on the array. The second step involves the comparison of the ISP spectrum from different arrays using a spectral distance (SD) measure. The ISP spectrum permits identification of sounds so that the spectrum power distance minimizes an error criterion. Therefore, the identification of the correct combination of DOAs is estimated by identifying the minor value of SD measures.
The location in a freefield outdoor environment can be employed for audio surveillance, sound monitoring, and analysis of acoustic scenes. In particular, Section 5 describes a prototype system for multiple source localization in a public space for monitoring a large area with a joint audiovideo system, in which the positional estimates by acoustic analysis are used to steer a videocamera consequently.
The paper is organized as follows. After presenting the signal model in Section 2, the multiple sources localization problem is described in Section 3. In Section 4 the ISPC algorithm is presented. Finally, Section 5 illustrates experimental results obtained in a realworld scenario.
2. Signal Model
We assume acoustic sources and arrays, each composed of microphones, and consider the omnidirectional characteristics of both the sources and the microphones. We will refer to the model of discretetime obtained by performing a sampling operation on the continuoustime signal with a uniform sampling period . A discretetime signal is expressed by where is the sample time index and is the sampling frequency. As usual, we will allow the sample period to remain implicit and refer to it simply as .
The freefield discretetime signal received by the th microphone of the th array can be modeled as where is the attenuation of the sound propagation (inversely proportional to the distance from source to microphone of array ), are the unknown uncorrelated source signals, is the propagation time from the unknown source to the reference sensor of array , is the TDOA of the signal between the th microphone and the reference of the th array, and is the additive noise signal at the sensor of array , assumed to be uncorrelated with not only all of the source signals but also with the noise observed at the other sensors.
In farfield case the relationship between TDOA and DOA can be solved easily with geometrical considerations. Therefore, for a generic pair of microphones with TDOA , DOA estimate is obtained as where is the speed of sound and the distance between microphones.
The vector for each source , considering the signal model (2), is defined by which contains the DOAs of the acoustic source by each array. In the case of sources and arrays, we can write the matrix , which contains all DOAs of distributed array network as The estimated DOAs angles, obtained for each array , are written with the following vector: where we consider the DOA values in ascending order (, etc.). Next, the estimated sorted DOAs matrix is defined as The position of the source can be calculated by combining the DOAs estimated by the arrays for that source.
3. Multiple Sources Localization
The multiple sources localization problem is to correctly assign the DOAs values to the source . In some applications, situations arise for which we cannot assign unambiguously TDOAs or DOAs to the same source. The example in Figure 1 shows the case of two sources with a configuration of two arrays for the 2D location. As we can see, the combination of incorrect DOAs leads to an incorrect position estimation. The two DOAs calculated by the two arrays can be combined following two different configurations: , ; , . The first configuration implies the correct localization of the sound sources, whereas the second leads to an incorrect localization of both the sources.
In general, the goal is to get the matrix to properly order the values of (7). Considering as the th DOA of array , the assignment of the correct value of the DOA for the unknown sources can be ambiguous; namely the exact position of the elements in the matrix of (6) cannot be uniquely determined: The possible combinations of the DOAs of matrix (7) are .
4. Incident Signal Power Comparison (ISPC)
Incident signal power comparison (ISPC) combines the DOAs from different arrays by considering the similarity criterion among acoustic sources. To check for this similarity, we can estimate for each array the ISP referring to all estimated DOAs using beamforming techniques. Once the ISPs are obtained, we can define an efficient error criterion for comparing the different possible combinations of the DOAs using a SD measure of ISPs pair between different arrays.
DOA estimation is a crucial step of ASL systems. In a freefield environment for farfield cases, it can be calculated by means of MCCC and SRP methods. After obtaining an estimation of the sorted DOAs matrix (the matrix of (7)), the steps of the ISPC algorithm are source extraction using beamforming techniques and estimation of ISPs for each DOA, ISPC using SD measurement between ISPs of different array, calculation of all DOAs combinations, and verification of the most consistent target combinations minimizing an error criterion on SD measurements. Finally, the localization of multiple sources can be computed by considering the estimated DOAs combination. Figure 2 illustrates the ISPC steps.
4.1. Incident Signal Power Estimation
The ISP is the power spectral density of the beamformer output that is steered to a specified direction. The SRP is based on maximizing the power output of a beamformer. Beamforming is a multichannel signal processing techniques that enhance the acoustic signals coming from a specific steered position, while reducing the signals coming from other directions. In the frequency domain, the output of a generic beamformer of th array in matrix notation can be written as where , and are the discrete Fourier transform of the signals, is the vector of the beamformer weights for steering and filtering the data on the direction , is the frequency bin index, and the superscript represents the Hermitian (complex conjugate) transpose.
The ISP of the beamformer output for a generic frequency is given by where is the crossspectral density matrix, which is square and symmetric, and denotes mathematical expectation. We consider a power spectrum calculated with , where and are the index values of a specific frequency range (FR), which defines the range interesting for the optimal performance of the ISPC algorithm. Note that beamformer pattern function is frequency dependent; then the main lobe narrows with increasing frequency and spatial aliasing can occur (for a comprehensive dissertation, refer to [62]).
Several beamforming techniques exist (a review can be found in [63]); however, the spatial filter methods that are used for comparing ISPC experimental results are the SRP based on delay and sum beamforming, the SRP with the DolphChebyshev window (SRPDC), and the MVDR with DL.
Hence, the ISP corresponding to delay and sum SRP can be written from (10) as where is the steering vector corresponding to direction . For a uniform linear array with microphone distance , the steering vector takes the form
The SRPDC is obtained from (11) introducing the DolphChebyshev window : where denotes elementwise multiplication.
The adaptive MVDR beamforming [27] is based on minimization problem of the following equation The aim of the MVDR filter is to minimize the noise and sources coming from different directions, while keeping a fixed gain on the desired direction. Solving (14) using the method of Lagrange multipliers, we can write In practical applications, the inverse of the crossspectral density matrix can be calculated using the MoorePenrose pseudoinverse, defined as where is the singular value decomposition of the matrix . Moreover, if the crossspectral density matrix is illconditioned, the spatial spectrum may not exist. Therefore, a DL [60, 61] method is adopted to calculate the inverse matrix in a stable way. The ISP with MVDR filter and DL becomes where is the identity matrix and is the loading level: where denotes the trace of the squared matrix and is the normalized loading constant. Typically, the values are , , [64].
Therefore, we can define the matrix containing all the ISPs related to the matrix (7): which has a dimension of , where the total number of ISPs is and .
4.2. Spectral Distance Estimation
To compare the ISPs of different arrays, spectral distance (SD) functions are used. Distance measures produce measurements of the dissimilarity of two sound spectra. We define the SD estimation between the and the of two DOAs of different arrays as where , and are the index labels of the array, , and are the index labels for the sorted DOAs of array, and is the SD function to measure the dissimilarity of spectra. We consider the four most common SD functions to verify how our system performance varies as a function of different equations. A classic spectral estimation method is linear prediction (LP) [65], for which we insert a negative one to standardize the minimum to zero as all functions The other functions are the ItakuraSaito (IS) distance measure [66] the root mean square (RMS) log [67] and the COSH measure [68] The total number of SD measures between all the ISPs pair of different arrays is .
4.3. DOA Combinations Calculation
Let us represent the sorted matrix of the DOAs using the graph theory to better understand the DOAs combinations calculation and the verification of the most consistent target combination minimizing an error criterion. Then, we can express the matrix (7) and all of its combinations as being composed of nodes and edges, connecting pairs of vertices. An example of three arrays and three sources is shown in Figure 3. Each row of the graph contains the sorted DOAs of an array: , , and . Each DOA is a node of graph and the edges represent the possible connections between nodes with the values , which is the estimated SD between the ISPs on array of DOA and on array of DOA . The combination of incorrect DOAs leads to an incorrect position estimation (see Figure 1). Thus, if we represent a combination of DOAs as a sum of values of the edges that connect the nodes, we expect that the minimum value of different sums corresponds to the correct combination. To calculate the possible combinations of DOAs between the arrays, it is helpful to introduce a matrix labeling of DOAs (7): in which the generic element is expressed as with and . The matrix label associates the position of the DOAs referring to the sorted matrix . Estimating the minimum error of an SD combination, we can obtain the matrix with the correct position of the DOAs, in which each column contains the DOAs of the source .
Furthermore, we can represent the graph representation of DOAs and the SDs as the adjacency matrix , which is an matrix of SD values. The entry in row () and column () is defined as an SD estimation if there is an edge connecting vertex and vertex in the graph, or it is defined as zero otherwise. The relationships between DOAs and SDs can be expressed by the following equation of the adjacency matrix element: The symmetric adjacency matrix results in the following equation:
These SD values are weights of the edges of the graph. An example of three arrays and three sources is presented in Figure 3; in this example, we have 27 total SD comparisons (3 for each source).
To calculate all possible combinations of DOAs, we can work on the label matrix . Considering that the first row of related to the first array remains unchanged, we can compute the combinations in two steps. In the first step, the permutations of the labels of for each row (for each array) are calculated. The number of permutations for each row is . Thus, we define the permutation matrix as where where is the vector of th permutation () of th array (), which contains the DOAs label of row . The matrix has a dimension of . In the second step, the combinations of column indices of matrix with value from 2 to give the possible combinations. We consider the combinations of groups, each one composed by elements (the permutation), assuming that one member (the index column of matrix ) from each of the groups is used in each combination and assuming that the order is not a distinguishing factor. We define a matrix of dimension , which stores the combinations of groups of column indices of matrix : The generic element with and can be calculated with the following equations: where and .
Hence, a combination label matrix of dimension is used to store the DOA label of all combinations: where is the vector, which contains the DOA labels of combination :
4.4. Minimum SD Measure Estimator
For each source, identified by nodes (the arrays), we have edges; then, the number of edges for a combination of DOAs is . The values of matrix are used to calculate the SD estimation of all combinations. Thus, we can define the SD estimation of the generic combination as the sum of the weights of all the edges: where , and . Accordingly, we define the SD vector of all combinations
Finally, the index of the minimum value of the vector identifies the target combination as and the DOAs matrix is estimated by ordering the label matrix with the combination .
4.5. Overall Procedure
The processing steps of the full ISPC algorithm are summarized in Algorithm 1. After the DOAs estimation and creation of the matrix defined by (7) the ISPC algorithm is applied if multiple sources are detected. In practice, the matrix does not always present all the DOA values. In these cases, the missing value of array can be represented with a zero value in the label matrix (25). The overall procedure is composed by the following steps: building of the label matrix (25) and calculation of ISPs and the matrix (19); estimation of the SD measurements between all ISP pairs of arrays and creation of the adjacency matrix (28); calculation of the permutations matrix (29) and the all DOA combination matrix (34); calculation of the vector (37) that contains the SD estimation for each DOAs combination and finding the minimum value of (38), for using the index value in the matrix to properly order the matrix and estimate the matrix .

5. Experimental Results
The experimental results were conducted in an outdoor real noisy environment and the ISPC performance is reported using different beamforming techniques (SRP, SRPDC, MVDR) and SD estimations (LP, IS, RMS, COSH). A prototype system for twodimensional localization has been installed on the roof of the building that houses the Computer Science Department in Udine University (Figure 4).
5.1. System Setup
The acoustic localization prototype includes two linear arrays, each composed of four omnidirectional microphones. Very small sized arrays are used because a real application of such systems would require that the public spaces are not invaded in an excessive way; therefore, there might not be enough space to install the arrays. The arrays are located 11.4 m apart at a height of 12.1 m above the plane. The sample rate of the digital system is 48 kHz, and the microphone distance is 25 cm. The system consists of two parallel processing lines, corresponding to the Array 1 and Array 2 (Figure 5).
The first processing step is the DOAs estimation. SRPPHAT is used for the DOAs estimation. The values corresponding to the principal peaks of the SRPPHAT function (in practice, those peaks which are above a given threshold) allow the DOAs estimation of the acoustic sources. The assumed DOA range is −90° +90°, where zero is in front of the array and the microphone reference is the first from left.
In the second step, the twodimensional coordinates of the source can be estimated by combining the DOAs at the arrays. If more than one source is identified, a beamformer and an SD comparison provide a guide to solve the problem of associating the DOAs of the Array 1 with those of the Array 2. The calculation of the twodimensional position of the source is a simple triangulation problem. However, we must consider that the two arrays are not coincidental with the plane of interest but are placed at a certain height. We must consider that the possible points identified by the DOA are located on a cone surface whose vertex is placed in the array and whose axis is the straight line joining the two arrays. Every array represents a cone: the intersection of the two cones is represented by a circumference. The intersection point between the circumference and the plane of interest is the estimation of the source distance from arrays (see Figure 6). Hence, we consider to be the distance of the arrays, to be the height of arrays above the plane of interest, and and to be the DOA estimated on the Array 1 and Array 2, and we obtain The spatial resolution of the system depends on the distance between the microphones, the distance between the arrays, and the sample frequency of digital system. Figure 7 shows the possible coordinates of the considered area of analysis. The zero of the axes reference is located in the middle of the distance between the two arrays. The spatial resolution tends to decrease with an increasing distance from the arrays and an increasing angle from the axis perpendicular to the array.
5.2. Experiment Setup
Experiments were conducted that consider the area of analysis of m shown in Figure 8, that is, the parking lots of the University. Twenty zones of acoustic source positioning are considered. They are labeled with a number as we can see in Figure 8. The sources used are a human voice (), a hammer striking an iron bar (), a motor car (), and a honk car (). The hammer striking an iron bar and the honk car are shortduration event sounds.
Two types of experiments were performed. The first type used sounds with different spectral content, named Test 1. The second type, instead, used sounds with similar spectral content, named Test 2. Test 1 is composed of thirtytwo parts (), each one with three sources placed in different positions (see Table 1). In various parts of Test 1, the sources were positioned at increasing distances along the axis and the axis. Table 1 also reported the sound pressure level (SPL) of each source. The environmental noise was in a range of 40–50 dB(A). In Test 2, two car sounds were used. The test was performed by placing two car sources in 1 and 7, as shown in Figure 8.
5.3. System Localization Evaluation
An evaluation of the system localization using a single source for each position was computed. Table 2 shows the real coordinates of the source points and the root mean square (RMS) error of the estimation using SRPPHAT method. We can see that the estimation error increases in distant areas and when the angle of incidence on the array is large.
5.4. ISPC Evaluation
Tables 3 and 4 summarize the results for Test 1 and Test 2, respectively, comparing the localization success rate (as a percentage) with different beamforming algorithms (SRP, SRPDC, and MVDR) and SD functions (LP, IS, RMS, and COSH).
The localization success rate is the ratio between the number of correct combinations and the number of ambiguities (NOA). NOA is the number of frames in which we have ambiguity to properly associate the DOAs to the sources; that is, the associations are incorrect in practice. The audio signal frame was divided into 17.5 ms overlapping and a Hannwindowed with a length of 140 ms. The parking area, where the tests were conducted, is a public area. Thus, we must consider that there are other sources in the acoustic scene: other sounds of cars that are moving in the parking area and in the nearby streets.
Table 3 summarizes the results of all thirtytwo tests (Test 1). The number of NOA is 750, and the three frequency ranges (FR) for the SD estimation are 20–675 Hz, 20–2000 Hz, and 20–8000 Hz. The frequency value of 675 Hz takes into account the spatial aliasing limit, which, in our case, is Hz. The phenomena of spatial aliasing implies that the main lobe of the beamformer has a set of identical copies, called grating lobes. The appearance of grating lobes is a function of both microphone spacing and incident frequency. When fully visible, a grating lobe is equal in amplitude to the main lobe of the array. This fact reduces the array response, and, therefore, by defining the spatial sampling requirement and removing the grating lobes, we obtain a greater efficiency in the ISPC.
Table 4 depicts the results of Test 2 with an FR of 20–675 Hz and a NAM of 100. We can note that the accuracy decreases, especially with regard to the RMS and COSH functions, and this result highlights the limitation of the proposed approach in the case of spectrally similar sources.
The best results for Test 1 were obtained with the RMS log SD function and FR = [20–675] Hz. MVDR has the greatest capacity for location with 90.1% of successful DOAs association.
6. Conclusions
The novel incident signal power comparison algorithm is used to solve the ambiguous problem of correctly linking the DOAs from different arrays to the same source in a farfield condition with concurrent sources. Experimental results have shown that this approach can be a solution for a multisource localization that requires a frametoframe analysis, that is, in those cases in which the traditional filtering approach can not be applied. An evaluation of the system in a real scenario is reported, installing a hardware/software prototype on the roof of the University building and analyzing the results comparing three types of beamforming and four functions for the SD estimation. The interest in locating in a farfield outdoor context may be attractive for audio surveillance, sound monitoring, and the analysis of acoustic scenes. The ISPC is successfully used in a joint audiovideo system for monitoring a large area. The best performances are obtained with RMS SD measure on frequency range between 20 Hz and the spatial aliasing frequency limit. We achieved a success rate of 90.1% using MVDR beamforming. We showed the limitation of the proposed algorithm in case of sources that have a similar spectral content.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
The authors would like to thank Rudolf Rabenstein (Professor, University of ErlangenNurnberg, Germany) for his valuable advice and helpful comments.