Abstract

Wireless acoustic sensor networks (WASNs) are formed by a distributed group of acoustic-sensing devices featuring audio playing and recording capabilities. Current mobile computing platforms offer great possibilities for the design of audio-related applications involving acoustic-sensing nodes. In this context, acoustic source localization is one of the application domains that have attracted the most attention of the research community along the last decades. In general terms, the localization of acoustic sources can be achieved by studying energy and temporal and/or directional features from the incoming sound at different microphones and using a suitable model that relates those features with the spatial location of the source (or sources) of interest. This paper reviews common approaches for source localization in WASNs that are focused on different types of acoustic features, namely, the energy of the incoming signals, their time of arrival (TOA) or time difference of arrival (TDOA), the direction of arrival (DOA), and the steered response power (SRP) resulting from combining multiple microphone signals. Additionally, we discuss methods not only aimed at localizing acoustic sources but also designed to locate the nodes themselves in the network. Finally, we discuss current challenges and frontiers in this field.

1. Introduction

With the rapid development in fields like circuit design and manufacturing, wireless nodes incorporating a variety of sensors, communication interfaces and compact microprocessors have become economical resources for the design of innovative monitoring systems. Networks of such type of devices, referred to as wireless sensor networks (WSNs) [1, 2], have been widely spread and used in many fields, with applications ranging from surveillance and military deployments to industrial and health-care systems [3]. When the nodes incorporate acoustic transducers and the processing involves the manipulation of audio signals, the resulting network is usually referred to as a wireless acoustic sensor network (WASN). A WASN consists of a set of sensor nodes interconnected via a wireless medium [4]. Each node has one or several sensors (microphones), a processing unit, a wireless communication module, and, sometimes, also one or several actuators (loudspeakers) [5].

During the last decade, the use of location information and its potentiality in the development of ambient intelligence applications has promoted the design of local positioning systems with WSNs [6]. Using WSNs to perform localization tasks has always been a desirable property since, besides being considerably cheap, they are easily deployable. Localization and ranging in WSNs have been typically addressed by measuring the received signal strength (RSS) or time of arrival (TOA) of radio signals [7]. However, the RSS approach, while being significantly inexpensive, incurs significant errors due to channel fading, long distances, and multipath. In the context of acoustic signal processing, WASNs also provide advantages with respect to traditional (wired) microphone arrays [8]. For example, they enable increased spatial coverage by distributing microphone nodes over a larger volume, a scalable structure, and possibly better signal-to-noise ratio (SNR) properties. In fact, since the ranging accuracy depends on both the signal propagation speed and the precision of the TOA measurement, acoustic signals may be preferred with respect to radio signals [9].

There are two typical localization tasks in WASNs: the localization of one or more sound sources of interest and the localization of the nodes that make up the network. The first case may involve locating the position of unknown sound sources, for example, talkers inside a room or unexpected acoustic events or sometimes other devices emitting known beacon signals. The second case is usually related to the self-calibration or automatic ranging of the nodes themselves.

To estimate the locations of the sound sources that are active in an acoustic environment monitored by a WASN, different methods exist in the literature. Usually, a centralized scheme is adopted where a dedicated node, known as the fusion center, is responsible for performing the localization task based on information it receives from the rest of nodes. The sensor network itself poses many limitations and challenges that must be considered when designing a localization approach in order to facilitate its use in practical scenarios. Such challenges include the bandwidth usage which limits the amount of information that can be transmitted in the network and the limited processing power of the nodes which prohibits them from carrying out very complex and computationally expensive operations. Moreover, each node has its own clock for sampling the signals and since the nodes operate individually, the resulting audio in the network will not be synchronized.

A taxonomy of sound source localization methods can be built up upon the nature of information from the sensors that it is utilized in order to estimate the locations. Hence, the WASN can estimate the locations of the acoustic sources based on (i) energy readings, (ii) time-of-arrival (TOA) measurements, (iii) time-difference-of-arrival (TDOA) measurements, and (iv) direction-of-arrival estimates or (v) by utilizing the steered response power (SRP) function.

In DOA-based approaches, each node estimates the DOA of the sources it can detect and transmits the DOA estimates to the fusion center. Although such approaches require increased computational power and multiple microphones in each node, they can attain very low-bandwidth usage, as only the DOA estimates need to be transmitted. Also, since the DOA estimation is carried out in each node individually, the audio signals at different nodes need not be synchronized: DOA-based approaches can tolerate unsynchronized input as long as the sources move at a rather slow rate relative to the analysis frame. The location estimators are generally based on estimating the location of a sound source as the intersection of lines emanating from the nodes at the direction of the estimated DOA. However, for multiple sources several challenges arise: the number of detected sources (and thus the number of DOA estimates) in each time instant can vary across the nodes due to missed detections (i.e., a source is not detected by a node) or false-alarms (i.e., overestimation of the number of detected sources) and an association procedure is needed to find the DOA combinations that correspond to the same source. This is known as the data-association problem and is crucial for the localization task.

The TDOA is related to the difference in the time of flight (TOF) of the wavefront produced by the source at a pair of microphones in the same node. TDOAs can be estimated at a moderate computational cost through the generalized cross correlation (GCC) [10] of the signals acquired by microphones in the pair. The source location estimate is accomplished by combining TDOA measurements coming from multiple sensors. Notice that, as for the DOA, only the TDOA measurements must be transmitted over the wireless network, with clear advantages in terms of transmission power and required bandwidth. Though suitable for WASNs, in practical scenarios (reverberant environments, presence of noise, and interferers), TDOA measurements are prone to errors, which in their turn lead to wrong localization. In order to mitigate the impact of these adverse phenomena, several techniques have been presented with the aim of identifying and removing outliers in the TDOA set [1113].

A TDOA measurement bounds the source to lie on a branch of hyperbola whose vertices are in microphone positions and whose aperture is determined by the TDOA value. When two (three in 3D) measurements from different pairs are available, the source can be localized through intersection of hyperbolas. The resulting cost function, however, is strongly nonlinear, and therefore its minimization is difficult and prone to errors. Linearized cost functions have been proposed to overcome this difficulty [1416]. It is important to notice, however, that the linearized cost functions require the presence of a reference microphone, with respect to which the remaining microphones in all the sensors must be synchronized. This poses technological constraints that, in some cases, hinder the use of such techniques. More recently, methodologies that include the synchronization offset in the optimization of the cost function have been proposed to overcome this problem [12, 17].

TOA measurements are obtained by detecting the time instant at which the source signal arrives at the microphones present in the network. Since in passive source localization the source emission time is unknown, the TOA is not equivalent to the TOF of the signal, preventing a direct mapping from TOAs to source-to-node distances. While some applications involve the use of sound-emitting nodes that allows performing localization by using trilateration techniques, TDOA localization methods are usually chosen. In this case, although the source emission time does not need to be known, the registered TOAs need to be referenced to a common clock, requiring precise timing hardware and synchronization mechanisms.

Energy-based localization relies on the averaged energy readings computed over windows of signal samples acquired by the microphones incorporated by the nodes [18]. Compared to TDOA and DOA methods, energy-based approaches are attractive because they do not require the use of multiple microphones at the nodes and are free of synchronization issues unlike those based on TOA. However, TDOA- and DOA-based methods, considered as signal-based approaches, offer generally better performance than energy-based methods. This is due to the fact that the information conveyed by all the samples of the signal is directly exploited instead of their average, at the expense of more sophisticated capturing devices and transmission resources [18, 19].

SRP approaches are beamforming-based techniques that compute the output power of a filter-and-sum beamformer steered to a set of candidate source locations defined by a predefined spatial grid. Since the computation of the SRP involves the accumulation of GCCs from multiple microphone pairs, the synchronization requirements are usually the same as those of TDOA-based methods. The set of SRPs obtained at the different points of the grid make up the SRP power map, where the point accumulating the highest value corresponds to the estimated source location. When using unsynchronized nodes with multiple microphones, the SRP power maps computed at each node can be used to obtain their corresponding DOAs. Alternatively, the SRP power maps from the different sensors can be accumulated at a central node to obtain a combined SRP power map, identifying the true source location by its maximum.

Besides the localization of acoustic sources, approaches for localizing the nodes in the network are also of high interest within a WASN context. Based on the estimated TOAs and TDOAs, algorithms for self-localization of the sensor nodes usually assume known source positions playing known probe signals. In practical scenarios, each sensor node does not have any information regarding other nodes or synchronization between the sensor and the source. These assumptions allow all processing to take place on the node itself. Several issues, for example, reverberation, asynchrony between the sound source and the sensor, poor estimation of the speed of sound, and noise, need to be considered for robust self-localization methods. In addition, the processing needs to be computationally inexpensive in order to be run on the sensor node itself.

Some state-of-the-art solutions for acoustic sensor localization detail the challenges facing these algorithms and methods to tackle such problems in a unified context [20, 21]. Furthermore, recent methodologies have been proposed for probe signal design aimed at improving TOF estimation [22], the joint localization of sensors and sources in an ad hoc array by using low-rank approximation methods [23], and an iterative peak matching algorithm for fast node autocalibration [24].

The paper is structured as follows. Section 2 discusses some general considerations regarding a general WASN structure and the notation used throughout this paper. Section 3 presents the fundamentals of energy-based source localization methods. Section 4 discusses TOA-based localization approaches. Methods for TDOA-based localization are presented in Section 5. Section 6 discusses the use of DOA measurements to perform localization of one or several sound sources. In Section 7, the fundamentals of the conventional and modified SRP methods are explained. Section 8 reviews some recent methods for the self-localization of the nodes in the network. Some future directions in the field are discussed in Section 9. Finally, the conclusions of this review are summarized in Section 10.

2. General Considerations

In order to clarify the notation used throughout this paper and the type of location cues used to perform the localization task, Figure 1 shows a general WASN with a set of wireless nodes and an emitting sound source. It is assumed that the network consists of nodes and that each node incorporates microphones. In the example shown in Figure 1, and . The nodes are assumed to be located at positions , while the microphone locations are denoted as , , where the superscript identifies the node at which the microphone is located. The source position is denoted as , while a general point in space is . Note that all these location vectors are referenced to the same absolute coordinate system. The distance from any microphone to the sound source is denoted as , while the time it takes the sound wave to travel from the source to a microphone, that is, the time of flight (TOF), is denoted by . The time instant at which the source signal arrives to a given microphone, that is, the TOA, is denoted as . TDOAs are denoted by and correspond to the observed TOA differences between pairs of microphones . It is a common practice to identify different pairs of microphones by using an index , where is the total number of microphone pairs involved in the localization task. The DOA corresponds to the angle that identifies the direction relative to the node microphone array that points to the sound source and is denoted by . Finally, the energy values of the source signal measured at the nodes are denoted as . These are negligible for microphones located at the same node (especially if the nodes are relatively far from the source), so it is usually assumed that only one energy reading is obtained for each node.

3. Energy-Based Source Localization

Traditionally, most localization methods for WASNs have been focused on sound energy measurements. This is motivated by the fact that the acoustic power emitted by targets usually varies slowly in time. Thus, the acoustic energy time series does not require as high a sampling rate as the raw signal time series, avoiding the need for high transmission rates and accurate synchronization. The energy-based model was first presented in [25]. In this model, the acoustic energy decreases approximately as the inverse of the square of the source to sensor distance. Without loss of generality, it will be assumed in this section that the node locations determine the microphone positions and that nodes only incorporate one microphone . Note that, as opposed to time delay measurements, the differences in energy measurements obtained from different microphones at the same node are negligible.

3.1. Energy-Decay Model

Assuming that there is only one source active, the acoustic intensity signal received by the th sensor at a time interval is modeled as [25] where is the source intensity at the sensor location, is a sensor gain factor, denotes the intensity of the source signal at a distance of one meter from the source, is the propagation delay from the source to the sensor, is the distance between the sensor and the source, and is an additive noise component modeled as Gaussian noise. In practice, for each time interval , a set of samples is used to obtain an energy reading at the sensor: where are the samples obtained from the microphone of the th node. In the case when several microphones are available at each node, the final energy reading is obtained by averaging the energies computed from each of the microphones.

By assuming that the maximum propagation delay between any pair of sensors is small compared to and taking into account the averaging effect, can be neglected for the energy-decay function, so that

3.2. Energy Ratios

Considering the energy measurements obtained by a group of sensor nodes, the energy ratio between the th and the th sensors is defined as where is the location of the source and and are the locations of the two microphones. For , all the possible source coordinates that satisfy (4) must reside on a hypersphere (sphere if or circle if ) described by where the center and the radius of this hypersphere are

In the limiting case, when , the solution of (4) forms a hyperplane between and : where and .

By using the energy ratios registered at a pair of sensors, the potential target location can be restricted to a hypersphere with center and radius that are functions of the energy ratio and the two sensor locations. If more sensors are used, more hyperspheres can be determined. If all the sensors that receive the signal from the same target are used, the corresponding target location hyperspheres must intersect at a particular point that corresponds to the source location. This is the basic idea underlying energy-based source localization. In the absence of noise, it can be shown that for measurements only of the total ratios are independent, and all the corresponding hyperspheres intersect at a single point for four or more sensor readings. For noisy measurements, however, more than ratios may be used for robustness, and the unknown source location is estimated by solving a nonlinear least squares problem, as explained in the next subsection. As an example, Figure 2 shows a 2D setup with three sensors and the circles resulting from noisy energy ratios.

3.3. Localization through Energy Ratios

Given sensor nodes providing energy ratios, the following least squares optimization problem can be formulated:where is the number of hyperspheres and is the number of hyperplanes ( close to 1), with corresponding indices and indicating the associated sensor pairs (). Note that the above cost function is nonlinear with respect to , resulting in the energy-ratio nonlinear least squares (ER-NLS) problem. It can be shown that minimizing this cost function leads to an approximate solution for the maximum likelihood (ML) estimate. A set of strategies were proposed in [26] to minimize by using the complete set of measured energy ratios. A popular approach to solve the problem is the unconstrained least squares method. Since every pair of hyperspheres (with double indices replaced by a single pair index for the sake of brevity) and , a hyperplane can be determined by eliminating the common terms: The combination of (7) with (10) leads to a least squares optimization problem without inconvenient nonlinear terms, known as the energy-ratio least squares (ER-LS) method, with cost function: where

The closed-form solution of the above unconstrained least squares formulation makes this method computationally attractive; however, it does not reach the Cramer-Rao bound. In [27], energy-based localization is formulated as a constrained least squares problem, and some well-known least squares methods for closed-form TDOA-based location estimation are applied, namely, linear intersection [28], spherical intersection [29], sphere interpolation (SI) [30], and subspace minimization [31]. In [32], an algebraic closed-form solution is presented that reaches the Cramer-Rao bound for Gaussian measurement noise as the SNR tends to infinite. The authors in [33] formulated the localization problem as a convex feasibility problem and proposed a distributed version of the projection onto convex sets method. A discussion of least squares approaches is provided in [19], presenting a low-complexity weighted direct least squares formulation. A recent review of energy-based acoustic source localization methods can be found in [18].

4. TOA-Based Localization

Typically, a WASN sound source location setup assumes that there is a sound-emitting source and a collection of fixed microphone anchor nodes placed at known positions. When the sound source emits a given signal, the different microphone nodes will estimate the time of arrival (TOA) or time of flight (TOF) of the sound. These two terms may not be equivalent under some situations. The TOF measures the time that it takes for the emitted signal to travel the distance between the source and the receiving node; that is,

In fact, when utilizing TOA measurements for source localization, it is often assumed that the source and sensor nodes cooperate such that the signal propagation time can be detected at the sensor nodes. However, such collaboration between sources and sensor nodes is not always available. Thus, without knowing the initial signal transmission time at the source, from TOA alone, the sensor is unable to determine the signal propagation time.

In the more general situation when unknown acoustic signals such as speech or unexpected sound events are to be localized, the relation between distances and TOAs can be modeled as follows: where is an unknown transmission time instant and is the TOA measurement noise. Note that the time appears due to the fact that typical sound sources do not encode a time stamp in their transmitted signal to indicate the starting transmission time to the sensor nodes and, moreover, there is not any underlying synchronization mechanism. Hence, the sensor nodes can only measure the signal arrival time instead of the propagation time or TOF. One way to tackle this problem is to exploit the difference of pairwise TOA measurements, that is, time difference of arrival (TDOA), for source localization (see Section 5). Although the dependence on the initial transmission time is eliminated by TDOA, the measurement subtraction strengthens the noise. To overcome such problems, some works propose methods to estimate both the source location and initial transmission time jointly [34].

When the TOA and the TOF are equivalent (i.e., ), for example, because there are synchronized sound-emitting nodes, the source-to-node distances can be calculated using the propagation speed of sound [35]. This may require an initial calibration process to determine factors that have a strong influence on the speed of sound. The computation of the source location can be carried out in a central node by using the estimated distances and solving the trilateration problem [36]. Trilateration is based on the formulation of one equation for each anchor that represents the surface of the sphere (or circle) centered at its position, with a radius equal to the estimated distance to the sound source. The solution to this series of equations finds the point where all the spheres intersect. For 2D localization, at least three sensors are needed, while one more sensor is necessary to obtain a 3D location estimate.

4.1. Trilateration

Let us consider a set of sensor TOA measurements that are transformed to distances by assuming a known propagation speed: where is the speed of sound (343 m/s). Then, the following system of equations can be formulated:

Each equation in (16) represents a circle in or a sphere in , centered at with a radius . Note that the problem is the same as the one given by (5). Thus, solving the above system is equivalent to finding the intersection point/points of a set of circles/spheres. Again, the trilateration problem is not straightforward to solve due to the nonlinearity of (16) and the errors in and [37]. A number of algorithms have been proposed in the literature to solve the trilateration problem, including both closed form [37, 38] and numerical solutions. Closed-form solutions have low computational complexity when the solution of (16) actually exists. However, most closed-form solutions only solve for the intersection points of spheres in . They do not attempt to determine the intersection point when , where small errors can easily cause the involved spheres not to intersect at one point [39]. It is then necessary to find a good approximation that minimizes the errors contained in (16) considering the nonlinear least squares cost function:

Numerical methods are in general necessary to estimate . However, compared with closed-form solutions, numerical methods have higher computational complexity. Some numerical methods are based on a linearization of the trilateration problem [4042], introducing additional errors into position estimation. Common numerical techniques such as Newton’s method or steepest descent have also been proposed [38, 40, 41]. However, most of these search algorithms are very sensitive to the choice of the initial guess, and a global convergence towards the desirable solution is not guaranteed [39].

4.2. Estimating TOAs of Acoustic Events

As already discussed, localizing acoustic sources from TOA measurements only is not possible due to the unknown source emission time of acoustic events. If the sensors are synchronized, the differences of TOA measurements can be used to cancel out the common term , so that a set of TDOAs are obtained and used as discussed in Section 5. A low-complexity method to estimate the TOAs from acoustic events in WASNs was proposed in [43], where the cumulative-sum (CUSUM) change detection algorithm is used to estimate the source onset times at the nodes. The CUSUM method is a low-complexity algorithm that allows estimating change detection instants by maximizing the following log-likelihood ratio:

The probability density function of each sample is given by , where is a deterministic parameter (not to be confused with the DOA of the source). The occurrence of an event is modeled by an instantaneous change in , so that before the event at and when . To simplify calculations at the nodes, the samples before the acoustic event are assumed to belong exclusively to a Gaussian noise component of variance , while the samples after the event are also normally distributed with variance . These variances are estimated from the beginning and tail of a window of samples where the nodes have strong evidence that an acoustic event has happened. The advantage of this approach is twofold. On the one hand, the estimation of the distribution parameters is more accurate. On the other hand, the CUSUM change detection algorithm needs only to be run when an acoustic event has actually occurred, allowing significant battery savings in the nodes. The detection of acoustic events is performed by assuming that, at least, one of the nodes has the sufficient SNR to detect the event by a simple amplitude threshold. The node (or nodes) detecting the presence of the event will notify the rest by sending an event warning alert in order to let them know that they must run the CUSUM algorithm over a window of samples (see Figure 3). The amplitude threshold selection is carried out by setting either the probability of detection or the probability of false alarm given an initial estimate of the ambient noise variance . Note that synchronization issues still persist (all must have a common time reference), so that the nodes exchange synchronization information by using MAC layer time stamping in the deployment discussed in [43].

5. TDOA-Based Localization

When each sensor consists of multiple microphones, localization can be accomplished in an efficient way demanding as much as possible of the processing to each node and then combining measurements to yield the localization at the central node. When nodes are connected through low bitrate channels and no synchronization of the internal clocks is guaranteed, this strategy becomes a must. Among all the possible measurements, a possible solution can be found in the time difference of arrival (TDOA).

5.1. TDOA and Generalized Cross Correlation

Consider the presence of nodes in the network. For reasons of simplicity in the notation, all nodes are equipped with microphones. The TDOA refers to the difference of propagation time from the source location to pairs of microphones. If the source is located at , and the th microphone in the th sensor is at , the TDOA is related to the difference of the ranges from the source to the microphones and through

Throughout the rest of this subsection we will consider pairs of microphones within the same node, so we will omit the superscript of the sensor. The estimate of the TDOA can be accomplished performing the generalized cross correlation (GCC) between the signals acquired by microphones at and , as detailed in the following. Under the assumption of working in an anechoic scenario and in a single source context, the discrete-time signal acquired by the th microphone is where is a microphone-dependent attenuation term that accounts for the propagation losses and air absorption, is the source signal, is the propagation delay between the source and the th microphone, and is an additive noise signal. In the discrete-time Fourier transform (DTFT) domain, the microphone signals can be written as where and are the DTFTs of and , respectively, and denotes normalized angular frequency.

Given the pair of microphones and , with , the GCC between and can be written as [10] where is the DTFT of , is the conjugate operator, and is a suitable weighting function.

The TDOA from the pair is estimated as where is the sampling frequency. The goal of is to make sharper so that the estimate in (23) becomes more accurate. One of the most common choices is to use the PHAse Transform (PHAT) weighting function; that is,

In an array of microphones, the complete set of TDOAs counts measures. If these are not affected by any sort of measurement error, it can be easily demonstrated that only of them are independent, the others being a linear combination of them. It is common practice, therefore, to adopt a reference microphone in the array and to measure TDOAs with respect to it. We refer to the set of measures as the reduced TDOA set. Without any loss of generality, for the reduced TDOA set, we can assume the reference microphone be with index 1, and the TDOAs in the reduced set, for reasons of compactness in the notation, are denoted with .

5.2. TDOA Measurement in Adverse Environments

It is important to stress the fact that TDOA measurements are very sensitive to reverberation, noise, and the presence of possible interferers: in a reverberant environment, for some locations and orientation of the source, the peak of the GCC relative to a reflective path could overcome that of the direct path. Moreover, in a noisy scenario, for some time instants the noise level could exceed the signal, thus making the TDOAs unreliable. As a result, some TDOAs must be considered outliers and must be discarded from the measurement set before localization (as in the example shown in Figure 4).

Several solutions have been developed in order to alleviate the impact of outliers in TDOAs. It has been observed that GCCs affected by reverberation and noise do not exhibit a single sharp peak. In order to identify outliers, therefore, some works analyze the shape of the GCC from which they were extracted. In [12], authors propose the use of the function to detect GCCs affected by outliers. More specifically, the numerator sums the “power” of the GCC samples within the interval centered around the candidate TDOA and compares it with the energy of the remaining samples. When overcomes a prescribed threshold, the TDOA is considered reliable. Two metrics of the GCC shape have been proposed in [13, 44]. The first one considers the value of the GCC at the maximum peak location, while the second compares the highest peak with the second highest one. When both metrics overcome prescribed thresholds, the GCC is considered reliable.

Another possible route to follow is described in [11] and is based on the observation that TDOAs along closed paths of microphones must sum to zero (zero-sum condition) and that there is a relationship between the local maxima of the autocorrelation and cross correlation of the microphone signals (raster condition). The zero-sum condition on minimum length paths of three microphones with indexes , , and , in particular, states that By imposing zero-sum and raster conditions, authors demonstrate that they are able to disambiguate TDOAs in the case of multiple sources in reverberant environments.

In [17] authors combine different approaches. A redundant set of candidate TDOAs is selected by identifying local maxima of the GCCs. A first selection is operated by discarding TDOAs that do not honor the zero-sum condition. A second selection step is based on three quality metrics related to the shape of the GCC. The third final step is based on the inspection of the residuals of the source localization cost function: all the measurements related to residuals overcoming a prescribed threshold are discarded.

It is important to notice that all the referenced techniques for TDOA outlier removal do not involve the cooperation of multiple nodes, with clear advantages in terms of data to be transmitted.

5.3. Localization through TDOAs

In this section we will consider the problem of source localization by combining measurements coming from different nodes. In order to identify the sensor from which measurements have been extracted, we will use the superscript . From a geometric standpoint, given a TDOA estimate , the source is bound to lie on a branch of hyperbola (hyperboloid in 3D), whose foci are in and , and whose vertices are far apart. If source and microphones are coplanar, the location of the source can be ideally obtained by intersecting two or more hyperbolas [14], as in Figure 5 and some primitive solutions for source localization rely on this idea. It is important to notice that when the source is sufficiently far from the node, the branch of hyperbola can be confused with its asymptote: in this case the TDOA is informative only with respect to the direction towards which the source is located and not its distance from the array. In this context it is more convenient to work with DOAs (see Section 6).

In general, intersecting hyperbolas is a strongly nonlinear problem, with obvious negative consequences on the computational cost and the sensitivity to noise in the measurements. In [12] a solution is proposed to overcome this issue, which relies on a projective representation. The hyperbola derived from the TDOA at microphones and is written as where the coefficients are determined in closed form by , , and . Equation (27) represents a constraint on the source location. In the presence of noise, the constraint is not honored and a residual can be defined aswhere for all the TDOAs that have been found reliable and 0 otherwise. The residuals are stacked in the vector , and the source is localized by minimizing the cost function If TDOA measurements are affected by additive white Gaussian noise, it is easy to demonstrate that (29) is proportional to the ML cost function.

It has been demonstrated that a simplification of the localization cost function can be brought if a reference microphone is adopted, at the cost of the nodes sharing a common clock. Without loss of generality, we assume the reference to be the first microphone in the first sensor (i.e., and ), and we also set ; that is, the origin of the reference frame coincides with the reference microphone. Moreover, we can drop the array index upon assigning a global index to the microphones in different sensors, ranging from to . In this context, it is possible to linearize the localization cost function, as shown in the next paragraphs. By rearranging the terms in (19) and setting , it is easy to derive where . In [45] it has been proposed to represent the source localization problem in the space-range reference frame: a point in space is mapped onto the 3D space-range as whereWe easily recognize that, in absence of noise and measurement errors, is the range difference relative to the source in between the reference microphone and a microphone in . If we replace we can easily interpret (30) as the equation of a negative half-cone in the space-range reference frame, whose apex is , and with aperture , and the source point lies on it. Equation (30) can be iterated for , and the source is bound to lie on the surface of all the cones. Moreover, the range of the source from the reference microphone must honor the constraint which is the equation of a cone whose apex is in and with aperture . The ML source estimate, therefore, is the point on the surface of the cone in (33) closest to all the cones defined by (30), as represented in Figure 6.

By squaring both members of (30) and recognizing that (30) can be rewritten as which is the equation of a plane in the space-range reference frame, on which the source is bound to lie. Under error in the range difference estimate , (35) is not satisfied exactly, and therefore a residual can be defined as Based upon the definition of , the LS spherical cost function is given by [46] Most LS estimation techniques adopt this cost function and they differ only for the additional constraints. The Unconstrained Least Squares (ULS) estimator [16, 30, 31, 47, 48] localizes the source as It is important to notice that the absence of any explicit constraint that relates and provides in many applications a poor localization accuracy. Constrained Least Squares techniques, therefore, reintroduce this constraint as Based on (39), Spherical Intersection [29], Least Squares with Linear Correction [49], and Squared Range Difference Least Squares [15] estimators have been proposed, which differ for the minimization procedure, ranging from iterative to closed-form solutions. It is important to notice, however, that all these techniques assume the presence of a global reference microphone and synchronization valid for all the nodes. Alternative solutions that overcome this technological constraint have been proposed in [17, 50]. Here the concept of cone propagation in the space-range reference has been put at advantage. In particular, in [50], the propagation cone is defined slightly different from the one defined in (30): the apex is in the source and . As a consequence, in absence of synchronization errors, all the points . , must lie on the surface of the propagation cone. If a sensor exhibits a clock offset, its measurements will be shifted along the range axis. The shift can be expressed as a function of the source location, and therefore it can be included in the localization cost function at a cost of some nonlinearity. The extension to the 3D localization cost function was then proposed in [17].

6. DOA-Based Localization

When each node in the WASN incorporates multiple microphones, the location of an acoustic source can be estimated based on direction of arrival (DOA), also known as bearing, measurements. Although, such approaches require increased computational complexity in the nodes—in order to perform the DOA estimation—they can attain very low-bandwidth usage as only DOA measurements need to be transmitted in the network. Moreover, they can tolerate unsynchronized input given that the sources are static or that they move at a rather slow rate relative to the analysis frame. DOA measurements describe the direction from which sound is propagating with respect to a sensor in each time instant and are an attractive approach to location estimation also due to the ease in which such estimates can be obtained: a variety of broadband DOA estimation methods for acoustic sources are available in the literature, such as the broadband MUSIC algorithm, [51] the ESPRIT algorithm [52], Independent Component Analysis (ICA) methods [53], or Sparse Component Analysis (SCA) methods [54]. When the microphones at the nodes follow a specific geometry, for example, circular, methods such as Circular Harmonics Beamforming (CHB) [55] can also be applied.

In the sequel, we will first review DOA-based localization approaches when a single source is active in the acoustic environment. Then, we will present approaches for localization of multiple simultaneously active sound sources. Finally, we will discuss methods to jointly estimate the locations as well as the number of sources, a problem which is known as source counting.

6.1. Single Source Localization through DOAs

In the single source case, the location can be estimated as the intersection of bearing lines (i.e., lines emanating from the locations of the sensors at the directions of the sensors’ estimated DOAs), a method which is known as triangulation. An example of triangulation is illustrated in Figure 7. The problem closely relates to that of target motion analysis, where the goal is to estimate the position and velocity of a target from DOA measurements acquired by a single moving or multiple observers. Hence, many of the methods were proposed for the target motion analysis problem but outlined here in the context of sound source localization in WASNs.

Considering a WASN of nodes at locations , the function that relates a location with its true azimuthal DOA estimate at node is where is the four-quadrant inverse tangent function. Note that we deal with the two-dimensional location estimation problem; that is, only the azimuthal angle is needed. When information about the elevation angle is also available, location estimation can be extended to the three-dimensional space.

In any practical case, however, the DOA estimates , , will be contaminated by noise and triangulation will not be able to produce a unique solution, craving for the need of statistical estimators to optimally tackle the triangulation problem. This scenario is illustrated in Figure 8. When the DOA noise is assumed to be Gaussian, the ML location estimator can be derived by minimizing the nonlinear cost function [56, 57]: where is the variance of DOA noise at the th sensor.

As information about the DOA error variance at the sensors is rarely available in practice, (41) is usually modified to which is termed as nonlinear least squares (NLS) [58] cost function. Minimizing (42) results in the ML estimator, when the DOA noise variance is assumed to be the same at all sensors.

While asymptotically unbiased, the nonlinear nature of the above cost functions requires numerical search methods for minimization, which comes with increased computational complexity compared to closed-form solutions and can become vulnerable to convergence problems under bad initialization, poor geometry between sources and sensors, high noise, or insufficient number of measurements. To surpass some of these problems, some methods form geometrical constraints between the measured data and result in better convergence properties than the maximum likelihood estimator [59] or try to directly minimize the mean squared location error [60] instead of minimizing the total bearing error in (41) and (42).

Other approaches are targeted at linearizing the above nonlinear cost functions. Stansfield [61] developed a weighted linear least squares estimator based on the cost function of (41) under the assumption that range information is available and DOA errors are small. Under the small DOA errors assumption can be approximated by and the ML cost function can be modified to where which is linear and has a closed-form solution:

When range information is not available, the weight matrix can be replaced by the identity matrix. In this way, the Stansfield estimator is transformed to the orthogonal vectors estimator, also known as the pseudolinear estimator [62]:

While simple in their implementation and computationally very efficient due to their closed-form solution, these linear estimators suffer from increased estimation bias [63]. A comparison between the Stansfield estimator and the ML estimator in [64] reveals that the Stansfield estimator provides biased estimates. Moreover, the bias does not vanish even for a large number of measurements. To reduce that bias various methods have been proposed based on instrumental variables [6567] or total least squares [57, 68].

Motivated by the need for computational efficiency, the intersection point method [69] is based on finding the location of a source by taking the centroid of the intersections of pairs of bearing lines. The centroid is simply the mean of the set of intersection points and minimizes the sum of squared distances between itself and each point in the set. To increase robustness in poor geometrical conditions, the method incorporates a scheme of identifying and excluding outliers that occur from the intersection of pairs of bearing lines that are almost parallel. Nonetheless, the performance of the method is very similar to that of the pseudolinear estimator.

To attain the accuracy of nonlinear least squares estimators and improve their computational complexity, the grid-based (GB) method [70, 71] is based on making the search space discrete by constructing a grid of grid points over the localization area. Moreover, as the measurements are angles, the GB method proposes the use of the Angular Distance—taking values in the range of —as a more proper measure of similarity than the absolute distance of (42). The GB method estimates the source location by finding the grid point whose DOAs most closely match the estimated DOAs from the sensors by solvingwhere denotes the angular distance between the two arguments.

To eliminate the location error introduced by the discrete nature of this approach, a very dense grid (high ) is required. The search for the best grid point is performed in an iterative manner: it starts with a coarse grid (low value of ) and once the best grid point is found—according to (47)—a new grid centered on this point is generated, with a smaller spacing between grid points but also a smaller scope. Then, the best grid point in this new grid is found and the procedure is repeated until the desired accuracy is obtained, while keeping the complexity under control, as it does not require an exhaustive search over a large number of grid points. In [70] it is shown that the GB method is much more computationally efficient than the nonlinear least squares estimators and attains the same accuracy.

6.2. Multiple Source Localization

When considering multiple sources, a fundamental problem is that the correct association of DOAs from the nodes to the sources is unknown. Hence, in order to perform triangulation, one must first estimate the correct DOA combinations from the nodes that correspond to the same source. The use of DOAs that belong to different sources will result in “ghost” sources, that is, locations that do not correspond to real sources, thus severely affecting localization performance. This is known as the data-association problem. The data-association problem is illustrated with an example in Figure 9: in a WASN with two nodes (blue circles) and two active sound sources (red circles), let the solid lines show the DOAs to the first source and the dashed lines show the DOAs to the second source. Intersecting the bearing lines will result in 4 intersection points: the red circles, that correspond to the true sources’ locations and are estimated by using the correct DOA combinations (i.e., the DOAs from the node that correspond to the same source) and the white circles that are the result of using the erroneous DOA combinations (“ghost” sources).

Also, with multiple active sources, some arrays might underestimate their number, especially when some nodes are far from some sources or when the sources are close together in terms of their angular separation [54]. Thus, missed detections can occur, meaning that the DOAs of some sources from some arrays may be missing. As illustrated in [72], missed detections can occur very often in practice. In the sequel, we review approaches for the data-association and localization problem of multiple sources whose number is assumed to be known.

Some approaches tried to tackle the data-association problem by enumerating all possible DOA combinations from the sensors and deciding on the correct DOA combinations based on the resulting location estimates from all combinations. In general, if denotes the number of sensors that detected sources, the number of possible DOA combinations is

The position nonlinear least squares (P-NLS) estimator developed in [73] incorporates the association procedure in the ML cost function, which takes the form where is the th DOA estimate of sensor . To minimize (50), initial locations are estimated (one for each DOA combination) using a linear least squares estimator, such as the pseudolinear transform of (46). Then, the cost function (50) is minimized—using numerical search methods— times, each time using a different initial location estimate. Each time, for each sensor the DOA closest to the DOA of the initial location estimate is used to take part in the minimization procedure. In that way, for all initial locations, the estimator is expected to converge to a location of a true source. However, as illustrated in [70], in the presence of missed detections and high noise the approach is not able to completely eliminate “ghost” sources.

The multiple source grid-based method [70] estimates an initial location for each possible DOA combination from the sensors by solving (47). It then decides which of the initial location estimates correspond to a true source, heuristically by selecting the estimated initial locations whose DOAs are closer to the DOAs from the combination used to estimate that location.

Other approaches focus on solving the data-association problem prior to the localization procedure. In this way, the correct association of DOAs from the sensors to the sources is estimated beforehand and the multiple source localization problem decomposes into multiple single source localization problems. In [74] the data-association problem is viewed as an assignment problem and is formulated as a statistical estimation problem which involves the maximization of the ratio of the likelihood that the measurements come from the same target to the likelihood that the measurements are false-alarms. Since the proposed solution becomes NP-hard for more than three sensors, suboptimal solutions tried to solve the same problem in pseudopolynomial time [75, 76].

An approach based on clustering of intersections of bearing lines in scenarios with no missed detections is discussed in [77]. It is based on the observation that intersections between pairs of bearing lines that correspond to the same source will be close to each other. Hence, intersections between bearing lines will cluster around the true sources, revealing the correct DOA associations, while intersections from erroneous DOA combinations will be randomly distributed in space.

Permitting the transmission of low-bandwidth additional information from the sensors can lead to more efficient approaches to the data-association problem. The idea is that the sensors can extract and transmit features associated with each source they detect. Appropriate features for the data-association problem must possess the property of being “similar” for the same source in the different sensors. Then, the correct association of DOAs to the sources can be found by comparing the corresponding features.

In [78] such features are extracted using Blind Source Separation. The features are binary masks [79] in the frequency domain for each detected source that when applied to the corresponding source signals they perform source separation. The extraction of such features relies on the W-disjoint orthogonality assumption [80], which states that in a given time-frequency point only one source is active, an assumption which has been showed to be valid especially for speech signals [81]. The association algorithm works by finding the binary masks from the different arrays that correlate the most. However, the method is designed for scenarios with no missed detections and, as illustrated in [72], performance significantly drops when missed detections occur. Moreover, the association algorithm is designed for the case of two sensors.

The design of association features that are robust to missed detections is considered in [72] along with a greedy association algorithm that can work with an arbitrary number of sources and sensors. The association features describe how the frequencies of the captured signals are distributed to the sources. To do that, the method estimates a DOA in each time-frequency point, where and denote the frequency and time frame index, respectively. Then, a time-frequency point is assigned to source if the following conditions are met: where is the DOA estimate at time frame for the th source at the sensor of interest and is a predefined threshold. Equations (51) and (52) imply that each frequency is assigned to the source whose DOA is closest to the estimated DOA in this frequency, as long as their distance does not exceed a certain threshold . The second condition (see (52)) adds robustness to missed detections as it rejects the frequencies with DOA estimates whose distance from the detected sources’ DOAs is significantly large.

6.3. Source Counting

Assuming that the number of sources is also unknown and can vary arbitrarily in time, other approaches were developed to jointly solve the source counting and location estimation problem. In these approaches, the central idea is to utilize narrowband DOA estimates—for each time-frequency point—from the nodes in order to estimate narrowband location estimates. Appropriate processing of the narrowband location estimates can infer the number and locations of the sound sources. The location for each time-frequency point is estimated using triangulation based on the corresponding narrowband DOA estimates from the sensors at that time-frequency point. Figure 10 shows an example of such narrowband location estimates and their corresponding histogram, which also describes the plausibility that a source is at a given location. The processing of these narrowband location estimates is usually done by statistical modeling methods: in [82], the narrowband location estimates are modeled by a Gaussian Mixture Model (GMM), where the number of Gaussian components corresponds to the number of sources, while the means of the Gaussians determine the sources’ locations. A variant of the Expectation-Maximization (EM) algorithm is proposed that incorporates empirical criteria for removing and merging Gaussian components in order to determine the number of sources as well. A Bayesian view of the Gaussian Mixture Modeling is adopted in [83, 84], where a variant of the -means algorithm is utilized that is able to determine both the number of clusters (i.e., number of sources) and the cluster centroids (i.e., sources’ locations) using split and merge operations on the Gaussian components.

7. SRP-Based Localization

Approaches based on the steered response power (SRP) have attracted the attention of many researchers due to their robustness in noisy and reverberant environments. Particularly, the SRP-PHAT algorithm is today one of the most popular approaches for acoustic source localization using microphone arrays [8587]. Basically, the goal of SRP methods is to maximize the power of the received sound source signal using a steered filter-and-sum beamformer. To this end, the method uses a grid-search procedure where candidate source locations are explored by computing a functional that relates spatial location to the TDOA information extracted from multiple microphone pairs. The power map resulting from the values computed at all candidate source locations (also known as Global Coherence Field [88]) will show a peak at the estimated source location.

Since SRP approaches are based on the exploitation of TDOA information, synchronization issues also arise when applying SRP in WASNs. As in the case of source localization using DOAs or TDOAs, SRP-based approaches for WASNs have been proposed considering that multiple microphones are available at each node [8991]. In these cases, the SRP method can be used for acquiring DOA estimates at each node or collecting source location estimates that are merged by a central node. Next, we describe the fundamentals of SRP-PHAT localization.

7.1. Conventional SRP-PHAT (C-SRP)

Consider a set of different microphones capturing the signal arriving from a sound source located at a spatial position in an anechoic scenario, following the model of (20). The SRP is defined as the output power of a filter-and-sum beamformer steered to a given spatial location. DiBiase [85] demonstrated that the SRP at a spatial location calculated over a time interval of samples can be efficiently computed in terms of GCCs: where is the time difference of arrival (TDOA) that would produce a sound source located at ; that is,

The last summation term in (53) is usually ignored, since it is a power offset independent of the steering location. When GCCs are computed with PHAT, the resulting SRP is known as SRP-PHAT.

In practice, the method is implemented by discretizing the location space region using a search grid consisting of candidate source locations in and computing the functional of (53) at each grid position. The estimated source location is the one providing the maximum functional value:

7.2. Modified SRP-PHAT (M-SRP)

Reducing the computational cost of SRP is an important issue in WASNs, since power-related constraints in the nodes may render impractical its implementation in real-world applications. The vast amount of modified solutions based on SRP is aimed at reducing the computational cost of the grid-search step [92, 93]. A problem of these methods is that they are prone to discard part of the information available, leading to some performance degradation. Other recent approaches are based on analyzing the volume surrounding the grid of candidate source locations [87, 94]. By taking this into account, the methods are able to accommodate the expected range of TDOAs at each volume in order to increase the robustness of the algorithm and relax its computational complexity. The modified SRP-PHAT collects and uses the TDOA information related to the volume surrounding each point of the search grid by considering a modified functional [87]: where and are the lower and upper accumulation limits of GCC delays, which depend on the spatial location .

The accumulation limits can be calculated beforehand in an exact way by exploring the boundaries separating the regions corresponding to the points of the grid. Alternatively, they can be selected by considering the spatial gradient of the TDOA , where each component of the gradient is

For a rectangular grid where neighboring points are separated a distance and the lower and upper accumulation limits are given bywhere , , , and the gradient direction angles are given by

The estimated source location is again obtained as the point in the search grid providing the maximum functional value:

Figure 11 shows the normalized SRP power maps obtained by C-SRP using two different grid resolutions and the one obtained by M-SRP using a coarse spatial grid. In (a), the fine search grid shows clearly the hyperbolas intersecting at the true source location. However, when the number of grid points is reduced in (b), the SRP power map does not provide a consistent maximum. As shown in (c), M-SRP is able to fix this situation, showing a consistent maximum even when a coarse spatial grid is used.

An iterative approach of the M-SRP method was described in [95], where the M-SRP is initially evaluated using a coarse spatial grid. Then, the volume surrounding the point of highest value is iteratively decomposed by using a finer spatial grid. This approach allows obtaining almost the same accuracy as the fine-grid search with a substantial reduction of functional evaluations.

Finally, recent works are also focusing on hardware aspects in the nodes with the aim of efficiently computing the SRP. In this context, the use of graphics processing units (GPUs) for implementing SRP-based approaches is specially promising [96, 97]. In [98] the performance of SRP-PHAT is analyzed over a massive multichannel processing framework in a multi-GPU system, analyzing its performance as a function of the number of microphones and available computational resources in the system. Note, however, that the performance of SRP approaches is also related to the properties of the sound sources, such as their bandwidth or their low-pass/pass-band nature [99, 100].

8. Self-Localization of Acoustic Sensor Nodes

Methods for sound source localization discussed in previous sections assume that for , the locations of the acoustic sensor nodes, or for , those of microphones, are known to the system and fixed in time. In practical situations, however, the precise locations of the sensor nodes or the microphones may not be known (e.g., for the deployment of ad hoc WASNs). Furthermore, in some WASN applications, the node locations may change over time. Due to these reasons, self-calibration for adjusting the known node/microphone locations or self-localization of unknown nodes of the WASNs becomes necessary.

The methods for self-localization of WASNs can be divided into three categories [101]. The first one uses nonacoustic sensors such as accelerometer and magnetometers, the second one uses the signal strength, and the third one uses the TOA or TDOA of acoustic signals. In this section, we focus on the last category since it has been shown to enable localization with fine granularity, centimeter-level localization [21], and requires only the acoustic sensors.

The TOA/TDOA-based algorithms can be further divided into two types depending upon whether the source positions are known for self-localization. Most of the early works addressed this problem as microphone array calibration [102105] with known source locations. The general problem of joint source and sensor localization was addressed by [5] as a nonlinear optimization problem. The work in [106] presented a solution to explicitly tackle the synchronization problem. In [107], a solution considering multiple sources and sensors per device was described.

This section will focus on algorithms that assume known source positions generating known probe signals without the knowledge of the sensor positions as well as the synchronization between the sensors and the sources. This approach allows all processing to take place on the sensor node for self-localization. The system illustration for this problem is given in Figure 12. The remainder of this section describes the TOA/TDOA-based methods for acoustic sensor localization by modeling the inaccurate TOA/TDOA measurements for robust localization, followed by some recent approaches.

8.1. Problem Formulation

Consider a WASN comprised of sources and nodes. The microphone locations , for , at node can be determined with respect to the node location, its orientation, and its microphone configuration. So we consider the sensor localization of finding the microphone location as the same problem as finding the node location in this section. Without loss of generality, we can consider the case with only one node and one microphone () because each node determines its location independently from others. In addition, we consider that the sources are the loudspeakers of the system with fixed and known locations in this scenario.

Let and be the single microphone position and the position of the th source for , respectively. The goal is to find by means of received acoustic signals emitted by sources where the location of each source is known.

The TOF from the th loudspeaker to the sensor is defined as where is the speed of sound. Note that this equation is equivalent to (13) except that we consider a single microphone (), multisource () case for ; thus the TOF is indexed with respect to instead of . From (61), it is evident that can be found if a sufficient number of TOFs are known. In practice, we need to rely on TOAs instead of TOFs due to measurement errors.

In order to remove the effect from such unknown factors, the TDOA can be used instead of TOA, which is given by regarding a pair of sources . Please note that the subscripts for the TDOA indicate the source indexes that are different from those defined for sensor indexes in (19).

Since the probe signals generated from the sources along with their locations are assumed to be known to the sensor nodes, the derivations of the self-localization methods hereafter rely on the GCC between the probe signal and the received signal at the sensor. Provided that the direct line-of-sight between the source and the sensor is guaranteed, then the time delay found by the GCC in (22) between the probe signal and the signal received at the sensor provides the TOA information.

8.2. Modeling of Time Measurement Errors

Two main factors—asynchrony and the sampling frequency mismatch between sources and sensors—can be considered for the modeling of time measurement errors. When there exists asynchrony between a source and a sensor, the TOA can be modeled as where is the true TOF from th source to the sensor and is the bias caused by the asynchrony. If there exists sampling frequency mismatch, then the sampling frequency at the sensor can be modeled as , where is that of the source. Considering these and ignoring the rounding of the discrete-time index, the relationship between the discrete-time TOA and the actual TOF is given by which can be further simplified for as where , , and . If the sources are connected to the playback system with the same clock such that the sources share the common playback delay, then for all . Thereforewhere and .

8.3. Least Squares Method

Given a sufficient number of TOA or TDOA estimates, the least squares (LS) method can be used to estimate the sensor position. The TOA-based and the TDOA-based LS methods proposed in [20, 21] are described in this subsection.

8.3.1. TOA-Based Formulation

Motivated by the relation in (66), the localization error corresponding to the th loudspeaker can be defined as If we define the error vector as with unknown parameters , , and , then the cost function can be defined as Then the localization problem is formulated as which does not have a closed-form solution.

8.3.2. TDOA-Based Formulation

We can consider the first loudspeaker for as the reference loudspeaker, whose TOA and position can be set to and . Given a set of TDOAs in the LS framework, it can be used with (66) as where is the TDOA between the th and the first loudspeakers. With the first as the reference, we can define length vectors , for and , the cost function can be defined as and the LS problem can be written as Note that the TDOA-based approach is not dependent upon the parameter unlike the TOA-based approach.

8.3.3. LS Solutions

For both the TOA- and TDOA-based approaches, the error vector can be formulated as , where the elements of the vector are unknown and those of the matrix and the vector are both known to the system. The error vectors for both approaches in this formulation can be expressed as follows [21]:

Although the elements of the vector are not independent from one another, the constraints can be removed for computational efficiency [15, 21], then the nonlinear problems in (69) and (72) can be considered as the ULS problem as follows: which has the closed-form solution given by It has been shown that it requires to find the closed-form solution for both approaches [21].

For the case when , that is, no sampling frequency mismatch with known speed of sound, the TDOA-based approach can be further simplified as which is related to the methods developed for the sensor localization problem [15, 108].

The self-localization results of the LS approaches are highly sensitive to the estimated values of TOA/TDOA. If they are estimated poorly, then the localization accuracy may significantly suffer from those inaccurate estimates. In order to address this issue, a sliding window technique is proposed to improve the accuracy of TOA/TDOA estimates [21].

8.4. Other Approaches

More recently, several papers have tackled the problem of how to design good probe signals between source and sensor and how to improve TOF estimation; in [22], a probe signal design using pulse compression technique and hyperbolic frequency modulated signals is presented that is capable of localizing an acoustic source and estimates its velocity and direction in case it is moving. A matching pursuit-based algorithm for TOF estimation is described in [109] and refined in [101]. The joint localization of sensor and source in an ad hoc array by using low-rank approximation methods has been addressed in [23]. In [24] an iterative peak matching algorithm for the calibration of a wireless acoustic sensor network is described in an unsynchronized network by using a fast calibration process. The method is valid for nodes that incorporate a microphone and a loudspeaker and is based on the use of a set of orthogonal probe signals that are assigned to the nodes of the network. The correlation properties of pseudonoise sequences are exploited to simultaneously estimate the relative TOAs from multiple acoustic nodes, substantially reducing the total calibration time. In a final step, synchronization issues are removed by following a BeepBeep strategy [106, 110], providing range estimates that are converted to absolute node positions by means of multidimensional scaling [104].

9. Challenges and Future Directions

9.1. Practical Challenges

Some real-world challenges arise in the design of localization systems using WASNs. To build a robust and accurate localization system, it is necessary to ensure a tradeoff among aspects related to cost, energy effectiveness, ease of calibration, deployment difficulty, and precision. Achieving such tradeoff is not straightforward and encompasses many practical challenges as discussed below.

9.1.1. Cost-Effectiveness

The potentiality of WASNs to provide high-accuracy acoustic localization features is highly dependent on the underlying hardware technologies in the nodes. For example, localization techniques based on DOA, TDOA, or SRP need intensive in-node processing for computing GCCs as well as input resources permitting multichannel audio recording. With the advent of powerful single-board computers, high-performance in-node signal processing can be easily achieved. Nonetheless, important aspects should be considered regarding cost and energy dissipation, especially for battery-powered nodes that are massively deployed.

9.1.2. Deployment Issues

Localization methods usually require a predeployment configuration process. For example, TOA-based methods usually need to properly set up synchronization mechanisms before starting to localize targets. Similarly, energy-based methods need nodes with calibrated gains in order to obtain high-quality energy-ratio measurements. All these tasks are usually complex and time-consuming. Moreover, they tend to need human supervision during an offline profiling phase. Such predeployment phase can become even more complex in some application environments where nodes can be accessed by unauthorized subjects. Moreover, WASNs are also applied outside of closed buildings. Thus they are subject to daily and seasonal temperature variations and corresponding variations of the speed of sound [35]. To cope with this shortcoming, calibration needs to be automated and made environment adaptive.

9.1.3. System Resiliency

Besides predeployment issues, a WASN should also implement self-configuration mechanisms dealing with network dynamics such as those related to node failures. In this context, system design must take into account the number of anchor nodes that are needed in the deployment and their placement strategy. The system should assure that, if some of the nodes get out of the network, the rest are still able to provide location estimates appropriately. To this end, it is important to maximize the coverage area while minimizing the number of required anchor nodes in the system.

9.1.4. Scalability

Depending on the specific application, the WASN that needs to be deployed can vary from a very small and simple network of a few nodes to very large WASNs with tens or hundreds of nodes and complex network topologies. For example, in wildlife monitoring applications a very large number of sensors are utilized to acoustically monitor very large environments, while the topology of the network can be constantly changing due to sensors being displaced by the wind or by passing animals. The challenge in such applications is to design localization and self-configuration methods that can easily scale to complex WASNs.

9.1.5. Measurement Errors

It is well known that RF-based localization in WSNs are prone to errors due to irregular propagation patterns induced by environmental conditions (pressure and temperature) and random multipath effects such as reflection, refraction, diffraction, and scattering. In the case of WASNs, acoustic signals are also subject to similar distortions caused by environmental changes and effects produced by noise and interfering sources, reflected echoes, object obstruction, or signal diffraction. Other errors are related to the aforementioned predeployment process, resulting in synchronization errors or inaccuracies in the positions of anchor nodes. These errors must be analyzed in order to filter measurement noise out and improve the accuracy of location estimates.

9.1.6. Benchmarking

In terms of performance evaluation, so far there are no specific benchmarking methodologies and datasets for the location estimation problem in WASNs. Due to their heterogeneity—in terms of sensors, number of sensors and microphones, topology, and so on—works comparing different localization methods using a common sensor setup are difficult to find in the literature. The definition of formal methodologies in order to evaluate localization performance and the recording of evaluation datasets using real-life WASNs still remain a big challenge.

9.2. Future Directions
9.2.1. Real-Life Application

In our days, the need for real-life realizations of WASN with sound source localization capabilities is becoming more a more evident. An important direction in the future will thus be the application of the localization methodologies in real-life WASNs. In this direction, the integration of methodologies from a diverse range of scientific fields will be of paramount importance. Such fields include networks (e.g., to design the communication and synchronization protocols), network administration (e.g., to organize the nodes of the network and identify and handle potential failures), signal processing (e.g., to estimate the sources’ locations with many potential applications), and hardware design (e.g., to design the acoustic nodes that can operate individually featuring communication and multichannel audio processing capabilities in a power efficient way). While many of these fields have flourished individually, the practical issues that will arise from their integration in practical WASNs still remain unseen and the need for benchmarking and methodologies for their efficient integration is becoming more and more urgent.

9.2.2. Machine Learning-Based Approaches

One of the practical challenges for the deployment of real-life applications is the huge variability of acoustic signals received at the WASNs due to the acoustic signal propagation in the physical domain as well as the inaccuracies caused at the system level and uncertainties associated with measurements of TOAs and TDOAs. With the help of large datasets and vastly increased computational power of off-the-shelf processors, these variabilities can be learned by machines for designing more robust algorithms.

10. Conclusion

Sound source localization through WASNs offers great potential for the development of location-aware applications. Although many methods for locating acoustic sources have been proposed during the last decades, most methods assume synchronized input signals acquired by a traditional microphone array. As a result, when designing WASN-oriented applications, many assumptions of traditional localization approaches have to be revisited. This paper has presented a review of sound source localization methods using commonly used measurements in WASNs, namely, energy, direction of arrival (DOA), time of arrival (TOA), time difference of arrival (TDOA), and steered response power (SRP). Moreover, since most algorithms assume perfect knowledge on the node locations, self-localization methods used to estimate the location of the nodes in the network are also of high interest within a WASN context. The practical challenges and future directions arising in the deployment of WASNs have also been discussed, emphasizing important aspects to be considered in the design of real-world applications relying on acoustic localization systems.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

All authors contributed equally to this work.