Abstract

We propose a new acoustic self-localization and orientation estimation algorithm for smartphones networks composed of commercial off-the-shelf devices equipped with two microphones and a speaker. Each smartphone acts as an acoustic transceiver, which emits and receives acoustic signals. Node locations are found by combining estimates of the range and direction of arrival (DoA) between node pairs using a maximum likelihood (ML) estimator. A tailored optimization algorithm is proposed to simultaneously solve the DoA uncertainty problem that arises from the use of only 2 microphones per node and obtain the azimuthal orientation of each node without requiring an electronic compass.

1. Introduction

Locating the nodes in wireless networks is an essential step for many applications, where the location of the sensors gives meaning to the collected data. However, accurate knowledge about the nodes’ locations and orientations is often not readily available. In indoor scenarios, where classic positioning systems such as GPS are not viable because of a lack of coverage or limited precision, it is common to resort to relative node distance and/or position measurements from acoustic, infrared, or radio frequency (RF) signals that are exchanged among devices. The most common measurements are time of arrival (ToA), direction of arrival (DoA), or angle of arrival and received signal strength (RSS) [1]. However, the use of these measurements is not straightforward because of the random component introduced by time-varying errors (e.g., additive noise and interferences) and environment-dependent errors (wall reflections, furniture obstructions, etc.). Traditional approaches for node localization rely on beacon nodes (sometimes called anchor nodes), whose position is known a priori to a certain degree. With the beacon nodes, the locations of the remaining sensors are estimated using multilateration or multiangulation techniques [2, 3]. However, in ad hoc networks such as an opportunistic network formed by smartphones, the probability of having beacon nodes is low because of their dynamic nature. Without the beacons, relative locations can be estimated using an arbitrary coordinate frame of reference, which is commonly called node self-localization. The relative location of the nodes provides us with sufficient spatial information to implement a wireless microphone array (WMA). WMAs have many potential applications in distributed audio processing, such as speech enhancement [4], blind source separation and echo cancelation [5], speaker localization and tracking [6, 7], and voice activity detection [8].

Current-generation smartphones pack sufficient hardware so that a group of devices with the correct software can be used for many applications such as indoor positioning, pedestrian tracking, smart cities, teleconferencing, and hearing impaired assistive technology [9]. However, despite their potential applications, smartphones have several hardware and software limitations that must be considered such as the limited number of available specialized sensors, their limited sampling rate, the lack of optimization of the operative system for real-time applications, and restricted hardware access.

Typically, the hardware in commercial smartphones is sufficient for different approaches to node self-localization. Some examples in the literature are the use of the RSS of RF signals [10, 11], a combination of RF and ultrasonic signals [12, 13], and different data fusion schemes [14, 15]. In Höflinger et al. [16], the authors propose an acoustic-based system using high-pitch chirps and at least 3 known-location receivers to achieve a localization error of approximately 30 cm. Node orientations are commonly obtained using an electronic compass, which is composed of a magnetometer and an accelerometer; both sensors are readily available in most smartphones. Unfortunately, magnetometer measurements are sensitive to disturbances from electric equipment (and even large metallic objects) and must be frequently calibrated to avoid large errors [17]. Typically, RF-based solutions are intended for large areas (i.e., an entire building) because they can cover wider distances at the cost of localization errors in the range of meters, whereas acoustic-based methods are used for localization within a room and achieve errors in the tenths of centimeters.

When the localization procedure is based only on acoustic signals, we can discuss array geometry calibration [18]. This field encompasses different scenarios, of which distributed array configuration calibration is the most relevant for node localization because its objective is to infer the location and orientation of distributed microphone arrays with known local geometry (i.e., nodes with more than 1 microphone) using DoA measurements. A common approach is to assume a two-dimensional (2D) scenario as seen in Jacob et al. [19], where 4 arrays with 2 microphones are located to a precision up to 5 cm, which assumes that the nodes are located along the walls of a room of known dimensions. Similarly, in Plinge and Fink [20], 3 arrays with 5 microphones embedded on a table and synchronized to μs are calibrated with a precision up to 1.2 cm and 1.3° using 300 s of white noise. In Anwar et al.’s work [21], nodes with 3 microphones and RSS measurement capabilities are located within an error of 11 cm and 1.7°. These proposals have in common the use of ad hoc hardware and all of them require 3 or more microphones per node to resolve 360° azimuthal orientation.

There are different types of self-localization methods such as those based on ToA measurements. Usually these approaches involve a number of acoustic sources and microphones at unknown locations from which the Time of Flight (ToF) between source-microphone pairs is obtained. The method described in Crocco et al.’s work [22] reports localization errors in the centimeter range. It represents an improvement upon classic methods such as Thrun [23] by introducing a closed-form solution as the initial state for the error function minimization.

In this work, we propose an algorithm for node self-localization and orientation estimation for smartphone networks using acoustic signals and assuming that each node is a state-of-the-art off-the-shelf smartphone with two microphones and a speaker. The algorithm is an extension of the ideas proposed in Ayllon et al.’s work [24], particularly a modification of the Maximum Likelihood-based Distributed Optimization for Node Localization (ML-DONL) algorithm in the said work. This modification does not require previous knowledge about node orientations. The main advantage of our proposal is that we avoid the error introduced by an uncalibrated compass, which is often in excess of 15° [17]. Both the location and orientation estimates are based on closed-form expressions; however an optimization algorithm is used to resolve the DoA uncertainty needed to obtain 360° orientation estimates using only 2 microphones per node. RF signals are required for data exchange in the network, and the method assumes that the nodes are static during the localization procedure, which takes a few seconds. The proposed approach is intended for the localization of acoustic nodes in a room (there is line of sight between nodes) to create a WMA.

2. Problem Formulation

Let us consider a fully connected network composed of nodes, where each node contains a microphone array of known geometry. If we also consider acoustic sources that are emitted from unknown locations, we can obtain a series of DoA estimates from each node to each source, so that the network geometry can be found as a combination of all estimated angles by solving a minimization problem. However, DoA-based algorithms can only find the relative geometry, and additional information is required to scale it.

In our particular case, each node is a smartphone equipped with two microphones ( and ) and a speaker (). Figure 1 represents a typical smartphone configuration that acts as the th node, where is the distance between the microphone pair, and are the distance and angle between the center of the array and the speaker, respectively, and is the orientation of the node.

Our goal is to find the location and orientation of the nodes that form the network. We focus on the 2D case, where all nodes lie on the plane, which is the typical scenario where various smartphones are resting on a table. A 3D generalization will require more than two microphones per node, which is an uncommon feature in current devices. Because we use an active approach by having the nodes as the sound sources, we define a node location as the location of its speaker. Then, the localization problem is reduced to the estimation of speaker coordinates and orientations (azimuth) based on the combination of DoA and range estimates between node pairs.

2.1. DoA and Range Estimation

The proposed localization algorithm is based on the combination of DoA () and range () estimates, where each , pair is an estimate of the relative location of the th speaker with respect to the th node in polar coordinates.

Let us consider the microphones of node as a linear array, so that if we assume that a source (th speaker) is in the far field of the array, a plane wavefront impinges it with an angle . The DoA is obtained from the Time Difference of Arrival (TDoA) between the two sensors (see Figure 2), which is given by , where is the intermicrophone distance and is the speed of sound. Unfortunately, a linear array (1D) in a 2D scenario can only discern DoAs between and radians, which leads to a problem known as DoA uncertainty. Because , for every , there are two potential DoAs. Then, the measurement of the angle between node pairs is biased by the node orientation and affected by DoA uncertainty and measurement errors. Thus, we can define the estimated angle between node pairs () aswhere is the DoA uncertainty correction variable and is the DoA measurement error. Please notice that, in Jacob et al.’s work [19], the DoA uncertainty is not considered as a problem because the 2-microphone arrays are always located along a wall, which eliminates the possibility of any sound impinging from the “back” of the array.

To obtain the DoA () and range () estimates between node pairs, each node emits a reference acoustic signal, which is received by every node in the network. Let these reference signals be known and denoted by , where indicates the emitter node. In this work, we use the General Cross-Correlation PHAse Transform (GCC-PHAT) to obtain the DoA estimates because of its robustness to reverberation [25]. Let and be the Fourier transform of the signals received by the microphones of node and let be the Fourier transform of the reference signal emitted by node . The GCC-PHAT of the microphone signals and reference signal is given bywhere is the frequency and is the time lag. The time difference between the two signals corresponds to the point where the value of the GCC-PHAT function is at its maximum:Because we correlate with a known signal, and are the time of arrival (ToA) of that signal for each microphone. Then, the TDoA between microphones can be easily computed as the difference between ToAs: , from which the DoA is directly estimated asThe range between node pairs is measured using ToF. Assuming that the nodes are synchronized, that is, every node in the network shares a common timebase and an identical sampling frequency , the problem of range estimation is reduced towhere is the time when the th node emits its signal and and are the time instants when the th node receives that signal at both of its microphones (ToA). Notice that because the nodes are equipped with two microphones, we take the average of the ToAs to obtain the ToA at the center of the array. The specific methods to obtain internode synchronization fall outside the scope of this paper, although there are multiple solutions in the literature, for example, Sur et al. [26].

3. Proposed Node Localization Method

In this section, we explain how the DoA and range estimates taken by the nodes are combined in order to obtain their locations and orientations.

3.1. ML Estimator of Node Locations

Let us consider that a full set of estimations of the range and incidence angle between node pairs is available, and each estimate has an error with standard deviation and , respectively. The objective is to estimate the position vector from the measurements considering the standard deviation of the measurements. Each polar measurement (azimuth and distance pair) is transformed into Cartesian coordinates , where and , with and for all and from 1 to .

Let us also consider the joint probability density function (PDF) of the measurements in Cartesian coordinates as a multivariate normal distribution. In Ayllon et al.’s work [24], the next expression for the PDF is proposedwhere is the covariance matrix of the PDF related to the measurement vector of the th node to the th node and is a column vector that contains the coordinates of the latter. It is possible to obtain the most likely node locations using a maximum likelihood estimator, where the log-likelihood of a given geometry is calculated using the following equation:Plugging (6) into (7) and simplifying, the next expression is obtained:where and .

Assuming that all the covariance matrices are equal and proportional to the identity matrix, so that , with when , we can obtain the solution using the following expression:This is equivalent to assuming that the variables of the PDF are independent and their standard deviation is constant. This way, every estimation has the same weight and has no effect on the localization result (). Please refer to Ayllon et al. [24] for a complete description of the ML location estimator. In this work, we are using the method denominated as “Naive Covariance Matrix Estimation.”

Most of self-localization methods (including Jacob et al. [19] and Plinge and Fink [20]) use some kind of iterative optimization algorithm in order to find the node locations. It is common to minimize a pairwise distance error function such aswhere is the measured distance (range) between nodes and (obtained either directly, i.e., ToF, or indirectly, i.e., TDoA triangulation) and is the distance between their estimated locations. However, it is important to note that our ML estimator is a closed-form method.

3.2. Orientation Estimation

To obtain from , first, we must know the orientation of the th node and solve the DoA uncertainty as shown in (1). Any error in the orientation estimation is directly added to , which poses a problem for the estimation of the node locations. Because the digital compass in smartphones is commonly uncalibrated, it introduces a large error that frequently outweighs that of the DoA estimation. Thus, we decided to estimate the orientation of the nodes using the available information instead of relying on an imprecise measurement.

Let us consider that the nodes have their sound source at the center of their microphone array () and we know the value of the true angle between node pairs (i.e., the actual value without any error). In this scenario, we know that rad, for . Now, if we introduce the approximation from (1), substitute with , and substitute the first assumption with (i.e., the distance between the center of the array and the speaker is much smaller than the distance between the nodes), we arrive to , from where the following generalization is obtained:

Figure 3 shows the angular relations between node pairs. Notice that when the distance between the nodes is sufficiently large, the error introduced by the speaker not being located at the array center is negligible.

Defining and taking expression (11) into the complex plane, after exponentiation and some operations, it becomes

Now, to estimate the orientations, we can force a relative orientation reference, where , arriving to the following expression:Plugging expression (13) into (12), we obtain the final expressions for the orientation estimation:

In order to obtain each value of , we have estimates, the quality of which is directly related to the error in and , and since is an unknown and also has to be estimated, it is the most unreliable. During the optimization process that will be discussed in the next section, orientation estimate is obtained by taking the trimmed mean of the available estimates, thus making the results more robust against outliers created by erroneous values.

With the orientation of the first node fixed at zero, we reestablish a relative coordinate system. The points in this space are translated and rotated; it suffices to know the actual position and orientation of one of the nodes (i.e., having a beacon node) to transform the results to a global coordinate system.

3.3. Uncertainty Solution

At this point, we assume that the values of are known; hence, , and the estimation of the node locations depends on a given DoA uncertainty correction matrix . However, its actual value is an unknown, and we must work with the estimate (composed of values). Because the uncertainty correction is a binary variable, there are possible values for , which makes it unfeasible to test every single value. Thus, we decided to use a Genetic Algorithm (GA) to find the solution. It is important to highlight that the main diagonal of is of no interest (the case when ) and does not need to be estimated, which reduces the maximum number of combinations to .

We have found a clear relation between the log-likelihood for a certain and the localization error. Thus, we propose using expression (8) as the fitness function. Figure 4 shows the relation between the log-likelihood and the pairwise node distance error for all possible values of in a network with . Then, the selected fitness metric clearly has a direct relation with the location error.

To improve the convergency of the optimization algorithm with respect to the total number of performance evaluations, instead of using a single GA and several runs (standard scheme), we use an elimination tournament of small GAs. We start with a set of 64 small GAs (denoted stage of the tournament) with a population of individuals and generations each. The best solutions of the first round are then paired, generating a new population for every two winners, which are set to compete in the next round. The process is repeated until a global winner is obtained. For illustrative purposes, Figure 5 shows an example of the elimination tournament used in the experiments with a total of rounds. In our case, we used rounds, since it empirically gave us good convergence results.

The GA algorithm is divided into 7 steps:(1)The algorithm is initialized by creating a population of individuals. Each individual () contains genes corresponding to . On the first round of the tournament, the genes are randomly selected; for every subsequent round, they are created by reproduction and mutation from the previous stage winners (steps  (4) and (5)).(2)The population is evaluated. For every , node orientations are estimated as described in the previous section, and then the log-likelihood (fitness function) is computed with (8).(3)The individuals are sorted according to their fitness level in a descending order. The top performing is selected to breed a new generation. The remaining of the population is discarded.(4)The population is regenerated via the reproduction of successful individuals. For every new individual, two parents are selected at random, each of which randomly provides half of its genes.(5)Except for the best performer, the full population is mutated by selecting of their genes at random and inverting their value. Since the probability of a change in involving a change in is very high, of the mutations change the sign of both genes. After mutation takes place, the new generation is complete.(6)If the iteration counter is lower than , the algorithm returns to 2, and the iteration counter is increased; otherwise, it continues to the last step.(7)Best is selected as a candidate and is set to compete in the next round of the tournament.

After the GA tournament is completed, the best individual becomes and is used to estimate the final node positions and orientations.

It is important to highlight that while the computational cost of the optimization algorithm is quite high, the different small GAs can be divided by the total number of nodes of the network, since the parallelization of the elimination tournament is trivial. In a rough approximation, taking the computation time of the closed-form expressions of the ML estimator and the orientation estimator as a single operation, in Big notation, the parallelized tournament has a complexity . The tournament is composed of 127 GAs divided among nodes. In the worst scenario, a node has to take care of GAs. Each GA performs iterations with a population size of , so, in total, each node needs to compute operations. In average, the computational load of the optimization algorithm (for one node) is around times higher than that of the estimations using the closed-form expressions. Please notice that the need for an iterative algorithm is a direct consequence of the DoA uncertainty. Provided that each node was capable of resolving 360° DoAs (by having 3 or more microphones arranged in a 2D array), the solution to the problem would be found directly.

4. Experiments and Results

To evaluate the proposed algorithm, we generated a realistic database of acoustic signals, which contains 300 different scenarios including both reverberation and background noise. Reverberation was controlled by the absorption coefficient of the walls. Background noise was added as additive white noise controlled by the signal-to-noise ratio (SNR). Each scenario contained 10 randomly located and oriented nodes and was generated with a random combination of the next parameters: room dimensions of 6–12 m long/wide and 2-3 m high, absorption coefficient of 0.5–1, and SNR of 5–20 dB. The positions of the nodes were restricted as if they were on a table of dimensions of  m (a medium-sized conference table) with a minimum distance between nodes of 15 cm. The acoustic signals received by the microphones were generated using a room impulse response generator, which was computed using the simple image method described in Allen and Berkley’s work [27] at a sampling frequency of 44100 Hz.

The reference acoustic signal emitted by the nodes is a band-limited white noise signal (500 Hz–16 kHz) of length of 4096 samples or 9.29 ms at  Hz. Each device has its unique reference signal, which is known by every node in the network. The selected frequency range is related to the frequency response of typical smartphone speakers, whereas the time duration is a tradeoff between computational complexity and robustness against the SNR. Notice that a short time duration has the added benefit of making the localization process less disturbing to users who are exposed to the reference signals.

Because achieving tight time synchronization between smartphones is not trivial, the synchronicity between nodes was also set at random. All nodes shared an identical sampling frequency , but their clock starting point was biased using a uniform distribution to simulate a loose synchronization between nodes. This clock jitter translates into a range estimation error in meters. For the experiments, the standard deviation of the range estimation was fixed at 3 different values,  m,  m, and  m, depending on the synchronization jitter.

The last consideration is the coordinate system. We have previously mentioned that the origin of coordinates was set at the center of mass of the node locations in the localization process; however, we can assume without loss of generality that the first node is located at the origin of the coordinates. Then, the transposed locations were found by subtracting the coordinates of the first node. Hence, with the condition set for the orientation estimation, the localization results are provided in relation to the first node. With a localization example in Figure 6, we observe that when this reference system is used, the estimated and true locations of the first node are identical.

In order to set a comparison with the proposed method, we have implemented 2 of the methods available in the literature, namely, Jacob et al. [19] and Crocco et al. [22].

The method presented in Jacob et al.’s work [19] is based on angle measurements alone. In order to adapt it for the use of range measurements, the solution is scaled to minimize the difference with the measured range values as described in Schmalenstroeer et al.’s work [28]. It is important to highlight that this method only works without DOA uncertainty (3 or more microphones per node) and so, in order to obtain the results, we assumed that the nodes were capable of measuring 360° DoAs using only 2 microphones, which is physically impossible.

The method described in Crocco et al.’s work [22] only uses range measurements, since it is intended for nodes with a single microphone. This method is not capable of discerning between reflected solutions and so, in order to obtain the results, we considered all the possible reflections. Notice that we obtain the range estimates by averaging the ToAs at both microphones; thus this method is not capable of obtaining orientation estimations. In case the ToAs were obtained at each microphone, it should be possible to also estimate the orientations by adding some constraints (known distance between same node microphones), although in [22] this is not considered.

4.1. Result Discussion

Table 1 shows the mean, the standard deviation (Std.), and the trimmed mean (trim) of the localization error obtained with the proposed algorithm and those obtained with Jacob et al. [19] and Crocco et al. [22], all of them working without previous knowledge about the node orientations. Please notice that Crocco et al. [22] do not consider node orientations and that Jacob et al. [19] use DoA estimates covering 360°, while the presented method is based on 180° DoA estimates. Of these methods, the proposed method obtains the best overall results except for , where Crocco et al.’s method [22] is better for large network sizes () due to convergence problems on the DoA uncertainty estimation. This effect can be noticed by looking at the trim and Std. for the proposed method. It is possible to see that while the trimmed mean follows a descending trend when the network size is enlarged, the Std. grows larger.

Crocco et al.’s [22] performance is affected by the range estimate error derived from the synchronization lag. The sensibility of this method to range estimation errors clearly shows that when comparing the results obtained with increasing values, Jacob et al.’s method [19] is less affected by the range error, since the geometry is found using the DoAs and the range is only employed to scale the solution. It is worthwhile to highlight again that Jacob et al.’s method [19] is not capable of solving the DoA uncertainty problem. It is not possible to use this method with only 2 microphones per node.

Table 2 shows the mean, Std., and trim of the localization error obtained with the ML-DONL algorithm presented in Ayllon et al.’s work [24] with known orientations (assumed to be obtained with an electronic compass), with and without orientation measurement error. Comparing this table with the previous one, we observe that the error obtained with the presented method using orientation estimates is between those obtained using measured orientations: it is larger than that of the ideal case but much smaller than when the typical error of an uncalibrated compass is introduced.

Table 3 shows the mean, Std., and trim of the orientation estimation error obtained with the proposed algorithm and those obtained by Jacob et al. [19]. The proposed method gets larger errors compared to Jacob et al.’s method [19], although it is worth recalling that the latter does not have to deal with DoA uncertainty. From the table, we can observe that orientation estimation is independent from , since it only uses DoA measurements. With both methods, the orientation estimation error is lower than that of a typical digital compass, rendering them useless for this particular application. Figure 7 shows a box plot of the results obtained for all the tested algorithms with , from which it is easier to see how the different algorithms perform.

A deeper analysis of the ML location estimation has revealed that large localization errors are associated with large DoA estimation errors, that is, those instances when the largest peak of the correlation corresponds to a reflection instead of the direct signal. Some proposals in the literature use outlier detection techniques to reduce the effect of spurious measurements. Jacob et al. [19] used random sample consensus (RANSAC) for the minimization algorithm (not implemented in our version); in Plinge and Fink’s work [20], outliers were detected by applying a threshold to the estimation error. Our current implementation does not contemplate outlier detection; therefore, the obtained errors have large variances.

Regarding the number of nodes, localization accuracy usually increases with larger networks. This result is expected because there is more information available; thus, it is easier to compensate for large local estimation errors (either DoA or range) in one or several nodes. However, due to DoA uncertainty, the proposed method has some convergence problems with large networks that need to be addressed.

To the best of our knowledge, our proposal is (together with our previous work in Ayllon et al. [24]) the only method capable of 2D DoA-based distributed array configuration calibration using nodes equipped with only 2 microphones.

5. Conclusions

In this paper, we have presented a new self-localization algorithm for wireless smartphone networks composed of commercial off-the-shelf devices that are equipped with two microphones and a speaker. The entire localization process is based on DoA and range estimates between node pairs obtained with acoustic signals. The main novelty of this work is a modification of the previously presented ML-DONL algorithm, which enables us to locate the nodes even without prior knowledge about their orientation. Thus, we eliminate the requirement for an electronic compass. The nodes are located by finding the position of their speaker and estimating their orientation while solving the DoA uncertainty problem, which arises from the use of only 2 microphones per node. The obtained localization error is lower than that obtained when an uncalibrated electronic compass is used, which is the most common scenario for off-the-shelf smartphones. In summary, the proposed algorithm improves the localization accuracy of other methods that require reference nodes or additional sensors, and it is on the same scale as other DoA-based algorithms without requiring ad hoc hardware. In addition, the computational cost of the algorithm is assumable for current mobile processors. However, the solution of the DoA uncertainty with a GA tournament adds a significative computational load, making it worthy to explore more efficient solutions. Future work will address spurious measurements using outlier detection techniques and will study different approaches to the DoA uncertainty estimation because they are the main sources of error.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work has been funded by the Spanish Ministry of Economy and Competitiveness/FEDER under Project TEC2015-67387-C4-4-R.