Abstract

The use of microphone arrays for sound-source localization is a well-researched topic. The response of such sensor arrays is dependent on the quantity of microphones operating on the array. A higher number of microphones, however, increase the computational demand, making real-time response challenging. In this paper, we present a Filter-and-Sum based architecture and several acceleration techniques to provide accurate sound-source localization in real-time. Experiments demonstrate how an accurate sound-source localization is obtained in a couple of milliseconds, independently of the number of microphones. Finally, we also propose different strategies to further accelerate the sound-source localization while offering increased angular resolution.

1. Introduction

Most of the signal processing needed in microphone arrays is traditionally done using general purpose processors. However, the computational demand is directly related to the number of microphones of the array. This number is drastically increasing as low-cost MEMS technology is readily available. Current FPGAs are a potential solution thanks to their high-computational power and low latency response. In fact, FPGAs have been already considered by other researchers, mainly for converting the analogue or digital microphone signals into an audio format [1, 2] without further signal processing computation. We believe that FPGAs not only are able to manage relatively large microphone arrays, but also enable a faster response when compared to using general purpose processors.

In order to satisfy the most time stringent sound-source localization applications that also use an incremental number of microphones, we propose a flexible, scalable, and real-time architecture. Main targets are the performance, scalability, and accuracy of the system to detect the direction of sound sources in real-time. Furthermore, we propose several techniques based on our architecture to accelerate the sound-source localization to guarantee real-time detection.

The architecture presented in this paper is an improved and more detailed version than the one presented in [3]. Because this novel architecture is designed to be part of an embedded system, the resource and the power consumption are included together with the performance in our analysis of the system. A frequency analysis is also done based on design parameters such as the number of microphones or the number of orientations. Altogether this leads to an architecture for which the frequency response must satisfy the basic needs of an application requiring real-time sound-source localization.

The main contributions of this work can be summarized as follows:(i)A Filter-and-Sum based architecture for a fast sound-source localization.(ii)A complete frequency and performance analysis of the system.(iii)Strategies to speed up the overall execution time.

This paper is organized as follows. Section 2 presents related work. The principles used for the sound-source localization are introduced in Section 3. In Section 4 our proposed architecture is detailed. A complete time analysis and different strategies to increase performance are presented in Section 5. In Section 6 the proposed architecture is analysed. Finally, the conclusions are drawn in Section 7.

The use of microphone arrays for sound-source localization is a well-researched problem, where complexity increases with the number of microphones involved and the required response time of the application. The response time is indeed crucial for applications such as a counter-sniper systems [4, 5]. Such military systems are composed of microphone arrays mounted on top of a soldiers helmet and connected to an FPGA for signal processing. A similar approach is applied in [6], where the authors present a hat-type hearing system composed of 48 digital MEMs microphone array with an FPGA as the computational component. Their main target is a hearing aid system which emphasizes up to 10 dB the sound coming from a certain direction. Such type of applications demands a fast response of the system while being power efficient.

Indoor applications, such as videoconferencing, home surveillance, and patient care, make also use of microphone arrays for speech detection [1, 7]. This paper describes the design and implementation on an FPGA of an eight-element digital MEMS microphone array for distant speech recognition. In [8] the authors propose a beamforming-based acoustic system for localization of the dominant noise source. The signal acquisition consists of a microphone array composed of up to 33 MEMS microphones whereas the PDM demodulation and the beamforming are implemented in an FPGA. The implementation in the FPGA is completed with the delay-and-sum beamforming, measuring 60 angles, and generating a polar map for directivity pattern presentation. Another example is proposed in [9], in which the sound-source localization is obtained by using distributed microphone arrays in a WSN. The distributed information collected by the nodes is transferred and processed using data-fusion techniques in order to locate and profile the sound sources. Despite the fact that they implement most of the processing components on an FPGA, the 64k-FFT component becomes too large and resource hungry such that it is not suitable for low and middle-end FPGAs. In both publications, however, their solutions are not scalable and not adaptable to dynamic acoustic environments. Furthermore, they do not provide information about how fast their systems can be. Instead, we present a detailed description and analysis of a flexible, scalable, and real-time architecture.

3. Sound-Source Localization

Our microphone array is designed to spatially sample its surrounding sound field in order to detect and to locate certain types of sound sources. A 360° sound power scan is performed for a configurable number of orientations. A beamforming technique focuses the array in one specific direction or orientation, by amplifying all sounds coming from that direction and by suppressing sounds coming from other directions. A polar power plot is obtained from which the lobes can be used to estimate the nearby sound sources. Figure 1 shows the functional elements required to locate the sound-source, which involve several filters, a beamformer, and a relative sound power estimator.

3.1. Microphone Array Description

The sensor array is composed of 52 digital MEMS microphones and designed for far-field and nondiffuse sound fields [9]. The array pattern consists of four concentric subarrays of 4, 8, 16, and 24 MEMS microphones mounted on a 20 cm circular printed board (Figure 2). Each subarray is differently positioned in order to facilitate the capture of spatial acoustic information using a beamforming technique. Furthermore, the sensor array response is dynamically modified by individually activating or deactivating subarrays. This distributed geometry allows adapting the sensor to different sound sources. For instance, not all the subarrays need to be active to detect a particular sound-source. The computational requirements drastically decrease and the sensor array becomes more power efficient if only a few numbers of subarrays are active.

3.2. Filters

The selected digital MEMS microphones are the ADMP521 MEMS microphones designed by Analog Devices, which offer an omnidirectional polar response and a wide-band frequency response ranging from 100 Hz up to 16 kHz [10]. These digital MEMS microphones have a multiplexed pulse density modulation (PDM) as output. The PDM signals are generated by using an analogue to digital converter (ADC) based on a sigma delta converter. The sigma delta conversion technique uses an embedded integrator-comparator circuit to sample the analogue signal and outputs a 1-bit signal [11]. The ADMP521 MEMS microphones use a fourth-order sigma delta converter, which reduces the added noise in the audio frequency spectrum by shifting it to higher frequency ranges. This undesirable high-frequency noise needs to be removed. The ADMP521 MEMS microphones require a clock input of around 1 to 3 MHz as sampling frequency (). This range of is chosen to oversample the audio signal in order to have sufficient audio quality and to generate the PDM output signal. Therefore, the PDM signal needs not only to be filtered to remove the noise but also to be downsampled to convert the audio signal to a Pulse-Code Modulation (PCM) format. The target audible frequency range, from to , determines the decimation factor () to properly downsample the PDM signal while satisfying the Nyquist theorem.The usual range of is from a few tens up to hundreds when targeting audible frequency ranges. For instance, needs to be 83 to recover audio signal oversampled at 2.49 MHz for a target of 15 kHz.

3.3. Filter-and-Sum Beamforming

The beamforming technique applied in our proposed architecture is based on the Filter-and-Sum beamforming [12]. The original Filter-and-Sum beamforming applies an independent weight to each microphone output before summing them. The overall effect is an amplification of the signal coming from a target orientation while suppressing signals from other orientations. A variant version of the Filter-and-Sum recovers the audio signal from the PDM signal, applies the same low-pass FIR filter, and delays the filter output signal of each microphone by a specific amount of time () before adding all the output signals together (Figure 3). The time delay () for a microphone is determined by the focus direction , the position vector () of microphone , and the speed of sound ().where the unitary vector () defines the direction vector of a far-field propagating signal with a focus direction . The total output () of the array can be expressed based on the signal output of each microphone in the time domain and the number of microphones in the array (): The response of the Filter-and-Sum beamforming, however, is usually represented in the frequency domain due to its dependence on the signal frequency. Let be the output signal of each microphone at angular speed for frequency and the number of microphones in the array. The total output () is defined as in [13]: which can be simplified by assuming a monochromatic acoustic wave aswhere is the output signal of the monochromatic wave, is the incoming monochromatic angular speed, is its direction, and is the array focus. is known as the array pattern, which determines the amplification or gain of the array output. For instance, when , which occurs when the array is focusing in the direction of the incoming monochromatic wave, the gain reaches its maximum , equal to the number of microphones.

3.4. Polar Steered Response Power

The direction of the sound-source is located by measuring the relative sound power per horizontal direction, which is done by a 360° sweep overview of the surrounding sound field. The directional power output of a microphone array, defined here as the polar steering response power (P-SRP), corresponds to the array’s directional response to sound sources present in a sound field (Figure 4). The P-SRP is obtained by considering multiple broadband sources coming from different directions, for instance, human speech.

The output power when the microphone array is exposed to a broadband sound-source with an angle of incidence can be modelled aswhere with is the amplitude of one of the frequency components of . The equation can be generalized to consider a sound field composed of multiple broadband sound sources at different locations and with uncorrelated noise:The array’s power output can be expressed assince the power of a signal is the square of the array’s power output. Finally, the normalized power output is defined as the P-SRP.The comparison of for different values of determines in which direction the sound-source is located since the maximum power is obtained when the focus corresponds to the location of a sound-source.

The calculation of the P-SRP is usually defined in the frequency domain [14, 15], which requires the computation of a Fourier transform. Instead, we propose applying Parseval’s theorem which states that the sum of the squares of a function is equal to the sum of the squares of its transform. This theorem drastically simplifies the calculations since P-SRP can be computed in the time domain. Let us define the sensing time () as the time the array is registering the previously defined sound field for each orientation. Therefore, the power can be expressed as follows:Consequently, P-SRP can be expressed in the time domain by

3.5. Sensor Array Evaluation

The defined P-SRP allows estimating the direction of arrival of multiple sound sources under different sound field conditions. Nevertheless, the precision and accuracy of its estimation can be determined by different quality metrics.

The Filter-and-Sum beamforming is applied to a discrete number of orientations or angles. The angular resolution of the microphone array is determined by the number of measurements per 360° sweep. A higher number of measurements increment the resolution of the P-SRP displayed as a polar power map (Figure 5) and decrease the location error of the sound-source. The lobes of this polar power map can then be used to estimate the bearing of nearby sound sources in nondiffuse sound fields conditions. In fact, the characteristics of the main lobe when considering a single sound-source scenario determine the directivity of the microphone array. The definition of array directivity, , is proposed in [16] for broadband signals. The authors propose the use of () as a metric of the quality of the array since depends on the main lobe shape and its capacity to unambiguously point to a specific bearing. The definition of array directivity presented in [16] is adapted for 2D polar coordinates in [9] as follows:where is the output power of the array when pointing to the direction and is the sum of the squared output power in all other directions. It can be expressed as the ratio between the area of a circle whose radius is the maximum power of the array and the total area of the power output. Consequently, defines the quality of the microphone array and can be used to specify a certain threshold for the microphone array. For instance, if equals 8, the main lobe is eight times slimmer than the unit circle and offers a confident estimation of a sound-source within half a quadrant.

Whereas is usually considered for broadband sound sources, other metrics are necessary to profile the array’s response for different types of sound sources. Figure 6 depicts the maximum side lobe () and the half-power beamwidth, which are two complementary metrics used to characterize the response of arrays for narrowband sound sources. Half-power beamwidth is the angular extent by which the power response has fallen to half of the maximum level of the main lobe. Since the half-power coincides with a 3 dB drop in power level, it is often called 3 dB beamwidth (). This metric determines the angular ratio between the power signal level which is at least 50% of the peak power level and the remaining circle. By contrast, is another important parameter used to represent the impact of the side lobes when characterizing arrays. is the normalized ratio between the highest side lobe and the power level of the main lobe expressed in dB. Both metrics, the and , are desired to be as low as possible, whereas should be as high as possible to guarantee a precise sound-source location.

4. A Filter-and-Sum Based Architecture

The proposed architecture uses a Filter-and-Sum based-beamforming technique to locate a sound-source with an array of digital MEMS microphones. Many applications, however, demand a certain scalability and flexibility when locating the sound-source. With such requirements in mind, the proposed architecture has some additional features to support a dynamic response targeting applications with real-time demands. The proposed architecture is also designed to be battery power efficient and to operate in streaming fashion to achieve the fastest possible response.

One of the features of the ADMP521 microphone is its low-power sleep mode capability. When no clock signal is provided, the ADMP521 microphone enters in a low-power sleep mode (<1 μA), which makes this sound-source localizer suitable for battery powered implementations. The PCB of the MEMs microphone array is designed to exploit this capability. Figure 2 depicts the subarray distribution of the MEMs microphones. Using the clock signal, it is possible to activate or deactivate subarrays since each subarray is fetched with an individual clock signal. This flexibility allows disabling not only subarrays of microphones, but also the associated computational components, decreasing the computational demand and the power consumption. The proposed architecture is properly designed to support such flexibility.

The array computes its response as fast as possible to reach real-time sound-source location. The proposed architecture is designed to process in stream fashion and is mainly composed of three cascaded stages operating in pipeline (Figure 7). The first stage is the filter chain, which is composed of the minimum number of components required to recover the audio signal in the target frequency range. The second stage computes the Filter-and-Sum beamforming operation. The final stage obtains for the focused orientation. A polar power map is obtained once a complete steering loop is completed. The different stages are discussed in more detail in the following subsections. Table 1 summarizes the most relevant parameters of the proposed architecture.

4.1. Filter Stage

The filter stage contains a PDM demultiplexer and as many filter chain blocks as MEMS microphones (Figure 8). Each microphone of the array is associated with a filter chain composed of a couple of cascaded filters. The full-capacity design supports up to 52 filter chain blocks working in parallel, but their number is defined by the number of active microphones. The unnecessary filter chain blocks are disabled at runtime.

The microphones’ clock determines the input rate and, therefore, how fast the filter stage should operate. The low operating frequency for current FPGAs allows interesting power savings [17].

Every pair of microphones has its PDM output signal multiplexed in time. Thus, at every edge of the clock cycle the output is the sampled data from one of the microphones. The PDM demultiplexing is the first operation to obtain the individual sampled data from each microphone. This task is done in the PDM splitter block.

The next component consists of a cascade of filters to filter and to downsample each microphone signal. Traditional digital filters such as the Finite Impulse Response (FIR) type of filters are a good solution to reduce the signal bandwidth and to remove the higher frequency noise. Once the signal is filtered it can be decimated to decrease the oversampling to a reasonable audio quality rate (e.g., 48 kHz). However, this filter consumes many adders and dedicated multipliers (DSPs) from the FPGA resources, particularly if its order increases.

The Cascaded Integrated-Comb (CIC) filter is an alternative for low-pass filtering techniques which has been developed in [18, 19] and involves only additions and subtractions. This type of filter consists of 3 stages: the integrating stage, the decimator or integrator stage, and the comb section. PDM samples are recursively added in the integrating stage while being recursively subtracted with a differential delay in the comb stage. The number of recursive operations in both the integrating and comb section determines the order of the filter () and should at least be equal to the order of the sigma delta converter from the DAC of the microphones. After the CIC filter, the signal growth () is proportional to the decimation factor () and the differential delay () and is exponential to the filter order [19].

The output bit width grows proportionally to . Denote by the number of input bits; then the number of output bits is as follows:

The proposed CIC decimation filter eliminates higher frequency noise components and decimates the signal by at the same time. However, a major disadvantage of this filter is the nonflat frequency response in the desired audio frequency range. In order to improve the flatness of the frequency response, a CIC filter with a lower decimation factor followed by a compensation FIR filter is often chosen like in [2022].

The CIC filter is followed by an averager, which is used to cancel out the effects caused by the microphones’ DC offset output leading to a constant offset in the beamforming values. This block improves the dynamic range, reducing the bit width required to represent the data after the CIC.

The last component of each filter chain is a low-pass compensation FIR filter based on a Kaiser window. This filter equalises the passband drop usually introduced by CIC filters [19]. It additionally performs a low rate change. The proposed filter also needs a cut-off frequency of at a sampling rate of , which is the sampling rate obtained after the CIC decimator filter with a decimation factor of . This low-pass FIR filter is designed in a serial fashion to reduce the resource consumption. In fact, the FIR filter order is also determined by . Thereby the stream nature of the architecture, the CIC filter, is able to generate an output value every clock cycle. Due to the decimation factor, only one output value per input value is propagated to the low-pass FIR filter. Therefore, the FIR filter has clock cycles to compute each input value, which determines its maximum order. The filtered signal is then further decimated by a factor of to obtain a minimum bandwidth of audio signals to satisfy the Nyquist theorem. The overall can be expressed based on the low rate change of each filter.

4.2. Beamforming Stage

As detailed before, the main purpose of the beamforming operation is to focus the MEMS microphone array in one particular direction. The detection of sound sources is possible by continuously steering in loops of 360°. The number of orientations, , determines the angular resolution. Higher angular resolutions demand not only a larger execution time per steering loop, but also more FPGA memory resources, to store the precomputed delays per orientation.

The beamforming stage depends on the number of microphones and subarrays. Although Filter-and-Sum beamforming assumes a fixed number of microphones and a fixed geometry, our scalable solution satisfies those restrictions while offering a flexible geometry. Figure 9 shows our proposed Filter-and-Sum based beamformer. This stage is basically composed of FPGA’s blocks of memory (BRAM) in ring-buffer fashion that properly delay the filtered microphone signal. The values of the delays at a given moment depend on the focus orientation at that moment and are determined by the array pattern from (5). The delay for a given microphone is determined by its position on the array and on the focus orientation. All possible delay values per microphone for each beamed orientation are precomputed, grouped per orientation and stored in ROMs during compilation time. During execution time, the delay values of each microphone when pointing to a certain orientation are obtained from this precomputed table.

The beamforming stage is designed to support a variable number of microphones. This is enabled by grouping the input signals following their subarray structure. Therefore, instead of implementing one simple Filter-and-Sum of 52 microphones, there are four Filter-and-Sum operations in parallel for the 4, 8, 16, and 24 microphones. Their sum operation is firstly done locally for each subarray and afterwards between subarrays. The only restriction of this modular beamforming is the synchronization of the outputs in order to have them properly delayed. Therefore, the easiest solution is to delay all the subarrays with the maximum delay of the subarrays. Although the output of some subarrays is already properly delayed, additional delays, shown at the Sums section in Figure 9, are inserted to assure that the proper delay of each subarray has been obtained. This is achieved by using the valid output signals of each subarray beamforming, without additional resource cost. Consequently, only the Filter-and-Sum beamforming modulo linked to an active subarray is enabled. The not active beamformers are set to zero in order to avoid any negative impact of the beamforming operation.

A side benefit of this modular approach is a reduction of the memory resource consumption. Since each subarray has their ring-buffer memory properly dimensioned to its maximum sample delay, the portion of underused regions of the consumed memories is significantly low.

4.3. Power Stage

Figure 10 shows the components of the power stage. Once the filtered data has been properly delayed and added for a particular orientation , is calculated following (10). The P-SRP is obtained after a steering loop, allowing the determination of the sound sources. The sound-source is estimated to be located in direction shown by the peak of the polar power map, which corresponds to the orientation with the maximum .

5. Performance Analysis of the Filter-and-Sum Based Architecture

A performance analysis of the proposed architecture is presented in this section. The analysis shows how the design parameters such as the filters’ characteristics affect the final execution time of the sound-source locator. The links between performance and design parameters are explained followed by the description of the different acceleration strategies. These strategies can be considered standalone or combined for certain timing constraints. The advantages of these strategies are lately presented in Section 6.

5.1. Time Parameters

The overall execution time of the proposed architecture is defined by the latency of the main components. A detailed analysis of the implementation of components and the latency that they incur provides a good insight about the speed of the system (Table 2). The operation frequency of the design can be assumed to be the same as the sampling frequency. Let us define as the overall execution time in clock cycles required to obtain P-SRP. Thus, is defined aswhere is the execution time of one orientation and is determined by the execution time of the filter stage (), the execution time of the beamforming (), and the execution time of the power stage (), which are the main components of the system as explained in the previous section. The proposed architecture is designed to pipeline each stage, overlapping the execution of each component of the design. Therefore, only the initial latency or initiation interval () of the components needs to be considered, since it corresponds to the system group delay.

Let us assume that the design operates at the same frequency like the microphones; then (16) can be rearranged as follows:where is the latency of the system and determined by the initiation interval of the filter stage (), the initiation interval of the beamforming stage (), and the initiation interval of the power stage (). The time during which the microphone array is monitoring one particular orientation is known as . This is the time required to calculate a certain number of output samples (). As previously detailed, the digital microphones oversample the audio signal by operating at . The reconstruction of the audio signal in the target range demands a certain level of decimation . This level of decimation is done by the CIC and the FIR filter in the filter stage, with a certain level of decimation () and (), respectively. Based on defined in (1), the time is expressed as follows:

of each stage of the implementation can also be further decomposed based on the latency of the components, where is the initiation interval of each component . Therefore, is defined as the sum of all the initiation intervals:

Equation (16) can be rearranged (see Figure 11) as

The execution time is determined by and , since the level of decimation is determined by the target frequency range and is determined by the components’ design. Although most of the latency of each component of the design is hidden thanks to the pipelined operation, there are still some cycles dedicated to initialize the components. A detailed analysis of provides valuable information about the performance leaks.

CIC. The initiation interval of the CIC filter represents the time required to fulfil the integrator and the comb stages. Therefore, the order of the CIC () determines .

DC. The component which must remove the DC level of the signal introduces a minor initial latency due to its internal registers. Since it needs at least two input values to calculate the DC level, it also depends on .

FIR. The initiation interval of the FIR filter is also determined by the order of this filter (). Since the filter operation is basically a convolution, the initial output values are not correct until at least the th input signal of the filter. Because the filters are cascaded, also affects .

Therefore, is expressed as follows:

Delay. The beamforming operation is done through memories, which properly delay the audio samples for a particular orientation. The maximum number of samples determines the minimum size of these delay memories. This value represents the maximum distance between a pair of microphones for a certain microphone array distribution and may vary for each orientation. The initiation interval of the Filter-and-Sum beamformer is therefore expressed as the maximum distance between pairs of microphones for a particular orientation.where is the maximum time delay of the active microphones for the beamed orientation . Therefore, is mainly determined by the microphone array distribution, , and the target frequencies, determining . Due to the symmetry of the microphone array and for the sake of simplicity, it is assumed that each orientation has the same . Notice this does not need to be true for different array configurations.

Sum. The proposed beamforming is composed of not only a set of delay memories but also a sum tree. The initiation interval of this component is defined by the number of active microphones ().

Therefore, is expressed as follows:

Power. The final component is the calculation of the power per orientation. This simple component has a constant latency of a couple of clock cycles.

The timing analysis of the initiation interval of each component of the architecture gives an idea about the design parameters with higher impact. The definition of the filters, mainly their order, is determined by the application specifications, so it should not be modified to reduce the overall execution time. On the other hand, the distribution of the microphones in the array affects not only the frequency response of the system, but also the execution time. Notice, however, that the number of microphones does not have timing impact. Only the number of active microphones has a minor impact in terms of a couple of clock cycles of difference. Nevertheless, (21) already shows that the dominant parameters are and .

5.2. Sensitive Parameters

The timing analysis provides an indication of the parameters dominating the execution time. Some parameters, like the microphone array distribution, which determine the beamforming latency, are fixed while others like or per orientation are variable.

Orientations. Figure 5 depicts how an increment of leads to a better sound-source localization. This resolution, however, has a high repercussion on the response time. A simple strategy is to maintain the angular resolution only for where it is needed while quickly exploring the surrounding sound field. For instance, the authors in [3] propose a strategy to reduce the beamforming exploration to 8 orientations, with an angular separation of 45 degrees. Once a steering loop ends, the orientations are rotated one position, which represents a shift operation in the precomputed orientation table. Therefore, all the supported 64 orientations are monitored after 8 steering loops. Despite this strategy intending to accelerate the peak detection by monitoring the minimum , the overall remains the same for achieving the equivalent angular resolution.

Sensing Time. The sensing time is a well-known parameter of radio frequency applications. The time is known to strengthen the robustness against noise [23]. In our case, the time a receiver is monitoring the surrounding sound field determines the probability of properly detection of a sound-source. Consequently, a higher is needed to detect and locate sound sources under low Signal-to-Noise () conditions. Despite the fact that this term could be modified in runtime to adapt the sensing of the array based on an estimated , it would demand a continuous estimation, which is out of the scope of this paper.

To conclude, Table 2 summarizes the timing definitions. On one hand, determines the number of processed acoustic samples and therefore directly affects the sensing of the system. On the other hand, determines the angular resolution of the sound-source search and influences the accuracy. There is a trade-off between and and the quality of the sound-source location.

5.3. Strategies for Time Reduction

The following three strategies are proposed to accelerate the sound-source localization without any impact on the frequency response and of the architecture. An additional strategy is proposed specially for dynamic acoustic environments, but with a certain accuracy cost.

5.3.1. Continuous Processing

The proposed architecture is designed to reset the filter and beamforming stages after due to orientation transition. Thanks to beamforming after the filter stage, the system can be continuously processing while resetting. The filter stage does not need to stop its processing. The input data is not lost due to the reset operations since the filtered input values are stored in the beamforming stage. Furthermore, the initialization of the beamforming stage can also be eliminated since the stored data from the previous orientation can be reused for the calculation of the new one. With this approach, (17) becomes as follows:

5.3.2. Time Multiplexing

Nowadays, FPGAs can operate at clock speeds of hundreds of MHz. Despite the fact that the power consumption is significantly lower when operating at low frequency [17], the proposed architecture is able to operate at much higher frequency than the data sampling rate. This capability provides the opportunity to parallelize the beamforming computations without any additional resource consumption. Instead of consuming more logic resources by replicating the main operations, the proposed strategy, similar to Time-Division Multiplexing in communications, consists in time multiplexing these parallel operations. Because the type of the input data is oversampled audio, the selection of the operations to be time multiplexed is limited. Based on (21), the candidates to be parallelized are and . Since the input data rate is determined by , (18) shows that cannot be reduced without decreasing or changing the target frequency range. Nevertheless, since the computation of each orientation is data independent, they can be parallelized. The simultaneous computation of multiple orientations is only possible after the beamforming operation. Let us define as the monitoring time before being able to process multiple orientations in parallel. Therefore,

After the delay memories which compose the Filter-and-Sum beamforming stage have already stored enough audio data to start locating the sound-source. Because the beamforming operation relies on delaying the recovered audio signal, multiple orientations can be computed in parallel by accessing the content of the delay memories at a higher speed than the sampling of the input data. It basically multiplexes the output beamforming computations over time. The required frequency to parallelize all for this architecture is defined as follows:

Due to (1), can be also expressed based on the target frequency range:Notice that the required frequency to multiplex in time the computation of the orientations does not depend on the number of microphones in the array. Figure 12 shows the clock domains when applying this strategy. While the front-end, consisting of the microphone array and the filter stage, operates at , the output of the beamforming is processed at . The additional cost in terms of resources is the extension of the register for the power per angle calculation. A memory of positions is required instead of the single register used to store the accumulated power values. This strategy allows fully parallelizing the computation of all the orientations. Thus, is mainly limited by and the maximum reachable frequency of the design, since is determined by the microphones’ operational frequency and by the frequency range of the target sound-source. In fact, determines how many orientations can be processed in parallel.

5.3.3. Parallel Time Multiplexing

This proposed strategy is an extension of the previous one. The frequency is limited by the maximum attainable operating frequency of the implementation, which is determined by many factors, from the technology to the available resources on the FPGA. For instance, if equals  kHz and the maximum attainable operating frequency is 100 MHz, then up to 1666 orientations could be computed in parallel. However, if not all the resources of the FPGA are completely consumed, especially the internal blocks of memory (BRAM), there is still space for improvement. With the time multiplexing strategy, the memories of the beamforming stage are fully accessed, since in each clock cycle there is at least one memory access or even two memory accesses when new data is stored. Therefore, more memory resources can be used to further accelerate the computation of the P-SRP. The simple replication of the beamforming stage, preconfigured for different orientations, will be enough to double the number of processed orientations while maintaining the same . The strategy mainly consumes BRAMs. Nevertheless, due to the value of the at for our microphone array, only few audio samples are needed to complete the beamforming. This fact drastically reduces the memory consumption, which provides the potential computation of thousands of orientations by applying both strategies.

All strategies can be applied independently despite the fact that some will only work properly when combined. Not all strategy combinations are beneficial. For instance, a dynamic angular resolution should be only combined with the time multiplexing of the orientations when is higher than . Otherwise the reduction of by dynamically readjusting the target orientations does not provide any acceleration and it would only degrade the response of the system.

6. Results

The proposed architecture is evaluated in this section. Our analysis starts evaluating different design solutions based on the timing analysis introduced in Section 5.1. One representative configuration is evaluated based on the frequency response and accuracy by using the metrics described in Section 3.5. This evaluation also considers sensitive parameters such as the number of active subarrays and the relevance of , already introduced in Section 5.2. The resource and the power consumption for a Zynq 7020 target FPGA are also presented. Finally, the strategies presented in Section 5.3 are applied for the representative design.

6.1. General Performance Analysis

The proposed performance analysis from the previous section is here applied on a concrete example. The explored design parameters are and , keeping and both constant to 64. Whereas is determined by the microphone’s sampling frequency, is determined by the target application. For our design space exploration, we consider an from 10 kHz to 16 kHz in steps of 125 Hz and ranges from 1.25 MHz until 3.072 MHz as specified in [10].

Equations (16) to (18) and (20) to (32) are used to obtain . The performance analysis starts obtaining for every possible value of and . All possible combinations of and are considered based on (15). The low-pass FIR filter parameters are , which is determined by , and as the cut-off frequency. Each possible low-pass FIR filter is generated considering a transition band of 2 kHz and an attenuation of at least 60 dB at the stop band. If the minimum order or the filter is higher than the filter is discarded. We consider these parameters as realistic constraints for low-pass FIR filters. Furthermore, a minimum order of 4 is defined as threshold for . Thus, some values are discarded because is a prime number or is below 4. Each low-pass FIR filter is generated and evaluated in Matlab 2016b.

Figure 13 depicts the minimum timings of the DSE that the proposed Filter-and-Sum architecture needs to compute one orientation. is slightly reduced when varying . For instance, it is reduced from 5.03 ms to 3.97 ms when  kHz. A higher means a faster sampling, which is in fact the operational frequency limiting factor. Furthermore, a higher decrement of is produced when increasing and . Higher values of allow higher values of , which can greatly reduce computational complexity of narrowband low-pass filtering. However, too high values of lead to such low rates that, although a higher order low-pass FIR filter is supported, it cannot satisfy the low-pass filtering specifications. Notice how the number of possible solutions decreases while increasing . Due to and ranges, the values of vary between 39 and 154. Though, as previously explained, many values cannot be considered since they are either prime numbers or the decomposition in factors of leads to values below 4. Because higher values of lead to low values of for low , these values cannot satisfy the specifications of the low-pass FIR filter.

Finally, relatively low values of are obtained for values from 10 kHz to 10.65 kHz and ranging from 2.7 MHz to 3.072 MHz. It is produced by high values of , which means that a higher order low-pass FIR filter is supported. As expected, high values of lead to high order low-pass FIR filters and lower . A lower is possible thanks to avoiding unnecessary computations since fewer samples are decimated after the low-pass FIR filter.

6.2. Analysis of a Design

As shown in Figure 13, several design considerations drastically affect the final performance. However, most of these design decisions do not have a significant impact on the system response compared to other factors such as the number of active microphones or the number of orientations. The analysis of impact of these parameters on the system’s response and performance is done over one particular design.

Table 3 summarizes the configuration of the architecture. The design considers MHz, which is the clock for the microphones and the functional frequency of the design. This value of is the intermediate value between the required clock signals of the ADMP521 microphones [10]. The selected cut-off frequency is  kHz, which leads to . In this example design with a decimation factor of 16 and a differential delay of 32. The chosen FIR filter has a beta factor of 2.7 and a cut-off frequency of at a sampling rate of kHz, which is the sampling rate obtained after the CIC decimator filter with a . The filtered signal is then further decimated by a factor to obtain a  kHz audio signal.

The architecture is designed to support a complete steering loop up to 64 orientations, which represents an angular resolution of 5.625°. On the other hand, the subarray approach allows activating the 52 microphones if all the 4 subarrays are active. The final results are obtained by assuming a speed sound of ≈343.2 m/s.

6.2.1. Frequency Response

The waterfall diagrams of Figure 14 show the power output of the combined subarrays in all directions for all frequencies. In our case, the results are calculated with a single sound-source varying between Hz and kHz in steps of Hz and placed at 180°. All results are normalized per frequency. Every waterfall shows a clear distinctive main lobe. When only subarray 1 is active there are side lobes at kHz and kHz which impede the sound-source location for those frequencies. The frequency response of the subarrays improves when they are combined since their frequency responses are superposed. The combination of the subarrays 1 and 2 reaches a minimum detectable frequency of kHz, when combining subarrays 1, 2, and 3 and all subarrays reach kHz and kHz, respectively. These minimum values are clearly depicted in Figure 15, with a threshold of 8 for , which indicates that the main lobe’s surface corresponds to maximally half of a quadrant. The frequency response of the combination of subarrays has a strong variation at the main lobe and, therefore, in . Figure 15 depicts the evolution of when increasing the angular resolution and when combining subarrays. The angular resolution determines that the upper bound converges, which is dependent on the number of orientations. The number of active microphones, on the other hand, influences how fast converges to its upper limit. Consequently, the number of active microphones determines the minimum frequency which can be located when considering a threshold of 8 for . Alongside the directivity, other metrics such as the main beamwidth and the MSL levels metrics are also calculated to properly evaluate the quality of the array’s response. Figure 16 depicts the MSL when varying the number of active subarrays and the number of orientations. A low angular resolution leads to a lower resolution of the waterfall diagrams, but only the metrics can show the impact. At frequencies between 1 and 3 kHz the main lobe converges to a unit circle, which can be explained by the lack of any side lobe. Higher frequencies present secondary lobes, especially when only the inner subarray is active, which increases the MSL values independently of the angular resolution. A low angular resolution leads to unexpected low values of MSL since the secondary lobes are not detected. On the other hand, a higher number of active microphones lead to lower values of MSL, independently of the angular resolution.

Figure 17 depicts the metric for a similar analysis of the number of microphones and angular resolution. On one hand, a higher number of microphones produce a faster decrement of , reflected as a thinner main lobe. Nevertheless, of each subarray converges to a minimum, which is only reached at higher frequencies. The angular resolution determines this minimum, which ranges from 90° till 11.25° when 8 or 64 orientations are considered, respectively.

6.2.2. Resource Consumption and Power Analysis

Table 4 summarizes the resource consumption when combining subarrays. The consumed resources are divided into the resources for the filter stage, the beamforming stage, and the total consumption per groups of subarrays. The filter stage mostly consumes DSPs while the beamforming stage mainly demands BRAMs. Most of the resource consumption is dominated by the filter stage, since a filter chain is dedicated to each MEMs microphone. What determines the resource consumption is the number of active subarrays.

The flexibility of our architecture allows the creation of heterogeneous source-sound locators. Thus, the architecture can be scaled for small FPGAs based on the target sound-source profile or a particular desirable power consumption. For instance, the combination of the two inner subarrays would use 12 microphones while consuming less than 10% of the available resources. The LUTs are the limiting resource due to the internal registers of the filters. In fact, when all the subarrays are used around 80% of the available LUTs are required. Nevertheless, any subarray can be disabled in runtime, which directly deactivates its associated filter and beamforming components. Although this does not affect the resource consumption, it has a direct impact over the power consumption. Table 5 shows the power consumption in mW based on the number of active subarrays. The power consumption of the microphones is also considered since the FPGA and the microphone array are powered from the same source. Thus, the overall power consumption must be considered since the architecture is designed for an embedded system. The MEMS microphones are powered with 3.3 volts, which represents a power consumption per microphone of μW and  mW for the inactive and active microphones, respectively. Notice how the power consumption increases with the number of active subarrays. There is a turning point when 3 or 4 subarrays are active. Thus, the microphone array consumes more power than the FPGA when all the subarrays are active.

6.2.3. Timing Analysis

The timing analysis based on Section 5 of the design under evaluation is summarized in Table 6. A complete steering loop requires around 169 ms while rounds to 2.6 ms. Notice that the initialization () consumes around 21.5% of the execution time. Fortunately, this initialization can almost be completely removed when applying the first strategy described in Section 5.3.1.

Table 7 summarizes the timing results when applying the first strategies proposed in Section 5. The elimination of the initialization after each orientation’s transition slightly reduces . In this case, is expressed as follows: The main improvement is obtained after time multiplexing the computation of the power per orientations. In this case, , the operational frequency of the beamforming computation to process all in parallel equals as expressed in (32). This is possible because and have the same value. Therefore, there is no need to have a different clock for the beamforming operation, since the spacing between output filtered values from the filter stage is large enough. By combining the first two strategies, rounds to 2 ms and only the first steering loop needs 2.6 ms due to . In this case, is expressed as follows:

The other two strategies proposed in Section 5.3.1 are designed to fully exploit the FPGA resources and to overcome time constraints when considering a high angular resolution. In the first case, since the design under evaluation has a small angular resolution (), there is no need for a higher when applying the time multiplexing strategy. However, a higher angular resolution can be obtained when considering the unconsumed resources without additional timing cost. Table 8 shows the combination of strategies increases the angular resolution without additional time penalty. The operational frequency () determines at what speed the FPGA can operate. By following (33), the beamforming operation can be exploited by increasing up to the maximum frequency, which increases as well:

Many thousands of orientations can be computed in parallel when combining all strategies. The beamforming stage can be replicated as many times as the remaining available resources allow. Of course, this estimation is certainly optimistic since the frequency drops when the resource consumption increases. Nevertheless, this provides an upper bound for . For instance, when only the inner subarray is considered, the DSPs are the limiting component. However, up to 53 beamforming stages could be theoretically placed in parallel. When more subarrays are active the BRAMs are the constrained component. Notice how the number of supported orientations increases if the number of subarrays decreases. It has, however, an impact on the frequency response and the accuracy of the system, as shown in Section 6.2.1. Nevertheless, tens of thousands of orientations can be computed in parallel consuming only around 2 ms by operating at the highest and by replicating the beamforming stage to exploit all the available resources.

7. Conclusions

In this paper we have presented a scalable and flexible architecture for fast sound-source localization. On one hand, the architecture can flexibly disable sections of the microphone array that are not needed or disable them to respect power restrictions. The modular approach of the architecture allows scaling the system for a larger or smaller number of microphones. Nevertheless, such capabilities do not impact the frequency and accuracy of our sound-source locator. On the other hand, several strategies to offer real-time sound-source localization have been presented and evaluated. These strategies not only accelerate but also provide solutions for those time stringent applications with a high angular resolution demand. Thousands of angles can be monitored in parallel, offering a high-resolution sound-source localization in a couple of milliseconds.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the European Regional Development Fund (ERDF) and the Brussels-Capital Region-Innoviris within the framework of the Operational Programme 2014–2020 through the ERDF-2020 Project ICITY-RDI.BRU.