Abstract

The current Microelectromechanical Systems (MEMS) technology enables the deployment of relatively low-cost wireless sensor networks composed of MEMS microphone arrays for accurate sound source localization. However, the evaluation and the selection of the most accurate and power-efficient network’s topology are not trivial when considering dynamic MEMS microphone arrays. Although software simulators are usually considered, they consist of high-computational intensive tasks, which require hours to days to be completed. In this paper, we present an FPGA-based platform to emulate a network of microphone arrays. Our platform provides a controlled simulated acoustic environment, able to evaluate the impact of different network configurations such as the number of microphones per array, the network’s topology, or the used detection method. Data fusion techniques, combining the data collected by each node, are used in this platform. The platform is designed to exploit the FPGA’s partial reconfiguration feature to increase the flexibility of the network emulator as well as to increase performance thanks to the use of the PCI-express high-bandwidth interface. On the one hand, the network emulator presents a higher flexibility by partially reconfiguring the nodes’ architecture in runtime. On the other hand, a set of strategies and heuristics to properly use partial reconfiguration allows the acceleration of the emulation by exploiting the execution parallelism. Several experiments are presented to demonstrate some of the capabilities of our platform and the benefits of using partial reconfiguration.

1. Introduction

Wireless sensor networks (WSN) composed of microphone arrays are becoming popular [1, 2] thanks to the relatively low cost of Microelectromechanical Systems (MEMS) sensors. However, validation and verification of these networks, using simulations, are time consuming procedures. Furthermore, before the deployment of a WSN composed of microphone arrays, the network must be tested in adapted environments such as anechoic chambers to avoid undesired reflections, possible distortions, or acoustic artifacts. Simulators offer usually a solution since they quickly provide information about the capabilities of a network. For instance, they can be used to explore the effects of different node’s architectures, network topologies, or network synchronization strategies. However, simulation processes are computationally intensive tasks which usually require hours or days to complete. Due to the inherent parallelism that microphone arrays present, we believe that FPGAs can accelerate the simulation of such type of networks. Here we present an extended version of the microphone array network emulator () presented in [3, 4], which mimics the node’s response, combines the response of the network’s nodes, and provides an estimation of the network’s response under a certain acoustic scenario. Therefore, instead of a pure software-based like the one presented in [1], the proposed uses an FPGA to accelerate the node’s computation by implementing exactly the same HDL code that is going to be deployed in the nodes of a real network. From one side, an improved version of the sound source locator proposed in [5] and accelerated in [6] is used as nodes of the network. From the other side, the uses partial reconfiguration () to adapt the network topology and the node’s configuration to increase accuracy of the sound source location. As a result, the functionalities of our are distributed between the host and the FPGA, using a high-bandwidth PCI-express () interface for the communication and .

This paper extends the work and results presented in [3, 4]. On the one hand, this paper presents a more detailed description of the platform, by providing low-level details of the node’s architectures under evaluation and detailed use of through . On the other hand, the use of through is exploited to not only extend the capabilities of the but also accelerate the emulation of multiple network’s topologies. The main contributions of this work can be summarized as follows:(i)A fully detailed architecture description of an FPGA-based , providing low-level details of the platform and how the via is used to reconfigure the network’s characteristics.(ii)Strategies and heuristics to exploit the use of through to further accelerate the computations on the FPGA.

This paper is organized as follows. The motivation of including as part of our system is done in Section 2. Section 3 presents related work. The description of the architecture, the data fusion technique, and the is done in Section 4. How is used to expand the supported nodes’ configurations of the and how to accelerate executions of the emulator by using is described in Section 5. In Section 6, the proposed is used to evaluate certain network’s configurations. Finally, our conclusions are presented in Section 7.

2. Why Partial Reconfiguration?

is a unique feature of FPGAs which allows us to change the functionality of a part of the reconfigurable logic in runtime. This results not only in a better reuse of area but also in a potential increment of performance when properly applied. Streaming applications demanding multiple individual computations of similar tasks but with different configurations are the ideal candidates. Furthermore, the execution of streaming applications on FPGAs can exploit parallelism by means of pipelining.

Reconfigurable resources are divided into static and dynamic parts when applying . The resources for the communication interfaces or for the control are usually in the static part. The dynamic part supports at runtime to allocate different mutually exclusive functionalities, known as reconfigurable modules (). The reserved logic resources for the dynamic part are denoted as reconfigurable partitions (). Thus, each is dimensioned to provide enough logic resources to support several .

2.1. Beyond the Available Resources

PR allows a higher level of resource reuse because functionality can be multiplexed in time on the reconfigurable logic. Such feature allows the allocation of different tasks in the same . The only condition is that their associated must be multiplexed in time by partially reconfiguring the .

The benefit of for efficient resources’ management has been exploited in different ways in recent years. For instance, the architecture presented in [7] supports 80 distinct hardware architectures, with different levels and precisions, of DCT computations. Other applications, instead, use to switch between applications’ modes in order to reduce the consumed FPGA resources and the overall power consumption. An example is presented in [8], where different image processing operations are switched in runtime to more power-efficient modes. The runtime management of the FPGA’s resources through allows self-adaptive or self-repairing systems such as the one presented in [9]. Further examples of how improves the resource utilization and increases the flexibility of the system are detailed in [10].

2.2. Performance Opportunities

has area and time cost. On the one hand, additional area is dedicated to support . On the other hand, the of a requires a certain amount of time, which is directly related to the size of the and the available to load the bitmap. Everything is slightly different when considering over . is usually exploited for small FPGAs or FPGA-based . In such devices, the logic resources are very limited, which evidences the benefit of by multiplexing in time the available resources. An FPGA board with is typically a high-end FPGA offering a large amount of resources to support high-performance applications. Thus, the sizes of the are usually larger to exploit the available resources and/or to allocate complex applications demanding many logic resources. As a consequence, the corresponding bitmaps of the are bigger, consuming more external on-board memory. Fortunately, since over is supported, the bitmap files can be located on the host side and be loaded through . The use of a high-bandwidth interface such as not only allows the reduction of the ’s area overhead but also provides new opportunities to the use of FPGAs as hardware accelerators. However, a proper placement and scheduling of the tasks to be executed on the FPGA is mandatory to compensate the remaining time overhead.

2.3. Our Approach

offers several benefits in the context of a microphone array network emulator. As mentioned above, the reuse of the logic resources increases the flexibility of the by modifying the functionalities in runtime. For instance, allows the evaluation of different node’s architectures in runtime as has been shown in [3, 4]. Thanks to , the area reuse not only increases the system’s flexibility but can be also used to increase performance when exploiting the level of parallelism of the tasks of the accelerated on the FPGA. Figure 1 depicts the main idea, where a is configured to support a scalable algorithm, which processes several inputs in parallel (e.g., a convolutional filter with several kernel sizes). The performance doubles thanks to partially reconfiguring the to support two instances of the algorithm while consuming the same area. Of course, the overall performance by using through only increases if the time overhead is reduced. As extension of the platform is presented in [3], we propose several heuristics to properly place and schedule the nodes’ configurations. The overall result is a flexible platform optimized to achieve the highest area and performance efficiency.

In this section, we provide an overview of similar previous work and explain the relations and differences compared to our work.

FPGAs have been already used as emulators for WSN. The authors in [11] propose an FPGA-based WSN emulator for the design, simulation, and evaluation of WSNs. Similarly, the authors in [12] present an FPGA-based wide-band wireless channel emulator able to generate white Gaussian noise and multipath and Doppler fading effects. Both works are complementary to ours as we focus on a detailed emulation of a network node while simplifying the wireless communication aspect. Furthermore, the of our emulator provides a higher dynamism and flexibility, which is not exploited in the mentioned emulators.

The use of has been thoroughly explored and proposed during the last decades. From one side, the use of to change the configuration of a node in a network has been already considered. For example, a LUT-based is proposed in [13] as part of an adaptive beamformer and in [5] to obtain a dynamic angular resolution of their acoustic beamforming. Our , however, considers the complete reconfiguration of the node and not only a minor component. Thanks to this additional flexibility different architectures can be evaluated on the nodes.

From the other side, the use of induces certain area and time overhead. Specially critical is the time overhead, which must be overcome in order to achieve high performance. Several authors have proposed strategies to mitigate the impact of this overhead. The approach proposed in [14] minimizes the total reconfiguration time when distributing the tasks onto the target architecture, through a proper placement and scheduling. Despite the authors exploiting task’s similarities, they do not use such characteristic to further exploit the resources. The optimal placement and scheduling are an NP-problem which is usually solved as an Integer-Linear Programming () problem. Thus, the authors in [15] present their model together with an heuristic to exploit techniques such as module reuse to reduce the number of reconfigurations. However, their approach does not consider the use of to increment the resource sharing of the . Our approach, instead, not only considers the module reuse during execution time but also takes advantage of task’s similarities to share logic resources of s. A more similar work to the one presented here is presented in [16]. The authors propose the resource sharing of by merging tasks of streaming applications thanks to identifying similarities between tasks. Although our approach addresses similar applications, our strategy prioritizes the maximum area reuse of s while reducing the number of reconfigurations on a PCIe-based FPGA. Despite the fact that none of the mentioned works uses , through has been already targeted in [17, 18]. Our proposed presents a more complex application which benefits from the current state-of-the-art technology [19]. As far as we are aware, the presented is one of the first applications using the recently introduced Xilinx MCAP [20] to partially reconfiguring the FPGA through PCIe.

4. Network Emulator Description

The main purpose of the (Figure 2) is to mimic the functionality of a network composed of microphone array nodes and to evaluate the network’s response for certain acoustic scenarios. This network increases the accuracy of the sound source location by combining the response of each node. This information is used as an early estimation about how the network would react in real-world scenarios and allows a fast design space exploration in order to target priorities like overall power consumption or the accuracy of the sound source localization. Our is flexible enough to support multiple network topologies, different sound source detection methods, or a variable number of nodes and sound sources.

Figure 3 summarizes the execution steps of the . One or multiple sound sources are generated for a target scenario composed of a variable number of nodes. Each execution consists of several iterations to compute all the necessary nodes on the available . The data collected from the nodes, the polar steering response maps, are fused and used to estimate the position of the sound sources. An evaluation of the error is done by considering the estimated position where the sound source is located and the known position of the sound source. Then, based on the target strategy under evaluation, the network is readjusted by partially reconfiguring the nodes with different configurations. The overall network power consumption and the accuracy of the sound source location are examples of potential strategies.

4.1. Distributed Functionality

The is built using the node’s architecture described in the previous section. Thanks to the scalability and flexibility of the architecture, each node of the network can present a different configuration. The network is designed to preserve this flexibility in order to adapt its response for the variances in the acoustic environment. Therefore, the must support multiple node’s possible configurations [6]. Some configurations are supported thanks to control signals, to disable certain microphones, or through , especially when evaluating the use of different architectures. Further details regarding the supported node’s configurations are provided in Section 4.4.

Figure 2 depicts the main components of the , which are distributed between the host and the FPGA.

4.1.1. Host

The host contains the sound source generator, the data fusion of the polar maps, and the evaluation of the data fusion. A graphical user interface (Figure 4) abstracts the user from these computations and from the host-FPGA communication and . The graphical user interface consists of a front-end generated in Matlab that communicates with the FPGA back-end through a middleware. The front-end is capable of simulating a sound field with multiple sound sources and nodes. Each sound source can have different frequency bands and each node can have different array configurations and calculation methods. Multiple sound sources are converted to PDM format in order to be compatible with the expected input data format of the node. The front-end is also capable of generating probability maps with the polar steering response map produced by the nodes on the FPGA.

The front-end uses data fusion to locate sound sources. Data fusion techniques combine the information gathered by different sensors measuring the same process to enhance the understanding of that process. In the context of this article, data fusion is performed by aggregating and combining the acoustic directivity information, represented as a polar steering response map, gathered by each node to produce a probability map of the location of the observed sound sources in a two-dimensional field. This technique is originally presented in [1] and has been used to validate the capacity of their microphone array design to locate sound sources (Figure 5).

Based on the data fusion results, the front-end calculates three error parameters: the localization error (in meters), the number of undetected sound sources, and the number of phantom sound sources.

The front-end allows us to create a network of nodes and to validate our architecture with a permutation of different scenarios: array architecture, detection method, sound spectrum, and sound source positions.

Finally, a middleware abstracts the application from the acceleration on the FPGA by managing the nodes’ configurations, the FPGA’s , and the host-FPGA communication. Further details about the heuristics for the placement and schedule of the nodes are detailed in Section 5.

4.1.2. PCIe Communication

The communication between the host and the FPGA uses the Xilinx PCIe DMA driver available in [21]. This driver enables the interaction of the software running on the host with the DMA endpoint IP via PCIe.

4.1.3. FPGA

On the FPGA side, the uses the IP core DMA subsystem for PCIe [22]. This IP core provides support for different types of reconfiguration through PCIe, such as Tandem, Tandem with Field Updates, and PR. In our case, the IP core DMA subsystem for PCIe with PR support is used. The DMA capability of this core is configured to act like an AXI4 bridge, operating at 125 MHz and with an AXI4 stream interface of 256 bitwidth. The HDL code of each node is encapsulated in an AXI4-Stream Wrapper in order to be compatible with AXI4-Stream. This AXI4-Stream Wrapper interfaces an input AXI4-Stream FIFO, both integrated in a Node Emulator entity. The is composed of 4 Node Emulators operating at 62.5 MHz and with a 64-bits AXI4-Stream interface each. Finally, the output AXI4-streams of the Node Emulators are combined in a 256-bits AXI4-Stream to interface the PCIe DMA Subsystem IP core.

4.2. Node Description

The original architecture proposed in [5] has been improved in [6] by rearranging the detection method () in a modular fashion and by reducing the control management. The filter stage has been also modified to operate uninterruptedly during the beamed orientation transition. The implementation of the node’s architecture on the FPGA (Figure 6) is done in VHDL and designed to process in stream fashion. Moreover, the nodes of the are composed of several cascaded stages operating in pipeline for a fast sound source location.

4.2.1. Microphone Array

The audio data is acquired by the microphone arrays and expressed as a multiplexed pulse density modulation (PDM). The microphone array is composed of four concentric subarrays of 4, 8, 16, and 24 digital MEMS microphones [23] mounted on a 20-cm circular printed board (Figure 7). Each subarray is dynamically activated or deactivated in order to facilitate the capture of spatial acoustic information using a beamforming technique. The distributed geometry of the subarrays allows the adaptation of the sensor to different sound sources. Therefore, the computational demand is adapted to the surrounding acoustic field, making the sensor array more power efficient if only a necessary number of subarrays are active. The emulation of the microphone array is partially done at the host side by the sound generator. The sound wave corresponding to the sound sources is generated in a PDM format for each microphone based on the node to be emulated and the position of the microphone in the node. The frequency band of the audio sources ranges from 100 Hz up to 15 kHz. In order to respect the technical specifications of the ADMP521 MEMS microphones, the generated audio signals are oversampled at 2 MHz.

4.2.2. Filter Stage

The single-bit PDM signal needs to be filtered to remove the high-frequency noise and to be downsampled to retrieve the audio signal in a Pulse-Code Modulation (PCM) format. The removal of the undesired high-frequency noise and the downsampling are done at the filter stage. Thus, each microphone signal has one cascade of filters to downsample and to remove the high-frequency noise. The first filter is a order low pass Cascaded Integrated-Comb (CIC) decimator filter with a decimation factor of 16. This type of filter only involves additions and subtractions [24], which significantly reduces the resource consumption. The CIC filter is followed by a 32-bits running average block to reduce the microphone DC offset and by a 15th order serial low pass FIR filter with a cut-off frequency of 12 kHz and a decimation factor of 4 completes the filter chain. The serial design of the FIR filter drastically reduces the resource consumption but limits the maximum order of the filter, which must be equal to the decimation factor of the CIC filter. The data representation is a signed 32-bits fixed point representation with 16 bits as fractional part. The filter’s coefficients are represented with 16 bits. However, the bitwidth is higher in the filter to keep the proper data resolution due to some internal filter operations, but the interfilter data representation is set to constant by applying the proper adjustment at the output of each filter.

4.2.3. Beamforming Stage

Beamforming techniques focus the array to a specific orientation by amplifying the sound coming from the predefined direction, while suppressing the sound coming from other directions. Therefore, the directional variations of the surrounding sound field are measured by continuously steering the focus direction in a 360° sweep. Our Delay-and-Sum based beamformer is applied to 64 orientations, which represents an angular resolution of 5.625 degrees.

The filtered signal of each microphone is delayed by a specific amount of time determined by the focus direction, the position vector of the microphone, and the speed of sound. All possible delays are precomputed, grouped based on the supported beamed orientations, and stored in block RAMs (BRAM) during the compilation time. Therefore, the 32-bits filtered audio of each microphone is delayed based on the precomputed values and grouped following their subarray structure to support a variable number of active microphones. Thus, instead of implementing one simple beamforming operation of 52 microphones, there are four beamforming operations in parallel for the 4, 8, 16, and 24 microphones. Only the beamforming block linked to an active subarray is enabled, while the disabled beamformers are set to zero.

4.2.4. Detection Stage

The polar steering response maps are obtained at this stage. The output data is normalized based on the maximum output value for each complete loop. The normalized outputs need to be represented with at least 16 bits to avoid errors due to the data representation.

We distinguish here 2 different . Both methods, already proposed in [3], can be available by partially reconfiguring the node’s architecture based on the active subarrays.

Polar Steered Response Power. The original architecture in [1] proposed the Delay-and-Sum beamforming technique. This technique uses the added beamformed values to calculate the output power of the signal per orientation in the time domain. The computation of this output power for different beamed orientations defines the Polar Steered Response Power (). The informs in which direction the sound source is located since the maximum power is obtained when the focus corresponds to the location of a sound source.

Cross-Correlation. The cross-correlation () method is based on the cross-correlated pairs of microphones. Thus, the time-differences-of-arrival () is the lag associated with the maximum measured correlation. The method is more robust to reverberation and noise effects [25] since it considers all available information. Nevertheless, we propose the alternative implementation of where all the global information is used and the difference between beamed orientations is amplified. Once the audio is beamformed, the audio data of all microphones are cross-correlated with each other and accumulated. Thus, once the audio data is properly delayed, the maximum of the positive values determines the location of a sound source. This method, however, demands a high number of multiplications because all possible pairs of microphones need to be correlated. The total number of multiplications () depends on the number of active microphones () and is expressed as follows:

Unfortunately, drastically increases when increasing the number of active microphones. For instance, when only the inner subarray, composed of 4 microphones, is active. Furthermore, increases from 66 to 1326 when activating the first two inner subarrays (12 microphones) or all subarrays (52 microphones), respectively. This fact has a significant impact in the resource consumption, since not only a large number of DSPs are consumed but also the LUTs are used to keep the fixed point precision. Each multiplication extends the 32-bit fixed point data representation to 64 bits, which is adjusted to 32 bits again before the next multiplication in the multiplication tree [3].

The method promises better accuracy when using a lower number of microphones. The theoretical implementation needs 66 multiplications in order to reach all possible combinations of the 12 active microphones. However, in order to save resources while maintaining the maximum flexibility, the implementation under evaluation only considers the combinations between the microphones of a subarray and not the combinations between the microphones of different subarrays. Therefore, the number of multiplications is reduced to 32, with 6 and 28 multiplications for the 4 microphones of the inner subarray and the 8 microphones of the second subarray, respectively. Because this modular promises higher accuracy, it is an interesting candidate to replace the method when a low number of microphones is active. The analysed method in our experiments only considers the use of the two inner subarrays.

4.3. Accuracy

The effective dynamic range of the floating point data representation provides a high accuracy at the cost of a high resource consumption and a performance cost, which discourages the use of floating point operations in the node’s architecture. The alternative fixed point data representation, however, induces undesired errors [26, 27]. The most sensitive blocks of the node’s architecture are located in the filter and the detection stages. A variable fixed point representation is applied at each node’s stage to minimize the errors induced by this type of data representation. The internal operations in the filters are scaled in order to provide enough bits for the data representation. However, in order to reduce the overall resource consumption, the output of each block is rescaled to signed 32-bit fixed point representation with 16 bits of fractional part. The evaluation of the impact in the accuracy of the node’s response has been performed for each supported frequency by comparing the results with our reference model programmed in Matlab which mimics the node’s architecture and is already used in [36]. As a result, the fixed point data representation at each stage of the node’s architecture guarantees an average relative error of , compared to floating point data representation.

4.4. Design Space Exploration

Table 1 summarizes the most relevant parameters that can be evaluated in our platform. Some parameters are related to the node’s architecture while others are relevant at the network perspective. Although the node’s parameters like the impact of the number of orientations or the sensing time have been discussed in [6] from the performance point of view, the allows the evaluation of the network’s configurations. For instance, different data fusion techniques or network topologies for a lower network’s power consumption can be evaluated. Moreover, the error induced by the data desynchronization from the nodes can be estimated. Notice that, due to the distribution of the functionality, the node’s parameters affect the FPGA design while the network’s parameters affect the code running on the host. Therefore, in order to focus on how can be exploited by our platform, only a couple of node’s parameters are considered for the design space exploration.

The node’s architecture permits the modification of in runtime (Figure 6) to adapt the node’s response to the dynamic behavior of acoustic environments. However, has a significant impact in the area consumption and in the node’s power consumption as detailed in [6]. This fact makes an interesting parameter to be evaluated from a network point of view because while a lower leads to a lower node’s power consumption, a lower accuracy is the price to pay. Further details about the node’s architecture, demanded hardware resources, and the impact of the supported configurations are detailed in [6].

Although different node’s configurations are supported in runtime, this does not occur with the , since only one of the proposed can be allocated on the FPGA. As a consequence, the switch between the proposed is only possible when using . Nevertheless, the evaluation of the is fully supported in our for the two inner subarrays. Both parameters are used to evaluate networks composed of nodes with variable values of and (Table 2). The experimental results are presented in Section 6.

5. Strategies for Exploiting Partial Reconfiguration

The middleware, located in the host side, is not only in charge of the host-FPGA communication through but also controlling and to optimizing the . This layer abstracts the front-end from the back-end configuration. While the user only needs to configure the topology of the network, the nodes’ configuration, and the sound sources, the middleware optimizes the execution of the by merging and scheduling the nodes in the available . Thus, the user does not need any knowledge about the configurations or in what a particular node is emulated.

One execution of the consists of the emulation of a certain number of nodes under an acoustic context. The middleware distributes the nodes between the available . Several iterations are needed in case the number of nodes to be emulated () is higher than the number of (). Thus, the number of iterations () is defined asThanks to the independence of the nodes, there are no data dependencies. This fact simplifies the middleware decisions. Thus, the middleware uses some heuristics to optimize the merging of compatible node’s configurations in order to exploit the available resources in one and also optimizes the scheduling of the computation of the merged node’s configurations. As a result, the middleware not only allows the abstraction of the user about the NE internal configuration but also exploits PR to increase performance.

5.1. Increasing Network Capabilities

The dynamism required to reduce the estimation error under unpredictable acoustic scenarios is enhanced thanks to . A clear example occurs when minimizing the overall network power consumption while offering accurate sound sources location. The overall power consumption and the accuracy are directly related to the . Thus, a trade-off is needed in order to get the highest accurate estimation with the minimum power consumption. has a role when considering alternative architectures to enhance the quality of results. That is the case of the method, which is only applicable for a lower number of microphones where it promises better accuracy. In that case, allows the dynamic modification of the network configuration in runtime to satisfy power constraints. Such evaluation in the would not be possible without . Otherwise, the platform had to be completely reprogrammed and a reboot would be needed in order to let the operating system identify the reconfigured device. Our experimental results in Section 6 cover this example.

5.2. Heuristics to Increase Performance

Figure 8 shows the heuristics used by the middleware to increase performance. can only be exploited for higher performance by properly placing and scheduling the nodes to be accelerated. Furthermore, we propose the use of to exploit the configurations’ compatibilities in order to better use the available resources.

An existing cost table (), like Table 3, is used to decide at each step of the heuristics. Such table is elaborated at design time and contains information about the relative area cost of the configurations, related to the most area demanding configuration, and information about configurations’ compatibilities. From one side, the relative area cost reflects the configuration’s level of parallelism (), which is used to merge compatible node’s configurations. In fact, is the inverse of the relative area cost since it represents the number of nodes with a certain configuration that can be executed in parallel per . From the other side, the configurations’ compatibilities relate a certain configuration with its supported . Thanks to the flexible architecture of the nodes, the number of active microphones varies from 4 to 52 MICs. Their activation is in runtime through control signals and does not require any type of . As a consequence, certain support multiple node’s configurations depending on the dedicated logic resources. Thus, the not only supports 52 microphones but also 28, 12, or 4. Such flexibility is used to reduce the number of in order to achieve higher performance. A summary of the abbreviations used for the description is done in Table 4.

5.2.1. Classification

The nodes are classified based on an existing like Table 3. This classification identifies the compatible and the that can be achieved based on their type. Algorithm 1 details the operations involved during the nodes’ classification. is composed of multiple nodes’ configurations, which can be optimally parallelized and scheduled to minimize the execution time. The relative area cost of each node is used as reference to sort the nodes’ list in decreasing order. Lately, the nodes’ list is evaluated to identify the compatible per configuration.

input: Nodes’ configuration list ,
output: Sorted and classified node list
(1) begin
(2)    Sort nodes based on their area cost (, );
(3)    Find Compatible (, );
(4) end
5.2.2. Merging

The merging of the nodes’ configurations consists of grouping the maximum number of compatible configurations in one in order to exploit the otherwise unused resources. For instance, in case the nodes with and can be computed in parallel in configured with and , respectively (Table 3). Since the are associated with the nodes’ configurations after the merging heuristic, for the previous example results in . Notice that is reduced in one unit thanks to this merging (2). In fact, the reduction of is the main objective of this heuristic.

Algorithm 2 shows how nodes are merged based on their to place in each as many nodes as possible. This merging intends to reduce and, potentially, the overall execution time. Algorithm 2 starts scanning the list of configurations in increasing relative area cost order. There are three possibilities:(i)If the configuration consumes a complete , which occurs when its cost equals 1, the configuration cannot share any . Thus, this configuration is allocated to the largest compatible in order to maximize the reuse of this . Otherwise, if the demanded relative area cost of the configuration is lower, it can be evaluated for sharing a .(ii)In case the addition of the configuration’s cost and the accumulated area cost of the already allocated nodes is higher than one , this is locked. Firstly, the grouped configurations are moved to the configurations’ list since they cannot longer share the resources of one . Secondly, the new unassigned node’s configuration is assigned to the smallest compatible to limit the sharing to the most constrained situation.(iii)In case area cost of the configuration allows the addition of this node’s configuration to the existing configurations’ group, the accumulated area cost (which represents the percentage of occupancy of the ) is incremented by the maximum area cost of new node’s configuration. In this way, the area cost of the largest node’s configuration dominates and thus unfeasible situations are avoided.

  input: Nodes’ configuration list
  output: Merged configuration list
(1) begin
(2)   ;
(3)   ;
(4)   for do
(5)     if AreaCost() then
(6)       ;
(7)       AccAreaCost() AreaCost();
(8)       CompatibleRMs() SmallestRM();
(9)        InsertIn();
(10)     end
(11)     else
(12)       if AreaCost() + AccAreaCost() then
(13)          InsertIn();
(14)         ;
(15)         AccAreaCost() AreaCost();
(16)         CompatibleRMs() SmallestRM();
(17)       end
(18)       else
(19)          InsertIn();
(20)         AccAreaCost() AccAreaCost() +
            max(AreaCost(;
(21)        end
(22)      end
(23)   end
(24) end

As a result, the configurations are categorized in the compatible which maximize the area reuse and potentially increment the overall performance by decreasing .

5.2.3. Scheduling

Once the nodes have been merged, they need to be properly scheduled in order to minimize the number of reconfigurations (). The strategy consists in maximizing the reuse of the RP’s previous configurations between iterations of one execution.

Algorithm 3 details how the merged configurations in are scheduled based on the configuration of the in each iteration. The RP’s configuration of the previous iteration is used as initial RP’s configuration of the iteration under scheduling. Thus, the list is traverse looking for the same loaded in the target . If found, the node’s configuration is assigned to that at that particular iteration. The process continues for the next until all the available are assigned. If there is no compatible node’s configuration with the available at a certain iteration, it could be possible that either all the nodes have been allocated or is needed. Based on the number of unallocated nodes, it is possible to distinguish how to proceed. On the one hand, if is needed, the most frequent configuration of the unallocated nodes is selected. This configuration is obtained through the calculation of a histogram and maximizes the potential reuse of this configuration over the remaining iterations. On the other hand, the remaining unassigned keep their configuration from the previous iteration to avoid additional and unnecessary if there are no more unallocated nodes. The process continues this way until no compatible tasks are available. The of some is then mandatory to compute the remaining tasks. Finally, it might be possible that the computation of some is not required. This occurs when is not an integer number. In that case, the maintain their configuration from the previous iteration to avoid additional and unnecessary .

  input: Merged configuration list
  output: Scheduled configuration set
(1) begin
(2)    ;
(3)    ;
(4)     InitialRPsConfig;
(5)    for do
(6)      ;
(7)      for do
(8)        for Size() do
(9)          if Config() = Config() then
(10)             InsertConfigInRP();
(11)             MarkAsConfigured();
(12)             RemoveConfigfromList(, );
(13)            break;
(14)          end
(15)        end
(16)      end
(17)      if NofElements() then
(18)        for NotConfigured() do
(19)          if NofElements() 0 then
(20)             CalcHistogram();
(21)             FindMostFreqConfig();
(22)             InsertConfigInRP(();
(23)             MarkAsConfigured();
(24)             RemoveConfigfromList(, );
(25)          end
(26)          else
(27)             Config();
(28)             MarkAsConfigured();
(29)          end
(30)        end
(31)      end
(32)    end
(33) end

For the sake of understanding, the scheduling heuristic is applied to the previous example. We assume some initial RPs’ conditions and the previous values of : The scheduling heuristic distributes the elements of between the required based on the previous iterations RPs’ configurations. Therefore, the execution order to minimize results as follows: where − represents an unused in one particular iteration. Thanks to both heuristics, has been reduced to 2 iterations, multiple configurations are computed in parallel, and there is no need for .

5.3. Partial Reconfiguration over PCIe

Despite the minimization of between iterations due to our scheduler, might be unavoidable due to the initial configuration of the and the list of nodes to be executed. Our uses the Media Configuration Access Port (MCAP) [19], which is a new configuration interface available for UltraScale devices, that provides a dedicated connection to the from one specific block per device. This interface is integrated into the hard block and provides access to the FPGA configuration logic through the hard block when enabled. The bitstream loading across the to configure the of the is detailed in [20]. The detailed process is only applicable for UltraScale architectures since these architectures need clearing bitstreams in order to prepare the for the new configuration. Consequently, each new reconfiguration of one of the requires a clearing operation before being reconfigured. Otherwise, the subsequent cannot be initialized [19]. It demands a knowledge of what is configured at each . This task is done at the host side by the middleware, which monitors the status of the nodes and applies the proper clearing reconfiguration before each PR of a node.

5.3.1. Cost Table

The of the shown in Table 3 has been partially defined at design time. It lists the different node’s configurations to be executed on the FPGA, their relative area cost, and the compatibility between the defined . The current considers 4 different configurations of the nodes based on the active microphones and using the method. Thus, the microphones of all subarrays are active when , which is the most area demanding configuration. Thanks to the flexibility of the nodes, can be modified at runtime without the need of . Therefore, the largest configuration is able to support all considered node’s configurations. To exploit the unused resources of a RP when considering low area-demanding node’s configurations ( < 12), several node’s configurations are placed in parallel per RP. For instance, when up to 4 instances can be placed in a dimensioned for the node’s configuration. Here is where has a role for performance acceleration. Of course, such acceleration is determined by the overhead due to partially reconfiguring a and the time cost of each configuration. The time cost shown in Table 3 is the averaged execution time experimentally measured after 100 executions.

5.3.2. Defining Reconfigurable Partitions

Figure 9 depicts the 4 available on the . The sizes of the are determined by the nodes’ configuration sizes and the dedicated I/O channels. Firstly, the RPs’ size must be large enough to support different node’s architectures based on or as detailed in Table 2. Nevertheless, the have a fixed dimension independent of the node’s parameter under evaluation since hierarchical is not supported [28]. Thus, our are designed to fit the most demanding resource node’s configuration, which is based on Table 3. Secondly, the 64-bit dedicated AXI4-Stream interface also limits the maximum per . Due to the characteristics of the node emulation, each normalized output needs to be represented with at least 16 bits. This fact is limited to 16, the number of nodes that can be simultaneously allocated on the FPGA. Finally, notice that the better adjust the available resources to the most area demanding node’s configuration with respect to [3, 4].

Figure 10 depicts the list of supported based on the experiments presented in Section 6. Notice that each RP supports the same number of . For instance, there are 4 different based on the number of active subarrays when using the method. Table 5 details the dimensions of the and their percentage of occupancy when configured with . Their values slightly vary since not all have exactly the same size, containing a different number and type of slices.

5.3.3. Partial Reconfiguration Overhead

The reconfiguration time per , including the cleaning operation and the , rounds to 1.09 s. It represents a relatively slow when considering the theoretical bandwidth and the size of the bitstream files, which rounds to 4.762 MB. Nevertheless, the time values have been experimentally obtained at the driver side.

6. Experimental Results

A couple of experiments are detailed in this section to demonstrate some of the capabilities of the and the benefits of . The sound field simulation used in the front-end has been optimized for a two-dimensional open field where sound attenuation, caused by propagation, has been assumed to be negligible. All the experiments demand a involving one or more . The first experiment demonstrates how is used to evaluate different for several node’s configurations and sound source profiles. The main purpose is to show how the capabilities of the are extended thanks to . Thus, different can be evaluated in runtime, which could not be possible without . Our strategies and heuristics are evaluated in the second experiment, which exemplifies how the use of can lead to a significant performance improvement. The increment in performance comes from the better resource exploitation. As a result, executions are accelerated when the network is composed of a minimum number of nodes.

All the supported node’s configurations (Table 2) are implemented and stored in the host side. As a result, the bitmaps required to run the experiments detailed in this section (Figure 10) are loaded to the by using through when needed. Because no bitmap compression technique has been applied, all the bitmaps associated with one have the same size (Table 5). The bitmaps are grouped based on the experiment where they are used. However, some bitmaps like are used in both experiments. Finally, a static bitmap file contains the static logic and the initial ’ configuration.

The FPGA card used for the implementation of the is a Xilinx Kintex Ultrascale from Alpha Data (ADM-PCIE-KU3), whose available resources are detailed in Table 5. It provides a Gen3 connection, supporting up to two x8 controllers. Vivado 2016.4 has been used to develop the DMA Subsystem and the through design flow. The system has been implemented in an Ubuntu 14.04.1 machine that uses C/C++ code, bash scripts, and Matlab 2016b.

6.1. Impact of the Detection Methods

The feature of the provides the capability of evaluating different node’s configurations. The following experiment intends to demonstrate how the allows the comparison of two different from the network point of view.

Figure 11 shows the average error in the estimation of the sound source when applying data fusion of 4 nodes using the two inner subarrays. The values have been obtained for a random position of a standalone sound source at 3 different frequencies (4, 8, and 10 kHz). The RMs are partially reconfigured to switch between both methods. The evaluation also considers the permutations of all possible combinations of the two inner subarrays. Thus, the top left error value corresponds to the 4 nodes with only the inner subarray active while the top right corresponds to all the nodes with two inner subarrays active. The results show that the method does not offer a significant improvement compared to the method. Despite offering a lower estimation error, its implementation in a distributed network of microphone arrays is not completely justified considering the additional resource consumption due to the required multiplications. Nevertheless, further experiments must be done with different sound source frequencies, with real measurements and in an anechoic room before discarding completely the method.

6.2. Partial Reconfiguration for Higher Performance

The use of to increment performance is evaluated through multiple executions with and random per node. All the experimental results explained here have been obtained after 10000 executions of up to 100 random nodes per execution. Only the method is considered in this experiment. The only difference in the node’s configuration is , which directly affects the node’s resource consumption. Therefore, the performance increases thanks to an increment in the number of node’s configurations executed in one iteration, which is done by allocating on each multiple node’s configurations with small .

Figure 12 depicts the required based on the with and without merging nodes. is, by default, expressed in (2). This value can be decreased thanks to exploiting the unused resources per . Thus, the merging of nodes to share resources of one leads to a lower needed per execution. Since determines the execution time, the merging of the nodes is expected to directly benefit the performance. Both graphs present a stepped shape because at least nodes are executed per iteration.

Figure 13 depicts the average execution time (left figure) and the overall speedup (right figure). The nonheuristic strategy (None) is used as reference. This strategy does not need to partially reconfigure the and, therefore, does not benefit from the use of the heuristics. Each is configured with the same in order to support all the node’s configurations under evaluation. Consequently, the strategy time-multiplexed the nodes in the available without any merging or scheduling. The other two strategies under evaluation consider the proposed heuristic for merging of the node’s configurations as standalone (Merge) and combined with the proposed heuristic for scheduling (). Both strategies have a random initial configuration of the , which are randomly asserted after each execution when using . This is the expected behavior of the since the final configuration of the after one execution is unknown in advance, at least not before the execution of the heuristics.

The results depicted in Figure 13 reflect that although a lower obtained by merging the node’s configuration in the same should lead to a higher performance, the time overhead decreases the overall performance. Moreover, the use of without a proper scheduling induces performance decrements, which is reflected in Figure 13. Although the merging of nodes increases the parallelism and diminishes , the overhead dominates when the nodes are not properly scheduled. Executions demanding a low are specially sensitive to this overhead, because the time overhead represents a large percentage of the overall execution time. As a consequence, the proper scheduling of the nodes is not beneficial unless a certain must be computed per execution. The rightmost figure in Figure 13 represents the overall speedup when only merging or also scheduling. Both graphs are saw waves for the same reason the graphs in Figure 12 present a stepped shape. Because several nodes’ configurations are computed in parallel, remains constant while incrementing . Thus, the speedup increases when increasing computed in the same amount of time, but suddenly decreases when an additional iteration must be computed. The proper merging and scheduling of the nodes’ configurations are only beneficial in average when is higher than 45.

7. Conclusion

The presented has shown the capacity to evaluate different WSN configurations thanks to its ability to mimic the node’s response to several sound sources. The use of PR through PCIe not only allows us to obtain a flexible NE capable of exploring multiple configuration scenarios in runtime but also allows us to accelerate the ’s execution by exploiting the available resources and the inherent parallelism of the node’s emulation. We believe our approach for the not only provides an interesting case study of how can be used to increment performance but can also be extended for other streaming applications such as video processing, where similar convolutional filters must be applied to different image sources. Nevertheless, it will also be interesting to explore other strategies like bitstream compression in order to further reduce the PR time cost. Finally, in the current version of our emulator, the user is able to select the node and its configuration at every moment. Although the current version of the emulator is managed by the user, we consider that certain level of intelligence can be added in the control automation to determine, in real-time, the best strategies to evolve the network configuration to obtain the lowest power consumption with the lowest estimation error.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the European Regional Development Fund (ERDF) and the Brussels-Capital Region-Innoviris within the framework of the Operational Programme 2014–2020 through the ERDF-2020 Project ICITYRDI.BRU. This work is also a result of the CORNET project “DynamIA: Dynamic Hardware Reconfiguration in Industrial Applications” [29] which was funded by IWT Flanders with reference number 140389. Finally, the authors would like to thank Xilinx for the provided software and hardware under the University Program donation.