International Journal of Reconfigurable Computing The latest articles from Hindawi Publishing Corporation © 2016 , Hindawi Publishing Corporation . All rights reserved. An FPGA-Based Quantum Computing Emulation Framework Based on Serial-Parallel Architecture Thu, 07 Apr 2016 07:59:21 +0000 Hardware emulation of quantum systems can mimic more efficiently the parallel behaviour of quantum computations, thus allowing higher processing speed-up than software simulations. In this paper, an efficient hardware emulation method that employs a serial-parallel hardware architecture targeted for field programmable gate array (FPGA) is proposed. Quantum Fourier transform and Grover’s search are chosen as case studies in this work since they are the core of many useful quantum algorithms. Experimental work shows that, with the proposed emulation architecture, a linear reduction in resource utilization is attained against the pipeline implementations proposed in prior works. The proposed work contributes to the formulation of a proof-of-concept baseline FPGA emulation framework with optimization on datapath designs that can be extended to emulate practical large-scale quantum circuits. Y. H. Lee, M. Khalil-Hani, and M. N. Marsono Copyright © 2016 Y. H. Lee et al. All rights reserved. FPGA-Based Real-Time Moving Target Detection System for Unmanned Aerial Vehicle Application Mon, 04 Apr 2016 11:35:00 +0000 Moving target detection is the most common task for Unmanned Aerial Vehicle (UAV) to find and track object of interest from a bird’s eye view in mobile aerial surveillance for civilian applications such as search and rescue operation. The complex detection algorithm can be implemented in a real-time embedded system using Field Programmable Gate Array (FPGA). This paper presents the development of real-time moving target detection System-on-Chip (SoC) using FPGA for deployment on a UAV. The detection algorithm utilizes area-based image registration technique which includes motion estimation and object segmentation processes. The moving target detection system has been prototyped on a low-cost Terasic DE2-115 board mounted with TRDB-D5M camera. The system consists of Nios II processor and stream-oriented dedicated hardware accelerators running at 100 MHz clock rate, achieving 30-frame per second processing speed for 640 × 480 pixels’ resolution greyscale videos. Jia Wei Tang, Nasir Shaikh-Husin, Usman Ullah Sheikh, and M. N. Marsono Copyright © 2016 Jia Wei Tang et al. All rights reserved. Modules for Pipelined Mixed Radix FFT Processors Tue, 22 Mar 2016 10:44:59 +0000 A set of soft IP cores for the Winograd -point fast Fourier transform (FFT) is considered. The cores are designed by the method of spatial SDF mapping into the hardware, which provides the minimized hardware volume at the cost of slowdown of the algorithm by times. Their clock frequency is equal to the data sampling frequency. The cores are intended for the high-speed pipelined FFT processors, which are implemented in FPGA. Anatolij Sergiyenko and Anastasia Serhienko Copyright © 2016 Anatolij Sergiyenko and Anastasia Serhienko. All rights reserved. How to Efficiently Reconfigure Tunable Lookup Tables for Dynamic Circuit Specialization Mon, 21 Mar 2016 07:22:53 +0000 Dynamic Circuit Specialization is used to optimize the implementation of a parameterized application on an FPGA. Instead of implementing the parameters as regular inputs, in the DCS approach these inputs are implemented as constants. When the parameter values change, the design is reoptimized for the new constant values by reconfiguring the FPGA. This allows faster and more resource-efficient implementation but investigations have shown that reconfiguration time is the major limitation for DCS implementation on Xilinx FPGAs. The limitation arises from the use of inefficient reconfiguration methods in conventional DCS implementation. To address this issue, we propose different approaches to reduce the reconfiguration time drastically and improve the reconfiguration speed. In this context, this paper presents the use of custom reconfiguration controllers and custom reconfiguration software drivers, along with placement constraints to shorten the reconfiguration time. Our results show an improvement in the reconfiguration speed by at least a factor 14 by using Xilinx reconfiguration controller along with placement constraints. However, the improvement can go up to a factor 40 with the combination of a custom reconfiguration controller, custom software drivers, and placement constraints. We also observe depreciation in the system’s performance by at least 6% due to placement constraints. Amit Kulkarni and Dirk Stroobandt Copyright © 2016 Amit Kulkarni and Dirk Stroobandt. All rights reserved. On-Chip Reconfigurable Hardware Accelerators for Popcount Computations Thu, 10 Mar 2016 14:32:46 +0000 Popcount computations are widely used in such areas as combinatorial search, data processing, statistical analysis, and bio- and chemical informatics. In many practical problems the size of initial data is very large and increase in throughput is important. The paper suggests two types of hardware accelerators that are (1) designed in FPGAs and (2) implemented in Zynq-7000 all programmable systems-on-chip with partitioning of algorithms that use popcounts between software of ARM Cortex-A9 processing system and advanced programmable logic. A three-level system architecture that includes a general-purpose computer, the problem-specific ARM, and reconfigurable hardware is then proposed. The results of experiments and comparisons with existing benchmarks demonstrate that although throughput of popcount computations is increased in FPGA-based designs interacting with general-purpose computers, communication overheads (in experiments with PCI express) are significant and actual advantages can be gained if not only popcount but also other types of relevant computations are implemented in hardware. The comparison of software/hardware designs for Zynq-7000 all programmable systems-on-chip with pure software implementations in the same Zynq-7000 devices demonstrates increase in performance by a factor ranging from 5 to 19 (taking into account all the involved communication overheads between the programmable logic and the processing systems). Valery Sklyarov, Iouliia Skliarova, and João Silva Copyright © 2016 Valery Sklyarov et al. All rights reserved. An Efficient Evolutionary Task Scheduling/Binding Framework for Reconfigurable Systems Mon, 07 Mar 2016 12:06:42 +0000 Several embedded application domains for reconfigurable systems tend to combine frequent changes with high performance demands of their workloads such as image processing, wearable computing, and network processors. Time multiplexing of reconfigurable hardware resources raises a number of new issues, ranging from run-time systems to complex programming models that usually form a reconfigurable operating system (ROS). In this paper, an efficient ROS framework that aids the designer from the early design stages all the way to the actual hardware implementation is proposed and implemented. An efficient reconfigurable platform is implemented along with novel placement/scheduling algorithms. The proposed algorithms tend to reuse hardware tasks to reduce reconfiguration overhead, migrate tasks between software and hardware to efficiently utilize resources, and reduce computation time. A supporting framework for efficient mapping of execution units to task graphs in a run-time reconfigurable system is also designed. The framework utilizes an Island Based Genetic Algorithm flow that optimizes several objectives including performance, area, and power consumption. The proposed Island Based GA framework achieves on average 55.2% improvement over a single-GA implementation and an 80.7% improvement over a baseline random allocation and binding approach. A. Al-Wattar, S. Areibi, and G. Grewal Copyright © 2016 A. Al-Wattar et al. All rights reserved. XOR-FREE Implementation of Convolutional Encoder for Reconfigurable Hardware Tue, 23 Feb 2016 13:25:36 +0000 This paper presents a novel XOR-FREE algorithm to implement the convolutional encoder using reconfigurable hardware. The approach completely removes the XOR processing of a chosen nonsystematic, feedforward generator polynomial of larger constraint length. The hardware (HW) implementation of new architecture uses Lookup Table (LUT) for storing the parity bits. The design implements architectural reconfigurability by modifying the generator polynomial of the same constraint length and code rate to reduce the design complexity. The proposed architecture reduces the dynamic power up to 30% and improves the hardware cost and propagation delay up to 20% and 32%, respectively. The performance of the proposed architecture is validated in MATLAB Simulink and tested on Zynq-7 series FPGA. Gaurav Purohit, Kota Solomon Raju, and Vinod Kumar Chaubey Copyright © 2016 Gaurav Purohit et al. All rights reserved. A Scalable Unsegmented Multiport Memory for FPGA-Based Systems Thu, 31 Dec 2015 13:20:44 +0000 On-chip multiport memory cores are crucial primitives for many modern high-performance reconfigurable architectures and multicore systems. Previous approaches for scaling memory cores come at the cost of operating frequency, communication overhead, and logic resources without increasing the storage capacity of the memory. In this paper, we present two approaches for designing multiport memory cores that are suitable for reconfigurable accelerators with substantial on-chip memory or complex communication. Our design approaches tackle these challenges by banking RAM blocks and utilizing interconnect networks which allows scaling without sacrificing logic resources. With banking, memory congestion is unavoidable and we evaluate our multiport memory cores under different memory access patterns to gain insights about different design trade-offs. We demonstrate our implementation with up to 256 memory ports using a Xilinx Virtex-7 FPGA. Our experimental results report high throughput memories with resource usage that scales with the number of ports. Kevin R. Townsend, Osama G. Attia, Phillip H. Jones, and Joseph Zambreno Copyright © 2015 Kevin R. Townsend et al. All rights reserved. Exploring Trade-Offs between Specialized Dataflow Kernels and a Reusable Overlay in a Stereo Matching Case Study Wed, 30 Dec 2015 11:48:26 +0000 FPGAs are known to permit huge gains in performance and efficiency for suitable applications but still require reduced design efforts and shorter development cycles for wider adoption. In this work, we compare the resulting performance of two design concepts that in different ways promise such increased productivity. As common starting point, we employ a kernel-centric design approach, where computational hotspots in an application are identified and individually accelerated on FPGA. By means of a complex stereo matching application, we evaluate two fundamentally different design philosophies and approaches for implementing the required kernels on FPGAs. In the first implementation approach, we designed individually specialized data flow kernels in a spatial programming language for a Maxeler FPGA platform; in the alternative design approach, we target a vector coprocessor with large vector lengths, which is implemented as a form of programmable overlay on the application FPGAs of a Convey HC-1. We assess both approaches in terms of overall system performance, raw kernel performance, and performance relative to invested resources. After compensating for the effects of the underlying hardware platforms, the specialized dataflow kernels on the Maxeler platform are around 3x faster than kernels executing on the Convey vector coprocessor. In our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5x. Tobias Kenter, Henning Schmitz, and Christian Plessl Copyright © 2015 Tobias Kenter et al. All rights reserved. AC_ICAP: A Flexible High Speed ICAP Controller Thu, 17 Dec 2015 09:49:10 +0000 The Internal Configuration Access Port (ICAP) is the core component of any dynamic partial reconfigurable system implemented in Xilinx SRAM-based Field Programmable Gate Arrays (FPGAs). We developed a new high speed ICAP controller, named AC_ICAP, completely implemented in hardware. In addition to similar solutions to accelerate the management of partial bitstreams and frames, AC_ICAP also supports run-time reconfiguration of LUTs without requiring precomputed partial bitstreams. This last characteristic was possible by performing reverse engineering on the bitstream. Besides, we adapted this hardware-based solution to provide IP cores accessible from the MicroBlaze processor. To this end, the controller was extended and three versions were implemented to evaluate its performance when connected to Peripheral Local Bus (PLB), Fast Simplex Link (FSL), and AXI interfaces of the processor. In consequence, the controller can exploit the flexibility that the processor offers but taking advantage of the hardware speed-up. It was implemented in both Virtex-5 and Kintex7 FPGAs. Results of reconfiguration time showed that run-time reconfiguration of single LUTs in Virtex-5 devices was performed in less than 5 μs which implies a speed-up of more than 380x compared to the Xilinx XPS_HWICAP controller. Luis Andres Cardona and Carles Ferrer Copyright © 2015 Luis Andres Cardona and Carles Ferrer. All rights reserved. Dynamic Task Distribution Model for On-Chip Reconfigurable High Speed Computing System Thu, 10 Dec 2015 07:08:20 +0000 Modern embedded systems are being modeled as Reconfigurable High Speed Computing System (RHSCS) where Reconfigurable Hardware, that is, Field Programmable Gate Array (FPGA), and softcore processors configured on FPGA act as computing elements. As system complexity increases, efficient task distribution methodologies are essential to obtain high performance. A dynamic task distribution methodology based on Minimum Laxity First (MLF) policy (DTD-MLF) distributes the tasks of an application dynamically onto RHSCS and utilizes available RHSCS resources effectively. The DTD-MLF methodology takes the advantage of runtime design parameters of an application represented as DAG and considers the attributes of tasks in DAG and computing resources to distribute the tasks of an application onto RHSCS. In this paper, we have described the DTD-MLF model and verified its effectiveness by distributing some of real life benchmark applications onto RHSCS configured on Virtex-5 FPGA device. Some benchmark applications are represented as DAG and are distributed to the resources of RHSCS based on DTD-MLF model. The performance of the MLF based dynamic task distribution methodology is compared with static task distribution methodology. The comparison shows that the dynamic task distribution model with MLF criteria outperforms the static task distribution techniques in terms of schedule length and effective utilization of available RHSCS resources. Mahendra Vucha and Arvind Rajawat Copyright © 2015 Mahendra Vucha and Arvind Rajawat. All rights reserved. An Improved Diffusion Based Placement Algorithm for Reducing Interconnect Demand in Congested Regions of FPGAs Wed, 11 Nov 2015 12:47:11 +0000 An FPGA has a finite routing capacity due to which a fair number of highly dense circuits fail to map on slightly underresourced architecture. The high-interconnect demand in the congested regions is not met by the available resources as a result of which the circuit becomes unroutable for that particular architecture. In this paper, we present a new placement approach which is based on a natural process called diffusion. Our placer attempts to minimize the routing congestion by evenly disseminating the interconnect demand across an FPGA chip. For the 20 MCNC benchmark circuits, our algorithm reduced the channel width for 15 circuits. The results showed on average ~33% reduction in standard deviation of interconnect usage at an expense of an average ~13% penalty on critical path delay. Maximum channel width gain of ~33% was also observed. Ali Asghar and Husain Parvez Copyright © 2015 Ali Asghar and Husain Parvez. All rights reserved. High Efficiency Generalized Parallel Counters for Look-Up Table Based FPGAs Mon, 19 Oct 2015 09:12:02 +0000 Generalized parallel counters (GPCs) are used in constructing high speed compressor trees. Prior work has focused on utilizing the fast carry chain and mapping the logic onto Look-Up Tables (LUTs). This mapping is not optimal in the sense that the LUT fabric is not fully utilized. This results in low efficiency GPCs. In this work, we present a heuristic that efficiently maps the GPC logic onto the LUT fabric. We have used our heuristic on various GPCs and have achieved an improvement in efficiency ranging from 33% to 100% in most of the cases. Experimental results using Xilinx 5th-, 6th-, and 7th-generation FPGAs and Stratix IV and V devices from Altera show a considerable reduction in resources utilization and dynamic power dissipation, for almost the same critical path delay. We have also implemented GPC-based FIR filters on 7th-generation Xilinx FPGAs using our proposed heuristic and compared their performance against conventional implementations. Implementations based on our heuristic show improved performance. Comparisons are also made against filters based on integrated DSP blocks and inherent IP cores from Xilinx. The results show that the proposed heuristic provides performance that is comparable to the structures based on these specialized resources. Burhan Khurshid and Roohie Naaz Mir Copyright © 2015 Burhan Khurshid and Roohie Naaz Mir. All rights reserved. Leakage Immune Modified Pass Transistor Based 8T SRAM Cell in Subthreshold Region Wed, 30 Sep 2015 07:11:37 +0000 The paper presents a novel 8T SRAM cell with access pass gates replaced with modified PMOS pass transistor logic. In comparison to 6T SRAM cell, the proposed cell achieves 3.5x higher read SNM and 2.4x higher write SNM with 16.6% improved SINM (static current noise margin) distribution at the expense of 7x lower WTI (write trip current) at 0.4 V power supply voltage, while maintaining similar stability in hold mode. The proposed 8T SRAM cell shows improvements in terms of 7.735x narrower spread in average standby power, 2.61x less in average (write access time), and 1.07x less in average (read access time) at supply voltage varying from 0.3 V to 0.5 V as compared to 6T SRAM equivalent at 45 nm technology node. Thus, comparative analysis shows that the proposed design has a significant improvement, thereby achieving high cell stability at 45 nm technology node. Priya Gupta, Anu Gupta, and Abhijit Asati Copyright © 2015 Priya Gupta et al. All rights reserved. Core-Level Modeling and Frequency Prediction for DSP Applications on FPGAs Thu, 03 Sep 2015 10:20:57 +0000 Field-programmable gate arrays (FPGAs) provide a promising technology that can improve performance of many high-performance computing and embedded applications. However, unlike software design tools, the relatively immature state of FPGA tools significantly limits productivity and consequently prevents widespread adoption of the technology. For example, the lengthy design-translate-execute (DTE) process often must be iterated to meet the application requirements. Previous works have enabled model-based, design-space exploration to reduce DTE iterations but are limited by a lack of accurate model-based prediction of key design parameters, the most important of which is clock frequency. In this paper, we present a core-level modeling and design (CMD) methodology that enables modeling of FPGA applications at an abstract level and yet produces accurate predictions of parameters such as clock frequency, resource utilization (i.e., area), and latency. We evaluate CMD’s prediction methods using several high-performance DSP applications on various families of FPGAs and show an average clock-frequency prediction error of 3.6%, with a worst-case error of 20.4%, compared to the best of existing high-level prediction methods, 13.9% average error with 48.2% worst-case error. We also demonstrate how such prediction enables accurate design-space exploration without coding in a hardware-description language (HDL), significantly reducing the total design time. Gongyu Wang, Greg Stitt, Herman Lam, and Alan George Copyright © 2015 Gongyu Wang et al. All rights reserved. Representing Tactics for Fault Recovery: A Reconfigurable, Modular, and Hierarchical Approach Tue, 16 Jun 2015 10:02:34 +0000 We show the advantages of modular and hierarchical design in obtaining fault-tolerant software. Modularity enables the identification of faulty software units simplifying key operations, like software removal and replacement. We describe three approaches to repair faulty software based on replication, namely, Passive Replication, N-Version Replication, and Active Replication, based on modular components. We show that the key construct to represent these tactics is the ability to make ad hoc changes in software topologies. We consider hierarchical mobility as a useful operation to introduce new software units for replacing faulty ones. For illustration purposes, we use connecton, a hierarchical, modular, and self-modifying software specification formalism, and its implementation in the Desmos framework. Fernando J. Barros Copyright © 2015 Fernando J. Barros. All rights reserved. Low Latency Network-on-Chip Router Microarchitecture Using Request Masking Technique Sun, 15 Mar 2015 08:48:59 +0000 Network-on-Chip (NoC) is fast emerging as an on-chip communication alternative for many-core System-on-Chips (SoCs). However, designing a high performance low latency NoC with low area overhead has remained a challenge. In this paper, we present a two-clock-cycle latency NoC microarchitecture. An efficient request masking technique is proposed to combine virtual channel (VC) allocation with switch allocation nonspeculatively. Our proposed NoC architecture is optimized in terms of area overhead, operating frequency, and quality-of-service (QoS). We evaluate our NoC against CONNECT, an open source low latency NoC design targeted for field-programmable gate array (FPGA). The experimental results on several FPGA devices show that our NoC router outperforms CONNECT with 50% reduction of logic cells (LCs) utilization, while it works with 100% and 35%~20% higher operating frequency compared to the one- and two-clock-cycle latency CONNECT NoC routers, respectively. Moreover, the proposed NoC router achieves 2.3 times better performance compared to CONNECT. Alireza Monemi, Chia Yee Ooi, and Muhammad Nadzir Marsono Copyright © 2015 Alireza Monemi et al. All rights reserved. Optimization of Lookup Schemes for Flow-Based Packet Classification on FPGAs Sun, 08 Mar 2015 11:25:26 +0000 Packet classification has become a key processing function to enable future flow-based networking schemes. As network capacity increases and new services are deployed, both high throughput and reconfigurability are required for packet classification architectures. FPGA technology can provide the best trade-off among them. However, to date, lookup stages have been mostly developed as independent schemes from the classification stage, which makes their efficient integration on FPGAs difficult. In this context, we propose a new interpretation of the lookup problem in the general context of packet classification, which enables comparing existing lookup schemes on a common basis. From this analysis, we recognize new opportunities for optimization of lookup schemes and their associated classification schemes on FPGA. In particular, we focus on the most appropriate candidate for future networking needs and propose optimizations for it. To validate our analysis, we provide estimation and implementation results for typical lookup architectures on FPGA and observe their convenience for different lookup and classification cases, demonstrating the benefits of our proposed optimization. Carlos A. Zerbini and Jorge M. Finochietto Copyright © 2015 Carlos A. Zerbini and Jorge M. Finochietto. All rights reserved. Using Genetic Algorithms for Hardware Core Placement and Mapping in NoC-Based Reconfigurable Systems Mon, 02 Feb 2015 13:38:30 +0000 Mapping of cores has been an important activity in NoC-based system design aimed to find the best topological location onto the NoC, such that the metrics of interest can be greatly optimized. In the last years, partial reconfigurable systems (PRSs) have included Networks-on-Chips (NoCs) as their communication structure, adding complexity to the problem of mapping. Several works have proposed specific and robust NoC architectures for PRSs, forming indirect and irregular networks, in which cases the mapping and placement problems must be treated altogether. The placement deals with the physical positioning of those cores inside the reconfigurable device. Up to now, to the best of our knowledge, the mapping-placement problem for those kinds of architectures has not been addressed yet. In this work, the problem formalization for the design-time hardware core placement and mapping in PRS-NoCs is proposed and methodologies for solving it with genetic algorithms (GAs) are presented. Several GA crossovers and methodologies are compared for obtaining the best solution. Results have shown that best GA solution obtained, in average, communication costs with 4% of penalty when compared with global minimum cost, obtained in a semiexhaustive approach. In addition, the algorithm presents low execution times. Jonas Gomes Filho, Marius Strum, and Wang Jiang Chau Copyright © 2015 Jonas Gomes Filho et al. All rights reserved. Scalable Fixed Point QRD Core Using Dynamic Partial Reconfiguration Sun, 14 Dec 2014 00:10:17 +0000 A Givens rotation based scalable QRD core which utilizes an efficient pipelined and unfolded 2D multiply and accumulate (MAC) based systolic array architecture with dynamic partial reconfiguration (DPR) capability is proposed. The square root and inverse square root operations in the Givens rotation algorithm are handled using a modified look-up table (LUT) based Newton-Raphson method, thereby reducing the area by 71% and latency by 50% while operating at a frequency 49% higher than the existing boundary cell architectures. The proposed architecture is implemented on Xilinx Virtex-6 FPGA for any real matrices of size , where and by dynamically inserting or removing the partial modules. The evaluation results demonstrate a significant reduction in latency, area, and power as compared to other existing architectures. The functionality of the proposed core is evaluated for a variable length adaptive equalizer. Gayathri R. Prabhu, Bibin Johnson, and J. Sheeba Rani Copyright © 2014 Gayathri R. Prabhu et al. All rights reserved. Multi-Softcore Architecture on FPGA Thu, 27 Nov 2014 06:22:28 +0000 To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication). Mouna Baklouti and Mohamed Abid Copyright © 2014 Mouna Baklouti and Mohamed Abid. All rights reserved. Low-Cost Fault Tolerant Methodology for Real Time MPSoC Based Embedded System Wed, 19 Nov 2014 08:50:55 +0000 We are proposing a design methodology for a fault tolerant homogeneous MPSoC having additional design objectives that include low hardware overhead and performance. We have implemented three different FT methodologies on MPSoCs and compared them against the defined constraints. The comparison of these FT methodologies is carried out by modelling their architectures in VHDL-RTL, on Spartan 3 FPGA. The results obtained through simulations helped us to identify the most relevant scheme in terms of the given design constraints. Mohsin Amin, Muhammad Shakir, Aqib Javed, Muhammad Hassan, and Syed Ali Raza Copyright © 2014 Mohsin Amin et al. All rights reserved. TreeBASIS Feature Descriptor and Its Hardware Implementation Mon, 10 Nov 2014 12:56:39 +0000 This paper presents a novel feature descriptor called TreeBASIS that provides improvements in descriptor size, computation time, matching speed, and accuracy. This new descriptor uses a binary vocabulary tree that is computed using basis dictionary images and a test set of feature region images. To facilitate real-time implementation, a feature region image is binary quantized and the resulting quantized vector is passed into the BASIS vocabulary tree. A Hamming distance is then computed between the feature region image and the effectively descriptive basis dictionary image at a node to determine the branch taken and the path the feature region image takes is saved as a descriptor. The TreeBASIS feature descriptor is an excellent candidate for hardware implementation because of its reduced descriptor size and the fact that descriptors can be created and features matched without the use of floating point operations. The TreeBASIS descriptor is more computationally and space efficient than other descriptors such as BASIS, SIFT, and SURF. Moreover, it can be computed entirely in hardware without the support of a CPU for additional software-based computations. Experimental results and a hardware implementation show that the TreeBASIS descriptor compares well with other descriptors for frame-to-frame homography computation while requiring fewer hardware resources. Spencer Fowers, Alok Desai, Dah-Jye Lee, Dan Ventura, and James Archibald Copyright © 2014 Spencer Fowers et al. All rights reserved. Architecture and Application-Aware Management of Complexity of Mapping Multiplication to FPGA DSP Blocks in High Level Synthesis Tue, 21 Oct 2014 10:01:40 +0000 Multiplication is a common operation in many applications and there exist various types of multiplication operations. Current high level synthesis (HLS) flows generally treat all multiplication operations equally and indistinguishable from each other leading to inefficient mapping to resources. This paper proposes algorithms for automatically identifying the different types of multiplication operations and investigates the ensemble of these different types of multiplication operations. This distinguishes it from previous works where mapping strategies for an individual type of multiplication operation have been investigated and the type of multiplication operation is assumed to be known a priori. A new cost model, independent of device and synthesis tools, for establishing priority among different types of multiplication operations for mapping to on-chip DSP blocks is also proposed. This cost model is used by a proposed analysis and priority ordering based mapping strategy targeted at making efficient use of hard DSP blocks on FPGAs while maximizing the operating frequency of designs. Results show that the proposed methodology could result in designs which were at least 2× faster in performance than those generated by commercial HLS tool: Vivado-HLS. Sharad Sinha and Thambipillai Srikanthan Copyright © 2014 Sharad Sinha and Thambipillai Srikanthan. All rights reserved. FPGA-Based Implementation of All-Digital QPSK Carrier Recovery Loop Combining Costas Loop and Maximum Likelihood Frequency Estimator Sun, 31 Aug 2014 12:05:48 +0000 This paper presents an efficient all digital carrier recovery loop (ADCRL) for quadrature phase shift keying (QPSK). The ADCRL combines classic closed-loop carrier recovery circuit, all digital Costas loop (ADCOL), with frequency feedward loop, maximum likelihood frequency estimator (MLFE) so as to make the best use of the advantages of the two types of carrier recovery loops and obtain a more robust performance in the procedure of carrier recovery. Besides, considering that, for MLFE, the accurate estimation of frequency offset is associated with the linear characteristic of its frequency discriminator (FD), the Coordinate Rotation Digital Computer (CORDIC) algorithm is introduced into the FD based on MLFE to unwrap linearly phase difference. The frequency offset contained within the phase difference unwrapped is estimated by the MLFE implemented just using some shifter and multiply-accumulate units to assist the ADCOL to lock quickly and precisely. The joint simulation results of ModelSim and MATLAB show that the performances of the proposed ADCRL in locked-in time and range are superior to those of the ADCOL. On the other hand, a systematic design procedure based on FPGA for the proposed ADCRL is also presented. Kaiyu Wang, Zhiming Song, Xianwei Qi, Qingxin Yan, and Zhenan Tang Copyright © 2014 Kaiyu Wang et al. All rights reserved. Self-Awareness in Computer Networks Mon, 25 Aug 2014 08:24:11 +0000 The Internet architecture works well for a wide variety of communication scenarios. However, its flexibility is limited because it was initially designed to provide communication links between a few static nodes in a homogeneous network and did not attempt to solve the challenges of today’s dynamic network environments. Although the Internet has evolved to a global system of interconnected computer networks, which links together billions of heterogeneous compute nodes, its static architecture remained more or less the same. Nowadays the diversity in networked devices, communication requirements, and network conditions vary heavily, which makes it difficult for a static set of protocols to provide the required functionality. Therefore, we propose a self-aware network architecture in which protocol stacks can be built dynamically. Those protocol stacks can be optimized continuously during communication according to the current requirements. For this network architecture we propose an FPGA-based execution environment called EmbedNet that allows for a dynamic mapping of network protocols to either hardware or software. We show that our architecture can reduce the communication overhead significantly by adapting the protocol stack and that the dynamic hardware/software mapping of protocols considerably reduces the CPU load introduced by packet processing. Ariane Keller, Daniel Borkmann, Stephan Neuhaus, and Markus Happe Copyright © 2014 Ariane Keller et al. All rights reserved. Design Patterns for Self-Adaptive RTE Systems Specification Mon, 14 Jul 2014 09:07:55 +0000 The development of self-adaptive real-time embedded (RTE) systems is an increasingly hard task due to the growing complexity of both hardware and software and the high variability of the execution environment. Different approaches, platforms, and middleware have been proposed in the field, from low to high abstraction level. However, there is still a lack of generic and reusable designs for self-adaptive RTE systems that fit different system domains, lighten designers’ task, and decrease development cost. In this paper, we propose five design patterns for self-adaptive RTE systems modeling resulting from the generalization of relevant existing adaptation-related works. Combined together, the patterns form the design of an adaptation loop composed of five adaptation modules. The proposed solution offers a modular, reusable, and flexible specification of these modules and enables the separation of concerns. It also permits dealing with concurrency, real-time features, and adaptation cost relative to the adaptation activities. To validate our solution, we applied it to a complex case study, a cross-layer self-adaptive object tracking system, to show patterns utilization and prove the solution benefits. Mouna Ben Said, Yessine Hadj Kacem, Mickaël Kerboeuf, Nader Ben Amor, and Mohamed Abid Copyright © 2014 Mouna Ben Said et al. All rights reserved. Practical Education Fostered by Research Projects in an Embedded Systems Course Sun, 29 Jun 2014 08:56:04 +0000 The very nature of universities makes them unique environments for research and teaching. Although both activities constantly borrow from each other, a deeper level of interaction is not always achieved for several reasons. This paper presents a successful experience on conducting an undergraduate course on embedded systems, based on strong interaction with related research activities previously conducted by the authors. Known for being everywhere, embedded systems are constantly expanding in both complexity and volume production. In addition, heterogeneous systems are becoming prevalent in modern applications, standing as an additional difficulty to students in this area. In this context, this paper presents experiences in teaching embedded systems using a project-based learning pedagogical approach, with strong emphasis on mobile robotic applications previously developed by MSc and PhD students. As a result, it has been observed that undergraduate students have the opportunity to build a strong background and feel better prepared to face the challenges to be found in their future professional activities. Vanderlei Bonato, Marcio M. Fernandes, Joao M. P. Cardoso, and Eduardo Marques Copyright © 2014 Vanderlei Bonato et al. All rights reserved. Simple Hybrid Scaling-Free CORDIC Solution for FPGAs Thu, 24 Apr 2014 12:14:15 +0000 COordinate Rotation DIgital Computer (CORDIC) is an effective method that is used in digital signal processing applications for computing various trigonometric, hyperbolic, linear, and transcendental functions. This paper presents the theoretical basis and practical implementation of circular (sine-cosine) CORDIC-based generator. Synthesis results of this generator based on Altera Stratix III FPGA (EP3SL340F1517C2) using Quartus II version 9.0 show that the proposed hybrid FPGA architecture significantly reduces latency (42% reduction) with a small area overhead, compared to the conventional version. The proposed algorithm has been simulated for sine and cosine function evaluation, and it has been verified that the accuracy is comparable with conventional algorithm. Leonid Moroz, Shinobu Nagayama, Taras Mykytiv, Ihor Kirenko, and Taras Boretskyy Copyright © 2014 Leonid Moroz et al. All rights reserved. An FPGA Task Placement Algorithm Using Reflected Binary Gray Space Filling Curve Wed, 16 Apr 2014 13:55:51 +0000 With the arrival of partial reconfiguration technology, modern FPGAs support tasks that can be loaded in (removed from) the FPGA individually without interrupting other tasks already running on the same FPGA. Many online task placement algorithms designed for such partially reconfigurable systems have been proposed to provide efficient and fast task placement. A new approach for online placement of modules on reconfigurable devices, by managing the free space using a run-length based representation. This representation allows the algorithm to insert or delete tasks quickly and also to calculate the fragmentation easily. In the proposed FPGA model, the CLBs are numbered according to reflected binary gray space filling curve model. The search algorithm will quickly identify a placement for the incoming task based on first fit mode or a fragmentation aware best fit mode. Simulation experiments indicate that the proposed techniques result in a low ratio of task rejection and high FPGA utilization compared to existing techniques. Senoj Joseph Olakkenghil and K. Baskaran Copyright © 2014 Senoj Joseph Olakkenghil and K. Baskaran. All rights reserved.