International Journal of Reconfigurable Computing The latest articles from Hindawi Publishing Corporation © 2014 , Hindawi Publishing Corporation . All rights reserved. Self-Awareness in Computer Networks Mon, 25 Aug 2014 08:24:11 +0000 The Internet architecture works well for a wide variety of communication scenarios. However, its flexibility is limited because it was initially designed to provide communication links between a few static nodes in a homogeneous network and did not attempt to solve the challenges of today’s dynamic network environments. Although the Internet has evolved to a global system of interconnected computer networks, which links together billions of heterogeneous compute nodes, its static architecture remained more or less the same. Nowadays the diversity in networked devices, communication requirements, and network conditions vary heavily, which makes it difficult for a static set of protocols to provide the required functionality. Therefore, we propose a self-aware network architecture in which protocol stacks can be built dynamically. Those protocol stacks can be optimized continuously during communication according to the current requirements. For this network architecture we propose an FPGA-based execution environment called EmbedNet that allows for a dynamic mapping of network protocols to either hardware or software. We show that our architecture can reduce the communication overhead significantly by adapting the protocol stack and that the dynamic hardware/software mapping of protocols considerably reduces the CPU load introduced by packet processing. Ariane Keller, Daniel Borkmann, Stephan Neuhaus, and Markus Happe Copyright © 2014 Ariane Keller et al. All rights reserved. Design Patterns for Self-Adaptive RTE Systems Specification Mon, 14 Jul 2014 09:07:55 +0000 The development of self-adaptive real-time embedded (RTE) systems is an increasingly hard task due to the growing complexity of both hardware and software and the high variability of the execution environment. Different approaches, platforms, and middleware have been proposed in the field, from low to high abstraction level. However, there is still a lack of generic and reusable designs for self-adaptive RTE systems that fit different system domains, lighten designers’ task, and decrease development cost. In this paper, we propose five design patterns for self-adaptive RTE systems modeling resulting from the generalization of relevant existing adaptation-related works. Combined together, the patterns form the design of an adaptation loop composed of five adaptation modules. The proposed solution offers a modular, reusable, and flexible specification of these modules and enables the separation of concerns. It also permits dealing with concurrency, real-time features, and adaptation cost relative to the adaptation activities. To validate our solution, we applied it to a complex case study, a cross-layer self-adaptive object tracking system, to show patterns utilization and prove the solution benefits. Mouna Ben Said, Yessine Hadj Kacem, Mickaël Kerboeuf, Nader Ben Amor, and Mohamed Abid Copyright © 2014 Mouna Ben Said et al. All rights reserved. Practical Education Fostered by Research Projects in an Embedded Systems Course Sun, 29 Jun 2014 08:56:04 +0000 The very nature of universities makes them unique environments for research and teaching. Although both activities constantly borrow from each other, a deeper level of interaction is not always achieved for several reasons. This paper presents a successful experience on conducting an undergraduate course on embedded systems, based on strong interaction with related research activities previously conducted by the authors. Known for being everywhere, embedded systems are constantly expanding in both complexity and volume production. In addition, heterogeneous systems are becoming prevalent in modern applications, standing as an additional difficulty to students in this area. In this context, this paper presents experiences in teaching embedded systems using a project-based learning pedagogical approach, with strong emphasis on mobile robotic applications previously developed by MSc and PhD students. As a result, it has been observed that undergraduate students have the opportunity to build a strong background and feel better prepared to face the challenges to be found in their future professional activities. Vanderlei Bonato, Marcio M. Fernandes, Joao M. P. Cardoso, and Eduardo Marques Copyright © 2014 Vanderlei Bonato et al. All rights reserved. Simple Hybrid Scaling-Free CORDIC Solution for FPGAs Thu, 24 Apr 2014 12:14:15 +0000 COordinate Rotation DIgital Computer (CORDIC) is an effective method that is used in digital signal processing applications for computing various trigonometric, hyperbolic, linear, and transcendental functions. This paper presents the theoretical basis and practical implementation of circular (sine-cosine) CORDIC-based generator. Synthesis results of this generator based on Altera Stratix III FPGA (EP3SL340F1517C2) using Quartus II version 9.0 show that the proposed hybrid FPGA architecture significantly reduces latency (42% reduction) with a small area overhead, compared to the conventional version. The proposed algorithm has been simulated for sine and cosine function evaluation, and it has been verified that the accuracy is comparable with conventional algorithm. Leonid Moroz, Shinobu Nagayama, Taras Mykytiv, Ihor Kirenko, and Taras Boretskyy Copyright © 2014 Leonid Moroz et al. All rights reserved. An FPGA Task Placement Algorithm Using Reflected Binary Gray Space Filling Curve Wed, 16 Apr 2014 13:55:51 +0000 With the arrival of partial reconfiguration technology, modern FPGAs support tasks that can be loaded in (removed from) the FPGA individually without interrupting other tasks already running on the same FPGA. Many online task placement algorithms designed for such partially reconfigurable systems have been proposed to provide efficient and fast task placement. A new approach for online placement of modules on reconfigurable devices, by managing the free space using a run-length based representation. This representation allows the algorithm to insert or delete tasks quickly and also to calculate the fragmentation easily. In the proposed FPGA model, the CLBs are numbered according to reflected binary gray space filling curve model. The search algorithm will quickly identify a placement for the incoming task based on first fit mode or a fragmentation aware best fit mode. Simulation experiments indicate that the proposed techniques result in a low ratio of task rejection and high FPGA utilization compared to existing techniques. Senoj Joseph Olakkenghil and K. Baskaran Copyright © 2014 Senoj Joseph Olakkenghil and K. Baskaran. All rights reserved. Using Statistical Assertions to Guide Self-Adaptive Systems Sun, 13 Apr 2014 16:34:37 +0000 Self-adaptive systems need to monitor themselves, to check their internal behaviour and design assumptions about runtime inputs and conditions. This kind of monitoring for self-adaptive systems can include collecting statistics about such systems themselves which can be computationally intensive (for detailed statistics) and hence time consuming, with possible negative impact on self-adaptive response time. To mitigate this limitation, we extend the technique of in-circuit runtime assertions to cover statistical assertions in hardware. The presented designs implement several statistical operators that can be exploited by self-adaptive systems; a novel optimization is developed for reducing the number of pairwise operators from to . To illustrate the practicability and industrial relevance of our proposed approach, we evaluate our designs, chosen from a class of possible application scenarios, for their resource usage and the tradeoffs between hardware and software implementations. Tim Todman, Stephan Stilkerich, and Wayne Luk Copyright © 2014 Tim Todman et al. All rights reserved. Distance-Ranked Fault Identification of Reconfigurable Hardware Bitstreams via Functional Input Mon, 17 Mar 2014 07:06:32 +0000 Distance-Ranked Fault Identification (DRFI) is a dynamic reconfiguration technique which employs runtime inputs to conduct online functional testing of fielded FPGA logic and interconnect resources without test vectors. At design time, a diverse set of functionally identical bitstream configurations are created which utilize alternate hardware resources in the FPGA fabric. An ordering is imposed on the configuration pool as updated by the PageRank indexing precedence. The configurations which utilize permanently damaged resources and hence manifest discrepant outputs, receive lower rank are thus less preferred for instantiation on the FPGA. Results indicate accurate identification of fault-free configurations in a pool of pregenerated bitstreams with a low number of reconfigurations and input evaluations. For MCNC benchmark circuits, the observed reduction in input evaluations is up to 75% when comparing the DRFI technique to unguided evaluation. The DRFI diagnosis method is seen to isolate all 14 healthy configurations from a pool of 100 pregenerated configurations, and thereby offering a 100% isolation accuracy provided the fault-free configurations exist in the design pool. When a complete recovery is not feasible, graceful degradation may be realized which is demonstrated by the PSNR improvement of images processed in a video encoder case study. Naveed Imran and Ronald F. DeMara Copyright © 2014 Naveed Imran and Ronald F. DeMara. All rights reserved. Hardware-Efficient Design of Real-Time Profile Shape Matching Stereo Vision Algorithm on FPGA Mon, 24 Feb 2014 08:50:44 +0000 A variety of platforms, such as micro-unmanned vehicles, are limited in the amount of computational hardware they can support due to weight and power constraints. An efficient stereo vision algorithm implemented on an FPGA would be able to minimize payload and power consumption in microunmanned vehicles, while providing 3D information and still leaving computational resources available for other processing tasks. This work presents a hardware design of the efficient profile shape matching stereo vision algorithm. Hardware resource usage is presented for the targeted micro-UV platform, Helio-copter, that uses the Xilinx Virtex 4 FX60 FPGA. Less than a fifth of the resources on this FGPA were used to produce dense disparity maps for image sizes up to 450 × 375, with the ability to scale up easily by increasing BRAM usage. A comparison is given of accuracy, speed performance, and resource usage of a census transform-based stereo vision FPGA implementation by Jin et al. Results show that the profile shape matching algorithm is an efficient real-time stereo vision algorithm for hardware implementation for resource limited systems such as microunmanned vehicles. Beau Tippetts, Dah Jye Lee, Kirt Lillywhite, and James K. Archibald Copyright © 2014 Beau Tippetts et al. All rights reserved. A Top-Down Optimization Methodology for Mutually Exclusive Applications Mon, 17 Feb 2014 08:07:33 +0000 Proliferation of mutually exclusive applications on circuits and the higher cost of silicon make the resource sharing more and more important. The state-of-the-art synthesis tools may often be unsatisfactory. Their efficiency may depend on the hardware description style. Nevertheless, today, different applications in a circuit can be developed by different developers. This paper proposes an efficient method to improve resource sharing between mutually exclusive applications with no dependence on the coding style. It takes the advantage of the possibility of resource sharing as done in FPGA and of predefined multiple functions as in ASIC. Alp Kilic, Zied Marrakchi, and Habib Mehrez Copyright © 2014 Alp Kilic et al. All rights reserved. IP-Enabled C/C++ Based High Level Synthesis: A Step towards Better Designer Productivity and Design Performance Wed, 08 Jan 2014 08:59:07 +0000 Intellectual property (IP) core based design is an emerging design methodology to deal with increasing chip design complexity. C/C++ based high level synthesis (HLS) is also gaining traction as a design methodology to deal with increasing design complexity. In the work presented here, we present a design methodology that combines these two individual methodologies and is therefore more powerful. We discuss our proposed methodology in the context of supporting efficient hardware synthesis of a class of mathematical functions without altering original C/C++ source code. Additionally, we also discuss and propose methods to integrate legacy IP cores in existing HLS flows. Relying on concepts from the domains of program recognition and optimized low level implementations of such arithmetic functions, the described design methodology is a step towards intelligent synthesis where application characteristics are matched with specific architectural resources and relevant IP cores in a transparent manner for improved area-delay results. The combined methodology is more aware of the target hardware architecture than the conventional HLS flow. Implementation results of certain compute kernels from a commercial tool Vivado-HLS as well as proposed flow are also compared to show that proposed flow gives better results. Sharad Sinha and Thambipillai Srikanthan Copyright © 2014 Sharad Sinha and Thambipillai Srikanthan. All rights reserved. Efficient FPGA Hardware Reuse in a Multiplierless Decimation Chain Sun, 05 Jan 2014 08:27:27 +0000 In digital communications, an usual reception chain requires many stages of digital signal processing for filtering and sample rate reduction. For satellite on board applications, this need is hardly constrained by the very limited hardware resources available in space qualified FPGAs. This short paper focuses on the implementation of a dual chain of 14 stages of cascaded half band filters plus 2 : 1 decimators for complex signals (in-phase and quadrature) with minimal hardware resources, using a small portion of an UT6325 Aeroflex FPGA, as a part of a receiver designed for a low data rate command and telemetry channel. Guillermo A. Jaquenod, Javier Valls, and Javier Siman Copyright © 2014 Guillermo A. Jaquenod et al. All rights reserved. Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools Thu, 26 Dec 2013 13:26:14 +0000 The potential of FPGAs as accelerators for high-performance computing applications is very large, but many factors are involved in their performance. The design for FPGAs and the selection of the proper optimizations when mapping computations to FPGAs lead to prohibitively long developing time. Alternatives are the high-level synthesis (HLS) tools, which promise a fast design space exploration due to design at high-level or analytical performance models which provide realistic performance expectations, potential impediments to performance, and optimization guidelines. In this paper we propose the combination of both, in order to construct a performance model for FPGAs which is able to visually condense all the helpful information for the designer. Our proposed model extends the roofline model, by considering the resource consumption and the parameters used in the HLS tools, to maximize the performance and the resource utilization within the area of the FPGA. The proposed model is applied to optimize the design exploration of a class of window-based image processing applications using two different HLS tools. The results show the accuracy of the model as well as its flexibility to be combined with any HLS tool. Bruno da Silva, An Braeken, Erik H. D’Hollander, and Abdellah Touhafi Copyright © 2013 Bruno da Silva et al. All rights reserved. Impact of Dual Placement and Routing on WDDL Netlist Security in FPGA Tue, 24 Dec 2013 18:22:56 +0000 The wave dynamic differential logic (WDDL) has been identified as a promising countermeasure to increase the robustness of cryptographic devices against differential power attacks (DPA). However, to guarantee the effectiveness of WDDL technique, the routing in both the direct and complementary paths must be balanced. This paper tackles the problem of unbalance of dual-railsignals in WDDL design. We describe placement techniques suitable for tree-based and mesh-based FPGAs and quantify the gain they confer. Then, we introduce a timing-balance-driven routing algorithm which is architecture independent. Our placement and routing techniques proved to be very promising. In fact, they achieve a gain of 95%, 93%, and 85% in delay balance in tree-based, simple mesh, and cluster-based mesh architectures, respectively. To reduce further the switch and delay unbalance in Mesh architecture, we propose a differential pair routing algorithm that is specific to cluster-based mesh architecture. It achieves perfectly balanced routed signals in terms of wire length and switch number. Emna Amouri, Habib Mehrez, and Zied Marrakchi Copyright © 2013 Emna Amouri et al. All rights reserved. Self-Adaptive On-Chip System Based on Cross-Layer Adaptation Approach Mon, 23 Dec 2013 09:12:57 +0000 The emergence of mobile and battery operated multimedia systems and the diversity of supported applications mount new challenges in terms of design efficiency of these systems which must provide a maximum application quality of service (QoS) in the presence of a dynamically varying environment. These optimization problems cannot be entirely solved at design time and some efficiency gains can be obtained at run-time by means of self-adaptivity. In this paper, we propose a new cross-layer hardware (HW)/software (SW) adaptation solution for embedded mobile systems. It supports application QoS under real-time and lifetime constraints via coordinated adaptation in the hardware, operating system (OS), and application layers. Our method relies on an original middleware solution used on both global and local managers. The global manager (GM) handles large, long-term variations whereas the local manager (LM) is used to guarantee real-time constraints. The GM acts in three layers whereas the LM acts in application and OS layers only. The main role of GM is to select the best configuration for each application to meet the constraints of the system and respect the preferences of the user. The proposed approach has been applied to a 3D graphics application and successfully implemented on an Altera FPGA. Kais Loukil, Nader Ben Amor, Mohamed Abid, and Jean Philippe Diguet Copyright © 2013 Kais Loukil et al. All rights reserved. Frequency Optimization Objective during System Prototyping on Multi-FPGA Platform Tue, 17 Dec 2013 14:39:45 +0000 Multi-FPGA hardware prototyping is becoming increasingly important in the system on chip design cycle. However, after partitioning the design on the multi-FPGA platform, the number of inter-FPGA signals is greater than the number of physical connections available on the prototyping board. Therefore, these signals should be time-multiplexed which lowers the system frequency. The way in which the design is partitioned affects the number of inter-FPGA signals. In this work, we propose a set of constraints to be taken into account during the partitioning task. Then, the resulting inter-FPGA signals are routed with an iterative routing algorithm in order to obtain the best multiplexing ratio. Indeed, signals are grouped and then routed using the intra-FPGA routing algorithm: Pathfinder. This algorithm is adapted to deal with the inter-FPGA routing problem. Many scenarios are proposed to obtain the most optimized results in terms of prototyping system frequency. Using this technique, the system frequency is improved by an average of 12.8% compared to constructive routing algorithm. Mariem Turki, Zied Marrakchi, Habib Mehrez, and Mohamed Abid Copyright © 2013 Mariem Turki et al. All rights reserved. An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization Thu, 24 Oct 2013 10:02:25 +0000 Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. The Packet Classification with Incremental Update (PCIU) algorithm, Ahmed et al. (2010), is a novel and efficient packet classification algorithm with a unique incremental update capability that demonstrated excellent results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. This paper seeks to explore the design space of translating the PCIU algorithm into hardware by utilizing several optimization techniques, ranging from fine grain to coarse grain and parallel coarse grain approaches. The paper presents a detailed implementation of a hardware accelerator of the PCIU based on an Electronic System Level (ESL) approach. Results obtained indicate that the hardware accelerator achieves on average 27x speedup over a state-of-the-art Xeon processor. O. Ahmed, S. Areibi, R. Collier, and G. Grewal Copyright © 2013 O. Ahmed et al. All rights reserved. Rainbow: An Operating System for Software-Hardware Multitasking on Dynamically Partially Reconfigurable FPGAs Tue, 01 Oct 2013 08:34:23 +0000 Dynamic Partial Reconfiguration technology coupled with an Operating System for Reconfigurable Systems (OS4RS) allows for implementation of a hardware task concept, that is, an active computing object which can contend for reconfigurable computing resources and request OS services in a way software task does in a conventional OS. In this work, we show a complete model and implementation of a lightweight OS4RS supporting preemptable and clock-scalable hardware tasks. We also propose a novel, lightweight scheduling mechanism allowing for timely and priority-based reservation of reconfigurable resources, which aims at usage of preemption only at the time it brings benefits to the performance of a system. The architecture of the scheduler and the way it schedules allocations of the hardware tasks result in shorter latency of system calls, thereby reducing the overall OS overhead. Finally, we present a novel model and implementation of a channel-based intertask communication and synchronization suitable for software-hardware multitasking with preemptable and clock-scalable hardware tasks. It allows for optimizations of the communication on per task basis and utilizes point-to-point message passing rather than shared-memory communication, whenever it is possible. Extensive overhead tests of the OS4RS services as well as application speedup tests show efficiency of our approach. Krzysztof Jozwik, Shinya Honda, Masato Edahiro, Hiroyuki Tomiyama, and Hiroaki Takada Copyright © 2013 Krzysztof Jozwik et al. All rights reserved. Design and Implementation of an Embedded NIOS II System for JPEG2000 Tier II Encoding Wed, 19 Jun 2013 18:20:38 +0000 This paper presents a novel implementation of the JPEG2000 standard as a system on a chip (SoC). While most of the research in this field centers on acceleration of the EBCOT Tier I encoder, this work focuses on an embedded solution for EBCOT Tier II. Specifically, this paper proposes using an embedded softcore processor to perform Tier II processing as the back end of an encoding pipeline. The Altera NIOS II processor is chosen for the implementation and is coupled with existing embedded processing modules to realize a fully embedded JPEG2000 encoder. The design is synthesized on a Stratix IV FPGA and is shown to out perform other comparable SoC implementations by 39% in computation time. John M. McNichols, Eric J. Balster, William F. Turri, and Kerry L. Hill Copyright © 2013 John M. McNichols et al. All rights reserved. Hardware Accelerators Targeting a Novel Group Based Packet Classification Algorithm Thu, 30 May 2013 11:07:45 +0000 Packet classification is a ubiquitous and key building block for many critical network devices. However, it remains as one of the main bottlenecks faced when designing fast network devices. In this paper, we propose a novel Group Based Search packet classification Algorithm (GBSA) that is scalable, fast, and efficient. GBSA consumes an average of 0.4 megabytes of memory for a 10 k rule set. The worst-case classification time per packet is 2 microseconds, and the preprocessing speed is 3 M rules/second based on an Xeon processor operating at 3.4 GHz. When compared with other state-of-the-art classification techniques, the results showed that GBSA outperforms the competition with respect to speed, memory usage, and processing time. Moreover, GBSA is amenable to implementation in hardware. Three different hardware implementations are also presented in this paper including an Application Specific Instruction Set Processor (ASIP) implementation and two pure Register-Transfer Level (RTL) implementations based on Impulse-C and Handel-C flows, respectively. Speedups achieved with these hardware accelerators ranged from 9x to 18x compared with a pure software implementation running on an Xeon processor. O. Ahmed, S. Areibi, and G. Grewal Copyright © 2013 O. Ahmed et al. All rights reserved. Selected Papers from the Symposium on Integrated Circuits and Systems Design (SBCCI 2011) Wed, 29 May 2013 16:38:20 +0000 Massimo Conti, Elmar Melcher, Jürgen Becker, Alisson Brito, and Oliver Sander Copyright © 2013 Massimo Conti et al. All rights reserved. Selected Papers from the 2011 International Conference on Reconfigurable Computing and FPGAs (ReConFig 2011) Wed, 20 Mar 2013 08:14:54 +0000 René Cumplido, Peter Athanas, and Jürgen Becker Copyright © 2013 René Cumplido et al. All rights reserved. Runtime Scheduling, Allocation, and Execution of Real-Time Hardware Tasks onto Xilinx FPGAs Subject to Fault Occurrence Mon, 18 Mar 2013 07:50:12 +0000 This paper describes a novel way to exploit the computation capabilities delivered by modern Field-Programmable Gate Arrays (FPGAs), not only towards a higher performance, but also towards an improved reliability. Computation-specific pieces of circuitry are dynamically scheduled and allocated to different resources on the chip based on a set of novel algorithms which are described in detail in this article. These algorithms consider most of the technological constraints existing in modern partially reconfigurable FPGAs as well as spontaneously occurring faults and emerging permanent damage in the silicon substrate of the chip. In addition, the algorithms target other important aspects such as communications and synchronization among the different computations that are carried out, either concurrently or at different times. The effectiveness of the proposed algorithms is tested by means of a wide range of synthetic simulations, and, notably, a proof-of-concept implementation of them using real FPGA hardware is outlined. Xabier Iturbe, Khaled Benkrid, Chuan Hong, Ali Ebrahim, Tughrul Arslan, and Imanol Martinez Copyright © 2013 Xabier Iturbe et al. All rights reserved. Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs Thu, 07 Mar 2013 15:57:43 +0000 Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10 digit recurrence dividers for FPGA architectures. The first one implements a simple restoring shift-and-subtract algorithm, whereas each of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient calculation and carry-save representation of the residuals. More precisely, the quotient digit selection function of the second divider is implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carry-propagate adders, and the fifth divider decomposes each digit into three components and requires neither a ROM nor a multiplexer. Furthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64 data format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given. Malte Baesler and Sven-Ole Voigt Copyright © 2013 Malte Baesler and Sven-Ole Voigt. All rights reserved. A Heuristic Scheduler for Port-Constrained Floating-Point Pipelines Wed, 27 Feb 2013 10:20:52 +0000 We describe a heuristic scheduling approach for optimizing floating-point pipelines subject to input port constraints. The objective of our technique is to maximize functional unit reuse while minimizing the following performance metrics in the generated circuit: (1) maximum multiplexer fanin, (2) datapath fanout, (3) number of multiplexers, and (4) number of registers. For a set of systems biology markup language (SBML) benchmark expressions, we compare the resource usages given by our method to those given by a branch-and-bound enumeration of all valid schedules. Compared with the enumeration results, our heuristic requires on average 33.4% less multiplexer bits and 32.9% less register bits than the worse case, while only requiring 14% more multiplexer bits and 4.5% more register bits than the optimal case. We also compare our results against those given by the state-of-art high-level synthesis tool Xilinx AutoESL. For the most complex of our benchmark expressions, our synthesis technique requires 20% less FPGA slices than AutoESL. Zheming Jin and Jason D. Bakos Copyright © 2013 Zheming Jin and Jason D. Bakos. All rights reserved. Fully Pipelined Implementation of Tree-Search Algorithms for Vector Precoding Thu, 21 Feb 2013 19:15:04 +0000 The nonlinear vector precoding (VP) technique has been proven to achieve close-to-capacity performance in multiuser multiple-input multiple-output (MIMO) downlink channels. The performance benefit with respect to its linear counterparts stems from the incorporation of a perturbation signal that reduces the power of the precoded signal. The computation of this perturbation element, which is known to belong in the class of NP-hard problems, is the main aspect that hinders the hardware implementation of VP systems. To this respect, several tree-search algorithms have been proposed for the closest-point lattice search problem in VP systems hitherto. Nevertheless, the optimality of these algorithms has been assessed mainly in terms of error-rate performance and computational complexity, leaving the hardware cost of their implementation an open issue. The parallel data-processing capabilities of field-programmable gate arrays (FPGA) and the loopless nature of the proposed tree-search algorithms have enabled an efficient hardware implementation of a VP system that provides a very high data-processing throughput. Maitane Barrenechea, Mikel Mendicute, and Egoitz Arruti Copyright © 2013 Maitane Barrenechea et al. All rights reserved. Transparent Runtime Migration of Loop-Based Traces of Processor Instructions to Reconfigurable Processing Units Thu, 21 Feb 2013 13:17:46 +0000 The ability to map instructions running in a microprocessor to a reconfigurable processing unit (RPU), acting as a coprocessor, enables the runtime acceleration of applications and ensures code and possibly performance portability. In this work, we focus on the mapping of loop-based instruction traces (called Megablocks) to RPUs. The proposed approach considers offline partitioning and mapping stages without ignoring their future runtime applicability. We present a toolchain that automatically extracts specific trace-based loops, called Megablocks, from MicroBlaze instruction traces and generates an RPU for executing those loops. Our hardware infrastructure is able to move loop execution from the microprocessor to the RPU transparently, at runtime, and without changing the executable binaries. The toolchain and the system are fully operational. Three FPGA implementations of the system, differing in the hardware interfaces used, were tested and evaluated with a set of 15 application kernels. Speedups ranging from 1.26 to 3.69 were achieved for the best alternative using a MicroBlaze processor with local memory. João Bispo, Nuno Paulino, João M. P. Cardoso, and João Canas Ferreira Copyright © 2013 João Bispo et al. All rights reserved. An Asynchronous FPGA Block with Its Tech-Mapping Algorithm Dedicated to Security Applications Tue, 05 Feb 2013 09:13:43 +0000 This paper presents an FPGA tech-mapping algorithm dedicated to security applications. The objective is to implement—on a full-custom asynchronous FPGA—secured functions that need to be robust against side-channel attacks (SCAs). The paper briefly describes the architecture of this FPGA that has been designed and prototyped in CMOS 65 nm to target various styles of asynchronous logic including 2-phase and 4-phase communication protocols and 1-of-n data encoding. This programmable architecture is designed to be electrically balanced in order to fit the security requirements. It allows fair comparisons between different styles of asynchronous implementations. In order to illustrate the FPGA flexibility and security, a case study has been implemented in 2-phase and 4-phase Quasi-Delay-Insensitive (QDI) logic. Taha Beyrouthy and Laurent Fesquet Copyright © 2013 Taha Beyrouthy and Laurent Fesquet. All rights reserved. Development of a SoC for Digital Television Set-Top Box: Architecture and System Integration Issues Wed, 09 Jan 2013 11:53:05 +0000 This work presents the integration of several IPs to generate a system-on-chip (SoC) for digital television set-top box compliant to the SBTVD standard. Embedded consumer electronics for multimedia applications like video processing systems require large storage capacity and high bandwidth memory. Also, those systems are built from heterogeneous processing units, designed to perform specific tasks in order to maximize the overall system efficiency. A single off-chip memory is generally shared between the processing units to reduce power and save costs. The external memory access is one bottleneck when decoding high-definition video sequences in real time. In this work, a four-level memory hierarchy was designed to manage the decoded video in macroblock granularity with low latency. The use of the memory hierarchy in the system design is challenging because it impacts the system integration process and IP reuse in a collaborative design team. Practical strategies used to solve integration problems are discussed in this text. The SoC architecture was validated and is being progressively prototyped using a Xilinx Virtex-5 FPGA board. André Borin Soares, Alexsandro Cristóvão Bonatto, and Altamiro Amadeu Susin Copyright © 2013 André Borin Soares et al. All rights reserved. Open SystemC Simulator with Support for Power Gating Design Sun, 30 Dec 2012 18:42:54 +0000 Power gating is one of the most efficient power consumption reduction techniques. However, when applied in several different parts of a complex design, functional verification becomes a challenge. Lately, the verification process of this technique has been executed in a Register-Transfer Level (RTL) abstraction, based on the Common Power Format (CPF) and the Unified Power Format (UPF). The purpose of this paper is to present an OSCI SystemC simulator with support to the power gating design. This simulator is an alternative to assist the functional verification accomplishment of systems modeled in RTL. The possibility of controlling the retention and isolation of power gated functional block (PGFB) is presented in this work, turning the simulations more stable and accurate. Two case studies are presented to demonstrate the new features of that simulator. George Sobral Silveira, Alisson V. Brito, Helder F. de A. Oliveira, and Elmar U. K. Melcher Copyright © 2012 George Sobral Silveira et al. All rights reserved. Configurable Transmitter and Systolic Channel Estimator Architectures for Data-Dependent Superimposed Training Communications Systems Thu, 27 Dec 2012 08:59:10 +0000 In this paper, a configurable superimposed training (ST)/data-dependent ST (DDST) transmitter and architecture based on array processors (APs) for DDST channel estimation are presented. Both architectures, designed under full-hardware paradigm, were described using Verilog HDL, targeted in Xilinx Virtex-5 and they were compared with existent approaches. The synthesis results showed a FPGA slice consumption of 1% for the transmitter and 3% for the estimator with 160 and 115 MHz operating frequencies, respectively. The signal-to-quantization-noise ratio (SQNR) performance of the transmitter is about 82 dB to support 4/16/64-QAM modulation. A Monte Carlo simulation demonstrates that the mean square error (MSE) of the channel estimator implemented in hardware is practically the same as the one obtained with the floating-point golden model. The high performance and reduced hardware of the proposed architectures lead to the conclusion that the DDST concept can be applied in current communications standards. E. Romero-Aguirre, R. Parra-Michel, Roberto Carrasco-Alvarez, and A. G. Orozco-Lugo Copyright © 2012 E. Romero-Aguirre et al. All rights reserved.