International Journal of Reconfigurable Computing The latest articles from Hindawi Publishing Corporation © 2015 , Hindawi Publishing Corporation . All rights reserved. Low Latency Network-on-Chip Router Microarchitecture Using Request Masking Technique Sun, 15 Mar 2015 08:48:59 +0000 Network-on-Chip (NoC) is fast emerging as an on-chip communication alternative for many-core System-on-Chips (SoCs). However, designing a high performance low latency NoC with low area overhead has remained a challenge. In this paper, we present a two-clock-cycle latency NoC microarchitecture. An efficient request masking technique is proposed to combine virtual channel (VC) allocation with switch allocation nonspeculatively. Our proposed NoC architecture is optimized in terms of area overhead, operating frequency, and quality-of-service (QoS). We evaluate our NoC against CONNECT, an open source low latency NoC design targeted for field-programmable gate array (FPGA). The experimental results on several FPGA devices show that our NoC router outperforms CONNECT with 50% reduction of logic cells (LCs) utilization, while it works with 100% and 35%~20% higher operating frequency compared to the one- and two-clock-cycle latency CONNECT NoC routers, respectively. Moreover, the proposed NoC router achieves 2.3 times better performance compared to CONNECT. Alireza Monemi, Chia Yee Ooi, and Muhammad Nadzir Marsono Copyright © 2015 Alireza Monemi et al. All rights reserved. Optimization of Lookup Schemes for Flow-Based Packet Classification on FPGAs Sun, 08 Mar 2015 11:25:26 +0000 Packet classification has become a key processing function to enable future flow-based networking schemes. As network capacity increases and new services are deployed, both high throughput and reconfigurability are required for packet classification architectures. FPGA technology can provide the best trade-off among them. However, to date, lookup stages have been mostly developed as independent schemes from the classification stage, which makes their efficient integration on FPGAs difficult. In this context, we propose a new interpretation of the lookup problem in the general context of packet classification, which enables comparing existing lookup schemes on a common basis. From this analysis, we recognize new opportunities for optimization of lookup schemes and their associated classification schemes on FPGA. In particular, we focus on the most appropriate candidate for future networking needs and propose optimizations for it. To validate our analysis, we provide estimation and implementation results for typical lookup architectures on FPGA and observe their convenience for different lookup and classification cases, demonstrating the benefits of our proposed optimization. Carlos A. Zerbini and Jorge M. Finochietto Copyright © 2015 Carlos A. Zerbini and Jorge M. Finochietto. All rights reserved. Using Genetic Algorithms for Hardware Core Placement and Mapping in NoC-Based Reconfigurable Systems Mon, 02 Feb 2015 13:38:30 +0000 Mapping of cores has been an important activity in NoC-based system design aimed to find the best topological location onto the NoC, such that the metrics of interest can be greatly optimized. In the last years, partial reconfigurable systems (PRSs) have included Networks-on-Chips (NoCs) as their communication structure, adding complexity to the problem of mapping. Several works have proposed specific and robust NoC architectures for PRSs, forming indirect and irregular networks, in which cases the mapping and placement problems must be treated altogether. The placement deals with the physical positioning of those cores inside the reconfigurable device. Up to now, to the best of our knowledge, the mapping-placement problem for those kinds of architectures has not been addressed yet. In this work, the problem formalization for the design-time hardware core placement and mapping in PRS-NoCs is proposed and methodologies for solving it with genetic algorithms (GAs) are presented. Several GA crossovers and methodologies are compared for obtaining the best solution. Results have shown that best GA solution obtained, in average, communication costs with 4% of penalty when compared with global minimum cost, obtained in a semiexhaustive approach. In addition, the algorithm presents low execution times. Jonas Gomes Filho, Marius Strum, and Wang Jiang Chau Copyright © 2015 Jonas Gomes Filho et al. All rights reserved. Scalable Fixed Point QRD Core Using Dynamic Partial Reconfiguration Sun, 14 Dec 2014 00:10:17 +0000 A Givens rotation based scalable QRD core which utilizes an efficient pipelined and unfolded 2D multiply and accumulate (MAC) based systolic array architecture with dynamic partial reconfiguration (DPR) capability is proposed. The square root and inverse square root operations in the Givens rotation algorithm are handled using a modified look-up table (LUT) based Newton-Raphson method, thereby reducing the area by 71% and latency by 50% while operating at a frequency 49% higher than the existing boundary cell architectures. The proposed architecture is implemented on Xilinx Virtex-6 FPGA for any real matrices of size , where and by dynamically inserting or removing the partial modules. The evaluation results demonstrate a significant reduction in latency, area, and power as compared to other existing architectures. The functionality of the proposed core is evaluated for a variable length adaptive equalizer. Gayathri R. Prabhu, Bibin Johnson, and J. Sheeba Rani Copyright © 2014 Gayathri R. Prabhu et al. All rights reserved. Multi-Softcore Architecture on FPGA Thu, 27 Nov 2014 06:22:28 +0000 To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication). Mouna Baklouti and Mohamed Abid Copyright © 2014 Mouna Baklouti and Mohamed Abid. All rights reserved. Low-Cost Fault Tolerant Methodology for Real Time MPSoC Based Embedded System Wed, 19 Nov 2014 08:50:55 +0000 We are proposing a design methodology for a fault tolerant homogeneous MPSoC having additional design objectives that include low hardware overhead and performance. We have implemented three different FT methodologies on MPSoCs and compared them against the defined constraints. The comparison of these FT methodologies is carried out by modelling their architectures in VHDL-RTL, on Spartan 3 FPGA. The results obtained through simulations helped us to identify the most relevant scheme in terms of the given design constraints. Mohsin Amin, Muhammad Shakir, Aqib Javed, Muhammad Hassan, and Syed Ali Raza Copyright © 2014 Mohsin Amin et al. All rights reserved. TreeBASIS Feature Descriptor and Its Hardware Implementation Mon, 10 Nov 2014 12:56:39 +0000 This paper presents a novel feature descriptor called TreeBASIS that provides improvements in descriptor size, computation time, matching speed, and accuracy. This new descriptor uses a binary vocabulary tree that is computed using basis dictionary images and a test set of feature region images. To facilitate real-time implementation, a feature region image is binary quantized and the resulting quantized vector is passed into the BASIS vocabulary tree. A Hamming distance is then computed between the feature region image and the effectively descriptive basis dictionary image at a node to determine the branch taken and the path the feature region image takes is saved as a descriptor. The TreeBASIS feature descriptor is an excellent candidate for hardware implementation because of its reduced descriptor size and the fact that descriptors can be created and features matched without the use of floating point operations. The TreeBASIS descriptor is more computationally and space efficient than other descriptors such as BASIS, SIFT, and SURF. Moreover, it can be computed entirely in hardware without the support of a CPU for additional software-based computations. Experimental results and a hardware implementation show that the TreeBASIS descriptor compares well with other descriptors for frame-to-frame homography computation while requiring fewer hardware resources. Spencer Fowers, Alok Desai, Dah-Jye Lee, Dan Ventura, and James Archibald Copyright © 2014 Spencer Fowers et al. All rights reserved. Architecture and Application-Aware Management of Complexity of Mapping Multiplication to FPGA DSP Blocks in High Level Synthesis Tue, 21 Oct 2014 10:01:40 +0000 Multiplication is a common operation in many applications and there exist various types of multiplication operations. Current high level synthesis (HLS) flows generally treat all multiplication operations equally and indistinguishable from each other leading to inefficient mapping to resources. This paper proposes algorithms for automatically identifying the different types of multiplication operations and investigates the ensemble of these different types of multiplication operations. This distinguishes it from previous works where mapping strategies for an individual type of multiplication operation have been investigated and the type of multiplication operation is assumed to be known a priori. A new cost model, independent of device and synthesis tools, for establishing priority among different types of multiplication operations for mapping to on-chip DSP blocks is also proposed. This cost model is used by a proposed analysis and priority ordering based mapping strategy targeted at making efficient use of hard DSP blocks on FPGAs while maximizing the operating frequency of designs. Results show that the proposed methodology could result in designs which were at least 2× faster in performance than those generated by commercial HLS tool: Vivado-HLS. Sharad Sinha and Thambipillai Srikanthan Copyright © 2014 Sharad Sinha and Thambipillai Srikanthan. All rights reserved. FPGA-Based Implementation of All-Digital QPSK Carrier Recovery Loop Combining Costas Loop and Maximum Likelihood Frequency Estimator Sun, 31 Aug 2014 12:05:48 +0000 This paper presents an efficient all digital carrier recovery loop (ADCRL) for quadrature phase shift keying (QPSK). The ADCRL combines classic closed-loop carrier recovery circuit, all digital Costas loop (ADCOL), with frequency feedward loop, maximum likelihood frequency estimator (MLFE) so as to make the best use of the advantages of the two types of carrier recovery loops and obtain a more robust performance in the procedure of carrier recovery. Besides, considering that, for MLFE, the accurate estimation of frequency offset is associated with the linear characteristic of its frequency discriminator (FD), the Coordinate Rotation Digital Computer (CORDIC) algorithm is introduced into the FD based on MLFE to unwrap linearly phase difference. The frequency offset contained within the phase difference unwrapped is estimated by the MLFE implemented just using some shifter and multiply-accumulate units to assist the ADCOL to lock quickly and precisely. The joint simulation results of ModelSim and MATLAB show that the performances of the proposed ADCRL in locked-in time and range are superior to those of the ADCOL. On the other hand, a systematic design procedure based on FPGA for the proposed ADCRL is also presented. Kaiyu Wang, Zhiming Song, Xianwei Qi, Qingxin Yan, and Zhenan Tang Copyright © 2014 Kaiyu Wang et al. All rights reserved. Self-Awareness in Computer Networks Mon, 25 Aug 2014 08:24:11 +0000 The Internet architecture works well for a wide variety of communication scenarios. However, its flexibility is limited because it was initially designed to provide communication links between a few static nodes in a homogeneous network and did not attempt to solve the challenges of today’s dynamic network environments. Although the Internet has evolved to a global system of interconnected computer networks, which links together billions of heterogeneous compute nodes, its static architecture remained more or less the same. Nowadays the diversity in networked devices, communication requirements, and network conditions vary heavily, which makes it difficult for a static set of protocols to provide the required functionality. Therefore, we propose a self-aware network architecture in which protocol stacks can be built dynamically. Those protocol stacks can be optimized continuously during communication according to the current requirements. For this network architecture we propose an FPGA-based execution environment called EmbedNet that allows for a dynamic mapping of network protocols to either hardware or software. We show that our architecture can reduce the communication overhead significantly by adapting the protocol stack and that the dynamic hardware/software mapping of protocols considerably reduces the CPU load introduced by packet processing. Ariane Keller, Daniel Borkmann, Stephan Neuhaus, and Markus Happe Copyright © 2014 Ariane Keller et al. All rights reserved. Design Patterns for Self-Adaptive RTE Systems Specification Mon, 14 Jul 2014 09:07:55 +0000 The development of self-adaptive real-time embedded (RTE) systems is an increasingly hard task due to the growing complexity of both hardware and software and the high variability of the execution environment. Different approaches, platforms, and middleware have been proposed in the field, from low to high abstraction level. However, there is still a lack of generic and reusable designs for self-adaptive RTE systems that fit different system domains, lighten designers’ task, and decrease development cost. In this paper, we propose five design patterns for self-adaptive RTE systems modeling resulting from the generalization of relevant existing adaptation-related works. Combined together, the patterns form the design of an adaptation loop composed of five adaptation modules. The proposed solution offers a modular, reusable, and flexible specification of these modules and enables the separation of concerns. It also permits dealing with concurrency, real-time features, and adaptation cost relative to the adaptation activities. To validate our solution, we applied it to a complex case study, a cross-layer self-adaptive object tracking system, to show patterns utilization and prove the solution benefits. Mouna Ben Said, Yessine Hadj Kacem, Mickaël Kerboeuf, Nader Ben Amor, and Mohamed Abid Copyright © 2014 Mouna Ben Said et al. All rights reserved. Practical Education Fostered by Research Projects in an Embedded Systems Course Sun, 29 Jun 2014 08:56:04 +0000 The very nature of universities makes them unique environments for research and teaching. Although both activities constantly borrow from each other, a deeper level of interaction is not always achieved for several reasons. This paper presents a successful experience on conducting an undergraduate course on embedded systems, based on strong interaction with related research activities previously conducted by the authors. Known for being everywhere, embedded systems are constantly expanding in both complexity and volume production. In addition, heterogeneous systems are becoming prevalent in modern applications, standing as an additional difficulty to students in this area. In this context, this paper presents experiences in teaching embedded systems using a project-based learning pedagogical approach, with strong emphasis on mobile robotic applications previously developed by MSc and PhD students. As a result, it has been observed that undergraduate students have the opportunity to build a strong background and feel better prepared to face the challenges to be found in their future professional activities. Vanderlei Bonato, Marcio M. Fernandes, Joao M. P. Cardoso, and Eduardo Marques Copyright © 2014 Vanderlei Bonato et al. All rights reserved. Simple Hybrid Scaling-Free CORDIC Solution for FPGAs Thu, 24 Apr 2014 12:14:15 +0000 COordinate Rotation DIgital Computer (CORDIC) is an effective method that is used in digital signal processing applications for computing various trigonometric, hyperbolic, linear, and transcendental functions. This paper presents the theoretical basis and practical implementation of circular (sine-cosine) CORDIC-based generator. Synthesis results of this generator based on Altera Stratix III FPGA (EP3SL340F1517C2) using Quartus II version 9.0 show that the proposed hybrid FPGA architecture significantly reduces latency (42% reduction) with a small area overhead, compared to the conventional version. The proposed algorithm has been simulated for sine and cosine function evaluation, and it has been verified that the accuracy is comparable with conventional algorithm. Leonid Moroz, Shinobu Nagayama, Taras Mykytiv, Ihor Kirenko, and Taras Boretskyy Copyright © 2014 Leonid Moroz et al. All rights reserved. An FPGA Task Placement Algorithm Using Reflected Binary Gray Space Filling Curve Wed, 16 Apr 2014 13:55:51 +0000 With the arrival of partial reconfiguration technology, modern FPGAs support tasks that can be loaded in (removed from) the FPGA individually without interrupting other tasks already running on the same FPGA. Many online task placement algorithms designed for such partially reconfigurable systems have been proposed to provide efficient and fast task placement. A new approach for online placement of modules on reconfigurable devices, by managing the free space using a run-length based representation. This representation allows the algorithm to insert or delete tasks quickly and also to calculate the fragmentation easily. In the proposed FPGA model, the CLBs are numbered according to reflected binary gray space filling curve model. The search algorithm will quickly identify a placement for the incoming task based on first fit mode or a fragmentation aware best fit mode. Simulation experiments indicate that the proposed techniques result in a low ratio of task rejection and high FPGA utilization compared to existing techniques. Senoj Joseph Olakkenghil and K. Baskaran Copyright © 2014 Senoj Joseph Olakkenghil and K. Baskaran. All rights reserved. Using Statistical Assertions to Guide Self-Adaptive Systems Sun, 13 Apr 2014 16:34:37 +0000 Self-adaptive systems need to monitor themselves, to check their internal behaviour and design assumptions about runtime inputs and conditions. This kind of monitoring for self-adaptive systems can include collecting statistics about such systems themselves which can be computationally intensive (for detailed statistics) and hence time consuming, with possible negative impact on self-adaptive response time. To mitigate this limitation, we extend the technique of in-circuit runtime assertions to cover statistical assertions in hardware. The presented designs implement several statistical operators that can be exploited by self-adaptive systems; a novel optimization is developed for reducing the number of pairwise operators from to . To illustrate the practicability and industrial relevance of our proposed approach, we evaluate our designs, chosen from a class of possible application scenarios, for their resource usage and the tradeoffs between hardware and software implementations. Tim Todman, Stephan Stilkerich, and Wayne Luk Copyright © 2014 Tim Todman et al. All rights reserved. Distance-Ranked Fault Identification of Reconfigurable Hardware Bitstreams via Functional Input Mon, 17 Mar 2014 07:06:32 +0000 Distance-Ranked Fault Identification (DRFI) is a dynamic reconfiguration technique which employs runtime inputs to conduct online functional testing of fielded FPGA logic and interconnect resources without test vectors. At design time, a diverse set of functionally identical bitstream configurations are created which utilize alternate hardware resources in the FPGA fabric. An ordering is imposed on the configuration pool as updated by the PageRank indexing precedence. The configurations which utilize permanently damaged resources and hence manifest discrepant outputs, receive lower rank are thus less preferred for instantiation on the FPGA. Results indicate accurate identification of fault-free configurations in a pool of pregenerated bitstreams with a low number of reconfigurations and input evaluations. For MCNC benchmark circuits, the observed reduction in input evaluations is up to 75% when comparing the DRFI technique to unguided evaluation. The DRFI diagnosis method is seen to isolate all 14 healthy configurations from a pool of 100 pregenerated configurations, and thereby offering a 100% isolation accuracy provided the fault-free configurations exist in the design pool. When a complete recovery is not feasible, graceful degradation may be realized which is demonstrated by the PSNR improvement of images processed in a video encoder case study. Naveed Imran and Ronald F. DeMara Copyright © 2014 Naveed Imran and Ronald F. DeMara. All rights reserved. Hardware-Efficient Design of Real-Time Profile Shape Matching Stereo Vision Algorithm on FPGA Mon, 24 Feb 2014 08:50:44 +0000 A variety of platforms, such as micro-unmanned vehicles, are limited in the amount of computational hardware they can support due to weight and power constraints. An efficient stereo vision algorithm implemented on an FPGA would be able to minimize payload and power consumption in microunmanned vehicles, while providing 3D information and still leaving computational resources available for other processing tasks. This work presents a hardware design of the efficient profile shape matching stereo vision algorithm. Hardware resource usage is presented for the targeted micro-UV platform, Helio-copter, that uses the Xilinx Virtex 4 FX60 FPGA. Less than a fifth of the resources on this FGPA were used to produce dense disparity maps for image sizes up to 450 × 375, with the ability to scale up easily by increasing BRAM usage. A comparison is given of accuracy, speed performance, and resource usage of a census transform-based stereo vision FPGA implementation by Jin et al. Results show that the profile shape matching algorithm is an efficient real-time stereo vision algorithm for hardware implementation for resource limited systems such as microunmanned vehicles. Beau Tippetts, Dah Jye Lee, Kirt Lillywhite, and James K. Archibald Copyright © 2014 Beau Tippetts et al. All rights reserved. A Top-Down Optimization Methodology for Mutually Exclusive Applications Mon, 17 Feb 2014 08:07:33 +0000 Proliferation of mutually exclusive applications on circuits and the higher cost of silicon make the resource sharing more and more important. The state-of-the-art synthesis tools may often be unsatisfactory. Their efficiency may depend on the hardware description style. Nevertheless, today, different applications in a circuit can be developed by different developers. This paper proposes an efficient method to improve resource sharing between mutually exclusive applications with no dependence on the coding style. It takes the advantage of the possibility of resource sharing as done in FPGA and of predefined multiple functions as in ASIC. Alp Kilic, Zied Marrakchi, and Habib Mehrez Copyright © 2014 Alp Kilic et al. All rights reserved. IP-Enabled C/C++ Based High Level Synthesis: A Step towards Better Designer Productivity and Design Performance Wed, 08 Jan 2014 08:59:07 +0000 Intellectual property (IP) core based design is an emerging design methodology to deal with increasing chip design complexity. C/C++ based high level synthesis (HLS) is also gaining traction as a design methodology to deal with increasing design complexity. In the work presented here, we present a design methodology that combines these two individual methodologies and is therefore more powerful. We discuss our proposed methodology in the context of supporting efficient hardware synthesis of a class of mathematical functions without altering original C/C++ source code. Additionally, we also discuss and propose methods to integrate legacy IP cores in existing HLS flows. Relying on concepts from the domains of program recognition and optimized low level implementations of such arithmetic functions, the described design methodology is a step towards intelligent synthesis where application characteristics are matched with specific architectural resources and relevant IP cores in a transparent manner for improved area-delay results. The combined methodology is more aware of the target hardware architecture than the conventional HLS flow. Implementation results of certain compute kernels from a commercial tool Vivado-HLS as well as proposed flow are also compared to show that proposed flow gives better results. Sharad Sinha and Thambipillai Srikanthan Copyright © 2014 Sharad Sinha and Thambipillai Srikanthan. All rights reserved. Efficient FPGA Hardware Reuse in a Multiplierless Decimation Chain Sun, 05 Jan 2014 08:27:27 +0000 In digital communications, an usual reception chain requires many stages of digital signal processing for filtering and sample rate reduction. For satellite on board applications, this need is hardly constrained by the very limited hardware resources available in space qualified FPGAs. This short paper focuses on the implementation of a dual chain of 14 stages of cascaded half band filters plus 2 : 1 decimators for complex signals (in-phase and quadrature) with minimal hardware resources, using a small portion of an UT6325 Aeroflex FPGA, as a part of a receiver designed for a low data rate command and telemetry channel. Guillermo A. Jaquenod, Javier Valls, and Javier Siman Copyright © 2014 Guillermo A. Jaquenod et al. All rights reserved. Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools Thu, 26 Dec 2013 13:26:14 +0000 The potential of FPGAs as accelerators for high-performance computing applications is very large, but many factors are involved in their performance. The design for FPGAs and the selection of the proper optimizations when mapping computations to FPGAs lead to prohibitively long developing time. Alternatives are the high-level synthesis (HLS) tools, which promise a fast design space exploration due to design at high-level or analytical performance models which provide realistic performance expectations, potential impediments to performance, and optimization guidelines. In this paper we propose the combination of both, in order to construct a performance model for FPGAs which is able to visually condense all the helpful information for the designer. Our proposed model extends the roofline model, by considering the resource consumption and the parameters used in the HLS tools, to maximize the performance and the resource utilization within the area of the FPGA. The proposed model is applied to optimize the design exploration of a class of window-based image processing applications using two different HLS tools. The results show the accuracy of the model as well as its flexibility to be combined with any HLS tool. Bruno da Silva, An Braeken, Erik H. D’Hollander, and Abdellah Touhafi Copyright © 2013 Bruno da Silva et al. All rights reserved. Impact of Dual Placement and Routing on WDDL Netlist Security in FPGA Tue, 24 Dec 2013 18:22:56 +0000 The wave dynamic differential logic (WDDL) has been identified as a promising countermeasure to increase the robustness of cryptographic devices against differential power attacks (DPA). However, to guarantee the effectiveness of WDDL technique, the routing in both the direct and complementary paths must be balanced. This paper tackles the problem of unbalance of dual-railsignals in WDDL design. We describe placement techniques suitable for tree-based and mesh-based FPGAs and quantify the gain they confer. Then, we introduce a timing-balance-driven routing algorithm which is architecture independent. Our placement and routing techniques proved to be very promising. In fact, they achieve a gain of 95%, 93%, and 85% in delay balance in tree-based, simple mesh, and cluster-based mesh architectures, respectively. To reduce further the switch and delay unbalance in Mesh architecture, we propose a differential pair routing algorithm that is specific to cluster-based mesh architecture. It achieves perfectly balanced routed signals in terms of wire length and switch number. Emna Amouri, Habib Mehrez, and Zied Marrakchi Copyright © 2013 Emna Amouri et al. All rights reserved. Self-Adaptive On-Chip System Based on Cross-Layer Adaptation Approach Mon, 23 Dec 2013 09:12:57 +0000 The emergence of mobile and battery operated multimedia systems and the diversity of supported applications mount new challenges in terms of design efficiency of these systems which must provide a maximum application quality of service (QoS) in the presence of a dynamically varying environment. These optimization problems cannot be entirely solved at design time and some efficiency gains can be obtained at run-time by means of self-adaptivity. In this paper, we propose a new cross-layer hardware (HW)/software (SW) adaptation solution for embedded mobile systems. It supports application QoS under real-time and lifetime constraints via coordinated adaptation in the hardware, operating system (OS), and application layers. Our method relies on an original middleware solution used on both global and local managers. The global manager (GM) handles large, long-term variations whereas the local manager (LM) is used to guarantee real-time constraints. The GM acts in three layers whereas the LM acts in application and OS layers only. The main role of GM is to select the best configuration for each application to meet the constraints of the system and respect the preferences of the user. The proposed approach has been applied to a 3D graphics application and successfully implemented on an Altera FPGA. Kais Loukil, Nader Ben Amor, Mohamed Abid, and Jean Philippe Diguet Copyright © 2013 Kais Loukil et al. All rights reserved. Frequency Optimization Objective during System Prototyping on Multi-FPGA Platform Tue, 17 Dec 2013 14:39:45 +0000 Multi-FPGA hardware prototyping is becoming increasingly important in the system on chip design cycle. However, after partitioning the design on the multi-FPGA platform, the number of inter-FPGA signals is greater than the number of physical connections available on the prototyping board. Therefore, these signals should be time-multiplexed which lowers the system frequency. The way in which the design is partitioned affects the number of inter-FPGA signals. In this work, we propose a set of constraints to be taken into account during the partitioning task. Then, the resulting inter-FPGA signals are routed with an iterative routing algorithm in order to obtain the best multiplexing ratio. Indeed, signals are grouped and then routed using the intra-FPGA routing algorithm: Pathfinder. This algorithm is adapted to deal with the inter-FPGA routing problem. Many scenarios are proposed to obtain the most optimized results in terms of prototyping system frequency. Using this technique, the system frequency is improved by an average of 12.8% compared to constructive routing algorithm. Mariem Turki, Zied Marrakchi, Habib Mehrez, and Mohamed Abid Copyright © 2013 Mariem Turki et al. All rights reserved. An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization Thu, 24 Oct 2013 10:02:25 +0000 Current software-based packet classification algorithms exhibit relatively poor performance, prompting many researchers to concentrate on novel frameworks and architectures that employ both hardware and software components. The Packet Classification with Incremental Update (PCIU) algorithm, Ahmed et al. (2010), is a novel and efficient packet classification algorithm with a unique incremental update capability that demonstrated excellent results and was shown to be scalable for many different tasks and clients. While a pure software implementation can generate powerful results on a server machine, an embedded solution may be more desirable for some applications and clients. Embedded, specialized hardware accelerator based solutions are typically much more efficient in speed, cost, and size than solutions that are implemented on general-purpose processor systems. This paper seeks to explore the design space of translating the PCIU algorithm into hardware by utilizing several optimization techniques, ranging from fine grain to coarse grain and parallel coarse grain approaches. The paper presents a detailed implementation of a hardware accelerator of the PCIU based on an Electronic System Level (ESL) approach. Results obtained indicate that the hardware accelerator achieves on average 27x speedup over a state-of-the-art Xeon processor. O. Ahmed, S. Areibi, R. Collier, and G. Grewal Copyright © 2013 O. Ahmed et al. All rights reserved. Rainbow: An Operating System for Software-Hardware Multitasking on Dynamically Partially Reconfigurable FPGAs Tue, 01 Oct 2013 08:34:23 +0000 Dynamic Partial Reconfiguration technology coupled with an Operating System for Reconfigurable Systems (OS4RS) allows for implementation of a hardware task concept, that is, an active computing object which can contend for reconfigurable computing resources and request OS services in a way software task does in a conventional OS. In this work, we show a complete model and implementation of a lightweight OS4RS supporting preemptable and clock-scalable hardware tasks. We also propose a novel, lightweight scheduling mechanism allowing for timely and priority-based reservation of reconfigurable resources, which aims at usage of preemption only at the time it brings benefits to the performance of a system. The architecture of the scheduler and the way it schedules allocations of the hardware tasks result in shorter latency of system calls, thereby reducing the overall OS overhead. Finally, we present a novel model and implementation of a channel-based intertask communication and synchronization suitable for software-hardware multitasking with preemptable and clock-scalable hardware tasks. It allows for optimizations of the communication on per task basis and utilizes point-to-point message passing rather than shared-memory communication, whenever it is possible. Extensive overhead tests of the OS4RS services as well as application speedup tests show efficiency of our approach. Krzysztof Jozwik, Shinya Honda, Masato Edahiro, Hiroyuki Tomiyama, and Hiroaki Takada Copyright © 2013 Krzysztof Jozwik et al. All rights reserved. Design and Implementation of an Embedded NIOS II System for JPEG2000 Tier II Encoding Wed, 19 Jun 2013 18:20:38 +0000 This paper presents a novel implementation of the JPEG2000 standard as a system on a chip (SoC). While most of the research in this field centers on acceleration of the EBCOT Tier I encoder, this work focuses on an embedded solution for EBCOT Tier II. Specifically, this paper proposes using an embedded softcore processor to perform Tier II processing as the back end of an encoding pipeline. The Altera NIOS II processor is chosen for the implementation and is coupled with existing embedded processing modules to realize a fully embedded JPEG2000 encoder. The design is synthesized on a Stratix IV FPGA and is shown to out perform other comparable SoC implementations by 39% in computation time. John M. McNichols, Eric J. Balster, William F. Turri, and Kerry L. Hill Copyright © 2013 John M. McNichols et al. All rights reserved. Hardware Accelerators Targeting a Novel Group Based Packet Classification Algorithm Thu, 30 May 2013 11:07:45 +0000 Packet classification is a ubiquitous and key building block for many critical network devices. However, it remains as one of the main bottlenecks faced when designing fast network devices. In this paper, we propose a novel Group Based Search packet classification Algorithm (GBSA) that is scalable, fast, and efficient. GBSA consumes an average of 0.4 megabytes of memory for a 10 k rule set. The worst-case classification time per packet is 2 microseconds, and the preprocessing speed is 3 M rules/second based on an Xeon processor operating at 3.4 GHz. When compared with other state-of-the-art classification techniques, the results showed that GBSA outperforms the competition with respect to speed, memory usage, and processing time. Moreover, GBSA is amenable to implementation in hardware. Three different hardware implementations are also presented in this paper including an Application Specific Instruction Set Processor (ASIP) implementation and two pure Register-Transfer Level (RTL) implementations based on Impulse-C and Handel-C flows, respectively. Speedups achieved with these hardware accelerators ranged from 9x to 18x compared with a pure software implementation running on an Xeon processor. O. Ahmed, S. Areibi, and G. Grewal Copyright © 2013 O. Ahmed et al. All rights reserved. Selected Papers from the Symposium on Integrated Circuits and Systems Design (SBCCI 2011) Wed, 29 May 2013 16:38:20 +0000 Massimo Conti, Elmar Melcher, Jürgen Becker, Alisson Brito, and Oliver Sander Copyright © 2013 Massimo Conti et al. All rights reserved. Selected Papers from the 2011 International Conference on Reconfigurable Computing and FPGAs (ReConFig 2011) Wed, 20 Mar 2013 08:14:54 +0000 René Cumplido, Peter Athanas, and Jürgen Becker Copyright © 2013 René Cumplido et al. All rights reserved.