International Journal of Reconfigurable Computing http://www.hindawi.com The latest articles from Hindawi Publishing Corporation © 2013 , Hindawi Publishing Corporation . All rights reserved. Hardware Accelerators Targeting a Novel Group Based Packet Classification Algorithm Thu, 30 May 2013 11:07:45 +0000 http://www.hindawi.com/journals/ijrc/2013/681894/ Packet classification is a ubiquitous and key building block for many critical network devices. However, it remains as one of the main bottlenecks faced when designing fast network devices. In this paper, we propose a novel Group Based Search packet classification Algorithm (GBSA) that is scalable, fast, and efficient. GBSA consumes an average of 0.4 megabytes of memory for a 10 k rule set. The worst-case classification time per packet is 2 microseconds, and the preprocessing speed is 3 M rules/second based on an Xeon processor operating at 3.4 GHz. When compared with other state-of-the-art classification techniques, the results showed that GBSA outperforms the competition with respect to speed, memory usage, and processing time. Moreover, GBSA is amenable to implementation in hardware. Three different hardware implementations are also presented in this paper including an Application Specific Instruction Set Processor (ASIP) implementation and two pure Register-Transfer Level (RTL) implementations based on Impulse-C and Handel-C flows, respectively. Speedups achieved with these hardware accelerators ranged from 9x to 18x compared with a pure software implementation running on an Xeon processor. O. Ahmed, S. Areibi, and G. Grewal Copyright © 2013 O. Ahmed et al. All rights reserved. Selected Papers from the Symposium on Integrated Circuits and Systems Design (SBCCI 2011) Wed, 29 May 2013 16:38:20 +0000 http://www.hindawi.com/journals/ijrc/2013/942021/ Massimo Conti, Elmar Melcher, Jürgen Becker, Alisson Brito, and Oliver Sander Copyright © 2013 Massimo Conti et al. All rights reserved. Selected Papers from the 2011 International Conference on Reconfigurable Computing and FPGAs (ReConFig 2011) Wed, 20 Mar 2013 08:14:54 +0000 http://www.hindawi.com/journals/ijrc/2013/597323/ René Cumplido, Peter Athanas, and Jürgen Becker Copyright © 2013 René Cumplido et al. All rights reserved. Runtime Scheduling, Allocation, and Execution of Real-Time Hardware Tasks onto Xilinx FPGAs Subject to Fault Occurrence Mon, 18 Mar 2013 07:50:12 +0000 http://www.hindawi.com/journals/ijrc/2013/905057/ This paper describes a novel way to exploit the computation capabilities delivered by modern Field-Programmable Gate Arrays (FPGAs), not only towards a higher performance, but also towards an improved reliability. Computation-specific pieces of circuitry are dynamically scheduled and allocated to different resources on the chip based on a set of novel algorithms which are described in detail in this article. These algorithms consider most of the technological constraints existing in modern partially reconfigurable FPGAs as well as spontaneously occurring faults and emerging permanent damage in the silicon substrate of the chip. In addition, the algorithms target other important aspects such as communications and synchronization among the different computations that are carried out, either concurrently or at different times. The effectiveness of the proposed algorithms is tested by means of a wide range of synthetic simulations, and, notably, a proof-of-concept implementation of them using real FPGA hardware is outlined. Xabier Iturbe, Khaled Benkrid, Chuan Hong, Ali Ebrahim, Tughrul Arslan, and Imanol Martinez Copyright © 2013 Xabier Iturbe et al. All rights reserved. Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs Thu, 07 Mar 2013 15:57:43 +0000 http://www.hindawi.com/journals/ijrc/2013/453173/ Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10 digit recurrence dividers for FPGA architectures. The first one implements a simple restoring shift-and-subtract algorithm, whereas each of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient calculation and carry-save representation of the residuals. More precisely, the quotient digit selection function of the second divider is implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carry-propagate adders, and the fifth divider decomposes each digit into three components and requires neither a ROM nor a multiplexer. Furthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64 data format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given. Malte Baesler and Sven-Ole Voigt Copyright © 2013 Malte Baesler and Sven-Ole Voigt. All rights reserved. A Heuristic Scheduler for Port-Constrained Floating-Point Pipelines Wed, 27 Feb 2013 10:20:52 +0000 http://www.hindawi.com/journals/ijrc/2013/849545/ We describe a heuristic scheduling approach for optimizing floating-point pipelines subject to input port constraints. The objective of our technique is to maximize functional unit reuse while minimizing the following performance metrics in the generated circuit: (1) maximum multiplexer fanin, (2) datapath fanout, (3) number of multiplexers, and (4) number of registers. For a set of systems biology markup language (SBML) benchmark expressions, we compare the resource usages given by our method to those given by a branch-and-bound enumeration of all valid schedules. Compared with the enumeration results, our heuristic requires on average 33.4% less multiplexer bits and 32.9% less register bits than the worse case, while only requiring 14% more multiplexer bits and 4.5% more register bits than the optimal case. We also compare our results against those given by the state-of-art high-level synthesis tool Xilinx AutoESL. For the most complex of our benchmark expressions, our synthesis technique requires 20% less FPGA slices than AutoESL. Zheming Jin and Jason D. Bakos Copyright © 2013 Zheming Jin and Jason D. Bakos. All rights reserved. Fully Pipelined Implementation of Tree-Search Algorithms for Vector Precoding Thu, 21 Feb 2013 19:15:04 +0000 http://www.hindawi.com/journals/ijrc/2013/496013/ The nonlinear vector precoding (VP) technique has been proven to achieve close-to-capacity performance in multiuser multiple-input multiple-output (MIMO) downlink channels. The performance benefit with respect to its linear counterparts stems from the incorporation of a perturbation signal that reduces the power of the precoded signal. The computation of this perturbation element, which is known to belong in the class of NP-hard problems, is the main aspect that hinders the hardware implementation of VP systems. To this respect, several tree-search algorithms have been proposed for the closest-point lattice search problem in VP systems hitherto. Nevertheless, the optimality of these algorithms has been assessed mainly in terms of error-rate performance and computational complexity, leaving the hardware cost of their implementation an open issue. The parallel data-processing capabilities of field-programmable gate arrays (FPGA) and the loopless nature of the proposed tree-search algorithms have enabled an efficient hardware implementation of a VP system that provides a very high data-processing throughput. Maitane Barrenechea, Mikel Mendicute, and Egoitz Arruti Copyright © 2013 Maitane Barrenechea et al. All rights reserved. Transparent Runtime Migration of Loop-Based Traces of Processor Instructions to Reconfigurable Processing Units Thu, 21 Feb 2013 13:17:46 +0000 http://www.hindawi.com/journals/ijrc/2013/340316/ The ability to map instructions running in a microprocessor to a reconfigurable processing unit (RPU), acting as a coprocessor, enables the runtime acceleration of applications and ensures code and possibly performance portability. In this work, we focus on the mapping of loop-based instruction traces (called Megablocks) to RPUs. The proposed approach considers offline partitioning and mapping stages without ignoring their future runtime applicability. We present a toolchain that automatically extracts specific trace-based loops, called Megablocks, from MicroBlaze instruction traces and generates an RPU for executing those loops. Our hardware infrastructure is able to move loop execution from the microprocessor to the RPU transparently, at runtime, and without changing the executable binaries. The toolchain and the system are fully operational. Three FPGA implementations of the system, differing in the hardware interfaces used, were tested and evaluated with a set of 15 application kernels. Speedups ranging from 1.26 to 3.69 were achieved for the best alternative using a MicroBlaze processor with local memory. João Bispo, Nuno Paulino, João M. P. Cardoso, and João Canas Ferreira Copyright © 2013 João Bispo et al. All rights reserved. An Asynchronous FPGA Block with Its Tech-Mapping Algorithm Dedicated to Security Applications Tue, 05 Feb 2013 09:13:43 +0000 http://www.hindawi.com/journals/ijrc/2013/517947/ This paper presents an FPGA tech-mapping algorithm dedicated to security applications. The objective is to implement—on a full-custom asynchronous FPGA—secured functions that need to be robust against side-channel attacks (SCAs). The paper briefly describes the architecture of this FPGA that has been designed and prototyped in CMOS 65 nm to target various styles of asynchronous logic including 2-phase and 4-phase communication protocols and 1-of-n data encoding. This programmable architecture is designed to be electrically balanced in order to fit the security requirements. It allows fair comparisons between different styles of asynchronous implementations. In order to illustrate the FPGA flexibility and security, a case study has been implemented in 2-phase and 4-phase Quasi-Delay-Insensitive (QDI) logic. Taha Beyrouthy and Laurent Fesquet Copyright © 2013 Taha Beyrouthy and Laurent Fesquet. All rights reserved. Development of a SoC for Digital Television Set-Top Box: Architecture and System Integration Issues Wed, 09 Jan 2013 11:53:05 +0000 http://www.hindawi.com/journals/ijrc/2013/783501/ This work presents the integration of several IPs to generate a system-on-chip (SoC) for digital television set-top box compliant to the SBTVD standard. Embedded consumer electronics for multimedia applications like video processing systems require large storage capacity and high bandwidth memory. Also, those systems are built from heterogeneous processing units, designed to perform specific tasks in order to maximize the overall system efficiency. A single off-chip memory is generally shared between the processing units to reduce power and save costs. The external memory access is one bottleneck when decoding high-definition video sequences in real time. In this work, a four-level memory hierarchy was designed to manage the decoded video in macroblock granularity with low latency. The use of the memory hierarchy in the system design is challenging because it impacts the system integration process and IP reuse in a collaborative design team. Practical strategies used to solve integration problems are discussed in this text. The SoC architecture was validated and is being progressively prototyped using a Xilinx Virtex-5 FPGA board. André Borin Soares, Alexsandro Cristóvão Bonatto, and Altamiro Amadeu Susin Copyright © 2013 André Borin Soares et al. All rights reserved. Open SystemC Simulator with Support for Power Gating Design Sun, 30 Dec 2012 18:42:54 +0000 http://www.hindawi.com/journals/ijrc/2012/793190/ Power gating is one of the most efficient power consumption reduction techniques. However, when applied in several different parts of a complex design, functional verification becomes a challenge. Lately, the verification process of this technique has been executed in a Register-Transfer Level (RTL) abstraction, based on the Common Power Format (CPF) and the Unified Power Format (UPF). The purpose of this paper is to present an OSCI SystemC simulator with support to the power gating design. This simulator is an alternative to assist the functional verification accomplishment of systems modeled in RTL. The possibility of controlling the retention and isolation of power gated functional block (PGFB) is presented in this work, turning the simulations more stable and accurate. Two case studies are presented to demonstrate the new features of that simulator. George Sobral Silveira, Alisson V. Brito, Helder F. de A. Oliveira, and Elmar U. K. Melcher Copyright © 2012 George Sobral Silveira et al. All rights reserved. Configurable Transmitter and Systolic Channel Estimator Architectures for Data-Dependent Superimposed Training Communications Systems Thu, 27 Dec 2012 08:59:10 +0000 http://www.hindawi.com/journals/ijrc/2012/236372/ In this paper, a configurable superimposed training (ST)/data-dependent ST (DDST) transmitter and architecture based on array processors (APs) for DDST channel estimation are presented. Both architectures, designed under full-hardware paradigm, were described using Verilog HDL, targeted in Xilinx Virtex-5 and they were compared with existent approaches. The synthesis results showed a FPGA slice consumption of 1% for the transmitter and 3% for the estimator with 160 and 115 MHz operating frequencies, respectively. The signal-to-quantization-noise ratio (SQNR) performance of the transmitter is about 82 dB to support 4/16/64-QAM modulation. A Monte Carlo simulation demonstrates that the mean square error (MSE) of the channel estimator implemented in hardware is practically the same as the one obtained with the floating-point golden model. The high performance and reduced hardware of the proposed architectures lead to the conclusion that the DDST concept can be applied in current communications standards. E. Romero-Aguirre, R. Parra-Michel, Roberto Carrasco-Alvarez, and A. G. Orozco-Lugo Copyright © 2012 E. Romero-Aguirre et al. All rights reserved. High-Level Design Space and Flexibility Exploration for Adaptive, Energy-Efficient WCDMA Channel Estimation Architectures Sun, 23 Dec 2012 10:07:24 +0000 http://www.hindawi.com/journals/ijrc/2012/961950/ Due to the fast changing wireless communication standards coupled with strict performance constraints, the demand for flexible yet high-performance architectures is increasing. To tackle the flexibility requirement, software-defined radio (SDR) is emerging as an obvious solution, where the underlying hardware implementation is tuned via software layers to the varied standards depending on power-performance and quality requirements leading to adaptable, cognitive radio. In this paper, we conduct a case study for representatives of two complexity classes of WCDMA channel estimation algorithms and explore the effect of flexibility on energy efficiency using different implementation options. Furthermore, we propose new design guidelines for both highly specialized architectures and highly flexible architectures using high-level synthesis, to enable the required performance and flexibility to support multiple applications. Our experiments with various design points show that the resulting architectures meet the performance constraints of WCDMA and a wide range of options are offered for tuning such architectures depending on power/performance/area constraints of SDR. Zoltán Endre Rákossy, Zheng Wang, and Anupam Chattopadhyay Copyright © 2012 Zoltán Endre Rákossy et al. All rights reserved. Object Recognition and Pose Estimation on Embedded Hardware: SURF-Based System Designs Accelerated by FPGA Logic Thu, 06 Dec 2012 15:46:33 +0000 http://www.hindawi.com/journals/ijrc/2012/368351/ State-of-the-art object recognition and pose estimation systems often utilize point feature algorithms, which in turn usually require the computing power of conventional PC hardware. In this paper, we describe two embedded systems for object detection and pose estimation using sophisticated point features. The feature detection step of the “Speeded-up Robust Features (SURF)” algorithm is accelerated by a special IP core. The first system performs object detection and is completely implemented in a single medium-size Virtex-5 FPGA. The second system is an augmented reality platform, which consists of an ARM-based microcontroller and intelligent FPGA-based cameras which support the main system. Michael Schaeferling, Ulrich Hornung, and Gundolf Kiefer Copyright © 2012 Michael Schaeferling et al. All rights reserved. A Programmable Look-Up Table-Based Interpolator with Nonuniform Sampling Scheme Tue, 04 Dec 2012 14:10:05 +0000 http://www.hindawi.com/journals/ijrc/2012/647805/ Interpolation is a useful technique for storage of complex functions on limited memory space: some few sampling values are stored on a memory bank, and the function values in between are calculated by interpolation. This paper presents a programmable Look-Up Table-based interpolator, which uses a reconfigurable nonuniform sampling scheme: the sampled points are not uniformly spaced. Their distribution can also be reconfigured to minimize the approximation error on specific portions of the interpolated function’s domain. Switching from one set of configuration parameters to another set, selected on the fly from a variety of precomputed parameters, and using different sampling schemes allow for the interpolation of a plethora of functions, achieving memory saving and minimum approximation error. As a study case, the proposed interpolator was used as the core of a programmable noise generator—output signals drawn from different Probability Density Functions were produced for testing FPGA implementations of chaotic encryption algorithms. As a result of the proposed method, the interpolation of a specific transformation function on a Gaussian noise generator reduced the memory usage to 2.71% when compared to the traditional uniform sampling scheme method, while keeping the approximation error below a threshold equal to 0.000030518. Élvio Carlos Dutra e Silva Júnior, Leandro Soares Indrusiak, Weiler Alves Finamore, and Manfred Glesner Copyright © 2012 Élvio Carlos Dutra e Silva Júnior et al. All rights reserved. HwPMI: An Extensible Performance Monitoring Infrastructure for Improving Hardware Design and Productivity on FPGAs Sun, 02 Dec 2012 17:57:25 +0000 http://www.hindawi.com/journals/ijrc/2012/162404/ Designing hardware cores for FPGAs can quickly become a complicated task, difficult even for experienced engineers. With the addition of more sophisticated development tools and maturing high-level language-to-gates techniques, designs can be rapidly assembled; however, when the design is evaluated on the FPGA, the performance may not be what was expected. Therefore, an engineer may need to augment the design to include performance monitors to better understand the bottlenecks in the system or to aid in the debugging of the design. Unfortunately, identifying what to monitor and adding the infrastructure to retrieve the monitored data can be a challenging and time-consuming task. Our work alleviates this effort. We present the Hardware Performance Monitoring Infrastructure (HwPMI), which includes a collection of software tools and hardware cores that can be used to profile the current design, recommend and insert performance monitors directly into the HDL or netlist, and retrieve the monitored data with minimal invasiveness to the design. Three applications are used to demonstrate and evaluate HwPMI’s capabilities. The results are highly encouraging as the infrastructure adds numerous capabilities while requiring minimal effort by the designer and low resource overhead to the existing design. Andrew G. Schmidt, Neil Steiner, Matthew French, and Ron Sass Copyright © 2012 Andrew G. Schmidt et al. All rights reserved. A Hardware-Accelerated ECDLP with High-Performance Modular Multiplication Wed, 28 Nov 2012 10:20:54 +0000 http://www.hindawi.com/journals/ijrc/2012/439021/ Elliptic curve cryptography (ECC) has become a popular public key cryptography standard. The security of ECC is due to the difficulty of solving the elliptic curve discrete logarithm problem (ECDLP). In this paper, we demonstrate a successful attack on ECC over prime field using the Pollard rho algorithm implemented on a hardware-software cointegrated platform. We propose a high-performance architecture for multiplication over prime field using specialized DSP blocks in the FPGA. We characterize this architecture by exploring the design space to determine the optimal integer basis for polynomial representation and we demonstrate an efficient mapping of this design to multiple standard prime field elliptic curves. We use the resulting modular multiplier to demonstrate low-latency multiplications for curves secp112r1 and P-192. We apply our modular multiplier to implement a complete attack on secp112r1 using a Nallatech FSB-Compute platform with Virtex-5 FPGA. The measured performance of the resulting design is 114 cycles per Pollard rho step at 100 MHz, which gives 878 K iterations per second per ECC core. We extend this design to a multicore ECDLP implementation that achieves 14.05 M iterations per second with 16 parallel point addition cores. Lyndon Judge, Suvarna Mane, and Patrick Schaumont Copyright © 2012 Lyndon Judge et al. All rights reserved. Multidimensional Costas Arrays and Their Enumeration Using GPUs and FPGAs Mon, 19 Nov 2012 13:07:48 +0000 http://www.hindawi.com/journals/ijrc/2012/196761/ The enumeration of two-dimensional Costas arrays is a problem with factorial time complexity and has been solved for sizes up to 29 using computer clusters. Costas arrays of higher dimensionality have recently been proposed and their properties are beginning to be understood. This paper presents, to the best of our knowledge, the first proposed implementations for enumerating these multidimensional arrays in GPUs and FPGAs, as well as the first discussion of techniques to prune the search space and reduce enumeration run time. Both GPU and FPGA implementations rely on Costas array symmetries to reduce the search space and perform concurrent explorations over the remaining candidate solutions. The fine grained parallelism utilized to evaluate and progress the exploration, coupled with the additional concurrency provided by the multiple instanced cores, allowed the FPGA (XC5VLX330-2) implementation to achieve speedups of up to 30× over the GPU (GeForce GTX 580). Rafael A. Arce-Nazario and José Ortiz-Ubarri Copyright © 2012 Rafael A. Arce-Nazario and José Ortiz-Ubarri. All rights reserved. Adaptive Multiclient Network-on-Chip Memory Core: Hardware Architecture, Software Abstraction Layer, and Application Exploration Sun, 18 Nov 2012 15:28:17 +0000 http://www.hindawi.com/journals/ijrc/2012/298561/ This paper presents the hardware architecture and the software abstraction layer of an adaptive multiclient Network-on-Chip (NoC) memory core. The memory core supports the flexibility of a heterogeneous FPGA-based runtime adaptive multiprocessor system called RAMPSoC. The processing elements, also called clients, can access the memory core via the Network-on-Chip (NoC). The memory core supports a dynamic mapping of an address space for the different clients as well as different data transfer modes, such as variable burst sizes. Therefore, two main limitations of FPGA-based multiprocessor systems, the restricted on-chip memory resources and that usually only one physical channel to an off-chip memory exists, are leveraged. Furthermore, a software abstraction layer is introduced, which hides the complexity of the memory core architecture and which provides an easy to use interface for the application programmer. Finally, the advantages of the novel memory core in terms of performance, flexibility, and user friendliness are shown using a real-world image processing application. Diana Göhringer, Lukas Meder, Stephan Werner, Oliver Oey, Jürgen Becker, and Michael Hübner Copyright © 2012 Diana Göhringer et al. All rights reserved. Novel Dynamic Partial Reconfiguration Implementation of K-Means Clustering on FPGAs: Comparative Results with GPPs and GPUs Wed, 14 Nov 2012 08:32:47 +0000 http://www.hindawi.com/journals/ijrc/2012/135926/ K-means clustering has been widely used in processing large datasets in many fields of studies. Advancement in many data collection techniques has been generating enormous amounts of data, leaving scientists with the challenging task of processing them. Using General Purpose Processors (GPPs) to process large datasets may take a long time; therefore many acceleration methods have been proposed in the literature to speed up the processing of such large datasets. In this work, a parameterized implementation of the K-means clustering algorithm in Field Programmable Gate Array (FPGA) is presented and compared with previous FPGA implementation as well as recent implementations on Graphics Processing Units (GPUs) and GPPs. The proposed FPGA has higher performance in terms of speedup over previous GPP and GPU implementations (two orders and one order of magnitude, resp.). In addition, the FPGA implementation is more energy efficient than GPP and GPU (615x and 31x, resp.). Furthermore, three novel implementations of the K-means clustering based on dynamic partial reconfiguration (DPR) are presented offering high degree of flexibility to dynamically reconfigure the FPGA. The DPR implementations achieved speedups in reconfiguration time between 4x to 15x. Hanaa M. Hussain, Khaled Benkrid, Ali Ebrahim, Ahmet T. Erdogan, and Huseyin Seker Copyright © 2012 Hanaa M. Hussain et al. All rights reserved. DMPDS: A Fast Motion Estimation Algorithm Targeting High Resolution Videos and Its FPGA Implementation Tue, 13 Nov 2012 15:46:04 +0000 http://www.hindawi.com/journals/ijrc/2012/186057/ This paper presents a new fast motion estimation (ME) algorithm targeting high resolution digital videos and its efficient hardware architecture design. The new Dynamic Multipoint Diamond Search (DMPDS) algorithm is a fast algorithm which increases the ME quality when compared with other fast ME algorithms. The DMPDS achieves a better digital video quality reducing the occurrence of local minima falls, especially in high definition videos. The quality results show that the DMPDS is able to reach an average PSNR gain of 1.85 dB when compared with the well-known Diamond Search (DS) algorithm. When compared to the optimum results generated by the Full Search (FS) algorithm the DMPDS shows a lose of only 1.03 dB in the PSNR. On the other hand, the DMPDS reached a complexity reduction higher than 45 times when compared to FS. The quality gains related to DS caused an expected increase in the DMPDS complexity which uses 6.4-times more calculations than DS. The DMPDS architecture was designed focused on high performance and low cost, targeting to process Quad Full High Definition (QFHD) videos in real time (30 frames per second). The architecture was described in VHDL and synthesized to Altera Stratix 4 and Xilinx Virtex 5 FPGAs. The synthesis results show that the architecture is able to achieve processing rates higher than 53 QFHD fps, reaching the real-time requirements. The DMPDS architecture achieved the highest processing rate when compared to related works in the literature. This high processing rate was obtained designing an architecture with a high operation frequency and low numbers of cycles necessary to process each block. Gustavo Sanchez, Felipe Sampaio, Marcelo Porto, Sergio Bampi, and Luciano Agostini Copyright © 2012 Gustavo Sanchez et al. All rights reserved. Selected Papers from the International Conference on Reconfigurable Computing and FPGAs (ReConFig'10) Wed, 19 Sep 2012 13:27:42 +0000 http://www.hindawi.com/journals/ijrc/2012/319827/ Claudia Feregrino, Miguel Arias, Kris Gaj, Viktor K. Prasanna, Marco D. Santambrogio, and Ron Sass Copyright © 2012 Claudia Feregrino et al. All rights reserved. Efficient Execution of Networked MPSoC Models by Exploiting Multiple Platform Levels Tue, 18 Sep 2012 16:08:13 +0000 http://www.hindawi.com/journals/ijrc/2012/729786/ Novel embedded applications are characterized by increasing requirements on processing performance as well as the demand for communication between several or many devices. Networked Multiprocessor System-on-Chips (MPSoCs) are a possible solution to cope with this increasing complexity. Such systems require a detailed exploration on both architectures and system design. An approach that allows investigating interdependencies between system and network domain is the cooperative execution of system design tools with a network simulator. Within previous work, synchronization mechanisms have been developed for parallel system simulation and system/network co-simulation using the high level architecture (HLA). Within this contribution, a methodology is presented that extends previous work with further building blocks towards a construction kit for system/network co-simulation. The methodology facilitates flexible assembly of components and adaptation to the specific needs of use cases in terms of performance and accuracy. Underlying concepts and made extensions are discussed in detail. Benefits are substantiated by means of various benchmarks. Christoph Roth, Joachim Meyer, Michael Rückauer, Oliver Sander, and Jürgen Becker Copyright © 2012 Christoph Roth et al. All rights reserved. Algorithm and Hardware Design of a Fast Intra Frame Mode Decision Module for H.264/AVC Encoders Sun, 02 Sep 2012 08:38:26 +0000 http://www.hindawi.com/journals/ijrc/2012/813023/ In the rate-distortion optimization (RDO), the process of choosing the best prediction mode is performed through exhaustive executions of the whole encoding process, increasing significantly the encoder computational complexity. Considering H.264/AVC intra frame prediction, there are several modes to encode a macroblock (MB). This work proposes an algorithm and the hardware design for a fast intra frame mode decision module for H.264/AVC encoders. The application of the proposed algorithm reduces in more than 10 times the number of encoding iterations for choosing the best intramode when compared with RDO-based decision. The architecture was synthesized to FPGA and achieved an operation frequency of 98 MHz processing more than 300 HD1080p frames per second. With this approach, we achieved one order-of-magnitude performance improvement compared with RDO-based approaches, which is very important not only from the performance but also from the energy consumption perspective for battery-operated devices. In order to compare the architecture with previously published works, we also synthesized it to standard cells. Compared with the best previous results reported, the implemented architecture achieves a complexity reduction of five times, a processing capability increase of 14 times, and a reduction in the number of clock cycles per MB of 11 times. Daniel Palomino, Guilherme Corrêa, Cláudio Diniz, Sergio Bampi, Luciano Agostini, and Altamiro Susin Copyright © 2012 Daniel Palomino et al. All rights reserved. Modeling and Implementation of a Power Estimation Methodology for SystemC Mon, 27 Aug 2012 09:41:37 +0000 http://www.hindawi.com/journals/ijrc/2012/439727/ This work describes a methodology to model power consumption of logic modules. A detailed mathematical model is presented and incorporated in a tool for translation of models written in VHDL to SystemC. The functionality for implicit power monitoring and estimation is inserted at module translation. The translation further implements an approach to wrap RTL to TLM interfaces so that the translated module can be connected to a system-level simulator. The power analysis is based on a statistical model of the underlying HW structure and an analysis of input data. The flexibility of the C++ syntax is exploited, to integrate the power evaluation technique. The accuracy and speed-up of the approach are illustrated and compared to a conventional power analysis flow using PPR simulation, based on Xilinx technology. Matthias Kuehnle, Andre Wagner, Alisson V. Brito, and Juergen Becker Copyright © 2012 Matthias Kuehnle et al. All rights reserved. An Optimization-Based Reconfigurable Design for a 6-Bit 11-MHz Parallel Pipeline ADC with Double-Sampling S&H Thu, 16 Aug 2012 15:11:31 +0000 http://www.hindawi.com/journals/ijrc/2012/786205/ This paper presents a 6 bit, 11 MS/s time-interleaved pipeline A/D converter design. The specification process, from block level to elementary circuits, is gradually covered to draw a design methodology. Both power consumption and mismatch between the parallel chain elements are intended to be reduced by using some techniques such as double and bottom-plate sampling, fully differential circuits, RSD digital correction, and geometric programming (GP) optimization of the elementary analog circuits (OTAs and comparators) design. Prelayout simulations of the complete ADC are presented to characterize the designed converter, which consumes 12 mW while sampling a 500 kHz input signal. Moreover, the block inside the ADC with the most stringent requirements in power, speed, and precision was sent to fabrication in a CMOS 0.35 μm AMS technology, and some postlayout results are shown. Wilmar Carvajal and Wilhelmus Van Noije Copyright © 2012 Wilmar Carvajal and Wilhelmus Van Noije. All rights reserved. Cellular Automata-Based Parallel Random Number Generators Using FPGAs Thu, 02 Aug 2012 11:53:47 +0000 http://www.hindawi.com/journals/ijrc/2012/219028/ Cellular computing represents a new paradigm for implementing high-speed massively parallel machines. Cellular automata (CA), which consist of an array of locally connected processing elements, are a basic form of a cellular-based architecture. The use of field programmable gate arrays (FPGAs) for implementing CA accelerators has shown promising results. This paper investigates the design of CA-based pseudo-random number generators (PRNGs) using an FPGA platform. To improve the quality of the random numbers that are generated, the basic CA structure is enhanced in two ways. First, the addition of a superrule to each CA cell is considered. The resulting self-programmable CA (SPCA) uses the superrule to determine when to make a dynamic rule change in each CA cell. The superrule takes its inputs from neighboring cells and can be considered itself a second CA working in parallel with the main CA. When implemented on an FPGA, the use of lookup tables in each logic cell removes any restrictions on how the super-rules should be defined. Second, a hybrid configuration is formed by combining a CA with a linear feedback shift register (LFSR). This is advantageous for FPGA designs due to the compactness of the LFSR implementations. A standard software package for statistically evaluating the quality of random number sequences known as Diehard is used to validate the results. Both the SPCA and the hybrid CA/LFSR were found to pass all the Diehard tests. David H. K. Hoe, Jonathan M. Comer, Juan C. Cerda, Chris D. Martinez, and Mukul V. Shirvaikar Copyright © 2012 David H. K. Hoe et al. All rights reserved. Redesigned-Scale-Free CORDIC Algorithm Based FPGA Implementation of Window Functions to Minimize Area and Latency Wed, 01 Aug 2012 10:44:39 +0000 http://www.hindawi.com/journals/ijrc/2012/185784/ One of the most important steps in spectral analysis is filtering, where window functions are generally used to design filters. In this paper, we modify the existing architecture for realizing the window functions using CORDIC processor. Firstly, we modify the conventional CORDIC algorithm to reduce its latency and area. The proposed CORDIC algorithm is completely scale-free for the range of convergence that spans the entire coordinate space. Secondly, we realize the window functions using a single CORDIC processor as against two serially connected CORDIC processors in existing technique, thus optimizing it for area and latency. The linear CORDIC processor is replaced by a shift-add network which drastically reduces the number of pipelining stages required in the existing design. The proposed design on an average requires approximately 64% less pipeline stages and saves up to 44.2% area. Currently, the processor is designed to implement Blackman windowing architecture, which with slight modifications can be extended to other widow functions as well. The details of the proposed architecture are discussed in the paper. Supriya Aggarwal and Kavita Khare Copyright © 2012 Supriya Aggarwal and Kavita Khare. All rights reserved. Performance Analysis Techniques for Multi-Soft-Core and Many-Soft-Core Systems Thu, 26 Jul 2012 10:37:00 +0000 http://www.hindawi.com/journals/ijrc/2012/736347/ Multi-soft-core systems are a viable and interesting solution for embedded systems that need a particular tradeoff between performance, flexibility and development speed. As the growing capacity allows it, many-soft-cores are also expected to have relevance to future embedded systems. As a consequence, parallel programming methods and tools will be necessarily embraced as a part of the full system development process. Performance analysis is an important part of the development process for parallel applications. It is usually mandatory when you want to get a desired performance or to verify that the system is meeting some real-time constraints. One of the usual techniques used by the HPC community is the postmortem analysis of application traces. However, this is not easily transported to the embedded systems based on FPGA due to the resource limitations of the platforms. We propose several techniques and some hardware architectural support to be able to generate traces on multiprocessor systems based on FPGAs and use them to optimize the performance of the running applications. David Castells-Rufas, Eduard Fernandez-Alonso, and Jordi Carrabina Copyright © 2012 David Castells-Rufas et al. All rights reserved. HoneyComb: An Application-Driven Online Adaptive Reconfigurable Hardware Architecture Thu, 19 Jul 2012 14:51:16 +0000 http://www.hindawi.com/journals/ijrc/2012/832531/ Since the introduction of the first reconfigurable devices in 1985 the field of reconfigurable computing developed a broad variety of architectures from fine-grained to coarse-grained types. However, the main disadvantages of the reconfigurable approaches, the costs in area, and power consumption, are still present. This contribution presents a solution for application-driven adaptation of our reconfigurable architecture at register transfer level (RTL) to reduce the resource requirements and power consumption while keeping the flexibility and performance for a predefined set of applications. Furthermore, implemented runtime adaptive features like online routing and configuration sequencing will be presented and discussed. A presentation of the prototype chip of this architecture designed in 90 nm standard cell technology manufactured by TSMC will conclude this contribution. Alexander Thomas, Michael Rückauer, and Jürgen Becker Copyright © 2012 Alexander Thomas et al. All rights reserved.