International Journal of Reconfigurable Computing



<rss  version="2.0">
 <channel>
    <title>International Journal of Reconfigurable Computing</title>
    <link>https://www.hindawi.com</link>
    <description>The latest articles from Hindawi</description>
   
    <copyright>&#169; 2024 , Hindawi Limited . All rights reserved.</copyright>
   
        <item>
            <title>Hardware Obfuscation Based Watermarking Technique for IPR Ownership Identification</title>
            <pubDate>Mon, 03 Jul 2023 10:50:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2023/4550758/</link>
            <description>As the reuse of IP cores or the development of frequently used hardware modules is gaining more attention in the semiconductor industry, the misappropriation of the owner&#x2019;s identity is a rising concern. Therefore, imprinting the owner&#x2019;s identity in the form of a watermark or signature on the IP core is essential to avoid intellectual property right (IPR) infringement. In view of this, a watermarking technique is proposed in the present manuscript. A constraint-based dynamic watermarking method to generate the owner&#x2019;s signature is proposed in conjunction with the logic encryption-based hardware obfuscation method. The method formulated in this manuscript consciously makes use of a basic switching component for embedding a watermark with IP core and hardware obfuscation, to achieve a lower overhead budget. Through the switching mechanism, the embedded watermark can be made detectable to legitimate end users off chip via test pin. The logic encryption-based method is set for accessing the watermark. Furthermore, an encrypted functionality is set as the signature generator module for generating owner&#x2019;s signature. This provides hardware obfuscation and two-stage authentication mechanism for the generation of owner&#x2019;s signature, and as a result of this, double-layer protection is achieved. Furthermore, a novel method to configure input key for signature generation module and to formulate owner&#x2019;s signature is proposed. The viability of the present watermark technique for real-life application is checked on the ground of transparency, security, reliability, performance overhead, and robustness. Since the watermark in the proposed method is embedded outside the IP core, it does not cause any latency for the IP core functionality. Thus, even with significantly lower area overhead (&#x223c;&#x3c;1.4&#x0025;), the proposed method is able to provide higher robustness in terms of lower probability of coincidence (PC&#x2009;&#x3d;&#x2009;4.68 e&#x2009;&#x2212;&#x2009;97).</description>
            <Author>Priyanka Bagul and Vandana Inamdar</Author>
            <copyright>Copyright &#xa9; 2023 Priyanka Bagul and Vandana Inamdar. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Open-Source Ethernet MAC IP Cores for FPGAs: Overview and Evaluation</title>
            <pubDate>Tue, 23 May 2023 08:35:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2023/9222318/</link>
            <description>Field-Programmable Gate Arrays (FPGAs) can be found in an increasing number of application domains, such as the telecom industry, the automotive electronics sector, or automation technology as well as in the area of reconfigurable computing. In recent years, it can be observed that the open-source idea which is known from the software domain for a long time also became popular in the world of hardware and FPGA design. In the era of the Internet of Things, many of today&#x2019;s electronic devices implement some kind of network interface with Ethernet being known as one of the most widely used network standards. Thus, there is consequently a high demand on available Ethernet implementations for FPGA platforms. The goal of this work is to survey available open-source Ethernet MAC IP cores, evaluate existing designs in terms of performance, resource utilization, code quality, or maturity, and to present and summarize the evaluation results herein. Furthermore, advantages of commercial solutions and related publication work are discussed. To the authors&#x2019; best knowledge, this is the first publication that evaluates and compares existing open-source Ethernet MAC IP cores on a large scale. This work should help designers to select an appropriate open-source Ethernet MAC for an FPGA design and shows possible pitfalls and things to pay attention when using an open-source IP core in general. Finally, the authors would like to show that the open-source community can be also very helpful in the world of hardware in terms of design reuse or time to market.</description>
            <Author>Christian Fibich, Patrick Schmitt, Roland H&#xf6;ller, and Peter R&#xf6;ssler</Author>
            <copyright>Copyright &#xa9; 2023 Christian Fibich et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>A Methodology for an FPGA Implementation of a Programmable Logic Controller to Control an Atomic Layer Deposition System</title>
            <pubDate>Fri, 06 May 2022 12:20:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2022/8827417/</link>
            <description>In this work, we present an industrial cold walled Atomic Layer Deposition (ALD) system, which can be controlled by either a traditional programmable logic controller (PLC) system or a field-programmable gate array (FPGA) prototyping board. This work presents an FPGA controlled system that takes ladder diagram (LD) control for a PLC and converts this control to Verilog HDL and programs an FPGA such that the FPGA prototyping board is used to control a real industrial application. We explore this approach since FPGA implementation of LD control could significantly reduce the cost of implementing these controllers with other potential advantages such as the improved granularity of timing control from milliseconds to nanoseconds, additional available pins for inputs and outputs far exceeding that of microprocessors, and lower power consumption for control. In this work, we provide details and descriptions of our industrial system (ALD), the LD control of this system and its implementation, our software flow to convert LDs to Verilog HDL, and our FPGA prototype board design to replace the existing electronic controller. We show how our LD-Verilog HDL converter in conjunction with FPGAs matches a PLC and demonstrate some of the benefits of using an FPGA.</description>
            <Author>Peter Jamieson, Donald Blank, Janelle Ghanem, Tyler McGrew, and Giancarlo Corti</Author>
            <copyright>Copyright &#xa9; 2022 Peter Jamieson et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>A Decision-Making Method Providing Sustainability to FPGA-Based SoCs by Run-Time Structural Adaptation to Mode of Operation, Power Budget, and Die Temperature Variations</title>
            <pubDate>Wed, 01 Sep 2021 06:50:01 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2021/5512938/</link>
            <description>One of the growing areas of application of embedded systems in robotics, aerospace, military, etc. is autonomous mobile systems. Usually, such embedded systems have multitask multimodal workloads. These systems must sustain the required performance of their dynamic workloads in presence of varying power budget due to rechargeable power sources, varying die temperature due to varying workloads and/or external temperature, and varying hardware resources due to occurrence of hardware faults. This paper proposes a run-time decision-making method, called Decision Space Explorer, for FPGA-based Systems-on-Chip (SoCs) to support changing workload requirements while simultaneously mitigating unpredictable variations in power budget, die temperature, and hardware resource constraints. It is based on the concept of Run-Time Structural Adaptation (RTSA); whenever there is a change in a system&#x2019;s set of constraints, Explorer selects a suitable hardware processing circuit for each active task at an appropriate operating frequency such that all the constraints are satisfied. Explorer has been experimentally deployed on the ARM Cortex-A9 core of Xilinx Zynq XC7Z020 SoC. Its worst-case decision-making time for different scenarios ranges from tens to hundreds of microseconds. Explorer is thus suitable for enabling RTSA in systems where specifications of multiple objectives must be maintained simultaneously, making them self-sustainable.</description>
            <Author>Dimple Sharma and Lev Kirischian</Author>
            <copyright>Copyright &#xa9; 2021 Dimple Sharma and Lev Kirischian. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>A Service-Oriented Component-Based Framework for Dynamic Reconfiguration Modeling Targeting SystemC/TLM</title>
            <pubDate>Tue, 03 Aug 2021 10:05:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2021/5584391/</link>
            <description>To deal with the complex design issues of Dynamically Reconfigurable Systems-on-Chip (DRSoCs), it is extremely relevant to raise the abstraction level in which models are expressed. A high abstraction level allows great flexibility and reusability while bypassing low-level implementation details. In this context, model-driven engineering (MDE) provides support to build and transform precise and structured models for a particular purpose at different levels of abstraction. Indeed, high-level models are successively refined to low-level models until reaching the executable ones. Thus, this paper presents an MDE-based framework for DRSoCs design enabling the transformation of UML/MARTE specifications to SystemC/TLM implementation. To achieve a high degree of expressiveness for modeling dynamic reconfiguration, we use a suitable software engineering approach based on service-oriented component architecture. Since MARTE does not cover the common features of dynamic reconfiguration domain and service orientation concepts, new stereotypes are created by refinement to add missing capabilities to the profile. Likewise, SystemC does not provide native support for dynamic reconfiguration, thus leading us to adopt a design pattern based solution for DRSoCs implementation in compliance with standards. The proposed framework is validated through a reconfigurable active 3-way crossover case study in which we demonstrate the practicability of the approach by gradual model transformations with reduced implementation effort and significant design productivity gain.</description>
            <Author>Khaled Allem, El-Bay Bourennane, and Youcef Khelfaoui</Author>
            <copyright>Copyright &#xa9; 2021 Khaled Allem et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>A Method for Run-Time Prediction of On-Chip Thermal Conditions in Dynamically Reconfigurable SOPCs</title>
            <pubDate>Thu, 07 Jan 2021 16:50:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2021/8818788/</link>
            <description>Autonomous mobile systems nowadays deploy FPGA-based System on Programmable Chips (SoPCs) for supporting their dynamic multitask multimodal workloads. For such field-deployed systems, activation times, execution periods of tasks, and variations in environmental conditions are usually difficult to predict. These dynamic variations result in a new challenge of dynamic thermal cycling stress on the SoPC die, which can result in transient and even permanent hardware faults in the computing system. This paper proposes the approach of run-time structural adaptation (RTSA) to mitigate dynamic thermal cycling stress on the SoPC dies. RTSA assumes the tasks to have multiple implementation variants, called Application Specific Processing (ASP) circuit variants, which vary in hardware resources, operating frequency, and power consumption. Dynamically reconfiguring appropriate ASP circuit variants of tasks allow systems to maintain their die temperature in the desired range while taking into account variations in power budget and modes of operation. This means the essence of RTSA is a decision-making mechanism which can select at run-time, a suitable system configuration (set of ASP circuit variants of active tasks), whenever needed, to meet the die temperature constraints. To do so, run-time die temperature prediction for potential system configurations using an estimation model is required. This paper presents a generic method to derive an analytical model for any SoPC that can estimate the die temperature in real time and thus support the decision-making mechanism. To develop this method, the thermal behavior of SoPC die under different task scenarios is studied and relation of die temperature to frequency, resource utilization, and power consumption is analyzed. An RTSA-enabled experimental platform is set up on Xilinx Zynq XC7Z020 SoPC for this purpose. Experimental results also demonstrate that the proposed method can be used to derive a model in run-time, thus enabling systems to self-derive and dynamically update the model in run-time.</description>
            <Author>Dimple Sharma and Lev Kirischian</Author>
            <copyright>Copyright &#xa9; 2021 Dimple Sharma and Lev Kirischian. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>FPGAs for Domain Experts</title>
            <pubDate>Tue, 27 Oct 2020 12:20:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2020/2725809/</link>
            <description></description>
            <Author>Wim Vanderbauwhede, Sven-Bodo Scholz, and Martin Margala</Author>
            <copyright>Copyright &#xa9; 2020 Wim Vanderbauwhede et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>FPGA Implementation of A Algorithm for Real-Time Path Planning</title>
            <pubDate>Mon, 17 Aug 2020 09:20:01 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2020/8896386/</link>
            <description>The traditional A algorithm is time-consuming due to a large number of iteration operations to calculate the evaluation function and sort the OPEN list. To achieve real-time path-planning performance, a hardware accelerator’s architecture called A accelerator has been designed and implemented in field programmable gate array (FPGA). The specially designed 8-port cache and OPEN list array are introduced to tackle the calculation bottleneck. The system-on-a-chip (SOC) design is implemented in Xilinx Kintex-7 FPGA to evaluate A accelerator. Experiments show that the hardware accelerator achieves 37–75 times performance enhancement relative to software implementation. It is suitable for real-time path-planning applications.</description>
            <Author>Yuzhi Zhou, Xi Jin, and Tianqi Wang</Author>
            <copyright>Copyright &#xa9; 2020 Yuzhi Zhou et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Dynamic Reliability Management for FPGA-Based Systems</title>
            <pubDate>Sat, 13 Jun 2020 18:35:01 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2020/2808710/</link>
            <description>Radiation tolerance in FPGAs is an important field of research particularly for reliable computation in electronics used in aerospace and satellite missions. The motivation behind this research is the degradation of reliability in FPGA hardware due to single-event effects caused by radiation particles. Redundancy is a commonly used technique to enhance the fault-tolerance capability of radiation-sensitive applications. However, redundancy comes with an overhead in terms of excessive area consumption, latency, and power dissipation. Moreover, the redundant circuit implementations vary in structure and resource usage with the redundancy insertion algorithms as well as number of used redundant stages. The radiation environment varies during the operation time span of the mission depending on the orbit and space weather conditions. Therefore, the overheads due to redundancy should also be optimized at run-time with respect to the current radiation level. In this paper, we propose a technique called Dynamic Reliability Management (DRM) that utilizes the radiation data, interprets it, selects a suitable redundancy level, and performs the run-time reconfiguration, thus varying the reliability levels of the target computation modules. DRM is composed of two parts. The design-time tool flow of DRM generates a library of various redundant implementations of the circuit with different magnitudes of performance factors. The run-time tool flow, while utilizing the radiation/error-rate data, selects a required redundancy level and reconfigures the computation module with the corresponding redundant implementation. Both parts of DRM have been verified by experimentation on various benchmarks. The most significant finding we have from this experimentation is that the performance can be scaled multiple times by using partial reconfiguration feature of DRM, e.g., 7.7 and 3.7 times better performance results obtained for our data sorter and matrix multiplier case studies compared with static reliability management techniques. Therefore, DRM allows for maintaining a suitable trade-off between computation reliability and performance overhead during run-time of an application.</description>
            <Author>Jahanzeb Anwer, Sebastian Meisner, and Marco Platzner</Author>
            <copyright>Copyright &#xa9; 2020 Jahanzeb Anwer et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>SIFO: Secure Computational Infrastructure Using FPGA Overlays</title>
            <pubDate>Fri, 06 Dec 2019 15:50:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/1439763/</link>
            <description>Secure Function Evaluation (SFE) has received recent attention due to the massive collection and mining of personal data, but remains impractical due to its large computational cost. Garbled Circuits (GC) is a protocol for implementing SFE which can evaluate any function that can be expressed as a Boolean circuit and obtain the result while keeping each party&#x2019;s input private. Recent advances have led to a surge of garbled circuit implementations in software for a variety of different tasks. However, these implementations are inefficient, and therefore GC is not widely used, especially for large problems. This research investigates, implements, and evaluates secure computation generation using a heterogeneous computing platform featuring FPGAs. We have designed and implemented SIFO: secure computational infrastructure using FPGA overlays. Unlike traditional FPGA design, a coarse-grained overlay architecture is adopted which supports mapping SFE problems that are too large to map to a single FPGA. Host tools provided include SFE problem generator, parser, and automatic host code generation. Our design allows repurposing an FPGA to evaluate different SFE tasks without the need for reprogramming and fully explores the parallelism for any GC problem. Our system demonstrates an order of magnitude speedup compared with an existing software platform.</description>
            <Author>Xin Fang, Stratis Ioannidis, and Miriam Leeser</Author>
            <copyright>Copyright &#xa9; 2019 Xin Fang et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>From FPGA to Support Cloud to Cloud of FPGA: State of the Art</title>
            <pubDate>Thu, 05 Dec 2019 15:50:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/8085461/</link>
            <description>Field Programmable Gate Array (FPGA) draws a significant attention from both industry and academia by accelerating computationally expensive applications and achieving low power consumption. FPGAs are interesting due to the flexibility and reconfigurabiltiy of their device. Cloud computing becomes a major trend towards infrastructure and computing resources dematerialization. It provides &#x201c;unlimited&#x201d; storage capacities and a large number of data and applications that make collaboration easier between multiple (not domain specific) designers. Many papers in the literature have surveyed Cloud and FPGA separately and, more precisely, their services and challenges. The acceleration of applications by FPGA and the unlimited capacities of the cloud are expected to be more and more pervasive. As more and more FPGA are being deployed in traditional cloud, it is appropriate to clarify what is the cloud FPGA and which drawbacks of using FPGA in local are resolved. We present a survey of the cloud FPGA works that have been proposed to exploit the advantages of using FPGA in the cloud. We classify these studies in three services to highlight their benefits and limitations. This survey aims at motivating further researches in cloud FPGA.</description>
            <Author>Rym Skhiri, Virginie Fresse, Jean Paul Jamont, Benoit Suffran, and Jihene Malek</Author>
            <copyright>Copyright &#xa9; 2019 Rym Skhiri et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Automatic Pipelining and Vectorization of Scientific Code for FPGAs</title>
            <pubDate>Mon, 18 Nov 2019 13:05:04 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/7348013/</link>
            <description>There is a large body of legacy scientific code in use today that could benefit from execution on accelerator devices like GPUs and FPGAs. Manual translation of such legacy code into device-specific parallel code requires significant manual effort and is a major obstacle to wider FPGA adoption. We are developing an automated optimizing compiler TyTra to overcome this obstacle. The TyTra flow aims to compile legacy Fortran code automatically for FPGA-based acceleration, while applying suitable optimizations. We present the flow with a focus on two key optimizations, automatic pipelining and vectorization. Our compiler frontend extracts patterns from legacy Fortran code that can be pipelined and vectorized. The backend first creates fine and coarse-grained pipelines and then automatically vectorizes both the memory access and the datapath based on a cost model, generating an OpenCL-HDL hybrid working solution for FPGA targets on the Amazon cloud. Our results show up to 4.2&#xd7; performance improvement over baseline OpenCL code.</description>
            <Author>Syed Waqar Nabi and Wim Vanderbauwhede</Author>
            <copyright>Copyright &#xa9; 2019 Syed Waqar Nabi and Wim Vanderbauwhede. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>ViPar: High-Level Design Space Exploration for Parallel Video Processing Architectures</title>
            <pubDate>Thu, 14 Nov 2019 03:05:03 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/4298013/</link>
            <description>Embedded video applications are now involved in sophisticated transportation systems like autonomous vehicles and driver assistance systems. As silicon capacity increases, the design productivity gap grows up for the current available design tools. Hence, high-level synthesis (HLS) tools emerged in order to reduce that gap by shifting the design efforts to higher abstraction levels. In this paper, we present ViPar as a tool for exploring different video processing architectures at higher design level. First, we proposed a parametrizable parallel architectural model dedicated for video applications. Second, targeting this architectural model, we developed ViPar tool with two main features: (1) An empirical model was introduced to estimate the power consumption based on hardware utilization and operating frequency. In addition to that, we derived the equations for estimating the hardware utilization and execution time for each design point during the space exploration process. (2) By defining the main characteristics of the parallel video architecture like parallelism level, the number of input/output ports, the pixel distribution pattern, and so on, ViPar tool can automatically generate the dedicated architecture for hardware implementation. In the experimental validation, we used ViPar tool to generate automatically an efficient hardware implementation for a Multiwindow Sum of Absolute Difference stereo matching algorithm on Xilinx Zynq ZC706 board. We succeeded to increase the design productivity by converging rapidly to the appropriate designs that fit with our system constraints in terms of power consumption, hardware utilization, and frame execution time.</description>
            <Author>Karim M. A. Ali, Rabie Ben Atitallah, Abdessamad Ait El Cadi, Nizar Fakhfakh, and Jean-Luc Dekeyser</Author>
            <copyright>Copyright &#xa9; 2019 Karim M. A. Ali et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Dimension Reduction Using Quantum Wavelet Transform on a High-Performance Reconfigurable Computer</title>
            <pubDate>Mon, 11 Nov 2019 00:07:40 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/1949121/</link>
            <description>The high resolution of multidimensional space-time measurements and enormity of data readout counts in applications such as particle tracking in high-energy physics (HEP) is becoming nowadays a major challenge. In this work, we propose combining dimension reduction techniques with quantum information processing for application in domains that generate large volumes of data such as HEP. More specifically, we propose using quantum wavelet transform (QWT) to reduce the dimensionality of high spatial resolution data. The quantum wavelet transform takes advantage of the principles of quantum mechanics to achieve reductions in computation time while processing exponentially larger amount of information. We develop simpler and optimized emulation architectures than what has been previously reported, to perform quantum wavelet transform on high-resolution data. We also implement the inverse quantum wavelet transform (IQWT) to accurately reconstruct the data without any losses. The algorithms are prototyped on an FPGA-based quantum emulator that supports double-precision floating-point computations. Experimental work has been performed using high-resolution image data on a state-of-the-art multinode high-performance reconfigurable computer. The experimental results show that the proposed concepts represent a feasible approach to reducing dimensionality of high spatial resolution data generated by applications such as particle tracking in high-energy physics.</description>
            <Author>Naveed Mahmud and Esam El-Araby</Author>
            <copyright>Copyright &#xa9; 2019 Naveed Mahmud and Esam El-Araby. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Translating Timing into an Architecture: The Synergy of COTSon and HLS (Domain Expertise&#x2014;Designing a Computer Architecture via HLS)</title>
            <pubDate>Sun, 03 Nov 2019 00:07:10 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/2624938/</link>
            <description>Translating a system requirement into a low-level representation (e.g., register transfer level or RTL) is the typical goal of the design of FPGA-based systems. However, the Design Space Exploration (DSE) needed to identify the final architecture may be time consuming, even when using high-level synthesis (HLS) tools. In this article, we illustrate our hybrid methodology, which uses a frontend for HLS so that the DSE is performed more rapidly by using a higher level abstraction, but without losing accuracy, thanks to the HP-Labs COTSon simulation infrastructure in combination with our DSE tools (MYDSE tools). In particular, this proposed methodology proved useful to achieve an appropriate design of a whole system in a shorter time than trying to design everything directly in HLS. Our motivating problem was to deploy a novel execution model called data-flow threads (DF-Threads) running on yet-to-be-designed hardware. For that goal, directly using the HLS was too premature in the design cycle. Therefore, a key point of our methodology consists in defining the first prototype in our simulation framework and gradually migrating the design into the Xilinx HLS after validating the key performance metrics of our novel system in the simulator. To explain this workflow, we first use a simple driving example consisting in the modelling of a two-way associative cache. Then, we explain how we generalized this methodology and describe the types of results that we were able to analyze in the AXIOM project, which helped us reduce the development time from months/weeks to days/hours.</description>
            <Author>Roberto Giorgi, Farnam Khalili, and Marco Procaccini</Author>
            <copyright>Copyright &#xa9; 2019 Roberto Giorgi et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>An FPGA-Based Hardware Accelerator for CNNs Using On-Chip Memories Only: Design and Benchmarking with Intel Movidius Neural Compute Stick</title>
            <pubDate>Tue, 22 Oct 2019 11:05:01 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/7218758/</link>
            <description>During the last years, convolutional neural networks have been used for different applications, thanks to their potentiality to carry out tasks by using a reduced number of parameters when compared with other deep learning approaches. However, power consumption and memory footprint constraints, typical of on the edge and portable applications, usually collide with accuracy and latency requirements. For such reasons, commercial hardware accelerators have become popular, thanks to their architecture designed for the inference of general convolutional neural network models. Nevertheless, field-programmable gate arrays represent an interesting perspective since they offer the possibility to implement a hardware architecture tailored to a specific convolutional neural network model, with promising results in terms of latency and power consumption. In this article, we propose a full on-chip field-programmable gate array hardware accelerator for a separable convolutional neural network, which was designed for a keyword spotting application. We started from the model implemented in a previous work for the Intel Movidius Neural Compute Stick. For our goals, we appropriately quantized such a model through a bit-true simulation, and we realized a dedicated architecture exclusively using on-chip memories. A benchmark comparing the results on different field-programmable gate array families by Xilinx and Intel with the implementation on the Neural Compute Stick was realized. The analysis shows that better inference time and energy per inference results can be obtained with comparable accuracy at expenses of a higher design effort and development time through the FPGA solution.</description>
            <Author>Gianmarco Dinelli, Gabriele Meoni, Emilio Rapuano, Gionata Benelli, and Luca Fanucci</Author>
            <copyright>Copyright &#xa9; 2019 Gianmarco Dinelli et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs</title>
            <pubDate>Sun, 13 Oct 2019 00:07:11 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/3679839/</link>
            <description>Solving diagonally dominant tridiagonal linear systems is a common problem in scientific high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator&#x2019;s compute performance. The truncated SPIKE FPGA solver is developed first for optimising OpenCL device kernel performance, global memory bandwidth, and interleaved host to device memory transactions. The FPGA OpenCL kernel code is then refactored and optimised to best exploit the underlying architecture of the CPU and GPU. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110&#x0025;, 150&#x0025;, and 170&#x0025; faster, respectively, than comparable device-optimised third-party solvers and applicable baselines. Assessing heterogeneous combinations of compute devices, the GPU&#x2009;&#x2b;&#x2009;FPGA combination is found to have the best compute performance and the FPGA-only configuration is found to have the best overall estimated energy efficiency.</description>
            <Author>Hamish J. Macintosh, Jasmine E. Banks, and Neil A. Kelson</Author>
            <copyright>Copyright &#xa9; 2019 Hamish J. Macintosh et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>A Real-Time Capable Dynamic Partial Reconfiguration System for an Application-Specific Soft-Core Processor</title>
            <pubDate>Sun, 22 Sep 2019 00:06:02 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/4723838/</link>
            <description>Modern FPGAs (Field Programmable Gate Arrays) are becoming increasingly important when it comes to embedded system development. Within these FPGAs, soft-core processors are often used to solve a wide range of different tasks. Soft-core processors are a cost-effective and time-efficient way to realize embedded systems. When using the full potential of FPGAs, it is possible to dynamically reconfigure parts of them during run time without the need to stop the device. This feature is called dynamic partial reconfiguration (DPR). If the DPR approach is to be applied in a real-time application-specific soft-core processor, an architecture must be created that ensures strict compliance with the real-time constraint at all times. In this paper, a novel method that addresses this problem is introduced, and its realization is described. In the first step, an application-specializable soft-core processor is presented that is capable of solving problems while adhering to hard real-time deadlines. This is achieved by the full design time analyzability of the soft-core processor. Its special architecture and other necessary features are discussed. Furthermore, a method for the optimized generation of partial bitstreams for the DPR as well as its practical implementation in a tool is presented. This tool is able to minimize given bitstreams with the help of a differential frame bitmap. Experiments that realize the DPR within the soft-core framework are presented, with respect to the need for hard real-time capability. Those experiments show a significant resource reduction of about 40&#x0025; compared to a functionally equivalent non-DPR design.</description>
            <Author>Michael Kirchhoff, Philipp Kerling, Detlef Streitferdt, and Wolfgang Fengler</Author>
            <copyright>Copyright &#xa9; 2019 Michael Kirchhoff et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>FPGA Implementation of an Improved Reconfigurable FSMIM Architecture Using Logarithmic Barrier Function Based Gradient Descent Approach</title>
            <pubDate>Mon, 01 Apr 2019 07:05:45 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/3727254/</link>
            <description>Recently, the Reconfigurable FSM has drawn the attention of the researchers for multistage signal processing applications. The optimal synthesis of Reconfigurable finite state machine with input multiplexing (Reconfigurable FSMIM) architecture is done by the iterative greedy heuristic based Hungarian algorithm (IGHA). The major problem concerning IGHA is the disintegration of a state encoding technique. This paper proposes the integration of IGHA with the state assignment using logarithmic barrier function based gradient descent approach to reduce the hardware consumption of Reconfigurable FSMIM. Experiments have been performed using MCNC FSM benchmarks which illustrate a significant area and speed improvement over other architectures during field programmable gate array (FPGA) implementation.</description>
            <Author>Nitish Das and Aruna Priya P</Author>
            <copyright>Copyright &#xa9; 2019 Nitish Das and Aruna Priya P. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Exposing End-to-End Delay in Software-Defined Networking</title>
            <pubDate>Mon, 04 Mar 2019 10:05:20 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/7363901/</link>
            <description>Software-Defined Networking (SDN) shows us a promising picture to deploy the demanding services in a fast and cost-effective way. Till now, most SDN use cases are deployed in enterprise/campus networks and data center networks. However, when applying SDN to the large-scale networks, such as Wide Area Network (WAN), the end-to-end delay of packet traversal is suspected to be very large and needs to be further investigated. Moreover, stringent time constraint is the cornerstone for real-time applications in SDN. Understanding the packet delay in SDN-based large networks is crucial for the proper design of switch architecture and the optimization of network algorithms such as flow control algorithms. In this paper, we present a thorough systematic exploration on the end-to-end delay in SDN which consists of multiple nodes, fully exposing the components which contribute to the long delay. We disclose that SDN switches cannot completely avoid the generation of flow setup even in proactive mode and conduct data mining on the probability of flow setup. We propose an analytical model for the end-to-end delay. This model takes into account the impact of the different rule installation time consumption on different switches. Considering the delay in switches contributes a large proportion to the entire delay, we conduct various measurements on the delay of a single switch. Results for the delay at different flow setup rates and with different rule priority patterns are presented. Furthermore, we study the impact on packet delay caused by ternary content addressable memory (TCAM) update. We measure parameters in the delay model and find that if SDN is deployed in all segments of WAN, the delay of packet traversal will be increased up to 27.95 times in the worst case in our experimental settings, compared with the delay in conventional network. Such high delay may eventually lead the end-to-end connections fail to complete if no additional measures are taken.</description>
            <Author>Ting Zhang and Bin Liu</Author>
            <copyright>Copyright &#xa9; 2019 Ting Zhang and Bin Liu. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>AsyncBTree: Revisiting Binary Tree Topology for Efficient FPGA-Based NoC Implementation</title>
            <pubDate>Wed, 20 Feb 2019 12:05:32 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2019/7239858/</link>
            <description>Binary tree topology generally fails to attract network on chip (NoC) implementations due to its low bisection bandwidth. Fat trees are proposed to alleviate this issue by using increasingly thicker links to connect switches towards the root node. This scheme is very efficient in interconnected networks such as computer networks, which use generic switches for interconnection. In an NoC context, especially for field programmable gate arrays (FPGAs), fat trees require more complex switches as we move higher in the hierarchy. This restricts the maximum clock frequency at which the network operates and offsets the higher bandwidth achieved through using fatter links. In this paper, we discuss the implementation of a binary tree-based NoC, which achieves better bandwidth by varying the clock frequency between the switches as we move higher in the hierarchy. This scheme enables using simpler switch architecture, thus supporting higher maximum frequency of operation. The effect on bandwidth and resource requirement of this architecture is compared with other FPGA-based NoCs for different network sizes and traffic patterns.</description>
            <Author>Kizheppatt Vipin</Author>
            <copyright>Copyright &#xa9; 2019 Kizheppatt Vipin. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>On a Real-Time Blind Signal Separation Noise Reduction System</title>
            <pubDate>Tue, 04 Dec 2018 07:04:24 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/3721756/</link>
            <description>Blind signal separation has been studied extensively in order to tackle the cocktail party problem. It explores spatial diversity of the received mixtures of sources by different sensors. By using the kurtosis measure, it is possible to select the source of interest out of a number of separated BSS outputs. Further noise cancellation can be achieved by adding an adaptive noise canceller (ANC) as postprocessing. However, the computation is rather intensive and an online implementation of the overall system is not straightforward. This paper intends to fill the gap by developing an FPGA hardware architecture to implement the system. Subband processing is explored and detailed functional operations are profiled carefully. The final proposed FPGA system is able to handle signals with sample rate over 20000 samples per second.</description>
            <Author>Ka Fai Cedric Yiu and Siow Yong Low</Author>
            <copyright>Copyright &#xa9; 2018 Ka Fai Cedric Yiu and Siow Yong Low. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Algorithm and Architecture Optimization for 2D Discrete Fourier Transforms with Simultaneous Edge Artifact Removal</title>
            <pubDate>Mon, 06 Aug 2018 08:37:48 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/1403181/</link>
            <description>Two-dimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm, with a plethora of applications. 2D images are, in general, nonperiodic but are assumed to be periodic while calculating their DFTs. This leads to cross-shaped artifacts in the frequency domain due to spectral leakage. These artifacts can have critical consequences if the DFTs are being used for further processing, specifically for biomedical applications. In this paper we present a novel FPGA-based solution to calculate 2D DFTs with simultaneous edge artifact removal for high-performance applications. Standard approaches for removing these artifacts, using apodization functions or mirroring, either involve removing critical frequencies or necessitate a surge in computation by significantly increasing the image size. We use a periodic plus smooth decomposition-based approach that was optimized to reduce DRAM access and to decrease 1D FFT invocations. 2D FFTs on FPGAs also suffer from the so-called “intermediate storage” or “memory wall” problem, which is due to limited on-chip memory, increasingly large image sizes, and strided column-wise external memory access. We propose a “tile-hopping” memory mapping scheme that significantly improves the bandwidth of the external memory for column-wise reads and can reduce the energy consumption up to . We tested our proposed optimizations on a PXIe-based Xilinx Kintex 7 FPGA system communicating with a host PC, which gives us the advantage of further expanding the design for biomedical applications such as electron microscopy and tomography. We demonstrate that our proposed optimizations can lead to  reduced FPGA and DRAM energy consumption when calculating high-throughput  2D FFTs with simultaneous edge artifact removal. We also used our high-performance 2D FFT implementation to accelerate filtered back-projection for reconstructing tomographic data.</description>
            <Author>Faisal Mahmood, M&#xe4;rt Toots, Lars-G&#xf6;ran &#xd6;fverstedt, and Ulf Skoglund</Author>
            <copyright>Copyright &#xa9; 2018 Faisal Mahmood et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Modelling and Assertion-Based Verification of Run-Time Reconfigurable Designs Using Functional Programming Abstractions</title>
            <pubDate>Tue, 10 Jul 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/3276159/</link>
            <description>With the increasing design and production costs and long time-to-market for Application Specific Integrated Circuits (ASICs), implementing digital circuits on reconfigurable hardware is becoming a more common practice. A reconfigurable hardware combines the flexibility of the software domain with the high performance of the hardware domain and provides a flexible life cycle management for the product with a lower cost. A complete design and assertion-based verification flow for Run-Time Reconfigurable (RTR) designs using functional programming abstractions of Haskell are proposed in this article, in which partially reconfigurable hardware is used as the implementation platform. The proposed flow includes modelling of RTR designs in high levels of abstraction by using higher-order functions and polymorphism in Haskell, as well as their implementation on partially reconfigurable Field Programmable Gate Arrays (FPGAs). Assertion-based verification (ABV) is used as the verification approach which is integrated in the early stages of the design flow. Assertions can be used to verify specifications of designs in different verification methods such as simulation-based and formal verification. A partitioning algorithm is proposed for clustering the assertion-checker circuits to implement the verification circuits in a limited reconfigurable area in the target FPGA. The proposed flow is evaluated by using example designs on a Zynq FPGA as the hardware/software implementation platform.</description>
            <Author>Bahram N. Uchevler and Kjetil Svarstad</Author>
            <copyright>Copyright &#xa9; 2018 Bahram N. Uchevler and Kjetil Svarstad. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Design of FPGA-Based Accelerator for Convolutional Neural Network under Heterogeneous Computing Framework with OpenCL</title>
            <pubDate>Mon, 02 Jul 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/1785892/</link>
            <description>CPU has insufficient resources to satisfy the efficient computation of the convolution neural network (CNN), especially for embedded applications. Therefore, heterogeneous computing platforms are widely used to accelerate CNN tasks, such as GPU, FPGA, and ASIC. Among these, FPGA can accelerate the computation by mapping the algorithm to the parallel hardware instead of CPU, which cannot fully exploit the parallelism. By fully using the parallelism of the neural network&#x2019;s structure, FPGA can reduce the computing costs and increase the computing speed. However, the development of FPGA requires great design skills. As a heterogeneous development platform, OpenCL has some advantages such as high abstraction level, short development cycle, and strong portability, which can make up for the lack of skilled designers. This paper uses Xilinx SDAccel to realize the parallel acceleration of CNN task, and it also proposes an optimizing strategy of single convolutional layer to accelerate CNN. Simulation results show that the calculation speed could be improved by adopting the proposed optimizing strategy. Compared with the baseline design, the strategy of single convolutional layer could increase the computing speed 14 times. Performance of the whole CNN task could be improved 2 times more than before, and the speed of image classification could attain more than 48 fps.</description>
            <Author>Li Luo, Yakun Wu, Fei Qiao, Yi Yang, Qi Wei, Xiaobo Zhou, Yongkai Fan, Shuzheng Xu, Xinjun Liu, and Huazhong Yang</Author>
            <copyright>Copyright &#xa9; 2018 Li Luo et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Corrigendum to &#x201c;An Impulse-C Hardware Accelerator for Packet Classification Based on Fine/Coarse Grain Optimization&#x201d;</title>
            <pubDate>Thu, 21 Jun 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/6075043/</link>
            <description></description>
            <Author>O. Ahmed, S. Areibi, R. Collier, and G. Grewal</Author>
            <copyright>Copyright &#xa9; 2018 O. Ahmed et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Exploiting Partial Reconfiguration through PCIe for a Microphone Array Network Emulator</title>
            <pubDate>Wed, 02 May 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/3214679/</link>
            <description>The current Microelectromechanical Systems (MEMS) technology enables the deployment of relatively low-cost wireless sensor networks composed of MEMS microphone arrays for accurate sound source localization. However, the evaluation and the selection of the most accurate and power-efficient network&#x2019;s topology are not trivial when considering dynamic MEMS microphone arrays. Although software simulators are usually considered, they consist of high-computational intensive tasks, which require hours to days to be completed. In this paper, we present an FPGA-based platform to emulate a network of microphone arrays. Our platform provides a controlled simulated acoustic environment, able to evaluate the impact of different network configurations such as the number of microphones per array, the network&#x2019;s topology, or the used detection method. Data fusion techniques, combining the data collected by each node, are used in this platform. The platform is designed to exploit the FPGA&#x2019;s partial reconfiguration feature to increase the flexibility of the network emulator as well as to increase performance thanks to the use of the PCI-express high-bandwidth interface. On the one hand, the network emulator presents a higher flexibility by partially reconfiguring the nodes&#x2019; architecture in runtime. On the other hand, a set of strategies and heuristics to properly use partial reconfiguration allows the acceleration of the emulation by exploiting the execution parallelism. Several experiments are presented to demonstrate some of the capabilities of our platform and the benefits of using partial reconfiguration.</description>
            <Author>Bruno da Silva, An Braeken, Federico Dom&#xed;nguez, and Abdellah Touhafi</Author>
            <copyright>Copyright &#xa9; 2018 Bruno da Silva et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Toward the Implementation of an ASIC-Like System on FPGA for Real-Time Video Processing with Power Reduction</title>
            <pubDate>Sun, 22 Apr 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/2843582/</link>
            <description>Driven by the importance of energy consumption in system-on-chip design as an evaluation factor, this paper presents a design methodology at the system level to optimize power consumption on ARM-based architecture for real-time video processing. The proposed design flow is based on the interaction between the tool and user optimizations. The tool optimizations are the options and best practices available on the integrated design environment for the Xilinx technology and the target Zynq-7000 architecture. The user methods present methods proposed by the user to optimize power consumption. We used the principles of voltage scaling and frequency scaling techniques for user methods. These two techniques allow energy to be consumed in the proportion of work to be done. The suggested flow is applied on real-time video processing system. The results show power savings for up to 60&#x25; with respect to performance and real-time constraints.</description>
            <Author>Lilia Kechiche, Lamjed Touil, and Bouraoui Ouni</Author>
            <copyright>Copyright &#xa9; 2018 Lilia Kechiche et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>RP-Ring: A Heterogeneous Multi-FPGA Accelerator</title>
            <pubDate>Wed, 04 Apr 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/6784319/</link>
            <description>To reduce the cost of designing new specialized FPGA boards as direct-summation MOND (Modified Newtonian Dynamics) simulator, we propose a new heterogeneous architecture with existing FPGA boards, which is called RP-ring (reconfigurable processor ring). This design can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resources. In order to avoid overall performance loss caused by the slowest board, we build a mathematical model to decompose workload among FPGAs. The dividing of workload is based on the logic resource, memory access bandwidth, and communication bandwidth of each FPGA chip. Our accelerator can achieve two orders of magnitude speedup compared with CPU implementation.</description>
            <Author>Shuaizhi Guo, Tianqi Wang, Linfeng Tao, Teng Tian, Zikun Xiang, and Xi Jin</Author>
            <copyright>Copyright &#xa9; 2018 Shuaizhi Guo et al. All rights reserved.</copyright>
           
          
        </item>
        <item>
            <title>Corrigendum to &#x201c;Hardware Accelerators Targeting a Novel Group Based Packet Classification Algorithm&#x201d;</title>
            <pubDate>Sun, 01 Apr 2018 00:00:00 +0000</pubDate>
            <link>https://www.hindawi.com/journals/ijrc/2018/3489169/</link>
            <description></description>
            <Author>O. Ahmed, S. Areibi, and G. Grewal</Author>
            <copyright>Copyright &#xa9; 2018 O. Ahmed et al. All rights reserved.</copyright>
           
          
        </item>
 </channel>
</rss>