VLSI Design http://www.hindawi.com The latest articles from Hindawi Publishing Corporation © 2013 , Hindawi Publishing Corporation . All rights reserved. A Prototype-Based Gate-Level Cycle-Accurate Methodology for SoC Performance Exploration and Estimation Thu, 16 May 2013 12:08:03 +0000 http://www.hindawi.com/journals/vlsi/2013/529150/ A prototype-based SoC performance estimation methodology was proposed for consumer electronics design. Traditionally, prototypes are usually used in system verification before SoC tapeout, which is without accurate SoC performance exploration and estimation. This paper attempted to carefully model the SoC prototype as a performance estimator and explore the environment of SoC performance. The prototype met the gate-level cycle-accurate requirement, which covered the effect of embedded processor, on-chip bus structure, IP design, embedded OS, GUI systems, and application programs. The prototype configuration, chip post-layout simulation result, and the measured parameters of SoC prototypes were merged to model a target SoC design. The system performance was examined according to the proposed estimation models, the profiling result of the application programs ported on prototypes, and the timing parameters from the post-layout simulation of the target SoC. The experimental result showed that the proposed method was accompanied with only an average of 2.08% of error for an MPEG-4 decoder SoC at simple profile level 2 specifications. Ching-Lung Su, Tse-Min Chen, and Kuo-Hsuan Wu Copyright © 2013 Ching-Lung Su et al. All rights reserved. LDPC Decoder with an Adaptive Wordwidth Datapath for Energy and BER Co-Optimization Thu, 09 May 2013 16:24:44 +0000 http://www.hindawi.com/journals/vlsi/2013/913018/ An energy efficient low-density parity-check (LDPC) decoder using an adaptive wordwidth datapath is presented. The decoder switches between a Normal Mode and a reduced wordwidth Low Power Mode. Signal toggling is reduced as variable node processing inputs change in fewer bits. The duration of time that the decoder stays in a given mode is optimized for power and BER requirements and the received SNR. The paper explores different Low Power Mode algorithms to reduce the wordwidth and their implementations. Analysis of the BER performance and power consumption from fixed-point numerical and post-layout power simulations, respectively, is presented for a full parallel 10GBASE-T LDPC decoder in 65 nm CMOS. A 5.10 mm2 low power decoder implementation achieves 85.7 Gbps while operating at 185 MHz and dissipates 16.4 pJ/bit at 1.3 V with early termination. At 0.6 V the decoder throughput is 9.3 Gbps (greater than 6.4 Gbps required for 10GBASE-T) while dissipating an average power of 31 mW. This is 4.6 lower than the state of the art reported power with an SNR loss of 0.35 dB at . Tinoosh Mohsenin, Houshmand Shirani-mehr, and Bevan M. Baas Copyright © 2013 Tinoosh Mohsenin et al. All rights reserved. High-Accuracy Programmable Timing Generator with Wide-Range Tuning Capability Thu, 09 May 2013 15:53:13 +0000 http://www.hindawi.com/journals/vlsi/2013/803616/ In this paper, a high-accuracy programmable timing generator with wide-range tuning capability is proposed. With the aid of dual delay-locked loop (DLL), both of the coarse- and fine-tuning mechanisms are operated in precise closed-loop scheme to lessen the effects of the ambient variations. The timing generator can provide sub-gate resolution and instantaneous switching capability. The circuit is implemented and simulated in TSMC 0.18 μm 1P6M technology. The test chip area occupies 1.9 mm2. The reference clock cycle can be divided into 128 bins by interpolation to obtain 14 ps resolution with the clock rate at 550 MHz. The INL and DNL are within −0.21~+0.78 and −0.27~+0.43 LSB, respectively. Ting-Li Chu, Sin-Hong Yu, and Chorng-Sii Hwang Copyright © 2013 Ting-Li Chu et al. All rights reserved. A 0.6-V to 1-V Audio Modulator in 65 nm CMOS with 90.2 dB SNDR at 0.6-V Wed, 08 May 2013 17:27:02 +0000 http://www.hindawi.com/journals/vlsi/2013/353080/ This paper presents a discrete time, single loop, third order modulator. The input feed forward technique combined with 5-bit quantizer is adopted to suppress swings of integrators. Harmonic distortions as well as the noise mixture due to the nonlinear amplifier gain are prevented. The design of amplifiers is hence relaxed. To reduce the area and power cost of the 5-bit quantizer, the successive approximation quantizer with only a single comparator instead of traditional flash quantizer is employed. Fabricated in 65 nm CMOS, the modulator achieves 95 dB peak SNDR at 1-V supply with 24 kHz. Thanks to low swing circuit techniques and low threshold voltages of devices, the peak SNDR maintains 90.2 dB under 0.6-V low supply. The total power dissipation is 371 μW at 1-V and drops to only 133 μW at 0.6-V. Liyuan Liu, Dongmei Li, and Zhihua Wang Copyright © 2013 Liyuan Liu et al. All rights reserved. Fast and Near-Optimal Timing-Driven Cell Sizing under Cell Area and Leakage Power Constraints Using a Simplified Discrete Network Flow Algorithm Tue, 07 May 2013 15:45:05 +0000 http://www.hindawi.com/journals/vlsi/2013/474601/ We propose a timing-driven discrete cell-sizing algorithm that can address total cell size and/or leakage power constraints. We model cell sizing as a “discretized” mincost network flow problem, wherein available sizes of each cell are modeled as nodes. Flow passing through a node indicates the choice of the corresponding cell size, and the total flow cost reflects the timing objective function value corresponding to these choices. Compared to other discrete optimization methods for cell sizing, our method can obtain near-optimal solutions in a time-efficient manner. We tested our algorithm on ISCAS’85 benchmarks, and compared our results to those produced by an optimal dynamic programming- (DP-) based method. The results show that compared to the optimal method, the improvements to an initial sizing solution obtained by our method is only 1% (3%) worse when using a 180 nm (90 nm) library, while being 40–60 times faster. We also obtained results for ISPD’12 cell-sizing benchmarks, under leakage power constraint, and compared them to those of a state-of-the-art approximate DP method (optimal DP runs out of memory for the smallest of these circuits). Our results show that we are only 0.9% worse than the approximate DP method, while being more than twice as fast. Huan Ren and Shantanu Dutt Copyright © 2013 Huan Ren and Shantanu Dutt. All rights reserved. A High-Efficiency Monolithic DC-DC PFM Boost Converter with Parallel Power MOS Technique Thu, 02 May 2013 13:24:21 +0000 http://www.hindawi.com/journals/vlsi/2013/643293/ This paper presents a high-efficiency monolithic dc-dc PFM boost converter designed with a standard TSMC 3.3/5V 0.35 μm CMOS technology. The proposed boost converter combines the parallel power MOS technique with pulse-frequency modulation (PFM) technique to achieve high efficiency over a wide load current range, extending battery life and reducing the cost for the portable systems. The proposed parallel power MOS controller and load current detector exactly determine the size of power MOS to increase power conversion efficiency in different loads. Postlayout simulation results of the designed circuit show that the power conversion is 74.9–90.7% efficiency over a load range from 1 mA to 420 mA with 1.5 V supply. Moreover, the proposed boost converter has a smaller area and lower cost than those of the existing boost converter circuits. Hou-Ming Chen, Robert C. Chang, and Kuang-Hao Lin Copyright © 2013 Hou-Ming Chen et al. All rights reserved. Low Complexity Submatrix Divided MMSE Sparse-SQRD Detection for MIMO-OFDM with ESPAR Antenna Receiver Tue, 30 Apr 2013 09:53:15 +0000 http://www.hindawi.com/journals/vlsi/2013/206909/ Multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) with an electronically steerable passive array radiator (ESPAR) antenna receiver can improve the bit error rate performance and obtains additional diversity gain without increasing the number of Radio Frequency (RF) front-end circuits. However, due to the large size of the channel matrix, the computational cost required for the detection process using Vertical-Bell Laboratories Layered Space-Time (V-BLAST) detection is too high to be implemented. Using the minimum mean square error sparse-sorted QR decomposition (MMSE sparse-SQRD) algorithm for the detection process the average computational cost can be considerably reduced but is still higher compared with a conventional MIMOOFDM system without ESPAR antenna receiver. In this paper, we propose to use a low complexity submatrix divided MMSE sparse-SQRD algorithm for the detection process of MIMOOFDM with ESPAR antenna receiver. The computational cost analysis and simulation results show that on average the proposed scheme can further reduce the computational cost and achieve a complexity comparable to the conventional MIMO-OFDM detection schemes. Diego Javier Reinoso Chisaguano and Minoru Okada Copyright © 2013 Diego Javier Reinoso Chisaguano and Minoru Okada. All rights reserved. Verification of Mixed-Signal Systems with Affine Arithmetic Assertions Sun, 28 Apr 2013 17:16:45 +0000 http://www.hindawi.com/journals/vlsi/2013/239064/ Embedded systems include an increasing share of analog/mixed-signal components that are tightly interwoven with functionality of digital HW/SW systems. A challenge for verification is that even small deviations in analog components can lead to significant changes in system properties. In this paper we propose the combination of range-based, semisymbolic simulation with assertion checking. We show that this approach combines advantages, but as well some limitations, of multirun simulations with formal techniques. The efficiency of the proposed method is demonstrated by several examples. Carna Radojicic, Christoph Grimm, Florian Schupfer, and Michael Rathmair Copyright © 2013 Carna Radojicic et al. All rights reserved. Design of Low Power Multiplier with Energy Efficient Full Adder Using DPTAAL Thu, 21 Mar 2013 11:10:18 +0000 http://www.hindawi.com/journals/vlsi/2013/157872/ Asynchronous adiabatic logic (AAL) is a novel lowpower design technique which combines the energy saving benefits of asynchronous systems with adiabatic benefits. In this paper, energy efficient full adder using double pass transistor with asynchronous adiabatic logic (DPTAAL) is used to design a low power multiplier. Asynchronous adiabatic circuits are very low power circuits to preserve energy for reuse, which reduces the amount of energy drawn directly from the power supply. In this work, an multiplier using DPTAAL is designed and simulated, which exhibits low power and reliable logical operations. To improve the circuit performance at reduced voltage level, double pass transistor logic (DPL) is introduced. The power results of the proposed multiplier design are compared with the conventional CMOS implementation. Simulation results show significant improvement in power for clock rates ranging from 100 MHz to 300 MHz. A. Kishore Kumar, D. Somasundareswari, V. Duraisamy, and T. Shunbaga Pradeepa Copyright © 2013 A. Kishore Kumar et al. All rights reserved. A High-Speed and Low-Energy-Consumption Processor for SVD-MIMO-OFDM Systems Mon, 18 Mar 2013 12:29:16 +0000 http://www.hindawi.com/journals/vlsi/2013/625019/ A processor design for singular value decomposition (SVD) and compression/decompression of feedback matrices, which are mandatory operations for SVD multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) systems, is proposed and evaluated. SVD-MIMO is a transmission method for suppressing multistream interference and improving communication quality by beamforming. An application specific instruction-set processor (ASIP) architecture is adopted to achieve flexibility in terms of operations and matrix size. The proposed processor realizes a high-speed/low-power design and real-time processing by the parallelization of floating-point units (FPUs) and arithmetic instructions specialized in complex matrix operations. Hiroki Iwaizumi, Shingo Yoshizawa, and Yoshikazu Miyanaga Copyright © 2013 Hiroki Iwaizumi et al. All rights reserved. Discrete Wavelet Transform on Color Picture Interpolation of Digital Still Camera Tue, 26 Feb 2013 18:42:20 +0000 http://www.hindawi.com/journals/vlsi/2013/738057/ Many people use digital still cameras to take photographs in contemporary society. Significant amounts of digital information have led to the emergence of a digital era. Because of the small size and low cost of the product hardware, most image sensors use a color filter array to obtain image information. However, employing a color filter array results in the loss of image information; thus, a color interpolation technique must be employed to retrieve the original picture. Numerous researchers have developed interpolation algorithms in response to various image problems. The method proposed in this study involves integrating discrete wavelet transform (DWT) into the interpolation algorithm. The method was developed based on edge weight and partial gain characteristics and uses the basic wavelet function to enhance the edge performance and processes of the nearest or larger and smaller direction gradients. The experiment results were compared to those of other methods to verify that the proposed method can improve image quality. Yu-Cheng Fan and Yi-Feng Chiang Copyright © 2013 Yu-Cheng Fan and Yi-Feng Chiang. All rights reserved. A General Design Methodology for Synchronous Early-Completion-Prediction Adders in Nano-CMOS DSP Architectures Thu, 17 Jan 2013 18:38:41 +0000 http://www.hindawi.com/journals/vlsi/2013/785281/ Synchronous early-completion-prediction adders (ECPAs) are used for high clock rate and high-precision DSP datapaths, as they allow a dominant amount of single-cycle operations even if the worst-case carry propagation delay is longer than the clock period. Previous works have also demonstrated ECPA advantages for average leakage reduction and NBTI effects reduction in nanoscale CMOS technologies. This paper illustrates a general systematic methodology to design ECPA units, targeting nanoscale CMOS technologies, which is not available in the current literature yet. The method is fully compatible with standard VLSI macrocell design tools and standard adder structures and includes automatic definition of critical test patterns for postlayout verification. A design example is included, reporting speed and power data superior to previous works. Mauro Olivieri and Antonio Mastrandrea Copyright © 2013 Mauro Olivieri and Antonio Mastrandrea. All rights reserved. Energy-Efficient Hardware Architectures for the Packet Data Convergence Protocol in LTE-Advanced Mobile Terminals Tue, 15 Jan 2013 15:32:49 +0000 http://www.hindawi.com/journals/vlsi/2013/369627/ In this paper, we present and compare efficient low-power hardware architectures for accelerating the Packet Data Convergence Protocol (PDCP) in LTE and LTE-Advanced mobile terminals. Specifically, our work proposes the design of two cores: a crypto engine for the Evolved Packet System Encryption Algorithm (128-EEA2) that is based on the AES cipher and a coprocessor for the Least Significant Bit (LSB) encoding mechanism of the Robust Header Compression (ROHC) algorithm. With respect to the former, first we propose a reference architecture, which reflects a basic implementation of the algorithm, then we identify area and power bottle-necks in the design and finally we introduce and compare several architectures targeting the most power-consuming operations. With respect to the LSB coprocessor, we propose a novel implementation based on a one-hot encoding, thereby reducing hardware’s logic switching rate. Architectural hardware analysis is performed using Faraday’s 90 nm standard-cell library. The obtained results, when compared against the reference architecture, show that these novel architectures achieve significant improvements, namely, 25% in area and 35% in power consumption for the 128-EEA2 crypto-core, and even more important reductions are seen for the LSB coprocessor, that is, 36% in area and 50% in power consumption. Shadi Traboulsi, Valerio Frascolla, Nils Pohl, Josef Hausner, and Attila Bilgic Copyright © 2013 Shadi Traboulsi et al. All rights reserved. A Graph-Based Approach to Optimal Scan Chain Stitching Using RTL Design Descriptions Thu, 20 Dec 2012 17:31:57 +0000 http://www.hindawi.com/journals/vlsi/2012/312808/ The scan chain insertion problem is one of the mandatory logic insertion design tasks. The scanning of designs is a very efficient way of improving their testability. But it does impact size and performance, depending on the stitching ordering of the scan chain. In this paper, we propose a graph-based approach to a stitching algorithm for automatic and optimal scan chain insertion at the RTL. Our method is divided into two main steps. The first one builds graph models for inferring logical proximity information from the design, and then the second one uses classic approximation algorithms for the traveling salesman problem to determine the best scan-stitching ordering. We show how this algorithm allows the decrease of the cost of both scan analysis and implementation, by measuring total wirelength on placed and routed benchmark designs, both academic and industrial. Lilia Zaourar, Yann Kieffer, and Chouki Aktouf Copyright © 2012 Lilia Zaourar et al. All rights reserved. Absolute Difference and Low-Power Bus Encoding Method for LCD Digital Display Interfaces Thu, 06 Dec 2012 14:23:40 +0000 http://www.hindawi.com/journals/vlsi/2012/657897/ Power dissipation has been an inevitable problem of LCD systems for years. To ease the problem, many encoding methods have been developed, such as the methods of transition minimized differential signaling, the most popular one in use for DVI to date, chromatic encoding, and limited intraword transition. In this paper, the authors present the absolute difference and low-power encoding method for the serial transmission of LCD digital DVI display interface. In regard to the LCD digital display interface with UMC 90 nm technology, the proposed method minimizes the architectural complexity and reduces the power dissipation by about 67% and 12%, respectively, compared with the transition minimized differential signaling and limited intraword transition. In short, the proposed method is an efficient bus encoding method to largely decrease the dynamic and total power dissipation of the LCD digital display interfaces. Chia-Hao Fang, I-tao Lung, and Chih-Peng Fan Copyright © 2012 Chia-Hao Fang et al. All rights reserved. A ±6 ms-Accuracy, 0.68 mm2, and 2.21 μW QRS Detection ASIC Thu, 22 Nov 2012 16:09:52 +0000 http://www.hindawi.com/journals/vlsi/2012/809393/ Healthcare issues arose from population aging. Meanwhile, electrocardiogram (ECG) is a powerful measurement tool. The first step of ECG is to detect QRS complexes. A state-of-the-art QRS detection algorithm was modified and implemented to an application-specific integrated circuit (ASIC). By the dedicated architecture design, the novel ASIC is proposed with 0.68 mm2 core area and 2.21 μW power consumption. It is the smallest QRS detection ASIC based on 0.18 μm technology. In addition, the sensitivity is 95.65% and the positive prediction of the ASIC is 99.36% based on the MIT/BIH arrhythmia database certification. Sheng-Chieh Huang, Hui-Min Wang, and Wei-Yu Chen Copyright © 2012 Sheng-Chieh Huang et al. All rights reserved. A Novel Framework for Applying Multiobjective GA and PSO Based Approaches for Simultaneous Area, Delay, and Power Optimization in High Level Synthesis of Datapaths Sun, 18 Nov 2012 07:47:05 +0000 http://www.hindawi.com/journals/vlsi/2012/273276/ High-Level Synthesis deals with the translation of algorithmic descriptions into an RTL implementation. It is highly multi-objective in nature, necessitating trade-offs between mutually conflicting objectives such as area, power and delay. Thus design space exploration is integral to the High Level Synthesis process for early assessment of the impact of these trade-offs. We propose a methodology for multi-objective optimization of Area, Power and Delay during High Level Synthesis of data paths from Data Flow Graphs (DFGs). The technique performs scheduling and allocation of functional units and registers concurrently. A novel metric based technique is incorporated into the algorithm to estimate the likelihood of a schedule to yield low-power solutions. A true multi-objective evolutionary technique, “Nondominated Sorting Genetic Algorithm II” (NSGA II) is used in this work. Results on standard DFG benchmarks indicate that the NSGA II based approach is much faster than a weighted sum GA approach. It also yields superior solutions in terms of diversity and closeness to the true Pareto front. In addition a framework for applying another evolutionary technique: Weighted Sum Particle Swarm Optimization (WSPSO) is also reported. It is observed that compared to WSGA, WSPSO shows considerable improvement in execution time with comparable solution quality. D. S. Harish Ram, M. C. Bhuvaneswari, and Shanthi S. Prabhu Copyright © 2012 D. S. Harish Ram et al. All rights reserved. Line Search-Based Inverse Lithography Technique for Mask Design Mon, 15 Oct 2012 15:01:36 +0000 http://www.hindawi.com/journals/vlsi/2012/589128/ As feature size is much smaller than the wavelength of illumination source of lithography equipments, resolution enhancement technology (RET) has been increasingly relied upon to minimize image distortions. In advanced process nodes, pixelated mask becomes essential for RET to achieve an acceptable resolution. In this paper, we investigate the problem of pixelated binary mask design in a partially coherent imaging system. Similar to previous approaches, the mask design problem is formulated as a nonlinear program and is solved by gradient-based search. Our contributions are four novel techniques to achieve significantly better image quality. First, to transform the original bound-constrained formulation to an unconstrained optimization problem, we propose a new noncyclic transformation of mask variables to replace the wellknown cyclic one. As our transformation is monotonic, it enables a better control in flipping pixels. Second, based on this new transformation, we propose a highly efficient line search-based heuristic technique to solve the resulting unconstrained optimization. Third, to simplify the optimization, instead of using discretization regularization penalty technique, we directly round the optimized gray mask into binary mask for pattern error evaluation. Forth, we introduce a jump technique in order to jump out of local minimum and continue the search. Xin Zhao and Chris Chu Copyright © 2012 Xin Zhao and Chris Chu. All rights reserved. Digital Noise Generator Design Using Inverted 1D Tent Chaotic Map Thu, 04 Oct 2012 17:44:55 +0000 http://www.hindawi.com/journals/vlsi/2012/849120/ This paper shows a digital noise generator designed in FPGA, based on a variant of the one-dimensional (1D) chaotic tent map (T-1D). The T-1D map is a piecewise linear 1D chaotic map that defines the statistical behavior of the generated sequences using its control parameter. In this way, the proposed noise generator is a highly competitive alternative in cryptographic systems when the statistical behavior of the sequences is closer to the uniform statistical distribution. The proposed system uses the inverted tent chaotic map (IT-1D), which has the same statistical behavior as the T-1D map. The fundamental algorithm used in this system was developed based on a 64-bit double precision format according to the numerical representation of floating point numbers defined in the IEEE-754 standard. The proposed system is analized using mechanical statistic tools and some statistical tests defined in the NIST 800-22SP (USA) standard. The main contribution of this work is the possibility of generating binary sequence of pseudorandom appearance by a procedure implemented in an FPGA device that translates real numbers to natural numbers preserving the statistical properties of sequences of real numbers that can be generated with the tent chaotic map in its original definition domain. Leonardo Palacios-Luengas, Gonzalo Isaac Duchen-Sánchez, José Luis Aragón-Vera, and Rubén Vázquez-Medina Copyright © 2012 Leonardo Palacios-Luengas et al. All rights reserved. VLSI Circuits, Systems, and Architectures for Advanced Image and Video Compression Standards Tue, 18 Sep 2012 11:25:34 +0000 http://www.hindawi.com/journals/vlsi/2012/102585/ Maurizio Martina, Muhammad Shafique, and Andrey Norkin Copyright © 2012 Maurizio Martina et al. All rights reserved. Design of an All-Digital Synchronized Frequency Multiplier Based on a Dual-Loop (D/FLL) Architecture Tue, 18 Sep 2012 08:17:25 +0000 http://www.hindawi.com/journals/vlsi/2012/546212/ This paper presents a new architecture for a synchronized frequency multiplier circuit. The proposed architecture is an all-digital dual-loop delay- and frequency-locked loops circuit, which has several advantages, namely, it does not have the jitter accumulation issue that is normally encountered in PLL and can be adapted easily for different FPGA families as well as implemented as an integrated circuit. Moreover, it can be used in supplying a clock reference for distributed digital processing systems as well as intra/interchip communication in system-on-chip (SoC). The proposed architecture is designed using the Verilog language and synthesized for the Altera DE2-70 development board. The experimental results validate the expected phase tracking as well as the synthesizing properties. For the measurement and validation purpose, an input reference signal in the range of 1.94–2.62 MHz was injected; the generated clock signal has a higher frequency, and it is in the range of 124.2–167.9 MHz with a frequency step (i.e., resolution) of 0.168 MHz. The synthesized design requires 330 logic elements using the above Altera board. Maher Assaad and Mohammed H. Alser Copyright © 2012 Maher Assaad and Mohammed H. Alser. All rights reserved. Flexible Radio Design: Trends and Challenges in Digital Baseband Implementation Tue, 11 Sep 2012 14:00:45 +0000 http://www.hindawi.com/journals/vlsi/2012/549768/ Guido Masera, Amer Baghdadi, Frank Kienle, and Christophe Moy Copyright © 2012 Guido Masera et al. All rights reserved. Design Space Exploration of Deeply Nested Loop 2D Filtering and 6 Level FSBM Algorithm Mapped onto Systolic Array Wed, 29 Aug 2012 14:12:53 +0000 http://www.hindawi.com/journals/vlsi/2012/268402/ The high integration density in today's VLSI chips offers enormous computing power to be utilized by the design of parallel computing hardware. The implementation of computationally intensive algorithms represented by 𝑛-dimensional (𝑛-D) nested loop algorithms, onto parallel array architecture is termed as mapping. The methodologies adopted for mapping these algorithms onto parallel hardware often use heuristic search that requires a lot of computational effort to obtain near optimal solutions. We propose a new mapping procedure wherein a lower dimensional subspace (of the 𝑛-D problem space) of inner loop is identified, in which lies the computational expression that generates the output or outputs of the 𝑛-D problem. The processing elements (PE array) are assigned to the identified sub-space and the reuse of the PE array is through the assignment of the PE array to the successive sub-spaces in consecutive clock cycles/periods (CPs) to complete the computational tasks of the 𝑛-D problem. The above is used to develop our proposed modified heuristic search to arrive at optimal design and the complexity comparisons are given. The MATLAB results of the new search and the design space trade-off analysis using the high-level synthesis tool are presented for two typical computationally intensive nested loop algorithms—the 6D FSBM and the 4D edge detection alternatively known as the 2D filtering algorithm. B. Bala Tripura Sundari Copyright © 2012 B. Bala Tripura Sundari. All rights reserved. Automatic Generation of Optimized and Synthesizable Hardware Implementation from High-Level Dataflow Programs Thu, 16 Aug 2012 11:33:05 +0000 http://www.hindawi.com/journals/vlsi/2012/298396/ In this paper, we introduce the Reconfigurable Video Coding (RVC) standard based on the idea that video processing algorithms can be defined as a library of components that can be updated and standardized separately. MPEG RVC framework aims at providing a unified high-level specification of current MPEG coding technologies using a dataflow language called Cal Actor Language (CAL). CAL is associated with a set of tools to design dataflow applications and to generate hardware and software implementations. Before this work, the existing CAL hardware compilers did not support high-level features of the CAL. After presenting the main notions of the RVC standard, this paper introduces an automatic transformation process that analyses the non-compliant features and makes the required changes in the intermediate representation of the compiler while keeping the same behavior. Finally, the implementation results of the transformation on video and still image decoders are summarized. We show that the obtained results can largely satisfy the real time constraints for an embedded design on FPGA as we obtain a throughput of 73 FPS for MPEG 4 decoder and 34 FPS for coding and decoding process of the LAR coder using a video of CIF image size. This work resolves the main limitation of hardware generation from CAL designs. Khaled Jerbi, Mickaël Raulet, Olivier Déforges, and Mohamed Abid Copyright © 2012 Khaled Jerbi et al. All rights reserved. Homogeneous and Heterogeneous MPSoC Architectures with Network-On-Chip Connectivity for Low-Power and Real-Time Multimedia Signal Processing Tue, 14 Aug 2012 17:50:01 +0000 http://www.hindawi.com/journals/vlsi/2012/450302/ Two multiprocessor system-on-chip (MPSoC) architectures are proposed and compared in the paper with reference to audio and video processing applications. One architecture exploits a homogeneous topology; it consists of 8 identical tiles, each made of a 32-bit RISC core enhanced by a 64-bit DSP coprocessor with local memory. The other MPSoC architecture exploits a heterogeneous-tile topology with on-chip distributed memory resources; the tiles act as application specific processors supporting a different class of algorithms. In both architectures, the multiple tiles are interconnected by a network-on-chip (NoC) infrastructure, through network interfaces and routers, which allows parallel operations of the multiple tiles. The functional performances and the implementation complexity of the NoC-based MPSoC architectures are assessed by synthesis results in submicron CMOS technology. Among the large set of supported algorithms, two case studies are considered: the real-time implementation of an H.264/MPEG AVC video codec and of a low-distortion digital audio amplifier. The heterogeneous architecture ensures a higher power efficiency and a smaller area occupation and is more suited for low-power multimedia processing, such as in mobile devices. The homogeneous scheme allows for a higher flexibility and easier system scalability and is more suited for general-purpose DSP tasks in power-supplied devices. Sergio Saponara and Luca Fanucci Copyright © 2012 Sergio Saponara and Luca Fanucci. All rights reserved. FastRoute: An Efficient and High-Quality Global Router Thu, 09 Aug 2012 13:07:28 +0000 http://www.hindawi.com/journals/vlsi/2012/608362/ Modern large-scale circuit designs have created great demand for fast and high-quality global routing algorithms to resolve the routing congestion at the global level. Rip-up and reroute scheme has been employed by the majority of academic and industrial global routers today, which iteratively resolve the congestion by recreating the routing path based on current congestion. This method is proved to be the most practical routing framework. However, the traditional iterative maze routing technique converges very slowly and easily gets stuck at local optimal solutions. In this work, we propose a very efficient and high-quality global router—FastRoute. FastRoute integrates several novel techniques: fast congestion-driven via-aware Steiner tree construction, 3-bend routing, virtual capacity adjustment, multisource multi-sink maze routing, and spiral layer assignment. These techniques not only address the routing congestion measured at the edges of global routing grids but also minimize the total wirelength and via usage, which is critical for subsequent detailed routing, yield, and manufacturability. Experimental results show that FastRoute is highly effective and efficient to solve ISPD07 and ISPD08 global routing benchmark suites. The results outperform recently published academic global routers in both routability and runtime. In particular, for ISPD07 and ISPD08 global routing benchmarks, FastRoute generates 12 congestion-free solutions out of 16 benchmarks with a speed significantly faster than other routers. Min Pan, Yue Xu, Yanheng Zhang, and Chris Chu Copyright © 2012 Min Pan et al. All rights reserved. 𝑁 Point DCT VLSI Architecture for Emerging HEVC Standard Wed, 08 Aug 2012 09:22:03 +0000 http://www.hindawi.com/journals/vlsi/2012/752024/ This work presents a flexible VLSI architecture to compute the 𝑁-point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, 4×4 up to 32×32, the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into sparse submatrices in order to reduce the multiplications. Finally, multiplications are completely eliminated using the lifting scheme. The proposed architecture sustains real-time processing of 1080P HD video codec running at 150 MHz. Ashfaq Ahmed, Muhammad Usman Shahid, and Ata ur Rehman Copyright © 2012 Ashfaq Ahmed et al. All rights reserved. A Systematic Methodology for Reliability Improvements on SoC-Based Software Defined Radio Systems Tue, 17 Jul 2012 09:58:02 +0000 http://www.hindawi.com/journals/vlsi/2012/784945/ Shrinking silicon technologies, increasing logic densities and clock frequencies, lead to a rapid elevation in power density. Increased power density results in higher onchip temperature, which creates numerous problems tightly firmed to reliability degradation. Since typical low-power design has been proved inefficient to tackle the temperature increment by itself, device architects are facing the challenge of developing new methodologies to guarantee timing, power, and thermal integrity of the chip. In this paper, we propose a thermal-aware exploration framework targeting temperature hotspots elimination through the efficient exploration of multiple microarchitecture selections over the temperature-area trade-off curve. By carefully planning at design time the resources of the initial microarchitecture that should be replicated, the proposed methodology optimizes the system’s thermal profile and attens on-chip temperature under various design constraints. The introduced framework does not impose any architectural or compiler modification, whereas it is orthogonal to any other thermal-aware methodology. For evaluation purposes, we employ the software-defined radio executed onto a thermal-aware instance of LEON3 processor. Based on experimental results, we found that our methodology leads to an architecture that exhibits temperature reduction of 17 Kelvin degrees, which leads to improvement against aging phenomena about 14%, with a controllable overhead in silicon area about 15%, compared to the initial LEON3 instance. Dionysios Diamantopoulos, Kostas Siozios, Sotiris Xydis, and Dimitrios Soudris Copyright © 2012 Dionysios Diamantopoulos et al. All rights reserved. Flexible LDPC Decoder Architectures Tue, 26 Jun 2012 09:08:06 +0000 http://www.hindawi.com/journals/vlsi/2012/730835/ Flexible channel decoding is getting significance with the increase in number of wireless standards and modes within a standard. A flexible channel decoder is a solution providing interstandard and intrastandard support without change in hardware. However, the design of efficient implementation of flexible low-density parity-check (LDPC) code decoders satisfying area, speed, and power constraints is a challenging task and still requires considerable research effort. This paper provides an overview of state-of-the-art in the design of flexible LDPC decoders. The published solutions are evaluated at two levels of architectural design: the processing element (PE) and the interconnection structure. A qualitative and quantitative analysis of different design choices is carried out, and comparison is provided in terms of achieved flexibility, throughput, decoding efficiency, and area (power) consumption. Muhammad Awais and Carlo Condo Copyright © 2012 Muhammad Awais and Carlo Condo. All rights reserved. Hardware Design Considerations for Edge-Accelerated Stereo Correspondence Algorithms Thu, 31 May 2012 13:58:47 +0000 http://www.hindawi.com/journals/vlsi/2012/602737/ Stereo correspondence is a popular algorithm for the extraction of depth information from a pair of rectified 2D images. Hence, it has been used in many computer vision applications that require knowledge about depth. However, stereo correspondence is a computationally intensive algorithm and requires high-end hardware resources in order to achieve real-time processing speed in embedded computer vision systems. This paper presents an overview of the use of edge information as a means to accelerate hardware implementations of stereo correspondence algorithms. The presented approach restricts the stereo correspondence algorithm only to the edges of the input images rather than to all image points, thus resulting in a considerable reduction of the search space. The paper highlights the benefits of the edge-directed approach by applying it to two stereo correspondence algorithms: an SAD-based fixed-support algorithm and a more complex adaptive support weight algorithm. Furthermore, we present design considerations about the implementation of these algorithms on reconfigurable hardware and also discuss issues related to the memory structures needed, the amount of parallelism that can be exploited, the organization of the processing blocks, and so forth. The two architectures (fixed-support based versus adaptive-support weight based) are compared in terms of processing speed, disparity map accuracy, and hardware overheads, when both are implemented on a Virtex-5 FPGA platform. Christos Ttofis and Theocharis Theocharides Copyright © 2012 Christos Ttofis and Theocharis Theocharides. All rights reserved.