Adder implementation in LUT-based FPGA technologies. (a) Xilinx Virtex FPGA slice. (b) Lattice XPGA cell.

Source publication

Highly efficient, limited range multipliers for LUT-based FPGA architectures

Article

Full-text available

Nov 2004

A novel design technique for deriving highly efficient multipliers that operate on a limited range of multiplier values is presented. Using the technique, Xilinx Virtex field programmable gate array (FPGA) implementations for a discrete cosine transform and poly-phase filter were derived with area reductions of 31%-70% and speed increases of 5%-35%...

Context 1

... FPGAs such as the Altera Stratix [6], the Xilinx Virtex [7] and the Lattice XPGAs [10] are composed of dedicated blocks of logic such as the fast carry logic, connected to LUTs as shown in Fig. 1 for the Virtex and XPGA technologies. Each circuit adds bits A; B; Ci and produces bits So and Co using the fast carry logic and, in the case of the Xilinx Virtex in Fig. 1(a), the 4-input LUT is used to implement the remaining XOR gate. This process occurs in synthesis tools which opt (quite correctly) to utilize the fast carry logic ...

View in full-text

Context 2

... the Altera Stratix [6], the Xilinx Virtex [7] and the Lattice XPGAs [10] are composed of dedicated blocks of logic such as the fast carry logic, connected to LUTs as shown in Fig. 1 for the Virtex and XPGA technologies. Each circuit adds bits A; B; Ci and produces bits So and Co using the fast carry logic and, in the case of the Xilinx Virtex in Fig. 1(a), the 4-input LUT is used to implement the remaining XOR gate. This process occurs in synthesis tools which opt (quite correctly) to utilize the fast carry logic rather than implement the adder using the slower LUT hardware. These unused LUT inputs can be used to implement the rc mux as shown in Fig. 2, thereby increasing cell ...

View in full-text

Context 3

... outputs Yn are generated as shown by (1) where X is the input, vector Yn is the output, S selects the slot and a(S); b(S), and c(S) are the coefficients. This forms part of the poly-phase filter example. Table VIII gives three sets of two coefficients that are used and the cir- cuit is shown in Fig. 10. The design goal here is not only to share subex- pressions within taps (as in the previous section) but to share terms be- tween ...

View in full-text

Context 4

... two of which, 2 and 3, are identified in other coefficients. This leaves a pattern in a(0) and a(1) which is covered by grouping 1 and a pattern in c(0) which is covered by grouping 4. The uncovered Table IX. The shaded areas indicate that some groupings can be made across SSDs that do not have same input source. This is because cell 5 in Fig. 10, has a mux input which can accommodate this. Columns 2 9 and 2 4 in c(0) and c(1) are grouped together in cell 6 and columns 2 5 and 2 1 in a(0) and a(1) are grouped together in cell 7. The term in 2 0 in c(0) and c(1) is delayed to the second layer as ...

View in full-text

Context 5

... circuit in Fig. 10 is one component of a poly-phase filter (Fig. 11) that was designed using this technique. A full poly-phase filter imple- mentation [17] was implemented that had an interpolation of 1 : 3 and a filter length of 55. This design was used as not all the filter taps are used in the computation of each output sample, therefore a number of ...

View in full-text

Context 6

View in full-text

Minimum adder-delay architecture of 8/16/32-point DCT based on fixed-rotation adaptive CORDIC

Article

Full-text available

May 2018

In this paper, the minimum adder-delay Discrete Cosine Transform (DCT) architecture is proposed using the Adaptive CORDIC (ACor) algorithm with fixed-rotation implementations. The proposed method has six different versions differ from the number of DCT point, i.e., 8-point (8p), 16-point (16p), and 32-point (32p), and the number of ACor stages, i.e...

A Brief Introduction to Time-to-Digital and Digital-to-Time Converters

Article

Full-text available

Apr 2010

This paper presents a short review of time-to-digital and digital-to-time converters (TDCs and DTCs, respectively) adopting a time-mode signal-processing perspective. The primary definitions, operating principles, and basic building blocks are presented. The discussion applies to most, if not all, DTCs and TDCs. A series of voltage-controlled delay...

Automating Reconfiguration Chain Generation for SRL-Based Run-Time Reconfiguration

Conference Paper

Full-text available

Mar 2012

Run-time reconfiguration (RTR) of FPGAs is mainly done using the configuration interface. However, for a certain group of designs, RTR using the shift register functionality of the LUTs is a much faster alternative than conventional RTR using the ICAP. This method requires the creation of reconfiguration chains connecting the run-time reconfigurabl...

Design and implementation of an improved arbitrary waveform generator based on Walsh functions

Article

Full-text available

Mar 2012

Zulfikar Zulfikar

The design and implementation of a new method of generating arbitrary signals was attempted. This new system is based upon the use of Walsh functions, which are derived from Rademacher functions. The VHDL modeling and the Xilinx field programmable gate arrays (FPGA) implementation of the proposed circuit were made. Two Walsh circuits realized using...

Built-in-Self-Test of FPGAs With Provable Diagnosabilities and High Diagnostic Coverage With Application to Online Testing.

Article

Full-text available

Jan 2008

We present novel and efficient methods for built-in-self-test (BIST) of FPGAs for detection and diagnosis of permanent faults in current as well as emerging technologies that are expected to have high fault densities. Our basic BIST methods can be used in both on-line as well as off-line testing scenarios, though we focus on the former in this pape...

New Approach of Unsigned and Signed Binary Numbers Multiplications

Technical Report

Full-text available

Jan 2021

Sarifuddin Madenda

This research report explains three new models of binary multiplication. The first model can do two types of binary multipliers: unsigned multiplied by signed positive numbers and unsigned multiplied by signed negative numbers. The second model can process two types of binary multipliers: signed positive multiplied by unsigned numbers and signed negative multiplied by unsigned numbers. The last model can handle four types of binary multipliers: signed positive multiplied by signed positive numbers; signed positive multiplied by signed negative numbers; signed negative multiplied by signed positive numbers; and signed negative multiplied by signed negative numbers. Each model is formulated mathematically, has a low complexity algorithm, and is easy to implement in the form of software coding and in integrated circuits. These proposed multipliers are more powerful compared to Baugh-Wooley's models.

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Article

Full-text available

Nov 2017

George Walters, E., III

Multiplication by a constant is a common operation for many signal, image, and video processing applications that are implemented in field-programmable gate arrays (FPGAs). Constant-coefficient multipliers (KCMs) are often implemented in the logic fabric using lookup tables (LUTs), reserving embedded hard multipliers for general-purpose multiplication. This paper describes a two-operand addition circuit from previous work and shows how it can be used to generate and add pre-computed partial products to implement KCMs. A novel method for pre-computing partial products for KCMs with a negative constant is also presented. These KCMs are then extended to have two to eight coefficients that may be selected by a control signal at runtime to implement time-multiplexed multiple-constant multiplication. Synthesis results show that proposed pipelined KCMs use 27.4% fewer LUTs on average and have a median LUT-delay product that is 12% lower than comparable LogiCORE IP KCMs. Proposed pipelined KCMs with two to eight selectable coefficients use 46% to 70% fewer LUTs than the best LogiCORE IP based alternative and most are faster than using a LogiCORE IP multiplier with a coefficient lookup function. They also outperform the state-of-the-art in the literature, using 22% to 57% fewer slices than the smallest pipelined adder graph (PAG) fusion designs and operate 7% to 30% faster than the fastest PAG fusion designs for the same operand size and number of selectable coefficients. For KCMs and KCMs with selectable coefficients of a given operand size, the placement and routing of LUTs remains the same for all positive and negative constant values, which is advantageous for runtime partial reconfiguration.

Optimal Shift Reassignment in Reconfigurable Constant Multiplication Circuits

Article

Jul 2017
IEEE T COMPUT AID D

This paper presents a new method called optimal shift reassignment (OSR), used for reconfigurable multiplication circuits. These circuits consist of adders, subtracters, shifts and multiplexers. They calculate the multiplication of an input number by one out of several constants which can be selected dynamically during run-time. The OSR method is based on the idea that shifts can be placed at different positions along the circuit, while the calculated output constant stays the same. This differs from previous approaches, which were limited by the fact that all constants within the constant multiplier were forced to be odd. The OSR method subsequently releases this restriction. As a result, the number of required multiplexers in the circuit can be reduced. This happens when the shift reassignment aligns the shift values of different inputs of a multiplexer. Experimental results show multiplexer savings of up to 50 % and average savings between 11 % and 16 % using the OSR method compared to previous approaches.

FPGA Implementation of Crypto-System based Wireless communication system

Article

Apr 2016

Multiplierless Design of Folded DSP Blocks,

Article

Full-text available

Nov 2014
ACM T DES AUTOMAT EL

This article addresses the problem of minimizing the implementation cost of the time-multiplexed constant multiplication (TMCM) operation that realizes the multiplication of an input variable by a single constant selected from a set of multiple constants at a time. It presents an efficient algorithm, called ORPHEUS, that finds a multiplierless TMCM design by sharing logic operators, namely adders, subtractors, adders/subtractors, and multiplexors (MUXes). Moreover, this article introduces folded design architectures for the digital signal processing (DSP) blocks, such as finite impulse response (FIR) filters and linear DSP transforms, and describes how these folded DSP blocks can be efficiently realized using TMCM operations optimized by ORPHEUS. Experimental results indicate that ORPHEUS can find better solutions than existing TMCM algorithms, yielding TMCM designs requiring less area. They also show that the folded architectures lead to alternative designs with significantly less area, but incurring an increase in latency and energy consumption, compared to the parallel architecture.

Self-Reconfigurable Constant Multiplier for FPGA

Article

Oct 2013

Constant multipliers are widely used in signal processing applications to implement the multiplication of signals by a constant coefficient. However, in some applications, this coefficient remains invariable only during an interval of time, and then, its value changes to adapt to new circumstances. In this article, we present a self-reconfigurable constant multiplier suitable for LUT-based FPGAs able to reload the constant in runtime. The pipelined architecture presented is easily scalable to any multiplicand and constant sizes, for unsigned and signed representations. It can be reprogrammed in 16 clock cycles, equivalent to less than 100 ns in current FPGAs. This value is significantly smaller than FPGA partial configuration times. The presented approach is more efficient in terms of area and speed when compared to generic multipliers, achieving up to 91&percnt; area reduction and up to 102&percnt; speed improvement for the case-study circuits tested. The power consumption of the proposed multipliers are in the range of those of slice-based multipliers provided by the vendor.

Power Efficient, FPGA Implementations of Transform Algorithms for Radar-Based Digital Receiver Applications

Article

Full-text available

Aug 2013
IEEE T IND INFORM

A key challenge in defense and security systems is to implement functionality within a power budget. We show how data bandwidth redundancy and the need to change performance is exploited to achieve power efficient, field programmable gate array realizations with improved sampling rates. A unified methodology is given for the implementation of a key function, the fast Fourier transform, for a Radar-based digital receiver. Locality of data, temporal and spatial resource usage are examined from first principles, leading to an algorithmic approach that demonstrates substantial industrial benefits in terms of power, performance and resource usage. A power saving of 18% is achieved over a Cooley Tukey design with a 100% speed improvement;the work is extended to other cyclical fast algorithms.

Mapping Decidable Signal Processing Graphs into FPGA Implementations

Article

Full-text available

May 2013

Roger Woods

Field programmable gate arrays (FPGAs) are examples of complex programmable system-on-chip (PSoC) platforms and comprise dedicated DSP hardware resources and distributed memory. They are ideal platforms for implementing computationally complex DSP systems in image processing and radar, sonar and signal processing. The chapter describes how decidable signal processing graphs are mapped into such platforms and shows how parallelism and pipelining can be controlled from a high level representation to achieve the required speed using minimal hardware resource. The process is demonstrated using a number of simple examples namely a finite impulse response (FIR) filter, lattice filter and a more complex adaptive signal processing design, a least means squares (LMS) filter.

Implementation of Digital Circuits Using Neuro - Swarm Based on FPGA.

Article

Full-text available

Jun 2010

Neural Network Implementation Using FPGA: Issues and Application

Article

Nov 2007

Hardware realization of a Neural Network (NN), to a large extent depends on the efficient implementation of a single neuron. FPGA-based reconfigurable computing architectures are suitable for hardware implementation of neural networks. FPGA realization of ANNs with a large number of neurons is still a challenging task. This paper discusses the issues involved in implementation of a multi-input neuron with linear/nonlinear excitation functions using FPGA. Implementation method with resource/speed tradeoff is proposed to handle signed decimal numbers. The VHDL coding developed is tested using Xilinx XC V50hq240 Chip. To improve the speed of operation a lookup table method is used. The problems involved in using a lookup table (LUT) for a nonlinear function is discussed. The percentage saving in resource and the improvement in speed with an LUT for a neuron is reported. An attempt is also made to derive a generalized formula for a multi-input neuron that facilitates to estimate approximately the total resource requirement and speed achievable for a given multilayer neural network. This facilitates the designer to choose the FPGA capacity for a given application. Using the proposed method of implementation a neural network based application, namely, a Space vector modulator for a vector-controlled drive is presented

Adder implementation in LUT-based FPGA technologies. (a) Xilinx Virtex FPGA slice. (b) Lattice XPGA cell.

Contexts in source publication

Similar publications

Citations