Fig 1 - uploaded by Roger Woods
Content may be subject to copyright.
Adder implementation in LUT-based FPGA technologies. (a) Xilinx Virtex FPGA slice. (b) Lattice XPGA cell.

Adder implementation in LUT-based FPGA technologies. (a) Xilinx Virtex FPGA slice. (b) Lattice XPGA cell.

Source publication
Article
Full-text available
A novel design technique for deriving highly efficient multipliers that operate on a limited range of multiplier values is presented. Using the technique, Xilinx Virtex field programmable gate array (FPGA) implementations for a discrete cosine transform and poly-phase filter were derived with area reductions of 31%-70% and speed increases of 5%-35%...

Contexts in source publication

Context 1
... FPGAs such as the Altera Stratix [6], the Xilinx Virtex [7] and the Lattice XPGAs [10] are composed of dedicated blocks of logic such as the fast carry logic, connected to LUTs as shown in Fig. 1 for the Virtex and XPGA technologies. Each circuit adds bits A; B; Ci and produces bits So and Co using the fast carry logic and, in the case of the Xilinx Virtex in Fig. 1(a), the 4-input LUT is used to implement the remaining XOR gate. This process occurs in synthesis tools which opt (quite correctly) to utilize the fast carry logic ...
Context 2
... the Altera Stratix [6], the Xilinx Virtex [7] and the Lattice XPGAs [10] are composed of dedicated blocks of logic such as the fast carry logic, connected to LUTs as shown in Fig. 1 for the Virtex and XPGA technologies. Each circuit adds bits A; B; Ci and produces bits So and Co using the fast carry logic and, in the case of the Xilinx Virtex in Fig. 1(a), the 4-input LUT is used to implement the remaining XOR gate. This process occurs in synthesis tools which opt (quite correctly) to utilize the fast carry logic rather than implement the adder using the slower LUT hardware. These unused LUT inputs can be used to implement the rc mux as shown in Fig. 2, thereby increasing cell ...
Context 3
... outputs Yn are generated as shown by (1) where X is the input, vector Yn is the output, S selects the slot and a(S); b(S), and c(S) are the coefficients. This forms part of the poly-phase filter example. Table VIII gives three sets of two coefficients that are used and the cir- cuit is shown in Fig. 10. The design goal here is not only to share subex- pressions within taps (as in the previous section) but to share terms be- tween ...
Context 4
... two of which, 2 and 3, are identified in other coefficients. This leaves a pattern in a(0) and a(1) which is covered by grouping 1 and a pattern in c(0) which is covered by grouping 4. The uncovered Table IX. The shaded areas indicate that some groupings can be made across SSDs that do not have same input source. This is because cell 5 in Fig. 10, has a mux input which can accommodate this. Columns 2 9 and 2 4 in c(0) and c(1) are grouped together in cell 6 and columns 2 5 and 2 1 in a(0) and a(1) are grouped together in cell 7. The term in 2 0 in c(0) and c(1) is delayed to the second layer as ...
Context 5
... circuit in Fig. 10 is one component of a poly-phase filter (Fig. 11) that was designed using this technique. A full poly-phase filter imple- mentation [17] was implemented that had an interpolation of 1 : 3 and a filter length of 55. This design was used as not all the filter taps are used in the computation of each output sample, therefore a number of ...
Context 6
... circuit in Fig. 10 is one component of a poly-phase filter (Fig. 11) that was designed using this technique. A full poly-phase filter imple- mentation [17] was implemented that had an interpolation of 1 : 3 and a filter length of 55. This design was used as not all the filter taps are used in the computation of each output sample, therefore a number of taps time-share a ...

Similar publications

Article
Full-text available
In this paper, the minimum adder-delay Discrete Cosine Transform (DCT) architecture is proposed using the Adaptive CORDIC (ACor) algorithm with fixed-rotation implementations. The proposed method has six different versions differ from the number of DCT point, i.e., 8-point (8p), 16-point (16p), and 32-point (32p), and the number of ACor stages, i.e...
Article
Full-text available
This paper presents a short review of time-to-digital and digital-to-time converters (TDCs and DTCs, respectively) adopting a time-mode signal-processing perspective. The primary definitions, operating principles, and basic building blocks are presented. The discussion applies to most, if not all, DTCs and TDCs. A series of voltage-controlled delay...
Conference Paper
Full-text available
Run-time reconfiguration (RTR) of FPGAs is mainly done using the configuration interface. However, for a certain group of designs, RTR using the shift register functionality of the LUTs is a much faster alternative than conventional RTR using the ICAP. This method requires the creation of reconfiguration chains connecting the run-time reconfigurabl...
Article
Full-text available
The design and implementation of a new method of generating arbitrary signals was attempted. This new system is based upon the use of Walsh functions, which are derived from Rademacher functions. The VHDL modeling and the Xilinx field programmable gate arrays (FPGA) implementation of the proposed circuit were made. Two Walsh circuits realized using...
Article
Full-text available
We present novel and efficient methods for built-in-self-test (BIST) of FPGAs for detection and diagnosis of permanent faults in current as well as emerging technologies that are expected to have high fault densities. Our basic BIST methods can be used in both on-line as well as off-line testing scenarios, though we focus on the former in this pape...

Citations

... In [8,9] proposed multiplication models of a variable by an integer constant to minimize the number of adders used, and in [10,14] a technique for implementing constant coefficient multipliers (CCM) models into Xilinx FPGA was developed. In addition, in [15,16] a Look-Up Tables architecture model was used to implement Constant Coefficient Multiplication. Another constant multiplication model that has been developed is multiple constant multiplications (MCM) [17][18][19][20][21][22][23][24][25]. ...
... Mathematically, binary multiplication of Y = B×A is given in equation (5). The shift-and-add structure of this equation is given in figure 2. b6 b5 b4 b3 b2 b1 b0 b0a7 b0a6 b0a5 b0a4 b0a3 b0a2 b0a1 b0a0 b1a7 b1a6 b1a5 b1a4 b1a3 b1a2 b1a1 b1a0 b2a7 b2a6 b2a5 b2a4 b2a3 b2a2 b2a1 b2a0 b3a7 b3a6 b3a5 b3a4 b3a3 b3a2 b3a1 b3a0 b4a7 b4a6 b4a5 b4a4 b4a3 b4a2 b4a1 b4a0 b5a7 b5a6 b5a5 b5a4 b5a3 b5a2 b5a1 b5a0 b6a7 b6a6 b6a5 b6a4 b6a3 b6a2 b6a1 15 2 14 2 13 2 12 2 11 2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 2 0 2 7 2 6 2 5 2 4 2 3 2 2 2 1 2 0 A(K = 8) a7 a6 a5 a4 a3 a2 a1 a0 × B(N = 8) b7 b6 b5 b4 b3 b2 b1 b0 a7=1 0 7 തതതതതത b0a6 b0a5 b0a4 b0a3 b0a2 b0a1 b0a0 1 7 തതതതതത b1a6 b1a5 b1a4 b1a3 b1a2 b1a1 b1a0 2 7 തതതതതത b2a6 b2a5 b2a4 b2a3 b2a2 b2a1 b2a0 3 7 തതതതതത b3a6 b3a5 b3a4 b3a3 b3a2 b3a1 b3a0 4 7 തതതതതത b4a6 b4a5 b4a4 b4a3 b4a2 b4a1 b4a0 5 7 തതതതതത b5a6 b5a5 b5a4 b5a3 b5a2 b5a1 b5a0 6 7 തതതതതത b6a6 b6a5 b6a4 b6a3 b6a2 b6a1 b6a0 7=1 7 7 തതതതതത b7a6 b7a5 b7a4 b7a3 b7a2 b7a1 b7a0 + Y15 Y14 Y13 Y12 Y11 Y10 Y9 Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 2 15 2 14 2 13 2 12 2 11 2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 2 0 It should be noted and underlined that the SNN-by-UNS multiplication of Baugh-Wooley's model cannot be used for SPN-by-UNS multiplication and UNS-by-SNN multiplication is unusable for UNS-by-SPN multiplication. These can be proven by using four examples in Figure 3. ...
... b0a-1 b0a-2 b1a7 b1a6 b1a5 b1a4 b1a3 b1a2 b1a1 b1a0 b1a-1 . b1a-2 b2a7 b2a6 b2a5 b2a4 b2a3 b2a2 b2a1 b2a0 b2a-1 b2a-2 b3a7 b3a6 b3a5 b3a4 b3a3 b3a2 b3a1 b3a0 b3a-1 b3a-2 b4a7 b4a6 b4a5 b4a4 b4a3 b4a2 b4a1 b4a0 b4a-1 b4a-2 b5a7 b5a6 b5a5 b5a4 b5a3 b5a2 b5a1 b5a0 b5a-1 b5a-2 b6a7 b6a6 b6a5 b6a4 b6a3 b6a2 b6a1 b6a0 b6a-1 b6a- 15 2 14 2 13 2 12 2 11 2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 2 0 2 -1 2 -2 2 -3 2 -4 multiplication bits has carry-out equals "1" at the bit position of 2 K+N-1 , then the sign bit of Y is + −1 = carry-out + borrow = "1" + "1 " = "1" + "-1" = "0". ...
Technical Report
Full-text available
This research report explains three new models of binary multiplication. The first model can do two types of binary multipliers: unsigned multiplied by signed positive numbers and unsigned multiplied by signed negative numbers. The second model can process two types of binary multipliers: signed positive multiplied by unsigned numbers and signed negative multiplied by unsigned numbers. The last model can handle four types of binary multipliers: signed positive multiplied by signed positive numbers; signed positive multiplied by signed negative numbers; signed negative multiplied by signed positive numbers; and signed negative multiplied by signed negative numbers. Each model is formulated mathematically, has a low complexity algorithm, and is easy to implement in the form of software coding and in integrated circuits. These proposed multipliers are more powerful compared to Baugh-Wooley's models.
... Sub-expressions can be shared to further reduce the number of add/subtract operations [8,9]. Turner and Woods present a technique to design reduced coefficient multipliers (RCMs) that operate on a limited set of coefficients [10], exploiting the observation that LUTs used to implement add/subtract operations have unused inputs. This is also known as time-multiplexed multiple-constant multiplication, where a variable input is multiplied by one of several constants selected by a control input to produce a single output. ...
... Turner and Woods present a reduced-coefficient multiplier (RCM) that can operate on a limited set of coefficients, selectable at run-time [10]. Their multipliers use canonical signed digit (CSD) recoding and sub-expression elimination to reduce the number of add/subtract operations. ...
Article
Full-text available
Multiplication by a constant is a common operation for many signal, image, and video processing applications that are implemented in field-programmable gate arrays (FPGAs). Constant-coefficient multipliers (KCMs) are often implemented in the logic fabric using lookup tables (LUTs), reserving embedded hard multipliers for general-purpose multiplication. This paper describes a two-operand addition circuit from previous work and shows how it can be used to generate and add pre-computed partial products to implement KCMs. A novel method for pre-computing partial products for KCMs with a negative constant is also presented. These KCMs are then extended to have two to eight coefficients that may be selected by a control signal at runtime to implement time-multiplexed multiple-constant multiplication. Synthesis results show that proposed pipelined KCMs use 27.4% fewer LUTs on average and have a median LUT-delay product that is 12% lower than comparable LogiCORE IP KCMs. Proposed pipelined KCMs with two to eight selectable coefficients use 46% to 70% fewer LUTs than the best LogiCORE IP based alternative and most are faster than using a LogiCORE IP multiplier with a coefficient lookup function. They also outperform the state-of-the-art in the literature, using 22% to 57% fewer slices than the smallest pipelined adder graph (PAG) fusion designs and operate 7% to 30% faster than the fastest PAG fusion designs for the same operand size and number of selectable coefficients. For KCMs and KCMs with selectable coefficients of a given operand size, the placement and routing of LUTs remains the same for all positive and negative constant values, which is advantageous for runtime partial reconfiguration.
... While all methods result in adder graphs in which intermediate results are reused and reconfiguration is done by multiplexers (see Fig. 1), the construction of the RCM follows different methodologies. In [7], [8], [11] a basic computation kernel was defined which perfectly fits into the FPGA's logic. This kernel was used to generate larger RCM circuits. ...
Article
This paper presents a new method called optimal shift reassignment (OSR), used for reconfigurable multiplication circuits. These circuits consist of adders, subtracters, shifts and multiplexers. They calculate the multiplication of an input number by one out of several constants which can be selected dynamically during run-time. The OSR method is based on the idea that shifts can be placed at different positions along the circuit, while the calculated output constant stays the same. This differs from previous approaches, which were limited by the fact that all constants within the constant multiplier were forced to be odd. The OSR method subsequently releases this restriction. As a result, the number of required multiplexers in the circuit can be reduced. This happens when the shift reassignment aligns the shift values of different inputs of a multiplexer. Experimental results show multiplexer savings of up to 50 % and average savings between 11 % and 16 % using the OSR method compared to previous approaches.
... generation and is predicted to play a completely vital position in 4g Wi-Fi structures. The prototyping of SISO systems through the usage of Field programmable gate arrays (FPGA) or ASIC'S presents an opportunity trying out surroundings for SISO structures [2]. A crucial challenge for the SISO generation will be the design of the transmitter and receiver sections which includes complicated algorithms at each sections [4]. ...
... The PN sequence is generated by using Linear feedback shift register. Here two linear feedback shift registers are used since main input key is split into two equal bits then each split key's is replaced by the PN sequence generated by LFSR [1] and LFSR [2]. The output of LFSR [1] and LFSR [2] is connected to SIPO where parallel output replaces MSB and LSB of mask1 and mask2 to generate a new mask1 and mask2 so that each time different subkey is generated which is Xor'd with input to generate the encrypted data. ...
... Here two linear feedback shift registers are used since main input key is split into two equal bits then each split key's is replaced by the PN sequence generated by LFSR [1] and LFSR [2]. The output of LFSR [1] and LFSR [2] is connected to SIPO where parallel output replaces MSB and LSB of mask1 and mask2 to generate a new mask1 and mask2 so that each time different subkey is generated which is Xor'd with input to generate the encrypted data. [ ...
... In the last decade, many efficient algorithms were introduced for the minimization of the design complexity in TMCM operations targeting the mux-add architecture and the application specific integrated circuit (ASIC) and field programmable gate arrays (FPGA) design platforms [Aksoy et al. 2013b;2014;Chen and Chang 2009;Demirsoy et al. 2007;Sidahao et al. 2004;Tummeltshammer et al. 2007;Turner and Woods 2004]. However, the exact method of [Sidahao et al. 2004] can only be applied to a small number of constants and the solution quality of the approximate methods [Aksoy et al. 2013b;Chen and Chang 2009;Demirsoy et al. 2007;Sidahao et al. 2005;Tummeltshammer et al. 2007;Turner and Woods 2004] depends heavily on the TMCM instance. ...
... In the last decade, many efficient algorithms were introduced for the minimization of the design complexity in TMCM operations targeting the mux-add architecture and the application specific integrated circuit (ASIC) and field programmable gate arrays (FPGA) design platforms [Aksoy et al. 2013b;2014;Chen and Chang 2009;Demirsoy et al. 2007;Sidahao et al. 2004;Tummeltshammer et al. 2007;Turner and Woods 2004]. However, the exact method of [Sidahao et al. 2004] can only be applied to a small number of constants and the solution quality of the approximate methods [Aksoy et al. 2013b;Chen and Chang 2009;Demirsoy et al. 2007;Sidahao et al. 2005;Tummeltshammer et al. 2007;Turner and Woods 2004] depends heavily on the TMCM instance. Recently, we introduced an approximate algorithm ORPHEUS [Aksoy et al. 2014] which combines efficient heuristics from both MCM and TMCM techniques and yields better solutions than previously proposed algorithms. ...
... The TMCM algorithms of [Demirsoy et al. 2007;Sidahao et al. 2004;Turner and Woods 2004] target the FPGA design platform. In [Demirsoy et al. 2007;Turner and Woods 2004], the basic structure consists of an adder, a subtractor, or an adder/subtractor, which may include a 2-to-1 MUX at one of its inputs that requires no additional hardware in an FPGA. ...
Article
Full-text available
This article addresses the problem of minimizing the implementation cost of the time-multiplexed constant multiplication (TMCM) operation that realizes the multiplication of an input variable by a single constant selected from a set of multiple constants at a time. It presents an efficient algorithm, called ORPHEUS, that finds a multiplierless TMCM design by sharing logic operators, namely adders, subtractors, adders/subtractors, and multiplexors (MUXes). Moreover, this article introduces folded design architectures for the digital signal processing (DSP) blocks, such as finite impulse response (FIR) filters and linear DSP transforms, and describes how these folded DSP blocks can be efficiently realized using TMCM operations optimized by ORPHEUS. Experimental results indicate that ORPHEUS can find better solutions than existing TMCM algorithms, yielding TMCM designs requiring less area. They also show that the folded architectures lead to alternative designs with significantly less area, but incurring an increase in latency and energy consumption, compared to the parallel architecture.
... However, in some applications, these constants may change in certain time steps, which prevents the use of standard constant multipliers [Bosí et al. 1999;Bouganis et al. 2009;Huang et al. 2008;Shoufan et al. 2010]. Some researchers [Chen and Chang 2009;Demirsoy et al. 2007;Turner and Woods 2004] have addressed this problem when the constant changes to several predefined values, as it does in FFT, DCT, filters, and many others. Other authors have proposed reconfigurable architectures for specific applications (i.e., FIR filters [Mahesh and Vinod 2010;Park et al. 2004]) that enable the use of a priori unknown constants. ...
Article
Constant multipliers are widely used in signal processing applications to implement the multiplication of signals by a constant coefficient. However, in some applications, this coefficient remains invariable only during an interval of time, and then, its value changes to adapt to new circumstances. In this article, we present a self-reconfigurable constant multiplier suitable for LUT-based FPGAs able to reload the constant in runtime. The pipelined architecture presented is easily scalable to any multiplicand and constant sizes, for unsigned and signed representations. It can be reprogrammed in 16 clock cycles, equivalent to less than 100 ns in current FPGAs. This value is significantly smaller than FPGA partial configuration times. The presented approach is more efficient in terms of area and speed when compared to generic multipliers, achieving up to 91% area reduction and up to 102% speed improvement for the case-study circuits tested. The power consumption of the proposed multipliers are in the range of those of slice-based multipliers provided by the vendor.
... The 'reconfigurable Mux' concept was used as an approach to highlight areas that would be dynamically swapped in, and out, and the mux would represent reconfiguration and not be implemented. However, by deliberately identifying a few common blocks and implementing the mux as an actual multiplexer, we are able to dynamically map the algorithmic requirements to underlying resource as proposed in [22]. ...
Article
Full-text available
A key challenge in defense and security systems is to implement functionality within a power budget. We show how data bandwidth redundancy and the need to change performance is exploited to achieve power efficient, field programmable gate array realizations with improved sampling rates. A unified methodology is given for the implementation of a key function, the fast Fourier transform, for a Radar-based digital receiver. Locality of data, temporal and spatial resource usage are examined from first principles, leading to an algorithmic approach that demonstrates substantial industrial benefits in terms of power, performance and resource usage. A power saving of 18% is achieved over a Cooley Tukey design with a 100% speed improvement;the work is extended to other cyclical fast algorithms.
... In this era, the focus was to exploit the nature of some fixed DSP operations and transforms to allow fixed coefficient multipliers to be derived. These fixed coefficient structures could be implemented efficiently using a small number of adders and evolutions such as the reduced coefficient multipliers (RCMs) [10]; very efficient multipliers which could operate on a number of multiplicands rather than a single one, could then be created. ...
... • The availability of dedicated multipliers, adders and memory elements suggest a direct mapping from the processing graphs to FPGAs where each function can be implemented by a separate processing element thereby allowing high levels Configurations for Xilinx CLB LUT [7] of parallelism. Historically, a lot of effort had been dedicated to implementing multiplicative functionality using the LUT-based programming element ( Fig. 2) [2,9,10] but this has now been superseded by the recent developments in FPGA architectures, namely the DSP blocks. • The plethora of small, distributed memories suggest a highly pipelined approach is applicable in FPGA. ...
Article
Full-text available
Field programmable gate arrays (FPGAs) are examples of complex programmable system-on-chip (PSoC) platforms and comprise dedicated DSP hardware resources and distributed memory. They are ideal platforms for implementing computationally complex DSP systems in image processing and radar, sonar and signal processing. The chapter describes how decidable signal processing graphs are mapped into such platforms and shows how parallelism and pipelining can be controlled from a high level representation to achieve the required speed using minimal hardware resource. The process is demonstrated using a number of simple examples namely a finite impulse response (FIR) filter, lattice filter and a more complex adaptive signal processing design, a least means squares (LMS) filter.
... FPGA realization of ANN with large number of neurons is still a not easy task because ANN algorithm is wealthy with multiplication process and it's relatively expensive to realize. Various work reported in this area includes new multiplication algorithm for ANN, NNs with some constraints to achieve higher speed of process at lower price and multichip realization [16][17][18]. ...
... FPGA realization of ANNs with a large number of neurons is still a challenging task because ANN algorithms are "multiplication-rich" and it is relatively expensive to implement. Various works reported in this area includes new multiplication algorithms for NN [8], NNs with some constraints to achieve higher speed of operation at lower cost [9] and multi-chip realization [10]. ...
... The maximum sampling interval t s for a given system can be obtained from the time constant of the system. From t s the maximum number of layers (ii) can be determined using any one of these equations (5)&(6) or (7)& (8). Let the number of layers be 'L'. ...
Article
Hardware realization of a Neural Network (NN), to a large extent depends on the efficient implementation of a single neuron. FPGA-based reconfigurable computing architectures are suitable for hardware implementation of neural networks. FPGA realization of ANNs with a large number of neurons is still a challenging task. This paper discusses the issues involved in implementation of a multi-input neuron with linear/nonlinear excitation functions using FPGA. Implementation method with resource/speed tradeoff is proposed to handle signed decimal numbers. The VHDL coding developed is tested using Xilinx XC V50hq240 Chip. To improve the speed of operation a lookup table method is used. The problems involved in using a lookup table (LUT) for a nonlinear function is discussed. The percentage saving in resource and the improvement in speed with an LUT for a neuron is reported. An attempt is also made to derive a generalized formula for a multi-input neuron that facilitates to estimate approximately the total resource requirement and speed achievable for a given multilayer neural network. This facilitates the designer to choose the FPGA capacity for a given application. Using the proposed method of implementation a neural network based application, namely, a Space vector modulator for a vector-controlled drive is presented