Figure 2 - uploaded by Paolo Ienne
Content may be subject to copyright.
A static multiplexer provides flexibility to the router: in this case, 8 inputs are available to route one signal (a); a dynamic multiplexer offers no flexibility to the router, as all multiplexer inputs are used, and routes must be computed for the dynamic control signals as well.  

A static multiplexer provides flexibility to the router: in this case, 8 inputs are available to route one signal (a); a dynamic multiplexer offers no flexibility to the router, as all multiplexer inputs are used, and routes must be computed for the dynamic control signals as well.  

Source publication
Conference Paper
Full-text available
In floating-point datapaths synthesized on FPGAs, the shifters that perform mantissa alignment and normalization consume a disproportionate number of LUTs. Shifters are implemented using several rows of small multiplexers; unfortunately, multiplexer-based logic structures map poorly onto LUTs. FPGAs, meanwhile, contain a large number of multiplexer...

Similar publications

Conference Paper
Full-text available
FPGAs have reached densities that can implement floating point applications, but floating-point operations still require a large amount of FPGA resources. One major component of IEEE compliant floating-point computations is variable length shifters. They account for over 30% of a double-precision floating-point adder and 25% of a double-precision m...
Article
Full-text available
End-users of integrated circuits need models to anticipate and solve conducted emission issues at board level in a short time. The standard IEC62433-2 integrated circuit emission model—conducted emissions (ICEM-CE) has been proposed to respond to this demand. Although the standard proposes methods to extract circuit models from measurements, they c...
Conference Paper
Full-text available
In this paper we evaluate a new multilevel hierarchical FPGA (MFPGA). The specific architecture includes two unidirectional programmable networks: A downward network based on the butterfly-fat-tree topology, and a special upward network. New tools are developed to place and route several benchmark circuits on this architecture. Comparison with the...
Article
Full-text available
In real-time catheter based 3D ultrasound imaging applications, gathering data from the transducer arrays is difficult as there is a restriction on cable count due to the diameter of the catheter. Although area and power hungry multiplexing circuits integrated at the catheter tip are used in some applications, these are unsuitable for use in small...
Article
Full-text available
Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in...

Citations

... 2) large priority encoders and barrel-shifters are used for combinational implementation of Leading One Detector (LOD) and extracting fractional-parts [24]. However, these ASIC-based approaches weakly fit FPGAs due to using layers of multiplexers (poorly map on LUTs [25]). Targeting FPGAs, authors in [26] proposed a 8-bit LOD which uses consecutive levels of LUTs to implement each bit of LODoutput since each bit is a function of all bits in the input. ...
... The works dedicated to the FPGA implementation of arithmetic devices and their shifters are mainly related to logarithmic and cyclic shifters [3][4][5][6][7][8][9][10][11]. In [3][4][5][6][7][8] there is insufficient adaptation of FPGA resources to implement shifters. ...
... The works dedicated to the FPGA implementation of arithmetic devices and their shifters are mainly related to logarithmic and cyclic shifters [3][4][5][6][7][8][9][10][11]. In [3][4][5][6][7][8] there is insufficient adaptation of FPGA resources to implement shifters. At the same time, [3] assesses the effect of complementation of FPGA logic slices with hardware shifters. ...
... [4] notes that the effectiveness of implementation of cyclic shifters on a FPGA would be increased by having a sufficient number of tri-state buffers on them. [5] discusses the possibility of decreasing the chip area occupied by the shifter by using the multiplexers of FPGA trace resources. ...
Article
Full-text available
In case of the Field Programmable Gate Array implementation of arithmetic devices operating in floating point, the implementation of shifters is associated with some challenges. This work compares two approaches to the formation of basic shifter blocks: as selectors using carry chains and as multi-input multiplexers. Both approaches use exclusively a FPGA programmable logic. The work shows that basic blocks as multiplexers require more than twice fewer FPGA logic slices and are notable for 10-20% better performance when compared to those based on selectors.
... However, such blocks are more difficult to compose into larger units than DSP blocks. For the shifts, a better idea, investigated by Moctar et al [53] would be to perform them in the reconfigurable routing network: it is based on multiplexers whose control signal comes from a configuration bit. Enabling some of these multiplexers to optionally take their control signal from another wire would enable cheaper shifts. ...
Article
Full-text available
An often overlooked way to increase the efficiency of HPC on FPGA is to tailor, as tightly as possible, the arithmetic to the application. An ideally efficient implementation would, for each of its operations, toggle and transmit just the number of bits required by the application at this point. Conventional microprocessors, with their word-level granularity and fixed memory hierarchy, keep us away from this ideal. FPGAs, with their bit-level granularity, have the potential to get much closer. Therefore, reconfigurable computing should systematically investigate, in an application-specific way, non-standard precisions, but also non-standard number systems and non-standard arithmetic operations. The purpose of this chapter is to review these opportunities.
Chapter
A shifter is a digital circuit that has two inputs and one output. It shifts one of the inputs by a number of bits defined by the second input. Although not strictly speaking arithmetic components themselves, shifters are pervasive in application-specific arithmetic. Their main use is to multiply by a power of two, to convert fixed-point to floating-point or the other way around, to align the binary representations of two numbers before adding them, and to normalize a floating-point result. In the latter case, the shift amount is the result of a leading bit count. This chapter covers all these use cases, studying their requirements and proposing relevant architectures.
Article
Despite many advantages of Field-Programmable Gate Arrays (FPGAs), they fail to take over the IC design market from Application-Specific Integrated Circuits (ASICs) for high-volume and even medium-volume applications, as FPGAs come with significant cost in area, delay, and power consumption. There are two main reasons that FPGAs have huge efficiency gap with ASICs: (1) FPGAs are extremely flexible as they have fully programmable soft-logic blocks and routing networks, and (2) FPGAs have hard-logic blocks that are only usable by a subset of applications. In other words, current FPGAs have a heterogeneous structure comprised of the flexible soft-logic and the efficient hard-logic blocks that suffer from inefficiency and inflexibility, respectively. The inefficiency of the soft-logic is a challenge for any application that is mapped to FPGAs, and lack of flexibility in the hard-logic results in a waste of resources when an application cannot use the hard-logic. In this thesis, we approach the inefficiency problem of FPGAs by bridging the efficiency/flexibility gap of the hard- and soft-logic. The main goal of this thesis is to compromise on efficiency of the hard-logic for flexibility, on the one hand, and to compromise on flexibility of the soft-logic for efficiency, on the other hand. In other words, this thesis deals with two issues: (1) adding more generality to the hard-logic of FPGAs, and (2) improving the soft-logic by adapting it to the generic requirements of applications. In the first part of the thesis, we introduce new techniques that expand the functionality of FPGAs hard-logic. The hard-logic includes the dedicated resources that are tightly coupled with the soft-logic –i.e., adder circuitry and carry chains –as well as the stand-alone ones –i.e., DSP blocks. These specialized resources are intended to accelerate critical arithmetic operations that appear in the pre-synthesis representation of applications; we introduce mapping and architectural solutions, which enable both types of the hard-logic to support additional arithmetic operations. We first present a mapping technique that extends the application of FPGAs carry chains for carry-save arithmetic, and then to increase the generality of the hard-logic, we introduce novel architectures; using these architectures, more applications can take advantage of FPGAs hard-logic. In the second part of the thesis, we improve the efficiency of FPGAs soft-logic by exploiting the circuit patterns that emerge after logic synthesis, i.e., connection and logic patterns. Using these patterns, we design new soft-logic blocks that have less flexibility, but more efficiency than current ones. In this part, we first introduce logic chains, fixed connections that are integrated between the soft-logic blocks of FPGAs and are well-suited for long chains of logic that appear post-synthesis. Logic chains provide fast and low cost connectivity, increase the bandwidth of the logic blocks without changing their interface with the routing network, and improve the logic density of soft-logic blocks. In addition to logic chains and as a complementary contribution, we present a non-LUT soft-logic block that comprises simple and pre-connected cells. The structure of this logic block is inspired from the logic patterns that appear post-synthesis. This block has a complexity that is only linear in the number of inputs, it sports the potential for multiple independent outputs, and the delay is only logarithmic in the number of inputs. Although this new block is less flexible than a LUT, we show (1) that effective mapping algorithms exist, (2) that, due to their simplicity, poor utilization is less of an issue than with LUTs, and (3) that a few LUTs can still be used in extreme unfortunate cases. In summary, to bridge the gap between FPGAs and ASICs, we approach the problem from two complementary directions, which balance flexibility and efficiency of the logic blocks of FPGAs. However, we were able to explore a few design points in this thesis, and future work could focus on further exploration of the design space.
Chapter
An often overlooked way to increase the efficiency of HPC on FPGA is to exploit the bit-level flexibility of the target to match the arithmetic to the application. The ideal operator, for each elementary computation, should toggle and transmit just the number of bits required by the application at this point. FPGAs have the potential to get much closer to this ideal than microprocessors. Therefore, reconfigurable computing should systematically investigate non-standard precisions, but also non-standard number systems and non-standard operations which can be implemented efficiently on reconfigurable hardware. This chapter attempts to review these opportunities systematically.
Conference Paper
The rising complexity of verification has led to an increase in the use of FPGA prototyping, which can run at significantly higher operating frequencies and achieve much higher coverage than logic simulations. However, a key challenge is observability into these devices, which can be solved by embedding trace-buffers to record on-chip signal values. Rather than connecting a predetermined subset of circuits signals to dedicated trace-buffer inputs at compile-time, in this work we propose that a virtual overlay network is built to multiplex all on-chip signals to all on-chip trace-buffers. Subsequently, at debug-time, the designer can choose a signal subset for observation. To minimize its overhead, we build this network out of unused routing multiplexers, and by using optimal bipartite graph matching techniques, we show that any subset of on-chip signals can be connected to 80-90% of the maximum trace-buffer capacity in less than 50 seconds.
Article
FPGA technology is commonly used to prototype new digital designs before entering fabrication. Whilst these physical prototypes can operate many orders of magnitude faster than through a logic simulator, a fundamental limitation is their lack of on-chip visibility when debugging. To counter this, trace-buffer-based instrumentation can be installed into the prototype, allowing designers to capture a predetermined window of signal data during live operation for offline analysis. However, instead of requiring the designer to recompile their entire circuit every time the window is modified, this article proposes that an overlay network is constructed using only spare FPGA routing multiplexers to connect all circuit signals through to the trace instruments. Thus, during debugging, designers would only need to reconfigure this network instead of finding a new place-and-route solution. Furthermore, we describe how this network can deliver signals to both the trigger and trace units of these instruments, which are implemented simultaneously using dual-port RAMs. Our results show that new network configurations connecting any subset of signals to 80--90% of the available RAM capacity can be computed in less than 70 seconds, for a 100,000 LUT circuit, as many times as necessary. Our tool—QuickTrace—is available for download.
Conference Paper
While FPGA programmable routing networks are designed to connect logic block output pins to input pins, FPGA users and architects sometimes become motivated to create connections between pins and specific wires in an FPGA. We call these pin-to-wire connections, and they are motivated by several reasons: first, a desire to employ routing-by-abutment, as commonly done in custom VLSI, to build modular, pre-laid out systems. Second, partial reconfiguration of FPGAs often requires that circuits in the FPGA connect by abutment. Third, pin-to-wire routing is required to make use of resources that reside within the routing network itself, such as the plentiful multiplexers in the network, or even the configuration bits themselves. In this paper we attempt to understand and measure how difficult it is to form such pin-to-wire connections. We show, for example, under an experimental scenario close to routing-by-abutment, that the total routed wirelength compared to a flat placement of the complete system increases by about 6%, that the critical path delay increases by 15% and the router effort goes up by a factor of 3.5. To achieve this result, it is important to be careful in selecting the specific target wires. Overall we demonstrate that while pin-to-wire connections definitely impose increased stress on the routing architecture and router, it is possible to route some reasonable number of them, and so they can be used under some circumstances.