Xilinx FPGA slice utilizing 8 6-input LUTs: 7 2-to-1 multiplexers.

Xilinx FPGA slice utilizing 8 6-input LUTs: 7 2-to-1 multiplexers.

Source publication
Article
Full-text available
There is strong interest in developing high-performance hardware sorting systems which can sort a set of elements as quickly as possible. The fastest of the current FPGA systems are sorting networks, in which sets of 2-sorters operate in parallel in each series stage of a multi-stage sorting process. A 2-sorter is a single-stage hardware block whic...

Contexts in source publication

Context 1
... these families, the basic logic blocks used for our N -sorter/N -filter designs are slices, such as the slice logic block shown in Fig. 10 [19]. The Ultrascale and Ultrascale+ slice block is the full block shown Fig. 10. This slice has 8 6-input lookup tables (LUTs), and a set of 7 2-to-1 multiplexers which are used to combine LUT output signals. The 7-series slice has 4 LUTs and 3 2-to-1 multiplexers [20]. The bottom half of Fig. 10, without the MUXF9, shows the design resources available for 7-series designs. All of the designs detailed in this work ...
Context 2
... slices, such as the slice logic block shown in Fig. 10 [19]. The Ultrascale and Ultrascale+ slice block is the full block shown Fig. 10. This slice has 8 6-input lookup tables (LUTs), and a set of 7 2-to-1 multiplexers which are used to combine LUT output signals. The 7-series slice has 4 LUTs and 3 2-to-1 multiplexers [20]. The bottom half of Fig. 10, without the MUXF9, shows the design resources available for 7-series designs. All of the designs detailed in this work only need the resources available in the 4-LUT 7-series slice, so they can be implemented in both 7-series devices and in devices in the Ultrascale and Ultrascale+ FPGA ...
Context 3
... are other hardware structures found in the slice blocks, but not shown in Fig. 10. These additional structures are not normally used in the design of N -sorters and N -filters, and are not discussed ...
Context 4
... bit of an output port value is implemented using the minimal number of Fig. 10 Since the 2-sorter and 3-sorter output bit multiplexers only require sorter input data and comparison result signals as their inputs, these two sorters have the minimum 2 series slices. Refer to rows 1 to 5 in Table 4 for data covering the number of series slice blocks used by each FPGA N ...
Context 5
... need for intermediate In_X_goes_to_Out_Y signals. The code in Fig. 9 creates a fast 4-max filter. The ge32 signal in Fig. 9 separates the Out_3 assignment into two sections. Each section has 3 ge* signals and 3 input port signals, so each section can be fit into a 6-input LUT. The outputs of the two LUTs become the inputs to a MUXF7, as shown in Fig. 10, and signal ge32 becomes the MUXF7 select line signal. Input Values => 2 3 4 5 6 7 8 9 1 Comparison Signals Blk X X X X X X X X 2 1st MUX Select Line Blk X X X X 3 2nd MUX Select Line Blk X 4 Output MUX Block X X X X X X X X _ ______________________ _ _ _ _ _ _ _ _ 5 Total Series Slices 2 2 2 2 3 3 3 4 6 N-Max Equivalent Stages 1 1 1 1 ...

Similar publications

Article
Full-text available
Aiming at reducing hardware complexity and power consumption of multicarrier-based broadband transceivers, this contribution discusses and analyzes the implementation of clustered-orthogonal frequency division multiplexing (OFDM)-based transceiver in a field programmable gate array (FPGA) device. In this sense, several OFDM schemes covering baseban...

Citations

... The goal of the FPGA-based power optimized impulse denoising image filter architecture is to reduce the switching activities as well as glitching activities during each iteration of the filtered output finding process at each iteration [16]. The multi kernel logic architecture employed to pump several pixels to sorting module or rank generating module in the streaming of pixels in the entire design [22]. ...
Article
Full-text available
In modern measurement systems, preserving accurate data integrity is essential for reliable and precise measurements. The presence of impulse noise in real-world image data can severely degrade the performance of vision algorithms. Median filtering has proven effective in attenuating impulse noise while retaining edge details. Nevertheless, traditional Median Filter architectures often exhibit high power consumption and hardware resource utilization, limiting their practical applicability in resource-constrained applications. Median filtering has proven effective for impulse noise removal while retaining critical measurement features. Nonetheless, existing architectures for Median Filters may suffer from high power consumption and hardware resource utilization, limiting their suitability for smart measurement systems. To address these limitations, we propose an energy-efficient Median Filter architecture tailored for impulse noise removal in smart measurement systems. The proposed design leverages novel Accumulation of Parallel Computing techniques to minimize pixel movements during the filtering process, leading to significant power savings. By utilizing parallel ring counters, selected pixels are kept stationary in specified cells, enabling efficient and rapid median computation. The proposed architecture is implemented using HDL Verilog and thoroughly evaluated on the ZYNQ FPGA platform. Through extensive simulation and synthesis, we assess the architecture’s power consumption and processing speed. The results demonstrate a remarkable 60% reduction in power dissipation, 52% reduction in area and a 2X increase in processing speed compared to existing architectures, making the design well-suited for power-constrained measurement applications. The proposed energy-efficient Median Filter architecture offers an effective and reliable solution for impulse noise removal in smart measurement systems. The design’s optimized power consumption and enhanced processing speed make it highly suitable for real-time and energy-efficient measurements, ensuring the integrity and accuracy of data in a range of measurement applications.
... Due to the proposed classifier and sorter unit, provides better hardware resource utilization and accuracy. The results show that the proposed filter is superior compared with recent techniques (Mahmoud et al. 2023a(Mahmoud et al. , 2023bKent and Pattichis 2021). ...
Article
Full-text available
The image quality is degraded during transmission and reception of images due to the impulse noise. Existing median filters are not compatible and may not give satisfactory results at high impulse noise densities for real-time image impulse de-noising applications. The proposed modular neural network using an Intelligent Impulse De-noising and Impulse Classifier-based hybrid median filter controls the various impulse noise densities by replacing the impulse pixel with a processed central pixel in the filtering window of the image. For removing high levels of noise, the proposed architecture is processed, and the left adjacent pixel is considered an approximation of the median output using min–max-average pooling. This paper presents four novel techniques in the newly proposed deep learning architecture to reduce the computation complexity. Firstly, to address the problem of the number of comparators and the technique proposed for reducing the comparators, Second, the impulse classifier is based on a neural network-based MIN–MAX-average pooling scheme for high level noise density. Then, the entire architecture is designed with both parallel and pipeline structures to improve the throughput. Finally, approximate adders eliminate the speed versus accuracy trade-off by simplifying the average operation of the optimized adder used by the mean filter. According to the implementation findings based on the targeted programmable VLSI-based device named ZYNQ FPGA, the proposed architecture could enable a hybrid median filter at a throughput of 3.75 ns and a power consumption of 478 mW.
... Table 1 shows that a 2-sorter's three distinct orderings are fully verified by the 2 2 =4 0-1 vectors for the 2-sorter. The In_1==In_0 ordering is tested twice, in vectors 1 and 4. The authors have recently described the design of single-stage hardware N-sorters, with N>2 [4], [5], and sorting networks which use these N-sorters [6]. Like the 2-sorter shown in Fig. 1, an N-sorter design is fixed, independent of the values of its inputs, and is therefore data-oblivious. ...
... The all-inputs-equal ordering becomes more serious for the single-stage N-sorters defined in [4], [5], with N> 2, as the input-to-output mapping is dependent on multiple comparisons, not just one comparison. In fact, it can be shown that the sort-3a 3-sorter in [18] fails a full all-inputs-equal test. ...
... The authors' data-oblivious sorting devices [4]- [6] are not limited to a particular type of hardware, but the target hardware used in these papers were modern AMD-Xilinx FPGAs. Traditional sorting networks, such as Kenneth Batcher's Odd-Even Merge Sort and Bitonic Merge Sort [7], are also easily implemented in these FPGAs. ...
Article
Full-text available
The well-known 0-1 principle for traditional data-oblivious sorting networks states that a network with L inputs can be fully validated with an input vector set consisting of all 2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> L </sup> unique vectors containing only the values 0 and 1 in each vector’s L inputs. Researchers providing proofs of this principle tend to ignore the fact, which is emphasized here, that 0-1 vectors provide all distinct orderings of the two inputs to the 2-sorters which perform all of the sorting operations in the networks. The authors have recently described single-stage N-sorters, with N >2, and multiway merge sorting networks which use these N-sorters. The new N-sorters and their networks are also data-oblivious. It is easily shown that 0-1 vectors are not sufficient to fully verify even a single-stage 3-sorter, the smallest such N-sorter, as the 0-1 vectors are unable to produce all distinct orderings of the 3 inputs. In order to verify these N-sorters, the authors propose the 1-N principle, which states that testing an N-sorter with all distinct orderings of N input values is necessary and sufficient to prove the N-sorter’s correctness. An algorithm is defined which generates a vector set consisting of all distinct orderings of N inputs, which is then used to fully verify the associated single-stage N-sorter. In order to validate the authors’ L-input sorting networks which use these N-sorters, methods have been created which produce validation vector sets that are dramatically reduced in size versus the unsorted vector sets they are derived from. For example, the ratios of the number of L! permutation vectors to the equivalent reduced vectors are >1,000,000 for L=12, and are much higher as L increases.
... The architecture of the insertion cell has changed as well. The solution for the modified insertion cell was inspired by [11] that presents a single-stage N sorter based on a comparison counting matrix. The suggested N -sorter does not require calculating the rank of each element, it is built according to the equations derived from the comparison counting result. ...
Preprint
With the current increase in the data produced by the Large Hadron Collider (LHC) at CERN, it becomes important to process this data in a corresponding manner. To begin with, to efficiently select events that contain relevant information from a massive flow of data. This is the task of the tau lepton decay triggering algorithm. The implementation is based on the High-Level Synthesis (HLS) approach that allows generating a hardware description of the design from the algorithm written in a high-level programming language like C++. HLS tools are intended to decrease the time and complexity of hardware design development, however, their capabilities are limited. The development of an efficient application requires substantial knowledge of the hardware design and HLS specifics. This paper presents the optimizations introduced to the algorithm that improved latency and area and more importantly solved the problems with the routing, making it possible to implement the algorithm on the FPGA fabric.
... However, the authors have recently introduced single-stage N -sorters, with 16≥N ≥3, which are significantly faster than the previous state-of-the-art 2-sorter networks [2], [3]. For example, a single-stage 16-sorter design was shown to have a speedup of over 4.0 versus the equivalent custom-built 2-sorter network [3]. ...
... Single-stage devices have one set of inputs, one set of outputs, and whatever internal logic is required to present the output values in sorted order [2], [3]. Their speed superiority is due to the fact that they use only a single stage operation to sort an input list, versus the series of 2-sorter stages required by the comparable manually-designed sorting network. ...
... However, the N cols N cols = 3, 5, 6, 7, 9 N cols = 2, 4, 8 various sorting devices used in these networks do not have the same propagation delay, as was shown for the AMD-Xilinx xcvu9p FPGA analyzed in our earlier work [2] [3]. An N cols sorting network uses N -sorters from 2-sorters up to N cols -sorters, and the [2] [3] data showed that N -sorter propagation delay tends to increase as N increases. In the following section, we will use our N -sorters from [3] to plot propagation delay vs N total , using synthesis data for UCMS networks implemented in the same target FPGA used in [3]. ...
Preprint
Full-text available
p>In hardware such as FPGAs, Kenneth Batcher’s Odd-Even Merge Sort and Bitonic Merge Sort are the state-of-the-art methodologies used to quickly sort a list of more than 16 input values. Both sorting networks feature merges of 2 sorted input lists into a single sorted output list. For both, a full sort of 64 and 512 input values requires 21 and 45 serial stages, respectively. Multiway merge sorting networks described here require significantly fewer serial stages. For example, 8-way merge networks fully sort 64 and 512 input values in 9 and 20 serial stages, less than half the number of the respective 2-way networks. When the multiway merge sorting networks utilize the single-stage N-sorters recently defined by the authors, they are considerably faster than Batcher’s networks. In the AMD-Xilinx Ultrascale+ xcvu9p FPGA, the two 8-way merge networks have speedups of 1.85 and 1.74 versus the comparable 2-way networks. A fully pipelined 3-way merge network in this FPGA is capable of fully sorting 500 million lists of 729 unsorted 32-bit values in one second. In software, multiway merge methods are used to find the median of certain pixel rectangles in images, since the median can be determined in fewer stages than are required to fully sort the rectangle. However, the software still requires a series of many 2-sorter operations to find a median. These multiway merge median methods are dramatically sped up in hardware, where the authors’ new single-stage N-sorters and N-filters operate in parallel in each stage of the merge process.</p
... However, the authors have recently introduced single-stage N -sorters, with 16≥N ≥3, which are significantly faster than the previous state-of-the-art 2-sorter networks [2], [3]. For example, a single-stage 16-sorter design was shown to have a speedup of over 4.0 versus the equivalent custom-built 2-sorter network [3]. ...
... Single-stage devices have one set of inputs, one set of outputs, and whatever internal logic is required to present the output values in sorted order [2], [3]. Their speed superiority is due to the fact that they use only a single stage operation to sort an input list, versus the series of 2-sorter stages required by the comparable manually-designed sorting network. ...
... However, the N cols N cols = 3, 5, 6, 7, 9 N cols = 2, 4, 8 various sorting devices used in these networks do not have the same propagation delay, as was shown for the AMD-Xilinx xcvu9p FPGA analyzed in our earlier work [2] [3]. An N cols sorting network uses N -sorters from 2-sorters up to N cols -sorters, and the [2] [3] data showed that N -sorter propagation delay tends to increase as N increases. In the following section, we will use our N -sorters from [3] to plot propagation delay vs N total , using synthesis data for UCMS networks implemented in the same target FPGA used in [3]. ...
Preprint
Full-text available
p>In hardware such as FPGAs, Kenneth Batcher’s Odd-Even Merge Sort and Bitonic Merge Sort are the state-of-the-art methodologies used to quickly sort a list of more than 16 input values. Both sorting networks feature merges of 2 sorted input lists into a single sorted output list. For both, a full sort of 64 and 512 input values requires 21 and 45 serial stages, respectively. Multiway merge sorting networks described here require significantly fewer serial stages. For example, 8-way merge networks fully sort 64 and 512 input values in 9 and 20 serial stages, less than half the number of the respective 2-way networks. When the multiway merge sorting networks utilize the single-stage N-sorters recently defined by the authors, they are considerably faster than Batcher’s networks. In the AMD-Xilinx Ultrascale+ xcvu9p FPGA, the two 8-way merge networks have speedups of 1.85 and 1.74 versus the comparable 2-way networks. A fully pipelined 3-way merge network in this FPGA is capable of fully sorting 500 million lists of 729 unsorted 32-bit values in one second. In software, multiway merge methods are used to find the median of certain pixel rectangles in images, since the median can be determined in fewer stages than are required to fully sort the rectangle. However, the software still requires a series of many 2-sorter operations to find a median. These multiway merge median methods are dramatically sped up in hardware, where the authors’ new single-stage N-sorters and N-filters operate in parallel in each stage of the merge process.</p
... Carranza et al. [54] were able to maximize throughput with optimized 2D-FFT libraries and vectorbased memory I/O. Kent et al. [63] incremented speed on 2x2 and 3x3 max-pooling with the help of FPGAs. ...
Preprint
Audio recordings of collaborative learning environments contain a constant presence of cross-talk and background noise. Dynamic speech recognition between Spanish and English is required in these environments. To eliminate the standard requirement of large-scale ground truth, the thesis develops a simulated dataset by transforming audio transcriptions into phonemes and using 3D speaker geometry and data augmentation to generate an acoustic simulation of Spanish and English speech. The thesis develops a low-complexity neural network for recognizing Spanish and English phonemes (available at github.com/muelitas/keywordRec). When trained on 41 English phonemes, 0.099 PER is achieved on Speech Commands. When trained on 36 Spanish phonemes and tested on real recordings of collaborative learning environments, a 0.7208 LER is achieved. Slightly better than Google's Speech-to-text 0.7272 LER, which used anywhere from 15 to 1,635 times more parameters and trained on 300 to 27,500 hours of real data as opposed to 13 hours of simulated audios.
... However, the authors have recently introduced single-stage N -sorters, with 16≥N ≥3, which are significantly faster than the previous state-of-the-art 2-sorter networks [2], [3]. For example, a single-stage 16-sorter design was shown to have a speedup of over 4.0 versus the equivalent custom-built 2-sorter network [3]. ...
... Single-stage devices have one set of inputs, one set of outputs, and whatever internal logic is required to present the output values in sorted order [2], [3]. Their speed superiority is due to the fact that they use only a single stage operation to sort an input list, versus the series of 2-sorter stages required by the comparable manually-designed sorting network. ...
... If all of the sorting devices in the Fig. 6 sorting networks had the same propagation delay, Fig. 6 would also display the behavior of total propagation delay vs N total . However, the various sorting devices used in these networks do not have the same propagation delay, as was shown for the AMD-Xilinx xcvu9p FPGA analyzed in our earlier work [2], [3]. ...
Article
Full-text available
In hardware such as FPGAs, Kenneth Batcher’s Odd-Even Merge Sort and Bitonic Merge Sort are the state-of-the-art methodologies used to quickly sort a list of more than 16 input values. Both sorting networks feature merges of 2 sorted input lists into a single sorted output list. For both, a full sort of 64 and 512 input values requires 21 and 45 serial stages, respectively. Multiway merge sorting networks described here require significantly fewer serial stages. For example, 8-way merge networks fully sort 64 and 512 input values in 9 and 20 serial stages, less than half the number of the respective 2-way networks. When the multiway merge sorting networks utilize the single-stage $N$ -sorters recently defined by the authors, they are considerably faster than Batcher’s networks. In the AMD-Xilinx Ultrascale+ xcvu9p FPGA, the two 8-way merge networks have speedups of 1.85 and 1.74 versus the comparable 2-way networks. A fully pipelined 3-way merge network in this FPGA is capable of fully sorting 500 million lists of 729 unsorted 32-bit values in one second. In software, multiway merge methods are used to find the median of certain pixel rectangles in images, since the median can be determined in fewer stages than are required to fully sort the rectangle. However, the software still requires a series of many 2-sorter operations to find a median. These multiway merge median methods are dramatically sped up in hardware, where the authors’ new single-stage $N$ -sorters and $N$ -filters operate in parallel in each stage of the merge process.
... The authors have recently published a system for designing fast, stable, single-stage hardware sorting devices, which sort 3 or more input values [1]. A single-stage device has one set of ports, one set of output ports, and whatever internal logic is needed in order to produce a fully sorted list of the input values at the output ports. ...
... Although the general design system introduced in [1] is not hardware-specific, a design logic block (LB) common to two FPGA families was used to show how the sorting devices are constructed. Using synthesis results for two products in those two families, it was shown that the single-stage sorters were significantly faster than the multistage sorting networks considered at that time to be the fastest state-of-the-art hardware sorting devices. ...
... For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ that will produce faster single-stage sorting devices than the comparable fastest sorting networks. This work strives to build even faster and larger devices than were created in [1], and to build the new devices with a second hardware type, carry chain logic. To this end, the N-sorter/N-filter design system from [1] is now expanded, with the addition of product term splitting and Sum-of-Products (SOP) output multiplexer (mux) equations, and is then used to build fast sorting devices using carry chain logic. ...
Article
Full-text available
The authors’ recently published design system for the creation of single-stage N-sorter/N-filter sorting devices, which were implemented using a particular example hardware block, is here expanded and applied to a second hardware type, FPGA carry chain logic. Although several researchers have published applications which use FPGA carry chain logic, most do not use carry chain logic as is done here, and none of the applications target sorting devices. Prior to the introduction of the single-stage N-sorter/N-filter design system, the fastest state-of-the-art hardware devices which sorted more than 2 input values were multistage sorting networks, in which the sorting process is performed by one or more 2-sorters and 2-max/2-min filters, operating in each sequential stage. Using the authors’ original design system, single-stage N-sorters and N-filters were shown to be faster than the fastest comparable sorting networks when sorting 3 to 9 inputs. Here, product term splitting and a new Sum-of-Products output multiplexer equation are added to the design system, and this expanded design system is then implemented in carry chain logic to build faster and larger N-sorters, and much larger and still fast N-max/N-min filters. The new carry chain N-sorters are implemented in the FPGA used in the Amazon AWS EC2 F1 instance, which is one of the two example FPGAs utilized in the previous publication. A carry chain logic 16-sorter, not practical when using the original hardware block, has a speedup of 4.61 relative to the fastest 16-network. An example of the new, very large single-stage carry chain N-max filters is the 125-max $5\times 5\times 5$ CNN video max pooling filter, which operates in only 2.075 nS. A 2-stage 1024-max network, using single-stage 32-max carry chain filters, has a speedup of 2.85 versus the existing state-of-the-art 10-stage network of 2-max filters.