Xilinx FPGA slice utilizing 8 6-input LUTs: 7 2-to-1 multiplexers.

Source publication

FIGURE 4. N-sorter SystemVerilog port signal definitions: 7-sorter...

FIGURE 10. Xilinx FPGA slice utilizing 8 6-input LUTs: 7 2-to-1...

FIGURE 11. FPGA LUT N-sorter design and data flow.

FIGURE 12. 8-bit N -sorter vs N -network speedup and resource increase....

FIGURE 14. 8-bit N -max vs network filters speedup and resource...

Design, Implementation, and Analysis of High-Speed Single-Stage N-Sorters and N-Filters

Article

Full-text available

Dec 2020

There is strong interest in developing high-performance hardware sorting systems which can sort a set of elements as quickly as possible. The fastest of the current FPGA systems are sorting networks, in which sets of 2-sorters operate in parallel in each series stage of a multi-stage sorting process. A 2-sorter is a single-stage hardware block whic...

Context 1

... these families, the basic logic blocks used for our N -sorter/N -filter designs are slices, such as the slice logic block shown in Fig. 10 [19]. The Ultrascale and Ultrascale+ slice block is the full block shown Fig. 10. This slice has 8 6-input lookup tables (LUTs), and a set of 7 2-to-1 multiplexers which are used to combine LUT output signals. The 7-series slice has 4 LUTs and 3 2-to-1 multiplexers [20]. The bottom half of Fig. 10, without the MUXF9, shows the design resources available for 7-series designs. All of the designs detailed in this work ...

View in full-text

Context 2

... slices, such as the slice logic block shown in Fig. 10 [19]. The Ultrascale and Ultrascale+ slice block is the full block shown Fig. 10. This slice has 8 6-input lookup tables (LUTs), and a set of 7 2-to-1 multiplexers which are used to combine LUT output signals. The 7-series slice has 4 LUTs and 3 2-to-1 multiplexers [20]. The bottom half of Fig. 10, without the MUXF9, shows the design resources available for 7-series designs. All of the designs detailed in this work only need the resources available in the 4-LUT 7-series slice, so they can be implemented in both 7-series devices and in devices in the Ultrascale and Ultrascale+ FPGA ...

View in full-text

Context 3

... are other hardware structures found in the slice blocks, but not shown in Fig. 10. These additional structures are not normally used in the design of N -sorters and N -filters, and are not discussed ...

View in full-text

Context 4

... bit of an output port value is implemented using the minimal number of Fig. 10 Since the 2-sorter and 3-sorter output bit multiplexers only require sorter input data and comparison result signals as their inputs, these two sorters have the minimum 2 series slices. Refer to rows 1 to 5 in Table 4 for data covering the number of series slice blocks used by each FPGA N ...

View in full-text

Context 5

... need for intermediate In_X_goes_to_Out_Y signals. The code in Fig. 9 creates a fast 4-max filter. The ge32 signal in Fig. 9 separates the Out_3 assignment into two sections. Each section has 3 ge* signals and 3 input port signals, so each section can be fit into a 6-input LUT. The outputs of the two LUTs become the inputs to a MUXF7, as shown in Fig. 10, and signal ge32 becomes the MUXF7 select line signal. Input Values => 2 3 4 5 6 7 8 9 1 Comparison Signals Blk X X X X X X X X 2 1st MUX Select Line Blk X X X X 3 2nd MUX Select Line Blk X 4 Output MUX Block X X X X X X X X _ ______________________ _ _ _ _ _ _ _ _ 5 Total Series Slices 2 2 2 2 3 3 3 4 6 N-Max Equivalent Stages 1 1 1 1 ...

View in full-text

The top level architecture of the OFDM transceiver on the FPGA platform

The implementation of the clustered-OFDM-based transceiver on an FPGA device: A comprehensive comparison

Article

Full-text available

Feb 2021

Aiming at reducing hardware complexity and power consumption of multicarrier-based broadband transceivers, this contribution discusses and analyzes the implementation of clustered-orthogonal frequency division multiplexing (OFDM)-based transceiver in a field programmable gate array (FPGA) device. In this sense, several OFDM schemes covering baseban...

Energy-Efficient Median Filter Core Architecture for Impulse Noise Removal in Smart Measurement Systems

Article

Full-text available

Jan 2024

In modern measurement systems, preserving accurate data integrity is essential for reliable and precise measurements. The presence of impulse noise in real-world image data can severely degrade the performance of vision algorithms. Median filtering has proven effective in attenuating impulse noise while retaining edge details. Nevertheless, traditional Median Filter architectures often exhibit high power consumption and hardware resource utilization, limiting their practical applicability in resource-constrained applications. Median filtering has proven effective for impulse noise removal while retaining critical measurement features. Nonetheless, existing architectures for Median Filters may suffer from high power consumption and hardware resource utilization, limiting their suitability for smart measurement systems. To address these limitations, we propose an energy-efficient Median Filter architecture tailored for impulse noise removal in smart measurement systems. The proposed design leverages novel Accumulation of Parallel Computing techniques to minimize pixel movements during the filtering process, leading to significant power savings. By utilizing parallel ring counters, selected pixels are kept stationary in specified cells, enabling efficient and rapid median computation. The proposed architecture is implemented using HDL Verilog and thoroughly evaluated on the ZYNQ FPGA platform. Through extensive simulation and synthesis, we assess the architecture’s power consumption and processing speed. The results demonstrate a remarkable 60% reduction in power dissipation, 52% reduction in area and a 2X increase in processing speed compared to existing architectures, making the design well-suited for power-constrained measurement applications. The proposed energy-efficient Median Filter architecture offers an effective and reliable solution for impulse noise removal in smart measurement systems. The design’s optimized power consumption and enhanced processing speed make it highly suitable for real-time and energy-efficient measurements, ensuring the integrity and accuracy of data in a range of measurement applications.

Efficient VLSI implementation of modular neural network based hybrid median filter

Article

Full-text available

Jul 2023
SOFT COMPUT

The image quality is degraded during transmission and reception of images due to the impulse noise. Existing median filters are not compatible and may not give satisfactory results at high impulse noise densities for real-time image impulse de-noising applications. The proposed modular neural network using an Intelligent Impulse De-noising and Impulse Classifier-based hybrid median filter controls the various impulse noise densities by replacing the impulse pixel with a processed central pixel in the filtering window of the image. For removing high levels of noise, the proposed architecture is processed, and the left adjacent pixel is considered an approximation of the median output using min–max-average pooling. This paper presents four novel techniques in the newly proposed deep learning architecture to reduce the computation complexity. Firstly, to address the problem of the number of comparators and the technique proposed for reducing the comparators, Second, the impulse classifier is based on a neural network-based MIN–MAX-average pooling scheme for high level noise density. Then, the entire architecture is designed with both parallel and pipeline structures to improve the throughput. Finally, approximate adders eliminate the speed versus accuracy trade-off by simplifying the average operation of the optimized adder used by the mean filter. According to the implementation findings based on the targeted programmable VLSI-based device named ZYNQ FPGA, the proposed architecture could enable a hybrid median filter at a throughput of 3.75 ns and a power consumption of 478 mW.

Beyond 0-1: The 1-N Principle and Fast Validation of N-Sorter Sorting Networks

Article

Full-text available

Jan 2023

The well-known 0-1 principle for traditional data-oblivious sorting networks states that a network with L inputs can be fully validated with an input vector set consisting of all 2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> L </sup> unique vectors containing only the values 0 and 1 in each vector’s L inputs. Researchers providing proofs of this principle tend to ignore the fact, which is emphasized here, that 0-1 vectors provide all distinct orderings of the two inputs to the 2-sorters which perform all of the sorting operations in the networks. The authors have recently described single-stage N-sorters, with N >2, and multiway merge sorting networks which use these N-sorters. The new N-sorters and their networks are also data-oblivious. It is easily shown that 0-1 vectors are not sufficient to fully verify even a single-stage 3-sorter, the smallest such N-sorter, as the 0-1 vectors are unable to produce all distinct orderings of the 3 inputs. In order to verify these N-sorters, the authors propose the 1-N principle, which states that testing an N-sorter with all distinct orderings of N input values is necessary and sufficient to prove the N-sorter’s correctness. An algorithm is defined which generates a vector set consisting of all distinct orderings of N inputs, which is then used to fully verify the associated single-stage N-sorter. In order to validate the authors’ L-input sorting networks which use these N-sorters, methods have been created which produce validation vector sets that are dramatically reduced in size versus the unsorted vector sets they are derived from. For example, the ratios of the number of L! permutation vectors to the equivalent reduced vectors are >1,000,000 for L=12, and are much higher as L increases.

HLS-based Optimization of Tau Triggering Algorithm for LHC: a case study

Preprint

Dec 2022

With the current increase in the data produced by the Large Hadron Collider (LHC) at CERN, it becomes important to process this data in a corresponding manner. To begin with, to efficiently select events that contain relevant information from a massive flow of data. This is the task of the tau lepton decay triggering algorithm. The implementation is based on the High-Level Synthesis (HLS) approach that allows generating a hardware description of the design from the algorithm written in a high-level programming language like C++. HLS tools are intended to decrease the time and complexity of hardware design development, however, their capabilities are limited. The development of an efficient application requires substantial knowledge of the hardware design and HLS specifics. This paper presents the optimizations introduced to the algorithm that improved latency and area and more importantly solved the problems with the routing, making it possible to implement the algorithm on the FPGA fabric.

Design of High-Speed Multiway Merge Sorting Networks Using Fast Single-Stage N-Sorters and N-Filters

Preprint

Full-text available

May 2022

p>In hardware such as FPGAs, Kenneth Batcher’s Odd-Even Merge Sort and Bitonic Merge Sort are the state-of-the-art methodologies used to quickly sort a list of more than 16 input values. Both sorting networks feature merges of 2 sorted input lists into a single sorted output list. For both, a full sort of 64 and 512 input values requires 21 and 45 serial stages, respectively. Multiway merge sorting networks described here require significantly fewer serial stages. For example, 8-way merge networks fully sort 64 and 512 input values in 9 and 20 serial stages, less than half the number of the respective 2-way networks. When the multiway merge sorting networks utilize the single-stage N-sorters recently defined by the authors, they are considerably faster than Batcher’s networks. In the AMD-Xilinx Ultrascale+ xcvu9p FPGA, the two 8-way merge networks have speedups of 1.85 and 1.74 versus the comparable 2-way networks. A fully pipelined 3-way merge network in this FPGA is capable of fully sorting 500 million lists of 729 unsorted 32-bit values in one second. In software, multiway merge methods are used to find the median of certain pixel rectangles in images, since the median can be determined in fewer stages than are required to fully sort the rectangle. However, the software still requires a series of many 2-sorter operations to find a median. These multiway merge median methods are dramatically sped up in hardware, where the authors’ new single-stage N-sorters and N-filters operate in parallel in each stage of the merge process.</p

Design of High-Speed Multiway Merge Sorting Networks Using Fast Single-Stage N-Sorters and N-Filters

Preprint

Full-text available

May 2022

Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments

Preprint

Feb 2022

Mario Esparza

Audio recordings of collaborative learning environments contain a constant presence of cross-talk and background noise. Dynamic speech recognition between Spanish and English is required in these environments. To eliminate the standard requirement of large-scale ground truth, the thesis develops a simulated dataset by transforming audio transcriptions into phonemes and using 3D speaker geometry and data augmentation to generate an acoustic simulation of Spanish and English speech. The thesis develops a low-complexity neural network for recognizing Spanish and English phonemes (available at github.com/muelitas/keywordRec). When trained on 41 English phonemes, 0.099 PER is achieved on Speech Commands. When trained on 36 Spanish phonemes and tested on real recordings of collaborative learning environments, a 0.7208 LER is achieved. Slightly better than Google's Speech-to-text 0.7272 LER, which used anywhere from 15 to 1,635 times more parameters and trained on 300 to 27,500 hours of real data as opposed to 13 hours of simulated audios.

Design of High-Speed Multiway Merge Sorting Networks Using Fast Single-Stage N-Sorters and N-Filters

Article

Full-text available

Jan 2022

In hardware such as FPGAs, Kenneth Batcher’s Odd-Even Merge Sort and Bitonic Merge Sort are the state-of-the-art methodologies used to quickly sort a list of more than 16 input values. Both sorting networks feature merges of 2 sorted input lists into a single sorted output list. For both, a full sort of 64 and 512 input values requires 21 and 45 serial stages, respectively. Multiway merge sorting networks described here require significantly fewer serial stages. For example, 8-way merge networks fully sort 64 and 512 input values in 9 and 20 serial stages, less than half the number of the respective 2-way networks. When the multiway merge sorting networks utilize the single-stage $N$ -sorters recently defined by the authors, they are considerably faster than Batcher’s networks. In the AMD-Xilinx Ultrascale+ xcvu9p FPGA, the two 8-way merge networks have speedups of 1.85 and 1.74 versus the comparable 2-way networks. A fully pipelined 3-way merge network in this FPGA is capable of fully sorting 500 million lists of 729 unsorted 32-bit values in one second. In software, multiway merge methods are used to find the median of certain pixel rectangles in images, since the median can be determined in fewer stages than are required to fully sort the rectangle. However, the software still requires a series of many 2-sorter operations to find a median. These multiway merge median methods are dramatically sped up in hardware, where the authors’ new single-stage $N$ -sorters and $N$ -filters operate in parallel in each stage of the merge process.

Use of Carry Chain Logic and Design System Extensions to Construct Significantly Faster and Larger Single-Stage N-Sorters and N-Filters

Article

Full-text available

Jan 2022

The authors’ recently published design system for the creation of single-stage N-sorter/N-filter sorting devices, which were implemented using a particular example hardware block, is here expanded and applied to a second hardware type, FPGA carry chain logic. Although several researchers have published applications which use FPGA carry chain logic, most do not use carry chain logic as is done here, and none of the applications target sorting devices. Prior to the introduction of the single-stage N-sorter/N-filter design system, the fastest state-of-the-art hardware devices which sorted more than 2 input values were multistage sorting networks, in which the sorting process is performed by one or more 2-sorters and 2-max/2-min filters, operating in each sequential stage. Using the authors’ original design system, single-stage N-sorters and N-filters were shown to be faster than the fastest comparable sorting networks when sorting 3 to 9 inputs. Here, product term splitting and a new Sum-of-Products output multiplexer equation are added to the design system, and this expanded design system is then implemented in carry chain logic to build faster and larger N-sorters, and much larger and still fast N-max/N-min filters. The new carry chain N-sorters are implemented in the FPGA used in the Amazon AWS EC2 F1 instance, which is one of the two example FPGAs utilized in the previous publication. A carry chain logic 16-sorter, not practical when using the original hardware block, has a speedup of 4.61 relative to the fastest 16-network. An example of the new, very large single-stage carry chain N-max filters is the 125-max $5\times 5\times 5$ CNN video max pooling filter, which operates in only 2.075 nS. A 2-stage 1024-max network, using single-stage 32-max carry chain filters, has a speedup of 2.85 versus the existing state-of-the-art 10-stage network of 2-max filters.

Pruning Binarized Neural Networks Enables Low-Latency, Low-Power FPGA-Based Handwritten Digit Classification

Conference Paper

Sep 2023

Xilinx FPGA slice utilizing 8 6-input LUTs: 7 2-to-1 multiplexers.

Contexts in source publication

Similar publications

Citations