Virtex-4 FPGA logic slice: LUTs, RAM...

Source publication

Exploring Accelerating Science Applications with FPGAs

Article

Full-text available

FPGA hardware and tools (VHDL, Viva, MitrionC and CHiMPS) are described. FPGA performance is evaluated on two Cray XD1 systems (Virtex-II Pro 50 and Virtex-4 LX160) for human genome (DNA and protein) sequence comparisons for a computational biology code (FASTA). Scalable FPGA speedups of 50X (Virtex-II) and 100X (Virtex-4) over a 2.2 GHz Opteron we...

Context 1

... reconfigurable, DNA, RNA, Smith- Waterman, Cray, FASTA, XD1, Virtex, OpenFPGA This paper describes Field-Programmable Gate Arrays (FPGAs), several tools used to program them and how 100X speedup was achieved for a science application. Remarkable innovations in computer technology [1-2] are fulfilling NASA future projections [3] for faster science and engineering computations. One innovation in the forefront is to harness FPGAs to accelerate High-Performance Computing (HPC) applications by one or more orders of magnitude over traditional microprocessors. FPGAs were invented in 1984 by Ross Freeman, co-founder of Xilinx Inc. FPGAs are extremely flexible and dominated by interconnections to thousands of embedded functions ( Fig. 1. left) like adders, multipliers, memory, and logic slices ( Fig. 2 ). They perform high-speed computations and communications (e.g., Hypertransport) in parallel via digital logic: LookUp Tables (LUTs), Registers, RAM, etc. Unlike “fixed” microprocessors, FPGA’s are reconfigurable “on the fly” by users in the “field”, thus, “field programmable”. The Virtex-4 is available with one or more PowerPC (PPC) processors on the chip ( Fig 1 . right). The rapidly growing (15-20%/year) $2B FPGA market (focused on the high-volume communications) is dominated by Xilinx and Altera. Aerospace and High-Performance Embedded Computing (HPEC) users are rapidly expanding their FPGA use. Although HPC sales (< 1%) are small, FPGA designers are open to HPC requirements for their next generation designs. : FPGA layout is extremely regular compared to microprocessors, simplifying fabrication, and allowing FPGAs to be among the first to reduce feature sizes (90nm => 65nm => 45nm). For space and flight use, this regularity and triply-redundant code, limits radiation damage (i.e. NASA Mars Rovers). At each clock cycle, FPGA algorithms (when coded to maximize the number of parallel operations) use nearly 100% of their silicon, compared to less efficient microprocessors which use < 2% of their silicon (while drawing 10x FPGA power to perform only one or two operations). Figure 3 shows several key FPGA characteristics. Unlike microprocessors, FPGAs continue to advance at Moore’s Law rate and have far to go before reaching logic cell and speed limits ( Fig. 3 ., left). FPGA clock speeds (often 100- 200 MHz) have far to go before facing heating issues that drove microprocessors to multi-core chips with reduced clock speeds. When FPGA applications are programmed to maximize parallelism, their computation speed far exceeds that of microprocessors ( Fig. 3 ., right). Being high-speed communications devices, the memory and IO bandwidths also significantly exceed those of microprocessors ( Fig. 3 ). : As FPGAs were developed by logic designers, they are traditionally programmed using circuit design languages such as VHDL and Verilog. These languages require the knowledge and training of a logic designer, take months to learn and far longer to code efficiently. Even once this skill is acquired, VHDL or Verilog coding is extremely arduous, taking months to develop early prototypes and often years to perfect and optimize. FPGA code development, unlike HPC compilers, is greatly slowed by the additional lengthy steps required to synthesize, place and route the circuit. Once the time is taken to code applications in VHDL, its FPGA performance is excellent. In particular, applications using basic integer or logic operations (compare, add, multiply) such as DNA sequence comparisons, cryptography or chess logic, run extremely well on FPGAs. As floating point and double-precision applications rapidly exhausted the number of slices available on early FPGAs, they were often avoided for high-precision calculations. However, this situation has changed for current Xilinx FPGAs (Virtex-4 and Virtex-5) which have sufficient logic to fit about 80 64-bit multiply units [ 2 ]. While early FPGAs had sufficient capability to be well suited for special-purpose HPEC, their use for general- purpose HPC was initially restricted to a first-generation of low-end reconfigurable supercomputers. (i.e. Starbridge Systems, SRC, Cray XD1). The lack of high-speed IO and infrastructure (compilers, libraries) to support general- purpose supercomputer applications, including legacy codes are typical of this early generation. However, this situation is rapidly changing with the latest generation of reconfigurable supercomputers and the FPGAs they use. DRC Computer, Xtreme Data and Xilinx in collaboration with Intel provide modules with the latest FPGAs which fit in microprocessor sockets and use the same high-speed communications link. Cray selected DRC’s module, Fig. 4 , to accelerate its XT line of supercomputers. Accelerating HPC applications is so critical that many alternatives have entered the marketplace. Even though many legacy physics-based codes are written in sequential Fortran over 30 years ago, they have remarkably survived several HPC generations: vector (via compilers), parallel (via MPI, OpenMP) and now the first stages of multi-core microprocessors. Some surmise they may suffer severe performance degradation or even require significant rewrites to fully exploit 8 or more cores/chip. Major chip vendors (Intel and AMD) have vigorous efforts to accommodate accelerators, with their primary focus on FPGAs as a way to regain performance. As multi-core microprocessors face looming power, cooling, size and IO challenges, FPGAs are increasingly attractive. Accelerator Options: Three other accelerator options are available to HPC architects: Cell (IBM), Graphics (GPUs) and Array (ClearSpeed) processors. Like FPGAs, Cell and GPUs have vast commercial markets (video games and graphics) driving down costs, promoting competition and stimulating advances making them increasingly attractive to HPC. Array processors, however, are custom devices requiring amortization over relatively few users. GPUs require significant power/cooling and have complex programming and data precision issues to solve before they can enter the HPC market. Coding the 8+1 Cell processors is likely to be considerably more difficult than programming FPGAs in VHDL or Verilog, which already has a large user base. As FPGA hardware advances, tools/ software are simplifying their use for HPC: Viva, MitrionC, & Xilinx’s CHiMPS (all discussed later) as well as DSPlogic, ImpulseC, Celoxica, Aldec, and, others. The authors are testing CHiMPS for HPC ...

View in full-text

Design of 2 ×4 Alamouti Transceiver Using FPGA

Article

Full-text available

Jan 2014

Wisam NAJM AL-DIN Abed

In this paper, the concept of Alamouti’s transmit diversity technique in MIMO system is implemented based on Hardware Description Language (HDL), using Xilinx Field Programmable Gate Array ( FPGA) .The proposed design based on Alamouti’s transmit diversity scheme which is a space-time block code (STBC) with two transmit antennas and four receive an...

DMPDS: A Fast Motion Estimation Algorithm Targeting High Resolution Videos and Its FPGA Implementation

Article

Full-text available

Nov 2012

This paper presents a new fast motion estimation (ME) algorithm targeting high resolution digital videos and its efficient hardware architecture design. The new Dynamic Multipoint Diamond Search (DMPDS) algorithm is a fast algorithm which increases the ME quality when compared with other fast ME algorithms. The DMPDS achieves a better digital video...

Implementação de um Controlador Fuzzy em Sistemas Reconfiguráveis

Thesis

Full-text available

Jan 2005

Jose Luiz Ferraz Barbosa

O objetivo deste trabalho é implementar um controlador fuzzy que será sintetizado sobre uma arquitetura do tipo FPGA (Field Programmable Gate Array), a ser implantado em um sistema de controle de nível de líquido. A descrição do sistema será realizada usando a linguagem de descrição de hardware VHDL.

Specific processor in FPGA for BLAKE algorithm

Conference Paper

Full-text available

Feb 2013

This article presents the analysis of assembly instructions from SHA-3 BLAKE algorithm, running on an ARM® processor, with the intention of developing a specific processor in FPGA for BLAKE Algorithm. For this purpose, we have used an implementation in C of the algorithm, where we could discover which instructions were executed and how frequently t...

Tradeoffs in the design of sliding block Viterbi decoders for MB-OFDM UWB systems

Conference Paper

Full-text available

Sep 2012

MultiBand OFDM (MB-OFDM) UWB [1] is a promising short-range wireless technology for high data rate communications up to 480 Mbps. The UWB receiver uses a Viterbi decoder that must support the highest data rate of 480 Mbps. To achieve such high data rates a sliding block Viterbi decoder is a good design candidate. In this paper, we analyze the trade...

Performance Analysis of DNA Sequence Alignment using FPGA's Tools

Article

Full-text available

Nov 2020

The DNA data generation rate exceeds its rate of computational processing with the increase in the development of DNA sequencing. Standard sequence alignment techniques using existing computational machines cannot achieve the exponentially growing requirements. Acceleration of the algorithm on FPGA improves the performance in comparison to other platforms. This paper will define and categorize the present sequence alignment algorithms and implement it on FPGA boards. We will also present a comparison of different types of sequence alignment algorithms and surmise the current alternatives and deliver a testimony to advance accelerating sequence alignment on FPGA.

FPGA Implementation of Real-Time Compressive Sensing with Partial Fourier Dictionary

Article

Full-text available

Jan 2016
Int J Antenn Propag

This paper presents a novel real-time compressive sensing (CS) reconstruction which employs high density field-programmable gate array (FPGA) for hardware acceleration. Traditionally, CS can be implemented using a high-level computer language in a personal computer (PC) or multicore platforms, such as graphics processing units (GPUs) and Digital Signal Processors (DSPs). However, reconstruction algorithms are computing demanding and software implementation of these algorithms is extremely slow and power consuming. In this paper, the orthogonal matching pursuit (OMP) algorithm is refined to solve the sparse decomposition optimization for partial Fourier dictionary, which is always adopted in radar imaging and detection application. OMP reconstruction can be divided into two main stages: optimization which finds the closely correlated vectors and least square problem. For large scale dictionary, the implementation of correlation is time consuming since it often requires a large number of matrix multiplications. Also solving the least square problem always needs a scalable matrix decomposition operation. To solve these problems efficiently, the correlation optimization is implemented by fast Fourier transform (FFT) and the large scale least square problem is implemented by Conjugate Gradient (CG) technique, respectively. The proposed method is verified by FPGA (Xilinx Virtex-7 XC7VX690T) realization, revealing its effectiveness in real-time applications.

ARC: A metacomputing environment for clusters augmented with reconfigurable hardware

Article

Full-text available

Sep 2012
J SUPERCOMPUT

The addition of reconfigurable hardware (FPGAs) to the nodes of Beowulf-style clusters has the potential to accelerate a variety of parallel applications through a combination of parallel programming and reconfigurable computing techniques. However, making efficient use of the computational resources available places a significant burden on the application developer due to the lack of support for reconfigurable computing and task heterogeneity in standard message-passing libraries. This paper describes Accessible Reconfigurable Computing (ARC), a metacomputing environment designed to address these issues. The architecture, implementation, and operation of the system are described in detail.

A case study of the task-based parallel wavefront pattern

Article

Full-text available

Jan 2012

This paper analyzes the applicability of the task programming model in the parallelization of the wavefront pat-tern. Computations on this type of problem are characterized by a data dependency pattern across a data space, which can produce a variable number of independent tasks through the traversal of such a space. We explore several implementations of this pattern, based on the current state-of-the-art threading libraries that support tasks. For each implementation, we discuss the particularities from a programmers point of view, highlighting the advantageous features in each case. We con-duct several experiments to identify the factors that can limit the performance in each implementation. Moreover, we propose and evaluate some optimizations (task recycling, prioritization of tasks based on locality hints and tiling) that the programmer can exploit to reduce the overhead in some cases.

Accelerating multiple alignment on FPGA with a high-level hardware description language

Article

Full-text available

Oct 2011

Oleg Medvedev

The paper describes an experience of creating a hardware implementation of a pairwise sequence alignment algorithm in a high-level hardware description language. The implementation is created to be run on an FPGA with a high latency interface to a PC (ethernet). Thus, a lot of control logic is implemented in hardware together with the main pipeline. We use a HaSCoL hardware description language for that purpose and discuss pros and cons of this approach compared to software implementation of the control logic on an embedded processor. We also discuss how the language helps to describe hardware and how it could help more as well.

A parallel FPGA design of the Smith-Waterman traceback

Conference Paper

Full-text available

Dec 2010

The Smith-Waterman (SW) algorithm is the only optimal local sequence alignment algorithm. There are many SW implementations on FPGA, which show speedups of up to 100x as compared to a general-purpose-processor (GPP). In this paper, we propose a design of the SW traceback, which is done in parallel with the matrix fill stage and which gives the optimal alignment after once scanning through the whole database. Beside that, we have proposed the hardware design for the RVEP SW FPGA implementation, which demonstrates that this solution can be realized with off-the-shelf FPGA boards.

Abstractions for Scientific Computing on Campus Grids

Article

Jun 2010

Christopher M. Moretti

Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors

Article

Full-text available

Jan 2010
J COMPUT PHYS

Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith–Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith–Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith–Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs.

Highly scalable genome assembly on campus grids

Conference Paper

Full-text available

Nov 2009

Bioinformatics researchers need efficient means to processlarge collections of sequence data. One application of interest,genome assembly, has great potential for parallelization, howeve r most pre- vious attempts at parallelization require uncommon high-e nd hard- ware. This paper introduces a scalable modular genome assem - bler that can achieve significant speedup using large number s of conventional desktop machines, such as those found in a camp us computing grid. The system is based on the Celera open-sourc e assembly toolkit, and replaces two independent sequentialmodules with scalable replacements: a scalable candidate selectorexploits the distributed memory capacity of a campus grid, while the s cal- able aligner exploits the distributed computing capacity.For large problems, these modules provide robust task and data manage ment while also achieving speedup with high efficiency on severalscales of resources. We show results for several datasets ranging f rom 738 thousand to over 121 million alignments using campus gri d re- sources ranging from a small cluster to more than a thousand n odes spanning three institutions. Our largest run so far achieve s a 927x speedup with 71.3 percent efficiency.

Developing Energy Efficient Filtering Systems

Conference Paper

Full-text available

Jul 2009

Processing large volumes of information generally requires massive amounts of computational power, which consumes a significant amount of energy. An emerging challenge is the development of ``environmentally friendly'' systems that are not only efficient in terms of time, but also energy efficient. In this poster, we outline our initial efforts at developing greener filtering systems by employing Field Programmable Gate Arrays (FPGA) to perform the core information processing task. FPGAs enable code to be executed in parallel at a chip level, while consuming only a fraction of the power of a standard (von Neuman style) processor. On a number of test collections, we demonstrate that the FPGA filtering system performs 10-20 times faster than the Itanium based implementation, resulting in considerable energy savings.

Virtex-4 FPGA logic slice: LUTs, RAM...

Context in source publication

Similar publications

Citations