Figure 2 - uploaded by Dave Strenski
Content may be subject to copyright.
Virtex-4 FPGA logic slice: LUTs, RAM... 

Virtex-4 FPGA logic slice: LUTs, RAM... 

Source publication
Article
Full-text available
FPGA hardware and tools (VHDL, Viva, MitrionC and CHiMPS) are described. FPGA performance is evaluated on two Cray XD1 systems (Virtex-II Pro 50 and Virtex-4 LX160) for human genome (DNA and protein) sequence comparisons for a computational biology code (FASTA). Scalable FPGA speedups of 50X (Virtex-II) and 100X (Virtex-4) over a 2.2 GHz Opteron we...

Context in source publication

Context 1
... reconfigurable, DNA, RNA, Smith- Waterman, Cray, FASTA, XD1, Virtex, OpenFPGA This paper describes Field-Programmable Gate Arrays (FPGAs), several tools used to program them and how 100X speedup was achieved for a science application. Remarkable innovations in computer technology [1-2] are fulfilling NASA future projections [3] for faster science and engineering computations. One innovation in the forefront is to harness FPGAs to accelerate High-Performance Computing (HPC) applications by one or more orders of magnitude over traditional microprocessors. FPGAs were invented in 1984 by Ross Freeman, co-founder of Xilinx Inc. FPGAs are extremely flexible and dominated by interconnections to thousands of embedded functions ( Fig. 1. left) like adders, multipliers, memory, and logic slices ( Fig. 2 ). They perform high-speed computations and communications (e.g., Hypertransport) in parallel via digital logic: LookUp Tables (LUTs), Registers, RAM, etc. Unlike “fixed” microprocessors, FPGA’s are reconfigurable “on the fly” by users in the “field”, thus, “field programmable”. The Virtex-4 is available with one or more PowerPC (PPC) processors on the chip ( Fig 1 . right). The rapidly growing (15-20%/year) $2B FPGA market (focused on the high-volume communications) is dominated by Xilinx and Altera. Aerospace and High-Performance Embedded Computing (HPEC) users are rapidly expanding their FPGA use. Although HPC sales (< 1%) are small, FPGA designers are open to HPC requirements for their next generation designs. : FPGA layout is extremely regular compared to microprocessors, simplifying fabrication, and allowing FPGAs to be among the first to reduce feature sizes (90nm => 65nm => 45nm). For space and flight use, this regularity and triply-redundant code, limits radiation damage (i.e. NASA Mars Rovers). At each clock cycle, FPGA algorithms (when coded to maximize the number of parallel operations) use nearly 100% of their silicon, compared to less efficient microprocessors which use < 2% of their silicon (while drawing 10x FPGA power to perform only one or two operations). Figure 3 shows several key FPGA characteristics. Unlike microprocessors, FPGAs continue to advance at Moore’s Law rate and have far to go before reaching logic cell and speed limits ( Fig. 3 ., left). FPGA clock speeds (often 100- 200 MHz) have far to go before facing heating issues that drove microprocessors to multi-core chips with reduced clock speeds. When FPGA applications are programmed to maximize parallelism, their computation speed far exceeds that of microprocessors ( Fig. 3 ., right). Being high-speed communications devices, the memory and IO bandwidths also significantly exceed those of microprocessors ( Fig. 3 ). : As FPGAs were developed by logic designers, they are traditionally programmed using circuit design languages such as VHDL and Verilog. These languages require the knowledge and training of a logic designer, take months to learn and far longer to code efficiently. Even once this skill is acquired, VHDL or Verilog coding is extremely arduous, taking months to develop early prototypes and often years to perfect and optimize. FPGA code development, unlike HPC compilers, is greatly slowed by the additional lengthy steps required to synthesize, place and route the circuit. Once the time is taken to code applications in VHDL, its FPGA performance is excellent. In particular, applications using basic integer or logic operations (compare, add, multiply) such as DNA sequence comparisons, cryptography or chess logic, run extremely well on FPGAs. As floating point and double-precision applications rapidly exhausted the number of slices available on early FPGAs, they were often avoided for high-precision calculations. However, this situation has changed for current Xilinx FPGAs (Virtex-4 and Virtex-5) which have sufficient logic to fit about 80 64-bit multiply units [ 2 ]. While early FPGAs had sufficient capability to be well suited for special-purpose HPEC, their use for general- purpose HPC was initially restricted to a first-generation of low-end reconfigurable supercomputers. (i.e. Starbridge Systems, SRC, Cray XD1). The lack of high-speed IO and infrastructure (compilers, libraries) to support general- purpose supercomputer applications, including legacy codes are typical of this early generation. However, this situation is rapidly changing with the latest generation of reconfigurable supercomputers and the FPGAs they use. DRC Computer, Xtreme Data and Xilinx in collaboration with Intel provide modules with the latest FPGAs which fit in microprocessor sockets and use the same high-speed communications link. Cray selected DRC’s module, Fig. 4 , to accelerate its XT line of supercomputers. Accelerating HPC applications is so critical that many alternatives have entered the marketplace. Even though many legacy physics-based codes are written in sequential Fortran over 30 years ago, they have remarkably survived several HPC generations: vector (via compilers), parallel (via MPI, OpenMP) and now the first stages of multi-core microprocessors. Some surmise they may suffer severe performance degradation or even require significant rewrites to fully exploit 8 or more cores/chip. Major chip vendors (Intel and AMD) have vigorous efforts to accommodate accelerators, with their primary focus on FPGAs as a way to regain performance. As multi-core microprocessors face looming power, cooling, size and IO challenges, FPGAs are increasingly attractive. Accelerator Options: Three other accelerator options are available to HPC architects: Cell (IBM), Graphics (GPUs) and Array (ClearSpeed) processors. Like FPGAs, Cell and GPUs have vast commercial markets (video games and graphics) driving down costs, promoting competition and stimulating advances making them increasingly attractive to HPC. Array processors, however, are custom devices requiring amortization over relatively few users. GPUs require significant power/cooling and have complex programming and data precision issues to solve before they can enter the HPC market. Coding the 8+1 Cell processors is likely to be considerably more difficult than programming FPGAs in VHDL or Verilog, which already has a large user base. As FPGA hardware advances, tools/ software are simplifying their use for HPC: Viva, MitrionC, & Xilinx’s CHiMPS (all discussed later) as well as DSPlogic, ImpulseC, Celoxica, Aldec, and, others. The authors are testing CHiMPS for HPC ...

Similar publications

Article
Full-text available
In this paper, the concept of Alamouti’s transmit diversity technique in MIMO system is implemented based on Hardware Description Language (HDL), using Xilinx Field Programmable Gate Array ( FPGA) .The proposed design based on Alamouti’s transmit diversity scheme which is a space-time block code (STBC) with two transmit antennas and four receive an...
Article
Full-text available
This paper presents a new fast motion estimation (ME) algorithm targeting high resolution digital videos and its efficient hardware architecture design. The new Dynamic Multipoint Diamond Search (DMPDS) algorithm is a fast algorithm which increases the ME quality when compared with other fast ME algorithms. The DMPDS achieves a better digital video...
Thesis
Full-text available
O objetivo deste trabalho é implementar um controlador fuzzy que será sintetizado sobre uma arquitetura do tipo FPGA (Field Programmable Gate Array), a ser implantado em um sistema de controle de nível de líquido. A descrição do sistema será realizada usando a linguagem de descrição de hardware VHDL.
Conference Paper
Full-text available
This article presents the analysis of assembly instructions from SHA-3 BLAKE algorithm, running on an ARM® processor, with the intention of developing a specific processor in FPGA for BLAKE Algorithm. For this purpose, we have used an implementation in C of the algorithm, where we could discover which instructions were executed and how frequently t...
Conference Paper
Full-text available
MultiBand OFDM (MB-OFDM) UWB [1] is a promising short-range wireless technology for high data rate communications up to 480 Mbps. The UWB receiver uses a Viterbi decoder that must support the highest data rate of 480 Mbps. To achieve such high data rates a sliding block Viterbi decoder is a good design candidate. In this paper, we analyze the trade...

Citations

... A prescribed triplet in the inquiry sequence will then complement the triplets in the database sequence that have a score of at least 11 when the 3 sets of amino acids are compared. [12][13][14][15]. For the 2nd stage, using amino acid substitution, the initial words into HSPs are extended by BLAST. ...
Article
Full-text available
The DNA data generation rate exceeds its rate of computational processing with the increase in the development of DNA sequencing. Standard sequence alignment techniques using existing computational machines cannot achieve the exponentially growing requirements. Acceleration of the algorithm on FPGA improves the performance in comparison to other platforms. This paper will define and categorize the present sequence alignment algorithms and implement it on FPGA boards. We will also present a comparison of different types of sequence alignment algorithms and surmise the current alternatives and deliver a testimony to advance accelerating sequence alignment on FPGA.
... These signal processing applications which require intense computation and simultaneous processing of large amount of data in real-time can make use of large scale fieldprogrammable gate array (FPGA) platform for hardware acceleration. Modern high-capacity FPGA is an attractive alternative to accelerate scientific and engineering applications [15] due to the possible utilization of massive parallelism. ...
Article
Full-text available
This paper presents a novel real-time compressive sensing (CS) reconstruction which employs high density field-programmable gate array (FPGA) for hardware acceleration. Traditionally, CS can be implemented using a high-level computer language in a personal computer (PC) or multicore platforms, such as graphics processing units (GPUs) and Digital Signal Processors (DSPs). However, reconstruction algorithms are computing demanding and software implementation of these algorithms is extremely slow and power consuming. In this paper, the orthogonal matching pursuit (OMP) algorithm is refined to solve the sparse decomposition optimization for partial Fourier dictionary, which is always adopted in radar imaging and detection application. OMP reconstruction can be divided into two main stages: optimization which finds the closely correlated vectors and least square problem. For large scale dictionary, the implementation of correlation is time consuming since it often requires a large number of matrix multiplications. Also solving the least square problem always needs a scalable matrix decomposition operation. To solve these problems efficiently, the correlation optimization is implemented by fast Fourier transform (FFT) and the large scale least square problem is implemented by Conjugate Gradient (CG) technique, respectively. The proposed method is verified by FPGA (Xilinx Virtex-7 XC7VX690T) realization, revealing its effectiveness in real-time applications.
... Given the large performance increases attainable using Cluster Computing and Reconfigurable Computing, it is not surprising that interest has developed in using a combination of both techniques, i.e., the creation of clusters where the compute nodes are augmented with reconfigurable hardware in order to improve performance1234. The practicality of this approach has improved in recent years with the availability of HyperTransport-enabled reconfigurable computing boards that can plug directly into unused processor slots on AMD Opteron motherboards [5] (Cray, Inc. has been marketing supercomputers that incorporate these boards since 2004 [6]). ...
Article
Full-text available
The addition of reconfigurable hardware (FPGAs) to the nodes of Beowulf-style clusters has the potential to accelerate a variety of parallel applications through a combination of parallel programming and reconfigurable computing techniques. However, making efficient use of the computational resources available places a significant burden on the application developer due to the lack of support for reconfigurable computing and task heterogeneity in standard message-passing libraries. This paper describes Accessible Reconfigurable Computing (ARC), a metacomputing environment designed to address these issues. The architecture, implementation, and operation of the system are described in detail.
... There have been several research works that have targeted the problem of parallelizing wavefront problems, well on distributed memory architectures [10], or on heterogeneous architectures such as SIMD-enabled accelerators [11], or the Cell/BE architecture [12]. In all these works, the authors have focused on vectorization optimizations and on studying the appropriate work distribution on each architecture, forcing the programmer to deal with several low level programming details. ...
Article
Full-text available
This paper analyzes the applicability of the task programming model in the parallelization of the wavefront pat-tern. Computations on this type of problem are characterized by a data dependency pattern across a data space, which can produce a variable number of independent tasks through the traversal of such a space. We explore several implementations of this pattern, based on the current state-of-the-art threading libraries that support tasks. For each implementation, we discuss the particularities from a programmers point of view, highlighting the advantageous features in each case. We con-duct several experiments to identify the factors that can limit the performance in each implementation. Moreover, we propose and evaluate some optimizations (task recycling, prioritization of tasks based on locality hints and tiling) that the programmer can exploit to reduce the overhead in some cases.
... This locality is the basis for pipelining the computation. The pipelining idea (for a non-affine gap variant) is described in [8]. ...
... Another goal of this work is to evaluate the HaSCoL highlevel hardware description language for the high performance computing (HPC) with FPGA and for sequence alignment in particular. Another solution to accelerate the pairwise sequence alignment, which uses a high-level language to generate an FPGA configuration is proposed in [8]. They use the MitrionC language [1]. ...
Article
Full-text available
The paper describes an experience of creating a hardware implementation of a pairwise sequence alignment algorithm in a high-level hardware description language. The implementation is created to be run on an FPGA with a high latency interface to a PC (ethernet). Thus, a lot of control logic is implemented in hardware together with the main pipeline. We use a HaSCoL hardware description language for that purpose and discuss pros and cons of this approach compared to software implementation of the control logic on an embedded processor. We also discuss how the language helps to describe hardware and how it could help more as well.
... Most of the implementations follow the second method, in which FPGAs are only used to find the maximum value after filling the matrix [7], [8], [9], [10], [11]. ...
... Our solution is better than the traditionally used solution [13], [7], [9] in the sense that it gives the best alignment between the unknown sequence and the known sequence in the database after once scanning through the database and there is no need to repeat the sequence alignment for some smaller subset. Secondly, the whole solution is based on FPGA, so there is no need to maintain the solution at two different places, which is easier. ...
Conference Paper
Full-text available
The Smith-Waterman (SW) algorithm is the only optimal local sequence alignment algorithm. There are many SW implementations on FPGA, which show speedups of up to 100x as compared to a general-purpose-processor (GPP). In this paper, we propose a design of the SW traceback, which is done in parallel with the matrix fill stage and which gives the optimal alignment after once scanning through the whole database. Beside that, we have proposed the hardware design for the RVEP SW FPGA implementation, which demonstrates that this solution can be realized with off-the-shelf FPGA boards.
... Typical parallel solutions to genome assembly have tightly coupled alignment with the other stages of the assembly. These highly specif c assemblers have relied on batch processing, complex MPI programming or specialized hardware such as BlueGene/L [73], FPGAs [116], and the Cell processor [110] to speed up alignment, but the new modular approaches are agnostic to the mechanisms of the individual modules. This presents a perfect opportunity for distributed abstractions to be supplied as modules. ...
... The classic way to parallelize this algorithm is to work along anti-diagonals [19]. Fig. 4 shows an anti-diagonal, and it should be clear from the dependency region shown in Fig. 2 that each diagonal item can be calculated independently of the others. ...
Article
Full-text available
Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith–Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith–Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith–Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs.
... Kalyanaraman et al. later reported an approach that could process 47 million maize candidate alignments in under 2 hours using 1024 processors of an IBM Blue-Gene/L [12]. More recent work has explored using FPGAs [24] and the Cell processor [21] to speed up alignment, which would provide up to a 100X speedup. ...
Conference Paper
Full-text available
Bioinformatics researchers need efficient means to processlarge collections of sequence data. One application of interest,genome assembly, has great potential for parallelization, howeve r most pre- vious attempts at parallelization require uncommon high-e nd hard- ware. This paper introduces a scalable modular genome assem - bler that can achieve significant speedup using large number s of conventional desktop machines, such as those found in a camp us computing grid. The system is based on the Celera open-sourc e assembly toolkit, and replaces two independent sequentialmodules with scalable replacements: a scalable candidate selectorexploits the distributed memory capacity of a campus grid, while the s cal- able aligner exploits the distributed computing capacity.For large problems, these modules provide robust task and data manage ment while also achieving speedup with high efficiency on severalscales of resources. We show results for several datasets ranging f rom 738 thousand to over 121 million alignments using campus gri d re- sources ranging from a small cluster to more than a thousand n odes spanning three institutions. Our largest run so far achieve s a 927x speedup with 71.3 percent efficiency.
... FP-GAs offer the parallelism inherent in logic integrated circuits combined with programmability and have therefore become increasingly popular for many applications. In particular there has been a lot interest in FPGA-accelerated scientific computing [6]. Until recently, a major drawback was the lack of high-level programming solutions for FP-GAs. ...
Conference Paper
Full-text available
Processing large volumes of information generally requires massive amounts of computational power, which consumes a significant amount of energy. An emerging challenge is the development of ``environmentally friendly'' systems that are not only efficient in terms of time, but also energy efficient. In this poster, we outline our initial efforts at developing greener filtering systems by employing Field Programmable Gate Arrays (FPGA) to perform the core information processing task. FPGAs enable code to be executed in parallel at a chip level, while consuming only a fraction of the power of a standard (von Neuman style) processor. On a number of test collections, we demonstrate that the FPGA filtering system performs 10-20 times faster than the Itanium based implementation, resulting in considerable energy savings.