Table 2 - uploaded by Wojciech Frohmberg
Content may be subject to copyright.
The maximum sequence length depending on the GPU RAM and window size parameter

The maximum sequence length depending on the GPU RAM and window size parameter

Source publication
Article
Full-text available
Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a g...

Context in source publication

Context 1
... the question is: what is the length of the longest sequences that can be processed by our program? Table 2 shows the maximum lengths of sequences, that can be aligned by the algorithm, depending on the amount of RAM available on the graphics card and the value of window size parameter. E.g. to utilize the resources of the GeForce GTX 280 with 1 GB of RAM properly, it is sufficient to set window size parameter to 80. ...

Similar publications

Conference Paper
Full-text available
The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and more frequently characterized by the diversity of their resources and the way the sharing of those resources is arbitrated. On such...

Citations

... They suggest utilizing both the intra-and inter-query parallelism based on a pre-determined sequence length threshold [42]. Similar approaches can be found from GSWABE [46] and gpu-pairalign [14]. In addition, focused on all-to-all patterns for protein DB alignments, query profiling optimization [47] and CPU-GPU hybrid parallelism [49] have been suggested. ...
Preprint
Full-text available
Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern GPUs. This paper presents SaLoBa, a GPU-accelerated sequence alignment library focused on seed extension. Based on the analysis of previous work with real-world sequencing data, we propose techniques to exploit the data locality and improve workload balancing. The experimental results reveal that SaLoBa significantly improves the seed extension kernel compared to state-of-the-art GPU-based methods.
... Due to these differences, GPU accelerated database search cannot be used to accelerate the alignment step in NGS data processing programs. gpu-pairAlign [19] and GSWABE [20] present only all-to-all pairwise local alignment of sequences. All-to-all alignment is easier to accelerate on GPU. ...
Article
Full-text available
Background: Due the computational complexity of sequence alignment algorithms, various accelerated solutions have been proposed to speedup this analysis. NVBIO is the only available GPU library that accelerates sequence alignment of high-throughput NGS data, but has limited performance. In this article we present GASAL2, a GPU library for aligning DNA and RNA sequences that outperforms existing CPU and GPU libraries. Results: The GASAL2 library provides specialized, accelerated kernels for local, global and all types of semi-global alignment. Pairwise sequence alignment can be performed with and without traceback. GASAL2 outperforms the fastest CPU-optimized SIMD implementations such as SeqAn and Parasail, as well as NVIDIA's own GPU-based library known as NVBIO. GASAL2 is unique in performing sequence packing on GPU, which is up to 750x faster than NVBIO. Overall on Geforce GTX 1080 Ti GPU, GASAL2 is up to 21x faster than Parasail on a dual socket hyper-threaded Intel Xeon system with 28 cores and up to 13x faster than NVBIO with a query length of up to 300 bases and 100 bases, respectively. GASAL2 alignment functions are asynchronous/non-blocking and allow full overlap of CPU and GPU execution. The paper shows how to use GASAL2 to accelerate BWA-MEM, speeding up the local alignment by 20x, which gives an overall application speedup of 1.3x vs. CPU with up to 12 threads. Conclusions: The library provides high performance APIs for local, global and semi-global alignment that can be easily integrated into various bioinformatics tools.
... Since three kinds of pairwise sequence alignment (global, semi-global and local) have the same framework and differ only in details, techniques of speeding up one can be applied to the other two with tiny modifications. Different kinds of high-performance platforms, especially accelerators, such as FPGAs [5,6] and GPUs [7][8][9][10][11][12][13][14][15][16], are used to reduce their execution time. ...
... However, the method is not described clearly. gpu-pairAlign [12] proposed to store the alignment moves in four Boolean backtracking matrices during the first stage and retrieve the four Boolean backtracking matrices instead of the score matrices. This group of implementations obtain the optimal alignment in linear time, but the disadvantage is that their space complexity is quadratic. ...
... The backtracking matrix in GATK HC is helpful during backtracking. It is much easier to identify the next move compared with other methods since it does not need to jump among several backtracking matrices (shown in Fig. 1) or calculate the next move based on the current move [12]. Moreover, the lengths of the consecutive deletion(s) and consecutive insertion(s) are given by the element of the btrack matrix. ...
Article
Full-text available
Background Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semi-global pairwise sequence alignment with traceback has so far been difficult to accelerate effectively on GPUs. Results We first analyze the characteristics of the semi-global alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intra-task parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPU-based implementation. This helps efficiently retrieve the backtracking matrix to obtain the optimal alignment in the second stage. Conclusions Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pair-HMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC implementation, and 1.34x faster than the GATK HC implementation with the integrated GPU-based pair-HMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications.
... Thread-level parallelization for the inter-sequence layout can be applied if many alignments need to be computed, which is a common use case in bioinformatics pipelines. The workload can be easily partitioned in chunks and then computed concurrently on the different cores of the multi-core processor, manycore processor or accelerator (Blazewicz et al., 2011;Daily, 2016;Rognes, 2011). For the thread-level, intra-sequence layout, strategies have been implemented that are similar to the intra-sequence vectorization layout, including a wavefront-based model progressing along the minor diagonal (Edmiston et al., 1988;Liu et al., 2001) a striped (Li et al., 2012a) or a sequential layout (Khajeh-Saeed et al., 2010). ...
Article
Motivation: Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (single instruction multiple data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we (a) distribute many independent alignments on multiple threads and (b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. Results: We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon PhiTM (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon PhiTM and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. Availability and implementation: The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4 under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME: SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. Supplementary information: Supplementary data are available at Bioinformatics online.
... Note that GPUs have a memory hierarchy distinct from that of CPUs and care must be taken in order not to incur excessive latency by suboptimal memory access and transfers (Farber, 2011). In this regard, our approach employs a number of optimization techniques previously proposed in the literature (Blazewicz et al., 2011;Manavski and Valle, 2008;Schatz et al., 2007) for dynamic-programming-based sequence alignment on GPUs, such as efficient use of shared memory, coalesced access to global memory and compact encoding of backtracking information. To accelerate the pairwise distance computation, we adopted our prior work on GPU-based acceleration of distance calculation (Lee et al., 2012) and extended it to multi-GPU and CPU co-processing. ...
... The GPU memory utilization explained above was inspired by the prior work on the GPU-based parallelization of dynamic programming (Blazewicz et al., 2011;Manavski and Valle, 2008;Schatz et al., 2007). The model construction is done in the host with the pairwise alignment results coming from the device. ...
Article
Motivation: Metagenomic sequencing has become a crucial tool for obtaining a gene catalogue of operational taxonomic units (OTUs) in a microbial community. A typical metagenomic sequencing produces a large amount of data (often in the order of terabytes or more), and computational tools are indispensable for efficient processing. In particular, error correction in metagenomics is crucial for accurate and robust genetic cataloging of microbial communities. However, many existing error-correction tools take a prohibitively long time and often bottleneck the whole analysis pipeline. Results: To overcome this computational hurdle, we analyzed and exploited the data-level parallelism that exists in the error-correction procedure and proposed a tool named MUGAN that exploits both multi-core central processing units (CPUs) and multiple graphics processing units (GPUs) for co-processing. According to the experimental results, our approach reduced not only the time demand for denoising amplicons from approximately 59 hours to only 46 minutes, but also the overestimation of the number of OTUs, estimating 6.7 times less species-level OTUs than the baseline. In addition, our approach provides web-based intuitive visualization of results. Given its efficiency and convenience, we anticipate that our approach would greatly facilitate denoising efforts in metagenomics studies. Availability: http://data.snu.ac.kr/pub/mugan. Contact: sryoon@snu.ac.kr. Supplementary information: Supplementary data are available at Bioinformatics online.
... Blazewicz et al. [19] propose a protein alignment algorithm with a backtracking routine on GPU platform. They evaluate and compare performances of the parallel algorithm in single GPU and multi-GPU cases. ...
... With the SIMD manner, the GPU threads execute the same sequence alignment procedure, but each thread aligns a separated database sequence. This parallel manner is similar to the one adopted in [17] and [19]. Another classic manner to parallel alignment algorithms is to compute along anti-diagonals of the DP matrix [16,18,20] which is based on the idea that computations of cells on the anti-diagonals are independent and can be processed in parallel. ...
Article
Full-text available
In biological research, alignment of protein sequences by computer is often needed to find similarities between them. Although results can be computed in a reasonable time for alignment of two sequences, it is still very central processing unit (CPU) time-consuming when solving massive sequences alignment problems such as protein database search. In this paper, an optimized protein database search method is presented and tested with Swiss-Prot database on graphic processing unit (GPU) devices, and further, the power of CPU multi-threaded computing is also involved to realize a GPU-based heterogeneous parallelism. In our proposed method, a hybrid alignment approach is implemented by combining Smith–Waterman local alignment algorithm with Needleman–Wunsch global alignment algorithm, and parallel database search is realized with compute unified device architecture (CUDA) parallel computing framework. In the experiment, the algorithm is tested on a lower-end and a higher-end personal computers equipped with GeForce GTX 750 Ti and GeForce GTX 1070 graphics cards, respectively. The results show that the parallel method proposed in this paper can achieve a speedup up to 138.86 times over the serial counterpart, improving efficiency and convenience of protein database search significantly.
... With the emergence of dedicated hardware accelerators, several GPU solutions have arrived [10]- [16]. Most notably Blazewicz [15] demonstrated a pure Smith-Waterman implementation on a single Nvidia GTX280 GPU, yielding a performance of ∼5 GCUPS for sequences of length 51, lessening to ∼4 GCUPS for the longest testing sequence length of 459. Liu [17] demonstrates a hybrid CPU-GPU implementation called CUDASW++ 3.0 attaining a performance of 83.3 GCUPS on an Nvidia GTX680 GPU aided by a CPU. ...
... The improvement in backtracking procedure is performed 18 . The method used four Boolean matrices to store the directions of backward moves during backtracking process. ...
Article
Full-text available
Objectives: To design a new framework to efficiently parallelize the steps of VLASPD algorithm using a hybridized apriori and fp-growth on GPU; to implement the existing and proposed framework in CUDA;to improve the performance factors like computational time, memory and CPU utilization.Methods/Statistical Analysis: This paper proposes the acceleration of Protein-Protein Interactions (PPIs) prediction on Graphics Processing Units(GPUs). A GPU can provide more processing cores and computational power in the same cost as a CPU.Findings: The frequently occurring patterns in the protein sequences can be used for PPIs prediction.The moving of the approaches from fixed length to variable length lead to computational complexity but also is found to be advantageous.Applications/Improvements:Sequence biology is since being researched by various computer engineers, the GPUs can be employed for predicting various sequence interactions like DNA-Proteins, etc. Since the GPU runs the parallel code efficiently, the methodology can be further improved if efficiently parallelized.
... Fine-grained proposals are the ones that use more than one PE (Processing Element) to compute the same DP matrix; otherwise, the proposal is classified as coarse-grained. Even though there are several multi-PE coarse-grained approaches for SW in the literature ( [28,9,22,6]), we will provide in this section a discussion of fine-grained approaches, which are more closely related to our work. Table 1 lists fine-grained SW implementations for platforms composed of multiple PEs. ...
Article
Full-text available
This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms, using the exact Smith-Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and limits the application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60 Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from 26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact method. We also show that the IST algorithm is able to reduce the traceback time from 2.15× up to 21.03×, when compared with the baseline traceback algorithm. The human×chimpanzee chromosome 5 comparison (180 MBP×183 MBP) attained 10,370.00 GCUPS (Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2 percent.
... Initially, programming these graphics chips for bioinformatics application still required programming with shaders using languages such as OpenGL [24]. The release of CUDA in 2007 made the usage GPUs for general purpose computing more accessible and subsequently a number of CUDA-enabled Smith-Waterman implementation have been presented in recent years [4,[25][26][27][28][29][30][31][32][33]. A number of MPI-based solutions for progressive multiple sequence alignments are targeted towards PC clusters [34][35][36][37]. ...
Article
Full-text available
Background Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. Results This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Conclusions Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi.