ArticlePDF Available

Parallel protein multiple sequence alignment approaches: a systematic literature review

Authors:

Abstract and Figures

Multiple sequence alignment approaches refer to algorithmic solutions for the alignment of biological sequences. Since multiple sequence alignment has exponential time complexity when a dynamic programming approach is applied, a substantial number of parallel computing approaches have been implemented in the last two decades to improve their performance. In this paper, we present a systematic literature review of parallel computing approaches applied to multiple sequence alignment algorithms for proteins, published in the open literature from 1988 to 2022; we extracted articles from four scientific databases: ACM Digital Library, IEEE Xplore, Science Direct and SpringerLink, and four journals: Bioinformatics, PLOS Computational Biology, PLOS ONE, and Scientific Reports. Additionally, in order to cover other potential databases and journals, we performed a transversal search through Google Scholar. We conducted a selection process that yielded 106 research articles; then, we analyzed these articles and defined a classification framework. Additionally, we point out some directions and trends for parallel computing approaches for multiple sequence alignment, as well as some unsolved problems.
Content may be subject to copyright.
Vol.:(0123456789)
The Journal of Supercomputing
https://doi.org/10.1007/s11227-022-04697-9
1 3
Parallel protein multiple sequence alignment approaches:
asystematic literature review
SergioH.Almanza‑Ruiz1· ArturoChavoya1 · HectorA.Duran‑Limon1
Accepted: 28 June 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2022
Abstract
Multiple sequence alignment approaches refer to algorithmic solutions for the align‑
ment of biological sequences. Since multiple sequence alignment has exponential
time complexity when a dynamic programming approach is applied, a substantial
number of parallel computing approaches have been implemented in the last two
decades to improve their performance. In this paper, we present a systematic lit‑
erature review of parallel computing approaches applied to multiple sequence align‑
ment algorithms for proteins, published in the open literature from 1988 to 2022; we
extracted articles from four scientific databases: ACM Digital Library, IEEE Xplore,
Science Direct and SpringerLink, and four journals: Bioinformatics, PLOS Compu‑
tational Biology, PLOS ONE, and Scientific Reports. Additionally, in order to cover
other potential databases and journals, we performed a transversal search through
Google Scholar. We conducted a selection process that yielded 106 research articles;
then, we analyzed these articles and defined a classification framework. Addition‑
ally, we point out some directions and trends for parallel computing approaches for
multiple sequence alignment, as well as some unsolved problems.
Keywords Systematic review· Multiple sequence alignment· Parallel
programming· Protein
* Arturo Chavoya
achavoya@cucea.udg.mx
Sergio H. Almanza‑Ruiz
sergio.almanza3518@alumnos.udg.mx
Hector A. Duran‑Limon
hduran@cucea.udg.mx
1 Department ofInformation Systems, CUCEA ‑ Universidad de Guadalajara, Periférico Norte
799, Zapopan45100, Jalisco, Mexico
S.H.Almanza-Ruiz et al.
1 3
1 Introduction
Bioinformatics as defined in [1] consists of the application of tools of computation
and analysis to the retrieval and interpretation of biological data. It is an interdisci‑
plinary field, which harnesses computer science, mathematics, physics, and biology.
The handling and analysis of biological sequences remain one of the prime tasks
of bioinformatics [2]. The most basic biological sequence analysis is to ask if two
nucleotide or protein sequences are related [3]. Pairwise alignment consists of an
arrangement of two nucleotide or amino acid sequences obtained from successively
reordering and comparing the sequences, residue by residue, with a scoring that
models matches, mismatches, and gaps as evolutionary events and where the score
is optimized. Thus, from a pairwise alignment, inferences regarding similarity of
function, structural motifs, or discernible evolutionary relationships can be drawn.
On the other hand, a multiple sequence alignment (MSA) is a rearrangement of a
set of three or more DNA, RNA, or protein sequences which are aligned to make
the residues from different sequences line up in vertical columns in such a manner
that best explains structural, functional, or evolutionary relationships. This topic is
usually divided into two parts: functional genomics and comparative genomics; in
the former, researchers seek to determine the role of the sequences in the living cell,
whereas in the latter the aim is to determine ancestries and correlations by compar‑
ing sequences from different organisms, or even individuals [2].
Figure 1 shows an example of two different MSAs of the same three proteins,
where each letter in the sequences represents a specific amino acid from the short
protein sequences, and the gaps in the proposed alignments are represented by
dashes. For the sake of simplicity, the scoring function was defined as follows: when
aligning a pair of symbols, the score is 1 if the symbols match,
1
if they mismatch,
2
if a gap and a symbol are aligned, and 0 when two gaps are aligned. It is impor
tant to mention that in practice, protein sequence alignments are scored using substi‑
tution matrices, such as BLOSUM62 [4]. For the example, let
S1
,
S2
and
S3
be three
protein sequences defined as
S1=EGGMF
,
S2=GMMFG
, and
S3=MGEEF
. In
the first alignment, which is shown in Fig.1a, the score for the first column is the
sum of
,
score(E,M)=−1
, and
score(−,M)=−2
, which gives a
sum of pairs score of
5
for that column; the scores for the rest of the columns
in the alignment are computed analogously. The sum of pairs (SP) for the multi‑
ple sequence alignment is calculated as the aggregation of the individual scores for
each column, which for the first alignment is
13
. The sum of pairs score for the
(a) (b)
Fig. 1 Simple example of two alternative multiple sequence alignments of the same three proteins, where
letters indicate specific amino acids in the protein sequences, dashes represent gaps, and SP is the sum of
pairs score for the alignment
1 3
Parallel protein multiple sequence alignment approaches:…
alternative MSA shown in Fig.1b was also computed and yielded a value of
24
.
From these results, we could conclude that the first alignment is better than the sec‑
ond one, as it has a higher score. In general, the sum of pairs score for a multi‑
ple sequence alignment depends on how the residues have been rearranged through
the insertion of gaps in certain positions, and different alignments for the same
sequences usually yield different sum of pairs scores.
Multiple sequence alignment is an NP‑complete problem when a brute force
approach is applied, whereas when using dynamic programming the complexity is
O(
L
N)
, where L is the length of the sequences and N is the number of sequences
[5]. These facts have motivated researchers to improve the efficiency of MSA algo‑
rithms by heuristic and metaheuristic approaches, as well as to decrease the compu‑
tational cost by means of parallel implementations. We focused the present system‑
atic review on protein MSA algorithms, since this type of alignments is generally
more accurate when based on amino acids than on their corresponding nucleotides.
This is because of the following reasons: first, the small size of the DNA alphabet
in the case of nucleotide sequences makes it more likely to find alignments due to
randomness rather than to similarity of function and structure, and second, because
sequence similarity degrades more rapidly at the DNA level than at the amino acid
level, a mutation in an amino acid sequence is likely to be more meaningful in terms
of functionality [6, 7]. In addition, we considered that it was more likely to find
relevant research for the present review if the search was focused on parallel imple‑
mentation of algorithms for protein MSA, as currently there are more efforts toward
this direction.
This paper presents a systematic literature review of parallel implementations
of protein multiple sequence alignment algorithms that were published in the open
literature during the period 1988 to 2022. The main goal of this review is to ana‑
lyze and characterize the parallel implementations of MSA algorithms. In order to
achieve the aforementioned goal, this review aims to yield a useful classification of
the parallel implementation of multiple sequence alignment algorithms. A second
aim is to find the parallel programming approaches that have been used to imple‑
ment the parallelization of MSA algorithms. The third aim is to find the multi‑
ple sequence alignment algorithms which were most frequently parallelized. And
finally, a fourth aim is to point out some unsolved problems regarding the parallel
implementations of MSA algorithms.
The paper is organized as follows. TheMethods section describes the systematic
literature review methodology and how we conducted it. In theResults and discus‑
sion section, we present the answers to the questions formulated for this systematic
literature review and therein we also propose a classification framework for paral‑
lel implementations of protein MSA algorithms, which was used to classify every
implementation in the reviewed articles. Lastly, in theConclusion section we pre‑
sent our closing remarks of the present review.
S.H.Almanza-Ruiz et al.
1 3
2 Methods
2.1 Systematic review method
A systematic literature review is a defined and methodical way to identify, evaluate,
and synthesize all available research relevant to a particular research question, subject,
or phenomenon to understand the current direction and status of research or provide
background to identify research challenges [815]. There exist several methodologies
to extract information from a set of consulted articles. The methodology that is used in
the present review is based on the approaches followed by Flores‑Contreras etal. [15]
and Mahdavi‑Hezavehi etal. [16].
The goal for this systematic literature review was formulated using the goal part
formulation from the Goal–Question–Metric perspective (purpose, issue, object, view
point) as described in [16], whereas we conducted the rest of the protocol as in [15].
The components of the goal formulation for the review are the following:
Purpose. Analyze and characterize.
Issue. Improvement in performance.
Object. Parallel implementations of multiple sequence alignment algorithms for
proteins.
Viewpoint. Point of view of the researcher.
In other words, the goal of this systematic literature review of parallel implementations
of multiple sequence alignment algorithms for proteins has the purpose of analyzing
and characterizing the articles in the literature, where the issue to be analyzed is the
improvement in performance from the point of view of the researcher in the field.
Once the review goal was defined, we briefly describe the components of the pro
tocol that we followed—Research questions, Keywords and search string, and Study
selection—as in [15]; further details on the application of the protocol will be given in
the corresponding subsections.
Research questions. The research questions express the motivation about a litera
ture review and determine the information to be extracted from the reviewed arti
cles.
Keywords and search string. In order to collect information based on the ques
tions, a number of keywords have to be obtained; thus, the search string is designed
using these keywords.
Study selection. The study selection determines two main elements in the review
protocol: the period of time of the published literature and the databases or indi
vidual journals from which articles are to be extracted.
2.2 Research questions
Specifying the research questions is the most important part of any systematic lit‑
erature review, since they drive the search process of the articles and the data
1 3
Parallel protein multiple sequence alignment approaches:…
extraction. After the research questions have been specified, we can then proceed
to answer them and perform the data analysis and synthesis, as pointed out in [11,
15]. The research questions relevant to the field of parallel implementations of mul‑
tiple sequence alignment algorithms used in this systematic literaturereview are pre‑
sented in Table1.
The first research question RQ1 has the objective of yielding a useful synthesis of
the information obtained through analyzing the collected articles, via a classification
framework. The second question RQ2 aims to identify the parallel programming
approaches applied to parallelize multiple sequence alignment algorithms. The third
question RQ3 seeks to identify which multiple sequence alignment algorithms have
been parallelized and the underlying reasons for such trends. Finally, RQ4 serves to
point out some of the unsolved problems in the field.
2.3 Keywords andsearch string
In order to build the search string, we extracted the keywords from the nouns of
the research questions as described in Flores‑Contreras etal. [15] and we obtained
the keywords as presented in Table2. We simplified compound nouns, for example,
“parallel programming approaches” was simplified to “parallel,” whereas “multiple
sequence alignment algorithms” was simplified to “multiple sequence alignment.”
Therefore, these keywords were simplified to three keywords: multiple sequence
alignment, parallel, and protein.
Afterward, we included synonyms and alternative spellings of the three key
words: multiple sequence alignment, parallel, and protein, as shown in Table 3. In
the case of the parallel keyword, we included words such as: faster, reconfigurable,
accelerated, and optimization as part of the search string, in order to expand the
search for relevant articles to the parallel programming approaches.
Consequently, we divided the construction of the search string into three parts:
Multiple sequence alignment field part: “multiple sequence alignment" OR
“MSA" OR “multiple biological sequence alignment"
Table 1 Research questions used for collecting data
Question ID Research question
RQ1 How can we yield a useful classification of the parallel
implementations of multiple sequence alignment algorithms?
RQ2 What parallel programming approaches have been used to
enhance performance of multiple sequence alignment algo‑
rithms?
RQ3 What protein multiple sequence alignment algorithms are
the most frequently parallelized and how were they parallel‑
ized?
RQ4 What are some of the unsolved problems of parallel
implementations of multiple sequence alignment algorithms?
S.H.Almanza-Ruiz et al.
1 3
Parallel computing field part: “parallel" OR “parallelization" OR “paral
lelisation" OR “distributed" OR “parallel algorithm" OR “high performance
computing" OR “accelerated" OR “HPC" OR “supercomputing" OR “cloud
computing" OR “supercomputer" OR “reconfigurable" OR “multi‑core" OR
“multicore" OR “grid computing" OR “grid computation" OR “optimization"
OR “optimisation" OR “cluster" OR “FPGA" OR “faster"
Protein sequences part: “amino acid" OR “protein"
Finally, we connected each part with an AND logical operator and obtained the
following search string:
(“multiple sequence alignment" OR “MSA" OR “multiple biological sequence
alignment") AND (“parallel" OR “parallelization" OR “parallelisation" OR “dis
tributed" OR “parallel algorithm" OR “high performance computing" OR “accel
erated" OR “HPC" OR “supercomputing" OR “cloud computing" OR “super
computer" OR “reconfigurable" OR “multi‑core" OR “multicore" OR “grid
computing" OR “grid computation" OR “optimization" OR “optimisation" OR
“cluster" OR “FPGA" OR “faster") AND (“amino acid" OR “protein").
Table 2 Keywords extracted
from research questions Question ID Keywords
RQ1 Parallel implementations,
Multiple sequence alignment
RQ2 Parallel programming approaches, performance of
Multiple sequence alignment algorithms
RQ3 Protein, multiple sequence alignment algorithms,
Parallel
RQ4 Parallel implementations, multiple sequence
Alignment algorithms
Table 3 Synonyms or alternative spellings for keywords
Keyword Synonyms or alternative spelling
Multiple sequence alignment Multiple sequence alignment, MSA,
Multiple biological sequence alignment
Parallel Parallel, parallelization, parallelisation, distributed,
Parallel algorithm, high performance computing,
Accelerated, HPC, supercomputing, cloud comput‑
ing,
Supercomputer, reconfigurable, multi‑core, multi‑
core,
Grid computing, grid computation, optimization,
Optimisation, cluster, FPGA, faster
Protein Amino acid, protein
1 3
Parallel protein multiple sequence alignment approaches:…
2.4 Study selection
To collect the articles, we chose four scientific databases: ACM Digital Library,
IEEE Xplore, Science Direct and SpringerLink, as in [15]; additionally, we col
lected articles from Bioinformatics, PLOS Computational Biology, PLOS ONE
and Scientific Reports because these are journals that can contain articles focused
on bioinformatics and are of high quality, as measured by their h‑index. We
applied our search string in the Advanced search section of the aforementioned
scientific databases and journals. Since the search engine varies among differ
ent scientific databases and journals, we had to apply different search approaches
depending on the scientific database or journal.
In the case of the ACM Digital Library, the full search string was applied to
the Full text and Abstract fields, whereas for the Title field, the first and second
parts of the search string were applied; 28 research articles were obtained.
For the IEEE Xplore database, the search string was applied to the Full text
only and All Metadata fields, and we obtained 107 research articles.
Regarding the Science Direct database, the three parts of the search string
were applied to the Search Anywhere field, whereas the multiple sequence align
ment and parallel computing components were applied to the Title, Abstract or
Author-specified keywords fields. From this database, we retrieved 70 articles.
In the case of the SpringerLink database, due to the limitations of the search
engine, we proceeded as follows. We applied our search string to the Full text
field and obtained a total of 4559 entries, from which 2859 journal articles and
489 conference proceedings articles were retrieved; the rest of the entries were
not articles. The aforementioned entries were downloaded in the form of csv
(comma‑separated values) files which contained only the title of the articles.
Since the information retrieved only contained the title of the articles, a filter with
the two parts shown below was implemented. This filter contained more terms
than the previous search string for other databases, since in this case we were
discriminating using only the title, and it was required to add enough flexibility to
the search. It should be pointed out that this filter was refined after several itera
tions of search and analysis of the results obtained. The filter parts are as follows:
The multiple sequence alignment part contained the following terms and we
applied an OR between each term: “multiple sequence alignment," “biological
sequence alignment," “progressive alignment," “MSA," “Clustal," “MAFFT,"
“MUSCLE," “T‑COFFEE," “PROBCONS," “PASTA," “SATE," and “MSAP
robs."
The parallel programming part contained the following terms and again
we applied an OR between each term: “parallel," “parallelization," “paral
lelisation" “distributed," “parallel algorithm," “high performance comput
ing," “accelerated," “HPC," “supercomputing," “cloud computing," “super
computer," “reconfigurable," “multi‑core," “multicore," “grid computing,"
“grid computation," “optimization," “optimisation," “cluster," “FPGA," and
“faster."
We applied an AND between the previous parts.
S.H.Almanza-Ruiz et al.
1 3
After applying the aforementioned step to the SpringerLink database, we obtained
20 articles from journals and 35 from conference proceedings, totaling 55 articles.
Regarding the Bioinformatics journal, the full search string was applied to the
Full text, Title and Abstract fields of the advanced search section; we obtained 551
articles.
As for the PLOS journals, the full search string was applied to the Title and
Abstract fields of the advanced search section; we obtained 358 articles. It should
be noted that we searched within all the PLOS journals, but only obtained relevant
results from PLOS Computational Biology and PLOS ONE.
Lastly, for the Scientific Reports journal, the full search string was applied to the
Title and Terms fields of the advanced search section; we obtained 500 articles. It is
worth noting that Scientific Reports is part of the Nature journals and we searched
through all of them, but only obtained relevant results from the Scientific Reports
journal.
In summary, the search in the scientific databases and journals rendered a total of
1669 articles, as can be seen in Table4.
2.5 Selection process
The selection process allowed us to identify the articles that were more related to
our review. This process had the following steps: The first step consisted of reading
the abstract of every single article, from which we obtained information, such as
whether the article was focused on parallel implementations of protein MSA algo‑
rithms, and a general perspective of how the research was conducted; we selected
only those papers whose main focus was directly related to the subject of this review.
The second step was to read all the content of the articles previously selected, as
well as their reference section to select additional articles. We discarded the articles
that were not pertinent to the review. The third step was assessing the quality of
the articles by using the h‑index and the CORE ranking. The final step consisted of
assessing the relevance of the articles to this review according to a score defined for
this purpose.
Table 4 Number of articles after
applying the search string Repository Number of articles after
applying the search string
ACM digital library 28
IEEE Xplore 107
Science direct 70
SpringerLink 55
Bioinformatics 551
PLOS computational biology 39
PLOS ONE 319
Scientific reports 500
1669
1 3
Parallel protein multiple sequence alignment approaches:…
2.5.1 Reading theabstracts
We read the 1669 abstracts from the articles that were collected after applying the
search string, and we selected 241 articles whose topic was related to the focus of
the present review, as presented in Table5.
2.5.2 Reading thecontent ofthearticles
As the second step of the selection process we read the content of the 241 arti
cles selected in the previous step, and we discarded articles that were not directly
related to the topic of parallel implementations of protein MSA algorithms
according to the subject matter of the article. In addition, we read the reference
section of these articles, and after selecting those articles that seemed pertinent to
the review from their title, we applied the first two steps of the selection process
(reading the abstract and reading the content) and obtained 18 additional articles.
After this step, we obtained a total of 147 articles, as shown in Table6.
2.5.3 Quality assessment
The third step of the selection process consisted in testing the 147 research arti
cles obtained in the previous step for their quality, considering their h‑index [17]
to select articles that were published in a journal, and the CORE ranking [18]
to select articles published in conference proceedings. We selected those articles
whose journals had an h‑index greater than or equal to 20 and included all arti
cles that had a CORE ranking. We applied the aforementioned criteria and dis
carded 33 articles from conference proceedings and 2 articles from journals; thus,
after this process we obtained 112 articles, as presented in Table7.
Table 5 Number of articles remaining after reading the abstracts
Repository Number of articles after applying the
search string
After reading
the abstract
ACM digital library 28 11
IEEE Xplore 107 81
Science direct 70 22
SpringerLink 55 49
Bioinformatics 551 50
PLOS computational biology 39 1
PLOS ONE 319 22
Scientific reports 500 5
1669 241
S.H.Almanza-Ruiz et al.
1 3
2.5.4 Relevance assessment
This was the last step of the selection process and enabled us to fine‑tune the
identification of the more relevant articles to the focus of this review. We defined
6 qualitative questions as presented in Table8. Question 1 from Table8 is mainly
based on RQ2, whereas Questions 2 to 4 from Table8 aim to answer how parallel
approaches were applied to multiple sequence alignment algorithms, as well as
to provide a useful classification for the papers; finally, Questions 5 and 6 from
Table8 have to do with the reproducibility of the articles and hence with its rel
evance for the field.
The numerical score for the answers to these 6 qualitative questions was
defined as follows: the answer Yes had an assigned score of 1, the answer Par-
tially had a score of 0.5, and No had a score of 0.
We describe next the criteria that we defined to give an answer of Yes, Par-
tially or No when each of the qualitative Questions 1 through 6 from Table8 was
applied to a given article.
Table 6 Number of articles
remaining after reading the
content of the articles
Repository After reading the
abstract
After reading
the content
ACM digital library 11 4
IEEE Xplore 81 57
Science direct 22 8
SpringerLink 49 36
Bioinformatics 50 19
PLOS computational biology 1 1
PLOS ONE 22 2
Scientific reports 5 2
Referenced 18 18
259 147
Table 7 Number of articles after
assessing quality Repository After reading
the content
After CORE rank‑
ing and h‑index
20
ACM digital library 4 2
IEEE Xplore 57 39
Science direct 8 8
SpringerLink 36 23
Bioinformatics 19 19
PLOS computational biology 1 1
PLOS ONE 2 2
Scientific reports 2 2
Referenced 18 16
147 112
1 3
Parallel protein multiple sequence alignment approaches:…
Question 1 had an answer of Yes when the main topic of the article was an imple
mentation of a parallel approach to multiple sequence alignment, and No otherwise.
For Question 2, we answered Yes when it was explicitly stated in the text which sub
routines were parallelized, Partially when this information had to be derived from the
context, supplementary material or another article, and we answered No when there
was no information about the parallelization of subroutines. Similarly, for Questions 3
and 4 we answered Yes, when it was explicitly stated in the text which parallelization
technique or platform was used, Partially when it had to be derived, and No when it
was not clear.
As for Question 5, we answered Yes when experimental tables, speeding up com
parative results or tables were solid and clearly presented, Partially when the results
were poorly presented, and No in the absence of experimental results, as in the case of
articles that presented only a proposal of implementation.
Regarding Question 6, we answered Yes when the parallel implementation, as well
as the MSA algorithm were clearly presented. We answered Partially, when one or
more of the questions from 2 to 5 was answered Partially and the rest was answered
Yes, and we answered No when at least one of the Questions from 1 to 5 was No.
We discarded all articles with relevance score less than 3, as in [15]. The percentage
of articles according to their relevance score is shown in Table9.
After applying the relevance criteria to the 112 articles that remained after the step
of assessing quality, we obtained 104 articles, as presented in Table10, where we also
specify separately the number of journal and conference articles obtained.
Table 8 Questions used to assess the relevance of articles
Question Answer
1. Was the parallelization of a multiple sequence Yes No
algorithm the main goal of the paper?
2. Was it clear which subroutines the proposed Ye s Partially No
approach parallelized?
3. Was it clear which parallelization techniques Yes Partially No
were used?
4. Was it clear what platform was used to Ye s Partially No
parallelize the algorithm?
5. Were the experimental results solid? Ye s Partially No
6. Is it possible to replicate the algorithm and Ye s Partially No
experiments with the information provided?
Table 9 Percentage of the articles according to their relevance score r
Score Very poor Poor Fair Good Very good
r<2
2r
<
3
3r
<
4
4r
<
5
5r6
Percentage
0%
7%
5%
50%
38%
S.H.Almanza-Ruiz et al.
1 3
2.5.5 Search inGoogle Scholar
In order to consider other possible sources of articles in addition to the aforementioned
databases and journals, we performed a transversal search using Google Scholar. Con
sidering the limitations of the Google Scholar search engine, as with the SpringerLink
database, we applied our search string to the available text of the articles and obtained
a total of 2084 entries, from which 1281 were journal articles and 180 were conference
proceedings articles, whereas the rest of the entries were not articles. From the entries
obtained, we excluded all those articles that were previously found in the four databases
and four journals mentioned in the preceding sections and then applied the selection
steps as described above. A summary of the results of the search in Google Scholar is
presented in Table11.
The two remaining articles from the search in Google Scholar were journal articles,
which when added to the 104 articles found in the previous selection process give a
total of 106 articles relevant to the review, as summarized in Table12.
Table 10 Number of articles
from journals and conference
proceedings after assessing
quality and relevance
Repository Journal Conference To tal
ACM digital library 0 2 2
IEEE Xplore 4 30 34
Science direct 8 0 8
SpringerLink 13 7 20
Bioinformatics 19 0 19
PLOS computational biology 1 0 1
PLOS ONE 2 0 2
Scientific reports 2 0 2
Referenced 14 2 16
63 41 104
Table 11 Number of articles
found through the search in
Google Scholar
Selection step Number
of articles
After applying the search string 1461
After excluding duplicates from previous searches 1451
After reading the abstracts 135
After reading the content 25
After assessing quality 10
After assessing relevance 2
1 3
Parallel protein multiple sequence alignment approaches:…
3 Results anddiscussion
In the following paragraphs, we present the answers to the research questions based
on the 106 articles obtained in the selection process.
3.1 RQ1: How can we yield auseful classification oftheparallel implementations
ofmultiple sequence alignment algorithms?
In order to manage the variety of parallel approaches that we have found in the lit‑
erature related to multiple sequence alignment algorithms, we propose the classifica‑
tion framework presented in Fig.2.
For this classification framework, the MSA algorithm approach category
describes the multiple sequence alignment algorithm as commonly found in the lit‑
erature, whereas the rest of categories—Spectrum, Parallelization scope, HPC strat-
egy and Platform type—describe how its parallel implementation was achieved.
3.1.1 Multiple sequence alignment approach terminology
The terminology used in this classification framework to describe multiple sequence
alignment approaches is compliant with the one commonly used in the literature.
Multiple sequence alignment algorithms were divided into Exact, Heuristic and
Metaheuristic approaches.
MSA algorithm approach. Here are included all MSA algorithm approaches as
frequently found in the literature.
Exact. This subcategory encompasses dynamic programming algorithms that
guarantee to find the optimal multiple sequence alignment.
Heuristic. This category consists of non‑evolutionary strategies that search for
high‑score multiple sequence alignments but do not guarantee the optimal align‑
ment. This category involves progressive, iterative, stochastic, and alternative
subcategories.
Progressive. Progressive approaches incrementally build a multiple sequence
alignment, where the most related pairs of sequences are aligned first, and
then, a series of pairwise alignments are executed for successively less closely
related sequences.
Table 12 Number of articles
from journals and conference
proceedings for the review
Repository Journal Conference Total
Databases and journals 63 41 104
Google scholar 2 0 2
65 41 106
S.H.Almanza-Ruiz et al.
1 3
Iterative. This subcategory encompasses approaches that iterate an
algorithm that produce a multiple sequence alignment until no further
improvement can be made.
Stochastic. Approaches under this subcategory obtain a probabilistic pro
file alignment in order to achieve a multiple sequence alignment.
Alternative. This subcategory consists of heuristic approaches that do not
fall under any of the aforementioned categories.
Metaheuristic. This category includes evolutionary algorithmic approaches
where an initial population of multiple sequence alignments evolve over time
and improve the MSA.
Fig. 2 A classification framework of parallel multiple sequence alignment algorithms with the number of
articles in each category indicated by the number between parentheses
1 3
Parallel protein multiple sequence alignment approaches:…
3.1.2 Parallel implementation terminology
For the parallel implementation terminology, we chose to describe the range of
application of the parallel implementation by the Spectrum category, which indi‑
cates whether the implementation is only intended for a specific algorithm or for
every algorithm of a certain MSA approach. The Parallelization scope describes
whether the parallel implementation was achieved via parallelization of subroutines
of the MSA algorithm or via parallelization of the whole algorithm. The HPC strat-
egy, as well as the Platform type categories, describes the parallel programming
approach and the hardware used, respectively, according to the terminology found
in the literature.
Spectrum. Some papers presented a parallel implementation applicable to all
algorithms of a specific MSA approach, whereas some others developed a parallel
implementation for a specific algorithm.
Parallelization scope. The purpose of this category is to describe whether the
parallel implementation of the MSA algorithm was or was not subdivided into
subroutines and in an affirmative case then describe how the parallelization was
implemented.
Subroutine. This subcategory comprises articles in which the parallel imple‑
mentation of the MSA algorithm was divided into subroutines.
Critical steps. This subcategory describes approaches to solve MSAs
where only critical steps of an algorithm to solve an MSA are manually pro‑
grammed.
All steps. This subcategory describes approaches to solve MSAs where all
the stages of an algorithm to solve an MSA are manually programmed.
Whole algorithm. This subcategory describes articles where the approach to
find an MSA has only one stage and this stage is parallelized or has several stages
optimized by a compiler.
HPC (High-Performance Computing) strategy. In this category, articles are dis‑
tinguished by the parallel computing strategy used for implementation. The HPC
strategy category involves the following subcategories:
Parallel programming oriented. This subcategory describes articles where a
parallel programming approach was implemented and no special hardware was
built. This subcategory involves the following subcategories:
Multiprocessing. This subcategory encompasses implementations where the
algorithm or some of its stages use multiple processes with separate memory.
Multithreading. This subcategory encloses implementations where all the
processing threads of the algorithms or some of its stages share the same
memory.
Divide and conquer. This subcategory consists of the so‑called embarrass‑
ingly parallel algorithms.
S.H.Almanza-Ruiz et al.
1 3
Intra-task parallelization. This subcategory comprises parallel implementa‑
tions for the GPU platform where each task is assigned to exactly one thread
block (i.e., a group of threads) and executed separately in parallel.
Hybrid. Parallel implementations under this subcategory employ several
approaches of parallel programming.
Vectorization. This subcategory encompasses parallel implementations that
employed either vector or array operations.
Special hardware oriented. This subcategory describes articles where a special
or custom hardware was built to implement a parallelization.
Platform type. This subcategory describes all platforms that have been used to
implement a given parallel approach for MSA.
The 106 selected articles were categorized according to the classification frame‑
work shown in Fig.2, and more detailed information is provided in Table13.
3.2 RQ2: What parallel programming approaches have been used toenhance
performance ofmultiple sequence alignment algorithms?
The earliest parallel approach for multiple sequence alignment found by this review
date from the year 1988, as is verified in Table14; we also found that the average of
relevant published articles on parallel implementations for protein MSA algorithms
was 3.7 per publication year.
The first parallel approaches for MSA were deployed in early supercomputing
equipment, as in [19], in which a dynamic approach was used. Then, from 1993 to
2000, there were some efforts to implement parallel implementations for stochastic
and metaheuristic approaches of multiple sequence alignment algorithms, as in [21]
and [26].
From 2001 to 2005, research focused mostly on improving performance for pro‑
gressive alignment, mainly using clusters, but varying the parallel programming
strategy used for the implementations. In the year 2003, Kuo Bin Li [32] imple‑
mented a multiprocessing parallelization of ClustalW, where the three stages of the
algorithm were parallelized and the implementation used 16 processors; the imple‑
mentation achieved a speedup of 4.3 for a dataset with 500 amino acid sequences
with an average length of 1000.
Around the year 2005, the first parallel implementations for an MSA algorithm
that applied Field Programmable Gate Arrays (FPGA) appeared in some works, as
in [37] and [40]; they mainly focused on speeding up the pairwise sequence align‑
ment step of the so‑called progressive approach. The parallel MSA algorithm imple‑
mentation in [37] achieved an overall speedup of 11.8 with respect to ClustalW and
was tested for datasets of 1000 amino acid sequences with an average length of 446.
Also in the year 2005, along with the appearance of the first commercial dual‑
core processors, research began to explore parallel implementations for the MSA
progressive approach using multicore processors. One of these implementations can
be found in [45], in which the authors used multithreading programming to paral‑
lelize critical steps of ClustalW; this implementation achieved an overall speedup
1 3
Parallel protein multiple sequence alignment approaches:…
Table 13 Selected articles and their classification
Reference Approach Spectrum
a
Scope
b
Strategy
c
Platform
d
[19] Exact Needleman–Wunsch WA VEC CPU
[20] Alternative Hierarchical clustering CS DAC CPU
[21] Stochastic Simulated Annealing WA MPR CPU
[22] Iterative Berger–Munson WA DAC CPU
[23] Stochastic Unnamed WA VEC CPU
[24] Iterative Berger–Munson WA MPR CPU
[25] Iterative Berger–Munson CS MPR CPU
[26] Metaheuristic Unnamed WA MPR CPU
[27] Metaheuristic Island Parallel GA WA MPR CPU
[28] Progressive ClustalW WA MPR CPU
[29] Progressive PRALINE AS MPR CPU
[30] Progressive ClustalW CS HYB CPU
[31] Progressive ClustalW AS MPR CPU
[32] Progressive ClustalW AS MPR CPU
[33] Progressive ClustalW CS MPR CPU
[34] Progressive ClustalW CS DAC CPU
[35] Iterative PhylTree CS MPR CPU
[36] Iterative DIALIGN P WA MPR CPU
[37] Progressive ClustalW CS SHO FPGA
[38] Progressive ClustalW CS MPR CPU
[39] Progressive ClustalW CS MPR CPU
[40] Progressive ClustalW CS SHO FPGA
[41] Progressive ClustalW CS SHO FPGA
[42] Stochastic Unnamed CS MPR CPU
[43] Progressive ClustalW CS HYB CPU
[44] Iterative PhylTree AS DAC CPU
[45] Progressive ClustalW CS MTH CPU
[46] Progressive ClustalW WA MTH CPU
[47] Iterative MUSCLE AS MTH CPU
[48] Progressive ClustalW CS MPR CPU
[49] Progressive ClustalW CS SHO FPGA
[50] Progressive ClustalW WA DAC CPU
[51] Progressive All algorithms WA DAC CPU
[52] Progressive ClustalW CS DAC CPU
[53] Iterative PhylTree CS MPR CPU
[54] Exact Unnamed CS MPR CPU
[55] Progressive ClustalW CS HYB GPU
[56] Alternative Unnamed WA DAC CPU
[57] Progressive T‑Coffee CS MPR CPU
[58] Exact Needleman–Wunsch WA DAC CPU
[59] Progressive All algorithms CS SHO GPU
[60] Alternative Sample‑Align‑D WA DAC CPU
S.H.Almanza-Ruiz et al.
1 3
Table 13 (continued)
Reference Approach Spectrum
a
Scope
b
Strategy
c
Platform
d
[61] Stochastic Unnamed CS MPR CPU
[62] Progressive ClustalW AS MTH GPU
[63] Progressive ClustalW AS MTH GPU
[64] Progressive All algorithms CS DAC CPU
[65] Progressive ClustalW CS VEC CPU
[66] Progressive T‑Coffee AS MPR CPU
[67] Progressive ClustalW CS MTH CPU
[68] Iterative MAFFT AS MTH CPU
[69] Progressive ClustalW CS VEC GC
[70] Stochastic MSAProbs AS MTH CPU
[71] Metaheuristic iiGA WA MPR CPU
[72] Progressive ClustalW CS VEC CPU
[73] Iterative DIALIGN P CS HYB CPU
[74] Progressive All algorithms CS SHO FPGA
[75] Progressive All algorithms AS SHO FPGA
[76] Progressive T‑Coffee CS MPR CPU
[77] Progressive T‑Coffee WA MPR CC
[78] Progressive Clustal Omega WA HYB CPU
[79] Metaheuristic AlineaGA WA MPR CPU
[80] Alternative GPU‑REMuSiC v1.0 CS ITP CC
[81] Progressive ClustalW CS SHO FPGA
[82] Progressive All algorithms AS MTH CPU
[83] Progressive T‑Coffee CS DAC CPU
[84] Progressive T‑Coffee AS SHO GPU
[85] Iterative DIALIGN TX CS MPR CPU
[86] Progressive ClustalW CS MPR CPU
[87] Alternative PE2A* WA MTH CPU
[88] Iterative MAFFT AS MTH CPU
[89] Progressive T‑Coffee CS MTH CPU
[90] Progressive T‑Coffee CS MPR CPU
[91] Iterative MAFFT WA MPR CPU
[92] Progressive ClustalW CS SHO FPGA
[93] Progressive All algorithms WA DAC CPU
[94] Progressive ClustalW CS MTH CPU
[95] Stochastic QuickProbs CS ITP GPU
[96] Alternative GPU‑REMuSiC v1.0 CS ITP GPU
[97] Progressive All algorithms CS VEC CPU
[98] Progressive ClustalW CS ITP GPU
[99] Alternative PASTA CS MTH CPU
[100] Stochastic UPP CS DAC CPU
[101] Progressive T‑Coffee CS MPR CPU
[102] Exact A‑star WA MTH CPU
1 3
Parallel protein multiple sequence alignment approaches:…
of 2.12 with respect to ClustalW and was tested for datasets of 1000 amino acid
sequences with a fixed length of 800.
Since the year 2006, with the appearance of the GPU and CUDA models of pro‑
gramming, researchers began to develop parallel implementations using this tech‑
nology to speed up several steps of the progressive approach for multiple sequence
alignment, as in [54]. In the year 2007, Zola etal. [57] implemented the first parallel
implementation of T‑Coffee.
Around 2010, along with MSAprobs and parallel implementations for
MSA algorithms such as T‑Coffee and MAFFT, researchers began to explore
metaheuristic and stochastic approaches for MSA algorithms with their par
allel implementations. The implementation of MSAProbs [70]—a stochastic
approach for MSA—was parallelized using multithreading programming and a
Table 13 (continued)
Reference Approach Spectrum
a
Scope
b
Strategy
c
Platform
d
[103] Metaheuristic MSA‑GA CS MTH CPU
[104] Iterative MAFFT CS HYB GPU
[105] Metaheuristic MSA‑GA AS MTH CPU
[106] Progressive FAMSA CS MTH CPU
[107] Stochastic MSAProbs CS MPR CPU
[108] Progressive ClustalW CS HYB CPU
[109] Progressive HAMSA CS VEC CPU
[110] Alternative PASTA CS MTH CC
[111] Progressive All algorithms AS HYB GPU
[112] Stochastic QuickProbs CS MTH GPU
[113] Progressive All algorithms CS SHO FPGA
[114] Alternative POA WA DAC CC
[115] Alternative HAlign WA MPR CC
[116] Metaheuristic M2Align WA MTH CPU
[117] Iterative MAFFT AS MTH CPU
[118] Exact A‑star WA MTH CPU
[119] Progressive All algorithms WA DAC CC
[120] Progressive KALIGN AS VEC CPU
[121] Metaheuristic Sequoya WA MPR CPU
[122] Alternative MAGUS CS MTH CPU
[123] Alternative MAGUS AS DAC CPU
[124] Progressive All algorithms CS HYB CPU
a This column contains either the specific name of the parallel MSA algorithm, or all algorithms as speci‑
fied in the Spectrum category
b CS (Critical steps), AS (All steps), WA (Whole algorithm)
c MPR (Multiprocessing), MTH (Multithreading), DAC (Divide and conquer), ITP (Intra‑task paralleliza‑
tion), HYB (Hybrid), Vectorization (VEC), SHO (Special‑hardware oriented)
d CPU (Central processing unit), GPU (Graphics processing unit), FPGA (Field programmable gate
array), GC(Grid computing), CC (Cloud computing)
S.H.Almanza-Ruiz et al.
1 3
GPU platform; the implementation showed results where MSAProbs was bet
ter in accuracy than MSA algorithms such as ClustalW, MAFFT and Probalign,
for datasets extracted from BAliBASE, PREFAB, SABmark and OXBENCH. In
2011, a parallel implementation of T‑Coffee appeared in [76], with experimental
results that yielded a speedup of over
68%
while preserving the accuracy.
In the year 2014, a stochastic MSA approach and its parallelization were pub
lished in the same article [95], in which the critical steps were parallelized and
the implementation used a GPU platform.
Two parallelizations of approaches for MSA that applied cloud computing
appeared in the year 2017: PASTASpark [110] and HAlign‑II [115]; it is note
worthy that these cloud computing approaches managed to align sequence files
up to 3.4 GB and 15 GB, respectively, and preserved the accuracy of the original
algorithms.
In the year 2020, we found two works: Sequoya [121]—an approach that used
multi‑objective metaheuristics for MSA—and MAGUS [122]—an approach that
used a graph clustering to combine disjoint alignments; the latter algorithm is
similar to PASTA but improves its accuracy and performance.
3.3 RQ3: What protein multiple sequence alignment algorithms are themost
frequently parallelized andhowwere they parallelized?
We discuss next the underlying reasons for the number of articles for parallel imple‑
mentations of the MSA approaches, as presented in Fig.2, as well as describe the
frequencies of appearance for the rest of categories in the classification framework.
Table 14 Articles per year Year Number of articles Year Number
of articles
1988 1 2009 5
1993 2 2010 7
1995 1 2011 7
1996 1 2012 4
1997 1 2013 10
1998 1 2014 3
1999 1 2015 8
2000 1 2016 5
2002 2 2017 7
2003 4 2018 3
2004 3 2019 1
2005 8 2020 2
2006 9 2021 1
2007 4 2022 1
2008 3
1 3
Parallel protein multiple sequence alignment approaches:…
3.3.1 Parallelized exact MSA algorithms
The earliest parallel implementation of an MSA algorithm [19] was a paralleliza‑
tion of the exact MSA approach. It has been proved that exact MSA is an expo‑
nential problem when dynamic programming is applied [5]. Consequently, in order
to improve computational time, researchers focused on developing parallel imple‑
mentations with heuristic and metaheuristic approaches. This can explain that there
were only five parallel implementations for the exact MSA approach found in this
systematic review. Nevertheless, the heuristic and metaheuristic approaches do not
guarantee an optimal solution, and exact MSA algorithms have been applied to test
the accuracy of other MSA approaches.
3.3.2 Parallelized heuristic MSA algorithms
According to data collected from the articles between the years 1988 and 2022,
53.77%
of the articles considered in the review are parallel implementations of
the progressive approach for multiple sequence alignment. Thus, the progres‑
sive approach is the most popular approach to be parallelized. This is verified in
Fig.2, which shows the total number of articles for every approach considered in
the review. There are two important facts that explain the frequency of paralleliza‑
tion of the progressive approach. The first fact is that the progressive MSA approach
was one of the earliest adopted approaches for protein multiple sequence alignment,
whose first implementation dates from 1994 [125], and some of their earliest paral‑
lel implementations were published around 2003, as in [31, 32]; the second fact is
that the progressive approach has three stages that can clearly be parallelized. The
first fact motivated researchers to improve previous implementations and the sec‑
ond fact provided the opportunities to test several combinations for parallelization
approaches.
We found that the second most frequently parallelized approach was the itera‑
tive approach, as presented in Fig.2. The iterative approach is a refinement of the
progressive approach that relies on the use of dynamic programming or another heu‑
ristic to realign a subset of the original sequences to the final alignment. This strat‑
egy has the advantage of improving accuracy, but increases the computational cost.
These refinement strategies are specific for each algorithm and in many cases are
difficult to parallelize.
As for stochastic MSA algorithms, they have the advantage of improving the
accuracy of MSAs. This improvement in accuracy is achieved by probabilistically
and statistically estimating MSA accuracy, as in [61] and [70]; however, this kind
of approaches has a high computational cost. Consequently, stochastic MSA algo‑
rithms have been parallelized to make them computationally feasible. In this sense,
it is noteworthy that none of the stochastic MSA algorithms found in this systematic
review had a prior serial implementation. Furthermore, some of the parallel imple‑
mentations of stochastic MSA algorithms such as QuickProbs 2 [112] have outper‑
formed the accuracy of progressive and iterative algorithms for hundreds or even
thousands of sequences. However, the parallel progressive MSA algorithm Clustal
Omega [78] and the parallel iterative MSA algorithm MAFFT [117] still outperform
S.H.Almanza-Ruiz et al.
1 3
stochastic approaches in performance and accuracy when obtaining MSAs of tens or
hundreds of thousands of sequences.
Finally, the algorithms that were categorized under the alternative approach either
applied a combination of strategies or subroutines that were employed in other heu‑
ristic approaches or implemented a new one. We found that in many of the articles,
the aim was to improve the accuracy for larger number of sequences compared to
the rest of algorithms, as in [78] and [99]. Another interesting finding was that 3 out
of 12 works under this category were implemented using cloud computing, which
can provide an advantage because when more computational resources are needed to
process an MSA, these resources can be requested as a service.
3.3.3 Parallelized metaheuristic MSA algorithms
The parallel implementations of metaheuristic MSA algorithms improve not only
the performance of their serial counterparts but also the quality of multiple sequence
alignments by assessing the accuracy of alignments using multiple criteria, such
as maximizing the sum of pairs score, maximizing the totally conserved columns,
minimizing the number of gaps, or maximizing structural information‑based scores.
This improvement in accuracy can be achieved by evolutionary multiobjective opti‑
mization techniques, as in [116] and [121]. Nevertheless, the metaheuristic MSA
algorithms need to employ previous multiple alignments as initial solutions, and
these alignments are in most of the cases obtained by previously aligning with
other algorithms such as ClustalW or MAFFT; thus, the accuracy and performance
of metaheuristic MSA algorithms initially depend on other aligners. In addition,
metaheuristic approaches have presented difficulties when processing medium to
large MSA alignments, as pointed out by Zambrano‑Vega etal. [116]. Thus, the lat‑
ter two issues can explain the reason that we found only eight parallel implementa‑
tions in the Metaheuristic category, whereas there were 93 articles with the Heuris-
tic approach.
3.3.4 How theapproaches were parallelized
In this section, we discuss the reasons for the number of articles in every subcat‑
egory of the categories Spectrum, Parallelization scope, HPC strategy and Platform
type, from the classification framework shown in Fig.2. These categories and their
subcategories aim to explain how the MSA approaches that were classified in the
MSA algorithm approach category were parallelized.
3.3.4.1 Spectrum category We obtained 94 articles where a specific algorithm was
parallelized, whereas there were only 12 where the focus was on providing a more
general parallel implementation for all algorithms of the approach. This behavior can
be explained in view that it is more difficult to identify a general solution which can
be applied to all algorithms of an approach than it is to explore different approaches
of parallelization for a specific MSA algorithm that has already been parallelized,
such as ClustalW.
1 3
Parallel protein multiple sequence alignment approaches:…
3.3.4.2 Parallelization scope category We found 77 implementations of MSA algo‑
rithms that fell into the Subroutine subcategory, whereas only 29 in the Whole algo-
rithm subcategory. The number of articles found in the Subroutine subcategory can
be explained by the fact that the progressive and iterative MSA algorithms have steps
or subroutines that can be parallelized, whereas this is not the case for most of the
exact and metaheuristic MSA algorithms, which fell into the Whole algorithm sub‑
category. We also found that, in general, the parallel implementations that fell in the
Whole algorithm subcategory applied an approach based on a strategy that splits the
data. We also found among the alternative algorithms some implementations which
were not divided into subroutines that can be parallelized, as in [114], and other
alternative algorithms as in [115], where one of the goals was to separate the parallel
implementation from the alignment algorithm, and this was achieved by splitting the
data into chunks to process them in a map‑reduce fashion along several stages of the
algorithm. In addition, some iterative and progressive MSA algorithms fit into the
Whole algorithm subcategory, in particular, implementations where the aim was to
yield a general solution for all the algorithms of the progressive approach, as in [93].
Regarding the Critical steps and the All steps subcategories, we found that in the
earliest implementations of the MSA progressive approach the most time‑consum‑
ing subroutines were identified; consequently, researchers focused more on parallel‑
izing critical steps than on parallelizing all the steps.
3.3.4.3 HPC strategy category This category was divided into the Parallel-program-
ming oriented subcategory, which had 95 articles, and the Special-hardware oriented
subcategory, in which there were only 11 articles. The number of articles in the afore‑
mentioned subcategories can be explained by the fact that the parallel implementa‑
tions of MSA algorithms in the Special-hardware oriented subcategory require to
build specific hardware for their deployment; on the contrary, the implementations
in Parallel-programming oriented subcategory can be deployed in a wider variety of
platforms. Thus, many researchers focused their efforts on developing a parallel MSA
algorithm rather than on developing specific hardware to deploy a parallelization of
an MSA algorithm.
Within the Parallel-programming oriented subcategory, we found 33 articles
under the Multiprocessing subcategory, which is the most frequent approach used
in the HPC strategy category. Among these articles, we found 12 implementations
from the Progressive subcategory; in many of these implementations, the subrou‑
tines with the most expensive usage of computing resources were identified, and
the data were evenly distributed among processors. Regarding the Multithreading
subcategory, we found 24 articles, making it the second most frequent subcategory
of the HPC strategy category, where 13 out of 24 articles involved algorithms using
a CPU platform. The main advantage of the multithreading approach is that it can
save communication time among processes, as in [68, 82]. However, one of the dis‑
advantages of using multithreading is that it is harder to subdivide the MSA algo‑
rithm subroutines into processes in terms of coding, than to evenly divide the data
among processors.
We found 17 articles in the Divide and conquer subcategory; in the implementa‑
tions of this approach a specific heuristic strategy was developed to split the data and
S.H.Almanza-Ruiz et al.
1 3
apply the MSA algorithm to the split data, which can make the writing of implemen‑
tation code cleaner, since the algorithm does not have to be divided into paralleliz‑
able subroutines. Thus, it is worth observing that the Divide and conquer approach
was applied to nine parallel implementations of MSA algorithms from the Whole
algorithm subcategory, as in [93]. However, the Divide and conquer approach could
depend on large processing capabilities such as those provided by the Cloud com-
puting platform, as in [114, 119].
Concerning the Intra-task parallelization subcategory, which is a parallelization
model for the GPU platform, we found only four implementations. The main advan‑
tage of applying Intra-task parallelization is that it allows to compute thousands of
threads in hundreds of cores. The Intra-task parallelization was applied to parallel‑
ize the subroutines of several MSA approaches, as in [96], an iterative MSA algo‑
rithm where the pairwise comparison subroutine of an alternative MSA algorithm
was parallelized using this approach, or in [95], a stochastic MSA algorithm, where
the posterior probability matrix calculation was computed 24.7 times faster with
Intra-task parallelization than with the CPU‑parallel implementation. A downside
of Intra-task parallelization resides in its dependence on a GPU platform, which
negatively affects the portability of the algorithm.
With regard to the Hybrid approach, their implementations applied a combination
of parallelization strategies, thus making it possible to apply the most suitable paral‑
lel implementation for a given subroutine of the MSA algorithm, as in [108], where
cluster‑level data parallelism, thread‑level coarse‑grained parallelism, vector‑level
parallelism and fine‑grained parallelism were applied. We also found that some of
the implementations of the Hybrid approach achieved to process tens of thousands
of sequences, as in [108, 111]. However, the combination of several parallelization
techniques can produce complex code that is difficult to read and maintain.
Regarding the articles in the Vectorization subcategory, it should be observed that
the advantage of this approach is that it minimizes the number of processors, since
some of the MSA operations such as the pairwise distance matrix are vectorized, as
in [65, 97]; additionally, it is worth mentioning that there was an implementation
under this subcategory that achieved state‑of‑the‑art accuracy for tens of thousands
of sequences [120]. However, although some set of the MSA operations are already
implemented in libraries such as AVX, others have to be adapted in terms of the
vectorized set of operations. The latter makes the code of these approaches cumber‑
some to read and can have a negative impact on the portability of the application.
The aforementioned disadvantages could explain why we found only eight imple‑
mentations for this subcategory.
3.3.4.4 Platform type category The Platform type category was subdivided into the
CPU, GPU, FPGA, Grid computing and Cloud computing subcategories. We found
80 articles under the CPU platform subcategory. It is worth clarifying that all parallel
MSA algorithm implementations before 2005 under the CPU category were deployed
in platforms where two or more CPUs were physically separated, in contrast with
implementations published starting in 2005, which made use of a multicore platform
that contained two or more CPUs on the same physical unit. The CPU platform is
1 3
Parallel protein multiple sequence alignment approaches:…
widely available and, unlike the GPU or FPGA platforms, there is no need to acquire
or develop special hardware.
Regarding the GPU platform, we found 11 articles under this subcategory. The
main advantage of the GPU platform resides in the fact that GPUs have a larger
number of cores than multicore CPUs, and thus, the former have greater perfor
mance capabilities than the latter. An implementation with GPUs was used to par‑
allelize two stochastic MSA algorithms [96, 112], which are known to have high
accuracy but also high computational cost. As mentioned earlier, a downside of
deploying in a GPU platform is that this type of platform negatively affects the port‑
ability of the implementation. In addition, in some cases, an extra effort is needed to
port algorithms originally written for CPU platforms.
In regard to the FPGA platform, we found nine articles under this subcategory;
the main advantages of using an FPGA platform for MSA algorithm paralleliza‑
tion are the minimization of computational cost of communication among process‑
ing elements and the implementation of fine‑grained parallelism, as in [41, 81]. The
downsides are the higher economical cost of FPGAs and the possible low availabil‑
ity of this very specialized hardware, which could negatively impact portability.
As for the Grid computing platform, we only found one article that used this type
of platform; the main advantage of employing grid computing in [69] was the use of
a distributed file‑allocation system that made it possible to manage large databases
of sequences, which were in the order of around six million sequences, by process‑
ing batches of smaller datasets containing between 150,000 and 280,000 sequences.
One of the advantages of grid computing over cloud computing resides in its perfor
mance, since the former does not use virtualization. However, the application of grid
computing to implement a parallel algorithm is technically more difficult than it is
for cloud computing.
With respect to the Cloud computing platform, we found five articles in this sub‑
category. The Cloud computing platform offers the advantage that a wide variety
of CPU or GPU configurations with large processing capabilities are available as a
service to test and deploy a given parallel MSA algorithm. The aforementioned large
processing capabilities were suitable for implementations such as [114, 115], where
the data were split and the whole algorithm was applied to chunks of data. However,
in the case of databases of tens of gigabytes, the increased time required to upload
such amount of data is a drawback for this approach.
3.4 RQ4: What are some oftheunsolved problems ofparallel implementations
ofmultiple sequence alignment algorithms?
One unsolved problem that we observed based on information from the articles
selected for this systematic review, is that the parallel implementations of every
MSA approach have difficulties regarding accuracy and efficiency when process‑
ing ultra‑large datasets, which are sets consisting in tens or hundreds of thousands
of sequences generated by the rapid development of modern sequencing technol‑
ogy. There have been attempts to scale up the parallel implementation of multiple
sequence alignment algorithms to accurately process ultra‑large datasets, as in [115,
S.H.Almanza-Ruiz et al.
1 3
122]. However, the problem has not been solved satisfactorily, as mentioned in
[123], where the author reported that MUSCLE and Clustal Omega could not pro‑
cess 1,000,000 sequences, in contrast with UPP, Recursive MAGUS and MAGUS,
which could process this number of sequences in a single run. Nevertheless, the
same author [123] also reported that MAGUS—the prior version of Recursive
MAGUS—outperformed the accuracy of the latest version for sets between 10,000
and 50,000. It is worth mentioning that UPP and Recursive MAGUS are parallel
alternative MSA algorithms and have achieved better processing capability than their
progressive counterpart. With regard to parallel stochastic MSA algorithms, they
have not achieved the large‑scale processing capability of their progressive counter‑
part; however, they have better accuracy for sets with smaller number of sequences.
In the case of parallel metaheuristic MSA approaches, it has been pointed out in
[126] that to find an optimal multiple sequence alignment, it is necessary to have
multiple objectives; this has been achieved via multiobjective genetic algorithms.
However, as far as this systematic review is concerned, there is no metaheuristic
MSA algorithm work where multiple sequence alignment algorithms can manage
tens of thousands of sequences, neither evidence that they clearly have outperformed
the accuracy of tools from its heuristic equivalent.
On the other hand, the progressive parallel MSA implementations have difficul‑
ties to scale up for ultra‑large data sequences due to the number of comparisons
that are necessary for the guide tree, whereas in the case of stochastic approaches
we believe the difficulties of scaling up are due to the fact that a larger set of
sequences can make the numerical precision to compute probabilities more difficult
to maintain.
Since some of the MSA parallel implementations, such as Clustal Omega and
PASTA, that process large scale‑scale datasets apply a pre‑clustering step, we could
also suggest as a possible direction for future work the application of reinforcement
learning to adaptively guide a preclustering algorithm to split up the data in order
to improve performance and accuracy. Another possible direction for future work is
to adapt stochastic algorithms such as Quickprobs 2 for divide and conquer paral‑
lelization to split data and use a cloud computing platform in order to scale up their
processing of large‑scale datasets.
Regarding the application of quantum computing for multiple sequence align‑
ment, it is worth mentioning that some researchers have proposed algorithms that
use this approach to improve the performance of pairwise sequence alignment algo‑
rithms, which as mentioned earlier is the basis of MSA algorithms, as in [127]. This
quantum approach consists in the use of dot‑matrix plotting and quantum pattern
recognition to align a pair of sequences. The authors of this work conducted simula‑
tions and made a complexity analysis in order to compare the quantum algorithm
against its classical computing counterparts in terms of speed and computational
resources. They claim that the obtained results strongly suggest that an actual imple‑
mentation of their quantum pairwise sequence alignment algorithm would outper‑
form its electronic computing counterparts in terms of time and space complexity.
Nonetheless, the same authors observed that there are still technical issues to actu‑
ally implement this quantum pairwise alignment algorithm. This approach could
1 3
Parallel protein multiple sequence alignment approaches:…
also be applied to stochastic MSA approaches by doing quantum sampling, in order
to circumvent the issues of numerical stability of stochastic simulation.
4 Conclusion
We carried a selection process for the present review for which the initial search
process yielded 1669 research articles. We read the abstracts of the 1669 articles
and selected 241 articles related to the review topic, from which we searched in the
reference section and obtained 18 additional articles. Thus, in summary we selected
259 articles related to the topic of the review. Next, we read the content of those
259 articles, and after this process, 147 articles remained. Afterward, we conducted
a process for assessing the quality and relevance of these 147 articles and obtained
a total of 104 articles relevant for the review. Subsequently, we performed a trans‑
versal search through Google Scholar using the same selection criteria as for the
databases and journals, and found two additional articles, giving a grand total of 106
articles.
Based on the results that we obtained after performing the systematic review, we
propose a classification framework for parallel implementations of protein MSA
algorithms. The proposed classification framework can aid researchers to identify
which combinations of protein MSA algorithms and parallel computing approaches
have been used and to what extent, by examining the number of implementations for
each particular approach.
As for the best approach to use, that would depend entirely on the goals that the
researchers are trying to fulfill and the drawbacks they are willing to assume. For
instance, if high accuracy is the main goal irrespective of the time taken by the algo‑
rithm, then one of the exact approaches could be the best choice; on the other hand,
if working with very large datasets and no local high performance hardware is avail‑
able, then one of the cloud computing approaches might be the best option. In gen‑
eral, as far as this systematic literature review is concerned, there is no single paral‑
lel MSA approach or implementation that outperforms the rest of approaches in all
relevant aspects, such as accuracy and the capacity of processing large datasets. As
mentioned in previous sections, some parallel MSA approaches such as MAGUS or
Clustal Omega outperform the rest of the MSA aligners for processing ultra‑large
datasets, whereas parallel MSA stochastic implementations outperform most of the
MSA approaches for accuracy up to a limit in the size of the dataset. Therefore,
there exists room for researching improvements in the parallel MSA approaches,
either for processing ultra‑large data sets or obtaining a higher accuracy, or both.
Finally, we identified some areas of opportunity for future work. The first area
consists in improving the capacity to process ultra‑large data sequences for par‑
allel MSA algorithms. A possible approach to scale up an algorithm could be to
adaptively split the data using reinforcement learning in order to improve the per‑
formance of the MSA algorithm. A second area for future work could consist in
developing parallel deep learning‑based MSA algorithms. A third area could be the
use of quantum computing for improving the performance and accuracy of parallel
S.H.Almanza-Ruiz et al.
1 3
MSA algorithms, even though there are still technical issues for its deployment that
must be solved.
Supplementary Information The online version contains supplementary material available at https:// doi.
org/ 10. 1007/ s11227‑ 022‑ 04697‑9.
Funding Sergio H. Almanza‑Ruiz is receiving a full‑time scholarship for his graduate studies from the
Mexican National Council for Science and Technology (CONACyT).
Data availability All data generated or analyzed during this study are included in this published article
and its supplementary information file.
Declarations
Competing interests The authors have no competing interests to declare that are relevant to the content
of this article.
References
1. Bayat A (2002) Bioinformatics. BMJ 324(7344):1018–1022. https:// doi. org/ 10. 1136/ bmj. 324.
7344. 1018
2. Ramsden J (2009) Bioinformatics: An Introduction, 2nd edn. Springer, London, England. https://
doi. org/ 10. 1007/ 978‑1‑ 84800‑ 257‑9
3. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, England. https://
doi. org/ 10. 1017/ CBO97 80511 790492
4. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Nat
Acad Sci U. S. A. 89(22):10915–10919. https:// doi. org/ 10. 1073/ pnas. 89. 22. 10915
5. Bonizzoni P, Vedova GD (2001) The complexity of multiple sequence alignment with SP‑score
that is a metric. Theor Comput Sci 259(1):63–79. https:// doi. org/ 10. 1016/ S0304‑ 3975(99) 00324‑2
6. Wernersson R, Pedersen AG (2003) RevTrans: multiple alignment of coding DNA from aligned
amino acid sequences. Nucleic Acids Res 31(13):3537–3539. https:// doi. org/ 10. 1093/ nar/ gkg609
7. Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: multiple alignment of nucleotide sequences
guided by amino acid translations. Nucleic Acids Research 38(supp 2):7–13. https:// doi. org/ 10.
1093/ nar/ gkq291
8. Kitchenham BA, Charters S (2007) Guidelines for performing systematic literature reviews in soft‑
ware engineering. Technical Report EBSE 2007‑001, Keele University and Durham University
Joint Report. https:// www. elsev ier. com/__ data/ promis_ misc/ 52544 4syst emati crevi ewsgu ide. pdf
9. Chen L, Ali Babar M (2011) A systematic review of evaluation of variability management
approaches in software product lines. Inf Softw Technol 53(4):344–362. https:// doi. org/ 10. 1016/j.
infsof. 2010. 12. 006. Special Section: Software Engineering track of the 24th Annual Symposium
on Applied Computing
10. Salleh N, Mendes E, Grundy J (2011) Empirical studies of pair programming for CS/SE teaching
in higher education: a systematic literature review. IEEE Trans Softw Eng 37(4):509–525. https://
doi. org/ 10. 1109/ TSE. 2010. 59
11. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault
prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304. https://
doi. org/ 10. 1109/ TSE. 2011. 103
12. Galster M, Weyns D, Tofan D, Michalik B, Avgeriou P (2014) Variability in software systems–a
systematic literature review. IEEE Trans Softw Eng 40(3):282–306. https:// doi. org/ 10. 1109/ TSE.
2013. 56
1 3
Parallel protein multiple sequence alignment approaches:…
13. de Freitas Junior M, Fantinato M, Sun V (2015) Improvements to the function point analysis
method: a systematic literature review. IEEE Trans Eng Manag 62(4):495–506. https:// doi. org/ 10.
1109/ TEM. 2015. 24533 54
14. Hujainah F, Bakar RBA, Abdulgabber MA, Zamli KZ (2018) Software requirements prioritisa‑
tion: a systematic literature review on significance, stakeholders, techniques and challenges. IEEE
Access 6:71497–71523. https:// doi. org/ 10. 1109/ ACCESS. 2018. 28817 55
15. Flores‑Contreras J, Duran‑Limon HA, Chavoya A, Almanza‑Ruiz SH (2021) Performance pre‑
diction of parallel applications: a systematic literature review. J Supercomput 77(4):4014–4055.
https:// doi. org/ 10. 1007/ s11227‑ 020‑ 03417‑5
16. Mahdavi‑Hezavehi S, Galster M, Avgeriou P (2013) Variability in quality attributes of service‑
based software systems: a systematic literature review. Inf Softw Technol 55(2):320–343. https://
doi. org/ 10. 1016/j. infsof. 2012. 08. 010. Special Section: Component‑Based Software Engineering
(CBSE), 2011
17. Bornmann L, Daniel H‑D (2007) What do we know about the h index? J Am Soc Inf Sci Technol
58(9):1381–1385. https:// doi. org/ 10. 1002/ asi. 20609
18. Welcome to CORE. Accessed: 2022‑01‑26 (2022). https:// www. core. edu. au Accessed 2022‑01‑26
19. Tajima K (1988) Multiple DNA and protein sequence alignment on a workstation and a supercom‑
puter. Bioinformatics 4(4):467–471. https:// doi. org/ 10. 1093/ bioin forma tics/4. 4. 467
20. Date S, Kulkarni R, Kulkarni B, Kulkarni‑Kale U, Kolaskar AS (1993) Multiple alignment of
sequences on parallel computers. Bioinformatics 9(4):397–402. https:// doi. org/ 10. 1093/ bioin forma
tics/9. 4. 397
21. Ishikawa M, Toya T, Hoshida M, Nitta K, Ogiwara A, Kanehisa M (1993) Multiple sequence
alignment by parallel simulated annealing. Bioinformatics 9(3):267–273. https:// doi. org/ 10. 1093/
bioin forma tics/9. 3. 267
22. Yap TK, Munson PJ, Frieder O, Martino RL (1995) Parallel multiple sequence alignment using
speculative computation. In: Proceedings of the 1995 International Conference on Parallel Process‑
ing ICPP
23. Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis
of the basic method. Bioinformatics 12(2):95–107. https:// doi. org/ 10. 1093/ bioin forma tics/ 12.2. 95
24. Martino RL, Yap TK, Suh EB (1997) Parallel algorithms in molecular biology. In: Hertzberger
B, Sloot P (eds) High‑Performance Computing and Networking. Springer, Berlin, Heidelberg, pp
232–240
25. Yap TK, Frieder O, Martino RL (1998) Parallel computation in biological sequence analysis. IEEE
Trans Paral Distrib Syst 9(3):283–294. https:// doi. org/ 10. 1109/ 71. 674320
26. Anbarasu LA, Narayanasamy P, Sundararajan V (1999) Multiple sequence alignment using parallel
genetic algorithms. In: McKay B, Yao X, Newton CS, Kim J‑H, Furuhashi T (eds) Simulated Evo‑
lution and Learning. Springer, Berlin, Heidelberg, pp 130–137
27. Anbarasu LA, Narayanasamy P, Sundararajan V (2000) Multiple molecular sequence alignment by
island parallel genetic algorithm. Curr Sci 78(7):858–863
28. Catalyurek U, Stahlberg E, Ferreira R, Kurc T, Saltz J (2002) Improving performance of multiple
sequence alignment analysis in multi‑client environments. In: Proceedings 16th International Par
allel and Distributed Processing Symposium, p. 8. https:// doi. org/ 10. 1109/ IPDPS. 2002. 10165 84
29. Kleinjung J, Douglas N, Heringa J (2002) Parallelized multiple alignment. Bioinformatics
18(9):1270–1271. https:// doi. org/ 10. 1093/ bioin forma tics/ 18.9. 1270
30. Catalyurek U, Gray M, Kurc T, Saltz J, Stahlberg E, Ferreira R (2003) A component‑based imple‑
mentation of multiple sequence alignment. In: Proceedings of the 2003 ACM Symposium on
Applied Computing. SAC ’03, pp. 122–126. Association for Computing Machinery, New York,
NY, USA. https:// doi. org/ 10. 1145/ 952532. 952559
31. Cheetham J, Dehne F, Pitre S, Rau‑Chaplin A, Taillon PJ (2003) Parallel CLUSTAL W for PC
clusters. In: Kumar, V., Gavrilova, M.L., Tan, C.J.K., L’Ecuyer, P. (eds.) International Conference
on Computational Science and Its Applications — ICCSA 2003, pp. 300–309. Springer, Berlin,
Heidelberg. https:// doi. org/ 10. 1007/3‑ 540‑ 44843‑8_ 32
32. Li K‑B (2003) ClustalW‑MPI: ClustalW analysis using distributed and parallel computing. Bioin‑
formatics 19(12):1585–1586. https:// doi. org/ 10. 1093/ bioin forma tics/ btg192
33. Zhihua D, Feng L (2003) Parallel computation for multiple sequence alignments. In: Fourth Inter‑
national Conference on Information, Communications and Signal Processing, 2003 and the Fourth
Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint, vol. 1, pp. 300–3031.
https:// doi. org/ 10. 1109/ ICICS. 2003. 12924 64
S.H.Almanza-Ruiz et al.
1 3
34. Ebedes J, Datta A (2004) Multiple sequence alignment in parallel on a workstation cluster. Bioin‑
formatics 20(7):1193–1195. https:// doi. org/ 10. 1093/ bioin forma tics/ bth055
35. Parmentier G, Trystram D, Zola J (2004) Cache‑based parallelization of multiple sequence align‑
ment problem. In: Danelutto M, Vanneschi M, Laforenza D (eds) Euro‑Par 2004 Parallel Process‑
ing. Springer, Berlin, Heidelberg, pp 1005–1012. https:// doi. org/ 10. 1007/ 978‑3‑ 540‑ 27866‑5_ 135
36. Schmollinger M, Nieselt K, Kaufmann M, Morgenstern B (2004) DIALIGN P: Fast pair‑wise and
multiple sequence alignment using parallel processors. BMC Bioinformatics 5(1):128. https:// doi.
org/ 10. 1186/ 1471‑ 2105‑5‑ 128
37. Lin X, Peiheng Z, Dongbo B, Shengzhong F, Ninghui S (2005) To accelerate multiple sequence
alignment using FPGAs. In: Eighth International Conference on High‑Performance Computing in
Asia‑Pacific Region (HPCASIA’05), pp. 5–180. https:// doi. org/ 10. 1109/ HPCAS IA. 2005. 96
38. Lopes HS, Moritz GL (2005) A distributed approach for a multiple sequence alignment algorithm
using a parallel virtual machine. In: 2005 IEEE Engineering in Medicine and Biology 27th Annual
Conference, pp. 2843–2846. https:// doi. org/ 10. 1109/ IEMBS. 2005. 16170 66
39. Luo J, Ahmad I, Ahmed M, Paul R (2005) Parallel multiple sequence alignment with dynamic
scheduling. In: International Conference on Information Technology: Coding and Computing
(ITCC’05) ‑ Volume II, vol. 1, pp. 8–131. https:// doi. org/ 10. 1109/ ITCC. 2005. 223
40. Oliver T, Schmidt B, Maskell D, Nathan D, Clemens R (2005) Multiple sequence alignment on an
FPGA. In: 11th International Conference on Parallel and Distributed Systems (ICPADS’05), vol. 2,
pp. 326–330. https:// doi. org/ 10. 1109/ ICPADS. 2005. 202
41. Oliver T, Schmidt B, Nathan D, Clemens R, Maskell D (2005) Using reconfigurable hardware to
accelerate multiple sequence alignment with ClustalW. Bioinformatics 21(16):3431–3432. https://
doi. org/ 10. 1093/ bioin forma tics/ bti508
42. Rajasekaran S, Thapar V, Dave H, Huang C‑H (2005) Randomized and parallel algorithms for dis‑
tance matrix calculations in multiple sequence alignment. J Clin Monit Comput 19(4):351–359.
https:// doi. org/ 10. 1007/ s10877‑ 005‑ 0680‑3
43. Tan G, Feng S, Sun N (2005) Parallel multiple sequences alignment in SMP cluster. In: Eighth
International Conference on High‑Performance Computing in Asia‑Pacific Region (HPCASIA’05),
pp. 6–431. https:// doi. org/ 10. 1109/ HPCAS IA. 2005. 70
44. Trystram D, Zola J (2005) Parallel multiple sequence alignment with decentralized cache support.
In: Cunha JC, Medeiros PD (eds) Euro‑Par 2005 Parallel Processing. Springer, Berlin, Heidelberg,
pp 1217–1226. https:// doi. org/ 10. 1007/ 11549 468_ 133
45. Chaichoompu K, Kittitornkun S, Tongsima S (2006) MT‑ClustalW: multithreading multiple
sequence alignment. In: Proceedings 20th IEEE International Parallel Distributed Processing Sym‑
posium, p. 8. https:// doi. org/ 10. 1109/ IPDPS. 2006. 16395 37
46. Chaichoompu K, Kittitornkun S (2006) Multithreaded ClustalW with improved optimization for
Intel multi‑core processor. In: 2006 International Symposium on Communications and Information
Technologies, pp. 590–594. https:// doi. org/ 10. 1109/ ISCIT. 2006. 340018
47. Deng X, Li E, Shan J, Chen W (2006) Parallel implementation and performance characterization
of MUSCLE. In: Proceedings 20th IEEE International Parallel Distributed Processing Symposium,
p. 7. https:// doi. org/ 10. 1109/ IPDPS. 2006. 16396 16
48. Du Z, Lin F (2006) pNJTree: A parallel program for reconstruction of neighbor‑joining tree and its
application in ClustalW. Paral Comput 32(5):441–446. https:// doi. org/ 10. 1016/j. parco. 2006. 05. 001
49. Oliver T, Schmidt B, Maskell D, Nathan D, Clemens R (2006) High‑speed multiple sequence
alignment on a reconfigurable platform. Int J Bioinf Res Appl 2(4):394–406. https:// doi. org/ 10.
1504/ IJBRA. 2006. 011038
50. Rezaei S, Monwar MM (2006) Divide‑and‑Conquer algorithm for ClustalW‑MPI. In: 2006 Cana‑
dian Conference on Electrical and Computer Engineering, pp. 717–720. https:// doi. org/ 10. 1109/
CCECE. 2006. 277630
51. Rezaei S, Monwar MM, Bai J (2006) Performance comparison of MPI‑based parallel multiple
sequence alignment algorithm using single and multiple guide trees. In: 2006 5th IEEE Interna‑
tional Conference on Cognitive Informatics, vol. 1, pp. 595–600. https:// doi. org/ 10. 1109/ COGINF.
2006. 365552
52. Tan G, Peng L, Feng S, Sun N (2006) Load balancing and parallel multiple sequence alignment
with tree accumulation. In: Nagel WE, Walter WV, Lehner W (eds) Euro‑Par 2006 Parallel Pro‑
cessing. Springer, Berlin, Heidelberg, pp 1138–1147. https:// doi. org/ 10. 1007/ 11823 285_ 120
1 3
Parallel protein multiple sequence alignment approaches:…
53. Zola J, Trystram, D, Tchernykh A, Brizuela C (2006) Parallel multiple sequence alignment with
local phylogeny search by simulated annealing. In: Proceedings 20th IEEE International Parallel
Distributed Processing Symposium, p. 8. https:// doi. org/ 10. 1109/ IPDPS. 2006. 16395 36
54. Lin CY, Huang CT, Chung Y‑C, Tang CY (2007) Efficient parallel algorithm for optimal three‑
sequences alignment. In: 2007 International Conference on Parallel Processing (ICPP 2007), pp.
14–14. https:// doi. org/ 10. 1109/ ICPP. 2007. 38
55. Liu W, Schmidt B, Voss G, Muller‑Wittig W (2007) Streaming algorithms for biological sequence
alignment on GPUs. IEEE Trans Paral Distrib Syst 18(9):1270–1281. https:// doi. org/ 10. 1109/
TPDS. 2007. 1069
56. Low DHP, Veeravalli B, Bader DA (2007) On the design of high‑performance algorithms for align‑
ing multiple protein sequences on mesh‑based multiprocessor architectures. J Paral Distrib Comput
67(9):1007–1017. https:// doi. org/ 10. 1016/j. jpdc. 2007. 03. 007
57. Zola J, Yang X, Rospondek A, Aluru S (2007) PARALLEL‑TCOFFEE: A parallel multiple
sequence aligner. In: Proceedings of the ISCA 20th International Conference on Parallel and Dis‑
tributed Computing Systems, September 24‑26, 2007, Las Vegas, Nevada, USA, pp. 248–253
58. Helal M, El‑Gindy H, Mullin L, Gaeta B (2008) Parallelizing optimal multiple sequence alignment
by dynamic programming. In: 2008 IEEE International Symposium on Parallel and Distributed
Processing with Applications, pp. 669–674. https:// doi. org/ 10. 1109/ ISPA. 2008. 93
59. Manavski SA, Valle G (2008) CUDA compatible GPU cards as efficient hardware accelera‑
tors for Smith‑Waterman sequence alignment. BMC Bioinf 9(2):10. https:// doi. org/ 10. 1186/
1471‑ 2105‑9‑ S2‑ S10
60. Saeed F, Khokhar A (2008) Sample‑Align‑D: A high performance multiple sequence alignment
system using phylogenetic sampling and domain decomposition. In: 2008 IEEE International Sym‑
posium on Parallel and Distributed Processing, pp. 1–9. https:// doi. org/ 10. 1109/ IPDPS. 2008. 45361
74
61. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast
statistical alignment. PLOS Comput Biol 5(5):1–15. https:// doi. org/ 10. 1371/ journ al. pcbi. 10003 92
62. Liu Y, Schmidt B, Maskell DL (2009) MSA‑CUDA: Multiple sequence alignment on graphics pro‑
cessing units with CUDA. In: 2009 20th IEEE International Conference on Application‑specific
Systems, Architectures and Processors, pp. 121–128. https:// doi. org/ 10. 1109/ ASAP. 2009. 14
63. Liu Y, Schmidt B, Maskell DL (2009) Parallel reconstruction of neighbor‑joining trees for large
multiple sequence alignments using CUDA. In: 2009 IEEE International Symposium on Parallel
Distributed Processing, pp. 1–8. https:// doi. org/ 10. 1109/ IPDPS. 2009. 51609 23
64. Saeed F, Khokhar A (2009) A domain decomposition strategy for alignment of multiple biological
sequences on multiprocessor platforms. J Paral Distrib Comput 69(7):666–677. https:// doi. org/ 10.
1016/j. jpdc. 2009. 03. 006
65. Wirawan A, Schmidt B, Kwoh CK (2009) Pairwise distance matrix computation for multiple
sequence alignment on the cell broadband engine. In: Allen G, Nabrzyski J, Seidel E, van Albada
GD, Dongarra J, Sloot PMA (eds) Computational Science ‑ ICCS 2009. Springer, Berlin, Heidel‑
berg, pp 954–963
66. Di Tommaso P, Orobitg M, Guirado F, Cores F, Espinosa T, Notredame C (2010) Cloud‑Coffee:
implementation of a parallel consistency‑based multiple alignment algorithm in the T‑Coffee
package and its benchmarking on the Amazon Elastic‑Cloud. Bioinformatics 26(15):1903–1904.
https:// doi. org/ 10. 1093/ bioin forma tics/ btq304
67. Isaza S, Sanchez F, Gaydadjiev G, Ramirez A, Valero M (2010) Scalability analysis of progressive
alignment on a multicore. In: 2010 International Conference on Complex, Intelligent and Software
Intensive Systems, pp. 889–894. https:// doi. org/ 10. 1109/ CISIS. 2010. 149
68. Katoh K, Toh H (2010) Parallelization of the MAFFT multiple sequence alignment program. Bio‑
informatics 26(15):1899–1900. https:// doi. org/ 10. 1093/ bioin forma tics/ btq224
69. Kim T, Joo H (2010) ClustalXeed: a GUI‑based grid computation version for high performance
and terabyte size multiple sequence alignment. BMC Bioinf 11(1):467. https:// doi. org/ 10. 1186/
1471‑ 2105‑ 11‑ 467
70. Liu Y, Schmidt B, Maskell DL (2010) MSAProbs: multiple sequence alignment based on pair hid‑
den Markov models and partition function posterior probabilities. Bioinformatics 26(16):1958–
1964. https:// doi. org/ 10. 1093/ bioin forma tics/ btq338
71. Miranda LA, Caetano MAF, Melo ACMA, Correa JM, Bordim JL (2010) Multiple biologi‑
cal sequence alignment with a parallel island injection genetic algorithm. In: 2010 IEEE 12th
S.H.Almanza-Ruiz et al.
1 3
International Conference on High Performance Computing and Communications (HPCC), pp.
314–321. https:// doi. org/ 10. 1109/ HPCC. 2010. 31
72. Wirawan A, Kwoh CK, Schmidt B (2010) Multi‑threaded vectorized distance matrix computation
on the CELL/BE and x86/SSE2 architectures. Bioinformatics 26(10):1368–1369. https:// doi. org/
10. 1093/ bioin forma tics/ btq135
73. de AraujoMacedo E, Magalhaes Alvesde Melo AC, Pfitscher GH, Boukerche A (2011) Hybrid
MPI/OpenMP strategy for biological multiple sequence alignment with DIALIGN‑TX in hetero‑
geneous multicore clusters. In: 2011 IEEE International Symposium on Parallel and Distributed
Processing Workshops and Phd Forum, pp. 418–425. https:// doi. org/ 10. 1109/ IPDPS. 2011. 169
74. Lloyd S, Snell QO (2011) Accelerated large‑scale multiple sequence alignment. BMC Bioinf
12(1):466. https:// doi. org/ 10. 1186/ 1471‑ 2105‑ 12‑ 466
75. Nguyen KD, Pan Y, Nong G (2011) Parallel progressive multiple sequence alignment on reconfig‑
urable meshes. BMC Genom 12(5):4. https:// doi. org/ 10. 1186/ 1471‑ 2164‑ 12‑ S5‑ S4
76. Orobitg M, Guirado F, Notredame C, Cores F (2011) Exploiting parallelism on progressive align‑
ment methods. J Supercomput 58(2):186–194. https:// doi. org/ 10. 1007/ s11227‑ 009‑ 0359‑5
77. Rius J, Cores F, Solsona F, van Hemert JI, Koetsier J, Notredame C (2011) A user‑friendly
web portal for T‑Coffee on supercomputers. BMC Bioinf 12(1):150. https:// doi. org/ 10. 1186/
1471‑ 2105‑ 12‑ 150
78. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M,
Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high‑quality protein mul‑
tiple sequence alignments using Clustal Omega. Mol Syst Biol 7(1):539. https:// doi. org/ 10. 1038/
msb. 2011. 75
79. da Silva FJM, Pérez JMS, Pulido JAG, Rodríguez MAV (2011) Parallel Niche Pareto AlineaGA
‑ an evolutionary multiobjective approach on multiple sequence alignment. J Integr Bioinf 8(3):57–
72. https:// doi. org/ 10. 1515/ jib‑ 2011‑ 174
80. Lin Y‑S, Lin, C‑Y, Chung Y‑C (2012) GPU‑based cloud service for multiple sequence alignments
with regular expression constrains. In: 4th IEEE International Conference on Cloud Computing
Technology and Science Proceedings, pp. 741–746. https:// doi. org/ 10. 1109/ Cloud Com. 2012.
64275 65
81. Mahram A, Herbordt MC (2012) FMSA: FPGA‑accelerated ClustalW‑based multiple sequence
alignment through pipelined prefiltering. In: 2012 IEEE 20th International Symposium on Field‑
Programmable Custom Computing Machines, pp. 177–183. https:// doi. org/ 10. 1109/ FCCM. 2012.
38
82. Marucci EA, Zafalon GFD, Momente JC, Pinto AR, Amazonas JRA, Shiyou Y, Sato LM, Machado
JM (2012) Using threads to overcome synchronization delays in parallel multiple progressive
alignment algorithms. Curr Res Bioinf 1:50–63. https:// doi. org/ 10. 3844/ ajbsp. 2012. 50. 63
83. Orobitg M, Cores F, Guirado F, Kemena C, Notredame C, Ripoll A (2012) Enhancing the scal‑
ability of consistency‑based progressive multiple sequences alignment applications. In: 2012 IEEE
26th International Parallel and Distributed Processing Symposium, pp. 71–82. https:// doi. org/ 10.
1109/ IPDPS. 2012. 17
84. Blazewicz J, Frohmberg W, Kierzynka M, Wojciechowski P (2013) G‑MSA ‑ A GPU‑based, fast
and accurate algorithm for multiple sequence alignment. J Paral Distrib Comput 73(1):32–41.
https:// doi. org/ 10. 1016/j. jpdc. 2012. 04. 004
85. de Araujo Macedo E, Alves Magalhaes, de Melo AC, Pfitscher GH, Boukerche A (2013) Multiple
biological sequence alignment in heterogeneous multicore clusters with user‑selectable task alloca‑
tion policies. J Supercomput 63(3):740–756. https:// doi. org/ 10. 1007/ s11227‑ 012‑ 0768‑8
86. Esteban FJ, Díaz D, Hernández P, Caballero JA, Dorado G, Gálvez S (2013) Direct approaches to
exploit many‑core architecture in bioinformatics. Future Gener Comput Syst 29(1), 15–26. https://
doi. org/ 10. 1016/j. future. 2012. 03. 018. Including Special section: AIRCC‑NetCoM 2009 and Spe‑
cial section: Clouds and Service‑Oriented Architectures
87. Hatem M, Ruml W (2013) External memory best‑first search for multiple sequence alignment.
Proc AAAI Conf Artif Intell 27(1):409–416
88. Katoh K, Standley DM (2013) MAFFT Multiple Sequence Alignment Software Version 7:
Improvements in Performance and Usability. Mol Biol Evol 30(4):772–780. https:// doi. org/ 10.
1093/ molbev/ mst010
89. Montañola A, Roig C, Guirado F, Hernández P, Notredame C (2013) Performance analysis of computa
tional approaches to solve multiple sequence alignment. J Supercomput 64(1):69–78. https:// doi. org/
10. 1007/ s11227‑ 012‑ 0751‑4
1 3
Parallel protein multiple sequence alignment approaches:…
90. Orobitg M, Lladós J, Guirado F, Cores F, Notredame C (2013) Scalability and accuracy improvements
of consistency‑based multiple sequence alignment tools. In: Proceedings of the 20th European MPI
Users’ Group Meeting. EuroMPI ’13, pp. 259–264. Association for Computing Machinery, New York,
NY, USA. https:// doi. org/ 10. 1145/ 24885 51. 24885 83
91. Tzanoudakis T, Papaefstathiou I, Manifavas C (2013) Parallelizing bioinformatics and security applica
tions on a low‑cost multi‑core system. In: 2013 ACS International Conference on Computer Systems
and Applications (AICCSA), pp. 1–4. https:// doi. org/ 10. 1109/ AICCSA. 2013. 66164 52
92. Yilmaz C, Gök M (2013) System designs to perform bioinformatics sequence alignment. Turkish J Electr
Eng Comput Sci 21(1):246–262. https:// doi. org/ 10. 3906/ elk‑ 1105‑ 22
93. Zhu X, Li K, Salah A (2013) A data parallel strategy for aligning multiple biological sequences on multi‑
core computers. Comput Biol Med 43(4):350–361. https:// doi. org/ 10. 1016/j. compb iomed. 2012. 12.
009
94. Díaz D, Esteban FJ, Hernández P, Caballero JA, Guevara A, Dorado G, Gálvez S (2014) MC64‑
ClustalWP2: A highly‑parallel hybrid strategy to align multiple sequences in many‑core architectures.
PLOS ONE 9(4):1–12. https:// doi. org/ 10. 1371/ journ al. pone. 00940 44
95. Gudyś A, Deorowicz S (2014) QuickProbs–A fast multiple sequence alignment algorithm designed for
graphics processors. PLOS ONE 9(2):1–18. https:// doi. org/ 10. 1371/ journ al. pone. 00889 01
96. Lin CY, Lin YS (2014) Efficient parallel algorit hm for multiple sequence alignments with regular expres
sion constraints on graphics processing units. Int J Comput Sci Eng 9(1–2):11–20. https:// doi. org/ 10.
1504/ IJCSE. 2014. 058687
97. Al‑Neama MW, Reda NM, Ghaleb FFM (2015) Fast vectorized distance matrix computation for multi
ple sequence alignment on multi‑cores. Int J Biomath 08(06):1550084. https:// doi. org/ 10. 1142/ S1793
52451 55008 49
98. Hung C‑L, Lin Y‑S, Lin C‑Y, Chung Y‑C, Chung Y‑F (2015) CUDA ClustalW: an efficient parallel algo
rithm for progressive multiple sequence alignment on Multi‑GPUs. Comput Biol Chem 58:62–68.
https:// doi. org/ 10. 1016/j. compb iolch em. 2015. 05. 004
99. Mirarab S, Nguyen N, Guo S, Wang L‑S, Kim J, Warnow T (2015) PASTA: Ultra‑large multiple
sequence alignment for nucleotide and amino‑acid sequences. J Comput Biol 22(5):377–386. https://
doi. org/ 10. 1089/ cmb. 2014. 0156 (PMID: 25549288)
100. N‑pD Nguyen, Mirarab S, Kumar K, Warnow T (2015) Ultra‑large alignments using phylogeny‑aware
profiles. Genome Biol 16(1):124. https:// doi. org/ 10. 1186/ s13059‑ 015‑ 0688‑z
101. Orobitg M, Guirado F, Cores F, Llados J, Notredame C (2015) High performance computing improve
ments on bioinformatics consistency‑based multiple sequence alignment tools. Paral Comput 42:18–
34. https:// doi. org/ 10. 1016/j. parco. 2014. 09. 010
102. Sundfeld D, Teodoro G, Magalhaes Alvesde Melo AC (2015) Parallel A‑Star multiple sequence align
ment with locality‑sensitive hash functions. In: 2015 Ninth International Conference on Complex,
Intelligent, and Software Intensive Systems, pp. 342–347. https:// doi. org/ 10. 1109/ CISIS. 2015. 50
103. Zafalon GFD, Visotaky JMV, Amorim AR, Valêncio CR, Neves LA, de Souza RCG, Machado JM
(2015) A parallel approach of COFFEE objective function to multiple sequence alignment. J Phys:
Conf Ser 633:012084. https:// doi. org/ 10. 1088/ 1742‑ 6596/ 633/1/ 012084
104. Zhu X, Li K, Salah A, Shi L, Li K (2015) Parallel implementation of MAFFT on CUDA‑enabled
graphics hardware. IEEE/ACM Trans Comput Biol Bioinf 12(1):205–218. https:// doi. org/ 10. 1109/
TCBB. 2014. 23518 01
105. Amorim AR, Visotaky JMV, de GodoiContessoto A, Neves LA, Gratão DeSouza RC, Valêncio CR,
Zafalon GFD (2016) Performance improvement of genetic algorithm for multiple sequence alignment.
In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and
Technologies (PDCAT), pp. 69–72. https:// doi. org/ 10. 1109/ PDCAT. 2016. 029
106. Deorowicz S, Debudaj‑Grabysz A, Gudyś A (2016) FAMSA: fast and accurate multiple sequence
alignment of huge protein families. Sci Rep 6(1):33964. https:// doi. org/ 10. 1038/ srep3 3964
107. González‑Domínguez J, Liu Y, Touriño J, Schmidt B (2016) MSAProbs‑MPI: parallel multiple
sequence aligner for distributed‑memory systems. Bioinformatics 32(24):3826–3828. https:// doi. org/
10. 1093/ bioin forma tics/ btw558
108. Lan H, Chan Y, Xu K, Schmidt B, Peng S, Liu W (2016) Parallel algorithms for large‑scale biologi
cal sequence alignment on Xeon‑Phi based clusters. BMC Bioinf 17(9):267. https:// doi. org/ 10. 1186/
s12859‑ 016‑ 1128‑0
109. Reda NM, Al‑Neama M, Ghaleb FFM (2016) HAMSA: highly accelerated multiple sequence aligner.
Int J Adv Comput Sci Appl. https:// doi. org/ 10. 14569/ IJACSA. 2016. 070661
110. Abuín JM, Pena TF, Pichel JC (2017) PASTASpark: multiple sequence alignment meets Big Data.
Bioinformatics 33(18):2948–2950. https:// doi. org/ 10. 1093/ bioin forma tics/ btx354
S.H.Almanza-Ruiz et al.
1 3
111. Araujo E, Stefanes MA, O. Ferlete Vd, Rozante LCS (2017) Multiple sequence alignment using
hybrid parallel computing. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioen
gineering (BIBE), pp. 175–180. https:// doi. org/ 10. 1109/ BIBE. 2017. 00‑ 59
112. Gudyś A, Deorowicz S (2017) QuickProbs 2: towards rapid construction of high‑quality alignments of
large protein families. Sci Rep 7(1):41553. https:// doi. org/ 10. 1038/ srep4 1553
113. Liu P, Hemani A, Paul K, Weis C, Jung M, Wehn N (2017) 3D‑stacked many‑core architecture for
biological sequence analysis problems. Int J Paral Program 45(6):1420–1460. https:// doi. org/ 10. 1007/
s10766‑ 017‑ 0495‑0
114. Neehal N, Karim DZ, Islam A (2017) Cloud‑POA: A cloud‑based map only implementation of PO‑
MSA on Amazon multi‑node EC2 Hadoop Cluster. In: 2017 20th International Conference of Com
puter and Information Technology (ICCIT), pp. 1–6 https:// doi. org/ 10. 1109/ ICCIT ECHN. 2017. 82818
08
115. Wan S, Zou Q (2017) HAlign‑II: efficient ultra‑large multiple sequence alignment and phylogenetic
tree reconstruction with distributed and parallel computing. Algorithms Mol Biol 12(1):25. https:// doi.
org/ 10. 1186/ s13015‑ 017‑ 0116‑x
116. Zambrano‑Vega C, Nebro AJ, García‑Nieto J, Aldana‑Montes JF (2017) M2Align: parallel multiple
sequence alignment with a multi‑objective metaheuristic. Bioinformatics 33(19):3011–3017. https://
doi. org/ 10. 1093/ bioin forma tics/ btx338
117. Nakamura T, Yamada KD, Tomii K, Katoh K (2018) Parallelization of MAFFT for large‑scale mul
tiple sequence alignments. Bioinformatics 34(14):2490–2492. https:// doi. org/ 10. 1093/ bioin forma tics/
bty121
118. Sundfeld D, Razzolini C, Teodoro G, Boukerche A, de Melo ACMA (2018) PA‑Star: a disk‑assisted
parallel A‑Star strategy with locality‑sensitive hash for multiple sequence alignment. J Paral Distrib
Comput 112:154–165. https:// doi. org/ 10. 1016/j. jpdc. 2017. 04. 014
119. Welivita A, Perera I, Meedeniya D, Wickramarachchi A, Mallawaarachchi V (2018) Managing com
plex workflows in bioinformatics: An interactive toolkit with GPU acceleration. IEEE Trans NanoBi
osci 17(3):199–208. https:// doi. org/ 10. 1109/ TNB. 2018. 28371 22
120. Lassmann T (2019) Kalign 3: multiple sequence alignment of large datasets. Bioinformatics
36(6):1928–1929. https:// doi. org/ 10. 1093/ bioin forma tics/ btz795
121. Benítez‑Hidalgo A, Nebro AJ, Aldana‑Montes JF (2020) Sequoya: multiobjective multiple sequence
alignment in Python. Bioinformatics 36(12):3892–3893. https:// doi. org/ 10. 1093/ bioin forma tics/ btaa2
57
122. Smirnov V, Warnow T (2020) MAGUS: multiple sequence alignment using graph clUStering. Bioin
formatics 37(12):1666–1672. https:// doi. org/ 10. 1093/ bioin forma tics/ btaa9 92
123. Smirnov V (2021) Recursive MAGUS: scalable and accurate multiple sequence alignment. PLOS
Comput Biol 17(10):1–17. https:// doi. org/ 10. 1371/ journ al. pcbi. 10089 50
124. Ishaq M, Khan A, Su’ud MM, Alam MM, Bangash JI, Khan A (2022) An improved strategy for task
scheduling in the parallel computational alignment of multiple sequences. Comput Math Methods
Med 2022:8691646. https:// doi. org/ 10. 1155/ 2022/ 86916 46
125. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position‑specific gap penalties and weight
matrix choice. Nucleic Acids Res 22(22):4673–4680. https:// doi. org/ 10. 1093/ nar/ 22. 22. 4673
126. Chowdhury B, Garai G (2017) A review on multiple sequence alignment from the perspective of
genetic algorithm. Genomics 109(5):419–431. https:// doi. org/ 10. 1016/j. ygeno. 2017. 06. 007
127. Prousalis K, Konofaos N (2019) A quantum pattern recognition method for improving pairwise
sequence alignment. Sci Rep 9(1):7226. https:// doi. org/ 10. 1038/ s41598‑ 019‑ 43697‑3
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
... Other specifically designed structural benchmarks are HOMSTRAD, PREFA, and SABmark, which are not generated by hand annotation of protein alignments like BAliBASE. Reference sets are also available for RNA structures [41,106]. ...
... These approaches offer the advantage of solving large-scale cases with reduced computational resources, including memory and processing time [24], thereby effectively addressing complex optimization challenges in bioinformatics. While alternative methods exist for optimizing these issues, the versatility of bioinspired-based methods renders them invaluable tools capable of generating "high-quality" solutions within reasonable computing timeframes [41]. An inherent advantage of employing bioinspired-based methods in bioinformatics lies in their ability to effectively address MSA problems, which typically involve large-scale, NP-hard optimization, posing significant constraints on classical optimization techniques [7]. ...
Article
Full-text available
Multiple Sequence Alignment (MSA) plays a pivotal role in bioinformatics, facilitating various critical biological analyses, including the prediction of unknown protein structures and functions. While numerous methods are available for MSA, bioinspired algorithms stand out for their efficiency. Despite the growing research interest in addressing the MSA challenge, only a handful of comprehensive reviews have been undertaken in this domain. To bridge this gap, this study conducts a thorough analysis of bioinspired-based methods for MSA through a systematic literature review (SLR). By focusing on publications from 2010 to 2024, we aim to offer the most current insights into this field. Through rigorous eligibility criteria and quality standards, we identified 45 relevant papers for review. Our analysis predominantly concentrates on bioinspired-based techniques within the context of MSA. Notably, our findings highlight Genetic Algorithm and Memetic Optimization as the most commonly utilized algorithms for MSA. Furthermore, benchmark datasets such as BAliBASE and SABmark are frequently employed in evaluating MSA solutions. Structural-based methods emerge as the preferred approach for assessing MSA solutions, as revealed by our systematic literature review. Additionally, this study explores current trends, challenges, and unresolved issues in the realm of bioinspired algorithms for MSA, offering practitioners and researchers valuable insights and comprehensive understanding of the field.
... In addition to the previously mentioned approaches, several parallelization strategies have been employed to tackle the challenges associated with MSA [55]. Some of these strategies focus on the parallelization of dynamic programming algorithms, such as in [56]. ...
... Recently, many studies utilized GPU acceleration for MSA [59]. The optimization of parallel MSA is characterized by continuous innovation in algorithmic design and adaptation to emerging hardware architectures [55]. In our study, we employ various parallel computing approaches to enhance the basic dynamic programming approach for MSA, as shown in the next Section 3. ...
Article
Full-text available
Multiple sequence alignment (MSA) stands as a critical tool for understanding the evolutionary and functional relationships among biological sequences. Obtaining an exact solution for MSA, termed exact-MSA, is a significant challenge due to the combinatorial nature of the problem. Using the dynamic programming technique to solve MSA is recognized as a highly computationally complex algorithm. To cope with the computational demands of MSA, parallel computing offers the potential for significant speedup in MSA. In this study, we investigated the utilization of parallelization to solve the exact-MSA using three proposed novel approaches. In these approaches, we used multi-threading techniques to improve the performance of the dynamic programming algorithms in solving the exact-MSA. We developed and employed three parallel approaches, named diagonal traversing, blocking, and slicing, to improve MSA performance. The proposed method accelerated the exact-MSA algorithm by around 4×. The suggested approaches could be basic approaches to be combined with many existing techniques. These proposed approaches could serve as foundational elements, offering potential integration with existing techniques for comprehensive MSA enhancement.
... In contrast, dynamic programming has a complexity of O(LN), where L is sequence length, and N is the number of sequences. Researchers aim to enhance MSA efficiency via heuristic and metaheuristic approaches and parallel implementations to reduce computational costs [35] . ...
Article
Full-text available
This study aimed to create a genetic information processing technique for the problem of multiple alignment of genetic sequences in bioinformatics. The objective was to take advantage of the computer hardware's capabilities and analyze the results obtained regarding quality, processing time, and the number of evaluated functions. The methodology was based on developing a genetic algorithm in Java, which resulted in four different versions: Gp1, Gp2, Gp3 and Gp4 . A set of genetic sequences were processed, and the results were evaluated by analyzing numerical behavior profiles. The research found that algorithms that maintained diversity in the population produced better quality solutions, and parallel processing reduced processing time. It was observed that the time required to perform the process decreased, according to the generated performance profile. The study concluded that conventional computer equipment can produce excellent results when processing genetic information if algorithms are optimized to exploit hardware resources. The computational effort of the hardware used is directly related to the number of evaluated functions. Additionally, the comparison method based on the determination of the performance profile is highlighted as a strategy for comparing the algorithm results in different metrics of interest, which can guide the development of more efficient genetic information processing techniques.
... In this context, the multiple sequence alignment (MSA) is an essential step for identifying conserved regions (CR) among sequences (Sperchneide 2010). However, MSA is a highly complex problem and is classified among NP-Complete computational problems; this establishes the intractability of MSA under a score function of mathematical interest (Bonizzoni and Vedova 2001;Benítez-Hidalgo et al. 2020;Almanza-Ruiz et al. 2023). ...
Article
Full-text available
Main conclusion Mfind is a tool to analyze the impact of microsatellite presence on DNA barcode specificity. We found a significant correlation between barcode entropy and microsatellite count in angiosperm. Abstract Genetic barcodes and microsatellites are some of the identification methods in taxonomy and biodiversity research. It is important to establish a relationship between microsatellite quantification and genetic information in barcodes. In order to clarify the association between the genetic information in barcodes (expressed as Shannon’s Measure of Information, SMI) and microsatellites count, a total of 330,809 DNA barcodes from the BOLD database (Barcode of Life Data System) were analyzed. A parallel sliding-window algorithm was developed to compute the Shannon entropy of the barcodes, and this was compared with the quantification of microsatellites like (AT)n, (AC)n, and (AG)n. The microsatellite search method utilized an algorithm developed in the Java programming language, which systematically examined the genetic barcodes from an angiosperm database. For this purpose, a computational tool named Mfind was developed, and its search methodology is detailed. This comprehensive study revealed a broad overview of microsatellites within barcodes, unveiling an inverse correlation between the sumz of microsatellites count and barcodes information. The utilization of the Mfind tool demonstrated that the presence of microsatellites impacts the barcode information when considering entropy as a metric. This effect might be attributed to the concise length of DNA barcodes and the repetitive nature of microsatellites, resulting in a direct influence on the entropy of the barcodes.
Article
Full-text available
Task scheduling in parallel multiple sequence alignment (MSA) through improved dynamic programming optimization speeds up alignment processing. The increased importance of multiple matching sequences also needs the utilization of parallel processor systems. This dynamic algorithm proposes improved task scheduling in case of parallel MSA. Specifically, the alignment of several tertiary structured proteins is computationally complex than simple word-based MSA. Parallel task processing is computationally more efficient for protein-structured based superposition. The basic condition for the application of dynamic programming is also fulfilled, because the task scheduling problem has multiple possible solutions or options. Search space reduction for speedy processing of this algorithm is carried out through greedy strategy. Performance in terms of better results is ensured through computationally expensive recursive and iterative greedy approaches. Any optimal scheduling schemes show better performance in heterogeneous resources using CPU or GPU.
Article
Full-text available
Multiple sequence alignment tools struggle to keep pace with rapidly growing sequence data, as few methods can handle large datasets while maintaining alignment accuracy. We recently introduced MAGUS, a new state-of-the-art method for aligning large numbers of sequences. In this paper, we present a comprehensive set of enhancements that allow MAGUS to align vastly larger datasets with greater speed. We compare MAGUS to other leading alignment methods on datasets of up to one million sequences. Our results demonstrate the advantages of MAGUS over other alignment software in both accuracy and speed. MAGUS is freely available in open-source form at https://github.com/vlasmirnov/MAGUS.
Article
Full-text available
Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g., MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger (GCM), a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.
Article
Full-text available
Different techniques for estimating the execution time of parallel applications have been studied for the last 25 years. These approaches have proposed different methods for predicting the performance behaviour of applications. Most of these methods rely on analysing one or more of the following aspects: system workload, application structure, platform system, and the computing resources that the application needs to perform its operations. These elements are used and applied by different methods such as analytic and non-analytic methods. However, no wide-ranging survey of these approaches exists at the time of writing. This paper presents a systematic review of performance prediction methods for parallel applications, which were published in the open literature during the period 2005–2020. We define a classification framework to categorise the reviewed approaches. In addition, we identify some directions and trends in performance prediction as well as some unsolved issues.
Article
Full-text available
Motivation: Kalign is an efficient multiple sequence alignment (MSA) program capable of aligning thousands of protein or nucleotide sequences. However, current alignment problems involving large numbers of sequences are exceeding Kalign's original design specifications. Here we present a completely re-written and updated version to meet current and future alignment challenges. Results: Kalign now uses a SIMD accelerated version of the bit-parallel Gene Myers algorithm to estimate pariwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences. The new version maintains high alignment accuracy on both protein and nucleotide alignments and scales better than other MSA tools. Availability: The source code of Kalign and code to reproduce the results are found here: https://github.com/timolassmann/kalign.
Article
Full-text available
Quantum pattern recognition techniques have recently raised attention as potential candidates in analyzing vast amount of data. The necessity to obtain faster ways to process data is imperative where data generation is rapid. The ever-growing size of sequence databases caused by the development of high throughput sequencing is unprecedented. Current alignment methods have blossomed overnight but there is still the need for more efficient methods that preserve accuracy in high levels. In this work, a complex method is proposed to treat the alignment problem better than its classical counterparts by means of quantum computation. The basic principal of the standard dot-plot method is combined with a quantum algorithm, giving insight into the effect of quantum pattern recognition on pairwise alignment. The central feature of quantum algorithmic -quantum parallelism- and the diffraction patterns of x-rays are synthesized to provide a clever array indexing structure on the growing sequence databases. A completely different approach is considered in contrast to contemporary conventional aligners and a variety of competitive classical counterparts are classified and organized in order to compare with the quantum setting. The proposed method seems to exhibit high alignment quality and prevail among the others in terms of time and space complexity.
Article
Full-text available
As one of the gatekeepers of quality software systems, requirements prioritisation (RP) is often used to select the most important requirements as perceived by system stakeholders. To date, many RP techniques that adopt various approaches have been proposed in the literature. To identify the strengths, opportunities and limitations of these existing approaches, this work studied and analysed the RP field in terms of its significance in the software development process based on the standard review guidelines by Kitchenham. By a rigorous study selection strategy, 122 relevant studies were selected to address the defined research questions. Findings indicated that RP plays a vital role in ensuring the development of a quality system with defined constraints. The stakeholders involved in RP were reported, and new categories of the participating stakeholders were proposed. Additionally, 108 RP techniques were identified and analysed with respect to their benefits, prioritisation criteria, size of requirements, types in terms of automation level and their limitations. Eighty-four (84) prioritisation criteria were disclosed with their frequency usages in prioritising the requirements. The study revealed that existing techniques suffer from serious limitations in terms of scalability, lack of quantification and prioritisation of the participating stakeholders, time consumption, requirement interdependencies and the need for highly professional human intervention. These findings are useful for researchers and practitioners in improving the current state of the art and state of practices.
Article
Duchene muscular dystrophy is inherited X-linked recessive disorder. Females will typically be carriers for the disease while males will be affected. Dystrophin is essential for cell membrane stability. Deficiency leads to reduction in three glycol proteins in the dystrophin associated protein complex that link dystrophin to laminin with cell membranes. This occurs in people without a known family history of the condition. Because of the way the disease is inherited, males are more likely to develop symptoms than are women. In this Clinical study enlightening about 2 cases of Duchene muscular dystrophy, the concept in Ayurveda the entities like Astimajjagatavata and Pakkharoga, and the treatment carried out in this disease are Sarvanga Abhyanga (whole body massage) Musthadiyapanabasti followed by Shamanayogas(palliative medicines) like Ajamamsa Rasayana, Balarishta and cap Bontone followed by Physiotherapy.
Article
Motivation: Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimisation problem with NP-hard complexity, because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multi-objective when more than one score is considered to assess the quality of an alignment, such as maximising the percentage of totally conserved columns and minimising the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a non exact stochastic optimisation approach that has proven to be effective to solve multi-objective problems. Results: The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualisation, and parallelism. Thus, Sequoya offers a graphical tool to visualise the progress of the optimisation in real time, the ability to guide the search towards a preferred region in run-time, parallel support to distribute the computation amongst nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimisation. Availability and implementation: Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Bioinformatics research continues to advance at an increasing scale with the help of techniques such as next generation sequencing and the availability of tool support to automate bioinformatics processes. With this growth, a large amount of biological data gets accumulated at an unprecedented rate demanding high performance and high throughput computing technologies for processing such datasets. Use of hardware accelerators such as Graphics Processing Units (GPUs) and distributed computing, accelerate the processing of big data in high performance computing environments. They enable higher degrees of parallelism to be achieved, thereby increasing the throughput. In this paper, we introduce BioWorkflow, an interactive workflow management system to automate the bioinformatics analyses with the capability of scheduling parallel tasks with the use of GPU-accelerated and distributed computing. The paper describes a case study carried out to evaluate the performance of a complex workflow with branching executed by BioWorkflow. The results indicate gains of x2.89 magnitude by utilizing GPUs and gains in speed by average x2.832 magnitude (over n=5 scenarios) by parallel execution of graph nodes during multiple sequence alignment (MSA) calculations. Combined speedups achieved x1.71 times for complex workflows. This confirms the expected higher speedups when having parallelism through GPUacceleration and concurrent execution of workflow nodes than the mainstream sequential workflow execution. The tool also provides a comprehensive user interface with better interactivity for managing complex workflows; System usability scale score of 82.9 confirmed high usability for the system.