ArticlePDF Available

Parallel protein multiple sequence alignment approaches: a systematic literature review

July 2022
The Journal of Supercomputing 79(7344):1-34

July 2022
79(7344):1-34

DOI:10.1007/s11227-022-04697-9

Authors:

Sergio Almanza

University of Guadalajara

Arturo Chavoya

University of Guadalajara

Hector A. Duran-Limon

University of Guadalajara

Multiple sequence alignment approaches refer to algorithmic solutions for the alignment of biological sequences. Since multiple sequence alignment has exponential time complexity when a dynamic programming approach is applied, a substantial number of parallel computing approaches have been implemented in the last two decades to improve their performance. In this paper, we present a systematic literature review of parallel computing approaches applied to multiple sequence alignment algorithms for proteins, published in the open literature from 1988 to 2022; we extracted articles from four scientific databases: ACM Digital Library, IEEE Xplore, Science Direct and SpringerLink, and four journals: Bioinformatics, PLOS Computational Biology, PLOS ONE, and Scientific Reports. Additionally, in order to cover other potential databases and journals, we performed a transversal search through Google Scholar. We conducted a selection process that yielded 106 research articles; then, we analyzed these articles and defined a classification framework. Additionally, we point out some directions and trends for parallel computing approaches for multiple sequence alignment, as well as some unsolved problems.

Simple example of two alternative multiple sequence alignments of the same three proteins, where letters indicate specific amino acids in the protein sequences, dashes represent gaps, and SP is the sum of pairs score for the alignment

…

A classification framework of parallel multiple sequence alignment algorithms with the number of articles in each category indicated by the number between parentheses

Research questions used for collecting data

…

Figures - uploaded by Arturo Chavoya

Content may be subject to copyright.

Content uploaded by Arturo Chavoya

Content may be subject to copyright.

Vol.:(0123456789)

The Journal of Supercomputing

https://doi.org/10.1007/s11227-022-04697-9

1 3

Parallel protein multiple sequence alignment approaches:

asystematic literature review

SergioH.Almanza‑Ruiz1· ArturoChavoya1 · HectorA.Duran‑Limon1

Accepted: 28 June 2022

2022

Abstract

Multiple sequence alignment approaches refer to algorithmic solutions for the align‑

ment of biological sequences. Since multiple sequence alignment has exponential

time complexity when a dynamic programming approach is applied, a substantial

number of parallel computing approaches have been implemented in the last two

decades to improve their performance. In this paper, we present a systematic lit‑

erature review of parallel computing approaches applied to multiple sequence align‑

ment algorithms for proteins, published in the open literature from 1988 to 2022; we

extracted articles from four scientiﬁc databases: ACM Digital Library, IEEE Xplore,

Science Direct and SpringerLink, and four journals: Bioinformatics, PLOS Compu‑

tational Biology, PLOS ONE, and Scientiﬁc Reports. Additionally, in order to cover

other potential databases and journals, we performed a transversal search through

Google Scholar. We conducted a selection process that yielded 106 research articles;

then, we analyzed these articles and deﬁned a classiﬁcation framework. Addition‑

ally, we point out some directions and trends for parallel computing approaches for

multiple sequence alignment, as well as some unsolved problems.

Keywords Systematic review· Multiple sequence alignment· Parallel

programming· Protein

* Arturo Chavoya

achavoya@cucea.udg.mx

Sergio H. Almanza‑Ruiz

sergio.almanza3518@alumnos.udg.mx

Hector A. Duran‑Limon

hduran@cucea.udg.mx

1 Department ofInformation Systems, CUCEA ‑ Universidad de Guadalajara, Periférico Norte

799, Zapopan45100, Jalisco, Mexico

S.H.Almanza-Ruiz et al.

1 3

1 Introduction

Bioinformatics as deﬁned in [1] consists of the application of tools of computation

and analysis to the retrieval and interpretation of biological data. It is an interdisci‑

plinary ﬁeld, which harnesses computer science, mathematics, physics, and biology.

The handling and analysis of biological sequences remain one of the prime tasks

of bioinformatics [2]. The most basic biological sequence analysis is to ask if two

nucleotide or protein sequences are related [3]. Pairwise alignment consists of an

arrangement of two nucleotide or amino acid sequences obtained from successively

reordering and comparing the sequences, residue by residue, with a scoring that

models matches, mismatches, and gaps as evolutionary events and where the score

is optimized. Thus, from a pairwise alignment, inferences regarding similarity of

function, structural motifs, or discernible evolutionary relationships can be drawn.

On the other hand, a multiple sequence alignment (MSA) is a rearrangement of a

set of three or more DNA, RNA, or protein sequences which are aligned to make

the residues from diﬀerent sequences line up in vertical columns in such a manner

that best explains structural, functional, or evolutionary relationships. This topic is

usually divided into two parts: functional genomics and comparative genomics; in

the former, researchers seek to determine the role of the sequences in the living cell,

whereas in the latter the aim is to determine ancestries and correlations by compar‑

ing sequences from diﬀerent organisms, or even individuals [2].

Figure 1 shows an example of two diﬀerent MSAs of the same three proteins,

where each letter in the sequences represents a speciﬁc amino acid from the short

protein sequences, and the gaps in the proposed alignments are represented by

dashes. For the sake of simplicity, the scoring function was deﬁned as follows: when

aligning a pair of symbols, the score is 1 if the symbols match,

−1

if they mismatch,

−2

if a gap and a symbol are aligned, and 0 when two gaps are aligned. It is impor‑

tant to mention that in practice, protein sequence alignments are scored using substi‑

tution matrices, such as BLOSUM62 [4]. For the example, let

and

be three

protein sequences deﬁned as

S1=EGGMF

S2=GMMFG

, and

S3=MGEEF

. In

the ﬁrst alignment, which is shown in Fig.1a, the score for the ﬁrst column is the

sum of

score(E,−)=−2

score(E,M)=−1

, and

score(−,M)=−2

, which gives a

sum of pairs score of

−5

for that column; the scores for the rest of the columns

in the alignment are computed analogously. The sum of pairs (SP) for the multi‑

ple sequence alignment is calculated as the aggregation of the individual scores for

each column, which for the ﬁrst alignment is

−13

. The sum of pairs score for the

(a) (b)

Fig. 1 Simple example of two alternative multiple sequence alignments of the same three proteins, where

letters indicate speciﬁc amino acids in the protein sequences, dashes represent gaps, and SP is the sum of

pairs score for the alignment

1 3

Parallel protein multiple sequence alignment approaches:…

alternative MSA shown in Fig.1b was also computed and yielded a value of

−24

From these results, we could conclude that the ﬁrst alignment is better than the sec‑

ond one, as it has a higher score. In general, the sum of pairs score for a multi‑

ple sequence alignment depends on how the residues have been rearranged through

the insertion of gaps in certain positions, and diﬀerent alignments for the same

sequences usually yield diﬀerent sum of pairs scores.

Multiple sequence alignment is an NP‑complete problem when a brute force

approach is applied, whereas when using dynamic programming the complexity is

, where L is the length of the sequences and N is the number of sequences

[5]. These facts have motivated researchers to improve the eﬃciency of MSA algo‑

rithms by heuristic and metaheuristic approaches, as well as to decrease the compu‑

tational cost by means of parallel implementations. We focused the present system‑

atic review on protein MSA algorithms, since this type of alignments is generally

more accurate when based on amino acids than on their corresponding nucleotides.

This is because of the following reasons: ﬁrst, the small size of the DNA alphabet

in the case of nucleotide sequences makes it more likely to ﬁnd alignments due to

randomness rather than to similarity of function and structure, and second, because

sequence similarity degrades more rapidly at the DNA level than at the amino acid

level, a mutation in an amino acid sequence is likely to be more meaningful in terms

of functionality [6, 7]. In addition, we considered that it was more likely to ﬁnd

relevant research for the present review if the search was focused on parallel imple‑

mentation of algorithms for protein MSA, as currently there are more eﬀorts toward

this direction.

This paper presents a systematic literature review of parallel implementations

of protein multiple sequence alignment algorithms that were published in the open

literature during the period 1988 to 2022. The main goal of this review is to ana‑

lyze and characterize the parallel implementations of MSA algorithms. In order to

achieve the aforementioned goal, this review aims to yield a useful classiﬁcation of

the parallel implementation of multiple sequence alignment algorithms. A second

aim is to ﬁnd the parallel programming approaches that have been used to imple‑

ment the parallelization of MSA algorithms. The third aim is to ﬁnd the multi‑

ple sequence alignment algorithms which were most frequently parallelized. And

ﬁnally, a fourth aim is to point out some unsolved problems regarding the parallel

implementations of MSA algorithms.

The paper is organized as follows. TheMethods section describes the systematic

literature review methodology and how we conducted it. In theResults and discus‑

sion section, we present the answers to the questions formulated for this systematic

literature review and therein we also propose a classiﬁcation framework for paral‑

lel implementations of protein MSA algorithms, which was used to classify every

implementation in the reviewed articles. Lastly, in theConclusion section we pre‑

sent our closing remarks of the present review.

S.H.Almanza-Ruiz et al.

1 3

2 Methods

2.1 Systematic review method

A systematic literature review is a deﬁned and methodical way to identify, evaluate,

and synthesize all available research relevant to a particular research question, subject,

or phenomenon to understand the current direction and status of research or provide

background to identify research challenges [8–15]. There exist several methodologies

to extract information from a set of consulted articles. The methodology that is used in

the present review is based on the approaches followed by Flores‑Contreras etal. [15]

and Mahdavi‑Hezavehi etal. [16].

The goal for this systematic literature review was formulated using the goal part

formulation from the Goal–Question–Metric perspective (purpose, issue, object, view‑

point) as described in [16], whereas we conducted the rest of the protocol as in [15].

The components of the goal formulation for the review are the following:

• Purpose. Analyze and characterize.

• Issue. Improvement in performance.

• Object. Parallel implementations of multiple sequence alignment algorithms for

proteins.

• Viewpoint. Point of view of the researcher.

In other words, the goal of this systematic literature review of parallel implementations

of multiple sequence alignment algorithms for proteins has the purpose of analyzing

and characterizing the articles in the literature, where the issue to be analyzed is the

improvement in performance from the point of view of the researcher in the ﬁeld.

Once the review goal was deﬁned, we brieﬂy describe the components of the pro‑

tocol that we followed—Research questions, Keywords and search string, and Study

selection—as in [15]; further details on the application of the protocol will be given in

the corresponding subsections.

• Research questions. The research questions express the motivation about a litera‑

ture review and determine the information to be extracted from the reviewed arti‑

cles.

• Keywords and search string. In order to collect information based on the ques‑

tions, a number of keywords have to be obtained; thus, the search string is designed

using these keywords.

• Study selection. The study selection determines two main elements in the review

protocol: the period of time of the published literature and the databases or indi‑

vidual journals from which articles are to be extracted.

2.2 Research questions

Specifying the research questions is the most important part of any systematic lit‑

erature review, since they drive the search process of the articles and the data

1 3

Parallel protein multiple sequence alignment approaches:…

extraction. After the research questions have been speciﬁed, we can then proceed

to answer them and perform the data analysis and synthesis, as pointed out in [11,

15]. The research questions relevant to the ﬁeld of parallel implementations of mul‑

tiple sequence alignment algorithms used in this systematic literaturereview are pre‑

sented in Table1.

The ﬁrst research question RQ1 has the objective of yielding a useful synthesis of

the information obtained through analyzing the collected articles, via a classiﬁcation

framework. The second question RQ2 aims to identify the parallel programming

approaches applied to parallelize multiple sequence alignment algorithms. The third

question RQ3 seeks to identify which multiple sequence alignment algorithms have

been parallelized and the underlying reasons for such trends. Finally, RQ4 serves to

point out some of the unsolved problems in the ﬁeld.

2.3 Keywords andsearch string

In order to build the search string, we extracted the keywords from the nouns of

the research questions as described in Flores‑Contreras etal. [15] and we obtained

the keywords as presented in Table2. We simpliﬁed compound nouns, for example,

“parallel programming approaches” was simpliﬁed to “parallel,” whereas “multiple

sequence alignment algorithms” was simpliﬁed to “multiple sequence alignment.”

Therefore, these keywords were simpliﬁed to three keywords: multiple sequence

alignment, parallel, and protein.

Afterward, we included synonyms and alternative spellings of the three key‑

words: multiple sequence alignment, parallel, and protein, as shown in Table 3. In

the case of the parallel keyword, we included words such as: faster, reconﬁgurable,

accelerated, and optimization as part of the search string, in order to expand the

search for relevant articles to the parallel programming approaches.

Consequently, we divided the construction of the search string into three parts:

• Multiple sequence alignment ﬁeld part: “multiple sequence alignment" OR

“MSA" OR “multiple biological sequence alignment"

Table 1 Research questions used for collecting data

Question ID Research question

RQ1 How can we yield a useful classiﬁcation of the parallel

implementations of multiple sequence alignment algorithms?

RQ2 What parallel programming approaches have been used to

enhance performance of multiple sequence alignment algo‑

rithms?

RQ3 What protein multiple sequence alignment algorithms are

the most frequently parallelized and how were they parallel‑

ized?

RQ4 What are some of the unsolved problems of parallel

implementations of multiple sequence alignment algorithms?

S.H.Almanza-Ruiz et al.

1 3

• Parallel computing ﬁeld part: “parallel" OR “parallelization" OR “paral‑

lelisation" OR “distributed" OR “parallel algorithm" OR “high performance

computing" OR “accelerated" OR “HPC" OR “supercomputing" OR “cloud

computing" OR “supercomputer" OR “reconﬁgurable" OR “multi‑core" OR

“multicore" OR “grid computing" OR “grid computation" OR “optimization"

OR “optimisation" OR “cluster" OR “FPGA" OR “faster"

• Protein sequences part: “amino acid" OR “protein"

Finally, we connected each part with an AND logical operator and obtained the

following search string:

(“multiple sequence alignment" OR “MSA" OR “multiple biological sequence

alignment") AND (“parallel" OR “parallelization" OR “parallelisation" OR “dis‑

tributed" OR “parallel algorithm" OR “high performance computing" OR “accel‑

erated" OR “HPC" OR “supercomputing" OR “cloud computing" OR “super‑

computer" OR “reconﬁgurable" OR “multi‑core" OR “multicore" OR “grid

computing" OR “grid computation" OR “optimization" OR “optimisation" OR

“cluster" OR “FPGA" OR “faster") AND (“amino acid" OR “protein").

Table 2 Keywords extracted

from research questions Question ID Keywords

RQ1 Parallel implementations,

Multiple sequence alignment

RQ2 Parallel programming approaches, performance of

Multiple sequence alignment algorithms

RQ3 Protein, multiple sequence alignment algorithms,

Parallel

RQ4 Parallel implementations, multiple sequence

Alignment algorithms

Table 3 Synonyms or alternative spellings for keywords

Keyword Synonyms or alternative spelling

Multiple sequence alignment Multiple sequence alignment, MSA,

Multiple biological sequence alignment

Parallel Parallel, parallelization, parallelisation, distributed,

Parallel algorithm, high performance computing,

Accelerated, HPC, supercomputing, cloud comput‑

ing,

Supercomputer, reconﬁgurable, multi‑core, multi‑

core,

Grid computing, grid computation, optimization,

Optimisation, cluster, FPGA, faster

Protein Amino acid, protein

1 3

Parallel protein multiple sequence alignment approaches:…

2.4 Study selection

To collect the articles, we chose four scientiﬁc databases: ACM Digital Library,

IEEE Xplore, Science Direct and SpringerLink, as in [15]; additionally, we col‑

lected articles from Bioinformatics, PLOS Computational Biology, PLOS ONE

and Scientiﬁc Reports because these are journals that can contain articles focused

on bioinformatics and are of high quality, as measured by their h‑index. We

applied our search string in the Advanced search section of the aforementioned

scientiﬁc databases and journals. Since the search engine varies among diﬀer‑

ent scientiﬁc databases and journals, we had to apply diﬀerent search approaches

depending on the scientiﬁc database or journal.

In the case of the ACM Digital Library, the full search string was applied to

the Full text and Abstract ﬁelds, whereas for the Title ﬁeld, the ﬁrst and second

parts of the search string were applied; 28 research articles were obtained.

For the IEEE Xplore database, the search string was applied to the Full text

only and All Metadata ﬁelds, and we obtained 107 research articles.

Regarding the Science Direct database, the three parts of the search string

were applied to the Search Anywhere ﬁeld, whereas the multiple sequence align‑

ment and parallel computing components were applied to the Title, Abstract or

Author-speciﬁed keywords ﬁelds. From this database, we retrieved 70 articles.

In the case of the SpringerLink database, due to the limitations of the search

engine, we proceeded as follows. We applied our search string to the Full text

ﬁeld and obtained a total of 4559 entries, from which 2859 journal articles and

489 conference proceedings articles were retrieved; the rest of the entries were

not articles. The aforementioned entries were downloaded in the form of csv

(comma‑separated values) ﬁles which contained only the title of the articles.

Since the information retrieved only contained the title of the articles, a ﬁlter with

the two parts shown below was implemented. This ﬁlter contained more terms

than the previous search string for other databases, since in this case we were

discriminating using only the title, and it was required to add enough ﬂexibility to

the search. It should be pointed out that this ﬁlter was reﬁned after several itera‑

tions of search and analysis of the results obtained. The ﬁlter parts are as follows:

• The multiple sequence alignment part contained the following terms and we

applied an OR between each term: “multiple sequence alignment," “biological

sequence alignment," “progressive alignment," “MSA," “Clustal," “MAFFT,"

“MUSCLE," “T‑COFFEE," “PROBCONS," “PASTA," “SATE," and “MSAP‑

robs."

• The parallel programming part contained the following terms and again

we applied an OR between each term: “parallel," “parallelization," “paral‑

lelisation" “distributed," “parallel algorithm," “high performance comput‑

ing," “accelerated," “HPC," “supercomputing," “cloud computing," “super‑

computer," “reconﬁgurable," “multi‑core," “multicore," “grid computing,"

“grid computation," “optimization," “optimisation," “cluster," “FPGA," and

“faster."

• We applied an AND between the previous parts.

S.H.Almanza-Ruiz et al.

1 3

After applying the aforementioned step to the SpringerLink database, we obtained

20 articles from journals and 35 from conference proceedings, totaling 55 articles.

Regarding the Bioinformatics journal, the full search string was applied to the

Full text, Title and Abstract ﬁelds of the advanced search section; we obtained 551

articles.

As for the PLOS journals, the full search string was applied to the Title and

Abstract ﬁelds of the advanced search section; we obtained 358 articles. It should

be noted that we searched within all the PLOS journals, but only obtained relevant

results from PLOS Computational Biology and PLOS ONE.

Lastly, for the Scientiﬁc Reports journal, the full search string was applied to the

Title and Terms ﬁelds of the advanced search section; we obtained 500 articles. It is

worth noting that Scientiﬁc Reports is part of the Nature journals and we searched

through all of them, but only obtained relevant results from the Scientiﬁc Reports

journal.

In summary, the search in the scientiﬁc databases and journals rendered a total of

1669 articles, as can be seen in Table4.

2.5 Selection process

The selection process allowed us to identify the articles that were more related to

our review. This process had the following steps: The ﬁrst step consisted of reading

the abstract of every single article, from which we obtained information, such as

whether the article was focused on parallel implementations of protein MSA algo‑

rithms, and a general perspective of how the research was conducted; we selected

only those papers whose main focus was directly related to the subject of this review.

The second step was to read all the content of the articles previously selected, as

well as their reference section to select additional articles. We discarded the articles

that were not pertinent to the review. The third step was assessing the quality of

the articles by using the h‑index and the CORE ranking. The ﬁnal step consisted of

assessing the relevance of the articles to this review according to a score deﬁned for

this purpose.

Table 4 Number of articles after

applying the search string Repository Number of articles after

applying the search string

ACM digital library 28

IEEE Xplore 107

Science direct 70

SpringerLink 55

Bioinformatics 551

PLOS computational biology 39

PLOS ONE 319

Scientiﬁc reports 500

1669

1 3

Parallel protein multiple sequence alignment approaches:…

2.5.1 Reading theabstracts

We read the 1669 abstracts from the articles that were collected after applying the

search string, and we selected 241 articles whose topic was related to the focus of

the present review, as presented in Table5.

2.5.2 Reading thecontent ofthearticles

As the second step of the selection process we read the content of the 241 arti‑

cles selected in the previous step, and we discarded articles that were not directly

related to the topic of parallel implementations of protein MSA algorithms

according to the subject matter of the article. In addition, we read the reference

section of these articles, and after selecting those articles that seemed pertinent to

the review from their title, we applied the ﬁrst two steps of the selection process

(reading the abstract and reading the content) and obtained 18 additional articles.

After this step, we obtained a total of 147 articles, as shown in Table6.

2.5.3 Quality assessment

The third step of the selection process consisted in testing the 147 research arti‑

cles obtained in the previous step for their quality, considering their h‑index [17]

to select articles that were published in a journal, and the CORE ranking [18]

to select articles published in conference proceedings. We selected those articles

whose journals had an h‑index greater than or equal to 20 and included all arti‑

cles that had a CORE ranking. We applied the aforementioned criteria and dis‑

carded 33 articles from conference proceedings and 2 articles from journals; thus,

after this process we obtained 112 articles, as presented in Table7.

Table 5 Number of articles remaining after reading the abstracts

Repository Number of articles after applying the

search string

After reading

the abstract

ACM digital library 28 11

IEEE Xplore 107 81

Science direct 70 22

SpringerLink 55 49

Bioinformatics 551 50

PLOS computational biology 39 1

PLOS ONE 319 22

Scientiﬁc reports 500 5

1669 241

S.H.Almanza-Ruiz et al.

1 3

2.5.4 Relevance assessment

This was the last step of the selection process and enabled us to ﬁne‑tune the

identiﬁcation of the more relevant articles to the focus of this review. We deﬁned

6 qualitative questions as presented in Table8. Question 1 from Table8 is mainly

based on RQ2, whereas Questions 2 to 4 from Table8 aim to answer how parallel

approaches were applied to multiple sequence alignment algorithms, as well as

to provide a useful classiﬁcation for the papers; ﬁnally, Questions 5 and 6 from

Table8 have to do with the reproducibility of the articles and hence with its rel‑

evance for the ﬁeld.

The numerical score for the answers to these 6 qualitative questions was

deﬁned as follows: the answer Yes had an assigned score of 1, the answer Par-

tially had a score of 0.5, and No had a score of 0.

We describe next the criteria that we deﬁned to give an answer of Yes, Par-

tially or No when each of the qualitative Questions 1 through 6 from Table8 was

applied to a given article.

Table 6 Number of articles

remaining after reading the

content of the articles

Repository After reading the

abstract

After reading

the content

ACM digital library 11 4

IEEE Xplore 81 57

Science direct 22 8

SpringerLink 49 36

Bioinformatics 50 19

PLOS computational biology 1 1

PLOS ONE 22 2

Scientiﬁc reports 5 2

Referenced 18 18

259 147

Table 7 Number of articles after

assessing quality Repository After reading

the content

After CORE rank‑

ing and h‑index

≥20

ACM digital library 4 2

IEEE Xplore 57 39

Science direct 8 8

SpringerLink 36 23

Bioinformatics 19 19

PLOS computational biology 1 1

PLOS ONE 2 2

Scientiﬁc reports 2 2

Referenced 18 16

147 112

1 3

Parallel protein multiple sequence alignment approaches:…

Question 1 had an answer of Yes when the main topic of the article was an imple‑

mentation of a parallel approach to multiple sequence alignment, and No otherwise.

For Question 2, we answered Yes when it was explicitly stated in the text which sub‑

routines were parallelized, Partially when this information had to be derived from the

context, supplementary material or another article, and we answered No when there

was no information about the parallelization of subroutines. Similarly, for Questions 3

and 4 we answered Yes, when it was explicitly stated in the text which parallelization

technique or platform was used, Partially when it had to be derived, and No when it

was not clear.

As for Question 5, we answered Yes when experimental tables, speeding up com‑

parative results or tables were solid and clearly presented, Partially when the results

were poorly presented, and No in the absence of experimental results, as in the case of

articles that presented only a proposal of implementation.

Regarding Question 6, we answered Yes when the parallel implementation, as well

as the MSA algorithm were clearly presented. We answered Partially, when one or

more of the questions from 2 to 5 was answered Partially and the rest was answered

Yes, and we answered No when at least one of the Questions from 1 to 5 was No.

We discarded all articles with relevance score less than 3, as in [15]. The percentage

of articles according to their relevance score is shown in Table9.

After applying the relevance criteria to the 112 articles that remained after the step

of assessing quality, we obtained 104 articles, as presented in Table10, where we also

specify separately the number of journal and conference articles obtained.

Table 8 Questions used to assess the relevance of articles

Question Answer

1. Was the parallelization of a multiple sequence Yes No

algorithm the main goal of the paper?

2. Was it clear which subroutines the proposed Ye s Partially No

approach parallelized?

3. Was it clear which parallelization techniques Yes Partially No

were used?

4. Was it clear what platform was used to Ye s Partially No

parallelize the algorithm?

5. Were the experimental results solid? Ye s Partially No

6. Is it possible to replicate the algorithm and Ye s Partially No

experiments with the information provided?

Table 9 Percentage of the articles according to their relevance score r

Score Very poor Poor Fair Good Very good

r<2

2≤r

3≤r

4≤r

5≤r≤6

Percentage

50%

38%

S.H.Almanza-Ruiz et al.

1 3

2.5.5 Search inGoogle Scholar

In order to consider other possible sources of articles in addition to the aforementioned

databases and journals, we performed a transversal search using Google Scholar. Con‑

sidering the limitations of the Google Scholar search engine, as with the SpringerLink

database, we applied our search string to the available text of the articles and obtained

a total of 2084 entries, from which 1281 were journal articles and 180 were conference

proceedings articles, whereas the rest of the entries were not articles. From the entries

obtained, we excluded all those articles that were previously found in the four databases

and four journals mentioned in the preceding sections and then applied the selection

steps as described above. A summary of the results of the search in Google Scholar is

presented in Table11.

The two remaining articles from the search in Google Scholar were journal articles,

which when added to the 104 articles found in the previous selection process give a

total of 106 articles relevant to the review, as summarized in Table12.

Table 10 Number of articles

from journals and conference

proceedings after assessing

quality and relevance

Repository Journal Conference To tal

ACM digital library 0 2 2

IEEE Xplore 4 30 34

Science direct 8 0 8

SpringerLink 13 7 20

Bioinformatics 19 0 19

PLOS computational biology 1 0 1

PLOS ONE 2 0 2

Scientiﬁc reports 2 0 2

Referenced 14 2 16

63 41 104

Table 11 Number of articles

found through the search in

Google Scholar

Selection step Number

of articles

After applying the search string 1461

After excluding duplicates from previous searches 1451

After reading the abstracts 135

After reading the content 25

After assessing quality 10

After assessing relevance 2

1 3

Parallel protein multiple sequence alignment approaches:…

3 Results anddiscussion

In the following paragraphs, we present the answers to the research questions based

on the 106 articles obtained in the selection process.

3.1 RQ1: How can we yield auseful classiﬁcation oftheparallel implementations

ofmultiple sequence alignment algorithms?

In order to manage the variety of parallel approaches that we have found in the lit‑

erature related to multiple sequence alignment algorithms, we propose the classiﬁca‑

tion framework presented in Fig.2.

For this classiﬁcation framework, the MSA algorithm approach category

describes the multiple sequence alignment algorithm as commonly found in the lit‑

erature, whereas the rest of categories—Spectrum, Parallelization scope, HPC strat-

egy and Platform type—describe how its parallel implementation was achieved.

3.1.1 Multiple sequence alignment approach terminology

The terminology used in this classiﬁcation framework to describe multiple sequence

alignment approaches is compliant with the one commonly used in the literature.

Multiple sequence alignment algorithms were divided into Exact, Heuristic and

Metaheuristic approaches.

MSA algorithm approach. Here are included all MSA algorithm approaches as

frequently found in the literature.

• Exact. This subcategory encompasses dynamic programming algorithms that

guarantee to ﬁnd the optimal multiple sequence alignment.

• Heuristic. This category consists of non‑evolutionary strategies that search for

high‑score multiple sequence alignments but do not guarantee the optimal align‑

ment. This category involves progressive, iterative, stochastic, and alternative

subcategories.

– Progressive. Progressive approaches incrementally build a multiple sequence

alignment, where the most related pairs of sequences are aligned ﬁrst, and

then, a series of pairwise alignments are executed for successively less closely

related sequences.

Table 12 Number of articles

from journals and conference

proceedings for the review

Repository Journal Conference Total

Databases and journals 63 41 104

Google scholar 2 0 2

65 41 106

S.H.Almanza-Ruiz et al.

1 3

– Iterative. This subcategory encompasses approaches that iterate an

algorithm that produce a multiple sequence alignment until no further

improvement can be made.

– Stochastic. Approaches under this subcategory obtain a probabilistic pro‑

ﬁle alignment in order to achieve a multiple sequence alignment.

– Alternative. This subcategory consists of heuristic approaches that do not

fall under any of the aforementioned categories.

• Metaheuristic. This category includes evolutionary algorithmic approaches

where an initial population of multiple sequence alignments evolve over time

and improve the MSA.

Fig. 2 A classiﬁcation framework of parallel multiple sequence alignment algorithms with the number of

articles in each category indicated by the number between parentheses

1 3

Parallel protein multiple sequence alignment approaches:…

3.1.2 Parallel implementation terminology

For the parallel implementation terminology, we chose to describe the range of

application of the parallel implementation by the Spectrum category, which indi‑

cates whether the implementation is only intended for a speciﬁc algorithm or for

every algorithm of a certain MSA approach. The Parallelization scope describes

whether the parallel implementation was achieved via parallelization of subroutines

of the MSA algorithm or via parallelization of the whole algorithm. The HPC strat-

egy, as well as the Platform type categories, describes the parallel programming

approach and the hardware used, respectively, according to the terminology found

in the literature.

Spectrum. Some papers presented a parallel implementation applicable to all

algorithms of a speciﬁc MSA approach, whereas some others developed a parallel

implementation for a speciﬁc algorithm.

Parallelization scope. The purpose of this category is to describe whether the

parallel implementation of the MSA algorithm was or was not subdivided into

subroutines and in an aﬃrmative case then describe how the parallelization was

implemented.

• Subroutine. This subcategory comprises articles in which the parallel imple‑

mentation of the MSA algorithm was divided into subroutines.

– Critical steps. This subcategory describes approaches to solve MSAs

where only critical steps of an algorithm to solve an MSA are manually pro‑

grammed.

– All steps. This subcategory describes approaches to solve MSAs where all

the stages of an algorithm to solve an MSA are manually programmed.

• Whole algorithm. This subcategory describes articles where the approach to

ﬁnd an MSA has only one stage and this stage is parallelized or has several stages

optimized by a compiler.

HPC (High-Performance Computing) strategy. In this category, articles are dis‑

tinguished by the parallel computing strategy used for implementation. The HPC

strategy category involves the following subcategories:

• Parallel programming oriented. This subcategory describes articles where a

parallel programming approach was implemented and no special hardware was

built. This subcategory involves the following subcategories:

– Multiprocessing. This subcategory encompasses implementations where the

algorithm or some of its stages use multiple processes with separate memory.

– Multithreading. This subcategory encloses implementations where all the

processing threads of the algorithms or some of its stages share the same

memory.

– Divide and conquer. This subcategory consists of the so‑called embarrass‑

ingly parallel algorithms.

S.H.Almanza-Ruiz et al.

1 3

– Intra-task parallelization. This subcategory comprises parallel implementa‑

tions for the GPU platform where each task is assigned to exactly one thread

block (i.e., a group of threads) and executed separately in parallel.

– Hybrid. Parallel implementations under this subcategory employ several

approaches of parallel programming.

– Vectorization. This subcategory encompasses parallel implementations that

employed either vector or array operations.

• Special hardware oriented. This subcategory describes articles where a special

or custom hardware was built to implement a parallelization.

Platform type. This subcategory describes all platforms that have been used to

implement a given parallel approach for MSA.

The 106 selected articles were categorized according to the classiﬁcation frame‑

work shown in Fig.2, and more detailed information is provided in Table13.

3.2 RQ2: What parallel programming approaches have been used toenhance

performance ofmultiple sequence alignment algorithms?

The earliest parallel approach for multiple sequence alignment found by this review

date from the year 1988, as is veriﬁed in Table14; we also found that the average of

relevant published articles on parallel implementations for protein MSA algorithms

was 3.7 per publication year.

The ﬁrst parallel approaches for MSA were deployed in early supercomputing

equipment, as in [19], in which a dynamic approach was used. Then, from 1993 to

2000, there were some eﬀorts to implement parallel implementations for stochastic

and metaheuristic approaches of multiple sequence alignment algorithms, as in [21]

and [26].

From 2001 to 2005, research focused mostly on improving performance for pro‑

gressive alignment, mainly using clusters, but varying the parallel programming

strategy used for the implementations. In the year 2003, Kuo Bin Li [32] imple‑

mented a multiprocessing parallelization of ClustalW, where the three stages of the

algorithm were parallelized and the implementation used 16 processors; the imple‑

mentation achieved a speedup of 4.3 for a dataset with 500 amino acid sequences

with an average length of 1000.

Around the year 2005, the ﬁrst parallel implementations for an MSA algorithm

that applied Field Programmable Gate Arrays (FPGA) appeared in some works, as

in [37] and [40]; they mainly focused on speeding up the pairwise sequence align‑

ment step of the so‑called progressive approach. The parallel MSA algorithm imple‑

mentation in [37] achieved an overall speedup of 11.8 with respect to ClustalW and

was tested for datasets of 1000 amino acid sequences with an average length of 446.

Also in the year 2005, along with the appearance of the ﬁrst commercial dual‑

core processors, research began to explore parallel implementations for the MSA

progressive approach using multicore processors. One of these implementations can

be found in [45], in which the authors used multithreading programming to paral‑

lelize critical steps of ClustalW; this implementation achieved an overall speedup

1 3

Parallel protein multiple sequence alignment approaches:…

Table 13 Selected articles and their classiﬁcation

Reference Approach Spectrum

Scope

Strategy

Platform

[19] Exact Needleman–Wunsch WA VEC CPU

[20] Alternative Hierarchical clustering CS DAC CPU

[21] Stochastic Simulated Annealing WA MPR CPU

[22] Iterative Berger–Munson WA DAC CPU

[23] Stochastic Unnamed WA VEC CPU

[24] Iterative Berger–Munson WA MPR CPU

[25] Iterative Berger–Munson CS MPR CPU

[26] Metaheuristic Unnamed WA MPR CPU

[27] Metaheuristic Island Parallel GA WA MPR CPU

[28] Progressive ClustalW WA MPR CPU

[29] Progressive PRALINE AS MPR CPU

[30] Progressive ClustalW CS HYB CPU

[31] Progressive ClustalW AS MPR CPU

[32] Progressive ClustalW AS MPR CPU

[33] Progressive ClustalW CS MPR CPU

[34] Progressive ClustalW CS DAC CPU

[35] Iterative PhylTree CS MPR CPU

[36] Iterative DIALIGN P WA MPR CPU

[37] Progressive ClustalW CS SHO FPGA

[38] Progressive ClustalW CS MPR CPU

[39] Progressive ClustalW CS MPR CPU

[40] Progressive ClustalW CS SHO FPGA

[41] Progressive ClustalW CS SHO FPGA

[42] Stochastic Unnamed CS MPR CPU

[43] Progressive ClustalW CS HYB CPU

[44] Iterative PhylTree AS DAC CPU

[45] Progressive ClustalW CS MTH CPU

[46] Progressive ClustalW WA MTH CPU

[47] Iterative MUSCLE AS MTH CPU

[48] Progressive ClustalW CS MPR CPU

[49] Progressive ClustalW CS SHO FPGA

[50] Progressive ClustalW WA DAC CPU

[51] Progressive All algorithms WA DAC CPU

[52] Progressive ClustalW CS DAC CPU

[53] Iterative PhylTree CS MPR CPU

[54] Exact Unnamed CS MPR CPU

[55] Progressive ClustalW CS HYB GPU

[56] Alternative Unnamed WA DAC CPU

[57] Progressive T‑Coﬀee CS MPR CPU

[58] Exact Needleman–Wunsch WA DAC CPU

[59] Progressive All algorithms CS SHO GPU

[60] Alternative Sample‑Align‑D WA DAC CPU

S.H.Almanza-Ruiz et al.

1 3

Table 13 (continued)

Reference Approach Spectrum

Scope

Strategy

Platform

[61] Stochastic Unnamed CS MPR CPU

[62] Progressive ClustalW AS MTH GPU

[63] Progressive ClustalW AS MTH GPU

[64] Progressive All algorithms CS DAC CPU

[65] Progressive ClustalW CS VEC CPU

[66] Progressive T‑Coﬀee AS MPR CPU

[67] Progressive ClustalW CS MTH CPU

[68] Iterative MAFFT AS MTH CPU

[69] Progressive ClustalW CS VEC GC

[70] Stochastic MSAProbs AS MTH CPU

[71] Metaheuristic iiGA WA MPR CPU

[72] Progressive ClustalW CS VEC CPU

[73] Iterative DIALIGN P CS HYB CPU

[74] Progressive All algorithms CS SHO FPGA

[75] Progressive All algorithms AS SHO FPGA

[76] Progressive T‑Coﬀee CS MPR CPU

[77] Progressive T‑Coﬀee WA MPR CC

[78] Progressive Clustal Omega WA HYB CPU

[79] Metaheuristic AlineaGA WA MPR CPU

[80] Alternative GPU‑REMuSiC v1.0 CS ITP CC

[81] Progressive ClustalW CS SHO FPGA

[82] Progressive All algorithms AS MTH CPU

[83] Progressive T‑Coﬀee CS DAC CPU

[84] Progressive T‑Coﬀee AS SHO GPU

[85] Iterative DIALIGN TX CS MPR CPU

[86] Progressive ClustalW CS MPR CPU

[87] Alternative PE2A* WA MTH CPU

[88] Iterative MAFFT AS MTH CPU

[89] Progressive T‑Coﬀee CS MTH CPU

[90] Progressive T‑Coﬀee CS MPR CPU

[91] Iterative MAFFT WA MPR CPU

[92] Progressive ClustalW CS SHO FPGA

[93] Progressive All algorithms WA DAC CPU

[94] Progressive ClustalW CS MTH CPU

[95] Stochastic QuickProbs CS ITP GPU

[96] Alternative GPU‑REMuSiC v1.0 CS ITP GPU

[97] Progressive All algorithms CS VEC CPU

[98] Progressive ClustalW CS ITP GPU

[99] Alternative PASTA CS MTH CPU

[100] Stochastic UPP CS DAC CPU

[101] Progressive T‑Coﬀee CS MPR CPU

[102] Exact A‑star WA MTH CPU

1 3

Parallel protein multiple sequence alignment approaches:…

of 2.12 with respect to ClustalW and was tested for datasets of 1000 amino acid

sequences with a ﬁxed length of 800.

Since the year 2006, with the appearance of the GPU and CUDA models of pro‑

gramming, researchers began to develop parallel implementations using this tech‑

nology to speed up several steps of the progressive approach for multiple sequence

alignment, as in [54]. In the year 2007, Zola etal. [57] implemented the ﬁrst parallel

implementation of T‑Coﬀee.

Around 2010, along with MSAprobs and parallel implementations for

MSA algorithms such as T‑Coﬀee and MAFFT, researchers began to explore

metaheuristic and stochastic approaches for MSA algorithms with their par‑

allel implementations. The implementation of MSAProbs [70]—a stochastic

approach for MSA—was parallelized using multithreading programming and a

Table 13 (continued)

Reference Approach Spectrum

Scope

Strategy

Platform

[103] Metaheuristic MSA‑GA CS MTH CPU

[104] Iterative MAFFT CS HYB GPU

[105] Metaheuristic MSA‑GA AS MTH CPU

[106] Progressive FAMSA CS MTH CPU

[107] Stochastic MSAProbs CS MPR CPU

[108] Progressive ClustalW CS HYB CPU

[109] Progressive HAMSA CS VEC CPU

[110] Alternative PASTA CS MTH CC

[111] Progressive All algorithms AS HYB GPU

[112] Stochastic QuickProbs CS MTH GPU

[113] Progressive All algorithms CS SHO FPGA

[114] Alternative POA WA DAC CC

[115] Alternative HAlign WA MPR CC

[116] Metaheuristic M2Align WA MTH CPU

[117] Iterative MAFFT AS MTH CPU

[118] Exact A‑star WA MTH CPU

[119] Progressive All algorithms WA DAC CC

[120] Progressive KALIGN AS VEC CPU

[121] Metaheuristic Sequoya WA MPR CPU

[122] Alternative MAGUS CS MTH CPU

[123] Alternative MAGUS AS DAC CPU

[124] Progressive All algorithms CS HYB CPU

a This column contains either the speciﬁc name of the parallel MSA algorithm, or all algorithms as speci‑

ﬁed in the Spectrum category

b CS (Critical steps), AS (All steps), WA (Whole algorithm)

c MPR (Multiprocessing), MTH (Multithreading), DAC (Divide and conquer), ITP (Intra‑task paralleliza‑

tion), HYB (Hybrid), Vectorization (VEC), SHO (Special‑hardware oriented)

d CPU (Central processing unit), GPU (Graphics processing unit), FPGA (Field programmable gate

array), GC(Grid computing), CC (Cloud computing)

S.H.Almanza-Ruiz et al.

1 3

GPU platform; the implementation showed results where MSAProbs was bet‑

ter in accuracy than MSA algorithms such as ClustalW, MAFFT and Probalign,

for datasets extracted from BAliBASE, PREFAB, SABmark and OXBENCH. In

2011, a parallel implementation of T‑Coﬀee appeared in [76], with experimental

results that yielded a speedup of over

68%

while preserving the accuracy.

In the year 2014, a stochastic MSA approach and its parallelization were pub‑

lished in the same article [95], in which the critical steps were parallelized and

the implementation used a GPU platform.

Two parallelizations of approaches for MSA that applied cloud computing

appeared in the year 2017: PASTASpark [110] and HAlign‑II [115]; it is note‑

worthy that these cloud computing approaches managed to align sequence ﬁles

up to 3.4 GB and 15 GB, respectively, and preserved the accuracy of the original

algorithms.

In the year 2020, we found two works: Sequoya [121]—an approach that used

multi‑objective metaheuristics for MSA—and MAGUS [122]—an approach that

used a graph clustering to combine disjoint alignments; the latter algorithm is

similar to PASTA but improves its accuracy and performance.

3.3 RQ3: What protein multiple sequence alignment algorithms are themost

frequently parallelized andhowwere they parallelized?

We discuss next the underlying reasons for the number of articles for parallel imple‑

mentations of the MSA approaches, as presented in Fig.2, as well as describe the

frequencies of appearance for the rest of categories in the classiﬁcation framework.

Table 14 Articles per year Year Number of articles Year Number

of articles

1988 1 2009 5

1993 2 2010 7

1995 1 2011 7

1996 1 2012 4

1997 1 2013 10

1998 1 2014 3

1999 1 2015 8

2000 1 2016 5

2002 2 2017 7

2003 4 2018 3

2004 3 2019 1

2005 8 2020 2

2006 9 2021 1

2007 4 2022 1

2008 3

1 3

Parallel protein multiple sequence alignment approaches:…

3.3.1 Parallelized exact MSA algorithms

The earliest parallel implementation of an MSA algorithm [19] was a paralleliza‑

tion of the exact MSA approach. It has been proved that exact MSA is an expo‑

nential problem when dynamic programming is applied [5]. Consequently, in order

to improve computational time, researchers focused on developing parallel imple‑

mentations with heuristic and metaheuristic approaches. This can explain that there

were only ﬁve parallel implementations for the exact MSA approach found in this

systematic review. Nevertheless, the heuristic and metaheuristic approaches do not

guarantee an optimal solution, and exact MSA algorithms have been applied to test

the accuracy of other MSA approaches.

3.3.2 Parallelized heuristic MSA algorithms

According to data collected from the articles between the years 1988 and 2022,

53.77%

of the articles considered in the review are parallel implementations of

the progressive approach for multiple sequence alignment. Thus, the progres‑

sive approach is the most popular approach to be parallelized. This is veriﬁed in

Fig.2, which shows the total number of articles for every approach considered in

the review. There are two important facts that explain the frequency of paralleliza‑

tion of the progressive approach. The ﬁrst fact is that the progressive MSA approach

was one of the earliest adopted approaches for protein multiple sequence alignment,

whose ﬁrst implementation dates from 1994 [125], and some of their earliest paral‑

lel implementations were published around 2003, as in [31, 32]; the second fact is

that the progressive approach has three stages that can clearly be parallelized. The

ﬁrst fact motivated researchers to improve previous implementations and the sec‑

ond fact provided the opportunities to test several combinations for parallelization

approaches.

We found that the second most frequently parallelized approach was the itera‑

tive approach, as presented in Fig.2. The iterative approach is a reﬁnement of the

progressive approach that relies on the use of dynamic programming or another heu‑

ristic to realign a subset of the original sequences to the ﬁnal alignment. This strat‑

egy has the advantage of improving accuracy, but increases the computational cost.

These reﬁnement strategies are speciﬁc for each algorithm and in many cases are

diﬃcult to parallelize.

As for stochastic MSA algorithms, they have the advantage of improving the

accuracy of MSAs. This improvement in accuracy is achieved by probabilistically

and statistically estimating MSA accuracy, as in [61] and [70]; however, this kind

of approaches has a high computational cost. Consequently, stochastic MSA algo‑

rithms have been parallelized to make them computationally feasible. In this sense,

it is noteworthy that none of the stochastic MSA algorithms found in this systematic

review had a prior serial implementation. Furthermore, some of the parallel imple‑

mentations of stochastic MSA algorithms such as QuickProbs 2 [112] have outper‑

formed the accuracy of progressive and iterative algorithms for hundreds or even

thousands of sequences. However, the parallel progressive MSA algorithm Clustal

Omega [78] and the parallel iterative MSA algorithm MAFFT [117] still outperform

S.H.Almanza-Ruiz et al.

1 3

stochastic approaches in performance and accuracy when obtaining MSAs of tens or

hundreds of thousands of sequences.

Finally, the algorithms that were categorized under the alternative approach either

applied a combination of strategies or subroutines that were employed in other heu‑

ristic approaches or implemented a new one. We found that in many of the articles,

the aim was to improve the accuracy for larger number of sequences compared to

the rest of algorithms, as in [78] and [99]. Another interesting ﬁnding was that 3 out

of 12 works under this category were implemented using cloud computing, which

can provide an advantage because when more computational resources are needed to

process an MSA, these resources can be requested as a service.

3.3.3 Parallelized metaheuristic MSA algorithms

The parallel implementations of metaheuristic MSA algorithms improve not only

the performance of their serial counterparts but also the quality of multiple sequence

alignments by assessing the accuracy of alignments using multiple criteria, such

as maximizing the sum of pairs score, maximizing the totally conserved columns,

minimizing the number of gaps, or maximizing structural information‑based scores.

This improvement in accuracy can be achieved by evolutionary multiobjective opti‑

mization techniques, as in [116] and [121]. Nevertheless, the metaheuristic MSA

algorithms need to employ previous multiple alignments as initial solutions, and

these alignments are in most of the cases obtained by previously aligning with

other algorithms such as ClustalW or MAFFT; thus, the accuracy and performance

of metaheuristic MSA algorithms initially depend on other aligners. In addition,

metaheuristic approaches have presented diﬃculties when processing medium to

large MSA alignments, as pointed out by Zambrano‑Vega etal. [116]. Thus, the lat‑

ter two issues can explain the reason that we found only eight parallel implementa‑

tions in the Metaheuristic category, whereas there were 93 articles with the Heuris-

tic approach.

3.3.4 How theapproaches were parallelized

In this section, we discuss the reasons for the number of articles in every subcat‑

egory of the categories Spectrum, Parallelization scope, HPC strategy and Platform

type, from the classiﬁcation framework shown in Fig.2. These categories and their

subcategories aim to explain how the MSA approaches that were classiﬁed in the

MSA algorithm approach category were parallelized.

3.3.4.1 Spectrum category We obtained 94 articles where a speciﬁc algorithm was

parallelized, whereas there were only 12 where the focus was on providing a more

general parallel implementation for all algorithms of the approach. This behavior can

be explained in view that it is more diﬃcult to identify a general solution which can

be applied to all algorithms of an approach than it is to explore diﬀerent approaches

of parallelization for a speciﬁc MSA algorithm that has already been parallelized,

such as ClustalW.

1 3

Parallel protein multiple sequence alignment approaches:…

3.3.4.2 Parallelization scope category We found 77 implementations of MSA algo‑

rithms that fell into the Subroutine subcategory, whereas only 29 in the Whole algo-

rithm subcategory. The number of articles found in the Subroutine subcategory can

be explained by the fact that the progressive and iterative MSA algorithms have steps

or subroutines that can be parallelized, whereas this is not the case for most of the

exact and metaheuristic MSA algorithms, which fell into the Whole algorithm sub‑

category. We also found that, in general, the parallel implementations that fell in the

Whole algorithm subcategory applied an approach based on a strategy that splits the

data. We also found among the alternative algorithms some implementations which

were not divided into subroutines that can be parallelized, as in [114], and other

alternative algorithms as in [115], where one of the goals was to separate the parallel

implementation from the alignment algorithm, and this was achieved by splitting the

data into chunks to process them in a map‑reduce fashion along several stages of the

algorithm. In addition, some iterative and progressive MSA algorithms ﬁt into the

Whole algorithm subcategory, in particular, implementations where the aim was to

yield a general solution for all the algorithms of the progressive approach, as in [93].

Regarding the Critical steps and the All steps subcategories, we found that in the

earliest implementations of the MSA progressive approach the most time‑consum‑

ing subroutines were identiﬁed; consequently, researchers focused more on parallel‑

izing critical steps than on parallelizing all the steps.

3.3.4.3 HPC strategy category This category was divided into the Parallel-program-

ming oriented subcategory, which had 95 articles, and the Special-hardware oriented

subcategory, in which there were only 11 articles. The number of articles in the afore‑

mentioned subcategories can be explained by the fact that the parallel implementa‑

tions of MSA algorithms in the Special-hardware oriented subcategory require to

build speciﬁc hardware for their deployment; on the contrary, the implementations

in Parallel-programming oriented subcategory can be deployed in a wider variety of

platforms. Thus, many researchers focused their eﬀorts on developing a parallel MSA

algorithm rather than on developing speciﬁc hardware to deploy a parallelization of

an MSA algorithm.

Within the Parallel-programming oriented subcategory, we found 33 articles

under the Multiprocessing subcategory, which is the most frequent approach used

in the HPC strategy category. Among these articles, we found 12 implementations

from the Progressive subcategory; in many of these implementations, the subrou‑

tines with the most expensive usage of computing resources were identiﬁed, and

the data were evenly distributed among processors. Regarding the Multithreading

subcategory, we found 24 articles, making it the second most frequent subcategory

of the HPC strategy category, where 13 out of 24 articles involved algorithms using

a CPU platform. The main advantage of the multithreading approach is that it can

save communication time among processes, as in [68, 82]. However, one of the dis‑

advantages of using multithreading is that it is harder to subdivide the MSA algo‑

rithm subroutines into processes in terms of coding, than to evenly divide the data

among processors.

We found 17 articles in the Divide and conquer subcategory; in the implementa‑

tions of this approach a speciﬁc heuristic strategy was developed to split the data and

S.H.Almanza-Ruiz et al.

1 3

apply the MSA algorithm to the split data, which can make the writing of implemen‑

tation code cleaner, since the algorithm does not have to be divided into paralleliz‑

able subroutines. Thus, it is worth observing that the Divide and conquer approach

was applied to nine parallel implementations of MSA algorithms from the Whole

algorithm subcategory, as in [93]. However, the Divide and conquer approach could

depend on large processing capabilities such as those provided by the Cloud com-

puting platform, as in [114, 119].

Concerning the Intra-task parallelization subcategory, which is a parallelization

model for the GPU platform, we found only four implementations. The main advan‑

tage of applying Intra-task parallelization is that it allows to compute thousands of

threads in hundreds of cores. The Intra-task parallelization was applied to parallel‑

ize the subroutines of several MSA approaches, as in [96], an iterative MSA algo‑

rithm where the pairwise comparison subroutine of an alternative MSA algorithm

was parallelized using this approach, or in [95], a stochastic MSA algorithm, where

the posterior probability matrix calculation was computed 24.7 times faster with

Intra-task parallelization than with the CPU‑parallel implementation. A downside

of Intra-task parallelization resides in its dependence on a GPU platform, which

negatively aﬀects the portability of the algorithm.

With regard to the Hybrid approach, their implementations applied a combination

of parallelization strategies, thus making it possible to apply the most suitable paral‑

lel implementation for a given subroutine of the MSA algorithm, as in [108], where

cluster‑level data parallelism, thread‑level coarse‑grained parallelism, vector‑level

parallelism and ﬁne‑grained parallelism were applied. We also found that some of

the implementations of the Hybrid approach achieved to process tens of thousands

of sequences, as in [108, 111]. However, the combination of several parallelization

techniques can produce complex code that is diﬃcult to read and maintain.

Regarding the articles in the Vectorization subcategory, it should be observed that

the advantage of this approach is that it minimizes the number of processors, since

some of the MSA operations such as the pairwise distance matrix are vectorized, as

in [65, 97]; additionally, it is worth mentioning that there was an implementation

under this subcategory that achieved state‑of‑the‑art accuracy for tens of thousands

of sequences [120]. However, although some set of the MSA operations are already

implemented in libraries such as AVX, others have to be adapted in terms of the

vectorized set of operations. The latter makes the code of these approaches cumber‑

some to read and can have a negative impact on the portability of the application.

The aforementioned disadvantages could explain why we found only eight imple‑

mentations for this subcategory.

3.3.4.4 Platform type category The Platform type category was subdivided into the

CPU, GPU, FPGA, Grid computing and Cloud computing subcategories. We found

80 articles under the CPU platform subcategory. It is worth clarifying that all parallel

MSA algorithm implementations before 2005 under the CPU category were deployed

in platforms where two or more CPUs were physically separated, in contrast with

implementations published starting in 2005, which made use of a multicore platform

that contained two or more CPUs on the same physical unit. The CPU platform is

1 3

Parallel protein multiple sequence alignment approaches:…

widely available and, unlike the GPU or FPGA platforms, there is no need to acquire

or develop special hardware.

Regarding the GPU platform, we found 11 articles under this subcategory. The

main advantage of the GPU platform resides in the fact that GPUs have a larger

number of cores than multicore CPUs, and thus, the former have greater perfor‑

mance capabilities than the latter. An implementation with GPUs was used to par‑

allelize two stochastic MSA algorithms [96, 112], which are known to have high

accuracy but also high computational cost. As mentioned earlier, a downside of

deploying in a GPU platform is that this type of platform negatively aﬀects the port‑

ability of the implementation. In addition, in some cases, an extra eﬀort is needed to

port algorithms originally written for CPU platforms.

In regard to the FPGA platform, we found nine articles under this subcategory;

the main advantages of using an FPGA platform for MSA algorithm paralleliza‑

tion are the minimization of computational cost of communication among process‑

ing elements and the implementation of ﬁne‑grained parallelism, as in [41, 81]. The

downsides are the higher economical cost of FPGAs and the possible low availabil‑

ity of this very specialized hardware, which could negatively impact portability.

As for the Grid computing platform, we only found one article that used this type

of platform; the main advantage of employing grid computing in [69] was the use of

a distributed ﬁle‑allocation system that made it possible to manage large databases

of sequences, which were in the order of around six million sequences, by process‑

ing batches of smaller datasets containing between 150,000 and 280,000 sequences.

One of the advantages of grid computing over cloud computing resides in its perfor‑

mance, since the former does not use virtualization. However, the application of grid

computing to implement a parallel algorithm is technically more diﬃcult than it is

for cloud computing.

With respect to the Cloud computing platform, we found ﬁve articles in this sub‑

category. The Cloud computing platform oﬀers the advantage that a wide variety

of CPU or GPU conﬁgurations with large processing capabilities are available as a

service to test and deploy a given parallel MSA algorithm. The aforementioned large

processing capabilities were suitable for implementations such as [114, 115], where

the data were split and the whole algorithm was applied to chunks of data. However,

in the case of databases of tens of gigabytes, the increased time required to upload

such amount of data is a drawback for this approach.

3.4 RQ4: What are some oftheunsolved problems ofparallel implementations

ofmultiple sequence alignment algorithms?

One unsolved problem that we observed based on information from the articles

selected for this systematic review, is that the parallel implementations of every

MSA approach have diﬃculties regarding accuracy and eﬃciency when process‑

ing ultra‑large datasets, which are sets consisting in tens or hundreds of thousands

of sequences generated by the rapid development of modern sequencing technol‑

ogy. There have been attempts to scale up the parallel implementation of multiple

sequence alignment algorithms to accurately process ultra‑large datasets, as in [115,

S.H.Almanza-Ruiz et al.

1 3

122]. However, the problem has not been solved satisfactorily, as mentioned in

[123], where the author reported that MUSCLE and Clustal Omega could not pro‑

cess 1,000,000 sequences, in contrast with UPP, Recursive MAGUS and MAGUS,

which could process this number of sequences in a single run. Nevertheless, the

same author [123] also reported that MAGUS—the prior version of Recursive

MAGUS—outperformed the accuracy of the latest version for sets between 10,000

and 50,000. It is worth mentioning that UPP and Recursive MAGUS are parallel

alternative MSA algorithms and have achieved better processing capability than their

progressive counterpart. With regard to parallel stochastic MSA algorithms, they

have not achieved the large‑scale processing capability of their progressive counter‑

part; however, they have better accuracy for sets with smaller number of sequences.

In the case of parallel metaheuristic MSA approaches, it has been pointed out in

[126] that to ﬁnd an optimal multiple sequence alignment, it is necessary to have

multiple objectives; this has been achieved via multiobjective genetic algorithms.

However, as far as this systematic review is concerned, there is no metaheuristic

MSA algorithm work where multiple sequence alignment algorithms can manage

tens of thousands of sequences, neither evidence that they clearly have outperformed

the accuracy of tools from its heuristic equivalent.

On the other hand, the progressive parallel MSA implementations have diﬃcul‑

ties to scale up for ultra‑large data sequences due to the number of comparisons

that are necessary for the guide tree, whereas in the case of stochastic approaches

we believe the diﬃculties of scaling up are due to the fact that a larger set of

sequences can make the numerical precision to compute probabilities more diﬃcult

to maintain.

Since some of the MSA parallel implementations, such as Clustal Omega and

PASTA, that process large scale‑scale datasets apply a pre‑clustering step, we could

also suggest as a possible direction for future work the application of reinforcement

learning to adaptively guide a preclustering algorithm to split up the data in order

to improve performance and accuracy. Another possible direction for future work is

to adapt stochastic algorithms such as Quickprobs 2 for divide and conquer paral‑

lelization to split data and use a cloud computing platform in order to scale up their

processing of large‑scale datasets.

Regarding the application of quantum computing for multiple sequence align‑

ment, it is worth mentioning that some researchers have proposed algorithms that

use this approach to improve the performance of pairwise sequence alignment algo‑

rithms, which as mentioned earlier is the basis of MSA algorithms, as in [127]. This

quantum approach consists in the use of dot‑matrix plotting and quantum pattern

recognition to align a pair of sequences. The authors of this work conducted simula‑

tions and made a complexity analysis in order to compare the quantum algorithm

against its classical computing counterparts in terms of speed and computational

resources. They claim that the obtained results strongly suggest that an actual imple‑

mentation of their quantum pairwise sequence alignment algorithm would outper‑

form its electronic computing counterparts in terms of time and space complexity.

Nonetheless, the same authors observed that there are still technical issues to actu‑

ally implement this quantum pairwise alignment algorithm. This approach could

1 3

Parallel protein multiple sequence alignment approaches:…

also be applied to stochastic MSA approaches by doing quantum sampling, in order

to circumvent the issues of numerical stability of stochastic simulation.

4 Conclusion

We carried a selection process for the present review for which the initial search

process yielded 1669 research articles. We read the abstracts of the 1669 articles

and selected 241 articles related to the review topic, from which we searched in the

reference section and obtained 18 additional articles. Thus, in summary we selected

259 articles related to the topic of the review. Next, we read the content of those

259 articles, and after this process, 147 articles remained. Afterward, we conducted

a process for assessing the quality and relevance of these 147 articles and obtained

a total of 104 articles relevant for the review. Subsequently, we performed a trans‑

versal search through Google Scholar using the same selection criteria as for the

databases and journals, and found two additional articles, giving a grand total of 106

articles.

Based on the results that we obtained after performing the systematic review, we

propose a classiﬁcation framework for parallel implementations of protein MSA

algorithms. The proposed classiﬁcation framework can aid researchers to identify

which combinations of protein MSA algorithms and parallel computing approaches

have been used and to what extent, by examining the number of implementations for

each particular approach.

As for the best approach to use, that would depend entirely on the goals that the

researchers are trying to fulﬁll and the drawbacks they are willing to assume. For

instance, if high accuracy is the main goal irrespective of the time taken by the algo‑

rithm, then one of the exact approaches could be the best choice; on the other hand,

if working with very large datasets and no local high performance hardware is avail‑

able, then one of the cloud computing approaches might be the best option. In gen‑

eral, as far as this systematic literature review is concerned, there is no single paral‑

lel MSA approach or implementation that outperforms the rest of approaches in all

relevant aspects, such as accuracy and the capacity of processing large datasets. As

mentioned in previous sections, some parallel MSA approaches such as MAGUS or

Clustal Omega outperform the rest of the MSA aligners for processing ultra‑large

datasets, whereas parallel MSA stochastic implementations outperform most of the

MSA approaches for accuracy up to a limit in the size of the dataset. Therefore,

there exists room for researching improvements in the parallel MSA approaches,

either for processing ultra‑large data sets or obtaining a higher accuracy, or both.

Finally, we identiﬁed some areas of opportunity for future work. The ﬁrst area

consists in improving the capacity to process ultra‑large data sequences for par‑

allel MSA algorithms. A possible approach to scale up an algorithm could be to

adaptively split the data using reinforcement learning in order to improve the per‑

formance of the MSA algorithm. A second area for future work could consist in

developing parallel deep learning‑based MSA algorithms. A third area could be the

use of quantum computing for improving the performance and accuracy of parallel

S.H.Almanza-Ruiz et al.

1 3

MSA algorithms, even though there are still technical issues for its deployment that

must be solved.

Supplementary Information The online version contains supplementary material available at https:// doi.

org/ 10. 1007/ s11227‑ 022‑ 04697‑9.

Funding Sergio H. Almanza‑Ruiz is receiving a full‑time scholarship for his graduate studies from the

Mexican National Council for Science and Technology (CONACyT).

Data availability All data generated or analyzed during this study are included in this published article

and its supplementary information ﬁle.

Declarations

Competing interests The authors have no competing interests to declare that are relevant to the content

of this article.

References

1. Bayat A (2002) Bioinformatics. BMJ 324(7344):1018–1022. https:// doi. org/ 10. 1136/ bmj. 324.

7344. 1018

2. Ramsden J (2009) Bioinformatics: An Introduction, 2nd edn. Springer, London, England. https://

doi. org/ 10. 1007/ 978‑1‑ 84800‑ 257‑9

3. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic

Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, England. https://

doi. org/ 10. 1017/ CBO97 80511 790492

4. Henikoﬀ S, Henikoﬀ JG (1992) Amino acid substitution matrices from protein blocks. Proc Nat

Acad Sci U. S. A. 89(22):10915–10919. https:// doi. org/ 10. 1073/ pnas. 89. 22. 10915

5. Bonizzoni P, Vedova GD (2001) The complexity of multiple sequence alignment with SP‑score

that is a metric. Theor Comput Sci 259(1):63–79. https:// doi. org/ 10. 1016/ S0304‑ 3975(99) 00324‑2

6. Wernersson R, Pedersen AG (2003) RevTrans: multiple alignment of coding DNA from aligned

amino acid sequences. Nucleic Acids Res 31(13):3537–3539. https:// doi. org/ 10. 1093/ nar/ gkg609

7. Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: multiple alignment of nucleotide sequences

guided by amino acid translations. Nucleic Acids Research 38(supp 2):7–13. https:// doi. org/ 10.

1093/ nar/ gkq291

8. Kitchenham BA, Charters S (2007) Guidelines for performing systematic literature reviews in soft‑

ware engineering. Technical Report EBSE 2007‑001, Keele University and Durham University

Joint Report. https:// www. elsev ier. com/__ data/ promis_ misc/ 52544 4syst emati crevi ewsgu ide. pdf

9. Chen L, Ali Babar M (2011) A systematic review of evaluation of variability management

approaches in software product lines. Inf Softw Technol 53(4):344–362. https:// doi. org/ 10. 1016/j.

infsof. 2010. 12. 006. Special Section: Software Engineering track of the 24th Annual Symposium

on Applied Computing

10. Salleh N, Mendes E, Grundy J (2011) Empirical studies of pair programming for CS/SE teaching

in higher education: a systematic literature review. IEEE Trans Softw Eng 37(4):509–525. https://

doi. org/ 10. 1109/ TSE. 2010. 59

11. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault

prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304. https://

doi. org/ 10. 1109/ TSE. 2011. 103

12. Galster M, Weyns D, Tofan D, Michalik B, Avgeriou P (2014) Variability in software systems–a

systematic literature review. IEEE Trans Softw Eng 40(3):282–306. https:// doi. org/ 10. 1109/ TSE.

2013. 56

1 3

Parallel protein multiple sequence alignment approaches:…

13. de Freitas Junior M, Fantinato M, Sun V (2015) Improvements to the function point analysis

method: a systematic literature review. IEEE Trans Eng Manag 62(4):495–506. https:// doi. org/ 10.

1109/ TEM. 2015. 24533 54

14. Hujainah F, Bakar RBA, Abdulgabber MA, Zamli KZ (2018) Software requirements prioritisa‑

tion: a systematic literature review on signiﬁcance, stakeholders, techniques and challenges. IEEE

Access 6:71497–71523. https:// doi. org/ 10. 1109/ ACCESS. 2018. 28817 55

15. Flores‑Contreras J, Duran‑Limon HA, Chavoya A, Almanza‑Ruiz SH (2021) Performance pre‑

diction of parallel applications: a systematic literature review. J Supercomput 77(4):4014–4055.

https:// doi. org/ 10. 1007/ s11227‑ 020‑ 03417‑5

16. Mahdavi‑Hezavehi S, Galster M, Avgeriou P (2013) Variability in quality attributes of service‑

based software systems: a systematic literature review. Inf Softw Technol 55(2):320–343. https://

doi. org/ 10. 1016/j. infsof. 2012. 08. 010. Special Section: Component‑Based Software Engineering

(CBSE), 2011

17. Bornmann L, Daniel H‑D (2007) What do we know about the h index? J Am Soc Inf Sci Technol

58(9):1381–1385. https:// doi. org/ 10. 1002/ asi. 20609

18. Welcome to CORE. Accessed: 2022‑01‑26 (2022). https:// www. core. edu. au Accessed 2022‑01‑26

19. Tajima K (1988) Multiple DNA and protein sequence alignment on a workstation and a supercom‑

puter. Bioinformatics 4(4):467–471. https:// doi. org/ 10. 1093/ bioin forma tics/4. 4. 467

20. Date S, Kulkarni R, Kulkarni B, Kulkarni‑Kale U, Kolaskar AS (1993) Multiple alignment of

sequences on parallel computers. Bioinformatics 9(4):397–402. https:// doi. org/ 10. 1093/ bioin forma

tics/9. 4. 397

21. Ishikawa M, Toya T, Hoshida M, Nitta K, Ogiwara A, Kanehisa M (1993) Multiple sequence

alignment by parallel simulated annealing. Bioinformatics 9(3):267–273. https:// doi. org/ 10. 1093/

bioin forma tics/9. 3. 267

22. Yap TK, Munson PJ, Frieder O, Martino RL (1995) Parallel multiple sequence alignment using

speculative computation. In: Proceedings of the 1995 International Conference on Parallel Process‑

ing ICPP

23. Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis

of the basic method. Bioinformatics 12(2):95–107. https:// doi. org/ 10. 1093/ bioin forma tics/ 12.2. 95

24. Martino RL, Yap TK, Suh EB (1997) Parallel algorithms in molecular biology. In: Hertzberger

B, Sloot P (eds) High‑Performance Computing and Networking. Springer, Berlin, Heidelberg, pp

232–240

25. Yap TK, Frieder O, Martino RL (1998) Parallel computation in biological sequence analysis. IEEE

Trans Paral Distrib Syst 9(3):283–294. https:// doi. org/ 10. 1109/ 71. 674320

26. Anbarasu LA, Narayanasamy P, Sundararajan V (1999) Multiple sequence alignment using parallel

genetic algorithms. In: McKay B, Yao X, Newton CS, Kim J‑H, Furuhashi T (eds) Simulated Evo‑

lution and Learning. Springer, Berlin, Heidelberg, pp 130–137

27. Anbarasu LA, Narayanasamy P, Sundararajan V (2000) Multiple molecular sequence alignment by

island parallel genetic algorithm. Curr Sci 78(7):858–863

28. Catalyurek U, Stahlberg E, Ferreira R, Kurc T, Saltz J (2002) Improving performance of multiple

sequence alignment analysis in multi‑client environments. In: Proceedings 16th International Par‑

allel and Distributed Processing Symposium, p. 8. https:// doi. org/ 10. 1109/ IPDPS. 2002. 10165 84

29. Kleinjung J, Douglas N, Heringa J (2002) Parallelized multiple alignment. Bioinformatics

18(9):1270–1271. https:// doi. org/ 10. 1093/ bioin forma tics/ 18.9. 1270

30. Catalyurek U, Gray M, Kurc T, Saltz J, Stahlberg E, Ferreira R (2003) A component‑based imple‑

mentation of multiple sequence alignment. In: Proceedings of the 2003 ACM Symposium on

Applied Computing. SAC ’03, pp. 122–126. Association for Computing Machinery, New York,

NY, USA. https:// doi. org/ 10. 1145/ 952532. 952559

31. Cheetham J, Dehne F, Pitre S, Rau‑Chaplin A, Taillon PJ (2003) Parallel CLUSTAL W for PC

clusters. In: Kumar, V., Gavrilova, M.L., Tan, C.J.K., L’Ecuyer, P. (eds.) International Conference

on Computational Science and Its Applications — ICCSA 2003, pp. 300–309. Springer, Berlin,

Heidelberg. https:// doi. org/ 10. 1007/3‑ 540‑ 44843‑8_ 32

32. Li K‑B (2003) ClustalW‑MPI: ClustalW analysis using distributed and parallel computing. Bioin‑

formatics 19(12):1585–1586. https:// doi. org/ 10. 1093/ bioin forma tics/ btg192

33. Zhihua D, Feng L (2003) Parallel computation for multiple sequence alignments. In: Fourth Inter‑

national Conference on Information, Communications and Signal Processing, 2003 and the Fourth

Paciﬁc Rim Conference on Multimedia. Proceedings of the 2003 Joint, vol. 1, pp. 300–3031.

https:// doi. org/ 10. 1109/ ICICS. 2003. 12924 64

S.H.Almanza-Ruiz et al.

1 3

34. Ebedes J, Datta A (2004) Multiple sequence alignment in parallel on a workstation cluster. Bioin‑

formatics 20(7):1193–1195. https:// doi. org/ 10. 1093/ bioin forma tics/ bth055

35. Parmentier G, Trystram D, Zola J (2004) Cache‑based parallelization of multiple sequence align‑

ment problem. In: Danelutto M, Vanneschi M, Laforenza D (eds) Euro‑Par 2004 Parallel Process‑

ing. Springer, Berlin, Heidelberg, pp 1005–1012. https:// doi. org/ 10. 1007/ 978‑3‑ 540‑ 27866‑5_ 135

36. Schmollinger M, Nieselt K, Kaufmann M, Morgenstern B (2004) DIALIGN P: Fast pair‑wise and

multiple sequence alignment using parallel processors. BMC Bioinformatics 5(1):128. https:// doi.

org/ 10. 1186/ 1471‑ 2105‑5‑ 128

37. Lin X, Peiheng Z, Dongbo B, Shengzhong F, Ninghui S (2005) To accelerate multiple sequence

alignment using FPGAs. In: Eighth International Conference on High‑Performance Computing in

Asia‑Paciﬁc Region (HPCASIA’05), pp. 5–180. https:// doi. org/ 10. 1109/ HPCAS IA. 2005. 96

38. Lopes HS, Moritz GL (2005) A distributed approach for a multiple sequence alignment algorithm

using a parallel virtual machine. In: 2005 IEEE Engineering in Medicine and Biology 27th Annual

Conference, pp. 2843–2846. https:// doi. org/ 10. 1109/ IEMBS. 2005. 16170 66

39. Luo J, Ahmad I, Ahmed M, Paul R (2005) Parallel multiple sequence alignment with dynamic

scheduling. In: International Conference on Information Technology: Coding and Computing

(ITCC’05) ‑ Volume II, vol. 1, pp. 8–131. https:// doi. org/ 10. 1109/ ITCC. 2005. 223

40. Oliver T, Schmidt B, Maskell D, Nathan D, Clemens R (2005) Multiple sequence alignment on an

FPGA. In: 11th International Conference on Parallel and Distributed Systems (ICPADS’05), vol. 2,

pp. 326–330. https:// doi. org/ 10. 1109/ ICPADS. 2005. 202

41. Oliver T, Schmidt B, Nathan D, Clemens R, Maskell D (2005) Using reconﬁgurable hardware to

accelerate multiple sequence alignment with ClustalW. Bioinformatics 21(16):3431–3432. https://

doi. org/ 10. 1093/ bioin forma tics/ bti508

42. Rajasekaran S, Thapar V, Dave H, Huang C‑H (2005) Randomized and parallel algorithms for dis‑

tance matrix calculations in multiple sequence alignment. J Clin Monit Comput 19(4):351–359.

https:// doi. org/ 10. 1007/ s10877‑ 005‑ 0680‑3

43. Tan G, Feng S, Sun N (2005) Parallel multiple sequences alignment in SMP cluster. In: Eighth

International Conference on High‑Performance Computing in Asia‑Paciﬁc Region (HPCASIA’05),

pp. 6–431. https:// doi. org/ 10. 1109/ HPCAS IA. 2005. 70

44. Trystram D, Zola J (2005) Parallel multiple sequence alignment with decentralized cache support.

In: Cunha JC, Medeiros PD (eds) Euro‑Par 2005 Parallel Processing. Springer, Berlin, Heidelberg,

pp 1217–1226. https:// doi. org/ 10. 1007/ 11549 468_ 133

45. Chaichoompu K, Kittitornkun S, Tongsima S (2006) MT‑ClustalW: multithreading multiple

sequence alignment. In: Proceedings 20th IEEE International Parallel Distributed Processing Sym‑

posium, p. 8. https:// doi. org/ 10. 1109/ IPDPS. 2006. 16395 37

46. Chaichoompu K, Kittitornkun S (2006) Multithreaded ClustalW with improved optimization for

Intel multi‑core processor. In: 2006 International Symposium on Communications and Information

Technologies, pp. 590–594. https:// doi. org/ 10. 1109/ ISCIT. 2006. 340018

47. Deng X, Li E, Shan J, Chen W (2006) Parallel implementation and performance characterization

of MUSCLE. In: Proceedings 20th IEEE International Parallel Distributed Processing Symposium,

p. 7. https:// doi. org/ 10. 1109/ IPDPS. 2006. 16396 16

48. Du Z, Lin F (2006) pNJTree: A parallel program for reconstruction of neighbor‑joining tree and its

application in ClustalW. Paral Comput 32(5):441–446. https:// doi. org/ 10. 1016/j. parco. 2006. 05. 001

49. Oliver T, Schmidt B, Maskell D, Nathan D, Clemens R (2006) High‑speed multiple sequence

alignment on a reconﬁgurable platform. Int J Bioinf Res Appl 2(4):394–406. https:// doi. org/ 10.

1504/ IJBRA. 2006. 011038

50. Rezaei S, Monwar MM (2006) Divide‑and‑Conquer algorithm for ClustalW‑MPI. In: 2006 Cana‑

dian Conference on Electrical and Computer Engineering, pp. 717–720. https:// doi. org/ 10. 1109/

CCECE. 2006. 277630

51. Rezaei S, Monwar MM, Bai J (2006) Performance comparison of MPI‑based parallel multiple

sequence alignment algorithm using single and multiple guide trees. In: 2006 5th IEEE Interna‑

tional Conference on Cognitive Informatics, vol. 1, pp. 595–600. https:// doi. org/ 10. 1109/ COGINF.

2006. 365552

52. Tan G, Peng L, Feng S, Sun N (2006) Load balancing and parallel multiple sequence alignment

with tree accumulation. In: Nagel WE, Walter WV, Lehner W (eds) Euro‑Par 2006 Parallel Pro‑

cessing. Springer, Berlin, Heidelberg, pp 1138–1147. https:// doi. org/ 10. 1007/ 11823 285_ 120

1 3

Parallel protein multiple sequence alignment approaches:…

53. Zola J, Trystram, D, Tchernykh A, Brizuela C (2006) Parallel multiple sequence alignment with

local phylogeny search by simulated annealing. In: Proceedings 20th IEEE International Parallel

Distributed Processing Symposium, p. 8. https:// doi. org/ 10. 1109/ IPDPS. 2006. 16395 36

54. Lin CY, Huang CT, Chung Y‑C, Tang CY (2007) Eﬃcient parallel algorithm for optimal three‑

sequences alignment. In: 2007 International Conference on Parallel Processing (ICPP 2007), pp.

14–14. https:// doi. org/ 10. 1109/ ICPP. 2007. 38

55. Liu W, Schmidt B, Voss G, Muller‑Wittig W (2007) Streaming algorithms for biological sequence

alignment on GPUs. IEEE Trans Paral Distrib Syst 18(9):1270–1281. https:// doi. org/ 10. 1109/

TPDS. 2007. 1069

56. Low DHP, Veeravalli B, Bader DA (2007) On the design of high‑performance algorithms for align‑

ing multiple protein sequences on mesh‑based multiprocessor architectures. J Paral Distrib Comput

67(9):1007–1017. https:// doi. org/ 10. 1016/j. jpdc. 2007. 03. 007

57. Zola J, Yang X, Rospondek A, Aluru S (2007) PARALLEL‑TCOFFEE: A parallel multiple

sequence aligner. In: Proceedings of the ISCA 20th International Conference on Parallel and Dis‑

tributed Computing Systems, September 24‑26, 2007, Las Vegas, Nevada, USA, pp. 248–253

58. Helal M, El‑Gindy H, Mullin L, Gaeta B (2008) Parallelizing optimal multiple sequence alignment

by dynamic programming. In: 2008 IEEE International Symposium on Parallel and Distributed

Processing with Applications, pp. 669–674. https:// doi. org/ 10. 1109/ ISPA. 2008. 93

59. Manavski SA, Valle G (2008) CUDA compatible GPU cards as eﬃcient hardware accelera‑

tors for Smith‑Waterman sequence alignment. BMC Bioinf 9(2):10. https:// doi. org/ 10. 1186/

1471‑ 2105‑9‑ S2‑ S10

60. Saeed F, Khokhar A (2008) Sample‑Align‑D: A high performance multiple sequence alignment

system using phylogenetic sampling and domain decomposition. In: 2008 IEEE International Sym‑

posium on Parallel and Distributed Processing, pp. 1–9. https:// doi. org/ 10. 1109/ IPDPS. 2008. 45361

61. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast

statistical alignment. PLOS Comput Biol 5(5):1–15. https:// doi. org/ 10. 1371/ journ al. pcbi. 10003 92

62. Liu Y, Schmidt B, Maskell DL (2009) MSA‑CUDA: Multiple sequence alignment on graphics pro‑

cessing units with CUDA. In: 2009 20th IEEE International Conference on Application‑speciﬁc

Systems, Architectures and Processors, pp. 121–128. https:// doi. org/ 10. 1109/ ASAP. 2009. 14

63. Liu Y, Schmidt B, Maskell DL (2009) Parallel reconstruction of neighbor‑joining trees for large

multiple sequence alignments using CUDA. In: 2009 IEEE International Symposium on Parallel

Distributed Processing, pp. 1–8. https:// doi. org/ 10. 1109/ IPDPS. 2009. 51609 23

64. Saeed F, Khokhar A (2009) A domain decomposition strategy for alignment of multiple biological

sequences on multiprocessor platforms. J Paral Distrib Comput 69(7):666–677. https:// doi. org/ 10.

1016/j. jpdc. 2009. 03. 006

65. Wirawan A, Schmidt B, Kwoh CK (2009) Pairwise distance matrix computation for multiple

sequence alignment on the cell broadband engine. In: Allen G, Nabrzyski J, Seidel E, van Albada

GD, Dongarra J, Sloot PMA (eds) Computational Science ‑ ICCS 2009. Springer, Berlin, Heidel‑

berg, pp 954–963

66. Di Tommaso P, Orobitg M, Guirado F, Cores F, Espinosa T, Notredame C (2010) Cloud‑Coﬀee:

implementation of a parallel consistency‑based multiple alignment algorithm in the T‑Coﬀee

package and its benchmarking on the Amazon Elastic‑Cloud. Bioinformatics 26(15):1903–1904.

https:// doi. org/ 10. 1093/ bioin forma tics/ btq304

67. Isaza S, Sanchez F, Gaydadjiev G, Ramirez A, Valero M (2010) Scalability analysis of progressive

alignment on a multicore. In: 2010 International Conference on Complex, Intelligent and Software

Intensive Systems, pp. 889–894. https:// doi. org/ 10. 1109/ CISIS. 2010. 149

68. Katoh K, Toh H (2010) Parallelization of the MAFFT multiple sequence alignment program. Bio‑

informatics 26(15):1899–1900. https:// doi. org/ 10. 1093/ bioin forma tics/ btq224

69. Kim T, Joo H (2010) ClustalXeed: a GUI‑based grid computation version for high performance

and terabyte size multiple sequence alignment. BMC Bioinf 11(1):467. https:// doi. org/ 10. 1186/

1471‑ 2105‑ 11‑ 467

70. Liu Y, Schmidt B, Maskell DL (2010) MSAProbs: multiple sequence alignment based on pair hid‑

den Markov models and partition function posterior probabilities. Bioinformatics 26(16):1958–

1964. https:// doi. org/ 10. 1093/ bioin forma tics/ btq338

71. Miranda LA, Caetano MAF, Melo ACMA, Correa JM, Bordim JL (2010) Multiple biologi‑

cal sequence alignment with a parallel island injection genetic algorithm. In: 2010 IEEE 12th

S.H.Almanza-Ruiz et al.

1 3

International Conference on High Performance Computing and Communications (HPCC), pp.

314–321. https:// doi. org/ 10. 1109/ HPCC. 2010. 31

72. Wirawan A, Kwoh CK, Schmidt B (2010) Multi‑threaded vectorized distance matrix computation

on the CELL/BE and x86/SSE2 architectures. Bioinformatics 26(10):1368–1369. https:// doi. org/

10. 1093/ bioin forma tics/ btq135

73. de AraujoMacedo E, Magalhaes Alvesde Melo AC, Pﬁtscher GH, Boukerche A (2011) Hybrid

MPI/OpenMP strategy for biological multiple sequence alignment with DIALIGN‑TX in hetero‑

geneous multicore clusters. In: 2011 IEEE International Symposium on Parallel and Distributed

Processing Workshops and Phd Forum, pp. 418–425. https:// doi. org/ 10. 1109/ IPDPS. 2011. 169

74. Lloyd S, Snell QO (2011) Accelerated large‑scale multiple sequence alignment. BMC Bioinf

12(1):466. https:// doi. org/ 10. 1186/ 1471‑ 2105‑ 12‑ 466

75. Nguyen KD, Pan Y, Nong G (2011) Parallel progressive multiple sequence alignment on reconﬁg‑

urable meshes. BMC Genom 12(5):4. https:// doi. org/ 10. 1186/ 1471‑ 2164‑ 12‑ S5‑ S4

76. Orobitg M, Guirado F, Notredame C, Cores F (2011) Exploiting parallelism on progressive align‑

ment methods. J Supercomput 58(2):186–194. https:// doi. org/ 10. 1007/ s11227‑ 009‑ 0359‑5

77. Rius J, Cores F, Solsona F, van Hemert JI, Koetsier J, Notredame C (2011) A user‑friendly

web portal for T‑Coﬀee on supercomputers. BMC Bioinf 12(1):150. https:// doi. org/ 10. 1186/

1471‑ 2105‑ 12‑ 150

78. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M,

Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high‑quality protein mul‑

tiple sequence alignments using Clustal Omega. Mol Syst Biol 7(1):539. https:// doi. org/ 10. 1038/

msb. 2011. 75

79. da Silva FJM, Pérez JMS, Pulido JAG, Rodríguez MAV (2011) Parallel Niche Pareto AlineaGA

‑ an evolutionary multiobjective approach on multiple sequence alignment. J Integr Bioinf 8(3):57–

72. https:// doi. org/ 10. 1515/ jib‑ 2011‑ 174

80. Lin Y‑S, Lin, C‑Y, Chung Y‑C (2012) GPU‑based cloud service for multiple sequence alignments

with regular expression constrains. In: 4th IEEE International Conference on Cloud Computing

Technology and Science Proceedings, pp. 741–746. https:// doi. org/ 10. 1109/ Cloud Com. 2012.

64275 65

81. Mahram A, Herbordt MC (2012) FMSA: FPGA‑accelerated ClustalW‑based multiple sequence

alignment through pipelined preﬁltering. In: 2012 IEEE 20th International Symposium on Field‑

Programmable Custom Computing Machines, pp. 177–183. https:// doi. org/ 10. 1109/ FCCM. 2012.

82. Marucci EA, Zafalon GFD, Momente JC, Pinto AR, Amazonas JRA, Shiyou Y, Sato LM, Machado

JM (2012) Using threads to overcome synchronization delays in parallel multiple progressive

alignment algorithms. Curr Res Bioinf 1:50–63. https:// doi. org/ 10. 3844/ ajbsp. 2012. 50. 63

83. Orobitg M, Cores F, Guirado F, Kemena C, Notredame C, Ripoll A (2012) Enhancing the scal‑

ability of consistency‑based progressive multiple sequences alignment applications. In: 2012 IEEE

26th International Parallel and Distributed Processing Symposium, pp. 71–82. https:// doi. org/ 10.

1109/ IPDPS. 2012. 17

84. Blazewicz J, Frohmberg W, Kierzynka M, Wojciechowski P (2013) G‑MSA ‑ A GPU‑based, fast

and accurate algorithm for multiple sequence alignment. J Paral Distrib Comput 73(1):32–41.

https:// doi. org/ 10. 1016/j. jpdc. 2012. 04. 004

85. de Araujo Macedo E, Alves Magalhaes, de Melo AC, Pﬁtscher GH, Boukerche A (2013) Multiple

biological sequence alignment in heterogeneous multicore clusters with user‑selectable task alloca‑

tion policies. J Supercomput 63(3):740–756. https:// doi. org/ 10. 1007/ s11227‑ 012‑ 0768‑8

86. Esteban FJ, Díaz D, Hernández P, Caballero JA, Dorado G, Gálvez S (2013) Direct approaches to

exploit many‑core architecture in bioinformatics. Future Gener Comput Syst 29(1), 15–26. https://

doi. org/ 10. 1016/j. future. 2012. 03. 018. Including Special section: AIRCC‑NetCoM 2009 and Spe‑

cial section: Clouds and Service‑Oriented Architectures

87. Hatem M, Ruml W (2013) External memory best‑ﬁrst search for multiple sequence alignment.

Proc AAAI Conf Artif Intell 27(1):409–416

88. Katoh K, Standley DM (2013) MAFFT Multiple Sequence Alignment Software Version 7:

Improvements in Performance and Usability. Mol Biol Evol 30(4):772–780. https:// doi. org/ 10.

1093/ molbev/ mst010

89. Montañola A, Roig C, Guirado F, Hernández P, Notredame C (2013) Performance analysis of computa‑

tional approaches to solve multiple sequence alignment. J Supercomput 64(1):69–78. https:// doi. org/

10. 1007/ s11227‑ 012‑ 0751‑4

1 3

Parallel protein multiple sequence alignment approaches:…

90. Orobitg M, Lladós J, Guirado F, Cores F, Notredame C (2013) Scalability and accuracy improvements

of consistency‑based multiple sequence alignment tools. In: Proceedings of the 20th European MPI

Users’ Group Meeting. EuroMPI ’13, pp. 259–264. Association for Computing Machinery, New York,

NY, USA. https:// doi. org/ 10. 1145/ 24885 51. 24885 83

91. Tzanoudakis T, Papaefstathiou I, Manifavas C (2013) Parallelizing bioinformatics and security applica‑

tions on a low‑cost multi‑core system. In: 2013 ACS International Conference on Computer Systems

and Applications (AICCSA), pp. 1–4. https:// doi. org/ 10. 1109/ AICCSA. 2013. 66164 52

92. Yilmaz C, Gök M (2013) System designs to perform bioinformatics sequence alignment. Turkish J Electr

Eng Comput Sci 21(1):246–262. https:// doi. org/ 10. 3906/ elk‑ 1105‑ 22

93. Zhu X, Li K, Salah A (2013) A data parallel strategy for aligning multiple biological sequences on multi‑

core computers. Comput Biol Med 43(4):350–361. https:// doi. org/ 10. 1016/j. compb iomed. 2012. 12.

009

94. Díaz D, Esteban FJ, Hernández P, Caballero JA, Guevara A, Dorado G, Gálvez S (2014) MC64‑

ClustalWP2: A highly‑parallel hybrid strategy to align multiple sequences in many‑core architectures.

PLOS ONE 9(4):1–12. https:// doi. org/ 10. 1371/ journ al. pone. 00940 44

95. Gudyś A, Deorowicz S (2014) QuickProbs–A fast multiple sequence alignment algorithm designed for

graphics processors. PLOS ONE 9(2):1–18. https:// doi. org/ 10. 1371/ journ al. pone. 00889 01

96. Lin CY, Lin YS (2014) Eﬃcient parallel algorit hm for multiple sequence alignments with regular expres‑

sion constraints on graphics processing units. Int J Comput Sci Eng 9(1–2):11–20. https:// doi. org/ 10.

1504/ IJCSE. 2014. 058687

97. Al‑Neama MW, Reda NM, Ghaleb FFM (2015) Fast vectorized distance matrix computation for multi‑

ple sequence alignment on multi‑cores. Int J Biomath 08(06):1550084. https:// doi. org/ 10. 1142/ S1793

52451 55008 49

98. Hung C‑L, Lin Y‑S, Lin C‑Y, Chung Y‑C, Chung Y‑F (2015) CUDA ClustalW: an eﬃcient parallel algo‑

rithm for progressive multiple sequence alignment on Multi‑GPUs. Comput Biol Chem 58:62–68.

https:// doi. org/ 10. 1016/j. compb iolch em. 2015. 05. 004

99. Mirarab S, Nguyen N, Guo S, Wang L‑S, Kim J, Warnow T (2015) PASTA: Ultra‑large multiple

sequence alignment for nucleotide and amino‑acid sequences. J Comput Biol 22(5):377–386. https://

doi. org/ 10. 1089/ cmb. 2014. 0156 (PMID: 25549288)

100. N‑pD Nguyen, Mirarab S, Kumar K, Warnow T (2015) Ultra‑large alignments using phylogeny‑aware

proﬁles. Genome Biol 16(1):124. https:// doi. org/ 10. 1186/ s13059‑ 015‑ 0688‑z

101. Orobitg M, Guirado F, Cores F, Llados J, Notredame C (2015) High performance computing improve‑

ments on bioinformatics consistency‑based multiple sequence alignment tools. Paral Comput 42:18–

34. https:// doi. org/ 10. 1016/j. parco. 2014. 09. 010

102. Sundfeld D, Teodoro G, Magalhaes Alvesde Melo AC (2015) Parallel A‑Star multiple sequence align‑

ment with locality‑sensitive hash functions. In: 2015 Ninth International Conference on Complex,

Intelligent, and Software Intensive Systems, pp. 342–347. https:// doi. org/ 10. 1109/ CISIS. 2015. 50

103. Zafalon GFD, Visotaky JMV, Amorim AR, Valêncio CR, Neves LA, de Souza RCG, Machado JM

(2015) A parallel approach of COFFEE objective function to multiple sequence alignment. J Phys:

Conf Ser 633:012084. https:// doi. org/ 10. 1088/ 1742‑ 6596/ 633/1/ 012084

104. Zhu X, Li K, Salah A, Shi L, Li K (2015) Parallel implementation of MAFFT on CUDA‑enabled

graphics hardware. IEEE/ACM Trans Comput Biol Bioinf 12(1):205–218. https:// doi. org/ 10. 1109/

TCBB. 2014. 23518 01

105. Amorim AR, Visotaky JMV, de GodoiContessoto A, Neves LA, Gratão DeSouza RC, Valêncio CR,

Zafalon GFD (2016) Performance improvement of genetic algorithm for multiple sequence alignment.

In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and

Technologies (PDCAT), pp. 69–72. https:// doi. org/ 10. 1109/ PDCAT. 2016. 029

106. Deorowicz S, Debudaj‑Grabysz A, Gudyś A (2016) FAMSA: fast and accurate multiple sequence

alignment of huge protein families. Sci Rep 6(1):33964. https:// doi. org/ 10. 1038/ srep3 3964

107. González‑Domínguez J, Liu Y, Touriño J, Schmidt B (2016) MSAProbs‑MPI: parallel multiple

sequence aligner for distributed‑memory systems. Bioinformatics 32(24):3826–3828. https:// doi. org/

10. 1093/ bioin forma tics/ btw558

108. Lan H, Chan Y, Xu K, Schmidt B, Peng S, Liu W (2016) Parallel algorithms for large‑scale biologi‑

cal sequence alignment on Xeon‑Phi based clusters. BMC Bioinf 17(9):267. https:// doi. org/ 10. 1186/

s12859‑ 016‑ 1128‑0

109. Reda NM, Al‑Neama M, Ghaleb FFM (2016) HAMSA: highly accelerated multiple sequence aligner.

Int J Adv Comput Sci Appl. https:// doi. org/ 10. 14569/ IJACSA. 2016. 070661

110. Abuín JM, Pena TF, Pichel JC (2017) PASTASpark: multiple sequence alignment meets Big Data.

Bioinformatics 33(18):2948–2950. https:// doi. org/ 10. 1093/ bioin forma tics/ btx354

S.H.Almanza-Ruiz et al.

1 3

111. Araujo E, Stefanes MA, O. Ferlete Vd, Rozante LCS (2017) Multiple sequence alignment using

hybrid parallel computing. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioen‑

gineering (BIBE), pp. 175–180. https:// doi. org/ 10. 1109/ BIBE. 2017. 00‑ 59

112. Gudyś A, Deorowicz S (2017) QuickProbs 2: towards rapid construction of high‑quality alignments of

large protein families. Sci Rep 7(1):41553. https:// doi. org/ 10. 1038/ srep4 1553

113. Liu P, Hemani A, Paul K, Weis C, Jung M, Wehn N (2017) 3D‑stacked many‑core architecture for

biological sequence analysis problems. Int J Paral Program 45(6):1420–1460. https:// doi. org/ 10. 1007/

s10766‑ 017‑ 0495‑0

114. Neehal N, Karim DZ, Islam A (2017) Cloud‑POA: A cloud‑based map only implementation of PO‑

MSA on Amazon multi‑node EC2 Hadoop Cluster. In: 2017 20th International Conference of Com‑

puter and Information Technology (ICCIT), pp. 1–6 https:// doi. org/ 10. 1109/ ICCIT ECHN. 2017. 82818

115. Wan S, Zou Q (2017) HAlign‑II: eﬃcient ultra‑large multiple sequence alignment and phylogenetic

tree reconstruction with distributed and parallel computing. Algorithms Mol Biol 12(1):25. https:// doi.

org/ 10. 1186/ s13015‑ 017‑ 0116‑x

116. Zambrano‑Vega C, Nebro AJ, García‑Nieto J, Aldana‑Montes JF (2017) M2Align: parallel multiple

sequence alignment with a multi‑objective metaheuristic. Bioinformatics 33(19):3011–3017. https://

doi. org/ 10. 1093/ bioin forma tics/ btx338

117. Nakamura T, Yamada KD, Tomii K, Katoh K (2018) Parallelization of MAFFT for large‑scale mul‑

tiple sequence alignments. Bioinformatics 34(14):2490–2492. https:// doi. org/ 10. 1093/ bioin forma tics/

bty121

118. Sundfeld D, Razzolini C, Teodoro G, Boukerche A, de Melo ACMA (2018) PA‑Star: a disk‑assisted

parallel A‑Star strategy with locality‑sensitive hash for multiple sequence alignment. J Paral Distrib

Comput 112:154–165. https:// doi. org/ 10. 1016/j. jpdc. 2017. 04. 014

119. Welivita A, Perera I, Meedeniya D, Wickramarachchi A, Mallawaarachchi V (2018) Managing com‑

plex workﬂows in bioinformatics: An interactive toolkit with GPU acceleration. IEEE Trans NanoBi‑

osci 17(3):199–208. https:// doi. org/ 10. 1109/ TNB. 2018. 28371 22

120. Lassmann T (2019) Kalign 3: multiple sequence alignment of large datasets. Bioinformatics

36(6):1928–1929. https:// doi. org/ 10. 1093/ bioin forma tics/ btz795

121. Benítez‑Hidalgo A, Nebro AJ, Aldana‑Montes JF (2020) Sequoya: multiobjective multiple sequence

alignment in Python. Bioinformatics 36(12):3892–3893. https:// doi. org/ 10. 1093/ bioin forma tics/ btaa2

122. Smirnov V, Warnow T (2020) MAGUS: multiple sequence alignment using graph clUStering. Bioin‑

formatics 37(12):1666–1672. https:// doi. org/ 10. 1093/ bioin forma tics/ btaa9 92

123. Smirnov V (2021) Recursive MAGUS: scalable and accurate multiple sequence alignment. PLOS

Comput Biol 17(10):1–17. https:// doi. org/ 10. 1371/ journ al. pcbi. 10089 50

124. Ishaq M, Khan A, Su’ud MM, Alam MM, Bangash JI, Khan A (2022) An improved strategy for task

scheduling in the parallel computational alignment of multiple sequences. Comput Math Methods

Med 2022:8691646. https:// doi. org/ 10. 1155/ 2022/ 86916 46

125. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive

multiple sequence alignment through sequence weighting, position‑speciﬁc gap penalties and weight

matrix choice. Nucleic Acids Res 22(22):4673–4680. https:// doi. org/ 10. 1093/ nar/ 22. 22. 4673

126. Chowdhury B, Garai G (2017) A review on multiple sequence alignment from the perspective of

genetic algorithm. Genomics 109(5):419–431. https:// doi. org/ 10. 1016/j. ygeno. 2017. 06. 007

127. Prousalis K, Konofaos N (2019) A quantum pattern recognition method for improving pairwise

sequence alignment. Sci Rep 9(1):7226. https:// doi. org/ 10. 1038/ s41598‑ 019‑ 43697‑3

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and

institutional aﬃliations.

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from The Journal of Supercomputing

This content is subject to copyright. Terms and conditions apply.

Bioinspired Algorithms for Multiple Sequence Alignment: A Systematic Review and Roadmap

Article

Full-text available

Mar 2024

Multiple Sequence Alignment (MSA) plays a pivotal role in bioinformatics, facilitating various critical biological analyses, including the prediction of unknown protein structures and functions. While numerous methods are available for MSA, bioinspired algorithms stand out for their efficiency. Despite the growing research interest in addressing the MSA challenge, only a handful of comprehensive reviews have been undertaken in this domain. To bridge this gap, this study conducts a thorough analysis of bioinspired-based methods for MSA through a systematic literature review (SLR). By focusing on publications from 2010 to 2024, we aim to offer the most current insights into this field. Through rigorous eligibility criteria and quality standards, we identified 45 relevant papers for review. Our analysis predominantly concentrates on bioinspired-based techniques within the context of MSA. Notably, our findings highlight Genetic Algorithm and Memetic Optimization as the most commonly utilized algorithms for MSA. Furthermore, benchmark datasets such as BAliBASE and SABmark are frequently employed in evaluating MSA solutions. Structural-based methods emerge as the preferred approach for assessing MSA solutions, as revealed by our systematic literature review. Additionally, this study explores current trends, challenges, and unresolved issues in the realm of bioinspired algorithms for MSA, offering practitioners and researchers valuable insights and comprehensive understanding of the field.

Accelerating Multiple Sequence Alignments Using Parallel Computing

Article

Full-text available

Feb 2024

Multiple sequence alignment (MSA) stands as a critical tool for understanding the evolutionary and functional relationships among biological sequences. Obtaining an exact solution for MSA, termed exact-MSA, is a significant challenge due to the combinatorial nature of the problem. Using the dynamic programming technique to solve MSA is recognized as a highly computationally complex algorithm. To cope with the computational demands of MSA, parallel computing offers the potential for significant speedup in MSA. In this study, we investigated the utilization of parallelization to solve the exact-MSA using three proposed novel approaches. In these approaches, we used multi-threading techniques to improve the performance of the dynamic programming algorithms in solving the exact-MSA. We developed and employed three parallel approaches, named diagonal traversing, blocking, and slicing, to improve MSA performance. The proposed method accelerated the exact-MSA algorithm by around 4×. The suggested approaches could be basic approaches to be combined with many existing techniques. These proposed approaches could serve as foundational elements, offering potential integration with existing techniques for comprehensive MSA enhancement.

Enhancing Multiple Sequence Alignment with Genetic Algorithms: A Bioinformatics Approach in Biomedical Engineering

Article

Full-text available

May 2024

This study aimed to create a genetic information processing technique for the problem of multiple alignment of genetic sequences in bioinformatics. The objective was to take advantage of the computer hardware's capabilities and analyze the results obtained regarding quality, processing time, and the number of evaluated functions. The methodology was based on developing a genetic algorithm in Java, which resulted in four different versions: Gp1, Gp2, Gp3 and Gp4 . A set of genetic sequences were processed, and the results were evaluated by analyzing numerical behavior profiles. The research found that algorithms that maintained diversity in the population produced better quality solutions, and parallel processing reduced processing time. It was observed that the time required to perform the process decreased, according to the generated performance profile. The study concluded that conventional computer equipment can produce excellent results when processing genetic information if algorithms are optimized to exploit hardware resources. The computational effort of the hardware used is directly related to the number of evaluated functions. Additionally, the comparison method based on the determination of the performance profile is highlighted as a strategy for comparing the algorithm results in different metrics of interest, which can guide the development of more efficient genetic information processing techniques.

Mfind: a tool for DNA barcode analysis in angiosperms and its relationship with microsatellites using a sliding window algorithm

Article

Full-text available

Apr 2024
PLANTA

Main conclusion Mfind is a tool to analyze the impact of microsatellite presence on DNA barcode specificity. We found a significant correlation between barcode entropy and microsatellite count in angiosperm. Abstract Genetic barcodes and microsatellites are some of the identification methods in taxonomy and biodiversity research. It is important to establish a relationship between microsatellite quantification and genetic information in barcodes. In order to clarify the association between the genetic information in barcodes (expressed as Shannon’s Measure of Information, SMI) and microsatellites count, a total of 330,809 DNA barcodes from the BOLD database (Barcode of Life Data System) were analyzed. A parallel sliding-window algorithm was developed to compute the Shannon entropy of the barcodes, and this was compared with the quantification of microsatellites like (AT)n, (AC)n, and (AG)n. The microsatellite search method utilized an algorithm developed in the Java programming language, which systematically examined the genetic barcodes from an angiosperm database. For this purpose, a computational tool named Mfind was developed, and its search methodology is detailed. This comprehensive study revealed a broad overview of microsatellites within barcodes, unveiling an inverse correlation between the sumz of microsatellites count and barcodes information. The utilization of the Mfind tool demonstrated that the presence of microsatellites impacts the barcode information when considering entropy as a metric. This effect might be attributed to the concise length of DNA barcodes and the repetitive nature of microsatellites, resulting in a direct influence on the entropy of the barcodes.

An Improved Strategy for Task Scheduling in the Parallel Computational Alignment of Multiple Sequences

Article

Full-text available

Jan 2022

Task scheduling in parallel multiple sequence alignment (MSA) through improved dynamic programming optimization speeds up alignment processing. The increased importance of multiple matching sequences also needs the utilization of parallel processor systems. This dynamic algorithm proposes improved task scheduling in case of parallel MSA. Specifically, the alignment of several tertiary structured proteins is computationally complex than simple word-based MSA. Parallel task processing is computationally more efficient for protein-structured based superposition. The basic condition for the application of dynamic programming is also fulfilled, because the task scheduling problem has multiple possible solutions or options. Search space reduction for speedy processing of this algorithm is carried out through greedy strategy. Performance in terms of better results is ensured through computationally expensive recursive and iterative greedy approaches. Any optimal scheduling schemes show better performance in heterogeneous resources using CPU or GPU.

Recursive MAGUS: Scalable and accurate multiple sequence alignment

Article

Full-text available

Oct 2021
PLOS COMPUT BIOL

Vladimir Smirnov

Multiple sequence alignment tools struggle to keep pace with rapidly growing sequence data, as few methods can handle large datasets while maintaining alignment accuracy. We recently introduced MAGUS, a new state-of-the-art method for aligning large numbers of sequences. In this paper, we present a comprehensive set of enhancements that allow MAGUS to align vastly larger datasets with greater speed. We compare MAGUS to other leading alignment methods on datasets of up to one million sequences. Our results demonstrate the advantages of MAGUS over other alignment software in both accuracy and speed. MAGUS is freely available in open-source form at https://github.com/vlasmirnov/MAGUS.

MAGUS: Multiple Sequence Alignment using Graph Clustering

Article

Full-text available

Nov 2020

Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g., MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger (GCM), a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.

Performance prediction of parallel applications: a systematic literature review

Article

Full-text available

Apr 2021
J SUPERCOMPUT

Different techniques for estimating the execution time of parallel applications have been studied for the last 25 years. These approaches have proposed different methods for predicting the performance behaviour of applications. Most of these methods rely on analysing one or more of the following aspects: system workload, application structure, platform system, and the computing resources that the application needs to perform its operations. These elements are used and applied by different methods such as analytic and non-analytic methods. However, no wide-ranging survey of these approaches exists at the time of writing. This paper presents a systematic review of performance prediction methods for parallel applications, which were published in the open literature during the period 2005–2020. We define a classification framework to categorise the reviewed approaches. In addition, we identify some directions and trends in performance prediction as well as some unsolved issues.

Kalign 3: multiple sequence alignment of large data sets

Article

Full-text available

Oct 2019
BIOINFORMATICS

Timo Lassmann

Motivation: Kalign is an efficient multiple sequence alignment (MSA) program capable of aligning thousands of protein or nucleotide sequences. However, current alignment problems involving large numbers of sequences are exceeding Kalign's original design specifications. Here we present a completely re-written and updated version to meet current and future alignment challenges. Results: Kalign now uses a SIMD accelerated version of the bit-parallel Gene Myers algorithm to estimate pariwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences. The new version maintains high alignment accuracy on both protein and nucleotide alignments and scales better than other MSA tools. Availability: The source code of Kalign and code to reproduce the results are found here: https://github.com/timolassmann/kalign.

Α Quantum Pattern Recognition Method for Improving Pairwise Sequence Alignment

Article

Full-text available

May 2019

Quantum pattern recognition techniques have recently raised attention as potential candidates in analyzing vast amount of data. The necessity to obtain faster ways to process data is imperative where data generation is rapid. The ever-growing size of sequence databases caused by the development of high throughput sequencing is unprecedented. Current alignment methods have blossomed overnight but there is still the need for more efficient methods that preserve accuracy in high levels. In this work, a complex method is proposed to treat the alignment problem better than its classical counterparts by means of quantum computation. The basic principal of the standard dot-plot method is combined with a quantum algorithm, giving insight into the effect of quantum pattern recognition on pairwise alignment. The central feature of quantum algorithmic -quantum parallelism- and the diffraction patterns of x-rays are synthesized to provide a clever array indexing structure on the growing sequence databases. A completely different approach is considered in contrast to contemporary conventional aligners and a variety of competitive classical counterparts are classified and organized in order to compare with the quantum setting. The proposed method seems to exhibit high alignment quality and prevail among the others in terms of time and space complexity.

Software Requirements Prioritisation: A Systematic Literature Review on Significance, Stakeholders, Techniques and Challenges

Article

Full-text available

Nov 2018

As one of the gatekeepers of quality software systems, requirements prioritisation (RP) is often used to select the most important requirements as perceived by system stakeholders. To date, many RP techniques that adopt various approaches have been proposed in the literature. To identify the strengths, opportunities and limitations of these existing approaches, this work studied and analysed the RP field in terms of its significance in the software development process based on the standard review guidelines by Kitchenham. By a rigorous study selection strategy, 122 relevant studies were selected to address the defined research questions. Findings indicated that RP plays a vital role in ensuring the development of a quality system with defined constraints. The stakeholders involved in RP were reported, and new categories of the participating stakeholders were proposed. Additionally, 108 RP techniques were identified and analysed with respect to their benefits, prioritisation criteria, size of requirements, types in terms of automation level and their limitations. Eighty-four (84) prioritisation criteria were disclosed with their frequency usages in prioritising the requirements. The study revealed that existing techniques suffer from serious limitations in terms of scalability, lack of quantification and prioritisation of the participating stakeholders, time consumption, requirement interdependencies and the need for highly professional human intervention. These findings are useful for researchers and practitioners in improving the current state of the art and state of practices.

Ayurvedic Management of Duchene Muscular Dystrophy (DMD) A Case report

Article

Jan 2022

Dr. Santosh N. Belavadi

Duchene muscular dystrophy is inherited X-linked recessive disorder. Females will typically be carriers for the disease while males will be affected. Dystrophin is essential for cell membrane stability. Deficiency leads to reduction in three glycol proteins in the dystrophin associated protein complex that link dystrophin to laminin with cell membranes. This occurs in people without a known family history of the condition. Because of the way the disease is inherited, males are more likely to develop symptoms than are women. In this Clinical study enlightening about 2 cases of Duchene muscular dystrophy, the concept in Ayurveda the entities like Astimajjagatavata and Pakkharoga, and the treatment carried out in this disease are Sarvanga Abhyanga (whole body massage) Musthadiyapanabasti followed by Shamanayogas(palliative medicines) like Ajamamsa Rasayana, Balarishta and cap Bontone followed by Physiotherapy.

Sequoya: Multi-objective multiple sequence alignment in Python

Article

Apr 2020
BIOINFORMATICS

Motivation: Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimisation problem with NP-hard complexity, because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multi-objective when more than one score is considered to assess the quality of an alignment, such as maximising the percentage of totally conserved columns and minimising the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a non exact stochastic optimisation approach that has proven to be effective to solve multi-objective problems. Results: The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualisation, and parallelism. Thus, Sequoya offers a graphical tool to visualise the progress of the optimisation in real time, the ability to guide the search towards a preferred region in run-time, parallel support to distribute the computation amongst nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimisation. Availability and implementation: Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. Supplementary information: Supplementary data are available at Bioinformatics online.

Managing Complex Workflows in Bioinformatics: An Interactive Toolkit With GPU Acceleration

Article

May 2018

Bioinformatics research continues to advance at an increasing scale with the help of techniques such as next generation sequencing and the availability of tool support to automate bioinformatics processes. With this growth, a large amount of biological data gets accumulated at an unprecedented rate demanding high performance and high throughput computing technologies for processing such datasets. Use of hardware accelerators such as Graphics Processing Units (GPUs) and distributed computing, accelerate the processing of big data in high performance computing environments. They enable higher degrees of parallelism to be achieved, thereby increasing the throughput. In this paper, we introduce BioWorkflow, an interactive workflow management system to automate the bioinformatics analyses with the capability of scheduling parallel tasks with the use of GPU-accelerated and distributed computing. The paper describes a case study carried out to evaluate the performance of a complex workflow with branching executed by BioWorkflow. The results indicate gains of x2.89 magnitude by utilizing GPUs and gains in speed by average x2.832 magnitude (over n=5 scenarios) by parallel execution of graph nodes during multiple sequence alignment (MSA) calculations. Combined speedups achieved x1.71 times for complex workflows. This confirms the expected higher speedups when having parallelism through GPUacceleration and concurrent execution of workflow nodes than the mainstream sequential workflow execution. The tool also provides a comprehensive user interface with better interactivity for managing complex workflows; System usability scale score of 82.9 confirmed high usability for the system.

Parallel protein multiple sequence alignment approaches: a systematic literature review

Abstract and Figures

Recommended publications

FAME: Fast And Memory Efficient multiple sequences alignment tool through compatible chain of roots

WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a...

CUDA-Parttree: A Multiple Sequence Alignment Parallel Strategy in GPU

HAMSA: Highly Accelerated Multiple Sequence Aligner