Genome-wide identification and comparative analysis of alternative splicing across four legume species

  Institute of Grassland Research, Chinese Academy of Agricultural Sciences

Abstract and Figures

Main conclusion Alternative splicing EVENTS were genome-wide identified for four legume species, and nitrogen fixation-related gene families and evolutionary analysis was also performed. Alternative splicing (AS) is a key regulatory mechanism that contributes to transcriptome and proteome diversity. Investigation of the genome-wide conserved AS events across different species will help with the understanding of the evolution of the functional diversity in legumes, allowing for genetic improvement. Genome-wide identification and characterization of AS were performed using the publically available mRNA, EST, and RNA-Seq data for four important legume species. A total of 15,165 AS genes in Glycine max, 6077 in Cicer arietinum, 7240 in Medicago truncatula, and 7358 in Lotus japonicus were identified. Intron retention (IntronR) was the dominant AS type among the identified events, with IntronR occurring from 53.76% in M. truncatula to 43.91% in C. arietinum. We identified 1159 AS genes that were conserved among four species. Furthermore, nine nitrogen fixation-related gene families with 237 genes were identified, and 80 of them were AS, accounting for the 43.48% in G. max and 27.78% in C. arietinum. An evolutionary analysis showed that these AS genes tended to be located adjacent to each other in the evolutionary tree and are unbalanced in the distribution in the sub-family. This study provides a foundation for future studies on transcription complexity, evolution, and the role of AS on plant functional regulation.
Genome‑wide identication andcomparative analysis ofalternative
splicing acrossfour legume species
Zan Wang · Han Zhang · Wenlong Gong
Main conclusion Alternative splicing EVENTS were genome-wide identified for four legume species, and nitrogen
fixation-related gene families and evolutionary analysis was also performed.
Alternative splicing (AS) is a key regulatory mechanism that contributes to transcriptome and proteome diversity. Investiga-
tion of the genome-wide conserved AS events across different species will help with the understanding of the evolution of the
functional diversity in legumes, allowing for genetic improvement. Genome-wide identification and characterization of AS
were performed using the publically available mRNA, EST, and RNA-Seq data for four important legume species. A total
of 15,165 AS genes in Glycine max, 6077 in Cicer arietinum, 7240 in Medicago truncatula, and 7358 in Lotus japonicus
were identified. Intron retention (IntronR) was the dominant AS type among the identified events, with IntronR occurring
from 53.76% in M. truncatula to 43.91% in C. arietinum. We identified 1159 AS genes that were conserved among four
species. Furthermore, nine nitrogen fixation-related gene families with 237 genes were identified, and 80 of them were AS,
accounting for the 43.48% in G. max and 27.78% in C. arietinum. An evolutionary analysis showed that these AS genes
tended to be located adjacent to each other in the evolutionary tree and are unbalanced in the distribution in the sub-family.
This study provides a foundation for future studies on transcription complexity, evolution, and the role of AS on plant func-
tional regulation.
Keywords Alternative splicing· Cicer arietinum· Glycine max· Legume· Lotus japonicas· Medicago truncatula
AltA Alternative 3acceptor sites
AltD Alternative 5donor sites
AS Alternative splicing
ExonS Exon skipping
IntronR Intron retention
MXEs Mutually exclusive exons
Alternative splicing (AS) is a regulated occurrence where the
generation of more than one mRNA transcript results from
precursor mRNA (pre-mRNA) transcripts (Staiger and Brown
2013). AS is a widespread mechanism that greatly increases
transcriptome diversity, and the alternatively spliced tran-
scripts may encode distinct proteins, thus expanding the coding
capacity of genes and contributing to the proteome complexity
of higher organisms (Marquez etal. 2012). In humans, it has
been reported that > 95% of genes are AS (Wang etal. 2008).
A relatively lower frequency of AS events (42–61%) have been
reported in plants (Filichkin etal. 2010; Marquez etal. 2012)
and it is likely that additional studies using advanced compu-
tational tools will identify many more genes with AS as tran-
scriptomes of plants grown under stress are evaluated (Shang
etal. 2017). Relative to the predominant transcript isoform, AS
can be divided into four main types, intron retention (IntronR),
alternative 3 acceptor sites (AltA), alternative 5 donor sites
(AltD), and exon skipping (ExonS) (Wang and Brendel 2006).
Zan Wang
Han Zhang
Wenlong Gong
1 Institute ofAnimal Science, Chinese Academy
ofAgricultural Science, Beijing100193, China
ExonS is the predominant AS form in animals (Wang etal.
2008), whereas IntronR is observed primarily in plants (Wang
and Brendel 2006; Filichkin etal. 2010; Marquez etal. 2012).
AS participates in many important processes during the lifecy-
cle of plants (Staiger and Brown 2013) and occurs in response
to various abiotic stressors (Mastrangelo etal. 2012) including
salt (Feng etal. 2015), drought (Liu etal. 2017; Thatcher etal.
2016), and heat stress (Liu etal. 2013; Jiang etal. 2017; Keller
etal. 2016).
Despite the important roles AS plays in plants, the evolu-
tion and conservation of AS events are not well understood
in legume species. Most large-scale, cross-species AS com-
parisons in leguminous species have been limited to iden-
tifying conserved AS events using cDNA and expressed
sequence tags (EST), and these comparative studies have
reported few conserved events between species (Wang and
Brendel 2006; Baek etal. 2008; Wang etal. 2008). Fabaceae,
the legume family, contains species important to humans
for both consumption and atmospheric nitrogen fixation, as
nitrogen is a main limiting factor for plant growth. Many leg-
ume species are also economically important and are a par-
ticularly important source of protein. Soybean (Glycine max)
is one of the most economically important legume species
and is the dominant source of protein for animal feed and
vegetable oil (Hartman etal. 2011). Chickpea (Cicer arieti-
num L.) has one of the best nutritional compositions among
the dry edible legumes, ranking third in worldwide legume
production and first in the Mediterranean basin (Adams etal.
2009). The tribe Trifolieae includes the predominant forage
legumes alfalfa (Medicaco sativa) and clover (Trifolium sp.)
as well as the model plant M. truncatula. Medicago trunca-
tula is used as a model plant to study the functional genom-
ics of legumes because it is self-fertile, has a small diploid
genome, and has high transformation efficiency (Young
etal. 2005). Lotus japonicus, in the tribe Loteae, is another
model diploid legume plant due to its small genome, short
life cycle, and ease of Agrobacterium-mediated transforma-
tion (Handberg and Stougaard 1992).
In this study, we compared the AS event landscape and
the AS gene functional diversity in four legume species, G.
max, C. arietinum, M. truncatula, and L. japonicus. Under-
standing the AS event conservation among these legumes
helps to elucidate some important aspects of the different
types of AS types. This work increases our knowledge of AS
in legumes and provides a platform for further investigation.
Materials andMethods
Sequence collection
The expressed sequence tags (ESTs) and mRNA sequences
of G. max, C. arietinum, M. truncatula, and L. japonicus
were downloaded from the nucleotide repository of the
National Center for Biotechnology Information (NCBI; The sequences were filtered using
SeqClean 2 (Chen etal. 2007) with the universal vector
database as the default parameter. In addition, the public
RNA-Seq raw reads of these four species were also down-
loaded (Suppl. TableS1) and were subsequently cleaned
using Trimmomatic v0.33 (Bolger etal. 2014) under the
following parameter: LEADING:3 TRAILING:3 SLID-
INGWINDOW:4:15 MINLEN:25. The filtered reads were
first aligned to the corresponding reference genome using
hisatv2.0.4 (Kim etal. 2015) and the duplicate reads were
removed with Picard v1.115 MarkDuplicates (http://broad
insti tute.githu d/). Finally, only the unique align-
ments for single end reads and concordant unique alignments
for paired end reads were kept for further analysis (Suppl.
Transcript assembly andidentication ofAS
The cleaned EST/mRNA sequences were assembled using
CAP3 with the following parameters: -p 95 -o 50 -g 3 -y 50
-t 1000 (Huang and Madan 1999). To maximize the detec-
tion of AS, we used three strategies to assemble the clean
RNA-seq data. The first was Cufflinks v2.2.1 (Trapnell etal.
2012) with the parameter “-GTF-guide -max-intron-length
-b -F 0.05-.” The “-max-intron-length” was set as 15,000 in
G. max, 20,000 in C. arietinum, 10,000 in M. truncatula,
and 15,000 in L. japonicus. The second strategy was genome
guided Trinity 2.0.4 (Haas etal. 2013), with the parameter
“-genome_guided_max_intron.” The “- genome _ guided
_ max _ intron” was set the same as Cufflinks v2.2.1 for
the four studied species. The third strategy was StringTie
v1.0.0 (Pertea etal. 2015) with parameter “-G -f 0.05 -j 2.”
The sequences assembled via the three methods together
with filtered ESTs/mRNAs were merged and aligned back to
the corresponding reference genome with GMAP (Wu and
Watanabe 2005) and clustered with PASA 2.0 (Haas etal.
2003) to remove redundancy. We used BLASTN to com-
pare the reference transcript sequences in the correspond-
ing database, the alignment parameter is -evalue 0.00001
-perc_identity 95, and the number of sequences on the align-
ment was counted according to 50%, 70%, and 90% cover-
age (TableS2). Compared to the reference transcript, more
new transcripts in our assembled transcripts were found,
which help us to fully exploit potential alternative splicing
events. The assembly results from PASA were compared
with the gene annotations from the reference genome (Suppl.
TableS3) using PASA with the parameter “-A –annots_gff3”
to obtain AS information from the reference gene annota-
tion. AS genes with an fragments per kilobase of exon per
million reads mapped (FPKM) < 0.1 were not retained. In
total, five types of AS events, including IntronR, AltA, AltD,
ExonS, and mutually exclusive exons (MXEs), were consid-
ered in this study. AS events and types were obtained with
Astalavista (Foissac and Sammeth 2007) for each legume
Identication ofconserved ASevents
The identification of conserved AS events followed the meth-
ods described by Chamala etal. (2015) and four AS events
(IntronR, AltA, AltD, and ExonS) were considered here.
First, the OrthoMCL software (Li etal. 2003) was utilized
to identify potential orthologous gene families (orthogroups)
among four legume species using protein sequences from the
longest isoform of each gene as input, and each orthologous
gene family was called a cluster (Chamala etal. 2015). For
each AS event, 30–300bp of sequence from upstream and
downstream exons, immediately flanking an intron defin-
ing the alternative junctions, were extracted. These flanking
sequences that define splice junctions are termed flanking
exon sequence tags (FESTs). Therefore, each AS event is
represented by a pair of FESTs. FESTs from all species were
divided into four datasets, one for each AS event type. Each
FEST in one dataset was searched against all other FESTs
in same dataset by WU-BLASTN (cutoff E-value 1E–5)
(http://blast .wustl .edu). An AS event between two genes was
considered conserved when these genes both belonged to
same orthogroup and the pair of FESTs of one gene aligned
well with the pair of FESTs of another gene (Chamala etal.
2015). Venn graphical visualization for conserved AS pairs
was obtained using R programing language (http://www.r-
Enrichment analysis ofconservatively expressed
An enrichment analysis was performed to annotate genes
that contained AS events. First, we used the longest tran-
script protein sequences from each gene to construct the
Pfam (Finn etal. 2014) annotations in hmmer v3.1b2 (Cheng
2014). Based on the best BLASTP hits from the NR data-
base, the Blast2 GO program (Conesa etal. 2005) was used
to make GO annotations. Fisher exact tests were used to
conduct an enrichment analysis of GO terms. We considered
GO terms to be significantly enriched when the corrected
P < 0.01.
Nitrogen xation‑related ASgene
Nitrogen fixation-related genes from dicotyledons were
downloaded from the protein database in NCBI (Suppl.
TableS4). BLASTP alignment (cutoff E value 1E–5) was
performed using all the protein sequences of the four spe-
cies as references and the downloaded sequence as the query
(Suppl. TableS1). Conserved domain annotations were also
conducted using the CDD database in NCBI (Marchler-
Bauer etal. 2011). Multiple sequence alignments were per-
formed by MUSCLE (Edgar 2004). A phylogenetic tree was
constructed using PhyML (Guindon etal. 2009) for each of
the nitrogen-fixing-related genes (families) with a bootstrap
value of 1000.
Results anddiscussion
AS identication infour legume species
For the exploration and comparison of AS patterns in four
legume species using the data gathered from NCBI, we
assembled and generated putative unique transcripts with
288,953 in G. max, 109,960 in C. arietinum, 348,535 in M.
truncatula, and 254,589 in L. japonicus (Table1; Suppl.
TableS1.). Five types of AS events, IntronR, AltA, AltD,
ExonS, and MXEs, were considered during this study
(Table2). IntronR is the most prevalent AS type among the
four species occurring in 53.76% of AS events in M. trun-
catula to 43.91% AS events in C. arietinum (Table2). These
results are consistent with previous findings in plants (Cha-
mala etal. 2015; Filichkin etal. 2010; Marquez etal. 2012;
Walters etal. 2013; Wang and Brendel 2006). On average,
close to half of the AS events are IntronR (48.1%), followed
by AltA (25.4%), AltD (14.9%), and ExonS (10.1%), with
MXEs (1.5%) being the least frequent type of AS event
In total, 41,919 AS events were identified in G. max,
12,853 in C. arietinum, 17,339 in M. truncatula, and 16,266
in L. japonicus (Table2). The percentages of multi-exon
genes with at least one AS event were the highest in G. max
Table 1 Summary of raw sequenced and assembly data for four legume species
Species EST/mRNA Cleaned EST/mRNA Raw reads Clean reads PASA assembly
Glycine max 1,558,403 1,429,801 1,009,119,760 1,008,528,351 288,953
Cicer arietinum 86,267 82,801 750,427,845 732,429,047 109,960
Medicago truncatula 348,535 337,221 1,037,215,452 1,036,743,325 182,201
Lotus japonicus 254,589 250,543 1,078,427,701 1,054,215,538 148,103
with 38.87%, followed by C. arietinum (33.70%), L. japoni-
cus (30.07%), and M. truncatula (28.39%) (Table2). The
percentages of multi-exon genes with at least one AS event
were similar to those found in Vitis vinifera (30%) (Vitulo
etal. 2014), Populus trichocarpa (36%) (Bao etal. 2013),
and Sonneratia (Yang etal. 2018) and were lower than those
in Arabidopsis (61%) (Marquez etal. 2012). The percentages
identified in our study might be an underestimate because
Table 2 Genome-wide AS
events distributions and patterns AS type Glycine max Cicer arietinum Medicago truncatula Lotus japonicus
Events (%) 516 (1.23) 156 (1.21) 137(0.79) 476 (2.93)
Genes (%) 366 (2.41) 122 (2.01) 101 (1.40) 328 (4.46)
Events (%) 19,264 (45.96) 5644 (43.91) 9,322 (53.76) 7958 (48.92)
Genes (%) 9219 (60.79) 3359 (55.27) 4780 (66.02) 4427 (60.17)
Events (%) 5339 (12.74) 1376 (10.71) 1450 (8.36) 1401 (8.61)
Genes (%) 3563 (23.49) 1093 (17.99) 1094 (15.11) 1028 (13.97)
Events (%) 6046 (14.42) 2028 (15.78) 2499 (14.41) 2407 (14.80)
Genes (%) 4389 (28.94) 1587 (26.11) 1880 (25.97) 1870 (25.41)
Events (%) 10,754 (25.65) 3649 (28.39) 3931 (22.67) 4024 (24.74)
Genes (%) 7142 (47.10) 2652 (43.64) 2882 (39.80) 2988 (40.61)
Events 41,919 12,853 17,339 16,266
Genes (%) 15,165 (38.87) 6077 (33.70) 7240 (28.39) 7358 (30.07)
Fig. 1 Proportions of alternative
splicing events in four Legu-
minosae plants. The pie charts
next to each species indicate
their proportions of AS events
Glycine max
Cicer arietinum
Medicago truncatula
Lotus japonicus
our analysis is restricted to only five types of AS events
(AltA, AltD, ExonS, IntronR, and MXEs). A previous com-
prehensive AS study in Arabidopsis reported that 61.2% of
expressed multi-exonic genes exhibit AS based on investiga-
tions into ten AS types (Marquez etal. 2012). In addition, a
previous study on two of the taxa included in this study, G.
max and M. truncatula, found higher percentages of multi-
exon genes with at least one AS event (50.2% for G. max
and 44.9% for M. truncatula) than those found in this study
(Chamala etal. 2015). This may be due to the difference in
the amount of sequence data for the respective species used
for analysis.
Identication ofconserved ASinfour legume
Classification of the conserved AS events provides a frame-
work for understanding the evolution of the functional genes
and their genic regulation at the transcriptional level, which
may initiate cross-talk among the evolution of the AS genes,
the transcriptional environment, and the ecological adapta-
tion (Wang and Brendel 2006). Conserved AS events among
four legume species were identified and are classified into
6,895 conserved AS event clusters (Suppl. TableS5). There
are 10,939 conserved AS events between at least two of the
four legume species included in this study, involving 2612
clusters and 7616 genes (Table3). This is the second largest
number of conserved AS events reported to date (Mei etal.
2017, Yang etal. 2018). Chamala etal. (2015) identified
27,120 conserved AS events between at least two of nine
angiosperm taxa, which is the largest number of conserved
AS events reported to date. As expected, the number of
events conserved between species is inversely proportional
to the number of species assayed, with the most (5824) con-
served events identified between only two species and only
a modest number (1966) conserved across all four species
The overall statistics of the shared/unique AS events for
four legume species are shown in Fig.2. The largest number
of conserved AS events was observed between M. truncatula
and G. max (5773), followed by L. japonicus and G. max
(4854) and C. arietinum and G. max (4663). The smallest
number was observed between C. arietinum and L. japoni-
cus (3129) (Fig.2; Suppl. TableS6). Glycine max had a
relatively high level of conserved AS events with the three
other species, whereas L. japonicus had a relatively low
level of conserved AS events with the three other species,
except for L. japonicus versus G. max (Suppl. TableS6). The
most important reason of the varying conservation levels
between the different species pairs of the four legume spe-
cies attributed to the genetic uniqueness of different spe-
cies. In addition, the difference in the quantity of publicly
available data for the respective species and the different
tissues that produce the sequence data may also cause the
difference. It is reported that AS events are considered to be
tissue specific (Wang etal. 2016, 2018). With regard to the
three-species analysis, the largest number of conserved AS
events (2934) were detected among L. japonicus, M. trun-
catula, and G. max; followed by C. arietinum, L. japonicus,
and M. truncatula (2229); and C. arietinum, M. truncatula,
and G. max (1317). The smallest number of conserved AS
Table 3 Conserved AS between four Leguminosae plants
AS type Two species Three species Four species Total
Clusters 1021 253 73 1347
Events 2897 1583 927 5407
Genes 2209 983 483 3675
Clusters 156 25 16 197
Events 419 141 166 726
Genes 340 91 83 514
Clusters 296 70 16 382
Events 716 332 135 1183
Genes 636 248 84 968
Clusters 702 238 101 1041
Events 1792 1093 738 3623
Genes 1526 864 575 2965
Clusters 1882 540 190 2612
Events 5824 3149 1966 10,939
Genes 4396 2061 1159 7616
Cicer arietinum
Lotus japonicus Mecdicago truncatula
Glycine ma
263 968
1083 1319
1317 601
Fig. 2 Conserved alternative splicing events in in four Leguminosae
events (601) was identified among C. arietinum, L. japoni-
cus, and G. max (Suppl. TableS6). Among all four species,
1966 conserved AS events were identified from 1159 genes
(Suppl. TableS6). Among the four AS types, IntronR is
the most common conserved AS event (49.4%) followed by
AltA (33.1%), AltD (10.8%), and ExonS (6.6%) of all events
Functional enrichment ofconserved ASgenes
Functional annotation of the conserved AS transcripts yields
a mechanistic overview of the effects that AS exerts on a
particular domain and on domain-mediated regulation of AS
(Walters etal. 2013). A total of 1159 conserved AS genes
among the four species identified in the present study were
functionally annotated for putative protein domains and
Gene Ontologies (GOs). Among the four species, 202 pro-
tein domains with conserved genes were identified including
protein kinase domain, protein tyrosine kinase, PAN-like
domain, S-locus glycoprotein family, and d-mannose bind-
ing lectin (Suppl. TableS7). Our analysis demonstrated that
AS genes in legume plants encode diverse protein families
that play important roles in various biological processes.
Self-incompatibility (SI) is one of the mechanisms evolved
by higher plants to promote outbreeding. The cell wall-local-
ized S-locus glycoprotein (SLG) family is thought to recog-
nize a pollen factor that leads to the rejection of self-pollen
(Cui etal. 2005; Watanabe etal. 2012). In this study, a total
of 54 AS SLG-related genes were observed in four legume
species including 22 in M. truncatula, 19 in G. max, seven
in C. arietinum, and six in L. japonicus (Suppl. TableS7).
Unfortunately, there have been no reports on the AS mecha-
nism of the SLG family to date.
According to biological and molecular functions, the GO
analysis revealed a wide visibility in all the major biologi-
cal and molecular functions. In this study, GO classification
revealed the functional information of the genes presenting
conserved AS events among four legume species (Suppl.
TableS8). In total, 38 GO terms were detected to be sig-
nificantly overrepresented (P < 0.01, Table4). Of them, 20,
two, and 16 terms belonged to the categories of biological
process, cell component, and molecular function, respec-
tively (Table4; Fig.3). In the category of molecular func-
tion, 13 of the 16 enriched terms were annotated as playing a
critical role in the adaptation of cellular response to environ-
mental stimulus (Suppl. TableS9). In the GO term nucleo-
tide binding (GO: 000166), the gene AT2G43130 encodes
a small GTP-binding protein (ARA-4) (Suppl. TableS10).
This protein has been shown to be predominantly localized
in Golgi-derived vesicles, Golgi cisternae, and the trans-
Golgi network in Arabidopsis and can be induced by heat
shock (Ueda etal. 1996). In the GO term phosphatidic acid
binding (GO: 0070300), the gene AT4G21534 encodes
sphingosine kinase (SPHK2). Six SphKs genes were identi-
fied in the Arabidopsis genome (Worrall etal. 2008; Guo
etal. 2011), and SPHK1, SPHK2/phyto-S1P, and PLDα1A
are co-dependent in amplification of response to ABA, medi-
ating stomatal closure in Arabidopsis (Coursol etal. 2005;
Worrall etal. 2008; Michaelson etal. 2009; Guo etal. 2011).
Gene AT2G44640 encodes TriGalactosylDiacylglycerol pro-
tein (TGD4). Four genes, TGD1, 2, 3, and 4, identified in
a genetic mutant screen, encode proteins that are involved
in ER-to-chloroplast lipid transfer in Arabidopsis (Xu etal.
2003; Awai etal. 2006; Lu etal. 2007). The TGD1, -2, and
-3 proteins form a putative ATP-binding cassette (ABC)
transporter transporting ER-derived lipids through the inner
envelope membrane of the chloroplast, while TGD4 binds
phosphatidic acid (PtdOH) and resides in the outer chloro-
plast envelope (Wang etal. 2012). The gene AT1G10940
encodes serine/threonine-protein kinase, SRK2A, asso-
ciated with abscisic acid, salt, and osmotic stress (Suppl.
TableS10). In the protein serine/threonine kinase activity
(GO: 0004674) GO term, genes AT1G27190 encode leu-
cine-rich Repeat Receptor Kinase BIR3, which negatively
regulates BAK1 receptor complexes in which BIR3 interacts
with BAK1 and inhibits ligand-binding receptors to prevent
BAK1 receptor complex formation (Imkampe etal. 2017).
Gene AT4G20940 encodes a plasma membrane receptor
kinase (GHR1). It is reported that GHR1 is a fundamen-
tal component of the ABA and H2O2 signaling pathways
and that the ABA signaling pathway greatly affects plant
response to drought, genetic modification of GHR1, and
related proteins might be used to increase drought tolerance
(Hua etal. 2012).
Nitrogen‑xing‑related gene ASandevolutionary
Biological nitrogen fixation, the conversion of atmospheric
N2 to NH3, plays an important role in the global nitrogen
cycle and in agriculture worldwide (Falkowski 1997).
Legumes (Fabaceae or Leguminosae) are unique among
cultivated plants for their ability to carry out endosymbi-
otic nitrogen fixation with rhizobial bacteria (Wang etal.
2013). The most biological nitrogen fixation is catalyzed
by molybdenum-dependent nitrogenase, which is distrib-
uted within bacteria and archaea. This enzyme is composed
of two component proteins, MoFe protein and Fe protein.
Molybdenum-dependent nitrogenase is an O2-labile metal-
loenzyme composed of the NifDK and NifH proteins, and its
biosynthesis requires a number of nif gene products (Rubio
and Ludden 2008). Previous biochemical and genetic stud-
ies have revealed that approximately 20 nif genes on a 24-kb
region in Klebsiella pneumoniae contribute to the synthesis
and maturation of nitrogenase (Hu and Ribbe 2011). In this
study, we identified a total of 237 nitrogen-fixing-related
genes from nine gene families in the four legume species
from NCBI, including nitrogenase-related genes (NifL,
NifS, NifU, and NifV), NODULIN 21-like, early nodu-
lin-like, mitogen-activated protein kinase family (MAPK,
MAPKK, and MAPKKK, represented by MAPA), nitrogen
regulation (NR), and glutamine synthetase (GS) (Suppl.
TableS11, TableS4). Eighty of these nitrogen-fixing genes
were identified as AS genes (Suppl. TableS11). At the spe-
cies level, G. max had the highest number of nitrogen-fix-
ing-related genes (69) and 30 of them were AS genes. The
percentages of AS of nitrogen-fixing-related genes are the
highest in G. max (43.48%) but the lowest in C. arietinum
(27.78%) (Suppl. TableS11). Among the 60 nitrogen-fixing-
related genes in M. truncatula, 18 were AS genes. Although,
Table 4 Gene ontology (GO) enrichment analysis of evolutionarily conserved AS genes in among four legume species
GO terms Function Conserved AS P value
Biological process
GO:0051716 Cellular response to stimulus 31 0
GO:0009875 Pollen-pistil interaction 18 0
GO:0048544 Recognition of pollen 17 1.26432E−12
GO:0006468 Protein phosphorylation 43 1.27286E−11
GO:0006355 Regulation of transcription, DNA-templated 33 2.00114E−10
GO:0016044 Membrane organization 14 3.04323E−08
GO:0002376 Immune system process 16 5.82958E−08
GO:0043631 RNA polyadenylation 4 6.53879E−08
GO:0071366 Cellular response to indolebutyric acid stimulus 6 1.4626E−07
GO:0015692 Lead ion transport 6 4.6021E−07
GO:0042407 Cristae formation 3 5.42869E−07
GO:0045595 Regulation of cell differentiation 7 8.77532E−07
GO:0033500 Carbohydrate homeostasis 5 1.06874E−06
GO:0007231 Osmosensory signaling pathway 5 1.06874E−06
GO:0043067 Regulation of programmed cell death 14 1.70116E−06
GO:0009630 Gravitropism 14 2.66945E−06
GO:0010033 Response to organic substance 20 2.679E−06
GO:0080022 Primary root development 9 2.93328E−06
GO:0043407 Negative regulation of MAP kinase activity 6 3.01921E−06
GO:0006423 Cysteinyl-tRNA aminoacylation 3 5.36266E−06
Cellular component
GO:0016607 Nuclear speck 6 2.19409E−08
GO:0000151 Ubiquitin ligase complex 8 6.90138E−07
Molecular function
GO:0003700 Transcription factor activity, sequence-specific DNA binding 29 8.99536E−12
GO:0005524 ATP binding 71 7.36843E−11
GO:0004965 G-protein coupled GABA receptor activity 5 6.81111E−08
GO:0000166 Nucleotide binding 33 3.37187E−07
GO:0043565 Sequence-specific DNA binding 15 4.32622E−07
GO:0005217 Intracellular ligand-gated ion channel activity 5 4.3695E−07
GO:0042299 Lupeol synthase activity 6 4.49489E−07
GO:0004970 Ionotropic glutamate receptor activity 5 5.88629E−07
GO:0016174 NAD(P)H oxidase activity 5 7.79051E−07
GO:0005515 Protein binding 79 2.45294E−06
GO:0070300 Phosphatidic acid binding 6 3.01783E−06
GO:0004674 Protein serine/threonine kinase activity 18 4.78645E−06
GO:0004817 Cysteine-tRNA ligase activity 3 7.00689E−06
GO:0015079 Potassium ion transmembrane transporter activity 4 7.97547E−06
GO:0015416 Organic phosphonate transmembrane-transporting ATPase activity 6 9.9574E−06
GO:0008569 Minus-end-directed microtubule motor activity 4 1.05586E−05
only two genes of the NifV family were found in M. truncat-
ula, both were AS, which led to a decreased protein diversity
caused by a small number of genes. There was no AS gene
in either NifS or NODULIN 21-like families in L. japonicus.
At the nitrogen-fixing-related gene family level, there was
no AS gene in the NODULIN 21-like family in all species.
The NifL family had the highest proportion of AS genes
accounting for 50.00%. The largest number of AS genes was
observed in the nitrogen regulation gene family.
Phylogenetic trees for nine nitrogen-fixing genes were
constructed separately and the AS gene of one species
was typically located adjacent to other AS genes of the
remaining species in the individual trees (Suppl. Fig. S1A-
I). We further divided the large number of gene families
into different phylogenetic groups. Namely, the NifU gene
family was divided into four groups. The numbers of AS
genes in groups III and IV were relatively large, while the
number of AS genes in groups I and II was small, with one
gene per species in each group (Suppl. Fig. S1C). Nitrogen
regulation-related genes were clustered into seven groups
with most of them (III to VII) having AS genes (Suppl.
Fig. S1E). Early nodulin-like genes formed six groups and
each of them carried AS genes. Glutamine synthetase gene
had four groups with AS genes among the four species
(Suppl. Fig. S1I). Although most of the groups of specific
nitrogen-fixing genes included AS genes, the distribu-
tion of AS genes was not balanced across the genes and
This is the first report of AS events associated with nitro-
gen-fixing-related genes. The results can help for a better
understanding of the complexity of biological nitrogen fixa-
tion processes, paying the way for the full use of legume
nitrogen fixation capacity in agricultural production.
The present study investigated the genome-wide conserved
AS events in four of the most important leguminous spe-
cies using the publicly available mRNA, EST, and RNA-Seq
data. Our findings provide a basis for the understanding of
the AS events that have occurred among different species,
particularly across legumes. This resource on conserved AS
identifies an additional layer between genotype to phenotype
that may impact future efforts to improve legumes.
Author contribution statement ZW designed the study. ZW,
HZ and WLG collected and analyzed data, ZW drafted the
manuscript. All authors reviewed the manuscript.
Percentage of genes
Number of genes
0.1 110 100
extracellular region
cell junction
membrane−enclosed lumen
macromolecular complex
extracellular region part
organelle part
virion part
membrane part
synapse part
cell part
supramolecular complex
transcription factor activity, protein bindin
nucleic acid binding transcription factor activity
catalytic activity
signal transducer activity
structural molecule activity
transporter activity
electron carrier activity
antioxidant activity
metallochaperone activity
protein tag
translation regulator activity
nutrient reservoir activity
molecular transducer activity
molecular function regulator
cell killing
immune system process
metabolic process
cellular process
reproductive process
biological adhesion
multicellular organismal process
developmental process
single−organism process
biological phase
rhythmic process
response to stimulus
multi−organism process
biological regulation
cellular component organization or biogenesis
1985 851
molecular functioncellular component biological process
Fig. 3 GO annotation classification for conserved genes
Compliance with ethical standards
Conflict of interest The authors declare that they have no competing interests.
Availability of data and materials All the sequence data used in the
