ArticlePDF Available

Will there ever be a tree of life that systematists can agree on?

Authors:

Abstract

Since the concept of a "Tree of Life" was raised by Charles Darwin, researches in this field have not only contributed to our understanding of phylogenetic relationships among taxa, but also significantly accelerated the development of related subjects in biological science. Evolutionary biologist Dobzhansky once remarked that "nothing makes sense in biology except in the light of evolution", which has been largely echoed by later biologists. Indeed, reconstruction of an accurate phylogeny of the living world is very important for biological classification and nomenclature, and also crucial to elucidate the origin and diversification of life. We have experienced three major phases for Tree of Life reconstruction in the past century. Prior to the 1990s, taxonomists published classification systems that were largely dependent on morphological characters. DNA sequencing technology facilitated by the development of polymerase chain reaction (PCR) techniques has allowed systematists to reconstruct phylogenetic relationships using molecular data. More recently, the rapid development of next-generation sequencing tools has brought the Tree of Life to a phylogenomic era by enabling the construction of phylogenies using hundreds or thousands of loci from organellar and nuclear genomes. However, significant conflicts have been detected in phylogenies of various organisms with the large increase in the number of loci used for phylogenetic analyses. Given the level of conflict in some data sets, some researchers have begun to doubt the accuracy and congruence of the Tree of Life and its applications in related biological fields. So, will there ever be a Tree of Life that systematists can agree on? In this paper, we highlight three reasons why researchers cannot retrieve a totally congruent tree that reflects the real evolutionary history of life. This is despite significant improvements in morphological, molecular, and statistical methods and is analogous to our inability to restore a collapsed building, even when all bricks and other building materials remain. (i) Sampling limitations: we cannot sample all the species in the world because a large percentage of species have become extinct throughout Earth's history and many species are currently facing extinction or have not yet been recognized by scientists (especially life in the oceans); (ii) Biological processes: hybridization/introgression, incomplete lineage sorting, gene duplication and/or loss, horizontal gene transfer and other biological events that have occurred during evolutionary history have frequently resulted in gene tree heterogeneity; and (iii) Systematic biases and models for tree reconstruction: phylogenetic noise in the data such as evolutionary saturation and compositional bias can lead to incorrect phylogenies and any algorithms for reconstructing phylogenetic trees cannot absolutely simulate the real processes of organic evolution. Furthermore, biological factors attributed to discordance can become even more complicated when we reconstruct phylogenies using phylogenomic datasets. Generally, incomplete lineage sorting and hybridization/introgression occurred in closely related species, whereas phylogenetic discrepancy at the family, order or above levels are usually the combined effects of gene duplication and/or loss, recombination, and genome duplication. Therefore, it is always important to understand the mechanisms causing the incongruence and explore approaches to better model the processes that generate the discordance. In recent decades, new models and methods in phylogenomic studies have been developed and have shed light on the species trees of some candidate groups. Thus, we still look to a bright future for the Tree of Life and its applications in related biological sciences despite the fact that we cannot achieve a completely congruent tree of the living world.
2016 61 9期:958 ~ 963
引用格式: 鲁丽敏, 陈之端, 路安民. 系统生物学家最终能得到完全一致的生命之树吗? 科学通报, 2016, 61: 958–963
Lu L M, Chen Z D, Lu A M. Will there ever be a tree of life that systematists can agree on? (in Chinese) Chin Sci Bull, 2016, 61: 958–963, doi:
10.1360/N972015-01355
© 2016《中国科学》杂志社 www.scichina.com csb.scichina.com
中国科学杂志社
SCIENCE CHINA PRESS
Will there ever be a tree of life that systematists can agree on?
Science
125
个科学前沿问题系列解读
(
)
系统生物学家最终能得到完全一致的生命之树吗?
鲁丽敏, 陈之端*, 路安民
中国科学院植物研究所系统与进化植物学国家重点实验室, 北京 100093
* 联系人, E-mail: zhiduan@ibcas.ac.cn
2015-12-07 收稿, 2016-01-11 修回, 2016-01-13 接受, 2016-02-22 网络版发表
国家自然科学基金面上项目(31270268)、国家自然科学基金青年科学基金(31500179) 、中美生物多样性 Dimension 国际合作项目
(31461123001)和国家自然科学基金重大项目(31590822)资助
摘要 自达尔文提出生命之树的概念以来, 该领域的研究不仅帮助人们了解了生物的起源和类群间的亲缘关
, 还极大地推动了生命科学相关学科的发展. 然而, 随着生命之树重建中越来越多冲突的发现, 人们对生命之
树的可靠性及其在其他学科中的应用产生了质疑. 本文简要介绍了生命之树的概念及其发展历史. 综述了由于
() 物种的绝灭和人类认知局限性导致的取样缺乏; () 生物进化过程中存在的杂交/渐渗、不完全谱系分选、
因重复和丢失、基因水平转移等事件; () 建树方法不能真实地模拟生物的进化过程等原因, 不可能获得完全一
致的生命之树. 最后, 展望了生命之树广阔的发展和应用前景, 指出尽管现实中很难得到唯一的生命之树, 但这
并不影响生命之树强劲的生命力及其与其他学科的交叉整合.
关键词 生命之树, 基因树, 物种树, 系统关系冲突, 系统发育基因组学
生命之树研究是当前生命科学中的热点领域,
作为学科间交叉的桥梁, 推动了相关领域的整合创
. 正如美国著名进化生物学家David Hillis所指出
的那样: “You pick up any biological journal—it
doesn’t matter what field it is—and it will have phylo-
genetic data (当你翻阅一本生物学期刊, 不管它属于
哪个领域, 你都会发现其中有系统发育数据).”[1]
Dobzhansky[2]也曾指出: “Nothing makes sense in bi-
ology except in the light of evolution (不从进化角度出
, 生物学上的任何问题都没有意义).” 现在越来越
多的学者也开始接受这样的事实: Evolutionary bi-
ology makes much more sense in the light of phyloge-
ny, the tree of life (如果从生命之树出发, 进化生物学
上的问题则更有意义).” 因此, 了解生命之树的概念
并构建可靠的系统发育关系不仅是生物分类和命名
的基础, 也是阐明类群的起源和扩散以及开
展多学科交叉整合研究的前提[3]. 然而, 随着越来越
多的数据用于生命之树重建, 拓扑结构之间存在冲
突的现象也越来越普遍, 人们不可避免地对生命之
树的一致性及其在其他学科中的应用产生疑问.
, 系统生物学家最终能得到完全一致的生命之树
? 拓扑结构之间的冲突是否影响生命之树在其他
学科中的应用?
1 什么是生命之树?
生命之树(tree of life, TOL)是达尔文在其不朽著
作《物种起源》(Origin of Species)中提出的, 是完全
符合其进化论思想的一个概念[4]. 达尔文认为地球上
生活的万物, 包括各种生命形式, 花、草、虫、鱼、
, 以及人类自身等生物门类都不是上帝创造的,
是从简单到复杂, 由低等至高等, 经过漫长的地质历
史一步步演化而来的[4]. 因此, 任何一个物种都有其
959
祖先, 所有的物种一直向前可追溯到一个共同的祖
, 即地球上所有的生物都是同源共祖的, 类人猿与
人类享有最近的共同祖先, 鱼的形态也有人的影子,
细菌某些进化慢的基因和基因家族与人类的仍然十
分相似, 和人类共用同一套遗传密码子, 人体内基因
形成的分子调控网络经常可在细菌细胞中找到其
雏形.
根据达尔文的进化理论, 生命之树的概念就不
难理解了: 生命世界就像一棵参天大树, 有树根、树
干、枝梢和树叶, 任何物种或早出或晚出, 但总能在
这棵树上找到其位置, 追溯到其祖先, 有些老根、枝
杈枯死了, 老根上又长出新根, 枯枝上又发出新芽,
生命世界和地球环境相互作用, 常绿常新, 不断进
. 生命之树最常见的表现形式是用二歧分支的树
状图来代表生物类群(种、属、科、目、纲、门、界)
之间的亲缘关系. 分支关系是根据生物类群的同源
性状比较确定的, 类群间具有的同源性状越多其亲
缘关系越近, 反之亲缘关系越远.
在达尔文之后的100多年里, 生物学家主要利用
形态性状构建生命之树, 但形态性状通常数量有限,
且用于亲缘关系较远的类群时其同源性难以辨别. 20
世纪八、九十年代, PCR技术应用于测序之后, 使得
测序效率大大提高, 利用DNA和蛋白质序列等分子
性状进行生命之树重建随之成为主流[5]. 随着测序技
术的日趋成熟和测序成本的大幅降低, 公共数据库
中存储的DNA和蛋白序列呈指数增长, 为构建全球
范围的生命之树奠定了基础. 例如, 被子植物、鸟类
等生物大门类的生命之树已在近10年内相继建成[6,7];
Hinchliff等人[8]基于已发表的形态或分子数据合成了
包括230万物种的全球生命之树”. 近年来, 新一代
测序技术的应用进一步提高了测序效率, 越来越多
的转录组甚至全基因组数据应用于生命之树重
[6,9~11]. 在基因组时代的大背景下, 系统发育基因
组学(phylogenomics)通过整合生命科学领域两个重
要的学科系统发育学(phylogenetics) 和基因组学
(genomics)作为一门崭新的交叉学科应运而生. 与此
同时, 人们也逐渐意识到利用单个或几个分子片段
得到的基因树(gene tree)有时并不能真正反映物种的
进化历史[12~14]. 因此, 探讨基因树和物种树(species
tree)之间冲突的原因和机制, 解决存在冲突的类群
间的系统关系, 并致力于物种树构建模型和软件的
开发已成为生命之树研究领域的一个热点[15~21].
2 系统生物学家能得到完全一致的生命之
树吗?
随着分子数据的海量增加, 取样物种越来越全,
建树手段日益丰富有效, 我们不禁要问: 生物学家最
终能够得到一致的, 也就是唯一的生命之树吗?
案是否定的, 理由如下:
() 取样限制. 目前, 地球上已描述命名的物
种约有170万种[22], 由于人类认知水平和环境条件的
限制, 准确估计地球上物种的数量仍是个挑战. Mora
等人[23]预测地球上共有870 万个物种, 其中, 约有
86%的陆地生物和91% 的海洋生物尚未被命名.
, 大量的物种在漫长的地质历史中灭绝了, 据估计
现存物种的多样性占比不足地质历史时期出现过的
所有生物多样性的10%. 近一个世纪以来, 由于人类
活动引起的自然环境破坏和生态系统失衡使得许多
物种在被发现和描述之前就已经灭绝[24]. 每一个曾
经存在的物种都有其独特的基因组, 每一种生物就
是进化上的一个链环, 该物种的绝灭就意味着其基
因组从地球上完全消失了, 这个进化上的链环从此
就永远缺失了. 因此, 利用形态和分子数据重建的生
命之树不能包括地球上曾经出现、生存过的所有物
. 尽管在实践上人们通过努力会越来越了解生物
的进化历程, 但任何人为构建的树都不可能是那棵
真正在地球上存在过的生命之树. 重建生命之树就
如同重建一座倒塌的摩天大楼, 尽管可以捡起坍塌
留下的砖石瓦块等建筑材料, 甚至找到当初大楼的
设计图纸进行重建, 却再也无法复原曾经的那座摩
天大楼了.
() 生物学因素. 不同的基因组和基因通常有
其独立的进化历程, 杂交/渐渗(hybridizaiton/intro-
gression)、不完全谱系分选(incomplete lineage sort-
ing)基因重复和/或基因丢失(gene duplication and/or
gene loss)以及基因水平转移(horizontal gene transfer)
等生物因素都可能导致基因树和物种树之间的不一
[13]. 植物有叶绿体、线粒体和核3个基因组, 发生
杂交和渐渗的机率较高, 这些复杂的进化历史隐藏
在谱系中, 导致利用不同基因或者基因组数据重建
的相同物种的生命之树不一致[25]. 不完全谱系分选
是指同一个居群(或物种)内的基因谱系未能形成单
, 而是某些基因谱系先与其他居群的谱系聚在了
一起[26]. 不完全谱系分选常常伴随着物种的快速辐
2016 3 61 9
960
射进化(rapid radiation), 通常物种形成间隔时间越
短、居群越大, 发生不完全谱系分选的可能性越大.
例如, 黑猩猩(Pan troglodytes)与人类享有最近的共
同祖先已成为共识. 然而, Scally等人[27] 通过比较人
类和现存类人猿的基因组, 发现基因组内存在3种信
: 多数基因支持黑猩猩与人类最近缘, 15%基因支
持黑猩猩与大猩猩(Gorilla)最近缘, 而另外15%支持
大猩猩与人类关系更近. RogersGibbs[28]指出3个物
种间基因树的异质性很可能是不完全谱系分选和基
因流(gene flow)导致的. 基因重复和/或基因丢失相
对容易理解: 如祖先中某个基因发生重复形成了两
个拷贝, 如果祖先中的两个拷贝在后代中发生差异
性丢失, 利用该基因构建的基因树就和物种树不吻
. 基因水平转移是指发生在不同物种或不同基因
组间遗传物质的传递. 早期研究认为, 基因水平转移
多发生于原核生物, 10多年来这种现象在真核生
物包括植物中也有报道[29]. 例如, DavisWurdack[30]
根据线粒体基因nad1B-C重建的系统发育关系支持
大花草科(Rafflesiaceae) 植物与其寄主葡萄科
(Vitaceae) 崖爬藤属(Tetrastigma Planch.) 植物近缘,
而基于核糖体18S rDNA和线粒体PHYC的系统关系
均支持其位于金虎尾目(Malpighiales), 他们推测这
种冲突可能归因于: 缺少根、茎、叶的大花草通过基
因水平转移的方式从其寄主那里获取了部分线粒体
基因以维持其营养生长. 现实中, 基因树与物种树间
的冲突常常是由以上因素综合作用导致的. 鉴于生
物复杂的进化过程, 有的学者认为生命世界错综复
杂的关系难以用简单的树状结构来表示, 进而提出
了生命之森林(forest of life)的概念[31].
() 系统误差和建树模型. 在生命之树重建过
程中, 建树方法能否很好地模拟相关类群的真实进
化过程对系统树的准确性有重要影响. 在过去的30
年中, 由于数据和计算能力的限制, 人们主要利用联
合分析法(concatenated analyses)地球上的生物进
行生命之树重建[32]. 联合分析法主要包括距离法
(distance-based)、最大简约法(maximum parsimony)
最大似然法(maximum likelihood)和贝叶斯法(Bayesian
inference)等构树方法. 每种算法都依据一定的进化
假设, 例如, 最大简约法假定进化历程中发生进化步
长最短的系统发育树为最优树; 最大似然法基于特
定的碱基替代模型选取似然值最大的系统树为最优
; 贝叶斯法也可设置特定的碱基替代模型, 选取马
尔科夫链(Monte Carlo Markov chain)中出现频率最
高的树为最优树[33]. 由于进化假设不同, 不同学者
运用不同算法对相同矩阵构树也可能得到不同的拓
扑结构. 例如, 最大简约法由于不能对长枝的平行突
变作出校正, 拓扑结构中常常出现长枝吸引(long-
branch attraction)现象, 从而得到与最大似然法和贝
叶斯法相异的系统树[34]. 即使是基于模型的建树方
, 所用模型不能很好地模拟进化速率异质性
(evolutionary heterogeneity) 或数据已经进化饱和
(substitution saturation), 也可能得到强支持但错误的
系统发育关系[35,36]. 系统发育基因组学提出以来,
于溯祖理论(coalescent-based)的物种树构建方法备受
青睐[16,37,38]. 该方法考虑了基因树间的异质性, 但由
于假设数据中不存在重组(recombination)现象引起了
很大争议[12,39,40]. 虽然用于生命之树重建的数据在增
加、模型在优化, 然而, 再复杂合理的模型也无法演
绎出自然界生物真实的进化历程.
3 前景展望
综上所述, 成熟的测序技术和计算机处理大数
据的能力, 为生命之树重建带来了前所未有的机遇,
使得人类对自然界生物进化关系及其生物遗传变异
分子基础的认知空前提高, 但由于物种的绝灭导致
进化链环的缺失, 不同基因和基因组有不同的进化
历史, 以及建树方法依据不同进化假设等原因,
同学者对同一个分类群(甚至利用相同的数据集)
行系统关系重建不能获得完全相同的, 即唯一的生
命之树. 在系统发育基因组学中, 导致系统发育冲
突的原因更为复杂: 不完全谱系分选和杂交/渐渗通
常发生在近缘物种间, 而科、目以上水平的冲突常常
是基因或基因组多倍化以及基因丢失和重组等进化
事件综合作用的结果. 随着分子数据的积累和越来
越多冲突机制的发现, 新的建树模型和方法也迅速
发展, 这为构建一棵接近真实的物种树带来了曙光.
尽管目前还得不到唯一的生命之树, 但这并不影响
生命之树强劲的生命力及其与其他学科的交叉整
. 因为不同学科对生命之树的需求不同, 有些学
科旨在得到大概的进化式样, 一定程度的数据缺
失或偏差不会左右总体趋势, 而有些学科通过研
究不同基因的进化历史以了解基因家族的进化式样
和功能. 在未来的几十年内, 生命之树不仅会进
一步促进系统与进化生物学的蓬勃发展, 提高人类
961
对自然界的认识水平和对生物多样性的保护意
, 而且将更广泛地渗透到生物学其他领域, 促进
生物、医药健康和旅游等产业的发展, 从而改善人们
的日常生活[41].
致谢 感谢澳大利亚联邦科学与工业研究组织(CSIRO) Russell L. Barrett博士和美国中田纳西州立大学(Middle Ten-
nessee State University) Opal R. Leonard帮助修改英文摘要.
参考文献
1 Pennisi E. Modernizing the tree of life. Science, 2003, 300: 1692–1697
2 Dobzhansky T. Nothing in biology makes sense except in the light of evolution. Am Biol Teacher, 1973, 35: 125–129
3 Soltis D E, Soltis P S, Chase M W, et al. Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Bot J Linn Soc,
2000, 133: 381–461
4 Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life.
London: John Murray, 1859
5 Hillis D M, Huelsenbeck J P, Cunningham C W. Application and accuracy of molecular phylogenies. Science, 1994, 264: 671–677
6 Jetz W, Thomas G H, Joy J B, et al. The global diversity of birds in space and time. Nature, 2012, 491: 444–448
7 Zanne A E, Tank D C, Cornwell W K, et al. Three keys to the radiation of angiosperms into freezing environments. Nature, 2014, 506:
89–92
8 Hinchliff C E, Smith S A, Allman J F, et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc Natl Acad Sci
USA, 2015, 112: 12764–12769
9 Liu L, Xi Z, Wu S, et al. Estimating phylogenetic trees from genome-scale data. Ann N Y Acad Sci, 2015, 1360: 36–53
10 Wickett N J, Mirarab S, Nguyen N, et al. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Natl
Acad Sci USA, 2014, 111: E4859–E4868
11 Wen J, Xiong Z, Nie Z L, et al. Transcriptome sequences resolve deep relationships of the grape family. PLoS One, 2013, 8: e74394
12 Springer M S, Gatesy J. The gene tree delusion. Mol Phylogenet Evol, 2016, 94: 1–33
13 Szöllősi G J, Tannier E, Daubin V, et al. The inference of gene trees with species trees. Syst Biol, 2015, 64: e42–e62
14 Zou X H, Ge S. Conflicting gene trees and phylogenomics (in Chinese). J Syst Evol, 2008, 46: 795–807 [邹新慧, 葛颂. 基因树冲突与
系统发育基因组学研究. 植物分类学报, 2008, 46: 795–807]
15 Jockusch E L, Martínez-Solano I, Timpe E K. The effects of inference method, population sampling, and gene sampling on species tree
inferences: An empirical study in slender salamanders (Plethodontidae: Batrachoseps). Syst Biol, 2015, 64: 66–83
16 Liu L, Xi Z, Davis C C. Coalescent methods are robust to the simultaneous effects of long branches and incomplete lineage sorting. Mol
Biol Evol, 2015, 32: 791–805
17 Liu L, Wu S Y, Yu L. Coalescent methods for estimating species trees from phylogenomic data. J Syst Evol, 2015, 53: 380–390
18 Capella-Gutierrez S, Kauff F, Gabaldón T. A phylogenomics approach for selecting robust sets of phylogenetic markers. Nucleic Acids
Res, 2014, 42: e54
19 Mirarab S, Reaz R, Bayzid M S, et al. ASTRAL: Genome-scale coalescent-based species tree estimation. Bioinformatics, 2014, 30:
i541–i548
20 Larget B R, Kotha S K, Dewey C N, et al. BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis.
Bioinformatics, 2010, 26: 2910–2911
21 Liu L. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics, 2008, 24: 2542–2543
22 The World Conservation Union. IUCN red list of threatened species 2014 3 summary of statistics for globally threatened species table 1:
numbers of threatened species by major groups of organisms (1996–2014). 2014
23 Mora C, Tittensor D P, Adl S, et al. How many species are there on earth and in the ocean? PLoS Biol, 2011, 9: e1001127
24 Costello M J, May R M, Stork N E. Can we name earth’s species before they go extinct? Science, 2013, 339: 413–416
25 Sun M, Soltis D E, Soltis P S, et al. Deep phylogenetic incongruence in the angiosperm clade Rosidae. Mol Phylogenet Evol, 2015, 83:
156–166
26 Degnan J H, Rosenberg N A. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol, 2009, 24:
332–340
27 Scally A, Dutheil J Y, Hillier L W, et al. Insights into hominid evolution from the gorilla genome sequence. Nature, 2012, 483: 169–175
2016 3 61 9
962
28 Rogers J, Gibbs R A. Comparative primate genomics: Emerging patterns of genome content and dynamics. Nat Rev Genet, 2014, 15: 347–359
29 Brown J R. Ancient horizontal gene transfer. Nat Rev Genet, 2003, 4: 121–132
30 Davis C C, Wurdack K J. Host-to-parasite gene transfer in flowering plants: Phylogenetic evidence from malpighiales. Science, 2004,
305: 676–678
31 Koonin E V, Wolf Y I, Puigbò P. The phylogenetic forest and the quest for the elusive tree of life. Cold Spring Harb Symp Quant Biol,
2009, 74: 205–213
32 Pirie M D. Phylogenies from concatenated data: Is the end nigh? Taxon, 2015, 64: 421–423
33 Hall B G. Phylogenetic Trees Made Easy: A How-to Manual. 4th ed. Maryland: Sinauer Association Inc, 2011
34 Bergsten J. A review of long-branch attraction. Cladistics, 2005, 21: 163–193
35 Cox C J, Li B, Foster P G, et al. Conflicting phylogenies for early land plants are caused by composition biases among synonymous
substitutions. Syst Biol, 2014, 63: 272–279
36 Liu Y, Cox C J, Wang W, et al. Mitochondrial phylogenomics of early land plants: Mitigating the effects of saturation, compositional
heterogeneity, and codon-usage bias. Syst Biol, 2014, 63: 862–878
37 Roch S, Warnow T. On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst Biol,
2015, 64: 663–676
38 Mirarab S, Bayzid M S, Warnow T. Evaluating summary methods for multi-locus species tree estimation in the presence of incomplete
lineage sorting. Syst Biol, 2014. doi: 10.1093/sysbio/syu063
39 Springer M S, Gatesy J. Land plant origins and coalescence confusion. Trends Plant Sci, 2014, 19: 267–269
40 Gatesy J, Springer M S. Concatenation versus coalescence versus “concatalescence”. Proc Natl Acad Sci USA, 2013, 110: E1179
41 Lu L M, Sun M, Zhang J B, et al. Tree of life and its applications (in Chinese). Biodiv Sci, 2014, 22: 3–20 [鲁丽敏, 孙苗, 张景博, .
生命之树及其应用. 生物多样性, 2014, 22: 3–20]
陈之端
中国科学院植物研究所研究员、博士生导师. 1985 年毕业于山东大学生物系,
1992 年在中国科学院植物研究所获得博士学位, 并留所工作. 1995~2006
分别在美国佛罗里达大学、瑞士苏黎世大学、英国邱园、美国哈佛大学和美
国华盛顿史密森研究所进行合作研究. 近年来一直从事植物生命之树重建和
生物地理学方面的研究, 承担了国家重大科学研究计划、科技部“863”计划、
自然科学基金重大项目、中国科学院知识创新方向性项目、发展中国家访问
学者计划、海外科教基地建设计划等多项课题. 目前的研究工作主要集中在:
(1) 利用基因、基因组、形态学以及生物地理学等证据探讨被子植物大类群
的系统发育和进化; (2) 利用进化发育生物学手段, 通过研究性状相关基因
与植物系统发育之间的关系, 探讨关键创新性状的进化.
963
Will there ever be a tree of life that systematists can agree on?
LU LiMin, CHEN ZhiDuan & LU AnMin
State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
Since the concept of a “Tree of Life” was raised by Charles Darwin, researches in this field have not only contributed to
our understanding of phylogenetic relationships among taxa, but also significantly accelerated the development of related
subjects in biological science. Evolutionary biologist Dobzhansky once remarked that “nothing makes sense in biology
except in the light of evolution”, which has been largely echoed by later biologists. Indeed, reconstruction of an accurate
phylogeny of the living world is very important for biological classification and nomenclature, and also crucial to
elucidate the origin and diversification of life. We have experienced three major phases for Tree of Life reconstruction in
the past century. Prior to the 1990s, taxonomists published classification systems that were largely dependent on
morphological characters. DNA sequencing technology facilitated by the development of polymerase chain reaction
(PCR) techniques has allowed systematists to reconstruct phylogenetic relationships using molecular data. More recently,
the rapid development of next-generation sequencing tools has brought the Tree of Life to a phylogenomic era by
enabling the construction of phylogenies using hundreds or thousands of loci from organellar and nuclear genomes.
However, significant conflicts have been detected in phylogenies of various organisms with the large increase in the
number of loci used for phylogenetic analyses. Given the level of conflict in some data sets, some researchers have begun
to doubt the accuracy and congruence of the Tree of Life and its applications in related biological fields. So, will there
ever be a Tree of Life that systematists can agree on?
In this paper, we highlight three reasons why researchers cannot retrieve a totally congruent tree that reflects the real
evolutionary history of life. This is despite significant improvements in morphological, molecular, and statistical methods
and is analogous to our inability to restore a collapsed building, even when all bricks and other building materials remain.
(i) Sampling limitations: we cannot sample all the species in the world because a large percentage of species have
become extinct throughout Earth’s history and many species are currently facing extinction or have not yet been
recognized by scientists (especially life in the oceans); (ii) Biological processes: hybridization/introgression, incomplete
lineage sorting, gene duplication and/or loss, horizontal gene transfer and other biological events that have occurred
during evolutionary history have frequently resulted in gene tree heterogeneity; and (iii) Systematic biases and models for
tree reconstruction: phylogenetic noise in the data such as evolutionary saturation and compositional bias can lead to
incorrect phylogenies and any algorithms for reconstructing phylogenetic trees cannot absolutely simulate the real
processes of organic evolution. Furthermore, biological factors attributed to discordance can become even more
complicated when we reconstruct phylogenies using phylogenomic datasets. Generally, incomplete lineage sorting and
hybridization/introgression occurred in closely related species, whereas phylogenetic discrepancy at the family, order or
above levels are usually the combined effects of gene duplication and/or loss, recombination, and genome duplication.
Therefore, it is always important to understand the mechanisms causing the incongruence and explore approaches to
better model the processes that generate the discordance. In recent decades, new models and methods in phylogenomic
studies have been developed and have shed light on the species trees of some candidate groups. Thus, we still look to a
bright future for the Tree of Life and its applications in related biological sciences despite the fact that we cannot achieve
a completely congruent tree of the living world.
tree of life, gene tree, species tree, phylogenetic incongruence, phylogenomics
doi: 10.1360/N972015-01355
... It is often difficult in practice to determine whether systematic biases or biological processes have led to phylogenetic incompatibility within a specific group (e.g. Sanderson et al., 2000;Rokas et al., 2003;Philippe et al., 2005;Burleigh and Mathews, 2007;Lu et al., 2016a;Springer and Gatesy, 2016). ...
Article
Full-text available
Evolutionary rate heterogeneity and rapid radiations are common phenomena in organismal evolution and represent major challenges for reconstructing deep-level phylogenies. Here we detected substantial conflicts in and among data sets as well as uncertainty concerning relationships among lineages of Vitaceae from individual gene trees, supernetworks and tree certainty values. Congruent deep-level relationships of Vitaceae were retrieved by comprehensive comparisons of results from optimal partitioning analyses, multispecies coalescent approaches and the Bayesian concordance method. We found that partitioning schemes selected by PartitionFinder were preferred over those by gene or by codon position, and the unpartitioned model usually performed the worst. For a data set with conflicting signals, however, the unpartitioned model outperformed models that included more partitions, demonstrating some limitations to the effectiveness of concatenation for these data. For a transcriptome data set, fast coalescent methods (STAR and MP-EST) and a Bayesian concordance approach yielded congruent topologies with trees from the concatenated analyses and previous studies. Our results highlight that well-resolved gene trees are critical for the effectiveness of coalescent-based methods. Future efforts to improve the accuracy of phylogenomic analyses should emphasize the development of new methods that can accommodate multiple biological processes and tolerate missing data while remaining computationally tractable.
Article
Full-text available
Reconstructing the origin and evolution of land plants and their algal relatives is a fundamental problem in plant phylogenetics, and is essential for understanding how critical adaptations arose, including the embryo, vascular tissue, seeds, and flowers. Despite advances in molecular systematics, some hypotheses of relationships remain weakly resolved. Inferring deep phylogenies with bouts of rapid diversification can be problematic; however, genome-scale data should significantly increase the number of informative characters for analyses. Recent phylogenomic reconstructions focused on the major divergences of plants have resulted in promising but in- consistent results. One limitation is sparse taxon sampling, likely resulting from the difficulty and cost of data generation. To address this limitation, transcriptome data for 92 streptophyte taxa were generated and analyzed along with 11 published plant genome sequences. Phylogenetic reconstructions were conducted using up to 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyses were performed to test the robustness of phylogenetic inferences to permutations of the data matrix or to phylogenetic method, including supermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned analyses, and amino acid versus DNA alignments. Among other results, we find robust support for a sister-group relationship between land plants and one group of streptophyte green algae, the Zygnematophyceae. Strong and robust support for a clade comprising liverworts and mosses is inconsistent with a widely accepted view of early land plant evolution, and suggests that phylogenetic hypotheses used to understand the evolution of fundamental plant traits should be reevaluated.
Article
Full-text available
With more and more sequence data available, it has been a widespread practice to apply multiple genes to reconstruct phylogenies at different hierarchical levels. The phenomenon of conflicting gene trees has accordingly become a remarkable and difficult problem. It is increasingly understood that the difference between gene tree and species tree and the causes behind should be fully appreciated in molecular phylogenetic studies. In this paper, we have explored the major causes resulting in conflicting gene trees, including stochastic errors, systematic errors and biological factors. We also introduced a newly developed discipline, phylogenomics, and demonstrated its power and great potential in resolving difficult phylogenetic problems using our recent phyloge-nomic study of Oryza as an example. Furthermore, we discussed some strategies and approaches in elucidating conflicting gene trees and provided some suggestions and recommendations for molecular phylogenetic studies using multiple genes.
Article
Full-text available
Significance Scientists have used gene sequences and morphological data to construct tens of thousands of evolutionary trees that describe the evolutionary history of animals, plants, and microbes. This study is the first, to our knowledge, to apply an efficient and automated process for assembling published trees into a complete tree of life. This tree and the underlying data are available to browse and download from the Internet, facilitating subsequent analyses that require evolutionary trees. The tree can be easily updated with newly published data. Our analysis of coverage not only reveals gaps in sampling and naming biodiversity but also further demonstrates that most published phylogenies are not available in digital formats that can be summarized into a tree of life.
Article
Full-text available
Gene trees from independent molecular markers often differ. Simple data matrix concatenation cannot represent the various biologically meaningful processes that underlie these differences, and in an age of high-throughput DNA sequencing and coalescent-based species tree inference methods, the approach seems increasingly quaint. I argue that concatenation still has its place in our suite of approaches, but that care should be taken when deciding which data might be combined under what circumstances. I present recommendations for avoiding the worst pitfalls of the approach.
Article
Full-text available
The heterogeneity of signals in the genomes of diverse organisms poses challenges for traditional phylogenetic analysis. Phylogenetic methods known as "species tree" methods have been proposed to directly address one important source of gene tree heterogeneity, namely the incomplete lineage sorting that occurs when evolving lineages radiate rapidly, resulting in a diversity of gene trees from a single underlying species tree. Here we review theory and empirical examples that help clarify conflicts between species tree and concatenation methods, and misconceptions in the literature about the performance of species tree methods. Considering concatenation as a special case of the multispecies coalescent model helps explain differences in the behavior of the two methods on phylogenomic data sets. Recent work suggests that species tree methods are more robust than concatenation approaches to some of the classic challenges of phylogenetic analysis, including rapidly evolving sites in DNA sequences and long-branch attraction. We show that approaches, such as binning, designed to augment the signal in species tree analyses can distort the distribution of gene trees and are inconsistent. Computationally efficient species tree methods incorporating biological realism are a key to phylogenetic analysis of whole-genome data. © 2015 New York Academy of Sciences.
Article
Higher-level relationships among placental mammals are mostly resolved, but several polytomies remain contentious. Song et al. (2012) claimed to have resolved three of these using shortcut coalescence methods (MP-EST, STAR) and further concluded that these methods, which assume no within-locus recombination, are required to unravel deep-level phylogenetic problems that have stymied concatenation. Here, we reanalyze Song et al.'s (2012) data and leverage these re-analyses to explore key issues in systematics including the recombination ratchet, gene tree stoichiometry, the proportion of gene tree incongruence that results from deep coalescence versus other factors, and simulations that compare the performance of coalescence and concatenation methods in species tree estimation. Song et al. (2012) reported an average locus length of 3.1 kb for the 447 protein-coding genes in their phylogenomic dataset, but the true mean length of these loci (start codon to stop codon) is 139.6 kb. Empirical estimates of recombination breakpoints in primates, coupled with consideration of the recombination ratchet, suggest that individual coalescence genes (c-genes) approach ∼12 bp or less for Song et al.'s (2012) dataset, three to four orders of magnitude shorter than the c-genes reported by these authors. This result has general implications for the application of coalescence methods in species tree estimation. We contend that it is illogical to apply coalescence methods to complete protein-coding sequences. Such analyses amalgamate c-genes with different evolutionary histories (i.e., exons separated by >100,000 bp), distort true gene tree stoichiometry that is required for accurate species tree inference, and contradict the central rationale for applying coalescence methods to difficult phylogenetic problems. In addition, Song et al.'s (2012) dataset of 447 genes includes 21 loci with switched taxonomic names, eight duplicated loci, 26 loci with non-homologous sequences that are grossly misaligned, and numerous loci with >50% missing data for taxa that are misplaced in their gene trees. These problems were compounded by inadequate tree searches with nearest neighbor interchange branch swapping and inadvertent application of substitution models that did not account for among-site rate heterogeneity. Sixty-six gene trees imply unrealistic deep coalescences that exceed 100 million years (MY). Gene trees that were obtained with better justified models and search parameters show large increases in both likelihood scores and congruence. Coalescence analyses based on a curated set of 413 improved gene trees and a superior coalescence method (ASTRAL) support a Scandentia (treeshrews) + Glires (rabbits, rodents) clade, contradicting one of the three primary systematic conclusions of Song et al. (2012). Robust support for a Perissodactyla + Carnivora clade within Laurasiatheria is also lost, contradicting a second major conclusion of this study. Song et al.'s (2012) MP-EST species tree provided the basis for circular simulations that led these authors to conclude that the multispecies coalescent accounts for 77% of the gene tree conflicts in their dataset, but internal branches of their MP-EST tree are stunted by an order of magnitude due to wholesale gene tree reconstruction errors. An independent assessment of branch lengths suggests the multispecies coalescent accounts for ⩽ 15% of the conflicts among Song et al.'s (2012) 447 gene trees. Unfortunately, Song et al.'s (2012) flawed phylogenomic dataset has been used as a model for additional simulation work that suggests the superiority of shortcut coalescence methods relative to concatenation. Investigator error was passed on to the subsequent simulation studies, which also incorporated further logical errors that should be avoided in future simulation studies. Illegitimate branch length switches in the simulation routines unfairly protected coalescence methods from their Achilles' heel, high gene tree reconstruction error at short internodes. These simulations therefore provide no evidence that shortcut coalescence methods out-compete concatenation at deep timescales. In summary, the long c-genes that are required for accurate reconstruction of species trees using shortcut coalescence methods do not exist and are a delusion. Coalescence approaches based on SNPs that are widely spaced in the genome avoid problems with the recombination ratchet and merit further pursuit in both empirical systematic research and simulations. Copyright © 2015. Published by Elsevier Inc.
Article
The estimation of species trees using multiple loci has become increasingly common. Because different loci can have different phylogenetic histories (reflected in different gene tree topologies) for multiple biological causes, new approaches to species tree estimation have been developed that take gene tree heterogeneity into account. Among these multiple causes, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is potentially the most common cause of gene tree heterogeneity, and much of the focus of the recent literature has been on how to estimate species trees in the presence of ILS. Despite progress in developing statistically consistent techniques for estimating species trees when gene trees can differ due to ILS, there is substantial controversy in the systematics community as to whether to use the new coalescent-based methods or the traditional concatenation methods. One of the key issues that has been raised is understanding the impact of gene tree estimation error on coalescent-based methods that operate by combining gene trees. Here we explore the mathematical guarantees of coalescent-based methods when analyzing estimated rather than true gene trees. Our results provide some insight into the differences between promise of coalescent-based methods in theory and their performance in practice.
Article
Genome-scale sequence data have become increasingly available in the phylogenetic studies for understanding the evolutionary histories of species. However, it is challenging to develop probabilistic models to account for heterogeneity of phylogenomic data. The multispecies coalescent model describes gene trees as independent random variables generated from a coalescence process occurring along the lineages of the species tree. Since the multispecies coalescent model allows gene trees to vary across genes, coalescent-based methods have been popularly used to account for heterogeneous gene trees in phylogenomic data analysis. In this paper, we summarize and evaluate the performance of coalescent-based methods for estimating species trees from genome-scale sequence data. We investigate the effects of deep coalescence and mutation on the performance of species tree estimation methods. We found that the coalescent-based methods perform well in estimating species trees for a large number of genes, regardless of the degree of deep coalescence and mutation. The performance of the coalescent methods is negatively correlated with the lengths of internal branches of the species tree.
Article
A phylogenetic analysis of a combined data set for 560 angiosperms and seven outgroups based on three genes, 18S rDNA (1855 bp), rbcL (1428 bp), and atpB (1450 bp) representing a total of 4733 bp is presented. Parsimony analysis was expedited by use of a new computer program, the RATCHET. Parsimony jackknifing was performed to assess the support of clades. The combination of three data sets for numerous species has resulted in the most highly resolved and strongly supported topology yet obtained for angiosperms. In contrast to previous analyses based on single genes, much of the spine of the tree and most of the larger clades receive jackknife support ≥50%. Some of the noneudicots form a grade followed by a strongly supported eudicot clade. The early-branching angiosperms are Amborellaceae, Nymphaeaceae, and a clade of Austrobaileyaceae, Illiciaceae, and SchiÍsandraceae. The remaining noneudicots, except Ceratophyllaceae, form a weakly supported core eumagnoliid clade comprising six well-supported subclades: Chloranthaceae, monocots, Winteraceae/Canellaceae, Piperales, Laurales, and Magnoliales. Ceratophyllaceae are sister to the eudicots. Within the well-supported eudicot clade, the early-diverging eudicots (e.g. Proteales, Ranunculales, Trochodendraceae, Sabiaceae) form a grade, followed by the core eudicots, the monophyly of which is also strongly supported. The core eudicots comprise six well-supported subclades: (1) Berberidopsidaceae/Aextoxicaceae; (2) Myrothamnaceae/Gunneraceae; (3) Saxifragales, which are the sister to Vitaceae (including Leea) plus a strongly supported eurosid clade; (4) Santalales; (5) Caryophyllales, to which Dilleniaceae are sister; and (6) an asterid clade. The relationships among these six subclades of core eudicots do not receive strong support. This large data set has also helped place a number of enigmatic angiosperm families, including Podostemaceae, Aphloiaceae, and Ixerbaceae. This analysis further illustrates the tractability of large data sets and supports a recent, phylogenetically based, ordinal-level reclassification of the angiosperms based largely, but not exclusively, on molecular (DNA sequence) data.