ArticlePDF Available

Advances on Methodologies for Genome-wide Association Studies in Plants

Authors:

Abstract

Genome-wide association studies (GWAS) have been widely used in human, animal and plant genetics, and many new approaches and their softwares have been developed in recent years. To make a better use of the GWAS methods in applied research, in this study we summarized the advances on methodologies and softwares for GWAS. First, LD score regression was introduced to investigate the effect of population structure on GWAS. Then, the main approaches and their softwares for GWAS in plants were reviewed, including a single-locus model, a multi-locus model, epistasis, and multiple correlated traits. Finally, we prospected the future developments in GWAS. It should be noted that, in real data analysis at present, the methodologies for genome-wide single-marker scan under polygenic background and population structure controls are widely used, and the corresponding results are complementary to those derived from non-parameter approaches with high false discovery rate. However, the future approaches for GWAS should be based on the multi-locus genetic model, QTN-by-environment interaction, epistatic detection and multivariate analysis. Our purpose was to provide beneficial information in theoretical and applied researches.
作物学报 ACTA AGRONOMICA SINICA 2016, 42(7): 945956 http://zwxb.chinacrops.org/
ISSN 0496-3490; CODEN TSHPA9 E-mail: xbzw@chinajournal.net.cn
本研究由国家自然科学基金项目(31301004)和中央高校基本科研业务费项目(KJQN201422)资助。
This work was supported by National Natural Science Foundation of China (31301004) and Fundamental Research Funds for the Central
Universities (KJQN201422).
* 通讯作者(Corresponding author): 章元明, E-mail: soyzhang@mail.hzau.edu.cn; Tel: 13505161564
第一作者联系方式: E-mail: fengjianying@njau.edu.cn
Received(收稿日期): 2015-07-08; Accepted(接受日期): 2016-05-09; Published online(网络出版日期): 2016-05-11.
URL: http://www.cnki.net/kcms/detail/11.1809.S.20160511.1551.002.html
DOI: 10.3724/SP.J.1006.2016.00945
植物关联分析方法的研究进展
冯建英 1 温阳俊 1 1 章元明 2,*
1 南京农业大学作物遗传与种质创新国家重点实验室, 江苏南京 210095; 2华中农业大学植物科技学院, 湖北武汉 430070
: 关联分析在人类和动植物遗传研究中的应用日益广泛, 新方法及其软件包不断涌现。为对其更好选择和应
, 本文综述了关联分析的主要方法及其软件包。首, 介绍了群体结构对关联分析的影响; 其次, 重点介绍了单位
点关联分析、多位点关联分析、上位性和多性状关联分析方法及其软件包; 最后, 展望了关联分析的发展动向。应当
指出, 基于群体结构和多基因整体背景控制的全基因组单标记快速扫描算法在目前的实际资料分析中应用较广泛,
与其结果互补的是假阳性率较高的非参数方法。但是, 今后的方法应当是以多位点模型、环境互作、上位性检验和
多个相关性状联合分析为主。这为今后的理论与应用研究提供了有益信息。
关键词: 全基因组关联分析; 上位性; 混合线性模型; 多位点模型
Advances on Methodologies for Genome-wide Association Studies in Plants
FENG Jian-Ying1, WEN Yang-Jun1, ZHANG Jin1, and ZHANG Yuan-Ming2,*
1 State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China; 2 College of Plant
Science and Technology, Huazhong Agricultural University, Wuhan 430070, China
Abstract: Genome-wide association studies (GWAS) have been widely used in human, animal and plant genetics, and many new
approaches and their softwares have been developed in recent years. To make a better use of the GWAS methods in applied
research, in this study we summarized the advances on methodologies and softwares for GWAS. First, LD score regression
was introduced to investigate the effect of population structure on GWAS. Then, the main approaches and their softwares for
GWAS in plants were reviewed, including a single-locus model, a multi-locus model, epistasis, and multiple correlated traits.
Finally, we prospected the future developments in GWAS. It should be noted that, in real data analysis at present, the method-
ologies for genome-wide single-marker scan under polygenic background and population structure controls are widely used,
and the corresponding results are complementary to those derived from non-parameter approaches with high false discovery
rate. However, the future approaches for GWAS should be based on the multi-locus genetic model, QTN-by-environment in-
teraction, epistatic detection and multivariate analysis. Our purpose was to provide beneficial information in theoretical and
applied researches.
Keywords: Genome-wide association study; Epistasis; Mixed linear model; Multi-locus model
以分子标记与复杂性状基因间的连锁不平衡
(linkage disequilibrium, LD)为基础的关联分析是人
类复杂疾病遗传剖析的基本方法。近年来, 在动植
物数量性状遗传分析中也有大量报道。
在人类复杂疾病的关联分析中, 最早分析的是
家系资料和 Case-Control 数据。随着 Risch 提出了全
基因组关联分析[1] (genome-wide association study,
GWAS), 特别是学者们研制了 PLINK[2] (http://pngu.
mgh.harvard.edu/~purcell/plink/)BOOST[3] (http://
bioinformatics.ust.hk/BOOST.html)等应用软件包以
, 涌现出大量 GWAS 研究论文[4-6], 掀开了人类复
杂疾病遗传研究和人类基因组学研究的新篇章。在
946 42
植物遗传方面, Thornsberry [7]在考虑群体结构的情
况下用统计方法研究了玉米开花期的 Dwarf8 多态性
变异, Hansen [8]GWAS 用于海甜菜抽薹基因 B
遗传研究, 特别是 Zhang [9]Yu [10]建立了混合
线性模型关联分析方法以后, 学者们应用不同的统
计模型和参数与非参数估计方法在全基因组开展了
单标记扫描、快速计算算法、多位点模型、上位性检
测和多个相关性状联合分析等大量研究。目前, 应用
最多的是基于混合线性模型的快速计算方法, 它们
大多是基于群体结构和多基因背景控制的单标记扫
描方法, 例如, EMMA[11]CMLM[12]及其拓展的算法。
与此类方法互补的是非参数检验方法, 但是其假阳
性率往往较高[13]今后的方向是多位点模型、环境互
作、上位性检验和多个相关性状联合分析的方法,
mrMLM[14]FarmCPU[15]QTXNetwork[16]
植物复杂性状的关联分析方法已有很多, 各具
特点, 各有所长。为增加这些方法的推广应用, 本文
从统计模型、方法和应用条件上综述植物关联分析方
法与软件研究的进展, 并展望今后的发展趋势, 为应
用工作者更好地选择和应用这些方法提供方便。
1 群体结构对关联分析的影响
关联分析的一个主要问题是因群体结构造成的
目标性状与无关基因间的假关联, 导致关联分析的
假阳性率较高。为此, 提出了基因组控制[17-18]、结
构关联[19-21]、主成分分析[22]和多维尺度[23]来解决群
体结构对关联分析的影响。实际上, 关联分析也受
多基因背景效应的影响。Bulik-Sullivan [24]认为,
目前的方法不能有效区分群体结构和多基因背景效
应对关联分析的影响。为此, 提出了 LD (linkage
disequilibrium)得分回归方法。
在关联作图群体中, 将第 j个标记与二歧性状
间相互独立的 2
j
值作为依变量, 标记间的连锁不平
衡统计量
j
l作为自变量, 进行线性回归分析, 其中
2
1
M
j
jk
k
lr,
j
k
r是第 jk个标记的相关系数。若
回归截距与 1差异不显著, 则群体结构并不影响关
联分析结果, 可不进行群体结构矫正[24]。若研究性
状不是二歧的而是连续的, 可抽取一定等比例的极
端大值个体和极端小值个体, 转化成二歧性状,
应用该方法[25]
2 单位点关联分析
随着基因组测序技术的进步及其成本的下降,
关联分析在作物遗传育种研究中的应用越来越广泛。
目前, 已经涌现出大量的关联分析方法与软件包。
2.1 混合线性模型(mixed linear model, MLM)
Zhang 等利用品种系谱、分子标记信息和数量
性状观察值首先提出了植物品种资源群体关联分析
的混合线性模型方法[9], 其数学模型为
 YXbZuWvε (1)
其中,

1,,YT
n
yy是由所有个体数量性状观测
值组成的向量; b为固定效应, X为其系数矩阵; ε
为服从正态分布
2
0, I
e
N
的随机误差; k个祖先
亲本(founder)基因效应为

2
~0,
ku
uN
, 多基因效
应为 v~2
(0, )
v
N
; ZZT的数学期望是品种间该等位
基因的 IBD (identity-by-descent)矩阵, WWT的数学
期望是品种间多基因亲缘关系 IBD 矩阵(相当于品
种间亲缘系数K矩阵), 有以下关系式


222
 YZZ WW I
TT
uve
Var E E

(2)
利用品种系谱, 通过 Monte Carlo 方法计算出
ZZT
E
WWT
E, 由方差组分分析估计 QTL
(quantitative trait locus)多基因和误差方差组分,
检测基因组上每一可能位置存在 QTL 的可能性大
小。若在某一位置上存在 QTL, 则用 BLUP(best
linear unbiased prediction)法预测 QTL 等位基因效
应。该方法可用于水稻、大豆等自花授粉作物和玉
米自交系等异花授粉作物的关联分析, 可检测大量
的遗传变异, 精度高, 误差较小。
若品种系谱不全, 不能计算 IBD ; 若系谱不
准确, 计算结果不可靠。特别是, 群体结构也会导致
假关联。为克服这些缺点, Yu [10]利用分子标记信
息计算品种间多基因亲缘关系矩阵 K, 以代替多基
IBD 矩阵, 并引入群体结构矩阵 Q, 以检测与数
量性状关联的分子标记, 通过模拟人类和玉米数据
, 证实了新方法在提高 QTL 检测功效和控制 I
错误率的有效性。此后, 混合线性模型关联分析方
法得到较大发展和广泛应用。
2.2 快速检测方法
随着测序技术的进步和测序成本的下降, SNP
(single nucleotide polymorphism)标记在植物关联分
析中应用已成为常态。但是, 大量的 SNP 标记会使
得分析的运算时间增加。所以, 新的快速检测方法
越来越受到应用者的青睐。
7 冯建英等: 植物关联分析方法的研究进展 947
上述混合线性模型方法需要估计 3个方差组分,
若标记数目较多, 则运算时间较长。为此, Zhang [12]
提出了压缩混合线性模型(compressed mixed linear
model, CMLM)方法。与 Yu [10]方法比较, CMLM
方法将 QTL 效应视为固定效应; 利用聚类分析方法
将品种分组, 获得最优分组数, 用组间亲缘系数代
替品种间亲缘系数; 提出了固定多基因方差与误差
方差比值的 P3D (population parameters previously
determined)算法。由此, 提高了检测功效和节约了计
算时间。若寻找 8种聚类分析方法和 3种组间亲缘
系数算法的最优组合, 还可进一步改进 CMLM 方法
的检测功效, 这就是优化压缩混合线性模型
(enriched CMLM, ECMLM)方法[26]
Kang [11]将模型(1)中的 QTL 效应u视为固定
效应, 并记

,bβT
Tu,
XβWv εY (3)
其中,

22
YWKWI
T
ge
Var
。显然, 只需估计
多基因和误差两方差组分; 同时通过矩阵谱分解获
取特征值, 简化了矩阵逆运算, EM 算法和
Newton-Raphson 算法相比, 简化了极大似然估计的
迭代过程, 缩短了运算时间。应当指出, QTL 效应值
是通过最优线性无偏预测获得的。这称为有效混合
模型关联(efficient mixed model association, EMMA)
为进一步提高运算速度, Kang [27]认为在估计每个
QTL 效应时, 不需要重复估计多基因方差 2
g
和误
差方差 2
e
, 可进一步提高计算速度, 称为 EMMAX
(EMMA eXpedited)Svishcheva [28]进一步提出的
GRAMMAR-Gamma 方法, 提高了计算速度, 且具
有检测功效高和应用灵活的特点。虽然 P3D
EMMAX GRAMMAR-Gamma 算法可以有效降低
计算时间, 但是由于多基因方差和误差方差比值固
, 属于近似算法。对此, Zhou [29]提出了 GEMMA
方法, EMMA 的精确算法, 计算速度比EMMA
大提高。近年来, 快速运算得到长足发展, 不断涌现
新方法, FaST-LMM[30]FaST-LMM-Select[31]
BOLT-LM M [32]最近, Wang [33]CMLM FaST-
LMM 两种算法整合, 提出了运算速度更快的 SUPER
方法。
这些方法的提出缓解了海量 SNP 关联分析计算
复杂度高和计算速度慢的问题。
Kang [10]分析玉米
和拟南芥数据后认为, EMMA 方法可有效降低群体
结构导致的高假阳性, 且检测结果具有更高的稳定
性和精度。Zhao[34]EMMA方法用于全球28个国
413个水稻品种的关联分析, 建立了水稻全基因
组关联分析的开放平台。此外, Wen[35]利用EMMA
P3D方法进行了大豆猝死综合症(sudden death
syndrome)的全基因组关联分析。
此外, 非参数方法在植物关联分析中也得以应
用。针对关联作图群体数量性状表型分布不对称、
QTN (quantitative trait nucleotide)效应中等和感兴趣
的等位基因频率很低的具体情况, Yang [36]将非参
Anderson-Darling 检验应用于关联分析, 并通过
IBD K值邻近法补全缺失 SNP 标记信息, 分析 17
个玉米数量性状后认为, 所获结果与常用关联分析
结果可相互补充, 有利于发现常规方法不易发现的
显著 QTN
3 多位点关联分析
上述关联分析混合模型方法及其快速算法是基
于群体结构和多基因背景控制的单标记分析。连锁
分析表明, QTL 定位是提高QTL 检测功效与精确
度的有效途径。因而, 多位点关联分析方法学研究
一直备受关注。
3.1 广义线性模型(generalized linear model,
GLM)方法
广义线性模型的一般形式可表述为

1
|
ii i
Ey h
(4)
01

k
iijji
jx
 
(5)
其中, i
yi
分别是第 i个品种性状表型观察值和
潜在变量值; 函数
h是连结函数(link function),
1
h是其逆函数;
E为数学期望; 0
是包含群
体均值和群体结构的固定效应向量;
j
是第 j个标
记的效应, ij
x
是相应的哑变量; i
是随机误差。
由于
h可将性状表型观测值 i
y与潜在变量 i
联系起来, 因此广义线性模型可以为数量性状和离
散型抗性性状遗传分析提供新方法, 而且也可以处
理误差 i
非正态性的情形。McCullagh Nelder[37]
系统阐述了广义线性模型相关理论。这些理论与方
法在生物和医学领域被广泛应用, 推动了数量遗传
学的发展。
为了改善遗传分析效果, 可将效应
j
视为概率
密度函数
1|
j
f
a
的连续型随机变量, 而参数a
概率密度函数
2|,
f
abc的连续型随机变量, 其中b
948 42
c可以是人为给定, 也可以是未知随机变量。这些
分层超参数(hierarchical hyperparameters) abc
由其后验分布参数确定。例如, Yi [38]提出的复杂
疾病稀有等位基因(rare allele)和位点间上位性互作
检测的分层广义线性模型, Feng [39]提出的品种群
体抗性分级性状关联分析的分层广义线性模型方法,
Wang [40]为解决关联分析的通路(pathway)问题建
立的基于 BLUP 估计的广义线性混合模型方法。
3.2 Bayesian 方法
Iwata [41]提出了品种资源群体多 QTL 检测的
Bayesian 关联分析方法, 其统计模型为
11
 

JK
iijjikkki
jk
yq x

(6)
其中,i
y为数量性状观察值; ij
q为群体结构 Q矩阵
i行第 j列元素,
j
是第 j亚群效应; ik
x
表示第 i
品种标记k的基因型值; 假定每个QTN 都在标记上,
k
是指示变量, 若第 k标记存在效应为 k
QTN,
k
=1, 否则 k
=0; i
是服从正态分布 2
(0, )N
误差。假定参数先验分布为 2
~(0, )IN
~(1,)
kk
Beta p
222
~
v
vs


, 其中 2
k
pv
2
s
都是超参数,
的先验分布为常数。由此, 推导
出各参数的条件后验分布, 通过 Markov Monte
Carlo 方法, 得到各参数的估计值。通过模拟和水稻
数据分析表明, 该方法假阳性率低, QTN 效应估计
值偏差较小, 但是收敛较慢, 计算时间较长。若只有
数百个分子标记, 还是有实用价值的。Iwata [42]
将这种方法拓展至离散型抗性性状多 QTL 检测的
Bayesian 关联分析。其主要思想是利用阈模型将抗
性性状观察值转换为潜在连续性变量。
为了缩短 Bayesian 方法的计算时间, Zhang
Xu[43]Bayesian 方法的先验分布密度函数与似然
函数相结合构建惩罚似然函数, 对提出的惩罚最大
似然方法可进行连锁分析。相似地, Hoggart [44]
出了分析 case-control 数据的惩罚 logistic 回归方法。
虽然两者都是利用惩罚似然函数来估计模型参数,
但是前者针对连续性变量的连锁分析, 而后者是针
case-control 数据的关联分析。若模型中变量个数
不超过样本容量的 10 , 这两种方法是可行的。不
, 对检测小效应 QTL 的功效有待提高。
3.3 混合线性模型方法
针对结构群体(structured population)复杂性状,
Segura [45]提出了一种多位点混合模型关联分析方
法。它利用了向前和向后逐步回归, 在变量筛选的
每一步都需要先估计多基因方差 2
g
和残差方差 2
e
,
由此获得每个 SNP 广义最小二乘效应的估计值及其
概率; 将最显著的SNP 作为协变量放入混合模型中,
进行全基因组条件分析, 获得 F测验的概率 P值。
重复这一过程先完成向前回归, 再进行向后回归变
量筛选。在筛选变量过程中, 通过 Gram-Schmidt
法提高运算速度。模拟研究证实, 它比单标记分析
具有更高的检测功效和较低假阳率; 在人类和拟南
芥实际数据分析中, 识别到了新的关联位点。
Liu [15]
将固定模型与随机模型迭代使用提出的 FarmCPU
方法与 Segura 等的方法在思想上有些相似, 也能检
测到更多的已知基因。应当指出, 它主要是利用 bin
的思想显著减少模型中变量个数, 并节省存贮空
间。Yan g [46]提出的 GCTA 方法, 是通过一条染色
体或整个基因组上的所有 SNPs 估计方差组分,
究所有 QTN 对性状的影响。
目前, 在广泛应用的关联分析方法中, 多数是
SNP 效应视为固定效应。然而, Goddard [47]认为,
SNP 效应视为随机更好, 可将与目标性状无关的
SNP 效应压缩至 0, 让表型观察值与预测值达到最
大相关。但是, 并未提供 SNP 效应估计方法。为此,
Wang [14]结合多位点模型、新的矩阵变换和快速计
算算法提出了多位点随机 SNP 效应混合线性模型方
法。由于多位点特性, 并不需要多重检验矫正。模
拟研究表明, 它比 EMMA 方法的 QTN 检测功效更
, 效应估计值更准; 以拟南芥 6个开花期数据分
析表明, 它能检测出更多的已知基因。
3.4 Bayesian 方法与混合模型方法的有机融合
混合模型假设有大量的小效应 QTN, Bayesian
方法则假设有少量的大效应的 QTNZhou [48]
, 在实际资料分析时, 无法判定哪一种更符合资
料本身。由此, 建议将两种方法结合, 提出 Bayesian
稀疏混合模型方法。其方法是假设 QTN 效应 k
从混合正态分布 22
~(0,( )())(1)
kab
qN p q N

2
(0, ( ))
bp
。若 0
q, 就是混合模型方法;
20
b
, 就是 Bayesian 方法。模拟研究发现, 新方法
在单个 QTN 解释的表型变异估计方面兼备混合模
型和 Bayesian 两种方法的优点, 在育种值预测方面
优于两种方法。
Moser [49]提出了一种类似方法,
Bayesian 混合分布模型方法。它假定 SNP 效应服
4个正态分布的混合分布, 且固定每个成分分布
7 冯建英等: 植物关联分析方法的研究进展 949
的相对方差, 22
12
|, ~ (0,0 )
kg g
ppN pN
 
42 32 22
34
(0,10 ) (0,10 ) (0,10 )

  
g
gg
pN pN

,
其中混合比例 i
p4
11
i
ip, 2
g
是所有SNP 解释
的加性遗传方差。其目的是将基因检测、SNP 贡献
率估计、复杂性状遗传基础和表型值预测相结合。
通过人类遗传疾病数据分析认为, 大于 96%SNP
效应是微小的; 大效应位点解释表型方差的比例因
性状而异; 预测分析证实, 分析大效应控制的性状
, Bayesian 方法更优。
4 上位性与多性状关联分析
4.1 上位性关联分析
上位性关联分析的研究更充实了数量遗传学内
容。但是, 超饱和线性模型问题和大数据问题更为
突出, 计算复杂度显著增加。目前的研究主要集中
在人类遗传, 应用参数和非参数检测方法。
在参数方法方面, ZhangLiu[50] 利用Bayesian
原理和Markov Monte Carlo 方法, 提出了case-
control数据同时检测主效和上位性QTNBayesian
上位性关联作图BEMA, 以推断与疾病显著相关
SNPZhangLiu[50]的模拟研究表明, 能处理10
万个SNP, 提高QTN 检测功效。Tang [51] 结合
Bayesian标记剖分模型和Gibbs抽样提出了检测上位
QTN 的方法。Cho [52] 提出了一种基于惩罚
logistic模型的弹性网正则化方法, 通过变量筛选和
弹性网两步实现了上位性关联分析。在非参数方法
方面, Han[53]提出了DASSO-MB算法; Han[54]
出一种基于Markov链的上位性互作检测FEPI-MB
, 减小了搜索空间, 运算速度更快, 检测功效高
BEMA方法。Li[55]提出一种两步非参数独立筛
选方法, 以鉴定与性状潜在关联的主效和上位性位
, 最后再用LASSO等惩罚回归分析获得与性状显
著关联的主效与上位性位点。他们认为, 其模型更
具一般性, 还可获得无主效应位点间的互作, 更好
地揭示控制性状的基因网络。与Yang [36]的方法相
, 其假阳性率低是由于在非参数方法基础上增加
了压缩估计, 并能估计主效QTN和上位性互作的效
应值。
在植物遗传方面, Wang[56]提出以自适应混合
LASSO方法检测上位性; Lü[57]提出了上位性检测
的经验贝叶斯方法; Zhang[16]提出基于图形卡GPU
计算的混合模型方法, 以检测主效、基因与环境和
基因间互作的QTN, 大大提高了计算速度; Wen[58]
提出基于EBLASSO算法的上位性检测方法, 分析了
部分NCII遗传交配群体不同遗传组分对杂种优势的
贡献。前两方法的模型中包含的变量个数不宜大于
样本容量的10; 后两种方种是动态地向模型中引
入变量, 可以容纳更多的变量, 处理海量变量的问
题。应当指出, 上位性关联分析方法还有待进一步探
, 以提高运算速度和小效应基因互作检测功效。
4.2 多个相关性状的关联分析
单一育种目标已成过去, 高产、优质和多抗是
当前的育种目标。为了将遗传分析与作物育种更紧
密结合, 有必要进行多个相关性状联合的关联分析。
最容易想到的是主成分分析[59]、典范相关分析[60]、多
个依变数的线性回归分析[61]Meta 分析[62]和偏最小
二乘法[63]。当然, 关联分析最常用的还是混合模型
方法。因而, 多个相关性状联合的混合模型方法更
易被应用者接受, 其相关方法主要有 GCTA[46]
MTMM[64] GEMMA (mvLMMs)[65] mtSet[66]
mvLMM[67], 其中 GCTA 只能分析 2个相关性状。
些研究均表明, 多个相关性状联合分析比单个性状
分析有更高的功效和精度。然而, 可供利用的
Windows 界面软件包还有待于研制。
5 植物关联分析的相关软件包
目前, 关联分析已在人类和动植物遗传学研究中
得到广泛应用, 理论工作者也不断提出新的方法与软
(1)。为便于应用, 这里简要介绍主要软件包。
PLINK 软件[2] (http://pngu.mgh.harvard.edu/~
purcell/plink/)是较早开放使用的关联分析软件,
用于数据管理、群体结构评价、复杂性状和case-
control数据的关联分析, 也可处理基因型和表型大
数据。
Cornell 大学 Buckler 实验室开发的 TASSEL
[69](http://tassel.bitbucket.org/)是以程序设计语言
Java 编写的可以在主流操作系统下使用的软件包。
目前已更新到 TASSEL5.0 版本, 主要包括关联分
析、进化分析和连锁分析, 也可以计算和图示连锁
不平衡统计量。2012 , 该实验室释放了基于 R
言的基因关联和预测整合工具 GAPIT (http://zzlab.
net/GAPIT), 现已更新至 GAPIT v2[70], 包含了
FaST-LMMECMLMFaST-LMM-SelectSUPER
等关联分析新方法, 全基因组预测包含了基于
CMLMECMLM SUPER gBLUP 方法。新版本
950 42
7 冯建英等: 植物关联分析方法的研究进展 951
952 42
增加了性状表型模拟、功效分析和交叉验证等功能。
QTXNetwork 是浙江大学朱军教授实验室开发
的、基于 GPU 计算的、可以处理大规模复杂性状组
学数据的关联分析软件包(http://ibi.zju.edu.cn/
software/QTXNetwork/)[16,71], 包括 QTL 连锁分析、
QTS GWASQTT/P/M 关联分析和 GMDR 全基
因组关联分析数据过滤 4个功能模块, 可以检测主
效基因、基因与环境互作和基因间互作, 表型数据
既可以是数量性状观察值又可以是组学数据, 是一
CPU GPU 异构运算平台的软件包。
mrMLM 是基于 Wang [14]提出的多位点随机
SNP 效应混合模型方法的 R软件包(https://cran.r-
project.org/web/packages/mrMLM/index.html), R
环境下可进行 Windows 界面操作, R中载入的
mrMLM 软件包也可在其他操作系统下运行。该软件
包除多位点关联分析外, 还能提供筛选显著标记的
Manhattan 图和评价方法优劣的QQ (Quantile-Quantile)
图。
除上述软件之外, 正文中提到的其他软件的一
些相关信息可参见表 1我们相信, 新的方法与软件
将不断涌现, 应用者可根据自己的需要, 选择不同
的方法; 也可以用尽可能多的方法分析同一组数据,
然后用逐步回归筛选出最优关联标记集。
6 展望
随着生物学组学数据、计算机科学技术和统计
学算法的不断更新, 特别是植物数量性状遗传分析
的需要, 有必要搭建植物关联分析的技术平台,
剖析数量性状的遗传基础, 推动作物分子设计育种
和分子生物学研究的发展(1)
1 植物全基因组关联分析技术路线图
Fig. 1 Technical framework for genome-wide association studies in plants
6.1 海量标记高精度快速检测关联分析算法研
究与软件包研制
植物关联分析方法学研究发展较快, 研究内容
越来越丰富, 加快了这些方法在植物遗传研究中的
应用。但是, 植物数量性状是复杂的, SNP 数目远大
于作图群体个体数, 使GWAS 面临巨大的挑战,
别是对于多基因检测、基因与环境互作分析[72]和基
因间上位性作图。这意味着关联分析方法研究需要
在统计学超饱和线性模型参数估计理论、计算机快
速计算技术和矩阵论快速计算算法等方面有所突
破。所以, 需要将统计方法、数值算法和计算机技
术有效结合, 不断开发出新的高效、快速和海量标
记的关联分析方法。为了让这些新方法得到广泛应
, 有必要研制不同平台的计算机软件包[73]
6.2 关联分析与作物育种相结合
常规的育种方法是借助表型及育种家经验对作
物的重要农艺性状进行选育, 其效率低, 周期长,
而基于基因型选择和高效准确的分子辅助技术,
启了作物育种的新方向。植物重要性状关联分析的
目的就是发掘有益的等位基因, 为作物育种服务。
关联分析在作物育种中可快速发掘种质资源中的优
异等位变异, 并通过聚合育种或其他分子设计育种
方法将其引入育种材料[74-76]。但由于标记的复杂性
以及遗传背景和环境的影响, 关联分析成果在分子
7 冯建英等: 植物关联分析方法的研究进展 953
标记辅助选择育种中的应用有待提升。此外, 通过
关联分析可有助于了解目标基因的位置、遗传效应
和基因网络等信息, 进而通过分子生物学操作或作
物分子育种操作来改良目标性状[77]
针对不同的育种目标可以选择不同的关联分析
方法。对于纯合品种育种, 可选用上述方法; 对于杂
种品种培育, 可利用育种群体进行遗传分析, 其结
果可用于最优杂交组合的预测[25,58]。若要提高精度,
全基因组预测是一个可供利用的方法[78]
6.3 关联分析与分子生物学和组学研究相结合
虽然关联分析能发掘更多的可供作物育种利用
的等位基因, 为基因的功能分析和功能标记开发研
究提供有用信息, 但是这些基因的生物学功能并不
十分清楚, 只能作为植物分子生物学的前期工作。
, 转录组、蛋白组和代谢组等组学研究十分活跃。若
将这些组学数据视为复杂性状, 也可进行相应的关联
分析, 在拟南芥[79]和玉米[80-81]等作物中已经得以应用。
但是, 这方面的工作还需要进一步加强。
References
[1] Risch N, Merikangas K. The future of genetic studies of com-
plex human diseases. Science, 1996, 273: 1516–1517
[2] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M,
Bender D, Maller J, Sklar P, De Bakker P, Daly M, Sham P C.
PLINK: a tool set for whole-genome association and popula-
tion-based linkage analyses. Am J Hum Genet, 2007, 81:
559–575
[3] Wan X, Yang C, Yang Q, Xue H, Fan X D, Tang N L S, Yu W C.
BOOST: a fast approach to detecting gene-gene interactions in
genome-wide case-control studies. Am J Hum Genet, 2010, 87:
325–340
[4] Takeuchi F, Serizawa M, Yamamoto K, Fujisawa T, Nakashima
E, Ohnaka K, Ikegami H, Sugiyama T, Katsuya T, Miyagishi M,
Nakashima N, Nawata H, Nakamura J, Kono S, Takayanagi R,
Kato N. Confirmation of multiple risk loci and genetic impacts
by a genome-wide association study of type 2 diabetes in the
Japanese population. Diabetes, 2009, 3: 1690–1699
[5] Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J,
Lush M J, Maranian M J, Bolla M K, Wang Q, Shah M, Per-
kins B J, Czene K, Eriksson M, Darabi H, Brand J S, Bojesen S
E, Nordestgaard B G, Flyger H, Nielsen S F, Rahman N,
Turnbull C, BOCS, Fletcher O, Peto J, Gibson L, dos-Santos-
Silva I, Chang-Claude J, Flesch-Janys D, Rudolph A, Eilber U,
Behrens S, Nevanlinna H, Muranen T A, Aittomäki K, Blom-
qvist C, Khan S, Aaltonen K, Ahsan H, Kibriya M G, Whitte-
more A S, John E M, Malone K E, Gammon M D, Santella R
M, Ursin G, Makalic E, Schmidt D F, Casey G, Hunter D J,
Gapstur S M, Gaudet M M, Diver W R, Haiman C A,
Schumacher F, Henderson B E, Le Marchand L, Berg C D,
Chanock S J, Figueroa J, Hoover R N, Lambrechts D, Neven P,
Wildiers H, van Limbergen E, Schmidt M K, Broeks A, Ver-
hoef S, Cornelissen S, Couch F J, Olson J E, Hallberg E, Va-
chon C, Waisfisz Q, Meijers-Heijboer H, Adank M A, van der
Luijt R B, Li J, Liu J, Humphreys K, Kang D, Choi J Y, Park S
K, Yoo K Y, Matsuo K, Ito H, Iwata H, Tajima K, Guénel P,
Truong T, Mulot C, Sanchez M, Burwinkel B, Marme F, Su-
rowy H, Sohn C, Wu A H, Tseng C C, Van Den Berg D, Stram
D O, González-Neira A, Benitez J, Zamora M P, Perez J I, Shu
X O, Lu W, Gao Y T, Cai H, Cox A, Cross S S, Reed M W,
Andrulis I L, Knight J A, Glendon G, Mulligan A M, Sawyer E
J, Tomlinson I, Kerin M J, Miller N, kConFab Investigators,
AOCS Group, Lindblom A, Margolin S, Teo S H, Yip C H,
Taib N A, Tan G H, Hooning M J, Hollestelle A, Martens J W,
Collée J M, Blot W, Signorello L B, Cai Q, Hopper J L,
Southey M C, Tsimiklis H, Apicella C, Shen C Y, Hsiung C N,
Wu P E, Hou M F, Kristensen V N, Nord S, Alnaes G I, NBCS,
Giles G G, Milne R L, McLean C, Canzian F, Trichopoulos D,
Peeters P, Lund E, Sund M, Khaw K T, Gunter M J, Palli D,
Mortensen L M, Dossus L, Huerta J M, Meindl A, Schmutzler
R K, Sutter C, Yang R, Muir K, Lophatananon A, Stewart-
Brown S, Siriwanarangsan P, Hartman M, Miao H, Chia K S,
Chan C W, Fasching P A, Hein A, Beckmann M W, Haeberle L,
Brenner H, Dieffenbach A K, Arndt V, Stegmaier C, Ashworth
A, Orr N, Schoemaker M J, Swerdlow A J, Brinton L, Garcia-
Closas M, Zheng W, Halverson S L, Shrubsole M, Long J,
Goldberg M S, Labrèche F, Dumont M, Winqvist R, Pylkäs K,
Jukkola-Vuorinen A, Grip M, Brauch H, Hamann U, Brüning T;
GENICA Network, Radice P, Peterlongo P, Manoukian S,
Bernard L, Bogdanova N V, Dörk T, Mannermaa A, Kataja V,
Kosma V M, Hartikainen J M, Devilee P, Tollenaar R A,
Seynaeve C, Van Asperen C J, Jakubowska A, Lubinski J, Ja-
worska K, Huzarski T, Sangrajrang S, Gaborieau V, Brennan P,
McKay J, Slager S, Toland A E, Ambrosone C B, Yannoukakos
D, Kabisch M, Torres D, Neuhausen S L, Anton-Culver H,
Luccarini C, Baynes C, Ahmed S, Healey C S, Tessier D C,
Vincent D, Bacot F, Pita G, Alonso M R, Álvarez N, Herrero D,
Simard J, Pharoah P P, Kraft P, Dunning A M,
Chenevix-Trench G, Hall P, Easton D F. Genome-wide asso-
ciation analysis of more than 120 000 individuals identifies 15
new susceptibility loci for breast cancer. Nat Genet, 2015, 47:
373–380
[6] Scuteri A, Sanna S, Chen W M, Uda M, Albai G, Strait J,
Najjar S, Nagaraja R, Orrú M, Usala G, Dei M, Lai S, Maschio
A, Busonero F, Mulas A, Ehret G B, Fink A A, Weder A B,
Cooper R S, Galan P, Chakravarti A, Schlessinger D, Cao A,
Lakatta E, Abecasis G R. Genome-wide association scan
shows genetic variants in the FTO gene are associated with
obesity-related traits. PLoS Genet, 2007, 3(7): e115
[7] Thornsberry J M, Goodman M M, Doebley J, Kresovich S,
Nielsen D, Buckler E S. Dwarf8 polymorphisms associate with
variation in flowering time. Nat Genet, 2001, 28: 286–289
[8] Hansen M, Kraft T, Ganestam S, Säll T, Nilsson N O. Linkage
disequilibrium mapping of the bolting gene in sea beet using
AFLP markers. Genet Res, 2001, 77: 61–66
[9] Zhang Y M, Mao Y C, Xie C Q, Smith H, Luo L, Xu S. Map-
ping quantitative trait loci using naturally occurring genetic
954 42
variance among commercial inbred lines of maize (Zea mays
L.). Genetics, 2005, 169: 2267–2275
[10] Yu J, Pressoir G, Briggs W H, Vroh Bi I, Yamasaki M, Doebley
J F, McMullen M D, Gaut B S, Nielsen D M, Holland J B,
Kresovich S, Buckler E S. A unified mixed-model method for
association mapping that accounts for multiple levels of rela-
tedness. Nat Genet, 2006, 38: 203–208
[11] Kang H M, Zaitlen N A, Wade C M, Kirby A, Heckerman D,
Daly M J, Eskin E. Efficient control of population structure in
model organism association mapping. Genetics, 2008, 178:
1709–1723
[12] Zhang Z, Ersoz E, Lai C Q, Todhunter R J, Tiwari H K, Gore
M A, Bradbury P J, Yu J M, Arnett D K, Ordovas J M, Buckler
E S. Mixed linear model approach adapted for genome-wide
association studies. Nat Genet, 2010, 42: 355–360
[13] Atwell S, Huang Y S, Vilhjálmsson B J, Willems G, Horton M,
Li Y, Meng D, Platt A, Tarone A M, Hu T T, Jiang R, Muliyati
N W, Zhang X, Amer M A, Baxter I, Brachi B, Chory J, Dean
C, Debieu M, de Meaux J, Ecker J R, Faure N, Kniskern J M,
Jones J D, Michael T, Nemri A, Roux F, Salt D E, Tang C, To-
desco M, Traw M B, Weigel D, Marjoram P, Borevitz J O,
Bergelson J, Nordborg M. Genome-wide association study of
107 phenotypes in Arabidopsis Thaliana inbred lines. J Am Soc
Mass Spectrom, 2010, 465: 627–631
[14] Wang S B, Feng J Y, Ren W L, Huang B, Zhou L, Wen Y J,
Zhang J, Jim M D, Xu S Z, Zhang Y M. Improving power and
accuracy of genome-wide association studies via a multi-locus
mixed linear model methodology. Sci Rep, 2016, 6: 19444
[15] Liu X L, Huang M, Fan B, Buckler E S, Zhang Z W. Iterative
usage of fixed and random effect models for powerful and
efficient genome-wide association studies. PLoS Genet, 2016,
12(2): e1005767
[16] Zhang F T, Zhu Z H, Tong X R, Zhu Z X, Qi T, Zhu J. Mixed
linear model approaches of association mapping for complex
traits based on omics variants. Sci Rep, 2015, 5: 10298
[17] Devlin B, Roeder K. Genomic control for association studies.
Biometrics, 1999, 55: 997–1004
[18] Song M, Hao W, Storey J D. Testing for genetic associations in
arbitrarily structured populations. Nat Genet, 2015, 47:
550–556
[19] Pritchard J K, Stephens M, Donnelly P. Inference of population
structure using multilocus genotype data. Genetics, 2000, 155:
945–959
[20] Wilson L M, Whitt S R, Ibáez A M, Rocheford T R, Goodman
M M, Buckler E S. Dissection of maize kernel composition
and starch production by candidate gene associations. Plant
Cell, 2004, 16: 2719–2733
[21] Sabatti C, Service S K, Hartikainen A L, Pouta A, Ripatti S,
Brodsky J, Jones C G, Zaitlen N A, Varilo T, Kaakinen M, So-
vio U, Ruokonen A, Laitinen J, Jakkula E, Coin L, Hoggart C,
Collins A, Turunen H, Gabriel S, Elliot P, McCarthy M I, Daly
M J, Järvelin M R, Freimer N B, Peltonen L. Genome-wide
association analysis of metabolic traits in a birth cohort from a
founder population. Nat Genet, 2009, 41: 35–46
[22] Price A L, Pattersom N J, Plenge R M, Weinblatt M E, Shadick
N A, Reich D. Principal components analysis corrects for
stratification in genome-wide association studies. Nat Genet,
2006, 38: 904–909
[23] Lee A B, Luca D, Klei L, Devlin B, Roeder K. Discovering
genetic ancestry using spectral graph theory. Genet Epidemiol,
2010, 34: 51–59
[24] Bulik-Sullivan B K, Loh P R, Finucane H K, Ripke S, Yang J,
Schizophrenia Working Group of the Psychiatric Genomics
Consortium, Patterson N, Daly M J, Price A L, Neale B M. LD
score regression distinguishes confounding from polygenicity
in genome-wide association studies. Nat Genet, 2015, 47:
291–295
[25] Bu S H, Zhao X W, Yi C, Wen J, Tu J X, Zhang Y M. Inter-
acted QTL mapping in partial NCII design provides evidences
for breeding by design. PLoS One, 2015, 10(3): e0121034
[26] Li M, Liu X L, Bradbury P, Yu J M, Zhang Y M, Todhunter R J,
Buckler E S, Zhang Z W. Enrichment of statistical power for
genome-wide association studies. BMC Biol, 2014, 12: 73–82
[27] Kang H M, Sul J H, Service S K, Zaitlen N A, Kong S Y,
Freimer N B, Sabatti C, Eskin E. Variance component model to
account for sample structure in genome-wide association
studies. Nat Genet, 2010, 42: 348–354
[28] Svishcheva G R, Axenovich T I, Belonogova N M, van Duijn
C M, Aulchenko Y S. Rapid variance components-based
method for whole-genome association analysis. Nat Genet,
2012, 44: 1166–1170
[29] Zhou X, Stephens M. Genome-wide efficient mixed-model
analysis for association studies. Nat Genet, 2012, 44: 821–826
[30] Lippert C, Listqarten J, Liu Y, Kadie C M, Davidson R I,
Heckerman D. Fast linear mixed models for genome-wide as-
sociation studies. Nat Methods, 2011, 8: 833–835
[31] Listgarten J, Lippert C, Kadie C M, Davidson R I, Eskin E,
Heckerman D. Improved linear mixed models for genome-
wide association studies. Nat Methods, 2012, 9: 525–526
[32] Loh P R, Tucker G, Bulik-Sullivan B K, Vilhjálmsson B J,
Finucane H K, Salem R M, Chasman D I, Ridker P M, Neale B
M, Berger B, Patterson N, Price A L. Efficient Bayesian
mixed-model analysis increases association power in large
cohorts. Nat Genet, 2015, 47: 284–290
[33] Wang Q, Tian F, Pan Y, Buckler E S, Zhang Z. A SUPER
powerful method for genome wide association study. PLoS
One, 2014, 9: e107684
[34] Zhao K, Tung C W, Eizenga G C, Wright M H, Ali M L, Price
A H, Norton G J, Islam M R, Reynolds A, Mezey J, McClung
A M, Bustamante C D, McCouch S R. Genome-wide associa-
tion mapping reveals a rich genetic architecture of complex
traits in Oryza sativa. Nat Commun, 2011, 2: 467–476
[35] Wen Z X, Tan R J, Yuan J Z, Bales C, Du W Y, Zhang S C,
Chilvers M I, Schmidt C, Song Q J, Cregan P B, Wang D C.
Genome-wide association mapping of quantitative resistance
to sudden death syndrome in soybean. BMC Genomics, 2014,
15: 809–819
[36] Yang N, Lu Y L, Yang X H, Huang J, Zhou Y, Ali F, Wen W W,
Liu J, Li J S, Yan J B. Genome wide association studies using a
new nonparametric model reveal the genetic architecture of 17
agronomic traits in an enlarged maize association panel. PLoS
Genet, 2014, 10(9): e1004573
7 冯建英等: 植物关联分析方法的研究进展 955
[37] McCullagh P, Nelder J A. Generalized Linear Models, 2nd edn.
London: Chapman and Hall, 1989
[38] Yi N, Liu N J, Zhi D G, Li J. Hierarchical generalized linear
models for multiple groups of rare and common variants:
jointly estimating group and individual-variant effects. PLoS
Genet, 2011, 7(12): e1002382
[39] Feng J Y, Zhang J, Zhang W J, Wang S B, Han S F, Zhang Y M.
An efficient hierarchical generalized linear mixed model for
mapping QTL of ordinal traits in crop cultivars. PLoS One,
2013, 8(4): e59541
[40] Wang L, Jia P, Wolfinger R D, Chen X, Grayson B L, Aune T
M, Zhao Z. An efficient hierarchical generalized linear mixed
model for pathway analysis of genome-wide association
studies. BMC Bioinformatics, 2011, 27(5): 686–692
[41] Iwata H, Uga Y, Yoshioka Y, Ebana K, Hayashi T. Bayesian
association mapping of multiple quantitative trait loci and its
application to the analysis of genetic variation among (Oryza
sativa L.) germplasms. Theor Appl Genet, 2007, 114: 1437–1449
[42] Iwata H, Ebana K, Fukuoka S, Jannink J L, Hayashi T. Bayes-
ian multilocus association mapping on ordinal and censored
traits and its application to the analysis of genetic variation
among (Oryza sativa L.) germplasms. Theor Appl Genet, 2009,
118: 865–880
[43] Zhang Y M, Xu S. A penalized maximum likelihood method
for estimating epistatic effects of QTL. Heredity, 2005, 95:
96–104
[44] Hoggart C J, Whittaker J C, De Iorio M, Balding D J. Simul-
taneous analysis of all SNPs in genome-wide and resequencing
association studies. PLoS Genet, 2008, 4: e1000130
[45] Segura V, Vilhjálmsson B J, Platt A, Korte A, Seren Ü, Long Q,
Nordborg M. An efficient multi-locus mixed-model approach
for genome-wide association studies in structured populations.
Nat Genet, 2012, 44: 825–830
[46] Yang J, Lee S H, Goddard M E, Visscher P M. GCTA: a tool
for genome-wide complex trait analysis. Am J Hum Genet,
2011, 88: 76–82
[47] Goddard M E, Wray N R, Verbyla K, Visscher P M. Estimating
effects and making predictions from genomewide marker data.
Stat Sci, 2009, 24: 517–529
[48] Zhou X, Carbonetto P, Stephens M. Polygenic modeling with
Bayesian sparse linear mixed models. PLoS Genet, 2013, 9(2):
e1003264
[49] Moser G, Lee S H, Hayes B J, Goddard M E, Wray N R, Viss-
cher P M. Simultaneous discovery, estimation and prediction
analysis of complex traits using a Bayesian mixture model.
PLoS Genet, 2015, 11(4): e1004969
[50] Zhang Y, Liu J S. Bayesian inference of epistatic interactions
in case-control studies. Nat Genet, 2007, 39: 1167–1173
[51] Tang W W, Wu X B, Jiang R. Epistatic module detection for
case-control studies: a Bayesian model with a Gibbs sampling
strategy. PLoS Genet, 2009, 5(5): e1000464
[52] Cho S, Kim H, Oh S, Kim K, Park T. Elastic-net regularization
approaches for genome-wide association studies of rheumatoid
arthritis. BMC Proc, 2009, 3(suppl 7): S25
[53] Han B, Park M, Chen X W. A Markov blanket-based method
for detecting causal SNPs in GWAS. BMC Bioinformatics,
2010, 11(suppl 3): S5
[54] Han B, Chen X W, Talebizadeh Z. FEPI-MB: identifying
SNPs-disease association using a Markov blanket-based ap-
proach. BMC Bioinformatics, 2011, 12(Suppl 12): S3
[55] Li J, Dan J, Li C L, Wu R L. A model-free approach for de-
tecting interactions in genetic association studies. Brief Bioin-
form, 2014, 15: 1057–1068
[56] Wang D, Eskridge K M, Crossa J. Identifying QTLs and epis-
tasis in structured plant populations using adaptive mixed
LASSO. J Agric Biol Environ Stat, 2011, 16: 170–184
[57] Lü H Y, Liu X F, Wei S P, Zhang Y M. Epistatic association
mapping in homozygous crop cultivars. PLoS One, 2011, 6(3):
e17773
[58] Wen J, Zhao X W, Wu G R, Xiang D, Liu Q, Bu S H, Yi C,
Song Q J, Dunwell J M, Tu J X, Zhang T Z, Zhang Y M. Ge-
netic dissection of heterosis using epistatic association map-
ping in a partial NCII mating design. Sci Rep, 2015, 5: 18376
[59] Aschard H, Vilhjálmsson B J, Greliche N, Morange P E,
Trégouët D A, Kraft P. Maximizing the power of principal-
component analysis of correlated phenotypes in genome-wide
association studies. Am J Hum Genet, 2014, 94: 662–676
[60] Ferreira M A, Purcell S M. A multivariate test of association.
Bioinformatics, 2009, 25: 132–133
[61] Bottolo L, Chadeau-Hyam M, Hastie D I, Zeller T, Liquet B,
Newcombe P, Yengo L, Wild P S, Schillert A, Ziegler A, Niel-
sen S F, Butterworth A S, Ho W K, Castagné R, Munzel T,
Tregouet D, Falchi M, Cambien F, Nordestgaard B G, Fumeron
F, Tybjærg-Hansen A, Froguel P, Danesh J, Petretto E,
Blankenberg S, Tiret L, Richardson S. GUESS-ing polygenic
associations with multiple phenotypes using a GPU-Based
evolutionary stochastic search algorithm. PLoS Genet, 2013,
9(8): e1003657
[62] Bolormaa S, Pryce J E, Reverter A, Zhang Y, Barendse W,
Kemper K, Tier B, Savin K, Hayes B J, Goddard M E. A
multi-trait, meta-analysis for detecting pleiotropic polymor-
phisms for stature, fatness and reproduction in beef cattle.
PLoS Genet, 2014, 10: e1004198
[63] Xu Y, Hu W M, Yang Z F, Xu C W. A multivariate partial least
squares approach to joint analysis for multiple correlated traits.
Crop J, 2016, 4(1): 21–29
[64] Korte A, Vilhjálmsson B J, Segura V, Platt A, Long Q, Nord-
borg M. A mixed-model approach for genome-wide association
studies of correlated traits in structured populations. Nat Genet,
2012, 44: 1066–1071
[65] Zhou X, Stephens M. Efficient algorithm for multivariate
linear mixed models in genome-wide association studies. Nat
Methods, 2014, 11: 407–409
[66] Casale F P, Rakitsch B, Lippert C, Stegle O. Efficient set tests
for the genetic analysis of correlated traits. Nat Methods, 2015,
12: 755–758
[67] Furlotte N A, Eskin E. Efficient multiple-trait association and
estimation of genetic correlation using the matrix-variate
linear mixed model. Genetics, 2015, 200: 59–68
[68] Wan X, Yang C, Yang Q, Xue H, Tang N L S, Yu W C. Predic-
tive rule inference for epistatic interaction detection in ge-
nome-wide association studies. Bioinformatics, 2010, 26:
956 42
30–37
[69] Bradbury P J, Zhang Z, Kroon D E, Casstevens T M, Ramdoss
Y, Buckler E S. TASSEL: software for association mapping of
complex traits in diverse samples. BMC Bioinformatics, 2007,
23: 2633–2635
[70] Tang Y, Liu X, Wang J, Li M, Wang Q, Tian F, Su Z, Pan Y,
Liu D, Lipka A E, Buckler E S, Zhang Z. GAPIT Version 2:
Enhanced integrated tool for genomic association and predic-
tion. Plant Genome, 2016, 9(2): doi: 10.3835/plantgenome
2015.11.0120
[71] 张福涛. 遗传分析方法的 GPU 并行计算与优化研究. 浙江
大学博士学位论文, 浙江杭州, 2014. pp 89–97
Zhang F T. Parallelization and Optimization of GPU Computa-
tion for Genetic Analysis Methods. PhD Dissertation of
Zhejiang University, Hangzhou, China, 2014. pp 89–97 (in
Chinese with English abstract)
[72] Sul J H, Bilow M, Yang W Y, Kostem E, Furlotte N, He D,
Eskin E. Accounting for population structure in gene-by-
environment interactions in genome-wide association studies
using mixed models. PLoS Genet, 2016, 12(3): e1005849
[73] Zhang W, Dai X, Wang Q, Xu S, Zhao P X. PEPIS: a pipeline
for estimating epistatic effects in quantitative trait locus
mapping and genome-wide association studies. PLoS Comput
Biol, 2016, 12(5): e1004925
[74] Collard B C Y, Mackill D J. Marker-assisted selection: an ap-
proach for precision plant breeding in the twenty-first century.
Philos Trans R Soc Lond B Biol Sci, 2008, 363(1491): 557–572
[75] Andersen J R, Lǜbberstedt T. Functional markers in plants.
Trends Plant Sci, 2003, 8: 554–560
[76] 杨小红, 严建兵, 郑艳萍, 余建明, 李建生. 植物数量性状
关联分析研究进展. 作物学报, 2007, 33: 523–530
Yang X H, Yan J B, Zheng Y P, Yu J M, Li J S. Reviews of
association analysis for quantitative traits in plants. Acta Agron
Sin, 2007, 33: 523–530 (in Chinese)
[77] 谭贤杰, 吴子恺, 程伟东, 王天宇, 黎裕. 关联分析及其在
植物遗传学研究中的应用. 植物学报, 2011, 46: 108–118
Tan X J, Wu Z K, Cheng W D, Wang T Y, Li Y. Association
analysis and its application in plant genetic research. Chin Bull
Bot, 2011, 46: 108–118 (in Chinese)
[78] 布素红. 多亲本群体 QTL 定位和优异杂交组合预测. 南京
农业大学博士学位论文, 江苏南京, 2015. pp 57–68
Bu S H. Mapping of Quantitative Trait Loci and Prediction of
Elite Hybrid Combination in Multi-parental Populations. PhD
Dissertation of Nanjing Agricultural University, Nanjing,
China, 2015. pp 57–68 (in Chinese with English abstract)
[79] Chan E K F, Rowe H C, Kliebenstein D J. Understanding the
evolution of defense metabolites in Arabidopsis thaliana using
genome-wide association mapping. Genetics, 2010, 185:
991–1007
[80] Riedelsheimer C, Lisec J, Czedik-Eysenbreg A, Sulpice R, Flis
A, Grieder C, Altmann T, Stitt M, Willmitzer L, Melchinger A
E. Genome-wide association mapping of leaf metabolic pro-
files for dissecting complex traits in maize. Proc Natl Acad Sci
USA, 2012, 109: 8872–8877
[81] Wen W W, Li D, Li X, Gao Y Q, Li W Q, Li H H, Liu J, Liu H
J, Chen W, Luo J, Yan J B. Metabolome-based genome-wide
association study of maize kernel leads to novel biochemical
insights. Nat Commun, 2014, 5: 3438–3447
... Genetic analysis using genomic markers has gradually become the default method in plant and animal breeding owing to its high prediction accuracy (Morris et al., 2011;Zhang et al., 2013;Feng et al., 2016;Martin et al., 2016;Owens et al., 2019) and ability to identify genetic markers underlying the traits of interest. However, so far, no report is found to study the genetic mechanism of hypermelanosis of Chinese tongue sole using genomic markers. ...
Article
Full-text available
Chinese tongue sole (Cynoglossus semilaevis) is an economically important marine fish in China. Generally, the eyeless side of the Chinese tongue sole is white and the side with eyes is brown after metamorphosis, hypermelanosis may still occur in the eyeless side in certain individuals after metamorphosis, which greatly decreases consumer acceptance and market price. In order to study the possibility of genetic improvement, we determined genomic markers in Chinese tongue sole using the genotyping-by-sequencing method and analyzed their association with hypermelanosis area. Genetic analysis showed that hypermelanosis was a complicated quantitative trait, and the estimated heritability for hypermelanosis incidence and area ratio were 0.16 and 0.21, respectively. Genomic selection analysis showed that selection based on hypermelanosis incidence and area ratio had similar reliabilities and prediction accuracies, indicating the feasibility of genetic improvement. Nine loci were significantly associated with hypermelanosis, few of which included genes or flanked genes potentially associated with skin disease, indicating the potential complicated genetic mechanisms underlying hypermelanosis in the Chinese tongue sole.
... To improve yield-determining traits (YDTs), contributing to a better understanding of their genetic basis and diversity, recently, genome-wide association studies (GWAS) approaches have been extensively used to dissect the complex traits in crops. Before this, most of the findings have been reported to have utilized single-locus GWAS, such as the mixed linear model (MLM) ( [162,163]), while, recently, various new MLM-based models have been introduced [164]. More comprehensively, these novel strategies have various applications in the genetic integration of novel and omics-related traits, facilitating the recent breakthrough in the generation of bioinformatics and sequencing strategies. ...
Article
Full-text available
Yield is one of the most important agronomic traits for the breeding of rapeseed (Brassica napus L), but its genetic dissection for the formation of high yield remains enigmatic, given the rapid population growth. In the present review, we review the discovery of major loci underlying important agronomic traits and the recent advancement in the selection of complex traits. Further, we discuss the benchmark summary of high-throughput techniques for the high-resolution genetic breeding of rapeseed. Biparental linkage analysis and association mapping have become powerful strategies to comprehend the genetic architecture of complex agronomic traits in crops. The generation of improved crop varieties, especially rapeseed, is greatly urged to enhance yield productivity. In this sense, the whole-genome sequencing of rapeseed has become achievable to clone and identify quantitative trait loci (QTLs). Moreover, the generation of high-throughput sequencing and genotyping techniques has significantly enhanced the precision of QTL mapping and genome-wide association study (GWAS) methodologies. Furthermore, this study demonstrates the first attempt to identify novel QTLs of yield-related traits, specifically focusing on ovule number per pod (ON). We also highlight the recent breakthrough concerning single-locus-GWAS (SL-GWAS) and multi-locus GWAS (ML-GWAS), which aim to enhance the potential and robust control of GWAS for improved complex traits.
... The most widely used NGS is single nucleotide polymorphisms (SNPs) which allow better detection power for markers associated with agronomic traits [60]. Previously have been developed to reveal the genetic architecture of complex traits in crops [68,69]. In earlier studies, mostly SL-GWAS methods were adopted to dissect complex traits, but only few SNPs for each trait have been identified due to its procedural limitations. ...
Preprint
Full-text available
Background: Wheat is a staple food crop worldwide. Plant height is a key factor in plant architecture as it plays a crucial role in lodging and thus affects yield and quality. Genome-wide studies are mostly applied in crop plants, due to its advanced genotyping technologies, identification of novel loci, and improved statistical approaches. Results: In this study, the population was genotyped by using Illumina iSelect 90K single nucleotide polymorphism (SNP) assay and finally 22,905 high-quality SNPs were used to perform a genome-wide association study (GWAS) for plant architectural traits employing four multi-locus GWAS (ML-GWAS) and three single-locus GWAS (SL-GWAS) models. As a result, 174 and 97 significant SNPs controlling plant architectural traits were detected by four ML-GWAS and three SL-GWAS methods, respectively. Among these SNP makers, 43 SNPs were commonly detected, including seven across multiple environments and thirty-six across multiple methods. Interestingly, five most stable SNPs (Kukri_c34553_89, RAC875_c8121_1490, wsnp_Ex_rep_c66315_64480362, Ku_c5191_340, and tplb0049a09_1302) consistently detected across multiple environments and methods, possibly played a role in modulating plant height and flag leaf length. When comparing ML-GWAS methods, pLARmEB was the most powerful and accountable for the detection of 49 significant SNPs that mostly contributed to plant height (36 SNPs). However, in SL-GWAS the FarmCPU model detected most of the significant SNPs. Moreover, a total of 152 candidate genes were found that are likely to be involved in plant growth and development which may provide insightful information related to plant architectural traits. Conclusion: Altogether, our results reveal 174 and 97 significant SNPs controlling plant architectural traits using four ML-GWAS and three SL-GWAS methods, respectively. The detection of the stable loci across multiple environments and methods, possibly play a role in modulating plant architectural traits in hexaploid wheat, and finally will contribute to the discovery of valuable SNP loci for marker-assisted selection (MAS) in wheat molecular breeding.
... Since the establishment of quantitative molecular genetics, numerous association mapping techniques have been developed to dissect the most important but complex traits in crops (Feng et al., 2016). However, the previously reported literature was concentrated on the SL-GWAS methods, based on a fixed-SNP-effect MLM under a polygenic background and population structure controls. ...
Article
Ovule number (ON), seed per silique (SS), and thousand seed weight (TSW) are the most important but complex traits affecting yield. This study undertook genome-wide association studies (GWAS) on 521 accessions of rapeseed genotyped with the Brassica 60 K SNP array by six multi-locus GWAS (ML-GWAS) and four single-locus GWAS (SL-GWAS) methods. The findings of our study showed that 280 and 31 significant quantitative trait nucleotides/loci (QTNs/QTLs) were detected above six multi-locus and four single-locus models, respectively. Among these sequences, 74 common significant QTNs were repeatedly detected by more than three ML-GWAS models and in multiple environments. Among the QTNs, 26 were detected via multiple environments, while 13 were detected via multiple methods and environments. Interestingly, 119 QTNs were detected by a single model (pLARmEB), demonstrating that this model is largely reliable and stable. However, in SL-GWAS, the GLM model detected the highest number of QTNs. The distribution of the superior allele results showed that, among 74 common significant QTNs, 28 QTNs were > 50%, while 45 QTNs were < 50%. The highest percentage of superior alleles indicates their probable involvement in yield-determining traits (YDTs) improvement which may facilitate marker-assisted selection. Furthermore, on the basis of common significant QTNs, 42 candidate genes were detected. We strongly believe that the genetic manipulation of these putative genes may further improve rapeseed molecular breeding for the creation of eco-friendly cultivars with improved yield. The results obtained by this strategy may lead to a breakthrough in rapeseed production at the industrial level.
... With the rapid development of high-throughput sequencing and molecular quantitative genetics, many GWAS methods have appeared for the genetic decryption of complex quantitative traits in plants ( Feng et al., 2016). However, the SL-GWAS analysis approaches, which based on a fixed-SNP-effect MLM were mainly applied in the previous studies. ...
Article
Full-text available
Upland cotton (Gossypium hirsutum L.) is the most important source of natural fiber in the world. Early-maturity upland cotton varieties are commonly planted in China. Nevertheless, lint yield of early-maturity upland cotton varieties is strikingly lower than that of middle- and late-maturity ones. How to effectively improve lint yield of early maturing cotton, becomes a focus of cotton research. Here, based on 72,792 high-quality single nucleotide polymorphisms of 160 early-maturing upland cotton accessions, we performed genome-wide association studies (GWASs) for lint percentage (LP), one of the most lint-yield component traits, applying one single-locus method and six multi-locus methods. A total of 4 and 45 significant quantitative trait nucleotides (QTNs) were respectively identified to be associated with LP. Interestingly, in two of four planting environments, two of these QTNs (A02_74713290 and A02_75551547) were simultaneously detected via both one single-locus and three or more multi-locus GWAS methods. Among the 42 genes within a genomic region (A02: 74.31–75.95 Mbp) containing the above two peak QTNs, Gh_A02G1269, Gh_A02G1280, and Gh_A02G1295 had the highest expression levels in ovules during seed development from 20 to 25 days post anthesis, whereas Gh_A02G1278 was preferentially expressed in the fibers rather than other organs. These results imply that the four potential candidate genes might be closely related to cotton LP by regulating the proportion of seed weight and fiber yield. The QTNs and potential candidate genes for LP, identified in this study, provide valuable resource for cultivating novel cotton varieties with earliness and high lint yield in the future.
... Therefore, it is essential for a comprehensive understanding of the genetic mechanisms underlying complex traits in future work. So far, a large number of association analysis methods have emerged for the genetic dissection of complex traits in plants including single-locus model, multi-locus model, epistasis, and multiple correlated traits (Feng et al. 2016). In our study, however, only a single-locus model was used to search for major effects of NFFB and HNFFB, which might miss other key genetic variants specific to environmental factors and cannot provide reliable estimates for genetic effects. ...
Article
Full-text available
Improving early maturity in upland cotton (Gossypium hirsutum L.) is an important target in breeding. The node of the first fruiting branch (NFFB) and its height (HNFFB) are two important indexes to measure early maturity in cotton. To facilitate breeding for early maturity traits in upland cotton and reveal the genetic control underlying the two traits, a genome-wide association study was performed using 53,848 high-quality single nucleotide polymorphisms (SNPs) from 77,774 of a recently developed CottonSNP80K array. A total of 55 target trait-associated SNPs were detected, of which 12 SNPs were for NFFB and 43 were for HNFFB. Two SNPs for NFFB and 22 SNPs for HNFFB were repeatedly detected in at least two environments and/or by two models. These 24 SNPs also exhibited high phenotypic contributions of more than 10% and could be used for marker-assisted selection in future breeding programs. Furthermore, 89 candidate genes were identified in the genome sequence of upland cotton. These genes were categorized through Gene Ontology analysis. Gh_A05G1482 might be a potential candidate gene for improving the early maturation of cotton. These findings reveal the genetic control underlying NFFB and HNFFB and provide insight into genetic improvements for early maturity in upland cotton.
... In the era of cotton functional genomics [38], GWAS is a preferred tool to dissect the genetic basis of cotton traits [20,23,39,40], and several software and association models can be applied to study genome-wide associations [41]. Because inadequate association models will result in false positive associations (Type I error) or false negative association (Type II error) [42,43], the selection of an association method is important. ...
Article
Full-text available
Plant architecture traits influence crop yield. An understanding of the genetic basis of cotton plant architecture traits is beneficial for identifying favorable alleles and functional genes and breeding elite cultivars. We collected 121 cotton accessions including 100 brown-fiber and 21 white-fiber accessions, genotyped them by whole-genome resequencing, and phenotyped them in multiple environments. This genome-wide association study (GWAS) identified 11 quantitative trait loci (QTL) for two plant architecture traits: plant height and fruit spur branch number. Negative-effect alleles were enriched in the elite cultivars. Based on these QTL, gene annotation information, and published QTL, candidate genes and natural genetic variations in four QTL were identified. Ghir_D02G017510 and Ghir_D02G017600 were identified as candidate genes for qD02-FSBN-1, and a premature start codon gain variation was found in Ghir_D02G017510. Ghir_A12G026570, the candidate gene of qA12-FSBN-2, belongs to the pectin lyase-like superfamily, and a significantly associated SNP, A12_105366045 (T/C), in this gene represents an amino acid change. The QTL, candidate genes, and associated natural variations in this study are expected to lay a foundation for studying functional genes and developing breeding programs for desirable architecture in brown-fiber cotton.
Article
Identification of genes associated with bruchid resistance variations in cowpea accessions would help breeders to generate new cowpea cultivars with improved resistance and quality. In this work, 107 cowpea collections from various areas in six countries were phenotyped for their responses to Callosobruchus maculatus and genotyped with Single Nucleotide Polymorphism (SNP) markers. Six multi‐locus models Genome Wide Association Study (mrMLM, FASTmrMLM, pKWmEB, pLARmEB, FASTmrEMMA and ISISEM‐BLASSO) were used to associate the genotype data to phenotypic cowpea resistance traits: Percentage of Bruchid Emergence (PBE); Percentage of Weight Loss (PWL); Median Development Period (MDP); Dobie Susceptibility Index (DSI); Number of Egg Laid (NEL) and Mean Number of Hole (MNH). Out of 14 QTNs, three were associated with more than one trait and were associated with 11 candidate genes located within 10–30 kb of the QTNs. These candidate genes exhibit functionalities associated with cowpea resistance mechanisms. All these results could contribute to the gene networks in resistant cowpea varieties. The result of this study could also increase our knowledge of genetic resistance of cowpea to bruchids and could be useful for molecular breeding.
Thesis
Full-text available
Les présents travaux réalisés sur le noyer ont consisté en l’exploitation des riches ressources génétiques disponibles à l’INRAE de Nouvelle-Aquitaine-Bordeaux, afin d’apporter les outils qui pourront être utilisés dans un nouveau programme de création variétale mené par le CTIFL, centre opérationnel de Lanxade. En effet, au regard du développement économique important de la noix, le choix variétal en France ne semble pas suffisant pour répondre aux futures nouvelles contraintes telles que la concurrence mondiale et le changement climatique. Le travail de prospection que l’on doit principalement à l’équipe d’Éric Germain a permis de rassembler sur l’UEA de Toulenne la majeure partie des espèces du genre Juglans et de nombreuses accessions de noyer cultivé, Juglans regia L. L’exploitation de ses archives accumulées pendant 30 ans a permis de rendre publiques d’importantes données chronologiques de phénotypage concernant cette collection. Ces données ont permis de montrer l’avancée de la phénologie des deux variétés témoins ‘Lara’ and ‘Franquette’, en lien avec le changement climatique. Grâce à un ensemble de 13 marqueurs SSR, des allèles spécifiques aux espèces Juglans ont été identifiés et la structure de la collection a été étudiée. Cette structure montre deux sous-groupes principaux, l’un comprenant des accessions d’Europe de l’est et d’Asie et l’autre, d’Europe de l’ouest et des Etat-Unis. Aussi, une core collection a été définie pour réaliser des études de GWAS sur les principaux caractères d’intérêt agronomique, de la fleur au fruit, grâce à l’utilisation d’une puce de 600 000 SNP mise au point par l’Université de Davis en Californie. Des associations entre des SNP et plusieurs caractères liés à la phénologie ont été mises en évidence, grâce aux données des archives et à celles nouvellement acquises. Un SNP fortement lié à la date de débourrement des feuilles et fleurs femelles a été identifié sur le chromosome 1 et co-localise avec un QTL détecté en parallèle sur une descendance F1. Un marqueur de type KASP a été validé avec du matériel végétal de l’Université de Davis. D’autres associations ont également été identifiées pour le type de dichogamie et de fructification, caractère intervenant directement sur le rendement, et ont mené à la définition de gènes candidats. D’autres analyses GWAS ont été conduites sur les caractères liés au fruit, comme la taille de la noix, son poids, le rendement au cassage et la force nécessaire pour rompre la coque. En parallèle, des méthodes utilisant des techniques de phénotypage robustes ont été développées, comme l’utilisation de la microtomographie à rayons X pour mesurer tous les caractères morphologiques, sans casser la noix. Enfin, un travail de comparaison de l’efficacité des deux types de marqueurs utilisés dans ces travaux, SSR et SNP, a été mené. Les résultats montrent que les 13 marqueurs SSR donnent des résultats similaires à plusieurs milliers de SNP en ce qui concerne les étapes de détermination de structure et de construction de core collections, incontournables dans le management des ressources génétiques. A terme, les résultats de ces travaux permettront d’initier une sélection assistée par marqueurs pour la création de nouvelles variétés, dans le cadre d’un nouveau programme d’amélioration qui sera mené par le CTIFL. Ces nouvelles variétés seront aptes à répondre aux critères recherchés dans les années à venir, prenant en compte le changement climatique.
Book
Full-text available
Genome-Wide Association Studies (GWAS) are widely used in the genetic dissection of complex traits. Most existing methods are based on single-marker association in genome-wide scans with population structure and polygenic background controls. To control the false positive rate, the Bonferroni correction for multiple tests is frequently adopted. This stringent correction results in the exclusion of important loci, especially for GWAS in crop genetics. To address this issue, multi-locus GWAS methodologies have been recommended, i.e., FASTmrEMMA, ISIS EM-BLASSO, mrMLM, FASTmrMLM, pLARmEB, pKWmEB and FarmCPU. In this Research Topic, our purpose is to clarify some important issues in the application of multi-locus GWAS methods. Here we discuss the following subjects: First, we discuss the advantages of new multi-locus GWAS methods over the widely-used single-locus GWAS methods in the genetic dissection of complex traits, metabolites and gene expression levels. Secondly, large experiment error in the field measurement of phenotypic values for complex traits in crop genetics results in relatively large P-values in GWAS, indicating the existence of small number of significantly associated SNPs. To solve this issue, a less stringent P-value critical value is often adopted, i.e., 0.001, 0.0001 and 1/m (m is the number of markers). Although lowering the stringency with which an association is made could identify more hits, confidence in these hits would significantly drop. In this Research Topic we propose a new threshold of significant QTN (LOD=3.0 or P-value=2.0e-4) in multi-locus GWAS to balance high power and low false positive rate. Thirdly, heritability missing in GWAS is a common phenomenon, and a series of scientists have explained the reasons why the heritability is missing. In this Research Topic, we also add one additional reason and propose the joint use of several GWAS methodologies to capture more QTNs. Thus, overall estimated heritability would be increased. Finally, we discuss how to select and use these multi-locus GWAS methods.
Article
Full-text available
Asian rice, Oryza sativa is a cultivated, inbreeding species that feeds over half of the world's population. Understanding the genetic basis of diverse physiological, developmental, and morphological traits provides the basis for improving yield, quality and sustainability of rice. Here we show the results of a genome-wide association study based on genotyping 44,100 SNP variants across 413 diverse accessions of O. sativa collected from 82 countries that were systematically phenotyped for 34 traits. Using cross-population-based mapping strategies, we identified dozens of common variants influencing numerous complex traits. Significant heterogeneity was observed in the genetic architecture associated with subpopulation structure and response to environment. This work establishes an open-source translational research platform for genome-wide association studies in rice that directly links molecular variation in genes and metabolic pathways with the germplasm resources needed to accelerate varietal development and crop improvement.
Article
Full-text available
Genome-wide association studies (GWAS) and large-scale replication studies have identified common variants in 79 loci associated with breast cancer, explaining ~14% of the familial risk of the disease. To identify new susceptibility loci, we performed a meta-analysis of 11 GWAS, comprising 15,748 breast cancer cases and 18,084 controls together with 46,785 cases and 42,892 controls from 41 studies genotyped on a 211,155-marker custom array (iCOGS). Analyses were restricted to women of European ancestry. We generated genotypes for more than 11 million SNPs by imputation using the 1000 Genomes Project reference panel, and we identified 15 new loci associated with breast cancer at P < 5 × 10−8. Combining association analysis with ChIP-seq chromatin binding data in mammary cell lines and ChIA-PET chromatin interaction data from ENCODE, we identified likely target genes in two regions: SETBP1 at 18q12.3 and RNF115 and PDZK1 at 1q21.1. One association appears to be driven by an amino acid substitution encoded in EXO1.
Article
Full-text available
The term epistasis refers to interactions between multiple genetic loci. Genetic epistasis is important in regulating biological function and is considered to explain part of the ‘missing heritability,’ which involves marginal genetic effects that cannot be accounted for in genome-wide association studies. Thus, the study of epistasis is of great interest to geneticists. However, estimating epistatic effects for quantitative traits is challenging due to the large number of interaction effects that must be estimated, thus significantly increasing computing demands. Here, we present a new web server-based tool, the Pipeline for estimating EPIStatic genetic effects (PEPIS), for analyzing polygenic epistatic effects. The PEPIS software package is based on a new linear mixed model that has been used to predict the performance of hybrid rice. The PEPIS includes two main sub-pipelines: the first for kinship matrix calculation, and the second for polygenic component analyses and genome scanning for main and epistatic effects. To accommodate the demand for high-performance computation, the PEPIS utilizes C/C++ for mathematical matrix computing. In addition, the modules for kinship matrix calculations and main and epistatic-effect genome scanning employ parallel computing technology that effectively utilizes multiple computer nodes across our networked cluster, thus significantly improving the computational speed. For example, when analyzing the same immortalized F2 rice population genotypic data examined in a previous study, the PEPIS returned identical results at each analysis step with the original prototype R code, but the computational time was reduced from more than one month to about five minutes. These advances will help overcome the bottleneck frequently encountered in genome wide epistatic genetic effect analysis and enable accommodation of the high computational demand. The PEPIS is publically available at http://bioinfo.noble.org/PolyGenic_QTL/.
Article
Full-text available
Most human diseases and agriculturally important traits are complex. Dissecting their genetic architecture requires continued development of innovative and powerful statistical methods. Corresponding advances in computing tools are critical to efficiently use these statistical innovations and to enhance and accelerate biomedical and agricultural research and applications. The genome association and prediction integrated tool (GAPIT) was first released in 2012 and became widely used for genome-wide association studies (GWAS) and genomic prediction. The GAPIT implemented computationally efficient statistical methods, including the compressed mixed linear model (CMLM) and genomic prediction by using genomic best linear unbiased prediction (gBLUP). New state-of-the-art statistical methods have now been implemented in a new, enhanced version of GAPIT. These methods include factored spectrally transformed linear mixed models (FaST-LMM), enriched CMLM (ECMLM), FaST-LMM-Select, and settlement of mixed linear models under progressively exclusive relationship (SUPER). The genomic prediction methods implemented in this new release of the GAPIT include gBLUP based on CMLM, ECMLM, and SUPER. Additionally, the GAPIT was updated to improve its existing output display features and to add new data display and evaluation functions, including new graphing options and capabilities, phenotype simulation, power analysis, and cross-validation. These enhancements make the GAPIT a valuable resource for determining appropriate experimental designs and performing GWAS and genomic prediction. The enhanced R-based GAPIT software package uses state-of-the-art methods to conduct GWAS and genomic prediction. The GAPIT also provides new functions for developing experimental designs and creating publication-ready tabular summaries and graphs to improve the efficiency and application of genomic research.
Article
Full-text available
Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants.
Article
Full-text available
False positives in a Genome-Wide Association Study (GWAS) can be effectively controlled by a fixed effect and random effect Mixed Linear Model (MLM) that incorporates population structure and kinship among individuals to adjust association tests on markers; however, the adjustment also compromises true positives. The modified MLM method, Multiple Loci Linear Mixed Model (MLMM), incorporates multiple markers simultaneously as covariates in a stepwise MLM to partially remove the confounding between testing markers and kinship. To completely eliminate the confounding, we divided MLMM into two parts: Fixed Effect Model (FEM) and a Random Effect Model (REM) and use them iteratively. FEM contains testing markers, one at a time, and multiple associated markers as covariates to control false positives. To avoid model over-fitting problem in FEM, the associated markers are estimated in REM by using them to define kinship. The P values of testing markers and the associated markers are unified at each iteration. We named the new method as Fixed and random model Circulating Probability Unification (FarmCPU). Both real and simulated data analyses demonstrated that FarmCPU improves statistical power compared to current methods. Additional benefits include an efficient computing time that is linear to both number of individuals and number of markers. Now, a dataset with half million individuals and half million markers can be analyzed within three days.
Article
Full-text available
Genome-wide association studies (GWAS) have been widely used in genetic dissection of complex traits. However, common methods are all based on a fixed-SNP-effect mixed linear model (MLM) and single marker analysis, such as efficient mixed model analysis (EMMA). These methods require Bonferroni correction for multiple tests, which often is too conservative when the number of markers is extremely large. To address this concern, we proposed a random-SNP-effect MLM (RMLM) and a multi-locus RMLM (MRMLM) for GWAS. The RMLM simply treats the SNP-effect as random, but it allows a modified Bonferroni correction to be used to calculate the threshold p value for significance tests. The MRMLM is a multi-locus model including markers selected from the RMLM method with a less stringent selection criterion. Due to the multi-locus nature, no multiple test correction is needed. Simulation studies show that the MRMLM is more powerful in QTN detection and more accurate in QTN effect estimation than the RMLM, which in turn is more powerful and accurate than the EMMA. To demonstrate the new methods, we analyzed six flowering time related traits in Arabidopsis thaliana and detected more genes than previous reported using the EMMA. Therefore, the MRMLM provides an alternative for multi-locus GWAS.
Article
We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.