Figure - available from: Frontiers in Genetics
This content is subject to copyright.
Comparison of unadjusted phenotypes for infection type (IT) and disease severity (SEV) over years and locations in the diversity panel and breeding line training populations using Kruskal–Wallis test. Significant differences were based on p-values “*” < 0.05, “**” < 0.01, and “***” < 0.001.

Comparison of unadjusted phenotypes for infection type (IT) and disease severity (SEV) over years and locations in the diversity panel and breeding line training populations using Kruskal–Wallis test. Significant differences were based on p-values “*” < 0.05, “**” < 0.01, and “***” < 0.001.

Source publication
Article
Full-text available
Most genomic prediction models are linear regression models that assume continuous and normally distributed phenotypes, but responses to diseases such as stripe rust (caused by Puccinia striiformis f. sp. tritici) are commonly recorded in ordinal scales and percentages. Disease severity (SEV) and infection type (IT) data in germplasm screening nurs...

Similar publications

Article
Full-text available
Ear diameter (ED) is a critical component of grain yield (GY) in maize (Zea mays L.). Studying the genetic basis of ED in maize is of great significance in enhancing maize GY. Against this backdrop, this study was framed to (1) map the ED-related quantitative trait locus (QTL) and SNPs associated with ED; and (2) identify putative functional genes...
Article
Full-text available
Eleusine coracana (L.) Gaertn., commonly known as finger millet, is a multipurpose crop used for food and feed. Genomic tools are required for the characterization of crop gene pools and their genomics-led breeding. High-throughput sequencing-based characterization of finger millet germplasm representing diverse agro-ecologies was considered an eff...
Article
Full-text available
The native, perennial shrub American hazelnut ( Corylus americana ) is cultivated in the Midwestern United States for its significant ecological benefits, as well as its high‐value nut crop. Implementation of modern breeding methods and quantitative genetic analyses of C. americana requires high‐quality reference genomes, a resource that is current...
Article
Full-text available
Phalaenopsis spp. represent the most popular orchids worldwide. Both P. equestris and P. aphrodite are the two important breeding parents with the whole genome sequence available. However, marker–trait association is rarely used for floral traits in Phalaenopsis breeding. Here, we analyzed markers associated with aesthetic traits of Phalaenopsis or...
Article
Full-text available
The eastern-most members of the Salmo trutta species complex in the Aralo-Caspian Sea region were studied to infer their population genetic structure and biogeographic origin. A total of 68 individuals collected from Iranian endorheic inland basins (Namak and Urmia lakes), tributaries of the Caspian (Haraz, Kura, Samur, Volga, and Ural river draina...

Citations

... Several statistical models have been developed for GS, and while they differ in their assumptions and complexity, they all aim to predict the phenotypic value of individuals with their genomic and other available kinds of information (Meuwissen et al. 2001;Gianola and Kaam 2008;Habier et al. 2011). The most developed models are for quantitative traits, which assume continuous and normally distributed phenotypes (Montesinos- López et al. 2015c;Merrick et al. 2022). However, for non-Gaussian phenotypes, there are fewer developed models available. ...
Article
Full-text available
Key message Genomic prediction models for quantitative traits assume continuous and normally distributed phenotypes. In this research, we proposed a novel Bayesian discrete lognormal regression model. Abstract Genomic selection is a powerful tool in modern breeding programs that uses genomic information to predict the performance of individuals and select those with desirable traits. It has revolutionized animal and plant breeding, as it allows breeders to identify the best candidates without labor-intensive and time-consuming phenotypic evaluations. While several statistical models have been developed, most of them have been for quantitative continuous traits and only a few for count responses. In this paper, we propose a discrete lognormal regression model in the Bayesian context, that with a Gibbs sampler to explore the corresponding posterior distribution and make the predictions. Two datasets of resistance disease is used in the wheat crop and are then evaluated against the traditional Gaussian model and a lognormal model. The results indicate the proposed model is a competitive and natural model for predicting count genomic traits.
... In animal science, ordinal and continuous data were compared, and the use of threshold traits resulted in markedly lower accuracy than a linear model (Kizilkaya et al. 2014). More recently, Merrick et al. (2022) reported that using machine learning methods led to higher predictive accuracy for the classification and prediction of traits with skewed distributions. However, most of these studies have focused on the predictive performance rather than estimating genetic parameters of key interest, such as selection gain and marker effects. ...
Article
Full-text available
Key message An approach for handling visual scores with potential errors and subjectivity in scores was evaluated in simulated and blueberry recurrent selection breeding schemes to assist breeders in their decision-making. Abstract Most genomic prediction methods are based on assumptions of normality due to their simplicity and ease of implementation. However, in plant and animal breeding, continuous traits are often visually scored as categorical traits and analyzed as a Gaussian variable, thus violating the normality assumption, which could affect the prediction of breeding values and the estimation of genetic parameters. In this study, we examined the main challenges of visual scores for genomic prediction and genetic parameter estimation using mixed models, Bayesian, and machine learning methods. We evaluated these approaches using simulated and real breeding data sets. Our contribution in this study is a five-fold demonstration: (i) collecting data using an intermediate number of categories (1–3 and 1–5) is the best strategy, even considering errors associated with visual scores; (ii) Linear Mixed Models and Bayesian Linear Regression are robust to the normality violation, but marginal gains can be achieved when using Bayesian Ordinal Regression Models (BORM) and Random Forest Classification; (iii) genetic parameters are better estimated using BORM; (iv) our conclusions using simulated data are also applicable to real data in autotetraploid blueberry; and (v) a comparison of continuous and categorical phenotypes found that investing in the evaluation of 600–1000 categorical data points with low error, when it is not feasible to collect continuous phenotypes, is a strategy for improving predictive abilities. Our findings suggest the best approaches for effectively using visual scores traits to explore genetic information in breeding programs and highlight the importance of investing in the training of evaluator teams and in high-quality phenotyping.
... Theoretically, it is not reasonable to use the method of analyzing linear traits to analyze ordered categorical traits. If the ordered categorical traits are treated as linear traits, the following assumptions are not valid in the linear model of GS: (1) there is a linear relationship between genotype and phenotype; (2) the phenotype is normally distributed; (3) the variance is constant rather than a function of the expected value [7][8][9]. Therefore, linear model of GS used in aquaculture breeding does not satisfy the assumptions required for ordered categorical traits. ...
Article
Full-text available
Ordered categorical traits are commonly used in fish breeding programs as they are easier to obtain than continuous observations. However, most studies treat ordered categorical traits as linear traits and analyze them using linear models, which can lead to a serious reduction in prediction accuracy by violating the basic assumptions of linear models. The aim of this study was to evaluate the advantages of Bayesian threshold model and machine learning method in genomic prediction of ordered categorical traits in fish. The study was based on the analyses of simulated data and real data of Atlantic salmon. Ordinal categorical traits were simulated with varying numbers of categories (2, 3 and 4) and levels of heritabilities (0.1, 0.3 and 0.5). Linear and threshold models with BayesA and BayesCπ methods, as well as a machine learning method, support vector regression with default (SVRdef) and tuning (SVRtuning) hyperparameters were used to investigate their prediction abilities. The results showed that Bayesian threshold models yielded 2.1%, 2.6% and 2.9% higher prediction accuracies on average for 2-, 3- and 4-category traits, respectively, than Bayesian linear models. Furthermore, SVRtuning produced higher prediction accuracy compared with SVRdef and Bayesian threshold models in all scenarios. For real data, Bayesian threshold models yielded 1.2% higher prediction accuracy than Bayesian linear models, and SVRdef and SVRtuning yielded 3.3% and 6.6% higher prediction accuracies than Bayesian methods, respectively. In conclusion, the use of Bayesian threshold model and machine learning method was beneficial for genomic prediction of ordered categorical traits in fish.
... In animal science, ordinal and continuous data were compared, and the use of threshold traits resulted in markedly lower accuracy than a linear model \parencite{Kizilkaya2014}. More recently, Merrick et al. (2022) reported that using machine learning methods led to higher predictive accuracy for the classification and prediction of traits with skewed distributions. ...
Preprint
Full-text available
Most genomic prediction methods are based on assumptions of normality due to their simplicity and ease of implementation. However, in plant and animal breeding, traits are often collected as categorical data, thus violating the normality assumption, which could affect the prediction of breeding values and the estimation of genetic parameters. In this study, we examined the main challenges of categorical phenotypes in genomic prediction and genetic parameter estimation using mixed models, Bayesian and machine learning methods. We evaluated these approaches using simulated and real breeding data sets. Our contribution in this study is a five-fold demonstration: (i) collecting data using an intermediate number of categories (1 to 3 and 1 to 5) is the best strategy, even considering errors associated with visual scores; (ii) Linear Mixed Models and Bayesian Linear Regression are robust to the normality violation, but marginal gains can be achieved when using Bayesian Ordinal Regression Models (BORM) and Random Forest Classification; (iii) genetic parameters are better estimated using BORM; (iv) our conclusions using simulated data are also applicable to real data in autotetraploid blueberry; and (v) a comparison of continuous and categorical phenotypes found that investing in the evaluation of 600–1000 categorical data points with low error, when it is not feasible to collect continuous phenotypes, is a strategy for improving predictive abilities. Our findings suggest the best approaches for effectively using categorical traits to explore genetic information in breeding programs and highlight the importance of investing in the training of evaluator teams and in high-quality phenotyping.
... Regression analysis is a statistical method to determine the quantitative relationship between two or more variables. Based on observational data, regression analysis could establish appropriate dependencies between variables and analyze the inherent rules of data (Merrick et al., 2022). It is widely used for forecasting in the healthcare field. ...
Article
Full-text available
Artificial intelligence (AI) based on the perspective of data elements is widely used in the healthcare informatics domain. Large amounts of clinical data from electronic medical records (EMRs), electronic health records (EHRs), and electroencephalography records (EEGs) have been generated and collected at an unprecedented speed and scale. For instance, the new generation of wearable technologies enables easy-collecting peoples’ daily health data such as blood pressure, blood glucose, and physiological data, as well as the application of EHRs documenting large amounts of patient data. The cost of acquiring and processing health big data is expected to reduce dramatically with the help of AI technologies and open-source big data platforms such as Hadoop and Spark. The application of AI technologies in health big data presents new opportunities to discover the relationship among living habits, sports, inheritances, diseases, symptoms, and drugs. Meanwhile, with the development of fast-growing AI technologies, many promising methodologies are proposed in the healthcare field recently. In this paper, we review and discuss the application of machine learning (ML) methods in health big data in two major aspects: (1) Special features of health big data including multimodal, incompletion, time validation, redundancy, and privacy. (2) ML methodologies in the healthcare field including classification, regression, clustering, and association. Furthermore, we review the recent progress and breakthroughs of automatic diagnosis in health big data and summarize the challenges, gaps, and opportunities to improve and advance automatic diagnosis in the health big data field.
Preprint
Full-text available
Genomic selection is a powerful tool in modern breeding programs that uses genomic information to predict the performance of individuals and select those with desirable traits. It has revolutionized animal and plant breeding, as it allows breeders to identify the best candidates without labor-intensive and time-consuming phenotypic evaluations. While several statistical models have been developed, most of them have been for quantitative continuous traits and only a few for count responses. In this paper, we propose a discrete lognormal regression model in the Bayesian context, developed using the inference by Gibbs sampler to explore the corresponding posterior distribution and make the predictions. A data set of resistance disease is used in the wheat crop and is then evaluated against the traditional Gaussian model and a lognormal model over the located response. The results indicate the proposed model is a competitive and natural model for predicting count genomic traits.