ThesisPDF Available

Ascertaining Genetics of Beta-Thalassemia and Sickle Cell Disease using Molecular Techniques and Machine Learning Heuristics

Authors:
  • Amrita Vishwa Vidyapeetham, Amrita school of Biotechnology
  • NITTE University, India

Abstract

Hemoglobinopathies are a group of disorders in which the hemoglobin molecule has abnormal production or structure. The hemoglobin molecules in red blood cells (RBC) are impacted by the blood disease known as sickle cell disease (SCD), and Thalassemia is one of the major monogenic disorders that reduces hemoglobin production. This disorder results in a large number of red blood cells being destroyed, leading to anemia. India bears a huge burden of hemoglobinopathies; thalassemia is the most prevalent. Since SCD is genetic and a person either has it or not at birth, there is no way to prevent it. However, a blood test can be used to check for the illness, even while pregnant. A key component of thalassemia prevention is a successful screening procedure to identify Thalassemia carriers. Effective screening programs have numerous obstacles, especially in environments with limited resources. Machine learning (ML) has been used to solve technical and domain-specific problems in a variety of prognostic and diagnostic medical jobs. The objectives of this study aim to identify and analyze the most common mutation of beta-thalassemia and sickle cell disease from the north Indian population. And to apply ML-based algorithms to Thalassemia screening to accurately predict the beta-thalassemia carrier state from a simple blood test and to predict pathogenic hemoglobin variants in a group of individuals. This study contributes to the validation of the models based on data from several individuals and hemoglobinopathies.
Ascertaining Genetics of Beta-Thalassemia and
Sickle Cell Disease using Molecular Techniques
and Machine Learning Heuristics
Aswathi P, Anjana SR, Somesh Kumar, R Shyamprasad Rao, Seema Kapoor,
Prashanth Suravajhala and Sunil Kumar Polipalli
INTRODUCTION
A collection of blood illnesses affecting red blood cells are called hemoglobinopathies.
Hemoglobin, a protein found in blood cells, transports oxygen throughout the body and
absorbs carbon dioxide. A hemoglobinopathy problem can result in an abnormal level of
protein production or an aberrant structure for this protein. Common types of
hemoglobinopathy are Sickle cell disease (SCD), thalassemia, hemoglobin C disease, and
hemoglobin E/D disease (Kohne, 2011).
Thalassemia
Thalassemia, originating from the words ‘sea’ and ‘blood’ is an inherited hematological
disorder.This genetic abnormality results in the reduction of hemoglobin production, which
is the key molecule transporting oxygen in the human body. We know that red blood cells
(RBC) are the carriers of oxygen to all body cells. Oxygen is a sort of fuel that cells use to
function. When the red blood cells are not enough, there is also not enough oxygen delivered
to all the body’s other cells, which may cause a person to feel tired, weak, or short of breath.
This is a condition called anemia. People with thalassemia may have mild or severe anemia.
Severe anemia can damage organs and lead to death (Bajwa et al., 2022).
The first case series of five children with thalassemia was presented by Dr. Thomas Cooley at
Children's Hospital in Michigan, USA. In 1926, Thomas Cooley described a clinical
syndrome in children of Mediterranean descent characterized by severe anemia, yellowish
eye discoloration (referred to as "Jaundice"), and failure to thrive. This disease was first seen
among people of Mediterranean origin. Thalassemia is seen in every ethnic group and every
geographic location. This disease is commonly seen in the Mediterranean regions including
Italy, Greece, Sub-Saharan Africa, the Middle East, Indian Subcontinent, and East and
Southeast Asia, compared to many other parts of the world, so these parts are traditionally
known as the ‘thalassemia belt’.
In this genetic disease, one of the two globin genes that create the hemoglobin tetramer (four
protein subunits) is mutated or deleted. Normal adult hemoglobin molecules contain haem,
two alpha globin subunits, and two beta globin subunits. If either the alpha or beta part is not
made, there will not be able to make normal amounts of hemoglobin, which reduce oxygen
delivery to the body, which is necessary for energy production for metabolism The reduction
or absence of the alpha-globin chain will lead to alpha-thalassemia and if the beta-globin
chain is reduced or absent it may lead to beta-thalassemia. The synthesis of alpha globin
chains is controlled by two-linked alpha gene pairs. The clinical picture is a spectrum that
correlates with the severity of the genetic abnormality, which ranges from a single gene allele
mutation to a mutation of all four gene alleles. This constellation of clinical pictures is known
as Alpha Thalassemia.
Beta thalassemia is caused by mutations in one or more alleles of the autosomal recessive
Hemoglobin Subunit Beta (HBB) gene. The synthesis of the alpha chain is controlled by two
gene clusters of chromosomes 11. A "carrier" form known as Beta Thalassemia Minor, with
moderate hypochromic microcytic anemia, results when there is just one allele mutated.
Individuals with thalassemia minor can be asymptomatic, making them hard to see, but they
act as carriers of the mutant HBB gene from generation to generation. More severe symptoms
result from the homozygous state.There are also other associated Anemia like Aplastic
Anemia which we at Systems Genomics Lab have worked on but this is beyond the scope of
this work.
Beta Thalassemia
Beta thalassemia are a group of hereditary blood disorders characterized by reduced or absent
beta globin chain synthesis, which results in lower levels of hemoglobin (Hb) in red blood
cells (RBCs), lower RBC production, and anemia. Beta Thalassemia can be classified into
three ways: Thalassemia Major, also known as "Cooley's Anemia"," Thalassemia Intermedia,
and Thalassemia Minor, also known as "Beta-thalassemia carrier," "Beta-thalassemia trait,"
and "heterozygous Beta-thalassemia”. Beta thalassemia associated with Hb anomalies is
HbC, HbE, and HbS. Beta thalassemia is also associated with trichothiodystrophy and
X-linked thrombocytopenia. The degree of globin chain imbalance in beta-thalassemia is
determined by the nature of the mutation of the beta gene. Beta globin chains are produced by
two linked beta genes present in chromosome 11. We know that two alleles of each gene and
four genes encode beta-globin production. The severity of beta thalassemia depends firstly on
the number of genes affected, and then based on whether it’s a gene deletion or a
non-deletional mutation (Figure 3). In beta thalassemia, two genes are involved in the
formation of the beta hemoglobin chain which we get from each of our parents. If there is:
One gene mutation, the condition is called thalassemia minor, and it will only
generate mild symptoms.
Two gene mutations, the condition is known as thalassemia major and it has severe
symptoms.
When a child is born with two defective beta hemoglobin genes, at the initial phase they will
be healthy but they will develop symptoms within the first two years of their life which is a
milder form of beta thalassemia, known as thalassemia intermedia.
The rate of synthesis of beta polypeptide chains will be reduced if the mutation is β+ and
ranges from severe to mild, while there will be no detectable beta globin production from the
mutated allele in β0 mutations. The genotypes of beta thalassemia refer to either β/β+ or the
β/β0. These are usually clinically mild conditions that result in mild anemia and microcytosis.
The severe beta plus (β+) mutation in the β0/ β0, β+/ β+, or β+/ β0 genotype will result in
transfusion-dependent thalassemia major, where
as the milder beta mutations in the β+/ β+ genotype may result in thalassemia intermedia.
β-thalassemias are inherited in an autosomal recessive manner. During pregnancy, each
sibling of an affected individual has a 25% chance of being affected, a 50% chance of
being asymptomatic, and a 25% chance of being unaffected or a carrier. Heterozygous
carriers may have mild anemia, but these are clinically insignificant. Carriers are often
referred to as thalassemia minor. Testing of at-risk individuals (including family members,
gamete donors, and members of at-risk ethnic groups) is possible. If two different mutated
alleles exist at a particular gene locus, referred to as compound heterozygous or beta
thalassemia major which can cause severe anemia.
Figure 1: Difference between alpha and beta-thalassemia, (image courtesy of
https://3billion.io/blog/rare-disease-series-3-thalassemia,last accessed on May 23, 2023)
Sickle Cell Disease (SCD)
The SCD affects the hemoglobin molecules in red blood cells (RBC).When blood flows
through the lungs, hemoglobin binds to oxygen, enabling red blood cells to transport oxygen
throughout the body affecting the red blood cells' hemoglobin molecules.The ability of
hemoglobin to deliver oxygen is compromised when it is abnormal, as in SCD, and it can also
lead the RBC to curve and become rigid. This gives a “sickle” shape of the cells. When a
person inherits two hemoglobin "S" genes, they are presumed to have sickle cell anemia, the
most prevalent and dangerous form of SCD. Children with the syndrome may endure slow
growth or delayed development, and some patients may experience chronic (long-term) pain.
The brain, kidneys, liver, lungs, eyes, heart, spleen, genitals, joints, and skin are just a few of
the organs that sickle cell disease may damage over time (Rees, 2010).
Figure 2: Difference between Normal Red Blood cell and Sickle cell, (image courtesy of
https://www.topdoctors.co.uk/medical-dictionary/sickle-cell-disease last accessed on June 16,
2023)
Hemoglobin variants
These are caused by a qualitative defect in the genetic code that leads to structural changes in
the hemoglobin molecule. Most alpha and beta globin chain variants are clinically silent and
are discovered incidentally or during the screening of family members of a patient. A few
variant hemoglobins are capable of causing severe disease, especially in the homozygous
state (eg: HbS) or when inherited in conjunction with another variant or a thalassemia
mutation. Common examples of variant hemoglobins in India include HbS, HbE, and HbD
(Thom, 2013).
Hemoglobin S (HbS)
Sickle hemoglobin (Hb S), a beta-globin gene variation, is the main cause of sickle cell
disease. For an autosomal recessive sickness to develop, two copies of HbS or one copy of
Hb S plus another beta-globin variation (such as Hb C) are needed. Symptoms include
chronic anemia, acute chest syndrome, stroke, splenic and renal dysfunction, pain crises, and
susceptibility to bacterial infections. Pediatric mortality is primarily due to bacterial infection
and stroke. In adults, specific causes of mortality are more varied, but individuals with more
symptomatic diseases may exhibit early mortality. In recent years, newborn screening, better
medical care, parent education, and penicillin prophylaxis have successfully reduced
morbidity and mortality due to Hb S (Ashley-Koch, 2000)
Hemoglobin E (HbE)
Hemoglobin E(HbE) disease is a mild, inherited blood disorder characterized by an abnormal
form of hemoglobin, called hemoglobin E. People with this condition may have very mild
anemia, but the condition typically does not cause any symptoms. It is inherited in an
autosomal recessive manner and is caused by a genetic change in the HBB gene. The genetic
change that causes HbE disease primarily occurs in Southeast Asian populations, and rarely
in Chinese populations (Hirsch, 2016).
Machine Learning and Predictive Modelling
"Machine learning" (ML) is defined as a technique in which a machine can learn to perform
tasks with given information without special programming. Deep learning is a newer type of
machine learning that uses multiple layers of neural networks to detect complex patterns in
data, including non-linearity. Today, computers have achieved human-level performance in
image object recognition tasks using convolutional neural networks (CNNs) and Artificial
Neural Networks (ANN). It's also great for certain narrow natural language processing tasks,
including speech recognition, natural language text analysis to develop predictive models. In
medicine, deep learning models have achieved physician-level accuracy for many diagnostic
tasks, including detecting melanoma, diabetic retinopathy, cardiovascular risk, detecting
breast lesions on mammography, and analyzing the spine with MRI. One deep learning model
has even been shown to be effective for diagnosing all medical procedures. In this study, four
machine learning algorithms are used to accurately predict beta-thalassemia carriers.
Machine learning is a branch of artificial intelligence (AI) that focuses on the development of
algorithms and models that enable computers to learn from and make predictions or decisions
based on data. It involves creating computer systems that automatically learn and improve
from experience without being explicitly programmed. In machine learning, algorithms are
designed to analyze and interpret patterns in data, identify relationships, and make predictions
or take actions based on those patterns. The learning process involves training a model on a
large amount of labeled or unlabeled data, allowing it to recognize patterns and make
informed decisions or predictions when presented with new, unseen data.
The main goal of machine learning is to develop algorithms that can generalize from the
training data and perform well on new, unseen data. This ability to learn and adapt from data
enables machine learning models to solve complex problems, make accurate predictions, and
automate decision-making processes (Tarca et al., 2007). Machine learning algorithms can be
broadly categorized into supervised, unsupervised, and reinforcement learning. Supervised
learning involves training models with labeled data, where the algorithm learns from
input-output pairs. Unsupervised learning focuses on discovering patterns or structures in
unlabeled data. Reinforcement learning involves training models to interact with an
environment and learn through feedback signals. Machine learning has numerous
applications in various fields, including image and speech recognition, natural language
processing, recommendation systems, fraud detection, medical diagnosis, and autonomous
vehicles. It continues to advance rapidly, driven by advancements in computing power,
availability of large datasets, and improvements in algorithmic techniques.
Deep learning is a subset of machine learning that focuses on training multi-layered artificial
neural networks to learn and make intelligent decisions or predictions based on complex data.
It is inspired by the structure and function of the human brain and its interconnected network
of neurons. In deep learning, neural networks consist of multiple layers of interconnected
nodes called artificial neurons or units. Each neuronal layer processes and transforms the
input data, progressively extracting higher-level features and representations as the
information passes through the network. Layers are typically organized into an input layer,
one or more hidden layers, and an output layer. The main advantage of deep learning is its
ability to automatically learn hierarchical representations of data, which enables more
efficient and accurate feature extraction compared to traditional machine learning algorithms.
Deep learning models can learn directly from unprocessed raw data without manually
designing features. Deep learning models are trained using large amounts of data, where the
parameters of the neural network are adjusted through a process called backpropagation.
Backpropagation involves computing the gradient of the model's performance with respect to
its parameters and using optimization algorithms such as stochastic gradient descent to update
the parameters to minimize the error or loss function. Deep learning has revolutionized
several fields, including computer vision, natural language processing, speech recognition,
and many others. It has achieved remarkable results in tasks such as image and object
recognition, speech synthesis and recognition, language translation, and even playing
complex games such as Go and chess. The success of deep learning is largely due to the
availability of large data sets, advances in computing power, and deep neural network
architectures such as convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) for computer vision. sequential processing of data. These advances have made it
possible to train deep-learning models with millions of parameters, allowing them to learn
complex patterns and make sophisticated predictions.
MATERIALS AND METHODS
Figure 3: Workflow used to perform analysis of mutations.
Clinical Samples: The clinical samples were collected directly from the Genome Sequencing
lab, Lok Nayak Hospital, MAMC, Delhi, India. Since this is a retrospective study, a priori
ethics clearance was obtained F.1/IEC/MAMC/70/05/2019/No 543. 30 samples were obtained
to analyze sickle cell disease and 90 samples for beta-thalassemia. While another 250
samples were used to predict hemoglobin variants and 370 samples were used for the
prediction of beta-thalassemia carriers using Machine Learning algorithms.
DNA EXTRACTION: We used the Phenol-Chloroform extraction method for isolating
DNA and the sample was collected and lysed to break open the cells. The DNA obtained was
separated from other cellular components. A number of methods, such as magnetic bead
separation were employed in DNA purification to separate and concentrate the DNA. After
purification, the amount and quality of the DNA were measured using spectrophotometry to
ensure that it is suitable for downstream applications.
GEL ELECTROPHORESIS: The Gel electrophoresis process involves loading the DNA
sample into wells in a gel matrix, typically made of agarose or polyacrylamide. An electric
current was applied to the gel, causing the DNA fragments to migrate through the matrix
based on their size and charge. As smaller fragments migrated further through the gel, larger
fragments moved more slowly. After electrophoresis, the DNA fragments were visualized
using staining methods such as ethidium bromide or SYBR safe. It makes it possible to both
validate the existence of the desired DNA fragments and estimate their size when compared
to other fragments in the samples. By performing Gel electrophoresis before PCR, we
ensured that the DNA sample is of high quality and contains the target DNA fragment. This
helped increase the accuracy and specificity of the PCR amplification and subsequent
downstream analysis (Figure 4).
PCR AND GEL ELECTROPHORESIS: PCR after gel electrophoresis was performed to
isolate specific DNA fragments from a complex mixture of DNA fragments.The PCR
method is highly versatile and can be used for various applications, including DNA
sequencing, genotyping, gene expression analysis, and molecular cloning. The technique has
revolutionized many areas of molecular biology and has become an essential tool for research
and diagnostic laboratories. During PCR, DNA was amplified exponentially, but the reaction
generated subtle non-specific amplification products or contaminants. Gel electrophoresis
allowed us to visualize the amplified DNA and determine if the desired product was
amplified.
ExoSAP WASHING: ExoSAP-IT is a commonly used enzymatic method for cleaning up
PCR products prior to downstream applications such as sequencing or cloning. The
ExoSAP-IT Kit contains Exonuclease 1 and Shrimp Alkaline Phosphatase, which degrade
and remove excess primers and dNTPs from the PCR product, which was used to reduce the
potential for carryover contamination in the downstream application.
SEQUENCING PCR: We amplified a specific region of DNA for sequencing analysis. The
PCR product was analyzed on an agarose gel to confirm the size and quality of the product
before being purified and sequenced. The sequencing data was then analyzed to identify
features of interest in the DNA sequence.
CLEANING BY MAGNETIC BEADS: Cleaning DNA samples using magnetic beads is a
standard method for purifying and removing contaminants from DNA samples. We used it to
efficiently select, scale, and gently purify, making it a popular choice in molecular biology
and genomic workflow.
SANGER SEQUENCING: Sanger sequencing is commonly used for smaller-scale
sequencing projects, such as verifying the presence of specific mutations, sequencing targeted
gene regions, or analyzing individual DNA fragments. This process includes DNA template
preparation, Primer annealing, sequencing reaction, DNA synthesis and termination,
fragment separation, visualization, and analysis, and the sequence is read based on the
position of the terminating ddNTPs, which correspond to specific nucleotides in the DNA
template.
Datasets and Methodology for the Prediction of beta-thalassemia
carrier state using full blood count
A database of 370 cases from the Genetic and Genome Sequencing lab (2022-2023), was
used to train and test the machine-learning tools, to accurately predict beta-thalassemia
carriers. The most common parameters used are Hemoglobin count (HGB), Mean cell
volume (MCV), Mean corpuscular hemoglobin concentration (MCHC), and Mean
corpuscular hemoglobin (MCH). The current study methodology is an adaptation of the
industry-proved and evidence-based standards that are being used in the data science industry
as well as in academic research, and medicine. An outline of the steps used in the study
methodology is described below. Although listed here for convenience, the process is
bi-directional, iterative, and flexible. The steps are as follows (Figure 8):
Problem analysis
Identifying data sources
Data acquisition and data understanding
Modeling using different algorithms
Evaluation of the model
Prediction
Figure 4: Methodology used for the prediction of beta-thalassemia carrier state using full
blood count
Data Collection
India has a genetically heterogeneous population, which results in a range of thalassemia
types and degrees of severity. India has both beta and alpha thalassemia, with
beta-thalassemia major (Cooley's anemia) being more common. As consanguineous marriages
are common in some Indian cultures, there can be an increase in the frequency of thalassemia
and thus it will raise the probability of thalassemia inheritance. The regional distribution can
also be influenced by the elements like cultural customs, awareness, and accessibility to
healthcare treatments because, in different Indian states or groups, the prevalence of
thalassemia will be different. In recognition of the substantial effects of thalassemia in India,
the government has been trying to raise awareness, create prevention programs, improve
diagnosis and treatment facilities, and provide support to both individuals and families who
are afflicted by thalassemia. These programs will minimize the impact of thalassemia and
enhance the quality of life for those who have the condition. The necessity for a better
screening procedure is one of the primary topics this study addresses. The survey highlights
the importance of an efficient screening procedure that is centrally monitored in order to
reduce new thalassemia births, which is the ultimate goal of thalassemia prevention.
The main objective of the study is to create and analyze the data in inferring mutations of the
first point of contact or at the level of the referring physician, allowing a thalassemia carrier
state to be identified or the risk of being a carrier to be predicted with confidence and
allowing the doctor to prioritize patients who need the gold standard genetic testing, which is
currently required for the diagnosis of beta Thalassemia carrier state. The use of a machine
learning tool that can be deployed in existing devices including mobile devices will ensure
that more people can be screened for the thalassemia carrier status using the tool. There are
two potential advantages of this tool: less time consumption, and less economic burden
(Asmarian N et al,2022).
Identifying key input variables
Full blood count (FBC) and hemoglobin variant percentages from hemoglobin
electrophoresis findings were used as the input variables or parameters needed to train and
evaluate the models. All of the variables used in the analysis are listed in the table below
along with their corresponding units of measurement.
Feature
Description
Full Blood Count
Mean cell hemoglobin (MCH) in picograms(pg.)
Mean cell hemoglobin concentration (MCHC) in
grams per decilitre(g/dL)
Hemoglobin (Hb) in grams per decilitre(g/dL)
Mean cell volume (MCV) in femtoliters(fL)
Demographic Data
Age
Sex
Final Finding/Phenotype
Carrier (1) or Normal (0)
Table 1: Features used for the machine learning analysis
Identifying Data Source
The following are the main criteria that was chosen
1) Validity of the data: It was assessed whether the selected data set represents the problem
being solved
2) Accessibility of the data: It was evaluated whether the data were available in an
accessible format and whether they could be collected for analysis within the time allotted for
the study.
3) Accuracy of the data: It takes into account the accuracy of the input data, especially the
target variables because the target variables are the "labels" or baselines that the model learns
and trains, their accuracy is critical to the accuracy and performance of the final model.
Consideration was given to whether the diagnosis obtained from the gold standard
examination was present and whether the diagnosis was confirmed by a qualified consultant
hematologist.
Total number of samples in the dataset was 370(N = 370). Machine learning model training
for supervised learning tasks needs specific outcomes or target variables, commonly referred
to as ‘labels’, to establish the ground truths. In this study, two labels were identified
1) Beta-thalassemia carriers (those carrying a single gene allele mutation)
2) Normal individuals
Data Acquisition
For data collection, data was manually entered into a paper-based Microsoft Excel file which
provides an easy-to-analyze source using modern programming languages. No personal data
was collected. The input data was saved as a "comma separated value" (.csv extension) to
make reading the file into a programming environment for analysis and modeling convenient
and fast.
All erythrocyte indices were initially entered as input variables. The important features were
then identified using the model itself. The random forest algorithm allowed the relative
importance of the input variables to be interpreted by plotting them in a bar chart. Then, by
trial and error, starting with the least significant variable, the performance of the model was
evaluated after each variable was removed. The threshold at which the model performance
does not degrade was determined and the final set of input variables was decided.
Hemoglobin electrophoresis parameters were also used in the modeling process and their
significance was evaluated by the method described above. In addition to the above variables,
age, and gender were included in the exploratory analysis.
Modeling
Python programming language with Data Science libraries NumPy (version 1.24), Pandas
(version 1.1.2.), Matplotlib (version 3.3.2.) and Scikit-learn (version 0.23.2) and Deep
Learning library Keras (version 3.2) provide the means to develop a neural network machine
learning model to predict carrier states based solely on the results of a complete blood count.
To train and test the first three algorithms which are Random Forest, Logistic Regression, and
Support Vector Machine, Scikit-learn was used while Keras was the main library used to train
and test neural networks. The integrated development environment for the data analysis is
Jupyter Notebooks. The available data was divided into two parts: Training and Testing data.
This study attempted to compare four machine learning algorithms for the classification of
thalassemia carrier data. The techniques used and the reasons for their selection are as
follows:
Random Forest Algorithm
The ensemble learning family includes the machine learning algorithm known as Random
Forest (RF). It combines multiple individual decision trees to make predictions or
classifications. The name "Random Forest" comes from the idea that each decision tree in the
ensemble is trained on a random subset of the training data and a random subset of the input
features. It uses a bagging technique where a number of decision tree outputs are randomly
combined using a bootstrap technique and an average result is obtained. In this way, it
reduces the problem of excessive decision trees (DT) and reduces variance, thus improving
accuracy. Random Forest works with both categorical variables and continuous variables.
They are convenient because they do not require feature scaling because they rely on decision
rules and are less sensitive to outliers and noise. Another important advantage of random
forests is that they can be interpreted through the visualization of decision trees, which shed
light on important input (independent) variables
Figure 5: Conceptual visualization of a random forest, which is made up of several decision
trees, and the final output is taken in a majority voting or average manner. Image courtesy
(https://en.wikipedia.org/wiki/Random_forest, last accessed on June 2, 2023)
Logistic Regression
One of the most often used Machine Learning algorithms, within the category of Supervised
Learning is logistic regression (LR). It is used to forecast the categorical dependent variable
using a predetermined set of independent factors. The LR is used to solve classification
problems, and the most common use case is binary logistic regression, where the outcome is
binary (yes or no). In the real world, logistic regression is applied in many areas and fields. In
healthcare, LR can be used to predict diseases. In our study, LR was used to predict
beta-thalassemia carriers(Figure 10).
Figure 6: Conceptual visualization of logistic regression,Image courtesy
(https://blog.devgenius.io/develop-a-logistic-regression-machine-learning-model-64d2be403b
a3, last accessed on June 2, 2023)
Support Vector Machines
One of the most well-liked supervised learning techniques for solving classification and
regression issues is the support vector machine (SVM). However, it is mostly employed to
address categorization issues in machine learning. The SVM algorithm's goal is to establish
the optimal decision boundary or line that can divide the n-dimensional space into classes so
that subsequent data points may be quickly assigned to the appropriate class. Classification
and prediction of biological data is one of the most important aspects of bioinformatics. With
the rapid increase in the size of natural databases, it is important to use computer projects to
robotize the order cycle. The SVM is best for predicting diseases because they are designed
to maximize the margin between two classes so that the trained model can effectively
generalize to unknown data (Figure 11).
Figure 7: Conceptual visualization of SVM. Image courtesy
(https://www.spiceworks.com/tech/big-data/articles/what-is-support-vector-machine/ , last
accessed on June 2, 2023)
Artificial Neural Network (ANN)
Since none of the previous studies used a "deep" neural network with multiple layers to
classify thalassemia carriers, a neural network was chosen because it can identify non-linear
complex patterns in data. The data in this study are trained with a back-propagation-based
feed-forward multilayer neural network. A computer model that tries to account for the
parallel nature of the human brain is called an artificial neural network (ANN). It is a parallel
network of highly connected processing elements (neurons). The nervous system in living
things serves as a model for these components. The field of medical diagnosis using artificial
intelligence systems, especially artificial neural networks and deep learning computer
diagnostics, is currently a very active area of research. This area is expected to become more
common in biomedical systems. The development of brain network strategies for clinical
discovery is heavily considered because they are ideal for disease detection using filters.
Brain networks work like visual signals, so there is no need for nuances related to disease
detection( Figure 12). Finally, using the above four machine learning algorithms, the best
algorithm which has a high accuracy rate was chosen to predict beta-thalassemia carriers.
Figure 8: Conceptual visualization of ANN. Note that this has four nodes in the input layer
(which correspond to input variables), and the output layer is an example where there are two
diagnostic categories. Image courtesy(https://www.javatpoint.com/artificial-neural-network ,
last accessed on June 2, 2023)
2.2.6. Evaluation
Performance measures used for the validation dataset were used to assess the models'
performance. Two key metrics are employed to evaluate the performance of the models based
on empirical data:
a) General Accuracy
b) Confusion Matrix-The prediction summary is represented in matrix form by a confusion
matrix. It displays how many predictions per class are accurate and inaccurate( Figure 13).
Figure 9: Image showing Confusion Matrix,Image Courtesy
(https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/, last
accessed on June 2, 2023)
c) F1 Score - The F1 score is a machine learning evaluation statistic that evaluates the
accuracy of the model. It combines model accuracy and recovery points.
Of the three, the F1 score and confusion matrix is defined as the most important outcome
measure because our main goal is to accurately predict beta-thalassemia carriers, which is
reflected in the true positive ratio (sensitivity) and positive predictive value.
Methodology for the Prediction of Hemoglobin variants using Machine
Learning Algorithms
We employed three machine learning algorithms to check and predict the pathogenic
variants ( Figure 5)
K-Nearest Neighbors (KNN)
Naive Bayes
Decision Trees
Figure 10: Workflow used to perform machine learning predictions.
K-Nearest Neighbors (KNN): A non-parametric, supervised learning classifier called
the K-nearest neighbors algorithm, or KNN uses proximity to classify or anticipate how a
particular data point will be grouped. Although it can be applied to classification or
regression issues, it is commonly employed as a classification algorithm because it relies on
the idea that comparable points can be discovered close to one another (Figure 6). The KNN
is a relatively simple method, and the data's accuracy and representativeness can impact how
well it performs (Ray, 2019).
Figure 11: Visualization of KNN algorithm (image courtesy of
https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning last accessed
on June 15, 2023)
Naive Bayes: It is a probabilistic machine learning algorithm that can be used for various
classification tasks. Using probability, we can use the Naive Bayes classifier to predict a class
or category based on a given set of features (Figure 7).
Some commonly used Naive Bayes algorithms are,
Gaussian Naive Bayes
Multinomial Naive Bayes
Complement Naive Bayes
Bernoulli Naive Bayes
Categorical Naive Bayes
Out-Of-Core Naive Bayes model fitting
Figure 12: Visualization of Naive Bayes algorithm (image courtesy of
https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf
last accessed on June 15, 2023)
Decision Trees (DT): A decision tree (DT) is an easy and powerful form of multiple
variables analysis that searches for different ways of splitting up the data and creating
branchlike segments. In the DTs, data are divided according to categories of input variables in
order to help you understand people's decisions. The objective is to develop a model, learning
simple decision rules deduced from the data features, that forecasts the value of your target
variable. The DT algorithm can solve regression and classification problems.
Figure 13: Visualization of Decision Tree algorithm (image courtesy of
https://towardsdatascience.com/from-a-single-decision-tree-to-a-random-forest-b9523be6514
7last accessed on June 15, 2023)
DATA COLLECTION: Machine learning algorithms require huge data sets to achieve
impressive performance. About 250 data were collected from Lok Nayak Hospital and it
involves gathering the necessary details required for the prediction.
DATA PREPARATION, TRAINING AND EVALUATION: For the preparation of data,
previously generated data was used which includes data cleaning, where we removed
unnecessary and erroneous data. And also the exploration of data and analyzing it thoroughly
in order to identify some patterns or new outcomes from the data set. We used three different
supervised learning algorithms like KNN, NB, and DTs and the performance was measured
on the basis of the confusion matrix and precision and accuracy metrics as soon as the model
was trained. Using the performance of the train and test we compared the accuracy of three
models taking the following attributes ( Table 1).
ATTRIBUTES USED
Hemoglobin count
MCH (Mean Corpuscular Hemoglobin)
MCHC (Mean Corpuscular Hemoglobin Concentration)
MCV (Mean Corpuscular Volume)
HbS (Hemoglobin S)
HbE (Hemoglobin E)
Table 2: Attributes used for an ML algorithm
Exploratory Data Analysis (EDA)
The datasets were analyzed and explored using EDA, which also makes use of visualizing the
data, which was the first step in the data analysis process. It helped in identifying the model
or hypothesis and improved comprehension of the variables in the data set. The features of
the EDA can be used for more sophisticated data analysis or modeling and the EDA was
finished in its entirety, and conclusions were being developed (Figure 8 ).
Figure 14: EDA technique was used to evaluate the variables.
RESULTS
The most common Beta-thalassemia mutation is identified IVS mutations
were screened across 48 of 90 samples
From the data available for individual samples, 48 of 90 individuals had IVS 1-5 (G > C) and
the remaining 42 had different mutations, such as a 619 bp deletion. IVS 1:1 (G>T), Cd
41/42 (-TCTT), Cd 30 (G>C) and Cd 26 (G>A). Molecular testing was performed on
samples positive for beta-thalassemia patients. Molecular analysis showed that IVS 1-5 (G
C) of the 90 beta-thalassemia cases studied with mutation patterns was the most common
mutation identified ( Table 2).
S.NO.
THE TYPE OF MUTATION
DETECTED
TOTAL NO. OF
MUTATIONS
DETECTED
CLINICAL
REPORT
TECHNIQUE
USED
MOTHER
FATHER
1
IVS 1:1
(G>T)
619 bp
deletion
619 bp
deletion
IVS 1:1
(G>T)
4
Compound
Heterozygous
PCR
Amplification, SS
2
NA
IVS 1:5
(G>C)
No Mutation
1
Carrier
Check for
anemia
PCR
Amplification, SS
3
IVS 1:5 (G>C)
No Mutation
IVS 1:5
(G>C)
2
Carrier
Check for
anemia
PCR
Amplification, SS
4
No Mutation
Cd 41/42
(-TCTT)
IVS 1:5
(G>C)
2
Parents are
carrier Child
unaffected
Check for
anemia
PCR
Amplification, SS
5
Cd 30 (G>C)
Cd 30
(G>C)
Cd 30
(G>C)
3
Homozygous
affected
PCR
Amplification, SS
6
IVS 1:5 (G>C)
IVS 1:5
(G>C)
IVS 1:5
(G>C)
3
Homozygous
affected
PCR
Amplification, SS
7
NA
IVS 1:5
(G>C)
No Mutation
1
Mother is
carrier
Check for
anemia
PCR
Amplification, SS
8
Cd
41/42(-TCTT)
Cd
41/42(-TCT
T)
Cd
41/42(-TCT
T)
3
Homozygous
affected
PCR
Amplification, SS
9
IVS 1:5
(G>C)
Cd 15(G>A)
Cd 15(G>A)
IVS 1:5
(G>C)
4
Compound
Heterozygous
PCR
Amplification, SS
10
IVS 1:5 (G>C)
IVS 1:5
(G>C)
IVS 1:5
(G>C)
3
Homozygous
affected
PCR
Amplification, SS
11
IVS 1:5 (G>C)
IVS 1:5
(G>C)
IVS 1:5
(G>C)
3
Homozygous
affected
PCR
Amplification ,
SS
12
Cd 26(G>A)
Cd
41/42(-TCT
T)
Cd
41/42(-TCT
T)
Cd 26(G>A)
4
Compound
Heterozygous
PCR
Amplification, SS
13
IVS 1:5
(G>C)
Cd 16(-C)
IVS 1:5
(G>C)
Cd 16(-C)
4
Compound
Heterozygous
PCR
Amplification, SS
14
Cd 26(G>A)
Cd
41/42(-TCT
T)
Cd
41/42(-TCT
T)
Cd 26(G>A)
4
Compound
Heterozygous
PCR
Amplification, SS
15
IVS 1:5
(G>C)
Cd 16(-C)
IVS 1:5
(G>C)
Cd 16(-C)
4
Compound
Heterozygous
PCR
Amplification, SS
16
NA
Cd 16(-C)
No Mutation
1
Mother is
carrier
Check for
anemia
PCR
Amplification, SS
17
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
18
NA
IVS 1:5
(G>C)
Cd 26(G>A)
2
Both parents
are carriers
PCR
Amplification, SS
19
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
20
NA
IVS 1:5
(G>C)
No Mutation
1
Mother is
carrier
Check for
anemia
PCR
Amplification, SS
21
NA
Cd 8/9 (+G)
No mutation
1
Mother is
carrier
Check for
anemia
PCR
Amplification, SS
22
IVS 1:5 (G>C)
IVS 1:5
(G>C)
IVS 1:5
(G>C)
3
Homozygous
affected
PCR
Amplification, SS
23
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
24
Cd
41/42(-TCT
T)
Cd 26(G>A)
Cd
41/42(-TCT
T)
Cd 26(G>A)
4
Compound
Heterozygous
PCR
Amplification, SS
25
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
26
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
27
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
28
IVS 1:1
(G>T)
Cd8/9 (+G)
IVS 1:1
(G>T)
Cd 8/9 (+G)
4
Compound
Heterozygous
PCR
Amplification, SS
29
619 bp deletion
619 bp
deletion
No mutation
2
Carriers
Check for
anemia
PCR
Amplification, SS
30
NA
IVS 1:5
(G>C)
619 bp
deletion
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
31
IVS 1:5 (G>C)
IVS 1:5
(G>C)
IVS 1:5
(G>C)
3
Homozygous
affected
PCR
Amplification, SS
32
NA
IVS 1:5
(G>C)
IVS 1:5
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
33
NA
Cd 8/9 (+G)
Cd 30
(G>C)
2
Both parents
are carriers
Check for
anemia
PCR
Amplification, SS
34
IVS 1:5 (G>C)
IVS 1:5
(G>C)
IVS 1:5
(G>C)
3
Homozygous
affected
PCR
Amplification, SS
35
IVS 1:5 (G>C)
IVS 1:5
(G>C)
No Mutation
2
Carriers
Check for
anemia
PCR
Amplification, SS
Table 2: Results showing mutations that were screened for the current study
SL.NO
Patient Id
Sex
Variants
Mutations
1
138
M
HbS
CD.20(A>T)
2
149
F
HbS
CD.20(A>T)
3
190
F
HbS
CD.20(A>T)
4
252
F
HbS
CD.20(A>T)
5
257
F
HbS
CD.20(A>T)
6
291
F
HbS
CD.20(A>T)
7
328
F
HbS
CD.20(A>T)
8
360
F
HbE
C.79G>A
9
410
F
HbS
CD.20(A>T)
10
414
F
HbS
CD.20(A>T)
11
415
F
HbS
CD.20(A>T)
12
469
F
HbS
CD.20(A>T)
13
486
F
HbS
CD.20(A>T)
14
500
F
HbS
CD.20(A>T)
15
533
F
HbS
CD.20(A>T)
16
541
F
HbS
CD.20(A>T)
17
589
M
HbS
CD.20(A>T)
18
597
M
HbS
CD.20(A>T)
19
854
M
HbE
CD.26(G>A)
20
892
M
HbS
CD.20(A>T)
21
902
M
HbS
CD.20(A>T)
22
909
M
HbS
CD.20(A>T)
23
916
F
HbS
CD.20(A>T)
24
924
F
HbE
CD.26(G>A)
25
1046
F
HbE
C.79G>A
26
1104
F
HbE
CD.26(G>A)
27
1106
F
HbS
CD.20(A>T)
28
1211
F
HbE
CD.26(G>A)
29
1413
F
HbS
CD.20(A>T)
30
1463
F
HbE
C.79G>A
The technique used to identify and analyze the mutation was PCR and Sanger sequencing.
The clinical report obtained was a mixture of compound heterozygous, Homozygous affected
and carriers wherein the carriers were referred to check Anemia and further genetic
counseling was offered to all subjects.Figure 14 is a graph showing the most common
mutation and Figure 15 is the snap gene view of the mutation IVS 1-5 (G>C).The
chromosome position of the mutation is Chr 11:5248155 C>G in HBB gene and the gene
position is g.2471(G>C)
Figure 15: Identified IVS 1:5 (G>C) as the most common mutation
Figure 16: Validating IVS 1:5 (G>C) mutation using the Sanger Sequencing
Random Forest proved to be the most accurate classifier to
predict Beta-Thalassemia Carriers
The material collected from the laboratory contained information from 370 people. 200 of
them were "index patients"; they were the point person for every request for thalassemia
testing sent to the laboratory. 170 people were family members of the index patients who
completed the family survey. Here 0 is indicated as male and 1 is female. Overall, there were
161 males and 209 females (Figure 16). One patient with hemoglobin H disease and two
patients with beta-thalassemia major based on genetic diagnosis were included. They were
excluded from the final dataset for modeling because the study focuses on thalassemia
carriers (so the final dataset had a sample size of 370). In addition to these, the material
contained main diagnostic categories called "phenotypes". These phenotypes were the results
predicted by the models through correlation .
An Exploratory Data Analysis (EDA) was done to understand and analyze the data which is
listed in the following table (Table 3)
MAXIMUM
MINIMUM
AVERAGE
HGB(g/dl)
16.9
4.8
13.05
MCH(pg)
30
16
22.51
MCV(fL)
101.1
69
84.64
MCHC(g/dL)
32.5
27.5
30.21
Table 4: Exploratory Data Analysis
Figure 17 :Comparison of HGB , MCH, MCHC and MCV levels of Carriers and Normal
Figure 18 : Boxplot of EDA
Figure 19 : A correlation matrix of all the input variables in the dataset. This contains
Pearson’s Correlation Coefficient(r) of each variable with all the other variables(including the
same variable).
Predictive Modeling
Four different algorithms were used to train and test the model. The four algorithms used
were RF, LR, VM and ANN wherein each of them were named as models 1, 2, 3, and 4
respectively. Each model was tested using the validation data, which was split using stratified
random sampling to account for class imbalance in the dataset. The performance of each
model was evaluated using its overall accuracy (1 - error rate) and F1 score, which measures
the sensitivity and positive predictive value of the model.
Modeling Using Random Forest Algorithm(RF)
Random Forest (RF) has a number of advantages, one of which is unsurpassed accuracy, it
can handle missing values in the data, and estimated variables are important for classification.
A random forest is a method that aims to predict the response of an observation by combining
the prediction results of several decision trees. In my study, a total of 370 cases were taken
for the prediction, out of that 259 data were trained and 111 were tested.
There are mainly four conditions for measuring the performance of machine learning
algorithms. True positive (TP), where the number of beta-thalassemia samples is correctly
diagnosed. True negative (TN), that is the number of correctly diagnosed
non-beta-thalassemia samples. False negative (FN) refers to the number of samples with
beta-thalassemia that are misdiagnosed. False positive (FP) refers to the number of
misdiagnosed non-beta-thalassemia samples (Figure 18). According to the table value of the
confusion matrix, we can calculate the accuracy, precision, and recall values. Accuracy
indicates that the system can classify the data correctly. The accuracy value indicates the
exact model of the predicted positive. Instead, recall is the proportion of true positives that
were incorrectly predicted as negatives.
Figure 20 : Confusion matrix of RF classifier
Modeling Using Logistic Regression( LR)
Accuracy, F1 score, Precision, and sensitivity of the model are predicted. 259 data were
trained and 111 were tested. The confusion matrix accurately shows the True Positive, True
Negative, False Positive, and False Negative values (Figure 19)
Figure 21: The confusion matrix of the Logistic Regression Algorithm. ‘True label’
represents the actual diagnostic values and ‘Predicted label’ represents the values predicted
by the model. 0 represents “Normal” and 1 represents “Beta-thalassemia carriers”.
Modeling using Support Vector Machine (SVM)
Accuracy, F1 score, Precision, and sensitivity of the model are predicted. Training and testing
datasets are in the ratio of 70::30 as the same as RF and LR. The confusion matrix accurately
shows the True Positive, True Negative, False Positive, and False Negative values (Figure
19).
Figure 22: The confusion matrix of the SVM Algorithm. ‘True label’ represents the actual
diagnostic values and ‘Predicted label’ represents the values predicted by the model. 0
represents “Normal” and 1 represents “Beta-thalassemia carriers.
Modeling Using Artificial Neural Network
ANN performed with an accuracy of 80.3% and F1 score was 80%. The Artificial Neural
Network model used is “Sequential”. To improve the accuracy and to reduce the total loss
Adams’ optimizer is used. ANN had 4 hidden layers. The activation function used in my
ANN model in the first three layers is ReLU and the fourth layer uses sigmoid as the
activation function. As we have less data, ANN cannot perform a promising performance.
ALGORITHM
TRUE
POSITIVE
FALSE
POSITIVE
TRUE
NEGATIVE
FALSE
NEGATIVE
PRECISIO
N
SENSITIVIT
Y
Random
Forest
49
5
48
9
84%
90%
Logistic
Regression
49
9
44
9
84%
84%
Support
Vector
Machine
49
11
42
9
84.4%
81.6%
Artificial
Neural
Network
45
11
44
11
80%
80.36%
Table 5: Summary of TP, TN, FN, and FP across all the algorithms
The Random Forest algorithm yielded an accuracy of 87.39%,with an F1 score of 87%.
The precision shown by the algorithm is 84% and it has a sensitivity of 90%. Precision,
accuracy, and sensitivity is calculated using a confusion matrix.
Machine Learning
Algorithms
Input Variables
(Predictors)
Output
Categories
(Target)
Performance of the model
on the test set
Logistic Regression
Full Blood Count
MCV, MCH,
MCHC, Hb
Beta-thalassemia
carrier (1)
“Normal”
individuals (0)
Accuracy:83.78%
F1 Score: 84%
Random Forest
Full Blood Count
MCV, MCH,
MCHC, Hb
Beta-thalassemia
carrier (1)
“Normal”
individuals (0)
Accuracy:87.39%
F1 Score: 87%
Support Vector
Machine
Full Blood Count
MCV, MCH,
MCHC, Hb
Beta-thalassemia
carrier (1)
“Normal”
individuals (0)
Accuracy:82.9%
F1 Score:81%
Artificial Neural
Network
Full Blood Count
MCV, MCH,
MCHC, Hb
Beta-thalassemia
carrier (1)
“Normal”
individuals (0)
Accuracy:80.3%
F1 Score: 80%
Table 6: Summary of the results showing accuracy and F1 score of each algorithm
Results Obtained from Mutation analysis and variant prediction
Significant HbS and HbE variants were found
We identified 77% HbS and 23% HbE from 30 samples. In 30 samples, 23 of them were
found to be HbS variants while 7 of them indicate the presence of HbE variants. This
suggests that HbS is the most prevalent variation seen even as HbS mutations are associated
mainly with our cohort. On the other hand, two well-known mutations were identified with
the majority of them being HbS CD.20(A>T) is an HbS or Sickle cell mutation, and
CD.26(G>A); C.79G>A is HbE or Hemoglobin E mutation ( Table 2; Figure 9).
Table 7: List of patients’ data from our Lok Nayak Hospital Cohort where mutations were
detected using PCR amplification
Figure 23: After analysis, the Presence of HbS and HbE was identified
HbS mutations were validated using Sanger sequencing
We validated CD.20(A>T); CD.26(G>A); C.79G>A mutations using Sanger sequencing and
the NCBI Sequence Viewer (SV) is a visualization tool for the Nucleotide and Protein
databases. Using this tool, we find out the position of C.79G>A mutation in chromosome 11.
It was used to access, analyze, and distribute sequence data. We were able to recognize the
mutations whether or not they are missense and frameshift variants from the ClinVar
database. The SV would display a single selected sequence molecule, because of the fact that
it would be suitable for observing the data or for analyzing (Figure 10). All the mutations
CD.20(A>T); CD.26(G>A); C.79G>A here, are in the chromosome 11 and location was
chr11:5248232T>A for CD.20(A>T) and for mutation CD.26(G>A) location was
chr11:5248173C>T
Figure 24: C.79G>A mutation was visualized from SV.
Decision trees yielded good accuracy
When the data was subjected to machine learning heuristics, we observed that among all the
algorithms, the Decision Tree (DT) performed well with an accuracy of 96% while Naive
Bayes (NB) algorithm showed 93% and KNN showed 80% respectively. We predicted
accuracy and from the confusion matrix, 0 represents the presence of variants and 1
represents the absence of variants. We plotted True Positive, True Negative, False Positive,
and False Negative. True Positive showed 51% and True Negative showed 29% while False
Positive and False Negative showed 2% and 1% respectively. We calculated the accuracy
from the Confusion matrix using the formula: (TP+TN)/TP+TN+FP+FN (Figures 11 and
12).
Figure 25: A pie chart depicting the accuracy of machine learning algorithms.
Figure 26: Confusion matrix shows the best accuracy for DTs
KNN showed less accuracy
The average value for k gave the most flexible fit, which was with a low bias but significant
variation. As we have seen above, our decision borders are going to be more raggedly
illustrated. A higher K is, nevertheless, more resilient against outliers due to the fact that it
has a larger number of points in each forecast. A tightening of the boundaries of decision will
result in greater K values, which reduces variability but increases bias. From this table, we
can analyze the score and accuracy of DTs, NB and KNN. However, DT models maintained
high performance even as the KNN showed less accuracy compared to the other two.
Predicting these models indicates that 96% of the variants are present (Figure 13).
Figure 27: K Neighbors Classifier scores for different K values
Sl. No
Algorithms
Scores
Accuracy
1
Decision Tree
0.96
0.9638
2
Naive Bayes
0.931
0.933
3
KNN
0.847
0.801
Table 8: Accuracy and Scores of three different algorithms
DISCUSSION
The first study was to identify and analyze the most common beta-thalassemia mutation from
the north Indian population. Hemoglobinopathies and thalassemias are genetic diseases and
are common worldwide. The most commonly observed hemoglobin variants in different parts
of the world are Hb E, Hb S, Hb D, Hb C, etc. Beta thalassemia is found in the residential
area of the northern region of India. In this study, the molecular basis of beta-thalassemia was
investigated among people from the northern region of India. IVS 1-5 (G→C)
beta-thalassemia mutation was found as the most common mutation among the studied
subjects, which was 46.6%. The IVS I-5 (G-C) mutation was found to be the most common
mutation in our study. This is fully consistent with many previous studies in India and the
Indian subcontinent. Cd 41/42(-TTT) was the second common mutation analyzed, and Cd
26(G>A) was third.619 bp deletion ,Cd 30(G>C) and IVS 1:1 (G>T) are the other mutations
observed.
The second and main objective of the first study was to test the accuracy of machine learning
tools to improve the screening process for Beta-thalassemia carriers and to show the potential
of machine learning in developing a tool that can then be adapted to deal with similar issues
both internally and externally. The lack of a current screening process has been one of the
main problems with the increase in beta-thalassemia carriers. If a predictive tool based on
machine learning can help in this process, it also takes less time to conduct screening,
because machine learning tools are used in the form of computer software, and inputs can be
processed much faster than manual testing, saving valuable time and costs of traditional
surveys. However, using such a tool in a practical clinical context is not easy. Although
machine learning models have proven accurate in many clinical problems, including tabular
data processing, few have been translated into real-world applications. Machine learning
models have inherent issues since they are only intelligent in a limited range and cannot
extend their intelligence to things they have not seen.. The real world is often messy, and if
the inputs are not of the expected quality, models can produce incorrect results. This also
implies that when the model is verified in a real-world setting in the future, it will be given
the opportunity to be exposed to fresh data and further trained, which will increase its
accuracy and dependability. Therefore, experimenting with these technologies to find
innovative solutions where traditional methods are insufficient can provide unique solutions.
According to surveys approximately five crore Indians are thalassemia carriers. In our study,
we aimed to find such a solution to a disease with a significant global burden, and the tool
was used to predict the beta-thalassemia carriers.
Another objective of this study was to discover the mutations in SCD pertaining to the North
Indian registry. Upon employing machine learning heuristics, the DT models have shown
great performance and have the potential to be used as a tool for hemoglobinopathy
identification in medical laboratory work methods. These findings suggest that our machine
learning algorithms are provided with sufficient, which could predict a variety of hemoglobin
variants. Our models, however, need to be evaluated using a large number of datasets
employing a wider range of patient data associated with hemoglobinopathies.
In India over the years, since the significant rise in people with sickle cell disease, it is critical
to identify these genetic abnormalities. To control genetic diseases, well-designed
community-based studies are essential as an aspect of public health. Here in our studies, we
try to predict hemoglobin variants from our dataset. We determined if variance will exist or
not using machine learning techniques. It is very important to comprehend and identify the
pathophysiology for the treatment and disease prediction. The application of prenatal
diagnosis, and awareness programs is the only way to prevent the occurrence of this type of
rare and genetic disease.
CONCLUSION
As the number of hemoglobinopathies (especially β-thalassemia and Sickle cell disease )
patients increases, the need to analyze Hemoglobin variants and β-thalassemia carriers at an
early stage increases. Current methods for detecting these are expensive and time-consuming.
To check this, we have ventured upon identifying mutations in a cohort of beta-thalassemia
and Sickle cell disease from North Indian registries was our first objective and we attempted
to propose an ensemble classifier for β-thalassemia carrier screening and to analyze
hemoglobin variants. The dataset used in this work was compiled from whole blood analysis
tests.
REFERENCES
Antonio Cao, Renzo Galanello , Beta-thalassemia, Genetics in Medicine Volume 12,
Issue 2010,Pages 61-76,ISSN
1098-3600,https://doi.org/10.1097/GIM.0b013e3181cd68ed.
Arica V, Arica S, Özer C, Çevik M. Serum Lipid Values in Children with Beta
Thalassemia Major. Pediat Therapeut. 2012; 2:130.
Asmarian N, Kamalipour A, Hosseini-Bensenjan M, Karimi M, Haghpanah S.
Prediction of Heart and Liver Iron Overload in β-Thalassemia Major Patients Using
Machine Learning Methods. Hemoglobin. 2022 Nov;46(6):303-307.
Aszhari FR, Rustam Z, Subroto F, Semendawai AS. Classification of thalassemia data
using random forest algorithm. J Phys Conf Ser. 2020 Mar;1490:012050
Aydinok Y, Kattamis A, Viprakasit V. Current approach to iron chelation in children.
Br J Haematol 2014;745–755.
Bajwa H, Basit H. Thalassemia. 2022 Aug 8. In: StatPearls [Internet]. Treasure Island
(FL): StatPearls Publishing; 2023 Jan.
BarnhartMagen G, Gotlib V, Marilus R, Einav Y. Differential Diagnostics of
Thalassemia Minor by Artificial Neural Networks Model. J Clin Lab Anal. 2013 Nov
11;27(6):481–6.
Brancaleoni V, Di Pierro E, Motta I, Cappellini MD. Laboratory diagnosis of
thalassemia. International Journal of laboratory hematology. 2016 May;38:32-40.
Brittenham GM, Griffith PM, Nienhuis AW, et al. Efficacy of deferoxamine in
preventing complications of iron overload in patients with thalassemia major. N Engl J
Med 1994;331:567-573.
Cao A, Galanello R. Beta-thalassemia. Genet Med. 2010 Feb;12(2):61-76. doi:
10.1097/GIM.0b013e3181cd68ed.
Chapin J, Giardina PJ. Thalassemia syndromes. InHematology 2018 Jan 1 (pp.
546570). Elsevier.
Chong SC, Metassan S, Yusof N, Idros R, Johari N, Zulkipli IN, Ghani H, Lim MA,
Taib S, Lu ZH, Abdul-Hamid MRW. Thalassemia in Asia 2021 Thalassemia in Brunei
Darussalam. Hemoglobin. 2022 Jan;46(1):15-19.
Choudhry VP. Thalassemia Minor and Major: Current Management. Indian J Pediatr.
2017 Aug;84(8):607-611. doi: 10.1007/s12098-017-2325-1. Epub 2017 Apr 24.
Colah RB, Seth T. Thalassemia in India. Hemoglobin. 2022 Jan;46(1):20-26.
Cousens NE, Gaff CL, Metcalfe SA, Delatycki MB. Carrier screening for
Betathalassaemia: a review of international practice. Eur J Hum Genet. 2010
Oct;18(10):1077– 83.
Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A
guide to deep learning in healthcare. Nature Medicine 2019 Jan;25(1):24–9.
Elshami EH, Alhalees AM. Automated diagnosis of thalassemia based on data mining
classifiers. InThe International Conference on Informatics and Applications (ICAI
2012) 2012 (pp. 440-445).
El Hasbani G, Musallam KM, Uthman I, Cappellini MD, Taher AT. Thalassemia and
autoimmune diseases: Absence of evidence or evidence of absence? Blood Rev. 2022
Mar;52:100874. doi: 10.1016/j.blre.2021.100874. Epub 2021 Aug 14.
Feng P, Li Y, Liao Z, Yao Z, Lin W, Xie S, Hu B, Huang C, Liu W, Xu H, Liu M, Gan
W. An online alpha-thalassemia carrier discrimination model based on random forest
and red blood cell parameters for low HbA2 cases. Clin Chim Acta. 2022 Jan
15;525:1-5. doi: 10.1016/j.cca.2021.12.003. Epub 2021 Dec 6.
Galanello R, Origa R. Beta-thalassemia. Orphanet J Rare Dis. 2010 May 21;5:11. doi:
10.1186/1750-1172-5-11.
Grady RW. The development of new drugs for use in iron chelation therapy. Birth
Defects Orig Artic Ser 1976;12:161–175.
Hagag AA, Elfrargy MS, Elfatah MA, et al. Comparative Study of Deferiprone and
Silymarin versus Deferiprone and Placebo as Iron Chelators in Children with Beta
Thalassemia with Iron Overload. J Leuk (Los Angel). 2014; 2:130.
Kacian DL, Gambino R, Dow LW, et al. Decreased globin messenger RNA in
thalassemia detected by molecular hybridization. Proc Natl Acad Sci USA
1973;70:1886–1890.
Kohne E. Hemoglobinopathies: clinical manifestations, diagnosis, and treatment.
Dtsch Arztebl Int. 2011 Aug;108(31-32):532-40. doi: 10.3238/arztebl.2011.0532.
Epub 2011 Aug 8.
Kumar R, Sagar C, Sharma D, Kishor P. β-globin genes: mutation hot-spots in the
global thalassemia belt. Hemoglobin. 2015;39(1):1-8. doi:
10.3109/03630269.2014.985831. Epub 2014 Dec 19.
Langlois S, Ford JC, Chitayat D; CCMG PRENATAL DIAGNOSIS COMMITTEE;
SOGC GENETICS COMMITTEE. Carrier screening for thalassemia and
hemoglobinopathies in Canada. J Obstet Gynaecol Can. 2008 Oct;30(10):950-959.
English, French. doi: 10.1016/S1701-2163(16)32975-9. PMID: 19038079.
Loukopoulos D. Haemoglobinopathies in Greece: prevention programme over the
past 35 years. Indian Journal of Medicine Research. 2011 Oct;134(4):572–6.
Mehta S, Medicherla KM, Gulati S, et al. Whole exome sequencing of adult Indians
with apparently acquired Aplastic Anemia: initial experience at tertiary care hospital.
Research Square; 2023. DOI: 10.21203/rs.3.rs-2836149/v1.
Monalisha Saikia Borah and Prasanta Kumar Bhattacharya and Mauchumi Saikia
Pathak. Study of IVS 1-5 (G→C) Mutation in the Beta Thalassaemia Patients of a
Tertiary Care Hospital of North East India. 2015.
Muhammad LJ, Al-Shourbaji I, Haruna AA, Mohammed IA, Ahmad A, Jibrin MB.
Machine Learning Predictive Models for Coronary Artery Disease. SN Comput Sci.
2021;2(5):350. doi: 10.1007/s42979-021-00731-4. Epub 2021 Jun 22.
Muncie HL Jr, Campbell J. Alpha and beta thalassemia. Am Fam Physician. 2009
Aug 15;80(4):339-44. PMID: 19678601.
Ohba Y, Hattori Y, Harano T, Harano K, Fukumaki Y, Ideguchi H. beta-thalassemia
mutations in Japanese and Koreans. Hemoglobin. 1997 Mar;21(2):191-200. doi:
10.3109/03630269708997524. Erratum in: Hemoglobin 1997 Jul;21(4):389.
Origa R. β-Thalassemia. Genet Med. 2017 Jun;19(6):609-619. doi:
10.1038/gim.2016.173. Epub 2016 Nov 3.
Rustam F, Ashraf I, Jabbar S, Tutusaus K, Mazas C, Barrera AEP, de la Torre Diez I.
Prediction of [Formula: see text]-Thalassemia carriers using complete blood count
features. Sci Rep. 2022 Nov 21;12(1):19999. doi: 10.1038/s41598-022-22011-8.
Sabath DE. Molecular Diagnosis of Thalassemias and Hemoglobinopathies: An
ACLPS Critical Review. Am J Clin Pathol. 2017 Jul 1;148(1):6-15. doi:
10.1093/ajcp/aqx047. PMID: 28605432.
Saboor M, Qudsia F, Qamar K, et al. Levels of Calcium, Corrected Calcium, Alkaline
Phosphatase and Inorganic Phosphorus in Patients’ Serum with β-Thalassemia Major
on Subcutaneous Deferoxamine. J Hematol Thromb Dis. 2014; 2:130.
Shine I, Lal S. A strategy to detect beta-thalassaemia minor. Lancet Lond Engl. 1977
Mar 26;1(8013):692–4.
Tarca AL, Carey VJ, Chen XW, Romero R, Drăghici S. Machine learning and its
applications to biology. PLoS Comput Biol. 2007 Jun;3(6):e116. doi:
10.1371/journal.pcbi.0030116. PMID: 17604446; PMCID: PMC1904382.
Thacker N. Prevention of thalassemia in India. Indian Pediatr. 2007 Sep;44(9):647-8.
PMID: 17921552.
Topol E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human
Again. Basic Books; 2019. 373 p.
Viprakasit V, Ekwattanakit S. Clinical Classification, Screening and Diagnosis for
Thalassemia. Hematol Oncol Clin North Am. 2018 Apr;32(2):193-211. doi:
10.1016/j.hoc.2017.11.006. PMID: 29458726.
Webb S. Deep learning for biology. Nature. 2018 Feb 22;554(7693):555-557. doi:
10.1038/d41586-018-02174-z. Erratum in: Nature. 2018 Mar 22;555(7697):547.
PMID: 29469107.
Wongseree W, Chaiyaratana N, Vichittumaros K, Winichagoon P, Fucharoen S.
Thalassaemia classification by neural networks and genetic programming. Inf Sci Int
J. 2007 Feb 1;177(3):771–786.
Yadav SS, Panchal P, Menon KC. Prevalence and Management of β-Thalassemia in
India. Hemoglobin. 2022 Jan;46(1):27-32. doi: 10.1080/03630269.2021.2001346.
Epub 2022 Feb 7. PMID: 35129043.
ResearchGate has not been able to resolve any citations for this publication.
Preprint
Full-text available
Aplastic anaemia (AA) is a rare hypocellular bone marrow disease with a large number of mutations in telomerase reverse transcriptase gene (TERT) leading to bone marrow failure. We used our benchmarked whole exome sequencing (WES) pipeline to identify variants in adult Indian subjects with apparently acquired AA. For 36 affected individuals, we sequenced coding regions to a mean coverage of 100× and a sufficient depth was achieved. The downstream validation and filtering was done to call the variants in patients who were treated with Cyclosporine A (CsA) wherein we identified mutations associated with genes, viz. TERT and CYP3A5 associated with AA. We report four mutations across TERT and CYP3A5 genes associated with the AA phenotype besides a host of other genes, viz. IFNG, PIGA, NBS/NBN, MPL. We demonstrate the application of WES to discover the variants associated with CsA responders and non-responders in the Indian cohort.
Article
Full-text available
India bears a huge burden of hemoglobinopathies, and the most prevalent is thalassemia. The different types of thalassemia include minor, major and intermedia, based on the α/β-globin chain inequality. This review aimed to understand the current prevalence of thalassemia in different regions of India and communities affected by it, along with the management of β-thalassemia major (β-TM) and β-thalassemia (β-thal) minor patients. A comprehensive electronic search for relevant articles was conducted using two databases, i.e. PubMed and Science Direct. Articles published in English from India between January 2009 and September 2021 were included. Studies from other countries, genetic and molecular characterization studies, and articles published in other languages were excluded. The prevalence of β-thal trait in Central India ranged between 1.4 and 3.4%, while 0.94% β-TM was reported among the patients with anemia. In South India, the prevalence of β-thal trait was between 8.50 and 37.90% and β-TM was reported to be between 2.30 and 7.47%. Northern and Western Indian states had a higher thalassemic burden. In Eastern India, tribal populations had a higher prevalence of β-thal trait (0.00-30.50%), β-TM (0.36-13.20%) and other hemoglobinopathies [Hb E (HBB: c.79G>A)/β-thal] (0.04-15.45%) than nontribal populations. Additionally, scheduled castes, scheduled tribes and other backward classes of low socioeconomic status and low literacy rates were affected by β-thal. Almost all Indian states reported β-thal; however, it is mostly concentrated in eastern and western parts of the country. Well-integrated strategies and effective implementation are needed at State and National levels to minimize the burden of β-thal.
Article
Full-text available
Thalassemia is a blood disorder that occurred in Southeast Asia. Thalassemia cannot be cured, but early detected thalassemia with screening process is the best way to prevent thalassemia disease. If early detection is done, patients can get the right treatment. It helps them increase their life expectancy and reduce the risk of thalassemia to the next generation. In this paper, we use thalassemia data and propose a random forest method to classify thalassemia disease well and accurately. The result concludes that the random forest algorithm can give the best accuracy, precision and recall which is 100 percent by using multiple five in range of 70 to 85 percent as the training data.
Article
Full-text available
Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep-learning methods for genomics are reviewed.
Article
Background Since screening of α-thalassemia carriers by low HbA2 has a low positive predictive value (PPV), the PPV was as low as 40.97% in our laboratory, other more effective screening methods need to be devised. This study aimed at developing a machine learning model by using red blood cell parameters to identify α-thalassemia carriers from low HbA2 patients. Methods Laboratory data of 1213 patients with low HbA2 used for modeling was randomly divided into the training set (849 of 1213, 70%) and the internal validation set (364 of 1213, 30%). In addition, an external data set (n =399) was used for model validation. Fourteen machine learning methods were applied to construct a discriminant model. Performance was evaluated with accuracy, sensitivity, specificity, etc. and compared with 7 previously published discriminant function formulae. Results The optimal model was based on random forest with 5 clinical features. The PPV of the model was more than twice the PPV of HbA2, and the model had a high negative predictive value (NPV) at the same time. Compared with seven formulae in screening of α-thalassemia carriers, the model had a better accuracy (0.915), specificity (0.967), NPV (0.901), PPV (0.942) and area under the receiver operating characteristic curve (AUC, 0.948) in the independent test set. Conclusion Use of a random forest-based model enables rapid discrimination of α-thalassemia carriers from low HbA2 cases.
Article
At present, thalassemia diseases are classified into transfusion-dependent thalassemia and non–transfusion-dependent thalassemia. This classification is based on the clinical severity of patients determining whether they do require regular blood transfusions to survive (transfusion-dependent thalassemia) or not (non–transfusion-dependent thalassemia). In addition to the previous terminology of “thalassemia major” or “thalassemia intermedia,” this classification has embraced all other forms of thalassemia syndromes such as α-thalassemia, hemoglobin E/β-thalassemia and combined α- and β-thalassemias. Definitive diagnosis of thalassemia and hemoglobinopathies requires a comprehensive workup from complete blood count, hemoglobin analysis, and molecular studies to identify mutations of globin genes.
Article
A popular artificial-intelligence method provides a powerful tool for surveying and classifying biological data. But for the uninitiated, the technology poses significant difficulties.
Article
Objectives: To describe the use of molecular diagnostic techniques for patients with hemoglobin disorders. Methods: A clinical scenario is presented in which molecular diagnosis is important for genetic counseling. Globin disorders, techniques for their diagnosis, and the role of molecular genetic testing in managing patients with these disorders are described in detail. Results: Hemoglobin disorders, including thalassemias and hemoglobinopathies, are among the commonest genetic diseases, and the clinical laboratory is essential for the diagnosis of patients with these abnormalities. Most disorders can be diagnosed with protein-based techniques such as electrophoresis and chromatography. Since severe syndromes can result due to inheritance of combinations of globin genetic disorders, genetic counseling is important to prevent adverse outcomes. Protein-based methods cannot always detect potentially serious thalassemia disorders; in particular, α-thalassemia may be masked in the presence of β-thalassemia. Deletional forms of β-thalassemia are also sometimes difficult to diagnose definitively with standard methods. Conclusions: Molecular genetic testing serves an important role in identifying individuals carrying thalassemia traits that can cause adverse outcomes in offspring. Furthermore, prenatal genetic testing can identify fetuses with severe globin phenotypes.
Article
Thalassemia is a common genetic disorder. It has been estimated that in India nearly 5 crore people are thalassemia carriers. They are asymptomatic and are detected on blood tests. These people are at same risk of developing iron deficiency anemia as general population and need iron therapy in the presence of iron deficiency anemia. Nearly 12,000 children with thalassemia major (Homozygous state) are born every year. These children often present with significant anemia along with hepatosplenomegaly during infancy and require early diagnosis and institution of therapy with repeated blood transfusions and chelation therapy. Adequate dose of chelation therapy is essential to maintain serum ferritin around 1000 ng/ml. With present protocol of management, thalassemic children have near normal life. Bone marrow transplantation offers cure for these children; results of bone marrow transplantation are best when performed below 7 y of age.
Article
The thalassemias can be defined as α- or β-thalassemias depending on the defective globin chain and on the underlying molecular defects. The recognition of carriers is possible by hematological tests. Both α- and β-thalassemia carriers (heterozygotes) present with microcytic hypochromic parameters with or without mild anemia. Red cell indices and morphology followed by separation and measurement of Hb fractions are the basis for identification of carriers. In addition, iron status should be ascertained by ferritin or zinc protoporphyrin measurements and the iron/total iron-binding capacity/saturation index. Mean corpuscular volume and mean corpuscular hemoglobin are markedly reduced (mean corpuscular volume: 60-70 fl; MCH: 19-23 pg) in β-thalassemia carriers, whereas a slight to relevant reduction is usually observed in α-carriers. HbA2 determination is the most decisive test for β-carrier detection although it can be disturbed by the presence of δ-thalassemia defects. In α-thalassemia, HbA2 can be lower than normal and it assumes significant value when iron deficiency is excluded. Several algorithms have been introduced to discriminate from thalassemia carriers and subjects with iron-deficient anemia; because the only discriminating parameter is the red cell counts, these formulas must be used consciously. Molecular analysis is not required to confirm the diagnosis of β-carrier, but it is necessary to confirm the α-thalassemia carrier status. The molecular diagnosis is essential to predict severe transfusion-dependent and intermediate-to-mild non-transfusion-dependent cases. DNA analysis on chorionic villi is the approach for prenatal diagnosis and the methods are the same used for mutations detection, according to the laboratory facilities and expertise.