PreprintPDF Available

Unveiling the role of Tdark genes in genetic diseases and phenotypes through bioinformatics-based functional enrichment and network analyses

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

In recent years, there has been growing interest in understanding the role of dark genes in genetic diseases and phenotypes. Despite their lack of functional characterisation, dark genes account for a significant portion of the human genome and are believed to play a role in regulating gene expression and cellular processes. We investigated the role of dark genes in genetic diseases and phenotypes by conducting integrative network analyses and functional enrichment studies across multiple large-scale molecular datasets. Our investigation revealed a predominant association of both dark and light genes with psoriasis. Furthermore, we found that the transcription factors UBTF and NFE2L2 are potential regulators of both dark and light genes associated with tuberculosis. In contrast, the transcription factors SUZ12 and TP63 are potential regulators of both dark and light genes associated with interstitial cystitis. Further network analysis of dark genes, including CALHM6, HCP5, PRRG4, DDX60L and RASA2, revealed a notably high weighted degree of association with genetic diseases and phenotypes. Moreover, our analysis revealed that numerous genetic diseases and phenotypes, including psoriasis, pick disease, tuberculosis, ulcerative colitis, interstitial cystitis, and Crohn's disease, exhibited shared gene linkages. Additionally, we conducted a protein-protein interaction analysis to reveal 16 dark genes that encode hub proteins, including R3HDM2, RPUSD4, FASTKD5, and MRPL15, that could play a role in many genetic diseases and phenotypes, and are widely expressed across body tissues. Our findings contribute to the understanding of the genetic basis of diseases and provide potential therapeutic targets for future research. Identifying dysregulated dark genes in disease states can lead to new strategies for prevention, diagnosis, and treatment, thus advancing our understanding of disease mechanisms.
1
Unveiling the role of Tdark genes in genetic diseases and
phenotypes through bioinformatics-based functional
enrichment and network analyses
Doris Kafita1*, Kevin Dzobo2, Panji Nkhoma1, Musalula Sinkala1,3
1University of Zambia, School of Health Sciences, Department of Biomedical Sciences,
Lusaka, Zambia.
2Medical Research Council-SA Wound Healing Unit, Hair and Skin Research Laboratory,
Division of Dermatology, Department of Medicine, Groote Schuur Hospital, Faculty of
Health Sciences University of Cape Town, Anzio Road, Observatory, Cape Town 7925,
South Africa.
3University of Cape Town, Faculty of Health Sciences, Institute of Infectious Disease and
Molecular Medicine, Computational Biology Division, Cape Town, South Africa.
*Corresponding author
E-mail: kafitadorisk@gmail.com (DK)
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
2
Abstract
In recent years, there has been growing interest in understanding the role of dark genes in
genetic diseases and phenotypes. Despite their lack of functional characterisation, dark
genes account for a significant portion of the human genome and are believed to play a role
in regulating gene expression and cellular processes. We investigated the role of dark genes
in genetic diseases and phenotypes by conducting integrative network analyses and
functional enrichment studies across multiple large-scale molecular datasets. Our
investigation revealed a predominant association of both dark and light genes with psoriasis.
Furthermore, we found that the transcription factors UBTF and NFE2L2 are potential
regulators of both dark and light genes associated with tuberculosis. In contrast, the
transcription factors SUZ12 and TP63 are potential regulators of both dark and light genes
associated with interstitial cystitis. Further network analysis of dark genes, including
CALHM6, HCP5, PRRG4, DDX60L and RASA2, revealed a notably high weighted degree
of association with genetic diseases and phenotypes. Moreover, our analysis revealed that
numerous genetic diseases and phenotypes, including psoriasis, pick disease, tuberculosis,
ulcerative colitis, interstitial cystitis, and Crohn's disease, exhibited shared gene linkages.
Additionally, we conducted a protein-protein interaction analysis to reveal 16 dark genes
that encode hub proteins, including R3HDM2, RPUSD4, FASTKD5, and MRPL15, that could
play a role in many genetic diseases and phenotypes, and are widely expressed across
body tissues. Our findings contribute to the understanding of the genetic basis of diseases
and provide potential therapeutic targets for future research. Identifying dysregulated dark
genes in disease states can lead to new strategies for prevention, diagnosis, and treatment,
thus advancing our understanding of disease mechanisms.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
3
Introduction
Genetic diseases and phenotypes can manifest as monogenic, polygenic, or complex
conditions resulting from a combination of genetic and environmental factors [14].
Numerous clinical phenotypes are linked to mutations in specific genes, and disruptions in
the regulatory mechanisms governing gene expression can also contribute to these
conditions [510]. Diseases stemming from alterations in the human genetic code pose a
substantial burden, with recognised genetic diseases affecting over 5% of live births and
more than two-thirds of miscarriages [11]. In addition to highly penetrant monogenic
disorders and significant chromosomal changes, the heritability of common diseases
strongly suggests a genetic basis for prevalent disorders, such as cardiovascular disease
[1114].
Recent advances in genomic sequencing technology have led to the discovery of thousands
of genetic variants and genes associated with genetic diseases and phenotypes [1517].
The implementation of high throughput next-generation sequencing techniques has enabled
the swift sequencing of a vast number of complete genomes, resulting in an ever-growing
wealth of genomic information that is becoming increasingly accessible [1821]. However,
over one-third of all protein-coding genes, often referred to as the “dark genome” or “Tdark
genes”, whose functions are either poorly understood or completely unknown [2226], have
not been extensively studied, resulting in limited literature coverage and a significant
challenge in understanding their biological significance [22]. Recent research has revealed
that targeting the dark genome could be a promising approach for cancer treatment since it
shares similar genetic and pharmacological dependencies with the light genome [27].
However, our understanding of the roles of dark genes in other genetic diseases and
phenotypes remains limited.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
4
To better understand the dark genome, the Illuminating the Druggable Genome (IDG)
project was launched in 2014 to identify knowledge gaps in human genome-encoded
proteins and to explore understudied proteins that may be potential drug targets [2426].
This project aims to gather comprehensive information on protein families, their associations
with diseases and drugs, and their structural and functional properties [24,25]. Currently, the
IDG project has generated vast amounts of data on over 20, 000 protein-coding genes
[24,25]. These datasets contain information about the dark genome or dark genes, which
have been curated based on the extent to which genes are studied. Furthermore, several
other biomedical projects have curated information on protein-protein interaction [2832],
gene dependencies [33], genes involved in disease [34,35], and gene expression and
regulation in human tissue [36].
The available large-scale datasets compiled by various biomedical databases present new
opportunities to apply integrative analyses that could shed light on the dark genome and
discover the possible roles of dark genes in human diseases. Here, we integrated data on
dark and light gene annotations together with curated protein-protein interaction information
to examine the connection between light and dark genes and genetic diseases and
phenotypes. Additionally, we investigated the potential biological processes and pathways
in which dark genes might participate, as well as their tissue-specific expression patterns.
Overall, we illuminated various aspects of the biological functions involving the dark genome
and offered insights into its potential involvement in genetic diseases and phenotypes.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
5
Methods
The study protocol was approved by the University of Zambia’s Health Sciences Research
Ethics Committee IRB00011000. The analyses conducted in this study utilised
publicly available datasets collected by the IDG, GTEx and HPA projects, which were made
accessible through their respective project databases. The methods employed in this study
adhered to the pertinent policies, regulations, and guidelines established by the IDG, GTEx
and HPA projects for the analysis of their datasets and the reporting of findings.
Determination of dark and light gene association with genetic diseases
To investigate the association between dark and light genes and genetic
diseases/phenotypes, we obtained a list of genetic diseases, disease genes, and their
associations from Pharos (version 3.15.1) (https://pharos.nih.gov/). Pharos serves as an
integrated web-based informatics platform for analysis of data aggregated by the
Illuminating the Druggable Genome (IDG) Knowledge Management Centre from various
databases, including UniProt, GWAS, and DisGeNET, among others [25]. Genes/proteins
in Pharos are classified into four target development levels, encompassing light genes (Tbio,
Tclin and Tchem) and dark genes (Tdark).
Tbio designates targets in the early stages of development, exhibiting a well-documented
Mendelian disease phenotype in the Online Mendelian Inheritance in Man (OMIM) database,
Gene Ontology (GO) leaf term annotations supported by experimental evidence, or
satisfying at least two of the following three criteria: a fractional PubMed publications count
exceeding five, three or more annotations from the National Centre for Biotechnology
Information (NCBI) Gene Reference Into Function (RIF), or 50 or more commercial
antibodies, as indicated by data accessible from the Antibodypedia database [24,26].
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
6
Conversely, Tchem pertains to targets with available chemical tools for investigation,
possessing bioactivity information in databases, with varying potency criteria depending on
protein type (e.g., kinases, GPCRs) [26,37]. Tclin refers to well-characterised genes linked
to drug mechanisms of action (MoA) and have established clinical relevance [24,26]. In
contrast, Tdark designates targets with minimal information available, failing to meet any of
the criteria set for Tclin, Tchem or Tbio [23,24].
We analysed the distribution of genetic diseases and their associated genes across each
target development level and generated a heatmap to display the association of the top 40
genetic diseases with dark and light genes.
Enrichment analysis of dark and light genes associated with selected
common diseases
To gain insight into the regulatory mechanisms that control the expression and function of
these genes. We conducted enrichment analysis on both light and dark genes that were
linked to the top ten diseases, respectively. To accomplish this, we utilised
eXpression2Kinases (X2K), a web tool developed by the Ma’ayan Laboratory (available at
https://maayanlab.cloud/X2K/), which performs gene set enrichment analysis using various
transcription factor gene set libraries, including integrated target genes for transcription
factors determined by ChIP-seq experiments (ChEA). In addition, X2K employs the
Genes2Networks (G2N) algorithm to identify proteins that interact physically with the
transcription factors and performs kinase enrichment analysis (KEA) on the list of identified
transcription factors and proteins using gene set libraries from kinase-substrate interaction
databases [38]. Transcription factors and kinases with a hypergeometric p-value of less than
0.05 were considered statistically significant.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
7
Construction of the diseasome bipartite network for dark genes and
their associated genetic diseases and phenotypes
We utilised Gephi (version 0.9.7), a robust network visualisation software [39], to further
evaluate the role of dark genes and how they contribute to disease development using a
bipartite network representing the diseasome. We used the circle pack layout with nodes
grouped according to their category to visualise the network. The gene-diseasome network
was made up of 2,704 nodes and 5,639 undirected edges. The nodes in this network belong
to two different categories: diseases and genes. A connection between a disease and a
gene was established if the gene was known to play a role in the disease/phenotype.
Additionally, we manually classified each disease or phenotype into one of the 17 categories
based on the physiological system affected. Furthermore, using the gene-diseasome
bipartite network as a starting point, we used multimode network projection to create two
biologically relevant network projections: the gene-disease network (GDN) and the disease-
gene network (DGN). The GDN was visualised with nodes sized based on their degree
within the range of 10 to 80. Nodes were coloured according to their associated
disease/phenotype category. The Force Atlas 3D layout with default settings was applied to
generate the network. Furthermore, to eliminate unconnected nodes, the giant component
was used as a filter. Within the GDN, the nodes represented diseases, and two diseases
were linked if they shared at least one gene that was associated with both. This network
projection was composed of 503 diseases, with 5,262 unique connections. Furthermore, we
used the Circle Pack layout that grouped nodes based on their weighted degree to visualise
the DGN. Node size was determined by the weighted degree, ranging from 10 to 80. In
addition, nodes were colour-coded according to their betweenness centrality. The DGN was
composed of nodes that represented genes, with connections between two nodes occurring
if the corresponding genes were both implicated in the same disease. The size of each node
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
8
was proportional to the number of diseases in which the gene was involved. This network
projection encompassed 2,147 genes with 467,344 unique connections. Additionally, we
utilised Gephi to detect smaller clusters, or modules, within the bipartite network of the
diseasome. To accomplish this, we employed the Louvain method, a widely used approach
for cluster identification in network analysis [40]. To determine the significance of nodes in
the networks, we calculated several network centrality metrics, such as degree, closeness
centrality, betweenness, and eccentricity.
Identification of hub dark genes
Further network analyses of the disease-gene network allowed us to identify key genes
involved in disease processes. Using a list of the 2,144 dark genes, we obtained known
protein-protein interaction of these genes from the University of California, Santa Cruz
(UCSC) [28], ChIP Enrichment Analysis (ChEA) [29], Kinase Enrichment Analysis (KEA)
[30], Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [31] and the
Biological General Repository for Interaction Datasets (BioGRID) [32] to create an
interaction network of dark genes –the network was visualised in Cytoscape
(http://apps.cytoscape.org/), it had 149 nodes, with 216 edges. To identify the hub dark
genes in the network, we used the CytoHubba (http://apps.cytoscape.org/apps/cytohubba)
plugin in Cytoscape. Considering that biological networks are heterogeneous, we used more
than one method to identify essential proteins [41]. Therefore, we used three algorithms to
reflect the status of the node in the entire network from different aspects. We then obtained
the top 20 genes from the protein-protein network ranked by three different algorithms:
Closeness, Degree, and Maximal Clique Centrality (MCC). Furthermore, to observe the
intersections of the results produced by the three algorithms, a Venn network was generated
using Evenn (http://www.ehbio.com/test/venn) to identify the commonly predicted significant
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
9
hub genes. Finally, to demonstrate the association of the identified hub dark genes with
genetic diseases and phenotypes, a diseasome bipartite subnetwork was constructed using
Gephi software.
Gene ontology analysis of the identified hub dark genes
To elucidate the molecular mechanisms and biological processes underlying hub dark
genes, we analysed hub genes at the functional level. Gene ontology (GO) enrichment [42]
and Reactome pathway analysis [43] were performed using the Enrichr web [44] tool
(https://maayanlab.cloud/Enrichr/). Pathways and GO terms with a P-value < 0.05 were
considered significant.
Tissue expression of hub dark genes
To gain insight into the molecular mechanisms by which the prominent dark genes we
identified are associated with genetic diseases, we utilised the Genotype-Tissue Expression
(GTEx) [45] database (version 8) (https://gtexportal.org/home/). Specifically, to analyse the
expression patterns of the 16 hub dark genes across different tissues, we created a heatmap
using the GTEx data. This allowed us to visualise the expression levels of the genes in
different tissues and identify any patterns or correlations between gene expression and
disease development. Additionally, we searched for expression quantitative trait loci
(eQTLs) within the dark genes to better understand how mutations within these genes may
contribute to disease development. By identifying eQTLs associated with the Tdark hub
genes and the particular tissues in which they are expressed, we can begin to understand
the biological pathways and mechanisms involved in disease development. By combining
the eQTL data with our knowledge of disease-gene associations, we hoped to gain a more
comprehensive understanding of the molecular mechanisms underlying genetic diseases
and identify potential therapeutic targets. Furthermore, we employed the Open Targets
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
10
Genetics (https://genetics.opentargets.org), an open-access integrative resource that
combines human GWAS and functional genomics data [46] to ascertain if any of the eQTLs
are linked to phenotypes specific to tissues in which they are expressed.
Tissue distribution and specificity analysis of dark and light gene
expression
We performed the tissue distribution and specificity analysis of dark and light gene
expression to gain insights into the functional characteristics and potential roles of these
genes in different tissues. To achieve this, we obtained the gene expression data from the
Human Protein Atlas (HPA) [47] (version 23.0) (https://www.proteinatlas.org/). This dataset
provided us with information on tissue distribution and tissue specificity of gene expression.
HPA classifies genes based on tissue distribution and tissue specificity [48]. Tissue
distribution is assessed based on the number of tissues with detectable mRNA levels above
a specified cut-off and is classified into five distinct categories [48]: (a) Detected in single:
only one tissue exhibits detectable levels. (b) Detected in some: more than one, but less
than one-third of the tissues display detectable levels. (c) Detected in many: at least one-
third, but not all tissues show detectable levels. (d) Detected in all: all 37 tissues exhibit
detectable mRNA levels. (e) Not detected: None of the 37 tissues demonstrate detectable
levels.
The classification of tissue specificity is determined by evaluating the fold-change in mRNA
expression levels across 37 analysed tissues and organs, and it is categorised into five
distinct specificity groups [48]: (a) Tissue enriched: signified by at least a fourfold higher
mRNA level in one tissue compared to all other tissues. (b) Group enriched: characterised
by a group of 2–5 tissues exhibiting at least a fourfold higher mRNA level compared to all
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
11
other tissues. (c) Tissue enhanced: designated when one tissue demonstrates at least a
fourfold higher mRNA level compared to the average level in all other tissues. (d) Low tissue
specificity: identified when at least one tissue has mRNA levels above the cut-off, but the
gene does not fit into any of the above categories. (e) Not detected: applicable when all
tissues exhibit mRNA levels below the cut-off.
In addition, to determine whether the Xpresso model predictions are associated with tissue
distribution and tissue specificity from HPA, we used the predictions from the Xpresso
model, as presented by Agarwal and Shendure [49]. Furthermore, we used information from
the Pharos database (version 3.15.1), including gene symbols and target development
levels (Tbio, Tchem, Tclin and Tdark) for integration with tissue specificity and tissue
distribution data.
Data analysis
We utilised a combination of computational analyses, employing MATLAB R2021a and
various bioinformatics tools, to conduct all analyses. Statistical significance was considered
when p-values were < 0.05 for single comparisons or when the q-values (Benjamini-
Hochberg adjusted p-values) were < 0.05 for multiple comparisons.
Results
Both dark and light genes are associated with genetic diseases and
phenotypes
To investigate dark and light gene associations with genetic diseases and phenotypes, we
utilised information from the Pharos [26] (version 3.15.1) web portal (https://pharos.nih.gov).
Pharos provides gene classifications based on their target development levels (TDLs) (see
Methods section for the description of TDLs). First, we evaluated the number of genes in
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
12
each Pharos TDL category and found that most genes in Pharos belonged to the Tbio
category (12,058, 59.07%), followed by Tchem (1,971, 9.67%), Tclin (704, 3.45%), and
Tdark (5,679, 27.82%) (see S1a Fig).
We then integrated the information on TDLs with the genetic diseases and phenotypes
(excluding cancers) associated with various genes. Here, our analysis revealed a significant
variation in the number of genes and genetic traits across different TDLs. Specifically, Tbio
genes comprised (7,642, 66.18%), Tdark genes (2,146, 18.58%), Tchem genes (1,327,
11.49%), and Tclin genes (432, 3.74%), as illustrated in Fig 1a. To view a comprehensive
list of genes in each TDL, refer to S1 File.
Furthermore, we analysed the number of genetic diseases and phenotypes linked to genes
within each TDL. We found that the Tbio category was associated with most diseases and
phenotypes (7,704, 48.99% diseases and phenotypes), followed by Tchem (3,926, 24.96%
diseases and phenotypes), Tclin (3,540, 22.51% diseases and phenotypes), and Tdark
(557, 3.54% diseases and phenotypes), as shown in Fig 1b. The distribution of genetic
diseases and phenotypes in each TDL is also available in S1 File.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
13
Fig 1. Distribution of genes and genetic diseases/phenotypes in each target development level (TDL).
a. Frequency of genes. b. Frequency of genetic diseases and phenotypes. c. Distribution of genes linked to
the top 40 genetic diseases and phenotypes. CLE: Cutaneous lupus erythematosus, COPD: Chronic
obstructive pulmonary disease, DMD: Duchenne muscular dystrophy, PSP: Progressive supranuclear palsy,
EDMD: Emery-Dreifuss muscular dystrophy, SLE: Systemic lupus erythematosus, FSHD:
Fascioscapulohumeral muscular dystrophy.
Additionally, we observed that Tclin exhibited the highest average number of diseases per
gene (8.194 diseases/gene), implying that these genes are associated with a broad
spectrum of diseases. Conversely, Tbio and Tchem had lower averages (1.008 and 2.959
diseases/gene, respectively), indicating that these genes are less frequently implicated in
disease. Furthermore, Tdark demonstrated the lowest average (0.259 diseases/gene),
indicating that its genes are less comprehensively characterised (S1b Fig). Notably, Tdark
exhibited a higher genes-to-disease ratio (3.853) compared to other TDLs, suggesting that,
on average, there are approximately four genes associated with each disease in this
category. This highlights the notable abundance of Tdark genes relative to identified
diseases and underscores the need for continuous research to elucidate the roles of Tdark
genes, potentially paving the way for innovative therapies across various diseases (S1c Fig).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
14
In our analysis, we found that psoriasis had notable associations across the TDL categories:
Tbio (3,194 associations), Tchem (629 associations), Tclin (213 associations), and Tdark
(801 associations). Similarly, ulcerative colitis was significantly associated with Tbio (1,391
associations), Tchem (327 associations), Tclin (121 associations), and Tdark (201
associations). These were among the top genetic diseases and phenotypes associated with
TDL gene categories (Fig 1c, also see S2 Fig). Interestingly, we found that some diseases,
such as Galloway-Mowat syndrome 2 (X-linked), Bardet-Biedl syndrome 18 and
chromosome 19q13.11 deletion syndrome, were exclusively associated with dark genes
(see S1 File). Overall, these findings indicate a significant potential for uncovering the roles
of Tdark genes in disease development, suggesting vast, untapped insights into their
contributions to genetic disorders.
Transcription factor and kinase enrichment analysis of dark and light
genes associated with genetic diseases and phenotypes
To gain insight into the regulatory mechanisms involved in gene expression, we identified
the top 10 genetic diseases and phenotypes with strong associations with both dark and
light genes. These include conditions such as psoriasis, ulcerative colitis, intellectual
disability, tuberculosis, and Crohn's disease. Subsequently, we performed transcription
factor and kinase enrichment analyses for genes in each TDL category, specifically targeting
the top ten genetic traits, to better understand the regulatory mechanisms influencing gene
expression. We first identified light genes associated with each disease and then assessed
the enrichment of transcription factors and kinases specific to these genes. In parallel, we
used the same method for the associated dark genes. This approach enabled us to delve
into the distinct regulatory mechanisms for TDL gene categories in relation to each genetic
trait (see Methods section).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
15
Employing the in-silico Chromatin Immunoprecipitation Enrichment Analysis (ChEA) [29],
we predicted significant enrichment (p < 0.05) of transcription factors potentially regulating
the expression of light genes linked to each of the top ten genetic traits (refer to S2 File).
Notably, rheumatoid arthritis and psoriasis stood out, with 36 and 35 predicted enriched
transcription factors, respectively. The top-ranked transcription factors for light genes
associated with psoriasis included SUZ12, which regulates 467 light genes (p = 1.5 x 10-8),
SALL4, which regulates 130 light genes (p = 8.4 x 10-8) and GATA1, which regulates 235
light genes (p = 5.2 x 10-6) (Fig 2a). The top-ranked transcription factors for light genes
associated with rheumatoid arthritis included RELA, which regulates 61 light genes (p = 2.7
x 10-10), UBTF, which regulates 134 light genes (p = 1.0 x 10-8) and GATA1, which regulates
77 light genes (p = 1.1 x 10-7) (S3a Fig, also see S2 File for a list of regulated genes from
the input gene list).
To pinpoint the regulatory kinases for psoriasis and rheumatoid arthritis, we created a
protein-protein interaction (PPI) subnetwork that maps the transcription factors and their
interacting proteins (refer to Fig 2b and S3b Fig, respectively). We then employed the Kinase
Enrichment Analysis (KEA) [30] to link the proteins in the PPI subnetwork to the protein
kinases that likely phosphorylate them. Here, our result yielded a ranked list of protein
kinases that likely regulate the transcriptome signature of psoriasis and rheumatoid arthritis.
For psoriasis, among the top-ten ranked kinases were CSNK2A1 (p = 6.7 x 10-12), HIPK2
(p = 3.0 x 10-10), CDK1 (p = 4.8 x 10-10), and AKT1 (p = 4.6 x 10-9; Fig 2c, also see S2 File).
Among the top-ten ranked kinases for rheumatoid arthritis were MAPK1 (p = 1.4 x 10-13),
CK2ALPHA (p = 4.7 x 10-13), CSNK2A1 (p = 4.8 x 10-13), and MAPK3 (p = 2.0 x 10-12; S3c
Fig, also see S2 File).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
16
Fig 2. ChEA and KEA analyses of light genes associated with psoriasis a. The top-20 predicted regulatory
transcription factors. b. A subnetwork of connected transcription factors and their interacting proteins: the sub-
network has 60 nodes with 526 edges. Transcription factors are pink nodes, whereas the proteins that connect
them are grey. c. The top-20 predicted regulatory kinases.
Conversely, focusing on the dark genes associated with the top ten genetic traits, we found
significant transcription factors predicted in select cases through the ChEA [29] analysis,
including ulcerative colitis, tuberculosis, interstitial cystitis, pick disease, diabetes mellitus,
Sjogren syndrome, and cystic fibrosis (p < 0.05), as presented in S2 File. Furthermore, our
analysis identified that among the dark genes associated with various genetic diseases and
phenotypes, those linked to pick disease and tuberculosis displayed the most substantial
enrichment, each being associated with ten transcription factors. This contrast underscores
the differential regulatory landscape between dark genes across different diseases, with pick
disease and tuberculosis exhibiting the most pronounced transcription factor enrichment.
The top-ranked transcription factors for dark genes associated with tuberculosis included
ELF1, which regulates 41 dark genes (p = 1.4 x 10-3); TCF7L2, which regulates 14 dark
genes (p = 3.3 x 10-3) and UBTF, which regulates 28 dark genes (p = 5.5 x 10-3) (Fig 3a).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
17
The top-ranked transcription factors for dark genes associated with pick disease included
TCF7L2, which regulates 12 dark genes (p = 4.4 x 10-3); BRCA1, which regulates 39 dark
genes (p = 1.0 x 10-2); and PBX3 which regulates 19 dark genes (p = 1.1 x 10-2) (S4a Fig,
also see S2 File for a list of regulated genes from the input gene list).
To identify the regulatory kinases of tuberculosis and pick disease, we created a PPI
subnetwork of connected transcription factors and their interacting proteins (see Fig 3b and
S4b Fig respectively). As previously, we utilised KEA [30] to link the proteins in the PPI
subnetwork to the protein kinases that likely phosphorylate them. Here, our result yielded a
ranked list of protein kinases that likely regulate the transcriptome signature of tuberculosis
and pick disease. Among the top-ten ranked kinases for tuberculosis were CK2ALPHA (p =
1.3 x 10-14), CSNK2A1 (p = 7.2 x 10-11), MAPK1 (p = 4.7 x 10-10), and HIPK2 (p = 1.3 x 10-9;
Fig 3c, also see S2 File). Additionally, the top-ten ranked kinases for pick disease were
CSNK2A1 (p = 2.4 x 10-12), MAPK1 (p = 3.2 x 10-12), CDK1 (p = 2.3 x 10-11), and CDK4 (p =
1.9 x 10-10; S4c Fig, also see S2 File).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
18
Fig 3. ChEA and KEA analyses of dark genes associated with tuberculosis a. The top-20 predicted
regulatory transcription factors. b. A subnetwork of connected transcription factors and their interacting
proteins: the sub-network has 41 nodes with 393 edges. Transcription factors are pink nodes, whereas the
proteins that connect them are grey. c. The top-20 predicted regulatory kinases.
Interestingly, we discovered that several transcription factors potentially regulating gene
expression of both light and dark genes in genetic diseases are shared. For instance, UBTF
and NFE2L2 were identified as potential regulators of both the dark genes and light genes
(S5 Fig) linked to tuberculosis. Similarly, we found that SUZ12 and TP63 were potential
regulators of both the light and dark genes related to interstitial cystitis (see S2 File).
Analysis of the diseasome bipartite network of dark genes and the
associated diseases and phenotypes
To further shed light on the role of dark genes on disease development, we utilised the
disease/phenotype-gene associations to construct a diseasome bipartite network (see
Methods section). The bipartite network encompassed 557 genetic traits and 2,147 dark
genes, of which 1,233 dark genes were linked to multiple diseases and phenotypes. The
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
19
top-ranking dark genes associated with the most genetic traits were TSEN34, with 31
associations, followed by PLPPR1, with 29, and LRMDA, with 21. Furthermore, we found
that most genetic traits in our diseasome were associated with the nervous system (Fig 4).
Fig 4. Gene-diseasome bipartite network. The bipartite network is composed of two disjoint sets of nodes
with different sizes: the larger ones correspond to genetic diseases, whereas the smaller ones correspond to
all genes. A link occurs between a disease and a gene if the gene is linked with the disease/phenotype. There
are 18 categories in the diseasome bipartite network, as labelled in the legend. The links between disease
pairs are shown in grey colour. Nodes are shaded grey if the corresponding genes are associated with more
than one disease type. Disease/phenotype-specific gene nodes are coloured according to each
disease/phenotype category, with “skin, connective tissue” having the highest number of disease/phenotype-
specific genes. Similarly, disease/phenotype nodes are coloured based on the disease/phenotype category
to which they belong, and most genetic diseases/phenotypes belong to the nervous system category. The
names of diseases and genes with more than 20 connections were labelled in the network.
In addition, we employed multimode network projection on the gene-diseasome bipartite
network to create two biologically meaningful network projections: the gene-disease network
(GDN) for a disease-centred perspective (Fig 5) and the disease-gene network (DGN) for a
gene-centred view of the gene-diseasome (Fig 6). Next, we analysed the GDN by calculating
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
20
various network statistics including the weighted degree centrality, betweenness centrality
and closeness centrality and eccentricity (see S3 File). Our analyses spotlighted psoriasis
as the central hub of the network, demonstrating strong associations with most dark genes.
Moreover, our degree distribution analysis indicated that most genetic traits (503 in total),
such as psoriasis, pick disease, tuberculosis, ulcerative colitis, interstitial cystitis, and
Crohn's disease, exhibited shared gene associations across multiple diseases and
phenotypes within the network. Our observation underscores the complex genetic interplay
underlying various diseases and phenotypes (Fig 5, also see S3 File).
Fig 5. Gene-disease network (GDN). The GDN is the projection of the gene-diseasome bipartite network, in
which nodes correspond to diseases/phenotypes, and two diseases/phenotypes are connected if there is at
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
21
least one gene that is linked to both. The width of a link is proportional to the number of genes that are linked
to both diseases. The size of a node is proportional to the number of genes linked to that disease. Different
node colours are associated with different disease categories. The names of diseases with > 20 associated
genes are labelled in the network. There are 17 disease categories in the gene-disease network (GDN), as
labelled in the legend. The links between disease pairs are shown in grey colour. The weight of a link is
proportional to the number of genes implicated in both diseases.
Furthermore, we calculated the network statistics on the DGN, including the weighted
degree centrality, betweenness centrality, closeness centrality and eccentricity (see S3 File),
which enabled us to identify dark genes with notably high weighted degree centrality,
including CALHM6, HCP5, PRRG4, DDX60L and RASA2. These genes potentially play
pivotal roles within the network, exerting strong influence over various genetic traits [50].
Furthermore, we pinpointed dark genes with high betweenness centrality, including PLBD1,
CTRB2, C6orf15, DPM3, and CALHM6. These genes act as crucial connectors within the
network, enhancing the flow of genetic information and possibly influencing the regulation
of genetic traits [51,52]. This analysis deepens our understanding of the genetic foundations
of complex traits and may pave the way for further research (Fig 6, also see S3 File).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
22
Fig 6. Disease-gene network (DGN). In the DGN, each node is a gene, with two genes being connected if
they are implicated in the same disease. The size of the nodes was based on the average weighted degree
(the larger the size, the higher the degree). The node colour exhibits betweenness centrality (the darker colour
indicates the higher betweenness), Light orange is low, and red is high. The names of genes with a weighted
degree range > 1119.0 are labelled in the network.
Additionally, we identified seven smaller network clusters within the diseasome bipartite
network and calculated a modularity score of 0.429 for these clusters. This positive score
indicates the presence of a modularity structure [53,54] and is an average value for this
network. This observation implies that genetic traits and their associated dark genes do not
interact randomly but tend to cluster together in specific clusters. This finding provides
further support for the notion that diseases sharing common genetic or biological
foundations tend to cluster together, suggesting potential shared pathways or therapeutic
targets [5557]. Notably, modularity clusters in groups 2, 3, and 6 were the largest,
representing distinct hub clusters with unique characteristics. The average degree for
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
23
modularity class 2 was calculated as 3.064. The genetic traits heart disease, attention deficit
hyperactivity disorder (inattentive type), and endogenous depression were highly connected
with 82, 81 and 80 connections, respectively. Additionally, disease frequency in modularity
class 2 showed that mostly nervous followed by musculoskeletal diseases belonged to this
class. Gene JRK was found to have the highest degree, with 17 connections to various
diseases (Fig 7; S3 File). The average degree for modularity class 3 was calculated as
3.975. The genetic traits ulcerative colitis, tuberculosis, interstitial cystitis, pick disease and
Sjogren syndrome were highly connected with 159, 144, 135, 128, and 120 connections,
respectively. Disease frequency in modularity class 3 indicated that mostly congenital,
hereditary, neonatal, followed by nervous disease categories were in this group. Gene
C6orf15 was found to have the highest degree, with 14 connections to various diseases (Fig
7; S3 File). Furthermore, the average degree for modularity class 6 was calculated as 2.373.
The genetic traits psoriasis and diabetes mellitus were highly connected with 409 and 88
connections, respectively. Disease frequency in modularity class 6 indicates that mostly
nervous followed by congenital, hereditary, and neonatal disease categories were in this
group. Gene PLPPR1 was found to have the highest degree, with 28 connections to various
diseases (Fig 7; S3 File).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
24
Fig 7. Cluster identification. The gene-diseasome bipartite network partitioned into seven distinct clusters,
each highlighting a group of closely interconnected diseases and their corresponding genes that exhibit similar
patterns of connections and associations within the network. The name of diseases and genes with > 20
connections are labelled in the network.
Overall, these findings indicate different levels of interconnectivity between the clusters in
the network. This suggests that genetic traits within the same cluster may share similarities
in terms of genetic factors and the potential underlying mechanisms [5557]. We also noted
variations in centrality metrics, including degree, betweenness, and closeness centrality,
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
25
across the clusters, with certain clusters exhibiting higher centrality values compared to
others (S3 File). Furthermore, the clear structural boundaries between clusters indicate that
these clusters represent distinct subgroups within the network. Understanding the
interconnectivity and centrality can assist in identifying key diseases and genes that are
crucial in the context of disease interactions, pathways, and potential therapeutic targets
[56,5860].
Identification of hub dark genes
We aimed to identify the key dark genes involved in disease processes. Out of the 2,147
dark genes we analysed, 1,998 dark genes could not be mapped to STRING [31], UCSC
[28], ChEA [29], KEA [30], and BioGRID [32] databases and were therefore excluded from
further analysis (see Methods section). The inability to map 1,998 of the dark genes to
established databases underscores the existing gaps in understanding and the sparse
representation of these genes within contemporary scientific literature and databases.
Subsequently, we visualised the PPI network using Cytoscape [61] [version 3.8.0]. The
resulting PPI network of dark genes consisted of 149 nodes and 216 edges (Fig 8). The
successful construction of the PPI network for the remaining dark genes suggests that
despite their obscurity, these genes are not isolated entities and may play interconnected
roles in biological systems.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
26
Fig 8. The proteinprotein interactions network of dark genes was visualised in Cytoscape. The nodes
indicate proteins, and the edges indicate the number of interactions. The PPI network had 149 nodes and 216
edges.
Moreover, we employed the cytoHubba [41] plugin to screen for the top 20 hub dark genes
in the PPI network using three approaches: Maximal Clique Centrality (MCC), Degree and
Closeness (see Methods section and S6 Fig). A total of 16 of the most significant genes
emerged from the overlap of the results obtained from the three algorithms. Noteworthy
among them were RPUSD4, FASTKD5, MRPL15, MRPL9, DCAF15, R3HDM2, and
CEP162 (S6d Fig). Notably, R3HDM2 was the most interconnected, associating with nine
genetic traits, including heart disease, psoriasis, and juvenile arthritis. Furthermore, MRPL9
was linked with six traits, including myocardial infarction, allergic rhinitis, and inflammatory
bowel disease (Fig 9).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
27
Fig 9. Diseasome bipartite subnetwork. Showing the interaction of genetic diseases with the identified hub
dark genes. Node size represents degree, and node colour shows category (diseases coloured red and genes
coloured blue).
Function enrichment of hub dark genes
The high interaction of hub genes and their association with various diseases and
phenotypes indicates that they may play important roles in biological processes. Therefore,
to gain insights into the potential biological functions and pathways in which these genes
may be involved, we performed functional enrichment analysis using Enrichr [44] (see
Methods section). Our Gene ontology (GO) analysis showed that the hub genes are highly
significantly enhanced for biological processes, the top enriched terms including
mitochondrial translational elongation (p = 1.39 x 10-15) and mitochondrial translational
termination (p = 1.39 x 10-15), (Fig 10a); Molecular functions including RNA binding (p = 2.97
x 10-7), and large ribosomal subunit rRNA binding (p = 3.99 x 10-3), (Fig 10b); Cellular
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
28
components including mitochondrial inner membrane (p = 5.51 x 10-11), and organelle inner
membrane (p = 8.44 x 10-11), (Fig 10c). As for biological pathways, the Reactome pathway
analysis showed that hub genes are enriched with mitochondrial translation termination (p
= 3.05 x 10-11), mitochondrial translation elongation (p = 3.05 x 10-11), and mitochondrial
translation initiation (p = 3.05 x 10-11), (Fig 10d). The detailed results are shown in S4 File.
Fig 10. Gene ontology and pathway analysis bar charts. The term at the top has the most significant overlap
with the input query gene set. The enriched terms for the hub dark genes are displayed based on log10(p-
value), with the actual p-value shown next to each term. Coloured bars correspond to terms with significant p-
values (< 0.05). An asterisk (*) next to a p-value indicates the term also has a significant adjusted p-value (<
0.05) (a) GO Biological Process (b) GO Molecular Function (c) GO Cellular Component (d) Reactome
Pathways.
Overall, these results suggest that hub dark genes are closely linked to mitochondrial
functions, particularly in the context of protein synthesis. This information can be invaluable
for understanding the molecular mechanisms underlying various diseases and phenotypes
associated with hub dark genes, as well as for potential therapeutic targeting in the future.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
29
Tissue expression of hub dark genes
We utilised the GTEx [45] database to investigate tissue-specific expression patterns of the
16 hub dark genes. Additionally, we searched for expression quantitative trait loci (eQTLs)
within the dark genes to gain a better understanding of how genetic variants in these genes
might contribute to disease development (see Methods section).
Our findings revealed that RPUSD4 exhibited the highest median expression in muscle-
skeletal (median TPM = 79.60). Similarly, MRPL15 demonstrated the highest median
expression in muscle- skeletal (median TPM = 109.8). Furthermore, MRPL9 displayed the
highest median expression in the uterus (median TPM = 77.75). R3HDM2 highest median
expression was observed in the brain- cerebellar (median TPM = 81.38), while DCAF15 had
the highest median expression within the testis (median TPM = 103.3 (Fig 11a).
We also investigated the effects of eQTLs on the expression of hub dark genes in specific
tissues. Interestingly, we found that 5,784 eQTLs significantly affected the expression of hub
dark genes (p < 0.05) in various tissues, including the brain, oesophagus, and skin (see S5
File). For example, genetic variant rs143649318 (located at chr8_53538523_G_A_b38)
significantly decreases the expression of MRPL15 in the brain- amygdala (p = 6.33 x 10-8;
Normalised effect size [NES] = -2.36), brain-hypothalamus (p = 9.07 x 10-9; NES = -2.19)
and brain-frontal cortex (p = 1.00 x 10-6; NES = -1.99) (Fig 11b, also see S5 File).
Furthermore, rs1104890 (located at chr11_1941470_C_T_b38) significantly decreases the
expression of MRPL23 in different tissues including the colon-transverse (p = 3.24 x 10-57;
NES = -1.05), oesophagus-muscularis (p = 1.36 x 10-78; NES = -1.15), and stomach (p =
3.38 x 10-53; NES = -1.19) (Fig 11d, also see S5 File). Additionally, rs55808806 (located at
chr19_13976645_T_G_b38) significantly increases the expression of DCAF15 in the artery-
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
30
aorta (p = 2.44 x 10-15; NES = 0.37); artery-tibial (p = 3.17 x 10-20; NES = 0.27) and heart-
atrial appendage (p = 4.62 x 10-7; NES = 0.19) (Fig 11e, also see S5 File).
Fig 11. Tissue enrichments of Tdark hub genes. Expression of Tdark hub genes in various tissues a. and
eQTLs and tissue expression of Tdark hub genes b. MRPL15 c. MRPL24 d. MRPL23 e. DCAF15.
To further highlight the correlation between specific genetic variants and tissue-restricted
phenotypes resulting from their impact on gene expression, we employed the Open Targets
Genetics [46] web portal (https://genetics.opentargets.org) (see Methods section). Notably,
we observed that rs1196456 (located at chr1_151764873_T_A_b38), which increases the
expression of MRPL9 in the heart (p = 1.10 x 10-6; NES = 0.23), exhibited an association
with myocardial infarction (p = 1.80 x 10-9, Beta = 0.0578). Additionally, rs686646 (located
at chr11_126189527_A_C_b38), which decreases RPUSD4 expression in the pancreas (p
= 7.90 x 10-7; NES = -0.2), displayed an association with diabetes (p = 4.20 x 10-3, Beta =
0.352). Similarly, rs1104890 (located at chr11_1941470_C_T_b38), which reduces
MRPL23 expression in skeletal muscle (p = 3.00 x 10-119; NES = -1.1), showed an
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
31
association with intervertebral disc disorder (p = 7.60 x 10-5, Beta = 0.0569) (see S6 File for
more details). Our findings demonstrate that understanding the tissue-specific expression
patterns of genes is important for identifying genetic variants that contribute to disease
susceptibility in specific tissues or organs.
Tissue distribution and tissue specificity analysis of dark and light gene
expression
We employed the Human Protein Atlas (HPA) [47] data and predictions from the Xpresso
model [49] to conduct a comprehensive analysis of tissue distribution and tissue specificity
in the context of dark and light gene expression. In terms of tissue distribution, our focus
was specifically on categorising records based on the number of tissues in which mRNA
transcripts were detected, namely, detected in all tissues, in many tissues, in some tissues,
in a single tissue, or not detected in any tissue. For mRNA tissue specificity, we assessed
records across different categories, including those that were group enriched, displayed low
tissue specificity, were not detected, showed tissue enrichment, or exhibited tissue
enhancement (see Methods section for the detailed description of categories). This
approach allowed us to gain insights into the distinct tissue distribution and specificity of
dark and light genes.
In the analysis of RNA tissue distribution, encompassing both dark and light genes, we
examined 17,978 records. We found that the expression of mRNA transcripts from dark and
light genes tends to vary across different tissues, with some transcripts being present in all
tissues while others are limited to only a few. Notably, more than half of these records (9,154;
50.95%) revealed the presence of mRNA transcripts in all tissues. Meanwhile, a smaller
fraction of the records demonstrated that mRNA transcripts were detected in many tissues
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
32
(5,205; 28.98%) or some tissues (2,928; 16.29%) (Fig 12a and S7 File). This observation
underscores the diverse tissue-specific expression patterns of dark and light genes.
Furthermore, the number of records broken down by TDL classes revealed that Tbio genes
exhibited the highest number of RNA tissue distribution records detected in all tissues (6,437
records; 56%). Similarly, Tdark genes (1,502 records; 39%) and Tchem genes (1,002
records; 52%) also demonstrated the highest number of RNA tissue distribution records
detected in all tissues. In contrast, Tclin genes displayed the highest number of RNA tissue
distribution records detected in many tissues (299 records; 44%) (Fig 12b, also see S7a
Fig). Additionally, we found that only a small percentage of records exhibited limited RNA
tissue distribution, with 538 records (2.99%) detected in a single tissue and 153 records
(0.85%) not detected in any tissue, suggesting restricted expression (Fig 12a). Notably,
Tdark genes showed the highest percentage of such records, with 338 (63%) falling into the
single-tissue detection category and 135 (88%) not being detected in any tissue, as depicted
in Fig 12b and S7b Fig.
Furthermore, we observed variations in RNA tissue distribution across distinct TDL classes
linked to disease. Our findings indicate that, among the TDL classes associated with
disease, Tbio and Tdark genes demonstrated the highest prevalence of RNA tissue
distribution records detected in all tissues (38% each), followed by Tchem genes (26%).
Additionally, Tdark exhibited the highest number of RNA tissue distribution records detected
in many tissues (23%), followed by Tbio (20%) and Tchem (17%). However, the proportion
of genes identified in a single tissue was notably lower, ranging from 0% to 9%, with Tdark
having the highest representation (Fig 12c). These results imply a substantial overlap in
RNA tissue distribution across different tissues. Many genes are expressed in multiple
tissues and may play a role in multiple disease pathways. However, some genes are tissue-
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
33
specific and may play a more targeted role in disease development. Overall, these findings
provide important insights into the molecular mechanisms underlying disease and highlight
the importance of considering RNA tissue distribution in the development of targeted
therapeutics.
Fig 12. Tissue distribution of dark genes and light genes expression. a. The sum of the number of records
on RNA tissue distribution. b. The sum of the number of records broken down by target development level
(TDL). c. Percentage of the total number of records on RNA tissue distribution across distinct Target
Development Level (TDL) classes linked to disease. The colour shows details about RNA tissue distribution.
The size shows the sum of the number of records. The marks are labelled by the sum of the number of records.
In our analysis of RNA tissue specificity, encompassing both dark and light genes, we
examined a total of 17,978 records. Our findings indicated that expression of these genes
in most tissues (7,919 records; 44.05%) exhibited low tissue specificity, followed by tissue-
enhanced (6,002 records; 33.39%) and tissue-enriched (2,431 records; 13.52%) categories
(Fig 13a and S7 File). Furthermore, among the TDL classes, Tbio genes had the highest
number of records with low tissue specificity (5,540 records; 48%), followed by tissue-
enhanced and tissue-enriched categories [3,760 (33%) - tissue-enhanced; 1,327 (12%) -
tissue-enriched]. Similarly, Tdark genes [1,418 (37%) - low tissue specificity; 1,237 (32%) -
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
34
tissue-enhanced; 729 (19%) - tissue-enriched] and Tchem genes [803 (42%) - low tissue
specificity; 716 (37%) - tissue-enhanced and 234 (12%) - tissue-enriched] displayed a
similar pattern to the Tbio gene class. In contrast, Tclin genes had the highest number of
records in the tissue-enhanced category, with 289 records (43%) (Fig 13b and S8a Fig).
Overall, we found that the total number of records for each category of RNA tissue specificity
broken down by TDL classes was consistently high in Tbio, followed by Tdark, Tchem and
lastly, Tclin, except for the “Not detected” category, which had Tdark with the highest record
(135 records; 88%) (Fig 13b and S8b Fig). Additionally, low tissue specificity records
remained consistently high in Tbio, Tchem, and Tdark, suggesting that these genes may
have more generalised functions that are not confined to specific biological contexts,
potentially playing crucial roles in multiple tissues or biological processes.
Moreover, we have identified distinctions in RNA tissue specificity among diverse TDL
classes linked to disease. Notably, Tclin exhibited the highest share of genes displaying
tissue enhancement (51%). Furthermore, our analysis revealed that among Tdark genes,
37% displayed low tissue specificity, 32% showed tissue enhancement, 19% were tissue-
enriched, 8% were group-enriched, and 4% were not detected (Fig 13c). These findings
suggest that a significant portion of Tdark genes is expressed in particular tissues or groups
of tissues, implying potential involvement in specific physiological processes. However, a
considerable number of Tdark genes exhibited low tissue specificity, suggesting that their
functions may be less tissue-specific or not yet fully understood. These findings underscore
the significance of further exploration of the role of Tdark genes in disease development and
tissue-specific processes.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
35
Fig 13. Tissue specificity of dark genes and light genes expression. a. The sum of the number of records
on RNA tissue specificity. b. The sum of the number of records broken down by target development level
(TDL). c. Percentage of the total number of records on RNA tissue specificity across distinct Target
Development Level (TDL) classes linked to disease. The colour shows details about RNA tissue specificity.
The size shows the sum of the number of records. The marks are labelled by the sum of the number of records.
Furthermore, our investigation has yielded noteworthy insights into the RNA tissue
distribution and tissue specificity of dark and light genes. Notably, mRNA transcripts of both
dark and light genes detected in all tissues exhibited the highest number of records with low
levels of tissue specificity, with Tbio genes being the most prominent (55%), followed by
Tdark genes (13%). While a subset of records demonstrated tissue enhancement for mRNA
transcripts detected in many tissues, with the highest proportion in Tbio genes (39%),
followed by Tdark genes (11%) and Tchem genes (7%). On the other hand, mRNA
transcripts detected in single tissues, categorised as tissue enriched, had the highest
proportion in Tdark genes (45%), followed by Tbio genes (29%). These findings contribute
to our understanding of gene expression regulation and provide insights into the distinct
functions associated with specific tissues (see S9 Fig).
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
36
Overall, these findings collectively underscore the importance of exploring the distinct roles
of Tdark genes in disease development and tissue-specific processes. While there are
similarities, there are notable differences in the RNA tissue distribution and RNA tissue
specificity between dark and light genes, emphasising their unique contributions to biological
functions and disease pathways.
Discussion
We have conducted integrative analyses to illuminate the potential roles of Tdark genes and
their possible contribution to genetic diseases and phenotypes. Our findings reveal that the
number of genes associated with genetic traits varies depending on the target development
level. Tdark genes display a relatively high average count of gene associations per disease,
indicating the complex or multifaceted roles these dark genes may play in disease
pathogenesis. Additionally, Tdark genes have the lowest number of diseases per gene,
possibly due to limited research or understanding [23,25]. Further research is required to
understand the function and potential significance of Tdark genes in various diseases and
phenotypes. Furthermore, our analysis highlights psoriasis as the most prevalent disease
across all development levels, emphasising the importance of studying Tdark genes in
relation to this condition.
We identified certain transcription factors as potential regulators of both light and dark genes
associated with specific genetic diseases. For instance, SUZ12 and TP63 were identified as
potential regulators of dark and light gene expression in interstitial cystitis, while UBTF and
NFE2L2 were identified as potential regulators of both dark and light genes associated with
tuberculosis. These transcription factors have been implicated in various biological
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
37
processes [6266], and their identification in both light and dark genes suggests their
important role in the pathogenesis of genetic diseases and phenotypes.
Our analysis of the gene-disease network (GDN) revealed the presence of connections
among genetic diseases and phenotypes. Consistent with prior research [55], we found that
out of 557 genetic traits examined, 503 belonged to a giant component, suggesting shared
genetic origins to some extent [55,67]. Furthermore, there was considerable variation in the
number of genes associated with different diseases and phenotypes, ranging from just a few
to dozens, as seen in conditions like psoriasis, pick disease, tuberculosis, ulcerative colitis,
interstitial cystitis, and Crohn's disease. This variation may suggest the presence of shared
genetic pathways or biological mechanisms [6871]. The connections among disease genes
in the disease-gene network (DGN) indicate their level of phenotypic similarity and
involvement in shared disease pathways. Integrating this network with other interaction
types (e.g., protein-protein interactions, transcription factor-promoter interactions, and
metabolic reactions) can unveil novel genetic interactions and provide a more
comprehensive understanding of disease mechanisms [55]. Notably, CALHM6, HCP5,
PRRG4, DDX60L, and RASA2 were identified as major hubs in the network, indicating their
key roles in disease pathways [72,73]. Additionally, PLBD1, CTRB2, C6orf15, DPM3, and
CALHM6 exhibited high betweenness centrality, signifying their importance in regulating
information flow among genes in the network [56,74].
We identified 16 most significant hub dark genes from the PPI network including RPUSD4,
FASTKD5, MRPL15, MRPL9, MRPL22, MRPL2, MRPL50, MRPL24, MRPL23, DCAF15,
C8orf82, KIAA0355, CCDC90B, MRPS14, R3HDM2, and CEP162. The interactions among
most of these genes reinforces their potential functional significance, offering clues about
their biological roles [75], and suggesting their possible involvement in the same biological
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
38
pathways. Upon further analysis of hub genes through Enrichr, we found that the hub genes
are significantly associated with GO (Gene Ontology) biological processes related to
mitochondrial translation, such as translational elongation and termination. Therefore, we
suggest that these genes may regulate and maintain mitochondrial protein synthesis, which
is essential for cellular function [76,77]. The hub genes are enriched in GO cellular
components like mitochondrial inner membrane, and GO molecular functions include RNA
binding and large ribosomal subunit rRNA binding. Additionally, the hub dark genes
participate in mitochondrial translation pathways, encompassing initiation, elongation, and
termination processes. This finding indicates that hub genes are crucial in preserving the
structural and functional integrity of subcellular compartments and in overseeing gene
expression and protein synthesis within cells. Recent research has underscored the
significant role of these hub dark genes in disease, with studies indicating their involvement
in various pathologies and potential as therapeutic targets [7881].
We conducted a detailed analysis of the expression patterns of 16 hub dark genes across
diverse tissues. The findings reveal that these genes are expressed in multiple tissues,
suggesting their crucial role in a wide range of biological processes. For example, we found
that RPUSD4 is highly expressed in muscle-skeletal tissues, whereas MRPL9 is expressed
in EBV-transformed lymphocytes, cultured fibroblasts, and the uterus. In addition, MRPL15
is expressed in muscle-skeletal and heart-left ventricles. Interestingly, the expression of
quantitative trait loci (eQTLs) significantly influenced the expression of hub dark genes
across various tissues, indicating that these genetic variants are instrumental in modulating
the expression of hub dark genes in diverse biological contexts [82]. The discovery that
specific eQTLs, such as rs1104890 and rs143649318, have a significant impact on the
expression of specific hub dark genes in different tissues highlights the importance of
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
39
understanding the tissue-specific regulation of gene expression [83]. Furthermore, these
eQTLs provide potential markers for further investigation into the mechanisms underlying
the regulation of hub dark genes [84]. Investigating the impact of these eQTLs on gene
expression and protein function can provide insights into the molecular pathways regulating
hub dark genes and potentially identify therapeutic targets for diseases associated with
these genes [8587].
In conclusion, our research demonstrates that dark genes are crucial in the pathogenesis of
genetic diseases and phenotypes, revealing their multifaceted involvement in disease
mechanisms. Moreover, network analysis exposes a common genetic foundation across
various diseases, with certain genes acting as key hubs within these networks. Importantly,
these hub dark genes are frequently involved in essential mitochondrial functions. The
distinct expression patterns of these genes and the modulation by eQTLs emphasise the
complexity of gene regulation in different biological contexts. Exploring dark genes further
can enhance our understanding of genetic diseases and lead to the discovery of new
therapeutic avenues.
Data availability
All relevant data are within the manuscript and its supporting information files.
Competing interests
The authors declare that they have no conflicts of interest.
Author contributions
Conceptualisation: Doris Kafita, Kevin Dzobo and Musalula Sinkala.
Methodology: Doris Kafita, Panji Nkhoma, Kevin Dzobo and Musalula Sinkala.
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
40
Formal analysis: Doris Kafita, Panji Nkhoma, Kevin Dzobo and Musalula Sinkala.
Visualisation: Doris Kafita, Kevin Dzobo and Musalula Sinkala.
Writing – original draft: Doris Kafita and Musalula Sinkala.
Writing – review & editing: Doris Kafita, Panji Nkhoma, Kevin Dzobo and Musalula
Sinkala.
References
1. Babar. U. Monogenic disorders: An overview. Int J Adv Res (Indore). 2017;5: 1398–
1424. doi:10.21474/IJAR01/3294
2. Cleynen I, Halfvarsson J. How to approach understanding complex trait genetics -
inflammatory bowel disease as a model complex trait. United European
gastroenterology journal. NLM (Medline); 2019. pp. 1426–1430.
doi:10.1177/2050640619891120
3. Apgar TL, Sanders CR. Compendium of causative genes and their encoded proteins
for common monogenic disorders. Protein Science. 2022;31: 75–91.
doi:10.1002/pro.4183
4. Brandes N, Weissbrod O, Linial M. Open problems in human trait genetics. Genome
Biology. BioMed Central Ltd; 2022. doi:10.1186/s13059-022-02697-9
5. Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell.
Elsevier B.V.; 2013. pp. 12371251. doi:10.1016/j.cell.2013.02.014
6. Salem MSZ. Pathogenetics. An introductory review. Egyptian Journal of Medical
Human Genetics. Egyptian Society of Human Genetics; 2016. pp. 1–23.
doi:10.1016/j.ejmhg.2015.07.002
7. Kammenga JE. The background puzzle: how identical mutations in the same gene
lead to different disease symptoms. FEBS Journal. Blackwell Publishing Ltd; 2017. pp.
33623373. doi:10.1111/febs.14080
8. Jackson M, Marks L, May GHW, Wilson JB. The genetic basis of disease. Essays in
Biochemistry. Portland Press Ltd; 2018. pp. 643–723. doi:10.1042/EBC20170053
9. Banoon SR, Salih TS, Ghasemian A. Genetic mutations and major human disorders: A
review. Egypt J Chem. 2022;65: 571–589. doi:10.21608/EJCHEM.2021.98178.4575
10. Casamassimi A, Ciccodicola A, Rienzo M. Transcriptional regulation and its
misregulation in human diseases. International Journal of Molecular Sciences.
Multidisciplinary Digital Publishing Institute (MDPI); 2023. doi:10.3390/ijms24108640
11. Roth TL, Marson A. Genetic disease and therapy. Annual Review of Pathology:
Mechanisms of Disease. Annual Reviews Inc.; 2021. pp. 145–166.
doi:10.1146/annurev-pathmechdis-012419-032626
12. Lahm H, Jia M, Dreßen M, Wirth F, Puluca N, Gilsbach R, et al. Congenital heart
disease risk loci identified by genome-wide association study in European patients.
Journal of Clinical Investigation. 2021;131. doi:10.1172/JCI141837
13. Abraham G, Rutten-Jacobs L, Inouye M. Risk prediction using polygenic risk scores for
prevention of stroke and other cardiovascular diseases. Stroke. Wolters Kluwer
Health; 2021. pp. 2983–2991. doi:10.1161/STROKEAHA.120.032619
14. Rachamadugu SI, Miller KA, Lee IH, Zou YS. Genetic detection of congenital heart
disease. Gynecology and Obstetrics Clinical Medicine. KeAi Communications Co.;
2022. pp. 109123. doi:10.1016/j.gocm.2022.07.005
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
41
15. Leiserson MDM, Vandin F, Wu HT, Dobson JR, Eldridge J V., Thomas JL, et al. Pan-
cancer network analysis identifies combinations of rare somatic mutations across
pathways and protein complexes. Nat Genet. 2015;47: 106–114.
doi:10.1038/ng.3168
16. Claussnitzer M, Cho JH, Collins R, Cox NJ, Dermitzakis ET, Hurles ME, et al. A brief
history of human disease genetics. Nature. Nature Research; 2020. pp. 179–189.
doi:10.1038/s41586-019-1879-7
17. Sinkala M, Elsheikh SSM, Mbiyavanga M, Cullinan J, Mulder NJ. A genome-wide
association study identifies distinct variants associated with pulmonary function
among European and African ancestries from the UK Biobank. Commun Biol. 2023;6.
doi:10.1038/s42003-023-04443-8
18. Petrikin JE, Willig LK, Smith LD, Kingsmore SF. Rapid whole genome sequencing and
precision neonatology. Seminars in Perinatology. W.B. Saunders; 2015. pp. 623–631.
doi:10.1053/j.semperi.2015.09.009
19. Kingsmore SF, Henderson A, Owen MJ, Clark MM, Hansen C, Dimmock D, et al.
Measurement of genetic diseases as a cause of mortality in infants receiving whole
genome sequencing. NPJ Genom Med. 2020;5. doi:10.1038/s41525-020-00155-8
20. Gupta N, Verma VK. Next-generation sequencing and its application: Empowering in
public health beyond reality. 2019. pp. 313–341. doi:10.1007/978-981-13-8844-6_15
21. Satam H, Joshi K, Mangrolia U, Waghoo S, Zaidi G, Rawool S, et al. Next-generation
sequencing technology: Current trends and advancements. Biology. Multidisciplinary
Digital Publishing Institute (MDPI); 2023. doi:10.3390/biology12070997
22. Pandey AK, Lu L, Wang X, Homayouni R, Williams RW. Functionally enigmatic genes:
A case study of the brain ignorome. PLoS One. 2014;9.
doi:10.1371/journal.pone.0088889
23. Oprea TI. Exploring the dark genome: implications for precision medicine.
Mammalian Genome. Springer New York LLC; 2019. pp. 192–200.
doi:10.1007/s00335-019-09809-0
24. Sheils T, Mathias SL, Siramshetty VB, Bocci G, Bologa CG, Yang JJ, et al. How to
illuminate the druggable genome using Pharos. Curr Protoc Bioinformatics. 2020;69.
doi:10.1002/cpbi.92
25. Nguyen DT, Mathias S, Bologa C, Brunak S, Fernandez N, Gaulton A, et al. Pharos:
Collating protein information to shed light on the druggable genome. Nucleic Acids
Res. 2017;45: D995–D1002. doi:10.1093/nar/gkw1072
26. Oprea TI, Bologa CG, Brunak S, Campbell A, Gan GN, Gaulton A, et al. Unexplored
therapeutic opportunities in the human genome. Nature Reviews Drug Discovery.
Nature Publishing Group; 2018. pp. 317–332. doi:10.1038/nrd.2018.14
27. Kafita D, Nkhoma P, Dzobo K, Sinkala M. Shedding light on the dark genome: Insights
into the genetic, CRISPR-based, and pharmacological dependencies of human cancers
and disease aggressiveness. Roemer K, editor. PLoS One. 2023;18: e0296029.
doi:10.1371/journal.pone.0296029
28. Lee BT, Barber GP, Benet-Pagès A, Casper J, Clawson H, DIekhans M, et al. The UCSC
Genome Browser database: 2022 update. Nucleic Acids Res. 2022;50: D1115–D1122.
doi:10.1093/nar/gkab959
29. Lachmann A, Xu H, Krishnan J, Berger SI, Mazloom AR, Ma’ayan A. ChEA:
Transcription factor regulation inferred from integrating genome-wide ChIP-X
experiments. Bioinformatics. 2010;26: 2438–2444.
doi:10.1093/bioinformatics/btq466
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
42
30. Lachmann A, Ma’ayan A. KEA: Kinase enrichment analysis. Bioinformatics. 2009;25:
684–686. doi:10.1093/bioinformatics/btp026
31. Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, et al. The STRING
database in 2021: Customizable protein-protein networks, and functional
characterization of user-uploaded gene/measurement sets. Nucleic Acids Res.
2021;49: D605–D612. doi:10.1093/nar/gkaa1074
32. Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, et al. The BioGRID
database: A comprehensive biomedical resource of curated protein, genetic, and
chemical interactions. Protein Science. 2021;30: 187–200. doi:10.1002/pro.3978
33. Hart T, Chandrashekhar M, Aregger M, Steinhart Z, Brown KR, MacLeod G, et al. High-
resolution CRISPR screens reveal fitness genes and genotype-specific cancer
liabilities. Cell. 2015;163: 1515–1526. doi:10.1016/j.cell.2015.11.015
34. Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: Leveraging knowledge
across phenotype-gene relationships. Nucleic Acids Res. 2019;47: D1038–D1043.
doi:10.1093/nar/gky1151
35. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The
cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45: 1113–1120.
doi:10.1038/ng.2764
36. Carithers LJ, Moore HM. The Genotype-Tissue Expression (GTEx) Project.
Biopreservation and Biobanking. Mary Ann Liebert Inc.; 2015. pp. 307–308.
doi:10.1089/bio.2015.29031.hmm
37. Lin Y, Mehta S, Küçük-McGinty H, Turner JP, Vidovic D, Forlin M, et al. Drug target
ontology to classify and integrate drug discovery data. J Biomed Semantics. 2017;8.
doi:10.1186/s13326-017-0161-x
38. Clarke DJB, Kuleshov M V., Schilder BM, Torre D, Duffy ME, Keenan AB, et al.
EXpression2Kinases (X2K) Web: Linking expression signatures to upstream cell
signaling networks. Nucleic Acids Res. 2018;46: W171–W179.
doi:10.1093/nar/gky458
39. Bastian M, Heymann S, Jacomy M. Gephi : An open source software for exploring and
manipulating networks visualization and exploration of large graphs. Available:
www.aaai.org
40. Salarikia SR, Kashkooli M, Taghipour MJ, Malekpour M, Negahdaripour M.
Identification of hub pathways and drug candidates in gastric cancer through systems
biology. Sci Rep. 2022;12. doi:10.1038/s41598-022-13052-0
41. Chin CH, Chen SH, Wu HH, Ho CW, Ko MT, Lin CY. cytoHubba: Identifying hub objects
and sub-networks from complex interactome. BMC Syst Biol. 2014;8.
doi:10.1186/1752-0509-8-S4-S11
42. Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, et al. The Gene
Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2021;49: D325–D334.
doi:10.1093/nar/gkaa1113
43. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The
Reactome Pathway Knowledgebase. Nucleic Acids Res. 2018;46: D649–D655.
doi:10.1093/nar/gkx1132
44. Kuleshov M V., Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr:
a comprehensive gene set enrichment analysis web server 2016 update. Nucleic
Acids Res. 2016;44: W90–W97. doi:10.1093/nar/gkw377
45. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, et al. The Genotype-Tissue
Expression (GTEx) project. Nature Genetics. 2013. pp. 580–585. doi:10.1038/ng.2653
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
43
46. Ghoussaini M, Mountjoy E, Carmona M, Peat G, Schmidt EM, Hercules A, et al. Open
Targets Genetics: Systematic identification of trait-associated genes using large-scale
genetics and functional genomics. Nucleic Acids Res. 2021;49: D1311–D1320.
doi:10.1093/nar/gkaa840
47. Thul PJ, Lindskog C. The human protein atlas: A spatial map of the human proteome.
Protein Science. 2018;27: 233–244. doi:10.1002/pro.3307
48. Digre A, Lindskog C. The Human Protein Atlas—Spatial localization of the human
proteome in health and disease. Protein Science. 2021;30: 218–233.
doi:10.1002/pro.3987
49. Agarwal V, Shendure J. Predicting mRNA abundance directly from genomic sequence
using deep convolutional neural networks. Cell Rep. 2020;31.
doi:10.1016/j.celrep.2020.107663
50. Song M, Han NG, Kim YH, Ding Y, Chambers T. Correction: Discovering implicit entity
relation with the gene-citation-gene network PLoS ONE. PLoS ONE. Public Library of
Science; 2014. doi:10.1371/annotation/b6121731-7357-4657-9845-82eb0c937f89
51. Jalan S, Sarkar C, Center P, Hans S. Complex Networks: an emerging branch of
science. Available: https://www.researchgate.net/publication/331438502
52. Durón C, Pan Y, Gutmann DH, Hardin J, Radunskaya A. Variability of betweenness
centrality and its effect on identifying essential genes. Bull Math Biol. 2019;81: 3655–
3673. doi:10.1007/s11538-018-0526-z
53. Newman MEJ. Modularity and community structure in networks. 2006. Available:
www.pnas.orgcgidoi10.1073pnas.0601602103
54. Alcalá-Corona SA, Sandoval-Motta S, Espinal-Enríquez J, Hernández-Lemus E.
Modularity in biological networks. Frontiers in Genetics. Frontiers Media S.A.; 2021.
doi:10.3389/fgene.2021.701331
55. Goh K-I, Cusick ME, Valle D, Childs B, Vidal M, Szló Barabá A-L. The human disease
network. 2007. Available: www.pnas.org/cgi/content/full/
56. Barabási AL, Gulbahce N, Loscalzo J. Network medicine: A network-based approach to
human disease. Nature Reviews Genetics. 2011. pp. 56–68. doi:10.1038/nrg2918
57. Choobdar S, Ahsen ME, Crawford J, Tomasoni M, Fang T, Lamparter D, et al.
Assessment of network module identification across complex diseases. Nat Methods.
2019;16: 843852. doi:10.1038/s41592-019-0509-5
58. Caldera M, Buphamalai P, Müller F, Menche J. Interactome-based approaches to
human disease. Current Opinion in Systems Biology. Elsevier Ltd; 2017. pp. 88–94.
doi:10.1016/j.coisb.2017.04.015
59. Maiorino E, Baek SH, Guo F, Zhou X, Kothari PH, Silverman EK, et al. Discovering the
genes mediating the interactions between chronic respiratory diseases in the human
interactome. Nat Commun. 2020;11. doi:10.1038/s41467-020-14600-w
60. Muhiuddin G, Samanta S, Aljohani AF, Alkhaibari AM. A study on graph centrality
measures of different diseases due to DNA sequencing. Mathematics. 2023;11: 3166.
doi:10.3390/math11143166
61. Otasek D, Morris JH, Bouças J, Pico AR, Demchak B. Cytoscape Automation:
Empowering workflow-based network analysis. Genome Biol. 2019;20.
doi:10.1186/s13059-019-1758-4
62. Margueron R, Reinberg D. The Polycomb complex PRC2 and its mark in life. Nature.
2011. pp. 343–349. doi:10.1038/nature09784
63. Lee SR, Roh YG, Kim SK, Lee JS, Seol SY, Lee HH, et al. Activation of EZH2 and SUZ12
regulated by E2F1 predicts the disease progression and aggressive characteristics of
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
44
bladder cancer. Clinical Cancer Research. 2015;21: 5391–5403. doi:10.1158/1078-
0432.CCR-14-2680
64. Zhang J, Zhang J, Liu W, Ge R, Gao T, Tian Q, et al. UBTF facilitates melanoma
progression via modulating MEK1/2-ERK1/2 signalling pathways by promoting GIT1
transcription. Cancer Cell Int. 2021;21. doi:10.1186/s12935-021-02237-8
65. Denicola GM, Karreth FA, Humpton TJ, Gopinathan A, Wei C, Frese K, et al.
Oncogene-induced Nrf2 transcription promotes ROS detoxification and
tumorigenesis. Nature. 2011;475: 106–110. doi:10.1038/nature10189
66. Melino G. P63 is a suppressor of tumorigenesis and metastasis interacting with
mutant p53. Cell Death and Differentiation. 2011. pp. 1487–1499.
doi:10.1038/cdd.2011.81
67. Almasi SM, Hu T. Measuring the importance of vertices in the weighted human
disease network. PLoS One. 2019;14. doi:10.1371/journal.pone.0205936
68. Li Y, Agarwal P. A pathway-based view of human diseases and disease relationships.
PLoS One. 2009;4. doi:10.1371/journal.pone.0004346
69. Bauer-Mehren A, Bundschus M, Rautschka M, Mayer MA, Sanz F, Furlong LI. Gene-
disease network analysis reveals functional modules in mendelian, complex and
environmental diseases. PLoS One. 2011;6. doi:10.1371/journal.pone.0020284
70. Lüscher-Dias T, Siqueira Dalmolin RJ, de Paiva Amaral P, Alves TL, Schuch V, Franco
GR, et al. The evolution of knowledge on genes associated with human diseases.
iScience. 2022;25. doi:10.1016/j.isci.2021.103610
71. Mi Z, Guo B, Yin Z, Li J, Zheng Z. Disease classification via gene network integrating
modules and pathways. R Soc Open Sci. 2019;6. doi:10.1098/rsos.190214
72. Liu Y, Gu HY, Zhu J, Niu YM, Zhang C, Guo GL. Identification of Hub Genes and Key
Pathways Associated With Bipolar Disorder Based on Weighted Gene Co-expression
Network Analysis. Front Physiol. 2019;10. doi:10.3389/fphys.2019.01081
73. Hasankhani A, Bahrami A, Sheybani N, Fatehi F, Abadeh R, Ghaem Maghami Farahani
H, et al. Integrated network analysis to identify key modules and potential hub genes
involved in bovine respiratory disease: A systems biology approach. Front Genet.
2021;12. doi:10.3389/fgene.2021.753839
74. De R, Hu T, Moore JH, Gilbert-Diamond D. Characterizing gene-gene interactions in a
statistical epistasis network of twelve candidate genes for obesity. BioData Min.
2015;8. doi:10.1186/s13040-015-0077-x
75. Zhou L, Ding L, Gong Y, Zhao J, Xin G, Zhou R, et al. Identification of hub genes
associated with the pathogenesis of diffuse large B-cell lymphoma subtype one
characterized by host response via integrated bioinformatic analyses. PeerJ. 2020;8.
doi:10.7717/peerj.10269
76. Gotoh K, Kunisaki Y, Mizuguchi S, Setoyama D, Hosokawa K, Yao H, et al.
Mitochondrial protein synthesis is essential for terminal differentiation of CD45–
TER119–erythroid and lymphoid progenitors. iScience. 2020;23.
doi:10.1016/j.isci.2020.101654
77. Li Q, Hoppe T. Role of amino acid metabolism in mitochondrial homeostasis.
Frontiers in Cell and Developmental Biology. Frontiers Media S.A.; 2023.
doi:10.3389/fcell.2023.1127618
78. Zaganelli S, Rebelo-Guiomar P, Maundrell K, Rozanska A, Pierredon S, Powell CA, et
al. The pseudouridine synthase RPUSD4 is an essential component of mitochondrial
RNA granules. Journal of Biological Chemistry. 2017;292: 4519–4532.
doi:10.1074/jbc.M116.771105
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
45
79. Di Nottia M, Marchese M, Verrigni D, Mutti CD, Torraco A, Oliva R, et al. A
homozygous MRPL24 mutation causes a complex movement disorder and affects the
mitoribosome assembly. Neurobiol Dis. 2020;141. doi:10.1016/j.nbd.2020.104880
80. Das S, Guha P, Nath M, Das S, Sen S, Sahu J, et al. A comparative cross-platform
analysis to identify potential biomarker genes for evaluation of teratozoospermia and
azoospermia. Genes (Basel). 2022;13. doi:10.3390/genes13101721
81. Nuzhat N, Van Schil K, Liakopoulos S, Bauwens M, Dueñas Rey A, Käseberg S, et al.
CEP162 deficiency causes human retinal degeneration and reveals a dual role in
ciliogenesis and neurogenesis. Journal of Clinical Investigation. 2023.
doi:10.1172/jci161156
82. Sonawane AR, Platig J, Fagny M, Chen CY, Paulson JN, Lopes-Ramos CM, et al.
Understanding tissue-specific gene regulation. Cell Rep. 2017;21: 1077–1088.
doi:10.1016/j.celrep.2017.10.001
83. He Y, Chhetri SB, Arvanitis M, Srinivasan K, Aguet F, Ardlie KG, et al. Sn-spMF: Matrix
factorization informs tissue-specific genetic regulation of gene expression. Genome
Biol. 2020;21. doi:10.1186/s13059-020-02129-6
84. Rouskas K, Katsareli EA, Amerikanou C, Dimopoulos AC, Glentis S, Kalantzi A, et al.
Identifying novel regulatory effects for clinically relevant genes through the study of
the Greek population. BMC Genomics. 2023;24. doi:10.1186/s12864-023-09532-w
85. Gibson G, Powell JE, Marigorta UM. Expression quantitative trait locus analysis for
translational medicine. Genome Medicine. BioMed Central Ltd.; 2015.
doi:10.1186/s13073-015-0186-7
86. Parisien M, Khoury S, Chabot-Doré AJ, Sotocinal SG, Slade GD, Smith SB, et al. Effect
of human genetic variability on gene expression in dorsal root Ganglia and
association with pain phenotypes. Cell Rep. 2017;19: 1940–1952.
doi:10.1016/j.celrep.2017.05.018
87. Li S, Schmid KT, de Vries DH, Korshevniuk M, Losert C, Oelen R, et al. Identification of
genetic variants that impact gene co-expression relationships using large-scale single-
cell data. Genome Biol. 2023;24. doi:10.1186/s13059-023-02897-x
.CC-BY-NC 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted January 15, 2024. ; https://doi.org/10.1101/2024.01.13.575505doi: bioRxiv preprint
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Investigating the human genome is vital for identifying risk factors and devising effective therapies to combat genetic disorders and cancer. Despite the extensive knowledge of the "light genome”, the poorly understood "dark genome" remains understudied. In this study, we integrated data from 20,412 protein-coding genes in Pharos and 8,395 patient-derived tumours from The Cancer Genome Atlas (TCGA) to examine the genetic and pharmacological dependencies in human cancers and their treatment implications. We discovered that dark genes exhibited high mutation rates in certain cancers, similar to light genes. By combining the drug response profiles of cancer cells with cell fitness post-CRISPR-mediated gene knockout, we identified the crucial vulnerabilities associated with both dark and light genes. Our analysis also revealed that tumours harbouring dark gene mutations displayed worse overall and disease-free survival rates than those without such mutations. Furthermore, dark gene expression levels significantly influenced patient survival outcomes. Our findings demonstrated a similar distribution of genetic and pharmacological dependencies across the light and dark genomes, suggesting that targeting the dark genome holds promise for cancer treatment. This study underscores the need for ongoing research on the dark genome to better comprehend the underlying mechanisms of cancer and develop more effective therapies.
Article
Full-text available
Background Expression quantitative trait loci (eQTL) studies provide insights into regulatory mechanisms underlying disease risk. Expanding studies of gene regulation to underexplored populations and to medically relevant tissues offers potential to reveal yet unknown regulatory variants and to better understand disease mechanisms. Here, we performed eQTL mapping in subcutaneous (S) and visceral (V) adipose tissue from 106 Greek individuals (Greek Metabolic study, GM) and compared our findings to those from the Genotype-Tissue Expression (GTEx) resource. Results We identified 1,930 and 1,515 eGenes in S and V respectively, over 13% of which are not observed in GTEx adipose tissue, and that do not arise due to different ancestry. We report additional context-specific regulatory effects in genes of clinical interest (e.g. oncogene ST7 ) and in genes regulating responses to environmental stimuli (e.g. MIR21, SNX33 ). We suggest that a fraction of the reported differences across populations is due to environmental effects on gene expression, driving context-specific eQTLs, and suggest that environmental effects can determine the penetrance of disease variants thus shaping disease risk. We report that over half of GM eQTLs colocalize with GWAS SNPs and of these colocalizations 41% are not detected in GTEx. We also highlight the clinical relevance of S adipose tissue by revealing that inflammatory processes are upregulated in individuals with obesity, not only in V, but also in S tissue. Conclusions By focusing on an understudied population, our results provide further candidate genes for investigation regarding their role in adipose tissue biology and their contribution to disease risk and pathogenesis.
Article
Full-text available
The advent of next-generation sequencing (NGS) has brought about a paradigm shift in genomics research, offering unparalleled capabilities for analyzing DNA and RNA molecules in a high-throughput and cost-effective manner. This transformative technology has swiftly propelled genomics advancements across diverse domains. NGS allows for the rapid sequencing of millions of DNA fragments simultaneously, providing comprehensive insights into genome structure, genetic variations, gene expression profiles, and epigenetic modifications. The versatility of NGS platforms has expanded the scope of genomics research, facilitating studies on rare genetic diseases, cancer genomics, microbiome analysis, infectious diseases, and population genetics. Moreover, NGS has enabled the development of targeted therapies, precision medicine approaches, and improved diagnostic methods. This review provides an insightful overview of the current trends and recent advancements in NGS technology, highlighting its potential impact on diverse areas of genomic research. Moreover, the review delves into the challenges encountered and future directions of NGS technology, including endeavors to enhance the accuracy and sensitivity of sequencing data, the development of novel algorithms for data analysis, and the pursuit of more efficient, scalable, and cost-effective solutions that lie ahead.
Article
Full-text available
Transcriptional regulation is a critical biological process that allows the cell or an organism to respond to a variety of intra- and extracellular signals, to define cell identity during development, to maintain it throughout its lifetime, and to coordinate cellular activity [...]
Article
Full-text available
Background Expression quantitative trait loci (eQTL) studies show how genetic variants affect downstream gene expression. Single-cell data allows reconstruction of personalized co-expression networks and therefore the identification of SNPs altering co-expression patterns (co-expression QTLs, co-eQTLs) and the affected upstream regulatory processes using a limited number of individuals. Results We conduct a co-eQTL meta-analysis across four scRNA-seq peripheral blood mononuclear cell datasets using a novel filtering strategy followed by a permutation-based multiple testing approach. Before the analysis, we evaluate the co-expression patterns required for co-eQTL identification using different external resources. We identify a robust set of cell-type-specific co-eQTLs for 72 independent SNPs affecting 946 gene pairs. These co-eQTLs are replicated in a large bulk cohort and provide novel insights into how disease-associated variants alter regulatory networks. One co-eQTL SNP, rs1131017, that is associated with several autoimmune diseases, affects the co-expression of RPS26 with other ribosomal genes. Interestingly, specifically in T cells, the SNP additionally affects co-expression of RPS26 and a group of genes associated with T cell activation and autoimmune disease. Among these genes, we identify enrichment for targets of five T-cell-activation-related transcription factors whose binding sites harbor rs1131017. This reveals a previously overlooked process and pinpoints potential regulators that could explain the association of rs1131017 with autoimmune diseases. Conclusion Our co-eQTL results highlight the importance of studying context-specific gene regulation to understand the biological implications of genetic variation. With the expected growth of sc-eQTL datasets, our strategy and technical guidelines will facilitate future co-eQTL identification, further elucidating unknown disease mechanisms.
Article
Full-text available
Defects in primary or motile cilia result in a variety of human pathologies, and retinal degeneration is frequently associated with these so-called ciliopathies. We found that homozygosity for a truncating variant in CEP162, a centrosome and microtubule-associated protein required for transition zone assembly during ciliogenesis and neuronal differentiation in the retina, caused late-onset retinitis pigmentosa in 2 unrelated families. The mutant CEP162-E646R*5 protein was expressed and properly localized to the mitotic spindle, but was missing from the basal body in primary and photoreceptor cilia. This impaired recruitment of transition zone components to the basal body and corresponded to complete loss of CEP162 function at the ciliary compartment, reflected by delayed formation of dysmorphic cilia. In contrast, shRNA knockdown of Cep162 in the developing mouse retina increased cell death, which was rescued by expression of CEP162-E646R*5, indicating that the mutant retains its role for retinal neurogenesis. Human retinal degeneration thus resulted from specific loss of the ciliary function of CEP162.
Article
Full-text available
Mitochondria are central hubs for energy production, metabolism and cellular signal transduction in eukaryotic cells. Maintenance of mitochondrial homeostasis is important for cellular function and survival. In particular, cellular metabolic state is in constant communication with mitochondrial homeostasis. One of the most important metabolic processes that provide energy in the cell is amino acid metabolism. Almost all of the 20 amino acids that serve as the building blocks of proteins are produced or degraded in the mitochondria. The synthesis of the amino acids aspartate and arginine depends on the activity of the respiratory chain, which is essential for cell proliferation. The degradation of branched-chain amino acids mainly occurs in the mitochondrial matrix, contributing to energy metabolism, mitochondrial biogenesis, as well as protein quality control in both mitochondria and cytosol. Dietary supplementation or restriction of amino acids in worms, flies and mice modulates lifespan and health, which has been associated with changes in mitochondrial biogenesis, antioxidant response, as well as the activity of tricarboxylic acid cycle and respiratory chain. Consequently, impaired amino acid metabolism has been associated with both primary mitochondrial diseases and diseases with mitochondrial dysfunction such as cancer. Here, we present recent observations on the crosstalk between amino acid metabolism and mitochondrial homeostasis, summarise the underlying molecular mechanisms to date, and discuss their role in cellular functions and organismal physiology.
Article
Full-text available
Pulmonary function is an indicator of well-being, and pulmonary pathologies are the third major cause of death worldwide. We analysed the UK Biobank genome-wide association summary statistics of pulmonary function for Europeans and individuals of recent African descent to identify variants associated with the trait in the two ancestries. Here, we show 627 variants in Europeans and 3 in Africans associated with three pulmonary function parameters. In addition to the 110 variants in Europeans previously reported to be associated with phenotypes related to pulmonary function, we identify 279 novel loci, including an ISX intergenic variant rs369476290 on chromosome 22 in Africans. Remarkably, we find no shared variants among Africans and Europeans. Furthermore, enrichment analyses of variants separately for each ancestry background reveal significant enrichment for terms related to pulmonary phenotypes in Europeans but not Africans. Further analysis of studies of pulmonary phenotypes reveals that individuals of European background are disproportionally overrepresented in datasets compared to Africans, with the gap widening over the past five years. Our findings extend our understanding of the different variants that modify the pulmonary function in Africans and Europeans, a promising finding for future GWASs and medical studies. A genome-wide association study using summary statistics from the UK Biobank identifies ancestry-specific variants associated with pulmonary function among European and African ancestry cohorts.
Article
Full-text available
Gephi is an open source software for graph and network analysis. It uses a 3D render engine to display large networks in real-time and to speed up the exploration. A flexible and multi-task architecture brings new possibilities to work with complex data sets and produce valuable visual results. We present several key features of Gephi in the context of interactive exploration and interpretation of networks. It provides easy and broad access to network data and allows for spatializing, filtering, navigating, manipulating and clustering. Finally, by presenting dynamic features of Gephi, we highlight key aspects of dynamic network visualization.
Article
Congenital heart disease (CHD) is the most common congenital anomaly and is an important cause of infant morbidity and mortality. Besides the epigenetic and environmental basis of CHD, genetics plays a central role in CHD pathogenesis. Traditional genetic testing strategies including conventional chromosome analysis, fluorescence in situ hybridization, and Sanger sequencing have largely focused on syndromic CHD or selected CHD phenotypes that are strongly associated with a particular genotype. The landscape of clinical genetic testing in CHD is rapidly evolving due to technical advances in genetic testing, including the identification of copy number variants by chromosomal microarray and nucleotide level alterations/variants by next-generation sequencing (NGS), which are essential to detect genetic causes of CHD and identify associations between genotypes and longitudinal clinical phenotypes. Whole-exome and whole-genome NGS not only reveal pathogenic variants in CHD genes, but also identify non-coding variants that influence the expression of CHD genes. Given the increasing availability and cost-effectiveness of clinical NGS to provide information on the causes of CHD and to detect incidental findings that are clinically actionable, the guidance of genetic counselors or experienced clinicians is essential. The identification of definitive causal CHD variants influences patient care and helps to inform the risk of recurrence, prenatal genetic counseling, and pre-implantation testing for the family of a CHD infant and adults with repaired/palliated CHD. Prenatally, circulating cell-free DNA screening as a non-invasive approach is available as early as 9 weeks of gestation and can screen for the common aneuploidies, which may underlie CHD. In this review, we present past and recent genetic testing in CHD based on our increased understanding of the pathogenesis of CHD along with current challenges with the interpretation of de novo genetic variants. Identification of a genetic diagnosis can help to predict and potentially improve clinical outcomes in CHD patients.