PreprintPDF Available

The UCSC Xena Platform for cancer genomics data visualization and interpretation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

UCSC Xena is a web-based visual integration and exploration tool for multi-omic data and associated clinical and phenotypic annotations. The investigator-driven platform consists of a web-based Xena Browser and turn-key Xena Hubs. Xena showcases seminal cancer genomics datasets from TCGA, Pan-Cancer Atlas, PCAWG, ICGC, GTEx, and the GDC; a total of more than 1500 datasets across 50 cancer types. We support virtually any type of functional genomics data modalities, including SNPs, INDELs, large structural variants, CNV, gene and other types of expression, DNA methylation, clinical and phenotypic annotations. A researcher can host their own data securely via private hubs running on a laptop or behind a firewall, with visual and analytical integration occurring only within the Xena Browser. Browser features include the high performance Visual Spreadsheet, dynamic Kaplan-Meier survival analysis, powerful filtering and subgrouping, charts, statistical analyses, genomic signatures, and bookmarks.
Content may be subject to copyright.
The UCSC Xena Platform for cancer genomics data
visualization and interpretation
Mary Goldman1, Brian Craft1, Angela Brooks1, Jing Zhu1, and David Haussler1
1UCSC Genomics Institute, UC Santa Cruz
Abstract
UCSC Xena is a web-based visual integration and exploration tool for multi-omic data
and associated clinical and phenotypic annotations. The investigator-driven platform
consists of a web-based Xena Browser and turn-key Xena Hubs. Xena showcases
seminal cancer genomics datasets from TCGA, Pan-Cancer Atlas, PCAWG, ICGC,
GTEx, and the GDC; a total of more than 1500 datasets across 50 cancer types. We
support virtually any type of functional genomics data modalities, including SNPs,
INDELs, large structural variants, CNV, gene and other types of expression, DNA
methylation, clinical and phenotypic annotations. A researcher can host their own data
securely via private hubs running on a laptop or behind a firewall, with visual and
analytical integration occurring only within the Xena Browser. Browser features include
the high performance Visual Spreadsheet, dynamic Kaplan-Meier survival analysis,
powerful filtering and subgrouping, charts, statistical analyses, genomic signatures, and
bookmarks.
Paper
Introduction
Cancer is a complex disease arising from diverse genomic mechanisms (Hanahan
2011). Deciphering how normal cells transform into cancerous ones requires vast
amounts of data (Ledford 2010). Recent advances, including high-throughput, whole-
genome, and single-cell technologies, are now producing increasingly larger and more
comprehensive genomics datasets (Mardis 2008). By 2025, the total amount of
genomics data is expected to match the sum of the three other major producers of large
data: astronomy, YouTube, and Twitter (Stephens 2015).
Cancer researchers need powerful, interactive visualization and analysis tools to extract
knowledge from this deluge of information (Schroeder 2015). As the volume of data
reaches the level of many petabytes, it is challenging but imperative that tools continue
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
to respond rapidly to researcher inquiries. Adding to the data volume, the variety of data
modalities, such as gene expression and somatic mutation, are also expanding. Each of
these modalities provides a unique window into the genome, which, in concert, give a
comprehensive picture of an individual cancers' biology. Integration across these
modalities is challenging but necessary togain a more in-depth understanding of the
mechanisms for tumorigenesis, metastasis, and treatment response.
The highly distributed nature of this data also poses a significant challenge. Most
investigators generate relatively small-scale datasets which are only used by a small
number of individuals. In contrast, organized large consortia, such as TCGA (The
Cancer Genome Atlas) (Chin 2011, Chin 2011) and the GDC (Genomic Data
Commons) (Grossman 2016), generate or host large, high-value datasets that are used
by researchers all over the world. Despite data sharing efforts, these two sources of
data tend to be “siloed” and cannot be easily integrated or connected.
The UCSC Xena system was developed to enable cancer researchers of all
computational backgrounds to explore large public datasets, as well as their own data,
no matter where the data is located (Cieślik 2018, Langmead 2018). Xena hosts
datasets from landmark cancer genomics resources including TCGA, ICGC
(International Cancer Genome Consortium) (The International Cancer Genome
Consortium 2010), and the GDC. The system easily supports tens of thousands of
samples and has been tested up to as many as a million cells. The simple and flexible
architecture supports a variety of common and uncommon genomic and clinical data
types. Xena's unique visualizations integrate gene-centric and genomic-coordinate-
centric views across multiple data modalities, providing a deep, comprehensive view of
genomic events within a cohort of tumors.
Results
UCSC Xena (http://xena.ucsc.edu) is a visual integration and exploration tool for multi-
omic data. The Xena platform has two components: the web-based Xena Browser and
the back-end Xena Hubs (Figure 1). The Xena Browser empowers biologists to explore
data in multiple Xena Hubs using a variety of visualizations and analyses. Each Xena
Hub is able to host tens of thousands of samples and supports virtually every modality
of functional genomics data. These turn-key hubs can be configured to be public or
private and can be installed on laptops, public servers or those behind a firewall, or in
the cloud (Figure 1). Any number of Xena hubs can be simultaneously connected to the
Xena Browser with integration occurring in the browser. This allows for data to be
distributed across multiple Xena Hubs.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
The decoupled front-end Xena Browser and back-end Xena Hubs architecture has
several advantages. First, researchers can easily view their own private data by
installing their own Xena Hub. Data across multiple hubs are combined on the Xena
Browser, allowing users to view data from separate hubs as a coherent data resource
while keeping private data secure. More importantly, the Xena platform scales easily. As
more datasets are generated, more Xena Hubs are added to the network, effectively
growing with expanding genomics resources.
Public data resources
Cancer genomics research is increasingly multi-omic. Today, studies commonly collect
data on somatic mutations, copy number and gene expression, with other data types
such as structural variants and protein expression being relatively rare. However, as
genomics technology advances, we expect that these rarer data types to increase in
frequency and new data types to be produced. With this in mind we designed Xena to
be able to load any tabular or matrix formatted data, giving us exceptional flexibility in
the data types we can visualize. Current supported data modalities include somatic and
germline SNPs, INDELs, large structural variants, copy number variation, gene-,
transcript-, exon-, protein-expression, DNA methylation, phenotype, clinical data, and
sample annotations. Additionally, our wrangled data can be downloaded in bulk and
used for downstream analyses. We also offer programmatic access to slices of data
through the Xena python package (https://github.com/ucscXena/xenaPython), which
can be used independently or in a Jupyter Notebook to access any of the public Xena
Hubs.
To showcase seminal cancer genomics data resources, we have deployed eight public
Xena Hubs in the cloud-computing environment. Together, they host 1512 datasets
from more than 50 cancer types, including the latest from TCGA, Pan-Cancer Atlas
(Hoadley 2018), ICGC, PCAWG (Pan-Cancer Analysis of Whole Genomes, Campbell
2017), and the GDC (Table 1). Our TCGA hub hosts data from TCGA, the most
comprehensive cancer genomics dataset to-date, with full set of data modalities for
12,000+ samples across 30+ cancer types. Xena TCGA hub hosts all public-tier TCGA
derived datasets including somatic mutation, copy number variation, gene and exon
expression, and more. Our PancanAltas hub hosts data from the latest TCGA project,
the Pan-Cancer Atlas, which conducted an integrative molecular analysis of the all
tumors in TCGA. In addition to being uniformly analyzed, there are also highly curated
datasets such as molecular subtypes and multiple survival endpoints. Our ICGC hub
hosts data from the ICGC project, a global effort to create a comprehensive description
of genomic, transcriptomic and epigenomic changes in 50 different tumor types. Our
PCAWG hub supports the PCAWG project, an analysis of 2,600 ICGC whole-cancer
genomes and their matching normal tissues across 39 distinct tumour types (Campbell
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
2017). Its datasets include somatic mutation data from the whole genome, large
structural variants, RNAseq-based data analysis, mutational signatures, curated
histology, and more. Our GDC hub hosts data from GDC, where TCGA and TARGET
data was uniformly recomputed using state-of-art pipelines and the latest human
genome assembly, hg38. In addition to these well-known resources, we also host
results from the Toil RNAseq analysis pipeline, a uniformly re-aligned and re-called
gene and transcript expression dataset for all TCGA, TARGET and GTEx samples
(Vivian 2017). This dataset allows users to compare gene and transcript expression of
TCGA 'tumor' samples to corresponding GTEx 'normal' samples. Our Treehouse hub
combines data from the Toil hub with Treehouse project
(https://treehousegenomics.soe.ucsc.edu/explore-our-data/) pediatric samples to
facilitate interpretation of a pediatric sample in a larger context. The UCSC Public hub
has data we curated from various literature publications such as CCLE (Cancer Cell
Line Encyclopedia, Barretina 2012). Xena complements each of these data sources by
not storing any raw sequencing data, but instead, loading derived datasets, such as
gene expression estimation and somatic mutation calls, and providing powerful
interactive visualizations for these data. All public Xena hubs
(https://xenabrowser.net/hub/) are open access, with no account or login required.
Turn-key Xena Hub
Xena Hubs are designed to be turn-key, allowing users who may not be computationally
savvy to install and run a Xena Hub on their personal computer. Hubs are easily
initiated using a point-and-click interface or through the command line
(http://xena.ucsc.edu/private-hubs/). Xena Hubs run on most operating systems,
including Windows, MAC and Linux. A dockerized version of the Xena Hub can be used
as part of an automated workflow pipeline to visualize computational results.
Xena Hubs can be configured to be private or public. Hubs running on a laptop are
private as they only allow connections from the users' own Xena Browser. Users can
use a laptop hub to quickly and securely view their own data. Xena Hubs started in the
cloud or on a server can be kept private by using a firewall. This enables easy sharing
of pre-publication data within a lab, institution, or as part of a larger collaboration. They
also can be configured to be public, making the data accessible to the larger community
after investigation and publication.
Performance is critical for an interactive visualization tool, especially on the web. As the
sample size for genomic experiments steadily increases, this has become a challenge
for many tools. Knowing this, we optimized Xena to support data queries on tens of
thousands of samples and more, serving slices of genomic and clinical data within a few
seconds.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Xena Browser
The Xena Browser is an online visual exploration tool for data in one or more Xena
Hubs. Our visualizations and analyses include the Xena Visual Spreadsheet, survival
analysis, scatter plots, bar graphs, statistical tests and genomic signatures.
Sophisticated filtering and searching allows the comparison of dynamically generated
subgroups. In addition to Xena’s own views, it connects with a variety of complementary
visualization tools. The Xena Browser supports dynamic genomic signatures, allowing
users to explore correlations between a score and other -omic data. Its shareable
bookmarks and high resolution pdfs enhance collaborations and results dissemination.
We support modern web browsers such as Chrome, Firefox or Safari.
Visual Spreadsheet
With multi-omic data, it is essential to view different types of data, such as gene
expression, copy number, and mutations, on genes or genomic segments side-by-side.
Integration across these diverse data modalities provides a more biologically complete
understanding of a genomic event. We designed our primary visualization, the Xena
Visual Spreadsheet, to facilitate this integration. Analogous to an office spreadsheet
application, it is a visual representation of a data grid where each column is a slice of
genomic or phenotypic data (e.g. gene expression), and each row is a single entity (e.g.
a bulk tumor sample, cell line, or single cells) (Figure 2). Rows of these entities are
dynamically sorted by the columns of genomics data, enabling the discovery of patterns
and correlations among genomic parameters. Xena's Visual Spreadsheet data grid
displays multiple user-requested slices of genomic and phenotypic data over a cohort of
samples. It excels at integrating diverse sets of genomics data and dynamically sorting
rows and columns to explore patterns, even when the data are hosted across multiple
data hubs.
Xena's Visual Spreadsheet displays genomic data in both gene-centric and coordinate-
centric views. Gene-centric views show data mapped to a gene or portion of a gene,
such as exons, transcripts, or specific CpG islands, and can display only exonic regions
or include data mapped to introns. The coordinate-centric view shows data along the
genomic coordinate, displaying genomic events such as copy number variation, simple
mutations or structural variants (Supplemental Figure 1). Genomic intervals, from base
level up to an entire chromosome, can be viewed through entering specific coordinates
or through dynamic zooming. Both gene- and coordinate-centric views support coding
and non-coding regions (Supplemental Figure 2). Links to the UCSC Genome Browser
give genomic context to any chromosome region. In addition to these various genomics
views, we also visualize phenotype and clinical data such as age, gender, expression
signatures, cell types, and subtype classifications. These crucial phenotypic data enable
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
users to go beyond the genomic-only discoveries. All these different columns and views
can be placed side-by-side in a single Xena Visual Spreadsheet.
The power of the Visual Spreadsheet is its data integration. Integration across different
data modalities, such as copy number variation, gene expression and DNA methylation,
gives users a more comprehensive view of a genomic event in a tumor sample. For
example, higher expression for a gene can be driven by copy number amplification or
by promoter hypomethylation. Integration across gene- and coordinate-centric views
helps users examine these events in different genomic contexts. For example, a copy
number amplification event can be part of chromosomal arm duplication or a focal
amplification. Integration across genomic and clinical data gives users the ability to
make connections between genomic patterns and clinically relevant phenotypes such
as subtype and survival. For example, a mutation in a gene can lead to lower
recurrence-free survival. These diverse integrations help researchers harness the
power of comprehensive genomics studies, driving discovery and a deeper
understanding of cancer biology.
More browser visualizations and functionalities
In addition to the Visual Spreadsheet, we support several other powerful views and
analyses. Our Kaplan-Meier analysis allows users to statistically assess survival
stratification by any genomic or phenotypic data (Figure 3). Bar charts, box plots and
scatter plots, all with statistical tests automatically computed (chi-squared, t-test, or
ANOVA as appropriate), provide additional insights into the data (Supplemental Figure
3). The Transcript View enables easy comparison of transcript-level expression
between two groups, such as TCGA 'tumor' vs. GTEx 'normal', for all the transcripts of a
gene (Figure 4). We also provide context-dependent links to complementary
visualizations such as the Tumor Map (https://tumormap.ucsc.edu) (Newton 2017) and
MuPIT in CRAVAT (http://mupit.icm.jhu.edu/) (Niknafs 2013), enabling users to easily
see a genomic pattern from a different perspective.
Gene-expression signatures have been developed to differentiate distinct subtypes of
tumors, identify important cellular responses to their environment (hypoxia), and predict
clinical outcomes in cancer (Sotiriou 2009). Xena’s genomic signature functionality
allows users to enter a signature expressed as a weighted sum of a marker gene set, a
common form seen in publications, and dynamically build a new spreadsheet column of
the resulting scores. This functionality allows researchers to test existing signatures or
build new ones, allowing comparison of the signature score with other genomic and
phenotypic data.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Our powerful text-based search allows users to dynamically highlight, filter and group
samples (Figure 5). Researchers can search the data on the screen similar to the ‘find’
functionality in Microsoft Word. Samples are matched and highlighted in real-time as the
user types. Researchers can filter, focusing the visualization to their samples of interest,
or dynamically build a new binary column, where samples are marked as 'true' or 'false',
depending if they meet the researcher's search criteria. The new two-group column
behaves like any other column and can be used in a Kaplan-Meier analysis, box plot, or
other statistical analysis to compare the two sample groups. The find-highlight-new-
column functionality is a powerful way to dynamically construct two sub-populations of
samples for comparison and analysis.
Being able to share and distribute biological insights is crucial in this era of collaborative
genomics. Xena’s bookmark functionality enables the sharing of live views. With a
single click, users can generate a URL of their current view, which will take researchers
back to the live browser session. The URL can be shared with colleagues or included in
reports or publications. If a view contains data from a non-public Xena Hub, we allow
users to download the current visualization state as a file. This file can then be
appropriately shared and imported into the Xena Browser to recreate the live view. By
giving users a file instead of a URL, we ensure that we never have the user's private
data on our servers. In addition to bookmarks, researchers can generate a high
resolution PDF figure of their current visualization for presentations and publications.
To assist researchers in building a Visual Spreadsheet, we developed a short three-step
guided wizard. This ensures that even new users who are unfamiliar with Xena can
build basic visualizations. We also provide links to live examples that showcase useful
and scientifically interesting visualizations, highlighting the power of Xena. We focus on
supporting our users by developing training videos, online and in-person workshops,
and help documentation. We keep users up-to-date on new features and datasets
through our social media accounts, mailing lists and monthly newsletter.
Discussion
UCSC Xena is a tool for cancer researchers to explore, visualize, and analyze
functional genomics data. We host many large public datasets, such as TCGA, Pan-
Cancer Atlas, PCAWG, GDC, and ICGC, helping to make these powerful resources
accessible to investigators. The Visual Spreadsheet, sophisticated filtering and
subgrouping, and Kaplan-Meier analysis enable researchers of all computational
backgrounds to investigate complex genomics datasets. We support virtually all data
modalities including mutations, copy number, expression, phenotype and clinical data
as well as rare data types such as non-coding mutations, large structural variants, and
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
RNAseq-based gene fusion calls. Integration across different data modalities and
visualizations, as well as between genomic and clinical data yield insightful views into
cancer biology.
UCSC Xena complements existing tools including the cBio Portal
(http://www.cbioportal.org/, Cerami 2012), ICGC Portal (https://dcc.icgc.org/, Zhang
2011), GDC Portal (https://portal.gdc.cancer.gov/, Jensen 2017), IGV
(http://software.broadinstitute.org/software/igv/, Thorvaldsdóttir 2013), and St. Jude
Cloud (https://stjude.cloud/, Ma 2018), by focusing on providing a federated platform to
visualize data across multiple hubs simultaneously. This enables biologists to easily
view their own data as well as consortium data while still maintaining data privacy.
Additionally, Xena focuses on simultaneous, integrative visualization of multi-omics
datasets across different genomic contexts, including gene, genomic element, or any
genomic region for both coding and non-coding part of the genome. Finally, Xena is
built for performance. It can easily visualize of tens of thousands of samples in a few
seconds and has been tested up to a million cells. With single-cell technology becoming
more common, functional genomics datasets will increase in size several orders of
magnitude larger than traditional bulk tumor samples. Xena is well-positioned to rise to
the challenge.
While it is widely recognized that data sharing is key to advancing cancer research, how
it is shared can impact the ease of data access. UCSC Xena is a designed for cancer
researchers both with and without computational expertise to share and access data.
Users without a strong computational background can explore their own data by
installing a Xena Hub on their personal computer using our point-and-click interface.
Bioinformaticians can install a private or public Xena Hub on a server or in the cloud or
as part of an analysis pipeline, making the generated data available in a user-friendly
manner that requires little extra effort. Data sharing has, and will continue to, advance
cancer biology and Xena is part of the technological ecosystem that helps support this
priority.
UCSC Xena is a scalable solution to the rapidly expanding and decentralized cancer
genomics data. Xena's architecture, with it’s detached data hubs and web-browser-
based visualization, allows new projects to easily add their data to the growing
compendium that we support. Additionally, by maintaining a flexible tabular and matrix
data input format, we support many different data modalities, both current and future. In
this age of expanding data resources, Xena's design supports the ongoing needs of the
cancer research community.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Xena excels at viewing cohorts of samples, cells, or cell lines and showing trends
across those entities, whether they be human or a model organism. While we have
focused on cancer genomics, the platform is general enough to host any functional
genomics data. We hope to expand Xena to continue to help serve the genomics needs
of the biomedical community.
Acknowledgements
Research reported in this publication was supported by National Cancer Institute of the
National Institutes of Health under award numbers 5U24CA180951-04 and
5U24CA210974-02. The content is solely the responsibility of the authors and does not
necessarily represent the official views of the National Institutes of Health. We would
also like to thank AWS Cloud Credits for Research as well as Google Summer of Code.
Methods
Data
Functional genomics data are downloaded from each respective source,, wrangled into
a generic tabular or matrix format, and then loaded into a corresponding hub. Specific
wrangling steps along with any normalization that was performed is listed for each
dataset in the Xena Browser dataset pages (https://xenabrowser.net/datapages/). The
wrangled data is available for download from the dataset pages.
Xena Hub
The Xena Hub is a JVM-based application, written in Clojure, that serves functional
genomic data (http://xena.ucsc.edu/private-hubs/) over HTTP. It exposes a relational
query API for metadata and data slicing. Using a query language instead of REST
allows us to decouple the client and server. To support interactive visualization REST
APIs have to be denormalized for performance (e.g. by joining related objects, and
projecting the result). This creates a tight coupling between the REST endpoints and
particular views: a novel view requires a novel denormalization. A query language
allows us to fetch exactly the data we need, and only the data we need, for quickly
evolving visualizations and data shapes, without redeployment of the hubs. This is
similar in motivation to Facebook's GraphQL, and Netflix's Falcor, but predates them.
Internally, Xena Hubs use the H2 database for storage. Data is stored in opaque blocks
in a column orientation, which allows fast retrieval of a field for all samples of a dataset,
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
or a subset of samples. A hub can be installed either via the command line or via the
point-and-click install4j GUI.
Xena Browser
The Xena Browser is a javascript application to visualize and analyze functional
genomics data stored in one or more Xena Hubs. The primary technologies are React,
the 2D canvas API, and RxJS. Babel is used for es6 support, and webpack for the build.
The application architecture is an asynchronous model similar to redux-observable
(https://redux-observable.js.org/), with semantic actions that update application state,
and action side-effects creating Rx streams that will dispatch later actions. The redux
(https://redux.js.org/), or Om (https://github.com/omcljs/om), pattern of immutable,
single-atom state makes it simple to keep multiple views in sync, and provides “time
travel” debugging during development.
We prefer the canvas API to SVG libraries such as D3, because it performs better at our
data scale. With the advances in javascript JIT compilers, we find that optimized loops
over canvas pixel buffers out-perform geometric drawing primitives, such as rect(), and
stroke(), when rendering dense views of large data.
The jsverify property-based testing library (http://jsverify.github.io/) is used for unit and
integration testing. Property-based, or "generative" testing is similar to fuzzing --
generating random test cases, and asserting invariants over the results -- but on failure,
attempts to find a minimal failing test case. This usually results in more tractable failure
cases. Property-based testing allows us to test a much larger portion of the input space
than conventional "known-answer" unit tests, and frequently identifies failure cases that
we would never think to test.
All of our code is open source and available for reuse under and Apache 2.0 license
(https://github.com/ucscXena). We also have contributed two javascript modules to
BioJS (Gómez 2013), including a Kaplan-Meier module
(https://github.com/ucscXena/kaplan-meier) to compute Kaplan-Meier statistics, and a
static-interval-tree library (https://github.com/ucscXena/static-interval-tree) to effectively
find overlapping intervals in one dimension.
Public Xena Hubs
We deploy all public-facing hubs in a cloud-computing environment.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Figures and Tables
Figure 1
Figure 1. Diagram of the UCSC Xena platform architecture. Multiple Xena Hubs (each
shown as a database icon) are connected to the Xena Browser simultaneously. Public
Hubs are in green and and private Hubs in red. In this example, private data from an
independent research collaboration (in red) can be loaded into their own private Xena
Hubs, on their servers or on a researcher's computer. Data integration occurs within the
Xena Browser on the user's computer. The lock icon indicates that only authorized
users have access to the private Xena Hubs. This design achieves data integration
across both public and private resources while maintaining each hub’s data
confidentiality.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Figure 2
Figure 2. A Xena Visual Spreadsheet showing genetic separation of TCGA lower grade
gliomas into two distinct groups with the bottom group being characterized by loss of
chromosome arms 1p and 19q, and relatively high expression of TERT and the top one
by TP53 and ATRX mutations. Each row in the view corresponds to a single sample.
The first column shows the scale of samples and is followed by data columns. The rows
are sorted by the left most data column (column B) and sub-sorted on subsequent
columns. Starting on the left, data columns are chromosome 1 and 19 copy number
segments, TP53 and ATRX mutation status, TERT gene expression, and sample
histology. Copy number segments (columns B, C) display genes in dark blue at the top.
Amplifications are in red and deletions in blue. Mutations (columns D, E) show a gene
diagram at the top with exons in grey boxes with coding regions being taller and
untranslated regions being shorter. The position of each mutation is marked in relation
to the gene diagram and colored by its functional impact. Deleterious mutations are in
red, missense mutations and in-frame indels are in blue, and synonymous mutations
are in green. Gene expression (column F) is colored red to blue for high to low
expression. Samples at the top of the view are enriched for the astrocytoma histological
subtype, while the samples at the bottom tend to be the oligodendroglioma subtype.
https://xenabrowser.net/heatmap/?bookmark=6b236d77538128421ee9c32911ce173a
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Figure 3
Figure 3. A Xena Kaplan-Meier analysis of overall survival for the TCGA lower grade
glioma histological subtypes. Black boxes in the figure highlight, top to bottom, the
ability to generate a PDF, statistical analysis results, the ability to select different
survival endpoints such as overall or recurrence-free survival, and custom survival time
cutoff (currently set to 3,650 days, or 10 years). This figure shows that patients
characterized as having the astrocytoma histological subtype have significantly worse
10-year overall survival compared to the oligodendroglioma and oligoastrocytoma
subtypes (p < 0.05).
https://xenabrowser.net/heatmap/?bookmark=2f9d783982879594dd0f52564058372d
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Figure 4
Figure 4. A Xena Transcript View showing four KRAS transcripts' expression for TCGA
pancreatic adenocarcinoma and GTEx normal pancreas tissue. To generate a view, a
researcher enters a gene and select two populations. The visualization will display, for
all transcripts for that gene, a double (top and bottom) density distribution of transcript
expression in each population. We see that for KRAS, transcript ENST00000311936.7
(second from the top), has higher expression in pancreatic tumors (TCGA) compared to
the normal pancreas tissue (GTEx).
http://dev.xenabrowser.net/transcripts/?bookmark=80d9c57b471b654cc569d4ceb44e65
91
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Figure 5
a.
b.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Figure 5. Xena Browser text-based find, highlight, filter, and subgroup samples
functionality. (a) Finding and highlighting samples in TCGA lung adenocarcinoma cohort
that have a splice mutation in MET. Similar to the ‘find in document’ feature in Microsoft
Word, users can search all data on the screen. In this figure, the Xena Browser
searched all columns for the user's search term 'splice' and highlighted those samples
with a 'splice' mutation with black tick marks (indicated by the black box). Other
supported operations include 'AND', 'OR', '>', '<', '=', and more. Users can dynamically
filter, zoom, and create subgroups based on the search results. Columns from left to
right are MET gene expression, MET exon expression and MET somatic mutation
status.
https://xenabrowser.net/heatmap/?bookmark=d145daa18123218d4a37af0ec214b95c
(b) After creating a new column with two subgroups. Columns left to right are same as
(a) with an additional user-generated column on the left created from the search.
Samples that matched the query of 'splice' were assigned a value of "true" and those
that do not "false". The researcher has zoomed to the few samples at the top for a more
detailed view. The figure shows that samples that have the splice site mutation (orange,
column E) have lower expression of exon 14 within the MET gene (column D). The
splice mutation causes exon 14 skipping and results in the activation of MET (Kong-
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Beltran 2006, The Cancer Genome Atlas Research Network 2014).
https://xenabrowser.net/heatmap/?bookmark=2239102135132b7fbd482204aa723a95
Table 1
Data Hub
Samples
List of Data Types
TCGA
12,811
copy number, gene-, exon-, miRNA-, and protein-expression,
somatic mutation, DNA methylation, survival, and clinical data
Pancan
Atlas
12,591
copy number, gene-, miRNA-, and protein-expression, somatic
mutation, DNA methylation, molecular subtypes, curated
survival, and clinical data
ICGC
17,677
copy number, gene expression, somatic coding mutation, and
somatic whole-genome mutation (non-TCGA only)
PCAWG
3,205
whole-genome copy number, somatic mutations, large structural
variants, gene- and miRNA-expression, RNAseq based gene
fusion, alternative promoter usage, RNAseq based slicing
events, purity, ploidy, mutational signature, survival, and curated
histology
Toil
19,340
TCGA, TARGET and GTEx gene and transcript expression
GDC
20,157
copy number, somatic mutations, gene and miRNA expression,
overall survival, and phenotypes
Treehouse
11,078
TCGA, TARGET and Treehouse partnering clinical studies gene
and transcript expression
UCSC
Public
*
Somatic mutation, gene expression, copy number, and clinical
data on adult and pediatric cancer, cell lines and mouse models
Table 1. Summary of data hosted on Public Xena Hubs. *Number of samples is not
listed for the UCSC Public Hub because each of the 44 cohorts are individual studies.
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Supplemental Figures
Supplemental Figure 1
Supplemental Figure 1. Visualization of large structural variants. This figure shows the
frequent ERG fusion in PCAWG prostate cancer detected by both RNA-seq and DNA-
seq analysis. Columns, starting at the left, correspond to histology, ERG gene
expression, and ERG fusion based on RNA-seq data. Gene expression is colored red to
green for high to low expression. In the ERG fusion column samples that have a fusion
are marked with 1 and those that do not are marked with 0. The next three columns
show structural variant calls made using whole-genome DNA-seq data for ERG,
TMPRSS2, and SLC45A3. Precise breakpoints are mapped to gene annotations. A grey
bar indicates an external piece of DNA that is fused at the breakpoint. Gene names on
the grey bars show the origin of the external DNA that is joined. This figure shows that
TMPRSS2 and SLC45A3 are fusion partners for ERG, and that these fusions correlate
with over-expression of ERG. Fusions detected by RNA-seq and whole-genome
sequencing are not always consistent. Here, even using a consensus of DNA-based
detection methods, one fusion detected by a consensus of RNA-based detectors is
missed, and the converse is also seen. This example shows that an integrated
visualization across multiple data types and algorithms provides a more accurate model
of a genomic event.
https://xenabrowser.net/heatmap/?bookmark=24ad428d0f3bf3bf3205bcffab64d276
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Supplemental Figure 2
Supplemental Figure 2. Visualization of both coding and non-coding mutations from a
gene-centric perspective in ICGC lymphoma. The columns left to right are cancer type,
BCL2, BCL7A, BIRC3 and ST6GAL1 mutation status, respectively. Gene diagrams are
shown at the top of each column with exons in light and dark grey boxes with coding
regions being taller and untranslated regions being shorter. The position of each
mutation is marked in relation to the gene diagram and colored by its functional impact,
with deleterious mutations in red, missense mutations and in-frame indels in blue,
synonymous mutations in green, splice site mutations in orange, and mutations with an
unknown functional impact in grey. This figure shows the frequent intronic mutations in
these genes, which are visible via our dynamic toggle to show or hide introns. These
mutation 'pile-ups' would be not be visible if viewing exomes only. While the majority of
the intronic mutations in this view have an unknown impact (shown in grey), they do
overlap with known enhancers regions (Mathelier 2015).
https://xenabrowser.net/heatmap/?bookmark=d2a79e46e22456036a732c49c2e4c5b3
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Supplemental Figure 3
Supplemental Figure 3. Xena Chart View showing a box plot of TERT expression for
each of the TCGA lower grade glioma histological subtypes. Columns created in the
Visual Spreadsheet (Figure 2) are used to construct the chart. Statistical analyses are
automatically computed. This view shows a significant expression difference for TERT
between oligodendroglioma, oligoastrocytoma and astrocytoma histologies (one-way
ANOVA, p < 0.05).
https://xenabrowser.net/heatmap/?bookmark=ae1a7f260e8d95bd3c1742503de32192
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
References
Barretina, J., Caponigro, G., Stransky, N., Venkatesan. K., Margolin. A.A., et al. The
Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug
sensitivity. Nature 483, 603-607 (2012).
Campbell, P. J., Getz, G., Stuart, J. M., Korbel, J. O., Stein L. D., et al. Pan-cancer
analysis of whole genomes. Preprint at
https://www.biorxiv.org/content/early/2017/07/12/162784 (2017).
The Cancer Genome Atlas Research Network. Comprehensive molecular profiling of
lung adenocarcinoma. Nature 511, 543–550 (2014).
Campbell, P. J., Getz, G., Stuart, J. M., Korbel, J. O., Stein, L. D., et al. Pan-cancer
analysis of whole genomes. Preprint at
https://www.biorxiv.org/content/early/2017/07/12/162784 (2017).
Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E, Sumer, S. O., et al. The cBio Cancer
Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics
Data. Cancer Discovery 5, 401-404 (2012).
Chin, L., Hahn, W.C., Getz, G. & Meyerson, M. Making sense of cancer genomic data.
Genes & Development 25, 534-555 (2011).
Chin, L., Andersen, J.N. & Futreal, P.A. Cancer genomics: from discovery science to
personalized medicine, Nature Medicine 17, 297-303 (2011).
Cieślik, M. & Chinnaiyan, A. M. Cancer transcriptome profiling at the juncture of clinical
translation. Nature Reviews Genetics 19, 93–109 (2018).
Gómez, J., García, L. J., Salazar, G. A., Gore, J. V. S., García, A., et al. BioJS: an open
source JavaScript framework for biological data visualization. Bioinformatics 29, 1103
1104 (2013)
Grossman, R.L., Heath, A. P., Ferretti, V., Varmus, H. E., Lowy, D. R., et al. Toward a
Shared Vision for Cancer Genomic Data. New England Journal of Medicine 375, 1109-
1112 (2016).
Hanahan, D. & Weinberg, R. A. Hallmarks of Cancer: The Next Generation. Cell 144(5),
646-674 (2011).
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Hoadley, K. A., Yau, C., Hinoue, T., Wolf, D. M., Lazar, A. J., et al. Cell-of-Origin
Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of
Cancer. Cell 173(2), 291–304 (2018).
The International Cancer Genome Consortium. International network of cancer genome
projects. Nature 464, 993–998 (2010).
Jensen, M. A., Ferretti, V., Grossman, R. L. & Staudt, L. M. The NCI Genomic Data
Commons as an engine for precision medicine. Blood 130, 453-459 (2017).
Kong-Beltran, M., Seshagiri, S., Zha, J., Zhu, W., Bhawe, K., et al. Somatic mutations
lead to an oncogenic deletion of met in lung cancer. Cancer Research 66, 283-289
(2006).
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and
collaboration. Nature Reviews Genetics 19, 208–219 (2018).
Ledford, H. Big science: The cancer genome challenge. Nature 464, 972-974 (2010).
Ma, X., Liu, Y., Liu, Y., Alexandrov, L. B., Edmonson, M.N., et al. Pan-cancer genome
and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature
555, 371-376 (2018)
Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends
in Genetics 24, 133-141 (2008).
Mathelier, A., Lefebvre, C., Zhang, A. W., Arenillas, D. J., Ding, J., et al. Cis-regulatory
somatic mutations and gene-expression alteration in B-cell lymphomas. Genome
Biology 16, 84 (2015).
Newton, Y., Novak, A. M., Swatloski, T., McColl, D. C., Chopra, S., et al. TumorMap:
Exploring the Molecular Similarities of Cancer Samples in an Interactive Portal. Cancer
Research 77, e111–114 (2017)
Niknafs, N., Kim, D., Kim, R., Diekhans, M., Ryan, M., et al. MuPIT interactive:
webserver for mapping variant positions to annotated, interactive 3D structures. Human
Genetics 132, 1235–1243 (2013)
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
Schroeder, M. P., Gonzalez-Perez, A. & Lopez-Bigas, N. Visualizing multidimensional
cancer genomics data. Genome Medicine 5, 9 (2013).
Sotiriou, C. & Pusztai, L. Gene-Expression Signatures in Breast Cancer. New England
Journal of Medicine 360, 790-800 (2009).
Stephens, Z. D., Lee, S. L., Faghri, F., Campbell, R. H., Zhai, C., et al. Big Data:
Astronomical or Genomical? PLOS Biology (2015).
Thorvaldsdóttir, H., Robinson, J. T., & Mesirov, J. P. Integrative Genomics Viewer
(IGV): high-performance genomics data visualization and exploration. Briefings in
Bioinformatics 14, 178–192 (2013).
Vivian, J., Rao, A. A., Nothaft, F. A., Ketchum, C., Armstrong, J., et al. Toil enables
reproducible, open source, big biomedical data analyses. Nature Biotechnology 35,
314-316 (2017).
Zhang, J., Baran, J., Cros, A., Guberman, J. M., Haider, S., et al. International Cancer
Genome Consortium Data Portal—a one-stop shop for cancer genomics data.
Database 2011, (2011).
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/326470doi: bioRxiv preprint first posted online May. 18, 2018;
... All datasets used in our application are publicly available. Gene expression data for 12,747 patient tumors from >40 cancer types were downloaded from the Tumor Compendium v11 Public PolyA, a database maintained by The Treehouse Childhood Cancer Initiative at the UCSC Genomics Institute [57]. These patient tumors include samples from Treehouse clinical sites, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project, and The Cancer Genome Atlas (TCGA) [14]. ...
... [17]. The gene expression data for both patient tumors and cell lines were pre-processed using established pipelines, as described in [17,57]. Additionally, the data scaling pipeline outlined in Section 2.3.3 was applied due to the large difference in sample size between patients and cell lines. ...
Preprint
The pursuit of precision oncology heavily relies on large-scale genomic and pharmacological data garnered from preclinical cancer model systems such as cell lines. While cell lines are instrumental in understanding the interplay between genomic programs and drug response, it well-established that they are not fully representative of patient tumors. Development of integrative methods that can systematically assess the commonalities between patient tumors and cell-lines can help bridge this gap. To this end, we introduce the Integrative Principal Component Regression (iPCR) model which uncovers both joint and model-specific structured variations in the genomic data of cell lines and patient tumors through matrix decompositions. The extracted joint variation is then used to predict patient drug responses based on the pharmacological data from preclinical models. Moreover, the interpretability of our model allows for the identification of key driver genes and pathways associated with the treatment-specific response in patients across multiple cancers. We demonstrate that the outputs of the iPCR model can assist in inferring both model-specific and shared co-expression networks between cell lines and patients. We show that iPCR performs favorably compared to competing approaches in predicting patient drug responses, in both simulation studies and real-world applications, in addition to identifying key genomic drivers of cancer drug responses.
... cBioPortal represents vast and comprehensive database that is significant for exploring cancer genomics through visualization, download and analysis of large cancer datasets (Gao et al., 2013). Another integrative, multiomics database UCSC Xena Browser (Goldman et al., 2018), provides a large resource for multiomics data analysis as well as clinical and phenotypes datasets. UK Biobank is a large-scale biomedical database accessible to global researchers undertaking vital research into life-threatening diseases which includes in-depth genetic and health information from around half a million UK participants (Sudlow et al., 2015). ...
... Unfortunately, this source of clinical data did not include reliable information on the extent of resection for these TCGA patients. Gene expression data from these TCGA patients for the four genes used to calculate the GPI were obtained from the University of California, Santa Clara (UCSC) Xena data portal [52], from an experiment conducted by the University of North Carolina TCGA genomic characterization center with an Agilent 244K custom gene expression microarray (G4502A_07_2). To make the two datasets as similar as possible, we excluded patients less than 18 years of age, those with IDH1 mutations, or those with an unknown MGMT promotor methylation status. ...
Article
Full-text available
The aim of this study was to investigate gene expression alterations associated with overall survival (OS) in glioblastoma (GBM). Using the Nanostring nCounter platform, we identified four genes (COL1A2, IGFBP3, NGFR, and WIF1) that achieved statistical significance when comparing GBM with non-neoplastic brain tissue. The four genes were included in a multivariate Cox Proportional Hazard model, along with age, extent of resection, and O6-methylguanine-DNA methyltransferase (MGMT) promotor methylation, to create a unique glioblastoma prognostic index (GPI). The GPI score inversely correlated with survival: patient with a high GPI had a median OS of 7.5 months (18-month OS = 9.7%) whereas patients with a low GPI had a median OS of 20.1 months (18-month OS = 54.5%; log rank p-value = 0.004). The GPI score was then validated in 188 GBM patients from The Cancer Genome Atlas (TCGA) from a national data base; similarly, patients with a high GPI had a median OS of 10.5 months (18-month OS = 12.4%) versus 16.9 months (18-month OS = 41.5%) for low GPI (log rank p-value = 0.0003). We conclude that this novel mRNA-based prognostic index could be useful in classifying GBM patients into risk groups and refine prognosis estimates to better inform treatment decisions or stratification into clinical trials.
... Expected read count by RSEM was normalized as variance stabilizing transformation by DESeq2, and subsequently used for the calculation of risk score. Clinical information was retrieved from phenotype data of GDC TCGA stomach cancer cohort in UCSC Xena [72]. Survival data, the information of molecular subtypes, and genetic mutation data were retrieved from TCGA PanCancer data [73]. ...
Article
Full-text available
Background Molecular analysis of advanced tumors can increase tumor heterogeneity and selection bias. We developed a robust prognostic signature for gastric cancer by comparing RNA expression between very rare early gastric cancers invading only mucosal layer (mEGCs) with lymph node metastasis (Npos) and those without metastasis (Nneg). Methods Out of 1003 mEGCs, all Npos were matched to Nneg using propensity scores. Machine learning approach comparing Npos and Nneg was used to develop prognostic signature. The function and robustness of prognostic signature was validated using cell lines and external datasets. Results Extensive machine learning with cross-validation identified the prognostic classifier consisting of four overexpressed genes (HDAC5, NPM1, DTX3, and PPP3R1) and two downregulated genes (MED12 and TP53), and enabled us to develop the risk score predicting poor prognosis. Cell lines engineered to high-risk score showed increased invasion, migration, and resistance to 5-FU and Oxaliplatin but maintained sensitivity to an HDAC inhibitor. Mouse models after tail vein injection of cell lines with high-risk score revealed increased metastasis. In three external cohorts, our risk score was identified as the independent prognostic factor for overall and recurrence-free survival. Conclusion The risk score from the 6-gene classifier can successfully predict the prognosis of gastric cancer.
Article
Feature screening is an important tool in analysing ultrahigh‐dimensional data, particularly in the field of Omics and oncology studies. However, most attention has been focused on identifying features that have a linear or monotonic impact on the response variable. Detecting a sparse set of variables that have a nonlinear or nonmonotonic relationship with the response variable is still a challenging task. To fill the gap, this paper proposed a robust model‐free screening approach for right‐censored survival data by providing a new perspective of quantifying the covariate effect on the restricted mean survival time, rather than the routinely used hazard function. The proposed measure, based on the difference between the restricted mean survival time of covariate‐stratified and overall data, is able to identify comprehensive types of associations including linear, nonlinear, nonmonotone and even local dependencies like change points. The sure screening property is established, and a more flexible iterative screening procedure is developed to increase the accuracy of the variable screening. Simulation studies are carried out to demonstrate the superiority of the proposed method in selecting important features with a complex association with the response variable. The potential of applying the proposed method to handle interval‐censored failure time data has also been explored in simulations, and the results have been promising. The method is applied to a breast cancer dataset to identify potential prognostic factors, which reveals potential associations between breast cancer and lymphoma.
Article
Full-text available
Recent advancements in deep learning (DL) have played a crucial role in aiding experts to develop personalized healthcare services, particularly in drug response prediction (DRP) for cancer patients. The DL’s techniques contribution to this field is significant, and they have proven indispensable in the medical field. This review aims to analyze the diverse effectiveness of various DL models in making these predictions, drawing on research published from 2017 to 2023. We utilized the VOS-Viewer 1.6.18 software to create a word cloud from the titles and abstracts of the selected studies. This study offers insights into the focus areas within DL models used for drug response. The word cloud revealed a strong link between certain keywords and grouped themes, highlighting terms such as deep learning, machine learning, precision medicine, precision oncology, drug response prediction, and personalized medicine. In order to achieve an advance in DRP using DL, the researchers need to work on enhancing the models’ generalizability and interoperability. It is also crucial to develop models that not only accurately represent various architectures but also simplify these architectures, balancing the complexity with the predictive capabilities. In the future, researchers should try to combine methods that make DL models easier to understand; this will make DRP reviews more open and help doctors trust the decisions made by DL models in cancer DRP.
Chapter
A small number of the cell’s genes have expression levels that directly affect the functional or regulatory functions of the cell. Gene expression time series (GETS) keep track of each gene activity, which reveals underlying cellular dynamics. High-throughput GETS (HTGETS) investigations need the grouping of genes according to their temporal expression patterns, frequently done using unsupervised machine learning approaches. Nevertheless, most clustering methods either need to improve their ability to consider the temporal structure of the data or need to be improved by the short duration of time series for gene expression (GE). The innovative machine learning (ML)-based architecture known as boosting with similarity matrix (B-SM) provided can address these problems for grouping GETS and related difficulties. B-SM originally visualizes time series data to provide more detailed data interpretations. The produced pictures are then subjected to deep neural clustering. Studies using biological and the advantages of this innovative approach over traditional clustering techniques are shown via synthetic datasets. Additionally, we use an enrichment approach to show that the B-SM clusters are biologically plausible.
Article
Full-text available
Radiotherapy (RT) is considered immunogenic, but clinical data demonstrating RT-induced T cell priming are scarce. Here, we show in a mouse tumor model representative of human lymphocyte-depleted cancer that RT enhanced spontaneous priming of thymus-derived (FOXP3+Helios+) Tregs by the tumor. These Tregs acquired an effector phenotype, populated the tumor, and impeded tumor control by a simultaneous, RT-induced CD8+ cytotoxic T cell (CTL) response. Combination of RT with CTLA-4 or PD-1 blockade, which enables CD28 costimulation, further increased this Treg response and failed to improve tumor control. We discovered that upon RT, the CD28 ligands CD86 and CD80 differentially affected the Treg response. CD86, but not CD80, blockade prevented the effector Treg response, enriched the tumor-draining lymph node migratory conventional DCs that were positive for PD-L1 and CD80 (PD-L1+CD80+), and promoted CTL priming. Blockade of CD86 alone or in combination with PD-1 enhanced intratumoral CTL accumulation, and the combination significantly increased RT-induced tumor regression and OS. We advise that combining RT with PD-1 and/or CTLA-4 blockade may be counterproductive in lymphocyte-depleted cancers, since these interventions drive Treg responses in this context. However, combining RT with CD86 blockade may promote the control of such tumors by enabling a CTL response.
Article
We aimed to identify an effective metabolic subtype and risk score to predict survival and immunotherapy response in head and neck squamous cell carcinoma (HNSCC). Data were obtained from an online database. We screened significant prognostic metabolism-related genes between the normal and tumor groups using a series of bioinformatics methods. Based on the selected prognostic genes, we conducted a subtype analysis to identify significantly different subtypes in HNSCC. We then investigated survival, immune features, and hallmark differences among different subtypes. LASSO was utilized to identify optimal genes for the risk score model construction. Finally, distribution of the risk score samples was analyzed for different subtypes. A total of 32 significantly prognostic metabolism-related genes were screened, and all samples were grouped into two subtypes: cluster 1 and cluster 2. Cluster 1 had worse survival. Different immune cell infiltration (CD8 T cells, macrophages, and regulatory T cells) and immune checkpoint gene expression (PD-1 and CLAT-4) were observed between the two clusters. Twelve optimal genes were involved in risk score model, and high-risk group had poorer survival. Cluster 1 contained more high-risk samples (60%). Finally, four genes CAV1, GGT6, PYGL, and HS3ST1 were identified as significantly related to immune cells, and these genes were differentially expressed in the normal oral epithelial cells and HNSCC cells. The subtypes and risk score model in the study provide a promising biomarker for prognosis and immunotherapy response.
Chapter
The tumor cell population in cancer tissue has distinct molecular characteristics and exhibits different phenotypes, thus, resulting in different subpopulations. This phenomenon is known as Intratumor Heterogeneity (ITH), a major contributor to drug resistance, poor prognosis, etc. Therefore, quantifying the levels of ITH in cancer patients is essential, and many algorithms do so in different ways, using different types of omics data. DEPTH2 algorithm utilizes transcriptomic data to assess ITH scores and exhibits promising performance. However, it quantifies ITH using all genes, limiting the identification of ITH-related prognostic genes. We hypothesize that a subset of key genes is sufficient to quantify the ITH level, and this subset of key genes could be ITH-related prognostic genes. To prove our hypothesis, we propose an unsupervised deep learning-based framework using Concrete Autoencoder (CAE) to select a subset of cancer-specific key genes for ITH evaluation. For the experiment, we used gene expression profile data of breast, kidney, and lung cancer tumor cohorts from the TCGA repository. Multi-run CAE identified three sets of key genes for each cancer cohort. Comparing ITH scores derived from all genes and CAE-selected key genes showed similar prognostic outcomes. Subtypes of lung cancer displayed consistent ITH distributions for both gene sets. Based on these observations, it can be concluded that a subset of key genes, instead of all, is sufficient for ITH quantification. Our results also showed that many key genes are prognostically significant and can be used as therapeutic targets.
Article
Full-text available
Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale1,2,3. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4–5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter⁴; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation5,6; analyses timings and patterns of tumour evolution⁷; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity8,9; and evaluates a range of more-specialized features of cancer genomes8,10,11,12,13,14,15,16,17,18.
Article
Full-text available
We conducted comprehensive integrative molecular analyses of the complete set of tumors in The Cancer Genome Atlas (TCGA), consisting of approximately 10,000 specimens and representing 33 types of cancer. We performed molecular clustering using data on chromosome-arm-level aneuploidy, DNA hypermethylation, mRNA, and miRNA expression levels and reverse-phase protein arrays, of which all, except for aneuploidy, revealed clustering primarily organized by histology, tissue type, or anatomic origin. The influence of cell type was evident in DNA-methylation-based clustering, even after excluding sites with known preexisting tissue-type-specific methylation. Integrative clustering further emphasized the dominant role of cell-of-origin patterns. Molecular similarities among histologically or anatomically related cancer types provide a basis for focused pan-cancer analyses, such as pan-gastrointestinal, pan-gynecological, pan-kidney, and pan-squamous cancers, and those related by stemness features, which in turn may inform strategies for future therapeutic development.
Article
Full-text available
Genomic landscape of childhood cancers The genetic alterations that give rise to childhood cancer are less well studied than those that give rise to adult cancers. Two papers in this issue report some of the first pan-cancer analyses of childhood cancers. Stefan Pfister and colleagues studied germline and somatic genomes from 914 young cancer patients, including children, adolescents and young adults. The tumour samples comprised 24 distinct molecular cancer types, including the most frequent and clinically relevant childhood cancers. The team characterized somatic mutation frequencies, genomic alterations, including structural variations and copy-number analysis, and mutational signatures. They found signatures associated with deficiencies of double-stranded break repair across all cancer types. Additionally, 7.6% of patients carried a likely pathogenic germline variant in a candidate cancer predisposition gene. Jinghui Zhang and colleagues analysed the genomes, exomes and transcriptomes of 1,699 paediatric leukaemias and solid tumours. They identified 142 driver genes in paediatric cancers, over half of which were specific to a single histotype. They also characterized copy number alterations and structural variation and identified 11 mutational signatures. Together, these papers provide a comprehensive resource for genomic alterations across common paediatric tumours, and highlight differences compared with the genomic alterations seen in adult cancers.
Article
Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.
Article
Methodological breakthroughs over the past four decades have repeatedly revolutionized transcriptome profiling. Using RNA sequencing (RNA-seq), it has now become possible to sequence and quantify the transcriptional outputs of individual cells or thousands of samples. These transcriptomes provide a link between cellular phenotypes and their molecular underpinnings, such as mutations. In the context of cancer, this link represents an opportunity to dissect the complexity and heterogeneity of tumours and to discover new biomarkers or therapeutic strategies. Here, we review the rationale, methodology and translational impact of transcriptome profiling in cancer.
Article
p>Vast amounts of molecular data are being collected on tumor samples, which provide unique opportunities for discovering trends within and between cancer subtypes. Such cross-cancer analyses require computational methods that enable intuitive and interactive browsing of thousands of samples based on their molecular similarity. We created a portal called TumorMap to assist in exploration and statistical interrogation of high-dimensional complex “omics” data in an interactive and easily interpretable way. In the TumorMap, samples are arranged on a hexagonal grid based on their similarity to one another in the original genomic space and are rendered with Google9s Map technology. While the important feature of this public portal is the ability for the users to build maps from their own data, we pre-built genomic maps from several previously published projects. We demonstrate the utility of this portal by presenting results obtained from The Cancer Genome Atlas project data. Cancer Res; 77(21); e111–4. ©2017 AACR .</p
Article
The National Cancer Institute Genomic Data Commons (GDC) is an information system to store, analyze and share genomic and clinical data from patients with cancer. The recent high-throughput sequencing of cancer genomes and transcriptomes has produced a "big data" problem that precludes many cancer biologists and oncologists from gleaning knowledge from these data regarding the nature of the malignant process and the relationship between tumor genomic profiles and treatment response. The GDC aims to democratize access to cancer genomic data and to foster the sharing of these data to promote precision medicine approaches to the diagnosis and treatment of cancer.
Article
The Genomic Data Commons will initially house raw genomic data and diagnostic, histologic, and clinical outcome data from National Cancer Institute–funded projects. A harmonization process will align sequencing data to the genome and identify mutations and alterations.