ArticlePDF Available

Rgb: A scriptable genome browser for R

Authors:

Abstract

Thanks to its free licensing and the development of initiatives like Bioconductor, R has become an essential part of the bioinformatics toolbox in the past years and is more and more confronted with genomically located data. While separate solutions are available to manipulate and visualize such data, no R package currently offers the efficiency required for computationally intensive tasks such as interactive genome browsing. The package proposed here fulfills this specific need, providing a multilevel interface suitable for most needs, from a completely interfaced genome browser to low-level classes and methods. Its time and memory efficiency have been challenged in a human dataset, where it outperformed existing solutions by several orders of magnitude. R sources and packages are freely available at the CRAN repository and dedicated Web site: http://bioinformatics.ovsa.fr/Rgb. Distributed under the GPL 3 license, compatible with most operating systems (Windows, Linux, Mac OS) and architectures. maressyl@gmail.com or fabrice.jardin@chb.unicancer.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
2014, pages 1–2
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btu185
Genome analysis Advance Access publication April 9, 2014
Rgb: a scriptable genome browser for R
Sylvain Mareschal
1,2,3,
*, Sydney Dubois
1,2,3
, Thierry Lecroq
2,3,4
and Fabrice Jardin
1,2,3,
*
1
Centre Henri Becquerel, INSERM UMR 918, 76038 Rouen Cedex 1, France,
2
Normandy University, University of
Rouen, 76821 Mont-Saint-Aignan, France,
3
Institute for Research and Innovation in Biomedicine (IRIB), Haute-
Normandie, 76183 Rouen Cedex, France and
4
LITIS, INSA EA 4108, 76801 Saint-Etienne-du-Rouvray, France
Associate Editor: Inanc Birol
ABSTRACT
Summary: Thanks to its free licensing and the development of initia-
tives like Bioconductor, R has become an essential part of the bio-
informatics toolbox in the past years and is more and more confronted
with genomically located data. While separate solutions are available
to manipulate and visualize such data, no R package currently offers
the efficiency required for computationally intensive tasks such as
interactive genome browsing. The package proposed here fulfills this
specific need, providing a multilevel interface suitable for most needs,
from a completely interfaced genome browser to low-level classes
and methods. Its time and memory efficiency have been challenged
in a human dataset, where it outperformed existing solutions by sev-
eral orders of magnitude.
Availability and implementation: R sources and packages are freely
available at the CRAN repository and dedicated Web site: http://bio-
informatics.ovsa.fr/Rgb. Distributed under the GPL 3 license, compat-
ible with most operating systems (Windows, Linux, Mac OS) and
architectures.
Contact: maressyl@gmail.com or fabrice.jardin@chb.unicancer.fr
Supplementary information: Supplementary data are available at
Bioinformatics online.
Received on December 20, 2013; revised on March 3, 2014; accepted
on April 2, 2014
1 INTRODUCTION
The growing demand from the biology community for statistic-
ally robust approaches has made the R statistics-oriented script-
ing language an essential part of the bioinformatics toolbox. Its
graphical capabilities make it a valuable tool to produce publi-
cation-grade complex figures, whereas its computational effi-
ciency allows it to handle huge datasets, as currently required
in fields like transcriptomics or next-generation sequencing.
These qualities come with an open-source licensing and various
operating system ports that make it available virtually every-
where. Thanks to the Bioconductor initiative (Gentleman
et al., 2004), large amount of software is freely available as R
packages for tasks as diverse as microarray processing (Smyth,
2005;Venkatraman and Olshen, 2007), feature annotation
(Zhang et al., 2003)orsequenceanalysis(Anders and Huber,
2010).
Much of this software generates genomic data, i.e. lists of
chromosome regions defined by starting and ending coordinates.
Such data are usually subset using chromosomal coordinates
rather than row indexes, a paradigm R was not developed to
deal with. Bioconductor historically addressed this issue with
the RangedData class from the IRanges package, handling gen-
omic regions as ranges of integers (base positions). Its flexibility
and efficiency were extended a few years later by the
GenomicRanges package, making direct use of IRanges compo-
nents for the subsetting. For visualization, Bioconductor pro-
vides two solutions: rtracklayer and GViz. The former sends
data to visualize to the UCSC web genome browser (Kent
et al., 2002), a model that implies frequent comings and goings
between programs and consequent network burden. The latter is
more integrated within Bioconductor and produces static graph-
ics from the classes described above, in delays incompatible with
user interactivity and intensive computing.
The package described here reconciles these two aspects in a
coherent way, thus offering an interactive interface responsive
enough for comfortable browsing and atomic operations suitable
for computer-intensive algorithms.
2 IMPLEMENTATION
Rgb is implemented as an R package, providing various classes
and methods for scripts. It makes use of the Reference class
system, which offers an object-oriented framework similar to
what can be found in other languages such as Java or Cþþ.It
is flexible enough to suit the needs of the four categories of R
users:
R beginners will find in Rgb a complete graphical user inter-
face (GUI) allowing them to convert and visualize genomic
data without any command line. Stand-alone builds of Rgb
can even make the R dependency totally transparent.
Script writers, needing an exclusive Command Line
Interface to automate analysis, will find in Rgb classes and
methods able to handle their genomic data and produce
high-quality graphics from them.
Console users who use R as an exploratory tool without a
special need for reproducibility, will find Rgb’s ability to mix
the two interfaces particularly time saving.
Package developers may be interested in extending classes to
handle new data storage modes or representations, a process
greatly facilitated by the object-oriented design of Rgb.
The core of the subsetting system consists of a collection of C
functions making direct use of R libraries, and the graphical
interface relies on Tcl-tk. Both of them are natively managed
*To whom correspondence should be addressed.
ßThe Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1
Bioinformatics Advance Access published April 21, 2014
by guest on August 20, 2015http://bioinformatics.oxfordjournals.org/Downloaded from
by R, and have been successfully compiled and tested on
Windows and Linux operating systems. BAM file support can
be added by the Rsamtools package, which is part of
Bioconductor.
3 PERFORMANCES
To assert Rgb’s track.table class suitability for computer-inten-
sive tasks such as responsive genome browsing or overlap com-
putation, its time and maximal memory consumption on usual
atomic tasks were compared with existing solutions: standard R
data.frame, the more efficient data.table, IRanges’ RangedData,
GenomicRanges’ GRanges and GViz’s AnnotationTrack.Three
tasks were monitored as the most common atomic operations in
such data: genomic extraction by chromosomal coordinates, ex-
traction and modification of consecutive rows by indexes. The
291 128 exons recorded in the CCDS database (Pruitt et al.,
2009) for the human genome were used as a common dataset.
Benchmarking was performed on a mid-range desktop computer
(3.1 GHz Intel i3-2100 processor with 8 GB RAM) running R
3.0.2 in a 64-bit Fedora 18 distribution. Each measure was
made with R functions proc.time and gc in a fresh session, to
normalize garbage collection effects (scripts and dataset available
as Supplementary Data).
With the genomic extraction task (Fig. 1), GViz and
RangedData proved their unsuitability to intensive tasks, with
computing times nearing the second for a single extraction (typ-
ical genome browser representations contain several tracks to be
subset), and extravagant memory usage (hundreds of Mio for a
dataset of 30 Mio). GRanges and generic R solutions perform
better, but are still outperformed by Rgb by a magnitude of 50–
1500 on small genomic extractions.
Similar results can be observed with the modification task,
with Rgb outperforming R generic classes by a factor of 100 in
time consumption and 10 in memory usage. It still outperforms
bioinformatic solutions with the last task, but is overtaken here
by generic solutions such as data.frame and data.table. However,
as this kind of extraction is far less justified in a genomic context
than the previous one, this average performance can be accepted.
4CONCLUSION
Rgb provides several entry points to a consistent genome brows-
ing system, which may prove equally useful to users looking for
an interactive genome browser able to handle their R datasets
and developers needing efficient genomic subsetting capabilities
for their scripts. Its object-oriented paradigm and open-source
licensing make it easily extendable, and its performances over
existing solutions have been proven in a real genomic dataset.
ACKNOWLEDGEMENTS
The authors thank Christian Bastard for his helpful comments
and feedback during Rgb development and the many CGH-
array datasets he provided for testing.
Funding: This work was supported by a PhD grant from the
Re
´gion Haute-Normandie (France).
Conflict of Interest: none declared.
REFERENCES
Anders,S. and Huber,W. (2010) Differential expression analysis for sequence count
data. Genome Biol.,11,R106.
Gentleman,R.C. et al. (2004) Bioconductor: open software development for com-
putational biology and bioinformatics. Genome Biol.,5,R80.
Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res.,12,
996–1006.
Pruitt,K.D. et al. (2009) The consensus coding sequence (CCDS) project: identifying
a common protein-coding gene set for the human and mouse genomes. Genome
Res.,19, 1316–1323.
Smyth,G.K. (2005) limma: linear models for microarray data. In: Gentleman,R.
et al. (ed.) Bioinformatics and Computational Biology Solutions Using R and
Bioconductor, Statistics for Biology and Health. Springer, New York, NY,
pp. 397–420.
Venkatraman,E.S. and Olshen,A.B. (2007) A faster circular binary segmentation
algorithm for the analysis of array CGH data. Bioinformatics,23, 657–663.
Zhang,J. et al. (2003) An extensible application for assembling annotation for gen-
omic data. Bioinformatis,19,155.
Fig. 1. Performance comparison of Rgb’s track.table class with existing
software. Three atomic tasks were benchmarked on a common dataset, in
terms of maximal memory usage and computing time, as recorded by R
2
S.Mareschal et al.
by guest on August 20, 2015http://bioinformatics.oxfordjournals.org/Downloaded from
... The whole cghRA workflow, including the novel algorithms presented here, is implemented in the R language (3.2.3), relying mainly on 'base' and 'stats' packages. Genomic feature handling and visualization is inherited from the 'Rgb' package class system (Mareschal et al., 2014). Graphical interfaces were built using the 'tcltk' and 'tkrplot' packages. ...
Article
Motivation: Although sequencing-based technologies are becoming the new reference in genome analysis, comparative genomic hybridization arrays (aCGH) still constitute a simple and reliable approach for copy number analysis. The most powerful algorithms to analyse such data have been freely provided by the scientific community for many years, but combining them is a complex scripting task. Results: The cghRA framework combines a user-friendly graphical interface and a powerful objectoriented command-line interface to handle a full aCGH analysis, as is illustrated in an original series of 107 Diffuse Large B-Cell Lymphomas. New algorithms for copy-number calling, polymorphism detection and minimal common region (MCR) prioritization were also developed and validated. While their performances will only be demonstrated with aCGH, these algorithms could actually prove useful to any copy-number analysis, whatever the technique used. Availability and Implementation: R package and source for Linux, MS Windows and MacOS are freely available at http://bioinformatics.ovsa.fr/cghRA. Contact:mareschal@ovsa.fr Supplementary information:Supplementary data are available at Bioinformatics online.
... probe-level signals were segmented using the CBS algorithm ("DNAcopy" R package, version 1.36.0); and copy number variations (CNVs) were called using samplespecific log-ratio thresholds accounting for the estimated cellularity of the samples. The resulting copy number data were queried and visualized using Rgb [24]. ...
Article
Primary mediastinal B-cell lymphoma (PMBL) is an entity of B-cell lymphoma distinct from the other molecular subtypes of diffuse large B-cell lymphoma (DLBCL). We investigated the prevalence, specificity and clinical relevance of mutations of XPO1, which encodes a member of the karyopherin-β nuclear transporters, in a large cohort of PMBL. PMBL cases defined histologically or by gene expression profiling (GEP) were sequenced and the XPO1 mutational status was correlated to genetic and clinical characteristics. The XPO1 mutational status was also assessed in DLBCL, Hodgkin lymphoma (HL) and mediastinal gray-zone lymphoma (MGZL).The biological impact of the mutation on Selective Inhibitor of Nuclear Export (SINE) compounds (KPT-185/330) sensitivity was investigated in vitro. XPO1 mutations were present in 28/117 (24%) PMBL cases and in 5/19 (26%) HL cases but absent/rare in MGZL (0/20) or DLBCL (3/197). A higher prevalence (50%) of the recurrent codon 571 variant (p.E571K) was observed in GEP-defined PMBL and was associated with shorter PFS. Age, International Prognostic Index and bulky mass were similar in XPO1 mutant and wild-type cases. KPT-185 induced a dose-dependent decrease in cell proliferation and increased cell-death in PMBL cell lines harboring wild type or XPO1 E571K mutant alleles. Experiments in transfected U2OS cells further confirmed that the XPO1 E571K mutation does not have a drastic impact on KPT-330 binding. To conclude the XPO1 E571K mutation represents a genetic hallmark of the PMBL subtype and serves as a new relevant PMBL biomarker. SINE compounds appear active for both mutated and wild-type protein. This article is protected by copyright. All rights reserved.
Article
Full-text available
*Motivation:* High-throughput nucleotide sequencing provides quantitative readouts in assays for RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq) or cell counting (barcode sequencing). Statistical inference of differential signal in such data requires estimation of their variability throughout the dynamic range. When the number of replicates is small, error modelling is needed to achieve statistical power. Results: We propose an error model that uses the negative binomial distribution, with variance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. *Availability:* A free open-source R software package, DESeq , is available from the Bioconductor project and from "http://www-huber.embl.de/users/anders/DESeq":http://www-huber.embl.de/users/anders/DESeq.
Article
Full-text available
High-throughput DNA sequencing is a powerful and versatile new technology for ob-taining comprehensive and quantitative data about RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq), and genetic variations between individuals. It addresses es-sentially all of the use cases that microarrays were applied to in the past, but produces more detailed and more comprehensive results. One of the basic statistical tasks is inference (testing, regression) on discrete count values (e.g., representing the number of times a certain type of mRNA was sampled by the sequencing machine). Challenges are posed by a large dynamic range, heteroskedas-ticity and small numbers of replicates. Hence, model-based approaches are needed to achieve statistical power. I will present an error model that uses the negative binomial distribution, with vari-ance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. I will also discuss how to use the GLM framework to detect alternative transcript isoform usage. A free open-source R software package, DESeq, is available from the Bioconductor project.
Article
Full-text available
High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
Article
Full-text available
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
Article
Full-text available
As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.
Book
Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.
Chapter
A survey is given of differential expression analyses using the linear modeling features of the limma package. The chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments. Experiments with technical as well as biological replication are considered. Empirical Bayes test statistics are explained. The use of quality weights, adaptive background correction and control spots in conjunction with linear modelling is illustrated on the β7 data.
Article
AnnBuilder is an R package for assembling genomic annotation data. The system currently provides parsers to process annotation data from LocusLink, Gene Ontology Consortium, and Human Gene Project and can be extended to new data sources via user defined parsers. AnnBuilder differs from other existing systems in that it provides users with unlimited ability to assemble data from user selected sources. The products of AnnBuilder are files in XML format that can be easily used by different systems. Availability: (http://www.bioconductor.org). Open source.