ArticlePDF Available

Rgb: A scriptable genome browser for R

April 2014
Bioinformatics 30(15)

April 2014
30(15)

DOI:10.1093/bioinformatics/btu185

Source
PubMed

Authors:

Sylvain Mareschal

Karolinska Institutet

Sydney Dubois

Centre Henri Becquerel

Thierry Lecroq

Université de Rouen Normandie

Fabrice Jardin

Université de Rouen

Thanks to its free licensing and the development of initiatives like Bioconductor, R has become an essential part of the bioinformatics toolbox in the past years and is more and more confronted with genomically located data. While separate solutions are available to manipulate and visualize such data, no R package currently offers the efficiency required for computationally intensive tasks such as interactive genome browsing. The package proposed here fulfills this specific need, providing a multilevel interface suitable for most needs, from a completely interfaced genome browser to low-level classes and methods. Its time and memory efficiency have been challenged in a human dataset, where it outperformed existing solutions by several orders of magnitude. R sources and packages are freely available at the CRAN repository and dedicated Web site: http://bioinformatics.ovsa.fr/Rgb. Distributed under the GPL 3 license, compatible with most operating systems (Windows, Linux, Mac OS) and architectures. maressyl@gmail.com or fabrice.jardin@chb.unicancer.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Content uploaded by Thierry Lecroq

Content may be subject to copyright.

2014, pages 1–2

BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btu185

Genome analysis Advance Access publication April 9, 2014

Rgb: a scriptable genome browser for R

Sylvain Mareschal

1,2,3,

*, Sydney Dubois

1,2,3

, Thierry Lecroq

2,3,4

and Fabrice Jardin

1,2,3,

Centre Henri Becquerel, INSERM UMR 918, 76038 Rouen Cedex 1, France,

Normandy University, University of

Rouen, 76821 Mont-Saint-Aignan, France,

Institute for Research and Innovation in Biomedicine (IRIB), Haute-

Normandie, 76183 Rouen Cedex, France and

LITIS, INSA EA 4108, 76801 Saint-Etienne-du-Rouvray, France

Associate Editor: Inanc Birol

ABSTRACT

Summary: Thanks to its free licensing and the development of initia-

tives like Bioconductor, R has become an essential part of the bio-

informatics toolbox in the past years and is more and more confronted

with genomically located data. While separate solutions are available

to manipulate and visualize such data, no R package currently offers

the efficiency required for computationally intensive tasks such as

interactive genome browsing. The package proposed here fulfills this

specific need, providing a multilevel interface suitable for most needs,

from a completely interfaced genome browser to low-level classes

and methods. Its time and memory efficiency have been challenged

in a human dataset, where it outperformed existing solutions by sev-

eral orders of magnitude.

Availability and implementation: R sources and packages are freely

available at the CRAN repository and dedicated Web site: http://bio-

informatics.ovsa.fr/Rgb. Distributed under the GPL 3 license, compat-

ible with most operating systems (Windows, Linux, Mac OS) and

architectures.

Contact: maressyl@gmail.com or fabrice.jardin@chb.unicancer.fr

Supplementary information: Supplementary data are available at

Bioinformatics online.

Received on December 20, 2013; revised on March 3, 2014; accepted

on April 2, 2014

1 INTRODUCTION

The growing demand from the biology community for statistic-

ally robust approaches has made the R statistics-oriented script-

ing language an essential part of the bioinformatics toolbox. Its

graphical capabilities make it a valuable tool to produce publi-

cation-grade complex figures, whereas its computational effi-

ciency allows it to handle huge datasets, as currently required

in fields like transcriptomics or next-generation sequencing.

These qualities come with an open-source licensing and various

operating system ports that make it available virtually every-

where. Thanks to the Bioconductor initiative (Gentleman

et al., 2004), large amount of software is freely available as R

packages for tasks as diverse as microarray processing (Smyth,

2005;Venkatraman and Olshen, 2007), feature annotation

(Zhang et al., 2003)orsequenceanalysis(Anders and Huber,

2010).

Much of this software generates genomic data, i.e. lists of

chromosome regions defined by starting and ending coordinates.

Such data are usually subset using chromosomal coordinates

rather than row indexes, a paradigm R was not developed to

deal with. Bioconductor historically addressed this issue with

the RangedData class from the IRanges package, handling gen-

omic regions as ranges of integers (base positions). Its flexibility

and efficiency were extended a few years later by the

GenomicRanges package, making direct use of IRanges compo-

nents for the subsetting. For visualization, Bioconductor pro-

vides two solutions: rtracklayer and GViz. The former sends

data to visualize to the UCSC web genome browser (Kent

et al., 2002), a model that implies frequent comings and goings

between programs and consequent network burden. The latter is

more integrated within Bioconductor and produces static graph-

ics from the classes described above, in delays incompatible with

user interactivity and intensive computing.

The package described here reconciles these two aspects in a

coherent way, thus offering an interactive interface responsive

enough for comfortable browsing and atomic operations suitable

for computer-intensive algorithms.

2 IMPLEMENTATION

Rgb is implemented as an R package, providing various classes

and methods for scripts. It makes use of the Reference class

system, which offers an object-oriented framework similar to

what can be found in other languages such as Java or Cþþ.It

is flexible enough to suit the needs of the four categories of R

users:

R beginners will find in Rgb a complete graphical user inter-

face (GUI) allowing them to convert and visualize genomic

data without any command line. Stand-alone builds of Rgb

can even make the R dependency totally transparent.

Script writers, needing an exclusive Command Line

Interface to automate analysis, will find in Rgb classes and

methods able to handle their genomic data and produce

high-quality graphics from them.

Console users who use R as an exploratory tool without a

special need for reproducibility, will find Rgb’s ability to mix

the two interfaces particularly time saving.

Package developers may be interested in extending classes to

handle new data storage modes or representations, a process

greatly facilitated by the object-oriented design of Rgb.

The core of the subsetting system consists of a collection of C

functions making direct use of R libraries, and the graphical

interface relies on Tcl-tk. Both of them are natively managed

*To whom correspondence should be addressed.

Bioinformatics Advance Access published April 21, 2014

by guest on August 20, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

by R, and have been successfully compiled and tested on

Windows and Linux operating systems. BAM file support can

be added by the Rsamtools package, which is part of

Bioconductor.

3 PERFORMANCES

To assert Rgb’s track.table class suitability for computer-inten-

sive tasks such as responsive genome browsing or overlap com-

putation, its time and maximal memory consumption on usual

atomic tasks were compared with existing solutions: standard R

data.frame, the more efficient data.table, IRanges’ RangedData,

GenomicRanges’ GRanges and GViz’s AnnotationTrack.Three

tasks were monitored as the most common atomic operations in

such data: genomic extraction by chromosomal coordinates, ex-

traction and modification of consecutive rows by indexes. The

291 128 exons recorded in the CCDS database (Pruitt et al.,

2009) for the human genome were used as a common dataset.

Benchmarking was performed on a mid-range desktop computer

(3.1 GHz Intel i3-2100 processor with 8 GB RAM) running R

3.0.2 in a 64-bit Fedora 18 distribution. Each measure was

made with R functions proc.time and gc in a fresh session, to

normalize garbage collection effects (scripts and dataset available

as Supplementary Data).

With the genomic extraction task (Fig. 1), GViz and

RangedData proved their unsuitability to intensive tasks, with

computing times nearing the second for a single extraction (typ-

ical genome browser representations contain several tracks to be

subset), and extravagant memory usage (hundreds of Mio for a

dataset of 30 Mio). GRanges and generic R solutions perform

better, but are still outperformed by Rgb by a magnitude of 50–

1500 on small genomic extractions.

Similar results can be observed with the modification task,

with Rgb outperforming R generic classes by a factor of 100 in

time consumption and 10 in memory usage. It still outperforms

bioinformatic solutions with the last task, but is overtaken here

by generic solutions such as data.frame and data.table. However,

as this kind of extraction is far less justified in a genomic context

than the previous one, this average performance can be accepted.

4CONCLUSION

Rgb provides several entry points to a consistent genome brows-

ing system, which may prove equally useful to users looking for

an interactive genome browser able to handle their R datasets

and developers needing efficient genomic subsetting capabilities

for their scripts. Its object-oriented paradigm and open-source

licensing make it easily extendable, and its performances over

existing solutions have been proven in a real genomic dataset.

ACKNOWLEDGEMENTS

The authors thank Christian Bastard for his helpful comments

and feedback during Rgb development and the many CGH-

array datasets he provided for testing.

Funding: This work was supported by a PhD grant from the

´gion Haute-Normandie (France).

Conflict of Interest: none declared.

REFERENCES

Anders,S. and Huber,W. (2010) Differential expression analysis for sequence count

data. Genome Biol.,11,R106.

Gentleman,R.C. et al. (2004) Bioconductor: open software development for com-

putational biology and bioinformatics. Genome Biol.,5,R80.

Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res.,12,

996–1006.

Pruitt,K.D. et al. (2009) The consensus coding sequence (CCDS) project: identifying

a common protein-coding gene set for the human and mouse genomes. Genome

Res.,19, 1316–1323.

Smyth,G.K. (2005) limma: linear models for microarray data. In: Gentleman,R.

et al. (ed.) Bioinformatics and Computational Biology Solutions Using R and

Bioconductor, Statistics for Biology and Health. Springer, New York, NY,

pp. 397–420.

Venkatraman,E.S. and Olshen,A.B. (2007) A faster circular binary segmentation

algorithm for the analysis of array CGH data. Bioinformatics,23, 657–663.

Zhang,J. et al. (2003) An extensible application for assembling annotation for gen-

omic data. Bioinformatis,19,155.

Fig. 1. Performance comparison of Rgb’s track.table class with existing

software. Three atomic tasks were benchmarked on a common dataset, in

terms of maximal memory usage and computing time, as recorded by R

S.Mareschal et al.

by guest on August 20, 2015http://bioinformatics.oxfordjournals.org/Downloaded from

Application of the cghRA framework to the genomic characterization of Diffuse Large B-Cell Lymphoma

Article

May 2017

Motivation: Although sequencing-based technologies are becoming the new reference in genome analysis, comparative genomic hybridization arrays (aCGH) still constitute a simple and reliable approach for copy number analysis. The most powerful algorithms to analyse such data have been freely provided by the scientific community for many years, but combining them is a complex scripting task. Results: The cghRA framework combines a user-friendly graphical interface and a powerful objectoriented command-line interface to handle a full aCGH analysis, as is illustrated in an original series of 107 Diffuse Large B-Cell Lymphomas. New algorithms for copy-number calling, polymorphism detection and minimal common region (MCR) prioritization were also developed and validated. While their performances will only be demonstrated with aCGH, these algorithms could actually prove useful to any copy-number analysis, whatever the technique used. Availability and Implementation: R package and source for Linux, MS Windows and MacOS are freely available at http://bioinformatics.ovsa.fr/cghRA. Contact:mareschal@ovsa.fr Supplementary information:Supplementary data are available at Bioinformatics online.

Recurrent mutations of the exportin 1 gene (XPO1) and their impact on selective inhibitor of nuclear export compounds sensitivity in primary mediastinal B-cell lymphoma: XPO1 Mutations in Primary Mediastinal B-Cell Lymphoma

Article

Jun 2016
AM J HEMATOL

Primary mediastinal B-cell lymphoma (PMBL) is an entity of B-cell lymphoma distinct from the other molecular subtypes of diffuse large B-cell lymphoma (DLBCL). We investigated the prevalence, specificity and clinical relevance of mutations of XPO1, which encodes a member of the karyopherin-β nuclear transporters, in a large cohort of PMBL. PMBL cases defined histologically or by gene expression profiling (GEP) were sequenced and the XPO1 mutational status was correlated to genetic and clinical characteristics. The XPO1 mutational status was also assessed in DLBCL, Hodgkin lymphoma (HL) and mediastinal gray-zone lymphoma (MGZL).The biological impact of the mutation on Selective Inhibitor of Nuclear Export (SINE) compounds (KPT-185/330) sensitivity was investigated in vitro. XPO1 mutations were present in 28/117 (24%) PMBL cases and in 5/19 (26%) HL cases but absent/rare in MGZL (0/20) or DLBCL (3/197). A higher prevalence (50%) of the recurrent codon 571 variant (p.E571K) was observed in GEP-defined PMBL and was associated with shorter PFS. Age, International Prognostic Index and bulky mass were similar in XPO1 mutant and wild-type cases. KPT-185 induced a dose-dependent decrease in cell proliferation and increased cell-death in PMBL cell lines harboring wild type or XPO1 E571K mutant alleles. Experiments in transfected U2OS cells further confirmed that the XPO1 E571K mutation does not have a drastic impact on KPT-330 binding. To conclude the XPO1 E571K mutation represents a genetic hallmark of the PMBL subtype and serves as a new relevant PMBL biomarker. SINE compounds appear active for both mutated and wild-type protein. This article is protected by copyright. All rights reserved.

Differential expression analysis for sequence count data

Article

Full-text available

Apr 2010

*Motivation:* High-throughput nucleotide sequencing provides quantitative readouts in assays for RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq) or cell counting (barcode sequencing). Statistical inference of differential signal in such data requires estimation of their variability throughout the dynamic range. When the number of replicates is small, error modelling is needed to achieve statistical power. Results: We propose an error model that uses the negative binomial distribution, with variance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. *Availability:* A free open-source R software package, DESeq , is available from the Bioconductor project and from "http://www-huber.embl.de/users/anders/DESeq":http://www-huber.embl.de/users/anders/DESeq.

Differential expression analysis for sequence count data

Article

Full-text available

Mar 2010

Wolfgang Huber

High-throughput DNA sequencing is a powerful and versatile new technology for ob-taining comprehensive and quantitative data about RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq), and genetic variations between individuals. It addresses es-sentially all of the use cases that microarrays were applied to in the past, but produces more detailed and more comprehensive results. One of the basic statistical tasks is inference (testing, regression) on discrete count values (e.g., representing the number of times a certain type of mRNA was sampled by the sequencing machine). Challenges are posed by a large dynamic range, heteroskedas-ticity and small numbers of replicates. Hence, model-based approaches are needed to achieve statistical power. I will present an error model that uses the negative binomial distribution, with vari-ance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. I will also discuss how to use the GLM framework to detect alternative transcript isoform usage. A free open-source R software package, DESeq, is available from the Bioconductor project.

Anders S, Huber W.. Differential expression analysis for sequence count data. Genome Biol 11: R106

Article

Full-text available

Oct 2010

High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes (Genome Research (2009) 19, (1316-1323))

Article

Full-text available

Jul 2009
GENOME RES

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

The human genome browser at UCSC

Article

Full-text available

Jul 2002
GENOME RES

As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.

Bioconductor: Open software development for computational biology and bioinformatics

Article

Jan 2004

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes (vol 19, pg 1316, 2009)

Article

Aug 2009
GENOME RES

Bioinformatics and Computational Biology Solutions Using R and Bioconductor

Book

Jan 2005

Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

LIMMA: Linear models for microarray data

Chapter

Jan 2005

G. K. Smyth

A survey is given of differential expression analyses using the linear modeling features of the limma package. The chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments. Experiments with technical as well as biological replication are considered. Empirical Bayes test statistics are explained. The use of quality weights, adaptive background correction and control spots in conjunction with linear modelling is illustrated on the β7 data.

An extensible application for assembling annotation for genomic data

Article

Feb 2003

AnnBuilder is an R package for assembling genomic annotation data. The system currently provides parsers to process annotation data from LocusLink, Gene Ontology Consortium, and Human Gene Project and can be extended to new data sources via user defined parsers. AnnBuilder differs from other existing systems in that it provides users with unlimited ability to assemble data from user selected sources. The products of AnnBuilder are files in XML format that can be easily used by different systems. Availability: (http://www.bioconductor.org). Open source.

Rgb: A scriptable genome browser for R

Abstract

Recommended publications

NanoPack2: Population scale evaluation of long-read sequencing data

Shiny-phyloseq: Web Application for Interactive Microbiome Analysis with Provenance Tracking

Rgb : a native genome browser for R

SeqGSEA: A Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differe...

VisRseq: R-based visual framework for analysis of sequencing data From 5th Symposium on Biological D...