FIG 1 - uploaded by Mikhail Roytberg
Content may be subject to copyright.
Three different ways of viewing an RNA sequence. (a) A schematic 2-dimensional description of an RNA folding. (b) A linear representation of the RNA. (c) The RNA as a rooted ordered tree.

Three different ways of viewing an RNA sequence. (a) A schematic 2-dimensional description of an RNA folding. (b) A linear representation of the RNA. (c) The RNA as a rooted ordered tree.

Source publication
Article
Full-text available
Locality is an important and well-studied notion in comparative analysis of biological sequences. Similarly, taking into account affine gap penalties when calculating biological sequence alignments is a well-accepted technique for obtaining better alignments. When dealing with RNA, one has to take into consideration not only sequential features, bu...

Context in source publication

Context 1
... have been quite a few approaches for defining alignments in terms of RNAs. The first one is due to the seminal paper of Zhang and Shasha (1989) which represented RNA sequences as rooted ordered trees ( Fig. 1), and defined editing operations on trees which correspond to editing operations on RNA sequences. In this way, an alignment of two RNA sequences corresponds to a sequence of editing operations on two corresponding trees, and any tree editing algorithm can be used to compute the optimal alignment of two RNAs. Furthermore, this approach ...

Citations

... While the idea of representing a molecule as a graph is not novel [36][37][38], our approach is, as it takes into account intra-molecular contacts in such a way that the circuit relations between contacts are preserved. As a measure of similarity and dissimilarity, we propose the graph edit distance which is commonly used in bioinformatics and structural biology for comparing structures [39][40][41][42][43]. Once we have a metric space model for linear molecules, we can use a plethora of other tools, for example, compute persistent homology, extended persistence, etc. ...
... One of the metrics commonly used on the space of all (finite) graphs is the graph edit distance [39][40][41][42][43]48] and its variations. Intuitively speaking, graph edit distance defines the similarity of two graphs by the minimum amount of distortion which is needed to transform one graph into the other. ...
Article
Full-text available
Structure plays a pivotal role in determining the functional properties of self-interacting linear biomolecular chains, for example proteins and nucleic acids. In this paper, we propose a method for representing each such molecule combinatorially - as a one-dimensional simplicial complex - in a novel way that takes into account intra-chain contacts. The representation allows for efficient quantification of structural similarities and differences between molecules, and for studying molecular topology using extended persistence. This method performs a multi-scale analysis on a filtered simplicial complex as it tracks clusters, holes, and higher dimensional voids in the filtration. From extended persistence we extract information about the arrangement of intra-chain interactions, a topological property which demonstrably affects folding and unfolding dynamics of the linear chains.
... For this reason, gapped edit distance with nonlinear gap cost functions has received fewer studies than the classic edit distance, which can be viewed as gapped edit distance with linear gap costs (despited of the choice of gap models). Rolf Backofen et al [4] studied the application of edit distance with gaps in RNA comparison. S. Schirmer and R. ...
Article
An important problem in geometric computing is defining and computing similarity between two geometric shapes, e.g. point sets, curves and surfaces, etc. Important geometric and topological information of many shapes can be captured by defining a tree structure on them (e.g. medial axis and contour trees). Hence, it is natural to study the problem of comparing similarity between trees. We study gapped edit distance between two ordered labeled trees, first proposed by Touzet \cite{Touzet2003}. Given two binary trees $T_{1}$ and $T_{2}$ with $m$ and $n$ nodes. We compute the general gap edit distance in $O(m^{3}n^{2} + m^{2}n^{3})$ time. The computation of this distance in the case of arbitrary trees has shown to be NP-hard \cite{Touzet2003}. We also give an algorithm for computing the complete subtree gap edit distance, which can be applied to comparing contour trees of terrains in $\mathbb{R}^{3}$.
... In trees representing RNA secondary structure, any gap on the primary sequence that is consistent with the secondary structure (i.e., the gap does not break any base pair) is a complete subforest gap in the tree. Such gaps are implemented in [36][37][38]. Compared to COMPLETESUBTREE, the grammar COMPLETESUBFOREST needs two additional symbols: E (sibling deletion) and J (sibling insertion). ...
Article
Full-text available
Dynamic programming is a classical algorithmic paradigm, which often allows the evaluation of a search space of exponential size in polynomial time. Recursive problem decomposition, tabulation of intermediate results for re-use, and Bellman's Principle of Optimality are its well-understood ingredients. However, algorithms often lack abstraction and are difficult to implement, tedious to debug, and delicate to modify. The present article proposes a generic framework for specifying dynamic programming problems. This framework can handle all kinds of sequential inputs, as well as tree-structured data. Biosequence analysis, document processing, molecular structure analysis, comparison of objects assembled in a hierarchic fashion, and generally, all domains come under consideration where strings and ordered, rooted trees serve as natural data representations. The new approach introduces inverse coupled rewrite systems. They describe the solutions of combinatorial optimization problems as the inverse image of a term rewrite relation that reduces problem solutions to problem inputs. This specification leads to concise yet translucent specifications of dynamic programming algorithms. Their actual implementation may be challenging, but eventually, as we hope, it can be produced automatically. The present article demonstrates the scope of this new approach by describing a diverse set of dynamic programming problems which arise in the domain of computational biology, with examples in biosequence and molecular structure analysis.
... I am also thankful to the MADALGO research center and the Caesarea Rothschild Institute for providing me with additional working environments and financial support. This thesis is based primarily on the following publications: [31] from ICALP'07, [12] from JCB, [80] from SODA'08, [63] from Algorithmica, [52] from SODA'09, and [45] from STACS'09. ...
Article
Dynamic Programming (DP) is a fundamental problem-solving technique that has been widely used for solving a broad range of search and optimization problems. While DP can be invoked when more specialized methods fail, this generality often incurs a cost in efficiency. We explore a unifying toolkit for speeding up DP, and algorithms that use DP as subroutines. Our methods and results can be summarized as follows. - Acceleration via Compression. Compression is traditionally used to efficiently store data. We use compression in order to identify repeats in the table that imply a redundant computation. Utilizing these repeats requires a new DP, and often different DPs for different compression schemes. We present the first provable speedup of the celebrated Viterbi algorithm (1967) that is used for the decoding and training of Hidden Markov Models (HMMs). Our speedup relies on the compression of the HMM's observable sequence. - Totally Monotone Matrices. It is well known that a wide variety of DPs can be reduced to the problem of finding row minima in totally monotone matrices. We introduce this scheme in the context of planar graph problems. In particular, we show that planar graph problems such as shortest paths, feasible flow, bipartite perfect matching, and replacement paths can be accelerated by DPs that exploit a total-monotonicity property of the shortest paths. - Combining Compression and Total Monotonicity. We introduce a method for accelerating string edit distance computation by combining compression and totally monotone matrices.
... However, it might be more useful not to directly use tree edit distance but to use variants that are specialized to RNA secondary structures. A number of such variants have been proposed and several systems have been developed [9], [21] Another important example of tree structured data is a phylogenetic tree. A phylogenetic tree is a rooted unordered tree that represents a history of evolution of various biological species. ...
Article
Tree structured data often appear in bioinformatics. For example, glycans, RNA secondary structures and phylogenetic trees usually have tree structures. Comparison of trees is one of fundamental tasks in analysis of these data. Various distance measures have been proposed and utilized for comparison of trees, among which extensive studies have been done on tree edit distance. In this paper, we review key results and our recent results on the tree edit distance problem and related problems. In particular, we review polynomial time exact algorithms and more efficient approximation algorithms for the edit distance problem for ordered trees, and approximation algorithms for the largest common sub-tree problem for unordered trees. We also review applications of tree edit distance and its variants to bioinformatics with focusing on comparison of glycan structures.
... We describe an algorithm based on that of Zhang and Shasha using an alignment graph. This approach was also used in [25] [4]. The alignment graph B F ,G of F and G is an edge-weighted directed graph defined as follows. ...
... We begin this section by giving an alternative description of Klein's algorithm using an alignment graph. However, as opposed to the alignment graph of [25] [4] our graph is three dimensional. ...
Article
The LCS of two rooted, ordered, and labeled trees F and G is the largest forest that can be obtained from both trees by deleting nodes. We present algorithms for computing tree LCS which exploit the sparsity inherent to the tree LCS problem. Assuming G is smaller than F, our first algorithm runs in time , where r is the number of pairs (v∈F,w∈G) such that v and w have the same label. Our second algorithm runs in time , where L is the size of the LCS of F and G. For this algorithm we present a novel three-dimensional alignment graph. Our third algorithm is intended for the constrained variant of the problem in which only nodes with zero or one children can be deleted. For this case we obtain an time algorithm, where .
... Analogous to the sequence comparison problem that we have addressed, the most popular computational methods for RNA secondary structure comparison are algorithms based on tree edit distance or tree alignment models [4], [5], [10], [12], [16], [17], [21], [22]. In this framework, RNA secondary structures are represented as rooted trees, and one applies any well-developed tree comparison algorithm to compare its structural similarity. ...
... A fragment r[i 1 ... respectively. Notice that although r [5] and r [7] have the distance two, their parteners are located at positions 18 and 15, which belong to different helices. In contrast, for the fragment r[3...9] = 3---5---7---9 ...
... (g(g(u(, r [5] is not a pioneer but r[3] and r [7] do. ...
Conference Paper
Full-text available
We point out the importance to incorporate affine-gap penalties in RNA secondary-structure comparison. Two notions of affine-gap penalties, one for sequences and the other for structures, are developed. A model from Jiang et al. in [JComput Biol, 2002, 9, (2), pp. 371-388] is extended to allow this facility, and a polynomial-time algorithm is provided in this paper. Experimental results in this paper revealed that our new model generates more accurate and biological meaningful alignments than several existing algorithms.
... We describe an algorithm based on that of Zhang and Shasha using an alignment graph. This approach was also used in [25,4]. The alignment graph B F ,G of F and G is an edge-weighted directed graph defined as follows. ...
... We begin this section by giving an alternative description of Klein's algorithm using an alignment graph. However, as opposed to the alignment graph of [25,4] our graph is three dimensional. Given a tree F and a path decomposition P of F we define a sequence of subforests of F as follows. ...
Conference Paper
Full-text available
The LCS of two rooted, ordered, and labeled trees F and G is the largest forest that can be obtained from both trees by deleting nodes. We present algorithms for computing tree LCS which exploit the sparsity inherent to the tree LCS problem. Assuming G is smaller than F, our first algorithm runs in time O(rheight(F) height(G)lglg|G|)O(r\cdot {\rm height}(F) \cdot {\rm height}(G)\cdot \lg\lg |G|) , where r is the number of pairs (v ∈ F, w ∈ G) such that v and w have the same label. Our second algorithm runs in time O(L r lgr lglg|G|)O(L r \lg r \cdot \lg\lg|G|) , where L is the size of the LCS of F and G. For this algorithm we present a novel three dimensional alignment graph. Our third algorithm is intended for the constrained variant of the problem in which only nodes with zero or one children can be deleted. For this case we obtain an O(r h lglg|G|)O(r h \lg \lg|G|) time algorithm, where h = height(F) + height(G).
Conference Paper
Detecting local common sequence-structure regions of RNAs is a biologically meaningful problem. By detecting such regions, biologists are able to identify functional similarity between the inspected molecules. We developed dynamic programming algorithms for finding common structure-sequence patterns between two RNAs. The RNAs are given by their sequence and a set of potential base pairs with associated probabilities. In contrast to prior work which matches fixed structures, we support the arc breaking edit operation; this allows to match only a subset of the given base pairs. We present an O(n 3) algorithm for local exact pattern matching between two nested RNAs, and an O(n 3logn) algorithm for one nested RNA and one bounded-unlimited RNA.