Fig 1 - uploaded by David Robertson
Content may be subject to copyright.
The syntax of grammar rules.  

The syntax of grammar rules.  

Source publication
Article
Full-text available
Motivation: The field of 'DNA linguistics' has emerged from pioneering work in computational linguistics and molecular biology. Most formal grammars in this field are expressed using Definite Clause Grammars but these have computational limitations which must be overcome. The present study provides a new DNA parsing system, comprising a logic gram...

Context in source publication

Context 1
... rule syntax Figure 1 shows the syntax of basic grammar rules. A grammar rule consists of a single left-hand-side (LHS) category, an arrow symbol, and one or more right-hand- side (RHS) categories (rhs cat 1,..., rhs cat n.). ...

Similar publications

Article
Full-text available
This paper presents the development of a grammar and a syntactic parser for the Vietnamese language. We first discuss the construction of a lexicalized tree-adjoining grammar using an automatic extraction approach. We then present the construction and evaluation of a deep syntactic parser based on the extracted grammar. This is a complete system th...
Article
Full-text available
Parsing plays a very prominent role in computational linguistics. Parsing a Bangla sentence is a primary need in Bangla language processing. This chapter describes the Context Free Grammar (CFG) for parsing Bangla language, and hence, a Bangla parser is proposed based on the Bangla grammar. This approach is very simple to apply in Bangla sentences,...
Conference Paper
Full-text available
Abstract The present paper reports on an end-to-end application using a deep processing grammar,to ex- tract spatial and temporal information of prepositional and adverbial expressions from running text. The extraction process is based on the full understanding of the input text. It is represented in a formalism standard for unification-based gramm...
Conference Paper
Full-text available
This research work describes a computer system for understanding the parsing of Bangla sentences. It draws on recent developments in Natural Language Processing (NLP) research to look at the past, present, and future of NLP technology in a new light. The research work of Bangla Language Processing (BLP) was started in late1980s in Bangladesh and it...
Article
Full-text available
RESUMEN En el análisis sintáctico automático, la definición de criterios lingüísticos para gramáticas basadas en conocimiento lingüístico permite de desarrollar recursos coherentes y consistentes. La construcción de EsTxala y CaTxala, dos gramáticas de dependencias del español y del catalán para FreeLing (un entorno de herramientas de Procesamiento...

Citations

... An ncRNA gene detection was made with the help of pair stochastic CFGs in [158]. Basic gene grammars were defined for processing DNA sequences in [124]. The model of grammatical induction for the recognition of human neuropeptide precursors was defined in [141]. ...
... A summary of applications of syntactic pattern recognition models in bioinformatics described in the previous subsections is presented in Table 1. Table 1 The summary of applications of syntactic pattern recognition models in bioinformatics (in chronological order) Model type Models References String-based models Regular grammars, stochastic context-free grammars, programmed context-free grammars, multiple context-free grammars, context-sensitive grammars, attributed grammars, finite-state automata, hidden Markov models, precedence parsing, CYK parsing, algebraic dynamic programming [120], [121], [82], [109], [70], [122], [192], [22], [189], [88], [32][33][34], [168][169][170][171][172][173], [40], [115,117], [162], [19], [23], [118], [13], [123], [159], [46,47], [104], [204], [110], [24], [133], [154], [157,158], [108], [116], [124], [141], [89], [144], [151], [203], [3], [28], [111], [112], [131], [146], [190], [6][7][8], [26,27], [41], [201], [39], [103], [134], [161], [183], [31], [105,107], [142], [198], [148], [186,187], [199], [209], [15], [21], [37], [66], [86], [128], [164], [176], [184], [202], [147], [206], [29], [44], [180], [205], [2], [90], [98], [212], [193], [5], [143], [177], [191], [36], [45], [38], [74], [178], [4], [35], [155], [200], [119], [140], [188], [208], [149], [126], [48], [165] Tree-based models (Stochastic) regular tree grammars, Tree Adjoining Grammars, algebraic dynamic programming [135], [1], [196], [75], [77], [76], [78], [79], [80], [106], [139], [30], [174], [93], [127], [210,211], [91], [38], [81], [166], [182], [14], [129], [152], [137] Graph-based models NLC graph grammars, NCE graph grammars, algebraic (DPO) graph transformation systems, string-regulated rewriting graph grammars, k-testable graph languages [99], [185], [130], [136], [72], [179], [197] E a r l y b i r d ...
Article
Full-text available
Formal tools and models of syntactic pattern recognition which are used in bioinformatics are introduced and characterized in the paper. They include, among others: stochastic (string) grammars and automata, hidden Markov models, programmed grammars, attributed grammars, stochastic tree grammars, Tree Adjoining Grammars (TAGs), algebraic dynamic programming, NLC- and NCE-type graph grammars, and algebraic graph transformation systems. The survey of applications of these formal tools and models in bioinformatics is presented.
... Wong in 1993 used Prolog to model the Probed Partial Digestion of DNA, with the goal of finding genomic maps compatible with experimental data [82]. The DNA-ChartParser system uses the Prolog unification mechanism to implement parting of DNA sequences; in particular, [48] dealt with the specific case of E. coli promoters. One of the first uses of Prolog as a high-level language for representing and querying biochemical data is proposed by Kazic in 1994, to circumvent impedance mismatch of other hybrid systems existing at the time of publication [45]. ...
Chapter
This paper provides an overview of the use of Prolog and its derivatives to sustain research and development in the fields of bioinformatics and computational biology. A number of applications in this domain have been enabled by the declarative nature of Prolog and the combinatorial nature of the underlying problems. The paper provides a summary of some relevant applications as well as potential directions that the Prolog community can continue to pursue in this important domain. The presentation is organized in two parts: “small,” which explores studies in biological components and systems, and “large,” that discusses the use of Prolog to handle biomedical knowledge and data. A concrete encoding example is presented and the effective implementation in Prolog of a widely used approximated search technique, large neighborhood search, is presented.
... Gemma Bel claimed that genetic code and natural language share a number of units, structures, and operations so that syntactic and semantic parallelisms between these codes should lead to methodological exchange between biology, linguistics, and semiotics (Cantero 2012). In this regard, the linguist Jakobson introduced the first relationship when he suggested an interpretation correlating elements of genetic code and verbal language (nucleotides against phonemes/letters, codons against words, and lexicon against 64 codons) (Leung et al. 2001). In a further step, Lyons more closely examined the molecular-linguistics analogy by considering four essential design features that enable language to function as a signalling, or semiotic, system: discreteness, arbitrariness, duality, and productivity. ...
Article
Full-text available
Since its emergence, bio-art has developed numerous metaphors central to the transfer of concepts of modern biology, genetics, and genomics to the public domain that reveal several cultural, ethical, and social variations in their related themes. This article assumes that a general typology of metaphors developed by practices related to bio-art can be categorised into two categories: pictorial and operational metaphors. Through these, information regarding several biological issues is transferred to the public arena. Based on the analysis, this article attempts to answer the following questions: How does bio-art develop metaphors to advance epistemic and discursive agendas that constitute public understanding of a set of deeply problematic assumptions regarding how today’s biology operates? Under the influence of today’s synthetic biology, could bio-media operationally reframe these epistemic agendas by reframing complex and multi-layered metaphors towards post-metaphoric structures? Finally, what are the scientific, cultural, and social implications of reframing?
... Leung, Mellish, and Robertson[38] present a grammatical parser to model and predict DNA transcription binding sites using Definite Clause Grammar (DCG). The parser takes the string and parses it to conclude if there is a transcription binding site.Another related approach by Krogh, Mian and Haussler[37] where it uses Hidden Markov Model (HMM) to predict protein coding genes in E. Coli DNA sequence. ...
Article
As data collection technologies are advancing and memory storage costs are declining, volumes of data collected have soared. Scientists and investigators are collecting all possible data in fear of missing out on important information. With the merge of the data collection trend, researchers were studying data mining and analysis to find the most efficient way to data mine. There are various valuable data mining techniques that can be found in literature such as Support Vector Machine (SVM), Neural Networks (ANN), and Formal Methods (Grammars). Grammars are a very valuable in analyzing structured data and describing them in a condense matter. However, not many have used it for data mining even though it has many benefits. In this research we present an approach to data mine big data. First, a grammar is inferred to build a structural model that describes the data. Then, on the next phase, a probabilistic context-free grammar is inferred and a model for a more complex structures. Given an input sequence, the model parses and generates the probability of that data sequence being part of the class based on its structural characteristics. Grammatical concatenation is utilized in case of existing sub-structures within the class’s structural description. The model then accepts, or rejects, the input as part of the data’s class by comparing the probability to a pre-set threshold. Finally, this is applied on a heterogeneous large data set by inferring multiple grammars. After building grammatical model for each class, the algorithm parse multiple points in the large set. It then classifies these data into smaller sets where they share similar structural characteristics using probabilistic grammar. If more than one class accepts the data point, it is associated to the highest ranking class. Biological data, DNAs and Proteins, were used for experimentation in this research.
... In our study, we have proposed a new method to support the de novo discovery of kinase specific phosphorylation sites based upon computational grammar. Various methods based on Computational grammars have been proposed so far for modelling and predicting various types of biological sequences such as promoter region of human [23], transcription binding site [24], associating genes with their regulatory sequences [25], predicting RNA folding [26], secondary structure of RNA molecule [27][28][29], genes and biological sequences [30,31], syntactic model to design genetic constructs [32] and new antimicrobial peptides [33]. Nowadays, Grammar Inference (GI) is becoming an active field of research in the area of computational grammar [34]. ...
Article
Full-text available
Kinase mediated phosphorylation site detection is the key mechanism of post translational mechanism that plays an important role in regulating various cellular processes and phenotypes. Many diseases, like cancer are related with the signaling defects which are associated with protein phosphorylation. Characterizing the protein kinases and their substrates enhances our ability to understand the mechanism of protein phosphorylation and extends our knowledge of signaling network; thereby helping us to treat such diseases. Experimental methods for predicting phosphorylation sites are labour intensive and expensive. Also, manifold increase of protein sequences in the databanks over the years necessitates the improvement of high speed and accurate computational methods for predicting phosphory-lation sites in protein sequences. Till date, a number of computational methods have been proposed by various researchers in predicting phosphorylation sites, but there remains much scope of improvement. In this communication, we present a simple and novel method based on Grammatical Inference (GI) approach to automate the prediction of kinase specific phosphorylation sites. In this regard, we have used a popular GI algorithm Alergia to infer Deterministic Stochastic Finite State Automata (DSFA) which equally represents the regular grammar corresponding to the phosphorylation sites. Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphor-ylation sites in a kinase specific manner. It performs significantly better when compared with the other existing phosphorylation site prediction methods. We have also compared our inferred DSFA with two other GI inference algorithms. The DSFA generated by our method performs superior which indicates that our method is robust and has a potential for predicting the phosphorylation sites in a kinase specific manner.
... Computational grammars have been used in modelling and predicting transcription binding site [28], associating genes with their regulatory sequences [29], predicting RNA folding [30], secondary structure of RNA molecule [31][32][33], genes and biological sequences [34,35], syntactic model to design genetic constructs [36]. Grammatical models have also been developed with the goal of designing new antimicrobial peptides [37]. ...
Article
Full-text available
An important step in understanding gene regulation is to identify the promoter regions where the transcription factor binding takes place. Predicting a promoter region de novo has been a theoretical goal for many researchers for a long time. There exists a number of in silico methods to predict the promoter region de novo but most of these methods are still suffering from various shortcomings, a major one being the selection of appropriate features of promoter region distinguishing them from non-promoters. In this communication, we have proposed a new composite method that predicts promoter sequences based on the interrelationship between structural profiles of DNA and primary sequence elements of the promoter regions. We have shown that a Context Free Grammar (CFG) can formalize the relationships between different primary sequence features and by utilizing the CFG, we demonstrate that an efficient parser can be constructed for extracting these relationships from DNA sequences to distinguish the true promoter sequences from non-promoter sequences. Along with CFG, we have extracted the structural features of the promoter region to improve upon the efficiency of our prediction system. Extensive experiments performed on different datasets reveals that our method is effective in predicting promoter sequences on a genome-wide scale and performs satisfactorily as compared to other promoter prediction techniques.
... The field of DNA linguistics has focused on computational linguistics and molecular biology. Such efforts have contributed to developing a logic grammar formalism that has been used to perform language processing and recognition of DNA sequences such as E. coli promoters [33] . We posit that linguistic structure coupled with algorithm methodologies helps us to understand the difference between data and algorithms in the DNA/RNA world. ...
Article
Full-text available
The fields of molecular biology and computer science have cooperated over recent years to create a synergy between the cybernetic and biosemiotic relationship found in cellular genomics to that of information and language found in computational systems. Biological information frequently manifests its "meaning" through instruction or actual production of formal bio-function. Such information is called prescriptive information (PI). PI programs organize and execute a prescribed set of choices. Closer examination of this term in cellular systems has led to a dichotomy in its definition suggesting both prescribed data and prescribed algorithms are constituents of PI. This paper looks at this dichotomy as expressed in both the genetic code and in the central dogma of protein synthesis. An example of a genetic algorithm is modeled after the ribosome, and an examination of the protein synthesis process is used to differentiate PI data from PI algorithms.
... Later, DCGs and the Prolog programming language were used in modeling gene regulation (Collado-Vides, 1992;Rosenblueth et al., 1996), benefiting from features such as parameter-passing and arbitrary Prolog code embeddings. Basic Gene Grammar, an attempt to simplify the representations of DNA sequences, and with expressive power equivalent to that of DCG, has been used to model and predict transcription binding sites (Leung et al., 2001). ...
Article
Full-text available
Treating genomes just as languages raises the possibility of producing concise generalizations about information in biological sequences. Grammars used in this way would constitute a model of underlying biological processes or structures, and that grammars may, in fact, serve as an appropriate tool for theory formation. The increasing number of biological sequences that have been yielded further highlights a growing need for developing grammatical systems in bioinformatics. The intent of this review is therefore to list some bibliographic references regarding the recent progresses in the field of grammatical modeling of biological sequences. This review will also contain some sections to briefly introduce basic knowledge about formal language theory, such as the Chomsky hierarchy, for non-experts in computational linguistics, and to provide some helpful pointers to start a deeper investigation into this field.
... Computational grammars have been used to model and predict transcription binding sites (Leung et al. 2001), RNA folding (Rivas and Eddy 2000) and genes (Searls 2002). Components of largerthan-gene structure (LGS) represented with grammars include integrons (Joss et al. 2009;Moura et al. 2009;Rowe-Magnus et al. 2003), insertion sequences (Siguier et al. 2006) and gene cassettes (Partridge et al. 2009). ...
Article
Full-text available
Larger than gene structures (LGS) are DNA segments that include at least one gene and often other segments such as inverted repeats and gene promoters. Mobile genetic elements (MGE) such as integrons are LGS that play an important role in horizontal gene transfer, primarily in Gram-negative organisms. Known LGS have a profound effect on organism virulence, antibiotic resistance and other properties of the organism due to the number of genes involved. Expert-compiled grammars have been shown to be an effective computational representation of LGS, well suited to automating annotation, and supporting de novo gene discovery. However, development of LGS grammars by experts is labour intensive and restricted to known LGS. Objectives: This study uses computational grammar inference methods to automate LGS discovery. We compare the ability of six algorithms to infer LGS grammars from DNA sequences annotated with genes and other short sequences. We compared the predictive power of learned grammars against an expert-developed grammar for gene cassette arrays found in Class 1, 2 and 3 integrons, which are modular LGS containing up to 9 of about 240 cassette types. Using a Bayesian generalization algorithm our inferred grammar was able to predict > 95% of MGE structures in a corpus of 1760 sequences obtained from Genbank (F-score 75%). Even with 100% noise added to the training and test sets, we obtained an F-score of 68%, indicating that the method is robust and has the potential to predict de novo LGS structures when the underlying gene features are known. http://www2.chi.unsw.edu.au/attacca.
... Surveys presented in [91] and [35] describe correlations between linguistic structures and biological function. In particular, linguistic models of macromolecules [10,41], have been used to model nucleic acid structure [90,89,47], protein linguistics [1,82], and gene regulation [18,84,56]. Much of the work available in the literature assumes the underlying grammar is known a priori. ...
... Some work has briefly been done in regards to modeling GRNs using a subclass of CSGs [18,91,89,47,5] called definite clause grammars (DCGs) developed in the efficient computer language, Prolog. This was further developed into Basic Gene Grammars in [56]. The end result is a very high-level model description with a database approach to determining the classification of sequences of data in silico. ...
... For example, [91] and [35] describe correlations between linguistic structures and biological function. Grammars have been used to model nucleic acid structure [90,89,47], protein linguistics [1,82], and gene regulation [18,84,56]. However, often the literature assumes the source grammar is already known, and it is usually for a specific set of biological data [84,56]. ...
Thesis
Full-text available
Grammars are generally understood to be the set of rules that define the relationships between elements of a language. However, grammars can also be used to elucidate structural relationships within sequences constructed from any finite alphabet. In this work abstract grammars are used to model the primary and secondary structures present in biological data. These grammar models are inferred and applied to efficiently solve various sequence analysis problems in computational biology, including multiple sequence alignment, fragment assembly, database redundancy removal, and structural prediction. The primary structures, or sequential ordering of symbols, of biological data are first modeled with Lempel-Ziv (LZ) grammars. The results are used to construct a grammar based sequence distance metric which can be used to compare biological sequences by comparing their inferred grammars. This concept is applied to solve several problems involving biological sequence analysis including multiple sequence alignment and phylogenetic clustering. The higher-level secondary structures of biological sequences are then modeled via two novel grammar inference methods. The resulting context-free grammars are used to estimate structural pieces within biological sequences, which can in-turn be used as supplemental information to help guide various sequence analysis algorithms. The use of this approach to develop algorithms for various sequence analysis tasks demonstrates the viability and versatility of using abstract grammars to model biological data.