ArticlePDF Available

String Operations in Query Languages

Authors:

Abstract

We study relational calculi with support for string operations. Most prior proposals were based on adding the operation of concatenation to first-order logic. Such an extension is problematic as the relational calculus becomes computationally complete, which in turn implies strong limits on the ability to perform optimization and static analysis of properties such as query safety. In contrast, we look at extensions of relational calculus that have nice expressiveness, decidability, and safety properties, while corresponding to sets of string operations used in SQL. We start with an extension based on the string ordering and LIKE predicates. We then extend this basic model to include string length comparison. While both of these share some of the attractive properties of relational calculus (low data complexity for generic queries, effective syntax for safe queries, correspondence with an algebra), there is a large gap between these calculi in expressive power and complexity. The small...
... The most basic one, restricted-quantifier normal form [6,16,8], states that every Ê ´Å µ formula is equivalent to a formula in which no symbol appears in the scope of a quantifier Ü or Ü (that is, they appear only in the scope of quantifiers Ü ¾ ÓÑ and Ü ¾ ÓÑ The main result of this section is that for every Ê ´Ì µ query, quantifiers can be restricted to range over the (finite and definable) set of trees whose domains can only contain nodes present in the domains of trees in the active domain of the finite -structure, and in the tuple of the free variables. ...
... From the learning point of view, this means that every definable family is PAC-learnable[5]. Finite VC dimension also implies strong bounds on the expressiveness of relational query languages[6,8]. It turns out that the presence of the extension predicate , prevents Å from having finite VC dimension.Model theory of strings vs. model theory of treesWe remarked before that if the alphabet of directions has a unique element, then trees over such alphabet are naturally associated with strings: that is, trees in ÌÖ × ½´¦ µ are in 1-1 correspondence with ¦ £ . What are the analogs of Ì Ô and ...
Conference Paper
Full-text available
We study relations on trees defined by first-order constraints over a vocabulary that includes the tree extension relation T<T', holding if and only if every branch of T extends to a branch of T', unary node-tests, and a binary relation checking if the domains of two trees are equal. We show that from such a formula one can generate a tree automaton that accepts the set of tuples of trees defined by the formula, and conversely that every automaton over tree-tuples is captured by such a formula. We look at the fragment with only extension inequalities and leaf tests, and show that it corresponds to a new class of automata on tree tuples, which is strictly weaker then general tree-tuple automata. We use the automata representations to show separation and expressibility results for formulae in the logic. We then turn to relational calculi over the logic defined here: that is, from constraints we extend to queries that have second-order parameters for a finite set of tree tuples. We give normal forms for queries, and use these to get bounds on the data complexity of query evaluation, showing that while general query evaluation is unbounded within the polynomial hierarchy, generic query evaluation has very low complexity giving strong bounds on the expressive power of relational calculi with tree extension constraints. We also give normal forms for safe queries in the calculus.
Chapter
Since their inception, the Perspectives in Logic and Lecture Notes in Logic series have published seminal works by leading logicians. Many of the original books in the series have been unavailable for years, but they are now in print once again. This volume, the twenty-fourth publication in the Lecture Notes in Logic series, contains the proceedings of the European Summer Meeting of the Association for Symbolic Logic, held in Helsinki, Finland, in August 2003. These articles include an extended tutorial on generalizing finite model theory, as well as seventeen original research articles spanning all areas of mathematical logic, including proof theory, set theory, model theory, computability theory and philosophy.
Chapter
We investigate the expressive power and complexity questions for the LIKE operator in SQL. The languages definable by a single LIKE pattern and generalizations are related to a well-known hierarchy of classes of formal languages, namely the dot-depth hierarchy introduced by Cohen and Brzozowski. Then we turn to natural decision problems and show that membership is likely easier for LIKE patterns than for more powerful regular expressions. Equivalence is provably harder for general regular expressions. More complex conditions based on LIKE patterns are also considered.
Conference Paper
With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.
Conference Paper
Many ideas of Alfred Tarski - one of the founders of modern logic - find application in database theory. We survey some of them with no attempt at comprehensiveness. Topics discussed include the genericity of database queries; the relational algebra, the Tarskian definition of truth for the relational calculus, and cylindric algebras; relation algebras and computationally complete query languages; real polynomial constraint databases; and geometrical query languages.
Article
Current data management and information retrieval systems lack advanced string processing capabilities needed in string-oriented application areas like computational molecular biology. Several theoretical models for string processing have been proposed but they either have not been implemented in practice or the implementations are too restricted or platform-dependent to be generally useful. In this article, we introduce the language Alignment Declarations designed for string querying and restructuring. The language extends the capabilities of existing database query languages by allowing the user to define database predicates that express structural properties of strings (e.g. containment of certain patterns) or relations between several strings (e.g. similarity measures). These predicates can be created and executed within the same database session and also stored for later sessions. We also describe the design and implementation of a working system.
Conference Paper
Unranked trees, that is, trees with no restriction on the number of children of nodes, have recently attracted much attention, primarily as an abstraction of XML (Extensible Markup Language) documents. In this paper, we study logical definability over unranked trees, as well as collections of unranked trees, that can be viewed as databases of XML documents. The traditional approach to definability is to view each tree as a structure of a fixed vocabulary, and study the expressive power of various logics on trees. A different approach, based on model theory, considers a structure whose universe is the set of all trees, and studies definable sets and relations; this approach extends smoothly to the setting of definability over collections of trees. We study the latter, model-theoretic approach. We find sets of operations on unranked trees that define regular tree languages, and show that some natural restrictions correspond to logics studied in the context of XML pattern languages. We then look at relational calculi over collections of unranked trees, and obtain quantifier-restriction results that give us bounds on the expressive power and complexity. As unrestricted relational calculi can express problems complete for each level of the polynomial hierarchy, we look at their restrictions, corresponding and find several calculi with low (NC<sup>1</sup>) data complexity that can express important XML properties like DTD validation and XPath evaluation.
Conference Paper
Full-text available
The guarded fragment with transitive guards, [GF+TG], is an extension of GF in which certain relations are required to be transitive, transitive predicate letters appear only in guards of the quantifiers and the equality symbol may appear everywhere. We prove that the decision problem for [GF+TG] is decidable. This answers the question posed in (Ganzinger et al., 1999). Moreover, we show that the problem is 2EXPTIME-complete. This result is optimal since the satisfiability problem for GF is 2EXPTIME-complete (Gradel, 1999). We also show that the satisfiability problem for two-variable [GF+TG] is NEXPTIME-hard in contrast to GF with bounded number of variables for which the satisfiability problem is EXPTIME-complete
Conference Paper
Full-text available
In this paper we pursue the study of Alignment Calculus, a declarative string database query language that supports both string querying and restructuring. This language is aimed for applications such as molecular biology databases, where the basic data type is a string, and the queries are combinatorial in nature. The declarative nature of our language does, however, require some additional effort in its implementation. Here we solve this problem by first defining a domain independent syntactic subset of the full language and then developing a query evaluation mechanism for this sublanguage. This mechanism then handles the required restructuring operations in a finite manner.
Conference Paper
Full-text available
Structured document databases can be naturally viewed as derivation trees of a context-free grammar. Under this view, the classical formalism of attribute grammars becomes a formalism for structured document query languages. From this perspective, we study the expressive power of BAGs: Boolean-valued attribute grammars with propositional logic formulas as semantic rules, and RAGs: relation-valued attribute grammars with first-order logic formulas as semantic rules. BAGs can express only unary queries; RAGs can express queries of any arity. We first show that the (unary) queries expressible by BAGs are precisely those definable in monadic second-order logic. We then show that the queries expressible by RAGs are precisely those definable by first-order inductions of linear depth, or, equivalently, those computable in linear time on a parallel machine with polynomially many processors. Further, we show that RAGs that only use synthesized attributes are strictly weaker than RAGs that use both synthesized and inherited attributes. We show that RAGs are more expressive than monadic second-order logic for queries of any arity. Finally, we discuss relational attribute grammars in the context of BAGs and RAGs. We show that in the case of BAGs this does not increase the expressive power, while different semantics for relational RAGs capture the complexity classes NP, coNP and UP ∩ coUP.
Article
The infinitary logic ℒ ω∞ω consists of all formulas of ℒ∞ω with a finite number of variables. During the past several years, the study of ℒ ω∞ω has occupied a prominent place in finite model theory. We present an overview of results concerning the interaction of ℒ ω∞ω with least fixed-point logic and first-order implicit definability on finite structures.
Book
Part 1 Mathematical preliminaries: words and languages automata and regular languages semigroups and homomorphisms. Part 2 Formal languages and formal logic: examples definitions. Part 3 Finite automata: monadic second-order sentences and regular languages regular numerical predicates infinite words and decidable theories. Part 4 Model-theoretic games: the Ehrenfeucht-Fraisse game application to FO [decreasing] application to FO [+1]. Part 5 Finite semigroups: the syntactic monoid calculation of the syntactic monoid application to FO [decreasing] semidirect products categories and path conditions pseudovarieties. Part 6 First-order logic: characterization of FO [decreasing] a hierarchy in FO [decreasing] another characterization of FO [+1] sentences with regular numerical predicates. Part 7 Modular quantifiers: definition and examples languages in (FO + MOD(P))[decreasing] languages in (FO + MOD)[+1] languages in (FO + MOD)[Reg] summary. Part 8 Circuit complexity: examples of circuits circuits and circuit complexity classes lower bounds. Part 9 Regular languages and circuit complexity: regular languages in NC1 formulas with arbitrary numerical predicates regular languages and non-regular numerical predicates special cases of the central conjecture. Appendices: proof of the Krohn-Rhodes theorem proofs of the category theorems.
Article
In order to study circuit complexity classes within NC1 in a uniform setting, we need a uniformity condition which is more restrictive than those in common use. Two such conditions, stricter than NC1 uniformity, have appeared in recent research: Immerman's families of circuits defined by first-order formulas and a uniformity corresponding to Buss' deterministic log-time reductions. We show that these two notions are equivalent, leading to a natural notion of uniformity for low-level circuit complexity classes. We show that recent results on the structure of NC1 still hold true in this very uniform setting. Finally, we investigate a parallel notion of uniformity, still more restrictive, based on the regular languages. Here we give characterizations of subclasses of the regular languages based on their logical expressibility, extending recent work of Straubing, Thérien, and Thomas. A preliminary version of this work appeared in “Structure of Complexity Theory: Third Annual Conference” pp. 47–59, IEEE Comput. Soc., Washington, DC, 1988.
Article
There is a significant amount of interest in combining and extending database and information retrieval technologies to manage textual data. The challenge is becoming more relevant due to increased availability of documents in digital form. Document data has a natural hierarchical structure, which may be made explicit due to the use of markup conventions (as with SGML). An important aspect of managing structured and semistructured textual data consists of supporting the efficient retrieval of text components based both on their content and on their structure. In this paper we study issues related to the expressive power and optimization of a class of algebras that support combining string (or pattern) searches with queries on the hierarchical structure of the text. Theregion algebrastudied is a set-at-a-time algebra for manipulatingtext regions(substrings of the text) that supports finding out nesting and ordering properties of the text regions. This algebra is part of the language in use in commercial text retrieval systems and can form the basis for supporting SQL-like access to textual data. By presenting a close relationship between the region algebra and the monadic first order theory of finite binary trees, we show that queries in the algebra can be optimized, in the sense that equivalence to less expensive expressions can be tested. This optimization can be difficult (co-NP-hard in the general case), but there is an important class of queries that can be optimized in polynomial time. On the negative side, we show that the language is incapable of capturing some important properties of the text structure, related to the nesting and ordering of text regions. We conclude by suggesting possible extensions to increase the expressive power of the language and consider one such example.
Article
We consider relational databases organized over an ordered domain with some additional relations — a typical example is the ordered domain of rational numbers together with the operation of addition. In the focus of our study are the first-order (FO) queries that are invariant under order-preserving “permutations” — such queries are called order-generic. It has recently been discovered that for some domains order-generic FO queries fail to express more than pure order queries. For example, every order-generic FO query over rational numbers with + can be rewritten without +. For some other domains, however, this is not the case.We provide very general conditions on the FO theory of the domain that ensure the collapse of order-generic extended FO queries to pure order queries over this domain: the Pseudo-finite Homogeneity Property and a stronger Isolation Property. We further distinguish one broad class of domains satisfying the Isolation Property, the so-called quasi-o-minimal domains. This class includes all the o-minimal domains, but also the ordered group of integer numbers and the ordered semigroup of natural numbers, and some other domains.An important difference of this paper from the recent series of related papers is that we generalize all the notions to the case of finitely representable database states — as opposed to finite states — and develop a general lifting technique that, essentially, allows us to extend any result of the kind we are interested in, from finite to finitely representable states. We show, however, that these results cannot be transfered to arbitrary infinite states.
Article
This paper develops a query language for sequence databases, such as genome databases and text databases. The language, calledSequence Datalog, extends classical Datalog with interpreted function symbols for manipulating sequences. It has both a clear operational and declarative semantics, based on a new notion called theextended active domainof a database. The extended domain contains all the sequences in the database and all their subsequences. This idea leads to a clear distinction between safe and unsafe recursion over sequences: safe recursion stays inside the extended active domain, while unsafe recursion does not. By carefully limiting the amount of unsafe recursion, the paper develops a safe and expressive subset of Sequence Datalog. As part of the development, a new type of transducer is introduced, called ageneralized sequence transducer. Unsafe recursion is allowed only within these generalized transducers. Generalized transducers extend ordinary transducers by allowing them to invoke other transducers as “subroutines.” Generalized transducers can be implemented in Sequence Datalog in a straightforward way. Moreover, their introduction into the language leads to simple conditions that guarantee safety and finiteness. This paper develops two such conditions. The first condition expresses exactly the class ofptimesequence functions, and the second expresses exactly the class of elementary sequence functions.