ArticlePublisher preview available

Data-induced predicates for sideways information passing in query optimizers

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Using data statistics, we convert predicates on a table into data-induced predicates (diPs) that apply on the joining tables. Doing so substantially speeds up multi-relation queries because the benefits of predicate pushdown can now apply beyond just the tables that have predicates. We use diPs to skip data exclusively during query optimization; i.e., diPs lead to better plans and have no overhead during query execution. We study how to apply diPs for complex query expressions and how the usefulness of diPs varies with the data statistics used to construct diPs and the data distributions. Our results show that building diPs using zone-maps which are already maintained in today’s clusters leads to sizable data skipping gains. Using a new (slightly larger) statistic, 50% of the queries in the TPC-H, TPC-DS and JoinOrder benchmarks can skip at least 33% of the query input. Consequently, the median query in a production big-data cluster finishes roughly 2×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times $$\end{document} faster.
This content is subject to copyright. Terms and conditions apply.
https://doi.org/10.1007/s00778-021-00693-2
SPECIAL ISSUE PAPER
Data-induced predicates for sideways information passing in query
optimizers
Srikanth Kandula1·Laurel Orr1·Surajit Chaudhuri1
Received: 30 November 2020 / Revised: 30 June 2021 / Accepted: 8 July 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
Using data statistics, we convert predicates on a table into data-induced predicates (diPs) that apply on the joining tables.
Doing so substantially speeds up multi-relation queries because the benefits of predicate pushdown can now apply beyond
just the tables that have predicates. We use diPs to skip data exclusively during query optimization; i.e., diPs lead to better
plans and have no overhead during query execution. We study how to apply diPs for complex query expressions and how
the usefulness of diPs varies with the data statistics used to construct diPs and the data distributions. Our results show that
building diPs using zone-maps which are already maintained in today’s clusters leads to sizable data skipping gains. Using a
new (slightly larger) statistic, 50% of the queries in the TPC-H, TPC-DS and JoinOrder benchmarks can skip at least 33% of
the query input. Consequently, the median query in a production big-data cluster finishes roughly 2×faster.
Keywords Data-induced predicates ·Query optimization ·Sideways-information passing ·Range sets ·Zone maps ·Data
skipping ·Partition elimination ·Query processing ·Efficiency ·Data-parallel clusters ·Big-data systems
1 Introduction
In this paper, we seek to extend the benefits of predicate push-
down beyond just the tables that have predicates. Consider
the following fragment of TPC-H query #17 [19].
SELECT SUM(l_extendedprice)
FROM lineitem
JOIN part ON l_partkey = p_partkey
WHERE p_brand=‘:1’ AND p_container=‘:2’
The lineitem table is much larger than the part table,
but because the query predicate uses columns that are only
available in part, predicate pushdown cannot speed up the
scan of lineitem. However, it is easy to see that scanning
the entire lineitem table will be wasteful if only a small
number of those rows will join with the rows from part that
satisfy the predicate on part.
If only the predicate was on the column used in the join
condition, _partkey, then a variety of techniques become
applicable (e.g., algebraic equivalence [56], magic set rewrit-
BSrikanth Kandula
srikanth@microsoft.com
1Microsoft, Redmond, WA, USA
ing [53,77] or value-based pruning [86]), but predicates over
join columns are rare,1and these techniques do not apply
when the predicates use columns that do not exist in the join-
ing tables.
Some systems implement a form of sideways informa-
tion passing over joins [21,72] during query execution. For
example, they may build a bloom filter over the values of
the join column _partkey in the rows that satisfy the pred-
icate on the part table and use this bloom filter to skip
rows from the lineitem table. Unfortunately, this tech-
nique only applies during query execution, does not easily
extend to general joins and has high overheads, especially
during parallel execution on large datasets because construct-
ing the bloom filter becomes a scheduling barrier delaying
the scan of lineitem until the bloom filter has been con-
structed.
We seek a method that can convert predicates on a table to
data skipping opportunities on joining tables even if the pred-
icate columns are absent in other tables. Moreover, we seek a
method that applies exclusively during query plan generation
in order to limit overheads during query execution. Finally,
1Over all the queries in TPC-H [28]andTPC-DS[26] benchmarks,
there are zero predicates on join columns perhaps because join columns
tend to be opaque system-generated identifiers.
123
The VLDB Journal (2022) 31:1263–1290
/ Published online: 29 August 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Article
Full-text available
Large-scale graph processing and Stream processing are two distinct computational paradigms for big data processing. Graph processing deals with computation on graphs of billions of vertices and edges. However, large-scale graph processing frameworks mostly work on graphs that do not change over time, while on the other end of the spectrum, stream processing operates on a continuous stream of data in real time. Modern-day graphs change very rapidly over time, and finding patterns in temporally evolving graphs could reveal a lot of insights that can not be unveiled using traditional graph computations. We have proposed a novel framework called FlowGraph which could find patterns in dynamic and temporally evolving graphs. Computations on large-scale graphs are iterative and take multiple steps before final results can be calculated, which is very different from stream processing which is one-shot computation. Therefore, the most critical bottleneck of such a system is the time required to process the query. In this work, we have proposed a query optimization technique that could reduce the time required to process the pattern. The proposed system has an optimization technique that could reduce the time required to process the pattern, especially those related to the temporal evolution of the graph. Our method shows for eight clauses the execution time is reduced by 75%, we also proved that this improvement is not affected by the scaling of the graph or the change of elements in given clauses.
Article
Full-text available
Nowadays, the volume of online data stored on websites is constantly increasing, and users’ demand for faster query response times is also on the rise with the expansion of network bandwidth. To improve the efficiency of database query, many large enterprises use database partitioning to divide huge database tables and speed up query results. While database partitioning methods based on query workloads have been successful, they have their limitations. These methods rely heavily on current workloads and the resulting partitioning structures may need to be improved when workloads change, a process called database repartitioning. Most current methods for repartitioning involve restarting the partitioning module directly, leading to significant overhead in industry due to the high complexity of the partitioning algorithm. Additionally, existing repartitioning models are often artificially determined and cannot achieve truly adaptive repartitioning. To address these issues, we propose a multi-tree training sampling model based on existing tree-shaped structure, which can speed up qdtree partitioning algorithm and reduce overhead caused by repartitioning. We also introduce improvements to qdtree structure to make it more adaptable to our method. For each query received by the partitioning model, we use a result-return rate mechanism to accumulate the evaluation of the current query on the partition structure, and initiate repartitioning only after a certain threshold is reached. Furthermore, we use the data redundancy storage technique to further improve query speed.
Article
Full-text available
The constant flux of data and queries alike has been pushing the boundaries of data analysis systems. The increasing size of raw data files has made data loading an expensive operation that delays the data-to-insight time. Hence, recent in-situ query processing systems operate directly over raw data, alleviating the loading cost. At the same time, analytical workloads have increasing number of queries. Typically, each query focuses on a constantly shifting -- yet small -- range. Minimizing the workload latency, now, requires the benefits of indexing in in-situ query processing. In this paper, we present Slalom, an in-situ query engine that accommodates workload shifts by monitoring user access patterns. Slalom makes on-the-fly partitioning and indexing decisions, based on information collected by lightweight monitoring. Slalom has two key components: (i) an online partitioning and indexing scheme, and (ii) a partitioning and indexing tuner tailored for in-situ query engines. When compared to the state of the art, Slalom offers performance benefits by taking into account user query patterns to (a) logically partition raw data files and (b) build for each partition lightweight partition-specific indexes. Due to its lightweight and adaptive nature, Slalom achieves efficient accesses to raw data with minimal memory consumption. Our experimentation with both micro-benchmarks and real-life workloads shows that Slalom outperforms state-of-the-art in-situ engines (3 -- 10×), and achieves comparable query response times with fully indexed DBMS, offering much lower (∼ 3×) cumulative query execution times for query workloads with increasing size and unpredictable access patterns.
Conference Paper
Full-text available
We introduce a new concept of leveraging traditional data statistics as dynamic data integrity constraints. These data statistics produce transient database constraints, which are valid as long as they can be proven to be consistent with the current data. We denote this type of data statistics by constraint data statistics, their properties needed for consistency checking by consistency metadata, and their implied integrity constraints by implied data statistics constraints (implied constraints for short). Implied constraints are valid integrity constraints which are powerful query optimization tools employed, just as traditional database constraints, in semantic query transformation (aka query reformulation), partition pruning, runtime optimization, and semi-join reduction, to name a few. To our knowledge, this is the first work introducing this novel and powerful concept of deriving implied integrity constraints from data statistics. We discuss theoretical aspects of the constraint data statistics concept and their integration into query processing. We present the current architecture of data statistics management in SAP HANA and detail how constraint data statistics are designed and integrated into this architecture. As an instantiation of this framework, we consider dynamic partition pruning for data aging scenarios. We discuss our current implementation for constraint data statistics objects in SAP HANA which can be used for dynamic partition pruning. We enumerate their properties and show how consistency checking for implied integrity constraints is supported in the data statistics architecture. Our experimental evaluations on the TPC-H benchmark and a real customer application confirm the effectiveness of the implied integrity constraints; (1) for 59% of TPC-H queries, constraint data statistics utilization results in pruning cold partitions and reducing memory consumption, and (2) we observe up to 3 orders of magnitude speed-up in query processing time, for a real customer running an S/4HANA application.
Article
Full-text available
Many modern applications and real-world problems involve the design of item collections, or packages: from planning your daily meals all the way to mapping the universe. Despite the pervasive need for packages, traditional data management does not offer support for their definition and computation. This is because traditional database queries follow a powerful, but very simple model: a query defines constraints that each tuple in the result must satisfy. However, a system tasked with the design of packages cannot consider items independently; rather, the system needs to determine if a set of items collectively satisfy given criteria. In this paper, we present package queries, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets. We develop a full-fledged package query system, implemented on top of a traditional database engine. Our work makes several contributions. First, we design PaQL, a SQL-based query language that supports the declarative specification of package queries. Second, we present a fundamental strategy for evaluating package queries that combines the capabilities of databases and constraint optimization solvers. The core of our approach is a set of translation rules that transform a package query to an integer linear program. Third, we introduce an offline data partitioning strategy allowing query evaluation to scale to large data sizes. Fourth, we introduce SKETCHREFINE, an efficient and scalable algorithm for package evaluation, which offers strong approximation guarantees. Finally, we present extensive experiments over real-world data. Our results demonstrate that SKETCHREFINE is effective at deriving high-quality package results, and achieves runtime performance that is an order of magnitude faster than directly using ILP solvers over large datasets.
Article
Modern servers pack enough storage and computing power that just a decade ago was spread across a modest-sized cluster. This paper presents a prototype system, called Quickstep, to exploit the large amount of parallelism that is packed inside modern servers. Quickstep builds on a vast body of previous methods for organizing data, optimizing, scheduling and executing queries, and brings them together in a single system. Quickstep also includes new query processing methods that go beyond previous approaches. To keep the project focused, the project's initial target is read-mostly in-memory data warehousing workloads in single-node settings. In this paper, we describe the design and implementation of Quickstep for this target application space. We also present experimental results comparing the performance of Quickstep to a number of other systems, demonstrating that Quickstep is often faster than many other contemporary systems, and in some cases faster by orders-of-magnitude. Quickstep is an Apache (incubating) project.
Book
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches describes basic principles and recent developments in building approximate synopses (that is, lossy, compressed representations) of massive data. Such synopses enable approximate query processing, in which the user's query is executed against the synopsis instead of the original data. It focuses on the four main families of synopses: random samples, histograms, wavelets, and sketches. A random sample comprises a "representative" subset of the data values of interest, obtained via a stochastic mechanism. Samples can be quick to obtain, and can be used to approximately answer a wide range of queries. A histogram summarizes a data set by grouping the data values into subsets, or "buckets," and then, for each bucket, computing a small set of summary statistics that can be used to approximately reconstruct the data in the bucket. Histograms have been extensively studied and have been incorporated into the query optimizers of virtually all commercial relational DBMSs. Wavelet-based synopses were originally developed in the context of image and signal processing. The data set is viewed as a set of M elements in a vector-i.e., as a function defined on the set {0,1,2,. ,M?1}-and the wavelet transform of this function is found as a weighted sum of wavelet "basis functions." The weights, or coefficients, can then be "thresholded", e.g., by eliminating coefficients that are close to zero in magnitude. The remaining small set of coefficients serves as the synopsis. Wavelets are good at capturing features of the data set at various scales. Sketch summaries are particularly well suited to streaming data. Linear sketches, for example, view a numerical data set as a vector or matrix, and multiply the data by a fixed matrix. Such sketches are massively parallelizable. They can accommodate streams of transactions in which data is both inserted and removed. Sketches have also been used successfully to estimate the answer to COUNT DISTINCT queries, a notoriously hard problem. Synopses for Massive Data describes and compares the different synopsis methods. It also discusses the use of AQP within research systems, and discusses challenges and future directions. It is essential reading for anyone working with, or doing research on massive data.
Article
Modern servers pack enough storage and computing power that just a decade ago was spread across a modest-sized cluster. This paper presents a prototype system, called Quickstep, to exploit the large amount of parallelism that is packed inside modern servers. Quickstep builds on a vast body of previous methods for organizing data, optimizing, scheduling and executing queries, and brings them together in a single system. Quickstep also includes new query processing methods that go beyond previous approaches. To keep the project focused, the project's initial target is read-mostly in-memory data warehousing workloads in single-node settings. In this paper, we describe the design and implementation of Quickstep for this target application space. We also present experimental results comparing the performance of Quickstep to a number of other systems, demonstrating that Quickstep is often faster than many other contemporary systems, and in some cases faster by orders-of-magnitude. Quickstep is an Apache (incubating) project.
Conference Paper
We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100 GB dataset show that replacing RocksDB's Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false positive rate.
Conference Paper
While numerous indexing and storage schemes have been developed to address the core functionality of predicate evaluation in data systems, they all require specific workload properties (query selectivity, data distribution, data clustering) to provide good performance and fail in other cases. We present a new class of indexing scheme, termed a Column Sketch, which improves the performance of predicate evaluation independently of workload properties. Column Sketches work primarily through the use of lossy compression schemes which are designed so that the index ingests data quickly, evaluates any query performantly, and has small memory footprint. A Column Sketch works by applying this lossy compression on a value-by-value basis, mapping base data to a representation of smaller fixed width codes. Queries are evaluated affirmatively or negatively for the vast majority of values using the compressed data, and only if needed check the base data for the remaining values. Column Sketches work over column, row, and hybrid storage layouts. We demonstrate that by using a Column Sketch, the select operator in modern analytic systems attains better CPU efficiency and less data movement than state-of-the-art storage and indexing schemes. Compared to standard scans, Column Sketches provide an improvement of 3x-6x for numerical attributes and 2.7x for categorical attributes. Compared to state-of-the-art scan accelerators such as Column Imprints and BitWeaving, Column Sketches perform 1.4 - 4.8× better.
Conference Paper
Data partitioning is crucial to improving query performance several workload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload a priori. Static workload-based data partitioning techniques are therefore not suitable for such settings. In this paper, we propose Amoeba, a distributed storage system that uses adaptive multi-attribute data partitioning to efficiently support ad-hoc as well as recurring queries. Amoeba requires zero set-up and tuning effort, allowing analysts to get the benefits of partitioning without requiring an upfront query workload. The key idea is to build and maintain a partitioning tree on top of the dataset. The partitioning tree allows us to answer queries with predicates by reading a subset of the data. The initial partitioning tree is created without requiring an upfront query workload and Amoeba adapts it over time by incrementally modifying subtrees based on user queries using repartitioning. A prototype of Amoeba running on top of Apache Spark improves query performance by up to 7x over full scans and up to 2x over range-based partitioning techniques on TPC-H as well as a real-world workload.
Conference Paper
Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for combining a number of techniques---a-priori, memoization, and pruning---to optimize iceberg queries with complex joins. A-priori pushes partial GROUP BY and HAVING condition before a join to reduce its input size. Memoization caches and reuses join computation results. Pruning uses cached results to infer that certain tuples cannot contribute to the final query result, and short-circuits join computation. We formally derive conditions for correctly applying these techniques. Our practical rewrite algorithm produces highly efficient SQL that can exploit combinations of optimization opportunities in ways previously not possible. We evaluate our PostgreSQL-based implementation experimentally and show that it outperforms both baseline PostgreSQL and a commercial database system.