A preview of this full-text is provided by Springer Nature.
Content available from The VLDB Journal
This content is subject to copyright. Terms and conditions apply.
https://doi.org/10.1007/s00778-021-00693-2
SPECIAL ISSUE PAPER
Data-induced predicates for sideways information passing in query
optimizers
Srikanth Kandula1·Laurel Orr1·Surajit Chaudhuri1
Received: 30 November 2020 / Revised: 30 June 2021 / Accepted: 8 July 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
Using data statistics, we convert predicates on a table into data-induced predicates (diPs) that apply on the joining tables.
Doing so substantially speeds up multi-relation queries because the benefits of predicate pushdown can now apply beyond
just the tables that have predicates. We use diPs to skip data exclusively during query optimization; i.e., diPs lead to better
plans and have no overhead during query execution. We study how to apply diPs for complex query expressions and how
the usefulness of diPs varies with the data statistics used to construct diPs and the data distributions. Our results show that
building diPs using zone-maps which are already maintained in today’s clusters leads to sizable data skipping gains. Using a
new (slightly larger) statistic, 50% of the queries in the TPC-H, TPC-DS and JoinOrder benchmarks can skip at least 33% of
the query input. Consequently, the median query in a production big-data cluster finishes roughly 2×faster.
Keywords Data-induced predicates ·Query optimization ·Sideways-information passing ·Range sets ·Zone maps ·Data
skipping ·Partition elimination ·Query processing ·Efficiency ·Data-parallel clusters ·Big-data systems
1 Introduction
In this paper, we seek to extend the benefits of predicate push-
down beyond just the tables that have predicates. Consider
the following fragment of TPC-H query #17 [19].
SELECT SUM(l_extendedprice)
FROM lineitem
JOIN part ON l_partkey = p_partkey
WHERE p_brand=‘:1’ AND p_container=‘:2’
The lineitem table is much larger than the part table,
but because the query predicate uses columns that are only
available in part, predicate pushdown cannot speed up the
scan of lineitem. However, it is easy to see that scanning
the entire lineitem table will be wasteful if only a small
number of those rows will join with the rows from part that
satisfy the predicate on part.
If only the predicate was on the column used in the join
condition, _partkey, then a variety of techniques become
applicable (e.g., algebraic equivalence [56], magic set rewrit-
BSrikanth Kandula
srikanth@microsoft.com
1Microsoft, Redmond, WA, USA
ing [53,77] or value-based pruning [86]), but predicates over
join columns are rare,1and these techniques do not apply
when the predicates use columns that do not exist in the join-
ing tables.
Some systems implement a form of sideways informa-
tion passing over joins [21,72] during query execution. For
example, they may build a bloom filter over the values of
the join column _partkey in the rows that satisfy the pred-
icate on the part table and use this bloom filter to skip
rows from the lineitem table. Unfortunately, this tech-
nique only applies during query execution, does not easily
extend to general joins and has high overheads, especially
during parallel execution on large datasets because construct-
ing the bloom filter becomes a scheduling barrier delaying
the scan of lineitem until the bloom filter has been con-
structed.
We seek a method that can convert predicates on a table to
data skipping opportunities on joining tables even if the pred-
icate columns are absent in other tables. Moreover, we seek a
method that applies exclusively during query plan generation
in order to limit overheads during query execution. Finally,
1Over all the queries in TPC-H [28]andTPC-DS[26] benchmarks,
there are zero predicates on join columns perhaps because join columns
tend to be opaque system-generated identifiers.
123
The VLDB Journal (2022) 31:1263–1290
/ Published online: 29 August 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved.