Conference PaperPDF Available

CS2: A new database synopsis for query estimation

Authors:

Abstract and Figures

Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.
Content may be subject to copyright.
CS2: A New Database Synopsis for Query Estimation
Feng Yu
Southern Illinois University
Carbondale, IL, 62901
fyu@cs.siu.edu
Wen-Chi Hou
Southern Illinois University
Carbondale, IL, 62901
hou@cs.siu.edu
Cheng Luo
Coppin State University
Baltimore, MD 21216
cluo@coppin.edu
Dunren Che
Southern Illinois University
Carbondale, IL, 62901
dche@cs.siu.edu
Mengxia Zhu
Southern Illinois University
Carbondale, IL, 62901
mzhu@cs.siu.edu
ABSTRACT
Fast and accurate estimations for complex queries are pro-
foundly beneficial for large databases with heavy workloads.
In this research, we propose a statistical summary for a
database, called CS2 (Correlated Sample Synopsis), to pro-
vide rapid and accurate result size estimations for all queries
with joins and arbitrary selections. Unlike the state-of-the-
art techniques, CS2 does not completely rely on simple ran-
dom samples, but mainly consists of correlated sample tuples
that retain join relationships with less storage. We introduce
a statistical technique, called reverse sample, and design a
powerful estimator, called reverse estimator, to fully utilize
correlated sample tuples for query estimation. We prove
both theoretically and empirically that the reverse estima-
tor is unbiased and accurate using CS2. Extensive exper-
iments on multiple datasets show that CS2 is fast to con-
struct and derives more accurate estimations than existing
methods with the same space budget.
Categories and Subject Descriptors
H.2.4 [Information Systems]: Database Management, Sys-
tems, Query Processing
General Terms
Theory
Keywords
Query Optimization, Database Synopsis, Selectivity Estima-
tion
1. INTRODUCTION
Query optimization has been a central research topic of
Database Management Systems for many years [14, 25, 26,
30]. It aims to find the most efficient execution plans for
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’13, June 22–27, 2013, New York, New York, USA.
Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.
queries. A query optimizer generally uses statistics about
the database [22, 24], and assumptions about the attribute
values [5, 25] to make cost estimations of alternative plans
and selects the best one for execution. It is important to
find the most efficient plans because studies [5, 19] have
shown that executions with sub-optimal plans can be orders
of magnitude slower than those with optimal plans. Un-
fortunately, due to the accuracy of the statistics stored in
the database and the validity of assumptions made, query
optimizers often cannot find the best plans in their search
spaces for the queries. On the other hand, storing accurate
statistics can require prohibitively large amounts of space.
Thus, constructing a concise, accurate database synopsis has
been a central issue of query optimization for decades. In
this research, we are mainly concerned about queries with
arbitrary select and equi-join operations.
There has been much effort to construct concise, accurate
database synopses in the past. Among them, histograms
[8, 15, 22, 24] have attracted the most attention. They are
simple and effective for estimating low-dimensional queries,
where the numbers of attributes of interest in a relation,
including the selection and join attributes, are small. How-
ever, as the dimension increases, the storage space required
by a multi-dimensional histogram can grow dramatically.
In addition, the computations and maintenance of multi-
dimensional histograms become very complex [21]. Some
techniques, such as discrete cosine transform, wavelets, sket-
ches [2, 16, 20], etc., have been proposed to compress the
histograms. Unfortunately, these size reduction techniques
become ineffective as the dimension increases.
There is also a long history of using sampling in databases
[13, 12, 11, 18, 31]. With a small amount of sample tuples,
accurate selectivity estimations can be obtained for select
queries with arbitrary selection predicates. However, for
joins, large estimation errors were observed when indepen-
dent random samples of relations are used. To overcome
this problem, Acharya et al [1] stored a join synopsis for
each relation. The synopsis for a relation is a simple ran-
dom sample of the result of the joins of the relation and
all its descendant relations by the foreign-key relationships.
Each join synopsis can then be used to generate simple ran-
dom samples of the results of all subqueries that involve the
source relation, and thus are very useful for estimations of
a large set of queries.
Chaudhuri et al. in [3] proposed using the stratified sam-
pling to improve estimations. However, workloads must be
known beforehand for the stratified sampling to be effective.
469
Moreover, the method is only applicable to “star queries”
with foreign-key joins. Wu at el. in [31] used cumulative
distributions to make estimations. But it mainly focuses on
range queries over single relation.
Spiegel et al [27, 28] proposed an innovative approach to
summarizing the correlations among the join and other at-
tribute values in a data graph. Tuples in the graph are con-
nected to nodes that represent their respective attribute val-
ues, and to tuples with which they join. To reduce the size of
the data graph, tuples with similar join- and attribute-based
characteristics are merged into clusters, and clusters with a
similar value distribution for an attribute share the same
(1-dimensional) histogram. This class of graph-based syn-
opses, called TuG (Tuple Graph) synopses, can be applied
to queries with joins of many-to-many and many-to-one re-
lationships [29], and selections with table-level predicates.
Experimental results [27, 28] showed that histogram-based
synopses are not as accurate as join synopses and TuG for
complex join queries. Howerver, join synopses and TuG also
have some limitations and shortcomings. For example, join
synopses restrict their applications to foreign-key joins. In
addition, a join synopsis is stored for each relation in the
database, which raises concern about the total size of the
synopses. As for TuG, its accuracy is heavily dependent on
the quality of the clusterings performed. But as the num-
ber of attributes of interest increases and the relationships
among attribute values becomes more complex, clusterings
become less effective, and thus the accuracy of estimations
deteriorates dramatically. The construction of a TuG syn-
opsis is also very complex.
Motivated by the importance and continuing lack of an
effective database synopsis, we set our goal to constructing
a simple, concise, and accurate database synopsis. We have
chosen to use sample tuples to construct the synopsis as in-
tricate relationships among attribute values can be naturally
embodied in the sample tuples. In addition, the space for
storing a sample tuple only increases linearly with the num-
ber of attributes (of interest) in a relation. It makes sample
tuples an effective tool for modeling multi-, and especially
high-dimensional attribute value distributions. However, in-
stead of storing independent random samples of relations,
we propose to store correlated sample tuples to preserve the
join relationships. That is, tuples of a relation are included
in the sample of a relation if they join with already drawn
sample tuples of other relations. Correlated sample tuples
from relations form the correlated sample synopsis (CS2),
which will be used for query size estimation, as well as ap-
proximate aggregate query answering.
While both the join synopses and the correlated sample
synopsis attempt to capture relationships between tuples of
different relations, there are many differences. The join syn-
opses store samples of the results of joins, while the cor-
related sample synopsis stores a sample of correlated tu-
ples from the relations. Besides this philosophical differ-
ence, the correlated sample synopsis is also more general
than the join synopses. The join synopses are applicable
only to the foreign-key join relationships, while the corre-
lated sample synopsis can be applied to all types of join
relationships, including many-to-one and many-to-many re-
lationships. The join synopses assume no dangling tuples
exist in the relations, but no such assumption is needed for
the correlated sample synopsis. The join synopses rely on
the simple random samples to make estimation and thus re-
quire constructing up to nsynopses, where nis the number
of relations in the join graph. The correlated sample synop-
sis can estimate with both random and correlated samples
and needs just one synopsis for the entire database. This
potential storage space advantage can translate into higher
estimation accuracy for the correlated sample synopsis.
While TuG covers all types of joins, the selections are re-
stricted to table-level selections. It cannot accommodate
predicates like “R.age=30 OR S.salary >50K” that cannot
be pushed down to the individual tables. As for the cor-
related sample synopsis, there is no such restriction; it can
be applied to any selection and join. In addition, selectivity
estimates with statistical confidence levels and intervals can
also be provided by using the correlated sample synopsis.
In this research, we propose to store a set of correlated
sample tuples as the database synopsis. We have developed
a powerful estimator, called the reverse estimator, that can
fully utilize the correlated sample tuples to make accurate
estimations, without relying on simple random sampling.
The proposed reverse estimator represents a breakthrough
in utilizing samples to make estimation. We have imple-
mented the join synopses, TuG, and the correlated sample
synopsis to compare their performance. The experimental
results showed that both the join synopses (JS) and corre-
lated sample synopsis (CS2) were much easier and several
orders faster to construct than the TuG. In addition, they
also generated more accurate estimates than TuG, with the
correlated sample synopsis generating the most accurate es-
timates. The experimental results confirmed that the cor-
related sample synopsis is a simple, concise, and accurate
summary of the database, not to mention that it is also the
most general synopsis, applicable to queries with all types
of join relationships and arbitrary selection conditions. It
is the desired database synopsis that researchers have long
been searching for.
The rest of paper is organized as follows. Section 2 in-
troduces some background knowledge. Section 3 discusses
the construction of a correlated sample synopsis. In Sec-
tion 4, we describe selectivity estimation for queries with
acyclic join graphs, and Section 5 cyclic graphs. Section 6
presents a method to improve estimation accuracy of com-
monly posted queries. Section 7 presents the experimental
results and Section 8 is the conclusion.
2. PRELIMINARY
In this section, we define terminology that will be used to
model the join relationships in a database. For simplicity,
here we focus on equi-joins. The framework can be extended
to other joins easily. Hereafter, a join refers to an equi-join.
2.1 Join Graphs
Definition 1. (Join Graph of a Database) The join
graph G(V, E) of a database consists of the set of all vertices
V={R1, R2,...,Rn}and a set of edges with each vertex
denoting a relation in the database and each edge hRi, Rji ∈
E, Ri, RjV, the existence of a join relationship between
Riand Rj.
In this research, we make no assumption about the mul-
tiplicity of the join relationships, that is, they can be one-
to-one, many-to-one, and many-to-many (often realized by
many-to-one relationships in the relational model).
470
R1R2
R3
R4
Figure 1: A Join Graph
Example 1. Figure 1 shows the join graph of a database
where there are join attributes between R1and R2,R2and
R3, and R2and R4.
For simplicity and without loss of generality, we shall as-
sume there is at most one join relationship between a pair
of relations and each join relationship involves only one join
attribute. The discussions here can be easily extended to
general cases.
Definition 2. (Join Graph of a Query) Let Qbe a
query over a set of relations VV. The join graph of
Q is a subgraph of G, denoted by G(V, E ), where VV,
EE, with each edge hRi, Rji ∈ Ecorresponding to a
join between Riand Rjin Q.
If the join graph of a query or a database is disconnected,
we can consider each connected component separately and
then merge them by Cartesian products. Therefore, we shall
assume hereafter all join graphs are connected.
Definition 3. (Joinable Relations) A pair of relations
Riand Rk,i6=k, is said to be joinable, denoted by J(Ri, Rk),
if there is a path of length 1 between Riand Rkin the
join graph of the query. If there is an edge hRi, Rki ∈ E,
we say Riand Rkare directly joinable; otherwise, they are
indirectly joinable.
Definition 4. (Joinable Tuples) Assume Riand Rkare
a pair of joinable relations with a path hRi, Ri+1,..., Rk1, Rki.
A pair of tuples tiRiand tkRkare said to be joinable,
denoted by J(ti, tk), if tican find a match ti+1 in Ri+1, ti+1
can find a match ti+2 in Ri+2,..., and tk1can find a match
tkin Rk. Joinable tuples tiRiand tkRkare directly
joinable if Riand Rkare directly joinable, and they can
match directly; otherwise, they are indirectly joinable.
3. CORRELATED SAMPLE SYNOPSIS CON-
STRUCTION
Sample tuples naturally embody the intricate relation-
ships of attribute values and thus can be good represen-
tatives of the underlying tuples of a relation. They are very
effective in estimating the result sizes of select queries with
arbitrary selection conditions [13]. Unfortunately, indepen-
dent simple random samples, each of which consists of ran-
domly drawn tuples from a participant relation of a query,
generated a large error in join size estimation [13]. We ob-
served that the large errors were caused by the independence
of the sample tuples, disregarding the join relationships be-
tween tuples of participant relations. Preserving the join
relationships may hold the key to the construction of an ef-
fective sample synopsis. With this principle in mind, we
design a simple sampling scheme, called correlated sam-
pling, that preserves the join relationships of tuples.
Table 1: Notation used in the paper
Symbol(s) Meaning
RiRelation
tRiA tuple tin relation Ri
G(V, E ) Join graph of a database
hRi, RkiEdge in the join graph
J(Ri, Rk)Riand Rkare joinable relations
J(ti, tk)tiand tkare joinable tuples
SRSWOR Simple random sample without replacement
S
iSRSWOR of Ri
SiSample relation
SCorrelated sample synopsis (CS2)
(...) Joins without order
YQuery result size
YiNumber of result tuples generated by tiRh
yiNumber of result tuples generated by tiSh
b
Y , b
Yp,b
YRV Estimations of Y
riJoinable sampled tuple ratio (JR)
ZiStandard Total Number of Result Tuples
(STNTR)
{zi}n1
i=1 Reverse sample
RJ
h,RJc
hJoinable and non-joinable partitions
3.1 Correlated Sampling
Correlated Sampling aims to gather a sample of joinable
sample tuples from joinable relations in the database to form
a synopsis, denoted by S.
Definition 5. (Correlated Sampling) First, a relation,
say R1, is chosen as the source relation, from which a
simple random sample without replacement (SR-
SWOR) [6] is collected into anther relation, denoted by
S1. We include S1into the synopsis, S, and mark R1as
visited. By induction, when Ri(i > 1) is visited, and Siis
the sample of Ri, we follow an unvisited edge hRi, Rjiin the
join graph of the database. Instead of SRSWOR, we collect
the tuples in Rjthat are directly joinable with the tuples in
Si. We include these correlated tuples into Sj, and repeat
this procedure until all relations in the join graph are ex-
hausted and all tuples directly or indirectly joinable with S1
are collected. Finally, the synopsis S={S1, ..., Sn}is called
aCorrelated Sample Synopsis or CS2.
Note that, to emphasize S1is a SRSWOR of R1, we use
S
1, instead of S1, in the future. All sampled relations S
1
and other Sj’s are called sample relations, and their tuples
are called sample tuples. A sample relation Sjwhich is
not a SRSWOR is called a correlated sample relation,
the tuples of Sjare called correlated sample tuples, and
Rjis called a non-source relation.
Algorithm 3.1 describes the general idea of the correlated
sampling. We shall refine it later.
3.1.1 Sampling Path and Source Relation Selection
When choosing sampling order along the join graph, al-
though one can follow any join relationship, e.g., a many-to-
one, one-to-many, or many-to-many relationship, to gather
correlated sample tuples, we suggest following a many-to-
one relationship from a visited relation to another relation
(visited or not). The rationale is that following one-to-many
(and perhaps many-to-many) relationships can easily ex-
471
Algorithm 1 Correlated Sampling
Require: G— Join Graph of the Database; n1— Sample
Size
1: Ra=Source Selection(G) //source relation selection
2: S
a=;Si=(i6=a)
3: S
a= SRSWOR(Ra,n1)//collect simple random sample
4: W={Ra}//mark Raas visited
5: while unvisited edge hRi, Rjiwith RiWdo
6: Sj=Sj∪ {t|tRj,tis directly joinable with a tuple
in Si}//collect correlated sample tuples
7: W=W∪ {Rj}//mark Rjas visited
8: end while
9: S={S
a} ∪ (j6=a{Sj})
10: return S— Correlated Sample Synopsis
plode the size of the synopsis, and thus reduce the number
of sample tuples that can be drawn from the source relation.
The same reasoning applies to the selection of the source
relation. Thus, we prefer a relation that can reach the most
relations, through only the many-to-one relationships di-
rectly or indirectly, to be the source relation. It is possi-
ble that the source relation may not reach all the relations
in the graph through only the many-to-one relationships.
Therefore, one may have to add additional source relations
to cover unvisited relationships and relations.
3.1.2 Sample Size and Synopsis Maintenance
In Step (3) of Algorithm 3.1, we assume, for simplicity,
the sample size nfrom the source relation is known before-
hand. However, in reality, the sample size that meets the
synopsis space requirement may not be known until the end
of the sampling process. Given a synopsis space budget, the
simplest way is to draw one tuple at a time from the source
relation and collect related tuples along the sampling path.
This process continues until the space budget is used up. If
only the many-to-one relationships are followed in the sam-
pling, one can also draw a number of sample tuples at a time
until there is no budget left.
The correlated sample synopsis can be updated using ex-
isting algorithms [10]. Alternatively, since the construction
of a synopsis takes just a few seconds (e.g., 10 seconds, as
demonstrated in Section 7), one can simply reconstruct the
synopsis periodically to keep it up to date.
3.2 A Primitive Correlated Sample Synopsis
For simplicity, we shall name the relations following the
order in which they are sampled in the correlated sampling
scheme described. That is, R1is the source relation, R2is
sampled after the source relation, and then R3, R4, and etc.
To explicitly specify the order, we modify the join graph into
a directed sampling order graph (as shown in Figure 2(a))
in which a double-headed arrow from Rito Rjindicates that
Rjis sampled based on the sample tuples of Ri, or Rjis a
child of Ri(equivalently, Riis a parent of Rj). Ancestor
and descendent relationships in a sampling order graph are
defined in the same way as in an ordinary directed graph.
Example 2. (A Primitive CS2) Consider the order of
sampling in Figure 2(a) and the join relationships among
tuples in Figure 2(b). Each tuple is represented only by its
ID for simplicity. R1={1,2,3},R2={a, b, c}, R3={A, B}
and R4={I, II}. The lines, including both solid and dashed
R1
R2
R3R4
(a) Sampling Order
Graph
123
abc
A B I II
R1
R2
R3R4
(b) Joinable Tuple
IDs
S
1={2,3}
S2={b, c}
S3={B}
S4={II}
(c) CS2
Figure 2: A Primitive CS2
lines, between tuples represent the matching of join attribute
values. R1and R2are in many-to-many relationship. The
solid lines indicate that their end tuples are selected into the
synopsis. That is, two tuples of R1, 2 and 3, are first drawn
into S
1(i.e., S
1={2,3}). Then band cof R2are selected
into S2because their join attribute values match those in
S
1. After that, Bis selected from R3into S3because of
matching join attribute values with b, c in S2. Finally, II of
R4is selected into S4because it matches aof S2. Therefore,
the sample synopsis Sconsists of four sample relations as
shown in Figure 2(c).
Note that the synopsis Sdescribed here is just in its prim-
itive form.
Instead of storing all attributes of a tuple in the synopsis,
one may choose to store only the join attributes and those
attributes that are often referenced in the selection condi-
tions. This would significantly reduce the size of a synopsis.
4. QUERY ESTIMATION
By evaluating queries against sample relations, we esti-
mate query result sizes from the sample query results. We
focus on queries that contain equi-joins and arbitrary selec-
tion predicates. Therefore, the query result size can be ob-
tained by summing up the number of result tuples on each
tuple of an arbitrary relation in the query. Due to space
limitation, here we consider only a database with an acyclic
join graph G(V, E ) and we assume no dangling tuples exist
in the joins. Dangling tuples in the source relation will not
generate result tuple. And dangling tuples in non-source
relations are discussed in Section 4.2.3.
4.1 Source Query Estimation
Asource query here refers to a query that has the source
relation, R1, as one of its operand relations.
Consider a source query over relations R1, R2, ..., Rn. Let
{S
1, ..., Sn}be the respective sample relations in the synop-
sis. Each result tuple of the query must be generated by a
set of ntuples {t1, ..., tn},tiRi, 1 in. We say that
ti,1in, generates a result tuple. Let Yibe the number
of result tuples generated by the ith tuple tiR1, then the
Query Result Size,Y=PN1
i=1 Yi, where N1=|R1|. We
propose the following unbiased estimator for Y.
Theorem 1 (Unbiasedness of CS2). Given S
1, a SR-
SWOR of the source relation R1, the estimator
b
Y=N1
n1
n1
X
i=1
yi(1)
472
is an unbiased estimator of the query result size, Y, where
N1=|R1|, n1=|S
1|, and yiis the number of result tuples
generated by the ith tuple in S
1. Also, v1(b
Y) = N2
1
n1ε2(1 f)
is an unbiased estimator of V ar(b
Y), where f=n1
N1and
ε2=Pn1
i=1(yiE(yi))2
n11.
Proof. We only have to show that {yj}n1
j=1 is a SRSWOR
of {Yi}N1
i=1. Then we can apply theorems in [6] to prove this
theorem.
First of all, yjis the total number of result tuples gen-
erated by tuple ujS
1when query Qis executed on the
CS2, S. According to Algorithm 3.1, the correlated sam-
pling, for any correlated tuple vkRji.e. J(vk, ui), we
have vkSj, j > 1. That is, every result tuple generated
by uion the original database, is also generated by uion S.
On the other hand, Sis a collection of sample relations,
which are subsets of the original relations. Thus, for the
same query Q,uican’t generate more result tuples on S
than the original database.
Hence, given uj=ti,ujS
1, tiR1, we have yj=Yi.
Since S
1is a SRSWOR of R1,{yj}n1
j=1 is also a SRSWOR
of {Yi}N1
i=1.
By Theorem 2.1 and Theorem 2.4 of [6], this theorem is
proved.
Example 3. (Source Query Estimation) Consider the
database and synopsis in Figure 2. We attempt to estimate
the sizes of all queries that involve the source relation R1,
that is, (R1, R2), (R1, R2, R3), (R1, R2, R4), and
(R1, R2, R3, R4). Here, (Ri,...,Rk) represents the joins of
Ri,..., and Rkin no particular order.
First, we use S
1, S2, S3, and S4to compute the sample
results of the queries.
|(S
1, S2)|=|S
1S2|= 2.
|(S
1, S2, S3)|=|(S
1S2)S3)|= 2.
|(S
1, S2, S4)|=|(S
1S2)S4)|= 1.
|(S
1, S2, S3, S4)|=|(S
1S2)S3)S4|= 1.
Then, the query sizes are estimated by Eq (1), where N1=
3 and n1= 2. So, the estimated query sizes of (R1, R2),
(R1, R2, R3), (R1, R2, R4), (R1, R2, R3, R4) are 3, 3,
3
2, and 3
2, respectively, as compared to the exact sizes 4, 4,
2, and 2.
4.2 No-Source Query Estimation
Ano-source query refers to a query that doesn’t include
the source relation.
Example 4. (No-source Query) In Figure 2, both
(R2, R3) and (R2, R4) are no-source queries, since neither
of their join graphs contain the source relation R1.
The highest relation in a query refers to the relation
with the least index according to the sampling order. For
instance, in Example 4, R2is the highest relation in query
(R2, R3).
For any no-source query Qwith Rhas its highest relation
in its join graph, let Yjbe the total number of result tu-
ples generated by the jth tuple ujin Rh.Shis the sample
relation of Rh. We want to estimate the query result size
YRh=PNh
j=1 Yj.
4.2.1 Estimation with Reverse Sampling
Since there is no SWRSOR of Rh, it is not straightforward
how query result size YRhcan be derived from the sample of
Rhdirectly. Therefore, we “back track”to the source relation
and consider estimation of YRhfrom R1. We try to establish
the relationship between YRhand some derived values on
tuples of R1based on Yj. In order to establish the “reverse”
connection from Rhto R1, we introduce a definition called
Joinable Tuples Sampled Ratio or JR.
Definition 6. (Joinable Tuple Sampled Ratio or JR)
Considering R1as the source relation, uiRh,h > 1, let
Cibe the total number of tuples in R1that are joinable with
ui. Among these Cijoinable tuples, we assume Dituples
are drawn into S
1, 0 DiCi.
ri=Di
Ci
(2)
is called the Joinable Tuples Sampled Ratio, or JR, of ui.
Now we show how to derive the result size YRhfrom Xi
in R1. We assign a value Xito the ith tuple, ti, of R1, as
follows.
Xi=
Nh
X
j=1
Ai
jYj, Ai
j=(1,if J(uj, ti), ujRh, tiR1
0,otherwise
(3)
where Nh=|Rh|, and J(uj, ti) is in definition 4. Xiis
called the Derived Total Number of Result Tuples or
DTNRT, generated by uiin R1. Using the Ciin JR, we
adjust the weight of Yjin Xi, denoted by Zi, as follows.
Zi=
Nh
X
j=1
Ai
j
Cj
Yj(4)
where Cjis defined in Definition 6. We call Zithe Standard
Total Number of Result Tuples or STNRT.
Note that Ziis associated with a tuple in R1, but its value
is from Rh. Therefore, the values in Rhare assigned to R1
“reversely” along the sampling path. In order to estimate
YRh, we pick a SRSWOR of {Zi}, and call it, a reverse
sample.
Definition 7. (Reverse Sample) A reverse sample, de-
noted by {zi}n1
i=1, is a SRSWOR of the STNRT set {Zi}N1
i=1
s.t. Ziis chosen into {zi}n1
i=1 if and only if the tuple tiof R1
is sampled into S
1.
It is trivial that yj=Yi, if tiRhis sampled into Shas
uj, in which yjis the total number of result tuples generated
by uj. Thus, according to (4), zican be expressed using
values in Shas follows.
zi=
nh
X
j=1
ai
j
Cj
yj, ai
j=(1,if J(u
j, t
i), u
jSh, t
iS
1
0,otherwise
(5)
To estimate the no-source query result size YRh, we pro-
pose the Reverse (RV) Estimator defined as
b
YRV =N1
n1
nh
X
j=1
rjyj(6)
where N1=|R1|,n1=|S
1|,nh=|Sh|, and yjis the total
number of result tuples generated by the jth tuple in Sh.
473
The significance of the RV Estimator is that, it can pro-
vide unbiased estimations for no-source queries using CS2.
It is shown in the following theorem.
Theorem 2 (Unbiasedness of CS2). Given a no-source
query with Rhas the highest relation in its join graph, b
YRV
in Eq(6) is an unbiased estimator of the query result size
YRh.
Proof. First of all, given R1, S
1, Rh, Sh, we study the
sum of Zias defined in Eq (4),
N1
X
i=1
Zi=
N1
X
i=1
Nh
X
j=1
Ai
j
Cj
Yj=
Nh
X
j=1
(PN1
i=1 Ai
j)
Cj
Yj(7)
By Definition 6, PN1
i=1 Ai
j=Cj. Thus (7) can be written as
PN1
i=1 Zi=PNh
j=1 Yj=YRh, which means the sum of Ziis
actually equal to the query result size YRh. Therefore, we
can estimate the sum of Zifor YRh.
Second, we take a reverse sample {zi}n1
i=1. By Definition
7, {zi}n1
i=1 is a SRSWOR of {Zi}N1
i=1. Therefore, we can use
Eq (1), b
Y=N1
n1Pn1
i=1 zi. By Theorem 1, b
Yis an unbiased
estimator of YRh.
Finally, we prove b
Y=b
YRV . We express b
Yin terms of yi.
b
Y=N1
n1
n1
X
i=1
zi=N1
n1
n1
X
i=1
nh
X
j=1
ai
j
Cj
yj=N1
n1
nh
X
j=1
(Pn1
i=1 ai
j
Cj
)yj
By Definition 6, Pn1
i=1 ai
j=Dj, we have
b
Y=N1
n1
nh
X
j=1
Dj
Cj
yj=N1
n1
nh
X
j=1
rjyj=b
YRV
The theorem is proved.
Theorem 3. We provide an unbiased estimator of the
variance of b
YRV .
v2(b
YRV ) = N1(N1n1)
n2
1
n1
X
i=1
zi2N1(N1n1)
n2
1(n11)
n1
X
i1=1
n1
X
i2>i1
zi1zi2
where ziis defined in the proof of Theorem 2.
This theorem can be proved by (9A.37) and (9A.44) of [6].
Example 5. (No-Source Query Estimation with RV
Estimator) We attempt to use b
YRV in Eq (6) to estimate
the result size of a no-source query (R2, R3) executed
on relations in Fig 3(a), denoted by YR2. Note that Ya=
Yb=Yc= 1, because each tuple of R2produces exactly 1
result tuple. |S
1|= 2, and N1=|R1|= 3. Thus N1
n1=3
2.
Figure 3 shows the CS2 and JRs when S
1={2,3}and
S2={b, c}. Tuples 1 and 2 of R1are joinable with bin S2,
but only tuple 1 is selected into S
1. Thus, rb=1
2. Similarly,
rc=1. By Eq (6), for this given S
1,b
YRV =3
2(Yb
2+Yc) =
9
4. In order to verify that b
YRV is unbiased, we consider all
possible SRSWOR of size 2 from R1, i.e. S
1, and calculate
the average of their estimates, E(b
YRV ). Table 2 lists all
calcualted b
YRV . As observed, E(b
YRV ) is exactly the actual
result size.
Table 2: Samples, RV Estimates( b
YRV ), and JR Val-
ues
S
1S2JR b
YRV
{1,2} {a, b}ra=rb= 1 3
2(Ya+Yb) = 3
{1,3} {a, b, c}ra=rc= 1,
rb=1
2
3
2(Ya+Yb
2+Yc) =
15
4
{2,3} {b, c}ra=1
2, rc= 1 3
2(Yb
2+Yc) = 9
4
E(b
YRV ) = 3
123
abc
A B
R1
R2
R3
(a) Joinable Tuple IDs
S
1={2,3}
S2={b, c}J2={1
2,1}
S3={B}J3={2
3}
(b) CS2 and JRs
Figure 3: CS2 and JRs when S
1={2,3}
4.2.2 JR Storage and Calculation
A CS2 is constructed for use by all possible queries. A JR
value is stored with each correlated sample tuple in Si(i > 1)
for use of all no-source query estimations.
As described in Algorithm 2, the JR values can be cal-
culated in a top-down fashion following the sampling path.
First, for each sample tuple uin the direct children of S
1,
we calculate the Cuand Du, in which Cucomprises of IDs
of u’s joinable tuples in R1, and Ducomprises of those in
S
1. Then, each sample tuple in other sample relations will
perform union (without duplicate) of Cuand Dufrom join-
able tuples in its parent sample relations. By definition 6,
Cu= #(Cu) and Du= #(Du), in which #() is the cardinal-
ity of a set. Therefore, the JR value of tuple uis calculated
as ru=#(Du)
#(Cu). Finally, all the intermediate data, Cuand
Du, will be removed after JR values are calculated.
In Algorithm 2, Cuand Duare only calculated for cor-
related sample tuples, and they can be obtained in parallel
with the procedure of correlated sampling (Algorithm 3.1).
Therefore, the intermediate time and space overheads can
be ignored when the sample size is small. In addition, in
Section 7, we show that the estimation of CS2 is fairly sat-
isfactory using only quite a small amount of sample.
4.2.3 Estimation Involving Non-joinable Tuples
Definition 8. (Joinable and Non-joinable Partition)
Considering R1as the source relation, the joinable partition
of Rh(with respect to R1), denoted by RJ
h, refers to the
subset of tuples in Rh, that are joinable with tuples in R1.
The complement of RJ
hin Rhis called non-joinable partition
of Rh(with respect to R1), denoted by RJc
h.
By Definition 8, there is no chance for tuples in RJc
hto
be selected into the correlated sample synopsis. In order to
estimate the result size of no-source queries with Rhas the
highest relation, we need to account for the number of result
tuples generated by RJc
h.
Definition 9. (Partition Difference) For any given query
Q, the total number of result tuples generated by RJ
hand
474
Algorithm 2 Calculate JR
Require: G(V, E ): the acyclic join graph of the database;
S={S
1} ∪ {Si}n
i=2: CS2 without JR
1: W=//the set of visited relations
2: //———Phase1: for direct children of R1————
3: U={Si|SiSs.t. Riis a direct child of R1in G}
4: for all SiUdo
5: for all uSido
6: //IDs of tuples in R1and S
1joinable with u
7: Cu={t.ID|tR1, J(t, u)}
8: Du={t.ID|tS
1, J (t, u)}
9: uCu//append to u
10: uDu
11: end for
12: W=W∪ {Si}//mark Sias visited
13: end for
14: //———Phase2: for other correlated samples———
15: while W6=CS2 do
16: pick Siwhose parent Spis in W
17: for all vSido
18: //all the joinable tuples of vin Si’s parent relations
19: Jv={u|uSp, J(u, v )}
20: Cv=SuJvCu//union without duplicate
21: Dv=SuJvDu
22: vCv//append to v
23: vDv
24: end for
25: W=W∪ {Si}//mark Sias visited
26: end while
27: //———Phase3: calculate JR values——————
28: for all SiCS2 do
29: for all uSido
30: ru=#(Du)
#(Cu)//calculate JR
31: uru//append JR to each sample tuple
32: delete Cu,Du//release extra space
33: end for
34: end for
35: return S(CS2 with JRs calculated)
RJc
hare YJand YJc, respectively. The partition difference
between RJ
hand RJcwith respect to Q, denoted by µ, is
defined as
µ(RJ
h, RJc
h) =
nJcYJ
nJYJc1(8)
The partition difference µis specified for a fixed query
Q. To estimate their general difference for any query with
Rhas the highest relation, one possible way is to select a
set of sample queries from them, and calculate an averaged
µ(RJ
h, RJc
h). One can also check the distributions of tuples
according to the join attributes of these queries.
Given a tolerance threshold, θ[0,1], we claim that if
µ(RJ
h, RJc
h)< θ, we can estimate YJcas
b
YJc=nJc
nJb
YJ
RV (9)
where b
YJ
RV is the RV estimator for YJ
h. The relative error
of E(b
YJc) is
E(b
YJc)YJc
YJc=
nJcYJ
nJYJc1=µ< θ
That is, when µ < θ the estimation relative error of b
YJcin
Eq (9) is less than the tolerance threshold θ.
When µ(RJ
h, RJc
h)θ, we have to estimate RJc
hsepa-
rately. An additional simple random sample S
h, called a
supplement sampling, is taken from from RJc
h, and cor-
related sample tuples are stored into respective sample rela-
tions of Rhin the join graph G.
To estimate YJcusing S
h, we modify b
Yin Eq(1) by treat-
ing RJc
has a source relation. That is
b
YJc=NJc
h
nJc
h
nJc
h
X
i=1
yi(10)
where NJc
h=|RJc
h|,nJc
h=|S
h|, and yiis the number of
result tuples generated by the ith tuple in S
h.
Let b
YJ
RV be the RV estimator of YJ. Since YJand YJc
are independent, we have the following theorem.
Theorem 4. b
Yh=b
YJ
RV +b
YJcis an unbiased estimator
of the query result size YRh. And v2(b
YJ
RV ) + v1(b
YJc)is
an unbiased estimator of V ar(b
Y), in which v1and v2are
defined as in Theorem 1 and Theorem 3.
5. SELECTIVITY ESTIMATION FOR QUE-
RIES WITH CYCLIC GRAPHS
If the join graph Gof a query is cyclic, and a query result
tuple trinvolves in two different relation tuples, for example,
t1and t2both in Rh. And there is a cycle in Gfrom t1to
t2. We say tris generated by t1instead of t2. Hence, each
result tuple still corresponds to one and only tuple in each
relation when there is no dangling tuple.
Figure 4(a) shows a cyclic join graph and Figure 4(b) the
join relationships. Let us use an example to illustrate the
issues encountered in a cyclic join graph. Assume tuple 1 is
first drawn from R1. Then, following the correlated sam-
pling (Figure 2), related tuples A(R2), b(R3), α, β (
R4), I(R5),∆ and Θ(R6) are drawn into respective sam-
ple relations Si, 2 i6, as shown in Figure 4(c). Note
that S4contains tuples that are related to tuples in S2(i.e.,
A) and S3(i.e., b) because the two sampling branches from
R1intersect at R4.
While the synopsis (Figure 4(c)) is sufficient for estimating
queries involving relations that are subsets of {R1, R2, R4, R6}
and {R1, R3, R4, R5, R6}, it is not straightforward how queries
involving relations from both sampling branches can be es-
timated. For example, consider the query (R2, R4, R3),
where R2and R3are on different branches. R2and R3both
are candidates for the highest relation of the query. Assume
we choose to use S2={A}to estimate the query size. From
the synopsis, we know Ais connected to αof R4. Unfor-
tunately, we do not know whether αis connected to any
tuples in R3or not because we did not let αpierce through
R4to reach related tuples in R3(i.e., a). A similar problem
occurs if we use S3={b}to estimate the query size. That
is, bis connected to β(R4), but we do not know whether
βis connected to any tuples in R2because the sampling did
not penetrate into R2to reach related tuples (i.e., Bhere).
Here, we propose two possible ways to handle this situa-
tion. The first is to estimate query sizes using the indepen-
dence and uniformity assumptions. The second method is to
enhance the synopsis by letting the sampling pierce through
475
R1
R2R3
R4
R5
R6
(a) A Cyclic Join Graph
R1
R2R3
R4
R5
R6
12
A B ab
αβ
∆ Θ I II
(b) Joinable Tuple IDs
S
1={1}
S2={A}
S3={b}
S4={α, β}
S5={I}
S6={,Θ}
(c) CS2
S+
3={a}
S+
5={II}
(d) Extended Samples
Figure 4: CS2 for a Cyclic Join Graph
the cycle-completing relation (e.g., R4) and perhaps other
relations.
5.1 Estimate using Independence and Unifor-
mity Assumptions
Consider the query (R2, R4, R3). Let T24 and T34 be the
estimates for |R2R4|and |R3R4|, respectively, derived
using the method described in Section 4.2 using sample re-
lations {S2, S4}and {S3, S4}, respectively. Let N4be the
number of tuples in R4. Assume every tuple in R4is equally
likely to participate in the result of R2R4and R3R4.
Then |(R2, R4, R3)|can be estimated as
T24T34
N4
This approach is simple. But, the accuracy of estimations
may raise a concern because the use of the independence
and uniformity assumptions.
5.2 Enhancing synopsis with extended samples
We allow the sampling to pier through the cycle-completing
relation (e.g., R4) and extend to any relation as desired. In
general, we only extend to the relations that are often ref-
erenced in the queries together with the cycle-completing
relation.
For example, we can let the sampling branch on the left
(Figure 4) that goes from R1,R2and R4, extend to R3
(or even to R1and R5). An enhanced synopsis is shown in
Figure 4(d), where the left sampling branch is extended to
R3and R5. In the figure, S+
icontains the extended sample
tuples.
We do not need to differentiate Sifrom S+
iin the sample
evaluations unless Riis the highest relation of the query.
For example, consider the query (R2, R4, R3, R5) and the
enhanced synopsis in Figure 4(c), 4(d). Assume R2is chosen
to be the highest relation for evaluation. Consequently, S2,
without S+
2, must be used. As for other relations, we can
just use SiS+
i. The sample query result can be obtained
by (S2, S4, S3S+
3, S5S+
5), which is the same as
(S2, S4, S+
3, S+
5). The latter may be more efficient, but the
former is simpler, that is, always use SiS+
i, except when
Siis the highest relation.
6. MULTI-SOURCE SYNOPSIS
Figure 5 shows a graph with two source relations RAand
RB. Let A and B be the two groups of relations with RA
and RBbeing their source relations respectively. The two
A B
RARB
Ri
Figure 5: Multiple Source Relations
groups have a common relation Ri, but there can certainly
be more. Note that the sampling processes that started from
RAand RBwould each draw their own related sample tuples
from Riand store them in Si.
Given a query if there exists a sample penetrating all its
operand relations, an estimate of the query size can be de-
rived using the methods described in Section 4. If there does
not exist a sample that penetrates all the operand relations,
we resort to the estimation method based on the indepen-
dence and uniformity assumptions in Section 5.
A multi-source synopsis arises from traversing only the
many-to-one join relationships in the sampling. But there
are other reasons for constructing a multi-source synopsis.
In situations where there are multiple groups of relations
that are often referenced together, we may select a source
relation for each such group. Depending upon the impor-
tance of queries and desired accuracy, a multi-source syn-
opsis allows a DBA to determine grouping of relations and
sample sizes for individual groups. It can improve the ac-
curacy of estimation for queries over relations in the same
groups, however, decrease the accuracy for queries over re-
lations across groups, due to the use of the independence
and uniformity assumption in estimation. It can be a very
flexible and effective way to build a synopsis.
7. EXPERIMENTS
In this section, we present the results of performance eval-
uations of TuG [27, 28], join synopses (JS) [1], and correlated
sample synopsis (CS2).
The implementations of the join synopses and correlated
sample synopsis are much simpler than TuG because TuG
uses complex clustering and compression algorithms. We
have implemented the TuG synopsis (whose code is not avail-
able to the public) following the descriptions in [27], includ-
ing using the public Java source code [23] of CF-Tree and
BIRCH [32] to conduct clustering, and implementing the
Count-min Sketches [7] and histogram synopses [8] to re-
duce the size of histograms.
All programs are implemented in Java 1.6 and MySQL
5.1 on a Ubuntu 10.04 Linux system, equipped with an Intel
476
Table 3: Datasets
Name Size Comments
TPCH1G05 1GB TPC-H skew factor z=0.5
TPCH1G1 1GB TPC-H skew factor z=1.0
TPCH10G05 10GB TPC-H skew factor z=0.5
TPCH10G1 10GB TPC-H skew factor z=1.0
DBLP 692MB DBLP online database
lineitem
!!
&&
partsupp
~~
orders
part supplier
customer //nation
region
(a) TPC-H
author of
author paper
type publisher
(b) DBLP
Figure 6: Join Graphs of Datasets (Boxed are
Source Relations)
Core i7 3.4 GHz CPU, 4GB RAM and a 500GB hard disk
(7200 RPM, Buffer size 16MB). All datasets and synopses
are stored on the hard disk.
Using a large amount of memory only benefits TuG which
utilizes clustering on large datasets. CS2 and JS only require
small amounts of memory when constructing synopses.
7.1 Datasets
We conducted experiments on both synthetic data - TPC-
H datasets with skewed data [4], and real-life data - DBLP
dataset[17], which are listed in Table 3.
We generated 1GB and 10G TPC-H datasets with skew
factors of z=0.5 and 1. Each TPC-H dataset has 8 relations,
consisting of only the foreign-key join relationships. The arcs
in Figure 6(a) indicate the many-to-one relationships, which
also satisfy the foreign-key constraint for the TPC-H data,
and the boxes indicate the source relations of CS2. Note
that every relation is a source relation in JS.
The DBLP database listed more than 1.3 million articles
in computer science by January 2010. Our DBLP dataset
sources from a relational enhancement of the original DBLP
data, named DBLP++[9]. We have replaced the many-
to-many relationship between “author” and “paper” by two
many-to-one relationships and a connecting relation “au-
thor of”, as JS does not apply to many-to-many relation-
ships directly. However, CS2 is applicable to all types of
relationships. In addition, we have also filtered out some
data so that the foreign key constraint holds, as required by
the JS. The join graph is shown in Fig 6(b).
7.2 Construction Time and Index
The average construction times of synopses in accuracy
tests are presented in Table 4. As observed, the construc-
tion of TuG is significantly slower than JS and CS2 because
it performs some time-consuming tasks, such as CV table
building, all-but-one similar merges, BIRCH clustering, and
etc. All of these tasks require tremendous amounts of disk
I/O and CPU computation.
Building index will certiainly increase the construction
Table 4: Average Construction Time of Synopses
Dataset Space Budget TuG JS CS2
TPCH1G05 29KB 639min 10sec 8sec
TPCH1G1 21KB 286min 10sec 8sec
TPCH10G05 100KB 408sec 334sec
TPCH10G1 100KB 340sec 300sec
DBLP 35KB 518min 8sec 3sec
speed of all the synopses. In our experiments, we built in-
dexes on the join key attributes in both TPC-H and DBLP
datasets.
7.3 Query Workload
We randomly generated queries involving different sets of
relations with different numbers of selection predicates. For
such queries, we randomly varied the ranges of attribute val-
ues in the selection criteria. Our experiments were designed
to examine how joins and selections affect the accuracy of
estimations.
Table 5 details the test queries and the space budgets
used for each test. “DBLP Large/Small Range” refers to
queries with large/small attribute value ranges specified in
the selection predicates. “DBLP No-source” refers to queries
that do not involve the source relation “author of”, which
would require using the proposed RV estimator (Section 4).
They are also among the queries used in the “DBLP Large
/ Small Range” experiments.
7.4 Synopsis Space Budget
According to Section 3.1.1 and Algorithm 3.1, in order to
control the size of CS2, we chose “lineitem” of TPC-H and
“author of” of DBLP as the source relations. Our correlated
sampling also followed many-to-one relationships in the join
graphs. As Table 5 shows, the space of CS2 can be well
controlled. Space budgets of CS2 are less than 20KB for
1GB TPC-H datasets, 100KB for 10GB TPC-H datasets,
and less than 10KB for DBLP datasets.
Note that it is hard to control the final sizes of TuGs,
even though we used the same clustering parameters to gen-
erate them on TPC-H datasets. We are not able to generate
any TuG smaller than 20KB for TPC-H or 23KB for DBLP
without generating large estimation errors, and thus their
results are absent for the 3KB and 10KB tests in Figure 7.
7.5 Estimation Accuracy
Test queries were grouped based on the numbers of joins
and selections placed on each relation. For example, in Fig-
ure 7(a), the query group ‘j1s2’ comprises queries with one
join and at most 2 selections on each operand relation, with
at most 4 selections per query.
Since JS and CS2 both need random sampling, they are
not unique for a given DB. Therefore, for each dataset and
given space budget, we generated 100 JSs and CS2s to ob-
serve their average performance. Each CS2/JS is used for
estimations of all test queries. We use the absolute rela-
tive error to compare the accuracy of estimations, which is
defined as
E=b
YEst YAct
YAct ×100%
where YAct denotes the actual query result size and b
YEst is
477
Table 5: Query Workloads and Space Budgets in Estimation Accuracy Tests
Figure Tests Data Total Space Budget
Size Queries TuG JS CS2
7(a) TPCH1G05 Synopses 20KB 1GB 45 29KB 20KB 20KB
7(b) TPCH1G05 Synopses 10KB 1GB 45 10KB 10KB
7(c) TPCH1G1 Synopses 20KB 1GB 45 21KB 20KB 20KB
7(d) TPCH1G1 Synopses 10KB 1GB 45 10KB 10KB
7(e) TPCH10G05 Synopses 100KB 10GB 45 100KB 100KB
7(f) TPCH10G1 Synopses 100KB 10GB 45 100KB 100KB
7(g) DBLP Large Range 3KB 692MB 51 23KB 3KB 3KB
7(h) DBLP Small Range 3KB 692MB 51 3KB 3KB
7(i) DBLP Small Range 10KB 692MB 51 10KB 10KB
7(j) DBLP No-source 692MB 36 3KB 3KB
the estimated result size. The results of accuracy tests are
shown in Fig 7.
As shown in Figure 7(a) and Figure 7(c), TuG produced
the worst relative errors generally. As mentioned earlier,
TuG smaller than 20KB cannot generate meaningful esti-
mations. JS and CS2 had similar accuracies for queries with
small numbers of joins. However, JS’s accuracy degraded
quickly when the number of joins increased. CS2 produced
the lowest relative errors in all cases of the TPC-H tests.
Figure 7(g) shows the results of DBLP large range query
tests. Again, CS2 was superior to JS and TuG, with TuG
performing the worst. As observed from Figures 7(g) and
7(h), small range queries gave higher errors than large range
queries. Also, as the sizes of the synopses increased, the rela-
tive errors decreased (Figures 7(h) and 7(i)). The dominance
of CS2 over JS became more obvious when the numbers of
joins and selections increased, as shown in Fig 7(h).
Finally, we compare the accuracy of CS2 and JS for no-
source queries. All queries were extracted from the previous
tests. It is clear, as shown in Fig 7(j), that CS2 with the
reverse estimator produced lower relative errors than JS.
In summary, TuG performed the worst among these three
synopses. CS2 produced the smallest relative errors in all
tests. In addition, CS2 increasingly outperformed JS as the
numbers of joins and selections grew.
7.6 Unbiasedness of RV Estimator
In Section 4.2, we proposed the reverse estimator for all
no-source queries, and proved theoretically that it is unbi-
ased (Theorem 2). Here, we validate empirically the unbi-
asedness of this estimator using CS2. We also compare the
results with JS.
For illustrative purposes, we select two queries for this
bias test, one from group ‘j2s1’ and the other from ‘j2s2’
in the DBLP tests. A query is estimated each time using
a different synopsis (or synopses) of the same size (3KB).
We generated 100 synopses for this purpose. The absolute
average relative error of the first iestimates is defined as
Ei=
Pi
j=1 b
YEst
j
iYAct
YAct ×100%
where i1, YAct is the actual query size, b
YEst
jis the
estimated value derived from the jth synopsis (synopses),
and Pi
j=1 b
YEst
j
iis the average of the first iestimates. For an
unbiased estimator [6], the absolute average relative error
should approach 0 as i (the number of tries) increases.
As observed in Figure 8(a), the errors of both CS2 (using
the RV Estimator) and JS approached 0 as the number of
tries increased. On the other hand, Figure 8(b) demostrates
that the errors of CS2 converged to 0 earlier than JS within
100 rounds. i.e. CS2 yeilds a faster convergence speed than
JS when more selections are added. Certainly, JS yields
unbiased estimates as it uses simple random samples of join
results. These experiments validate our earlier proofs of the
unbiasedness of the RV estimator.
8. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented an innovative approach to
construct a database synopsis. The synopsis mainly consists
of a set of correlated sample relations. We have designed a
reverse estimator that can fully utilize the correlated sam-
ple tuples to make accurate estimations. Our experimen-
tal results show that the correlated sample synopsis is easy
to construct and has better performance than existing syn-
opses. It proves itself to be a simple, concise, and accurate
synopsis.
There is still much work to do. We shall study the possi-
bility of further reducing the size of the synopsis using other
sampling techniques, such as subsampling. Finally, we shall
explore the use of the correlated sample synopsis in approx-
imate aggregate query answering.
9. ACKNOWLEDGMENTS
A special thanks to Dr. Sakthivel Jeyarathnam for his help
in developing the reverse estimator. We are also grateful to
Mr. Chase Abner for editing assistance.
10. REFERENCES
[1] S. Acharya, P. B. Gibbons, V. Poosala, and
S. Ramaswamy. Join synopses for approximate query
answering. SIGMOD Rec., 28:275–286, June 1999.
[2] N. Alon, Y. Matias, and M. Szegedy. The space
complexity of approximating the frequency moments.
In JOURNAL OF COMPUTER AND SYSTEM
SCIENCES, pages 20–29, 1996.
[3] S. Chaudhuri, G. Das, and V. Narasayya. Optimized
stratified sampling for approximate query processing.
ACM Trans. Database Syst., 32(2), June 2007.
[4] S. Chaudhuri and V. Narasayya. Program for tpc-d
data generation with skew. ftp://ftp.research.
microsoft.com/users/viveknar/tpcdskew.
478
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
TuG
JS
CS2
(a) TPCH1G05: TuG=29KB; JS,CS2=20KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
JS
CS2
(b) TPCH1G05: JS,CS2=10KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
TuG
JS
CS2
(c) TPCH1G1: TuG=21KB; JS,CS2=20KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
JS
CS2
(d) TPCH1G1: JS,CS2=10KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
5%
10%
15%
20%
25%
30%
35%
Query Groups
JS
CS2
(e) TPCH10G05: JS,CS2=100KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
5%
10%
15%
20%
25%
30%
35%
Query Groups
JS
CS2
(f) TPCH10G1: JS,CS2=100KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
120%
Query Groups
TuG
JS
CS2
(g) DBLP Large Range: TuG=23K; JS,CS2=3KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
120%
Query Groups
JS
CS2
(h) DBLP Small Range: JS,CS2=3KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
120%
Query Groups
JS
CS2
(i) DBLP Small Range: JS,CS2=10KB
j1s1 j1s2 j2s1 j2s2
0%
20%
40%
60%
80%
100%
120%
140%
Query Groups
JS
CS2
(j) DBLP No-source: JS,CS2=3KB
Figure 7: Absolute Relative Errors (%) in Accuracy Tests
479
0 20 40 60 80 100
0%
5%
10%
15%
20%
Number of Estimates
JS
CS2
(a) Bias Test: DBLP No-source j2s1
0 20 40 60 80 100
0%
5%
10%
15%
20%
Number of Estimates
JS
CS2
(b) Bias Test: DBLP No-source j2s2
Figure 8: Averaged Absolute Relative Errors (%) in Bias Tests
[5] S. Christodoulakis. Implications of certain
assumptions in database performance evauation. ACM
Trans. Database Syst., 9:163–186, June 1984.
[6] W. G. Cochran. Sampling Techniques, 3rd Edition.
John Wiley, 1977.
[7] G. Cormode and S. Muthukrishnan. An improved
data stream summary: the count-min sketch and its
applications. J. Algorithms, 55:58–75, April 2005.
[8] A. Deshpande, M. Garofalakis, and R. Rastogi.
Independence is good: dependency-based histogram
synopses for high-dimensional data. SIGMOD Rec.,
30:199–210, May 2001.
[9] J. Diederich. FacetedDBLP.
http://dblp.l3s.de/dblp++.php.
[10] P. B. Gibbons, Y. Matias, and V. Poosala. Fast
incremental maintenance of approximate histograms.
VLDB ’97, pages 466–475, 1997.
[11] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N.
Swami. Fixed-precision estimation of join selectivity.
PODS ’93, pages 190–201, 1993.
[12] P. J. Haas and A. N. Swami. Sequential sampling
procedures for query size estimation. SIGMOD Rec.,
21:341–350, June 1992.
[13] W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja.
Statistical estimators for relational algebra
expressions. PODS ’88, pages 276–287, 1988.
[14] Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy
trees: an analysis of strategy spaces and its
implications for query optimization. SIGMOD Rec.,
20:168–177, April 1991.
[15] R. P. Kooi. The optimization of queries in relational
databases. PhD thesis, Cleveland, OH, USA, 1980.
AAI8109596.
[16] J.-H. Lee, D.-H. Kim, and C.-W. Chung.
Multi-dimensional selectivity estimation using
compressed histogram information. SIGMOD Rec.,
28:205–214, June 1999.
[17] M. Ley. The DBLP computer science bibliography.
http://www.informatik.uni-trier.de/~ley/db/.
[18] R. J. Lipton, J. F. Naughton, and D. A. Schneider.
Practical selectivity estimation through adaptive
sampling. SIGMOD Rec., 19:1–11, May 1990.
[19] V. Markl, V. Raman, D. Simmen, G. Lohman,
H. Pirahesh, and M. Cilimdzic. Robust query
processing through progressive optimization. SIGMOD
’04, pages 659–670, 2004.
[20] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based
histograms for selectivity estimation. SIGMOD Rec.,
27:448–459, June 1998.
[21] Y. Matias, J. S. Vitter, and M. Wang. Dynamic
maintenance of wavelet-based histograms. VLDB ’00,
pages 101–110, 2000.
[22] M. Muralikrishna and D. J. DeWitt. Equi-depth
histograms for estimating selectivity factors for
multi-dimensional queries. SIGMOD ’88, pages 28–36,
1988.
[23] R. Perdisci. JBIRCH.
http://roberto.perdisci.com/projects/jbirch.
[24] V. Poosala and Y. E. Ioannidis. Selectivity estimation
without the attribute value independence assumption.
VLDB ’97, pages 486–495, 1997.
[25] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,
R. A. Lorie, and T. G. Price. Access path selection in
a relational database management system. SIGMOD
’79, pages 23–34, 1979.
[26] J. M. Smith and P. Y.-T. Chang. Optimizing the
performance of a relational algebra database interface.
Commun. ACM, 18:568–579, October 1975.
[27] J. Spiegel and N. Polyzotis. Graph-based synopses for
relational selectivity estimation. SIGMOD ’06, pages
205–216, 2006.
[28] J. Spiegel and N. Polyzotis. Tug synopses for
approximate query answering. ACM Trans. Database
Syst., 34(1):3:1–3:56, Apr. 2009.
[29] J. D. Ullman, H. Garcia-Molina, and J. Widom.
Database Systems: The Complete Book. Prentice Hall
PTR, Upper Saddle River, NJ, USA, 1st edition, 2001.
[30] E. Wong and K. Youssefi. Decomposition - a strategy
for query processing. ACM Transactions on Database
Systems, 1:223–241, 1976.
[31] Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying
the golden rule of sampling for query estimation.
SIGMOD Rec., 30(2):449–460, May 2001.
[32] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an
efficient data clustering method for very large
databases. SIGMOD Rec., 25:103–114, June 1996.
480
... Estimating join cardinality requires correlated statistics on the join attributes. While such statistics exist, e.g., correlated samples [17,24,44], they require the existence of indexes on every join attribute combination, which severely limits their applicability in the case of multi-way joins. As a result, even advanced optimizers rely on crude formulas that assume uniformity, inclusion, and independence-which are likely to produce highly sub-optimal execution plans [23]. ...
... Estimating join cardinality requires correlated statistics on the join attributes. While such statistics exist, e.g., correlated samples [22,62,29], they require the existence of indexes on every join attribute combination, which severely limits their applicability in the case of multi-way joins. As a result, even advanced optimizers rely on crude formulas that assume uniformity, inclusion, and independence-which are likely to produce highly sub-optimal execution plans [28]. ...
Preprint
Full-text available
Cost-based query optimization remains a critical task in relational databases even after decades of research and industrial development. Query optimizers rely on a large range of statistical synopses -- including attribute-level histograms and table-level samples -- for accurate cardinality estimation. As the complexity of selection predicates and the number of join predicates increase, two problems arise. First, statistics cannot be incrementally composed to effectively estimate the cost of the sub-plans generated in plan enumeration. Second, small errors are propagated exponentially through join operators, which can lead to severely sub-optimal plans. In this paper, we introduce COMPASS, a novel query optimization paradigm for in-memory databases based on a single type of statistics -- Fast-AGMS sketches. In COMPASS, query optimization and execution are intertwined. Selection predicates and sketch updates are pushed-down and evaluated online during query optimization. This allows Fast-AGMS sketches to be computed only over the relevant tuples -- which enhances cardinality estimation accuracy. Plan enumeration is performed over the query join graph by incrementally composing attribute-level sketches -- not by building a separate sketch for every sub-plan. We prototype COMPASS in MapD -- an open-source parallel database -- and perform extensive experiments over the complete JOB benchmark. The results prove that COMPASS generates better execution plans -- both in terms of cardinality and runtime -- compared to four other database systems. Overall, COMPASS achieves a speedup ranging from 1.35X to 11.28X in cumulative query execution time over the considered competitors.
... Yu et al. [105] introduce correlated sampling as a part of CS2 algorithm. They (1) choose one of the tables in a join graph as the source table R 1 , (2) use a random sampling method to obtain sample set S 1 for R 1 (mark R 1 as visited), (3) follow an unvisited edge < R i , R j > ( R i is visited) in the join graph and collect the tuples from R j which are joinable with tuples in S i as S j , and (4) estimate the join size over the samples. ...
Article
Full-text available
Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an operator, plays a crucial role. Due to the inaccuracy in cardinality estimation, errors in cost model, and the huge plan space, the optimizer cannot find the optimal execution plan for a complex query in a reasonable time. In this paper, we first deeply study the causes behind the limitations above. Next, we review the techniques used to improve the quality of the three key components in the cost-based optimizer, cardinality estimation, cost model, and plan enumeration. We also provide our insights on the future directions for each of the above aspects.
... Yu et al. [105] introduce correlated sampling as a part of CS2 algorithm. They (1) choose one of the tables in a join graph as the source table 1 , (2) use a random sampling method to obtain sample set 1 for 1 (mark 1 as visited), (3) follow an unvisited edge < , > ( is visited) in the join graph and collect the tuples from which are joinable with tuples in as , and (4) estimate the join size over the samples. ...
Preprint
Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an operator, plays a crucial role. Due to the inaccuracy in cardinality estimation, errors in cost model, and the huge plan space, the optimizer cannot find the optimal execution plan for a complex query in a reasonable time. In this paper, we first deeply study the causes behind the limitations above. Next, we review the techniques used to improve the quality of the three key components in the cost-based optimizer, cardinality estimation, cost model, and plan enumeration. We also provide our insights on the future directions for each of the above aspects.
Article
Inner-product estimation is the base of many important tasks in a variety of big data scenarios, including measuring similarity of streams in data stream processing, estimating join size in database, and analyzing cosine similarity in various applications. Sketch, as a class of probability algorithms, is promising in inner-product estimation. However, existing sketch solutions suffer from low accuracy due to their neglect of the high skewness of real data. In this paper, we design a new sketch algorithm for accurate and unbiased inner-product estimation, namely JoinSketch. To improve accuracy, JoinSketch consists of multiple components, and records items with different frequency in different components. We theoretically prove that JoinSketch is unbiased, and has lower variance compared with the well-known AGMS and Fast-AGMS sketch. The experimental results show that JoinSketch improves the accuracy by 10 times in average while maintaining a comparable speed. All code is open-sourced at Github.
Article
Today's query optimizers use fast selectivity estimation techniques but are known to be susceptible to large estimation errors. Recent work on supervised learned models for selectivity estimation significantly improves accuracy while ensuring relatively low estimation overhead. However, these models impose significant model construction cost as they need large numbers of training examples and computing selectivity labels is costly for large datasets. We propose a novel model construction method that incrementally generates training data and uses approximate selectivity labels, that reduces total construction cost by an order of magnitude while preserving most of the accuracy gains. The proposed method is particularly attractive for model designs that are faster-to-train for a given number of training examples, but such models are known to support a limited class of query expressions. We broaden the applicability of such supervised models to the class of select-project-join query expressions with range predicates and IN clauses. Our extensive evaluation on synthetic benchmark and real-world queries shows that the 95th-percentile error of our proposed models is 10-100X better than traditional selectivity estimators. We also demonstrate significant gains in plan quality as a result of improved selectivity estimates.
Article
Query size estimation is crucial for many database system components. In particular, query optimizers need efficient and accurate query size estimation when deciding among alternative query plans. In this paper we propose a novel sampling technique based on the golden rule of sampling, introduced by von Neumann in 1947, for estimating range queries. The proposed technique randomly samples the frequency domain using the cumulative frequency distribution and yields good estimates without any a priori knowledge of the actual underlying distribution of spatial objects. We show experimentally that the proposed sampling technique gives smaller approximation error than the Min-Skew histogram based and wavelet based approaches for both synthetic and real datasets. Moreover, the proposed technique can be easily extended for higher dimensional datasets.
Article
We provide a procedure, based on random sampling, for estimation of the size of a query result. The procedure is sequential in that sampling terminates after a random number of steps according to a stopping rule that depends upon the observations obtained so far. Enough observations are obtained so that, with a pre-specified probability, the estimate differs from the true size of the query result by no more than a prespecified amount. Unlike previous sequential estimation procedures for queries, our procedure is asymptotically efficient and requires no ad hoc pilot sample or a a priori assumptions about data characteristics. In addition to establishing the asymptotic properties of the estimation procedure, we provide techniques for reducing undercoverage at small sample sizes and show that the sampling cost of the procedure can be reduced through stratified sampling techniques.
Article
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
Article
Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”.