Content uploaded by Feng Yu
Author content
All content in this area was uploaded by Feng Yu on Oct 28, 2016
Content may be subject to copyright.
CS2: A New Database Synopsis for Query Estimation
Feng Yu
Southern Illinois University
Carbondale, IL, 62901
fyu@cs.siu.edu
Wen-Chi Hou
Southern Illinois University
Carbondale, IL, 62901
hou@cs.siu.edu
Cheng Luo
Coppin State University
Baltimore, MD 21216
cluo@coppin.edu
Dunren Che
Southern Illinois University
Carbondale, IL, 62901
dche@cs.siu.edu
Mengxia Zhu
Southern Illinois University
Carbondale, IL, 62901
mzhu@cs.siu.edu
ABSTRACT
Fast and accurate estimations for complex queries are pro-
foundly beneficial for large databases with heavy workloads.
In this research, we propose a statistical summary for a
database, called CS2 (Correlated Sample Synopsis), to pro-
vide rapid and accurate result size estimations for all queries
with joins and arbitrary selections. Unlike the state-of-the-
art techniques, CS2 does not completely rely on simple ran-
dom samples, but mainly consists of correlated sample tuples
that retain join relationships with less storage. We introduce
a statistical technique, called reverse sample, and design a
powerful estimator, called reverse estimator, to fully utilize
correlated sample tuples for query estimation. We prove
both theoretically and empirically that the reverse estima-
tor is unbiased and accurate using CS2. Extensive exper-
iments on multiple datasets show that CS2 is fast to con-
struct and derives more accurate estimations than existing
methods with the same space budget.
Categories and Subject Descriptors
H.2.4 [Information Systems]: Database Management, Sys-
tems, Query Processing
General Terms
Theory
Keywords
Query Optimization, Database Synopsis, Selectivity Estima-
tion
1. INTRODUCTION
Query optimization has been a central research topic of
Database Management Systems for many years [14, 25, 26,
30]. It aims to find the most efficient execution plans for
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’13, June 22–27, 2013, New York, New York, USA.
Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.
queries. A query optimizer generally uses statistics about
the database [22, 24], and assumptions about the attribute
values [5, 25] to make cost estimations of alternative plans
and selects the best one for execution. It is important to
find the most efficient plans because studies [5, 19] have
shown that executions with sub-optimal plans can be orders
of magnitude slower than those with optimal plans. Un-
fortunately, due to the accuracy of the statistics stored in
the database and the validity of assumptions made, query
optimizers often cannot find the best plans in their search
spaces for the queries. On the other hand, storing accurate
statistics can require prohibitively large amounts of space.
Thus, constructing a concise, accurate database synopsis has
been a central issue of query optimization for decades. In
this research, we are mainly concerned about queries with
arbitrary select and equi-join operations.
There has been much effort to construct concise, accurate
database synopses in the past. Among them, histograms
[8, 15, 22, 24] have attracted the most attention. They are
simple and effective for estimating low-dimensional queries,
where the numbers of attributes of interest in a relation,
including the selection and join attributes, are small. How-
ever, as the dimension increases, the storage space required
by a multi-dimensional histogram can grow dramatically.
In addition, the computations and maintenance of multi-
dimensional histograms become very complex [21]. Some
techniques, such as discrete cosine transform, wavelets, sket-
ches [2, 16, 20], etc., have been proposed to compress the
histograms. Unfortunately, these size reduction techniques
become ineffective as the dimension increases.
There is also a long history of using sampling in databases
[13, 12, 11, 18, 31]. With a small amount of sample tuples,
accurate selectivity estimations can be obtained for select
queries with arbitrary selection predicates. However, for
joins, large estimation errors were observed when indepen-
dent random samples of relations are used. To overcome
this problem, Acharya et al [1] stored a join synopsis for
each relation. The synopsis for a relation is a simple ran-
dom sample of the result of the joins of the relation and
all its descendant relations by the foreign-key relationships.
Each join synopsis can then be used to generate simple ran-
dom samples of the results of all subqueries that involve the
source relation, and thus are very useful for estimations of
a large set of queries.
Chaudhuri et al. in [3] proposed using the stratified sam-
pling to improve estimations. However, workloads must be
known beforehand for the stratified sampling to be effective.
469
Moreover, the method is only applicable to “star queries”
with foreign-key joins. Wu at el. in [31] used cumulative
distributions to make estimations. But it mainly focuses on
range queries over single relation.
Spiegel et al [27, 28] proposed an innovative approach to
summarizing the correlations among the join and other at-
tribute values in a data graph. Tuples in the graph are con-
nected to nodes that represent their respective attribute val-
ues, and to tuples with which they join. To reduce the size of
the data graph, tuples with similar join- and attribute-based
characteristics are merged into clusters, and clusters with a
similar value distribution for an attribute share the same
(1-dimensional) histogram. This class of graph-based syn-
opses, called TuG (Tuple Graph) synopses, can be applied
to queries with joins of many-to-many and many-to-one re-
lationships [29], and selections with table-level predicates.
Experimental results [27, 28] showed that histogram-based
synopses are not as accurate as join synopses and TuG for
complex join queries. Howerver, join synopses and TuG also
have some limitations and shortcomings. For example, join
synopses restrict their applications to foreign-key joins. In
addition, a join synopsis is stored for each relation in the
database, which raises concern about the total size of the
synopses. As for TuG, its accuracy is heavily dependent on
the quality of the clusterings performed. But as the num-
ber of attributes of interest increases and the relationships
among attribute values becomes more complex, clusterings
become less effective, and thus the accuracy of estimations
deteriorates dramatically. The construction of a TuG syn-
opsis is also very complex.
Motivated by the importance and continuing lack of an
effective database synopsis, we set our goal to constructing
a simple, concise, and accurate database synopsis. We have
chosen to use sample tuples to construct the synopsis as in-
tricate relationships among attribute values can be naturally
embodied in the sample tuples. In addition, the space for
storing a sample tuple only increases linearly with the num-
ber of attributes (of interest) in a relation. It makes sample
tuples an effective tool for modeling multi-, and especially
high-dimensional attribute value distributions. However, in-
stead of storing independent random samples of relations,
we propose to store correlated sample tuples to preserve the
join relationships. That is, tuples of a relation are included
in the sample of a relation if they join with already drawn
sample tuples of other relations. Correlated sample tuples
from relations form the correlated sample synopsis (CS2),
which will be used for query size estimation, as well as ap-
proximate aggregate query answering.
While both the join synopses and the correlated sample
synopsis attempt to capture relationships between tuples of
different relations, there are many differences. The join syn-
opses store samples of the results of joins, while the cor-
related sample synopsis stores a sample of correlated tu-
ples from the relations. Besides this philosophical differ-
ence, the correlated sample synopsis is also more general
than the join synopses. The join synopses are applicable
only to the foreign-key join relationships, while the corre-
lated sample synopsis can be applied to all types of join
relationships, including many-to-one and many-to-many re-
lationships. The join synopses assume no dangling tuples
exist in the relations, but no such assumption is needed for
the correlated sample synopsis. The join synopses rely on
the simple random samples to make estimation and thus re-
quire constructing up to nsynopses, where nis the number
of relations in the join graph. The correlated sample synop-
sis can estimate with both random and correlated samples
and needs just one synopsis for the entire database. This
potential storage space advantage can translate into higher
estimation accuracy for the correlated sample synopsis.
While TuG covers all types of joins, the selections are re-
stricted to table-level selections. It cannot accommodate
predicates like “R.age=30 OR S.salary >50K” that cannot
be pushed down to the individual tables. As for the cor-
related sample synopsis, there is no such restriction; it can
be applied to any selection and join. In addition, selectivity
estimates with statistical confidence levels and intervals can
also be provided by using the correlated sample synopsis.
In this research, we propose to store a set of correlated
sample tuples as the database synopsis. We have developed
a powerful estimator, called the reverse estimator, that can
fully utilize the correlated sample tuples to make accurate
estimations, without relying on simple random sampling.
The proposed reverse estimator represents a breakthrough
in utilizing samples to make estimation. We have imple-
mented the join synopses, TuG, and the correlated sample
synopsis to compare their performance. The experimental
results showed that both the join synopses (JS) and corre-
lated sample synopsis (CS2) were much easier and several
orders faster to construct than the TuG. In addition, they
also generated more accurate estimates than TuG, with the
correlated sample synopsis generating the most accurate es-
timates. The experimental results confirmed that the cor-
related sample synopsis is a simple, concise, and accurate
summary of the database, not to mention that it is also the
most general synopsis, applicable to queries with all types
of join relationships and arbitrary selection conditions. It
is the desired database synopsis that researchers have long
been searching for.
The rest of paper is organized as follows. Section 2 in-
troduces some background knowledge. Section 3 discusses
the construction of a correlated sample synopsis. In Sec-
tion 4, we describe selectivity estimation for queries with
acyclic join graphs, and Section 5 cyclic graphs. Section 6
presents a method to improve estimation accuracy of com-
monly posted queries. Section 7 presents the experimental
results and Section 8 is the conclusion.
2. PRELIMINARY
In this section, we define terminology that will be used to
model the join relationships in a database. For simplicity,
here we focus on equi-joins. The framework can be extended
to other joins easily. Hereafter, a join refers to an equi-join.
2.1 Join Graphs
Definition 1. (Join Graph of a Database) The join
graph G(V, E) of a database consists of the set of all vertices
V={R1, R2,...,Rn}and a set of edges with each vertex
denoting a relation in the database and each edge hRi, Rji ∈
E, Ri, Rj∈V, the existence of a join relationship between
Riand Rj.
In this research, we make no assumption about the mul-
tiplicity of the join relationships, that is, they can be one-
to-one, many-to-one, and many-to-many (often realized by
many-to-one relationships in the relational model).
470
R1R2
R3
R4
Figure 1: A Join Graph
Example 1. Figure 1 shows the join graph of a database
where there are join attributes between R1and R2,R2and
R3, and R2and R4.
For simplicity and without loss of generality, we shall as-
sume there is at most one join relationship between a pair
of relations and each join relationship involves only one join
attribute. The discussions here can be easily extended to
general cases.
Definition 2. (Join Graph of a Query) Let Qbe a
query over a set of relations V′⊆V. The join graph of
Q is a subgraph of G, denoted by G′(V′, E ′), where V′⊆V,
E′⊆E, with each edge hRi, Rji ∈ E′corresponding to a
join between Riand Rjin Q.
If the join graph of a query or a database is disconnected,
we can consider each connected component separately and
then merge them by Cartesian products. Therefore, we shall
assume hereafter all join graphs are connected.
Definition 3. (Joinable Relations) A pair of relations
Riand Rk,i6=k, is said to be joinable, denoted by J(Ri, Rk),
if there is a path of length ≥1 between Riand Rkin the
join graph of the query. If there is an edge hRi, Rki ∈ E,
we say Riand Rkare directly joinable; otherwise, they are
indirectly joinable.
Definition 4. (Joinable Tuples) Assume Riand Rkare
a pair of joinable relations with a path hRi, Ri+1,..., Rk−1, Rki.
A pair of tuples ti∈Riand tk∈Rkare said to be joinable,
denoted by J(ti, tk), if tican find a match ti+1 in Ri+1, ti+1
can find a match ti+2 in Ri+2,..., and tk−1can find a match
tkin Rk. Joinable tuples ti∈Riand tk∈Rkare directly
joinable if Riand Rkare directly joinable, and they can
match directly; otherwise, they are indirectly joinable.
3. CORRELATED SAMPLE SYNOPSIS CON-
STRUCTION
Sample tuples naturally embody the intricate relation-
ships of attribute values and thus can be good represen-
tatives of the underlying tuples of a relation. They are very
effective in estimating the result sizes of select queries with
arbitrary selection conditions [13]. Unfortunately, indepen-
dent simple random samples, each of which consists of ran-
domly drawn tuples from a participant relation of a query,
generated a large error in join size estimation [13]. We ob-
served that the large errors were caused by the independence
of the sample tuples, disregarding the join relationships be-
tween tuples of participant relations. Preserving the join
relationships may hold the key to the construction of an ef-
fective sample synopsis. With this principle in mind, we
design a simple sampling scheme, called correlated sam-
pling, that preserves the join relationships of tuples.
Table 1: Notation used in the paper
Symbol(s) Meaning
RiRelation
t∈RiA tuple tin relation Ri
G(V, E ) Join graph of a database
hRi, RkiEdge in the join graph
J(Ri, Rk)Riand Rkare joinable relations
J(ti, tk)tiand tkare joinable tuples
SRSWOR Simple random sample without replacement
S∗
iSRSWOR of Ri
SiSample relation
SCorrelated sample synopsis (CS2)
✶(...) Joins without order
YQuery result size
YiNumber of result tuples generated by ti∈Rh
yiNumber of result tuples generated by ti∈Sh
b
Y , b
Yp,b
YRV Estimations of Y
riJoinable sampled tuple ratio (JR)
ZiStandard Total Number of Result Tuples
(STNTR)
{zi}n1
i=1 Reverse sample
RJ
h,RJc
hJoinable and non-joinable partitions
3.1 Correlated Sampling
Correlated Sampling aims to gather a sample of joinable
sample tuples from joinable relations in the database to form
a synopsis, denoted by S.
Definition 5. (Correlated Sampling) First, a relation,
say R1, is chosen as the source relation, from which a
simple random sample without replacement (SR-
SWOR) [6] is collected into anther relation, denoted by
S1. We include S1into the synopsis, S, and mark R1as
visited. By induction, when Ri(i > 1) is visited, and Siis
the sample of Ri, we follow an unvisited edge hRi, Rjiin the
join graph of the database. Instead of SRSWOR, we collect
the tuples in Rjthat are directly joinable with the tuples in
Si. We include these correlated tuples into Sj, and repeat
this procedure until all relations in the join graph are ex-
hausted and all tuples directly or indirectly joinable with S1
are collected. Finally, the synopsis S={S1, ..., Sn}is called
aCorrelated Sample Synopsis or CS2.
Note that, to emphasize S1is a SRSWOR of R1, we use
S∗
1, instead of S1, in the future. All sampled relations S∗
1
and other Sj’s are called sample relations, and their tuples
are called sample tuples. A sample relation Sjwhich is
not a SRSWOR is called a correlated sample relation,
the tuples of Sjare called correlated sample tuples, and
Rjis called a non-source relation.
Algorithm 3.1 describes the general idea of the correlated
sampling. We shall refine it later.
3.1.1 Sampling Path and Source Relation Selection
When choosing sampling order along the join graph, al-
though one can follow any join relationship, e.g., a many-to-
one, one-to-many, or many-to-many relationship, to gather
correlated sample tuples, we suggest following a many-to-
one relationship from a visited relation to another relation
(visited or not). The rationale is that following one-to-many
(and perhaps many-to-many) relationships can easily ex-
471
Algorithm 1 Correlated Sampling
Require: G— Join Graph of the Database; n1— Sample
Size
1: Ra=Source Selection(G) //source relation selection
2: S∗
a=∅;Si=∅(∀i6=a)
3: S∗
a= SRSWOR(Ra,n1)//collect simple random sample
4: W={Ra}//mark Raas visited
5: while ∃unvisited edge hRi, Rjiwith Ri∈Wdo
6: Sj=Sj∪ {t|t∈Rj,tis directly joinable with a tuple
in Si}//collect correlated sample tuples
7: W=W∪ {Rj}//mark Rjas visited
8: end while
9: S={S∗
a} ∪ (∪j6=a{Sj})
10: return S— Correlated Sample Synopsis
plode the size of the synopsis, and thus reduce the number
of sample tuples that can be drawn from the source relation.
The same reasoning applies to the selection of the source
relation. Thus, we prefer a relation that can reach the most
relations, through only the many-to-one relationships di-
rectly or indirectly, to be the source relation. It is possi-
ble that the source relation may not reach all the relations
in the graph through only the many-to-one relationships.
Therefore, one may have to add additional source relations
to cover unvisited relationships and relations.
3.1.2 Sample Size and Synopsis Maintenance
In Step (3) of Algorithm 3.1, we assume, for simplicity,
the sample size nfrom the source relation is known before-
hand. However, in reality, the sample size that meets the
synopsis space requirement may not be known until the end
of the sampling process. Given a synopsis space budget, the
simplest way is to draw one tuple at a time from the source
relation and collect related tuples along the sampling path.
This process continues until the space budget is used up. If
only the many-to-one relationships are followed in the sam-
pling, one can also draw a number of sample tuples at a time
until there is no budget left.
The correlated sample synopsis can be updated using ex-
isting algorithms [10]. Alternatively, since the construction
of a synopsis takes just a few seconds (e.g., 10 seconds, as
demonstrated in Section 7), one can simply reconstruct the
synopsis periodically to keep it up to date.
3.2 A Primitive Correlated Sample Synopsis
For simplicity, we shall name the relations following the
order in which they are sampled in the correlated sampling
scheme described. That is, R1is the source relation, R2is
sampled after the source relation, and then R3, R4, and etc.
To explicitly specify the order, we modify the join graph into
a directed sampling order graph (as shown in Figure 2(a))
in which a double-headed arrow from Rito Rjindicates that
Rjis sampled based on the sample tuples of Ri, or Rjis a
child of Ri(equivalently, Riis a parent of Rj). Ancestor
and descendent relationships in a sampling order graph are
defined in the same way as in an ordinary directed graph.
Example 2. (A Primitive CS2) Consider the order of
sampling in Figure 2(a) and the join relationships among
tuples in Figure 2(b). Each tuple is represented only by its
ID for simplicity. R1={1,2,3},R2={a, b, c}, R3={A, B}
and R4={I, II}. The lines, including both solid and dashed
R1
R2
R3R4
(a) Sampling Order
Graph
123
abc
A B I II
R1
R2
R3R4
(b) Joinable Tuple
IDs
S∗
1={2,3}
S2={b, c}
S3={B}
S4={II}
(c) CS2
Figure 2: A Primitive CS2
lines, between tuples represent the matching of join attribute
values. R1and R2are in many-to-many relationship. The
solid lines indicate that their end tuples are selected into the
synopsis. That is, two tuples of R1, 2 and 3, are first drawn
into S∗
1(i.e., S∗
1={2,3}). Then band cof R2are selected
into S2because their join attribute values match those in
S∗
1. After that, Bis selected from R3into S3because of
matching join attribute values with b, c in S2. Finally, II of
R4is selected into S4because it matches aof S2. Therefore,
the sample synopsis Sconsists of four sample relations as
shown in Figure 2(c).
Note that the synopsis Sdescribed here is just in its prim-
itive form.
Instead of storing all attributes of a tuple in the synopsis,
one may choose to store only the join attributes and those
attributes that are often referenced in the selection condi-
tions. This would significantly reduce the size of a synopsis.
4. QUERY ESTIMATION
By evaluating queries against sample relations, we esti-
mate query result sizes from the sample query results. We
focus on queries that contain equi-joins and arbitrary selec-
tion predicates. Therefore, the query result size can be ob-
tained by summing up the number of result tuples on each
tuple of an arbitrary relation in the query. Due to space
limitation, here we consider only a database with an acyclic
join graph G(V, E ) and we assume no dangling tuples exist
in the joins. Dangling tuples in the source relation will not
generate result tuple. And dangling tuples in non-source
relations are discussed in Section 4.2.3.
4.1 Source Query Estimation
Asource query here refers to a query that has the source
relation, R1, as one of its operand relations.
Consider a source query over relations R1, R2, ..., Rn. Let
{S∗
1, ..., Sn}be the respective sample relations in the synop-
sis. Each result tuple of the query must be generated by a
set of ntuples {t1, ..., tn},ti∈Ri, 1 ≤i≤n. We say that
ti,1≤i≤n, generates a result tuple. Let Yibe the number
of result tuples generated by the ith tuple ti∈R1, then the
Query Result Size,Y=PN1
i=1 Yi, where N1=|R1|. We
propose the following unbiased estimator for Y.
Theorem 1 (Unbiasedness of CS2). Given S∗
1, a SR-
SWOR of the source relation R1, the estimator
b
Y=N1
n1
n1
X
i=1
yi(1)
472
is an unbiased estimator of the query result size, Y, where
N1=|R1|, n1=|S∗
1|, and yiis the number of result tuples
generated by the ith tuple in S∗
1. Also, v1(b
Y) = N2
1
n1ε2(1 −f)
is an unbiased estimator of V ar(b
Y), where f=n1
N1and
ε2=Pn1
i=1(yi−E(yi))2
n1−1.
Proof. We only have to show that {yj}n1
j=1 is a SRSWOR
of {Yi}N1
i=1. Then we can apply theorems in [6] to prove this
theorem.
First of all, yjis the total number of result tuples gen-
erated by tuple uj∈S∗
1when query Qis executed on the
CS2, S. According to Algorithm 3.1, the correlated sam-
pling, for any correlated tuple vk∈Rji.e. J(vk, ui), we
have vk∈Sj, j > 1. That is, every result tuple generated
by uion the original database, is also generated by uion S.
On the other hand, Sis a collection of sample relations,
which are subsets of the original relations. Thus, for the
same query Q,uican’t generate more result tuples on S
than the original database.
Hence, given ∀uj=ti,uj∈S∗
1, ti∈R1, we have yj=Yi.
Since S∗
1is a SRSWOR of R1,{yj}n1
j=1 is also a SRSWOR
of {Yi}N1
i=1.
By Theorem 2.1 and Theorem 2.4 of [6], this theorem is
proved.
Example 3. (Source Query Estimation) Consider the
database and synopsis in Figure 2. We attempt to estimate
the sizes of all queries that involve the source relation R1,
that is, ✶(R1, R2), ✶(R1, R2, R3), ✶(R1, R2, R4), and ✶
(R1, R2, R3, R4). Here, ✶(Ri,...,Rk) represents the joins of
Ri,..., and Rkin no particular order.
First, we use S∗
1, S2, S3, and S4to compute the sample
results of the queries.
|✶(S∗
1, S2)|=|S∗
1✶S2|= 2.
|✶(S∗
1, S2, S3)|=|(S∗
1✶S2)✶S3)|= 2.
|✶(S∗
1, S2, S4)|=|(S∗
1✶S2)✶S4)|= 1.
|✶(S∗
1, S2, S3, S4)|=|(S∗
1✶S2)✶S3)✶S4|= 1.
Then, the query sizes are estimated by Eq (1), where N1=
3 and n1= 2. So, the estimated query sizes of ✶(R1, R2),
✶(R1, R2, R3), ✶(R1, R2, R4), ✶(R1, R2, R3, R4) are 3, 3,
3
2, and 3
2, respectively, as compared to the exact sizes 4, 4,
2, and 2.
4.2 No-Source Query Estimation
Ano-source query refers to a query that doesn’t include
the source relation.
Example 4. (No-source Query) In Figure 2, both ✶
(R2, R3) and ✶(R2, R4) are no-source queries, since neither
of their join graphs contain the source relation R1.
The highest relation in a query refers to the relation
with the least index according to the sampling order. For
instance, in Example 4, R2is the highest relation in query
✶(R2, R3).
For any no-source query Qwith Rhas its highest relation
in its join graph, let Yjbe the total number of result tu-
ples generated by the jth tuple ujin Rh.Shis the sample
relation of Rh. We want to estimate the query result size
YRh=PNh
j=1 Yj.
4.2.1 Estimation with Reverse Sampling
Since there is no SWRSOR of Rh, it is not straightforward
how query result size YRhcan be derived from the sample of
Rhdirectly. Therefore, we “back track”to the source relation
and consider estimation of YRhfrom R1. We try to establish
the relationship between YRhand some derived values on
tuples of R1based on Yj. In order to establish the “reverse”
connection from Rhto R1, we introduce a definition called
Joinable Tuples Sampled Ratio or JR.
Definition 6. (Joinable Tuple Sampled Ratio or JR)
Considering R1as the source relation, ∀ui∈Rh,h > 1, let
Cibe the total number of tuples in R1that are joinable with
ui. Among these Cijoinable tuples, we assume Dituples
are drawn into S∗
1, 0 ≤Di≤Ci.
ri=Di
Ci
(2)
is called the Joinable Tuples Sampled Ratio, or JR, of ui.
Now we show how to derive the result size YRhfrom Xi
in R1. We assign a value Xito the ith tuple, ti, of R1, as
follows.
Xi=
Nh
X
j=1
Ai
jYj, Ai
j=(1,if J(uj, ti), uj∈Rh, ti∈R1
0,otherwise
(3)
where Nh=|Rh|, and J(uj, ti) is in definition 4. Xiis
called the Derived Total Number of Result Tuples or
DTNRT, generated by uiin R1. Using the Ciin JR, we
adjust the weight of Yjin Xi, denoted by Zi, as follows.
Zi=
Nh
X
j=1
Ai
j
Cj
Yj(4)
where Cjis defined in Definition 6. We call Zithe Standard
Total Number of Result Tuples or STNRT.
Note that Ziis associated with a tuple in R1, but its value
is from Rh. Therefore, the values in Rhare assigned to R1
“reversely” along the sampling path. In order to estimate
YRh, we pick a SRSWOR of {Zi}, and call it, a reverse
sample.
Definition 7. (Reverse Sample) A reverse sample, de-
noted by {zi}n1
i=1, is a SRSWOR of the STNRT set {Zi}N1
i=1
s.t. Ziis chosen into {zi}n1
i=1 if and only if the tuple tiof R1
is sampled into S∗
1.
It is trivial that yj=Yi, if ti∈Rhis sampled into Shas
uj, in which yjis the total number of result tuples generated
by uj. Thus, according to (4), zican be expressed using
values in Shas follows.
zi=
nh
X
j=1
ai
j
Cj
yj, ai
j=(1,if J(u′
j, t′
i), u′
j∈Sh, t′
i∈S∗
1
0,otherwise
(5)
To estimate the no-source query result size YRh, we pro-
pose the Reverse (RV) Estimator defined as
b
YRV =N1
n1
nh
X
j=1
rjyj(6)
where N1=|R1|,n1=|S∗
1|,nh=|Sh|, and yjis the total
number of result tuples generated by the jth tuple in Sh.
473
The significance of the RV Estimator is that, it can pro-
vide unbiased estimations for no-source queries using CS2.
It is shown in the following theorem.
Theorem 2 (Unbiasedness of CS2). Given a no-source
query with Rhas the highest relation in its join graph, b
YRV
in Eq(6) is an unbiased estimator of the query result size
YRh.
Proof. First of all, given R1, S∗
1, Rh, Sh, we study the
sum of Zias defined in Eq (4),
N1
X
i=1
Zi=
N1
X
i=1
Nh
X
j=1
Ai
j
Cj
Yj=
Nh
X
j=1
(PN1
i=1 Ai
j)
Cj
Yj(7)
By Definition 6, PN1
i=1 Ai
j=Cj. Thus (7) can be written as
PN1
i=1 Zi=PNh
j=1 Yj=YRh, which means the sum of Ziis
actually equal to the query result size YRh. Therefore, we
can estimate the sum of Zifor YRh.
Second, we take a reverse sample {zi}n1
i=1. By Definition
7, {zi}n1
i=1 is a SRSWOR of {Zi}N1
i=1. Therefore, we can use
Eq (1), b
Y=N1
n1Pn1
i=1 zi. By Theorem 1, b
Yis an unbiased
estimator of YRh.
Finally, we prove b
Y=b
YRV . We express b
Yin terms of yi.
b
Y=N1
n1
n1
X
i=1
zi=N1
n1
n1
X
i=1
nh
X
j=1
ai
j
Cj
yj=N1
n1
nh
X
j=1
(Pn1
i=1 ai
j
Cj
)yj
By Definition 6, Pn1
i=1 ai
j=Dj, we have
b
Y=N1
n1
nh
X
j=1
Dj
Cj
yj=N1
n1
nh
X
j=1
rjyj=b
YRV
The theorem is proved.
Theorem 3. We provide an unbiased estimator of the
variance of b
YRV .
v2(b
YRV ) = N1(N1−n1)
n2
1
n1
X
i=1
zi−2N1(N1−n1)
n2
1(n1−1)
n1
X
i1=1
n1
X
i2>i1
zi1zi2
where ziis defined in the proof of Theorem 2.
This theorem can be proved by (9A.37) and (9A.44) of [6].
Example 5. (No-Source Query Estimation with RV
Estimator) We attempt to use b
YRV in Eq (6) to estimate
the result size of a no-source query ✶(R2, R3) executed
on relations in Fig 3(a), denoted by YR2. Note that Ya=
Yb=Yc= 1, because each tuple of R2produces exactly 1
result tuple. |S∗
1|= 2, and N1=|R1|= 3. Thus N1
n1=3
2.
Figure 3 shows the CS2 and JRs when S∗
1={2,3}and
S2={b, c}. Tuples 1 and 2 of R1are joinable with bin S2,
but only tuple 1 is selected into S∗
1. Thus, rb=1
2. Similarly,
rc=1. By Eq (6), for this given S∗
1,b
YRV =3
2(Yb
2+Yc) =
9
4. In order to verify that b
YRV is unbiased, we consider all
possible SRSWOR of size 2 from R1, i.e. S∗
1, and calculate
the average of their estimates, E(b
YRV ). Table 2 lists all
calcualted b
YRV . As observed, E(b
YRV ) is exactly the actual
result size.
Table 2: Samples, RV Estimates( b
YRV ), and JR Val-
ues
S∗
1S2JR b
YRV
{1,2} {a, b}ra=rb= 1 3
2(Ya+Yb) = 3
{1,3} {a, b, c}ra=rc= 1,
rb=1
2
3
2(Ya+Yb
2+Yc) =
15
4
{2,3} {b, c}ra=1
2, rc= 1 3
2(Yb
2+Yc) = 9
4
E(b
YRV ) = 3
123
abc
A B
R1
R2
R3
(a) Joinable Tuple IDs
S∗
1={2,3}
S2={b, c}J2={1
2,1}
S3={B}J3={2
3}
(b) CS2 and JRs
Figure 3: CS2 and JRs when S∗
1={2,3}
4.2.2 JR Storage and Calculation
A CS2 is constructed for use by all possible queries. A JR
value is stored with each correlated sample tuple in Si(i > 1)
for use of all no-source query estimations.
As described in Algorithm 2, the JR values can be cal-
culated in a top-down fashion following the sampling path.
First, for each sample tuple uin the direct children of S∗
1,
we calculate the Cuand Du, in which Cucomprises of IDs
of u’s joinable tuples in R1, and Ducomprises of those in
S∗
1. Then, each sample tuple in other sample relations will
perform union (without duplicate) of Cuand Dufrom join-
able tuples in its parent sample relations. By definition 6,
Cu= #(Cu) and Du= #(Du), in which #() is the cardinal-
ity of a set. Therefore, the JR value of tuple uis calculated
as ru=#(Du)
#(Cu). Finally, all the intermediate data, Cuand
Du, will be removed after JR values are calculated.
In Algorithm 2, Cuand Duare only calculated for cor-
related sample tuples, and they can be obtained in parallel
with the procedure of correlated sampling (Algorithm 3.1).
Therefore, the intermediate time and space overheads can
be ignored when the sample size is small. In addition, in
Section 7, we show that the estimation of CS2 is fairly sat-
isfactory using only quite a small amount of sample.
4.2.3 Estimation Involving Non-joinable Tuples
Definition 8. (Joinable and Non-joinable Partition)
Considering R1as the source relation, the joinable partition
of Rh(with respect to R1), denoted by RJ
h, refers to the
subset of tuples in Rh, that are joinable with tuples in R1.
The complement of RJ
hin Rhis called non-joinable partition
of Rh(with respect to R1), denoted by RJc
h.
By Definition 8, there is no chance for tuples in RJc
hto
be selected into the correlated sample synopsis. In order to
estimate the result size of no-source queries with Rhas the
highest relation, we need to account for the number of result
tuples generated by RJc
h.
Definition 9. (Partition Difference) For any given query
Q, the total number of result tuples generated by RJ
hand
474
Algorithm 2 Calculate JR
Require: G(V, E ): the acyclic join graph of the database;
S={S∗
1} ∪ {Si}n
i=2: CS2 without JR
1: W=∅//the set of visited relations
2: //———Phase1: for direct children of R1————
3: U={Si|Si∈Ss.t. Riis a direct child of R1in G}
4: for all Si∈Udo
5: for all u∈Sido
6: //IDs of tuples in R1and S∗
1joinable with u
7: Cu={t.ID|t∈R1, J(t, u)}
8: Du={t.ID|t∈S∗
1, J (t, u)}
9: u←Cu//append to u
10: u←Du
11: end for
12: W=W∪ {Si}//mark Sias visited
13: end for
14: //———Phase2: for other correlated samples———
15: while W6=CS2 do
16: pick Siwhose parent Spis in W
17: for all v∈Sido
18: //all the joinable tuples of vin Si’s parent relations
19: Jv={u|u∈Sp, J(u, v )}
20: Cv=Su∈JvCu//union without duplicate
21: Dv=Su∈JvDu
22: v←Cv//append to v
23: v←Dv
24: end for
25: W=W∪ {Si}//mark Sias visited
26: end while
27: //———Phase3: calculate JR values——————
28: for all Si∈CS2 do
29: for all u∈Sido
30: ru=#(Du)
#(Cu)//calculate JR
31: u←ru//append JR to each sample tuple
32: delete Cu,Du//release extra space
33: end for
34: end for
35: return S(CS2 with JRs calculated)
RJc
hare YJand YJc, respectively. The partition difference
between RJ
hand RJcwith respect to Q, denoted by µ, is
defined as
µ(RJ
h, RJc
h) =
nJcYJ
nJYJc−1(8)
The partition difference µis specified for a fixed query
Q. To estimate their general difference for any query with
Rhas the highest relation, one possible way is to select a
set of sample queries from them, and calculate an averaged
µ(RJ
h, RJc
h). One can also check the distributions of tuples
according to the join attributes of these queries.
Given a tolerance threshold, θ∈[0,1], we claim that if
µ(RJ
h, RJc
h)< θ, we can estimate YJcas
b
YJc=nJc
nJb
YJ
RV (9)
where b
YJ
RV is the RV estimator for YJ
h. The relative error
of E(b
YJc) is
E(b
YJc)−YJc
YJc=
nJcYJ
nJYJc−1=µ< θ
That is, when µ < θ the estimation relative error of b
YJcin
Eq (9) is less than the tolerance threshold θ.
When µ(RJ
h, RJc
h)≥θ, we have to estimate RJc
hsepa-
rately. An additional simple random sample S′
h, called a
supplement sampling, is taken from from RJc
h, and cor-
related sample tuples are stored into respective sample rela-
tions of Rhin the join graph G.
To estimate YJcusing S′
h, we modify b
Yin Eq(1) by treat-
ing RJc
has a source relation. That is
b
YJc=NJc
h
nJc
h
nJc
h
X
i=1
yi(10)
where NJc
h=|RJc
h|,nJc
h=|S′
h|, and yiis the number of
result tuples generated by the ith tuple in S′
h.
Let b
YJ
RV be the RV estimator of YJ. Since YJand YJc
are independent, we have the following theorem.
Theorem 4. b
Yh=b
YJ
RV +b
YJcis an unbiased estimator
of the query result size YRh. And v2(b
YJ
RV ) + v1(b
YJc)is
an unbiased estimator of V ar(b
Y), in which v1and v2are
defined as in Theorem 1 and Theorem 3.
5. SELECTIVITY ESTIMATION FOR QUE-
RIES WITH CYCLIC GRAPHS
If the join graph G′of a query is cyclic, and a query result
tuple trinvolves in two different relation tuples, for example,
t1and t2both in Rh. And there is a cycle in G′from t1to
t2. We say tris generated by t1instead of t2. Hence, each
result tuple still corresponds to one and only tuple in each
relation when there is no dangling tuple.
Figure 4(a) shows a cyclic join graph and Figure 4(b) the
join relationships. Let us use an example to illustrate the
issues encountered in a cyclic join graph. Assume tuple 1 is
first drawn from R1. Then, following the correlated sam-
pling (Figure 2), related tuples A(∈R2), b(∈R3), α, β (∈
R4), I(∈R5),∆ and Θ(∈R6) are drawn into respective sam-
ple relations Si, 2 ≤i≤6, as shown in Figure 4(c). Note
that S4contains tuples that are related to tuples in S2(i.e.,
A) and S3(i.e., b) because the two sampling branches from
R1intersect at R4.
While the synopsis (Figure 4(c)) is sufficient for estimating
queries involving relations that are subsets of {R1, R2, R4, R6}
and {R1, R3, R4, R5, R6}, it is not straightforward how queries
involving relations from both sampling branches can be es-
timated. For example, consider the query ✶(R2, R4, R3),
where R2and R3are on different branches. R2and R3both
are candidates for the highest relation of the query. Assume
we choose to use S2={A}to estimate the query size. From
the synopsis, we know Ais connected to αof R4. Unfor-
tunately, we do not know whether αis connected to any
tuples in R3or not because we did not let αpierce through
R4to reach related tuples in R3(i.e., a). A similar problem
occurs if we use S3={b}to estimate the query size. That
is, bis connected to β(∈R4), but we do not know whether
βis connected to any tuples in R2because the sampling did
not penetrate into R2to reach related tuples (i.e., Bhere).
Here, we propose two possible ways to handle this situa-
tion. The first is to estimate query sizes using the indepen-
dence and uniformity assumptions. The second method is to
enhance the synopsis by letting the sampling pierce through
475
R1
R2R3
R4
R5
R6
(a) A Cyclic Join Graph
R1
R2R3
R4
R5
R6
12
A B ab
αβ
∆ Θ I II
(b) Joinable Tuple IDs
S′
1={1}
S2={A}
S3={b}
S4={α, β}
S5={I}
S6={∆,Θ}
(c) CS2
S+
3={a}
S+
5={II}
(d) Extended Samples
Figure 4: CS2 for a Cyclic Join Graph
the cycle-completing relation (e.g., R4) and perhaps other
relations.
5.1 Estimate using Independence and Unifor-
mity Assumptions
Consider the query ✶(R2, R4, R3). Let T24 and T34 be the
estimates for |R2✶R4|and |R3✶R4|, respectively, derived
using the method described in Section 4.2 using sample re-
lations {S2, S4}and {S3, S4}, respectively. Let N4be the
number of tuples in R4. Assume every tuple in R4is equally
likely to participate in the result of R2✶R4and R3✶R4.
Then |✶(R2, R4, R3)|can be estimated as
T24T34
N4
This approach is simple. But, the accuracy of estimations
may raise a concern because the use of the independence
and uniformity assumptions.
5.2 Enhancing synopsis with extended samples
We allow the sampling to pier through the cycle-completing
relation (e.g., R4) and extend to any relation as desired. In
general, we only extend to the relations that are often ref-
erenced in the queries together with the cycle-completing
relation.
For example, we can let the sampling branch on the left
(Figure 4) that goes from R1,R2and R4, extend to R3
(or even to R1and R5). An enhanced synopsis is shown in
Figure 4(d), where the left sampling branch is extended to
R3and R5. In the figure, S+
icontains the extended sample
tuples.
We do not need to differentiate Sifrom S+
iin the sample
evaluations unless Riis the highest relation of the query.
For example, consider the query ✶(R2, R4, R3, R5) and the
enhanced synopsis in Figure 4(c), 4(d). Assume R2is chosen
to be the highest relation for evaluation. Consequently, S2,
without S+
2, must be used. As for other relations, we can
just use Si∪S+
i. The sample query result can be obtained
by ✶(S2, S4, S3∪S+
3, S5∪S+
5), which is the same as ✶
(S2, S4, S+
3, S+
5). The latter may be more efficient, but the
former is simpler, that is, always use Si∪S+
i, except when
Siis the highest relation.
6. MULTI-SOURCE SYNOPSIS
Figure 5 shows a graph with two source relations RAand
RB. Let A and B be the two groups of relations with RA
and RBbeing their source relations respectively. The two
A B
RARB
Ri
Figure 5: Multiple Source Relations
groups have a common relation Ri, but there can certainly
be more. Note that the sampling processes that started from
RAand RBwould each draw their own related sample tuples
from Riand store them in Si.
Given a query if there exists a sample penetrating all its
operand relations, an estimate of the query size can be de-
rived using the methods described in Section 4. If there does
not exist a sample that penetrates all the operand relations,
we resort to the estimation method based on the indepen-
dence and uniformity assumptions in Section 5.
A multi-source synopsis arises from traversing only the
many-to-one join relationships in the sampling. But there
are other reasons for constructing a multi-source synopsis.
In situations where there are multiple groups of relations
that are often referenced together, we may select a source
relation for each such group. Depending upon the impor-
tance of queries and desired accuracy, a multi-source syn-
opsis allows a DBA to determine grouping of relations and
sample sizes for individual groups. It can improve the ac-
curacy of estimation for queries over relations in the same
groups, however, decrease the accuracy for queries over re-
lations across groups, due to the use of the independence
and uniformity assumption in estimation. It can be a very
flexible and effective way to build a synopsis.
7. EXPERIMENTS
In this section, we present the results of performance eval-
uations of TuG [27, 28], join synopses (JS) [1], and correlated
sample synopsis (CS2).
The implementations of the join synopses and correlated
sample synopsis are much simpler than TuG because TuG
uses complex clustering and compression algorithms. We
have implemented the TuG synopsis (whose code is not avail-
able to the public) following the descriptions in [27], includ-
ing using the public Java source code [23] of CF-Tree and
BIRCH [32] to conduct clustering, and implementing the
Count-min Sketches [7] and histogram synopses [8] to re-
duce the size of histograms.
All programs are implemented in Java 1.6 and MySQL
5.1 on a Ubuntu 10.04 Linux system, equipped with an Intel
476
Table 3: Datasets
Name Size Comments
TPCH1G05 1GB TPC-H skew factor z=0.5
TPCH1G1 1GB TPC-H skew factor z=1.0
TPCH10G05 10GB TPC-H skew factor z=0.5
TPCH10G1 10GB TPC-H skew factor z=1.0
DBLP 692MB DBLP online database
lineitem
!!
❈
❈
❈
❈
&&
partsupp
~~⑥
⑥
⑥
⑥
orders
part supplier
customer //nation
region
(a) TPC-H
author of
✁
✁
✁
✁
✿
✿
✿
✿
✿
author paper
☎
☎
☎
☎
✾
✾
✾
✾
type publisher
(b) DBLP
Figure 6: Join Graphs of Datasets (Boxed are
Source Relations)
Core i7 3.4 GHz CPU, 4GB RAM and a 500GB hard disk
(7200 RPM, Buffer size 16MB). All datasets and synopses
are stored on the hard disk.
Using a large amount of memory only benefits TuG which
utilizes clustering on large datasets. CS2 and JS only require
small amounts of memory when constructing synopses.
7.1 Datasets
We conducted experiments on both synthetic data - TPC-
H datasets with skewed data [4], and real-life data - DBLP
dataset[17], which are listed in Table 3.
We generated 1GB and 10G TPC-H datasets with skew
factors of z=0.5 and 1. Each TPC-H dataset has 8 relations,
consisting of only the foreign-key join relationships. The arcs
in Figure 6(a) indicate the many-to-one relationships, which
also satisfy the foreign-key constraint for the TPC-H data,
and the boxes indicate the source relations of CS2. Note
that every relation is a source relation in JS.
The DBLP database listed more than 1.3 million articles
in computer science by January 2010. Our DBLP dataset
sources from a relational enhancement of the original DBLP
data, named DBLP++[9]. We have replaced the many-
to-many relationship between “author” and “paper” by two
many-to-one relationships and a connecting relation “au-
thor of”, as JS does not apply to many-to-many relation-
ships directly. However, CS2 is applicable to all types of
relationships. In addition, we have also filtered out some
data so that the foreign key constraint holds, as required by
the JS. The join graph is shown in Fig 6(b).
7.2 Construction Time and Index
The average construction times of synopses in accuracy
tests are presented in Table 4. As observed, the construc-
tion of TuG is significantly slower than JS and CS2 because
it performs some time-consuming tasks, such as CV table
building, all-but-one similar merges, BIRCH clustering, and
etc. All of these tasks require tremendous amounts of disk
I/O and CPU computation.
Building index will certiainly increase the construction
Table 4: Average Construction Time of Synopses
Dataset Space Budget TuG JS CS2
TPCH1G05 29KB 639min 10sec 8sec
TPCH1G1 21KB 286min 10sec 8sec
TPCH10G05 100KB ∞408sec 334sec
TPCH10G1 100KB ∞340sec 300sec
DBLP 35KB 518min 8sec 3sec
speed of all the synopses. In our experiments, we built in-
dexes on the join key attributes in both TPC-H and DBLP
datasets.
7.3 Query Workload
We randomly generated queries involving different sets of
relations with different numbers of selection predicates. For
such queries, we randomly varied the ranges of attribute val-
ues in the selection criteria. Our experiments were designed
to examine how joins and selections affect the accuracy of
estimations.
Table 5 details the test queries and the space budgets
used for each test. “DBLP Large/Small Range” refers to
queries with large/small attribute value ranges specified in
the selection predicates. “DBLP No-source” refers to queries
that do not involve the source relation “author of”, which
would require using the proposed RV estimator (Section 4).
They are also among the queries used in the “DBLP Large
/ Small Range” experiments.
7.4 Synopsis Space Budget
According to Section 3.1.1 and Algorithm 3.1, in order to
control the size of CS2, we chose “lineitem” of TPC-H and
“author of” of DBLP as the source relations. Our correlated
sampling also followed many-to-one relationships in the join
graphs. As Table 5 shows, the space of CS2 can be well
controlled. Space budgets of CS2 are less than 20KB for
1GB TPC-H datasets, 100KB for 10GB TPC-H datasets,
and less than 10KB for DBLP datasets.
Note that it is hard to control the final sizes of TuGs,
even though we used the same clustering parameters to gen-
erate them on TPC-H datasets. We are not able to generate
any TuG smaller than 20KB for TPC-H or 23KB for DBLP
without generating large estimation errors, and thus their
results are absent for the 3KB and 10KB tests in Figure 7.
7.5 Estimation Accuracy
Test queries were grouped based on the numbers of joins
and selections placed on each relation. For example, in Fig-
ure 7(a), the query group ‘j1s2’ comprises queries with one
join and at most 2 selections on each operand relation, with
at most 4 selections per query.
Since JS and CS2 both need random sampling, they are
not unique for a given DB. Therefore, for each dataset and
given space budget, we generated 100 JSs and CS2s to ob-
serve their average performance. Each CS2/JS is used for
estimations of all test queries. We use the absolute rela-
tive error to compare the accuracy of estimations, which is
defined as
E=b
YEst −YAct
YAct ×100%
where YAct denotes the actual query result size and b
YEst is
477
Table 5: Query Workloads and Space Budgets in Estimation Accuracy Tests
Figure Tests Data Total Space Budget
Size Queries TuG JS CS2
7(a) TPCH1G05 Synopses 20KB 1GB 45 29KB 20KB 20KB
7(b) TPCH1G05 Synopses 10KB 1GB 45 — 10KB 10KB
7(c) TPCH1G1 Synopses 20KB 1GB 45 21KB 20KB 20KB
7(d) TPCH1G1 Synopses 10KB 1GB 45 — 10KB 10KB
7(e) TPCH10G05 Synopses 100KB 10GB 45 – 100KB 100KB
7(f) TPCH10G1 Synopses 100KB 10GB 45 — 100KB 100KB
7(g) DBLP Large Range 3KB 692MB 51 23KB 3KB 3KB
7(h) DBLP Small Range 3KB 692MB 51 — 3KB 3KB
7(i) DBLP Small Range 10KB 692MB 51 — 10KB 10KB
7(j) DBLP No-source 692MB 36 — 3KB 3KB
the estimated result size. The results of accuracy tests are
shown in Fig 7.
As shown in Figure 7(a) and Figure 7(c), TuG produced
the worst relative errors generally. As mentioned earlier,
TuG smaller than 20KB cannot generate meaningful esti-
mations. JS and CS2 had similar accuracies for queries with
small numbers of joins. However, JS’s accuracy degraded
quickly when the number of joins increased. CS2 produced
the lowest relative errors in all cases of the TPC-H tests.
Figure 7(g) shows the results of DBLP large range query
tests. Again, CS2 was superior to JS and TuG, with TuG
performing the worst. As observed from Figures 7(g) and
7(h), small range queries gave higher errors than large range
queries. Also, as the sizes of the synopses increased, the rela-
tive errors decreased (Figures 7(h) and 7(i)). The dominance
of CS2 over JS became more obvious when the numbers of
joins and selections increased, as shown in Fig 7(h).
Finally, we compare the accuracy of CS2 and JS for no-
source queries. All queries were extracted from the previous
tests. It is clear, as shown in Fig 7(j), that CS2 with the
reverse estimator produced lower relative errors than JS.
In summary, TuG performed the worst among these three
synopses. CS2 produced the smallest relative errors in all
tests. In addition, CS2 increasingly outperformed JS as the
numbers of joins and selections grew.
7.6 Unbiasedness of RV Estimator
In Section 4.2, we proposed the reverse estimator for all
no-source queries, and proved theoretically that it is unbi-
ased (Theorem 2). Here, we validate empirically the unbi-
asedness of this estimator using CS2. We also compare the
results with JS.
For illustrative purposes, we select two queries for this
bias test, one from group ‘j2s1’ and the other from ‘j2s2’
in the DBLP tests. A query is estimated each time using
a different synopsis (or synopses) of the same size (3KB).
We generated 100 synopses for this purpose. The absolute
average relative error of the first iestimates is defined as
Ei=
Pi
j=1 b
YEst
j
i−YAct
YAct ×100%
where i≥1, YAct is the actual query size, b
YEst
jis the
estimated value derived from the jth synopsis (synopses),
and Pi
j=1 b
YEst
j
iis the average of the first iestimates. For an
unbiased estimator [6], the absolute average relative error
should approach 0 as i (the number of tries) increases.
As observed in Figure 8(a), the errors of both CS2 (using
the RV Estimator) and JS approached 0 as the number of
tries increased. On the other hand, Figure 8(b) demostrates
that the errors of CS2 converged to 0 earlier than JS within
100 rounds. i.e. CS2 yeilds a faster convergence speed than
JS when more selections are added. Certainly, JS yields
unbiased estimates as it uses simple random samples of join
results. These experiments validate our earlier proofs of the
unbiasedness of the RV estimator.
8. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented an innovative approach to
construct a database synopsis. The synopsis mainly consists
of a set of correlated sample relations. We have designed a
reverse estimator that can fully utilize the correlated sam-
ple tuples to make accurate estimations. Our experimen-
tal results show that the correlated sample synopsis is easy
to construct and has better performance than existing syn-
opses. It proves itself to be a simple, concise, and accurate
synopsis.
There is still much work to do. We shall study the possi-
bility of further reducing the size of the synopsis using other
sampling techniques, such as subsampling. Finally, we shall
explore the use of the correlated sample synopsis in approx-
imate aggregate query answering.
9. ACKNOWLEDGMENTS
A special thanks to Dr. Sakthivel Jeyarathnam for his help
in developing the reverse estimator. We are also grateful to
Mr. Chase Abner for editing assistance.
10. REFERENCES
[1] S. Acharya, P. B. Gibbons, V. Poosala, and
S. Ramaswamy. Join synopses for approximate query
answering. SIGMOD Rec., 28:275–286, June 1999.
[2] N. Alon, Y. Matias, and M. Szegedy. The space
complexity of approximating the frequency moments.
In JOURNAL OF COMPUTER AND SYSTEM
SCIENCES, pages 20–29, 1996.
[3] S. Chaudhuri, G. Das, and V. Narasayya. Optimized
stratified sampling for approximate query processing.
ACM Trans. Database Syst., 32(2), June 2007.
[4] S. Chaudhuri and V. Narasayya. Program for tpc-d
data generation with skew. ftp://ftp.research.
microsoft.com/users/viveknar/tpcdskew.
478
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
TuG
JS
CS2
(a) TPCH1G05: TuG=29KB; JS,CS2=20KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
JS
CS2
(b) TPCH1G05: JS,CS2=10KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
TuG
JS
CS2
(c) TPCH1G1: TuG=21KB; JS,CS2=20KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
Query Groups
JS
CS2
(d) TPCH1G1: JS,CS2=10KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
5%
10%
15%
20%
25%
30%
35%
Query Groups
JS
CS2
(e) TPCH10G05: JS,CS2=100KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
5%
10%
15%
20%
25%
30%
35%
Query Groups
JS
CS2
(f) TPCH10G1: JS,CS2=100KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
120%
Query Groups
TuG
JS
CS2
(g) DBLP Large Range: TuG=23K; JS,CS2=3KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
120%
Query Groups
JS
CS2
(h) DBLP Small Range: JS,CS2=3KB
j1s1 j1s2 j2s1 j2s2 j3s1 j3s2
0%
20%
40%
60%
80%
100%
120%
Query Groups
JS
CS2
(i) DBLP Small Range: JS,CS2=10KB
j1s1 j1s2 j2s1 j2s2
0%
20%
40%
60%
80%
100%
120%
140%
Query Groups
JS
CS2
(j) DBLP No-source: JS,CS2=3KB
Figure 7: Absolute Relative Errors (%) in Accuracy Tests
479
0 20 40 60 80 100
0%
5%
10%
15%
20%
Number of Estimates
JS
CS2
(a) Bias Test: DBLP No-source j2s1
0 20 40 60 80 100
0%
5%
10%
15%
20%
Number of Estimates
JS
CS2
(b) Bias Test: DBLP No-source j2s2
Figure 8: Averaged Absolute Relative Errors (%) in Bias Tests
[5] S. Christodoulakis. Implications of certain
assumptions in database performance evauation. ACM
Trans. Database Syst., 9:163–186, June 1984.
[6] W. G. Cochran. Sampling Techniques, 3rd Edition.
John Wiley, 1977.
[7] G. Cormode and S. Muthukrishnan. An improved
data stream summary: the count-min sketch and its
applications. J. Algorithms, 55:58–75, April 2005.
[8] A. Deshpande, M. Garofalakis, and R. Rastogi.
Independence is good: dependency-based histogram
synopses for high-dimensional data. SIGMOD Rec.,
30:199–210, May 2001.
[9] J. Diederich. FacetedDBLP.
http://dblp.l3s.de/dblp++.php.
[10] P. B. Gibbons, Y. Matias, and V. Poosala. Fast
incremental maintenance of approximate histograms.
VLDB ’97, pages 466–475, 1997.
[11] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N.
Swami. Fixed-precision estimation of join selectivity.
PODS ’93, pages 190–201, 1993.
[12] P. J. Haas and A. N. Swami. Sequential sampling
procedures for query size estimation. SIGMOD Rec.,
21:341–350, June 1992.
[13] W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja.
Statistical estimators for relational algebra
expressions. PODS ’88, pages 276–287, 1988.
[14] Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy
trees: an analysis of strategy spaces and its
implications for query optimization. SIGMOD Rec.,
20:168–177, April 1991.
[15] R. P. Kooi. The optimization of queries in relational
databases. PhD thesis, Cleveland, OH, USA, 1980.
AAI8109596.
[16] J.-H. Lee, D.-H. Kim, and C.-W. Chung.
Multi-dimensional selectivity estimation using
compressed histogram information. SIGMOD Rec.,
28:205–214, June 1999.
[17] M. Ley. The DBLP computer science bibliography.
http://www.informatik.uni-trier.de/~ley/db/.
[18] R. J. Lipton, J. F. Naughton, and D. A. Schneider.
Practical selectivity estimation through adaptive
sampling. SIGMOD Rec., 19:1–11, May 1990.
[19] V. Markl, V. Raman, D. Simmen, G. Lohman,
H. Pirahesh, and M. Cilimdzic. Robust query
processing through progressive optimization. SIGMOD
’04, pages 659–670, 2004.
[20] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based
histograms for selectivity estimation. SIGMOD Rec.,
27:448–459, June 1998.
[21] Y. Matias, J. S. Vitter, and M. Wang. Dynamic
maintenance of wavelet-based histograms. VLDB ’00,
pages 101–110, 2000.
[22] M. Muralikrishna and D. J. DeWitt. Equi-depth
histograms for estimating selectivity factors for
multi-dimensional queries. SIGMOD ’88, pages 28–36,
1988.
[23] R. Perdisci. JBIRCH.
http://roberto.perdisci.com/projects/jbirch.
[24] V. Poosala and Y. E. Ioannidis. Selectivity estimation
without the attribute value independence assumption.
VLDB ’97, pages 486–495, 1997.
[25] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,
R. A. Lorie, and T. G. Price. Access path selection in
a relational database management system. SIGMOD
’79, pages 23–34, 1979.
[26] J. M. Smith and P. Y.-T. Chang. Optimizing the
performance of a relational algebra database interface.
Commun. ACM, 18:568–579, October 1975.
[27] J. Spiegel and N. Polyzotis. Graph-based synopses for
relational selectivity estimation. SIGMOD ’06, pages
205–216, 2006.
[28] J. Spiegel and N. Polyzotis. Tug synopses for
approximate query answering. ACM Trans. Database
Syst., 34(1):3:1–3:56, Apr. 2009.
[29] J. D. Ullman, H. Garcia-Molina, and J. Widom.
Database Systems: The Complete Book. Prentice Hall
PTR, Upper Saddle River, NJ, USA, 1st edition, 2001.
[30] E. Wong and K. Youssefi. Decomposition - a strategy
for query processing. ACM Transactions on Database
Systems, 1:223–241, 1976.
[31] Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying
the golden rule of sampling for query estimation.
SIGMOD Rec., 30(2):449–460, May 2001.
[32] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an
efficient data clustering method for very large
databases. SIGMOD Rec., 25:103–114, June 1996.
480