Conference PaperPDF Available

CS2: A new database synopsis for query estimation

June 2013

June 2013

DOI:10.1145/2463676.2463701

Conference: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Authors:

Feng Yu

Youngstown State University

Cheng Luo

Osaka University

Dunren Che

Southern Illinois University Carbondale

Show all 5 authorsHide

Fast and accurate estimations for complex queries are profoundly beneficial for large databases with heavy workloads. In this research, we propose a statistical summary for a database, called CS2 (Correlated Sample Synopsis), to provide rapid and accurate result size estimations for all queries with joins and arbitrary selections. Unlike the state-of-the-art techniques, CS2 does not completely rely on simple random samples, but mainly consists of correlated sample tuples that retain join relationships with less storage. We introduce a statistical technique, called reverse sample, and design a powerful estimator, called reverse estimator, to fully utilize correlated sample tuples for query estimation. We prove both theoretically and empirically that the reverse estimator is unbiased and accurate using CS2. Extensive experiments on multiple datasets show that CS2 is fast to construct and derives more accurate estimations than existing methods with the same space budget.

A Join Graph

…

CS2 and JRs when S * 1 = {2, 3}

…

Figures - uploaded by Feng Yu

Content may be subject to copyright.

Content uploaded by Feng Yu

Content may be subject to copyright.

CS2: A New Database Synopsis for Query Estimation

Feng Yu

Southern Illinois University

Carbondale, IL, 62901

fyu@cs.siu.edu

Wen-Chi Hou

Southern Illinois University

Carbondale, IL, 62901

hou@cs.siu.edu

Cheng Luo

Coppin State University

Baltimore, MD 21216

cluo@coppin.edu

Dunren Che

Southern Illinois University

Carbondale, IL, 62901

dche@cs.siu.edu

Mengxia Zhu

Southern Illinois University

Carbondale, IL, 62901

mzhu@cs.siu.edu

ABSTRACT

Fast and accurate estimations for complex queries are pro-

foundly beneﬁcial for large databases with heavy workloads.

In this research, we propose a statistical summary for a

database, called CS2 (Correlated Sample Synopsis), to pro-

vide rapid and accurate result size estimations for all queries

with joins and arbitrary selections. Unlike the state-of-the-

art techniques, CS2 does not completely rely on simple ran-

dom samples, but mainly consists of correlated sample tuples

that retain join relationships with less storage. We introduce

a statistical technique, called reverse sample, and design a

powerful estimator, called reverse estimator, to fully utilize

correlated sample tuples for query estimation. We prove

both theoretically and empirically that the reverse estima-

tor is unbiased and accurate using CS2. Extensive exper-

iments on multiple datasets show that CS2 is fast to con-

struct and derives more accurate estimations than existing

methods with the same space budget.

Categories and Subject Descriptors

H.2.4 [Information Systems]: Database Management, Sys-

tems, Query Processing

General Terms

Theory

Keywords

Query Optimization, Database Synopsis, Selectivity Estima-

tion

1. INTRODUCTION

Query optimization has been a central research topic of

Database Management Systems for many years [14, 25, 26,

30]. It aims to ﬁnd the most eﬃcient execution plans for

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD’13, June 22–27, 2013, New York, New York, USA.

queries. A query optimizer generally uses statistics about

the database [22, 24], and assumptions about the attribute

values [5, 25] to make cost estimations of alternative plans

and selects the best one for execution. It is important to

ﬁnd the most eﬃcient plans because studies [5, 19] have

shown that executions with sub-optimal plans can be orders

of magnitude slower than those with optimal plans. Un-

fortunately, due to the accuracy of the statistics stored in

the database and the validity of assumptions made, query

optimizers often cannot ﬁnd the best plans in their search

spaces for the queries. On the other hand, storing accurate

statistics can require prohibitively large amounts of space.

Thus, constructing a concise, accurate database synopsis has

been a central issue of query optimization for decades. In

this research, we are mainly concerned about queries with

arbitrary select and equi-join operations.

There has been much eﬀort to construct concise, accurate

database synopses in the past. Among them, histograms

[8, 15, 22, 24] have attracted the most attention. They are

simple and eﬀective for estimating low-dimensional queries,

where the numbers of attributes of interest in a relation,

including the selection and join attributes, are small. How-

ever, as the dimension increases, the storage space required

by a multi-dimensional histogram can grow dramatically.

In addition, the computations and maintenance of multi-

dimensional histograms become very complex [21]. Some

techniques, such as discrete cosine transform, wavelets, sket-

ches [2, 16, 20], etc., have been proposed to compress the

histograms. Unfortunately, these size reduction techniques

become ineﬀective as the dimension increases.

There is also a long history of using sampling in databases

[13, 12, 11, 18, 31]. With a small amount of sample tuples,

accurate selectivity estimations can be obtained for select

queries with arbitrary selection predicates. However, for

joins, large estimation errors were observed when indepen-

dent random samples of relations are used. To overcome

this problem, Acharya et al [1] stored a join synopsis for

each relation. The synopsis for a relation is a simple ran-

dom sample of the result of the joins of the relation and

all its descendant relations by the foreign-key relationships.

Each join synopsis can then be used to generate simple ran-

dom samples of the results of all subqueries that involve the

source relation, and thus are very useful for estimations of

a large set of queries.

Chaudhuri et al. in [3] proposed using the stratiﬁed sam-

pling to improve estimations. However, workloads must be

known beforehand for the stratiﬁed sampling to be eﬀective.

469

Moreover, the method is only applicable to “star queries”

with foreign-key joins. Wu at el. in [31] used cumulative

distributions to make estimations. But it mainly focuses on

range queries over single relation.

Spiegel et al [27, 28] proposed an innovative approach to

summarizing the correlations among the join and other at-

tribute values in a data graph. Tuples in the graph are con-

nected to nodes that represent their respective attribute val-

ues, and to tuples with which they join. To reduce the size of

the data graph, tuples with similar join- and attribute-based

characteristics are merged into clusters, and clusters with a

similar value distribution for an attribute share the same

(1-dimensional) histogram. This class of graph-based syn-

opses, called TuG (Tuple Graph) synopses, can be applied

to queries with joins of many-to-many and many-to-one re-

lationships [29], and selections with table-level predicates.

Experimental results [27, 28] showed that histogram-based

synopses are not as accurate as join synopses and TuG for

complex join queries. Howerver, join synopses and TuG also

have some limitations and shortcomings. For example, join

synopses restrict their applications to foreign-key joins. In

addition, a join synopsis is stored for each relation in the

database, which raises concern about the total size of the

synopses. As for TuG, its accuracy is heavily dependent on

the quality of the clusterings performed. But as the num-

ber of attributes of interest increases and the relationships

among attribute values becomes more complex, clusterings

become less eﬀective, and thus the accuracy of estimations

deteriorates dramatically. The construction of a TuG syn-

opsis is also very complex.

Motivated by the importance and continuing lack of an

eﬀective database synopsis, we set our goal to constructing

a simple, concise, and accurate database synopsis. We have

chosen to use sample tuples to construct the synopsis as in-

tricate relationships among attribute values can be naturally

embodied in the sample tuples. In addition, the space for

storing a sample tuple only increases linearly with the num-

ber of attributes (of interest) in a relation. It makes sample

tuples an eﬀective tool for modeling multi-, and especially

high-dimensional attribute value distributions. However, in-

stead of storing independent random samples of relations,

we propose to store correlated sample tuples to preserve the

join relationships. That is, tuples of a relation are included

in the sample of a relation if they join with already drawn

sample tuples of other relations. Correlated sample tuples

from relations form the correlated sample synopsis (CS2),

which will be used for query size estimation, as well as ap-

proximate aggregate query answering.

While both the join synopses and the correlated sample

synopsis attempt to capture relationships between tuples of

diﬀerent relations, there are many diﬀerences. The join syn-

opses store samples of the results of joins, while the cor-

related sample synopsis stores a sample of correlated tu-

ples from the relations. Besides this philosophical diﬀer-

ence, the correlated sample synopsis is also more general

than the join synopses. The join synopses are applicable

only to the foreign-key join relationships, while the corre-

lated sample synopsis can be applied to all types of join

relationships, including many-to-one and many-to-many re-

lationships. The join synopses assume no dangling tuples

exist in the relations, but no such assumption is needed for

the correlated sample synopsis. The join synopses rely on

the simple random samples to make estimation and thus re-

quire constructing up to nsynopses, where nis the number

of relations in the join graph. The correlated sample synop-

sis can estimate with both random and correlated samples

and needs just one synopsis for the entire database. This

potential storage space advantage can translate into higher

estimation accuracy for the correlated sample synopsis.

While TuG covers all types of joins, the selections are re-

stricted to table-level selections. It cannot accommodate

predicates like “R.age=30 OR S.salary >50K” that cannot

be pushed down to the individual tables. As for the cor-

related sample synopsis, there is no such restriction; it can

be applied to any selection and join. In addition, selectivity

estimates with statistical conﬁdence levels and intervals can

also be provided by using the correlated sample synopsis.

In this research, we propose to store a set of correlated

sample tuples as the database synopsis. We have developed

a powerful estimator, called the reverse estimator, that can

fully utilize the correlated sample tuples to make accurate

estimations, without relying on simple random sampling.

The proposed reverse estimator represents a breakthrough

in utilizing samples to make estimation. We have imple-

mented the join synopses, TuG, and the correlated sample

synopsis to compare their performance. The experimental

results showed that both the join synopses (JS) and corre-

lated sample synopsis (CS2) were much easier and several

orders faster to construct than the TuG. In addition, they

also generated more accurate estimates than TuG, with the

correlated sample synopsis generating the most accurate es-

timates. The experimental results conﬁrmed that the cor-

related sample synopsis is a simple, concise, and accurate

summary of the database, not to mention that it is also the

most general synopsis, applicable to queries with all types

of join relationships and arbitrary selection conditions. It

is the desired database synopsis that researchers have long

been searching for.

The rest of paper is organized as follows. Section 2 in-

troduces some background knowledge. Section 3 discusses

the construction of a correlated sample synopsis. In Sec-

tion 4, we describe selectivity estimation for queries with

acyclic join graphs, and Section 5 cyclic graphs. Section 6

presents a method to improve estimation accuracy of com-

monly posted queries. Section 7 presents the experimental

results and Section 8 is the conclusion.

2. PRELIMINARY

In this section, we deﬁne terminology that will be used to

model the join relationships in a database. For simplicity,

here we focus on equi-joins. The framework can be extended

to other joins easily. Hereafter, a join refers to an equi-join.

2.1 Join Graphs

Deﬁnition 1. (Join Graph of a Database) The join

graph G(V, E) of a database consists of the set of all vertices

V={R1, R2,...,Rn}and a set of edges with each vertex

denoting a relation in the database and each edge hRi, Rji ∈

E, Ri, Rj∈V, the existence of a join relationship between

Riand Rj.

In this research, we make no assumption about the mul-

tiplicity of the join relationships, that is, they can be one-

to-one, many-to-one, and many-to-many (often realized by

many-to-one relationships in the relational model).

470

R1R2

Figure 1: A Join Graph

Example 1. Figure 1 shows the join graph of a database

where there are join attributes between R1and R2,R2and

R3, and R2and R4.

For simplicity and without loss of generality, we shall as-

sume there is at most one join relationship between a pair

of relations and each join relationship involves only one join

attribute. The discussions here can be easily extended to

general cases.

Deﬁnition 2. (Join Graph of a Query) Let Qbe a

query over a set of relations V′⊆V. The join graph of

Q is a subgraph of G, denoted by G′(V′, E ′), where V′⊆V,

E′⊆E, with each edge hRi, Rji ∈ E′corresponding to a

join between Riand Rjin Q.

If the join graph of a query or a database is disconnected,

we can consider each connected component separately and

then merge them by Cartesian products. Therefore, we shall

assume hereafter all join graphs are connected.

Deﬁnition 3. (Joinable Relations) A pair of relations

Riand Rk,i6=k, is said to be joinable, denoted by J(Ri, Rk),

if there is a path of length ≥1 between Riand Rkin the

join graph of the query. If there is an edge hRi, Rki ∈ E,

we say Riand Rkare directly joinable; otherwise, they are

indirectly joinable.

Deﬁnition 4. (Joinable Tuples) Assume Riand Rkare

a pair of joinable relations with a path hRi, Ri+1,..., Rk−1, Rki.

A pair of tuples ti∈Riand tk∈Rkare said to be joinable,

denoted by J(ti, tk), if tican ﬁnd a match ti+1 in Ri+1, ti+1

can ﬁnd a match ti+2 in Ri+2,..., and tk−1can ﬁnd a match

tkin Rk. Joinable tuples ti∈Riand tk∈Rkare directly

joinable if Riand Rkare directly joinable, and they can

match directly; otherwise, they are indirectly joinable.

3. CORRELATED SAMPLE SYNOPSIS CON-

STRUCTION

Sample tuples naturally embody the intricate relation-

ships of attribute values and thus can be good represen-

tatives of the underlying tuples of a relation. They are very

eﬀective in estimating the result sizes of select queries with

arbitrary selection conditions [13]. Unfortunately, indepen-

dent simple random samples, each of which consists of ran-

domly drawn tuples from a participant relation of a query,

generated a large error in join size estimation [13]. We ob-

served that the large errors were caused by the independence

of the sample tuples, disregarding the join relationships be-

tween tuples of participant relations. Preserving the join

relationships may hold the key to the construction of an ef-

fective sample synopsis. With this principle in mind, we

design a simple sampling scheme, called correlated sam-

pling, that preserves the join relationships of tuples.

Table 1: Notation used in the paper

Symbol(s) Meaning

RiRelation

t∈RiA tuple tin relation Ri

G(V, E ) Join graph of a database

hRi, RkiEdge in the join graph

J(Ri, Rk)Riand Rkare joinable relations

J(ti, tk)tiand tkare joinable tuples

SRSWOR Simple random sample without replacement

S∗

iSRSWOR of Ri

SiSample relation

SCorrelated sample synopsis (CS2)

✶(...) Joins without order

YQuery result size

YiNumber of result tuples generated by ti∈Rh

yiNumber of result tuples generated by ti∈Sh

Y , b

Yp,b

YRV Estimations of Y

riJoinable sampled tuple ratio (JR)

ZiStandard Total Number of Result Tuples

(STNTR)

{zi}n1

i=1 Reverse sample

h,RJc

hJoinable and non-joinable partitions

3.1 Correlated Sampling

Correlated Sampling aims to gather a sample of joinable

sample tuples from joinable relations in the database to form

a synopsis, denoted by S.

Deﬁnition 5. (Correlated Sampling) First, a relation,

say R1, is chosen as the source relation, from which a

simple random sample without replacement (SR-

SWOR) [6] is collected into anther relation, denoted by

S1. We include S1into the synopsis, S, and mark R1as

visited. By induction, when Ri(i > 1) is visited, and Siis

the sample of Ri, we follow an unvisited edge hRi, Rjiin the

join graph of the database. Instead of SRSWOR, we collect

the tuples in Rjthat are directly joinable with the tuples in

Si. We include these correlated tuples into Sj, and repeat

this procedure until all relations in the join graph are ex-

hausted and all tuples directly or indirectly joinable with S1

are collected. Finally, the synopsis S={S1, ..., Sn}is called

aCorrelated Sample Synopsis or CS2.

Note that, to emphasize S1is a SRSWOR of R1, we use

S∗

1, instead of S1, in the future. All sampled relations S∗

and other Sj’s are called sample relations, and their tuples

are called sample tuples. A sample relation Sjwhich is

not a SRSWOR is called a correlated sample relation,

the tuples of Sjare called correlated sample tuples, and

Rjis called a non-source relation.

Algorithm 3.1 describes the general idea of the correlated

sampling. We shall reﬁne it later.

3.1.1 Sampling Path and Source Relation Selection

When choosing sampling order along the join graph, al-

though one can follow any join relationship, e.g., a many-to-

one, one-to-many, or many-to-many relationship, to gather

correlated sample tuples, we suggest following a many-to-

one relationship from a visited relation to another relation

(visited or not). The rationale is that following one-to-many

(and perhaps many-to-many) relationships can easily ex-

471

Algorithm 1 Correlated Sampling

Require: G— Join Graph of the Database; n1— Sample

Size

1: Ra=Source Selection(G) //source relation selection

2: S∗

a=∅;Si=∅(∀i6=a)

3: S∗

a= SRSWOR(Ra,n1)//collect simple random sample

4: W={Ra}//mark Raas visited

5: while ∃unvisited edge hRi, Rjiwith Ri∈Wdo

6: Sj=Sj∪ {t|t∈Rj,tis directly joinable with a tuple

in Si}//collect correlated sample tuples

7: W=W∪ {Rj}//mark Rjas visited

8: end while

9: S={S∗

a} ∪ (∪j6=a{Sj})

10: return S— Correlated Sample Synopsis

plode the size of the synopsis, and thus reduce the number

of sample tuples that can be drawn from the source relation.

The same reasoning applies to the selection of the source

relation. Thus, we prefer a relation that can reach the most

relations, through only the many-to-one relationships di-

rectly or indirectly, to be the source relation. It is possi-

ble that the source relation may not reach all the relations

in the graph through only the many-to-one relationships.

Therefore, one may have to add additional source relations

to cover unvisited relationships and relations.

3.1.2 Sample Size and Synopsis Maintenance

In Step (3) of Algorithm 3.1, we assume, for simplicity,

the sample size nfrom the source relation is known before-

hand. However, in reality, the sample size that meets the

synopsis space requirement may not be known until the end

of the sampling process. Given a synopsis space budget, the

simplest way is to draw one tuple at a time from the source

relation and collect related tuples along the sampling path.

This process continues until the space budget is used up. If

only the many-to-one relationships are followed in the sam-

pling, one can also draw a number of sample tuples at a time

until there is no budget left.

The correlated sample synopsis can be updated using ex-

isting algorithms [10]. Alternatively, since the construction

of a synopsis takes just a few seconds (e.g., 10 seconds, as

demonstrated in Section 7), one can simply reconstruct the

synopsis periodically to keep it up to date.

3.2 A Primitive Correlated Sample Synopsis

For simplicity, we shall name the relations following the

order in which they are sampled in the correlated sampling

scheme described. That is, R1is the source relation, R2is

sampled after the source relation, and then R3, R4, and etc.

To explicitly specify the order, we modify the join graph into

a directed sampling order graph (as shown in Figure 2(a))

in which a double-headed arrow from Rito Rjindicates that

Rjis sampled based on the sample tuples of Ri, or Rjis a

child of Ri(equivalently, Riis a parent of Rj). Ancestor

and descendent relationships in a sampling order graph are

deﬁned in the same way as in an ordinary directed graph.

Example 2. (A Primitive CS2) Consider the order of

sampling in Figure 2(a) and the join relationships among

tuples in Figure 2(b). Each tuple is represented only by its

ID for simplicity. R1={1,2,3},R2={a, b, c}, R3={A, B}

and R4={I, II}. The lines, including both solid and dashed

R3R4

(a) Sampling Order

Graph

123

abc

A B I II

R3R4

(b) Joinable Tuple

IDs

S∗

1={2,3}

S2={b, c}

S3={B}

S4={II}

Figure 2: A Primitive CS2

lines, between tuples represent the matching of join attribute

values. R1and R2are in many-to-many relationship. The

solid lines indicate that their end tuples are selected into the

synopsis. That is, two tuples of R1, 2 and 3, are ﬁrst drawn

into S∗

1(i.e., S∗

1={2,3}). Then band cof R2are selected

into S2because their join attribute values match those in

S∗

1. After that, Bis selected from R3into S3because of

matching join attribute values with b, c in S2. Finally, II of

R4is selected into S4because it matches aof S2. Therefore,

the sample synopsis Sconsists of four sample relations as

shown in Figure 2(c).

Note that the synopsis Sdescribed here is just in its prim-

itive form.

Instead of storing all attributes of a tuple in the synopsis,

one may choose to store only the join attributes and those

attributes that are often referenced in the selection condi-

tions. This would signiﬁcantly reduce the size of a synopsis.

4. QUERY ESTIMATION

By evaluating queries against sample relations, we esti-

mate query result sizes from the sample query results. We

focus on queries that contain equi-joins and arbitrary selec-

tion predicates. Therefore, the query result size can be ob-

tained by summing up the number of result tuples on each

tuple of an arbitrary relation in the query. Due to space

limitation, here we consider only a database with an acyclic

join graph G(V, E ) and we assume no dangling tuples exist

in the joins. Dangling tuples in the source relation will not

generate result tuple. And dangling tuples in non-source

relations are discussed in Section 4.2.3.

4.1 Source Query Estimation

Asource query here refers to a query that has the source

relation, R1, as one of its operand relations.

Consider a source query over relations R1, R2, ..., Rn. Let

{S∗

1, ..., Sn}be the respective sample relations in the synop-

sis. Each result tuple of the query must be generated by a

set of ntuples {t1, ..., tn},ti∈Ri, 1 ≤i≤n. We say that

ti,1≤i≤n, generates a result tuple. Let Yibe the number

of result tuples generated by the ith tuple ti∈R1, then the

Query Result Size,Y=PN1

i=1 Yi, where N1=|R1|. We

propose the following unbiased estimator for Y.

Theorem 1 (Unbiasedness of CS2). Given S∗

1, a SR-

SWOR of the source relation R1, the estimator

Y=N1

i=1

yi(1)

472

is an unbiased estimator of the query result size, Y, where

N1=|R1|, n1=|S∗

1|, and yiis the number of result tuples

generated by the ith tuple in S∗

1. Also, v1(b

Y) = N2

n1ε2(1 −f)

is an unbiased estimator of V ar(b

Y), where f=n1

N1and

ε2=Pn1

i=1(yi−E(yi))2

n1−1.

Proof. We only have to show that {yj}n1

j=1 is a SRSWOR

of {Yi}N1

i=1. Then we can apply theorems in [6] to prove this

theorem.

First of all, yjis the total number of result tuples gen-

erated by tuple uj∈S∗

1when query Qis executed on the

CS2, S. According to Algorithm 3.1, the correlated sam-

pling, for any correlated tuple vk∈Rji.e. J(vk, ui), we

have vk∈Sj, j > 1. That is, every result tuple generated

by uion the original database, is also generated by uion S.

On the other hand, Sis a collection of sample relations,

which are subsets of the original relations. Thus, for the

same query Q,uican’t generate more result tuples on S

than the original database.

Hence, given ∀uj=ti,uj∈S∗

1, ti∈R1, we have yj=Yi.

Since S∗

1is a SRSWOR of R1,{yj}n1

j=1 is also a SRSWOR

of {Yi}N1

i=1.

By Theorem 2.1 and Theorem 2.4 of [6], this theorem is

proved.

Example 3. (Source Query Estimation) Consider the

database and synopsis in Figure 2. We attempt to estimate

the sizes of all queries that involve the source relation R1,

that is, ✶(R1, R2), ✶(R1, R2, R3), ✶(R1, R2, R4), and ✶

(R1, R2, R3, R4). Here, ✶(Ri,...,Rk) represents the joins of

Ri,..., and Rkin no particular order.

First, we use S∗

1, S2, S3, and S4to compute the sample

results of the queries.

|✶(S∗

1, S2)|=|S∗

1✶S2|= 2.

|✶(S∗

1, S2, S3)|=|(S∗

1✶S2)✶S3)|= 2.

|✶(S∗

1, S2, S4)|=|(S∗

1✶S2)✶S4)|= 1.

|✶(S∗

1, S2, S3, S4)|=|(S∗

1✶S2)✶S3)✶S4|= 1.

Then, the query sizes are estimated by Eq (1), where N1=

3 and n1= 2. So, the estimated query sizes of ✶(R1, R2),

✶(R1, R2, R3), ✶(R1, R2, R4), ✶(R1, R2, R3, R4) are 3, 3,

2, and 3

2, respectively, as compared to the exact sizes 4, 4,

2, and 2.

4.2 No-Source Query Estimation

Ano-source query refers to a query that doesn’t include

the source relation.

Example 4. (No-source Query) In Figure 2, both ✶

(R2, R3) and ✶(R2, R4) are no-source queries, since neither

of their join graphs contain the source relation R1.

The highest relation in a query refers to the relation

with the least index according to the sampling order. For

instance, in Example 4, R2is the highest relation in query

✶(R2, R3).

For any no-source query Qwith Rhas its highest relation

in its join graph, let Yjbe the total number of result tu-

ples generated by the jth tuple ujin Rh.Shis the sample

relation of Rh. We want to estimate the query result size

YRh=PNh

j=1 Yj.

4.2.1 Estimation with Reverse Sampling

Since there is no SWRSOR of Rh, it is not straightforward

how query result size YRhcan be derived from the sample of

Rhdirectly. Therefore, we “back track”to the source relation

and consider estimation of YRhfrom R1. We try to establish

the relationship between YRhand some derived values on

tuples of R1based on Yj. In order to establish the “reverse”

connection from Rhto R1, we introduce a deﬁnition called

Joinable Tuples Sampled Ratio or JR.

Deﬁnition 6. (Joinable Tuple Sampled Ratio or JR)

Considering R1as the source relation, ∀ui∈Rh,h > 1, let

Cibe the total number of tuples in R1that are joinable with

ui. Among these Cijoinable tuples, we assume Dituples

are drawn into S∗

1, 0 ≤Di≤Ci.

ri=Di

(2)

is called the Joinable Tuples Sampled Ratio, or JR, of ui.

Now we show how to derive the result size YRhfrom Xi

in R1. We assign a value Xito the ith tuple, ti, of R1, as

follows.

Xi=

j=1

jYj, Ai

j=(1,if J(uj, ti), uj∈Rh, ti∈R1

0,otherwise

(3)

where Nh=|Rh|, and J(uj, ti) is in deﬁnition 4. Xiis

called the Derived Total Number of Result Tuples or

DTNRT, generated by uiin R1. Using the Ciin JR, we

adjust the weight of Yjin Xi, denoted by Zi, as follows.

Zi=

j=1

Yj(4)

where Cjis deﬁned in Deﬁnition 6. We call Zithe Standard

Total Number of Result Tuples or STNRT.

Note that Ziis associated with a tuple in R1, but its value

is from Rh. Therefore, the values in Rhare assigned to R1

“reversely” along the sampling path. In order to estimate

YRh, we pick a SRSWOR of {Zi}, and call it, a reverse

sample.

Deﬁnition 7. (Reverse Sample) A reverse sample, de-

noted by {zi}n1

i=1, is a SRSWOR of the STNRT set {Zi}N1

i=1

s.t. Ziis chosen into {zi}n1

i=1 if and only if the tuple tiof R1

is sampled into S∗

It is trivial that yj=Yi, if ti∈Rhis sampled into Shas

uj, in which yjis the total number of result tuples generated

by uj. Thus, according to (4), zican be expressed using

values in Shas follows.

zi=

j=1

yj, ai

j=(1,if J(u′

j, t′

i), u′

j∈Sh, t′

i∈S∗

0,otherwise

(5)

To estimate the no-source query result size YRh, we pro-

pose the Reverse (RV) Estimator deﬁned as

YRV =N1

j=1

rjyj(6)

where N1=|R1|,n1=|S∗

1|,nh=|Sh|, and yjis the total

number of result tuples generated by the jth tuple in Sh.

473

The signiﬁcance of the RV Estimator is that, it can pro-

vide unbiased estimations for no-source queries using CS2.

It is shown in the following theorem.

Theorem 2 (Unbiasedness of CS2). Given a no-source

query with Rhas the highest relation in its join graph, b

YRV

in Eq(6) is an unbiased estimator of the query result size

YRh.

Proof. First of all, given R1, S∗

1, Rh, Sh, we study the

sum of Zias deﬁned in Eq (4),

i=1

Zi=

i=1

j=1

Yj=

j=1

(PN1

i=1 Ai

Yj(7)

By Deﬁnition 6, PN1

i=1 Ai

j=Cj. Thus (7) can be written as

PN1

i=1 Zi=PNh

j=1 Yj=YRh, which means the sum of Ziis

actually equal to the query result size YRh. Therefore, we

can estimate the sum of Zifor YRh.

Second, we take a reverse sample {zi}n1

i=1. By Deﬁnition

7, {zi}n1

i=1 is a SRSWOR of {Zi}N1

i=1. Therefore, we can use

Eq (1), b

Y=N1

n1Pn1

i=1 zi. By Theorem 1, b

Yis an unbiased

estimator of YRh.

Finally, we prove b

Y=b

YRV . We express b

Yin terms of yi.

Y=N1

i=1

zi=N1

i=1

j=1

yj=N1

j=1

(Pn1

i=1 ai

)yj

By Deﬁnition 6, Pn1

i=1 ai

j=Dj, we have

Y=N1

j=1

yj=N1

j=1

rjyj=b

YRV

The theorem is proved.

Theorem 3. We provide an unbiased estimator of the

variance of b

YRV .

v2(b

YRV ) = N1(N1−n1)

i=1

zi−2N1(N1−n1)

1(n1−1)

i1=1

i2>i1

zi1zi2

where ziis deﬁned in the proof of Theorem 2.

This theorem can be proved by (9A.37) and (9A.44) of [6].

Example 5. (No-Source Query Estimation with RV

Estimator) We attempt to use b

YRV in Eq (6) to estimate

the result size of a no-source query ✶(R2, R3) executed

on relations in Fig 3(a), denoted by YR2. Note that Ya=

Yb=Yc= 1, because each tuple of R2produces exactly 1

result tuple. |S∗

1|= 2, and N1=|R1|= 3. Thus N1

n1=3

Figure 3 shows the CS2 and JRs when S∗

1={2,3}and

S2={b, c}. Tuples 1 and 2 of R1are joinable with bin S2,

but only tuple 1 is selected into S∗

1. Thus, rb=1

2. Similarly,

rc=1. By Eq (6), for this given S∗

1,b

YRV =3

2(Yb

2+Yc) =

4. In order to verify that b

YRV is unbiased, we consider all

possible SRSWOR of size 2 from R1, i.e. S∗

1, and calculate

the average of their estimates, E(b

YRV ). Table 2 lists all

calcualted b

YRV . As observed, E(b

YRV ) is exactly the actual

result size.

Table 2: Samples, RV Estimates( b

YRV ), and JR Val-

ues

S∗

1S2JR b

YRV

{1,2} {a, b}ra=rb= 1 3

2(Ya+Yb) = 3

{1,3} {a, b, c}ra=rc= 1,

rb=1

2(Ya+Yb

2+Yc) =

{2,3} {b, c}ra=1

2, rc= 1 3

2(Yb

2+Yc) = 9

E(b

YRV ) = 3

123

abc

A B

(a) Joinable Tuple IDs

S∗

1={2,3}

S2={b, c}J2={1

2,1}

S3={B}J3={2

(b) CS2 and JRs

Figure 3: CS2 and JRs when S∗

1={2,3}

4.2.2 JR Storage and Calculation

A CS2 is constructed for use by all possible queries. A JR

value is stored with each correlated sample tuple in Si(i > 1)

for use of all no-source query estimations.

As described in Algorithm 2, the JR values can be cal-

culated in a top-down fashion following the sampling path.

First, for each sample tuple uin the direct children of S∗

we calculate the Cuand Du, in which Cucomprises of IDs

of u’s joinable tuples in R1, and Ducomprises of those in

S∗

1. Then, each sample tuple in other sample relations will

perform union (without duplicate) of Cuand Dufrom join-

able tuples in its parent sample relations. By deﬁnition 6,

Cu= #(Cu) and Du= #(Du), in which #() is the cardinal-

ity of a set. Therefore, the JR value of tuple uis calculated

as ru=#(Du)

#(Cu). Finally, all the intermediate data, Cuand

Du, will be removed after JR values are calculated.

In Algorithm 2, Cuand Duare only calculated for cor-

related sample tuples, and they can be obtained in parallel

with the procedure of correlated sampling (Algorithm 3.1).

Therefore, the intermediate time and space overheads can

be ignored when the sample size is small. In addition, in

Section 7, we show that the estimation of CS2 is fairly sat-

isfactory using only quite a small amount of sample.

4.2.3 Estimation Involving Non-joinable Tuples

Deﬁnition 8. (Joinable and Non-joinable Partition)

Considering R1as the source relation, the joinable partition

of Rh(with respect to R1), denoted by RJ

h, refers to the

subset of tuples in Rh, that are joinable with tuples in R1.

The complement of RJ

hin Rhis called non-joinable partition

of Rh(with respect to R1), denoted by RJc

By Deﬁnition 8, there is no chance for tuples in RJc

hto

be selected into the correlated sample synopsis. In order to

estimate the result size of no-source queries with Rhas the

highest relation, we need to account for the number of result

tuples generated by RJc

Deﬁnition 9. (Partition Diﬀerence) For any given query

Q, the total number of result tuples generated by RJ

hand

474

Algorithm 2 Calculate JR

Require: G(V, E ): the acyclic join graph of the database;

S={S∗

1} ∪ {Si}n

i=2: CS2 without JR

1: W=∅//the set of visited relations

2: //———Phase1: for direct children of R1————

3: U={Si|Si∈Ss.t. Riis a direct child of R1in G}

4: for all Si∈Udo

5: for all u∈Sido

6: //IDs of tuples in R1and S∗

1joinable with u

7: Cu={t.ID|t∈R1, J(t, u)}

8: Du={t.ID|t∈S∗

1, J (t, u)}

9: u←Cu//append to u

10: u←Du

11: end for

12: W=W∪ {Si}//mark Sias visited

13: end for

14: //———Phase2: for other correlated samples———

15: while W6=CS2 do

16: pick Siwhose parent Spis in W

17: for all v∈Sido

18: //all the joinable tuples of vin Si’s parent relations

19: Jv={u|u∈Sp, J(u, v )}

20: Cv=Su∈JvCu//union without duplicate

21: Dv=Su∈JvDu

22: v←Cv//append to v

23: v←Dv

24: end for

25: W=W∪ {Si}//mark Sias visited

26: end while

27: //———Phase3: calculate JR values——————

28: for all Si∈CS2 do

29: for all u∈Sido

30: ru=#(Du)

#(Cu)//calculate JR

31: u←ru//append JR to each sample tuple

32: delete Cu,Du//release extra space

33: end for

34: end for

35: return S(CS2 with JRs calculated)

RJc

hare YJand YJc, respectively. The partition diﬀerence

between RJ

hand RJcwith respect to Q, denoted by µ, is

deﬁned as

µ(RJ

h, RJc

h) = 

nJcYJ

nJYJc−1(8)

The partition diﬀerence µis speciﬁed for a ﬁxed query

Q. To estimate their general diﬀerence for any query with

Rhas the highest relation, one possible way is to select a

set of sample queries from them, and calculate an averaged

µ(RJ

h, RJc

h). One can also check the distributions of tuples

according to the join attributes of these queries.

Given a tolerance threshold, θ∈[0,1], we claim that if

µ(RJ

h, RJc

h)< θ, we can estimate YJcas

YJc=nJc

nJb

RV (9)

where b

RV is the RV estimator for YJ

h. The relative error

of E(b

YJc) is



E(b

YJc)−YJc

YJc=

nJcYJ

nJYJc−1=µ< θ

That is, when µ < θ the estimation relative error of b

YJcin

Eq (9) is less than the tolerance threshold θ.

When µ(RJ

h, RJc

h)≥θ, we have to estimate RJc

hsepa-

rately. An additional simple random sample S′

h, called a

supplement sampling, is taken from from RJc

h, and cor-

related sample tuples are stored into respective sample rela-

tions of Rhin the join graph G.

To estimate YJcusing S′

h, we modify b

Yin Eq(1) by treat-

ing RJc

has a source relation. That is

YJc=NJc

nJc

i=1

yi(10)

where NJc

h=|RJc

h|,nJc

h=|S′

h|, and yiis the number of

result tuples generated by the ith tuple in S′

Let b

RV be the RV estimator of YJ. Since YJand YJc

are independent, we have the following theorem.

Theorem 4. b

Yh=b

RV +b

YJcis an unbiased estimator

of the query result size YRh. And v2(b

RV ) + v1(b

YJc)is

an unbiased estimator of V ar(b

Y), in which v1and v2are

deﬁned as in Theorem 1 and Theorem 3.

5. SELECTIVITY ESTIMATION FOR QUE-

RIES WITH CYCLIC GRAPHS

If the join graph G′of a query is cyclic, and a query result

tuple trinvolves in two diﬀerent relation tuples, for example,

t1and t2both in Rh. And there is a cycle in G′from t1to

t2. We say tris generated by t1instead of t2. Hence, each

result tuple still corresponds to one and only tuple in each

relation when there is no dangling tuple.

Figure 4(a) shows a cyclic join graph and Figure 4(b) the

join relationships. Let us use an example to illustrate the

issues encountered in a cyclic join graph. Assume tuple 1 is

ﬁrst drawn from R1. Then, following the correlated sam-

pling (Figure 2), related tuples A(∈R2), b(∈R3), α, β (∈

R4), I(∈R5),∆ and Θ(∈R6) are drawn into respective sam-

ple relations Si, 2 ≤i≤6, as shown in Figure 4(c). Note

that S4contains tuples that are related to tuples in S2(i.e.,

A) and S3(i.e., b) because the two sampling branches from

R1intersect at R4.

While the synopsis (Figure 4(c)) is suﬃcient for estimating

queries involving relations that are subsets of {R1, R2, R4, R6}

and {R1, R3, R4, R5, R6}, it is not straightforward how queries

involving relations from both sampling branches can be es-

timated. For example, consider the query ✶(R2, R4, R3),

where R2and R3are on diﬀerent branches. R2and R3both

are candidates for the highest relation of the query. Assume

we choose to use S2={A}to estimate the query size. From

the synopsis, we know Ais connected to αof R4. Unfor-

tunately, we do not know whether αis connected to any

tuples in R3or not because we did not let αpierce through

R4to reach related tuples in R3(i.e., a). A similar problem

occurs if we use S3={b}to estimate the query size. That

is, bis connected to β(∈R4), but we do not know whether

βis connected to any tuples in R2because the sampling did

not penetrate into R2to reach related tuples (i.e., Bhere).

Here, we propose two possible ways to handle this situa-

tion. The ﬁrst is to estimate query sizes using the indepen-

dence and uniformity assumptions. The second method is to

enhance the synopsis by letting the sampling pierce through

475

R2R3

(a) A Cyclic Join Graph

R2R3

A B ab

αβ

∆ Θ I II

(b) Joinable Tuple IDs

S′

1={1}

S2={A}

S3={b}

S4={α, β}

S5={I}

S6={∆,Θ}

S+

3={a}

S+

5={II}

(d) Extended Samples

Figure 4: CS2 for a Cyclic Join Graph

the cycle-completing relation (e.g., R4) and perhaps other

relations.

5.1 Estimate using Independence and Unifor-

mity Assumptions

Consider the query ✶(R2, R4, R3). Let T24 and T34 be the

estimates for |R2✶R4|and |R3✶R4|, respectively, derived

using the method described in Section 4.2 using sample re-

lations {S2, S4}and {S3, S4}, respectively. Let N4be the

number of tuples in R4. Assume every tuple in R4is equally

likely to participate in the result of R2✶R4and R3✶R4.

Then |✶(R2, R4, R3)|can be estimated as

T24T34

This approach is simple. But, the accuracy of estimations

may raise a concern because the use of the independence

and uniformity assumptions.

5.2 Enhancing synopsis with extended samples

We allow the sampling to pier through the cycle-completing

relation (e.g., R4) and extend to any relation as desired. In

general, we only extend to the relations that are often ref-

erenced in the queries together with the cycle-completing

relation.

For example, we can let the sampling branch on the left

(Figure 4) that goes from R1,R2and R4, extend to R3

(or even to R1and R5). An enhanced synopsis is shown in

Figure 4(d), where the left sampling branch is extended to

R3and R5. In the ﬁgure, S+

icontains the extended sample

tuples.

We do not need to diﬀerentiate Sifrom S+

iin the sample

evaluations unless Riis the highest relation of the query.

For example, consider the query ✶(R2, R4, R3, R5) and the

enhanced synopsis in Figure 4(c), 4(d). Assume R2is chosen

to be the highest relation for evaluation. Consequently, S2,

without S+

2, must be used. As for other relations, we can

just use Si∪S+

i. The sample query result can be obtained

by ✶(S2, S4, S3∪S+

3, S5∪S+

5), which is the same as ✶

(S2, S4, S+

3, S+

5). The latter may be more eﬃcient, but the

former is simpler, that is, always use Si∪S+

i, except when

Siis the highest relation.

6. MULTI-SOURCE SYNOPSIS

Figure 5 shows a graph with two source relations RAand

RB. Let A and B be the two groups of relations with RA

and RBbeing their source relations respectively. The two

A B

RARB

Figure 5: Multiple Source Relations

groups have a common relation Ri, but there can certainly

be more. Note that the sampling processes that started from

RAand RBwould each draw their own related sample tuples

from Riand store them in Si.

Given a query if there exists a sample penetrating all its

operand relations, an estimate of the query size can be de-

rived using the methods described in Section 4. If there does

not exist a sample that penetrates all the operand relations,

we resort to the estimation method based on the indepen-

dence and uniformity assumptions in Section 5.

A multi-source synopsis arises from traversing only the

many-to-one join relationships in the sampling. But there

are other reasons for constructing a multi-source synopsis.

In situations where there are multiple groups of relations

that are often referenced together, we may select a source

relation for each such group. Depending upon the impor-

tance of queries and desired accuracy, a multi-source syn-

opsis allows a DBA to determine grouping of relations and

sample sizes for individual groups. It can improve the ac-

curacy of estimation for queries over relations in the same

groups, however, decrease the accuracy for queries over re-

lations across groups, due to the use of the independence

and uniformity assumption in estimation. It can be a very

ﬂexible and eﬀective way to build a synopsis.

7. EXPERIMENTS

In this section, we present the results of performance eval-

uations of TuG [27, 28], join synopses (JS) [1], and correlated

sample synopsis (CS2).

The implementations of the join synopses and correlated

sample synopsis are much simpler than TuG because TuG

uses complex clustering and compression algorithms. We

have implemented the TuG synopsis (whose code is not avail-

able to the public) following the descriptions in [27], includ-

ing using the public Java source code [23] of CF-Tree and

BIRCH [32] to conduct clustering, and implementing the

Count-min Sketches [7] and histogram synopses [8] to re-

duce the size of histograms.

All programs are implemented in Java 1.6 and MySQL

5.1 on a Ubuntu 10.04 Linux system, equipped with an Intel

476

Table 3: Datasets

Name Size Comments

TPCH1G05 1GB TPC-H skew factor z=0.5

TPCH1G1 1GB TPC-H skew factor z=1.0

TPCH10G05 10GB TPC-H skew factor z=0.5

TPCH10G1 10GB TPC-H skew factor z=1.0

DBLP 692MB DBLP online database

lineitem

❈



partsupp

~~⑥

⑥

⑥

orders



part supplier



customer //nation



region

(a) TPC-H

author of

✁

✁



✿

author paper

☎

☎

☎

✾

type publisher

(b) DBLP

Figure 6: Join Graphs of Datasets (Boxed are

Source Relations)

Core i7 3.4 GHz CPU, 4GB RAM and a 500GB hard disk

(7200 RPM, Buﬀer size 16MB). All datasets and synopses

are stored on the hard disk.

Using a large amount of memory only beneﬁts TuG which

utilizes clustering on large datasets. CS2 and JS only require

small amounts of memory when constructing synopses.

7.1 Datasets

We conducted experiments on both synthetic data - TPC-

H datasets with skewed data [4], and real-life data - DBLP

dataset[17], which are listed in Table 3.

We generated 1GB and 10G TPC-H datasets with skew

factors of z=0.5 and 1. Each TPC-H dataset has 8 relations,

consisting of only the foreign-key join relationships. The arcs

in Figure 6(a) indicate the many-to-one relationships, which

also satisfy the foreign-key constraint for the TPC-H data,

and the boxes indicate the source relations of CS2. Note

that every relation is a source relation in JS.

The DBLP database listed more than 1.3 million articles

in computer science by January 2010. Our DBLP dataset

sources from a relational enhancement of the original DBLP

data, named DBLP++[9]. We have replaced the many-

to-many relationship between “author” and “paper” by two

many-to-one relationships and a connecting relation “au-

thor of”, as JS does not apply to many-to-many relation-

ships directly. However, CS2 is applicable to all types of

relationships. In addition, we have also ﬁltered out some

data so that the foreign key constraint holds, as required by

the JS. The join graph is shown in Fig 6(b).

7.2 Construction Time and Index

The average construction times of synopses in accuracy

tests are presented in Table 4. As observed, the construc-

tion of TuG is signiﬁcantly slower than JS and CS2 because

it performs some time-consuming tasks, such as CV table

building, all-but-one similar merges, BIRCH clustering, and

etc. All of these tasks require tremendous amounts of disk

I/O and CPU computation.

Building index will certiainly increase the construction

Table 4: Average Construction Time of Synopses

Dataset Space Budget TuG JS CS2

TPCH1G05 29KB 639min 10sec 8sec

TPCH1G1 21KB 286min 10sec 8sec

TPCH10G05 100KB ∞408sec 334sec

TPCH10G1 100KB ∞340sec 300sec

DBLP 35KB 518min 8sec 3sec

speed of all the synopses. In our experiments, we built in-

dexes on the join key attributes in both TPC-H and DBLP

datasets.

7.3 Query Workload

We randomly generated queries involving diﬀerent sets of

relations with diﬀerent numbers of selection predicates. For

such queries, we randomly varied the ranges of attribute val-

ues in the selection criteria. Our experiments were designed

to examine how joins and selections aﬀect the accuracy of

estimations.

Table 5 details the test queries and the space budgets

used for each test. “DBLP Large/Small Range” refers to

queries with large/small attribute value ranges speciﬁed in

the selection predicates. “DBLP No-source” refers to queries

that do not involve the source relation “author of”, which

would require using the proposed RV estimator (Section 4).

They are also among the queries used in the “DBLP Large

/ Small Range” experiments.

7.4 Synopsis Space Budget

According to Section 3.1.1 and Algorithm 3.1, in order to

control the size of CS2, we chose “lineitem” of TPC-H and

“author of” of DBLP as the source relations. Our correlated

sampling also followed many-to-one relationships in the join

graphs. As Table 5 shows, the space of CS2 can be well

controlled. Space budgets of CS2 are less than 20KB for

1GB TPC-H datasets, 100KB for 10GB TPC-H datasets,

and less than 10KB for DBLP datasets.

Note that it is hard to control the ﬁnal sizes of TuGs,

even though we used the same clustering parameters to gen-

erate them on TPC-H datasets. We are not able to generate

any TuG smaller than 20KB for TPC-H or 23KB for DBLP

without generating large estimation errors, and thus their

results are absent for the 3KB and 10KB tests in Figure 7.

7.5 Estimation Accuracy

Test queries were grouped based on the numbers of joins

and selections placed on each relation. For example, in Fig-

ure 7(a), the query group ‘j1s2’ comprises queries with one

join and at most 2 selections on each operand relation, with

at most 4 selections per query.

Since JS and CS2 both need random sampling, they are

not unique for a given DB. Therefore, for each dataset and

given space budget, we generated 100 JSs and CS2s to ob-

serve their average performance. Each CS2/JS is used for

estimations of all test queries. We use the absolute rela-

tive error to compare the accuracy of estimations, which is

deﬁned as

E=b

YEst −YAct

YAct ×100%

where YAct denotes the actual query result size and b

YEst is

477

Table 5: Query Workloads and Space Budgets in Estimation Accuracy Tests

Figure Tests Data Total Space Budget

Size Queries TuG JS CS2

7(a) TPCH1G05 Synopses 20KB 1GB 45 29KB 20KB 20KB

7(b) TPCH1G05 Synopses 10KB 1GB 45 — 10KB 10KB

7(c) TPCH1G1 Synopses 20KB 1GB 45 21KB 20KB 20KB

7(d) TPCH1G1 Synopses 10KB 1GB 45 — 10KB 10KB

7(e) TPCH10G05 Synopses 100KB 10GB 45 – 100KB 100KB

7(f) TPCH10G1 Synopses 100KB 10GB 45 — 100KB 100KB

7(g) DBLP Large Range 3KB 692MB 51 23KB 3KB 3KB

7(h) DBLP Small Range 3KB 692MB 51 — 3KB 3KB

7(i) DBLP Small Range 10KB 692MB 51 — 10KB 10KB

7(j) DBLP No-source 692MB 36 — 3KB 3KB

the estimated result size. The results of accuracy tests are

shown in Fig 7.

As shown in Figure 7(a) and Figure 7(c), TuG produced

the worst relative errors generally. As mentioned earlier,

TuG smaller than 20KB cannot generate meaningful esti-

mations. JS and CS2 had similar accuracies for queries with

small numbers of joins. However, JS’s accuracy degraded

quickly when the number of joins increased. CS2 produced

the lowest relative errors in all cases of the TPC-H tests.

Figure 7(g) shows the results of DBLP large range query

tests. Again, CS2 was superior to JS and TuG, with TuG

performing the worst. As observed from Figures 7(g) and

7(h), small range queries gave higher errors than large range

queries. Also, as the sizes of the synopses increased, the rela-

tive errors decreased (Figures 7(h) and 7(i)). The dominance

of CS2 over JS became more obvious when the numbers of

joins and selections increased, as shown in Fig 7(h).

Finally, we compare the accuracy of CS2 and JS for no-

source queries. All queries were extracted from the previous

tests. It is clear, as shown in Fig 7(j), that CS2 with the

reverse estimator produced lower relative errors than JS.

In summary, TuG performed the worst among these three

synopses. CS2 produced the smallest relative errors in all

tests. In addition, CS2 increasingly outperformed JS as the

numbers of joins and selections grew.

7.6 Unbiasedness of RV Estimator

In Section 4.2, we proposed the reverse estimator for all

no-source queries, and proved theoretically that it is unbi-

ased (Theorem 2). Here, we validate empirically the unbi-

asedness of this estimator using CS2. We also compare the

results with JS.

For illustrative purposes, we select two queries for this

bias test, one from group ‘j2s1’ and the other from ‘j2s2’

in the DBLP tests. A query is estimated each time using

a diﬀerent synopsis (or synopses) of the same size (3KB).

We generated 100 synopses for this purpose. The absolute

average relative error of the ﬁrst iestimates is deﬁned as

Ei=

j=1 b

YEst

i−YAct

YAct ×100%

where i≥1, YAct is the actual query size, b

YEst

jis the

estimated value derived from the jth synopsis (synopses),

and Pi

j=1 b

YEst

iis the average of the ﬁrst iestimates. For an

unbiased estimator [6], the absolute average relative error

should approach 0 as i (the number of tries) increases.

As observed in Figure 8(a), the errors of both CS2 (using

the RV Estimator) and JS approached 0 as the number of

tries increased. On the other hand, Figure 8(b) demostrates

that the errors of CS2 converged to 0 earlier than JS within

100 rounds. i.e. CS2 yeilds a faster convergence speed than

JS when more selections are added. Certainly, JS yields

unbiased estimates as it uses simple random samples of join

results. These experiments validate our earlier proofs of the

unbiasedness of the RV estimator.

8. CONCLUSIONS AND FUTURE WORK

In this paper, we have presented an innovative approach to

construct a database synopsis. The synopsis mainly consists

of a set of correlated sample relations. We have designed a

reverse estimator that can fully utilize the correlated sam-

ple tuples to make accurate estimations. Our experimen-

tal results show that the correlated sample synopsis is easy

to construct and has better performance than existing syn-

opses. It proves itself to be a simple, concise, and accurate

synopsis.

There is still much work to do. We shall study the possi-

bility of further reducing the size of the synopsis using other

sampling techniques, such as subsampling. Finally, we shall

explore the use of the correlated sample synopsis in approx-

imate aggregate query answering.

9. ACKNOWLEDGMENTS

A special thanks to Dr. Sakthivel Jeyarathnam for his help

in developing the reverse estimator. We are also grateful to

Mr. Chase Abner for editing assistance.

10. REFERENCES

[1] S. Acharya, P. B. Gibbons, V. Poosala, and

S. Ramaswamy. Join synopses for approximate query

answering. SIGMOD Rec., 28:275–286, June 1999.

[2] N. Alon, Y. Matias, and M. Szegedy. The space

complexity of approximating the frequency moments.

In JOURNAL OF COMPUTER AND SYSTEM

SCIENCES, pages 20–29, 1996.

[3] S. Chaudhuri, G. Das, and V. Narasayya. Optimized

stratiﬁed sampling for approximate query processing.

ACM Trans. Database Syst., 32(2), June 2007.

[4] S. Chaudhuri and V. Narasayya. Program for tpc-d

data generation with skew. ftp://ftp.research.

microsoft.com/users/viveknar/tpcdskew.

478

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

Query Groups

TuG

CS2

(a) TPCH1G05: TuG=29KB; JS,CS2=20KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

Query Groups

CS2

(b) TPCH1G05: JS,CS2=10KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

Query Groups

TuG

CS2

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

Query Groups

CS2

(d) TPCH1G1: JS,CS2=10KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

10%

15%

20%

25%

30%

35%

Query Groups

CS2

(e) TPCH10G05: JS,CS2=100KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

10%

15%

20%

25%

30%

35%

Query Groups

CS2

(f) TPCH10G1: JS,CS2=100KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

120%

Query Groups

TuG

CS2

(g) DBLP Large Range: TuG=23K; JS,CS2=3KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

120%

Query Groups

CS2

(h) DBLP Small Range: JS,CS2=3KB

j1s1 j1s2 j2s1 j2s2 j3s1 j3s2

20%

40%

60%

80%

100%

120%

Query Groups

CS2

(i) DBLP Small Range: JS,CS2=10KB

j1s1 j1s2 j2s1 j2s2

20%

40%

60%

80%

100%

120%

140%

Query Groups

CS2

(j) DBLP No-source: JS,CS2=3KB

Figure 7: Absolute Relative Errors (%) in Accuracy Tests

479

0 20 40 60 80 100

10%

15%

20%

Number of Estimates

CS2

(a) Bias Test: DBLP No-source j2s1

0 20 40 60 80 100

10%

15%

20%

Number of Estimates

CS2

(b) Bias Test: DBLP No-source j2s2

Figure 8: Averaged Absolute Relative Errors (%) in Bias Tests

[5] S. Christodoulakis. Implications of certain

assumptions in database performance evauation. ACM

Trans. Database Syst., 9:163–186, June 1984.

[6] W. G. Cochran. Sampling Techniques, 3rd Edition.

John Wiley, 1977.

[7] G. Cormode and S. Muthukrishnan. An improved

data stream summary: the count-min sketch and its

applications. J. Algorithms, 55:58–75, April 2005.

[8] A. Deshpande, M. Garofalakis, and R. Rastogi.

Independence is good: dependency-based histogram

synopses for high-dimensional data. SIGMOD Rec.,

30:199–210, May 2001.

[9] J. Diederich. FacetedDBLP.

http://dblp.l3s.de/dblp++.php.

[10] P. B. Gibbons, Y. Matias, and V. Poosala. Fast

incremental maintenance of approximate histograms.

VLDB ’97, pages 466–475, 1997.

[11] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N.

Swami. Fixed-precision estimation of join selectivity.

PODS ’93, pages 190–201, 1993.

[12] P. J. Haas and A. N. Swami. Sequential sampling

procedures for query size estimation. SIGMOD Rec.,

21:341–350, June 1992.

[13] W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja.

Statistical estimators for relational algebra

expressions. PODS ’88, pages 276–287, 1988.

[14] Y. E. Ioannidis and Y. C. Kang. Left-deep vs. bushy

trees: an analysis of strategy spaces and its

implications for query optimization. SIGMOD Rec.,

20:168–177, April 1991.

[15] R. P. Kooi. The optimization of queries in relational

databases. PhD thesis, Cleveland, OH, USA, 1980.

AAI8109596.

[16] J.-H. Lee, D.-H. Kim, and C.-W. Chung.

Multi-dimensional selectivity estimation using

compressed histogram information. SIGMOD Rec.,

28:205–214, June 1999.

[17] M. Ley. The DBLP computer science bibliography.

http://www.informatik.uni-trier.de/~ley/db/.

[18] R. J. Lipton, J. F. Naughton, and D. A. Schneider.

Practical selectivity estimation through adaptive

sampling. SIGMOD Rec., 19:1–11, May 1990.

[19] V. Markl, V. Raman, D. Simmen, G. Lohman,

H. Pirahesh, and M. Cilimdzic. Robust query

processing through progressive optimization. SIGMOD

’04, pages 659–670, 2004.

[20] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based

histograms for selectivity estimation. SIGMOD Rec.,

27:448–459, June 1998.

[21] Y. Matias, J. S. Vitter, and M. Wang. Dynamic

maintenance of wavelet-based histograms. VLDB ’00,

pages 101–110, 2000.

[22] M. Muralikrishna and D. J. DeWitt. Equi-depth

histograms for estimating selectivity factors for

multi-dimensional queries. SIGMOD ’88, pages 28–36,

1988.

[23] R. Perdisci. JBIRCH.

http://roberto.perdisci.com/projects/jbirch.

[24] V. Poosala and Y. E. Ioannidis. Selectivity estimation

without the attribute value independence assumption.

VLDB ’97, pages 486–495, 1997.

[25] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,

R. A. Lorie, and T. G. Price. Access path selection in

a relational database management system. SIGMOD

’79, pages 23–34, 1979.

[26] J. M. Smith and P. Y.-T. Chang. Optimizing the

performance of a relational algebra database interface.

Commun. ACM, 18:568–579, October 1975.

[27] J. Spiegel and N. Polyzotis. Graph-based synopses for

relational selectivity estimation. SIGMOD ’06, pages

205–216, 2006.

[28] J. Spiegel and N. Polyzotis. Tug synopses for

approximate query answering. ACM Trans. Database

Syst., 34(1):3:1–3:56, Apr. 2009.

[29] J. D. Ullman, H. Garcia-Molina, and J. Widom.

Database Systems: The Complete Book. Prentice Hall

PTR, Upper Saddle River, NJ, USA, 1st edition, 2001.

[30] E. Wong and K. Yousseﬁ. Decomposition - a strategy

for query processing. ACM Transactions on Database

Systems, 1:223–241, 1976.

[31] Y.-L. Wu, D. Agrawal, and A. El Abbadi. Applying

the golden rule of sampling for query estimation.

SIGMOD Rec., 30(2):449–460, May 2001.

[32] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an

eﬃcient data clustering method for very large

databases. SIGMOD Rec., 25:103–114, June 1996.

480

COMPASS: Online Sketch-based Query Optimization for In-Memory Databases

Conference Paper

Full-text available

Jun 2021

Online Sketch-based Query Optimization

Preprint

Full-text available

Feb 2021

Cost-based query optimization remains a critical task in relational databases even after decades of research and industrial development. Query optimizers rely on a large range of statistical synopses -- including attribute-level histograms and table-level samples -- for accurate cardinality estimation. As the complexity of selection predicates and the number of join predicates increase, two problems arise. First, statistics cannot be incrementally composed to effectively estimate the cost of the sub-plans generated in plan enumeration. Second, small errors are propagated exponentially through join operators, which can lead to severely sub-optimal plans. In this paper, we introduce COMPASS, a novel query optimization paradigm for in-memory databases based on a single type of statistics -- Fast-AGMS sketches. In COMPASS, query optimization and execution are intertwined. Selection predicates and sketch updates are pushed-down and evaluated online during query optimization. This allows Fast-AGMS sketches to be computed only over the relevant tuples -- which enhances cardinality estimation accuracy. Plan enumeration is performed over the query join graph by incrementally composing attribute-level sketches -- not by building a separate sketch for every sub-plan. We prototype COMPASS in MapD -- an open-source parallel database -- and perform extensive experiments over the complete JOB benchmark. The results prove that COMPASS generates better execution plans -- both in terms of cardinality and runtime -- compared to four other database systems. Overall, COMPASS achieves a speedup ranging from 1.35X to 11.28X in cumulative query execution time over the considered competitors.

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Article

Full-text available

Mar 2021

Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an operator, plays a crucial role. Due to the inaccuracy in cardinality estimation, errors in cost model, and the huge plan space, the optimizer cannot find the optimal execution plan for a complex query in a reasonable time. In this paper, we first deeply study the causes behind the limitations above. Next, we review the techniques used to improve the quality of the three key components in the cost-based optimizer, cardinality estimation, cost model, and plan enumeration. We also provide our insights on the future directions for each of the above aspects.

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Preprint

Jan 2021

On Join Sampling and the Hardness of Combinatorial Output-Sensitive Join Algorithms

Conference Paper

Jun 2023

JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation

Article

May 2023

Inner-product estimation is the base of many important tasks in a variety of big data scenarios, including measuring similarity of streams in data stream processing, estimating join size in database, and analyzing cosine similarity in various applications. Sketch, as a class of probability algorithms, is promising in inner-product estimation. However, existing sketch solutions suffer from low accuracy due to their neglect of the high skewness of real data. In this paper, we design a new sketch algorithm for accurate and unbiased inner-product estimation, namely JoinSketch. To improve accuracy, JoinSketch consists of multiple components, and records items with different frequency in different components. We theoretically prove that JoinSketch is unbiased, and has lower variance compared with the well-known AGMS and Fast-AGMS sketch. The experimental results show that JoinSketch improves the accuracy by 10 times in average while maintaining a comparable speed. All code is open-sourced at Github.

A Sampling-based Learning Framework for Big Databases

Conference Paper

Apr 2022

Efficiently approximating selectivity functions using low overhead regression models

Article

Aug 2020

Today's query optimizers use fast selectivity estimation techniques but are known to be susceptible to large estimation errors. Recent work on supervised learned models for selectivity estimation significantly improves accuracy while ensuring relatively low estimation overhead. However, these models impose significant model construction cost as they need large numbers of training examples and computing selectivity labels is costly for large datasets. We propose a novel model construction method that incrementally generates training data and uses approximate selectivity labels, that reduces total construction cost by an order of magnitude while preserving most of the accuracy gains. The proposed method is particularly attractive for model designs that are faster-to-train for a given number of training examples, but such models are known to support a limited class of query expressions. We broaden the applicability of such supervised models to the class of select-project-join query expressions with range predicates and IN clauses. Our extensive evaluation on synthetic benchmark and real-world queries shows that the 95th-percentile error of our proposed models is 10-100X better than traditional selectivity estimators. We also demonstrate significant gains in plan quality as a result of improved selectivity estimates.

Conference Paper

Apr 2020

CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis

Conference Paper

Dec 2019

Applying the golden rule of sampling for query estimation

Article

Jun 2001
SIGMOD REC

Query size estimation is crucial for many database system components. In particular, query optimizers need efficient and accurate query size estimation when deciding among alternative query plans. In this paper we propose a novel sampling technique based on the golden rule of sampling, introduced by von Neumann in 1947, for estimating range queries. The proposed technique randomly samples the frequency domain using the cumulative frequency distribution and yields good estimates without any a priori knowledge of the actual underlying distribution of spatial objects. We show experimentally that the proposed sampling technique gives smaller approximation error than the Min-Skew histogram based and wavelet based approaches for both synthetic and real datasets. Moreover, the proposed technique can be easily extended for higher dimensional datasets.

Sequential sampling procedures for query size estimation

Article

Jan 1992

We provide a procedure, based on random sampling, for estimation of the size of a query result. The procedure is sequential in that sampling terminates after a random number of steps according to a stopping rule that depends upon the observations obtained so far. Enough observations are obtained so that, with a pre-specified probability, the estimate differs from the true size of the query result by no more than a prespecified amount. Unlike previous sequential estimation procedures for queries, our procedure is asymptotically efficient and requires no ad hoc pilot sample or a a priori assumptions about data characteristics. In addition to establishing the asymptotic properties of the estimation procedure, we provide techniques for reducing undercoverage at small sample sizes and show that the sampling cost of the procedure can be reduced through stratified sampling techniques.

BIRCH

Article

Jun 1996
SIGMOD REC

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH 's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

Independence is good

Article

Jun 2001
SIGMOD REC

Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”.

Left-deep vs. bushy trees

Article