Conference PaperPDF Available

From A to E: Analyzing TPC’s OLTP Benchmarks -- The obsolete, the ubiquitous, the unexplored

Authors:

Abstract and Figures

Introduced in 2007, TPC-E is the most recently standardized OLTP benchmark by TPC. Even though TPC-E has already been around for six years, it has not gained the popularity of its predecessor TPC-C: all the published results for TPC-E use a single database vendor's product. TPC-E is significantly different than its predecessors. Some of its distinguishing characteristics are the non-uniform input creation, longer-running and more complicated transactions, more difficult partitioning etc. These factors slow down the adoption of TPC-E. In turn, there is little knowledge in the community about how TPC-E behaves micro-architecturally and within the database engine. To shed light on TPC-E, we implement it on top of a scalable open-source database engine, Shore-MT, and perform a workload characterization study, comparing it with the previous, much better known OLTP benchmarks of TPC: TPC-B and TPC-C. In parallel, we study the evolution of the OLTP benchmarks throughout the decades. Our results demonstrate that TPC-E exhibits similar micro-architectural behavior to TPC-B and TPC-C, even though it incurs less stall time and higher instructions per cycle. On the other hand, within the database engine it suffers more from logical lock contention. Therefore, we argue that, on the hardware side, TPC-E needs less aggressive processors. Whereas on the software side it can benefit from designs based on intra-transaction parallelism, logical partitioning, and optimistic concurrency control to minimize the effects of lock contention without introducing distributed transactions.
Content may be subject to copyright.
From A to E: Analyzing TPC’s OLTP Benchmarks
The obsolete, the ubiquitous, the unexplored
Pınar Tözün Ippokratis PandisCansu Kaynak Djordje Jevdjic Anastasia Ailamaki
École Polytechnique Fédérale de Lausanne IBM Almaden Research Center
Lausanne, VD, Switzerland San Jose, CA, USA
ABSTRACT
Introduced in 2007, TPC-E is the most recently standard-
ized OLTP benchmark by TPC. Even though TPC-E has
already been around for six years, it has not gained the pop-
ularity of its predecessor TPC-C: all the published results
for TPC-E use a single database vendor’s product. TPC-
E is significantly different than its predecessors. Some of
its distinguishing characteristics are the non-uniform input
creation, longer-running and more complicated transactions,
more difficult partitioning etc. These factors slow down the
adoption of TPC-E. In turn, there is little knowledge in the
community about how TPC-E behaves micro-architecturally
and within the database engine.
To shed light on TPC-E, we implement it on top of a scal-
able open-source database engine, Shore-MT, and perform
a workload characterization study, comparing it with the
previous, much better known OLTP benchmarks of TPC:
TPC-B and TPC-C. In parallel, we study the evolution of
the OLTP benchmarks throughout the decades. Our results
demonstrate that TPC-E exhibits similar micro-architectural
behavior to TPC-B and TPC-C, even though it incurs less
stall time and higher instructions per cycle. On the other
hand, within the database engine it suffers more from logi-
cal lock contention. Therefore, we argue that, on the hard-
ware side, TPC-E needs less aggressive processors. Whereas
on the software side it can benefit from designs based on
intra-transaction parallelism, logical partitioning, and opti-
mistic concurrency control to minimize the effects of lock
contention without introducing distributed transactions.
1. INTRODUCTION
For the past decades, the data management ecosystem and
in turn the database and hardware markets have evolved
primarily around two applications: online transaction pro-
cessing (OLTP) and online analytical processing (OLAP).
Transaction processing benchmarks are the gold standard
for comparing products by different database and hardware
vendors, and are regularly used for marketing purposes [16,
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
EDBT/ICDT ’13, March 18 - 22 2013, Genoa, Italy
Copyright 2013 ACM 978-1-4503-1597-5/13/03 ...$15.00.
28]. For the last two decades, TPC-C [37] has been the most
widely used OLTP benchmark by the majority of indus-
try and academia. TPC-C consists of simple short-running
transactions with frequent updates and less frequent index
scans. On the other hand, the benchmark of choice for
OLAP workloads is TPC-H [39]. TPC-H observes more
complicated long-running read-only queries with frequent
index and file scans. The data management stacks, from
the database down to hardware, are typically optimized for
these two extreme benchmarks.
In order to represent OLTP workloads more realistically,
the Transaction Processing Performance Council (TPC) in-
troduced the TPC-E benchmark [38] in 2007. TPC-E is
an OLTP workload that includes transactions for real-time
business intelligence combined with client-side requests. It
acts in between a typical OLTP and an OLAP benchmark.
The design decision for TPC-E was to create a sophisticated
OLTP benchmark, having more complicated and longer trans-
actions when compared to TPC-C, relying on the extensive
use of non-primary indexes, observing data and access skew,
applying integrity and referential constraints, and being less
amenable to partitioning.
Both industry and academia are slow at adopting TPC-E.
For example, even though the benchmark was standardized
six years ago, all of the published results for TPC-E use
the same database product (Microsoft SQL Server). Due to
TPC-E’s significant differences from the other benchmarks,
it is not easy to extrapolate how systems perform when they
run TPC-E (and TPC-E-like applications).
Existing experimental studies typically use database bench-
marks other than TPC-E. Previous studies of OLTP and
OLAP benchmarks, either micro-architectural [2, 5, 15, 22,
32, 33] or profiling [18, 20, 29, 30], provide valuable results.
However, they fall short of explaining the behavior TPC-E
is expected to exhibit. Recent work that analyzes TPC-E
either focuses only on the I/O behavior [10, 21] or reports
micro-architectural results on only one type of machine while
running TPC-E on a commercial RDBMS and treating the
database as a black-box [13]. To date, there is neither an
analysis of the TPC-E benchmark on various hardware plat-
forms nor a comprehensive breakdown of the execution time
with respect to database engine components.
In this paper, we perform a detailed study of TPC-E.
We characterize where it spends time within an open-source
database engine and how it behaves micro-architecturally
on two different hardware platforms, one in-order and one
out-of-order machine. In parallel, we compare TPC-E to
the well-known OLTP benchmarks and observe how TPC’s
transactional benchmarks have evolved over the years. Then,
we discuss what kind of changes in database and hardware
systems can be more beneficial for such a workload. The
contributions of our study are as follows:
Our micro-architectural study demonstrates that TPC-
E is actually very similar to the previous OLTP bench-
marks in terms of its micro-architectural behavior. It
highly suffers from L1 instruction misses and exhibits
low instructions per cycle (IPC); IPC is smaller than
one on a machine that has ability to execute four. Thus,
we argue that TPC-E-like workloads need less aggressive
processors with a lower instruction issue width on the
hardware side. In addition, even though simultaneous
multi-threading (SMT) hides some of the stalls caused
by instruction misses and almost doubles the IPC, we
need more effective solutions like intra-transaction par-
allelism [11, 29] or computation spreading [4, 9] to better
utilize modern processor cores.
Our profiling study reveals that, within the database en-
gine, TPC-E spends 70% more time inside the lock man-
ager compared to both TPC-B and TPC-C for a con-
figuration with an orders of magnitude bigger database
size. TPC-E’s more complicated schema and transac-
tions make it less straightforward to physically partition
a TPC-E database to eliminate its locking overheads due
to the significant number of distributed transactions such
a design would cause. However, TPC-E can benefit from
shared-everything designs that aim minimizing locking
with logical [29] or physiological partitioning [30], or sys-
tems that rely on optimistic concurrency control [24] to
improve system performance.
The rest of the paper is organized as follows. Section 2
briefly describes the previous transaction processing bench-
marks standardized by TPC over the years and details TPC-
E. Section 3 introduces Shore-Kits, which is a suite of OLTP
benchmarks for Shore-MT and Section 4 describes our ex-
perimental methodology. Section 5 and Section 6 present
the profiling and micro-architectural analysis respectively
for TPC-E in comparison with TPC-B and TPC-C. Based
on the analysis results, Section 7 discusses possible design
optimizations both for upcoming hardware and storage man-
agers, mainly while running TPC-E on top. Finally, Sec-
tion 8 surveys the related work and Section 9 concludes.
2. EVOLUTION OF OLTP BENCHMARKS
Transaction processing benchmarks are the gold standard
for DBMS performance evaluation and they are frequently
used for marketing purposes. The Transaction Processing
Performance Council (TPC) is a non-profit IT organization
founded to define database benchmarks and disseminate ob-
jective, verifiable performance data to the industry. This
section describes the four important database transaction
processing benchmarks that have been used under the trade-
mark of TPC and highlights how they have evolved over the
years with each new benchmark.
2.1 The obsolete TPC-A and TPC-B
The first widely accepted database benchmark was formal-
ized in 1985 [3]. That specification included three workloads,
of which the “DebitCredit” stressed the database engine.
The DebitCredit benchmark was an instant success. Soon
database and hardware vendors started reporting extraor-
dinary results, often achieved by removing key constraints
from the specification. Therefore, in 1988 a consortium of
analysts and hardware, operating system, and database ven-
dors formed the Transaction Processing Performance Coun-
cil in order to enforce some order in database benchmarking.
Its first benchmark specification, TPC-A, essentially formal-
ized the DebitCredit benchmark.
TPC-A is straightforward. It models deposits on and
withdrawals from random bank accounts, with the associ-
ated double-entry accounting on a database that contains
xBranches, 10xTellers, and 100,000xAccounts. It also
captures the entire system, including terminals and network.
Transactions usually originate from their “home” Branch,
but can go anywhere. Conflicts are possible requiring the
system to recover occasionally from failed transactions. An
important aspect of this benchmark is its scaling rule: for a
result to be valid, the database size must be proportional to
the reported throughput.
Simple though it maybe, the TPC-A benchmark high-
lighted the importance of quantifying the performance and
correctness of different systems. Early benchmarking showed
vast performance differences among different vendors (400x),
as well as exposing serious bugs, which had been lurked and
undiscovered, for many years in mature products.
TPC’s second benchmark, TPC-B [35], is very similar to
TPC-A, but eliminates the network and terminal handling
to create a database engine stress test. Like TPC-A, the
TPC-B database contains four tables: Branch,Teller,Ac-
count, and History. These tables are accessed in a double-
entry accounting style as customers make deposits on and
withdrawals from various tellers. The benchmark consists of
a single transaction, AccountUpdate, which simply updates
one record in the Branch,Teller, and Account tables while
appending a record to the History table. Therefore, it is a
very update-heavy transaction that stresses the transaction
processing engine; especially the logging and concurrency
control modules. Due to the similarities between TPC-A
and TPC-B, for the rest of our study we use TPC-B alone.
2.2 The ubiquitous TPC-C
For its third benchmark specification, TPC-C [37], TPC
moves away from banking to commerce. TPC-C models an
online transaction processing database for a wholesale sup-
plier. The transactions follow customer orders from initial
creation to final delivery and payment.
A TPC-C database consists of nine tables in total where
one of them has fixed size (Fixed), four of them scale propor-
tionally with the number of Warehouses (Scaling), and four
of them might change size, mostly grow, due to insert and
delete operations (Growing). Thereby, compared to TPC-B,
TPC-C offers a more complex database schema; where the
TPC-B schema can be represented as a tree with only four
nodes, the TPC-C schema is a directed acyclic graph with
nine nodes.
Like the database schema, the TPC-C transactions are
also more complex. The benchmark combines the five trans-
actions listed below in a transaction mix at frequencies given
in parenthesis:
NewOrder (45%) inserts a new sales order to the database.
It is a medium-weight transaction with a 1% failure rate
due to invalid inputs.
Table 1: The TPC-E transactions
Transaction Weight Access Category Frames Executed % in Mix
BrokerVolume Mid to Heavy RO BI 1 (out of 1) 4.9
CustomerPosition Mid to Heavy RO CI 2/3 (out of 3) 13
MarketFeed Medium RW MT 1 (out of 1) 1
MarketWatch Medium RO CI 1 (out of 1) 18
SecurityDetail Medium RO CI 1 (out of 1) 14
TradeLookup Medium RO BI/CI 1 (out of 4) 8
TradeOrder Heavy RW CI 2/5/6 (out of 6) 10.1
TradeResult Heavy RW MT 5/6 (out of 6) 10
TradeStatus Light RO CI 1 (out of 1) 19
TradeUpdate Medium RW BI/CI 1 (out of 3) 2
BI: Brokerage Initiated, CI: Customer Initiated, MT: Market Triggered
Payment (43%) is a short transaction, very similar to
the AccountUpdate transaction of TPC-B, which makes
a payment on an existing order.
OrderStatus (4%) is a read-only transaction that com-
putes the shipping status and the line items of an order.
Delivery (4%) is the largest and the most contentious
update transaction. It selects the oldest undelivered or-
ders for each warehouse and marks them as delivered.
StockLevel (4%) is also a read-only transaction. It joins
on average 200 order line items with their corresponding
stock entries in order to produce a report.
The specification also lays out strict requirements about
response time, consistency, and recovery in the system, and
brings back the testing an end-to-end system that includes
network and terminal handling.
TPC-C stresses the entire stack (database system, oper-
ating system, and hardware) in several ways. First, it mixes
short and long, read-only and update-intensive transactions,
exercising a wider variety of features and situations than
the TPC-B benchmark. In addition, the benchmark has
major hotspots, partly due to the way transactions access
the Warehouse table and partly due to how the Delivery
transaction is designed. The resulting contention and dead-
locks stress the system’s concurrency control mechanisms.
Finally, the database grows throughout the benchmark run;
not just because of the append-only History table as in
TPC-B, but also because of the insert and delete operations
on different tables, stressing code paths that the previous
TPC-B benchmark did not reach.
TPC-C has been the most popular OLTP benchmark for
over twenty years. Major database vendors have published
results on TPC’s website, and on several occasions it is used
for marketing purposes [16, 28].
2.3 The unexplored TPC-E
To represent more realistically real-life OLTP workloads,
TPC presented TPC-E [38] as an alternative to the dom-
inant TPC-C. In this subsection, we give an overview of
TPC-E while pointing out its differences from TPC-C.
2.3.1 Model
TPC-E models a brokerage house. The database tables
keep information about the customers, brokers, and market.
The transactions simulate a workload where either the cus-
tomers initiate requests to the brokerage house (customer
initiated transactions) or the market sends ticker feeds or
trade results to the brokerage house (market-triggered trans-
actions). The brokerage house responds to the customers,
checks the orders to decide whether to submit them or not,
submits the related brokerage requests (brokerage initiated
transactions), and analyzes or updates the database. One
could say that TPC-E represents a more complicated busi-
ness model compared to TPC-C.
2.3.2 Database
TPC-E has more tables than TPC-C; thirty-three tables
instead of nine. Nine of TPC-E’s tables are of Fixed size,
sixteen are Scaling based on the number of Customers, and
eight are Growing. However, the growth rate of the Growing
tables varies and in general it is greater than the growth
rates of the Growing tables in TPC-C. In addition, the TPC-
E tables are populated with pseudo-real data and exhibit
data skew. On the contrary, TPC-C tables have randomly
generated data that face a low degree of skew.
The scaling factor determines the number of Branches in
TPC-B and the number of Warehouses in TPC-C. TPC-E
has a scaling factor that controls the number of Customers
in the database. But, unlike TPC-B and TPC-C, where a
single scaling factor (via the number of Branches and Ware-
houses) is the only parameter that determines the initial size
of the database, TPC-E has two additional parameters that
affect the initial database size. In particular, the parameters
called working days and scaling factor control the cardinal-
ity of the Trade table and in turn all the other Growing
tables in TPC-E.
TPC-E also has a Growing table, Trade_Request, that
right after database population starts as an empty table and
then grows. Neither TPC-B nor TPC-C have empty tables
after the initial database population.
2.3.3 Transactions
TPC-E contains twelve transactions in total, which are
shown in Table 1. Only ten of the transactions belong to the
regular transaction mix. Two of them, DataMaintenance
and TradeCleanup, get executed separately. DataMainte-
nance is executed periodically, every minute, alongside with
the transaction mix, whereas TradeCleanup needs to be exe-
cuted before each run if one wants to cleanup the submitted
or pending trades from a previous run in order to restore the
initial database state. In TPC-C, all of the five transactions
Table 2: Evolution of TPC’s OLTP benchmarks
TPC-A TPC-B TPC-C TPC-E
First release Nov 1989 Aug 1990 Aug 1992 Feb 2007
Last update Jun 1994 Jun 1994 Feb 2010 Jun 2010
Business model Banking Banking Wholesale supplier Brokerage house
Tables
Fixed 0 0 1 9
Scaling 3 3 4 16
Growing 1 1 4 8
Total 4 4 9 33
Transactions RW 1 1 3 6
RO 0 0 2 6
Transaction RW 100% 100% 92% 23.1%
Mix % RO 0% 0% 8% 76.9%
Transactions using None None 2 10
secondary indexes
Data population Random Random Random Pseudo-real
are included in the transaction mix.
The TPC-E transactions consist of frames, which are parts
of a long transaction with a distinctive task. For some trans-
actions only a subset of their frames are executed depending
on the input values or whether they are initiated by a cus-
tomer or brokerage; like in TradeLookup and TradeUpdate.
TPC-C does not contain as complicated and long transac-
tions. All transactions in TPC-C have only one frame.
One significant distinction of TPC-E from its predecessors
is the majority of the transactions in the mix are Read-Only
(RO). That is, in TPC-E around 75% of the transactions
executed are read-only, whereas TPC-C has 92% Read-Write
(RW ) transactions in the mix.
Another distinction of TPC-E is that its transaction mix
enforces dependencies among some of the transactions. More
specifically, the market-triggered transactions (TradeResult
and MarketFeed) require TradeOrder transactions to submit
input for them. Therefore, they cannot be executed inde-
pendently from the transaction mix. In TPC-C none of the
transactions have such dependencies.
TPC-E specification also introduces skew in transaction
inputs, harness control measures within the transactions,
and checks for referential integrity constraints, which do not
exist in TPC-C. Moreover, for high performance, TPC-E
needs to perform lookups and scans through non-primary
indexes in almost all of its transactions (ten out of twelve),
whereas TPC-C uses secondary indexes in only two of its
transactions.
Overall, TPC-E is a much more sophisticated OLTP bench-
mark compared to all its predecessors and therefore, it of-
fers a more interesting and mature environment for testing
OLTP engines. On the other hand, it is also harder to adopt
for people from both industry and academia, which have
been optimizing their systems mainly based on TPC-C for
the last twenty years.
2.4 The evolution summary
Table 2 summarizes the high-level comparison of the four
OLTP benchmarks of TPC, which we detailed above. What
we can conclude from this section and Table 2 is that with
each benchmark TPC standardized, we see a significant com-
plexity increase, which is driven by the facts listed below:
A more sophisticated business model.
A larger variety in terms of transaction types.
Longer-running and less deterministic transactions, caus-
ing longer and less predictable instruction streams.
Increase in the number of read-only transactions that
need to be run together with update-heavy ones.
Increase in the number of scan operations and depen-
dency on the secondary indexes, which in turn makes
physical database partitioning less effective.
More fundamental stress within the storage manager and
exploration of an increased number of code-paths.
The above items are going to be crucial while explaining
the behavior of these workloads within a storage manager
and micro-architecturally.
3. SHORE-KITS: BENCHMARKS ON TOP
OF SHORE-MT
Shore-MT [19] is an enhanced version of the SHORE stor-
age manager [8], whose micro-architectural behavior is very
close to the commercial systems [1]. Shore-MT adds a mul-
tithreaded storage manager kernel to SHORE and is partic-
ularly developed to adapt SHORE to multicore era, mainly
by focusing on eliminating the scalability bottlenecks when
running on multicore hardware. Today, Shore-MT is one the
most scalable open-source shared-everything storage man-
agers within a single database node. It has been used in
various research projects as a test-bed both by the team
who develops and maintains it [18, 20, 29, 30, 31] and by
other well-known teams in the database and computer ar-
chitecture communities [4, 14].
In order to study the behavior and challenges the stan-
dardized OLTP benchmarks pose on modern storage man-
agers, we implement them on top of Shore-MT and dis-
tribute them as a suite of database benchmarks, called Shore-
Kits. In other words, Shore-Kits1is an open-source suite of
OLTP benchmarks for the Shore-MT storage manager.
1Available at https://bitbucket.org/shoremt
Since Shore-MT does not have an SQL front end, a query
parser, and an optimizer, the benchmarks are implemented
in C++ using direct calls to Shore-MT’s storage manager
API, which is linked as a static library to the executable.
With some programming effort and code refactoring, one
can port Shore-Kits to other storage managers by changing
the API calls to match the target storage manager’s API.
We implemented TPC-E using the query plans taken from
a TPC-E implementation of a major database vendor. As
for the index decisions, we initially adapted the indexes from
the same kit. Later, however, we had to change some of
the indexes in order to optimize performance when running
on top of Shore-MT. For example, Shore-MT’s API allows
Shore-Kits to use only unclustered indexes, whereas the kit
of the commercial database uses clustered ones for the pri-
mary indexes. Therefore, the optimal index decisions varied
between Shore-Kits and the kit of the commercial database.
Due to its large number of tables and longer and more com-
plicated transactions, TPC-E was by far the most difficult
benchmark implemented in Shore-Kits.
TPC-E stresses Shore-MT in ways previous benchmarks
do not. It pinpointed code-paths, exposing previously un-
detected bugs and performance bottlenecks. Therefore, it
helped us to further improve Shore-MT. For example, Shore-
MT had implementation of forward and backward index
scans. But the backward index scans were disabled, because
they were causing large number of deadlocks in some work-
loads. Debugging and re-enabling backward index scans in
Shore-MT improved performance of TPC-E by three orders
of magnitude on an Intel server.
4. EXPERIMENTAL METHODOLOGY
We used two servers for our experiments: (1) a Sun Ul-
traSPARC T5220 server with one socket containing eight
in-order cores, where each core has support for eight hard-
ware contexts and is clocked at 1.4GHz, running Solaris 10,
and (2) a server with two Intel Xeon X5660 processors each
with six out-of-order processor cores running Ubuntu 10.04
with Linux kernel version 2.6.32. Table 3 lists the character-
istics of each processor in detail. The diversity and degree
of hardware parallelism on these systems make them good
candidates for this study to reflect the behavior of our work-
loads on various types of modern hardware.
We use memory-resident databases for our experiments
and flush the log to RAM due to not having a suitably fast
I/O sub-system. A configuration that allows I/O in our
infrastructure might cause an unreasonably slow and highly
suboptimal OLTP system, and therefore, unrealistic micro-
architectural conclusions.
On the Intel machine, we experiment with two cases; when
hyper-threading (HT) is off and when it is on. When hyper-
threading is on, the Intel machine supports two hardware
contexts running at the same time on one core to be able to
overlap the stall time of one of the threads with the execution
of the other. This property is analogous to the simultane-
ous multi-threading (SMT) support in the SPARC machine
where each core has support for eight hardware contexts by
default, which is actually one of the main design principles
of the UltraSPARC T2 architecture.
We chose the most optimal configuration options we de-
termined empirically for all the benchmarks running on top
of Shore-MT to make sure that we run them without any
obvious scalability bottlenecks and better utilize the hard-
Table 3: Server Properties
Server UltraSPARC Intel Xeon
T2 X5660
#Sockets 1 2
#Cores per Socket 8 (in-order) 6 (OoO)
#HW Contexts 64 24
Clock Speed 1.40GHz 2.80GHz
Memory 64GB 48GB
L3 (shared) -12MB
access latency 29 cycles
L2 (shared) 4MB -
access latency 20 cycles
L2 (per core) -256KB
access latency 6 cycles
L1-I (per core) 16KB 32KB
access latency 3 cycles 4 cycles
L1-D (per core) 8KB 32KB
access latency 3 cycles 4 cycles
OS
SunOS 5.10 Ubuntu 10.04
Generic with Linux
141414-10 kernel 2.6.32
ware resources. In TPC-B we pad the records of Branch and
Teller tables so that a single database page only has a sin-
gle record. This minimizes false sharing of database pages
and avoids latching contention, which can be a fundamental
bottleneck for typical shared-everything architectures [30].
We also enable Speculative Lock Inheritance (SLI) [18] and
logging optimizations from Aether [20] to reduce the bottle-
necks coming from the lock and log managers, respectively,
for the benchmarks that benefit from these techniques.
Furthermore, for TPC-B and TPC-C we spread the re-
quests based on the primary key of the Branch and Ware-
house tables, respectively, to reduce logical lock contention.
In order to do that, we picked scaling factors that are equal
to the number of hardware contexts available on the machine
a specific experiment is run on, since the scaling factor is
equal to the number of Branches in TPC-B and Warehouses
in TPC-C. In other words, on the Intel machine we picked
a scaling factor of 12 and 24 when hyper-threading is dis-
abled and enabled, respectively, and on the SPARC machine
we picked a scaling factor of 64. Unfortunately, for TPC-E,
it is not straightforward how to spread the requests due to
its more complex schema and transactions that do not have
correlation based on any primary key column for the major-
ity of the database tables. To be able to run an in-memory
database, we picked a database size that contains 1000 cus-
tomers for TPC-E. We set the working days and scaling
factor parameters to 300 and 500, respectively, which are
the default values for these parameters in the TPC-E spec-
ification.
Before taking any measurements, we start with a newly
populated database, make each worker thread in the system
execute 1000 transactions to warm-up the caches, and then
perform a one-minute run. The tools used to collect the
hardware counter values and profiling results during these
runs are mentioned in the related sections.
5. PROFILING ANALYSIS
In order to further understand the high-level characteris-
tics of each benchmark, firstly, we report statistical infor-
0
20
40
60
80
100
4 8 16 32 48 60 4 8 16 32 48 60 4 8 16 32 48 60
TPC-B TPC-C TPC-E
Time Breakdown (%)
#Clients
Other
Btree
Catalog
SM
BPool
Xct Mgr
Logging
Latching
Locking
Figure 1: Time breakdown as the machine load increases on UltraSPARC T2.
mation collected from the storage manager in Section 5.1.
Then, in Section 5.2, our profiling analysis identifies the
components of the storage manager each benchmark spends
the most time in.
5.1 High-level analysis
Table 4 contains the high-level statistics of each bench-
mark to further highlight the changes in complexity with
each OLTP benchmark standardized by TPC. These statis-
tics are independent of the underlying hardware. We chose
a scaling factor of one for each benchmark in this part of
the analysis. This corresponds to one Branch in TPC-B,
one Warehouse in TPC-C, and one-thousand Customers in
TPC-E. For the initial database, we measure the number
of records each benchmark has and how many pages it uses
in Shore-MT, which uses 8KB pages by default. Then, we
use the existing statistic measurements within Shore-MT to
see how many records, locks, and pages on average a trans-
action accesses for each benchmark while performing a run
with one worker thread executing transactions.
As expected, Table 4 re-emphasizes the complexity in-
crease from TPC-B to TPC-E. TPC-E has several orders
of magnitude more records per scaling factor compared to
TPC-B and TPC-C, and a much larger database size as
the total number of heap and index pages indicates. TPC-
B only touches one record per table, hence it accesses few
database locks and pages. TPC-C accesses almost ten times
the records TPC-B accesses per transaction in its transac-
tion mix, increasing the number of locks and database pages
it accesses as well. Finally, TPC-E performs around four
times the record accesses of TPC-C, which is also reflected
in the higher number of row-level locks it has to acquire.
However, the total number of locks acquired does not in-
crease accordingly since Shore-MT escalates to higher-level
locking from row-level locking when a single transaction ac-
cesses more than a threshold of records (the default value is
twenty-five in Shore-MT).
Table 4 reports two values for the average number of pages
accessed in a transaction; the unique number of pages ac-
cessed and the total number of pages accessed, which is also
the number of times a page is requested from the buffer
pool. Such a separation reveals that even though TPC-E
accesses more than twice the index pages TPC-C does, the
number of unique index page accesses is the same for both
workloads. The main reason for this is TPC-E’s extensive
Table 4: High-level statistics of each benchmark per
scaling factor 1
TPC-B TPC-C TPC-E
# records 10K 600K 117M
# heap pages 147 12K 1M
# index pages 91 6K 1M
Average per xct
# records accessed 4 36 149
# row-level locks 10 54 171
# higher-level locks 10 36 69
# heap pages accessed (U) 4 23 40
# index pages accessed (U) 4 33 33
# heap pages accessed 7 49 125
# index pages accessed 4 90 211
K: thousand, M: million, U: unique
index scans. TPC-C does not re-access most of the index
pages it touches, while TPC-E does this very frequently for
the index leaf pages during its index scans; it sequentially
reads an index leaf page and hence frequently reuses that
page. This results in TPC-E exhibiting lower L1 data cache
miss rates as Section 6.1.3 and Section 6.2 show.
5.2 Time breakdown
To get accurate time breakdowns within the storage man-
ager, we use DTrace [7] on the SPARC machine. Figure 1
presents the results of the profiling as we increase the ma-
chine utilization, i.e., as we run more clients in the system.
Figure 1 highlights that the lock manager is one of the
components the OLTP benchmarks spend most of their time
in within a shared-everything database management system.
The lock manager becomes the main bottleneck especially
for TPC-E, making it unable to utilize more than eight hard-
ware contexts on this machine, while both TPC-B and TPC-
C are able to almost fully utilize the machine with smaller
database sizes.
Logging is the next problematic component for TPC-B
and TPC-C. It becomes, however, less significant as we in-
crease the system utilization since we adopt the logging
optimizations of [20] that benefit from combining logging
requests as the number of clients in the system increases.
Btree and BPool (buffer-pool) come after Locking and Log-
ging, since a transaction’s execution is highly dependent on
0%
20%
40%
60%
80%
100%
4 16 48 4 16 48 4 16 48
TPC-B TPC-C TPC-E
Lock Manager Breakdown
#Clients
Lock-PC Lock-LC Lock
Figure 2: Time breakdown in the lock manager as
the machine load increases on UltraSPARC T2.
its index operations. The rest of the major components
of a storage manager are Catalog (metadata manager), SM
(storage manager API functionality), Xct Mgr (transaction
manager), and Latching; in which none of the workloads
spends a major part of their execution time.
Figure 2 focuses on the time spent inside the lock manager
and shows the time breakdown of sub-categories: Physical
lock contention, Lock-PC, represents the time spent while
waiting to acquire the element guarding a particular record
or table lock. Logical lock contention, Lock-LC, represents
the time spent until a record or table lock is granted after
the lock request is appended to the list of requests for this
lock. Finally, locking, Lock, is the time spent on performing
the locking operation aside from the waiting time.
TPC-E mainly suffers from logical lock contention (Lock-
LC) even though we use a larger database size for it com-
pared to TPC-B and TPC-C. There are three main rea-
sons for this outcome: (1) TPC-E observes data and access
skew, turning some of the data regions into hotspots (e.g.,
Last_Trade table); (2) TPC-E transactions acquire on aver-
age more locks since they access a larger number of database
records; and (3) TPC-E transactions hold the locks they ac-
quire for a longer duration since they are more complicated,
longer running, and scan-heavy transactions. TPC-B and
TPC-C, on the other hand, do not suffer from logical lock
contention since the system can properly spread the requests
and SLI [18] prevents physical lock contention from becom-
ing problematic, leaving only the actual locking operation as
the main time-consuming component within the lock man-
ager.
However, as we will see in Table 5, the lock contention
is not as problematic when we run TPC-E on the Intel
machine, which has faster processors than the SPARC ma-
chine. The faster the processor, the faster the lock acqui-
sitions and releases are, and hence, the less time is spent
on lock contention. We come across this fact also when
we run TPC-B. When two threads want to access the same
Branch in a TPC-B database, they first acquire a read lock
on the wanted Branch during the index probe according to
ARIES/IM [26] (the default concurrency control scheme in
Shore-MT). Later, when they want to upgrade their read
locks to exclusive ones to update the Branch, they both wait
for each other and they deadlock. While on the SPARC
machine we observe such deadlocks, TPC-B runs without
deadlocks on the Intel machine due to faster locking oper-
Table 5: Number of worker threads used for each
benchmark on the two machines
Server UltraSPARC T2 Intel Xeon X5660
No HT HT
TPC-B 48 10 18
TPC-C 60 10 18
TPC-E 4 12 24
ations. Switching to ARIES/KVL [25], which has stricter
concurrency control rules than ARIES/IM, makes this type
of deadlocks disappear on the SPARC machine as well.
6. MICRO-ARCHITECTURAL ANALYSIS
While performing a micro-architectural analysis for the
OLTP benchmarks, we try to answer the following questions:
(1) Where do CPU cycles go on different types of modern
hardware? Are they wasted on memory stalls or used to
retire an instruction?, (2) Do stalls happen mainly due to
instructions or data?, (3) How important are the instruction
and data miss rates?, (4) How much instruction-level (ILP)
and memory-level (MLP) parallelism do OLTP benchmarks
exhibit?, and (5) What is the effect of simultaneous multi-
threading (or hyper-threading)?
All the numbers reported in this section were obtained
when the workloads have their peak performance on the cor-
responding server with their optimal configuration on Shore-
MT. Table 5 shows the number of worker threads execut-
ing transactions in the system when the peak throughput
is achieved for each workload on each server. Adding more
worker threads to the system on top of the numbers reported
in Table 5 causes degradation in throughput, either due to
contention on shared records and storage manager objects
or over-saturation of the machine being used.
6.1 OLTP on an out-of-order processor
This section presents micro-architectural results from the
Intel Xeon X5660 processors. We use VTune [17], which
provides an API to ease the use of the hardware counters
on this machine. We emphasize that the execution time
breakdown on a superscalar out-of-order (OoO) processor
cannot be precise due to overlapping of different execution
components [12]. However, considering the low IPC of the
workloads we are experimenting with (Section 6.1.4), we can
assume that not much of work is overlapped. Nevertheless,
we draw the execution cycles that can be overlapped side-
by-side rather than on top of each other.
Intel Xeon X5660 processors support hyper-threading, run-
ning two hardware contexts on one core at the same time.
The goal of hyper-threading is to overlap the stall time of
one thread with the execution of another. In the following
subsections, for each experiment we present results when
hyper-threading is disabled and when it is enabled.
6.1.1 Execution time breakdown
Figure 3 shows the breakdown of the execution cycles into
busy and stall time for the three benchmarks. We count the
cycles in which at least one instruction is retired as busy and
where no instruction is retired as stalled.
In Figure 3, we see that more than half of the execu-
tion time is spent on stalls for all of the OLTP benchmarks.
While TPC-B and TPC-C show very similar behavior in
0%
20%
40%
60%
80%
100%
No HT HT No HT HT No HT HT
TPC-B TPC-C TPC-E
Execution Time Breakdown
Stalled (App) Stalled (OS) Busy (App) Busy (OS)
Figure 3: Execution time breakdown for three OLTP
benchmarks on an OoO processor with and without
hyper-threading.
terms of the percentage of busy and stalled cycles, TPC-E
seems to observe fewer stalled cycles during the overall exe-
cution. This behavior results in a higher IPC value for TPC-
E (see Section 6.1.4). As expected, when hyper-threading is
enabled, the stalled cycles increase in the overall execution
time since two threads instead of one share the private L1
and L2 caches, evicting each other’s data and instructions
from the cache, thus, causing more cache misses.
Figure 3 also breaks the execution time into time spent on
the operating system operations (OS ) and application itself
(App); and it demonstrates that for our configuration, the
OS does not contribute much to the overall execution time.
6.1.2 Core stalls
As presented in the previous section, stalls dominate the
total execution time of OLTP benchmarks. The estimated
breakdown of these stalls into resource, which also includes
data, and instruction stalls are given in Figure 4. We ac-
count resource stalls within a core, mainly stemming from
the re-order buffer (ROB) being full, as backend/resource
stalls while the remaining stalls as frontend/instruction stalls.
We, again, separate OS and application stalls even though
OS does not contribute significantly to the total stall time.
As Figure 4 demonstrates, the main cause of core stalls
is the frontend stalls for the OLTP benchmarks. In other
words, a core spends most of its execution cycles waiting
for instructions, since it cannot find them in its private L1
instruction cache. The percentage of the frontend stalls is
higher for TPC-E compared to both TPC-B and TPC-C.
We link this behavior to lower data miss rate of TPC-E (see
Section 6.1.3), which increases the percentage of stalls for
instructions.
In addition, hyper-threading increases the percentage of
the backend stalls. Two threads sharing the resources of
one core with hyper-threading can increase the hit rate of
the instruction cache more than the data cache, because
transactions tend to share more instructions than data [4].
6.1.3 Data and instruction misses
Figure 5 shows the number of misses per k-instructions on
the left-hand side and the estimated number of cycles spent
on these misses on the right-hand side. As we mentioned
before, we demonstrate the cycles spent on various cache
misses side-by-side rather than on top of each other because
0%
20%
40%
60%
80%
100%
No HT HT No HT HT No HT HT
TPC-B TPC-C TPC-E
Core Stalls Breakdown
Frontend (Instruction) - App
Frontend (Instruction) - OS
Backend (Resource) - App Backend (Resource) - OS
Figure 4: Core stalls breakdown for three OLTP
benchmarks on an OoO processor with and without
hyper-threading.
of the unknown overlapping cycles for these misses. We
categorize the cache misses as L1 instruction cache misses
(L1I ), L2 instruction misses (L2I ), L1 data cache misses
(L1D), L2 data misses (L2D ), and L3 or last-level cache
misses (LLC ). For stall cycles due to cache misses, we use
the expected penalty for that particular miss on the machine
being used. For LLC misses, we average the penalty for
going to local memory and remote memory.
What we observe is that L1 instruction cache misses domi-
nate both the total number of misses and the total number of
cycles spent on those misses for all of the OLTP benchmarks.
As mentioned in Section 6.1.1, enabling hyper-threading in-
creases the total number of misses in general due to more
threads sharing the shared cache resources.
TPC-E exhibits 35% fewer data misses and almost the
same number of instruction misses, regardless of its longer
running and more complicated transactions. Since it per-
forms more scan operations, TPC-E can reuse the cache lines
for data and instructions it needs more often.
6.1.4 Instruction- and memory-level parallelism
Finally, Figure 6 shows how many instructions per cycle
(IPC) these OLTP benchmarks can execute per core on the
left-hand side and how many long-latency misses (L2 miss)
can be overlapped (MLP) on the right-hand side.
An Intel Xeon X5660 processor has the ability to retire
four instructions per cycle. However, by looking at Figure 6,
we see that OLTP benchmarks can hardly retire even one
instruction per cycle even though enabling hyper-threading
provides some benefit. Overall, as the complexity of the
benchmark increases, going from TPC-B to TPC-E, the IPC
also increases. As we also mentioned in Section 6.1.2, it is
expected that TPC-E has a higher IPC value since it spends
less of its execution time on stall cycles compared to the
other two workloads. Higher IPC stems from TPC-E ob-
serving fewer L1 data misses (Section 6.1.3) because of its
frequent scan operations.
From the MLP values given in Figure 6, we also con-
clude that OLTP benchmarks do not exhibit high MLP.
Even though there are 48-entry load-store queues in this
processor, OLTP benchmarks do not have more than 2.7
outstanding long-latency misses even when hyper-threading
is enabled. While TPC-B and TPC-C observe very similar
MLP values, TPC-E exhibits less memory-level parallelism.
0
20
40
60
80
100
No HT HT No HT HT No HT HT
TPC-B TPC-C TPC-E
#Misses per k-instructions
LLC L2D L1D L2I L1I
0
50
100
150
200
250
300
350
400
450
No HT HT No HT HT No HT HT
TPC-B TPC-C TPC-E
#Cycles for Misses per k-
Instructions (Estimates)
Figure 5: Number of misses per k-instructions for three OLTP benchmarks on an OoO processor with and
without hyper-threading and the estimated number of cycles spent on these misses.
0
0.2
0.4
0.6
0.8
1
1.2
TPC-B TPC-C TPC-E
IPC
No HT
HT
0.0
0.4
0.8
1.2
1.6
2.0
2.4
2.8
TPC-B TPC-C TPC-E
MLP
Figure 6: Instructions committed per cycle and
memory-level parallelism on an OoO processor with
and without hyper-threading.
6.2 OLTP on an in-order processor
This section presents micro-architectural results from the
Sun UltraSPARC T5220 server. We used the hardware coun-
ters on this machine through the cputrack command [27],
which allows us to count various types of cache misses and
number of instructions executed by each thread.
UltraSPARC T2 is an in-order processor that supports
simultaneous multi-threading. A core provides support for
eight hardware contexts and collocates two hardware con-
texts in the pipeline in one cycle. Therefore, each of these
hardware contexts uses one cycle in every four cycles, aiming
to overlap the stall time of other hardware contexts.
Figure 7 shows the number of misses per k-instructions on
the left-hand side and the estimated number of cycles spent
on these misses on the right-hand side as in Figure 5. On
this processor, we also cannot infer the overlapped opera-
tions and, as in Figure 5, we draw the execution cycles that
can be overlapped side-by-side rather than on top of each
other. We report L1 instruction cache misses (L1I ), L2 in-
struction misses (L2I ), L1 data cache misses (L1D), and L2
data misses (L2D). For stall cycles due to misses, we use the
expected penalty for that particular miss on this machine.
Similar to the Intel machine, the main source of misses
and stall cycles are also L1 instruction cache misses as Fig-
ure 7 shows. On the other hand, the last-level cache (L2)
maintains almost all of the instructions for these workloads
running on Shore-MT. Due to having smaller L1 data caches
and more hardware contexts using the same private L1 cache
in a core, L1 data cache misses contribute to a bigger portion
of the total stall cycles compared to the Intel machine.
The comparison among the three benchmarks in terms of
misses look similar to the comparison we have on the Intel
machine (Figure 5). The instruction miss numbers are very
close to each other for all the workloads and TPC-E has 50%
fewer data misses compared to TPC-B and TPC-C.
Figure 8 shows the IPC values for the three OLTP bench-
marks running on UltraSPARC T2. Considering that this is
an in-order machine, being able to execute instructions from
two hardware contexts in a cycle, the IPC being higher than
one shows a more effective use of the hardware resources
compared to the Intel machine. While, on the Intel ma-
chine, OLTP benchmarks can hardly leverage less than half
of the instruction issue width, on SPARC, they can utilize
more than half of it.
7. DISCUSSION
In this section we summarize the highlights of our exper-
imental study and discuss the optimal ways of executing
OLTP benchmarks, mainly focusing on TPC-E.
Looking at the high-level description and statistics for
each benchmark, we see that with each new OLTP bench-
mark standardized by TPC, we have a significant increase
in complexity compared to the previous ones. Moreover,
observing our time breakdown results from Section 5.2 and
previous studies [18, 20, 29, 30], each benchmark stresses dif-
ferent parts of the storage manager in different ways. How-
ever, regardless of these differences, micro-architecturally, all
the OLTP benchmarks that exist today observe very similar
behavior (Section 6).
As our micro-architectural analysis show, TPC-E a has
higher IPC, observes lower miss rates, and spends less of
its execution time on memory stalls compared to TPC-B
and TPC-C. However, the fact that OLTP benchmarks com-
monly observe low IPC, spend most of their execution time
on memory stalls, and mainly suffer from L1 instruction
cache misses still remains. Going from an aggressive out-
of-order processor to an in-order processor, does not change
the micro-architectural characteristics of the OLTP bench-
marks much. However, we observe that simultaneous multi-
0
20
40
60
80
100
TPC-B TPC-C TPC-E
#Misses per k-instructions
L2D
L1D
L2I
L1I
0
200
400
600
800
1000
1200
TPC-B TPC-C TPC-E
#Cycles for Misses per k-
instructions (Estimated)
Figure 7: Number of misses per k-instructions for three OLTP benchmarks on an in-order processor with
simultaneous multi-threading and the estimated number of cycles spent on these misses.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
TPC-B TPC-C TPC-E
IPC
Figure 8: Instructions committed per cycle on an in-
order processor with simultaneous multi-threading.
threading (or hyper-threading) helps to overlap the stall
time caused by cache misses to some extent.
By looking at the time TPC-E spends inside the lock man-
ager, the natural choice would be to partition the database
and deploy a shared-nothing design for it. Even though for
TPC-B- and TPC-C-like database schemas, this would work
very well [34], for TPC-E such a design would cause a lot
of distributed transactions. There are two main reasons for
this: (1) Due to its complex schema, not all the TPC-E ta-
bles can be correlated with a single database column like
the Branch ID in TPC-B or Warehouse ID in TPC-C. (2)
The TPC-E transactions access a lot of database records
from various tables and perform frequent index scans by us-
ing secondary indexes. Therefore, it is not clear based on
which columns we should partition TPC-E tables in a way to
minimize distributed transactions when we deploy a shared-
nothing design.
On the other hand, a shared-everything design based on
logical or physiological partitioning like in DORA [29] or
PLP [30], respectively, might be more beneficial especially
for TPC-E-like workloads. Such designs successfully mini-
mize locking and latching overheads within the storage man-
ager and they do not suffer from distributed transactions like
in a shared-nothing design. In addition, optimistic and mul-
tiversion concurrency control schemes [6, 24] may especially
help TPC-E-like read-heavy workloads to improve concur-
rency by avoiding blocking at the time of a potential conflict
and rather lazily performing checks at commit time.
Considering that L1 instruction cache misses dominate
the total number of cache misses, techniques that involve
several cores to execute a transaction or exploit common in-
structions both within the same transaction and across dif-
ferent transactions would be very helpful for OLTP. While
using several cores creates an aggregate L1 instruction cache
capacity for a transaction, being able to reuse common in-
structions reduces the need to re-fetch an instruction over
and over.
Software-side techniques that exploit intra-transaction par-
allelism [11, 29] divide the transactions into smaller actions
and run independent actions in parallel on different nearby
cores. Each action has smaller instruction footprint than the
entire transaction and a higher chance of fitting its instruc-
tions in the L1I cache. On the hardware side, computation
spreading through thread migration [4, 9] both uses multiple
cores to execute a transaction and makes newer transactions
reuse the instructions brought to the L1I cache by the older
transactions without any guidance from the software side. A
more effective solution, however, would be to involve both
software and hardware enhancements to minimize the stall
cycles due to instructions.
8. RELATED WORK
There is a large body of related work on workload char-
acterization for database workloads. Barrosso et al. [5] in-
vestigated the memory system behavior of OLTP and DSS
style workloads using TPC-B and TPC-D [36], respectively,
both on a real machine and with a full-system simulation.
They found that these two types of workloads need different
architectural designs in terms of the memory system. Ran-
ganathan et al. [32] used the same workloads as in [5]. How-
ever, they only focused on the effectiveness of out-of-order
execution on SMPs while running these workloads in a simu-
lation environment. We believe, neither TPC-B nor TPC-D
can be representative of TPC-E since TPC-E has much more
complicated and longer-running transactions than TPC-B
and it is not completely read-only like TPC-D.
Keeton et al. [22] experimented with TPC-C on a 4-way
Pentium Pro SMP machine and performed a similar analysis
to [5, 32]. Although, TPC-C is closer to TPC-E compared
to both TPC-B and TPC-D, it still has major differences
from TPC-E as described in Section 2. Stets et al. [33]
performs a micro-architectural comparison between TPC-B
and TPC-C. We add TPC-E to this comparison and also
analyze what happens within the storage manager.
Ailamaki et al. [2] examined where the time goes on
four different commercial DBMSs with a microbenchmark
to have a finer-grain understanding of the memory system
behavior of multiprocessors. Hardavellas et al. [15] ana-
lyzed OLTP, with TPC-C, and DSS, with TPC-H, on both
in-order and out-of-order machines by using a simulation
environment. Rather than optimizing the hardware for the
workloads, these two papers focused on the implications on
the DBMS side in order to utilize the underlying hardware
more effectively. In our work, we consider both the hardware
and the DBMS design for optimal TPC-E execution.
Johnson et al. [18, 20] and Pandis et al. [29, 30] provide
detailed analysis on where the time goes within the storage
manager for typical OLTP benchmarks. Their main aim was
to highlight components that become scalability bottlenecks
in the existing systems and propose alternative designs that
remove those bottlenecks. In this paper, we also perform
the same analysis with TPC-E and discuss which one of
their techniques can or cannot help TPC-E, and also expose
the bottleneck on L1 instruction misses.
There are a few performance analysis papers that use
TPC-E. For example, [10, 21] use I/O traces of a produc-
tion database server running TPC-E in order to study its
I/O behavior. In [10] the authors compare the I/O behavior
of TPC-C and TPC-E. We do not study the I/O behav-
ior. For our experiments we use memory-resident databases
and focus on the micro-architectural behavior. Ferdman et
al. [13] present a detailed micro-architectural analysis with
many types of workloads on Intel X5670 processors, focus-
ing on the architectural design needs of the scale-out work-
loads. They provide a comparison between the scale-out
workloads and server workloads, like TPC-C and TPC-E.
In our work, we use a very similar methodology while an-
alyzing the OLTP benchmarks micro-architecturally on our
Intel X5660 processors and our high-level conclusions cor-
roborate with their findings. In addition, we perform such a
micro-architectural analysis on different hardware platforms
to understand the behavior when we switch from an in-order
core to an out-of-order one. Moreover, we also demonstrate
which components TPC-E stresses within the storage man-
ager as opposed to a pure micro-architectural study. Atta et
al. [4] propose computation spreading through thread mi-
gration to minimize instruction misses for OLTP workloads.
A part of their study also analyzes the instruction and data
misses of both TPC-C and TPC-E with a trace simulation
study, but not on real hardware. Finally, [23] uses TPC-E
to show that a cluster of “wimpy” (low-power Atom-based)
nodes is not as energy-efficient as a cluster of traditional
server-grade processors (Xeon-based).
9. CONCLUSIONS
In this paper, we present a thorough workload charac-
terization study for TPC-E. We rely on profiling results to
determine where the time goes within the storage manager
while executing TPC-E on top. Furthermore, we use per-
formance counters to investigate the micro-architectural be-
havior on two different camps of modern hardware; aggres-
sive out-of-order and lean in-order. We compare TPC-E
with previous OLTP benchmarks standardized by TPC, the
well-studied TPC-C and the obsolete TPC-B, to better un-
derstand what TPC-E-like workloads need both from the
software and hardware.
Our study shows that TPC-E observes higher IPC but, at
a high-level, has a very similar micro-architectural behavior
to its predecessors; it suffers from a high number of L1 in-
struction cache misses and spends most of its time stalling
on memory accesses. Within the storage manager, TPC-E
stresses the lock manager the most, like its predecessors, al-
though it gets a higher penalty within the lock manager due
to logical lock contention on hot database records.
We believe TPC-E can benefit from the previous design
proposals made for OLTP workloads, both from the hard-
ware side and within the storage manager. Running TPC-E
on less aggressive processors, with few instruction issues,
and processors that have support for SMT increases its IPC
value and leads to a better utilization of micro-architectural
resources. However, we advocate a more fundamental spe-
cialized solution where hardware and software operate to-
gether. Such a design can be based on logical partitioning,
intra-transaction parallelism, and/or computation spreading
to get the best of modern and future hardware for OLTP.
Acknowledgments
We would like to thank all the members of the DIAS and
PARSA laboratories at EPFL, Islam Atta, and Duygu Cey-
lan for their support and feedback throughout this work. We
are very grateful to Onur Kocberber and Rene Mueller for
sharing their expertise on VTune with us. We also thank
the reviewers for their constructive comments and help to
improve this paper. This work was partially supported by
a Sloan research fellowship, NSF grants CCR-0205544, IIS-
0133686, and IIS-0713409, an ESF EurYI award, and Swiss
National Foundation funds.
10. REFERENCES
[1] A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page
layouts for relational databases on deep memory
hierarchies. VLDB J., 11(3):198–215, 2002.
[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A.
Wood. DBMSs on a modern processor: Where does
time go? In VLDB, pages 266–277, 1999.
[3] Anon. et al. A measure of transaction processing
power. Datamation, 31(7), 1985.
[4] I. Atta, P. T¨
oz¨
un, A. Ailamaki, and A. Moshovos.
SLICC: Self-Assembly of Instruction Cache Collectives
for OLTP Workloads. In MICRO, pages 188–198,
2012.
[5] L. A. Barroso, K. Gharachorloo, and E. Bugnion.
Memory system characterization of commercial
workloads. In ISCA, pages 3–14, 1998.
[6] P. A. Bernstein and N. Goodman. Multiversion
concurrency control—theory and algorithms. ACM
TODS, 8(4):465–483, 1983.
[7] B. M. Cantrill et al. Dtrace. Available at
http://dtrace.org.
[8] M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall,
M. L. McAuliffe, J. F. Naughton, D. T. Schuh, M. H.
Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and
M. J. Zwilling. Shoring up persistent applications. In
SIGMOD, pages 383–394, 1994.
[9] K. Chakraborty, P. M. Wells, and G. S. Sohi.
Computation spreading: employing hardware
migration to specialize cmp cores on-the-fly. In
ASPLOS, pages 283–292, 2006.
[10] S. Chen, A. Ailamaki, M. Athanassoulis, P. B.
Gibbons, R. Johnson, I. Pandis, and R. Stoica. TPC-E
vs. TPC-C: Characterizing the new TPC-E
benchmark via an I/O comparison study. SIGMOD
Record, 39:5–10, 2010.
[11] C. B. Colohan, A. Ailamaki, J. G. Steffan, and T. C.
Mowry. Optimistic intra-transaction parallelism on
chip multiprocessors. In VLDB, pages 73–84, 2005.
[12] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E.
Smith. A performance counter architecture for
computing accurate cpi components. In ASPLOS,
pages 175–184, 2006.
[13] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,
M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu,
A. Ailamaki, and B. Falsafi. Clearing the Clouds: A
Study of Emerging Scale-out Workloads on Modern
Hardware. In ASPLOS, pages 37–48, 2012.
[14] G. Graefe, H. Kimura, and H. Kuno. Foster B-trees.
ACM TODS, 37(3):17:1–17:29, 2012.
[15] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril,
A. Ailamaki, and B. Falsafi. Database servers on chip
multiprocessors: Limitations and opportunities. In
CIDR, pages 79–87, 2007.
[16] IBM. IBM breaks double digit performance barrier
with 10 million transactions per minute, 2010.
Available at http://www-
03.ibm.com/press/us/en/pressrelease/32328.wss.
[17] Intel. Intel VTune Amplifier XE performance profiler.
Available at http://software.intel.com/en-
us/articles/intel-vtune-amplifier-xe/.
[18] R. Johnson, I. Pandis, and A. Ailamaki. Improving
OLTP scalability using speculative lock inheritance.
PVLDB, 2(1):479–489, 2009.
[19] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki,
and B. Falsafi. Shore-MT: a scalable storage manager
for the multicore era. In EDBT, pages 24–35, 2009.
[20] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis,
and A. Ailamaki. Aether: a scalable approach to
logging. PVLDB, 3:681–692, 2010.
[21] S. Kavalanekar, B. Worthington, Q. Zhang, and
V. Sharda. Characterization of storage workload
traces from production windows servers. In IISWC,
pages 119–128, 2008.
[22] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael,
and W. E. Baker. Performance characterization of a
quad Pentium Pro SMP using OLTP workloads. In
ISCA, pages 15–26, 1998.
[23] W. Lang, J. M. Patel, and S. Shankar. Wimpy node
clusters: what about non-wimpy workloads? In
DaMoN, pages 47–55, 2010.
[24] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman,
J. M. Patel, and M. Zwilling. High-performance
concurrency control mechanisms for main-memory
databases. PVLDB, 5(4), 2011.
[25] C. Mohan. ARIES/KVL: a key-value locking method
for concurrency control of multiaction transactions
operating on B-tree indexes. In VLDB, pages 392–405,
1990.
[26] C. Mohan and F. Levine. ARIES/IM: an efficient and
high concurrency index management method using
write-ahead logging. In SIGMOD, pages 371–380,
1992.
[27] Oracle. cputrack. Available at
http://docs.oracle.com/cd/E19683-01/816-
0210/6m6nb7m6s/index.html.
[28] Oracle. SPARC supercluster with 27 SPARC T3-4
servers demonstrates world record performance on
TPC-C benchmark, 2010. Available at
http://www.oracle.com/us/solutions/performance-
scalability/t3-4-tpc-c-12210-bmark-190934.html.
[29] I. Pandis, R. Johnson, N. Hardavellas, and
A. Ailamaki. Data-oriented transaction execution.
PVLDB, 3(1):928–939, 2010.
[30] I. Pandis, P. T¨
oz¨
un, R. Johnson, and A. Ailamaki.
PLP: page latch-free shared-everything OLTP.
PVLDB, 4(10):610–621, 2011.
[31] D. Porobic, I. Pandis, M. Branco, P. T¨
oz¨
un, and
A. Ailamaki. OLTP on hardware islands. PVLDB,
5(11):1447–1458, 2012.
[32] P. Ranganathan, K. Gharachorloo, S. V. Adve, and
L. A. Barroso. Performance of database workloads on
shared-memory systems with out-of-order processors.
In ASPLOS-VIII, pages 307–318, 1998.
[33] R. Stets, K. Gharachorloo, and L. Barroso. A detailed
comparison of two transaction processing workloads.
In WWC, pages 37–48, 2002.
[34] M. Stonebraker, S. Madden, D. J. Abadi,
S. Harizopoulos, N. Hachem, and P. Helland. The end
of an architectural era: (it’s time for a complete
rewrite). In VLDB, pages 1150–1160, 2007.
[35] TPC. TPC benchmark B standard specification,
revision 2.0, 1994. Available at
http://www.tpc.org/tpcb.
[36] TPC. TPC benchmark D standard specification,
revision 2.1, 1998. Available at
http://www.tpc.org/tpcd.
[37] TPC. TPC benchmark C standard specification,
revision 5.11, 2010. Available at
http://www.tpc.org/tpcc.
[38] TPC. TPC benchmark E standard specification,
revision 1.12.0, 2010. Available at
http://www.tpc.org/tpce.
[39] TPC. TPC benchmark H standard specification,
revision 2.14.3, 2011. Available at
http://www.tpc.org/tpch.
... Additionally, the underlying hardware is important, as, e.g., some DBMSs have been shown to utilize multi-processor or multi-core environments more effectively than others [81]. Intuitively, how well a DBMS can exploit parallelism affects the performance of query execution [76,80]. Ultimately, performance measurement is about gains or losses in percentages, not in, e.g., response times. ...
... In the domain of relational databases, the Transaction Processing Council (TPC) benchmarks [e.g., 37] are perhaps the most utilized [30,80], and test the throughput of the DBMS with various parameters. For example, the TPC-A benchmark simulates a database of a bank with four tables and with one transaction, Heap IEEE Xplore ("Abstract":performance AND "Abstract":comparison AND ("Abstract":database OR "Abstract":dbms)) ScienceDirect ...
... Title, abstract, keywords: performance AND comparison AND (database OR dbms) Google Scholar database performance comparison the TPC-B benchmark a database of a wholesale supplier with nine tables and with five transactions, and the TPC-E benchmark a brokerage database with 33 tables and 12 transactions. All these benchmarks have the option for simulating strong consistency, and while TPC-A and TPC-B have transactions typical for transaction processing, TPC-E includes also decision support transactions [80]. TPC-A simulates human end-user thinking by waiting between transactions, as a human arguably would wait between clicks in an online bank. ...
Preprint
Full-text available
Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This survey systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.
... A prior data loading performance study focuses on relational databases [4] and does not consider spatial data and systems. Finally, existing micro-architectural studies do not consider spatial workloads either [22,24,25,28]. They range from investigating the behavior of online transaction processing (OLTP) [25,28], to online analytical processing (OLAP) [24] and graph [22] workloads. ...
... Finally, existing micro-architectural studies do not consider spatial workloads either [22,24,25,28]. They range from investigating the behavior of online transaction processing (OLTP) [25,28], to online analytical processing (OLAP) [24] and graph [22] workloads. This paper presents a comprehensive analysis of geospatial data loading with the following goals: 1) analyze how different file formats impact data loading, 2) compare the loading throughput of different spatial libraries, 3) understand the stress that geospatial data loading imposes on processor and caches across different file formats and libraries, and 4) provide insights to data scientists for optimizing their pipeline and avoiding loading bottlenecks. ...
Conference Paper
Full-text available
The rate at which applications gather geospatial data today has turned data loading into a critical component of data analysis pipelines. However, users are confronted with multiple file formats for storing geospatial data and an array of systems for processing it. To shed light on how the choice of file format and system affects performance, this paper explores the performance of loading geospatial data stored in diverse file formats using different libraries. It aims to study the impact of different file formats, compare loading throughput across spatial libraries, and examine the micro-architectural behavior of geospatial data loading. Our findings show that GeoParquet files provide the highest loading throughput across all benchmarked libraries. Furthermore, we note that the more spatial features per byte a file format can store, the higher the data loading throughput. Our micro-architectural analysis reveals high instructions per cycle (IPC) during spatial data loading for most libraries and formats. Additionally, our experiments show that instruction misses dominate L1 cache misses, except for GeoParquet files, where data misses take over.
Article
Full-text available
Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This study systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.
Article
In this paper, we revisit the classical data management problem of page replacement. We propose Write-Aware Timestamp Tracking (WATT), a novel replacement algorithm that is optimized for modern hardware. By explicitly tracking the access history of each cached page, WATT achieves state-of-the-art replacement effectiveness. WATT is also carefully co-designed with modern multi-core CPUs and can be implemented with very low overhead. Finally, WATT allows trading of read versus write I/O operations, which is useful for prolonging flash SSD lifetime.
Article
With the rapid development of distributed transactional databases in recent years, there is an urgent need for fair performance evaluation and comparison. Though there are various open-source benchmarks built for databases, it is lack of a comprehensive study about the applicability for distributed transactional databases. This paper presents a review of the state-of-art benchmarks with respect to distributed transactional databases. We first summarize the representative architectures of distributed transactional databases and then provide an overview about the chock points in distributed transactional databases. Then, we classify the classic transactional benchmarks based on their characteristics and design purposes. Finally, we review these benchmarks from schema and data definition, workload generation, and evaluation and metrics to check whether they are still applicable to distributed transactional databases with respect to the chock points. This paper exposes a potential research direction to motivate future benchmark designs in the area of distributed transactional databases.
Method
Full-text available
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size. More Information: http://www.tpc.org/tpch/default.asp
Conference Paper
Full-text available
While hardware technology has undergone major advancements over the past decade, transaction processing systems have remained largely unchanged. The number of cores on a chip grows exponentially, following Moore's Law, allowing for an ever- increasing number of transactions to execute in parallel. As the number of concurrently-executing transactions increases, contended critical sections become scalability burdens. In typical transaction processing systems the centralized lock manager is often the first contended component and scalability bottleneck. In this paper, we identify the conventional thread-to- transaction assignment policy as the primary cause of contention. Then, we design DORA, a system that decomposes each transaction to smaller actions and assigns actions to threads based on which data each action is about to access. DORA's design allows each thread to mostly access thread-local data structures, minimizing interaction with the contention-prone centralized lock manager. Built on top of a conventional storage engine, DORA maintains all the ACID properties. Evaluation of a prototype implementation of DORA on a multicore system demonstrates that DORA attains up to 4.8x higher throughput than a state-of-the-art storage engine when running a variety of synthetic and real-world OLTP workloads. 2010 VLDB Endowment.
Article
Scaling the performance of shared-everything on-line transaction processing to highly-parallel multicore hardware remains a great challenge for database system designers. Developments in OLTP technology remove locking and logging from being scalability bottlenecks on such systems, leaving page latching as the next potential problem. To tackle the page latching problem, we design a system around physiological partitioning (PLP). The PLP design applies logical-only partitioning, maintaining the desired properties of shared-everything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) which allows us to partition the accesses at the physical page level. That is, logical partitioning, along with MRBTrees ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch-free. We extend the design to make heap page accesses thread-private as well. The elimination of page latching allows us to simplify key code paths in the system such as B+Tree operations leading to more efficient yet easier maintainable code. The profiling of a prototype PLP system shows that it acquires 85% and 68% fewer contentious critical sections per transaction than an optimized conventional design and one based on logical-only partitioning respectively. As a result the PLP prototype improves performance by up to 40% and 18% over the two systems on two multicore machines.
Article
Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications.This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 times over an in-order single-issue processor for OLTP and DSS, respectively. In addition, speculative techniques enable optimized implementations of memory consistency models that significantly improve the performance of stricter consistency models, bringing the performance to within 10--15% of the performance of more relaxed models.The second part of our study focuses on the more challenging OLTP workload. We show that an instruction stream buffer is effective in reducing the remaining instruction stalls in OLTP, providing a 17% reduction in execution time (approaching a perfect instruction cache to within 15%). Furthermore, our characterization shows that a large fraction of the data communication misses in OLTP exhibit migratory behavior; our preliminary results show that software prefetch and writeback/flush hints can be used for this data to further reduce execution time by 12%.
Article
In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.
Article
This paper provides a comprehensive treatment of index management in transaction systems. We present a method, called ARIESIIM 1992, for concurrency control and recovery of B+-trees. ARIES/IM guarantees serializability and uses write-ahead logging for recovery. It supports very high concurrency and good performance by (1) treating as the lock of a key the same lock as the one on the corresponding record data in a data page (e.g., at the record level), (2) not acquiring, in the interest of permitting very high concurrency, commit duration locks on index pages even during index structure modification operations (SMOs) like page splits and page deletions, and (3) allowing retrievals, inserts, and deletes to go on concurrently with SMOs. During restart recovery, any necessary redos of index changes are always performed in a page-oriented fashion (i.e., without traversing the index tree) and, during normal processing and restart recovery, whenever possible undos are performed in a page-oriented fashion. ARIES/IM permits different granularities of locking to be supported in a flexible manner. A subset of ARIES/IM has been implemented in the OS/2 Extended Edition Database Manager. Since the locking ideas of ARIES/IM have general applicability, some of them have also been implemented in SQL/DS and the VM Shared File System, even though those systems use the shadow-page technique for recovery.
Article
Scaling the performance of shared-everything transaction processing systems to highly-parallel multicore hardware remains a challenge for database system designers. Recent proposals alleviate locking and logging bottlenecks in the system, leaving page latching as the next potential problem. To tackle the page latching problem, we propose physiological partitioning (PLP). The PLP design applies logical-only partitioning, maintaining the desired properties of sharedeverything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) which enables the partitioning of the accesses at the physical page level. Logical partitioning and MRBTrees together ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch-free; an extended design makes heap page accesses thread-private as well. Eliminating page latching allows us to simplify key code paths in the system such as B+Tree operations leading to more efficient and maintainable code. Profiling a prototype PLP system running on different multicore machines shows that it acquires 85% and 68% fewer contentious critical sections, respectively, than an optimized conventional design and one based on logical-only partitioning. PLP also improves performance up to 40% and 18%, respectively, over the existing systems.
Conference Paper
Online transaction processing (OLTP) is at the core of many data center applications. OLTP workloads are known to have large instruction footprints that foil existing L1 instruction caches resulting in poor overall performance. Prefetching can reduce the impact of such instruction cache miss stalls, however, state-of-the-art solutions require large dedicated hardware tables on the order of 40KB in size. SLICC is a programmer transparent, low cost technique to minimize instruction cache misses when executing OLTP workloads. SLICC migrates threads, spreading their instruction footprint over several L1 caches. It exploits repetition within and across transactions, where a transaction's first iteration prefetches the instructions for subsequent iterations or similar subsequent transactions. SLICC reduces instruction misses by 58% on average for TPC-C and TPCE, thereby improving performance by 68%. When compared to a state-of-the-art prefetcher, and notwithstanding the increased storage overheads (42x as compared to SLICC), performance using SLICC is 21% higher for TPC-E and within 2% for TPC-C.
Article
Foster B-trees are a new variant of B-trees that combines advantages of prior B-tree variants optimized for many-core processors and modern memory hierarchies with flash storage and nonvolatile memory. Specific goals include: (i) minimal concurrency control requirements for the data structure, (ii) efficient migration of nodes to new storage locations, and (iii) support for continuous and comprehensive self-testing. Like Blink-trees, Foster B-trees optimize latching without imposing restrictions or specific designs on transactional locking, for example, key range locking. Like write-optimized B-trees, and unlike Blink-trees, Foster B-trees enable large writes on RAID and flash devices as well as wear leveling and efficient defragmentation. Finally, they support continuous and inexpensive yet comprehensive verification of all invariants, including all cross-node invariants of the B-tree structure. An implementation and a performance evaluation show that the Foster B-tree supports high concurrency and high update rates without compromising consistency, correctness, or read performance.