Conference PaperPDF Available

From A to E: Analyzing TPC’s OLTP Benchmarks -- The obsolete, the ubiquitous, the unexplored

March 2013

March 2013

DOI:10.1145/2452376.2452380

Conference: Proceedings of the 16th International Conference on Extending Database Technology

Authors:

Ippokratis Pandis

Amazon Web Services

Cansu Kaynak

École Polytechnique Fédérale de Lausanne

Djordje Jevdjic

École Polytechnique Fédérale de Lausanne

Show all 5 authorsHide

Introduced in 2007, TPC-E is the most recently standardized OLTP benchmark by TPC. Even though TPC-E has already been around for six years, it has not gained the popularity of its predecessor TPC-C: all the published results for TPC-E use a single database vendor's product. TPC-E is significantly different than its predecessors. Some of its distinguishing characteristics are the non-uniform input creation, longer-running and more complicated transactions, more difficult partitioning etc. These factors slow down the adoption of TPC-E. In turn, there is little knowledge in the community about how TPC-E behaves micro-architecturally and within the database engine. To shed light on TPC-E, we implement it on top of a scalable open-source database engine, Shore-MT, and perform a workload characterization study, comparing it with the previous, much better known OLTP benchmarks of TPC: TPC-B and TPC-C. In parallel, we study the evolution of the OLTP benchmarks throughout the decades. Our results demonstrate that TPC-E exhibits similar micro-architectural behavior to TPC-B and TPC-C, even though it incurs less stall time and higher instructions per cycle. On the other hand, within the database engine it suffers more from logical lock contention. Therefore, we argue that, on the hardware side, TPC-E needs less aggressive processors. Whereas on the software side it can benefit from designs based on intra-transaction parallelism, logical partitioning, and optimistic concurrency control to minimize the effects of lock contention without introducing distributed transactions.

Time breakdown as the machine load increases on UltraSPARC T2.

…

Time breakdown in the lock manager as the machine load increases on UltraSPARC T2.

…

Figures - uploaded by Djordje Jevdjic

Content may be subject to copyright.

Content uploaded by Djordje Jevdjic

Content may be subject to copyright.

From A to E: Analyzing TPC’s OLTP Benchmarks

The obsolete, the ubiquitous, the unexplored

Pınar Tözün Ippokratis Pandis∗Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki

École Polytechnique Fédérale de Lausanne ∗IBM Almaden Research Center

Lausanne, VD, Switzerland San Jose, CA, USA

ABSTRACT

Introduced in 2007, TPC-E is the most recently standard-

ized OLTP benchmark by TPC. Even though TPC-E has

already been around for six years, it has not gained the pop-

ularity of its predecessor TPC-C: all the published results

for TPC-E use a single database vendor’s product. TPC-

E is signiﬁcantly diﬀerent than its predecessors. Some of

its distinguishing characteristics are the non-uniform input

creation, longer-running and more complicated transactions,

more diﬃcult partitioning etc. These factors slow down the

adoption of TPC-E. In turn, there is little knowledge in the

community about how TPC-E behaves micro-architecturally

and within the database engine.

To shed light on TPC-E, we implement it on top of a scal-

able open-source database engine, Shore-MT, and perform

a workload characterization study, comparing it with the

previous, much better known OLTP benchmarks of TPC:

TPC-B and TPC-C. In parallel, we study the evolution of

the OLTP benchmarks throughout the decades. Our results

demonstrate that TPC-E exhibits similar micro-architectural

behavior to TPC-B and TPC-C, even though it incurs less

stall time and higher instructions per cycle. On the other

hand, within the database engine it suﬀers more from logi-

cal lock contention. Therefore, we argue that, on the hard-

ware side, TPC-E needs less aggressive processors. Whereas

on the software side it can beneﬁt from designs based on

intra-transaction parallelism, logical partitioning, and opti-

mistic concurrency control to minimize the eﬀects of lock

contention without introducing distributed transactions.

1. INTRODUCTION

For the past decades, the data management ecosystem and

in turn the database and hardware markets have evolved

primarily around two applications: online transaction pro-

cessing (OLTP) and online analytical processing (OLAP).

Transaction processing benchmarks are the gold standard

for comparing products by diﬀerent database and hardware

vendors, and are regularly used for marketing purposes [16,

Permission to make digital or hard copies of part or all of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, to republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee.

EDBT/ICDT ’13, March 18 - 22 2013, Genoa, Italy

28]. For the last two decades, TPC-C [37] has been the most

widely used OLTP benchmark by the majority of indus-

try and academia. TPC-C consists of simple short-running

transactions with frequent updates and less frequent index

scans. On the other hand, the benchmark of choice for

OLAP workloads is TPC-H [39]. TPC-H observes more

complicated long-running read-only queries with frequent

index and ﬁle scans. The data management stacks, from

the database down to hardware, are typically optimized for

these two extreme benchmarks.

In order to represent OLTP workloads more realistically,

the Transaction Processing Performance Council (TPC) in-

troduced the TPC-E benchmark [38] in 2007. TPC-E is

an OLTP workload that includes transactions for real-time

business intelligence combined with client-side requests. It

acts in between a typical OLTP and an OLAP benchmark.

The design decision for TPC-E was to create a sophisticated

OLTP benchmark, having more complicated and longer trans-

actions when compared to TPC-C, relying on the extensive

use of non-primary indexes, observing data and access skew,

applying integrity and referential constraints, and being less

amenable to partitioning.

Both industry and academia are slow at adopting TPC-E.

For example, even though the benchmark was standardized

six years ago, all of the published results for TPC-E use

the same database product (Microsoft SQL Server). Due to

TPC-E’s signiﬁcant diﬀerences from the other benchmarks,

it is not easy to extrapolate how systems perform when they

run TPC-E (and TPC-E-like applications).

Existing experimental studies typically use database bench-

marks other than TPC-E. Previous studies of OLTP and

OLAP benchmarks, either micro-architectural [2, 5, 15, 22,

32, 33] or proﬁling [18, 20, 29, 30], provide valuable results.

However, they fall short of explaining the behavior TPC-E

is expected to exhibit. Recent work that analyzes TPC-E

either focuses only on the I/O behavior [10, 21] or reports

micro-architectural results on only one type of machine while

running TPC-E on a commercial RDBMS and treating the

database as a black-box [13]. To date, there is neither an

analysis of the TPC-E benchmark on various hardware plat-

forms nor a comprehensive breakdown of the execution time

with respect to database engine components.

In this paper, we perform a detailed study of TPC-E.

We characterize where it spends time within an open-source

database engine and how it behaves micro-architecturally

on two diﬀerent hardware platforms, one in-order and one

out-of-order machine. In parallel, we compare TPC-E to

the well-known OLTP benchmarks and observe how TPC’s

transactional benchmarks have evolved over the years. Then,

we discuss what kind of changes in database and hardware

systems can be more beneﬁcial for such a workload. The

contributions of our study are as follows:

•Our micro-architectural study demonstrates that TPC-

E is actually very similar to the previous OLTP bench-

marks in terms of its micro-architectural behavior. It

highly suﬀers from L1 instruction misses and exhibits

low instructions per cycle (IPC); IPC is smaller than

one on a machine that has ability to execute four. Thus,

we argue that TPC-E-like workloads need less aggressive

processors with a lower instruction issue width on the

hardware side. In addition, even though simultaneous

multi-threading (SMT) hides some of the stalls caused

by instruction misses and almost doubles the IPC, we

need more eﬀective solutions like intra-transaction par-

allelism [11, 29] or computation spreading [4, 9] to better

utilize modern processor cores.

•Our proﬁling study reveals that, within the database en-

gine, TPC-E spends 70% more time inside the lock man-

ager compared to both TPC-B and TPC-C for a con-

ﬁguration with an orders of magnitude bigger database

size. TPC-E’s more complicated schema and transac-

tions make it less straightforward to physically partition

a TPC-E database to eliminate its locking overheads due

to the signiﬁcant number of distributed transactions such

a design would cause. However, TPC-E can beneﬁt from

shared-everything designs that aim minimizing locking

with logical [29] or physiological partitioning [30], or sys-

tems that rely on optimistic concurrency control [24] to

improve system performance.

The rest of the paper is organized as follows. Section 2

brieﬂy describes the previous transaction processing bench-

marks standardized by TPC over the years and details TPC-

E. Section 3 introduces Shore-Kits, which is a suite of OLTP

benchmarks for Shore-MT and Section 4 describes our ex-

perimental methodology. Section 5 and Section 6 present

the proﬁling and micro-architectural analysis respectively

for TPC-E in comparison with TPC-B and TPC-C. Based

on the analysis results, Section 7 discusses possible design

optimizations both for upcoming hardware and storage man-

agers, mainly while running TPC-E on top. Finally, Sec-

tion 8 surveys the related work and Section 9 concludes.

2. EVOLUTION OF OLTP BENCHMARKS

Transaction processing benchmarks are the gold standard

for DBMS performance evaluation and they are frequently

used for marketing purposes. The Transaction Processing

Performance Council (TPC) is a non-proﬁt IT organization

founded to deﬁne database benchmarks and disseminate ob-

jective, veriﬁable performance data to the industry. This

section describes the four important database transaction

processing benchmarks that have been used under the trade-

mark of TPC and highlights how they have evolved over the

years with each new benchmark.

2.1 The obsolete TPC-A and TPC-B

The ﬁrst widely accepted database benchmark was formal-

ized in 1985 [3]. That speciﬁcation included three workloads,

of which the “DebitCredit” stressed the database engine.

The DebitCredit benchmark was an instant success. Soon

database and hardware vendors started reporting extraor-

dinary results, often achieved by removing key constraints

from the speciﬁcation. Therefore, in 1988 a consortium of

analysts and hardware, operating system, and database ven-

dors formed the Transaction Processing Performance Coun-

cil in order to enforce some order in database benchmarking.

Its ﬁrst benchmark speciﬁcation, TPC-A, essentially formal-

ized the DebitCredit benchmark.

TPC-A is straightforward. It models deposits on and

withdrawals from random bank accounts, with the associ-

ated double-entry accounting on a database that contains

xBranches, 10xTellers, and 100,000xAccounts. It also

captures the entire system, including terminals and network.

Transactions usually originate from their “home” Branch,

but can go anywhere. Conﬂicts are possible requiring the

system to recover occasionally from failed transactions. An

important aspect of this benchmark is its scaling rule: for a

result to be valid, the database size must be proportional to

the reported throughput.

Simple though it maybe, the TPC-A benchmark high-

lighted the importance of quantifying the performance and

correctness of diﬀerent systems. Early benchmarking showed

vast performance diﬀerences among diﬀerent vendors (400x),

as well as exposing serious bugs, which had been lurked and

undiscovered, for many years in mature products.

TPC’s second benchmark, TPC-B [35], is very similar to

TPC-A, but eliminates the network and terminal handling

to create a database engine stress test. Like TPC-A, the

TPC-B database contains four tables: Branch,Teller,Ac-

count, and History. These tables are accessed in a double-

entry accounting style as customers make deposits on and

withdrawals from various tellers. The benchmark consists of

a single transaction, AccountUpdate, which simply updates

one record in the Branch,Teller, and Account tables while

appending a record to the History table. Therefore, it is a

very update-heavy transaction that stresses the transaction

processing engine; especially the logging and concurrency

control modules. Due to the similarities between TPC-A

and TPC-B, for the rest of our study we use TPC-B alone.

2.2 The ubiquitous TPC-C

For its third benchmark speciﬁcation, TPC-C [37], TPC

moves away from banking to commerce. TPC-C models an

online transaction processing database for a wholesale sup-

plier. The transactions follow customer orders from initial

creation to ﬁnal delivery and payment.

A TPC-C database consists of nine tables in total where

one of them has ﬁxed size (Fixed), four of them scale propor-

tionally with the number of Warehouses (Scaling), and four

of them might change size, mostly grow, due to insert and

delete operations (Growing). Thereby, compared to TPC-B,

TPC-C oﬀers a more complex database schema; where the

TPC-B schema can be represented as a tree with only four

nodes, the TPC-C schema is a directed acyclic graph with

nine nodes.

Like the database schema, the TPC-C transactions are

also more complex. The benchmark combines the ﬁve trans-

actions listed below in a transaction mix at frequencies given

in parenthesis:

•NewOrder (45%) inserts a new sales order to the database.

It is a medium-weight transaction with a 1% failure rate

due to invalid inputs.

Table 1: The TPC-E transactions

Transaction Weight Access Category Frames Executed % in Mix

BrokerVolume Mid to Heavy RO BI 1 (out of 1) 4.9

CustomerPosition Mid to Heavy RO CI 2/3 (out of 3) 13

MarketFeed Medium RW MT 1 (out of 1) 1

MarketWatch Medium RO CI 1 (out of 1) 18

SecurityDetail Medium RO CI 1 (out of 1) 14

TradeLookup Medium RO BI/CI 1 (out of 4) 8

TradeOrder Heavy RW CI 2/5/6 (out of 6) 10.1

TradeResult Heavy RW MT 5/6 (out of 6) 10

TradeStatus Light RO CI 1 (out of 1) 19

TradeUpdate Medium RW BI/CI 1 (out of 3) 2

BI: Brokerage Initiated, CI: Customer Initiated, MT: Market Triggered

•Payment (43%) is a short transaction, very similar to

the AccountUpdate transaction of TPC-B, which makes

a payment on an existing order.

•OrderStatus (4%) is a read-only transaction that com-

putes the shipping status and the line items of an order.

•Delivery (4%) is the largest and the most contentious

update transaction. It selects the oldest undelivered or-

ders for each warehouse and marks them as delivered.

•StockLevel (4%) is also a read-only transaction. It joins

on average 200 order line items with their corresponding

stock entries in order to produce a report.

The speciﬁcation also lays out strict requirements about

response time, consistency, and recovery in the system, and

brings back the testing an end-to-end system that includes

network and terminal handling.

TPC-C stresses the entire stack (database system, oper-

ating system, and hardware) in several ways. First, it mixes

short and long, read-only and update-intensive transactions,

exercising a wider variety of features and situations than

the TPC-B benchmark. In addition, the benchmark has

major hotspots, partly due to the way transactions access

the Warehouse table and partly due to how the Delivery

transaction is designed. The resulting contention and dead-

locks stress the system’s concurrency control mechanisms.

Finally, the database grows throughout the benchmark run;

not just because of the append-only History table as in

TPC-B, but also because of the insert and delete operations

on diﬀerent tables, stressing code paths that the previous

TPC-B benchmark did not reach.

TPC-C has been the most popular OLTP benchmark for

over twenty years. Major database vendors have published

results on TPC’s website, and on several occasions it is used

for marketing purposes [16, 28].

2.3 The unexplored TPC-E

To represent more realistically real-life OLTP workloads,

TPC presented TPC-E [38] as an alternative to the dom-

inant TPC-C. In this subsection, we give an overview of

TPC-E while pointing out its diﬀerences from TPC-C.

2.3.1 Model

TPC-E models a brokerage house. The database tables

keep information about the customers, brokers, and market.

The transactions simulate a workload where either the cus-

tomers initiate requests to the brokerage house (customer

initiated transactions) or the market sends ticker feeds or

trade results to the brokerage house (market-triggered trans-

actions). The brokerage house responds to the customers,

checks the orders to decide whether to submit them or not,

submits the related brokerage requests (brokerage initiated

transactions), and analyzes or updates the database. One

could say that TPC-E represents a more complicated busi-

ness model compared to TPC-C.

2.3.2 Database

TPC-E has more tables than TPC-C; thirty-three tables

instead of nine. Nine of TPC-E’s tables are of Fixed size,

sixteen are Scaling based on the number of Customers, and

eight are Growing. However, the growth rate of the Growing

tables varies and in general it is greater than the growth

rates of the Growing tables in TPC-C. In addition, the TPC-

E tables are populated with pseudo-real data and exhibit

data skew. On the contrary, TPC-C tables have randomly

generated data that face a low degree of skew.

The scaling factor determines the number of Branches in

TPC-B and the number of Warehouses in TPC-C. TPC-E

has a scaling factor that controls the number of Customers

in the database. But, unlike TPC-B and TPC-C, where a

single scaling factor (via the number of Branches and Ware-

houses) is the only parameter that determines the initial size

of the database, TPC-E has two additional parameters that

aﬀect the initial database size. In particular, the parameters

called working days and scaling factor control the cardinal-

ity of the Trade table and in turn all the other Growing

tables in TPC-E.

TPC-E also has a Growing table, Trade_Request, that

right after database population starts as an empty table and

then grows. Neither TPC-B nor TPC-C have empty tables

after the initial database population.

2.3.3 Transactions

TPC-E contains twelve transactions in total, which are

shown in Table 1. Only ten of the transactions belong to the

regular transaction mix. Two of them, DataMaintenance

and TradeCleanup, get executed separately. DataMainte-

nance is executed periodically, every minute, alongside with

the transaction mix, whereas TradeCleanup needs to be exe-

cuted before each run if one wants to cleanup the submitted

or pending trades from a previous run in order to restore the

initial database state. In TPC-C, all of the ﬁve transactions

Table 2: Evolution of TPC’s OLTP benchmarks

TPC-A TPC-B TPC-C TPC-E

First release Nov 1989 Aug 1990 Aug 1992 Feb 2007

Last update Jun 1994 Jun 1994 Feb 2010 Jun 2010

Business model Banking Banking Wholesale supplier Brokerage house

Tables

Fixed 0 0 1 9

Scaling 3 3 4 16

Growing 1 1 4 8

Total 4 4 9 33

Transactions RW 1 1 3 6

RO 0 0 2 6

Transaction RW 100% 100% 92% 23.1%

Mix % RO 0% 0% 8% 76.9%

Transactions using None None 2 10

secondary indexes

Data population Random Random Random Pseudo-real

are included in the transaction mix.

The TPC-E transactions consist of frames, which are parts

of a long transaction with a distinctive task. For some trans-

actions only a subset of their frames are executed depending

on the input values or whether they are initiated by a cus-

tomer or brokerage; like in TradeLookup and TradeUpdate.

TPC-C does not contain as complicated and long transac-

tions. All transactions in TPC-C have only one frame.

One signiﬁcant distinction of TPC-E from its predecessors

is the majority of the transactions in the mix are Read-Only

(RO). That is, in TPC-E around 75% of the transactions

executed are read-only, whereas TPC-C has 92% Read-Write

(RW ) transactions in the mix.

Another distinction of TPC-E is that its transaction mix

enforces dependencies among some of the transactions. More

speciﬁcally, the market-triggered transactions (TradeResult

and MarketFeed) require TradeOrder transactions to submit

input for them. Therefore, they cannot be executed inde-

pendently from the transaction mix. In TPC-C none of the

transactions have such dependencies.

TPC-E speciﬁcation also introduces skew in transaction

inputs, harness control measures within the transactions,

and checks for referential integrity constraints, which do not

exist in TPC-C. Moreover, for high performance, TPC-E

needs to perform lookups and scans through non-primary

indexes in almost all of its transactions (ten out of twelve),

whereas TPC-C uses secondary indexes in only two of its

transactions.

Overall, TPC-E is a much more sophisticated OLTP bench-

mark compared to all its predecessors and therefore, it of-

fers a more interesting and mature environment for testing

OLTP engines. On the other hand, it is also harder to adopt

for people from both industry and academia, which have

been optimizing their systems mainly based on TPC-C for

the last twenty years.

2.4 The evolution summary

Table 2 summarizes the high-level comparison of the four

OLTP benchmarks of TPC, which we detailed above. What

we can conclude from this section and Table 2 is that with

each benchmark TPC standardized, we see a signiﬁcant com-

plexity increase, which is driven by the facts listed below:

•A more sophisticated business model.

•A larger variety in terms of transaction types.

•Longer-running and less deterministic transactions, caus-

ing longer and less predictable instruction streams.

•Increase in the number of read-only transactions that

need to be run together with update-heavy ones.

•Increase in the number of scan operations and depen-

dency on the secondary indexes, which in turn makes

physical database partitioning less eﬀective.

•More fundamental stress within the storage manager and

exploration of an increased number of code-paths.

The above items are going to be crucial while explaining

the behavior of these workloads within a storage manager

and micro-architecturally.

3. SHORE-KITS: BENCHMARKS ON TOP

OF SHORE-MT

Shore-MT [19] is an enhanced version of the SHORE stor-

age manager [8], whose micro-architectural behavior is very

close to the commercial systems [1]. Shore-MT adds a mul-

tithreaded storage manager kernel to SHORE and is partic-

ularly developed to adapt SHORE to multicore era, mainly

by focusing on eliminating the scalability bottlenecks when

running on multicore hardware. Today, Shore-MT is one the

most scalable open-source shared-everything storage man-

agers within a single database node. It has been used in

various research projects as a test-bed both by the team

who develops and maintains it [18, 20, 29, 30, 31] and by

other well-known teams in the database and computer ar-

chitecture communities [4, 14].

In order to study the behavior and challenges the stan-

dardized OLTP benchmarks pose on modern storage man-

agers, we implement them on top of Shore-MT and dis-

tribute them as a suite of database benchmarks, called Shore-

Kits. In other words, Shore-Kits1is an open-source suite of

OLTP benchmarks for the Shore-MT storage manager.

1Available at https://bitbucket.org/shoremt

Since Shore-MT does not have an SQL front end, a query

parser, and an optimizer, the benchmarks are implemented

in C++ using direct calls to Shore-MT’s storage manager

API, which is linked as a static library to the executable.

With some programming eﬀort and code refactoring, one

can port Shore-Kits to other storage managers by changing

the API calls to match the target storage manager’s API.

We implemented TPC-E using the query plans taken from

a TPC-E implementation of a major database vendor. As

for the index decisions, we initially adapted the indexes from

the same kit. Later, however, we had to change some of

the indexes in order to optimize performance when running

on top of Shore-MT. For example, Shore-MT’s API allows

Shore-Kits to use only unclustered indexes, whereas the kit

of the commercial database uses clustered ones for the pri-

mary indexes. Therefore, the optimal index decisions varied

between Shore-Kits and the kit of the commercial database.

Due to its large number of tables and longer and more com-

plicated transactions, TPC-E was by far the most diﬃcult

benchmark implemented in Shore-Kits.

TPC-E stresses Shore-MT in ways previous benchmarks

do not. It pinpointed code-paths, exposing previously un-

detected bugs and performance bottlenecks. Therefore, it

helped us to further improve Shore-MT. For example, Shore-

MT had implementation of forward and backward index

scans. But the backward index scans were disabled, because

they were causing large number of deadlocks in some work-

loads. Debugging and re-enabling backward index scans in

Shore-MT improved performance of TPC-E by three orders

of magnitude on an Intel server.

4. EXPERIMENTAL METHODOLOGY

We used two servers for our experiments: (1) a Sun Ul-

traSPARC T5220 server with one socket containing eight

in-order cores, where each core has support for eight hard-

ware contexts and is clocked at 1.4GHz, running Solaris 10,

and (2) a server with two Intel Xeon X5660 processors each

with six out-of-order processor cores running Ubuntu 10.04

with Linux kernel version 2.6.32. Table 3 lists the character-

istics of each processor in detail. The diversity and degree

of hardware parallelism on these systems make them good

candidates for this study to reﬂect the behavior of our work-

loads on various types of modern hardware.

We use memory-resident databases for our experiments

and ﬂush the log to RAM due to not having a suitably fast

I/O sub-system. A conﬁguration that allows I/O in our

infrastructure might cause an unreasonably slow and highly

suboptimal OLTP system, and therefore, unrealistic micro-

architectural conclusions.

On the Intel machine, we experiment with two cases; when

hyper-threading (HT) is oﬀ and when it is on. When hyper-

threading is on, the Intel machine supports two hardware

contexts running at the same time on one core to be able to

overlap the stall time of one of the threads with the execution

of the other. This property is analogous to the simultane-

ous multi-threading (SMT) support in the SPARC machine

where each core has support for eight hardware contexts by

default, which is actually one of the main design principles

of the UltraSPARC T2 architecture.

We chose the most optimal conﬁguration options we de-

termined empirically for all the benchmarks running on top

of Shore-MT to make sure that we run them without any

obvious scalability bottlenecks and better utilize the hard-

Table 3: Server Properties

Server UltraSPARC Intel Xeon

T2 X5660

#Sockets 1 2

#Cores per Socket 8 (in-order) 6 (OoO)

#HW Contexts 64 24

Clock Speed 1.40GHz 2.80GHz

Memory 64GB 48GB

L3 (shared) -12MB

access latency 29 cycles

L2 (shared) 4MB -

access latency 20 cycles

L2 (per core) -256KB

access latency 6 cycles

L1-I (per core) 16KB 32KB

access latency 3 cycles 4 cycles

L1-D (per core) 8KB 32KB

access latency 3 cycles 4 cycles

SunOS 5.10 Ubuntu 10.04

Generic with Linux

141414-10 kernel 2.6.32

ware resources. In TPC-B we pad the records of Branch and

Teller tables so that a single database page only has a sin-

gle record. This minimizes false sharing of database pages

and avoids latching contention, which can be a fundamental

bottleneck for typical shared-everything architectures [30].

We also enable Speculative Lock Inheritance (SLI) [18] and

logging optimizations from Aether [20] to reduce the bottle-

necks coming from the lock and log managers, respectively,

for the benchmarks that beneﬁt from these techniques.

Furthermore, for TPC-B and TPC-C we spread the re-

quests based on the primary key of the Branch and Ware-

house tables, respectively, to reduce logical lock contention.

In order to do that, we picked scaling factors that are equal

to the number of hardware contexts available on the machine

a speciﬁc experiment is run on, since the scaling factor is

equal to the number of Branches in TPC-B and Warehouses

in TPC-C. In other words, on the Intel machine we picked

a scaling factor of 12 and 24 when hyper-threading is dis-

abled and enabled, respectively, and on the SPARC machine

we picked a scaling factor of 64. Unfortunately, for TPC-E,

it is not straightforward how to spread the requests due to

its more complex schema and transactions that do not have

correlation based on any primary key column for the major-

ity of the database tables. To be able to run an in-memory

database, we picked a database size that contains 1000 cus-

tomers for TPC-E. We set the working days and scaling

factor parameters to 300 and 500, respectively, which are

the default values for these parameters in the TPC-E spec-

iﬁcation.

Before taking any measurements, we start with a newly

populated database, make each worker thread in the system

execute 1000 transactions to warm-up the caches, and then

perform a one-minute run. The tools used to collect the

hardware counter values and proﬁling results during these

runs are mentioned in the related sections.

5. PROFILING ANALYSIS

In order to further understand the high-level characteris-

tics of each benchmark, ﬁrstly, we report statistical infor-

100

4 8 16 32 48 60 4 8 16 32 48 60 4 8 16 32 48 60

TPC-B TPC-C TPC-E

Time Breakdown (%)

#Clients

Other

Btree

Catalog

BPool

Xct Mgr

Logging

Latching

Locking

Figure 1: Time breakdown as the machine load increases on UltraSPARC T2.

mation collected from the storage manager in Section 5.1.

Then, in Section 5.2, our proﬁling analysis identiﬁes the

components of the storage manager each benchmark spends

the most time in.

5.1 High-level analysis

Table 4 contains the high-level statistics of each bench-

mark to further highlight the changes in complexity with

each OLTP benchmark standardized by TPC. These statis-

tics are independent of the underlying hardware. We chose

a scaling factor of one for each benchmark in this part of

the analysis. This corresponds to one Branch in TPC-B,

one Warehouse in TPC-C, and one-thousand Customers in

TPC-E. For the initial database, we measure the number

of records each benchmark has and how many pages it uses

in Shore-MT, which uses 8KB pages by default. Then, we

use the existing statistic measurements within Shore-MT to

see how many records, locks, and pages on average a trans-

action accesses for each benchmark while performing a run

with one worker thread executing transactions.

As expected, Table 4 re-emphasizes the complexity in-

crease from TPC-B to TPC-E. TPC-E has several orders

of magnitude more records per scaling factor compared to

TPC-B and TPC-C, and a much larger database size as

the total number of heap and index pages indicates. TPC-

B only touches one record per table, hence it accesses few

database locks and pages. TPC-C accesses almost ten times

the records TPC-B accesses per transaction in its transac-

tion mix, increasing the number of locks and database pages

it accesses as well. Finally, TPC-E performs around four

times the record accesses of TPC-C, which is also reﬂected

in the higher number of row-level locks it has to acquire.

However, the total number of locks acquired does not in-

crease accordingly since Shore-MT escalates to higher-level

locking from row-level locking when a single transaction ac-

cesses more than a threshold of records (the default value is

twenty-ﬁve in Shore-MT).

Table 4 reports two values for the average number of pages

accessed in a transaction; the unique number of pages ac-

cessed and the total number of pages accessed, which is also

the number of times a page is requested from the buﬀer

pool. Such a separation reveals that even though TPC-E

accesses more than twice the index pages TPC-C does, the

number of unique index page accesses is the same for both

workloads. The main reason for this is TPC-E’s extensive

Table 4: High-level statistics of each benchmark per

scaling factor 1

TPC-B TPC-C TPC-E

# records ∼10K ∼600K ∼117M

# heap pages 147 ∼12K ∼1M

# index pages 91 ∼6K ∼1M

Average per xct

# records accessed 4 36 149

# row-level locks 10 54 171

# higher-level locks 10 36 69

# heap pages accessed (U) 4 23 40

# index pages accessed (U) 4 33 33

# heap pages accessed 7 49 125

# index pages accessed 4 90 211

K: thousand, M: million, U: unique

index scans. TPC-C does not re-access most of the index

pages it touches, while TPC-E does this very frequently for

the index leaf pages during its index scans; it sequentially

reads an index leaf page and hence frequently reuses that

page. This results in TPC-E exhibiting lower L1 data cache

miss rates as Section 6.1.3 and Section 6.2 show.

5.2 Time breakdown

To get accurate time breakdowns within the storage man-

ager, we use DTrace [7] on the SPARC machine. Figure 1

presents the results of the proﬁling as we increase the ma-

chine utilization, i.e., as we run more clients in the system.

Figure 1 highlights that the lock manager is one of the

components the OLTP benchmarks spend most of their time

in within a shared-everything database management system.

The lock manager becomes the main bottleneck especially

for TPC-E, making it unable to utilize more than eight hard-

ware contexts on this machine, while both TPC-B and TPC-

C are able to almost fully utilize the machine with smaller

database sizes.

Logging is the next problematic component for TPC-B

and TPC-C. It becomes, however, less signiﬁcant as we in-

crease the system utilization since we adopt the logging

optimizations of [20] that beneﬁt from combining logging

requests as the number of clients in the system increases.

Btree and BPool (buﬀer-pool) come after Locking and Log-

ging, since a transaction’s execution is highly dependent on

20%

40%

60%

80%

100%

4 16 48 4 16 48 4 16 48

TPC-B TPC-C TPC-E

Lock Manager Breakdown

#Clients

Lock-PC Lock-LC Lock

Figure 2: Time breakdown in the lock manager as

the machine load increases on UltraSPARC T2.

its index operations. The rest of the major components

of a storage manager are Catalog (metadata manager), SM

(storage manager API functionality), Xct Mgr (transaction

manager), and Latching; in which none of the workloads

spends a major part of their execution time.

Figure 2 focuses on the time spent inside the lock manager

and shows the time breakdown of sub-categories: Physical

lock contention, Lock-PC, represents the time spent while

waiting to acquire the element guarding a particular record

or table lock. Logical lock contention, Lock-LC, represents

the time spent until a record or table lock is granted after

the lock request is appended to the list of requests for this

lock. Finally, locking, Lock, is the time spent on performing

the locking operation aside from the waiting time.

TPC-E mainly suﬀers from logical lock contention (Lock-

LC) even though we use a larger database size for it com-

pared to TPC-B and TPC-C. There are three main rea-

sons for this outcome: (1) TPC-E observes data and access

skew, turning some of the data regions into hotspots (e.g.,

Last_Trade table); (2) TPC-E transactions acquire on aver-

age more locks since they access a larger number of database

records; and (3) TPC-E transactions hold the locks they ac-

quire for a longer duration since they are more complicated,

longer running, and scan-heavy transactions. TPC-B and

TPC-C, on the other hand, do not suﬀer from logical lock

contention since the system can properly spread the requests

and SLI [18] prevents physical lock contention from becom-

ing problematic, leaving only the actual locking operation as

the main time-consuming component within the lock man-

ager.

However, as we will see in Table 5, the lock contention

is not as problematic when we run TPC-E on the Intel

machine, which has faster processors than the SPARC ma-

chine. The faster the processor, the faster the lock acqui-

sitions and releases are, and hence, the less time is spent

on lock contention. We come across this fact also when

we run TPC-B. When two threads want to access the same

Branch in a TPC-B database, they ﬁrst acquire a read lock

on the wanted Branch during the index probe according to

ARIES/IM [26] (the default concurrency control scheme in

Shore-MT). Later, when they want to upgrade their read

locks to exclusive ones to update the Branch, they both wait

for each other and they deadlock. While on the SPARC

machine we observe such deadlocks, TPC-B runs without

deadlocks on the Intel machine due to faster locking oper-

Table 5: Number of worker threads used for each

benchmark on the two machines

Server UltraSPARC T2 Intel Xeon X5660

No HT HT

TPC-B 48 10 18

TPC-C 60 10 18

TPC-E 4 12 24

ations. Switching to ARIES/KVL [25], which has stricter

concurrency control rules than ARIES/IM, makes this type

of deadlocks disappear on the SPARC machine as well.

6. MICRO-ARCHITECTURAL ANALYSIS

While performing a micro-architectural analysis for the

OLTP benchmarks, we try to answer the following questions:

(1) Where do CPU cycles go on diﬀerent types of modern

hardware? Are they wasted on memory stalls or used to

retire an instruction?, (2) Do stalls happen mainly due to

instructions or data?, (3) How important are the instruction

and data miss rates?, (4) How much instruction-level (ILP)

and memory-level (MLP) parallelism do OLTP benchmarks

exhibit?, and (5) What is the eﬀect of simultaneous multi-

threading (or hyper-threading)?

All the numbers reported in this section were obtained

when the workloads have their peak performance on the cor-

responding server with their optimal conﬁguration on Shore-

MT. Table 5 shows the number of worker threads execut-

ing transactions in the system when the peak throughput

is achieved for each workload on each server. Adding more

worker threads to the system on top of the numbers reported

in Table 5 causes degradation in throughput, either due to

contention on shared records and storage manager objects

or over-saturation of the machine being used.

6.1 OLTP on an out-of-order processor

This section presents micro-architectural results from the

Intel Xeon X5660 processors. We use VTune [17], which

provides an API to ease the use of the hardware counters

on this machine. We emphasize that the execution time

breakdown on a superscalar out-of-order (OoO) processor

cannot be precise due to overlapping of diﬀerent execution

components [12]. However, considering the low IPC of the

workloads we are experimenting with (Section 6.1.4), we can

assume that not much of work is overlapped. Nevertheless,

we draw the execution cycles that can be overlapped side-

by-side rather than on top of each other.

Intel Xeon X5660 processors support hyper-threading, run-

ning two hardware contexts on one core at the same time.

The goal of hyper-threading is to overlap the stall time of

one thread with the execution of another. In the following

subsections, for each experiment we present results when

hyper-threading is disabled and when it is enabled.

6.1.1 Execution time breakdown

Figure 3 shows the breakdown of the execution cycles into

busy and stall time for the three benchmarks. We count the

cycles in which at least one instruction is retired as busy and

where no instruction is retired as stalled.

In Figure 3, we see that more than half of the execu-

tion time is spent on stalls for all of the OLTP benchmarks.

While TPC-B and TPC-C show very similar behavior in

20%

40%

60%

80%

100%

No HT HT No HT HT No HT HT

TPC-B TPC-C TPC-E

Execution Time Breakdown

Stalled (App) Stalled (OS) Busy (App) Busy (OS)

Figure 3: Execution time breakdown for three OLTP

benchmarks on an OoO processor with and without

hyper-threading.

terms of the percentage of busy and stalled cycles, TPC-E

seems to observe fewer stalled cycles during the overall exe-

cution. This behavior results in a higher IPC value for TPC-

E (see Section 6.1.4). As expected, when hyper-threading is

enabled, the stalled cycles increase in the overall execution

time since two threads instead of one share the private L1

and L2 caches, evicting each other’s data and instructions

from the cache, thus, causing more cache misses.

Figure 3 also breaks the execution time into time spent on

the operating system operations (OS ) and application itself

(App); and it demonstrates that for our conﬁguration, the

OS does not contribute much to the overall execution time.

6.1.2 Core stalls

As presented in the previous section, stalls dominate the

total execution time of OLTP benchmarks. The estimated

breakdown of these stalls into resource, which also includes

data, and instruction stalls are given in Figure 4. We ac-

count resource stalls within a core, mainly stemming from

the re-order buﬀer (ROB) being full, as backend/resource

stalls while the remaining stalls as frontend/instruction stalls.

We, again, separate OS and application stalls even though

OS does not contribute signiﬁcantly to the total stall time.

As Figure 4 demonstrates, the main cause of core stalls

is the frontend stalls for the OLTP benchmarks. In other

words, a core spends most of its execution cycles waiting

for instructions, since it cannot ﬁnd them in its private L1

instruction cache. The percentage of the frontend stalls is

higher for TPC-E compared to both TPC-B and TPC-C.

We link this behavior to lower data miss rate of TPC-E (see

Section 6.1.3), which increases the percentage of stalls for

instructions.

In addition, hyper-threading increases the percentage of

the backend stalls. Two threads sharing the resources of

one core with hyper-threading can increase the hit rate of

the instruction cache more than the data cache, because

transactions tend to share more instructions than data [4].

6.1.3 Data and instruction misses

Figure 5 shows the number of misses per k-instructions on

the left-hand side and the estimated number of cycles spent

on these misses on the right-hand side. As we mentioned

before, we demonstrate the cycles spent on various cache

misses side-by-side rather than on top of each other because

20%

40%

60%

80%

100%

No HT HT No HT HT No HT HT

TPC-B TPC-C TPC-E

Core Stalls Breakdown

Frontend (Instruction) - App

Frontend (Instruction) - OS

Backend (Resource) - App Backend (Resource) - OS

Figure 4: Core stalls breakdown for three OLTP

benchmarks on an OoO processor with and without

hyper-threading.

of the unknown overlapping cycles for these misses. We

categorize the cache misses as L1 instruction cache misses

(L1I ), L2 instruction misses (L2I ), L1 data cache misses

(L1D), L2 data misses (L2D ), and L3 or last-level cache

misses (LLC ). For stall cycles due to cache misses, we use

the expected penalty for that particular miss on the machine

being used. For LLC misses, we average the penalty for

going to local memory and remote memory.

What we observe is that L1 instruction cache misses domi-

nate both the total number of misses and the total number of

cycles spent on those misses for all of the OLTP benchmarks.

As mentioned in Section 6.1.1, enabling hyper-threading in-

creases the total number of misses in general due to more

threads sharing the shared cache resources.

TPC-E exhibits ∼35% fewer data misses and almost the

same number of instruction misses, regardless of its longer

running and more complicated transactions. Since it per-

forms more scan operations, TPC-E can reuse the cache lines

for data and instructions it needs more often.

6.1.4 Instruction- and memory-level parallelism

Finally, Figure 6 shows how many instructions per cycle

(IPC) these OLTP benchmarks can execute per core on the

left-hand side and how many long-latency misses (L2 miss)

can be overlapped (MLP) on the right-hand side.

An Intel Xeon X5660 processor has the ability to retire

four instructions per cycle. However, by looking at Figure 6,

we see that OLTP benchmarks can hardly retire even one

instruction per cycle even though enabling hyper-threading

provides some beneﬁt. Overall, as the complexity of the

benchmark increases, going from TPC-B to TPC-E, the IPC

also increases. As we also mentioned in Section 6.1.2, it is

expected that TPC-E has a higher IPC value since it spends

less of its execution time on stall cycles compared to the

other two workloads. Higher IPC stems from TPC-E ob-

serving fewer L1 data misses (Section 6.1.3) because of its

frequent scan operations.

From the MLP values given in Figure 6, we also con-

clude that OLTP benchmarks do not exhibit high MLP.

Even though there are 48-entry load-store queues in this

processor, OLTP benchmarks do not have more than 2.7

outstanding long-latency misses even when hyper-threading

is enabled. While TPC-B and TPC-C observe very similar

MLP values, TPC-E exhibits less memory-level parallelism.

100

No HT HT No HT HT No HT HT

TPC-B TPC-C TPC-E

#Misses per k-instructions

LLC L2D L1D L2I L1I

100

150

200

250

300

350

400

450

No HT HT No HT HT No HT HT

TPC-B TPC-C TPC-E

#Cycles for Misses per k-

Instructions (Estimates)

Figure 5: Number of misses per k-instructions for three OLTP benchmarks on an OoO processor with and

without hyper-threading and the estimated number of cycles spent on these misses.

0.2

0.4

0.6

0.8

1.2

TPC-B TPC-C TPC-E

IPC

No HT

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

TPC-B TPC-C TPC-E

MLP

Figure 6: Instructions committed per cycle and

memory-level parallelism on an OoO processor with

and without hyper-threading.

6.2 OLTP on an in-order processor

This section presents micro-architectural results from the

Sun UltraSPARC T5220 server. We used the hardware coun-

ters on this machine through the cputrack command [27],

which allows us to count various types of cache misses and

number of instructions executed by each thread.

UltraSPARC T2 is an in-order processor that supports

simultaneous multi-threading. A core provides support for

eight hardware contexts and collocates two hardware con-

texts in the pipeline in one cycle. Therefore, each of these

hardware contexts uses one cycle in every four cycles, aiming

to overlap the stall time of other hardware contexts.

Figure 7 shows the number of misses per k-instructions on

the left-hand side and the estimated number of cycles spent

on these misses on the right-hand side as in Figure 5. On

this processor, we also cannot infer the overlapped opera-

tions and, as in Figure 5, we draw the execution cycles that

can be overlapped side-by-side rather than on top of each

other. We report L1 instruction cache misses (L1I ), L2 in-

struction misses (L2I ), L1 data cache misses (L1D), and L2

data misses (L2D). For stall cycles due to misses, we use the

expected penalty for that particular miss on this machine.

Similar to the Intel machine, the main source of misses

and stall cycles are also L1 instruction cache misses as Fig-

ure 7 shows. On the other hand, the last-level cache (L2)

maintains almost all of the instructions for these workloads

running on Shore-MT. Due to having smaller L1 data caches

and more hardware contexts using the same private L1 cache

in a core, L1 data cache misses contribute to a bigger portion

of the total stall cycles compared to the Intel machine.

The comparison among the three benchmarks in terms of

misses look similar to the comparison we have on the Intel

machine (Figure 5). The instruction miss numbers are very

close to each other for all the workloads and TPC-E has 50%

fewer data misses compared to TPC-B and TPC-C.

Figure 8 shows the IPC values for the three OLTP bench-

marks running on UltraSPARC T2. Considering that this is

an in-order machine, being able to execute instructions from

two hardware contexts in a cycle, the IPC being higher than

one shows a more eﬀective use of the hardware resources

compared to the Intel machine. While, on the Intel ma-

chine, OLTP benchmarks can hardly leverage less than half

of the instruction issue width, on SPARC, they can utilize

more than half of it.

7. DISCUSSION

In this section we summarize the highlights of our exper-

imental study and discuss the optimal ways of executing

OLTP benchmarks, mainly focusing on TPC-E.

Looking at the high-level description and statistics for

each benchmark, we see that with each new OLTP bench-

mark standardized by TPC, we have a signiﬁcant increase

in complexity compared to the previous ones. Moreover,

observing our time breakdown results from Section 5.2 and

previous studies [18, 20, 29, 30], each benchmark stresses dif-

ferent parts of the storage manager in diﬀerent ways. How-

ever, regardless of these diﬀerences, micro-architecturally, all

the OLTP benchmarks that exist today observe very similar

behavior (Section 6).

As our micro-architectural analysis show, TPC-E a has

higher IPC, observes lower miss rates, and spends less of

its execution time on memory stalls compared to TPC-B

and TPC-C. However, the fact that OLTP benchmarks com-

monly observe low IPC, spend most of their execution time

on memory stalls, and mainly suﬀer from L1 instruction

cache misses still remains. Going from an aggressive out-

of-order processor to an in-order processor, does not change

the micro-architectural characteristics of the OLTP bench-

marks much. However, we observe that simultaneous multi-

100

TPC-B TPC-C TPC-E

#Misses per k-instructions

L2D

L1D

L2I

L1I

200

400

600

800

1000

1200

TPC-B TPC-C TPC-E

#Cycles for Misses per k-

instructions (Estimated)

Figure 7: Number of misses per k-instructions for three OLTP benchmarks on an in-order processor with

simultaneous multi-threading and the estimated number of cycles spent on these misses.

0.2

0.4

0.6

0.8

1.2

1.4

1.6

TPC-B TPC-C TPC-E

IPC

Figure 8: Instructions committed per cycle on an in-

order processor with simultaneous multi-threading.

threading (or hyper-threading) helps to overlap the stall

time caused by cache misses to some extent.

By looking at the time TPC-E spends inside the lock man-

ager, the natural choice would be to partition the database

and deploy a shared-nothing design for it. Even though for

TPC-B- and TPC-C-like database schemas, this would work

very well [34], for TPC-E such a design would cause a lot

of distributed transactions. There are two main reasons for

this: (1) Due to its complex schema, not all the TPC-E ta-

bles can be correlated with a single database column like

the Branch ID in TPC-B or Warehouse ID in TPC-C. (2)

The TPC-E transactions access a lot of database records

from various tables and perform frequent index scans by us-

ing secondary indexes. Therefore, it is not clear based on

which columns we should partition TPC-E tables in a way to

minimize distributed transactions when we deploy a shared-

nothing design.

On the other hand, a shared-everything design based on

logical or physiological partitioning like in DORA [29] or

PLP [30], respectively, might be more beneﬁcial especially

for TPC-E-like workloads. Such designs successfully mini-

mize locking and latching overheads within the storage man-

ager and they do not suﬀer from distributed transactions like

in a shared-nothing design. In addition, optimistic and mul-

tiversion concurrency control schemes [6, 24] may especially

help TPC-E-like read-heavy workloads to improve concur-

rency by avoiding blocking at the time of a potential conﬂict

and rather lazily performing checks at commit time.

Considering that L1 instruction cache misses dominate

the total number of cache misses, techniques that involve

several cores to execute a transaction or exploit common in-

structions both within the same transaction and across dif-

ferent transactions would be very helpful for OLTP. While

using several cores creates an aggregate L1 instruction cache

capacity for a transaction, being able to reuse common in-

structions reduces the need to re-fetch an instruction over

and over.

Software-side techniques that exploit intra-transaction par-

allelism [11, 29] divide the transactions into smaller actions

and run independent actions in parallel on diﬀerent nearby

cores. Each action has smaller instruction footprint than the

entire transaction and a higher chance of ﬁtting its instruc-

tions in the L1I cache. On the hardware side, computation

spreading through thread migration [4, 9] both uses multiple

cores to execute a transaction and makes newer transactions

reuse the instructions brought to the L1I cache by the older

transactions without any guidance from the software side. A

more eﬀective solution, however, would be to involve both

software and hardware enhancements to minimize the stall

cycles due to instructions.

8. RELATED WORK

There is a large body of related work on workload char-

acterization for database workloads. Barrosso et al. [5] in-

vestigated the memory system behavior of OLTP and DSS

style workloads using TPC-B and TPC-D [36], respectively,

both on a real machine and with a full-system simulation.

They found that these two types of workloads need diﬀerent

architectural designs in terms of the memory system. Ran-

ganathan et al. [32] used the same workloads as in [5]. How-

ever, they only focused on the eﬀectiveness of out-of-order

execution on SMPs while running these workloads in a simu-

lation environment. We believe, neither TPC-B nor TPC-D

can be representative of TPC-E since TPC-E has much more

complicated and longer-running transactions than TPC-B

and it is not completely read-only like TPC-D.

Keeton et al. [22] experimented with TPC-C on a 4-way

Pentium Pro SMP machine and performed a similar analysis

to [5, 32]. Although, TPC-C is closer to TPC-E compared

to both TPC-B and TPC-D, it still has major diﬀerences

from TPC-E as described in Section 2. Stets et al. [33]

performs a micro-architectural comparison between TPC-B

and TPC-C. We add TPC-E to this comparison and also

analyze what happens within the storage manager.

Ailamaki et al. [2] examined where the time goes on

four diﬀerent commercial DBMSs with a microbenchmark

to have a ﬁner-grain understanding of the memory system

behavior of multiprocessors. Hardavellas et al. [15] ana-

lyzed OLTP, with TPC-C, and DSS, with TPC-H, on both

in-order and out-of-order machines by using a simulation

environment. Rather than optimizing the hardware for the

workloads, these two papers focused on the implications on

the DBMS side in order to utilize the underlying hardware

more eﬀectively. In our work, we consider both the hardware

and the DBMS design for optimal TPC-E execution.

Johnson et al. [18, 20] and Pandis et al. [29, 30] provide

detailed analysis on where the time goes within the storage

manager for typical OLTP benchmarks. Their main aim was

to highlight components that become scalability bottlenecks

in the existing systems and propose alternative designs that

remove those bottlenecks. In this paper, we also perform

the same analysis with TPC-E and discuss which one of

their techniques can or cannot help TPC-E, and also expose

the bottleneck on L1 instruction misses.

There are a few performance analysis papers that use

TPC-E. For example, [10, 21] use I/O traces of a produc-

tion database server running TPC-E in order to study its

I/O behavior. In [10] the authors compare the I/O behavior

of TPC-C and TPC-E. We do not study the I/O behav-

ior. For our experiments we use memory-resident databases

and focus on the micro-architectural behavior. Ferdman et

al. [13] present a detailed micro-architectural analysis with

many types of workloads on Intel X5670 processors, focus-

ing on the architectural design needs of the scale-out work-

loads. They provide a comparison between the scale-out

workloads and server workloads, like TPC-C and TPC-E.

In our work, we use a very similar methodology while an-

alyzing the OLTP benchmarks micro-architecturally on our

Intel X5660 processors and our high-level conclusions cor-

roborate with their ﬁndings. In addition, we perform such a

micro-architectural analysis on diﬀerent hardware platforms

to understand the behavior when we switch from an in-order

core to an out-of-order one. Moreover, we also demonstrate

which components TPC-E stresses within the storage man-

ager as opposed to a pure micro-architectural study. Atta et

al. [4] propose computation spreading through thread mi-

gration to minimize instruction misses for OLTP workloads.

A part of their study also analyzes the instruction and data

misses of both TPC-C and TPC-E with a trace simulation

study, but not on real hardware. Finally, [23] uses TPC-E

to show that a cluster of “wimpy” (low-power Atom-based)

nodes is not as energy-eﬃcient as a cluster of traditional

server-grade processors (Xeon-based).

9. CONCLUSIONS

In this paper, we present a thorough workload charac-

terization study for TPC-E. We rely on proﬁling results to

determine where the time goes within the storage manager

while executing TPC-E on top. Furthermore, we use per-

formance counters to investigate the micro-architectural be-

havior on two diﬀerent camps of modern hardware; aggres-

sive out-of-order and lean in-order. We compare TPC-E

with previous OLTP benchmarks standardized by TPC, the

well-studied TPC-C and the obsolete TPC-B, to better un-

derstand what TPC-E-like workloads need both from the

software and hardware.

Our study shows that TPC-E observes higher IPC but, at

a high-level, has a very similar micro-architectural behavior

to its predecessors; it suﬀers from a high number of L1 in-

struction cache misses and spends most of its time stalling

on memory accesses. Within the storage manager, TPC-E

stresses the lock manager the most, like its predecessors, al-

though it gets a higher penalty within the lock manager due

to logical lock contention on hot database records.

We believe TPC-E can beneﬁt from the previous design

proposals made for OLTP workloads, both from the hard-

ware side and within the storage manager. Running TPC-E

on less aggressive processors, with few instruction issues,

and processors that have support for SMT increases its IPC

value and leads to a better utilization of micro-architectural

resources. However, we advocate a more fundamental spe-

cialized solution where hardware and software operate to-

gether. Such a design can be based on logical partitioning,

intra-transaction parallelism, and/or computation spreading

to get the best of modern and future hardware for OLTP.

Acknowledgments

We would like to thank all the members of the DIAS and

PARSA laboratories at EPFL, Islam Atta, and Duygu Cey-

lan for their support and feedback throughout this work. We

are very grateful to Onur Kocberber and Rene Mueller for

sharing their expertise on VTune with us. We also thank

the reviewers for their constructive comments and help to

improve this paper. This work was partially supported by

a Sloan research fellowship, NSF grants CCR-0205544, IIS-

0133686, and IIS-0713409, an ESF EurYI award, and Swiss

National Foundation funds.

10. REFERENCES

[1] A. Ailamaki, D. J. DeWitt, and M. D. Hill. Data page

layouts for relational databases on deep memory

hierarchies. VLDB J., 11(3):198–215, 2002.

[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A.

Wood. DBMSs on a modern processor: Where does

time go? In VLDB, pages 266–277, 1999.

[3] Anon. et al. A measure of transaction processing

power. Datamation, 31(7), 1985.

[4] I. Atta, P. T¨

oz¨

un, A. Ailamaki, and A. Moshovos.

SLICC: Self-Assembly of Instruction Cache Collectives

for OLTP Workloads. In MICRO, pages 188–198,

2012.

[5] L. A. Barroso, K. Gharachorloo, and E. Bugnion.

Memory system characterization of commercial

workloads. In ISCA, pages 3–14, 1998.

[6] P. A. Bernstein and N. Goodman. Multiversion

concurrency control—theory and algorithms. ACM

TODS, 8(4):465–483, 1983.

[7] B. M. Cantrill et al. Dtrace. Available at

http://dtrace.org.

[8] M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall,

M. L. McAuliﬀe, J. F. Naughton, D. T. Schuh, M. H.

Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and

M. J. Zwilling. Shoring up persistent applications. In

SIGMOD, pages 383–394, 1994.

[9] K. Chakraborty, P. M. Wells, and G. S. Sohi.

Computation spreading: employing hardware

migration to specialize cmp cores on-the-ﬂy. In

ASPLOS, pages 283–292, 2006.

[10] S. Chen, A. Ailamaki, M. Athanassoulis, P. B.

Gibbons, R. Johnson, I. Pandis, and R. Stoica. TPC-E

vs. TPC-C: Characterizing the new TPC-E

benchmark via an I/O comparison study. SIGMOD

Record, 39:5–10, 2010.

[11] C. B. Colohan, A. Ailamaki, J. G. Steﬀan, and T. C.

Mowry. Optimistic intra-transaction parallelism on

chip multiprocessors. In VLDB, pages 73–84, 2005.

[12] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E.

Smith. A performance counter architecture for

computing accurate cpi components. In ASPLOS,

pages 175–184, 2006.

[13] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,

M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu,

A. Ailamaki, and B. Falsaﬁ. Clearing the Clouds: A

Study of Emerging Scale-out Workloads on Modern

Hardware. In ASPLOS, pages 37–48, 2012.

[14] G. Graefe, H. Kimura, and H. Kuno. Foster B-trees.

ACM TODS, 37(3):17:1–17:29, 2012.

[15] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril,

A. Ailamaki, and B. Falsaﬁ. Database servers on chip

multiprocessors: Limitations and opportunities. In

CIDR, pages 79–87, 2007.

[16] IBM. IBM breaks double digit performance barrier

with 10 million transactions per minute, 2010.

Available at http://www-

03.ibm.com/press/us/en/pressrelease/32328.wss.

[17] Intel. Intel VTune Ampliﬁer XE performance proﬁler.

Available at http://software.intel.com/en-

us/articles/intel-vtune-ampliﬁer-xe/.

[18] R. Johnson, I. Pandis, and A. Ailamaki. Improving

OLTP scalability using speculative lock inheritance.

PVLDB, 2(1):479–489, 2009.

[19] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki,

and B. Falsaﬁ. Shore-MT: a scalable storage manager

for the multicore era. In EDBT, pages 24–35, 2009.

[20] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis,

and A. Ailamaki. Aether: a scalable approach to

logging. PVLDB, 3:681–692, 2010.

[21] S. Kavalanekar, B. Worthington, Q. Zhang, and

V. Sharda. Characterization of storage workload

traces from production windows servers. In IISWC,

pages 119–128, 2008.

[22] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael,

and W. E. Baker. Performance characterization of a

quad Pentium Pro SMP using OLTP workloads. In

ISCA, pages 15–26, 1998.

[23] W. Lang, J. M. Patel, and S. Shankar. Wimpy node

clusters: what about non-wimpy workloads? In

DaMoN, pages 47–55, 2010.

[24] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman,

J. M. Patel, and M. Zwilling. High-performance

concurrency control mechanisms for main-memory

databases. PVLDB, 5(4), 2011.

[25] C. Mohan. ARIES/KVL: a key-value locking method

for concurrency control of multiaction transactions

operating on B-tree indexes. In VLDB, pages 392–405,

1990.

[26] C. Mohan and F. Levine. ARIES/IM: an eﬃcient and

high concurrency index management method using

write-ahead logging. In SIGMOD, pages 371–380,

1992.

[27] Oracle. cputrack. Available at

http://docs.oracle.com/cd/E19683-01/816-

0210/6m6nb7m6s/index.html.

[28] Oracle. SPARC supercluster with 27 SPARC T3-4

servers demonstrates world record performance on

TPC-C benchmark, 2010. Available at

http://www.oracle.com/us/solutions/performance-

scalability/t3-4-tpc-c-12210-bmark-190934.html.

[29] I. Pandis, R. Johnson, N. Hardavellas, and

A. Ailamaki. Data-oriented transaction execution.

PVLDB, 3(1):928–939, 2010.

[30] I. Pandis, P. T¨

oz¨

un, R. Johnson, and A. Ailamaki.

PLP: page latch-free shared-everything OLTP.

PVLDB, 4(10):610–621, 2011.

[31] D. Porobic, I. Pandis, M. Branco, P. T¨

oz¨

un, and

A. Ailamaki. OLTP on hardware islands. PVLDB,

5(11):1447–1458, 2012.

[32] P. Ranganathan, K. Gharachorloo, S. V. Adve, and

L. A. Barroso. Performance of database workloads on

shared-memory systems with out-of-order processors.

In ASPLOS-VIII, pages 307–318, 1998.

[33] R. Stets, K. Gharachorloo, and L. Barroso. A detailed

comparison of two transaction processing workloads.

In WWC, pages 37–48, 2002.

[34] M. Stonebraker, S. Madden, D. J. Abadi,

S. Harizopoulos, N. Hachem, and P. Helland. The end

of an architectural era: (it’s time for a complete

rewrite). In VLDB, pages 1150–1160, 2007.

[35] TPC. TPC benchmark B standard speciﬁcation,

revision 2.0, 1994. Available at

http://www.tpc.org/tpcb.

[36] TPC. TPC benchmark D standard speciﬁcation,

revision 2.1, 1998. Available at

http://www.tpc.org/tpcd.

[37] TPC. TPC benchmark C standard speciﬁcation,

revision 5.11, 2010. Available at

http://www.tpc.org/tpcc.

[38] TPC. TPC benchmark E standard speciﬁcation,

revision 1.12.0, 2010. Available at

http://www.tpc.org/tpce.

[39] TPC. TPC benchmark H standard speciﬁcation,

revision 2.14.3, 2011. Available at

http://www.tpc.org/tpch.

Database management system performance comparisons: A systematic literature review

Preprint

Full-text available

Jan 2023

Toni Taipalus

Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This survey systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.

Analysis of Geospatial Data Loading

Conference Paper

Full-text available

Jun 2024

The rate at which applications gather geospatial data today has turned data loading into a critical component of data analysis pipelines. However, users are confronted with multiple file formats for storing geospatial data and an array of systems for processing it. To shed light on how the choice of file format and system affects performance, this paper explores the performance of loading geospatial data stored in diverse file formats using different libraries. It aims to study the impact of different file formats, compare loading throughput across spatial libraries, and examine the micro-architectural behavior of geospatial data loading. Our findings show that GeoParquet files provide the highest loading throughput across all benchmarked libraries. Furthermore, we note that the more spatial features per byte a file format can store, the higher the data loading throughput. Our micro-architectural analysis reveals high instructions per cycle (IPC) during spatial data loading for most libraries and formats. Additionally, our experiments show that instruction misses dominate L1 cache misses, except for GeoParquet files, where data misses take over.

Surprise Benchmarking: The Why, What, and How

Conference Paper

Jun 2024

Database management system performance comparisons: A systematic literature review

Article

Full-text available

Oct 2023
J SYST SOFTWARE

Toni Taipalus

Efficiency has been a pivotal aspect of the software industry since its inception, as a system that serves the end-user fast, and the service provider cost-efficiently benefits all parties. A database management system (DBMS) is an integral part of effectively all software systems, and therefore it is logical that different studies have compared the performance of different DBMSs in hopes of finding the most efficient one. This study systematically synthesizes the results and approaches of studies that compare DBMS performance and provides recommendations for industry and research. The results show that performance is usually tested in a way that does not reflect real-world use cases, and that tests are typically reported in insufficient detail for replication or for drawing conclusions from the stated results.

Write-Aware Timestamp Tracking: Effective and Efficient Page Replacement for Modern Hardware

Article

Aug 2023

In this paper, we revisit the classical data management problem of page replacement. We propose Write-Aware Timestamp Tracking (WATT), a novel replacement algorithm that is optimized for modern hardware. By explicitly tracking the access history of each cached page, WATT achieves state-of-the-art replacement effectiveness. WATT is also carefully co-designed with modern multi-core CPUs and can be implemented with very low overhead. Finally, WATT allows trading of read versus write I/O operations, which is useful for prolonging flash SSD lifetime.

Profiling and Monitoring Deep Learning Training Tasks

Conference Paper

May 2023

Databases on Modern Hardware: How to Stop Underutilization and Love Multicores

Article

Aug 2017

Micro-architectural analysis of a learned index

Conference Paper

Jun 2022

Databench-T:A Transactional Database Benchmark for Financial Scenarios

Conference Paper

Oct 2021

Are current benchmarks adequate to evaluate distributed transactional databases?

Article

Feb 2022

With the rapid development of distributed transactional databases in recent years, there is an urgent need for fair performance evaluation and comparison. Though there are various open-source benchmarks built for databases, it is lack of a comprehensive study about the applicability for distributed transactional databases. This paper presents a review of the state-of-art benchmarks with respect to distributed transactional databases. We first summarize the representative architectures of distributed transactional databases and then provide an overview about the chock points in distributed transactional databases. Then, we classify the classic transactional benchmarks based on their characteristics and design purposes. Finally, we review these benchmarks from schema and data definition, workload generation, and evaluation and metrics to check whether they are still applicable to distributed transactional databases with respect to the chock points. This paper exposes a potential research direction to motivate future benchmark designs in the area of distributed transactional databases.

TPC Benchmark H Standard Specification

Method

Full-text available

Jan 2010

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size. More Information: http://www.tpc.org/tpch/default.asp

Data-Oriented Transaction Execution

Conference Paper

Full-text available

Jul 2010

While hardware technology has undergone major advancements over the past decade, transaction processing systems have remained largely unchanged. The number of cores on a chip grows exponentially, following Moore's Law, allowing for an ever- increasing number of transactions to execute in parallel. As the number of concurrently-executing transactions increases, contended critical sections become scalability burdens. In typical transaction processing systems the centralized lock manager is often the first contended component and scalability bottleneck. In this paper, we identify the conventional thread-to- transaction assignment policy as the primary cause of contention. Then, we design DORA, a system that decomposes each transaction to smaller actions and assigns actions to threads based on which data each action is about to access. DORA's design allows each thread to mostly access thread-local data structures, minimizing interaction with the contention-prone centralized lock manager. Built on top of a conventional storage engine, DORA maintains all the ACID properties. Evaluation of a prototype implementation of DORA on a multicore system demonstrates that DORA attains up to 4.8x higher throughput than a state-of-the-art storage engine when running a variety of synthetic and real-world OLTP workloads. 2010 VLDB Endowment.

PLP: page latch-free shared-everything OLTP

Article

Jul 2011

Scaling the performance of shared-everything on-line transaction processing to highly-parallel multicore hardware remains a great challenge for database system designers. Developments in OLTP technology remove locking and logging from being scalability bottlenecks on such systems, leaving page latching as the next potential problem. To tackle the page latching problem, we design a system around physiological partitioning (PLP). The PLP design applies logical-only partitioning, maintaining the desired properties of shared-everything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) which allows us to partition the accesses at the physical page level. That is, logical partitioning, along with MRBTrees ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch-free. We extend the design to make heap page accesses thread-private as well. The elimination of page latching allows us to simplify key code paths in the system such as B+Tree operations leading to more efficient yet easier maintainable code. The profiling of a prototype PLP system shows that it acquires 85% and 68% fewer contentious critical sections per transaction than an optimized conventional design and one based on logical-only partitioning respectively. As a result the PLP prototype improves performance by up to 40% and 18% over the two systems on two multicore machines.

DBMSs on a Modern Processor: Where Does Time Go?

Article

Jan 1999

Performance of database workloads on shared-memory systems with out-of-order processors

Article

Dec 1998

Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized to perform well on scientific and engineering workloads. Given the radically different behavior of database workloads (especially OLTP), it is important to re-evaluate key system design decisions in the context of this important class of applications.This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance improvements. Our study is based on detailed simulations of the Oracle commercial database engine. The results show that the combination of out-of-order execution and multiple instruction issue is indeed effective in improving performance of database workloads, providing gains of 1.5 and 2.6 times over an in-order single-issue processor for OLTP and DSS, respectively. In addition, speculative techniques enable optimized implementations of memory consistency models that significantly improve the performance of stricter consistency models, bringing the performance to within 10--15% of the performance of more relaxed models.The second part of our study focuses on the more challenging OLTP workload. We show that an instruction stream buffer is effective in reducing the remaining instruction stalls in OLTP, providing a 17% reduction in execution time (approaching a perfect instruction cache to within 15%). Furthermore, our characterization shows that a large fraction of the data communication misses in OLTP exhibit migratory behavior; our preliminary results show that software prefetch and writeback/flush hints can be used for this data to further reduce execution time by 12%.

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Article

Oct 2006

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.

ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging

Article

Jan 1992

This paper provides a comprehensive treatment of index management in transaction systems. We present a method, called ARIESIIM 1992, for concurrency control and recovery of B+-trees. ARIES/IM guarantees serializability and uses write-ahead logging for recovery. It supports very high concurrency and good performance by (1) treating as the lock of a key the same lock as the one on the corresponding record data in a data page (e.g., at the record level), (2) not acquiring, in the interest of permitting very high concurrency, commit duration locks on index pages even during index structure modification operations (SMOs) like page splits and page deletions, and (3) allowing retrievals, inserts, and deletes to go on concurrently with SMOs. During restart recovery, any necessary redos of index changes are always performed in a page-oriented fashion (i.e., without traversing the index tree) and, during normal processing and restart recovery, whenever possible undos are performed in a page-oriented fashion. ARIES/IM permits different granularities of locking to be supported in a flexible manner. A subset of ARIES/IM has been implemented in the OS/2 Extended Edition Database Manager. Since the locking ideas of ARIES/IM have general applicability, some of them have also been implemented in SQL/DS and the VM Shared File System, even though those systems use the shadow-page technique for recovery.

PLP: Page latchfree sharedeverything OLTP

Article

Jul 2011

Scaling the performance of shared-everything transaction processing systems to highly-parallel multicore hardware remains a challenge for database system designers. Recent proposals alleviate locking and logging bottlenecks in the system, leaving page latching as the next potential problem. To tackle the page latching problem, we propose physiological partitioning (PLP). The PLP design applies logical-only partitioning, maintaining the desired properties of sharedeverything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) which enables the partitioning of the accesses at the physical page level. Logical partitioning and MRBTrees together ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch-free; an extended design makes heap page accesses thread-private as well. Eliminating page latching allows us to simplify key code paths in the system such as B+Tree operations leading to more efficient and maintainable code. Profiling a prototype PLP system running on different multicore machines shows that it acquires 85% and 68% fewer contentious critical sections, respectively, than an optimized conventional design and one based on logical-only partitioning. PLP also improves performance up to 40% and 18%, respectively, over the existing systems.

SLICC: Self-assembly of instruction cache collectives for OLTP workloads

Conference Paper

Dec 2012

Online transaction processing (OLTP) is at the core of many data center applications. OLTP workloads are known to have large instruction footprints that foil existing L1 instruction caches resulting in poor overall performance. Prefetching can reduce the impact of such instruction cache miss stalls, however, state-of-the-art solutions require large dedicated hardware tables on the order of 40KB in size. SLICC is a programmer transparent, low cost technique to minimize instruction cache misses when executing OLTP workloads. SLICC migrates threads, spreading their instruction footprint over several L1 caches. It exploits repetition within and across transactions, where a transaction's first iteration prefetches the instructions for subsequent iterations or similar subsequent transactions. SLICC reduces instruction misses by 58% on average for TPC-C and TPCE, thereby improving performance by 68%. When compared to a state-of-the-art prefetcher, and notwithstanding the increased storage overheads (42x as compared to SLICC), performance using SLICC is 21% higher for TPC-E and within 2% for TPC-C.

Foster B-Trees

Article

Aug 2012

Foster B-trees are a new variant of B-trees that combines advantages of prior B-tree variants optimized for many-core processors and modern memory hierarchies with flash storage and nonvolatile memory. Specific goals include: (i) minimal concurrency control requirements for the data structure, (ii) efficient migration of nodes to new storage locations, and (iii) support for continuous and comprehensive self-testing. Like Blink-trees, Foster B-trees optimize latching without imposing restrictions or specific designs on transactional locking, for example, key range locking. Like write-optimized B-trees, and unlike Blink-trees, Foster B-trees enable large writes on RAID and flash devices as well as wear leveling and efficient defragmentation. Finally, they support continuous and inexpensive yet comprehensive verification of all invariants, including all cross-node invariants of the B-tree structure. An implementation and a performance evaluation show that the Foster B-tree supports high concurrency and high update rates without compromising consistency, correctness, or read performance.

From A to E: Analyzing TPC’s OLTP Benchmarks -- The obsolete, the ubiquitous, the unexplored

Abstract and Figures

Recommended publications

Fast transactions for multicore in-memory databases

Cambricon: An Instruction Set Architecture for Neural Networks

Fifth Workshop on Computer Architecture Evaluation using Commercial Workloads Cambridge, Massachuset...

Mapping Task-based Data-flow Models on Heterogeneous CPU-GPU Systems