Conference PaperPDF Available

Continuous Benchmarking: Using System Benchmarking in Build Pipelines

June 2019

June 2019

DOI:10.1109/IC2E.2019.00039

Conference: Workshop on Service Quality and Quantitative Evaluation in new Emerging Technologies

Authors:

Martin Grambow

Technische Universität Berlin

Fabian Lehmann

Humboldt-Universität zu Berlin

Continuous integration and deployment are established paradigms in modern software engineering. Both intend to ensure the quality of software products and to automate the testing and release process. Today's state of the art, however, focuses on functional tests or small microbenchmarks such as single method performance while the overall quality of service (QoS) is ignored. In this paper, we propose to add a dedicated benchmarking step into the testing and release process which can be used to ensure that QoS goals are met and that new system releases are at least as "good" as the previous ones. For this purpose, we present a research prototype which automatically deploys the system release, runs one or more benchmarks, collects and analyzes results, and decides whether the release fulfills predefined QoS goals. We evaluate our approach by replaying two years of Apache Cassandra's commit history.

Architecture and Main Components of Continuous Benchmarking Setups

…

Total Benchmark Runtime: Median Run and Threshold Metrics

…

Content may be subject to copyright.

Content uploaded by David Bermbach

Content may be subject to copyright.

Continuous Benchmarking: Using System

Benchmarking in Build Pipelines

Martin Grambow, Fabian Lehmann, David Bermbach

TU Berlin & Einstein Center Digital Future

Mobile Cloud Computing Research Group

Berlin, Germany

{mg, ﬂm, db}@mcc.tu-berlin.de

Abstract—Continuous integration and deployment are estab-

lished paradigms in modern software engineering. Both intend

to ensure the quality of software products and to automate the

testing and release process. Today’s state of the art, however,

focuses on functional tests or small microbenchmarks such as

single method performance while the overall quality of service

(QoS) is ignored.

In this paper, we propose to add a dedicated benchmarking

step into the testing and release process which can be used to

ensure that QoS goals are met and that new system releases are at

least as “good” as the previous ones. For this purpose, we present

a research prototype which automatically deploys the system

release, runs one or more benchmarks, collects and analyzes

results, and decides whether the release fulﬁlls predeﬁned QoS

goals. We evaluate our approach by replaying two years of

Apache Cassandra’s commit history.

Index Terms—Benchmarking, Continuous Integration, Soft-

ware Development, QoS, YCSB, Cassandra

I. INTRODUCTION

Today’s IT systems tend to be rather complex pieces of

software so that even the smallest changes, either in the

source code itself or in the application settings, can have

a big (negative) impact on their performance, e. g., adding

or conﬁguring security features as shown in [1]–[3]. Besides

increased latencies which result in a poor user experience,

cloud based systems usually use autoscaling to automatically

adapt the amount of resources to meet quality of service (QoS)

goals. When a software change now leads to increased use of

hardware resources, more resources will be provisioned lead-

ing to signiﬁcantly higher cost. Finally, systems and services

rarely operate in an isolated way. In interaction with other

systems and services, however, small QoS changes will affect

other services and may even start a butterﬂy effect in some

scenarios. Regardless of the cause, these deﬁcits are often

coupled with reduced revenue or even ﬁnes if the current

performance metrics do not meet the deﬁned service level

agreements (SLA) or user expectations, e.g., Google reports

that the number of daily searches per user decreases if the

latency of results increases [4].

In order to prevent undesired effects on performance or

other quality metrics, we propose to add system benchmarking

to the build pipeline of software systems. This way, developers

can assert that a new release is at least as good as the previous

release and that it complies with SLAs. In this regard, we make

the following contributions:

1) We describe how QoS requirements can be integrated

into the development process and how benchmarking can

be used to enforce QoS goals.

2) We present a proof-of-concept prototype, including the

corresponding Jenkins plug-in.

II. BACKGROU ND

In this section, we give a short overview of paradigms and

systems used in this paper.

Version Control with Git: Git1is a very popular distributed

version control tool. Among many other features, software

developers can download the current version of the source

code from a given repository (checkout or pull), commit

changes to this software locally, and upload (push) them to

ultimately merge them into the current version.

Continuous Integration & Deployment: Continuous Integra-

tion (CI) and Continuous Deployment (CD) are two modern

paradigms which aim to improve, automate, and accelerate

the software development process leading to shorter release

cycles. CI deﬁnes the process of integrating new software

changes into the master version, including adapting and run-

ning corresponding test cases which ensures that the software

is extensively tested before it is merged into a production

branch of a system [5]. CD describes the automated process

of releasing and deploying new software versions. Once a

new release has been thoroughly tested in the CI process, it

is automatically rolled-out to the production system so that

frequent daily releases are possible; this shortens the release

cycle.

Both processes, CI and CD, are designed to run multiple

times per day, depending on how many features are imple-

mented per day and the release policy. Thus, 10 minutes are a

guiding value for the total run time of both processes so that

developers can get early feedback on their software changes.

In practice, however, this is not always realistic so that multi-

tiered deployment pipelines are usually used instead.

Jenkins: Jenkins2is an open source automation server for all

tasks related to the software development process. Triggered

by various kinds of events, a Jenkins server executes user-

deﬁned job pipelines which consist of multiple build steps;

1https://git-scm.com/

2https://jenkins.io/

structure and sequence of steps depends on the respective

application. While the typical structure of a pipeline is a

simple sequence of tasks, it is also possible to insert conditions

or execute steps in parallel. Since different software projects

have different needs, Jenkins can be extended through so

called plug-ins which can be used to implement arbitrary

functionality.

Benchmarking: Benchmarking “is the process of measuring

quality and collecting information on system states” [6].

In contrast to monitoring, benchmarking aims to answer a

speciﬁc question and is typically used to compare different

system versions, conﬁgurations, alternatives, or deployments.

Moreover, benchmarking typically measures the quality of

a non-production environment with arbitrary metrics at a

speciﬁc time in multiple test runs and analyzes its output in a

subsequent ofﬂine analysis. Benchmarking can involve running

micro benchmarks which measure very small isolated features

down to single method performance. Typically, however, a

system is deployed on several machines (along with other

systems it potentially depends on) and a measurement client

on another machine. Then, the measurement client runs an

application-driven workload against the system under test

(SUT) and tracks its changes in QoS.

Apache Cassandra: Apache Cassandra3is a popular NoSQL

database system which was originally developed at Facebook.

The system was explicitly designed for elastic scalability,

high performance and availability; in exchange, it offers only

eventually consistent guarantees based on the PACELC trade-

offs [7].

YCSB: The Yahoo! Cloud Serving Benchmark4(YCSB) [8] is

the de-facto standard benchmark for NoSQL databases. YCSB

offers a suite of synthetic standard workloads which can be

run against the SUT. After each execution, YCSB reports

aggregated measurement results including throughput, latency,

and the total runtime which can be used to evaluate and rank

the compared systems or system conﬁgurations.

III. APP ROAC H

In this section, we describe how benchmarking can be

used as part of a CI or CD build process to ensure QoS

requirements, give an overview of our system architecture,

and describe how builds with QoS problems can be detected

in benchmarking results.

A. Continuous Benchmarking

As already described, state of the art CI and CD solutions

focus on functional tests and micro benchmarks such as single

method performance. To complement this, we propose to

regularly run system benchmarks as part of the build process.

Comparable to CI and CD, we refer to this as Continuous

Benchmarking (CB).

CB should only be done if the correct functionality of the

software has already been ensured, otherwise the benchmark

3http://cassandra.apache.org/

4https://github.com/brianfrankcooper/YCSB

Fig. 1. Main Steps of Continuous Benchmarking

might run against a buggy software version and produce

incorrect results, e.g., because data records are processed much

faster due to an error. CB should, hence, take place once all

functional tests and integration tests have already been passed.

In the following, we will give an overview of the steps involved

in CB; see also ﬁgure 1 which gives a high level overview of

the CB process and how it integrates into a CI/CD process.

Setup: First, once all functional and integration tests have been

passed, the SUT and the benchmarking client must be set up,

which can be done either sequentially or in parallel. To get

comparable results over several runs, it is important to deploy

both systems in the same environment in each CB process

(same hardware, operating system, supporting libraries, etc.).

If, for example, a different hard disk would be used in each

run, the different speed would have an inﬂuence on the results

and it would not be possible to carry out a trend analysis of the

key ﬁgures. Also, unless the benchmark explicitly targets a fu-

ture use case, it is reasonable to choose a runtime environment

as similar as possible to the production environment. Finally,

the SUT and benchmarking system must be isolated from

external factors which could affect the benchmark results, e.g.,

other processes running on the same machine, other services

interacting with the SUT, or too much trafﬁc on the network.

Depending on the system under test, it may also be nec-

essary to deploy other external systems (e.g., BigTable [9]

always relies on GFS [10] and Chubby [11]) or to run a preload

phase which inserts an initial data set into the SUT [12].

Execution: In the second step, the benchmark is actually run.

In fact, it may be run several times as benchmarks should

usually be repeated and different benchmarks and benchmark

conﬁgurations may be run in parallel. Here, monitoring should

be used to assert that the machine(s) of the benchmarking

client do(es) not become the performance bottleneck. Further-

more, benchmarks should be run for a sufﬁciently long time,

typically, this means to keep a system benchmark running for

at least 20-30 minutes [6].

Analysis: In the third step, the results from all benchmark runs

need to be collected as they will typically be distributed across

multiple machines. Next, these results need to be analyzed.

Depending on the benchmark, this may mean simple unit

conversions (e.g., ns to ms) and aggregations or more complex

analysis steps. For instance, when data staleness is measured

following the approach of [13], this may involve analysis of

several GBs of raw text ﬁles.

Decision: Finally, the process needs to decide whether the

current build is released to the deployment pipeline or whether

the process is aborted because QoS goals were not met.

Depending on the application and its benchmark, the measured

values can either be compared with absolute thresholds, e.g.,

Fig. 2. Architecture and Main Components of Continuous Benchmarking

Setups

as deﬁned in Service Level Agreements (SLA) or software

speciﬁcations, or relative thresholds could be used (we discuss

these metrics section III-C).

B. Architecture and Components

As shown in ﬁgure 2, the components in our architecture are

closely aligned with the steps described above. Typically, our

CB process will be triggered by a CI server or some other

build pipeline automation system. For this, the Benchmark

Manager acts as the main entry point. Once it has been

triggered, it installs and conﬁgures both the SUT and the

Benchmarking Client before starting the execution of the

benchmark run(s). Next, the Benchmark Manager collects all

results and forwards them to the Analyzer which is responsible

for all analysis steps. Of course, the Analyzer may also act as

a proxy that forwards the raw results to an external analysis

system – like the Benchmarking Client, the Analyzer is SUT-

speciﬁc. Finally, the Analyzer forwards the aggregated analysis

results to the CB Controller along with the raw data. The

CB Controller then persists the data, decides on success or

failure of the evaluated build, and reports the result back to

the CI server. Beyond this, the Visual Interface visualizes

the benchmarking results for human users and is also used

to conﬁgure the CB Controller (e.g., to adjust relative and

absolute thresholds).

C. Metrics

The ﬁnal step of our approach requires some metrics to de-

cide on the success or failure of a benchmark run. Depending

on the system requirements, these decision can be based on

ﬁxed values given in SLAs, can reject a build because of a

sudden and signiﬁcant drop/jump in QoS compared to the last

build, or detect a negative trend over multiple builds. Here,

we present the decision algorithms which we will later use in

our evaluation.

Fixed values (FV): The most simple method to detect unde-

sired builds is to apply ﬁxed thresholds, e.g., from SLAs. A

build is rejected if the determined metric mc, e.g. latency, is

not in a speciﬁc interval.

f(m) = (succeed if F Vlower < mc< F Vupper

reject else (1)

Please, note that we always consider a lower and an upper

value as thresholds. As an example, consider an application

with a latency of 10ms. If this latency suddenly drops to 1ms,

it could be the result of brilliant engineering. It is, however,

much more likely to be the result of a bug where all requests

terminate really quick with an error message.

Jump detection (JD): Especially if a software system offers

much better quality than required by the FV thresholds, a

sudden massive change in quality may still have a signiﬁcant

impact on the user experience. Thus, a relative comparison of

the current run metric mcto the predecessor run metric mc−1

can be used to reject builds in which QoS deviates from the last

build by more than tpercent. Concrete values for t obviously

depend on the concrete application; however, we recommend

values around 5%.

f(mc, mc−1) = (succeed if t > 100 mc

mc−1−1

reject else

(2)

Trend detection (TD): Finally, a longer lasting trend can lead

to the software deteriorating slightly in bbuilds and ﬁnally

exceed a given relative threshold of tpercent in total. Here,

the metric of current build mcmust not exceed tpercent more

than the moving average of the previous bbuilds (mc−brefers

to the metric of the bth build before the current one).

f(m, b) = (succeed if t > 100 ∗mc·b

i=1 mc−i−1

reject else

(3)

Besides these, there are other metrics which could be used.

A valid approach might actually be to set the tvalue in JD

and TD to zero to enforce continuously improving QoS. This,

however, is likely to reject most builds unless the experiments

are run on a completely isolated infrastructure.

IV. EVALUATION

In this section, we evaluate our approach through a proof-of-

concept prototype and a number of experiments with our pro-

posed Continuous Benchmarking process in a realistic setup.

We decided to use the existing commit history of Apache

Cassandra as it is one of the most popular NoSQL systems

and benchmark coverage, e.g., through the well established

YCSB benchmark, is good. For our experiments, we replayed

the commit history of Cassandra over the last two years.

A. Proof-of-Concept Implementation

We have implemented our system design as a proof-of-

concept prototype. Parts of it are generic enough to be useful

for all use cases, other parts are very use case-speciﬁc.

Speciﬁcally, the Benchmark Manager’s code depends to some

degree on the SUT and the benchmarking client used. Here,

we implemented everything as needed for our evaluation (see

next section) with Apache Cassandra and YCSB.

The Benchmark Manager is implemented in Java and uses a

number of Unix shell scripts for installation of Git, Ant, etc. if

not already installed. For a production-ready implementation,

Fig. 3. Setup in all Experiment Runs

we would recommend to replace such shell scripts with

“Infrastructure as Code” environments such as Ansible5.

As already indicated above, the Analyzer is SUT- and

benchmark-speciﬁc. In our evaluation case, it is implemented

as a very short script that converts the YCSB output ﬁles to a

standard format that our CB Controller can understand.

To better integrate our prototype into existing build

pipelines, we have implemented the CB Controller and the

Visual Interface as a Jenkins plug-in6. The Visual Interface

allows users to specify thresholds and illustrates line charts for

trend analysis as well as functionality for detailed insights into

single benchmark runs. In contrast to our Benchmark Manager

which is aligned with our experiments, the plug-in is generally

applicable to arbitrary metrics.

B. Experiment Setup

For our experiments, we deployed Jenkins and our CB plug-

in on a single virtual machine (VM). We conﬁgured the plug-in

to run Cassandra on two other machines and YCSB on a third

machine (see ﬁgure 3).

In all experiments, Cassandra used the “SimpleStrategy”

for replication as we only had two nodes in the cluster; the

replication factor was two. YCSB used workload A with

the following conﬁguration: ﬁeldcount=10, ﬁeldlength=100,

records=20,000, operations=1,000,000 and threads=100.

We ran our set of experiments on Amazon EC2 m3.medium

instances (3.75GB RAM, one CPU core) in the eu-west region,

all in the same availability zone. We used the Amazon Linux

AMI and ran all experiments on the same three VMs.

As input for our experiments, we used 465 commits between

Jan 3, 2017 and Oct 23, 2018 of Cassandra’s commit history

which merged changes into the main trunk. We tested this

reduced commit history three times successively, thus, each

of these commits was benchmarked three times at different

points in time, i.e., we had almost 1400 benchmark runs.

Please, note that a real build pipeline of course also involves

steps such as testing. For our experiments, we decided to

exclude these steps for reasons of simplicity.

C. Results

Figure 4 shows the results of our experiments as returned

by YCSB. When ignoring outliers which can be expected

when experimenting in the cloud [14], the values indicate the

performance gradient as they are mostly densely packed.

5https://www.ansible.com/

6https://github.com/jenkinsci/benchmark-evaluator-plugin

850

900

950

1000

1050

1100

1150

1200

1250

17-01-03

17-03-09

17-06-05

17-09-13

18-02-28

18-10-23

Runtime (s)

Commit date

Run 1 Run 2 Run 3

Fig. 4. Total Benchmark Runtime

At this stage, we did not specify any absolute or relative

thresholds in our plug-in as we wanted the entire commit series

to run through.

D. Application of Threshold Metrics

Following our approach and the metrics deﬁned in III-C,

we applied these thresholds on the median benchmark mea-

surement to exclude outliers but still evaluate with actual

measurement values. For the total benchmark runtime, we set

950sand 1100sas FV thresholds. We chose t= 5% as relative

threshold for JD and t= 4% for TD which includes b= 20

builds.

Figure 5 illustrates these graphs along with the results from

our median experiment run. An intersection of the median line

and one of the other lines means that the respective build will

be rejected.

The ﬁxed value boundaries trigger only once – for the

sudden performance improvement (ca. 13.5%) towards the

end of the timeseries. Here, the developers introduced two

features: ”Flush netty client messages immediately by default”

and ”Improve TokenMetaData cache populating performance

avoid long locking” which indicates that our detection is

a false positive. Our jump detection algorithm would reject

one build on May 10, 2017 (jump around 6.5%) which can

either be caused by, according to the Git commit messages,

”Forbid unsupported creation of SASI indexes over partition

key column” or ”Avoid reading static row twice from legacy

sstables”; the ﬁrst one, however, seems more likely to be the

cause. The trend detection, on the other side, would reject 4

builds. The most signiﬁcant violating build was on Aug 2,

2018 (performance drop of almost 4.6%) which just moves

some code comments without touching any functionality, thus,

the main reason for this negative trend is caused in the builds

before and further analysis is necessary.

We only applied our deﬁned metrics to the total runtime. Of

course, there would be more metrics in other, more complex

scenarios.

V. DISCUSSION

Based on our measurement results, we believe that CB is

a very useful approach for keeping QoS of a system either

Fig. 5. Total Benchmark Runtime: Median Run and Threshold Metrics

constant or to continuously improve it while using the same

amount of infrastructure resources. With our proof-of-concept

prototype, we have also shown that the integration of CB into

a build pipeline is indeed possible and does not involve a lot of

effort – in fact, CB is simply integrated into the build process

through our prototype which is automatically triggered for new

versions. There is, however, also a number of open challenges

and caveats.

Running CB will typically create additional costs for the de-

velopment process; there is a tradeoff between how frequently

CB is run, i.e., how early QoS problems can be detected, and

the costs associated with that. We believe that this tradeoff

is system-speciﬁc and cannot be solved in a general way.

Developers also have to decide whether they plan to run CB

on dedicated physical machines on-premises or whether they

shift benchmark execution to the cloud. Depending on the

frequency of CB runs, the on-premises option may be less

expensive. The cloud option, in contrast, allows to run several

benchmarks (and benchmark runs) in parallel so that it will

be the preferred option when CB uses a set of benchmarks

instead of a single benchmark only.

This choice between on-premises non-virtualized hardware

and the cloud option is also related to the variance of results:

We believe that running the CB process on dedicated hardware

will produce more stable results with less variance across

experiment runs. When running experiments in the cloud, we

would recommend to run an initial experiment with at least

ten runs (more is better) to get a better understanding of the

variance effects caused by the underlying infrastructure. This

would then also determine the number of necessary repetitions

during the actual CB execution. Based on our AWS results, we

would recommend to run the experiment at least three times

(preferably ﬁve times) and to use the median result for further

analysis and decision making.

There is also the question of when to trigger a CB run.

In fact, we do not believe that running one for every single

Git commit is the ideal but too expensive scenario. In our

opinion, this would simply create too much data so that the

developers may no longer be able to visually comprehend

results. Comparable to our evaluation approach, we would

recommend to trigger CB whenever a new feature branch

is merged into the main branch. This also allows developers

to manually override QoS thresholds when there are external

events which mandate feature or conﬁguration updates. For

instance, when a new vulnerability in a TLS cipher suite is

detected, switching cipher suites may be necessary but might

have a strong impact on system performance [1]–[3].

There is also the challenge of ﬁnding a benchmark in the

ﬁrst place. In our case, we were running Apache Cassandra for

which a number of open source benchmarks and benchmark

tools exist. For some custom microservice, this will typically

not be the case. In such scenarios, developers have to build

their own benchmark ﬁrst – comparable to test-driven devel-

opment – which causes additional costs for personnel.

Finally, to conclude all cost aspects: CB will directly cause

additional costs for the CB infrastructure and personnel costs

for management or development of benchmarks. These costs,

however, will likely be offset by indirect costs caused by un-

happy customers or direct costs from compensation payments

for SLA violations. Balancing these costs is a non-trivial task

that is application-speciﬁc and should probably be approached

in an agile way with continuous readaptation.

VI. RE LATE D WOR K

CB is a powerful mechanism for evaluating QoS of a new

system version in a production-like environment. As such, it

is relies on benchmarking approaches such as [8], [12], [15]–

[19]. An alternative but also complementary approach to CB

are live testing techniques such as canary releases [20] or dark

launches [21]. In contrast to CB, live testing is characterized

by the fact that a new version (of a software artifact) is directly

deployed into the production environment in parallel with the

older version.

For canary releases [20], this new version is initially rolled

out for a very small subset of users and developers monitor

its behavior in production. If there are errors or QoS issues in

the new version, the impact only affects a few users and the

version is reverted or shut down. Otherwise, more and more

users are added to the set of test users until the new version

has completely been rolled out.

While canary releases aim to only affect a small subset of

users in case of failures, dark (or shadow) launches [21], [22]

eliminate potentially unsatisﬁed users completely by deploying

a new version in the production environment without serving

real user trafﬁc – so called shadow instances. This way, no user

is confronted with the new version and its potential issues.

Live testing techniques can be used to detect performance

and other QoS issues in production. However, testing new

versions in a production environment might be problematic

for several reasons: First, a production system is usually in

a normal state with usual load and regular trafﬁc. Thus, a

new version is never evaluated in production under extreme

conditions or for rare corner cases. Second, a roll-out of several

new versions of multiple software artifacts is administratively

complex and error-prone, though tools like BiFrost [23] try

to overcome these problems. Third, theoretical setups and

architectures including new versions are hard to evaluate with

live testing techniques. Finally, live testing does not necessarily

create the right data to identify QoS degradation in the system

release as varying workloads depending on user trafﬁc will

lead to varying observable QoS behavior. All this can be

done with CB, e.g., by creating benchmark setups for extreme

load peak situations. As benchmarking, however, can never

be identical to a production load, we propose to combine the

strengths of both approaches, i.e., to use both live testing and

CB in parallel.

Comparable to our approach, Waller et al. [24] also pro-

posed to include benchmarking in CI pipelines and present a

Jenkins plug-in for this purpose. In contrast to our approach,

however, they focus on measuring performance overheads of a

code instrumentation tool. This is rather different from generic

system benchmarking of distributed systems but supports the

relevance of our approach.

Beyond these, there are two Jenkins plug-ins which could

handle the task of our Visual Interface: the Performance7plug-

in and the Benchmark plug-in8. Both, however, have explicitly

been designed for small single-machine micro benchmarks

such as single method benchmarks.

VII. CONCLUSION

Complex systems are very sensitive to change and even the

smallest change can strongly affect QoS. Particularly, such

changes occur frequently when releasing new software ver-

sions. Existing CI/CD pipelines, however, focus on functional

testing or single method performance measurements and can,

hence, not detect changes in QoS.

In this paper, we proposed a new approach called Continu-

ous Benchmarking in which one or more system benchmarks

are run as additional step in the build pipeline. Measurement

results from these benchmark runs are then compared to either

absolute thresholds, e.g., as speciﬁed in an SLA, or to relative

7https://plugins.jenkins.io/performance

8https://github.com/jenkinsci/benchmark-plugin

thresholds which compare the result to previous results to as-

sert that QoS levels always improve or at least remain constant

across releases. We have prototypically implemented CB using

Apache Cassandra as SUT and YCSB as benchmarking client

and evaluated our approach by replaying almost two years of

Cassandra’s commit history.

REFERENCES

[1] S. M¨

uller, D. Bermbach, S. Tai, and F. Pallas, “Benchmarking the per-

formance impact of transport layer security in cloud database systems,”

in Proc. of IC2E. IEEE, 2014.

[2] F. Pallas, J. G¨

unther, and D. Bermbach, “Pick your choice in hbase:

Security or performance,” in Big Data. IEEE, 2016.

[3] F. Pallas, D. Bermbach, S. M¨

uller, and S. Tai, “Evidence-based security

conﬁgurations for cloud datastores,” in Proc. of SAC. ACM, 2017.

[4] J. Brutlag, “Speed matters for google web search,” 2009.

[5] M. Fowler and M. Foemmel, “Continuous integration,” Thought-Works,

vol. 122, 2006.

[6] D. Bermbach, E. Wittern, and S. Tai, Cloud Service Benchmarking: Mea-

suring Quality of Cloud Services from a Client Perspective. Springer,

2017.

[7] D. Abadi, “Consistency tradeoffs in modern distributed database system

design: Cap is only part of the story,” IEEE Computer, vol. 45, no. 2,

2012.

[8] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,

“Benchmarking cloud serving systems with ycsb,” in Proc. of SOCC.

ACM, 2010.

[9] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-

rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed

storage system for structured data,” in Proc. of OSDI. USENIX

Association, 2006.

[10] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google ﬁle system,”

in Proc. of SOSP. ACM, 2003.

[11] M. Burrows, “The chubby lock service for loosely-coupled distributed

systems,” in Proc. of OSDI. USENIX Association, 2006.

[12] D. Bermbach, J. Kuhlenkamp, A. Dey, A. Ramachandran, A. Fekete, and

S. Tai, “BenchFoundry: A Benchmarking Framework for Cloud Storage

Services,” in Proc. of ICSOC 2017. Springer, 2017.

[13] D. Bermbach, “Benchmarking eventually consistent distributed storage

systems,” Ph.D. dissertation, Karlsruhe Institute of Technology, 2014.

[14] ——, “Quality of cloud services: Expect the unexpected,” IEEE Internet

Computing, 2017.

[15] C. Binnig, D. Kossmann, T. Kraska, and S. Loesing, “How is the weather

tomorrow?: Towards a benchmark for the cloud,” in Proc. of DBTEST.

ACM, 2009.

[16] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux, “Oltp-bench:

An extensible testbed for benchmarking relational databases,” Proc. of

VLDB Endowment, vol. 7, no. 4, 2013.

[17] D. Bermbach and E. Wittern, “Benchmarking web api quality,” in Proc.

of ICWE. Springer, 2016.

[18] A. H. Borhani, P. Leitner, B. S. Lee, X. Li, and T. Hung, “Wpress:

An application-driven performance benchmark for cloud-based virtual

machines,” Proc. of EDOC, 2014.

[19] D. Bermbach, J. Kuhlenkamp, A. Dey, S. Sakr, and R. Nambiar,

“Towards an Extensible Middleware for Database Benchmarking,” in

Proc. of TPCTC. Springer, 2014.

[20] J. Humble and D. Farley, Continuous Delivery: Reliable Software

Releases through Build, Test, and Deployment Automation. Addison-

Wesley Professional, 2010.

[21] D. G. Feitelson, E. Frachtenberg, and K. L. Beck, “Development and

deployment at facebook,” IEEE Internet Computing, vol. 17, no. 4, 2013.

[22] C. Tang, T. Kooburat, P. Venkatachalam, A. Chander, Z. Wen,

A. Narayanan, P. Dowell, and R. Karl, “Holistic conﬁguration man-

agement at facebook,” in Proc. of SOSP. ACM, 2015.

[23] G. Schermann, D. Sch¨

oni, P. Leitner, and H. C. Gall, “Bifrost: supporting

continuous deployment with automated enactment of multi-phase live

testing strategies,” in Proc. of Middleware. ACM, 2016.

[24] J. Waller, N. C. Ehmke, and W. Hasselbring, “Including performance

benchmarks into continuous integration to enable devops,” ACM SIG-

SOFT Software Engineering Notes, vol. 40, no. 2, 2015.

Increasing Efficiency and Result Reliability of Continuous Benchmarking for FaaS Applications

Preprint

May 2024

In a continuous deployment setting, Function-as-a-Service (FaaS) applications frequently receive updated releases, each of which can cause a performance regression. While continuous benchmarking, i.e., comparing benchmark results of the updated and the previous version, can detect such regressions, performance variability of FaaS platforms necessitates thousands of function calls, thus, making continuous benchmarking time-intensive and expensive. In this paper, we propose DuetFaaS, an approach which adapts duet benchmarking to FaaS applications. With DuetFaaS, we deploy two versions of FaaS function in a single cloud function instance and execute them in parallel to reduce the impact of platform variability. We evaluate our approach against state-of-the-art approaches, running on AWS Lambda. Overall, DuetFaaS requires fewer invocations to accurately detect performance regressions than other state-of-the-art approaches. In 99.65% of evaluated cases, our approach provides smaller confidence interval sizes than the comparing approaches, and can reduce the size by up to 98.23%.

Using Microbenchmark Suites to Detect Application Performance Changes

Preprint

Full-text available

Dec 2022

Software performance changes are costly and often hard to detect pre-release. Similar to software testing frameworks, either application benchmarks or microbenchmarks can be integrated into quality assurance pipelines to detect performance changes before releasing a new application version. Unfortunately, extensive benchmarking studies usually take several hours which is problematic when examining dozens of daily code changes in detail; hence, trade-offs have to be made. Optimized microbenchmark suites, which only include a small subset of the full suite, are a potential solution for this problem, given that they still reliably detect the majority of the application performance changes such as an increased request latency. It is, however, unclear whether microbenchmarks and application benchmarks detect the same performance problems and one can be a proxy for the other. In this paper, we explore whether microbenchmark suites can detect the same application performance changes as an application benchmark. For this, we run extensive benchmark experiments with both the complete and the optimized microbenchmark suites of the two time-series database systems InuxDB and VictoriaMetrics and compare their results to the results of corresponding application benchmarks. We do this for 70 and 110 commits, respectively. Our results show that it is possible to detect application performance changes using an optimized microbenchmark suite if frequent false-positive alarms can be tolerated.

Hunter: Using Change Point Detection to Hunt for Performance Regressions

Preprint

Full-text available

Jan 2023

Change point detection has recently gained popularity as a method of detecting performance changes in software due to its ability to cope with noisy data. In this paper we present Hunter, an open source tool that automatically detects performance regressions and improvements in time-series data. Hunter uses a modified E-divisive means algorithm to identify statistically significant changes in normally-distributed performance metrics. We describe the changes we made to the E-divisive means algorithm along with their motivation. The main change we adopted was to replace the significance test using randomized permutations with a Student's t-test, as we discovered that the randomized approach did not produce deterministic results, at least not with a reasonable number of iterations. In addition we've made tweaks that allow us to find change points the original algorithm would not, such as two nearby changes. For evaluation, we developed a method to generate real timeseries, but with artificially injected changes in latency. We used these data sets to compare Hunter against two other well known algorithms, PELT and DYNP. Finally, we conclude with lessons we've learned supporting Hunter across teams with individual responsibility for the performance of their project.

ElastiBench: Scalable Continuous Benchmarking on Cloud FaaS Platforms

Preprint

May 2024

Running microbenchmark suites often and early in the development process enables developers to identify performance issues in their application. Microbenchmark suites of complex applications can comprise hundreds of individual benchmarks and take multiple hours to evaluate meaningfully, making running those benchmarks as part of CI/CD pipelines infeasible. In this paper, we reduce the total execution time of microbenchmark suites by leveraging the massive scalability and elasticity of FaaS (Function-as-a-Service) platforms. While using FaaS enables users to quickly scale up to thousands of parallel function instances to speed up microbenchmarking, the performance variation and low control over the underlying computing resources complicate reliable benchmarking. We demonstrate an architecture for executing microbenchmark suites on cloud FaaS platforms and evaluate it on code changes from an open-source time series database. Our evaluation shows that our prototype can produce reliable results (~95% of performance changes accurately detected) in a quarter of the time (<=15min vs.~4h) and at lower cost ($0.49 vs. ~$1.18) compared to cloud-based virtual machines.

Impact of TLS Overhead on Segmented Network for Cloud Database Systems

Article

Full-text available

May 2023

Cloud database serves flexible, affordable, and scalable database systems. Even the cloud database is secure with transport layer security (TLS), but the performance overhead that TLS introduces when executing operations on one of the major No SQL databases: Mongo DB in terms of latency. To explore TLS performance overhead for Mongo DB, we performed two tests simulating common database usage patterns. We first investigated connection pooling, where an application uses a single connection for many database operations. Then, we considered one request per connection in which an application opens a connection, executes a process, and immediately closes the connection after completing the operation. Our experimental result shows that applications that cannot endure significant performance overhead should be deployed within a properly segmented network rather than enabling TLS. Applications using TLS should use a connection pool rather than a connection-per-request.

The Early Microbenchmark Catches the Bug -- Studying Performance Issues Using Micro- and Application Benchmarks

Conference Paper

Apr 2024

Efficiently Detecting Performance Changes in FaaS Application Releases

Conference Paper

Dec 2023

Hunter: Using Change Point Detection to Hunt for Performance Regressions

Conference Paper

Apr 2023

Towards a Domain-Specific Language for Provisioning Multiple Cloud Testing Environments for Mobile Applications

Conference Paper

Nov 2022

Making the Cloud Monitor Real-Time Adaptive

Conference Paper

Oct 2022

BenchFoundry: A Benchmarking Framework for Cloud Storage Services

Conference Paper

Full-text available

Oct 2017

Understanding quality of services in general, and of cloud storage services in particular, is often crucial. Previous proposals to benchmark storage services are too restricted to cover the full variety of NoSQL stores, or else too simplistic to capture properties of use by realistic applications; they also typically measure only one facet of the complex tradeoffs between different qualities of service. In this paper, we present BenchFoundry which is not a benchmark itself but rather is a benchmarking framework that can execute arbitrary application-driven benchmark workloads in a distributed deployment while measuring multiple qualities at the same time. BenchFoundry can be used or extended for every kind of storage service. Specifically, BenchFoundry is the first system where workload specifications become mere configuration files instead of code. In our design, we have put special emphasis on ease-of-use and deterministic repeatability of benchmark runs which is achieved through a trace-based workload model.

Evidence-based security configurations for cloud datastores

Conference Paper

Full-text available

Apr 2017

Cloud systems offer a diversity of security mechanisms with potentially complex configuration options. So far, security engineering has focused on achievable security levels, but not on the costs associated with a specific security mechanism and its configuration. Through a series of experiments with a variety of cloud datastores conducted over the last years, we gained substantial knowledge on how one desired quality like security can have a significant impact on other system qualities like performance. In this paper, we report on select findings related to security-performance trade-offs for three prominent cloud datastores, focusing on data in transit encryption, and propose a simple, structured approach for making trade-off decisions based on factual evidence gained through experimentation. Our approach allows to rationally reason about security trade-offs.

Quality of Cloud Services: Expect the Unexpected

Article

Full-text available

Jan 2017

David Bermbach

Here, the author presents a number of experiences from several years of benchmarking cloud services. He discusses how the respectively observed quality behavior would have affected cloud applications or how cloud consumers could use the behavior to their advantage.

Pick your choice in HBase: Security or performance

Conference Paper

Full-text available

Dec 2016

When analyzing sensitive data in a cloud-deployed Hadoop stack, data-in-transit security needs to be enabled, especially in the underlying storage tier. This, however, will affect the performance of the system and may partially offset the cost benefits of the cloud. In this paper, we discuss two strategies for securing HBase deployments in the cloud. For both, we present benchmarking results which show performance impacts that significantly exceed the suggested 10% from the official documentation. These results demonstrate (i) that security configurations should follow a rational decision process based on benchmarking results and (ii) that the security architecture of HBase/HDFS should be redesigned with an emphasis on performance.

Benchmarking Web API Quality

Conference Paper

Full-text available

Jun 2016

Web APIs are increasingly becoming an integral part of web or mobile applications. As a consequence, performance characteristics and availability of the APIs used directly impact the user experience of end users. Still, quality of web APIs is largely ignored and simply assumed to be sufficiently good and stable. Especially considering geo-mobility of today’s client devices, this can lead to negative surprises at runtime. In this work, we present an approach and toolkit for benchmarking the quality of web APIs considering geo-mobility of clients. Using our benchmarking tool, we then present the surprising results of a geo-distributed 3-month benchmark run for 15 web APIs and discuss how application developers can deal with volatile quality both from an architectural and engineering point of view.

Benchmarking Eventually Consistent Distributed Storage Systems

Article

Full-text available

Jan 2014

David Bermbach

Cloud storage services and NoSQL systems typically offer only "Eventual Consistency", a rather weak guarantee covering a broad range of potential data consistency behavior. The degree of actual (in-)consistency, however, is unknown. This work presents novel solutions for determining the degree of (in-)consistency via simulation and benchmarking, as well as the necessary means to resolve inconsistencies leveraging this information. © 2014 Karlsruher Institut fur Technologie (KIT). All rights reserved.

Cloud Service Benchmarking: Measuring Quality of Cloud Services from a Client Perspective

Book

Apr 2017

Cloud service benchmarking can provide important, sometimes surprising insights into the quality of services and leads to a more quality-driven design and engineering of complex software architectures that use such services. Starting with a broad introduction to the field, this book guides readers step-by-step through the process of designing, implementing and executing a cloud service benchmark, as well as understanding and dealing with its results. It covers all aspects of cloud service benchmarking, i.e., both benchmarking the cloud and benchmarking in the cloud, at a basic level. The book is divided into five parts: Part I discusses what cloud benchmarking is, provides an overview of cloud services and their key properties, and describes the notion of a cloud system and cloud-service quality. It also addresses the benchmarking lifecycle and the motivations behind running benchmarks in particular phases of an application lifecycle. Part II then focuses on benchmark design by discussing key objectives (e.g., repeatability, fairness, or understandability) and defining metrics and measurement methods, and by giving advice on developing own measurement methods and metrics. Next, Part III explores benchmark execution and implementation challenges and objectives as well as aspects like runtime monitoring and result collection. Subsequently, Part IV addresses benchmark results, covering topics such as an abstract process for turning data into insights, data preprocessing, and basic data analysis methods. Lastly, Part V concludes the book with a summary, suggestions for further reading and pointers to benchmarking tools available on the Web. The book is intended for researchers and graduate students of computer science and related subjects looking for an introduction to benchmarking cloud services, but also for industry practitioners who are interested in evaluating the quality of cloud services or who want to assess key qualities of their own implementations through cloud-based experiments.

The Google file system

Article

Jan 2003

Bifrost: Supporting Continuous Deployment with Automated Enactment of Multi-Phase Live Testing Strategies

Conference Paper

Nov 2016

Live testing is used in the context of continuous delivery and deployment to test changes or new features in the production environment. This includes canary releases, dark launches, A/B tests, and gradual rollouts. Oftentimes, multiple of these live testing practices need to be combined (e.g., running an A/B test after a dark launch). Manually administering such multi-phase live testing strategies is a daunting task for developers or release engineers. In this paper, we introduce a formal model for multi-phase live testing, and present Bifrost as a Node.js based prototype implementation that allows developers to define and automatically enact complex live testing strategies. We extensively evaluate the runtime behavior of Bifrost in three rollout scenarios of a microservice-based case study application, and conclude that the performance overhead of our prototype is at or below 8 ms for most scenarios. Further, we show that more than 100 parallel strategies can be enacted even on cheap public cloud instances.

Holistic configuration management at Facebook

Conference Paper

Oct 2015

Facebook's web site and mobile apps are very dynamic. Every day, they undergo thousands of online configuration changes, and execute trillions of configuration checks to personalize the product features experienced by hundreds of million of daily active users. For example, configuration changes help manage the rollouts of new product features, perform A/B testing experiments on mobile devices to identify the best echo-canceling parameters for VoIP, rebalance the load across global regions, and deploy the latest machine learning models to improve News Feed ranking. This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook's configuration end-to-end, including the frontend products, backend systems, and mobile apps.

Continuous Benchmarking: Using System Benchmarking in Build Pipelines

Abstract and Figures

Recommended publications

Palladio Optimization Suite: QoS optimization for component-based Cloud applications

Benchmarking Microservice Performance: A Pattern-based Approach

Benchmarking the Performance of Microservice Applications

Benchmarking Microservice Performance: A Pattern-based Approach

Benchmarking Web API Quality - Revisited