Conference PaperPDF Available

Continuous Benchmarking: Using System Benchmarking in Build Pipelines

Authors:

Abstract and Figures

Continuous integration and deployment are established paradigms in modern software engineering. Both intend to ensure the quality of software products and to automate the testing and release process. Today's state of the art, however, focuses on functional tests or small microbenchmarks such as single method performance while the overall quality of service (QoS) is ignored. In this paper, we propose to add a dedicated benchmarking step into the testing and release process which can be used to ensure that QoS goals are met and that new system releases are at least as "good" as the previous ones. For this purpose, we present a research prototype which automatically deploys the system release, runs one or more benchmarks, collects and analyzes results, and decides whether the release fulfills predefined QoS goals. We evaluate our approach by replaying two years of Apache Cassandra's commit history.
Content may be subject to copyright.
Continuous Benchmarking: Using System
Benchmarking in Build Pipelines
Martin Grambow, Fabian Lehmann, David Bermbach
TU Berlin & Einstein Center Digital Future
Mobile Cloud Computing Research Group
Berlin, Germany
{mg, flm, db}@mcc.tu-berlin.de
Abstract—Continuous integration and deployment are estab-
lished paradigms in modern software engineering. Both intend
to ensure the quality of software products and to automate the
testing and release process. Today’s state of the art, however,
focuses on functional tests or small microbenchmarks such as
single method performance while the overall quality of service
(QoS) is ignored.
In this paper, we propose to add a dedicated benchmarking
step into the testing and release process which can be used to
ensure that QoS goals are met and that new system releases are at
least as “good” as the previous ones. For this purpose, we present
a research prototype which automatically deploys the system
release, runs one or more benchmarks, collects and analyzes
results, and decides whether the release fulfills predefined QoS
goals. We evaluate our approach by replaying two years of
Apache Cassandra’s commit history.
Index Terms—Benchmarking, Continuous Integration, Soft-
ware Development, QoS, YCSB, Cassandra
I. INTRODUCTION
Today’s IT systems tend to be rather complex pieces of
software so that even the smallest changes, either in the
source code itself or in the application settings, can have
a big (negative) impact on their performance, e. g., adding
or configuring security features as shown in [1]–[3]. Besides
increased latencies which result in a poor user experience,
cloud based systems usually use autoscaling to automatically
adapt the amount of resources to meet quality of service (QoS)
goals. When a software change now leads to increased use of
hardware resources, more resources will be provisioned lead-
ing to significantly higher cost. Finally, systems and services
rarely operate in an isolated way. In interaction with other
systems and services, however, small QoS changes will affect
other services and may even start a butterfly effect in some
scenarios. Regardless of the cause, these deficits are often
coupled with reduced revenue or even fines if the current
performance metrics do not meet the defined service level
agreements (SLA) or user expectations, e.g., Google reports
that the number of daily searches per user decreases if the
latency of results increases [4].
In order to prevent undesired effects on performance or
other quality metrics, we propose to add system benchmarking
to the build pipeline of software systems. This way, developers
can assert that a new release is at least as good as the previous
release and that it complies with SLAs. In this regard, we make
the following contributions:
1) We describe how QoS requirements can be integrated
into the development process and how benchmarking can
be used to enforce QoS goals.
2) We present a proof-of-concept prototype, including the
corresponding Jenkins plug-in.
II. BACKGROU ND
In this section, we give a short overview of paradigms and
systems used in this paper.
Version Control with Git: Git1is a very popular distributed
version control tool. Among many other features, software
developers can download the current version of the source
code from a given repository (checkout or pull), commit
changes to this software locally, and upload (push) them to
ultimately merge them into the current version.
Continuous Integration & Deployment: Continuous Integra-
tion (CI) and Continuous Deployment (CD) are two modern
paradigms which aim to improve, automate, and accelerate
the software development process leading to shorter release
cycles. CI defines the process of integrating new software
changes into the master version, including adapting and run-
ning corresponding test cases which ensures that the software
is extensively tested before it is merged into a production
branch of a system [5]. CD describes the automated process
of releasing and deploying new software versions. Once a
new release has been thoroughly tested in the CI process, it
is automatically rolled-out to the production system so that
frequent daily releases are possible; this shortens the release
cycle.
Both processes, CI and CD, are designed to run multiple
times per day, depending on how many features are imple-
mented per day and the release policy. Thus, 10 minutes are a
guiding value for the total run time of both processes so that
developers can get early feedback on their software changes.
In practice, however, this is not always realistic so that multi-
tiered deployment pipelines are usually used instead.
Jenkins: Jenkins2is an open source automation server for all
tasks related to the software development process. Triggered
by various kinds of events, a Jenkins server executes user-
defined job pipelines which consist of multiple build steps;
1https://git-scm.com/
2https://jenkins.io/
structure and sequence of steps depends on the respective
application. While the typical structure of a pipeline is a
simple sequence of tasks, it is also possible to insert conditions
or execute steps in parallel. Since different software projects
have different needs, Jenkins can be extended through so
called plug-ins which can be used to implement arbitrary
functionality.
Benchmarking: Benchmarking “is the process of measuring
quality and collecting information on system states” [6].
In contrast to monitoring, benchmarking aims to answer a
specific question and is typically used to compare different
system versions, configurations, alternatives, or deployments.
Moreover, benchmarking typically measures the quality of
a non-production environment with arbitrary metrics at a
specific time in multiple test runs and analyzes its output in a
subsequent offline analysis. Benchmarking can involve running
micro benchmarks which measure very small isolated features
down to single method performance. Typically, however, a
system is deployed on several machines (along with other
systems it potentially depends on) and a measurement client
on another machine. Then, the measurement client runs an
application-driven workload against the system under test
(SUT) and tracks its changes in QoS.
Apache Cassandra: Apache Cassandra3is a popular NoSQL
database system which was originally developed at Facebook.
The system was explicitly designed for elastic scalability,
high performance and availability; in exchange, it offers only
eventually consistent guarantees based on the PACELC trade-
offs [7].
YCSB: The Yahoo! Cloud Serving Benchmark4(YCSB) [8] is
the de-facto standard benchmark for NoSQL databases. YCSB
offers a suite of synthetic standard workloads which can be
run against the SUT. After each execution, YCSB reports
aggregated measurement results including throughput, latency,
and the total runtime which can be used to evaluate and rank
the compared systems or system configurations.
III. APP ROAC H
In this section, we describe how benchmarking can be
used as part of a CI or CD build process to ensure QoS
requirements, give an overview of our system architecture,
and describe how builds with QoS problems can be detected
in benchmarking results.
A. Continuous Benchmarking
As already described, state of the art CI and CD solutions
focus on functional tests and micro benchmarks such as single
method performance. To complement this, we propose to
regularly run system benchmarks as part of the build process.
Comparable to CI and CD, we refer to this as Continuous
Benchmarking (CB).
CB should only be done if the correct functionality of the
software has already been ensured, otherwise the benchmark
3http://cassandra.apache.org/
4https://github.com/brianfrankcooper/YCSB
Fig. 1. Main Steps of Continuous Benchmarking
might run against a buggy software version and produce
incorrect results, e.g., because data records are processed much
faster due to an error. CB should, hence, take place once all
functional tests and integration tests have already been passed.
In the following, we will give an overview of the steps involved
in CB; see also figure 1 which gives a high level overview of
the CB process and how it integrates into a CI/CD process.
Setup: First, once all functional and integration tests have been
passed, the SUT and the benchmarking client must be set up,
which can be done either sequentially or in parallel. To get
comparable results over several runs, it is important to deploy
both systems in the same environment in each CB process
(same hardware, operating system, supporting libraries, etc.).
If, for example, a different hard disk would be used in each
run, the different speed would have an influence on the results
and it would not be possible to carry out a trend analysis of the
key figures. Also, unless the benchmark explicitly targets a fu-
ture use case, it is reasonable to choose a runtime environment
as similar as possible to the production environment. Finally,
the SUT and benchmarking system must be isolated from
external factors which could affect the benchmark results, e.g.,
other processes running on the same machine, other services
interacting with the SUT, or too much traffic on the network.
Depending on the system under test, it may also be nec-
essary to deploy other external systems (e.g., BigTable [9]
always relies on GFS [10] and Chubby [11]) or to run a preload
phase which inserts an initial data set into the SUT [12].
Execution: In the second step, the benchmark is actually run.
In fact, it may be run several times as benchmarks should
usually be repeated and different benchmarks and benchmark
configurations may be run in parallel. Here, monitoring should
be used to assert that the machine(s) of the benchmarking
client do(es) not become the performance bottleneck. Further-
more, benchmarks should be run for a sufficiently long time,
typically, this means to keep a system benchmark running for
at least 20-30 minutes [6].
Analysis: In the third step, the results from all benchmark runs
need to be collected as they will typically be distributed across
multiple machines. Next, these results need to be analyzed.
Depending on the benchmark, this may mean simple unit
conversions (e.g., ns to ms) and aggregations or more complex
analysis steps. For instance, when data staleness is measured
following the approach of [13], this may involve analysis of
several GBs of raw text files.
Decision: Finally, the process needs to decide whether the
current build is released to the deployment pipeline or whether
the process is aborted because QoS goals were not met.
Depending on the application and its benchmark, the measured
values can either be compared with absolute thresholds, e.g.,
Fig. 2. Architecture and Main Components of Continuous Benchmarking
Setups
as defined in Service Level Agreements (SLA) or software
specifications, or relative thresholds could be used (we discuss
these metrics section III-C).
B. Architecture and Components
As shown in figure 2, the components in our architecture are
closely aligned with the steps described above. Typically, our
CB process will be triggered by a CI server or some other
build pipeline automation system. For this, the Benchmark
Manager acts as the main entry point. Once it has been
triggered, it installs and configures both the SUT and the
Benchmarking Client before starting the execution of the
benchmark run(s). Next, the Benchmark Manager collects all
results and forwards them to the Analyzer which is responsible
for all analysis steps. Of course, the Analyzer may also act as
a proxy that forwards the raw results to an external analysis
system – like the Benchmarking Client, the Analyzer is SUT-
specific. Finally, the Analyzer forwards the aggregated analysis
results to the CB Controller along with the raw data. The
CB Controller then persists the data, decides on success or
failure of the evaluated build, and reports the result back to
the CI server. Beyond this, the Visual Interface visualizes
the benchmarking results for human users and is also used
to configure the CB Controller (e.g., to adjust relative and
absolute thresholds).
C. Metrics
The final step of our approach requires some metrics to de-
cide on the success or failure of a benchmark run. Depending
on the system requirements, these decision can be based on
fixed values given in SLAs, can reject a build because of a
sudden and significant drop/jump in QoS compared to the last
build, or detect a negative trend over multiple builds. Here,
we present the decision algorithms which we will later use in
our evaluation.
Fixed values (FV): The most simple method to detect unde-
sired builds is to apply fixed thresholds, e.g., from SLAs. A
build is rejected if the determined metric mc, e.g. latency, is
not in a specific interval.
f(m) = (succeed if F Vlower < mc< F Vupper
reject else (1)
Please, note that we always consider a lower and an upper
value as thresholds. As an example, consider an application
with a latency of 10ms. If this latency suddenly drops to 1ms,
it could be the result of brilliant engineering. It is, however,
much more likely to be the result of a bug where all requests
terminate really quick with an error message.
Jump detection (JD): Especially if a software system offers
much better quality than required by the FV thresholds, a
sudden massive change in quality may still have a significant
impact on the user experience. Thus, a relative comparison of
the current run metric mcto the predecessor run metric mc1
can be used to reject builds in which QoS deviates from the last
build by more than tpercent. Concrete values for t obviously
depend on the concrete application; however, we recommend
values around 5%.
f(mc, mc1) = (succeed if t > 100 mc
mc11
reject else
(2)
Trend detection (TD): Finally, a longer lasting trend can lead
to the software deteriorating slightly in bbuilds and finally
exceed a given relative threshold of tpercent in total. Here,
the metric of current build mcmust not exceed tpercent more
than the moving average of the previous bbuilds (mcbrefers
to the metric of the bth build before the current one).
f(m, b) = (succeed if t > 100 mc·b
Pb
i=1 mci1
reject else
(3)
Besides these, there are other metrics which could be used.
A valid approach might actually be to set the tvalue in JD
and TD to zero to enforce continuously improving QoS. This,
however, is likely to reject most builds unless the experiments
are run on a completely isolated infrastructure.
IV. EVALUATION
In this section, we evaluate our approach through a proof-of-
concept prototype and a number of experiments with our pro-
posed Continuous Benchmarking process in a realistic setup.
We decided to use the existing commit history of Apache
Cassandra as it is one of the most popular NoSQL systems
and benchmark coverage, e.g., through the well established
YCSB benchmark, is good. For our experiments, we replayed
the commit history of Cassandra over the last two years.
A. Proof-of-Concept Implementation
We have implemented our system design as a proof-of-
concept prototype. Parts of it are generic enough to be useful
for all use cases, other parts are very use case-specific.
Specifically, the Benchmark Manager’s code depends to some
degree on the SUT and the benchmarking client used. Here,
we implemented everything as needed for our evaluation (see
next section) with Apache Cassandra and YCSB.
The Benchmark Manager is implemented in Java and uses a
number of Unix shell scripts for installation of Git, Ant, etc. if
not already installed. For a production-ready implementation,
Fig. 3. Setup in all Experiment Runs
we would recommend to replace such shell scripts with
“Infrastructure as Code” environments such as Ansible5.
As already indicated above, the Analyzer is SUT- and
benchmark-specific. In our evaluation case, it is implemented
as a very short script that converts the YCSB output files to a
standard format that our CB Controller can understand.
To better integrate our prototype into existing build
pipelines, we have implemented the CB Controller and the
Visual Interface as a Jenkins plug-in6. The Visual Interface
allows users to specify thresholds and illustrates line charts for
trend analysis as well as functionality for detailed insights into
single benchmark runs. In contrast to our Benchmark Manager
which is aligned with our experiments, the plug-in is generally
applicable to arbitrary metrics.
B. Experiment Setup
For our experiments, we deployed Jenkins and our CB plug-
in on a single virtual machine (VM). We configured the plug-in
to run Cassandra on two other machines and YCSB on a third
machine (see figure 3).
In all experiments, Cassandra used the “SimpleStrategy”
for replication as we only had two nodes in the cluster; the
replication factor was two. YCSB used workload A with
the following configuration: fieldcount=10, fieldlength=100,
records=20,000, operations=1,000,000 and threads=100.
We ran our set of experiments on Amazon EC2 m3.medium
instances (3.75GB RAM, one CPU core) in the eu-west region,
all in the same availability zone. We used the Amazon Linux
AMI and ran all experiments on the same three VMs.
As input for our experiments, we used 465 commits between
Jan 3, 2017 and Oct 23, 2018 of Cassandra’s commit history
which merged changes into the main trunk. We tested this
reduced commit history three times successively, thus, each
of these commits was benchmarked three times at different
points in time, i.e., we had almost 1400 benchmark runs.
Please, note that a real build pipeline of course also involves
steps such as testing. For our experiments, we decided to
exclude these steps for reasons of simplicity.
C. Results
Figure 4 shows the results of our experiments as returned
by YCSB. When ignoring outliers which can be expected
when experimenting in the cloud [14], the values indicate the
performance gradient as they are mostly densely packed.
5https://www.ansible.com/
6https://github.com/jenkinsci/benchmark-evaluator-plugin
850
900
950
1000
1050
1100
1150
1200
1250
17-01-03
17-03-09
17-06-05
17-09-13
18-02-28
18-10-23
Runtime (s)
Commit date
Run 1 Run 2 Run 3
Fig. 4. Total Benchmark Runtime
At this stage, we did not specify any absolute or relative
thresholds in our plug-in as we wanted the entire commit series
to run through.
D. Application of Threshold Metrics
Following our approach and the metrics defined in III-C,
we applied these thresholds on the median benchmark mea-
surement to exclude outliers but still evaluate with actual
measurement values. For the total benchmark runtime, we set
950sand 1100sas FV thresholds. We chose t= 5% as relative
threshold for JD and t= 4% for TD which includes b= 20
builds.
Figure 5 illustrates these graphs along with the results from
our median experiment run. An intersection of the median line
and one of the other lines means that the respective build will
be rejected.
The fixed value boundaries trigger only once – for the
sudden performance improvement (ca. 13.5%) towards the
end of the timeseries. Here, the developers introduced two
features: ”Flush netty client messages immediately by default”
and ”Improve TokenMetaData cache populating performance
avoid long locking” which indicates that our detection is
a false positive. Our jump detection algorithm would reject
one build on May 10, 2017 (jump around 6.5%) which can
either be caused by, according to the Git commit messages,
”Forbid unsupported creation of SASI indexes over partition
key column” or ”Avoid reading static row twice from legacy
sstables”; the first one, however, seems more likely to be the
cause. The trend detection, on the other side, would reject 4
builds. The most significant violating build was on Aug 2,
2018 (performance drop of almost 4.6%) which just moves
some code comments without touching any functionality, thus,
the main reason for this negative trend is caused in the builds
before and further analysis is necessary.
We only applied our defined metrics to the total runtime. Of
course, there would be more metrics in other, more complex
scenarios.
V. DISCUSSION
Based on our measurement results, we believe that CB is
a very useful approach for keeping QoS of a system either
Fig. 5. Total Benchmark Runtime: Median Run and Threshold Metrics
constant or to continuously improve it while using the same
amount of infrastructure resources. With our proof-of-concept
prototype, we have also shown that the integration of CB into
a build pipeline is indeed possible and does not involve a lot of
effort – in fact, CB is simply integrated into the build process
through our prototype which is automatically triggered for new
versions. There is, however, also a number of open challenges
and caveats.
Running CB will typically create additional costs for the de-
velopment process; there is a tradeoff between how frequently
CB is run, i.e., how early QoS problems can be detected, and
the costs associated with that. We believe that this tradeoff
is system-specific and cannot be solved in a general way.
Developers also have to decide whether they plan to run CB
on dedicated physical machines on-premises or whether they
shift benchmark execution to the cloud. Depending on the
frequency of CB runs, the on-premises option may be less
expensive. The cloud option, in contrast, allows to run several
benchmarks (and benchmark runs) in parallel so that it will
be the preferred option when CB uses a set of benchmarks
instead of a single benchmark only.
This choice between on-premises non-virtualized hardware
and the cloud option is also related to the variance of results:
We believe that running the CB process on dedicated hardware
will produce more stable results with less variance across
experiment runs. When running experiments in the cloud, we
would recommend to run an initial experiment with at least
ten runs (more is better) to get a better understanding of the
variance effects caused by the underlying infrastructure. This
would then also determine the number of necessary repetitions
during the actual CB execution. Based on our AWS results, we
would recommend to run the experiment at least three times
(preferably five times) and to use the median result for further
analysis and decision making.
There is also the question of when to trigger a CB run.
In fact, we do not believe that running one for every single
Git commit is the ideal but too expensive scenario. In our
opinion, this would simply create too much data so that the
developers may no longer be able to visually comprehend
results. Comparable to our evaluation approach, we would
recommend to trigger CB whenever a new feature branch
is merged into the main branch. This also allows developers
to manually override QoS thresholds when there are external
events which mandate feature or configuration updates. For
instance, when a new vulnerability in a TLS cipher suite is
detected, switching cipher suites may be necessary but might
have a strong impact on system performance [1]–[3].
There is also the challenge of finding a benchmark in the
first place. In our case, we were running Apache Cassandra for
which a number of open source benchmarks and benchmark
tools exist. For some custom microservice, this will typically
not be the case. In such scenarios, developers have to build
their own benchmark first – comparable to test-driven devel-
opment – which causes additional costs for personnel.
Finally, to conclude all cost aspects: CB will directly cause
additional costs for the CB infrastructure and personnel costs
for management or development of benchmarks. These costs,
however, will likely be offset by indirect costs caused by un-
happy customers or direct costs from compensation payments
for SLA violations. Balancing these costs is a non-trivial task
that is application-specific and should probably be approached
in an agile way with continuous readaptation.
VI. RE LATE D WOR K
CB is a powerful mechanism for evaluating QoS of a new
system version in a production-like environment. As such, it
is relies on benchmarking approaches such as [8], [12], [15]–
[19]. An alternative but also complementary approach to CB
are live testing techniques such as canary releases [20] or dark
launches [21]. In contrast to CB, live testing is characterized
by the fact that a new version (of a software artifact) is directly
deployed into the production environment in parallel with the
older version.
For canary releases [20], this new version is initially rolled
out for a very small subset of users and developers monitor
its behavior in production. If there are errors or QoS issues in
the new version, the impact only affects a few users and the
version is reverted or shut down. Otherwise, more and more
users are added to the set of test users until the new version
has completely been rolled out.
While canary releases aim to only affect a small subset of
users in case of failures, dark (or shadow) launches [21], [22]
eliminate potentially unsatisfied users completely by deploying
a new version in the production environment without serving
real user traffic – so called shadow instances. This way, no user
is confronted with the new version and its potential issues.
Live testing techniques can be used to detect performance
and other QoS issues in production. However, testing new
versions in a production environment might be problematic
for several reasons: First, a production system is usually in
a normal state with usual load and regular traffic. Thus, a
new version is never evaluated in production under extreme
conditions or for rare corner cases. Second, a roll-out of several
new versions of multiple software artifacts is administratively
complex and error-prone, though tools like BiFrost [23] try
to overcome these problems. Third, theoretical setups and
architectures including new versions are hard to evaluate with
live testing techniques. Finally, live testing does not necessarily
create the right data to identify QoS degradation in the system
release as varying workloads depending on user traffic will
lead to varying observable QoS behavior. All this can be
done with CB, e.g., by creating benchmark setups for extreme
load peak situations. As benchmarking, however, can never
be identical to a production load, we propose to combine the
strengths of both approaches, i.e., to use both live testing and
CB in parallel.
Comparable to our approach, Waller et al. [24] also pro-
posed to include benchmarking in CI pipelines and present a
Jenkins plug-in for this purpose. In contrast to our approach,
however, they focus on measuring performance overheads of a
code instrumentation tool. This is rather different from generic
system benchmarking of distributed systems but supports the
relevance of our approach.
Beyond these, there are two Jenkins plug-ins which could
handle the task of our Visual Interface: the Performance7plug-
in and the Benchmark plug-in8. Both, however, have explicitly
been designed for small single-machine micro benchmarks
such as single method benchmarks.
VII. CONCLUSION
Complex systems are very sensitive to change and even the
smallest change can strongly affect QoS. Particularly, such
changes occur frequently when releasing new software ver-
sions. Existing CI/CD pipelines, however, focus on functional
testing or single method performance measurements and can,
hence, not detect changes in QoS.
In this paper, we proposed a new approach called Continu-
ous Benchmarking in which one or more system benchmarks
are run as additional step in the build pipeline. Measurement
results from these benchmark runs are then compared to either
absolute thresholds, e.g., as specified in an SLA, or to relative
7https://plugins.jenkins.io/performance
8https://github.com/jenkinsci/benchmark-plugin
thresholds which compare the result to previous results to as-
sert that QoS levels always improve or at least remain constant
across releases. We have prototypically implemented CB using
Apache Cassandra as SUT and YCSB as benchmarking client
and evaluated our approach by replaying almost two years of
Cassandra’s commit history.
REFERENCES
[1] S. M¨
uller, D. Bermbach, S. Tai, and F. Pallas, “Benchmarking the per-
formance impact of transport layer security in cloud database systems,”
in Proc. of IC2E. IEEE, 2014.
[2] F. Pallas, J. G¨
unther, and D. Bermbach, “Pick your choice in hbase:
Security or performance,” in Big Data. IEEE, 2016.
[3] F. Pallas, D. Bermbach, S. M¨
uller, and S. Tai, “Evidence-based security
configurations for cloud datastores,” in Proc. of SAC. ACM, 2017.
[4] J. Brutlag, “Speed matters for google web search,” 2009.
[5] M. Fowler and M. Foemmel, “Continuous integration,Thought-Works,
vol. 122, 2006.
[6] D. Bermbach, E. Wittern, and S. Tai, Cloud Service Benchmarking: Mea-
suring Quality of Cloud Services from a Client Perspective. Springer,
2017.
[7] D. Abadi, “Consistency tradeoffs in modern distributed database system
design: Cap is only part of the story,IEEE Computer, vol. 45, no. 2,
2012.
[8] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,
“Benchmarking cloud serving systems with ycsb,” in Proc. of SOCC.
ACM, 2010.
[9] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-
rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed
storage system for structured data,” in Proc. of OSDI. USENIX
Association, 2006.
[10] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,”
in Proc. of SOSP. ACM, 2003.
[11] M. Burrows, “The chubby lock service for loosely-coupled distributed
systems,” in Proc. of OSDI. USENIX Association, 2006.
[12] D. Bermbach, J. Kuhlenkamp, A. Dey, A. Ramachandran, A. Fekete, and
S. Tai, “BenchFoundry: A Benchmarking Framework for Cloud Storage
Services,” in Proc. of ICSOC 2017. Springer, 2017.
[13] D. Bermbach, “Benchmarking eventually consistent distributed storage
systems,” Ph.D. dissertation, Karlsruhe Institute of Technology, 2014.
[14] ——, “Quality of cloud services: Expect the unexpected,” IEEE Internet
Computing, 2017.
[15] C. Binnig, D. Kossmann, T. Kraska, and S. Loesing, “How is the weather
tomorrow?: Towards a benchmark for the cloud,” in Proc. of DBTEST.
ACM, 2009.
[16] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux, “Oltp-bench:
An extensible testbed for benchmarking relational databases,” Proc. of
VLDB Endowment, vol. 7, no. 4, 2013.
[17] D. Bermbach and E. Wittern, “Benchmarking web api quality,” in Proc.
of ICWE. Springer, 2016.
[18] A. H. Borhani, P. Leitner, B. S. Lee, X. Li, and T. Hung, “Wpress:
An application-driven performance benchmark for cloud-based virtual
machines,” Proc. of EDOC, 2014.
[19] D. Bermbach, J. Kuhlenkamp, A. Dey, S. Sakr, and R. Nambiar,
“Towards an Extensible Middleware for Database Benchmarking,” in
Proc. of TPCTC. Springer, 2014.
[20] J. Humble and D. Farley, Continuous Delivery: Reliable Software
Releases through Build, Test, and Deployment Automation. Addison-
Wesley Professional, 2010.
[21] D. G. Feitelson, E. Frachtenberg, and K. L. Beck, “Development and
deployment at facebook,” IEEE Internet Computing, vol. 17, no. 4, 2013.
[22] C. Tang, T. Kooburat, P. Venkatachalam, A. Chander, Z. Wen,
A. Narayanan, P. Dowell, and R. Karl, “Holistic configuration man-
agement at facebook,” in Proc. of SOSP. ACM, 2015.
[23] G. Schermann, D. Sch¨
oni, P. Leitner, and H. C. Gall, “Bifrost: supporting
continuous deployment with automated enactment of multi-phase live
testing strategies,” in Proc. of Middleware. ACM, 2016.
[24] J. Waller, N. C. Ehmke, and W. Hasselbring, “Including performance
benchmarks into continuous integration to enable devops,ACM SIG-
SOFT Software Engineering Notes, vol. 40, no. 2, 2015.
... In a continuous integration and deployment (CI/CD) setting, where new function releases are deployed frequently, this is even more important as new code is likely to cause performance regressions [13]. To detect such performance regressions early before they can affect the live application, both van Hoorn et al. [31] and Grambow et al. [14] have proposed to benchmark new releases as part of the CI/CD pipeline -continuous benchmarking. ...
... This is simply a result of the high level of abstraction of FaaS platforms, which build on complex hardware and software stacks, and are thus subject to performance variability [25,26]. Previous research has proposed methodologies and strategies to reduce and mitigate the impact of such influences [12,14,15]. Thus far, all strategies use either parallel, independent function invocations, which can be affected by performance variability between the machines the different function instances are deployed on, or sequential function invocations within the same function instances, which is subject to temporal performance variability. ...
... Several works have called for performance change detection directly after a change is published [13,17,18,20]. Other recent work has also highlighted the importance of benchmarking during the deployment process, with approaches showing potential to detect significant performance regressions before final deployment of a release [12,14,17]. ...
Preprint
In a continuous deployment setting, Function-as-a-Service (FaaS) applications frequently receive updated releases, each of which can cause a performance regression. While continuous benchmarking, i.e., comparing benchmark results of the updated and the previous version, can detect such regressions, performance variability of FaaS platforms necessitates thousands of function calls, thus, making continuous benchmarking time-intensive and expensive. In this paper, we propose DuetFaaS, an approach which adapts duet benchmarking to FaaS applications. With DuetFaaS, we deploy two versions of FaaS function in a single cloud function instance and execute them in parallel to reduce the impact of platform variability. We evaluate our approach against state-of-the-art approaches, running on AWS Lambda. Overall, DuetFaaS requires fewer invocations to accurately detect performance regressions than other state-of-the-art approaches. In 99.65% of evaluated cases, our approach provides smaller confidence interval sizes than the comparing approaches, and can reduce the size by up to 98.23%.
... user experience, performance issues can also occupy additional resources and result in major fixing efforts which all imply unpredictable additional costs [14,60,61]. Thus, performance changes should ideally be detected by a Continuous Integration and Deployment (CI/CD) pipeline immediately after a code change is checked in [11,17,23,27,35,48,59]. ...
... Here, a system under test (SUT) is stressed with an artificial load, the requested values are measured, and these are then compared with the specification values, with the results of another alternative system, or with a previous version [8]. For detecting a performance change using benchmarking, there are two alternatives with different levels of granularity: first, using application benchmarks, where the SUT is set up including all related components and stressed in an environment that mimics the production conditions (e.g., a database system running on a virtual instance is stressed by a client software which mimics the requests of thousands of users for half an hour) [7,9,11,27]; second, using microbenchmarks, which analyze individual functions at source code level and execute them repeatedly (e.g., a date conversion method is called a million times) [40,41]. While the former method provides reliable results regarding application runtime implications, it is complex due to the setup and execution of the application benchmark. ...
... Nevertheless, neither benchmarking technique is currently suited to be executed on every code change due to the extensive execution durations of several hours as well as the resulting costs [18,27,40,55,59]. Applying one of these two approaches to a large project with many application developers, hundreds of source code files, and multiple code changes per day would soon create a stack of bench-mark tasks that would prevent fast-paced software development and integration of individual changes. ...
Preprint
Full-text available
Software performance changes are costly and often hard to detect pre-release. Similar to software testing frameworks, either application benchmarks or microbenchmarks can be integrated into quality assurance pipelines to detect performance changes before releasing a new application version. Unfortunately, extensive benchmarking studies usually take several hours which is problematic when examining dozens of daily code changes in detail; hence, trade-offs have to be made. Optimized microbenchmark suites, which only include a small subset of the full suite, are a potential solution for this problem, given that they still reliably detect the majority of the application performance changes such as an increased request latency. It is, however, unclear whether microbenchmarks and application benchmarks detect the same performance problems and one can be a proxy for the other. In this paper, we explore whether microbenchmark suites can detect the same application performance changes as an application benchmark. For this, we run extensive benchmark experiments with both the complete and the optimized microbenchmark suites of the two time-series database systems InuxDB and VictoriaMetrics and compare their results to the results of corresponding application benchmarks. We do this for 70 and 110 commits, respectively. Our results show that it is possible to detect application performance changes using an optimized microbenchmark suite if frequent false-positive alarms can be tolerated.
... Hunter is a command-line tool, written in Python, that detects statistically significant changes in time-series data stored either in a CSV file or on a graphite server. It is designed to be easily integrated into build pipelines [10] and provide automated performance analysis that can decide whether code should be deployed to production. As well as printing change point data on the command-line, Hunter also includes support for Slack and can be configured to send results to a Slack channel. ...
... Continuous Benchmarking [10] is a common technique for ensuring the performance of a product is maintained or improved as new code is merged into the source code repository and the literature includes examples of using change point detection [4] and threshold-based methods to identify changes in software performance [16] as part of a continuous integration pipeline. Multiple change point detection algorithms can also be combined into an ensemble which can outperform the individual algorithms [19] when identifying performance changes. ...
Preprint
Full-text available
Change point detection has recently gained popularity as a method of detecting performance changes in software due to its ability to cope with noisy data. In this paper we present Hunter, an open source tool that automatically detects performance regressions and improvements in time-series data. Hunter uses a modified E-divisive means algorithm to identify statistically significant changes in normally-distributed performance metrics. We describe the changes we made to the E-divisive means algorithm along with their motivation. The main change we adopted was to replace the significance test using randomized permutations with a Student's t-test, as we discovered that the randomized approach did not produce deterministic results, at least not with a reasonable number of iterations. In addition we've made tweaks that allow us to find change points the original algorithm would not, such as two nearby changes. For evaluation, we developed a method to generate real timeseries, but with artificially injected changes in latency. We used these data sets to compare Hunter against two other well known algorithms, PELT and DYNP. Finally, we conclude with lessons we've learned supporting Hunter across teams with individual responsibility for the performance of their project.
... Quickly provisioning cloud resources is a convenient approach for running software performance benchmarks on powerful infrastructure without upfront investments [25,30]. Cloud variability, however, has necessitated special approaches to ensure reliable benchmarking results [1,34,37]. ...
Preprint
Running microbenchmark suites often and early in the development process enables developers to identify performance issues in their application. Microbenchmark suites of complex applications can comprise hundreds of individual benchmarks and take multiple hours to evaluate meaningfully, making running those benchmarks as part of CI/CD pipelines infeasible. In this paper, we reduce the total execution time of microbenchmark suites by leveraging the massive scalability and elasticity of FaaS (Function-as-a-Service) platforms. While using FaaS enables users to quickly scale up to thousands of parallel function instances to speed up microbenchmarking, the performance variation and low control over the underlying computing resources complicate reliable benchmarking. We demonstrate an architecture for executing microbenchmark suites on cloud FaaS platforms and evaluate it on code changes from an open-source time series database. Our evaluation shows that our prototype can produce reliable results (~95% of performance changes accurately detected) in a quarter of the time (<=15min vs.~4h) and at lower cost ($0.49 vs. ~$1.18) compared to cloud-based virtual machines.
... [13][14][15][16][17][18] In literature, a dedicated benchmarking tool for a cloud database system to ensure service quality was proposed by M. Grambow et.al. [19] A pattern-based approach reduces the efforts for defining micro-services benchmarks for cloud databases. [20] A comparative analysis of various TLS libraries which includes authenticated encryption cipher, hashing, and public-key cryptography but does not cover the cloud system. ...
Article
Full-text available
Cloud database serves flexible, affordable, and scalable database systems. Even the cloud database is secure with transport layer security (TLS), but the performance overhead that TLS introduces when executing operations on one of the major No SQL databases: Mongo DB in terms of latency. To explore TLS performance overhead for Mongo DB, we performed two tests simulating common database usage patterns. We first investigated connection pooling, where an application uses a single connection for many database operations. Then, we considered one request per connection in which an application opens a connection, executes a process, and immediately closes the connection after completing the operation. Our experimental result shows that applications that cannot endure significant performance overhead should be deployed within a properly segmented network rather than enabling TLS. Applications using TLS should use a connection pool rather than a connection-per-request.
Conference Paper
Full-text available
Understanding quality of services in general, and of cloud storage services in particular, is often crucial. Previous proposals to benchmark storage services are too restricted to cover the full variety of NoSQL stores, or else too simplistic to capture properties of use by realistic applications; they also typically measure only one facet of the complex tradeoffs between different qualities of service. In this paper, we present BenchFoundry which is not a benchmark itself but rather is a benchmarking framework that can execute arbitrary application-driven benchmark workloads in a distributed deployment while measuring multiple qualities at the same time. BenchFoundry can be used or extended for every kind of storage service. Specifically, BenchFoundry is the first system where workload specifications become mere configuration files instead of code. In our design, we have put special emphasis on ease-of-use and deterministic repeatability of benchmark runs which is achieved through a trace-based workload model.
Conference Paper
Full-text available
Cloud systems offer a diversity of security mechanisms with potentially complex configuration options. So far, security engineering has focused on achievable security levels, but not on the costs associated with a specific security mechanism and its configuration. Through a series of experiments with a variety of cloud datastores conducted over the last years, we gained substantial knowledge on how one desired quality like security can have a significant impact on other system qualities like performance. In this paper, we report on select findings related to security-performance trade-offs for three prominent cloud datastores, focusing on data in transit encryption, and propose a simple, structured approach for making trade-off decisions based on factual evidence gained through experimentation. Our approach allows to rationally reason about security trade-offs.
Article
Full-text available
Here, the author presents a number of experiences from several years of benchmarking cloud services. He discusses how the respectively observed quality behavior would have affected cloud applications or how cloud consumers could use the behavior to their advantage.
Conference Paper
Full-text available
When analyzing sensitive data in a cloud-deployed Hadoop stack, data-in-transit security needs to be enabled, especially in the underlying storage tier. This, however, will affect the performance of the system and may partially offset the cost benefits of the cloud. In this paper, we discuss two strategies for securing HBase deployments in the cloud. For both, we present benchmarking results which show performance impacts that significantly exceed the suggested 10% from the official documentation. These results demonstrate (i) that security configurations should follow a rational decision process based on benchmarking results and (ii) that the security architecture of HBase/HDFS should be redesigned with an emphasis on performance.
Conference Paper
Full-text available
Web APIs are increasingly becoming an integral part of web or mobile applications. As a consequence, performance characteristics and availability of the APIs used directly impact the user experience of end users. Still, quality of web APIs is largely ignored and simply assumed to be sufficiently good and stable. Especially considering geo-mobility of today’s client devices, this can lead to negative surprises at runtime. In this work, we present an approach and toolkit for benchmarking the quality of web APIs considering geo-mobility of clients. Using our benchmarking tool, we then present the surprising results of a geo-distributed 3-month benchmark run for 15 web APIs and discuss how application developers can deal with volatile quality both from an architectural and engineering point of view.
Article
Full-text available
Cloud storage services and NoSQL systems typically offer only "Eventual Consistency", a rather weak guarantee covering a broad range of potential data consistency behavior. The degree of actual (in-)consistency, however, is unknown. This work presents novel solutions for determining the degree of (in-)consistency via simulation and benchmarking, as well as the necessary means to resolve inconsistencies leveraging this information. © 2014 Karlsruher Institut fur Technologie (KIT). All rights reserved.
Book
Cloud service benchmarking can provide important, sometimes surprising insights into the quality of services and leads to a more quality-driven design and engineering of complex software architectures that use such services. Starting with a broad introduction to the field, this book guides readers step-by-step through the process of designing, implementing and executing a cloud service benchmark, as well as understanding and dealing with its results. It covers all aspects of cloud service benchmarking, i.e., both benchmarking the cloud and benchmarking in the cloud, at a basic level. The book is divided into five parts: Part I discusses what cloud benchmarking is, provides an overview of cloud services and their key properties, and describes the notion of a cloud system and cloud-service quality. It also addresses the benchmarking lifecycle and the motivations behind running benchmarks in particular phases of an application lifecycle. Part II then focuses on benchmark design by discussing key objectives (e.g., repeatability, fairness, or understandability) and defining metrics and measurement methods, and by giving advice on developing own measurement methods and metrics. Next, Part III explores benchmark execution and implementation challenges and objectives as well as aspects like runtime monitoring and result collection. Subsequently, Part IV addresses benchmark results, covering topics such as an abstract process for turning data into insights, data preprocessing, and basic data analysis methods. Lastly, Part V concludes the book with a summary, suggestions for further reading and pointers to benchmarking tools available on the Web. The book is intended for researchers and graduate students of computer science and related subjects looking for an introduction to benchmarking cloud services, but also for industry practitioners who are interested in evaluating the quality of cloud services or who want to assess key qualities of their own implementations through cloud-based experiments.
Conference Paper
Live testing is used in the context of continuous delivery and deployment to test changes or new features in the production environment. This includes canary releases, dark launches, A/B tests, and gradual rollouts. Oftentimes, multiple of these live testing practices need to be combined (e.g., running an A/B test after a dark launch). Manually administering such multi-phase live testing strategies is a daunting task for developers or release engineers. In this paper, we introduce a formal model for multi-phase live testing, and present Bifrost as a Node.js based prototype implementation that allows developers to define and automatically enact complex live testing strategies. We extensively evaluate the runtime behavior of Bifrost in three rollout scenarios of a microservice-based case study application, and conclude that the performance overhead of our prototype is at or below 8 ms for most scenarios. Further, we show that more than 100 parallel strategies can be enacted even on cheap public cloud instances.
Conference Paper
Facebook's web site and mobile apps are very dynamic. Every day, they undergo thousands of online configuration changes, and execute trillions of configuration checks to personalize the product features experienced by hundreds of million of daily active users. For example, configuration changes help manage the rollouts of new product features, perform A/B testing experiments on mobile devices to identify the best echo-canceling parameters for VoIP, rebalance the load across global regions, and deploy the latest machine learning models to improve News Feed ranking. This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook's configuration end-to-end, including the frontend products, backend systems, and mobile apps.