ArticlePDF Available

Milepost GCC: Machine Learning Enabled Self-tuning Compiler

Authors:
  • Infrasoft IT Solutions, Płock, Poland

Abstract and Figures

Tuning compiler optimizations for rapidly evolving hardware makes porting and extending an optimizing compiler for each new platform extremely challenging. Iterative optimization is a popular approach to adapting programs to a new architecture automatically using feedback-directed compilation. However, the large number of evaluations required for each program has prevented iterative compilation from widespread take-up in production compilers. Machine learning has been proposed to tune optimizations across programs systematically but is currently limited to a few transformations, long training phases and critically lacks publicly released, stable tools. Our approach is to develop a modular, extensible, self-tuning optimization infrastructure to automatically learn the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior and optimizations. In this paper we describe Milepost GCC, the first publicly-available open-source machine learning-based compiler. It consists of an Interactive Compilation Interface (ICI) and plugins to extract program features and exchange optimization data with the cTuning.org open public repository. It automatically adapts the internal optimization heuristic at function-level granularity to improve execution time, code size and compilation time of a new program on a given architecture. Part of the MILEPOST technology together with low-level ICI-inspired plugin framework is now included in the mainline GCC. We developed machine learning plugins based on probabilistic and transductive approaches to predict good combinations of optimizations. Our preliminary experimental results show that it is possible to automatically reduce the execution time of individual MiBench programs, some by more than a factor of 2, while also improving compilation time and code size. On average we are able to reduce the execution time of the MiBench benchmark suite by 11% for the ARC reconfigurable processor. We also present a realistic multi-objective optimization scenario for Berkeley DB library using Milepost GCC and improve execution time by approximately 17%, while reducing compilation time and code size by 12% and 7% respectively on Intel Xeon processor. KeywordsMachine learning compiler–Self-tuning compiler–Adaptive compiler–Automatic performance tuning–Machine learning–Program characterization–Program features–Collective optimization–Continuous optimization–Multi-objective optimization–Empirical performance tuning–Optimization repository–Iterative compilation–Feedback-directed compilation–Adaptive compilation–Optimization prediction–Portable optimization
Content may be subject to copyright.
Int J Parallel Prog
DOI 10.1007/s10766-010-0161-2
Milepost GCC: Machine Learning Enabled Self-tuning
Compiler
Grigori Fursin ·Yuriy Kashnikov ·Abdul Wahid Memon ·
Zbigniew Chamski ·Olivier Temam ·Mircea Namolaru ·
Elad Yom-Tov ·Bilha Mendelson ·Ayal Zaks ·Eric Courtois ·
Francois Bodin ·Phil Barnard ·Elton Ashton ·Edwin Bonilla ·
John Thomson ·Christopher K. I. Williams ·Michael O’Boyle
Received: 12 February 2009 / Accepted: 30 November 2010
© Springer Science+Business Media, LLC 2011
Abstract Tuning compiler optimizations for rapidly evolving hardware makes port-
ing and extending an optimizing compiler for each new platform extremely chal-
lenging. Iterative optimization is a popular approach to adapting programs to a new
architecture automatically using feedback-directed compilation. However, the large
number of evaluations required for each program has prevented iterative compilation
from widespread take-up in production compilers. Machine learning has been pro-
posed to tune optimizations across programs systematically but is currently limited
to a few transformations, long training phases and critically lacks publicly released,
stable tools. Our approach is to develop a modular, extensible, self-tuning optimization
infrastructure to automatically learn the best optimizations across multiple programs
and architectures based on the correlation between program features, run-time behavior
and optimizations. In this paper we describe Milepost GCC, the first publicly-available
G. Fursin (B
)·Z. Chamski ·O. Temam
INRIA Saclay, Parc Club Orsay Universite, 3 rue Jean Rostand, 91893 Orsay, France
e-mail: grigori.fursin@unidapt.org
G. Fursin ·Y. Kashnikov ·A. W. Memon
University of Versailles Saint Quentin en Yvelines, 45 avenue des Etats Unis, 78000 Versailles, France
M. Namolaru ·E. Yom-Tov ·B. Mendelson ·A. Zaks
IBM Research Lab, Haifa University Campus, Mount Carmel, 31905 Haifa, Israel
E. Courtois ·F. Bodin
CAPS Entreprise, 4 Allée Marie Berhaut, 35000 Rennes, France
P. Barnard ·E. Ashton
ARC International, St. Albans AL1 5HE, UK
E. Bonilla ·J. Thomson ·C. K. I. Williams ·M. O’Boyle
University of Edinburgh, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, UK
123
Int J Parallel Prog
open-source machine learning-based compiler. It consists of an Interactive Compila-
tion Interface (ICI) and plugins to extract program features and exchange optimization
data with the cTuning.org open public repository. It automatically adapts the inter-
nal optimization heuristic at function-level granularity to improve execution time,
code size and compilation time of a new program on a given architecture. Part of
the MILEPOST technology together with low-level ICI-inspired plugin framework is
now included in the mainline GCC. We developed machine learning plugins based on
probabilistic and transductive approaches to predict good combinations of optimiza-
tions. Our preliminary experimental results show that it is possible to automatically
reduce the execution time of individual MiBench programs, some by more than a fac-
tor of 2, while also improving compilation time and code size. On average we are able
to reduce the execution time of the MiBench benchmark suite by 11% for the ARC
reconfigurable processor. We also present a realistic multi-objective optimization sce-
nario for Berkeley DB library using Milepost GCC and improve execution time by
approximately 17%, while reducing compilation time and code size by 12% and 7%
respectively on Intel Xeon processor.
Keywords Machine learning compiler ·Self-tuning compiler ·Adaptive compiler ·
Automatic performance tuning ·Machine learning ·Program characterization ·
Program features ·Collective optimization ·Continuous optimization ·
Multi-objective optimization ·Empirical performance tuning ·Optimization
repository ·Iterative compilation ·Feedback-directed compilation ·
Adaptive compilation ·Optimization prediction ·Portable optimization
1 Introduction
Designers of new processor architectures attempt to bring higher performance and
lower power across a wide range of programs while keeping time to market as short as
possible. However, static compilers fail to deliver satisfactory levels of performance as
they cannot keep pace with the rate of change in hardware evolution. Fixed heuristics
based on simplistic hardware models and lack of run-time information means that
much manual retuning of the compiler is needed for each new hardware generation.
Typical systems now have multiple heterogeneous reconfigurable cores making such
manual compiler tuning increasingly infeasible.
The difficulty of achieving portable performance has led to empirical iterative com-
pilation for statically compiled programs [7,64,17,44,16,46,18,77,27,67,37,38,26,
29,15,30], applying automatic compiler tuning based on feedback-directed compi-
lation. Here the static optimization model of a compiler is replaced by an iterative
search of the optimization space to empirically find the most profitable solutions that
improve execution time, compilation time, code size, power and other metrics. Need-
ing little or no knowledge of the current platform, this approach can adapt programs
to any given architecture. This approach is currently used in library generators and
adaptive tools [84,56,71,68,1,25]. However it is generally limited to searching for
combinations of global compiler optimization flags and tweaking a few fine-grain
transformations within relatively narrow search spaces. The main barrier to its wider
123
Int J Parallel Prog
use is the currently excessive compilation and execution time needed in order to opti-
mize each program. This prevents a wider adoption of iterative compilation for general
purpose compilers.
Our approach to solve this problem is to use machine learning which has the poten-
tial of reusing knowledge across iterative compilation runs, gaining the benefits of
iterative compilation while reducing the number of executions needed. The objective
of the Milepost project [58] is to develop compiler technology that can automatically
learn how to best optimize programs for configurable heterogeneous processors based
on the correlation between program features, run-time behavior and optimizations. It
also aims to dramatically reduce the time to market configurable or frequently evolving
systems. Rather than developing a specialized compiler by hand for each configuration,
Milepost aims to produce optimizing compilers automatically.
A key goal of the project is to make machine learning based compilation a realistic
technology for general-purpose production compilers. Current approaches [60,12,72,
2,11] are highly preliminary, limited to global compiler flags or a few transformations
considered in isolation. GCC was selected as the compiler infrastructure for Milepost
as it is currently the most stable and robust open-source compiler. GCC is currently the
only production compiler that supports more than 30 different architectures and has
multiple aggressive optimizations making it a natural vehicle for our research. Each
new version usually features new transformations and it may take months to adjust
each optimization heuristic, if only to prevent performance degradation in any of the
supported architectures. This further emphasizes the need for an automated approach.
We use the Interactive Compilation Interface (ICI) [41,40] that separates the opti-
mization process from a particular production compiler. ICI is a plugin system that
acts as a “middleware” interface between production compilers such as GCC and user-
definable research plugins. ICI allowed us to add a program feature extraction module
and to select arbitrary optimization passes in GCC. In the future, compiler indepen-
dent ICI should help transfer Milepost technology to other compilers. We connected
Milepost GCC to a public collective optimization database at cTuning.org [14,28,32].
This provides a wealth of continuously updated training data from multiple users and
environments.
In this paper we present experimental results showing that it is possible to improve
the performance of the well-known MiBench [36] benchmark suite automatically using
iterative compilation and machine learning on several platforms including x86: Intel
and AMD, and the ARC configurable core family. Using Milepost GCC, after a few
weeks training, we were able to learn a model that automatically improves the exe-
cution time of some individual MiBench programs by a factor of more than 2 while
improving the overall MiBench suite by 11% on reconfigurable ARC architecture,
often without sacrificing code size or compilation time. Furthermore, our approach
supports general multi-objective optimization where a user can choose to minimize
not only execution time but also code size and compilation time.
This paper is organized as follows: this section provided motivation for our research
and developments. It is followed by Sect. 2describing the experimental setup used
throughout the article. Section 3describes how iterative compilation can deliver
multi-objective optimization. Section 4describes the overall Milepost collaborative
infrastructure while Sect. 5describes our machine learning techniques used to predict
123
Int J Parallel Prog
good optimizations based on program features and provides experimental evaluation.
It is followed by the sections on related and future work.
2 Experimental Setup
The tools, benchmarks, architectures and environment used throughout the article are
briefly described in this section.
2.1 Compiler
We considered several compilers for our research and development including
Open64 [65], LLVM [53], ROSE [70], Phoenix [69], GCC [34]. GCC was selected
as it is a mature and popular open-source optimizing compiler that supports many
languages, has a large community, is competitive with the best commercial compilers,
and features a large number of program transformation techniques including advanced
optimizations such as the polyhedral transformation framework (GRAPHITE) [78].
Furthermore, GCC is the only extensible open-source optimizing compiler that sup-
ports more than 30 processor families. However, our developed techniques are not
compiler dependent. We selected the latest GCC 4.4.x as the base for our machine-
learning enabled self-tuning compiler.
2.2 Optimizations
There are approximately 100 flags available for tuning in the most recent version of
GCC, all of which are considered by our framework. However, it is impossible to
validate all possible combinations of optimizations due to their number. Since GCC
has not been originally designed for iterative compilation it is not always possible to
explore the entire optimization space by simply combining multiple compiler opti-
mization flags, because some of them are initiated only with a given global GCC
optimization level (-Os,-O1,-O2,-O3). We overcome this issue by selecting a global
optimization level -O1 .. -O3 first and then either turning on a particular optimiza-
tion through a corresponding flag -f<optimization name> or turning it off
using -fno-<optimization name> flag. In some cases, certain combinations
of compiler flags or passes cause the compiler to crash or produce incorrect program
execution. We reduce the probability of such cases by comparing outputs of programs
with reference outputs.
2.3 Platforms
We selected two general-purpose and one embedded processor for evaluation:
AMD – a cluster of 16 AMD Opteron 2218, 2.6GHz, 4GB main memory, 2MB
L2 cache, running Debian Linux Sid x64 with kernel 2.6.28.1 (provided by
GRID5000 [35])
123
Int J Parallel Prog
Intel – a cluster of 16 Intel Xeon EM64T, 3GHz, 2GB main memory, 1MB
L2 cache, running Debian Linux Sid x64 with kernel 2.6.28.1 (provided by
GRID5000)
ARC – FPGA implementation of the ARC 725D reconfigurable processor,
200MHz, 32KB L1 cache, running Linux ARC with kernel 2.4.29
We specifically selected platforms that have been in the market for some time but
not outdated to allow a fair comparison of our optimization techniques with default
compiler optimization heuristics that had been reasonably hand-tuned.
2.4 Benchmarks and Experiments
We use both embedded and server processors so we selected MiBench/cBench [36,29,
28] benchmark suite for evaluation, covering a broad range of applications from sim-
ple embedded functions to larger desktop/server programs. Most of the benchmarks
have been rewritten to be easily portable to different architectures; we use dataset 1
in all cases. We encountered problems while compiling 4 tiff programs on the ARC
platform and hence used them only on AMD and Intel platforms.
We use OProfile [66] with hardware counters support to perform non intrusive func-
tion-level profiling during each run. This tool may introduce some overhead, so we
execute each compiled program three times and averaged the execution and compila-
tion time. In the future, we plan to use more statistically rigorous approaches [75,33].
For this study, we selected the most time consuming function from each benchmark for
further analysis and optimization. If a program has several hot functions depending on
a dataset, we analyze and optimize them one by one and report separately. Analyzing
the effects of interactions between multiple functions on optimization is left for future
work.
2.5 Collective Optimization Database
All experimental results were recorded in the public Collective Optimization Database
[14,28,32] at cTuning.org, allowing independent analysis of our results.
3 Motivation
This section shows that tuning optimization heuristics of an existing real-world com-
piler for multiple objectives such as execution time, code size and compilation time is
a non-trivial task. We demonstrate that iterative compilation can effectively solve this
problem, however often with excessive search costs that motivate the use of machine
learning to mitigate the need for per-program iterative compilation and learn optimi-
zations across programs based on their features.
123
Int J Parallel Prog
3.1 Multi-Objective Empirical Iterative Optimization
Iterative compilation is a popular method to explore different optimizations by exe-
cuting a given program on a given architecture and finding good solutions to improve
program execution time and other characteristics based on empirical search.
We selected 88 program transformations of GCC known to influence performance,
including inlining, unrolling, scheduling, register allocation, and constant propagation.
We selected 1000 combinations of optimizations using a random search strategy with
50% probability to select each flag and either turn it on or off. We use this strategy to
allow uniform unbiased exploration of unknown optimization search spaces. In order
to validate the resulting diversity of program transformations, we checked that no two
combinations of optimizations generated the same binary for any of the benchmarks
using the MD5 checksum of the assembler code obtained through the objdump -d
command. Occasionally, random selection of flags in GCC may result in an invalid
code. In order to avoid such situations, we validated all generated combinations of
optimizations by comparing the outputs of all benchmarks used in our study with the
recorded outputs during reference runs when compiled with -O3 global optimization
level.
Figure 1shows the best execution time speedup achieved for each benchmark
over the highest GCC optimization level (-O3) after 1000 iterations across 3 selected
architectures. It confirms results from previous research on iterative compilation and
demonstrates that it is possible to outperform GCC’s highest default optimization level
for most programs using random iterative search for good combinations of optimi-
zations. Several benchmarks achieve more than 2 times speedup while on average
we reached speedups of 1.33 and 1.4 for Intel and AMD respectively and a smaller
speedup of 1.15 for ARC. This is likely due to simpler architecture and less sensitivity
to program optimizations.
However, the task of an optimizing compiler is not only to improve execution time
but also to balance code size and compilation time across a wide range of programs
and architectures. The violin graphs1in Fig. 2show high variation of execution time
speedups, code size improvements and compilation time speedups during iterative
compilation across all benchmarks on Intel platform. Multi-objective optimization in
such cases depends on end-user usage scenarios: improving both execution time and
code size is often required for embedded applications, improving both compilation and
execution time is important for data centers and real-time systems, while improving
only execution time is common for desktops and supercomputers.
As an example, in Fig. 3, we present the execution time speedups vs code size
improvements and vs compilation time for susan_c on the AMD platform. Naturally,
depending on optimization scenario, users are interested in optimization cases on the
frontier of the program optimization area. Circles on these graphs show the 2D fron-
tier that improves at least two metrics, while squares show optimization cases where
the speedup is also achieved on the third optimization metric and is more than some
threshold (compilation time speedup is more than 2 in the first graph and code size
1Violin graphs are similar to box graphs, showing the probability density in addition to min, max and
interquartile.
123
Int J Parallel Prog
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffdither
tiffmedian
qsort1
stringsearch1
AVERAGE
Execution time speedup
5.37.28.10.32.20.2
AMD
Intel
ARC
Fig. 1 Maximum execution time speedups over the highest GCC optimization level (-O3) using iterative
compilation with uniform random distribution after 1,000 iterations on 3 selected architectures
improvement is more than 1.2 in the second graph). These graphs demonstrate that for
this selected benchmark and architecture there are relatively many optimization cases
that improve execution time, code size and compilation time simultaneously. This is
because many flags turned on for the default optimization level (-O3) do not influ-
ence this program or even degrade performance and take considerable compilation
time.
Figure 4summarizes code size improvements and compilation time speedups
achievable on Intel platform across evaluated programs with the execution time speed-
ups within 95% of the maximum available during iterative compilation. Wecan observe
that in some cases we can improve both execution time, code size and compilation
time such as for susan_c and dijkstra for example. In some other cases, without avoid-
ing degradation of execution time for the default optimization level (-O3), we can
improve compilation time considerably (more than 1.7 times) and code size such as
for jpeg_c and patricia. Throughout the rest of the article, we will consider improving
execution time of primary importance, then code size and compilation time. However,
our self-tuning compiler can work with other arbitrary optimization scenarios. Users
may provide their own plugins to choose optimal solutions, for example using a Pareto
distribution as shown in [37,38].
The combinations of flags corresponding to Fig. 4across all programs and archi-
tectures are presented2in Table 1. Some combinations can reduce compilation time
by 70% which can be critical when compiling large-scale applications or for cloud
computing services where a quick response time is critical. The diversity of compiler
2The flags that do not influence execution time, code size or compilation time have been iteratively and
automatically removed from the original combination of random optimizations using CCC framework to
simplify the analysis of the results.
123
Int J Parallel Prog
Execution time speedup
0.0
0.5
1.0
1.5
3.0 3.5
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
tiff2bw
tiff2rgba
tiffdither
tiffmedian
gsm
qsort1
stringsearch1
Code size improvement
0.6
0.8
1.0
1.2
1.4
1.6
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
tiff2bw
tiff2rgba
tiffdither
tiffmedian
gsm
qsort1
stringsearch1
Compilation time speedup
0
1
2
3
4
5
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
tiff2bw
tiff2rgba
tiffdither
tiffmedian
gsm
qsort1
stringsearch1
Fig. 2 Distribution of execution time speedups, code size improvements and compilation time speedups
on Intel platform during iterative compilation (1,000 iterations)
optimizations involved demonstrates that the compiler optimization space is not trivial
and the compiler best optimization heuristic (-O3) is far from optimal. All combina-
tions of flags found per program and architecture during this research are available
123
Int J Parallel Prog
0
0.5
1
1.5
2
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Execution time speedup
Code size improvement
2D optimization area frontier
compilation time speedup > 2
0
0.5
1
1.5
2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Execution time speedup
Compilation time speedup
2D optimization area frontier
code size improvement > 1.25
Fig. 3 Distribution of executiontime speedups, code size improvements and compilation time speedups for
benchmarks susan_c on AMD platform during iterative compilation. Depending on optimization scenarios,
good optimization cases are depicted with circles on 2D optimization area frontier and with squares where
third metric is more than some threshold (compilation time speedup >2 or code size improvement >1.2)
on-line in the Collective Optimization Database [14] to allow end-users to optimize
their programs or enable further collaborative research.
Finally, Fig. 5shows that it may take on average 70 iterations before reaching 95%
of the speedup available after 1000 iterations (averaged over 10 repetitions) and is
heavily dependent on the programs and architectures. Such a large number of iter-
ations is needed due to an increasing number of aggressive optimizations available
in the compiler where multiple combinations of optimizations can both considerably
increase or decrease performance, change code size and compilation time.
The experimental results of this section suggest that iterative compilation can effec-
tively generalize and automate the program optimization process but can be too time
123
Int J Parallel Prog
0.6
0.8
1
1.2
1.4
1.6
1.8
2
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
Speedup
3.0 3.6
execution time
code size
compilation time
Fig. 4 Code size improvements and compilation time speedups for optimization cases with execution time
speedups within 95% of the maximum available on Intel platform (as found by iterative compilation)
0
20
40
60
80
100
120
140
160
180
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
AVERAGE
Number of iterations
262 280 487 307 374
AMD
Intel
ARC
Fig. 5 Number of iterations needed to obtain 95% of the available speedup using iterative compilation
with uniform random distribution
consuming. Hence it is important to speed up iterative compilation. In the next section,
we present the Milepost framework which speeds up program optimization through
machine learning.
123
Int J Parallel Prog
Tab l e 1 Best found combinations of Milepost GCC flags to improve execution time, code size and com-
pilation time after iterative compilation (1000 iterations) across all evaluated benchmarks and platforms
-O1 -fcse-follow-jumps -fno-tree-ter -ftree-vectorize
-O1 -fno-cprop-registers -fno-dce -fno-move-loop-invariants -frename-registers
-fno-tree-copy-prop -fno-tree-copyrename
-O1 -freorder-blocks -fschedule-insns -fno-tree-ccp -fno-tree-dominator-opts
-O2
-O2 -falign-loops -fno-cse-follow-jumps -fno-dce -fno-gcse-lm -fno-inline-functions-called-once
-fno-schedule-insns2 -fno-tree-ccp -fno-tree-copyrename -funroll-all-loops
-O2 -finline-functions -fno-omit-frame-pointer -fschedule-insns -fno-split-ivs-in-unroller
-fno-tree-sink -funroll-all-loops
-O2 -fno-align-jumps -fno-early-inlining -fno-gcse -fno-inline-functions-called-once
-fno-move-loop-invariants -fschedule-insns -fno-tree-copyrename -fno-tree-loop-optimize
-fno-tree-ter -fno-tree-vrp
-O2 -fno-caller-saves -fno-guess-branch-probability -fno-ira-share-spill-slots -fno-tree-reassoc
-funroll-all-loops -fno-web
-O2 -fno-caller-saves -fno-ivopts -fno-reorder-blocks -fno-strict-overflow -funroll-all-loops
-O2 -fno-cprop-registers -fno-move-loop-invariants -fno-omit-frame-pointer -fpeel-loops
-O2 -fno-dce -fno-guess-branch-probability -fno-strict-overflow -fno-tree-dominator-opts
-fno-tree-loop-optimize -fno-tree-reassoc -fno-tree-sink
-O2 -fno-ivopts -fpeel-loops -fschedule-insns
-O2 -fno-tree-loop-im -fno-tree-pre
-O3 -falign-loops -fno-caller-saves -fno-cprop-registers -fno-if-conversion -fno-ivopts
-freorder-blocks-and-partition -fno-tree-pre -funroll-all-loops
-O3 -falign-loops -fno-cprop-registers -fno-if-conversion -fno-peephole2 -funroll-all-loops
-O3 -falign-loops -fno-delete-null-pointer-checks -fno-gcse-lm -fira-coalesce -floop-interchange
-fsched2-use-superblocks -fno-tree-pre -fno-tree-vectorize -funroll-all-loops
-funsafe-loop-optimizations -fno-web
-O3 -fno-gcse -floop-strip-mine -fno-move-loop-invariants -fno-predictive-commoning -ftracer
-O3 -fno-inline-functions-called-once -fno-regmove -frename-registers -fno-tree-copyrename
-O3 -fno-inline-functions -fno-move-loop-invariants
4 Milepost Optimization Approach and Framework
As shown in the previous section, iterative compilation can considerably outperform
existing compilers but at the cost of excessive recompilation and program execution
during optimization search space exploration. Multiple techniques have been proposed
to speed up this process. For example, ACOVEA tool [1] utilizes genetic algorithms;
hill-climbing search [27] and run-time function-level per-phase optimization evalu-
ation [30] have been used, as well as the use of Pareto distribution [37,38] to find
multi-objective solutions. However, these approaches start their exploration of opti-
mizations for a new program from scratch and do not reuse any prior optimization
knowledge across different programs and architectures.
The Milepost project takes an orthogonal approach based on the observation that
similar programs may exhibit similar behavior and require similar optimizations so
it is possible to correlate program features and optimizations, thereby predicting
good transformations for unseen programs based on previous optimization experience
123
Int J Parallel Prog
[60,12,72,2,39,11,32]. In the current version of Milepost GCC we use static program
features (such as the number of instructions in a method, number of branches, etc)
to characterize programs and build predictive models. Naturally, since static features
may not be enough to capture run-time program behavior, we plan to add plugins
to improve program and optimization correlation based on dynamic features (perfor-
mance counters [11], microarchitecture-independent characteristics [39], reactions to
transformations [32] or semantically non-equivalent program modifications [31]).
The next section describes the overall framework and is followed by a detailed
description of Milepost GCC and the Interactive Compiler Interface. This is then
followed by a discussion of the features used to predict good optimizations.
4.1 Milepost Adaptive Optimization Framework
The Milepost framework shown in Fig. 6uses a number of components including (i) a
machine learning enabled Milepost GCC with Interactive Compilation Interface (ICI)
to modify internal optimization decisions, (ii) a Continuous Collective Compilation
Framework (CCC) to perform iterative search for good combinations of optimizations
and (iii) a Collective Optimization Database (COD) to record compilation and exe-
cution statistics in the common repository. Such information is later used as training
data for the machine learning models. We use public COD that is hosted at cTun-
ing.org [14,28,32]. The Milepost framework currently proceeds in two distinct phases,
in accordance with typical machine learning practice: training and deployment.
Training During the training phase we need to gather information about the structure
of programs and record how they behave when compiled under different optimization
MILEPOST GCC
with ICI
IC Plugins
Record sequences of
optimization passes per
function
Extract static
program features
Program1
ProgramN
Training
New program
Deployment
MILEPOST GCC
Extract static
program features
Substitute compiler default
optimization heuristic with
predicted passes
Compiler
independent plugins
to perform iterative
compilation and
model training
Continuous Collective
Compilation Framework
CCC
Predicting “good”
passes to improve
exec. time, code
size and comp. time
cTuning.org
Open Collaborative
Optimization Center
Collective
Optimization Web
Services
Register events
Query database
Get statistics
Predict
optimizations
Web server
Fig. 6 Open framework to automatically tune programs and improve default optimization heuristics using
predictive machine learning techniques, Milepost GCC with Interactive Compilation Interface (ICI) and
program features extractor, CCC Framework to train ML model and predict good optimization passes, and
COD optimization repository at cTuning.org
123
Int J Parallel Prog
settings. Such information allows machine learning tools to correlate aspects of pro-
gram structure, or features, with optimizations, building a strategy that predicts good
combinations of optimizations.
In order to train a useful model, a large number of compilations and executions are
needed as training examples. These training examples are generated by CCC [13,28],
which evaluates different combinations of optimizations and stores execution time,
profiling information, code size, compilation time and other metrics in a database. The
features of the program are also extracted from Milepost GCC and stored in the COD.
Plugins allow fine grained control and examination of the compiler, driven externally
through shared libraries.
Deployment Once sufficient training data is gathered, multiple machine learning
models can be created. Such models aim to correlate a given set of program features
with profitable program transformations to predict good optimization strategies. They
can later be re-inserted as plugins back to Milepost GCC or deployed as web-services
at cTuning.org. The last method allows continuous update of the machine learning
model based on collected information from multiple users. When encountering a new
program, Milepost GCC determines the program’s features and passes them to the
model to predict the most profitable optimizations to improve execution time or other
metrics depending on the user’s optimization requirements.
4.2 Milepost GCC and Interactive Compilation Interface
Current production compilers often have fixed and black-box optimization heuris-
tics without the means to fine-tune the application of transformations. This section
describes the Interactive Compilation Interface (ICI) [40] which unveils a compiler
and provides opportunities for external control and examination of its optimization
decisions with minimal changes. To avoid the pitfall of revealing intermediate repre-
sentation and libraries of the compiler to a point where it would overspecify too many
internals details and prevent further evolution, we choose to control the decision pro-
cess itself, granting access only to the high-level features needed for effectively taking
a decision. Optimization settings at a fine-grained level, beyond the capabilities of
command line options or pragmas, can be managed through external shared libraries,
leaving the compiler uncluttered. By replacing default optimization heuristics, execu-
tion time, code size and compilation time can be improved.
We decided to implement ICI for GCC and transform it into a research self-tuning
compiler to provide a common stable extensible compiler infrastructure shared by both
academia and industry, aiming to improve the quality, practicality and reproducibility
of research, and make experimental results immediately useful to the community.
The internal structure of ICI is shown in Fig. 7. We separate ICI into two parts:
low-level compiler-dependent and high-level compiler independent, the main reason
being to keep high-level iterative compilation and machine learning plugins invariant
when moving from one compiler to another. At the same time, since plugins now
extend GCC through external shared libraries, experiments can be performed with no
further modifications to the underlying compiler.
123
Int J Parallel Prog
Low-level compiler-dependent
Interactive Compilation
Detect
optimization flags
GCC Controller
(Pass Manager)
IC
Event
Pass N
IC
Event
Pass 1
GCC Data Layer
AST, CFG, CF, etc
IC
Data
IC
Event
ICI
MILEPOST GCC with ICI
...
Detect
(or other compilers)
optimization flags
GCC Controller
(Pass Manager)
Pass N
Pass 1
GCC
GCC Data Layer
AST, CFG, CF, etc
(a) (b)
IC Plugins
Selecting pass
combinations/sequences
Extracting static
program features
<Dynamically linked shared libraries>
...
High-level compiler-
independent ICI
cTuning.org
g
Plugins to perform iterative
compilation for multi-objective
optimization to improve
execution time, code size and
compilation time, etc.
COD
web-
services
CCC
Fig. 7 GCC Interactive Compilation Interface: aoriginal GCC, bMilepost GCC with ICI and plugins
External plugins can transparently monitor execution of passes or replace the GCC
Controller (Pass Manager), if desired. Passes can be selected by an external plugin
which may choose to drive them in a very different order than that currently used
in GCC, even choosing different pass orderings for each and every function in the
program being compiled. This mechanism simplifies the introduction of new analysis
and optimization passes to the compiler.
In an additional set of enhancements, a coherent event and data passing mechanism
enables external plugins to discover the state of the compiler and to be informed as
it changes. At various points in the compilation process events (IC Event) are raised
indicating decisions about transformations. Auxiliary data (IC Data) is registered if
needed.
Using ICI, we can now substitute all default optimization heuristics with external
optimization plugins to suggest an arbitrary combination of optimization passes dur-
ing compilation without the need for any project or Makefile changes. Together with
additional routines needed for machine learning, such as program feature extraction,
our compiler infrastructure forms the Milepost GCC. We also added a ‘-Oml’ flag
which calls a plugin to extract features, queries machine learning model plugins and
substitutes the default optimization levels.
In this work, we do not investigateoptimal orders of optimizations since this requires
detailed information about dependencies between passes to detect legal orders; we plan
to provide this information in the future. Hence, we examine the pass orders generated
by compiler flags during iterative compilation and focus on selecting or deselecting
appropriate passes that improve program execution time, compilation time or code
123
Int J Parallel Prog
size. In the future, we will focus on fine-grain parametric transformations in MILE-
POST GCC [40] and combine them with the POET scripting language [86].
4.3 Static Program Features
Our machine learning models predict the best GCC optimization to apply to an input
program based on its program structure or program features. The program features are
typically a summary of the internal program representation and characterize essential
aspects of a program that help to distinguish between good and bad optimizations.
The current version of ICI allows to invoke auxiliary passes that are not part of
GCC’s default compiler optimization heuristics. These passes can monitor and pro-
file the compilation process or extract data structures needed for generating program
features.
During compilation, a program is represented by several data structures, implement-
ing the intermediate representation (tree-SSA, RTL etc), control flow graph (CFG),
def-use chains, loop hierarchy, etc. The data structures available depend on the compi-
lation pass currently being performed. For statistical machine learning, the information
about these data structures is encoded in a constant size vector of numbers (i.e fea-
tures)—this process is called feature extraction and facilitates reuse of optimization
knowledge across different programs.
We implemented an additional ml-feat pass in GCC to extract static program fea-
tures. This pass is not invoked during default compilation but can be called using
an extract_program_static_ features plugin after any arbitrary pass, when all data
necessary to produce features is available.
In Milepost GCC, feature extraction is performed in two stages. In the first stage,
a relational representation of the program is extracted; in the second stage, the vector
of features is computed from this representation. In the first stage, the program is con-
sidered to be characterized by a number of entities and relations over these entities.
The entities are a direct mapping of similar entities defined by the language refer-
ence, or generated during compilation. Such examples of entities are variables, types,
instructions, basic blocks, temporary variables, etc.
A relation over a set of entities is a subset of their Cartesian product. The relations
specify properties of the entities or the connections among them. We use a nota-
tion based on logic for describing the relations—Datalog is a Prolog-like language
but with a simpler semantics, suitable for expressing relations and operations upon
them [83,79].
To extract the relational representation of the program, we used a simple method
based on the examination of the include files. The main data structures of the compiler
are built using struct data types, having a number of fields. Each such struct data
type may introduce an entity, and its fields may introduce relations over the entity,
representing the including struct data type and the entity representing the data type
of the field. This data is collected by the ml-feat pass.
In the second stage, we provide a Prolog program defining the features to be com-
puted from the Datalog relational representation, extracted from the compiler’s internal
data structures in the first stage. The extract_program_static_ features plugin invokes
123
Int J Parallel Prog
Tab l e 2 List of static program features currently available in Milepost GCC V2.1
ft1 Number of basic blocks in the method
ft2 Number of basic blocks with a single successor
ft3 Number of basic blocks with two successors
ft4 Number of basic blocks with more then two successors
ft5 Number of basic blocks with a single predecessor
ft6 Number of basic blocks with two predecessors
ft7 Number of basic blocks with more then two predecessors
ft8 Number of basic blocks with a single predecessor and a single successor
ft9 Number of basic blocks with a single predecessor and two successors
ft10 Number of basic blocks with a two predecessors and one successor
ft11 Number of basic blocks with two successors and two predecessors
ft12 Number of basic blocks with more then two successors and more then two predecessors
ft13 Number of basic blocks with number of instructions less then 15
ft14 Number of basic blocks with number of instructions in the interval [15, 500]
ft15 Number of basic blocks with number of instructions greater then 500
ft16 Number of edges in the control flow graph
ft17 Number of critical edges in the control flow graph
ft18 Number of abnormal edges in the control flow graph
ft19 Number of direct calls in the method
ft20 Number of conditional branches in the method
ft21 Number of assignment instructions in the method
ft22 Number of binary integer operations in the method
ft23 Number of binary floating point operations in the method
ft24 Number of instructions in the method
ft25 Average of number of instructions in basic blocks
ft26 Average of number of phi-nodes at the beginning of a basic block
ft27 Average of arguments for a phi-node
ft28 Number of basic blocks with no phi nodes
a Prolog compiler to execute this program, resulting in a vector of features (as shown
in Table 2) which later serves to detect similarities between programs, build machine
learning models and predict the best combinations of passes for new programs. We
provide more details about aggregation of semantical program properties for machine
learning based optimization in [63].
5 Using Machine Learning to Predict Good Optimization Passes
The Milepost approach to learning optimizations across programs is based on the
observation that programs may exhibit similar behavior for a similar set of optimiza-
tions [2,32], and hence we try to apply machine learning techniques to correlate their
features with most profitable program optimizations. In this case, whenever we are
123
Int J Parallel Prog
Tab l e 2 continued
ft29 Number of basic blocks with phi nodes in the interval [0, 3]
ft30 Number of basic blocks with more then 3 phi nodes
ft31 Number of basic block where total number of arguments for all phi-nodes is in greater then 5
ft32 Number of basic block where total number of arguments for all phi-nodes is in the interval [1, 5]
ft33 Number of switch instructions in the method
ft34 Number of unary operations in the method
ft35 Number of instruction that do pointer arithmetic in the method
ft36 Number of indirect references via pointers (“*” in C)
ft37 Number of times the address of a variables is taken (“&” in C)
ft38 Number of times the address of a function is taken (“&” in C)
ft39 Number of indirect calls (i.e. done via pointers) in the method
ft40 Number of assignment instructions with the left operand an integer constant in the method
ft41 Number of binary operations with one of the operands an integer constant in the method
ft42 Number of calls with pointers as arguments
ft43 Number of calls with the number of arguments is greater then 4
ft44 Number of calls that return a pointer
ft45 Number of calls that return an integer
ft46 Number of occurrences of integer constant zero
ft47 Number of occurrences of 32-bit integer constants
ft48 Number of occurrences of integer constant one
ft49 Number of occurrences of 64-bit integer constants
ft50 Number of references of a local variables in the method
ft51 Number of references (def/use) of static/extern variables in the method
ft52 Number of local variables referred in the method
ft53 Number of static/extern variables referred in the method
ft54 Number of local variables that are pointers in the method
ft55 Number of static/extern variables that are pointers in the method
ft56 Number of unconditional branches in the method
given a new unseen program, we can search for similar programs within the training
set and suggest good optimizations based on their optimization experience.
In order to test this assumption, we selected the combination of optimizations which
yields the best performance for a given program on AMD,seereference in Fig. 8.
We then applied all these “best” combinations to all other programs and reported the
performance difference, see applied to. It is possible to see that there is a fairly
large amount of programs that share similar optimizations.
In the next subsections we introduce two machine learning techniques to select
combinations of optimization passes based on construction of a probabilistic model
or a transductive model on a set of Mtraining programs, and then use these models
to predict “good” combinations of optimization passes for unseen programs based on
their features.
123
Int J Parallel Prog
Reference
Applied to
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
% difference with best performance
0% 80%
Fig. 8 Percentage difference between speedup achievable after iterative compilation for “applied to” pro-
gram and speedup obtained when applying best optimization from “reference” program to “applied to”
program on AMD.“–” means that best optimization was not found for this program
There are several differences between the two models: first, in our implementation
the probabilistic model assumes each attribute is independent, whereas the proposed
transductive model also analyzes interdependencies between attributes. Second, the
probabilistic model finds the closest programs from the training set to the test program,
whereas the transductive model attempts to generalize and identify good combinations
of flags and program attributes. Therefore, it is expected that in some settings pro-
grams will benefit more from the probabilistic approach, whereas in others programs
will be improved more by using the transductive method, depending on training set
size, number of samples of the program space, as well as program and architecture
attributes.
In order to train the two machine learning models, we generated 1000 random com-
binations of flags turned either on or off as described in Sect. 3. Such a number of
runs is small relative to the size of the optimization space yet it provides enough opti-
mization cases and sufficient information to capture good optimization choices. The
program features for each benchmark, the flag settings and execution times formed
the training data for each model. All experiments were conducted using leave-one-out
cross-validation. This means that for each of the Nprograms, the other N1 programs
123
Int J Parallel Prog
are used as training data. This guarantees that each program is unseen when the model
predicts good optimization settings to avoid bias.
5.1 Probabilistic Machine Learning Model
Our probabilistic machine learning method is similar to that of [2] where a probability
distribution over “good” solutions (i.e. optimization passes or compiler flags) is learnt
across different programs. This approach has been referred to as Predictive Search
Distributions (PSD) [8]. However, unlike prior work [2,8] where such a distribution
is used to focus the search of compiler optimizations on a new program, we use the
learnt distribution to make one-shot predictions on unseen programs. Thus we do not
search for the best optimization, we automatically predict it.
Given a set of training programs T1,...,TM, which can be described by fea-
ture vectors t1,...,tM, and for which we have evaluated different combinations of
optimization passes (x) and their corresponding execution times (or speed-ups) yso
that we have for each program Tjan associated dataset Dj={(xi,yi)}Nj
i=1, with
j=1,...,M, our goal is to predict a good combination of optimization passes x
minimizing ywhen a new program Tis presented.
We approach this problem by learning a mapping from the features of a program
tto a distribution over good solutions q(x|t,θ), where θare the parameters of the
distribution. Once this distribution has been learnt, prediction for a new program Tis
straightforward and is achieved by sampling at the mode of the distribution. In other
words, we obtain the predicted combination of flags by computing:
x=argmax
xq(x|t,θ). (1)
In order to learn the model it is necessary to fit a distribution over good solutions to
each training program beforehand. These solutions can be obtained, for example, by
using uniform sampling or by running an estimation of distribution algorithm (EDA,
see [47] for an overview) on each of the training programs. In our experiments we
use uniform sampling and we choose the set of good solutions to be those optimi-
zation settings that achieve at least 98% of the maximum speed-up available in the
corresponding program-dependent dataset.
Let us denote the distribution over good solutions on each training program by
P(x|Tj)with j=1,...,M. In principle, these distributions can belong to any para-
metric family. However, in our experiments we use an IID model where each of the
elements of the combination are considered independently. In other words, the proba-
bility of a “good” combination of passes is simply the product of each of the individual
probabilities corresponding to how likely each pass is to belong to a good solution:
P(x|Tj)=
L
=1
P(x
|Tj), (2)
where Lis the length of the combination.
123
Int J Parallel Prog
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
0 1.6
Fig. 9 Euclidean distance for all programs based on static program features normalized by feature 24
(number of instructions in a method)
Once the individual training distributions P(x|Tj)are obtained, the predictive dis-
tribution q(x|t,θ)can be learnt by maximization of the conditional likelihood or by
using k-nearest neighbor methods. In our experiments we use a 1-nearest neighbor
approach (Fig. 9shows Euclidean distances between all programs with a visible clus-
tering). In other words, we set the predictive distribution q(x|t,θ)to be the distribution
corresponding to the training program that is closest in feature space to the new (test)
program.
Figure 10 compares the speedups achieved after iterative compilation using 1000
iterations and 50% probability of selecting each optimization on AMD and Intel after
one-shot prediction using probabilistic model or simply after selecting best combi-
nation of optimizations from the closest program. Interestingly, the results suggest
that simply selecting best combination of optimizations from a similar program may
not perform well in many cases; this may be due to our random optimization space
exploration technique - each “good” combination of optimizations includes multiple
flags that do not influence performance or other metrics on a given program, however
some of them can considerably degrade performance on other programs. On the con-
trary, probabilistic approach helps to filter away non-influential flags statistically and
thereby improve predictions.
123
Int J Parallel Prog
0.6
0.8
1
1.2
1.4
1.6
1.8
2
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
AVERAGE
Execution time speedup
Upper bound (iterative compilation)
Predicted (best optimization from the neighbour)
Predicted (probability distribution from the neighbour)
(a)
0.6
0.8
1
1.2
1.4
1.6
1.8
2
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
tiff2bw
tiff2rgba
tiffmedian
tiffdither
qsort1
stringsearch1
AVERAGE
Execution time speedup
Upper bound (iterative compilation)
Predicted (best optimization from the neighbour)
(b)
Predicted (probability distribution from the neighbour)
Fig. 10 Speedups achieved when using iterative compilation on aAMD and bIntel with random search
strategy (1,000 iterations; 50% probability to select each optimization;), when selecting best optimization
from the nearest program and when predicting optimization using probabilistic ML model based on program
features
5.2 Transductive Machine Learning Model
In this subsection we describe a new transductive approach where optimization com-
binations themselves are used, as features for the learning algorithm, together with
program features. The model is then queried for the best combination of optimizations
out of the set of optimizations that the program was compiled with. Many learning
algorithms can be used for building the ML model. In this work we used a decision
tree model [23] to ease analysis of the resulting model.
123
Int J Parallel Prog
As in the previous section, we try to predict whether a specific optimization com-
bination will obtain at least 95% of the maximal speedup possible. The feature set
is comprised of the flags/passes and the extracted program features, obtained from
Milepost GCC. Denoting the vector of extracted features from the i-th program by
ti,i =1,...,M and the possible optimization passes by xj,j =1,...,N,wetrain
the ML model with a set of features which is the cross-product of x×t, such that
each feature vector is a concatenation of xjand ti. This is akin to multi-class methods
which rely on single binary classifiers (see [24] for a detailed discussion of such meth-
ods). The target for the predictor is whether this combination of program features and
flags/passes combination will give a speedup of at least 95% of the maximal speedup.
Once a program is compiled with different optimization settings (either an exhaus-
tive sample, or a random sample of optimization combinations), all successfully com-
piled program settings are used as a query for the learned model together with the
program features, and the flag setting which is predicted to have the best speedup
is used. If several settings are predicted to have the same speedup, the one which
exhibited, on average, the best speedup with the training set programs, is used.
Figure 11 compares the speedups achieved after iterative compilation using 1000
iterations and 50% probability of selecting each optimization on ARC and after one-
shot prediction using probabilistic and transductive models. It shows that our probabi-
listic model can automatically improve the default optimization heuristics of GCC by
11% on average while reaching 100% of the achievable speedup in some cases. On the
other hand, transductive model improves GCC by only a modest 5%. However, in sev-
eral cases it outperforms the probabilistic model: susan_s, dijkstra, rijndael_e, qsort1
and strinsearch1 likely due to a different mechanism of capturing the importance of
program features and optimizations. Moreover, transductive (decision tree) model has
an advantage that it is much easier to analyze the results. For example, Fig. 12 shows
the top levels of the decision trees learnt for ARC. The leafs indicate the probability that
the optimization and program feature combinations which reached these nodes will
be in the top 95% of the speedup for a benchmark. Most of these features found at the
top level characterize the control flow graph (CFG). This is somehow expected, since
the structure of the CFG is one of the major factors that may affect the efficiency of
several optimizations. Other features relate to the applicability of the “address-taken”
operator to functions that may affect the accuracy of the call-graph and of subsequent
analysis using it. To improve the performance of both models, we intend to analyze the
quality and importance of program features and their correlation with optimizations
in the future.
5.3 Realistic Optimization Scenario of a Production Application
Experimental results from the previous section show how to optimize several stan-
dard benchmarks using Milepost GCC. In this section we show how to optimize a
real production application using Milepost technology combined with machine learn-
ing model from Sect. 5.1. For this purpose, we selected the open-source Berkeley
DB library (BDB) which is a popular high-performance database written in C with
APIs to most other languages. For evaluation purposes we used an official internal
123
Int J Parallel Prog
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
bitcount
susan_c
susan_e
susan_s
jpeg_c
jpeg_d
dijkstra
patricia
blowfish_d
blowfish_e
rijndael_d
rijndael_e
adpcm_c
adpcm_d
CRC32
gsm
qsort1
stringsearch1
AVE RAG E
Execution time speedup
Upper bound (iterative compilation)
Predicted optimization using probabilistic ML model
Predicted optimization using transductive ML model
Fig. 11 Speedups achieved when using iterative compilation on ARC with random search strategy (1,000
iterations; 50% probability to select each optimization;) and when predicting best optimizations using
probabilistic ML model and transductive ML model based on program features
ft6: Number of basic
blocks with
a two predecessors
and one
successor < 18
ft38: Number of times
the address
of a function
is taken (&
in C). < 16.5
Yes
ft9: Number of basic
blocks with
a single predecessor
and two
successors < 15.5
No
0.038008
Yes
-O < 0.5
No
0.031966
Yes
ft16: Number of edges
in the control
flow graph < 193.5
No
0.24367
Yes
0.88301
No
-O < 0.5
Yes
0.087843
No
0
Yes
0.69581
No
Fig. 12 Top levels of decision trees learnt for ARC
123
Int J Parallel Prog
benchmarking suite and provided support of the CCC framework to perform iterative
compilation in a same manner as described in Sect. 3, in order to find the upper bounds
for execution time, code size and compilation time.
For simplicity, we decided to use a probabilistic machine learning model from
Sect. 5.1. Since BDB is relatively large (around 200,000 lines of code) we selected the
3 hottest functions, extracted features for each function using Milepost GCC and cal-
culated Euclidean distance with all programs from our training set (MiBench/cBench)
to find the five most similar programs. Then, depending on the optimization scenario,
we selected the best optimizations from those programs to (a) improve execution time
while not degrading compilation time (b) improve code size while not degrading exe-
cution time and (c) improve compilation time while not degrading execution time.
Figure 13 shows the achieved execution time speedups, code size improvements and
compilation time speedups over -O3 optimization level when applying selected opti-
mizations from the most similar programs to BerkeleyDB for these three optimization
scenarios. These speedups are compared to the upper bound for the respective metrics
achieved after iterative compilation (200 iterations) for the whole program. The pro-
grams on the X-axis are sorted by distances starting from the closest program. In the
case of improving execution time, we show significant speedup across the functions.
For improving compilation time we are far from the optimal solution because it is
naturally associated with the lowest optimization level, while we have been focusing
also on not degrading execution time of -O3. Overall, the best results were achieved
when applying optimizations from tiff programs that are closer in the feature space to
the hot functions selected from BerkeleyDB, than any other program of the training
set.
We added information about the best optimizations from these 3 optimization sce-
narios to the open online Collective Optimization Database [14] to help users and
researchers validate and reproduce such results. These optimization cases are refer-
enced by the following cTuning RUN_ID reference numbers: 24857532370695782,
17268781782733561 and 9072658980980875. The default run related to -O3 optimi-
zation level is referenced by 965827379437489142. We also added support for pragma
#ctuning-opt-case UID that allows end-users to explicitly force Milepost GCC
to connect combinations of optimizations found by other users during empirical col-
lective search and referenced by UID in COD to a given code section instead of using
machine learning.
6 Related Work
Automatic performance tuning techniques are now widely adopted to improve different
characteristics of a code empirically. They search automatically for good optimiza-
tion settings, applying multiple compilations and executions of a given program while
requiring little or no knowledge of the current platform, so programs can be adapted
to any given architecture. Originally developed to improve performance of various
small kernels using a few parametric transformations across multiple architectures,
where static compilers fail to deliver best performance [84,45,56,71,81,6,85], these
123
Int J Parallel Prog
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
Max after
iterative compilation
tiffmedian tiffdither tiff2rgba patricia stringsearch1
Speedup/improvement
execution time speedup
compilation time speedup
code size improvement
(a)
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
Max after
iterative compilation
tiffmedian tiffdither tiff2rgba patricia stringsearch1
Speedup/improvement
execution time speedup
compilation time speedup
code size improvement
(b)
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
Max after
iterative compilation
tiffmedian tiffdither tiff2rgba patricia stringsearch1
Speedup/improvement
2.1
execution time speedup
compilation time speedup
code size improvement
(c)
Fig. 13 Execution time speedups (a), code size improvements (b) and compilation time speedup (c)for
BerkeleyDB on Intel when applying optimizations from 5 closest programs from MiBench/cBench (based
on Euclidean distance using static program features of 3 hottest functions) using several optimization
scenarios
techniques have been extended to larger applications and richer set of optimizations
[7,44,16,46,18,77,27,21,51,67,86,25,68,78].
Though popular for library generators and embedded systems, iterative compilation
is still not widely adopted by general purpose compilers mainly due to excessively
long optimization time. Multiple genetic and probabilistic approaches have been devel-
oped to speed up optimization of a given program on a given architecture [64,17,73,
37,4,38,26]. Furthermore, novel dynamic and hybrid (static and dynamic) adaptation
techniques have been proposed to speed up evaluation of optimizations at run-time
[80,49]. In [30], we have shown the possibility to speed up iterative compilation by
123
Int J Parallel Prog
several orders of magnitude using static function cloning with pre-optimized versions
for various objectives and run-time low-overhead optimization evaluation that also
enabled adaptive binaries reactive to run-time changes in the program and environ-
ment. In [27,31], we introduced a new technique to quickly detect realistic lower
bound of the execution time of memory intensive applications by converting array
accesses to scalars in various ways without preserving the semantics of the code to
quickly detect performance anomalies and identify code sections that can benefit from
empirical optimizations. All these techniques can effectively learn the optimization
space of an individual program and optimization process but they still do not learn
optimizations across programs and architectures.
Calder et al. [10] presented a new approach to predict branches for a new program
based on behavior of other programs. They used neural networks and decision trees
to map static features associated with each branch to a prediction that the branch
will be taken, and managed to slightly reduce the branch misprediction rate on a set
of C and Fortran programs. Moss and McGovern et al. [62,59] incorporated a rein-
forcement learning model with a compiler to improve code scheduling, however no
absolute performance improvements were reported. Monsifrot et al. [61] presented a
classifier based on decision tree learning to determine which loops to unroll. Mark
Stephenson and Saman Amarasinghe [72] also predict unroll factors using nearest
neighbor classification and support vector machines. In our previous work [2,11]we
used static or dynamic (performance counters) code features with SUIF, Intel and
PathScale compilers to predict a set of multiple optimizations that improve execution
time for new programs based on similarities between previously optimized programs.
Liao et al. [82] used machine learning to performance counters and decision trees to
choose hardware prefetcher configurations. Several researchers [9,50,55] attempted to
characterize program input in order to predict best code variant at run-time using sev-
eral machine learning methods, including automatically generated decision trees and
statistical modeling. Other works [42,39,22] used machine learning for performance
prediction and hardware-software co-design.
Though machine learning techniques demonstrate a good potential to speed up the
iterative compilation process and facilitate reuse of optimization knowledge across dif-
ferent programs and architectures, the training phase can still be very long. Techniques
for continuous optimization can effectively speed up training of machine learning mod-
els. Anderson et al. [3] presented a practical framework for continuous and transparent
profiling and analysis of computing systems, though unfortunately this work did not
continue and no machine learning has been used. Lattner and Adve [48] and Lu et
al. [54] describe frameworks for lifelong program optimization, but without providing
details on practical collection of data and optimization strategies across runs. Other
frameworks [5,74] can collect profile information across multiple runs of users and
continuously alter run-time decisions in Java virtual machines, while we focus on pro-
duction static compilers and predictive modeling to correlate program features with
program optimizations. In previous work [28,32] we presented an open source frame-
work for statistical collective optimization that can leverage experience of multiple
users with static compilers and collect run-time profile data transparently in an open
public database for further machine learning processing. In [32], we also presented a
new technique to characterize programs based on reaction to transformations, which
123
Int J Parallel Prog
can be an alternative portable approach to program characterization using static or
dynamic program features.
We found many of the above approaches highly preliminary, limited to a few trans-
formations and global flags, rarely with publicly released open source tools or exper-
imental data to reproduce results. In contrast, the main goal of the Milepost project is
to make machine learning based multi-objective optimization a realistic, automatic,
reproducible and portable technology for general-purpose production compilers.
7 Conclusions and Future Work
The main contribution of this article is the first practical attempt to move empiri-
cal multi-objective iterative optimization and machine learning research techniques
into production compilers, deliver open collaborative R&D infrastructure based on
the popular GCC compiler and connect it to the cTuning.org optimization repository
to help end-users optimize their applications and allow researchers to reproduce and
improve experimental results. We show that Milepost GCC has a potential to automate
the tuning of compiler heuristics for a wide range of architectures and multi-objective
optimization such as improving execution time, code size, compilation time and other
constraints while considerably simplifying overall compiler design and time to market.
We released all Milepost/cTuning infrastructure and experimental data as open
source at cTuning.org [57,19,20] to be immediately useful to end users and research-
ers. We hope that Milepost GCC connected to cTuning.org’s public collaborative
tools and databases with common API will open many new practical opportunities
for systematic and reproducible research in the area of empirical multi-objective opti-
mization and machine learning. Some of Milepost’s technology is now included in
mainline GCC.
We continue to extend the Interactive Compilation Interface [41,40] to abstract
high-level optimization processes from compiler internals and provide finer grain tun-
ing for performance, power, compilation time and code size. We also expect to combine
ICI with the POET scripting language [86] and pragmas to unify fine-grain program
tuning. Future work will connect LLVM, ROSE, Path64 and other compilers to our
framework. We are also integrating our framework with the collective optimization
approach [32] to reduce or completely remove training stage overheads with limited
benchmarks, architectures and datasets. Collective optimization also allows to define
truly representative benchmarks based on classical clustering techniques.
Our framework now facilitates deeper analysis of interactions among optimizations
and investigation of the influence of program inputs and run-time state on program
optimizations in large applications. We also extend Milepost/cTuning technology to
improve machine learning models and analyze the quality of program features to
search for optimal sequences of optimization passes or polyhedral transformations
[51,78]. We started combining Milepost technology with machine-learning based
auto-parallelization and predictive scheduling techniques [52,43,76]. We have also
started investigating staged compilation techniques to balance between static and
dynamic optimizations using machine learning in LLVM or Milepost GCC4CIL con-
nected to Mono virtual machine. We plan to connect architecture simulators to our
123
Int J Parallel Prog
framework to enable software and hardware co-optimization. Finally, we will inves-
tigate adaptive and machine learning techniques for parallelization on heterogeneous
multi-core architectures and power saving prediction for large data centers and super-
computers.
Acknowledgments This research was generously supported by the EU FP6 035307 Project Milepost
(MachIne Learning for Embedded PrOgramS opTimization) [58]. We would like to thank GRID5000 [35]
community for providing computational resources to help validate results of this paper. We are grateful to
Prof. William Jalby for providing financial support to Abdul Wahid Memon and Yuriy Kashnikov to work
on this project. We would like to thank Ari Freund, Björn Franke, Hugh Leather, our colleagues from the
Institute of Computing Technology of Chinese Academy of Science and users from GCC, cTuning and
HiPEAC communities for interesting discussions and feedback during this project. We would like to thank
Cupertino Miranda and Joern Rennecke for their help to improve the Interactive Compilation Interface [41].
We would also like to thank Diego Novillo and GCC developers for practical discussions about the imple-
mentation of the ICI-compatible plugin framework in GCC. We are grateful to Google for their support to
extend Milepost GCC during Google Summer of Code’09 program. We would also like to thank Yossi Gil
for proof-reading this paper and anonymous reviewers for their insightful comments.
References
1. ACOVEA: Using Natural Selection to Investigate Software Complexities. http://www.coyotegulch.
com/products/acovea
2. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O’Boyle, M., Thomson, J., Toussaint, M.,
Williams, C.: Using machine learning to focus iterative optimization. In: Proceedings of the Interna-
tional Symposium on Code Generation and Optimization (CGO) (2006)
3. Anderson, J., Berc, L., Dean, J., Ghemawat, S., Henzinger, M., Leung, S., Sites, D., Vandevoorde, M.,
Waldspurger, C., Weihl, W.: Continuous profiling: Where have all the cycles gone? In: Proceedings of
the 30th Symposium on Microarchitecture (MICRO-30), (1997)
4. Arcuri, A., White, D.R., Clark, J., Yao, X.: Multi-objective improvement of software using co-evolution
and smart seeding. In: Proceedings of the 7th International Conference on Simulated Evolution And
Learning (SEAL’08) (2008)
5. Arnold, M., Welc, A., Rajan, V.T.: Improving virtual machine performance using a cross-run pro-
file repository. In: Proceedings of the ACM Conference on Object-Oriented Programming, Systems,
Languages and Applications (OOPSLA’05) (2005)
6. Barthou, D., Donadio, S., Carribault, P., Duchateau, A., Jalby, W.: Loop optimization using hierarchi-
cal compilation and kernel decomposition. In: Proceedings of the International Symposium on Code
Generation and Optimization (CGO) (2007)
7. Bodin, F., Kisuki, T., Knijnenburg, P., O’Boyle, M., Rohou, E.: Iterative compilation in a non-linear
optimisation space. In: Proceedings of the Workshop on Profile and Feedback Directed Compilation
(1998)
8. Bonilla, E.V., Williams, C.K.I., Agakov, F.V., Cavazos, J., Thomson, J., O’Boyle, M.F.P.: Predic-
tive search distributions. In: Proceedings of the 23rd International Conference on Machine Learning.
pp. 121–128, New York, NY, USA, (2006)
9. Brewer E. High-level optimization via automated statistical modeling. In: Proceedings of the 5th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 80–91, (1995)
10. Calder, B., Grunwald, D., Jones, M., Lindsay, D., Martin, J., Mozer, M., Zorn, B.: Evidence-based
static branch prediction using machine learning. ACM Transactions on Programming Languages and
Systems (TOPLAS) (1997)
11. Cavazos, J., Fursin, G., Agakov, F., Bonilla, E., O’Boyle, M., Temam, O.: Rapidly selecting good
compiler optimizations using performance counters. In: Proceedings of the International Symposium
on Code Generation and Optimization (CGO) March (2007)
12. Cavazos J., Moss J. Inducing heuristics to decide whether to schedule. In: Proceedings of the ACM
SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2004)
13. CCC: Continuous Collective Compilation Framework for iterativemulti-objective optimization. http://
cTuning.org/ccc
123
Int J Parallel Prog
14. COD: Public collaborative repository and tools for program and architecture characterization and
optimization. http://cTuning.org/cdatabase
15. Chen, Y., Huang, Y., Eeckhout, L., Fursin, G., Peng, L., Temam, O., Wu, C.: Evaluating iterative opti-
mization across 1000 data sets. In: Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI) June (2010)
16. Cooper, K., Grosul, A., Harvey, T., Reeves, S., Subramanian, D., Torczon, L., Waterman, T.: ACME:
adaptive compilation made efficient. In: Proceedings of the Conference on Languages, Compilers, and
Tools for Embedded Systems (LCTES) (2005)
17. Cooper, K., Schielke, P., Subramanian, D.: Optimizing for reduced code space using genetic algo-
rithms. In: Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems
(LCTES), pp. 1–9, (1999)
18. Cooper, K., Subramanian, D., Torczon, L.: Adaptive optimizing compilers for the 21st century.
J. Supercomput. 23(1) (2002)
19. cTuning CC: cTuning Compiler Collection that can convert any traditional compiler into adaptive,
machine learning enabled self-tuning infrastructure using Milepost GCC with ICI, CCC framework,
cBench, COD public repository and cTuning.org web-services. http://cTuning.org/ctuning-cc
20. cTuning.org: public collaborative optimization center with open source tools and repository to system-
atize, simplify and automate design and optimization of computing systems while enabling reproduc-
ibility of results
21. Donadio, S., Brodman, J.C., Roeder, T., Yotov, K., Barthou, D., Cohen, A., Garzaran, M.J., Padua,
D.A., Pingali, K.: A language for the compact representation of multiple program versions. In: Pro-
ceedings of the International Workshop on Languages and Compilers for Parallel computing (LCPC)
(2005)
22. Dubach, C., Jones, T.M., Bonilla, E.V., Fursin, G., O’Boyle, M.F.: Portable compiler optimization
across embedded programs and microarchitectures using machine learning. In: Proceedings of the
IEEE/ACM International Symposium on Microarchitecture (MICRO) December (2009)
23. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New-York (2001)
24. El-Yaniv, R., Pechyony, D., Yom-Tov, E.: Better multiclass classification via a margin-optimized single
binary problem. Pattern Recognit Lett 29(14), 1954–1959 (2008)
25. ESTO: Expert System for Tuning Optimizations. http://www.haifa.ibm.com/projects/systems/cot/
esto/index.html
26. Franke, B., O’Boyle, M., Thomson, J., Fursin, G.: Probabilistic source-level optimisation of embed-
ded programs. In: Proceedings of the Conference on Languages, Compilers, and Tools for Embedded
Systems (LCTES) (2005)
27. Fursin, G.: Iterative Compilation and Performance Prediction for Numerical Applications. PhD thesis,
University of Edinburgh, United Kingdom (2004)
28. Fursin, G.: Collective tuning initiative: automating and accelerating development and optimization of
computing systems. In: Proceedings of the GCC Developers’ Summit, June (2009)
29. Fursin, G., Cavazos, J., O’Boyle, M., Temam, O.: MiDataSets: creating the conditions for a more
realistic evaluation of iterative optimization. In: Proceedings of the International Conference on High
Performance Embedded Architectures & Compilers (HiPEAC 2007), January (2007)
30. Fursin, G., Cohen, A., O’Boyle, M., Temam, O.: A practical method for quickly evaluating program
optimizations. In: Proceedings of the International Conference on High Performance Embedded Archi-
tectures & Compilers (HiPEAC 2005), pp. 29–46, November (2005)
31. Fursin, G., O’Boyle, M., Temam, O., Watts, G.: Fast and accurate method for determining a lower
bound on execution time. Concurrency 16(2–3), 271–292 (2004)
32. Fursin, G., Temam, O.: Collective optimization. In: Proceedings of the International Conference on
High Performance Embedded Architectures & Compilers (HiPEAC 2009), January (2009)
33. Georges, A., Buytaert, D., Eeckhout, L.: Statistically rigorous javaperformance evaluation. In: Proceed-
ings of the Twenty-Second ACM SIGPLAN Conference on Object-Oriented Programming, Systems,
Languages & Applications (OOPSLA) (2007)
34. GCC: the GNU Compiler Collection. http://gcc.gnu.org
35. GRID5000: A nationwide infrastructure for large scale parallel and distributed computing research.
http://www.grid5000.fr
36. Guthaus, M.R., Ringenberg, M.R., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: Mibench:
A free, commercially representative embedded benchmark suite. In: Proceedings of the IEEE 4th
Annual Workshop on Workload Characterization, Austin, TX, December (2001)
123
Int J Parallel Prog
37. Heydemann, K., Bodin, F.: Iterative compilation for two antagonistic criteria: Application to code
size and performance. In: Proceedings of the 4th Workshop on Optimizations for DSP and Embedded
Systems, colocated with CGO (2006)
38. Hoste, K., Eeckhout, L.: Cole: Compiler optimization level exploration. In: Proceedings of the Inter-
national Symposium on Code Generation and Optimization (CGO) (2008)
39. Hoste, K., Eeckhout, L.: Comparing benchmarks using key microarchitecture-independent character-
istics. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC),
pp. 83–92, California,USA, October (2006)
40. Huang, Y., Peng, L., Wu, C., Kashnikov, Y., Renneke, J., Fursin, G.: Transforming GCC into a research-
friendly environment: plugins for optimization tuning and reordering, function cloning and program
instrumentation. In: 2nd International Workshop on GCC Research Opportunities (GROW), Colocated
with HiPEAC’10 Conference, January (2010)
41. ICI: Interactive Compilation Interface is a unified plugin system to convert black-box production
compilers into interactive research toolsets for application and architecture characterization and opti-
mization. http://cTuning.org/ici
42. Ipek, E., McKee, S.A., de Supinski, B.R., Schulz, M., Caruana, R.: Efficiently exploring architec-
tural design spaces via predictive modeling. In: Proceedings of the 12th International Conference on
Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 195–206
(2006)
43. Jimenez, V., Gelado, I., Vilanova, L., Gil, M., Fursin, G., Navarro, N.: Predictive runtime code sched-
uling for heterogeneous architectures. In: Proceedings of the International Conference on High Per-
formance Embedded Architectures & Compilers (HiPEAC 2009), January (2009)
44. Kisuki, T., Knijnenburg, P., O’Boyle, M.: Combined selection of tile sizes and unroll factors using
iterative compilation. In: Proceedings of the International Conference on Parallel Architectures and
Compilation Techniques (PACT), pp. 237–246, (2000)
45. Kisuki T., Knijnenburg P., O’Boyle M.: Combined selection of tile sizes and unroll factors using iter-
ative compilation. In: Proceedings of IEEE International Conference on Parallel Architectures and
Compilation Techniques (PACT), pp. 237–246, (2000)
46. Kulkarni, P., Zhao, W., Moon, H., Cho, K., Whalley, D., Davidson, J., Bailey, M., Paek, Y., Gallivan,
K.: Finding effective optimization phase sequences. In: Proceedings of the Conference on Languages,
Compilers, and Tools for Embedded Systems (LCTES), pp. 12–23 (2003)
47. Larra naga, P., Lozano, J.A.: Estimation of Distribution Algorithms: A New Tool for Evolutionary
Computation. Kluwer, Norwell (2001)
48. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transfor-
mation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization
(CGO’04), Palo Alto, California, March (2004)
49. Lau, J., Arnold, M., Hind, M., Calder, B.: Online performance auditing: Using hot optimizations with-
out getting burned. In: Proceedings of the ACM SIGPLAN Conference on Programming Languaged
Design and Implementation (PLDI’06) (2006)
50. Li, X., Garzaran, M.J., Padua, D.A.: Optimizing sorting with machine learning algorithms. In: Pro-
ceedings of the International Parallel and Distributed Processing Symposium (IPDPS) (2007)
51. Long, S., Fursin, G.: A heuristic search algorithm based on unified transformation framework. In:
Proceedings of the 7th International Workshop on High Performance Scientific and Engineering Com-
puting (HPSEC-05), pp. 137–144, (2005)
52. Long, S., Fursin, G., Franke, B.: A cost-aware parallel workload allocation approach based on machine
learning techniques. In: Proceedings of the IFIP International Conference on Network and Parallel
Computing (NPC 2007), number 4672 in LNCS, pp. 506–515. Springer, September (2007)
53. LLVM: the low level virtual machine compiler infrastructure. http://llvm.org
54. Lu, J., Chen, H., Yew, P.-C., Hsu, W.-C.: Design and implementation of a lightweight dynamic opti-
mization system. J. Instruction-Level Parallel. 6(2004)
55. Luo, L., Chen, Y., Wu, C., Long, S., Fursin, G.: Finding representative sets of optimizations for adap-
tive multiversioning applications. In: 3rd Workshop on Statistical and Machine Learning Approaches
Applied to Architectures and Compilation (SMART’09), Colocated with HiPEAC’09 Conference,
January (2009)
56. Matteo, F., Johnson, S.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the
IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1381–1384,
Seattle, WA, May (1998)
123
Int J Parallel Prog
57. MILEPOST GCC: public collaborative R&D website. http://cTuning.org/milepost-gcc
58. MILEPOST project archive (MachIne Learning for Embedded PrOgramS opTimization). http://
cTuning.org/project-milepost
59. McGovern, A., Moss, E.: Scheduling straight-line code using reinforcement learning and rollouts. In:
Advances in Neural Information Processing Systems (NIPS). Morgan Kaufmann, San Mateo (1998)
60. Monsifrot, A., Bodin, F.,Quiniou, R.: A machine learning approach to automatic production of compiler
heuristics. In: Proceedings of the International Conference on Artificial Intelligence: Methodology,
Systems, Applications, LNCS 2443, pp. 41–50 (2002)
61. Monsifrot, A., Bodin, F.,Quiniou, R.: A machine learning approach to automatic production of compiler
heuristics. In: Proceedings of the Tenth International Conference on Artificial Intelligence: Methodol-
ogy, Systems, Applications (AIMSA), LNCS 2443, pp. 41–50, (2002)
62. Moss, J., Utgoff, P., Cavazos, J., Precup, D., Stefanovic, D., Brodley, C., Scheeff, D.: Learning to
schedule straight-line code. In: Advances in Neural Information Processing Systems (NIPS), pp. 929–
935. Morgan Kaufmann, (1997)
63. Namolaru, M., Cohen, A., Fursin, G., Zaks, A., Freund, A.: Practical aggregation of semantical program
properties for machine learning based optimization. In: Proceedings of the International Conference
on Compilers, Architecture, and Synthesis For Embedded Systems (CASES 2010), October (2010)
64. Nisbet, A.: Iterative feedback directed parallelisation using genetic algorithms. In: Proceedings of the
Workshopon Profile and Feedback Directed Compilation in Conjunction with International Conference
on Parallel Architectures and Compilation Technique (PACT) (1998)
65. Open64: an open source optimizing compiler suite. http://www.open64.net
66. OProfile: system-wide profiler for Linux systems, capable of profiling all running code at low overhead.
http://oprofile.sourceforge.net
67. Pan, Z., Eigenmann, R.: Fast and effective orchestration of compiler optimizations for automatic perfor-
mance tuning. In: Proceedings of the International Symposium on Code Generation and Optimization
(CGO), pp. 319–332 (2006)
68. PathScale EKOPath Compilers. http://www.pathscale.com
69. Phoenix: software optimization and analysis framework for microsoft compiler technologies. https://
connect.microsoft.com/Phoenix
70. ROSE: an open source compiler infrastructure to build source-to-source program transformation and
analysis tools. http://www.rosecompiler.org/
71. Singer, B., Veloso, M.: Learning to predict performance from formula modeling and training data. In:
Proceedings of the Conference on Machine Learning (2000)
72. Stephenson, M., Amarasinghe, S.: Predicting unroll factors using supervised classification. In:
Proceedings of International Symposium on Code Generation and Optimization (CGO), pp. 123–134,
(2005)
73. Stephenson, M., Amarasinghe, S., Martin, M., O’Reilly,U.-M.: Meta optimization: Improving compiler
heuristics with machine learning. In: Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI’03), pp. 77–90, June (2003)
74. Stephenson, M.W.: Automating the construction of compiler heuristics using machine learning. PhD
thesis, MIT, USA, (2006)
75. Touati, S., Worms, J., Briais, S.: The speedup test. In: INRIA Technical Report HAL-inria-00443839
(2010)
76. Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.: Towards a holistic approach to auto-
parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping.
In: Proceedings of the Conference on Programming Language Design and Implementation (PLDI)
(2009)
77. Triantafyllis, S., Vachharajani, M., Vachharajani, N., August, D.: Compiler optimization-space explo-
ration. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO),
pp. 204–215 (2003)
78. Trifunovic, K., Cohen, A., Edelsohn, D., Feng, L., Grosser, T., Jagasia, H., Ladelsky, R., Pop, S.,
Sjoedin, J., Upadrasta, R.: Graphite two years after: First lessons learned from real-world polyhedral
compilation. In: 2nd International Workshop on GCC Research Opportunities (GROW) (2010)
79. Ullman, J.: Principles of database and knowledge systems, vol. 1. Computer Science Press,
New York (1988)
80. Voss, M., Eigenmann, R.: ADAPT: Automated de-coupled adaptive program transformation. In: Pro-
ceedings of International Conference on Parallel Processing (2000)
123
Int J Parallel Prog
81. Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels.
J. Phys. Conf. Ser. 16, 521–530 (2005)
82. wei Liao, S., han Hung, T., Nguyen, D., Chou, C., Tu, C., Zhou, H.: Machine learning-based prefetch
optimization for data center applications. In: Proceedings of the IEEE/ACM Conference on Super-
computing (SC) (2009)
83. Whaley, J., Lam, M.S.: Cloning based context sensitive pointer alias analysis using binary decision
diagrams. In: Proceedings of the Conference on Programming Language Design and Implementation
(PLDI), (2004)
84. Whaley, R., Dongarra, J.: Automatically tuned linear algebra software. In: Proceedings of the Confer-
ence on High Performance Networking and Computing (1998)
85. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-
vector multiplication on emerging multicore platforms. In: Proceedings of the IEEE/ACM Conference
on Supercomputing (SC) (2007)
86. Yi, Q., Seymour, K., You, H., Vuduc, R., Quinlan, D.: Poet: Parameterized optimizations for empirical
tuning. In: Proceedings of the Workshop on Performance Optimization of High-level Languages and
Libraries (POHLL) Co-located with IEEE International Parallel and Distributed Processing Sympo-
sium (IPDPS) (2007)
123
... Two primary categories of approaches have emerged to tackle this challenge: machine learning-based approaches (Hui et al. 2019;Ashouri 2017;Liu et al. 2022;Fursin et al. 2011;Fursin 2009;Cavazos et al. 2007;Ashouri et al. 2016;Pallister et al. 2015) and design space exploration Asher et al. 2017;Youcong et al. 2019;Georgiou et al. 2018;Garciarena et al. 2016;Lin et al. 2008;Agakov et al. 2006;Chen et al. 2021;Ni et al. 2019;Blackmore et al. 2017;Tağtekin et al. 2021;López-Ibáñez et al. 2016). Machine learning-based approaches leverage specific hardware platforms, GCC versions, and benchmark programs to construct prediction models, enabling rapid option selection for optimized programs. ...
... Machine learning-based approaches Wang and O'Boyle (2018) utilize static features (Hui et al. 2019;Fursin et al. 2011;Fursin 2009), dynamic features (Hui et al. 2019;Cavazos et al. 2007), and/or hybrid features (Ashouri et al. 2016) extracted from the target programs in open-source benchmark sets (PolyBench 2020; CBench 2020) as their input. These approaches then generate various prediction models and apply them to predict optimal option combinations, with the goal of optimizing execution time (Hui et al. 2019;Ashouri 2017;Liu et al. 2022;Fursin et al. 2011;Fursin 2009;Cavazos et al. 2007;Ashouri et al. 2016), code size (Fursin et al. 2011;Fursin 2009), and energy consumption (Pallister et al. 2015). ...
... Machine learning-based approaches Wang and O'Boyle (2018) utilize static features (Hui et al. 2019;Fursin et al. 2011;Fursin 2009), dynamic features (Hui et al. 2019;Cavazos et al. 2007), and/or hybrid features (Ashouri et al. 2016) extracted from the target programs in open-source benchmark sets (PolyBench 2020; CBench 2020) as their input. These approaches then generate various prediction models and apply them to predict optimal option combinations, with the goal of optimizing execution time (Hui et al. 2019;Ashouri 2017;Liu et al. 2022;Fursin et al. 2011;Fursin 2009;Cavazos et al. 2007;Ashouri et al. 2016), code size (Fursin et al. 2011;Fursin 2009), and energy consumption (Pallister et al. 2015). Liu et al. ...
Article
Full-text available
The open-source compiler GCC offers numerous options to improve execution time. Two categories of approaches, machine learning-based and design space exploration, have emerged for selecting the optimal set of options. However, they continue to face challenge in quickly obtaining high-quality solutions due to the large and discrete optimization space, time-consuming utility evaluation for selected options, and complex interactions among options. To address these challenges, we propose TSOA, a Two-Stage Optimization Approach for GCC compilation options to minimize execution time. In the first stage, we present OPPM, an Option Preselection algorithm based on Pattern Mining. OPPM generates diverse samples to cover a wide range of option interactions. It subsequently mines frequent options from both objective-improved and non-improved samples. The mining results are further validated using CRC codes to precisely preselect options and reduce the optimization space. Transitioning to the second stage, we present OSEA, an Option Selection Evolutionary optimization Algorithm. OSEA is grounded in solution preselection and an option interaction graph. The solution preselection employs a random forest to build a classifier, efficiently identifying promising solutions for the next-generation population and thereby reducing the time spent on utility evaluation. Simultaneously, the option interaction graph is built to capture option interplays and their influence on objectives from evaluated solutions. Then, high-quality solutions are generated based on the option interaction graph. We evaluate the performance of TSOA by comparing it with representative machine learning-based and design space exploration approaches across a diverse set of 20 problem instances from two benchmark platforms. Additionally, we validate the effectiveness of OPPM and conduct related ablation experiments. The experimental results show that TSOA outperforms state-of-the-art approaches significantly in both optimization time and solution quality. Moreover, OPPM outperforms other option preselection algorithms, while the effectiveness of random forest-assisted solution preselection, along with new solution generation based on the option interaction graph, has been verified.
... In this paper, the motivation for using MLIR's affine dialect auto-scheduling is to investigate the effectiveness of automatic optimization techniques in improving the performance of computer vision applications and AI computation. This kind of optimization is used in computer vision domain-specific language, Halide [4] and traditional compiler GCC [5] by different approaches. ...
... On the other hand, the machine learning approach also be used to handle instruction scheduling optimization. IBM Milepost gcc [5] developed a decision tree-based machine learning plugin to predict the optimal combination. The approach involves learning the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior, and optimizations. ...
... Second, machine learning methods take less time to get the optimized solution than iterative compilation when the process is large enough. However, the speedup of performance and the reduction of code size may not be as well as an iterative compilation as [5] shows. Thus, according to the size of the search space and the speedup between the two approaches, this paper tries to explore the strength and limitations of MLIR affine transformation in computer vision algorithms using a heuristic approach. ...
Preprint
Full-text available
The increasing usage of computer vision algorithms in camera-centric devices has led to a growing need for optimizing these algorithms to improve their performance on resource-constrained platforms. Halide is a language specific to image processing algorithms that separates the algorithm's scheduling from its implementation, resulting in high performance. This thesis proposes an approach to improve the performance of Halide computer vision algorithms using stochastic algorithms such as simulated annealing to optimize scheduling, which enables exploring the global optimum of affine scheduling in constrained time. To convert the Halide program to MLIR, we use novel compile flows, namely the Halide to MLIR (HTM) converter. The efficacy of the approach will be evaluated on different platforms, such as x86, ARM, and RISC-V. The study demonstrates the potential of MLIR's transformation and optimization capabilities on Affine dialects and highlights the need for tuning infrastructure to fully leverage MLIR's optimization capabilities.
... However, some researchers have used AI to decide which optimization techniques are to be included in compilers. The AI techniques that have been used so far to select optimization techniques are genetic programming (GP) (Stephenson et al., 2003), genetic algorithm (GA) (Agakov et al., 2006), probabilistic ML (Fursin et al., 2011), artificial neural network (ANN) (Dubach et al., 2007) and long short-term memory (LSTM) (Cummins et al., 2017). Programmers developing compilers can manually analyze only a few optimization techniques. ...
... Alternatively, AI can be used to study the performance of various available optimization techniques on a large number of sample programs, and only the best optimization techniques can be then included in the compilers. Fursin et al. (2011) andCummins et al. (2017) followed this approach and could achieve up to 11% and 14% speedup, respectively. ...
Conference Paper
Full-text available
Compilers are complex programs and specialized algorithms are available to implement the different phases of compilers. Nevertheless, some researchers have used artificial intelligence techniques to improve the performance of the syntax analysis, code optimization and code generation phases of compilers, and to check the correctness of compilers. This paper reviews the artificial intelligence, machine learning and deep learning techniques that have been successfully used so far for designing and testing compilers.
... Dans ce contexte, Fursin et al. [21] ont proposé MILEPOST GCC, un framework de compilation capable d'apprendre automatiquement comment optimiser au mieux les programmes pour les processeurs hétérogènes configurables, sur la base de la corrélation entre les caractéristiques du programme, le comportement à l'exécution et les optimisations. Pour extraire les caractéristiques du programme d'entrée, Ils proposent un plugin pour générer un vecteur de taille fixe représentant ce dernier. ...
... Chaque application passe par une phase de caractérisation qui génère une représentation paramétrée de l'application cible pour extraire ses principales caractéristiques. Ils caractérisent un programme en combinant deux vecteurs de représentation issus du plugin de MILEPOST GCC proposé par Fursin et al. [21], et celle de MICA « Microarchitecture-independent workload characterization » proposé par Hoste et Eeckhout [28] contenant des caractéristiques comme le parallélisme au niveau des instructions, distances de réutilisation de la mémoire, etc. Ces caractéristiques sont pré-traitées au moyen de techniques statistiques de réduction des dimensions afin d'identifier une représentation plus compacte. ...
Thesis
Full-text available
Les techniques d'optimisation automatique de code permettent d'améliorer les performances des programmes notamment le temps d’exécution. Ces techniques visent à transformer les programmes pour exploiter plus efficacement le matériel utilisé, en explorant l'espace des optimisations possibles pour choisir les plus efficaces. Les implémentations efficaces de ces techniques utilisent généralement des modèles de coût basés sur l'apprentissage automatique/profond afin d'évaluer l'effet des optimisations explorées. Dans ce travail, nous proposons un modèle de coût basé sur l'apprentissage profond qui vise à prédire l'accélération obtenue suite à l'application d'une séquence de transformations de code de manière plus précise par rapport à l'approche actuelle utilisée dans le compilateur Tiramisu. Ce nouveau modèle a l'avantage de supporter une plus large gamme de programmes, ce qui permet de meilleures optimisations et de meilleures accélérations obtenues pour les programmes du monde réel. Le modèle de coût proposé atteint une erreur absolue moyenne en pourcentage de 19.95% pour prédire les accélérations des programmes optimisés.
... I have been exposed to these problems since 2008 while developing a machine-learning based compiler with collective tuning and federated learning, introducing the artifact evaluation process with a unified artifact appendix and reproducibility checklists at ACM and IEEE conferences, reproducing results from many research projects from academia and industry and helping companies and the community run MLPerf benchmarks and submit the most efficient and cost-effective versions and configurations of software and hardware for popular AI tasks [30,29,35,10]. While working with academia and industry, I have noticed that often takes weeks and months of private, painful, 1 arXiv:2406.16791v1 ...
Preprint
In this white paper, I present my community effort to automatically co-design cheaper, faster and more energy-efficient software and hardware for AI, ML and other popular workloads with the help of the Collective Mind framework (CM), virtualized MLOps, MLPerf benchmarks and reproducible optimization tournaments. I developed CM to modularize, automate and virtualize the tedious process of building, running, profiling and optimizing complex applications across rapidly evolving open-source and proprietary AI/ML models, datasets, software and hardware. I achieved that with the help of portable, reusable and technology-agnostic automation recipes (ResearchOps) for MLOps and DevOps (CM4MLOps) discovered in close collaboration with academia and industry when reproducing more than 150 research papers and organizing the 1st mass-scale community benchmarking of ML and AI systems using CM and MLPerf. I donated CM and CM4MLOps to MLCommons to help connect academia and industry to learn how to build and run AI and other emerging workloads in the most efficient and cost-effective way using a common and technology-agnostic automation, virtualization and reproducibility framework while unifying knowledge exchange, protecting everyone's intellectual property, enabling portable skills, and accelerating transfer of the state-of-the-art research to production. My long-term vision is to make AI accessible to everyone by making it a commodity automatically produced from the most suitable open-source and proprietary components from different vendors based on user demand, requirements and constraints such as cost, latency, throughput, accuracy, energy, size and other important characteristics.
... It discusses the principles of PGO, including profile collection and optimization feedback, and explores recent research on advanced PGO methods, such as training phase selection, adaptive instrumentation, and selective optimization. PGO can improve program performance by improving profile accuracy and overhead, but it has both benefits and limitations.[16][17][18][31][35] The development of polyhedral compilation and optimization techniques is examined in this literature study. ...
Article
Full-text available
The modern period has seen advancements in compiler design, optimization technique, and software system efficiency. The influence of the most recent developments in compiler design and optimization techniques on program execution speed, memory utilization, and overall software quality is highlighted in this study. The design of the compiler is advanced by the efficient code that is now structured in research with high-speed performance without manual intervention. The influence of the most recent developments in compiler design and optimization techniques on program execution speed, memory utilization, and overall software quality is highlighted in this paper's thorough analysis.
Conference Paper
General matrix multiplication (GEMM) is a core computation kernel for deep neural networks. CUTLASS, a state-of-the-art open-source CUDA-based linear-algebra template library, provides a highly optimized tiling-based GEMM. However, CUTLASS GEMM often cannot achieve the optimal performance when its tiling configuration is not appropriately chosen because the performance varies significantly depending on some factors such as the tile size and shape, as well as the target graphics processing unit (GPU) architecture. Thus, determining the optimal tiling configuration is a major challenge in achieving the best performance of a tiling-based GEMM.To address this problem, we propose CUTLASS-tailor, a novel end-to-end framework that predicts the best tile parameters for target CUTLASS GEMM operations and underlying GPUs using a neural network model. We trained the prediction model using a suitable synthetic dataset that includes various input matrix combinations with different sizes and structures. Furthermore, to cover the various GPUs with a universal model, we also included the number of GPU cores and the amount of shared memory as GPU hardware features for the input of the CUTLASS-tailor network. On a test dataset from several real-world GEMMs, CUTLASS-tailor-based GEMM operations outperformed the GEMM operations using cuBLAS by up to 1.94× on an NVIDIA TitanXp GPU, and also showed that CUTLASS-tailor can find better tile parameters than well-known search algorithms.
Article
Full-text available
Modern compilers are responsible for adapting the semantics of source programs into a form that makes efficient use of a highly complex, hetero-geneous machine. This adaptation amounts to solve an optimization problem in a huge and unstructured search space, while predicting the performance outcome of complex sequences of program transformations. The polyhedral model of com-pilation is aimed at these challenges. Its geometrical, non-inductive semantics enables the construction of better-structured optimization problems and pre-cise analytical models. Recent work demonstrated the scalability of the main polyhedral algorithms to real-world programs. Its integration into production compilers is under way, pioneered by the graphite branch of the GNU Compiler Collection (GCC). Two years after the effective beginning of the project, this paper reports on original questions and innovative solutions that arose during the design and implementation of graphite.
Article
Research over the past five years has shown significant performance improvements using a technique called adaptive compilation. An adaptive compiler uses a compile-execute-analyze feedback loop to find the combination of optimizations and parameters that minimizes some performance goal, such as code size or execution time. Despite its ability to improve performance, adaptive compilation has not seen widespread use because of two obstacles: the large amounts of time that such systems have used to perform the many compilations and executions prohibits most users from adopting these systems, and the complexity inherent in a feedback-driven adaptive system has made it difficult to build and hard to use. A significant portion of the adaptive compilation process is devoted to multiple executions of the code being compiled. We have developed a technique called virtual execution to address this problem. Virtual execution runs the program a single time and preserves information that allows us to accurately predict the performance of different optimization sequences without running the code again. Our prototype implementation of this technique significantly reduces the time required by our adaptive compiler. In conjunction with this performance boost, we have developed a graphical-user interface (GUI) that provides a controlled view of the compilation process. By providing appropriate defaults, the interface limits the amount of information that the user must provide to get started. At the same time, it lets the experienced user exert fine-grained control over the parameters that control the system.
Conference Paper
Compiler writers have crafted many heuristics over the years to approximately solve NP-hard problems efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process. This paper introduces Meta Optimization, a methodology for automatically fine-tuning compiler heuristics. Meta Optimization uses machine-learning techniques to automatically search the space of compiler heuristics. Our techniques reduce compiler design complexity by relieving compiler writers of the tedium of heuristic tuning. Our machine-learning system uses an evolutionary algorithm to automatically find effective compiler heuristics. We present promising experimental results. In one mode of operation Meta Optimization creates application-specific heuristics which often result in impressive speedups. For hyperblock formation, one optimization we present in this paper, we obtain an average speedup of 23% (up to 73%) for the applications in our suite. Furthermore, by evolving a compiler's heuristic over several benchmarks, we can create effective, general-purpose heuristics. The best general-purpose heuristic our system found for hyperblock formation improved performance by an average of 25% on our training set, and 9% on a completely unrelated test set. We demonstrate the efficacy of our techniques on three different optimizations in this paper: hyperblock formation, register allocation, and data prefetching.
Conference Paper
As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good performance for a range of benchmarks. Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly. Today's VMs perform sophisticated feedback-directed optimizations, but these techniques do not address performance degradations, and they actually make the situation worse by making the system more unpredictable.This paper presents an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work opens the door for a fundamental shift in the way optimizations are developed and tuned for online systems, and may allow the body of work in offline empirical optimization search to be applied automatically at runtime. We present our implementation and evaluation of this system in a product Java VM.
Article
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of profiling to find the best possible program variant, have been developed. These strategies have been evaluated using a range of kernels and programs on different platforms and operating systems. A significant performance improvement has been achieved using new approaches when compared to the state-of-the-art native static and platform-specific feedback directed compilers.
Article
Compiler writers are expected to create effective and inexpensive solutions to NP-hard problems such as instruction scheduling and register allocation. To make matters worse, separate optimization phases have strong interactions and competing resource constraints. Compiler writers deal with system complexity by dividing the problem into multiple phases and devising approximate heuristics for each phase. However, to achieve satisfactory performance, developers are forced to manually tweak their heuristics with trial-and-error experimentation. In this dissertation I present meta optimization, a methodology for automatically constructing high quality compiler heuristics using machine learning techniques. This thesis describes machine-learned heuristics for three important compiler optimizations: hyperblock formation, register allocation, and loop unrolling. The machine-learned heuristics outperform (by as much as 3x in some cases) their state-of-the-art hand-crafted counterparts. By automatically collecting data and systematically analyzing them, my techniques discover subtle interactions that even experienced engineers would likely overlook. In addition to improving performance, my techniques can significantly reduce the human effort involved in compiler design.
Article
Understanding the behavior of emerging workloads is important for designing next generation microprocessors. For addressing this issue, computer architects and performance analysts build benchmark suites of new application domains and compare the behavioral characteristics of these benchmark suites against well-known benchmark suites. Current practice typically compares workloads based on microarchitecture-dependent characteristics generated from running these workloads on real hardware. There is one pitfall though with comparing benchmarks using microarchitecture-dependent characteristics, namely that completely different inherent program behavior may yield similar microarchitecture-dependent behavior. This paper proposes a methodology for characterizing benchmarks based on microarchitecture-independent characteristics. This methodology minimizes the number of inherent program characteristics that need to be measured by exploiting correlation between program characteristics. In fact, we reduce our 47-dimensional space to an 8-dimensional space without compromising the methodology's ability to compare benchmarks. The important benefits of this methodology are that (i) only a limited number of microarchitecture-independent characteristics need to be measured, and (ii) the resulting workload characterization is easy to interpret. Using this methodology we compare 122 benchmarks from 6 recently proposed benchmark suites. We conclude that some benchmarks in emerging benchmark suites are indeed similar to benchmarks from well-known benchmark suites as suggested through a microarchitecture-dependent characterization. However, other benchmarks are dissimilar based on a microarchitecture-independent characterization although a microarchitecture-dependent characterization suggests the opposite to be true
Article
Iterative compilation is a widely adopted technique to opti-mize programs for different constraints such as performance, code size and power consumption in rapidly evolving hardware and software envi-ronments. However, in case of statically compiled programs, it is often re-stricted to optimizations for a specific dataset and may not be applicable to applications that exhibit different run-time behavior across program phases, multiple datasets or when executed in heterogeneous, reconfig-urable and virtual environments. Several frameworks have been recently introduced to tackle these problems and enable run-time optimization and adaptation for statically compiled programs based on static func-tion multiversioning and monitoring of online program behavior. In this article, we present a novel technique to select a minimal set of representa-tive optimization variants (function versions) for such frameworks while avoiding performance loss across available datasets and code-size explo-sion. We developed a novel mapping mechanism using popular decision tree or rule induction based machine learning techniques to rapidly select best code versions at run-time based on dataset features and minimize selection overhead. These techniques enable creation of self-tuning static binaries or libraries adaptable to changing behavior and environments at run-time using staged compilation that do not require complex recom-pilation frameworks while effectively outperforming traditional single-version non-adaptable code.