Conference PaperPDF Available

Google Vizier: A Service for Black-Box Optimization

Authors:
  • Google, Inc, Pittsburgh, PA

Abstract and Figures

Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides.
Content may be subject to copyright.
Google Vizier: A Service for Black-Box Optimization
Daniel Golovin
Google Research
dgg@google.com
Benjamin Solnik
Google Research
bsolnik@google.com
Subhodeep Moitra
Google Research
smoitra@google.com
Greg Kochanski
Google Research
gpk@google.com
John Karro
Google Research
karro@google.com
D. Sculley
Google Research
dsculley@google.com
ABSTRACT
Any suciently complex system acts as a black box when it becomes
easier to experiment with than to understand. Hence, black-box
optimization has become increasingly important as systems have
become more complex. In this paper we describe Google Vizier, a
Google-internal service for performing black-box optimization that
has become the de facto parameter tuning engine at Google. Google
Vizier is used to optimize many of our machine learning models
and other systems, and also provides core capabilities to Google’s
Cloud Machine Learning HyperTune subsystem. We discuss our
requirements, infrastructure design, underlying algorithms, and
advanced features such as transfer learning and automated early
stopping that the service provides.
KEYWORDS
Black-Box Optimization, Bayesian Optimization, Gaussian Pro-
cesses, Hyperparameters, Transfer Learning, Automated Stopping
1 INTRODUCTION
Black–box optimization is the task of optimizing an objective func-
tion
f
:
XR
with a limited budget for evaluations. The adjective
“black–box” means that while we can evaluate
f(x)
for any
xX
,
we have no access to any other information about
f
, such as gradi-
ents or the Hessian. When function evaluations are expensive, it
makes sense to carefully and adaptively select values to evaluate;
the overall goal is for the system to generate a sequence of
xt
that
approaches the global optimum as rapidly as possible.
Black box optimization algorithms can be used to nd the best
operating parameters for any system whose performance can be
measured as a function of adjustable parameters. It has many impor-
tant applications, such as automated tuning of the hyperparameters
of machine learning systems (e.g., learning rates, or the number
of hidden layers in a deep neural network), optimization of the
user interfaces of web services (e.g. optimizing colors and fonts
KDD ’17, August 13-17, 2017, Halifax, NS, Canada
©2017 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-4887-4/17/08.
https://doi.org/10.1145/3097983.3098043
to maximize reading speed), and optimization of physical systems
(e.g., optimizing airfoils in simulation).
In this paper we discuss a state-of-the-art system for black–box
optimization developed within Google, called Google Vizier, named
after a high ocial who oers advice to rulers. It is a service for
black-box optimization that supports several advanced algorithms.
The system has a convenient Remote Procedure Call (RPC) inter-
face, along with a dashboard and analysis tools. Google Vizier is
a research project, parts of which supply core capabilities to our
Cloud Machine Learning HyperTune
1
subsystem. We discuss the ar-
chitecture of the system, design choices, and some of the algorithms
used.
1.1 Related Work
Black–box optimization makes minimal assumptions about the
problem under consideration, and thus is broadly applicable across
many domains and has been studied in multiple scholarly elds un-
der names including Bayesian Optimization [
2
,
25
,
26
], Derivative–
free optimization [
7
,
24
], Sequential Experimental Design [
5
], and
assorted variants of the multiarmed bandit problem [13, 20, 29].
Several classes of algorithms have been proposed for the prob-
lem. The simplest of these are non-adaptive procedures such as
Random Search
, which selects
xt
uniformly at random from
X
at each time step
t
independent of the previous points selected,
{xτ: 1 τ<t}
, and
Grid Search
, which selects along a grid (i.e.,
the Cartesian product of nite sets of feasible values for each pa-
rameter). Classic algorithms such as
SimulatedAnnealing
and
assorted genetic algorithms have also been investigated, e.g., Co-
variance Matrix Adaptation [16].
Another class of algorithms performs a local search by selecting
points that maintain a search pattern, such as a simplex in the case of
the classic
Nelder–Mead
algorithm [
22
]. More modern variants of
these algorithms maintain simple models of the objective
f
within
a subset of the feasible regions (called the trust region), and select a
point xtto improve the model within the trust region [7].
More recently, some researchers have combined powerful tech-
niques for modeling the objective
f
over the entire feasible region,
using ideas developed for multiarmed bandit problems for manag-
ing explore / exploit trade-os. These approaches are fundamen-
tally Bayesian in nature, hence this literature goes under the name
Bayesian Optimization. Typically, the model for
f
is a Gaussian
1https://cloud.google.com/ml/
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1487
This work is licensed under a Creative Commons
Attribution-NonCommercial-NoDerivs International 4.0 License.
process (as in [
26
,
29
]), a deep neural network (as in [
27
,
31
]), or a
regression forest (as in [2, 19]).
Many of these algorithms have open-source implementations
available. Within the machine learning community, examples in-
clude, e.g., HyperOpt
2
, MOE
3
, Spearmint
4
, and AutoWeka
5
, among
many others. In contrast to such software packages, which require
practitioners to set them up and run them locally, we opted to de-
velop a managed service for black–box optimization, which is more
convenient for users but involves additional design considerations.
1.2 Denitions
Throughout the paper, we use to the following terms to describe
the semantics of the system:
ATrial is a list of parameter values,
x
, that will lead to a single
evaluation of
f(x)
. A trial can be “Completed”, which means that it
has been evaluated and the objective value
f(x)
has been assigned
to it, otherwise it is “Pending”.
AStudy represents a single optimization run over a feasible
space. Each Study contains a conguration describing the feasible
space, as well as a set of Trials. It is assumed that
f(x)
does not
change in the course of a Study.
AWorker refers to a process responsible for evaluating a Pending
Trial and calculating its objective value.
2 SYSTEM OVERVIEW
This section explores the design considerations involved in imple-
menting black-box optimization as a service.
2.1 Design Goals and Constraints
Vizier’s design satises the following desiderata:
Ease of use. Minimal user conguration and setup.
Hosts state-of-the-art black-box optimization algorithms.
High availability
Scalable to millions of trials per study, thousands of parallel
trial evaluations per study, and billions of studies.
Easy to experiment with new algorithms.
Easy to change out algorithms deployed in production.
For ease of use, we implemented Vizier as a managed service
that stores the state of each optimization. This approach drastically
reduces the eort a new user needs to get up and running; and
a managed service with a well-documented and stable RPC API
allows us to upgrade the service without user eort. We provide a
default conguration for our managed service that is good enough
to ensure that most users need never concern themselves with the
underlying optimization algorithms.
The default option allows the service to dynamically select a
recommended black–box algorithm along with low–level settings
based on the study conguration. We choose to make our algorithms
stateless, so that we can seamlessly switch algorithms during a
2https://github.com/jaberg/hyperopt
3https://github.com/Yelp/MOE
4https://github.com/HIPS/Spearmint
5https://github.com/automl/autoweka
study, dynamically choosing the algorithm that is likely to perform
better for a particular trial of a given study. For example, Gaussian
Process Bandits [26, 29] provide excellent result quality, but naive
implementations scale as
O(n3)
with the number of training points.
Thus, once we’ve collected a large number of completed Trials, we
may want to switch to using a more scalable algorithm.
At the same time, we want to allow ourselves (and advanced
users) the freedom to experiment with new algorithms or special-
case modications of the supported algorithms in a manner that is
safe, easy, and fast. Hence, we’ve built Google Vizier as a modular
system consisting of four cooperating processes (see Figure 1) that
update the state of Studies in the central database. The processes
themselves are modular with several clean abstraction layers that
allow us to experiment with and apply dierent algorithms easily.
Finally we want to allow multiple trials to be evaluated in parallel,
and allow for the possibility that evaluating the objective function
for each trial could itself be a distributed process. To this end we
dene Workers, responsible for evaluating suggestions, and iden-
tify each worker by a name (a
worker_handle
) that persists across
process preemptions or crashes.
2.2 Basic User Workow
To use Vizier, a developer may use one of our client libraries (cur-
rently implemented in C++, Python, Golang), which will generate
service requests encoded as protocol buers [
15
]. The basic work-
ow is extremely simple. Users specify a study conguration which
includes:
Identifying characteristics of the study (e.g. name, owner,
permissions).
The set of parameters along with feasible sets for each (c.f.,
Section 2.3.1 for details); Vizier does constrained optimiza-
tion over the feasible set.
Given this conguration, basic use of the service (with each trial
being evaluated by a single process) can be implemented as follows:
# Register this client with the Study, creating it if
# necessary.
client.LoadStudy(study_cong, worker_handle)
while (not client.StudyIsDone()):
# Obtain a trial to evaluate.
trial = client.GetSuggestion()
# Evaluate the objective function at the trial parameters.
metrics = RunTrial(trial)
# Report back the results.
client.CompleteTrial(trial, metrics)
Here
RunTrial
is the problem–specic evaluation of the objective
function
f
. Multiple named metrics may be reported back to Vizier,
however one must be distinguished as the objective value
f(x)
for
trial
x
. Note that multiple processes working on a study should
share the same
worker_handle
if and only if they are collaboratively
evaluating the same trial. All processes registered with a given
study with the same
worker_handle
are guaranteed to receive the
same trial upon request, which enables distributed trial evaluation.
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1488
Vizier API
Persistent
Database Suggestion Service
Suggestion
Workers
Automated Stopping Service
Dangling
Work Finder
AutomatedStopping
Workers
Evaluation
Workers
Figure 1: Architecture of Vizier service: Main components are (1)
Dangling work nder (restarts work lost to preemptions) (2) Persis-
tent Database holding the current state of all Studies (3) Suggestion
Service (creates new Trials), (4) Early Stopping Service (helps termi-
nate a Trial early) (5) Vizier API (JSON, validation, multiplexing) (6)
Evaluation workers (provided and owned by the user).
2.3 Interfaces
2.3.1 Configuring a Study. To congure a study, the user pro-
vides a study name, owner, optional access permissions, an opti-
mization goal from
{MAXIMIZE, MINIMIZE}
, and species the feasible
region
X
via a set of
ParameterConfigs
, each of which declares a
parameter name along with its values. We support the following
parameter types:
DOUBLE
: The feasible region is a closed interval [
a,b
]for some
real values ab.
INTEGER
: The feasible region has the form [
a,b
]
Z
for some
integers ab.
DISCRETE
: The feasible region is an explicitly specied, or-
dered set of real numbers.
CATEGORICAL
: The feasible region is an explicitly specied,
unordered set of strings.
Users may also suggest recommended scaling, e.g., logarithmic
scaling for parameters for which the objective may depend only on
the order of magnitude of a parameter value.
2.3.2 API Definition. Workers and end users can make calls
to the Vizier Service using either a REST API or using Google’s
internal RPC protocol [15]. The most important service calls are:
CreateStudy
: Given a Study conguration, this creates an
optimization Study and returns a globally unique identier
(“guid”) which is then used for all future service calls. If a
Study with a matching name exists, the guid for that Study
is returned. This allows parallel workers to call this method
and all register with the same Study.
SuggestTrials
: This method takes a “worker handle” as input,
and immediately returns a globally unique handle for a “long-
running operation” that represents the work of generating
Trial suggestions. The user can then poll the API periodically
to check the status of the operation. Once the operation is
completed, it will contain the suggested Trials. This design
ensures that all service calls are made with low latency, while
allowing for the fact that the generation of Trials can take
longer.
AddMeasurementToTrial
: This method allows clients to pro-
vide intermediate metrics during the evaluation of a Trial.
These metrics are then used by the Automated Stopping
rules to determine which Trials should be stopped early.
CompleteTrial
: This method changes a Trial’s status to “Com-
pleted”, and provides a nal objective value that is then used
to inform the suggestions provided by future calls to Sug-
gestTrials.
ShouldTrialStop
: This method returns a globally unique han-
dle for a long-running operation that represents the work of
determining whether a Pending Trial should be stopped.
2.4 Infrastructure
2.4.1 Parallel Processing of Suggestion Work. As the de facto
parameter tuning engine of Google, Vizier is constantly working on
generating suggestions for a large number of Studies concurrently.
As such, a single machine would be insucient for handling the
workload. Our Suggestion Service is therefore partitioned across
several Google datacenters, with a number of machines being used
in each one. Each instance of the Suggestion Service potentially
can generate suggestions for several Studies in parallel, giving
us a massively scalable suggestion infrastructure. Google’s load
balancing infrastructure is then used to allow clients to make calls
to a unied endpoint, without needing to know which instance is
doing the work.
When a request is received by a Suggestion Service instance to
generate suggestions, the instance rst places a distributed lock on
the Study. This lock is acquired for a xed period of time, and is
periodically extended by a separate thread running on the instance.
In other words, the lock will be held until either the instance fails,
or it decides it’s done working on the Study. If the instance fails
(due to e.g. hardware failure, job preemption, etc), the lock soon
expires, making it eligible to be picked up by a separate process
(called the “DanglingWorkFinder”) which then reassigns the Study
to a dierent Suggestion Service instance.
One consideration in maintaining a production system is that
bugs are inevitably introduced as our code matures. Occasionally, a
new algorithmic change, however well tested, will lead to instances
of the Suggestion Service failing for particular Studies. If a Study
is picked up by the DanglingWorkFinder too many times, it will
temporarily halt the Study and alert us. This prevents subtle bugs
that only aect a few Studies from causing crash loops that aect
the overall stability of the system.
2.5 The Algorithm Playground
Vizier’s algorithm playground provides a mechanism for advanced
users to easily, quickly, and safely replace Vizier’s core optimization
algorithms with arbitrary algorithms.
The playground serves a dual purpose; it allows rapid prototyp-
ing of new algorithms, and it allows power-users to easily customize
Vizier with advanced or exotic capabilities that are particular to
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1489
Vizier API
Persistent
Database
Evaluation
Workers
Playground Binary
Abstract Policy
Custom Policy
Figure 2: Architecture of Playground mode: Main components are
(1) The Vizier API takes service requests. (2) The Custom Policy im-
plements the Abstract Policy and generates suggested Trials. (3) The
Playground Binary drives the custom policy based on demand re-
ported by the Vizier API. (4) The Evaluation Workers behave as nor-
mal, i.e., they request and evaluate Trials.
their use-case. In all cases, users of the playground benet from
all of Vizier’s infrastructure aside from the core algorithms, such
as access to a persistent database of Trials, the dashboard, and
visualizations.
At the core of the playground is the ability to inject Trials into
a Study. Vizier allows the user or other authorized processes to
request one or more particular Trials be evaluated. In Playground
mode, Vizier does not suggest Trials for evaluation, but relies on
an external binary to generate Trials, which are then pushed to the
service for later distribution to the workers.
More specically, the architecture of the Playground involves
the following key components: (1) Abstract Policy (2) Playground
Binary, (3) Vizier Service and (4) Evaluation Workers. See Figure 2
for an illustration.
The Abstract Policy contains two abstract methods:
(1) GetNewSuggestions(trials, num_suggestions)
(2) GetEarlyStoppingTrials(trials)
which should be implemented by the user’s custom policy. Both
these methods are passed the full state of all Trials in the Study, so
algorithms may be implemented in a stateless fashion if desired.
GetNewSuggestions
is expected to generate
num_suggestions
new tri-
als, while the
GetEarlyStoppingTrials
method is expected to return
a list of Pending Trials that should be stopped early. The custom
policy is registered with the Playground Binary which periodically
polls the Vizier Service. The Evaluation Workers maintain the service
abstraction and are unaware of the existence of the Playground.
2.6 Benchmarking Suite
Vizier has an integrated framework that allows us to eciently
benchmark our algorithms on a variety of objective functions. Many
of the objective functions come from the Black-Box Optimization
Benchmarking Workshop [
10
], but the framework allows for any
function to be modeled by implementing an abstract Experimenter
class, which has a virtual method responsible for calculating the
objective value for a given Trial, and a second virtual method that
returns the optimal solution for that benchmark.
Figure 3: A section of the dashboard for tracking the progress of
Trials and the corresponding objective function values. Note also,
the presence of actions buttons such as Get Suggestions for manually
requesting suggestions.
Users congure a set of benchmark runs by providing a set of
algorithm congurations and a set of objective functions. The bench-
marking suite will optimize each function with each algorithm
k
times (where
k
is congurable), producing a series of performance-
over-time metrics which are then formatted after execution. The
individual runs are distributed over multiple threads and multi-
ple machines, so it is easy to have thousands of benchmark runs
executed in parallel.
2.7 Dashboard and Visualizations
Vizier has a web dashboard which is used for both monitoring
and changing the state of Vizier studies. The dashboard is fully
featured and implements the full functionality of the Vizier API.
The dashboard is commonly used for: (1) Tracking the progress
of a study; (2) Interactive visualizations; (3) Creating, updating
and deleting a study; (4) Requesting new suggestions, early stop-
ping, activating/deactivating a study. See Figure 3 for a section of
the dashboard. In addition to monitoring and visualizations, the
dashboard contains action buttons such as Get Suggestions.
The dashboard uses a translation layer which converts between
JSON and protocol buers [
15
] when talking with backend servers.
The dashboard is built with Polymer [
14
] an open source web frame-
work supported by Google and uses material design principles. It
contains interactive visualizations for analyzing the parameters in
your study. In particular, we use the parallel coordinates visualiza-
tion [
18
] which has the benet of scaling to high dimensional spaces
(
15 dimensions) and works with both numerical and categorical
parameters. See Figure 4 for an example. Each vertical axis is a
dimension corresponding to a parameter, whereas each horizontal
line is an individual trial. The point at which the horizontal line
intersects the vertical axis gives the value of the parameter in that
dimension. This can be used for examining how the dimensions
co-vary with each other and also against the objective function
value (left most axis). The visualizations are built using d3.js [4].
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1490
Figure 4: The Parallel Coordinates visualization [18] is used for ex-
amining results from dierent Vizier runs. It has the benet of scal-
ing to high dimensional spaces (15 dimensions) and works with
both numerical and categorical parameters. Additionally, it is inter-
active and allows various modes of slicing and dicing data.
3 THE VIZIER ALGORITHMS
Vizier’s modular design allows us to easily support multiple algo-
rithms. For studies with under a thousand trials, Vizier defaults
to using Batched Gaussian Process Bandits [
8
]. We use a Matérn
kernel with automatic relevance determination (see e.g. section 5.1
of Rasmussen and Williams
[23]
for a discussion) and the expected
improvement acquisition function [
21
]. We search for and nd local
maxima of the acquisition function with a proprietary gradient-free
hill climbing algorithm, with random starting points.
We implement discrete parameters by embedding them in
R
. Cat-
egorical parameters with
k
feasible values are represented via one-
hot encoding, i.e., embedded in [0
,
1]
k
. In both cases, the Gaussian
Process regressor gives us a continuous and dierentiable function
upon which we can walk uphill, then when the walk has converged,
round to the nearest feasible point.
While some authors recommend using Bayesian deep learning
models in lieu of Gaussian processes for scalability [
27
,
31
], in our
experience they are too sensitive to their own hyperparameters and
do not reliably perform well. Other researchers have recognized
this problem as well, and are working to address it [28].
For studies with tens of thousands of trials or more, other al-
gorithms may be used. Though
RandomSearch
and
GridSearch
are supported as rst–class choices and may be used in this regime,
and many other published algorithms are supported through the
algorithm playground, we currently recommend a proprietary local–
search algorithm under these conditions.
For all of these algorithms we support data normalization, which
maps numeric parameter values into [0
,
1] and objective values
onto [
0
.
5
,
0
.
5]. Depending on the problem, a one-to-one nonlinear
mapping may be used for some of the parameters, and is typically
used for the objective. Data normalization is handled before trials
are presented to the trial suggestion algorithms, and its suggestions
are transparently mapped back to the user-specied scaling.
3.1 Automated Early Stopping
In some important applications of black–box optimization, informa-
tion related to the performance of a trial may become available dur-
ing trial evaluation. Perhaps the best example of such a performance
curve occurs when tuning machine learning hyperparameters for
models trained progressively (e.g., via some version of stochastic
gradient descent). In this case, the model typically becomes more
accurate as it trains on more data, and the accuracy of the model is
available at the end of each training epoch. Using these accuracy
vs. training step curves, it is often possible to determine that a
trial’s parameter settings are unpromising well before evaluation is
nished. In this case we can terminate trial evaluation early, freeing
those evaluation resources for more promising trial parameters.
When done algorithmically, this is referred to as automated early
stopping.
Vizier supports automated early stopping via an API call to a
ShouldTrialStop
method. Analogously to the Suggestion Service,
there is an Automated Stopping Service that accepts requests from
the Vizier API to analyze a study and determine the set of trials
that should be stopped, according to the congured early stopping
algorithm. As with suggestion algorithms, several automated early
stopping algorithms are supported, and rapid prototyping can be
done via the algorithm playground.
3.2 Automated Stopping Algorithms
Vizier supports the following automated stopping algorithms. These
are meant to work in a stateless fashion i.e. they are given the full
state of all trials in the Vizier study when determining which trials
should stop.
3.2.1 Performance Curve Stopping Rule. This stopping rule per-
forms regression on the performance curves to make a prediction
of the nal objective value of a Trial given a set of Trials that are
already Completed, and a partial performance curve (i.e., a set of
measurements taken during Trial evaluation). Given this prediction,
if the probability of exceeding the optimal value found thus far is
suciently low, early stopping is requested for the Trial.
While prior work on automated early stopping used Bayesian
parametric regression [
9
,
30
], we opted for a Bayesian non-parametric
regression, specically a Gaussian process model with a carefully
designed kernel that measures similarity between performance
curves. Our motivation in this was to be robust to many kinds
of performance curves, including those coming from applications
other than tuning machine learning hyperparameters in which the
performance curves may have very dierent semantics. Notably,
this stopping rule still works well even when the performance curve
is not measuring the same quantity as the objective value, but is
merely predictive of it.
3.2.2 Median Stopping Rule. The median stopping rule stops a
pending trial
xt
at step
s
if the trial’s best objective value by step
s
is strictly worse than the median value of the running averages
ˆ
oτ
1:s
of all completed trials’ objectives
xτ
reported up to step
s
.
Here, we calculate the running average of a trial
xτ
up to step
s
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1491
as
ˆ
oτ
1:s=1
sΣs
i=1oτ
i
, where
oτ
i
is the objective value of
xτ
at step
i
.
As with the performance curve stopping rule, the median stopping
rule does not depend on a parametric model, and is applicable to a
wide range of performance curves. In fact, the median stopping rule
is model–free, and is more reminiscent of a bandit-based approach
such as HyperBand [20].
3.3 Transfer learning
When doing black-box optimization, users often run studies that are
similar to studies they have run before, and we can use this fact to
minimize repeated work. Vizier supports a form of Transfer Learning
which leverages data from prior studies to guide and accelerate the
current study. For instance, one might tune the learning rate and
regularization of a machine learning system, then use that Study
as a prior to tune the same ML system on a dierent data set.
Vizier’s current approach to transfer learning is relatively simple,
yet robust to changes in objective across studies. We designed our
transfer learning approach with these goals in mind:
(1) Scale well to situations where there are many prior studies.
(2)
Accelerate studies (i.e., achieve better results with fewer
trials) when the priors are good, particularly in cases where
the location of the optimum, x, doesn’t change much.
(3)
Be robust against poorly chosen prior studies (i.e., a bad prior
should give only a modest deceleration).
(4)
Share information even when there is no formal relationship
between the prior and current Studies.
In previous work on transfer learning in the context of hyper-
parameter optimization, Bardenet et al
. [1]
discuss the diculty in
transferring knowledge across dierent datasets especially when
the observed metrics and the sampling of the datasets are dierent.
They use a ranking approach for constructing a surrogate model for
the response surface. This approach suers from the computational
overhead of running a ranking algorithm. Yogatama and Mann
[32]
propose a more ecient approach, which scales as
Θ(kn +n3)
for
k
studies of
n
trials each, where the cubic term comes from using a
Gaussian process in their acquisition function.
Vizier typically uses Gaussian Process regressors, so one natural
approach to implementing transfer learning might be to build a
larger Gaussian Process regressor that is trained on both the prior(s)
and the current Study. However that approach fails to satisfy design
goal 1: for
k
studies with
n
trials each it would require
(k3n3)
time. Such an approach also requires one to specify or learn ker-
nel functions that bridge between the prior(s) and current Study,
violating design goal 4.
Instead, our strategy is to build a stack of Gaussian Process
regressors, where each regressor is associated with a study, and
where each level is trained on the residuals relative to the regressor
below it. Our model is that the studies were performed in a linear
sequence, each study using the studies before it as priors.
The bottom of the stack contains a regressor built using data
from the oldest study in the stack. The regressor above it is associ-
ated with the 2nd oldest study, and regresses on the residual of its
Figure 5: An illustration of our transfer learning scheme,
showing how µ
iis built from the residual labels w.r.t. µi1
(shown in dotted red lines).
objective relative to the predictions of the regressor below it. Simi-
larly, the regressor associated with the
ith
study is built using the
data from that study, and regresses on the residual of the objective
with respect to the predictions of the regressor below it.
More formally, we have a sequence of studies
{Si}k
i=1
on un-
known objective functions
fik
i=1
, where the current study is
Sk
,
and we build two sequences of regressors
{Ri}k
i=1
and
(R
i)k
i=1
hav-
ing posterior mean functions
µik
i=1
and
(µ
i)k
i=1
respectively, and
posterior standard deviation functions
{σi}k
i=1
and
(σ
i)k
i=1
, respec-
tively. Our nal predictions will be µkand σk.
Let
Di=((xi
t,yi
t))t
be the dataset for study
Si
. Let
R
i
be a re-
gressor trained using data
(((xi
t,yi
tµi1(xi
t)) )t
which computes
µ
i
and
σ
i
. Then we dene as our posterior means at level
i
as
µi(x)
:
=µ
i(x)+µi1(x)
. We take our posterior standard deviations
at level
i
,
σi(x)
, to be a weighted geometric mean of
σ
i(x)
and
σi1(x)
, where the weights are a function of the amount of data
(i.e., completed trials) in
Si
and
Si1
. The exact weighting function
depends on a constant
α
1sets the relative importance of old and
new standard deviations.
This approach has nice properties when the prior regressors
are densely supported (i.e. has many well-spaced data points), but
the top-level regressor has relatively little training data: (1) ne
structure in the priors carries through to
µk
, even if the top-level
regressor gives a low-resolution model of the objective function
residual; (2) since the estimate for
σ
k
is inaccurate, averaging it
with
σk1
can lead to an improved estimate. Further, when the
top-level regressor has dense support,
β
1and the
σkσ
k
, as
one might desire.
We provide details in the pseudocode in Algorithm 1, and illus-
trate the regressors in Figure 5.
Algorithm 1 is then used in the Batched Gaussian Process Ban-
dits [
8
] algorithm. Algorithm 1 has the property that for a su-
ciently dense sampling of the feasible region in the training data for
the current study, the predictions converge to those of a regressor
trained only on the current study data. This ensures a certain degree
of robustness: badly chosen priors will eventually be overwhelmed
(design goal 3).
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1492
Algorithm 1 Transfer Learning Regressor
# This is a higher order function that returns a regressor R(
xtest
);
# then R(xtest) can be evaluated to obtain (µ,σ)
function GetRegressor(Dtraining,i)
If i<0: Return function that returns (0,1) for all inputs
# Recurse to get a Regressor (µi1(x),σi1(x)) trained on
# the data for all levels of the stack below this one.
Rprior GetRegressor(Dtraining,i1)
# Compute training residuals
Dresiduals [(x,yRprior(x)[0])for(x,y)Di]
# Train a Gaussian Process (µ
i(x),σ
i(x)) on the residuals.
GPresiduals =TrainGP(Dresiduals)
function StackedRegressor(xtest)
µprior,σprior Rprior (xtest)
µtop,σtop GPresiduals(xtest)
µµtop +µprior
βα|Di|/(α|Di|+|Di1|)
σσβ
topσ1β
prior
return µ,σ
end function
return StackedRegressor
end function
In production settings, transfer learning is often particularly
valuable when the number of trials per study is relatively small,
but there are many such studies. For example, certain production
machine learning systems may be very expensive to train, limiting
the number of trials that can be run for hyperparameter tuning,
yet are mission critical for a business and are thus worked on year
after year. Over time, the total number of trials spanning several
small hyperparameter tuning runs can be quite informative. Our
transfer learning scheme is particularly well-suited to this case, as
illustrated in section 4.3.
4 RESULTS
4.1 Performance Evaluation
To evaluate the performance of Google Vizier we require functions
that can be used to benchmark the results. These are pre-selected,
easily calculated functions with known optimal points that have
proven challenging for black-box optimization algorithms. We can
measure the success of an optimizer on a benchmark function
f
by
its nal optimality gap. That is, if
x
minimizes
f
, and
ˆ
x
is the best
solution found by the optimizer, then
|f(ˆ
x)f(x)|
measures the
success of that optimizer on that function. If, as is frequently the
case, the optimizer has a stochastic component, we then calculate
the average optimality gap by averaging over multiple runs of the
optimizer on the same benchmark function.
Comparing between benchmarks is a more dicult given that the
dierent benchmark functions have dierent ranges and diculties.
For example, a good black-box optimizer applied to the Rastrigin
function might achieve an optimality gap of 160, while simple
random sampling of the Beale function can quickly achieve an
optimality gap of 60 [
10
]. We normalize for this by taking the ratio
of the optimality gap to the optimality gap of
Random Search
on
the same function under the same conditions. Once normalized, we
average over the benchmarks to get a single value representing an
optimizer’s performance.
The benchmarks selected were primarily taken from the Black-
Box Optimization Benchmarking Workshop [
10
] (an academic com-
petition for black–box optimizers), and include the Beale, Branin,
Ellipsoidal, Rastrigin, Rosenbrock, Six Hump Camel, Sphere, and
Styblinski benchmark functions.
4.2 Empirical Results
In Figures 6 we look at result quality for four optimization algo-
rithms currently implemented in the Vizier framework: a multi-
armed bandit technique using a Gaussian process regressor [
29
],
the SMAC algorithm [
19
], the Covariance Matrix Adaption Evo-
lution Strategy (CMA-ES) [
16
], and a probabilistic search method
of our own. For a given dimension
d
, we generalized each bench-
mark function into a
d
dimensional space, ran each optimizer on
each benchmark 100 times, and recorded the intermediate results
(averaging these over the multiple runs). Figure 6 shows their im-
provement over
Random Search
; the horizontal axis represents
the number of trials have been evaluated, while the vertical axis
indicates each optimality gap as a fraction of the
Random Search
optimality gap at the same point. The 2
×Random Search
curve is
the
Random Search
algorithm when it was allowed to sample two
points for each point the other algorithms evaluated. While some
authors have claimed that 2
×Random Search
is highly competitive
with Bayesian Optimization methods [
20
], our data suggests this
is only true when the dimensionality of the problem is suciently
high (e.g., over 16).
4.3 Transfer Learning
We display the value of transfer learning in Figure 7 with a series
of short studies; each study is just six trials long. Even so, one can
see that transfer learning from one study to the next leads to steady
progress towards the optimum, as the stack of regressors gradually
builds up information about the shape of the objective function.
This experiment is conducted in a 10 dimensional space, using
the 8 black-box functions described in section 4.1. We run 30 studies
(180 trials) and each study uses transfer learning from all previous
studies.
As one might hope, transfer learning causes the GP bandit algo-
rithm to show a strong systematic decrease in the optimality gap
from study to study, with its nal average optimality gap 37% the
size of
Random Search
’s. As expected,
Random Search
shows no
systematic improvement in its optimality gap from study to study.
Note that a systematic improvement in the optimality gap is a
dicult task since each study gets a budget of only 6 trials whilst
operating in a 10 dimensional space, and the GP regressor is optimiz-
ing 8 internal hyperparameters for each study. By any reasonable
measure, a single study’s data is insucient for the regressor to
learn much about the shape of the objective function.
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1493
100101102103104
Batch (log scale)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Reduction in optimality gap
relative to Random Search
Dimension = 4
100101102103104
Batch (log scale)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Dimension = 8
100101102103104
Batch (log scale)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Dimension = 16
100101102103104
Batch (log scale)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Dimension = 32
2xRandom
GP Bandit
SMAC
CMA-ES
Probabilistic Search
100101102103104
Batch (log scale)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Dimension = 64
Figure 6: Ratio of the average optimality gap of each optimizer to that of Random Search at a given number of samples. The
2
×Random Search is a Random Search allowed to sample two points at every step (as opposed to a single point for the other
algorithms).
0 5 10 15 20 25 30
Number of transfer learning steps (i.e. studies)
5.0
5.5
6.0
6.5
log(best optimality gap seen in study)
Figure 7: Convergence of transfer learning in a 10 dimen-
sional space. This shows a sequence of studies with progres-
sive transfer learning for both GP Bandit (blue diamonds)
and Random Search (red squares) optimizers. The X-axis
shows the index of the study, i.e. the number of times
that transfer learning has been applied; the Y-axis shows
the log of the best mean optimality gap seen in the study
(see Section 4.1). Each study contains six trials; for the GP
Bandit-based optimizer the previous studies are used as pri-
ors for transfer learning. Note that the GP bandits shows
a consistent improvement in optimality gap from study to
study, thus demonstrating an eective transfer of knowl-
edge from the earlier trials; Random Search does not do
transfer learning.
4.4 Automated Stopping
4.4.1 Performance Curve Stopping Rule. In our experiments, we
found that the use of the performance curve stopping rule resulted
in achieving optimality gaps comparable to those achieved without
the stopping rule, while using approximately 50% fewer CPU-hours
when tuning hyperparameter for deep neural networks. Our result
is in line with gures reported by other researchers, while using a
more exible non-parametric model (e.g., Domhan et al
. [9]
report
reductions in the 40% to 60% range on three ML hyperparameter
tuning benchmarks).
4.4.2 Median Automated Stopping Rule. We evaluated the Me-
dian Stopping Rule for several hyperparameter search problems,
including a state-of-the-art residual network architecture based
on [
17
] for image classication on CIFAR10 with 16 tunable hyper-
parameters, and an LSTM architecture [
33
] for language modeling
on the Penn TreeBank data set with 12 tunable hyperparameters.
We observed that in all cases the stopping rule consistently achieved
a factor two to three speedup over random search, while always
nding the best performing Trial. Li et al
. [20]
argued that “2X
random search”, i.e., random search at twice the speed, is competi-
tive with several state-of-the-art black-box optimization methods
on a broad range of benchmarks. The robustness of the stopping
rule was also evaluated by running repeated simulations on a large
set of completed random search trials under random permutation,
which showed that the algorithm almost never decided to stop the
ultimately-best-performing trial early.
5 USE CASES
Vizier is used for a number of dierent application domains.
5.1 Hyperparameter tuning and HyperTune
Vizier is used across Google to optimize hyperparameters of ma-
chine learning models, both for research and production models.
Our implementation scales to service the entire hyperparameter
tuning workload across Alphabet, which is extensive. As one (ad-
mittedly extreme) example, Collins et al
. [6]
used Vizier to perform
hyperparameter tuning studies that collectively contained millions
of trials for a research project investigating the capacity of dierent
recurrent neural network architectures. In this context, a single
trial involved training a distinct machine learning model using
dierent hyperparameter values. That research project would not
be possible without eective black–box optimization. For other
research projects, automating the arduous and tedious task of hy-
perparameter tuning accelerates their progress.
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1494
Perhaps even more importantly, Vizier has made notable im-
provements to production models underlying many Google prod-
ucts, resulting in measurably better user experiences for over a
billion people. External researchers and developers can achieve the
same benets using Google Cloud Machine Learning HyperTune
subsystem, which benets from our experience and technology.
5.2 Automated A/B testing
In addition to tuning hyperparameters, Vizier has a number of other
uses. It is used for automated A/B testing of Google web proper-
ties, for example tuning user–interface parameters such as font
and thumbnail sizes, color schema, and spacing, or trac-serving
parameters such as the relative importance of various signals in
determining which items to show to a user. An example of the lat-
ter would be “how should the search results returned from Google
Maps trade o search-relevance for distance from the user?”
5.3 Delicious Chocolate Chip Cookies
Vizier is also used to solve complex black–box optimization prob-
lems arising from physical design or logistical problems. Here we
present an example that highlights some additional capabilities of
the system: nding the most delicious chocolate chip cookie recipe
from a parameterized space of recipes.
Parameters included
baking soda
,
brown sugar
,
white
sugar
,
butter
,
vanilla
,
egg
,
flour
,
chocolate
,
chip type
,
salt
,
cayenne
,
orange extract
,
baking time
, and
baking temperature
. We provided
recipes to contractors responsible for providing desserts for Google
employees. The head chefs among the contractors were given dis-
cretion to alter parameters if (and only if) they strongly believed
it to be necessary, but would carefully note what alterations were
made. The cookies were baked, and distributed to the cafes for
taste–testing. Cafe goers tasted the cookies and provided feedback
via a survey. Survey results were aggregated and the results were
sent back to Vizier. The “machine learning cookies” were provided
about twice a week over several weeks.
The cookies improved signicantly over time; later rounds were
extremely well-rated and, in the authors’ opinions, delicious. How-
ever, we wish to highlight the following capabilities of Vizier the
cookie design experiment exercised:
Infeasible trials: In real applications, some trials may be in-
feasible, meaning they cannot be evaluated for reasons that
are intrinsic to the parameter settings. Very high learning
rates may cause training to diverge, leading to garbage mod-
els. In this example: very low levels of butter may make your
cookie dough impossibly crumbly and incohesive.
Manual overrides of suggested trials: Sometimes you cannot
evaluate the suggested trial or else mistakenly evaluate a
dierent trial than the one asked for. For example, when
baking you might be running low on an ingredient and have
to settle for less than the recommended amount.
Transfer learning: Before starting to bake at large scale, we
baked some recipes in a smaller scale run-through. This pro-
vided useful data that we could transfer learn from when
baking at scale. Conditions were not identical, however, re-
sulting in some unexpected consequences. For example, the
dough was allowed to sit longer in large-scale production
which unexpectedly, and somewhat dramatically, increased
the subjective spiciness of the cookies for trials that involved
cayenne. Fortunately, our transfer learning scheme is rela-
tively robust to such shifts.
Vizier supports marking trials as infeasible, in which case they do
not receive an objective value. In the case of Bayesian Optimization,
previous work either assigns them a particularly bad objective
value, attempts to incorporate a probability of infeasibility into
the acquisition function to penalize points that are likely to be
infeasible [
3
], or tries to explicitly model the shape of the infeasible
region [
11
,
12
]. We take the rst approach, which is simple and
fairly eective for the applications we consider. Regarding manual
overrides, Vizier’s stateless design makes it easy to support updating
or deleting trials; we simply update the trial state on the database.
For details on transfer learning, refer to section 3.3.
6 CONCLUSION
We have presented our design for Vizier, a scalable, state-of-the-
art internal service for black–box optimization within Google, ex-
plained many of its design choices, and described its use cases
and benets. It has already proven to be a valuable platform for
research and development, and we expect it will only grow more so
as the area of black–box optimization grows in importance. Also,
it designs excellent cookies, which is a very rare capability among
computational systems.
7 ACKNOWLEDGMENTS
We gratefully acknowledge the contributions of the following:
Jeremy Kubica, Je Dean, Eric Christiansen, Moritz Hardt, Katya
Gonina, Kevin Jamieson, and Abdul Salem.
REFERENCES
[1]
Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Collabo-
rative hyperparameter tuning. ICML 2 (2013), 199.
[2]
James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Al-
gorithms for hyper-parameter optimization. In Advances in Neural Information
Processing Systems. 2546–2554.
[3]
J Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and
M West. 2011. Optimization under unknown constraints. Bayesian Statistics 9 9
(2011), 229.
[4]
Michael Bostock, Vadim Ogievetsky, and Jerey Heer. 2011. D
3
data-driven
documents. IEEE transactions on visualization and computer graphics 17, 12 (2011),
2301–2309.
[5]
Herman Cherno. 1959. Sequential Design of Experiments. Ann. Math. Statist.
30, 3 (09 1959), 755–770. https://doi.org/10.1214/aoms/1177706205
[6]
Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. 2017. Capacity and
Trainability in Recurrent Neural Networks. In Profeedings of the International
Conference on Learning Representations (ICLR).
[7]
Andrew R Conn, Katya Scheinberg, and Luis N Vicente. 2009. Introduction to
derivative-free optimization. SIAM.
[8]
Thomas Desautels, Andreas Krause, and Joel W Burdick. 2014. Parallelizing
exploration-exploitation tradeos in Gaussian process bandit optimization. Jour-
nal of Machine Learning Research 15, 1 (2014), 3873–3923.
[9]
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding Up
Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapo-
lation of Learning Curves.. In IJCAI. 3460–3468.
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1495
[10]
Steen Finck, Nikolaus Hansen, Raymond Rost, and Anne Auger. 2009.
Real-Parameter Black-Box Optimization Benchmarking 2009: Presentation of
the Noiseless Functions. http://coco.gforge.inria.fr/lib/exe/fetch.php?media=
download3.6:bbobdocfunctions.pdf. (2009). [Online].
[11]
Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and
John P Cunningham. 2014. Bayesian Optimization with Inequality Constraints..
In ICML. 937–945.
[12]
Michael A Gelbart, Jasper Snoek, and Ryan P Adams. 2014. Bayesian optimiza-
tion with unknown constraints. In Proceedings of the Thirtieth Conference on
Uncertainty in Articial Intelligence. AUAI Press, 250–259.
[13]
Josep Ginebra and Murray K. Clayton. 1995. Response Surface Bandits. Journal
of the Royal Statistical Society. Series B (Methodological) 57, 4 (1995), 771–784.
http://www.jstor.org/stable/2345943
[14]
Google. 2017. Polymer: Build modern apps using web components. https://github.
com/Polymer/polymer. (2017). [Online].
[15]
Google. 2017. Protocol Buers: Google’s data interchange format. https://github.
com/google/protobuf. (2017). [Online].
[16]
Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized
self-adaptation in evolution strategies. Evolutionary computation 9, 2 (2001),
159–195.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 770–778.
[18]
Julian Heinrich and Daniel Weiskopf. 2013. State of the Art of Parallel Coordi-
nates.. In Eurographics (STARs). 95–116.
[19]
Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-
based optimization for general algorithm conguration. In International Confer-
ence on Learning and Intelligent Optimization. Springer, 507–523.
[20]
Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet
Talwalkar. 2016. Hyperband: A Novel Bandit-Based Approach to Hyperparameter
Optimization. CoRR abs/1603.06560 (2016). http://arxiv.org/abs/1603.06560
[21]
J Moćkus, V Tiesis, and A Źilinskas. 1978. The Application of Bayesian Methods
for Seeking the Extremum. Vol. 2. Elsevier. 117–128 pages.
[22]
John A Nelder and Roger Mead. 1965. A simplex method for function minimiza-
tion. The computer journal 7, 4 (1965), 308–313.
[23]
Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes
for Machine Learning (Adaptive Computation and Machine Learning). The MIT
Press.
[24]
Luis Miguel Rios and Nikolaos V Sahinidis. 2013. Derivative-free optimization: a
review of algorithms and comparison of software implementations. Journal of
Global Optimization 56, 3 (2013), 1247–1293.
[25]
Bobak Shahriari, Kevin Swersky,Ziyu Wang, Ryan P Adams, and Nando de Freitas.
2016. Taking the human out of the loop: A review of bayesian optimization. Proc.
IEEE 104, 1 (2016), 148–175.
[26]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian
optimization of machine learning algorithms. In Advances in neural information
processing systems. 2951–2959.
[27]
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,
Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams.
2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Pro-
ceedings of the 32nd International Conference on Machine Learning, ICML 2015,
Lille, France, 6-11 July 2015 (JMLR Workshop and Conference Proceedings), Fran-
cis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 2171–2180. http:
//jmlr.org/proceedings/papers/v37/snoek15.html
[28]
Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hut-
ter. 2016. Bayesian Optimization with Robust Bayesian Neural Net-
works. In Advances in Neural Information Processing Systems 29,
D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett
(Eds.). Curran Associates, Inc., 4134–4142. http://papers.nips.cc/paper/
6117-bayesian- optimization-with-robust- bayesian-neural-networks.pdf
[29]
Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010.
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental
Design. ICML (2010).
[30]
Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. 2014. Freeze-thaw
Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014).
[31]
Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing.
2016. Deep kernel learning. In Proceedings of the 19th International Conference on
Articial Intelligence and Statistics. 370–378.
[32]
Dani Yogatama and Gideon Mann. 2014. Ecient Transfer Learning Method for
Automatic Hyperparameter Tuning. JMLR: W&CP 33 (2014), 1077–1085.
[33]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural
network regularization. arXiv preprint arXiv:1409.2329 (2014).
KDD 2017 Applied Data Science Paper
KDD’17, August 13–17, 2017, Halifax, NS, Canada
1496
... The objectives of this work are to develop and present a high-fidelity ensemble wildfire simulation framework for enabling the parametric examination of wildfire behavior, the creation of ensemble data to support ML tasks, and to support fire-risk assessment and uncertainty analysis of wildfires. The proposed simulation framework combines an opensource large-eddy simulation tool Swirl-Fire for wildfire predictions (Wang et al. 2023), an opensource optimization platform Vizier (Golovin et al. 2017) for run-time management of ensemble simulation, and large-scale batch processing. The resulting framework is employed to perform a detailed parametric analysis of the effects of changing wind and slope on the fire-spread behavior. ...
... To enable the high-fidelity ensemble simulation, we are leveraging the open-source optimization platform Vizier (Golovin et al. 2017). Given only a parameter space and a cost function over this space, Vizier efficiently searches for optimal parameters. ...
Preprint
Full-text available
Background. Wildfire research uses ensemble methods to analyze fire behaviors and assess uncertainties. Nonetheless, current research methods are either confined to simple models or complex simulations with limits. Modern computing tools could allow for efficient, high-fidelity ensemble simulations. Aims. This study proposes a high-fidelity ensemble wildfire simulation framework for studying wildfire behavior, ML tasks, fire-risk assessment, and uncertainty analysis. Methods. In this research, we present a simulation framework that integrates the Swirl-Fire large-eddy simulation tool for wildfire predictions with the Vizier optimization platform for automated run-time management of ensemble simulations and large-scale batch processing. All simulations are executed on tensor-processing units to enhance computational efficiency. Key results. A dataset of 117 simulations is created, each with 1.35 billion mesh points. The simulations are compared to existing experimental data and show good agreement in terms of fire rate of spread. Computations are done for fire acceleration, mean rate of spread, and fireline intensity. Conclusions. Strong coupling between these 2 parameters are observed for the fire spread and intermittency. A critical Froude number that delineates fires from plume-driven to convection-driven is identified and confirmed with literature observations. Implications. The ensemble simulation framework is efficient in facilitating parametric wildfire studies.
... Consider a black-box function f : R n → R, where each evaluation of f is assumed to be expensive in terms of computational resources or time. The objective of the black-box optimization problem is to find [37]: ...
Preprint
Full-text available
Large Language Models (LLMs) have achieved significant progress across various fields and have exhibited strong potential in evolutionary computation, such as generating new solutions and automating algorithm design. Surrogate-assisted selection is a core step in evolutionary algorithms to solve expensive optimization problems by reducing the number of real evaluations. Traditionally, this has relied on conventional machine learning methods, leveraging historical evaluated evaluations to predict the performance of new solutions. In this work, we propose a novel surrogate model based purely on LLM inference capabilities, eliminating the need for training. Specifically, we formulate model-assisted selection as a classification and regression problem, utilizing LLMs to directly evaluate the quality of new solutions based on historical data. This involves predicting whether a solution is good or bad, or approximating its value. This approach is then integrated into evolutionary algorithms, termed LLM-assisted EA (LAEA). Detailed experiments compared the visualization results of 2D data from 9 mainstream LLMs, as well as their performance on optimization problems. The experimental results demonstrate that LLMs have significant potential as surrogate models in evolutionary computation, achieving performance comparable to traditional surrogate models only using inference. This work offers new insights into the application of LLMs in evolutionary computation. Code is available at: https://github.com/hhyqhh/LAEA.git
... However, machine learning models are black boxes with hidden inner workings, making it difficult for users to understand the relationship between data and results (Carvalho et al., 2019). Thus, feature importance analyzes have been widely used in different fields to help machine learning interpretability (Golovin et al., 2017;Kahng et al., 2017;J. Zhang et al., 2018). ...
Article
Full-text available
Purpose: With the popularity of recreational activities, the study aimed to develop prediction models for recreational activity participation and explore the key factors affecting participation in recreational activities. Methods: A total of 12,712 participants, excluding individuals under 20, were selected from the National Health and Nutrition Examination Survey (NHANES) from 2011 to 2018. The mean age of the sample was 46.86 years (±16.97), with a gender distribution of 6,721 males and 5,991 females. The variables included demographic, physical-related variables, and lifestyle variables. This study developed 42 prediction models using six machine learning methods, including logistic regression, Support Vector Machine (SVM), decision tree, random forest, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). The relative importance of each variable was evaluated by permutation feature importance. Results: The results illustrated that the LightGBM was the most effective algorithm for predicting recreational activity participation (accuracy: .838, precision: .783, recall: .967, F1-score: .865, AUC: .826). In particular, prediction performance increased when the demographic and lifestyle datasets were used together. Next, as the result of the permutation feature importance based on the top models, education level and moderate-vigorous physical activity (MVPA) were found to be essential variables. Conclusion: These findings demonstrated the potential of a data-driven approach utilizing machine learning in a recreational discipline. Furthermore, this study interpreted the prediction model through feature importance analysis to overcome the limitation of machine learning interpretability.
... Automation in hyperparameter tuning can significantly reduce the time and expertise required to find optimal settings [115,116]. Automated tools like Hyperopt, Optuna, and Google's Vizier can assist meteorologists and data scientists by efficiently exploring parameter spaces, but expert oversight remains essential to guide the tuning process and interpret results effectively [117][118][119]. ...
Article
Full-text available
Accurate and rapid weather forecasting and climate modeling are universal goals in human development. While Numerical Weather Prediction (NWP) remains the gold standard, it faces challenges like inherent atmospheric uncertainties and computational costs, especially in the post-Moore era. With the advent of deep learning, the field has been revolutionized through data-driven models. This paper reviews the key models and significant developments in data-driven weather forecasting and climate modeling. It provides an overview of these models, covering aspects such as dataset selection, model design, training process, computational acceleration, and prediction effectiveness. Data-driven models trained on reanalysis data can provide effective forecasts with an accuracy (ACC) greater than 0.6 for up to 15 days at a spatial resolution of 0.25°. These models outperform or match the most advanced NWP methods for 90% of variables, reducing forecast generation time from hours to seconds. Data-driven climate models can reliably simulate climate patterns for decades to 100 years, offering a magnitude of computational savings and competitive performance. Despite their advantages, data-driven methods have limitations, including poor interpretability, challenges in evaluating model uncertainty, and conservative predictions in extreme cases. Future research should focus on larger models, integrating more physical constraints, and enhancing evaluation methods.
... Because of that, AutoML research has grown rapidly over the last years, probably most evident in NAS (Elsken et al., 2019;. At the same time, most big IT companies have developed large software packages enabling AutoML, including Google (Golovin et al., 2017;Song et al., 2022), Amazon (Erickson et al., 2020), Meta (Balandat et al., 2020), IBM , Oracle (Yakovlev et al., 2020) and Microsoft (Wang et al., 2021a). ...
Preprint
Full-text available
Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) workflows, aiding the research of new ML algorithms, and contributing to the democratization of ML by making it accessible to a broader audience. Over the past decade, commendable achievements in AutoML have primarily focused on optimizing predictive performance. This focused progress, while substantial, raises questions about how well AutoML has met its broader, original goals. In this position paper, we argue that a key to unlocking AutoML's full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems, including their diverse roles, expectations, and expertise. We envision a more human-centered approach in future AutoML research, promoting the collaborative design of ML systems that tightly integrates the complementary strengths of human expertise and AutoML methodologies.
... Some of recent works explore scalable transfer learning with deep neural networks [33,47]. Also, different components of BO can be transferred such as observations [41], surrogate functions [15,47], hyperparmater initializations [47], or all of them [45]. However, most of the existing transfer-BO approaches assume the traditional black-box BO settings. ...
Preprint
In this paper, we address the problem of cost-sensitive multi-fidelity Bayesian Optimization (BO) for efficient hyperparameter optimization (HPO). Specifically, we assume a scenario where users want to early-stop the BO when the performance improvement is not satisfactory with respect to the required computational cost. Motivated by this scenario, we introduce utility, which is a function predefined by each user and describes the trade-off between cost and performance of BO. This utility function, combined with our novel acquisition function and stopping criterion, allows us to dynamically choose for each BO step the best configuration that we expect to maximally improve the utility in future, and also automatically stop the BO around the maximum utility. Further, we improve the sample efficiency of existing learning curve (LC) extrapolation methods with transfer learning, while successfully capturing the correlations between different configurations to develop a sensible surrogate function for multi-fidelity BO. We validate our algorithm on various LC datasets and found it outperform all the previous multi-fidelity BO and transfer-BO baselines we consider, achieving significantly better trade-off between cost and performance of BO.
... When setting the hyperparameters, we leveraged insights from prior research and experimental experience, and referred to optimization algorithms based on Bayesian methods. Additionally, we utilized the Python open-source toolkit 'advisor' for parameter optimization [35]. ...
Article
Full-text available
With the development of deep learning, several graph neural network (GNN)-based approaches have been utilized for text classification. However, GNNs encounter challenges when capturing contextual text information within a document sequence. To address this, a novel text classification model, RB-GAT, is proposed by combining RoBERTa-BiGRU embedding and a multi-head Graph ATtention Network (GAT). First, the pre-trained RoBERTa model is exploited to learn word and text embeddings in different contexts. Second, the Bidirectional Gated Recurrent Unit (BiGRU) is employed to capture long-term dependencies and bidirectional sentence information from the text context. Next, the multi-head graph attention network is applied to analyze this information, which serves as a node feature for the document. Finally, the classification results are generated through a Softmax layer. Experimental results on five benchmark datasets demonstrate that our method can achieve an accuracy of 71.48%, 98.45%, 80.32%, 90.84%, and 95.67% on Ohsumed, R8, MR, 20NG and R52, respectively, which is superior to the existing nine text classification approaches.
... In practice, attackers generally cannot access the internal structure and parameters of the T2I models, as they are packed in the black-box API, which makes conventional gradient descent algorithms unsuitable (Stephan et al., 2017;Kingma & Ba, 2014). Although some existing black-box learning methods (Golovin et al., 2017) relax the need to access internal information of the black box, they generally require obtaining the output results of the black box to compute losses and estimate gradients. However, in the training process, the integrated textual and visual defenses of the API could effectively detect malicious content within the input prompts or generated images. ...
Preprint
Text-to-Image (T2I) models have raised security concerns due to their potential to generate inappropriate or harmful images. In this paper, we propose UPAM, a novel framework that investigates the robustness of T2I models from the attack perspective. Unlike most existing attack methods that focus on deceiving textual defenses, UPAM aims to deceive both textual and visual defenses in T2I models. UPAM enables gradient-based optimization, offering greater effectiveness and efficiency than previous methods. Given that T2I models might not return results due to defense mechanisms, we introduce a Sphere-Probing Learning (SPL) scheme to support gradient optimization even when no results are returned. Additionally, we devise a Semantic-Enhancing Learning (SEL) scheme to finetune UPAM for generating target-aligned images. Our framework also ensures attack stealthiness. Extensive experiments demonstrate UPAM's effectiveness and efficiency.
... Black-box optimizer is also an appealing approach for network optimization problems. It refers to the task of optimizing an objective function f : X → R without access to any other information about f , e.g., gradients or the Hessian [172]. Telecom networks will become more and more complicated in the 6G era, and black optimization can avoid the complexity of building dedicated optimization models. ...
Preprint
Full-text available
Large language models (LLMs) have received considerable attention recently due to their outstanding comprehension and reasoning capabilities, leading to great progress in many fields. The advancement of LLM techniques also offers promising opportunities to automate many tasks in the telecommunication (telecom) field. After pre-training and fine-tuning, LLMs can perform diverse downstream tasks based on human instructions, paving the way to artificial general intelligence (AGI)-enabled 6G. Given the great potential of LLM technologies, this work aims to provide a comprehensive overview of LLM-enabled telecom networks. In particular, we first present LLM fundamentals, including model architecture, pre-training, fine-tuning, inference and utilization, model evaluation, and telecom deployment. Then, we introduce LLM-enabled key techniques and telecom applications in terms of generation, classification, optimization, and prediction problems. Specifically, the LLM-enabled generation applications include telecom domain knowledge, code, and network configuration generation. After that, the LLM-based classification applications involve network security, text, image, and traffic classification problems. Moreover, multiple LLM-enabled optimization techniques are introduced, such as automated reward function design for reinforcement learning and verbal reinforcement learning. Furthermore, for LLM-aided prediction problems, we discussed time-series prediction models and multi-modality prediction problems for telecom. Finally, we highlight the challenges and identify the future directions of LLM-enabled telecom networks.
Article
Full-text available
Hyperparameter optimization (HPO) has been well-developed and evolved into a well-established research topic over the decades. With the success and wide application of deep learning, HPO has garnered increased attention, particularly within the realm of machine learning model training and inference. The primary objective is to mitigate the challenges associated with manual hyperparameter tuning, which can be ad-hoc, reliant on human expertise, and consequently hinders reproducibility while inflating deployment costs. Recognizing the growing significance of HPO, this paper surveyed classical HPO methods, approaches for accelerating the optimization process, HPO in an online setting (dynamic algorithm configuration, DAC), and when there is more than one objective to optimize (multi-objective HPO). Acceleration strategies were categorized into multi-fidelity, bandit-based, and early stopping; DAC algorithms encompassed gradient-based, population-based, and reinforcement learning-based methods; multi-objective HPO can be approached via scalarization, metaheuristics, and model-based algorithms tailored for multi-objective situation. A tabulated overview of popular frameworks and tools for HPO was provided, catering to the interests of practitioners.
Article
Full-text available
Bayesian optimization has been demonstrated as an effective methodology for the global optimization of functions with expensive evaluations. Its strategy relies on querying a distribution over functions defined by a relatively cheap surrogate model. The ability to accurately model this distribution over functions is critical to the effectiveness of Bayesian optimization, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires a large number of evaluations, and as such, massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to Gaussian processes to model distributions over functions. We show that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically. This allows us to achieve a previously intractable degree of parallelism, which we use to rapidly search over large spaces of models. We achieve state-of-the-art results on benchmark object recognition tasks using convolutional neural networks, and image caption generation using multimodal neural language models.
Conference Paper
Full-text available
Bayesian optimization is a powerful framework for minimizing expensive objective functions while using very few function evaluations. It has been successfully applied to a variety of prob-lems, including hyperparameter tuning and ex-perimental design. However, this framework has not been extended to the inequality-constrained optimization setting, particularly the setting in which evaluating feasibility is just as expensive as evaluating the objective. Here we present con-strained Bayesian optimization, which places a prior distribution on both the objective and the constraint functions. We evaluate our method on simulated and real data, demonstrating that con-strained Bayesian optimization can quickly find optimal and feasible points, even when small fea-sible regions cause standard methods to fail.
Article
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. © 2018 Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh and Ameet Talwalkar.
Article
Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GP-UCB compares favorably with other heuristical GP optimization approaches.
Article
Big Data applications are typically associated with systems involving large numbers of users, massive complex software systems, and large-scale heterogeneous computing and storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., recommendation systems, medical analysis tools, real-time game engines, speech recognizers) thus involve many tunable configuration parameters. These parameters are often specified and hard-coded into the software by various developers or teams. If optimized jointly, these parameters can result in significant improvements. Bayesian optimization is a powerful tool for the joint optimization of design choices that is gaining great popularity in recent years. It promises greater automation so as to increase both product quality and human productivity. This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.
Article
Hyperparameter learning has traditionally been a manual task because of the limited number of trials. Today's computing infrastructures allow bigger evaluation budgets, thus opening the way for algorithmic approaches. Recently, surrogate-based optimization was successfully applied to hyperparameter learning for deep belief networks and to WEKA classifiers. The methods combined brute force computational power with model building about the behavior of the error function in the hyperparameter space, and they could significantly improve on manual hyperparameter tuning. What may make experienced practitioners even better at hyperparameter optimization is their ability to generalize across similar learning problems. In this paper, we propose a generic method to incorporate knowledge from previous experiments when simultaneously tuning a learning algorithm on new problems at hand. To this end, we combine surrogate-based ranking and optimization techniques for surrogate-based collaborative tuning (SCoT). We demonstrate SCoT in two experiments where it outperforms standard tuning techniques and single-problem surrogate-based optimization.