Conference PaperPDF Available

Google Vizier: A Service for Black-Box Optimization

August 2017

August 2017

DOI:10.1145/3097983.3098043

Conference: the 23rd ACM SIGKDD International Conference

Authors:

Daniel Golovin

Google Inc.

Greg Kochanski

Google, Inc, Pittsburgh, PA

Show all 6 authorsHide

Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides.

Architecture of Playground mode: Main components are (1) The Vizier API takes service requests. (2) The Custom Policy implements the Abstract Policy and generates suggested Trials. (3) The Playground Binary drives the custom policy based on demand reported by the Vizier API. (4) The Evaluation Workers behave as normal, i.e., they request and evaluate Trials.

…

A section of the dashboard for tracking the progress of Trials and the corresponding objective function values. Note also, the presence of actions buttons such as Get Suggestions for manually requesting suggestions.

…

The Parallel Coordinates visualization [18] is used for ex

…

An illustration of our transfer learning scheme, showing how µ ′ i is built from the residual labels w.r.t. µ i−1 (shown in dotted red lines).

…

Figures - uploaded by Greg Kochanski

Content may be subject to copyright.

Content uploaded by Greg Kochanski

Content may be subject to copyright.

Google Vizier: A Service for Black-Box Optimization

Daniel Golovin

Google Research

dgg@google.com

Benjamin Solnik

Google Research

bsolnik@google.com

Subhodeep Moitra

Google Research

smoitra@google.com

Greg Kochanski

Google Research

gpk@google.com

John Karro

Google Research

karro@google.com

D. Sculley

Google Research

dsculley@google.com

ABSTRACT

Any suciently complex system acts as a black box when it becomes

easier to experiment with than to understand. Hence, black-box

optimization has become increasingly important as systems have

become more complex. In this paper we describe Google Vizier, a

Google-internal service for performing black-box optimization that

has become the de facto parameter tuning engine at Google. Google

Vizier is used to optimize many of our machine learning models

and other systems, and also provides core capabilities to Google’s

Cloud Machine Learning HyperTune subsystem. We discuss our

requirements, infrastructure design, underlying algorithms, and

advanced features such as transfer learning and automated early

stopping that the service provides.

KEYWORDS

Black-Box Optimization, Bayesian Optimization, Gaussian Pro-

cesses, Hyperparameters, Transfer Learning, Automated Stopping

1 INTRODUCTION

Black–box optimization is the task of optimizing an objective func-

tion

X→R

with a limited budget for evaluations. The adjective

“black–box” means that while we can evaluate

f(x)

for any

x∈X

we have no access to any other information about

, such as gradi-

ents or the Hessian. When function evaluations are expensive, it

makes sense to carefully and adaptively select values to evaluate;

the overall goal is for the system to generate a sequence of

that

approaches the global optimum as rapidly as possible.

Black box optimization algorithms can be used to nd the best

operating parameters for any system whose performance can be

measured as a function of adjustable parameters. It has many impor-

tant applications, such as automated tuning of the hyperparameters

of machine learning systems (e.g., learning rates, or the number

of hidden layers in a deep neural network), optimization of the

user interfaces of web services (e.g. optimizing colors and fonts

KDD ’17, August 13-17, 2017, Halifax, NS, Canada

ACM ISBN 978-1-4503-4887-4/17/08.

https://doi.org/10.1145/3097983.3098043

to maximize reading speed), and optimization of physical systems

(e.g., optimizing airfoils in simulation).

In this paper we discuss a state-of-the-art system for black–box

optimization developed within Google, called Google Vizier, named

after a high ocial who oers advice to rulers. It is a service for

black-box optimization that supports several advanced algorithms.

The system has a convenient Remote Procedure Call (RPC) inter-

face, along with a dashboard and analysis tools. Google Vizier is

a research project, parts of which supply core capabilities to our

Cloud Machine Learning HyperTune

subsystem. We discuss the ar-

chitecture of the system, design choices, and some of the algorithms

used.

1.1 Related Work

Black–box optimization makes minimal assumptions about the

problem under consideration, and thus is broadly applicable across

many domains and has been studied in multiple scholarly elds un-

der names including Bayesian Optimization [

], Derivative–

free optimization [

], Sequential Experimental Design [

], and

assorted variants of the multiarmed bandit problem [13, 20, 29].

Several classes of algorithms have been proposed for the prob-

lem. The simplest of these are non-adaptive procedures such as

Random Search

, which selects

uniformly at random from

at each time step

independent of the previous points selected,

{xτ: 1 ≤τ<t}

, and

Grid Search

, which selects along a grid (i.e.,

the Cartesian product of nite sets of feasible values for each pa-

rameter). Classic algorithms such as

SimulatedAnnealing

and

assorted genetic algorithms have also been investigated, e.g., Co-

variance Matrix Adaptation [16].

Another class of algorithms performs a local search by selecting

points that maintain a search pattern, such as a simplex in the case of

the classic

Nelder–Mead

algorithm [

]. More modern variants of

these algorithms maintain simple models of the objective

within

a subset of the feasible regions (called the trust region), and select a

point xtto improve the model within the trust region [7].

More recently, some researchers have combined powerful tech-

niques for modeling the objective

over the entire feasible region,

using ideas developed for multiarmed bandit problems for manag-

ing explore / exploit trade-os. These approaches are fundamen-

tally Bayesian in nature, hence this literature goes under the name

Bayesian Optimization. Typically, the model for

is a Gaussian

1https://cloud.google.com/ml/

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1487

This work is licensed under a Creative Commons

Attribution-NonCommercial-NoDerivs International 4.0 License.

process (as in [

]), a deep neural network (as in [

]), or a

regression forest (as in [2, 19]).

Many of these algorithms have open-source implementations

available. Within the machine learning community, examples in-

clude, e.g., HyperOpt

, MOE

, Spearmint

, and AutoWeka

, among

many others. In contrast to such software packages, which require

practitioners to set them up and run them locally, we opted to de-

velop a managed service for black–box optimization, which is more

convenient for users but involves additional design considerations.

1.2 Denitions

Throughout the paper, we use to the following terms to describe

the semantics of the system:

ATrial is a list of parameter values,

, that will lead to a single

evaluation of

f(x)

. A trial can be “Completed”, which means that it

has been evaluated and the objective value

f(x)

has been assigned

to it, otherwise it is “Pending”.

AStudy represents a single optimization run over a feasible

space. Each Study contains a conguration describing the feasible

space, as well as a set of Trials. It is assumed that

f(x)

does not

change in the course of a Study.

AWorker refers to a process responsible for evaluating a Pending

Trial and calculating its objective value.

2 SYSTEM OVERVIEW

This section explores the design considerations involved in imple-

menting black-box optimization as a service.

2.1 Design Goals and Constraints

Vizier’s design satises the following desiderata:

•Ease of use. Minimal user conguration and setup.

•Hosts state-of-the-art black-box optimization algorithms.

•High availability

•

Scalable to millions of trials per study, thousands of parallel

trial evaluations per study, and billions of studies.

•Easy to experiment with new algorithms.

•Easy to change out algorithms deployed in production.

For ease of use, we implemented Vizier as a managed service

that stores the state of each optimization. This approach drastically

reduces the eort a new user needs to get up and running; and

a managed service with a well-documented and stable RPC API

allows us to upgrade the service without user eort. We provide a

default conguration for our managed service that is good enough

to ensure that most users need never concern themselves with the

underlying optimization algorithms.

The default option allows the service to dynamically select a

recommended black–box algorithm along with low–level settings

based on the study conguration. We choose to make our algorithms

stateless, so that we can seamlessly switch algorithms during a

2https://github.com/jaberg/hyperopt

3https://github.com/Yelp/MOE

4https://github.com/HIPS/Spearmint

5https://github.com/automl/autoweka

study, dynamically choosing the algorithm that is likely to perform

better for a particular trial of a given study. For example, Gaussian

Process Bandits [26, 29] provide excellent result quality, but naive

implementations scale as

O(n3)

with the number of training points.

Thus, once we’ve collected a large number of completed Trials, we

may want to switch to using a more scalable algorithm.

At the same time, we want to allow ourselves (and advanced

users) the freedom to experiment with new algorithms or special-

case modications of the supported algorithms in a manner that is

safe, easy, and fast. Hence, we’ve built Google Vizier as a modular

system consisting of four cooperating processes (see Figure 1) that

update the state of Studies in the central database. The processes

themselves are modular with several clean abstraction layers that

allow us to experiment with and apply dierent algorithms easily.

Finally we want to allow multiple trials to be evaluated in parallel,

and allow for the possibility that evaluating the objective function

for each trial could itself be a distributed process. To this end we

dene Workers, responsible for evaluating suggestions, and iden-

tify each worker by a name (a

worker_handle

) that persists across

process preemptions or crashes.

2.2 Basic User Workow

To use Vizier, a developer may use one of our client libraries (cur-

rently implemented in C++, Python, Golang), which will generate

service requests encoded as protocol buers [

]. The basic work-

ow is extremely simple. Users specify a study conguration which

includes:

•

Identifying characteristics of the study (e.g. name, owner,

permissions).

•

The set of parameters along with feasible sets for each (c.f.,

Section 2.3.1 for details); Vizier does constrained optimiza-

tion over the feasible set.

Given this conguration, basic use of the service (with each trial

being evaluated by a single process) can be implemented as follows:

# Register this client with the Study, creating it if

# necessary.

client.LoadStudy(study_cong, worker_handle)

while (not client.StudyIsDone()):

# Obtain a trial to evaluate.

trial = client.GetSuggestion()

# Evaluate the objective function at the trial parameters.

metrics = RunTrial(trial)

# Report back the results.

client.CompleteTrial(trial, metrics)

Here

RunTrial

is the problem–specic evaluation of the objective

function

. Multiple named metrics may be reported back to Vizier,

however one must be distinguished as the objective value

f(x)

for

trial

. Note that multiple processes working on a study should

share the same

worker_handle

if and only if they are collaboratively

evaluating the same trial. All processes registered with a given

study with the same

worker_handle

are guaranteed to receive the

same trial upon request, which enables distributed trial evaluation.

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1488

Vizier API

Persistent

Database Suggestion Service

Suggestion

Workers

Automated Stopping Service

Dangling

Work Finder

AutomatedStopping

Workers

Evaluation

Workers

Figure 1: Architecture of Vizier service: Main components are (1)

Dangling work nder (restarts work lost to preemptions) (2) Persis-

tent Database holding the current state of all Studies (3) Suggestion

Service (creates new Trials), (4) Early Stopping Service (helps termi-

nate a Trial early) (5) Vizier API (JSON, validation, multiplexing) (6)

Evaluation workers (provided and owned by the user).

2.3 Interfaces

2.3.1 Configuring a Study. To congure a study, the user pro-

vides a study name, owner, optional access permissions, an opti-

mization goal from

{MAXIMIZE, MINIMIZE}

, and species the feasible

region

via a set of

ParameterConfigs

, each of which declares a

parameter name along with its values. We support the following

parameter types:

•DOUBLE

: The feasible region is a closed interval [

a,b

]for some

real values a≤b.

•INTEGER

: The feasible region has the form [

a,b

]

∩Z

for some

integers a≤b.

•DISCRETE

: The feasible region is an explicitly specied, or-

dered set of real numbers.

•CATEGORICAL

: The feasible region is an explicitly specied,

unordered set of strings.

Users may also suggest recommended scaling, e.g., logarithmic

scaling for parameters for which the objective may depend only on

the order of magnitude of a parameter value.

2.3.2 API Definition. Workers and end users can make calls

to the Vizier Service using either a REST API or using Google’s

internal RPC protocol [15]. The most important service calls are:

•CreateStudy

: Given a Study conguration, this creates an

optimization Study and returns a globally unique identier

(“guid”) which is then used for all future service calls. If a

Study with a matching name exists, the guid for that Study

is returned. This allows parallel workers to call this method

and all register with the same Study.

•SuggestTrials

: This method takes a “worker handle” as input,

and immediately returns a globally unique handle for a “long-

running operation” that represents the work of generating

Trial suggestions. The user can then poll the API periodically

to check the status of the operation. Once the operation is

completed, it will contain the suggested Trials. This design

ensures that all service calls are made with low latency, while

allowing for the fact that the generation of Trials can take

longer.

•AddMeasurementToTrial

: This method allows clients to pro-

vide intermediate metrics during the evaluation of a Trial.

These metrics are then used by the Automated Stopping

rules to determine which Trials should be stopped early.

•CompleteTrial

: This method changes a Trial’s status to “Com-

pleted”, and provides a nal objective value that is then used

to inform the suggestions provided by future calls to Sug-

gestTrials.

•ShouldTrialStop

: This method returns a globally unique han-

dle for a long-running operation that represents the work of

determining whether a Pending Trial should be stopped.

2.4 Infrastructure

2.4.1 Parallel Processing of Suggestion Work. As the de facto

parameter tuning engine of Google, Vizier is constantly working on

generating suggestions for a large number of Studies concurrently.

As such, a single machine would be insucient for handling the

workload. Our Suggestion Service is therefore partitioned across

several Google datacenters, with a number of machines being used

in each one. Each instance of the Suggestion Service potentially

can generate suggestions for several Studies in parallel, giving

us a massively scalable suggestion infrastructure. Google’s load

balancing infrastructure is then used to allow clients to make calls

to a unied endpoint, without needing to know which instance is

doing the work.

When a request is received by a Suggestion Service instance to

generate suggestions, the instance rst places a distributed lock on

the Study. This lock is acquired for a xed period of time, and is

periodically extended by a separate thread running on the instance.

In other words, the lock will be held until either the instance fails,

or it decides it’s done working on the Study. If the instance fails

(due to e.g. hardware failure, job preemption, etc), the lock soon

expires, making it eligible to be picked up by a separate process

(called the “DanglingWorkFinder”) which then reassigns the Study

to a dierent Suggestion Service instance.

One consideration in maintaining a production system is that

bugs are inevitably introduced as our code matures. Occasionally, a

new algorithmic change, however well tested, will lead to instances

of the Suggestion Service failing for particular Studies. If a Study

is picked up by the DanglingWorkFinder too many times, it will

temporarily halt the Study and alert us. This prevents subtle bugs

that only aect a few Studies from causing crash loops that aect

the overall stability of the system.

2.5 The Algorithm Playground

Vizier’s algorithm playground provides a mechanism for advanced

users to easily, quickly, and safely replace Vizier’s core optimization

algorithms with arbitrary algorithms.

The playground serves a dual purpose; it allows rapid prototyp-

ing of new algorithms, and it allows power-users to easily customize

Vizier with advanced or exotic capabilities that are particular to

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1489

Vizier API

Persistent

Database

Evaluation

Workers

Playground Binary

Abstract Policy

Custom Policy

Figure 2: Architecture of Playground mode: Main components are

(1) The Vizier API takes service requests. (2) The Custom Policy im-

plements the Abstract Policy and generates suggested Trials. (3) The

Playground Binary drives the custom policy based on demand re-

ported by the Vizier API. (4) The Evaluation Workers behave as nor-

mal, i.e., they request and evaluate Trials.

their use-case. In all cases, users of the playground benet from

all of Vizier’s infrastructure aside from the core algorithms, such

as access to a persistent database of Trials, the dashboard, and

visualizations.

At the core of the playground is the ability to inject Trials into

a Study. Vizier allows the user or other authorized processes to

request one or more particular Trials be evaluated. In Playground

mode, Vizier does not suggest Trials for evaluation, but relies on

an external binary to generate Trials, which are then pushed to the

service for later distribution to the workers.

More specically, the architecture of the Playground involves

the following key components: (1) Abstract Policy (2) Playground

Binary, (3) Vizier Service and (4) Evaluation Workers. See Figure 2

for an illustration.

The Abstract Policy contains two abstract methods:

(1) GetNewSuggestions(trials, num_suggestions)

(2) GetEarlyStoppingTrials(trials)

which should be implemented by the user’s custom policy. Both

these methods are passed the full state of all Trials in the Study, so

algorithms may be implemented in a stateless fashion if desired.

GetNewSuggestions

is expected to generate

num_suggestions

new tri-

als, while the

GetEarlyStoppingTrials

method is expected to return

a list of Pending Trials that should be stopped early. The custom

policy is registered with the Playground Binary which periodically

polls the Vizier Service. The Evaluation Workers maintain the service

abstraction and are unaware of the existence of the Playground.

2.6 Benchmarking Suite

Vizier has an integrated framework that allows us to eciently

benchmark our algorithms on a variety of objective functions. Many

of the objective functions come from the Black-Box Optimization

Benchmarking Workshop [

], but the framework allows for any

function to be modeled by implementing an abstract Experimenter

class, which has a virtual method responsible for calculating the

objective value for a given Trial, and a second virtual method that

returns the optimal solution for that benchmark.

Figure 3: A section of the dashboard for tracking the progress of

Trials and the corresponding objective function values. Note also,

the presence of actions buttons such as Get Suggestions for manually

requesting suggestions.

Users congure a set of benchmark runs by providing a set of

algorithm congurations and a set of objective functions. The bench-

marking suite will optimize each function with each algorithm

times (where

is congurable), producing a series of performance-

over-time metrics which are then formatted after execution. The

individual runs are distributed over multiple threads and multi-

ple machines, so it is easy to have thousands of benchmark runs

executed in parallel.

2.7 Dashboard and Visualizations

Vizier has a web dashboard which is used for both monitoring

and changing the state of Vizier studies. The dashboard is fully

featured and implements the full functionality of the Vizier API.

The dashboard is commonly used for: (1) Tracking the progress

of a study; (2) Interactive visualizations; (3) Creating, updating

and deleting a study; (4) Requesting new suggestions, early stop-

ping, activating/deactivating a study. See Figure 3 for a section of

the dashboard. In addition to monitoring and visualizations, the

dashboard contains action buttons such as Get Suggestions.

The dashboard uses a translation layer which converts between

JSON and protocol buers [

] when talking with backend servers.

The dashboard is built with Polymer [

] an open source web frame-

work supported by Google and uses material design principles. It

contains interactive visualizations for analyzing the parameters in

your study. In particular, we use the parallel coordinates visualiza-

tion [

] which has the benet of scaling to high dimensional spaces

(

∼

15 dimensions) and works with both numerical and categorical

parameters. See Figure 4 for an example. Each vertical axis is a

dimension corresponding to a parameter, whereas each horizontal

line is an individual trial. The point at which the horizontal line

intersects the vertical axis gives the value of the parameter in that

dimension. This can be used for examining how the dimensions

co-vary with each other and also against the objective function

value (left most axis). The visualizations are built using d3.js [4].

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1490

Figure 4: The Parallel Coordinates visualization [18] is used for ex-

amining results from dierent Vizier runs. It has the benet of scal-

ing to high dimensional spaces (∼15 dimensions) and works with

both numerical and categorical parameters. Additionally, it is inter-

active and allows various modes of slicing and dicing data.

3 THE VIZIER ALGORITHMS

Vizier’s modular design allows us to easily support multiple algo-

rithms. For studies with under a thousand trials, Vizier defaults

to using Batched Gaussian Process Bandits [

]. We use a Matérn

kernel with automatic relevance determination (see e.g. section 5.1

of Rasmussen and Williams

[23]

for a discussion) and the expected

improvement acquisition function [

]. We search for and nd local

maxima of the acquisition function with a proprietary gradient-free

hill climbing algorithm, with random starting points.

We implement discrete parameters by embedding them in

. Cat-

egorical parameters with

feasible values are represented via one-

hot encoding, i.e., embedded in [0

. In both cases, the Gaussian

Process regressor gives us a continuous and dierentiable function

upon which we can walk uphill, then when the walk has converged,

round to the nearest feasible point.

While some authors recommend using Bayesian deep learning

models in lieu of Gaussian processes for scalability [

], in our

experience they are too sensitive to their own hyperparameters and

do not reliably perform well. Other researchers have recognized

this problem as well, and are working to address it [28].

For studies with tens of thousands of trials or more, other al-

gorithms may be used. Though

RandomSearch

and

GridSearch

are supported as rst–class choices and may be used in this regime,

and many other published algorithms are supported through the

algorithm playground, we currently recommend a proprietary local–

search algorithm under these conditions.

For all of these algorithms we support data normalization, which

maps numeric parameter values into [0

1] and objective values

onto [

−

5]. Depending on the problem, a one-to-one nonlinear

mapping may be used for some of the parameters, and is typically

used for the objective. Data normalization is handled before trials

are presented to the trial suggestion algorithms, and its suggestions

are transparently mapped back to the user-specied scaling.

3.1 Automated Early Stopping

In some important applications of black–box optimization, informa-

tion related to the performance of a trial may become available dur-

ing trial evaluation. Perhaps the best example of such a performance

curve occurs when tuning machine learning hyperparameters for

models trained progressively (e.g., via some version of stochastic

gradient descent). In this case, the model typically becomes more

accurate as it trains on more data, and the accuracy of the model is

available at the end of each training epoch. Using these accuracy

vs. training step curves, it is often possible to determine that a

trial’s parameter settings are unpromising well before evaluation is

nished. In this case we can terminate trial evaluation early, freeing

those evaluation resources for more promising trial parameters.

When done algorithmically, this is referred to as automated early

stopping.

Vizier supports automated early stopping via an API call to a

ShouldTrialStop

method. Analogously to the Suggestion Service,

there is an Automated Stopping Service that accepts requests from

the Vizier API to analyze a study and determine the set of trials

that should be stopped, according to the congured early stopping

algorithm. As with suggestion algorithms, several automated early

stopping algorithms are supported, and rapid prototyping can be

done via the algorithm playground.

3.2 Automated Stopping Algorithms

Vizier supports the following automated stopping algorithms. These

are meant to work in a stateless fashion i.e. they are given the full

state of all trials in the Vizier study when determining which trials

should stop.

3.2.1 Performance Curve Stopping Rule. This stopping rule per-

forms regression on the performance curves to make a prediction

of the nal objective value of a Trial given a set of Trials that are

already Completed, and a partial performance curve (i.e., a set of

measurements taken during Trial evaluation). Given this prediction,

if the probability of exceeding the optimal value found thus far is

suciently low, early stopping is requested for the Trial.

While prior work on automated early stopping used Bayesian

parametric regression [

], we opted for a Bayesian non-parametric

regression, specically a Gaussian process model with a carefully

designed kernel that measures similarity between performance

curves. Our motivation in this was to be robust to many kinds

of performance curves, including those coming from applications

other than tuning machine learning hyperparameters in which the

performance curves may have very dierent semantics. Notably,

this stopping rule still works well even when the performance curve

is not measuring the same quantity as the objective value, but is

merely predictive of it.

3.2.2 Median Stopping Rule. The median stopping rule stops a

pending trial

at step

if the trial’s best objective value by step

is strictly worse than the median value of the running averages

oτ

1:s

of all completed trials’ objectives

xτ

reported up to step

Here, we calculate the running average of a trial

xτ

up to step

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1491

oτ

1:s=1

sΣs

i=1oτ

, where

oτ

is the objective value of

xτ

at step

As with the performance curve stopping rule, the median stopping

rule does not depend on a parametric model, and is applicable to a

wide range of performance curves. In fact, the median stopping rule

is model–free, and is more reminiscent of a bandit-based approach

such as HyperBand [20].

3.3 Transfer learning

When doing black-box optimization, users often run studies that are

similar to studies they have run before, and we can use this fact to

minimize repeated work. Vizier supports a form of Transfer Learning

which leverages data from prior studies to guide and accelerate the

current study. For instance, one might tune the learning rate and

regularization of a machine learning system, then use that Study

as a prior to tune the same ML system on a dierent data set.

Vizier’s current approach to transfer learning is relatively simple,

yet robust to changes in objective across studies. We designed our

transfer learning approach with these goals in mind:

(1) Scale well to situations where there are many prior studies.

(2)

Accelerate studies (i.e., achieve better results with fewer

trials) when the priors are good, particularly in cases where

the location of the optimum, x∗, doesn’t change much.

(3)

Be robust against poorly chosen prior studies (i.e., a bad prior

should give only a modest deceleration).

(4)

Share information even when there is no formal relationship

between the prior and current Studies.

In previous work on transfer learning in the context of hyper-

parameter optimization, Bardenet et al

. [1]

discuss the diculty in

transferring knowledge across dierent datasets especially when

the observed metrics and the sampling of the datasets are dierent.

They use a ranking approach for constructing a surrogate model for

the response surface. This approach suers from the computational

overhead of running a ranking algorithm. Yogatama and Mann

[32]

propose a more ecient approach, which scales as

Θ(kn +n3)

for

studies of

trials each, where the cubic term comes from using a

Gaussian process in their acquisition function.

Vizier typically uses Gaussian Process regressors, so one natural

approach to implementing transfer learning might be to build a

larger Gaussian Process regressor that is trained on both the prior(s)

and the current Study. However that approach fails to satisfy design

goal 1: for

studies with

trials each it would require

Ω(k3n3)

time. Such an approach also requires one to specify or learn ker-

nel functions that bridge between the prior(s) and current Study,

violating design goal 4.

Instead, our strategy is to build a stack of Gaussian Process

regressors, where each regressor is associated with a study, and

where each level is trained on the residuals relative to the regressor

below it. Our model is that the studies were performed in a linear

sequence, each study using the studies before it as priors.

The bottom of the stack contains a regressor built using data

from the oldest study in the stack. The regressor above it is associ-

ated with the 2nd oldest study, and regresses on the residual of its

Figure 5: An illustration of our transfer learning scheme,

showing how µ′

iis built from the residual labels w.r.t. µi−1

(shown in dotted red lines).

objective relative to the predictions of the regressor below it. Simi-

larly, the regressor associated with the

ith

study is built using the

data from that study, and regresses on the residual of the objective

with respect to the predictions of the regressor below it.

More formally, we have a sequence of studies

{Si}k

i=1

on un-

known objective functions

fik

i=1

, where the current study is

and we build two sequences of regressors

{Ri}k

i=1

and

(R′

i)k

i=1

hav-

ing posterior mean functions

µik

i=1

and

(µ′

i)k

i=1

respectively, and

posterior standard deviation functions

{σi}k

i=1

and

(σ′

i)k

i=1

, respec-

tively. Our nal predictions will be µkand σk.

Let

Di=((xi

t,yi

t))t

be the dataset for study

. Let

R′

be a re-

gressor trained using data

(((xi

t,yi

t−µi−1(xi

t)) )t

which computes

µ′

and

σ′

. Then we dene as our posterior means at level

µi(x)

=µ′

i(x)+µi−1(x)

. We take our posterior standard deviations

at level

σi(x)

, to be a weighted geometric mean of

σ′

i(x)

and

σi−1(x)

, where the weights are a function of the amount of data

(i.e., completed trials) in

and

Si−1

. The exact weighting function

depends on a constant

α≈

1sets the relative importance of old and

new standard deviations.

This approach has nice properties when the prior regressors

are densely supported (i.e. has many well-spaced data points), but

the top-level regressor has relatively little training data: (1) ne

structure in the priors carries through to

µk

, even if the top-level

regressor gives a low-resolution model of the objective function

residual; (2) since the estimate for

σ′

is inaccurate, averaging it

with

σk−1

can lead to an improved estimate. Further, when the

top-level regressor has dense support,

β→

1and the

σk→σ′

, as

one might desire.

We provide details in the pseudocode in Algorithm 1, and illus-

trate the regressors in Figure 5.

Algorithm 1 is then used in the Batched Gaussian Process Ban-

dits [

] algorithm. Algorithm 1 has the property that for a su-

ciently dense sampling of the feasible region in the training data for

the current study, the predictions converge to those of a regressor

trained only on the current study data. This ensures a certain degree

of robustness: badly chosen priors will eventually be overwhelmed

(design goal 3).

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1492

Algorithm 1 Transfer Learning Regressor

# This is a higher order function that returns a regressor R(

xtest

);

# then R(xtest) can be evaluated to obtain (µ,σ)

function GetRegressor(Dtraining,i)

If i<0: Return function that returns (0,1) for all inputs

# Recurse to get a Regressor (µi−1(x),σi−1(x)) trained on

# the data for all levels of the stack below this one.

Rprior ←GetRegressor(Dtraining,i−1)

# Compute training residuals

Dresiduals ←[(x,y−Rprior(x)[0])for(x,y)∈Di]

# Train a Gaussian Process (µ′

i(x),σ′

i(x)) on the residuals.

GPresiduals =TrainGP(Dresiduals)

function StackedRegressor(xtest)

µprior,σprior ←Rprior (xtest)

µtop,σtop ←GPresiduals(xtest)

µ←µtop +µprior

β←α|Di|/(α|Di|+|Di−1|)

σ←σβ

topσ1−β

prior

return µ,σ

end function

return StackedRegressor

end function

In production settings, transfer learning is often particularly

valuable when the number of trials per study is relatively small,

but there are many such studies. For example, certain production

machine learning systems may be very expensive to train, limiting

the number of trials that can be run for hyperparameter tuning,

yet are mission critical for a business and are thus worked on year

after year. Over time, the total number of trials spanning several

small hyperparameter tuning runs can be quite informative. Our

transfer learning scheme is particularly well-suited to this case, as

illustrated in section 4.3.

4 RESULTS

4.1 Performance Evaluation

To evaluate the performance of Google Vizier we require functions

that can be used to benchmark the results. These are pre-selected,

easily calculated functions with known optimal points that have

proven challenging for black-box optimization algorithms. We can

measure the success of an optimizer on a benchmark function

its nal optimality gap. That is, if

x∗

minimizes

, and

is the best

solution found by the optimizer, then

|f(ˆ

x)−f(x∗)|

measures the

success of that optimizer on that function. If, as is frequently the

case, the optimizer has a stochastic component, we then calculate

the average optimality gap by averaging over multiple runs of the

optimizer on the same benchmark function.

Comparing between benchmarks is a more dicult given that the

dierent benchmark functions have dierent ranges and diculties.

For example, a good black-box optimizer applied to the Rastrigin

function might achieve an optimality gap of 160, while simple

random sampling of the Beale function can quickly achieve an

optimality gap of 60 [

]. We normalize for this by taking the ratio

of the optimality gap to the optimality gap of

Random Search

the same function under the same conditions. Once normalized, we

average over the benchmarks to get a single value representing an

optimizer’s performance.

The benchmarks selected were primarily taken from the Black-

Box Optimization Benchmarking Workshop [

] (an academic com-

petition for black–box optimizers), and include the Beale, Branin,

Ellipsoidal, Rastrigin, Rosenbrock, Six Hump Camel, Sphere, and

Styblinski benchmark functions.

4.2 Empirical Results

In Figures 6 we look at result quality for four optimization algo-

rithms currently implemented in the Vizier framework: a multi-

armed bandit technique using a Gaussian process regressor [

the SMAC algorithm [

], the Covariance Matrix Adaption Evo-

lution Strategy (CMA-ES) [

], and a probabilistic search method

of our own. For a given dimension

, we generalized each bench-

mark function into a

dimensional space, ran each optimizer on

each benchmark 100 times, and recorded the intermediate results

(averaging these over the multiple runs). Figure 6 shows their im-

provement over

Random Search

; the horizontal axis represents

the number of trials have been evaluated, while the vertical axis

indicates each optimality gap as a fraction of the

Random Search

optimality gap at the same point. The 2

×Random Search

curve is

the

Random Search

algorithm when it was allowed to sample two

points for each point the other algorithms evaluated. While some

authors have claimed that 2

×Random Search

is highly competitive

with Bayesian Optimization methods [

], our data suggests this

is only true when the dimensionality of the problem is suciently

high (e.g., over 16).

4.3 Transfer Learning

We display the value of transfer learning in Figure 7 with a series

of short studies; each study is just six trials long. Even so, one can

see that transfer learning from one study to the next leads to steady

progress towards the optimum, as the stack of regressors gradually

builds up information about the shape of the objective function.

This experiment is conducted in a 10 dimensional space, using

the 8 black-box functions described in section 4.1. We run 30 studies

(180 trials) and each study uses transfer learning from all previous

studies.

As one might hope, transfer learning causes the GP bandit algo-

rithm to show a strong systematic decrease in the optimality gap

from study to study, with its nal average optimality gap 37% the

size of

Random Search

’s. As expected,

Random Search

shows no

systematic improvement in its optimality gap from study to study.

Note that a systematic improvement in the optimality gap is a

dicult task since each study gets a budget of only 6 trials whilst

operating in a 10 dimensional space, and the GP regressor is optimiz-

ing 8 internal hyperparameters for each study. By any reasonable

measure, a single study’s data is insucient for the regressor to

learn much about the shape of the objective function.

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1493

100101102103104

Batch (log scale)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Reduction in optimality gap

relative to Random Search

Dimension = 4

100101102103104

Batch (log scale)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Dimension = 8

100101102103104

Batch (log scale)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Dimension = 16

100101102103104

Batch (log scale)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Dimension = 32

2xRandom

GP Bandit

SMAC

CMA-ES

Probabilistic Search

100101102103104

Batch (log scale)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Dimension = 64

Figure 6: Ratio of the average optimality gap of each optimizer to that of Random Search at a given number of samples. The

×Random Search is a Random Search allowed to sample two points at every step (as opposed to a single point for the other

algorithms).

0 5 10 15 20 25 30

Number of transfer learning steps (i.e. studies)

5.0

5.5

6.0

6.5

log(best optimality gap seen in study)

Figure 7: Convergence of transfer learning in a 10 dimen-

sional space. This shows a sequence of studies with progres-

sive transfer learning for both GP Bandit (blue diamonds)

and Random Search (red squares) optimizers. The X-axis

shows the index of the study, i.e. the number of times

that transfer learning has been applied; the Y-axis shows

the log of the best mean optimality gap seen in the study

(see Section 4.1). Each study contains six trials; for the GP

Bandit-based optimizer the previous studies are used as pri-

ors for transfer learning. Note that the GP bandits shows

a consistent improvement in optimality gap from study to

study, thus demonstrating an eective transfer of knowl-

edge from the earlier trials; Random Search does not do

transfer learning.

4.4 Automated Stopping

4.4.1 Performance Curve Stopping Rule. In our experiments, we

found that the use of the performance curve stopping rule resulted

in achieving optimality gaps comparable to those achieved without

the stopping rule, while using approximately 50% fewer CPU-hours

when tuning hyperparameter for deep neural networks. Our result

is in line with gures reported by other researchers, while using a

more exible non-parametric model (e.g., Domhan et al

. [9]

report

reductions in the 40% to 60% range on three ML hyperparameter

tuning benchmarks).

4.4.2 Median Automated Stopping Rule. We evaluated the Me-

dian Stopping Rule for several hyperparameter search problems,

including a state-of-the-art residual network architecture based

on [

] for image classication on CIFAR10 with 16 tunable hyper-

parameters, and an LSTM architecture [

] for language modeling

on the Penn TreeBank data set with 12 tunable hyperparameters.

We observed that in all cases the stopping rule consistently achieved

a factor two to three speedup over random search, while always

nding the best performing Trial. Li et al

. [20]

argued that “2X

random search”, i.e., random search at twice the speed, is competi-

tive with several state-of-the-art black-box optimization methods

on a broad range of benchmarks. The robustness of the stopping

rule was also evaluated by running repeated simulations on a large

set of completed random search trials under random permutation,

which showed that the algorithm almost never decided to stop the

ultimately-best-performing trial early.

5 USE CASES

Vizier is used for a number of dierent application domains.

5.1 Hyperparameter tuning and HyperTune

Vizier is used across Google to optimize hyperparameters of ma-

chine learning models, both for research and production models.

Our implementation scales to service the entire hyperparameter

tuning workload across Alphabet, which is extensive. As one (ad-

mittedly extreme) example, Collins et al

. [6]

used Vizier to perform

hyperparameter tuning studies that collectively contained millions

of trials for a research project investigating the capacity of dierent

recurrent neural network architectures. In this context, a single

trial involved training a distinct machine learning model using

dierent hyperparameter values. That research project would not

be possible without eective black–box optimization. For other

research projects, automating the arduous and tedious task of hy-

perparameter tuning accelerates their progress.

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1494

Perhaps even more importantly, Vizier has made notable im-

provements to production models underlying many Google prod-

ucts, resulting in measurably better user experiences for over a

billion people. External researchers and developers can achieve the

same benets using Google Cloud Machine Learning HyperTune

subsystem, which benets from our experience and technology.

5.2 Automated A/B testing

In addition to tuning hyperparameters, Vizier has a number of other

uses. It is used for automated A/B testing of Google web proper-

ties, for example tuning user–interface parameters such as font

and thumbnail sizes, color schema, and spacing, or trac-serving

parameters such as the relative importance of various signals in

determining which items to show to a user. An example of the lat-

ter would be “how should the search results returned from Google

Maps trade o search-relevance for distance from the user?”

5.3 Delicious Chocolate Chip Cookies

Vizier is also used to solve complex black–box optimization prob-

lems arising from physical design or logistical problems. Here we

present an example that highlights some additional capabilities of

the system: nding the most delicious chocolate chip cookie recipe

from a parameterized space of recipes.

Parameters included

baking soda

brown sugar

white

sugar

butter

vanilla

egg

flour

chocolate

chip type

salt

cayenne

orange extract

baking time

, and

baking temperature

. We provided

recipes to contractors responsible for providing desserts for Google

employees. The head chefs among the contractors were given dis-

cretion to alter parameters if (and only if) they strongly believed

it to be necessary, but would carefully note what alterations were

made. The cookies were baked, and distributed to the cafes for

taste–testing. Cafe goers tasted the cookies and provided feedback

via a survey. Survey results were aggregated and the results were

sent back to Vizier. The “machine learning cookies” were provided

about twice a week over several weeks.

The cookies improved signicantly over time; later rounds were

extremely well-rated and, in the authors’ opinions, delicious. How-

ever, we wish to highlight the following capabilities of Vizier the

cookie design experiment exercised:

•

Infeasible trials: In real applications, some trials may be in-

feasible, meaning they cannot be evaluated for reasons that

are intrinsic to the parameter settings. Very high learning

rates may cause training to diverge, leading to garbage mod-

els. In this example: very low levels of butter may make your

cookie dough impossibly crumbly and incohesive.

•

Manual overrides of suggested trials: Sometimes you cannot

evaluate the suggested trial or else mistakenly evaluate a

dierent trial than the one asked for. For example, when

baking you might be running low on an ingredient and have

to settle for less than the recommended amount.

•

Transfer learning: Before starting to bake at large scale, we

baked some recipes in a smaller scale run-through. This pro-

vided useful data that we could transfer learn from when

baking at scale. Conditions were not identical, however, re-

sulting in some unexpected consequences. For example, the

dough was allowed to sit longer in large-scale production

which unexpectedly, and somewhat dramatically, increased

the subjective spiciness of the cookies for trials that involved

cayenne. Fortunately, our transfer learning scheme is rela-

tively robust to such shifts.

Vizier supports marking trials as infeasible, in which case they do

not receive an objective value. In the case of Bayesian Optimization,

previous work either assigns them a particularly bad objective

value, attempts to incorporate a probability of infeasibility into

the acquisition function to penalize points that are likely to be

infeasible [

], or tries to explicitly model the shape of the infeasible

region [

]. We take the rst approach, which is simple and

fairly eective for the applications we consider. Regarding manual

overrides, Vizier’s stateless design makes it easy to support updating

or deleting trials; we simply update the trial state on the database.

For details on transfer learning, refer to section 3.3.

6 CONCLUSION

We have presented our design for Vizier, a scalable, state-of-the-

art internal service for black–box optimization within Google, ex-

plained many of its design choices, and described its use cases

and benets. It has already proven to be a valuable platform for

research and development, and we expect it will only grow more so

as the area of black–box optimization grows in importance. Also,

it designs excellent cookies, which is a very rare capability among

computational systems.

7 ACKNOWLEDGMENTS

We gratefully acknowledge the contributions of the following:

Jeremy Kubica, Je Dean, Eric Christiansen, Moritz Hardt, Katya

Gonina, Kevin Jamieson, and Abdul Salem.

REFERENCES

[1]

Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Collabo-

rative hyperparameter tuning. ICML 2 (2013), 199.

[2]

James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Al-

gorithms for hyper-parameter optimization. In Advances in Neural Information

Processing Systems. 2546–2554.

[3]

J Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and

M West. 2011. Optimization under unknown constraints. Bayesian Statistics 9 9

(2011), 229.

[4]

Michael Bostock, Vadim Ogievetsky, and Jerey Heer. 2011. D

data-driven

documents. IEEE transactions on visualization and computer graphics 17, 12 (2011),

2301–2309.

[5]

Herman Cherno. 1959. Sequential Design of Experiments. Ann. Math. Statist.

30, 3 (09 1959), 755–770. https://doi.org/10.1214/aoms/1177706205

[6]

Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. 2017. Capacity and

Trainability in Recurrent Neural Networks. In Profeedings of the International

Conference on Learning Representations (ICLR).

[7]

Andrew R Conn, Katya Scheinberg, and Luis N Vicente. 2009. Introduction to

derivative-free optimization. SIAM.

[8]

Thomas Desautels, Andreas Krause, and Joel W Burdick. 2014. Parallelizing

exploration-exploitation tradeos in Gaussian process bandit optimization. Jour-

nal of Machine Learning Research 15, 1 (2014), 3873–3923.

[9]

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding Up

Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapo-

lation of Learning Curves.. In IJCAI. 3460–3468.

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1495

[10]

Steen Finck, Nikolaus Hansen, Raymond Rost, and Anne Auger. 2009.

Real-Parameter Black-Box Optimization Benchmarking 2009: Presentation of

the Noiseless Functions. http://coco.gforge.inria.fr/lib/exe/fetch.php?media=

download3.6:bbobdocfunctions.pdf. (2009). [Online].

[11]

Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and

John P Cunningham. 2014. Bayesian Optimization with Inequality Constraints..

In ICML. 937–945.

[12]

Michael A Gelbart, Jasper Snoek, and Ryan P Adams. 2014. Bayesian optimiza-

tion with unknown constraints. In Proceedings of the Thirtieth Conference on

Uncertainty in Articial Intelligence. AUAI Press, 250–259.

[13]

Josep Ginebra and Murray K. Clayton. 1995. Response Surface Bandits. Journal

of the Royal Statistical Society. Series B (Methodological) 57, 4 (1995), 771–784.

http://www.jstor.org/stable/2345943

[14]

Google. 2017. Polymer: Build modern apps using web components. https://github.

com/Polymer/polymer. (2017). [Online].

[15]

Google. 2017. Protocol Buers: Google’s data interchange format. https://github.

com/google/protobuf. (2017). [Online].

[16]

Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized

self-adaptation in evolution strategies. Evolutionary computation 9, 2 (2001),

159–195.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual

learning for image recognition. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition. 770–778.

[18]

Julian Heinrich and Daniel Weiskopf. 2013. State of the Art of Parallel Coordi-

nates.. In Eurographics (STARs). 95–116.

[19]

Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-

based optimization for general algorithm conguration. In International Confer-

ence on Learning and Intelligent Optimization. Springer, 507–523.

[20]

Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet

Talwalkar. 2016. Hyperband: A Novel Bandit-Based Approach to Hyperparameter

Optimization. CoRR abs/1603.06560 (2016). http://arxiv.org/abs/1603.06560

[21]

J Moćkus, V Tiesis, and A Źilinskas. 1978. The Application of Bayesian Methods

for Seeking the Extremum. Vol. 2. Elsevier. 117–128 pages.

[22]

John A Nelder and Roger Mead. 1965. A simplex method for function minimiza-

tion. The computer journal 7, 4 (1965), 308–313.

[23]

Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes

for Machine Learning (Adaptive Computation and Machine Learning). The MIT

Press.

[24]

Luis Miguel Rios and Nikolaos V Sahinidis. 2013. Derivative-free optimization: a

review of algorithms and comparison of software implementations. Journal of

Global Optimization 56, 3 (2013), 1247–1293.

[25]

Bobak Shahriari, Kevin Swersky,Ziyu Wang, Ryan P Adams, and Nando de Freitas.

2016. Taking the human out of the loop: A review of bayesian optimization. Proc.

IEEE 104, 1 (2016), 148–175.

[26]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian

optimization of machine learning algorithms. In Advances in neural information

processing systems. 2951–2959.

[27]

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,

Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams.

2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Pro-

ceedings of the 32nd International Conference on Machine Learning, ICML 2015,

Lille, France, 6-11 July 2015 (JMLR Workshop and Conference Proceedings), Fran-

cis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 2171–2180. http:

//jmlr.org/proceedings/papers/v37/snoek15.html

[28]

Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hut-

ter. 2016. Bayesian Optimization with Robust Bayesian Neural Net-

works. In Advances in Neural Information Processing Systems 29,

D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett

(Eds.). Curran Associates, Inc., 4134–4142. http://papers.nips.cc/paper/

6117-bayesian- optimization-with-robust- bayesian-neural-networks.pdf

[29]

Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010.

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental

Design. ICML (2010).

[30]

Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. 2014. Freeze-thaw

Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014).

[31]

Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing.

2016. Deep kernel learning. In Proceedings of the 19th International Conference on

Articial Intelligence and Statistics. 370–378.

[32]

Dani Yogatama and Gideon Mann. 2014. Ecient Transfer Learning Method for

Automatic Hyperparameter Tuning. JMLR: W&CP 33 (2014), 1077–1085.

[33]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural

network regularization. arXiv preprint arXiv:1409.2329 (2014).

KDD 2017 Applied Data Science Paper

KDD’17, August 13–17, 2017, Halifax, NS, Canada

1496

FireBench: A High-fidelity Ensemble Simulation Framework for Exploring Wildfire Behavior and Data-driven Modeling

Preprint

Full-text available

Jun 2024

Background. Wildfire research uses ensemble methods to analyze fire behaviors and assess uncertainties. Nonetheless, current research methods are either confined to simple models or complex simulations with limits. Modern computing tools could allow for efficient, high-fidelity ensemble simulations. Aims. This study proposes a high-fidelity ensemble wildfire simulation framework for studying wildfire behavior, ML tasks, fire-risk assessment, and uncertainty analysis. Methods. In this research, we present a simulation framework that integrates the Swirl-Fire large-eddy simulation tool for wildfire predictions with the Vizier optimization platform for automated run-time management of ensemble simulations and large-scale batch processing. All simulations are executed on tensor-processing units to enhance computational efficiency. Key results. A dataset of 117 simulations is created, each with 1.35 billion mesh points. The simulations are compared to existing experimental data and show good agreement in terms of fire rate of spread. Computations are done for fire acceleration, mean rate of spread, and fireline intensity. Conclusions. Strong coupling between these 2 parameters are observed for the fire spread and intermittency. A critical Froude number that delineates fires from plume-driven to convection-driven is identified and confirmed with literature observations. Implications. The ensemble simulation framework is efficient in facilitating parametric wildfire studies.

Large Language Models as Surrogate Models in Evolutionary Algorithms: A Preliminary Study

Preprint

Full-text available

Jun 2024

Large Language Models (LLMs) have achieved significant progress across various fields and have exhibited strong potential in evolutionary computation, such as generating new solutions and automating algorithm design. Surrogate-assisted selection is a core step in evolutionary algorithms to solve expensive optimization problems by reducing the number of real evaluations. Traditionally, this has relied on conventional machine learning methods, leveraging historical evaluated evaluations to predict the performance of new solutions. In this work, we propose a novel surrogate model based purely on LLM inference capabilities, eliminating the need for training. Specifically, we formulate model-assisted selection as a classification and regression problem, utilizing LLMs to directly evaluate the quality of new solutions based on historical data. This involves predicting whether a solution is good or bad, or approximating its value. This approach is then integrated into evolutionary algorithms, termed LLM-assisted EA (LAEA). Detailed experiments compared the visualization results of 2D data from 9 mainstream LLMs, as well as their performance on optimization problems. The experimental results demonstrate that LLMs have significant potential as surrogate models in evolutionary computation, achieving performance comparable to traditional surrogate models only using inference. This work offers new insights into the application of LLMs in evolutionary computation. Code is available at: https://github.com/hhyqhh/LAEA.git

A Data-Driven Approach to Predicting Recreational Activity Participation Using Machine Learning

Article

Full-text available

Jun 2024

Purpose: With the popularity of recreational activities, the study aimed to develop prediction models for recreational activity participation and explore the key factors affecting participation in recreational activities. Methods: A total of 12,712 participants, excluding individuals under 20, were selected from the National Health and Nutrition Examination Survey (NHANES) from 2011 to 2018. The mean age of the sample was 46.86 years (±16.97), with a gender distribution of 6,721 males and 5,991 females. The variables included demographic, physical-related variables, and lifestyle variables. This study developed 42 prediction models using six machine learning methods, including logistic regression, Support Vector Machine (SVM), decision tree, random forest, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). The relative importance of each variable was evaluated by permutation feature importance. Results: The results illustrated that the LightGBM was the most effective algorithm for predicting recreational activity participation (accuracy: .838, precision: .783, recall: .967, F1-score: .865, AUC: .826). In particular, prediction performance increased when the demographic and lifestyle datasets were used together. Next, as the result of the permutation feature importance based on the top models, education level and moderate-vigorous physical activity (MVPA) were found to be essential variables. Conclusion: These findings demonstrated the potential of a data-driven approach utilizing machine learning in a recreational discipline. Furthermore, this study interpreted the prediction model through feature importance analysis to overcome the limitation of machine learning interpretability.

Data-Driven Weather Forecasting and Climate Modeling from the Perspective of Development

Article

Full-text available

Jun 2024

Accurate and rapid weather forecasting and climate modeling are universal goals in human development. While Numerical Weather Prediction (NWP) remains the gold standard, it faces challenges like inherent atmospheric uncertainties and computational costs, especially in the post-Moore era. With the advent of deep learning, the field has been revolutionized through data-driven models. This paper reviews the key models and significant developments in data-driven weather forecasting and climate modeling. It provides an overview of these models, covering aspects such as dataset selection, model design, training process, computational acceleration, and prediction effectiveness. Data-driven models trained on reanalysis data can provide effective forecasts with an accuracy (ACC) greater than 0.6 for up to 15 days at a spatial resolution of 0.25°. These models outperform or match the most advanced NWP methods for 90% of variables, reducing forecast generation time from hours to seconds. Data-driven climate models can reliably simulate climate patterns for decades to 100 years, offering a magnitude of computational savings and competitive performance. Despite their advantages, data-driven methods have limitations, including poor interpretability, challenges in evaluating model uncertainty, and conservative predictions in extreme cases. Future research should focus on larger models, integrating more physical constraints, and enhancing evaluation methods.

Position: A Call to Action for a Human-Centered AutoML Paradigm

Preprint

Full-text available

Jun 2024

Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) workflows, aiding the research of new ML algorithms, and contributing to the democratization of ML by making it accessible to a broader audience. Over the past decade, commendable achievements in AutoML have primarily focused on optimizing predictive performance. This focused progress, while substantial, raises questions about how well AutoML has met its broader, original goals. In this position paper, we argue that a key to unlocking AutoML's full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems, including their diverse roles, expectations, and expertise. We envision a more human-centered approach in future AutoML research, promoting the collaborative design of ML systems that tightly integrates the complementary strengths of human expertise and AutoML methodologies.

Cost-Sensitive Multi-Fidelity Bayesian Optimization with Transfer of Learning Curve Extrapolation

Preprint

May 2024

In this paper, we address the problem of cost-sensitive multi-fidelity Bayesian Optimization (BO) for efficient hyperparameter optimization (HPO). Specifically, we assume a scenario where users want to early-stop the BO when the performance improvement is not satisfactory with respect to the required computational cost. Motivated by this scenario, we introduce utility, which is a function predefined by each user and describes the trade-off between cost and performance of BO. This utility function, combined with our novel acquisition function and stopping criterion, allows us to dynamically choose for each BO step the best configuration that we expect to maximally improve the utility in future, and also automatically stop the BO around the maximum utility. Further, we improve the sample efficiency of existing learning curve (LC) extrapolation methods with transfer learning, while successfully capturing the correlations between different configurations to develop a sensible surrogate function for multi-fidelity BO. We validate our algorithm on various LC datasets and found it outperform all the previous multi-fidelity BO and transfer-BO baselines we consider, achieving significantly better trade-off between cost and performance of BO.

RB-GAT: A Text Classification Model Based on RoBERTa-BiGRU with Graph ATtention Network

Article

Full-text available

May 2024
SENSORS-BASEL

With the development of deep learning, several graph neural network (GNN)-based approaches have been utilized for text classification. However, GNNs encounter challenges when capturing contextual text information within a document sequence. To address this, a novel text classification model, RB-GAT, is proposed by combining RoBERTa-BiGRU embedding and a multi-head Graph ATtention Network (GAT). First, the pre-trained RoBERTa model is exploited to learn word and text embeddings in different contexts. Second, the Bidirectional Gated Recurrent Unit (BiGRU) is employed to capture long-term dependencies and bidirectional sentence information from the text context. Next, the multi-head graph attention network is applied to analyze this information, which serves as a node feature for the document. Finally, the classification results are generated through a Softmax layer. Experimental results on five benchmark datasets demonstrate that our method can achieve an accuracy of 71.48%, 98.45%, 80.32%, 90.84%, and 95.67% on Ohsumed, R8, MR, 20NG and R52, respectively, which is superior to the existing nine text classification approaches.

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

Preprint

May 2024

Text-to-Image (T2I) models have raised security concerns due to their potential to generate inappropriate or harmful images. In this paper, we propose UPAM, a novel framework that investigates the robustness of T2I models from the attack perspective. Unlike most existing attack methods that focus on deceiving textual defenses, UPAM aims to deceive both textual and visual defenses in T2I models. UPAM enables gradient-based optimization, offering greater effectiveness and efficiency than previous methods. Given that T2I models might not return results due to defense mechanisms, we introduce a Sphere-Probing Learning (SPL) scheme to support gradient optimization even when no results are returned. Additionally, we devise a Semantic-Enhancing Learning (SEL) scheme to finetune UPAM for generating target-aligned images. Our framework also ensures attack stealthiness. Extensive experiments demonstrate UPAM's effectiveness and efficiency.

Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities

Preprint

Full-text available

May 2024

Large language models (LLMs) have received considerable attention recently due to their outstanding comprehension and reasoning capabilities, leading to great progress in many fields. The advancement of LLM techniques also offers promising opportunities to automate many tasks in the telecommunication (telecom) field. After pre-training and fine-tuning, LLMs can perform diverse downstream tasks based on human instructions, paving the way to artificial general intelligence (AGI)-enabled 6G. Given the great potential of LLM technologies, this work aims to provide a comprehensive overview of LLM-enabled telecom networks. In particular, we first present LLM fundamentals, including model architecture, pre-training, fine-tuning, inference and utilization, model evaluation, and telecom deployment. Then, we introduce LLM-enabled key techniques and telecom applications in terms of generation, classification, optimization, and prediction problems. Specifically, the LLM-enabled generation applications include telecom domain knowledge, code, and network configuration generation. After that, the LLM-based classification applications involve network security, text, image, and traffic classification problems. Moreover, multiple LLM-enabled optimization techniques are introduced, such as automated reward function design for reinforcement learning and verbal reinforcement learning. Furthermore, for LLM-aided prediction problems, we discussed time-series prediction models and multi-modality prediction problems for telecom. Finally, we highlight the challenges and identify the future directions of LLM-enabled telecom networks.

Hyperparameter optimization: Classics, acceleration, online, multi-objective, and tools

Article

Full-text available

Jan 2024

Hyperparameter optimization (HPO) has been well-developed and evolved into a well-established research topic over the decades. With the success and wide application of deep learning, HPO has garnered increased attention, particularly within the realm of machine learning model training and inference. The primary objective is to mitigate the challenges associated with manual hyperparameter tuning, which can be ad-hoc, reliant on human expertise, and consequently hinders reproducibility while inflating deployment costs. Recognizing the growing significance of HPO, this paper surveyed classical HPO methods, approaches for accelerating the optimization process, HPO in an online setting (dynamic algorithm configuration, DAC), and when there is more than one objective to optimize (multi-objective HPO). Acceleration strategies were categorized into multi-fidelity, bandit-based, and early stopping; DAC algorithms encompassed gradient-based, population-based, and reinforcement learning-based methods; multi-objective HPO can be approached via scalarization, metaheuristics, and model-based algorithms tailored for multi-objective situation. A tabulated overview of popular frameworks and tools for HPO was provided, catering to the interests of practitioners.

Scalable Bayesian Optimization Using Deep Neural Networks

Article

Full-text available

Feb 2015

Bayesian optimization has been demonstrated as an effective methodology for the global optimization of functions with expensive evaluations. Its strategy relies on querying a distribution over functions defined by a relatively cheap surrogate model. The ability to accurately model this distribution over functions is critical to the effectiveness of Bayesian optimization, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires a large number of evaluations, and as such, massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to Gaussian processes to model distributions over functions. We show that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically. This allows us to achieve a previously intractable degree of parallelism, which we use to rapidly search over large spaces of models. We achieve state-of-the-art results on benchmark object recognition tasks using convolutional neural networks, and image caption generation using multimodal neural language models.

Bayesian Optimization with Inequality Constraints

Conference Paper

Full-text available

Jun 2014

Bayesian optimization is a powerful framework for minimizing expensive objective functions while using very few function evaluations. It has been successfully applied to a variety of prob-lems, including hyperparameter tuning and ex-perimental design. However, this framework has not been extended to the inequality-constrained optimization setting, particularly the setting in which evaluating feasibility is just as expensive as evaluating the objective. Here we present con-strained Bayesian optimization, which places a prior distribution on both the objective and the constraint functions. We evaluate our method on simulated and real data, demonstrating that con-strained Bayesian optimization can quickly find optimal and feasible points, even when small fea-sible regions cause standard methods to fail.

Hyperband: A novel bandit-based approach to hyperparameter optimization

Article

Apr 2018
J MACH LEARN RES

Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. © 2018 Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh and Ameet Talwalkar.

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

Article

Jan 2010

Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GP-UCB compares favorably with other heuristical GP optimization approaches.

Deep Residual Learning for Image Recognition

Conference Paper

Jun 2016

Bayesian optimization with inequality constraints

Article

Jan 2014

Algorithms for hyper-parameter optimization

Article

Jan 2011
Adv Neural Inform Process Syst

Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets

Article

Jan 2015

Big Data applications are typically associated with systems involving large numbers of users, massive complex software systems, and large-scale heterogeneous computing and storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., recommendation systems, medical analysis tools, real-time game engines, speech recognizers) thus involve many tunable configuration parameters. These parameters are often specified and hard-coded into the software by various developers or teams. If optimized jointly, these parameters can result in significant improvements. Bayesian optimization is a powerful tool for the joint optimization of design choices that is gaining great popularity in recent years. It promises greater automation so as to increase both product quality and human productivity. This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.

Collaborative hyperparameter tuning

Article

Jun 2013

Hyperparameter learning has traditionally been a manual task because of the limited number of trials. Today's computing infrastructures allow bigger evaluation budgets, thus opening the way for algorithmic approaches. Recently, surrogate-based optimization was successfully applied to hyperparameter learning for deep belief networks and to WEKA classifiers. The methods combined brute force computational power with model building about the behavior of the error function in the hyperparameter space, and they could significantly improve on manual hyperparameter tuning. What may make experienced practitioners even better at hyperparameter optimization is their ability to generalize across similar learning problems. In this paper, we propose a generic method to incorporate knowledge from previous experiments when simultaneously tuning a learning algorithm on new problems at hand. To this end, we combine surrogate-based ranking and optimization techniques for surrogate-based collaborative tuning (SCoT). We demonstrate SCoT in two experiments where it outperforms standard tuning techniques and single-problem surrogate-based optimization.

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Book

Jan 2005

Google Vizier: A Service for Black-Box Optimization

Abstract and Figures

Recommended publications

Kubernetes on Google Cloud Platform

Mandi: A market exchange for trading utility and cloud computing services

VM Live Migration At Scale

Cloud repository as a malicious service: challenge, identification and implication