Conference PaperPDF Available

Convergence and Finite-Time Behavior of Simulated Annealing

Authors:

Abstract

Simulated Annealing is a randomized algorithm which has been proposed for finding globally optimum least-cost configurations in large NP-complete problems with cost functions which may have many local minima. A theoretical analysis of Simulated Annealing based on its precise model, a time-inhomogeneous Markov chain, is presented. An annealing schedule is given for which the Markov chain is strongly ergodic and the algorithm converges to a global optimum. The finite-time behavior of Simulated Annealing is also analyzed and a bound obtained on the departure of the probability distribution of the state at finite time from the optimum. This bound gives an estimate of the rate of convergence and insights into the conditions on the annealing schedule which gives optimum performance.
Convergence and Finite-Time Behavior of Simulated Annealing
Author(s): Debasis Mitra, Fabio Romeo and Alberto Sangiovanni-Vincentelli
Source:
Advances in Applied Probability,
Vol. 18, No. 3 (Sep., 1986), pp. 747-771
Published by: Applied Probability Trust
Stable URL: http://www.jstor.org/stable/1427186 .
Accessed: 02/07/2014 16:01
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
.
Applied Probability Trust is collaborating with JSTOR to digitize, preserve and extend access to Advances in
Applied Probability.
http://www.jstor.org
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Adv. Appl. Prob. 18, 747-771 (1986)
Printed in N. Ireland
? Applied Probability Trust 1986
CONVERGENCE AND FINITE-TIME BEHAVIOR
OF SIMULATED ANNEALING
DEBASIS MITRA, *AT&
T Bell Laboratories
FABIO ROMEO, **
University of California, Berkeley
ALBERTO SANGIOVANNI-VINCENTELLI, **
University of California, Berkeley
Abstract
Simulated
annealing
is a randomized
algorithm
which has been proposed
for
finding globally optimum least-cost configurations
in large NP-complete
problems with cost functions which may have many local minima. A
theoretical analysis of simulated annealing
based on its precise model, a
time-inhomogeneous
Markov
chain, is presented.
An annealing
schedule is
given for which the Markov chain is strongly ergodic and the algorithm
converges to a global optimum. The finite-time behavior of simulated
annealing
is also analyzed and a bound obtained on the departure
of the
probability
distribution
of the state at finite time from the optimum.
This
bound gives an estimate of the rate of convergence
and insights into the
conditions on the annealing
schedule which
gives optimum
performance.
GLOBAL OPTIMIZATION; RANDOMIZED ALGORITHMS; TIME-INHOMOGENEOUS
MARKOV CHAINS
1. Introduction
Many combinatorial
optimization problems belong to a class of problems
which are difficult
to solve, i.e., the class of NP-complete problems [3]. For
these problems, there is no known algorithm
whose worst-case
complexity
is
bounded
by a polynomial
in the size of the input. Heuristic
algorithms
are used
to solve NP-complete problems approximately,
i.e. to find 'good' solutions
which are 'close' to the optimum.
These algorithms explore a discrete
space of
admissible configurations,
S, in a deterministic fashion. Often the search
terminates at a local minimum due to the fact that heuristic algorithms
are
'greedy'. To avoid this behavior, a class of randomized
algorithms
(e.g. [171)
have been devised which generate
the next configuration randomly,
and which
can 'climb hills', i.e., moves that generate configurations
of higher cost than
the present
one are accepted.
Received
24 April 1985;
revision received
26 July 1985.
* Postal address:
AT&T Bell Laboratories,
600 Mountain
Avenue, Murray
Hill, NJ
07974,
USA.
** Postal address:
Department
of Electrical
Engineering
and Computer
Science, University
of
California,
Berkeley,
CA 94720, USA. 747
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
748 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
Simulated annealing as proposed by Kirkpatrick
et al. [11] allows 'hill-
climbing'
moves but these moves are accepted
according
to a certain
criterion
which takes the state of the search process into consideration
in a manner
unlike other randomized
algorithms.
The controlling
mechanism is based on
the observation that combinatorial
optimization problems with a large con-
figuration
space exhibit properties similar to physical processes with many
degrees of freedom.
In particular,
bringing a fluid into a low-energy state such as growing a
crystal, has been considered
in [11] to be similar to the process of finding
an
optimum solution of a combinatorial
optimization problem. Annealing is a
well-known
process for growing crystals. It consists of melting the fluid and
then lowering
the temperature
slowly until the crystal
is formed. The rate of
decrease of temperature
has to be very low around the freezing temperature.
The Metropolis Monte Carlo method [1], [14] can be used to simulate the
annealing process. It has been proposed as an effective method for finding
global minima of combinatorial
optimization
problems.
In applications
to combinatorial
optimization,
this method starts from an
arbitrary
configuration
and, given that the simulation
is at configuration
i at
time m, m = 0, 1, 2, - - -, a new configuration
j is randomly generated
from an
admissible set N(i) and a check is made to determine whether the cost of the
new configuration
satisfies an acceptance
criterion based on the temperature,
a
controlling parameter, at time m, Tm. If the cost decreases, the simulation
accepts the move. Otherwise, a random number uniformly
distributed over
[0, 1] is picked and compared
with exp (- {c(j)- c(i)}/Tm), where c(.) is the
cost function on configurations. If the random number is smaller, the
simulation
accepts
the move, otherwise
it discards the move. In any case, time
is incremented. Note that the higher
the temperature,
the more likely it is that
a 'hill-climbing'
move is accepted. The initial temperature, the number of
moves generated
at each temperature
and the rate of decrease of temperature
are all important parameters
that affect the speed of the algorithm
and the
quality
of the final
configuration.
Experimental results [9], [11], [16], [19] show that simulated annealing
produces very good results when compared
to other techniques
for the solution
of combinatorial
optimization problems
such as those arising
from the layout
of integrated
circuits, at the expense, however, of large computation
time (a
1500 standard cell placement problem can take as much as 24 hours of a
VAX 11/780 [161). This has emphasized the need for a better theoretical
understanding
of simulated
annealing.
Early analyses using time-homogeneous
Markov chains [7], [8], [18] were
based on certain
(unrealistic)
assumptions
on the number of iterations taken at
each temperature.
It was shown [12], [15] that simulated
annealing,
and even
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 749
generalizations
of it called 'probabilistic
hill-climbing
algorithms' [15] give
asymptotically
the optimum
solution with probability
1.
The analysis
in this paper is based on time-inhomogeneous
Markov chains.
We prove that for an arbitrary
but bounded cost function, for annealing
schedules of the form
(1.1) T=log (m mo1), m = 0, 1, 2,
where mo is any parameter
satisfying
1 _ mo
< oo,
the Markov
chain is strongly
ergodic
if r
where r is the radius
of the graph
underlying
the chain and L is a Lipschitz-like
constant of the cost function. Strong ergodicity implies that, for any starting
probability
vector, the state probability
vector converges
component-wise
to a
constant vector e*. Furthermore,
we show that e* is the optimum
vector, i.e.,
the vector in which all elements are zero except those with the indices of the
global least-cost configurations. Our other main result is on finite-time
behavior and rate of convergence.
We give a bound on the departure
of the
state vector from the optimum
vector after a finite number
of iterations. This
bound indicates how the annealing schedule must be balanced between
contrary requirements
for optimum performance.
A simple corollary
to this
result states that for a large number of iterations k, the La-norm
of the
difference
of the state vector from the optimum
vector is O(1/kmin(ab)),
where
a and b respectively
increase
and decrease with increasing
y.
We also obtain a set of results on distributions which we call quasi-
stationary. These constructs are the equilibrium distributions of time-
homogeneous Markov chains obtained from simulated annealing by holding
the temperature fixed at various values. The dependence of the quasi-
stationary
distributions on temperature
is shown to have a number of desirable
properties. These properties are essential for our analysis of the time-
inhomogeneous Markov chains obtained from annealing schedules given in
(1.1). In addition, they are of independent interest since they hold for
annealing schedules considerably more general than (1.1). This may be
important in the future if, as we expect, it becomes possible to design
schedules matched
to special properties
of the cost function.
In an important
work Geman and Geman [4] have proved in the context of
Markov fields, used to model image-processing models, that simulated
annealing converges
to the least-cost configurations
for a particular
annealing
schedule.t Our results are stronger
in the following respects: (i) there is no
t The recent works of Gidas [5] and Hajek [6] are also noteworthy.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
750 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
result in [4] on finite-time
behavior
and rate of convergence,
results which are
most useful to obtain a practical
annealing
schedule;
(ii) our conditions on the
annealing
schedule are substantially
weaker; (iii) the proof of convergence
is
simpler since it makes use of powerful, known results in the theory of
time-inhomogeneous
Markov chains; (iv) the graph underlying
the Markov
chain is arbitrary
and well-matched
to combinatorial
optimization,
and there is
no suggestion
of the need of structural
constraints
such as those that exist in
image processing.
The paper is organized
as follows. In Section 2, the structure of simulated
annealing
and the Markov-chain model are briefly
recalled. In Section 3, the
quasi-stationary
probabilities
of the Markov chain are introduced and their
properties established. In Section 4, the basic results of the theory of
time-inhomogeneous
Markov chain useful to us are recalled. In Section 5, the
annealing
schedule that guarantees
convergence
of simulated
annealing
to the
optimum
vector is presented and the basic convergence
theorem proven. In
Section 6, the finite-time behavior of the Markov chain and the rate of
convergence of simulated annealing are investigated. In Section 7, some
concluding
remarks and future research directions
are offered.
2. Preliminaries
In this section, we describe the basic structure of the simulated
annealing
algorithm
and we introduce a Markov
chain model for it.
Simulated
annealing algorithm
structure
(fo, To)
{
/* Given an initial state
jo and an initial
value
for the
parameter
T, To.
*/
X =jo;
m = 0;
while('stopping
criterion' is not satisfied)
{
while('inner loop criterion'
is not satisfied)
{
j = generate(X)
if(accept(c(j), c(X), T,))
X =j;
}
T,,+ = update (Tn)
m=m+l
}
}
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 751
The 'inner loop criterion' determines how many steps are taken by the
algorithm
at a given temperature.
In the analysis
here we have emphasized
the
case in which only one step is taken at each temperature.
However, as we
observe later, it is easy to extend the results to the case of more than one step
at each temperature.
In the algorithm
structure
three functions
play a fundamental role: accept,
generate and update. While several accept functions
can be used [15], in this
paper
we restrict
our attention to the one proposed
in [11].
accept(c(j), c(i), T)
{
/*
returns
1 if the cost
variation
passes a test.
T is the control
parameter.
*/
Ac- = cOu) - c(i);
y = min [1, exp (-Aci)];
r = random(0, 1);
/*
random
is a function which
returns
a pseudo
random
number
uniformly
distributed on the
interval
[0, 1].
*/
if(r
!y)
return(I);
else
return(0);
}
The generate
function selects a new configuration.
In simulated
annealing,
a
new configuration
is generated randomly
from a set of possible configurations.
To specify this function completely, a set of configurations
accessible
from a
given configuration
and the probability
of generating
one of these has to be
given.
The update function, also called the annealing
schedule or cooling schedule,
produces a new value for the temperature. This function is most important to
determine the convergence properties of the algorithm. We focus on update
functions which return monotonically decreasing values of T, i.e. Vm L0,
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
752 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
T,,,m+
< T and lim,,, Tm
= 0. The function is completely specified
when the
explicit dependency
of T on m is given. This paper is devoted to the study of
update functions that guarantee convergence
of the algorithm
to the optimum
vector.
It is easy to see that simulated
annealing
can be represented
by a Markov
chain, whose connectivity
is fully specified by the generate
function and whose
transition probabilities are determined by the accept and by the generate
functions.
The underlying
directed graph, which we denote by G, is determined as
follows. There is a bijective
correspondence
between the elements
of S, the set
of all the possible
configurations
of the optimization problem,
and the nodes of
the graph.
Given two different
elements, say i and
j, of S, there is an arc
from i
to j if j can be generated starting from i. The two nodes are said to be
neighbors.
We define N(i) to be the set of all the neighbors
of i. We assume
that i 0 N(i). In several applications
of the simulated
annealing algorithm,
the
probability
of generating
a particular neighboring configuration
starting
from i
is simply given
by 1/IN(i)i
where
IN(i)I
is the cardinality
of N(i). However,
in
certain
applications
such as placement
of integrated
circuits
[16], it is important
to generate certain neighbors with higher probability. For this reason, we
assume
that the probability
of generating
j from i is given by
(2.1) g(i, j)/g(i)
where g(i, j) gives the 'weights'
for each of the neighbors
of i and g(i) is a
normalizing
function
which ensures
that
1 g(i,j)= 1.
g(i)
jEN(i)
The directed
graph
G is assumed to be connected.
The one-step transition
probabilities
of the Markov
chain are represented
as
weights on the edges of the directed graph G defined above and are
determined
by the product
of the probability
of generating
a given configura-
tion and the probability
of accepting
it. We define first a one-parameter
family
of transition
probabilities:
0 if j N(i) and j # i
(2.2) Pij(T)= g(i,j) min [1, exp (-{c(j) - c(i)}/T)] if j e N(i)
g(i)
and
(2.3) Pi,(T)= 1- C P,(T).
j•N(i)
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 753
The transition
probabilities
of the time-inhomogeneous
Markov
chain
in which
m denotes discrete time are obtained from the above and the annealing
schedule which specifies T = Tm,
m = 0, 1, 2, .
3. Quasi-stationary
probability
distributions and their properties
We have shown that a mathematical model for simulated annealing is a
time-inhomogeneous
Markov chain. However, if the temperature
is frozen at a
particular
value T, then we obtain a time-homogeneous
Markov chain. To
prove the convergence of simulated annealing, it is important
to study this
Markov chain. In particular,
we show here that this chain has a stationary
probability
distribution,
which we call the quasi-stationary
probability
distribu-
tion of the time-inhomogeneous
Markov
chain.t In addition,
we show that the
stationary
probability
distributions have a limit when T goes to 0, i.e. when m
goes to oo,
and that this limit is the optimum
vector e*.
3.1. The quasi-stationary probabilities.
For i e S define
(3.1) .(T) &A_
g(i) exp (-c(i)/T)
G(T)
where G(T) is a scaling
factor such that IIr(T)II
= 1 where
S
i=1
and s = ISI.
The role of G(T) is similar to that of the partition function in statistical
mechanics
and stochastic
networks
[10].
We now show that x(T) is the stationary probability
distribution
for the
time-homogeneous
Markov
chain. For this to be true we need to assume that
the function
g(i, j) is symmetric,
i.e.,
(3.2) g(i, j) = g(j, i) Vi, j e S.
This is a mild restriction which is easy to satisfy in implementations
of
simulated annealing. In particular, symmetry exists in the case where all
neighbours
of each configuration
are given equal weights.
Proposition
3.1. If (3.2) holds, then {1(T)}, defined
by (3.1), satisfies
(3.3) ;r(T)P(T) = ;r(T), m = 0, 1, ...
t The reader is warned that the term 'quasi-stationary distribution' is used for another concept
in [8].
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
754 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
where P(T) is the one-step transition
probability
matrix
of the Markov
chain
defined in (2.2) and (2.3).
Proof. By (3.1), (2.2), and (3.2), we have that for every i and
j neighbors
in
S, and also for j = i,
(- T exp ({cO")
- c(i)}/ T) = PT)
xj(T) gO) Pi(T)
Note that this is true regardless
of the sign of {c(j)- c(i)}.
Hence, detailed balance holds:
(3.4) x.( T)P1.(T)
= x.(T)Pji(T).
Equation
(3.4) obviously
holds also for those i and
j that are not neighbors
in S under the given topology
since then each side is 0. By adding,
with respect
to i, both sides of (3.4) and recalling
(2.3), (3.3) is obtained.
It is of some interest
to note that detailed
balance,
see (3.4), is equivalent
to
the time-reversibility
[10] of the time-homogeneous
Markov
chain.
3.2. Asymptotic
quasi-stationary probabilities.
The results
in this section and
Sections
3.3-3.4 hold for any update
function
in which
(3.5a) T> Tm+,,, Vm -0,
(3.5b) lim Tm=
0.
m--->oo
It should
be emphasized
that here and in Sections
3.3-3.4 we are investigating
the dependence of x(Tm) on Tm
where {Tm} behaves as in (3.5), and that
x(Tm) is a construct and not the distribution obtained from simulated
annealing.
It is possible
to show the following
result.
Proposition 3.2. If the update function satisfies (3.5.b), then the quasi-
stationary
probability
vector x(Tm) defined in (3.1) converges, as m -- 00, to
the optimal
vector e*
(3.6) e= g(i)/g(*) ie S*
where S* is the set of indices of global least-cost
configurations,
i.e.
S* ({i SIc(i) c(O)
Vj
e S}
and
g(,)_A ] gO').
JES*
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 755
The proof of this proposition
is straightforward
and hence omitted here. A
proof for a more general class of algorithms, probabilistic hill-climbing
algorithms,
can be found in [15].
Note that this result can be interpreted
as the convergence
of the algorithm
to the optimum
vector provided
that an infinite
number
of iterations
are taken
at each value of m, so that the equilibrium
distribution
is reached.
3.3. Monotonicity of the quasi-stationary
probabilities.
The convergence
of
the quasi-stationary
distributions
to e* displays
remarkable
monotonicity pro-
perties. This property
is insightful
and also an essential
element of the analysis
of the asymptotic
and finite-time
behavior of simulated
annealing.
We will need to identify the 'weighted
mean cost' to be denoted by C and
defined
thus:
(3.7)
C
gU)cU)
gu).
jES jES
Proposition
3.3.
(i) For each ie S*,
J;(Tm+1) - J;((Tm)
> 0 Vm 0.
(ii) For each i 0 S*, there exists an unique integer i_, 0 -
rh5ii < oo, such that
;(Tm+l) - J;(Trm)
> 0 0 5 m 5 rki - 1
<0 m-r"h;M.
Proof. Consider x as a continuous function of the parameter T and
differentiate
zR(T)
in (3.1) with respect
to T:
[( J)
T 7(T)=
- g'){c(j) - c(i)} exp (-{c) - c(i)}/T)
(3.8) = [1 g(0){c(i)- cO)} exp
({c(i) - c) /T)
Lj:c()<c(i)
- 1: g(j){c(j) - c(i)} exp (-{c(j) - c(i)}/T)].
Ljc(i)>c(i)
The sign properties of (d/dT)4.(T) will be deduced from the relative
magnitudes
of the two bracketed
terms on the right-hand
side.
If configuration
i is least-cost, i.e. i e S*, then the first term is null and
(d/dT)xi(T) < 0 for T > 0. Statement
(i) then follows from (3.5).
The terms if not null are respectively monotonically decreasing and
monotonically increasing with increasing T, and the value of the right-hand
side of (3.8) evaluated at T = 0 is positive. Hence the right-hand side either
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
756 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
has a finite-valued zero or not depending
on the sign of its value evaluated at
T = o. We conclude that if c(i) is not least-cost and c(i) < C then a unique
zero exists at T, say, where 0 < T < oo, and also that
dT;;(T) > O, 0 < T < 4,
(3.9)
dT
= 0, T = ,
<0,
1 T< Tc--o.
Thus for c(i) < C, the weighted mean cost, we may use (3.5) to identify mi in
statement
(ii) with the smallest
integer such that T,s, Ti.
If on the other hand c(i)> C, then
d
d ( T) > O, 0 < T <oo
dT
and Mi = 0 in statement (ii).
An immediate
corollary
to Proposition
3.3 is the existence of th, ih < c, such
that for all i S*,
(3.10) ai((Tm+1)
- ;.(Tm) < 0, Vm mtih.
In fact
(3.11) h = max
rii.
ieS*
3.4. Uniform
monotonicity of the quasi-stationary probabilities.
The analysis
in Section 6 on finite-time behavior
requires
knowledge
of rh
which marks
the
onset of monotonic decrease of quasi-stationary
probabilities
of all but the
least-cost
configurations.
We show here how it may be identified. This is done
by considering i, for i 0 S* and c(i) < C, as functions
of the cost associated to
each state {c(j)}.
Proposition
3.4. For all i such that i 0 S* and c(j)< C, T are monotonic,
strictly
increasing
with increasing c(i).
Proof. A little algebra
shows that for any pair (i1, i2) where il E S and i2
e S
and c(i) - c(i2)= E
> 0
1 do,(T) E 2
(T) 1
dIj,(T)
(3.12) g(i) dT exp
(-E/T) 2) + exp (-E/T)2) dT
For the case of interest here c(i2) < (il)< C, so that from the definition of T,2
see (3.9), (d/dT)~.(T2) = 0. Now if (3.12) is evaluated
at i2, then the second
term on the right-hand
side is 0, while the first term is positive. Again noting
(3.9) it follows that 2
< i,.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 757
Note that configurations
with common cost have common values of i and
Mni.
To calculate th it is helpful to identify the least-cost and the next-to-least-
cost of all the configurations.
Let
(3.13) c(*) min c(j)
jeS
and
(3.14) 6 {min c)} - c(*),
SjIS*
so that c(*) and {6 + c(*)} are respectively the least- and the next-to-least
cost. Note that 6 is an important
global characteristic of the cost function.
The monotonicity
property
in Proposition
3.4 allows (3.11) to be sharpened:
rh
= m-,
where T is any configuration
with next-to-least cost.
Setting (3.8) equal to 0 for i = i, let T be the unique
positive solution of the
equation
(3.15) 6g(*) - C g'){c(j) - c(*) - 6} exp (-{c() - c(*)}/T) = 0,
j:c(j)>b+c(*)
where
g(*) is given in Proposition
3.2. Then rh
is the smallest
integer
such that
We conclude this section by a summary.
The quasi-stationary
probability
distribution
converges
with decreasing
temperature
(i.e. increasing
time) to the
optimum
vector. The quasi-stationary
probabilities
of least-cost
configurations
monotonically
increase with decreasing temperature.
For configurations
with
costs not less than the weighted mean cost, the opposite is true. Each
configuration
i with cost between least-cost and weighted mean cost has an
associated 'critical
temperature'
T; while the temperature
is greater than T,
the configuration's
quasi-stationary
probability
increases with decreasing
tem-
perature, and for temperatures
less than T the opposite is true. Furthermore,
the critical temperature
is an increasing
function of cost. All of the above
properties
hold for any update function
satisfying
(3.5).
4. Time-inhomogeneous
Markov
chains
In this section a number of well-known
properties
of time-inhomogeneous
Markov chains are presented. These results will be used in Section 5 to prove
the convergence properties of the simulated annealing algorithm and to
determine the influence of the annealing schedule on the rate of convergence
to the optimal solution of the combinatorial optimization problem.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
758 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
All theorems and propositions are given without proof. The interested
reader
can find these proofs in [7], [8] and [18].
4.1. Notation. For the sake of notational simplicity, from now on all
vectors, matrices and functions
depending
on T,m
will be denoted as depending
on m. Let P(m, m) be the identity
matrix,
and
n-1
P(m, n + m) A P(m + i), m -O, n- 1
i=O
be the n-step transition
matrix.
Furthermore
let
V(m) A [va(m), V2(m), ' ' ', s(m)]
denote the state probability
vector after m transitions of the Markov
chain, so
that v(m + n) = v(m)P(m, m + n).
We also let v(m, n) = v(O)P(m,
n).
4.2. Basic results from the theory of time-inhomogeneous Markov
chains. We need the following
definition.
Definition
4.1. A time-inhomogeneous
Markov chain is weakly ergodic if,
for all m,
(4.1) lim sup IlIr(m,
n) - V2(m, n)II
= 0
n--o v1(0),v2(0)
where v1(0) and v2(0) are two arbitrary
initial state probability
vectors and
vl(m, n) = vl(o)P(m, n)
v2(m,
n) = v2(0)P(m, n).
Note that weak ergodicity
does not imply the existence of limits of vectors
v'(m, n) and v2(m, n) but only a tendency towards equality of the rows of
P(m, n). Thus weak ergodicity implies only a 'loss of memory' of the initial
conditions,
but not convergence.
The investigation
of conditions under which weak ergodicity
holds is aided
by the introduction of the following
coefficient of ergodicity.t
Definition
4.2. Given a stochastic matrix
P, its coefficient of ergodicity
ri is
1 SS
(4.2) ,l(P)
= max Pik -Pjk =1-min min(Pk, P).
2 i,, k = 1 J k = 1
t We are following
Seneta [18];
Isaacson and Madsen
[7], following
Dobrushin
[2], call (1- r1)
the ergodic
coefficient.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 759
With the above definition of the coefficient of ergodicity
the following
result
can be proved [7], [8], [18].
Theorem
4.1. The time-inhomogeneous
Markov
chain is weakly ergodic if
and only if there is a strictly increasing sequence of positive integers
{(k}, i= 0, 1, ... such that
(4.3) E [1 - rl(P(k;, ki+,))] =oo.
i=0
Strong ergodicity
is defined as follows.
Definition
4.3. The time-inhomogeneous
Markov chain is strongly ergodic
if
there exists a vector q, IIqII
= 1 and qi O0,
i E S, such that for all m
(4.4) lim sup
Ilv(m,
n) - q | = 0.
n---oo ()
Strong ergodicity
is obtained only with convergence in addition to loss of
memory. Note that since the Markov chain is finite, the convergence
in norm
used to define weak and strong ergodicity is equivalent to coordinate-wise
convergence.
We shall need the following
result due to Madsen and Isaacson
[13], [7].
Theorem 4.2. If for every m there exists a r(m) such that r(m)=
;r(m)P(m),
II (m)ll
= 1 and
oI
||(m) - ; (m
+ 1)11<
m,
m=O
and the time-inhomogeneous
Markov
chain is weakly ergodic, then it is also
strongly
ergodic. Moreover
if
e* = lim xr(m),
then for all m, m---00
lim
su Ilv(m,
n) -
e*'
= 0.
n-.* o v(o
5. Strong ergodicity
of simulated
annealing
To establish weak ergodicity we use Theorem 4.1. In particular,
we first
determine
a bound on the coefficient of ergodicity
and then we determine the
update function
such that (4.3) is satisfied. Next we show that weak ergodicity
together with the existence of 7(Tm), as defined in (3.1), are sufficient
conditions
to ensure strong
ergodicity.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
760 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
5.1. Radius
of G and Lipschitz
constant. We need a few definitions
related
to the structure of the graph underlying
the Markov chain and to the slope of
the cost function.
Let Sm
be the set of all the points that are local maxima for the cost function,
i.e.,
Sm, {iE S I
c(j) - c(i) Vj e N(i)}.
Let
(5.1) r min max
d(i, j)
iE(S-S,) jES
be the radius
of the graph,
where d(i, j) is the distance of j from i measured
by
the length (number
of edges) of the minimum
length path from i to j in G. Let
1, the index of a node where the minimum
in (5.1) is attained,
be the center
of
the graph.
We will show that at any time the radius
r represents
an upper
bound
on the
number
of transitions
of the Markov
chain that are required
for the probability
transition
matrix
to have all the elements in at least one column, namely the
one indexed by 1, to be different
from 0. Note that the radius is well defined
since we assumed
G is connected and, because of the symmetry
of g(i, j), it is
also strongly
connected.
A Lipschitz-like
constant bounding the local slope of the cost function is
given by
(5.2) L = max max Ic(j)
- c(i)l.
ieS jeN(i)
Finally
we define a lower bound on the generation
function
A .g(i, j)
(5.3) w= min min
i~S jeN(i) g(i)
An important
assumption
is that w > 0.
5.2. Coefficient
of ergodicity.
If i and
j are neighbors
in G, i.e. j E
N(i), then
from (2.2), (5.2) and (5.3),
(5.4) P1i(m) - w exp (-L/Tm), m = 0, 1,..
Now the diagonal elements Pi,(m), i (S - Sm), may be quite small initially,
but these terms are monotonic, increasing
with increasing
m. This is because
the probabilities of transition from node i to neighboring nodes with lower cost
are constant with respect to m, while the probabilities of transition to
neighboring nodes with higher cost are monotonically decreasing with increas-
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 761
ing m. Hence there exists some ko, ko < oo, such that for all i E S - Sm
(5.5) Pi(m) _ w exp (-L/Tm), m - (ko - 1)r,
since the left-hand side monotonically increases and the right-hand side
monotonically
decreases with increasing
m.
We can use (5.1) and (5.5) to bound Pji(m - r, m) for every i ES and
m = kor m-1
Pi(m - r, m) - I {w exp (-L/T,)}
(5.6) n=m -r
Sw' exp (-rL/Tm_I).
Hence the coefficient of ergodicity
rl defined
in (4.2) satisfies
(5.7) rl(P(kr - r, kr)) - 1 - min {min (Pil, Pji)}
(5.8) 1- w'exp ,
ke?:ko'
From now on, for convenience we shall abbreviate r,(P(n, m)) to rl(n, m).
5.3. Weak
ergodicity. By Theorem 4.1 and (5.8), we have that the Markov
chain associated with simulated
annealing
is weakly ergodic
if
(5.9) kk exp -Tkrl =00
k=ko Tkr-1
Note that up to now, we have only assumed
that the sequence of parameter
{Tm} is monotonically decreasing and limm,.TTm
=0; in particular, the
dependency of Tm
on m has not been specified. We give now an update
function
which ensures that the Markov
chain is weakly
ergodic.
Theorem
5.1. The Markov chain associated with simulated
annealing
with
the following
update function:
(5.10) TM=
= + , m = 0, 1, 2, ...
T log(m + mo+ 1)'
where mo is any parameter satisfying
1 = mo
< oo,
is weakly
ergodic
if
(5.11) y -rL.
Proof. Replacing Tm
in (5.8) with the formula
given in (5.10) we obtain
a
(5.12a) 1(kr - r, kr) 1 - k ko
(k + mo/r)"k
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
762 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
where
(5.12b) 9 A rL/y,
and
Wr
a -
(5.12c) arrL/
It is obvious that, for any 1
Z {1 - rl(kr
- r, kr)}
=0
k=1
if t - 1. Using Theorem 4.1, the proposition
is proved.
It is clear that weak ergodicity
is preserved
even if the annealing
schedule
in
(5.10) is modified to keep the temperature
unchanged at various (finitely
many) time steps.
5.4. Strong
ergodicity. In Section
3 we have shown
that
there exists
for every
m, m- 0, a vector x(m) of quasi-stationary
probabilities
that has unit norm,
satisfies (3.3) and, as shown in Proposition 3.2, converges to the optimum
vector e* defined
in (3.6).
Hence, to prove the strong ergodicity
of the Markov
chain associated
with
simulated
annealing
using Theorem 4.2, we only have to prove the following
proposition. Interestingly
the proposition holds more generally than for the
update function
in (5.10).
Proposition 5.1. For update functions satisfying (3.5) the corresponding
quasi-stationary
probabilities
are such that
(5.13) |
I1(m + 1)-
x(m)l _-2(m,
+ 1)
< ,
m=O
where mr
is given in (3.10) and (3.11).
Proof. From statement
(i) of Proposition
3.3, and (3.10), for m- Fi,
(5.14) IIx(m+
1)
- x(m)ll = ((m+ 1)- -
(m)}
- {.(m 1)
-
iES* ioS*
Since
Sn(m)
+ C (m)= 1, Vm _
0,
iES* ieS*
we have
(5.15) IIx(m
+ 1)- x(m)ll = 2{x*(m + 1)- x*(m)}, m r_
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 763
where
(5.16) x*(m)_
C
(m), m _0.
ieS*
By (5.15), we have
(5.17)
>
IIx(m
+ 1)- (m)l
I2.
m=m
In view of (5.17) the proposition
is proven.
Using Theorem 4.2 and Theorem 5.1, we can prove the fundamental
result
of this section.
Theorem 5.2. The time-inhomogeneous Markov chain associated with
simulated annealing is strongly ergodic if it is weakly ergodic, and the
annealing
schedule satisfies
(3.5). In this case, for all m
(5.18) lim
sup IIv(m, n) - e* = 0.
n-. ov (O
In particular,
the annealing schedule in (5.10) with y
,=rL gives a strongly
ergodic
Markov chain for which (5.18) holds.
6. Finite-time
behavior and rate of convergence
We obtain an estimate of the departure
of the state of the Markov chain at
finite time m from the optimum
vector e*. The results in Theorem 6.2 below
give important
insights at the factors affecting the rate of convergence and
their implications
in the design of optimum
annealing
schedules.
6.1. Components
of finite-time behavior. The following decomposition is
basic:
v(m) - e* = { v(m) - x(O)P(0, m)}
(6.1) + {x(0)P(0, m) - x(m)) + {x(m) - e*}.
Observe that the sum of the first two terms in braces in the right-hand
side
measures the departure at time m of the state distribution from the
quasi-stationary
distribution. We have chosen to decompose this quantity
further so that the first term measures the extent to which at time m the
Markov
chain has lost memory
of the difference between v(0) and x(0).
From (6.1) we obtain
Ilv(m)
- e*Il - Ilv(m)
- r(0)P(0, m)ll
+ IIx (0)P(O,
m)- ,(m)ll
(6.2) + J|l|(m) - e* | .
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
764 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
In the next subsections, each of the three terms in the right-hand
side are
bounded
independently.
6.1.1. Bound for the
first term of (6.2). To determine
a bound for the first
term in the right-hand
side of (6.2), we need the following
fundamental
result
due to Dobrushin
[2], [7], [18].
Theorem
6.1. If P is any stochastic matrix and y is any row vector with
A
=
o,
then
In view of Theorem 6.1 for the first term of the right-hand
side of (6.2),
IIv(kr)
- ;(0)P(0, kr)II
= II{v(0)
- ;r(0)}P(0, kr)ll
(6.3) <I(0) - r(0)Ir(0, kr).
To complete the bound of the first term of (6.2) we need to bound rl(0, kr).
To this end the following
proposition
is necessary.
Proposition
6.1. If y i rL and the annealing
schedule (5.10) is applied so
that rl satisfies
(5.12a), then
ko
+ mo/0r)
(6.4a) r1(lr
- r, kr) + , for ko 1 k
\ k + mo/r
/ '
(6.4b)
rl(lr-r,kr)< k+moIr) , for
kol<k
where a is defined
by (5.12c), r by (5.1), and ko is such that (5.5) holds and mo
is the parameter
that controls the initial
value of the temperature.
Proof. Let Q and R be two stochastic
matrices,
then [7]
Tl(QR) 5 TI(Q)•I(R).
By the above property,
we have from (5.12a), for ko-5
15 k
k
Tl(lr- r, kr) = H rT(mr - r, mr)
m=l
k ka
H0 1- a
m=: (m + mo/r)i"
< exp -a mor
m=l (m + o/r)"
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 765
Sexp
-a-( t
1
+ mo/r
)a
\k + molr/
A similar
bound can be derived for 1
_ ko < k.
The bound in (6.4) on the coefficient of ergodicity
is fundamental
to the
finite-time analysis of simulated annealing. Substituting
the above bound in
(6.3) yields
(6.5) IIv(kr)
- x(O)P(O,
kr)I < IIv (O)
- (O)II
(ko
+ o/r) k ko
(k + mol/r) Vk
6.1.2. Bound
for the second term
of (6.2). Let
(6.6) p(m) _ x(0O)P(O,
m) - (m), m = 0, 1,...
Note that p(0) = 0 and that {p(i)} satisfy
the recursion
p(m
+ r)
(6.7) r
(6.7)
=(m)P(m, m + r) +> {x(m + s - 1) - x(m + s)}P(m + s, m + r).
s=1
The recursion
is solved to give
k
(6.8a) p(kr) = e(lr)P(lr, kr),
1=1
where
r
(6.8b) e(lr) - E {(lr - s) - r(lr
- s + 1)}P(lr - s + 1, ir).
s=1
Applying
Theorem 6.1 twice to obtain bounds
for EII(lr)P(lr,
kr)|| and EII(lr)
from (6.8a) and (6.8b) respectively,
we obtain
(6.9) ||p(kr)|| -- rl(lr, kr) IIx
(1r
+ 1 - s) - x (lr - s)II, k - 1.
1=1 s=l
Now making
use of (5.15) and (6.4) we obtain
ko(6.10)
+ mo/r\a lo r
(6.10) 11p(kr)I k., IIr(lr
+ 1 - s) - xr(lr - s)II 11
k + mo/r 1
=1s=1
2 k \a
+( +( +tolr) + 1 + {
r{*(lr) - x*(lr - r) }
(k + mo/r)a 1=10+1 r
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
766 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
for k >10 amax {rm/r,
ko - 2}. Now writing **(n) for {1 - x*(n)}, with x* as
in (5.16), we have
k
E (1 + 1 + molr)a{x*(lr) - x*(lr - r)}
1=10+1 k
(6.11) _ {(ll+1+mor)a- (+mo/r)a}**(lr-r)+(lo+1+mo/r)aft*(lor)
1=10+1
k x*(lr-r)
- a 1 ( 1 + (lo + 1 + molr)ax*(lor),
1=10o+1
( + mo/r)
where in the last step we have used the relation
a < 1.
On substituting (6.11) in (6.10) we obtain, for k > lo
D + 2a ,k r * _
-
(6.12a) IIx(kr) + 2ar - r)
(k + mo/lr)a (k + mol/r)a =10+1 (1 + molr)l-a '
where
lo r
(6.12b) D, A (ko
+ molr)a E I||x(lr
+ 1 - s) - x(lr - s)II
1=1 s=1
+ 2(10
+ 1 + molr)a*x*(lor).
To proceed further it is necessary to estimate {x*(m)}, and this is
undertaken
in the following
proposition.
Proposition
6.2.
(6.13)
k*(m) = 1 - x*(m) = II
x(m) - e*Il
5 g )g(*) m 0, 1,...
jes (m + mo + 1)
where {b(j)} is given by
b(j) A {cO.) - c(*)}I/y, j E S,
c(*), see (3.13), is the minimum
of the cost function
and g(*), see Proposition
3.2, is
Proof. By the definition of x(m) given in (3.1) and that of x*(m) given in
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 767
(5.16) we have
1 - *(m) - 1 - g(i) exp (-c(i)/T,)
iEs. G(m)
g(j)/g(*)
;•s
s(m + m0 + l)b) )g
+
g )/g(*) -es. (m + mo +
1)b(O)"
S+.*
(m + mo + 1)b(O)
Observe
that the bound given in (6.13) is asymptotically (i.e. as m -- oo)
tight.
We can now say that
(6.14a) xr*(lr - r) 5 (- 1 + or)b)
, l= 1, 2,...
where
sgj)/g(*)
(6.14b) gu0)
= g(b) jJeS.
By substituting
(6.14a) in (6.12a) and then bounding
the resulting expression
we obtain
S(6.15) Il(rll D
)a 2aqr0() 1 Ea-b(j) ]
(6.15) IIp(kr)I
I| ( + I
(k + molr)a j a - b ) (k + molr)b() (k + mo/lr)a
where E (lo
- 1 + mo/r).
This bound in (6.15) has been obtained for a b(j), j 4 S*; if this is not true,
then for the terms corresponding
to values of j for which a = b(j), a related
expression
is obtained by a slightly
different
bounding procedure.
6.1.3. Bound for the third term of (6.2). This bound comes directly from
Proposition
6.2.
6.2. Final results. Combining
the results
given in 6.1.1-6.1.3 we obtain the
following
final theorem.
Theorem
6.2. For every k - lo, the following
relation
holds:
II
v(kr)
- e*|
II5
(k + molr)a
(6.16) + C 2ar(j) [ 1
Ea_
b)
es a - b(j) (k + mo/r)b&) (k + mo/r)aJ
2•j(j')
+es. (k + mo/r)b&)'
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
768 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
where D
=
D,
+ I
v(O)
- x(O))II
(ko
+
mo/r)a.
Also, a, {b(')) and {(r(j)} are given in (5.12c), (6.13) and (6.14b) respectively.
Equation (6.16) can be further
simplified
if we observe that the dominant
term of 1
(k + molr)b()' JS*,
is given by 1
(k + mo/r)b
'
where b min b
(j) =-,
is* Y
and 6, which
has been defined
in (3.14), is the difference
between next-to-least
cost and least cost.
A simple corollary
to Theorem 6.2 is the following.
Proposition 6.3. The simulated annealing algorithm with the annealing
schedule
given by (5.10) has the following
estimate
for its rate of convergence:
(6.17) IIv(kr)
- e*II
= O(1/kmin(ab)).
6.3. Discussion. We can
see from
(6.17) that
the bound
on the asymptotic
rate
of convergence is limited by min (a, b). Both a and b depend on 6 and L
derived
from the cost function,
w and r from the connectivity
properties
of the
graph
underlying
the Markov
chain and y from the annealing
schedule. Note
that with all other parameters
and time held fixed, higher y corresponds
to
higher temperature
and thus, in this sense, to slower cooling. Now y has to
satisfy a condition that gives weak ergodicity, i.e. y - YWE wherein by our
analysis
yWE
= rL, but otherwise
it is a free parameter.
It is therefore
of some
interest
to investigate
the value of y which maximizes
min
(a, b).
Recall the definition
of a in (5.12c) and that b 6/y. Hence a(y) and b(y)
are respectively
increasing
and decreasing
with increasing
y, and it is easy to
see that there exists an unique ? such that a(7) = b(7). Furthermore,
the
problem
max {min(a,b)}
Y:Y=YWE
has the solution
Y
= max (YWE, *).
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 769
The above procedure
for optimizing
the algorithm
is often feasible since for
many combinatorial
optimization problems, graph partitioning problems in
particular,
estimates of r, L and 6 are available.
The above discussion has been on the effect of y (from the annealing
schedule) on the bound on the rate of convergence
at finite, but large time.
For behavior at smaller time, the more detailed relation (6.16) has to be
considered. Observe that in the right-hand
side of this equation, the only
factors which depend on the time kr are 1/(k + molr)a and 1/(k + molr)b(j),
j f S*. We may glean qualitative
information on the dependence
of the rate of
convergence
on y by investigating
the dependence
of a and {b(0)} on y. Now,
smaller
y gives larger b(j) for each j and, as already
noted, smaller a. Hence,
reducing y has the effect of reducing
the third term and increasing
the first
term in the right-hand
side of (6.16). The dependence of the middle term is
more involved since it has features
of both other terms reflected in it. Roughly,
it is small only when both the first and third terms are small, i.e. in the
mid-range
of y.
With the benefit of analysis we can even go back to (6.2) and deduce
qualitatively
the effect of y on each of the three terms there. The first term
measures how effectively
the difference between v(O) and x(O) is forgotten
at
step m of the algorithm. The bound in (6.5) corroborates our intuitive
understanding
that this rate of memory loss is aided by having higher y, i.e.
higher temperatures
and slower
cooling. The third
term, for which we have the
most explicit information
(see Proposition
6.2), depends on the rate at which
the quasi-stationary
distribution
approaches
its asymptotic
value, the optimum
distribution. This term benefits from small y. The middle
term benefits from a
matching of the two rates. The point in the analysis where this is most
explicitly manifest is in (6.12a). The two rates are matched and the term
minimized in the mid-range
of y. In all, the above discussion
illuminates the
balancing
of opposite mechanisms
that an optimal annealing
schedule must
reflect.
The analysis
can be brought
to bear on an important
question (for which
we
are indebted
to H. S. Witsenhausen):
to what extent does simulated
annealing
exploit the connectivity of the configurations
in a particular case? The
comparison is therefore between a given partially-connected
graph and a
construct
in which the connectivity
is artificially
increased.
A first observation
is that the artificial increase
of connectivity
leads to a deteriorating component
in performance,
insofar as the departure
of the quasi-stationary
distribution
at
a particular temperature
from the optimum distribution
(see third term in
(6.3)) is greater. This is easily seen by tracing the effect of increased
connectivity
on g(j)/g(*), in Proposition
6.2. On the other hand, the effect on
the coefficient
of ergodicity
and, in particular,
on the parameter
a in the bound
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
770 D. MITRA, F. ROMEO AND A. SANGIOVANNI-VINCENTELLI
for it given in Proposition
6.1, depends on the particulars
of the case being
considered.
To see this observe that the parameter
a depends on w, r and L
and typically
the first
two decrease
while the last increases with the increase
of
connectivity
in the construct.
7. Concluding
remarks
We have proven
a number
of results
on the behavior
of simulated
annealing.
In particular,
we have introduced
an annealing
schedule which
guarantees
that
the individual
state probabilities converge either to a positive value or to 0
depending upon whether the configuration corresponding
to the state is
globally
least-cost
or not. Also we have analyzed
finite-time
behavior
in terms
of a decomposition
of the distance of the state probability
vector from the
optimum.
Each of the three terms of the decomposition
reflects
an important
component
of the behavior of the algorithm.
Each term has an independent
bound and this allows the trade-offs in the design of the algorithm
to be
quantified.
We give below a selection of three directions in which the present analysis
may be extended:
1. An analysis
more closely attached
to the evolution
with time of mean cost
rather
than the distance
of the state distribution
from the optimal.
2. An analysis
of schedules
in which temperature
is lowered at a faster rate
than that allowed here by (1.1).
3. The exploitation of special properties of the cost function to design
matched annealing schedules with a provable improvement in
performance.
References
[1] BINDER,
K. (1978)
Monte Carlo
Methods
in Statistical
Physics.
Springer-Verlag,
Berlin.
[2] DOBRUSHIN,
R. L. (1956) Central limit theorem for nonstationary
Markov
chains, I, II.
Theory Prob. Appl. 1, 65-80; 329-383.
[3] GAREY,
M. R. AND
JOHNSON,
D. S. (1979) Computers and Intractability: A Guide to the
Theory of NP-Completeness. Freeman, San Francisco.
[4] GEMAN,
S. AND GEMAN,
D. (1948) Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Trans. Pattern
Analysis and Machine Intelligence
6,
721-741.
[5] GIDAS, B. (1985) Non-stationary Markov chains and convergence of the annealing
algorithm. J. Statist. Phys. 39, 73-131.
[6] HAJEK,
B. (1985) Cooling schedules for optimal annealing. Preprint.
[7] ISAACSON,
D. L. AND MADSEN,
R. W. (1976) Markov Chains: Theory and Applications.
Wiley, New York.
[8] IOSIFESCU,
M. (1980) Finite Markov Processes and their Applications. Wiley, New York.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
Convergence
and
finite-time
behavior
of simulated
annealing 771
[9] JOHNSON,
D. S. (1984) Simulated annealing performance
studies. Presented at the
Simulated
Annealing
Workshop,
Yorktown
Heights.
[10] KELLY, F. P. (1980) Reversibility
and Stochastic
Networks.
Wiley, New York.
[11] KIRKPATRICK, S., GELATT, C. D. AND VECCHI, M. P. (1983) Optimization
by simulated
annealing.
Science
220, 671-680.
[12] LUNDY, M. AND MEES, A. (1984) Convergence
of the annealing algorithm.
Presented at
Simulated
Annealing Workshop,
Yorktown
Heights.
[13] MADSEN, R. W. AND ISAACSON, D. L. (1973)
Strongly ergodic
behavior
for non-stationary
Markov
processes.
Ann. Prob. 1, 329-335.
[14] METROPOLIS, N., ROSENBLUTH, A. W., ROSENBLUTH, M. N. AND TELLER, A. H. (1953)
Equations
of state calculations
by fast computing
machines. J. Chem.
Phys. 21, 1087-1091.
[15] ROMEO,
F. AND SANGIOVANNI-VINCENTELLI,
A. (1984) Probabilistic hill climbing
algo-
rithms:
properties
and applications.
ERL Memo, University
of California,
Berkeley.
[16] SECHEN, C. AND SANGIOVANNI-VINCENTELLI,
A. (1984) The timber
wolf placement
and
routing package.
Proc. 1984 Custom
Integrated
Circuit
Conference,
Rochester.
[17] SCHWARTZ,
J. (1980) Fast probabilistic
algorithms
for verification of polynomial
identities.
J. Assoc. Comput,
Mach.
27, 701-717.
[18] SENETA, E. (1980) Non-negative
Matrices
and Markov
Chains,
2nd edn. Springer-Verlag,
New York.
[19] VECCHI, M. P. AND KIRKPATRICK,
S. (1983) Global
wiring by simulated
annealing.
IEEE
Trans.
Computer-Aided
Design
2, 215-222.
This content downloaded from 128.59.46.225 on Wed, 2 Jul 2014 16:01:23 PM
All use subject to JSTOR Terms and Conditions
... It uses a parameter called temperature , which is progressively lowered during the search to transition from a diversification behaviour to an intensification one ( Kirkpatrick, Gelatt, & Vecchi, 1983;Č erný, 1985 ). One alternative is to use the same temperature value throughout the whole search ( Mitra, Romeo, & Sangiovanni-Vincentelli, 1985 ). This SA variant appears in the literature under different names ( Cohn & Fielding, 1999;Jerrum, Sinclair, & Hochbaum, 1996;Johnson & Jacobson, 2002;Orosz & Jacobson, 2002 ) and, for consistency, throughout this work we will refer to it as Fixed Temperature algorithm (FTA). ...
... The first work that mentioned a SA with a fixed temperature in the optimization literature is probably the one of Mitra et al. in 1985( Mitra et al., 1985. In 1989, Hayek and Sasaki studied a SA for the polynomial-time matching problem and gave examples for which any monotone decreasing temperature sequence is not optimal ( Hajek & Sasaki, 1989 ). ...
... The first work that mentioned a SA with a fixed temperature in the optimization literature is probably the one of Mitra et al. in 1985( Mitra et al., 1985. In 1989, Hayek and Sasaki studied a SA for the polynomial-time matching problem and gave examples for which any monotone decreasing temperature sequence is not optimal ( Hajek & Sasaki, 1989 ). ...
Article
Since the introduction of Simulated Annealing (SA), researchers have considered variants that keep the same temperature value throughout the whole search and tried to determine whether this strategy can be more effective than the original cooling scheme. Several studied have tried to answer this question without a conclusive answer and without providing indications that could be useful for a practical implementation. In this work, we address this question following an experimental approach, relating the characteristics of the algorithms with the characteristics of the landscapes they encounter. We use problem-independent landscape features to study the algorithmic behaviour across different problems. We consider three different objective functions and various instance classes and determine the conditions under which the fixed-temperature variant of SA can outperform its original counterpart and when SA is instead a better choice.
... Since each random walk takes a different path towards optimality, nodes that share the greatest amount of topological similarity have the greatest chance of becoming aligned across independent paths taken towards a near-optimal solution. Our random walk through search space is generated using simulated annealing, which has a rich history of success in optimizing NP-complete problems [25][26][27][28][29][30][31][32][33][34][35][36][37] . Its randomness is key: each run of our Simulated Annealing Network Aligner, or SANA 38,39 , follows a different, randomized path towards an alignment that uncovers close to the maximum amount of common topology that can be discovered between two networks 40 . ...
... npj Systems Biology and Applications (2022)25 Published in partnership with the Systems Biology Institute ...
Article
Full-text available
Topological network alignment aims to align two networks node-wise in order to maximize the observed common connection (edge) topology between them. The topological alignment of two protein–protein interaction (PPI) networks should thus expose protein pairs with similar interaction partners allowing, for example, the prediction of common Gene Ontology (GO) terms. Unfortunately, no network alignment algorithm based on topology alone has been able to achieve this aim, though those that include sequence similarity have seen some success. We argue that this failure of topology alone is due to the sparsity and incompleteness of the PPI network data of almost all species, which provides the network topology with a small signal-to-noise ratio that is effectively swamped when sequence information is added to the mix. Here we show that the weak signal can be detected using multiple stochastic samples of “good” topological network alignments, which allows us to observe regions of the two networks that are robustly aligned across multiple samples. The resulting network alignment frequency (NAF) strongly correlates with GO-based Resnik semantic similarity and enables the first successful cross-species predictions of GO terms based on topology-only network alignments. Our best predictions have an AUPR of about 0.4, which is competitive with state-of-the-art algorithms, even when there is no observable sequence similarity and no known homology relationship. While our results provide only a “proof of concept” on existing network data, we hypothesize that predicting GO terms from topology-only network alignments will become increasingly practical as the volume and quality of PPI network data increase.
... Still, under the analogy with physical systems, it is well-known that if the cooling is too rapid, the system cannot achieve thermal equilibrium for each temperature value, which may result in a configuration with defects in the form of high-energy, metastable, locally optimal structures. A treatment of this topic from a theoretical point of view based on the Markov chain theory is present in [1,12,15,22,26,32], where specifically in [15] was derived a necessary and sufficient condition on the cooling speed that guarantees asymptotic convergence of the SA to the ground states. ...
Article
Full-text available
The Digital Annealer is a CMOS hardware designed by Fujitsu Laboratories for high-speed solving of Quadratic Unconstrained Binary Optimization (QUBO) problems that could be difficult to solve by means of existing general-purpose computers. In this paper, we present a mathematical description of the first-generation Digital Annealer’s Algorithm from the Markov chain theory perspective, establish a relationship between its stationary distribution with the Gibbs-Boltzmann distribution, and provide a necessary and sufficient condition on its cooling schedule that ensures asymptotic convergence to the ground states.
... Still, under the analogy with physical systems, it is well-known that if the cooling is too rapid, the system cannot achieve thermal equilibrium for each temperature value, which may result in a configuration with defects in the form of high-energy, metastable, locally optimal structures. A treatment of this topic from a theoretical point of view based on the Markov chain theory is present in [1,12,15,22,26,31], where specifically in [15] was derived a necessary and sufficient condition on the cooling speed that guarantees asymptotic convergence of the SA to the ground states. ...
Preprint
Full-text available
The Digital Annealer is a quantum-inspired CMOS hardware designed by Fujitsu Laboratories for high-speed solving Quadratic Unconstrained Binary Optimization (QUBO) problems that could be difficult to solve by means of existing general-purpose computers. In this paper, we present a mathematical description of the first-generation Digital Annealer's Algorithm from the Markov chain theory perspective, establish a relationship between its stationary distribution with the Gibbs-Boltzmann distribution, and provide a necessary and sufficient condition on its cooling schedule that ensures asymptotic convergence to the ground states.
... Thus, following the work (Mitra, Romeo, and Vincentelli 1985), for the stationary distribution of the Markov Chain ...
Article
The problem of influence maximization, i.e., mining top-k influential nodes from a social network such that the spread of influence in the network is maximized, is NP-hard. Most of the existing algorithms for the prob- lem are based on greedy algorithm. Although greedy algorithm can achieve a good approximation, it is computational expensive. In this paper, we propose a totally different approach based on Simulated Annealing(SA) for the influence maximization problem. This is the first SA based algorithm for the problem. Additionally, we propose two heuristic methods to accelerate the con- vergence process of SA, and a new method of comput- ing influence to speed up the proposed algorithm. Experimental results on four real networks show that the proposed algorithms run faster than the state-of-the-art greedy algorithm by 2-3 orders of magnitude while being able to improve the accuracy of greedy algorithm.
... Since each random walk takes a different path towards optimality, nodes that share the greatest amount of topological similarity have the greatest chance of becoming aligned across independent paths taken towards a near-optimal solution. Our random walk through search space is generated using simulated annealing, which has a rich history of success in optimizing NP-complete problems [25][26][27][28][29][30][31][32][33][34][35][36][37] . Its randomness is key: each run of our Simulated Annealing Network Aligner, or SANA 38,39 , follows a different, randomized path towards an alignment that uncovers close to the maximum amount of common topology that can be discovered between two networks 40 . ...
Preprint
Full-text available
Topological network alignment aims to align two networks node-wise in order to maximize the observed common connection (edge) topology between them. The topological alignment of two Protein-Protein Interaction (PPI) networks should thus expose protein pairs with similar interaction partners allowing, for example, the prediction of common Gene Ontology (GO) terms. Unfortunately, no network alignment algorithm based on topology alone has been able to achieve this aim, though those that include sequence similarity have seen some success. We argue that this failure of topology alone is due to the sparsity and incompleteness of the PPI network data of almost all species, which provides the network topology with a small signal-to-noise ratio that is effectively swamped when sequence information is added to the mix. Here we show that the weak signal can be detected using multiple stochastic samples of "good" topological network alignments, which allows us to observe regions of the two networks that are robustly aligned across multiple samples. The resulting Network Alignment Frequency (NAF) strongly correlates with GO-based Resnik semantic similarity and enables the first successful cross-species predictions of GO terms based on topology-only network alignments. Our best predictions have an AUPR of about 0.4, which is competitive with state-of-the-art algorithms, even when there is no observable sequence similarity and no known homology relationship. While our results provide only a "proof of concept" on existing network data, we hypothesize that predicting GO terms from topology-only network alignments will become increasingly practical as the volume and quality of PPI network data increase.
... (SANA is available on GitHub at https://github.com/waynebhayes/SANA.) SA has a rich history of successful application to NP-complete problems across a wide array of application domains [63][64][65][66][67][68][69][70][71][72] . One important aspect of SA is the choice of temperature schedule; SANA automatically determines effective temperature limits using an algorithm detailed elsewhere 73 . ...
Preprint
Full-text available
The function of a protein is defined by its interaction partners. Thus, topology-driven network alignment of the protein-protein interaction (PPI) networks of two species should uncover similar interaction patterns and allow identification of functionally similar proteins. Howver, few of the fifty or more algorithms for PPI network alignment have demonstrated a significant link between network topology and functional similarity, and none have recovered orthologs using network topology alone. We find that the major contributing factors to this failure are: (i) edge densities in current PPI networks are too low to expect topological network alignment to succeed; (ii) when edge densities are high enough, some measures of topological similarity easily uncover functionally similar proteins while others do not; and (iii) most network alignment algorithms fail to optimize their own topological objective functions, hampering their ability to use topology effectively. We demonstrate that SANA-the Simulated Annealing Network Aligner-significantly outperforms existing aligners at optimizing their own objective functions, even achieving near-optimal solutions when optimal solution is known. We offer the first demonstration of global network alignments based on topology alone that align functionally similar proteins with p-values in some cases below 1e-300. We predict that topological network alignment has a bright future as edge densities increase towards the value where good alignments become possible. We demonstrate that when enough common topology is present at high enough edge densities-for example in the recent, partly synthetic networks of the Integrated Interaction Database-topological network alignment easily recovers most orthologs, paving the way towards high-throughput functional prediction based on topology-driven network alignment.
Article
Flow shop scheduling deals with the determination of the optimal sequence of jobs processing on machines in a fixed order with the main objective consisting of minimizing the completion time of all jobs (makespan). This type of scheduling problem appears in many industrial and production planning applications. This study proposes a new bi-objective mixed-integer programming model for solving the synchronous flow shop scheduling problems with completion time. The objective functions are the total makespan and the sum of tardiness and earliness cost of blocks. At the same time, jobs are moved among machines through a synchronous transportation system with synchronized processing cycles. In each cycle, the existing jobs begin simultaneously, each on one of the machines, and after completion, wait until the last job is completed. Subsequently, all the jobs are moved concurrently to the next machine. Four algorithms, including non-dominated sorting genetic algorithm (NSGA II), multi-objective simulated annealing (MOSA), multi-objective particle swarm optimization (MOPSO), and multi-objective hybrid vibration-damping optimization (MOHVDO), are used to find a near-optimal solution for this NP-hard problem. In particular, the proposed hybrid VDO algorithm is based on the imperialist competitive algorithm (ICA) and the integration of a neighborhood creation technique. MOHVDO and MOSA show the best performance among the other algorithms regarding objective functions and CPU Time, respectively. Thus, the results from running small-scale and medium-scale problems in MOHVDO and MOSA are compared with the solutions obtained from the epsilon-constraint method. In particular, the error percentage of MOHVDO’s objective functions is less than 2% compared to the epsilon-constraint method for all solved problems. Besides the specific results obtained in terms of performance and, hence, practical applicability, the proposed approach fills a considerable gap in the literature. Indeed, even though variants of the aforementioned meta-heuristic algorithms have been largely introduced in multi-objective environments, a simultaneous implementation of these algorithms as well as a compared study of their performance when solving flow shop scheduling problems has been so far overlooked.
Chapter
Since the function of a protein is defined by its interaction partners, and since we expect similar interaction patterns across species, the alignment of protein-protein interaction (PPI) networks between species, based on network topology alone, should uncover functionally related proteins across species. Surprisingly, despite the publication of more than fifty algorithms aimed at performing PPI network alignment, few have demonstrated a statistically significant link between network topology and functional similarity, and none have demonstrated that orthologs can be recovered using network topology alone. We find that the major contributing factors to this surprising failure are: (i) edge densities in most currently available experimental PPI networks are demonstrably too low to expect topological network alignment to succeed; (ii) in the few cases where the edge densities are high enough, some measures of topological similarity easily uncover functionally similar proteins while others do not; and (iii) most network alignment algorithms to date perform poorly at optimizing even their own topological objective functions, hampering their ability to use topology effectively. We demonstrate that SANA—the Simulated Annealing Network Aligner—significantly outperforms existing aligners at optimizing their own objective functions, even achieving near-optimal solutions when the optimal solution is known. We offer the first demonstration of global network alignments based on topology alone that align functionally similar proteins with p-values in some cases below 10⁻³⁰⁰. We predict that topological network alignment has a bright future as edge densities increase toward the value where good alignments become possible. We demonstrate that when enough common topology is present at high enough edge densities—for example in the recent, partly synthetic networks of the Integrated Interaction Database—topological network alignment easily recovers most orthologs, paving the way toward high-throughput functional prediction based on topology-driven network alignment.
Article
Simulated annealing proposed by Kirkpatrick et al. has proven to be an effective technique to solve general combinatorial optimization problems. Morkov chains are proposed as mathematical models of the Simulated Annealing algorithm. Using these models, it has been possible to prove that under certain assumptions on the rules used by the algorithm to generate the configurations of the problem and on the time spent at each temperature, the Simulated Annealing algorithm generates a global optimum solution with probability one. This result has made possible the definition of a general class of algorithms with the same statistical properties: the class of probabilistic hill-climbing methods. The mathematical properties of this class are presented and rules on the selection of annealing schedules are obtained from these properties.