ArticlePDF Available

Autonomous Vehicle-Target Assignment: A Game-Theoretical Formulation

Authors:

Abstract and Figures

We consider an autonomous vehicle-target assignment problem where a group of vehicles are expected to optimally assign themselves to a set of targets. We introduce a game-theoretical formulation of the problem in which the vehicles are viewed as self-interested decision makers. Thus, we seek the optimization of a global utility function through autonomous vehicles that are capable of making individually rational decisions to opti-mize their own utility functions. The first important aspect of the problem is to choose the utility functions of the vehicles in such a way that the objectives of the vehicles are localized to each vehicle yet aligned with a global utility function. The second important aspect of the problem is to equip the vehicles with an appropriate negotiation mechanism by which each vehicle pursues the optimization of its own utility function. We present several design procedures and accompanying caveats for vehicle utility design. We present two new negotiation mechanisms, namely, "generalized regret monitoring with fading memory and inertia" and "selective spatial adaptive play," and provide accom-panying proofs of their convergence. Finally, we present simulations that illustrate how vehicle negotiations can consistently lead to near-optimal assignments provided that the utilities of the vehicles are designed appropriately.
Content may be subject to copyright.
Gürdal Arslan
Department of Electrical Engineering,
University of Hawaii,
Manoa, Honolulu, HI 96822
e-mail: gurdal@hawaii.edu
Jason R. Marden
e-mail: marden@ucla.edu
Jeff S. Shamma
e-mail: shamma@ucla.edu
Department of Mechanical and Aerospace
Engineering,
University of California, Los Angeles,
Los Angeles, CA 90095
Autonomous Vehicle-Target
Assignment: A Game-Theoretical
Formulation
We consider an autonomous vehicle-target assignment problem where a group of vehicles
are expected to optimally assign themselves to a set of targets. We introduce a game-
theoretical formulation of the problem in which the vehicles are viewed as self-interested
decision makers. Thus, we seek the optimization of a global utility function through
autonomous vehicles that are capable of making individually rational decisions to opti-
mize their own utility functions. The first important aspect of the problem is to choose the
utility functions of the vehicles in such a way that the objectives of the vehicles are
localized to each vehicle yet aligned with a global utility function. The second important
aspect of the problem is to equip the vehicles with an appropriate negotiation mechanism
by which each vehicle pursues the optimization of its own utility function. We present
several design procedures and accompanying caveats for vehicle utility design. We
present two new negotiation mechanisms, namely, “generalized regret monitoring with
fading memory and inertia” and “selective spatial adaptive play,” and provide accom-
panying proofs of their convergence. Finally, we present simulations that illustrate how
vehicle negotiations can consistently lead to near-optimal assignments provided that the
utilities of the vehicles are designed appropriately. DOI: 10.1115/1.2766722
1 Introduction
Designing autonomous vehicles with intelligent and coordi-
nated action capabilities to achieve an overall objective is a major
part of the recent theme of “cooperative control,” which has re-
ceived significant attention in recent years. Whereas much of the
work in this area focuses on “kinetic” coordination, e.g., multive-
hicle trajectory generation e.g., 1, and references therein, the
focus here is on strategic coordination. In particular, we consider
an autonomous vehicle-target assignment problem illustrated in
Fig. 1, where a group of vehicles are expected to assign them-
selves to a set of targets to optimize a global utility function.
When viewed as a combinatorial optimization problem, the
vehicle-target assignment problem considered in this paper is a
generalization of the well-known weapon-target assignment prob-
lem 2to the case where the global utility is a general function of
the vehicle-target assignments. In its full generality, the weapon-
target assignment problem is known to be nondeterministic-
polynomial-time-complete 2, and the existing literature on the
weapon-target assignment problem is concentrated on heuristic
methods to quickly obtain near optimal assignments in relatively
large instances of the problem—very often with no guarantees on
the degree of suboptimality cf., 3, and references therein.
Therefore, from an optimization viewpoint, the vehicle-target as-
signment problem considered in this paper is, in general, a hard
problem, even though optimal assignments can be obtained quite
efficiently in very special cases.
Our viewpoint in this paper deviates from that of direct optimi-
zation. Rather, we emphasize the design of vehicles that are indi-
vidually capable of making coordination decisions to optimize
their own utilities, which then indirectly translates to the optimi-
zation of a global utility function. The main potential benefit of
this approach is to enable autonomous vehicles that are individu-
ally capable of operating in uncertain and adversarial environ-
ments, with limited information, communication, and computa-
tion, to autonomously optimize a global utility. The optimization
methods available in the literature are not suitable for our pur-
poses because even a distributed implementation of such optimi-
zation algorithms need not induce “individually rational” behav-
ior, which is the key to realize the expected benefits of our
approach. Furthermore, an optimization approach would typically
require constant dissemination of global information throughout
the network of the vehicles as well as increased communication
and computation.
Accordingly, in this paper we formulate our autonomous
vehicle-target assignment problem as a multiplayer game 4,5,
where each vehicle is interested in optimizing its own utility. We
use the notion of pure Nash equilibrium to represent the assign-
ments that are agreeable to the rational vehicles, i.e., the assign-
ments at which there is no incentive for any vehicle to unilaterally
deviate. We use algorithms for multiplayer learning in games as
negotiation mechanisms by which the vehicles seek to optimize
their utilities. The problem of optimizing a global utility function
by the autonomous vehicles then reduces to the proper design of
ithe vehicle utilities and iithe negotiation mechanisms.
Designing vehicle utilities is essential to obtaining desirable
collective behavior through self-interested vehicles cf., 6兴兲.An
important consideration in designing the vehicle utilities is that
the vehicle utility functions should be “aligned” with the global
utility function in the sense that agreeable assignments i.e., Nash
equilibriashould lead to high, ideally maximal, global utility.
There are multiple ways that such alignment can be achieved. An
obvious instance is to set the vehicle utilities equal the global
utility. This choice is not desirable in the case of a large number of
interaction vehicles, because another consideration in designing
the vehicle utilities is that the vehicle utilities should be “local-
ized,” i.e., a vehicle’s utility should depend only on the local
information available to the vehicle. For example, in a large
vehicle-target assignment problem, the vehicles may have range
restrictions and a vehicle may not even be aware of the targets
and/or the vehicles outside its range. In such a case, a vehicle
whose utility is set to the global utility would not have sufficient
information to compute its own utility. Therefore, a vehicle’s util-
ity should be localized to its range while maintaining the align-
Contributed by the Dynamic Systems, Measurement, and Control Division of
ASME for publication in the JOURNAL OF DYNAMIC SYSTEMS, MEASUREMENT, AND CON-
TROL. Manuscript received March 31, 2006; final manuscript received April 1, 2007.
Review conducted by Tal Shima.
584 / Vol. 129, SEPTEMBER 2007 Copyright © 2007 by ASME Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
ment with the global utility. More generally, we will discuss the
properties of being aligned and localized for several utility design
procedures in Sec. 3.
Obtaining optimal assignments using the approach presented in
this paper also requires that the vehicles use a negotiation mecha-
nism that is convergent in the multiplayer game induced by the
vehicle utilities. We will show that when vehicle utilities are
aligned with the global utility, they always lead to a class of
games known as “ordinal potential games” 7. The significance
of this connection is that certain multiplayer learning algorithms,
such as fictitious play FP兲关8, are known to converge in potential
games, and hence can be used as vehicle negotiation mechanisms.
However, FP has an intensive informational requirement. Spatial
adaptive play SAP兲关9is another such algorithm, which leads to
an optimizer of the potential function in potential games with
arbitrarily high probability. Although SAP reduces the information
requirement, there can be a high implementation cost when ve-
hicles have a large number of possible actions.
This paper goes beyond existing work in the area through the
introduction of new negotiating mechanisms that alleviate the in-
formational and implementation requirement, namely, “general-
ized regret monitoring with fading memory and inertia” and “se-
lective spatial adaptive play.” We establish new convergence
results for both algorithms and simulate their performance on an
illustrative weapon-target assignment problem.
The remainder of this paper is organized as follows. Section 2
sets up an autonomous vehicle-target assignment problem as a
multiplayer game. Section 3 discusses the issue of designing the
utility functions of the vehicles that are localized to each vehicle
yet aligned with a given global utility function. Section 4 reviews
selected learning algorithms available in the literature and pre-
sents two new algorithms, along with convergence results, that
offer some advantages over existing algorithms. Section 5 present
some simulation results to illustrate the possibility of obtaining
near optimal assignments through vehicle negotiations. Finally,
Section 6 contains some concluding remarks.
2 Game-Theoretical Formulation of an Autonomous
Vehicle-Target Assignment Problem
We begin by considering an optimal assignment problem where
nvvehicles are to be assigned to nttargets. Each entity, whether a
vehicle or a target, may have different characteristics. The ve-
hicles are labeled as V1,... ,Vnv, and the targets are labeled as
T0,T1,... ,Tnt, where a fictitious target T0represents the “null tar-
get” or “no target.” Let VªV1,... ,Vnvand T
ªT0,T1,... ,Tnt. A vehicle can be assigned to any target in its
range, denoted by AiTfor vehicle ViV. The null target al-
ways satisfies T0Ai. Let AªA1¯Anv. The assignment of
vehicle Viis denoted by aiAi, and the collection of vehicle
assignments a1,... ,anv, called the assignment profile, is denoted
by a. Each assignment profile, aA, corresponds to a global
utility, Uga, that can be interpreted as the objective of a global
planner.
We view the vehicles as “autonomous” decision makers, and
accordingly, each vehicle, e.g., vehicle ViV, is assumed to select
its own target assignment, aiAi, to maximize its own utility
function, UVia. In general, vehicle utility functions may be dif-
ferent and each of them may depend on the whole assignment
profile a. Hence, the vehicles do not necessarily face an optimi-
zation problem, but rather, they face a finitemultiplayer game.
In such a setting, the vehicles are to negotiate an assignment pro-
file that is mutually agreeable. The autonomous target assignment
problem is to design the utilities, UVia, as well as appropriate
negotiation procedures so that the vehicles can negotiate a mutu-
ally agreeable target assignment that yields maximal global utility,
Uga.
To be able to deal with the intricacies of our autonomous target
assignment problem, we adopt some concepts and methods from
the theory of games 4,5. We start with the concept of equilib-
rium to characterize the target assignments that are agreeable to
the vehicles. A well-known equilibrium concept for multiplayer
games is the notion of Nash equilibrium. In the context of an
autonomous target assignment problem, a Nash equilibrium is an
assignment profile a*=a1
*,... ,anv
*such that no vehicle could im-
prove its utility by unilaterally deviating from a*. Before introduc-
ing the notion of Nash equilibrium in more precise terms, we will
introduce some notation. Let aidenote the collection of the target
assignments of the vehicles other than vehicle Vi, i.e.,
ai=a1, ... ,ai−1,ai+1, ... ,anv
and let
AiªA1... Ai−1 Ai+1 ... Anv
With this notation, we will sometimes write an assignment profile
aas ai,ai. Similarly, we may write UViaas UViai,ai. Using
the above notation, an assignment profile a*is called a pure Nash
equilibrium if, for all vehicles ViV,
UViai
*,ai
*=max
aiAi
UViai,ai
*兲共1
In this paper, we will represent the agreeable target assignment
profiles by the set of pure Nash equilibria even though in the
literature some non-Nash solution concepts for multiplayer games
are also available. We will introduce one such concept called ef-
ficiency for future reference. An assignment profile is called effi-
cient if there is no other assignment that yields higher utilities to
all vehicles. For given vehicle utilities, a Nash equilibrium assign-
ment may or may not be efficient. Our justification of a pure Nash
equilibrium as an agreeable assignment is based on the autono-
Fig. 1 Illustration of vehicle target assignment
Journal of Dynamic Systems, Measurement, and Control SEPTEMBER 2007, Vol. 129 / 585
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
mous and self-interested nature of the vehicles. Clearly, an effi-
cient pure Nash equilibrium should be more appealing to the ve-
hicles than an inefficient pure Nash equilibrium.
In general, a pure Nash equilibrium may not exist for an arbi-
trary set of vehicle utilities. However, as will be seen in Sec. 3,
any reasonable set of vehicle utilities tailored to the autonomous
vehicle-target problem would have at least one pure Nash equilib-
rium.
We conclude this section with the definition of potential games
and ordinal potential games 7. These games form an important
class of games because of their relevance to autonomous vehicle-
target assignment as well as their desirable convergence properties
mentioned earlier.
DEFINITION 2.1 共关ORDINALPOTENTIAL GAMES. A potential
game consists of vehicle utilities,UVia,ViV,and a potential
function,
a:AR,such that,for every vehicle,ViV,for
every aiAi,and for every ai
,ai
Ai,
UViai
,aiUViai
,ai=
ai
,ai
ai
,ai
An ordinal potential game consists of vehicle utilities UVia,Vi
V,and a potential function
a:ARsuch that,for every
vehicle ViV,for every aiAi,and for every ai
,ai
Ai,
UViai
,aiUViai
,ai0
ai
,ai
ai
,ai0
In a potential game, the difference in utility received by any one
vehicle for its two different target choices, when the assignments
of other vehicles are fixed, can be measured by a potential func-
tion that only depends on the assignment profile and not on the
label of any vehicle.
In an ordinal potential game, an improvement in utility received
by any one vehicle for its two different target choices, when the
assignments of other vehicles are fixed, always results in an im-
provement of a potential function that, again, only depends on the
assignment profile and not on the label of any vehicle. Clearly,
ordinal potential games form a broader class than potential games.
3 Utility Design
In this section, we discuss various important aspects of design-
ing the vehicle utilities to achieve a high global utility. We cite
7,10as the key references for this section, since we freely use
some of the terminology and the ideas presented in them. To make
the discussion more concrete and relevant, we assume a certain
structure for the global utility, even though it is possible to present
the ideas at a more abstract level. We assume that all vehicles that
assign themselves to a particular target form a team and engage
their common target in a coordinated manner. An engagement
with target TjTgenerates some utility denoted by UTja;
UT0a=0 for any a.
It is important to distinguish between a target utility, UTja,
and a vehicle utility, UVia. The realized target utility represents
the overall value for engaging target Tj, whereas a vehicle utility
partly reflects vehicle Vi’s share of that value. Furthermore, it may
be that vehicle Vishares this reward even if it did not engage
target Ti. This will depend on the final specification of vehicle
utilities.
We will assume that the utility generated by an engagement
with target Tjdepends only on the characteristics of target Tjand
the vehicles engaging target Tj. This is stated more precisely in the
following assumption.
ASSUMPTION 3.1. Let a and a
˜
be two action profiles in A,and
for any target,TjT,define the sets
Sj=ViVai=Tjand S
˜
j=ViVa
˜
i=Tj
Then,
Sj=S
˜
jUTja=UTja
˜
We now define the global utility to be the total sum of the
utilities generated by all engagements, i.e.,
Uga=
TjT
UTja兲共2
This summation is only one approach to aggregate the target util-
ity functions. See 11for a more general discussion from the
perspective of multiobjective optimization.
It will be convenient to model an engagement with a target as a
random event that is assumed to be independent of the other target
engagements. At the end of an engagement, the target and some of
the engaging vehicles are destroyed with certain probability. The
statistics of the outcome of an engagement depend on the charac-
teristics of the target as well as the composition of the engaging
vehicles. As an example, it may be the case that only a particular
team of vehicles may destroy a particular target with reasonable
probability. In this case, the utility generated by an engagement is
taken to be the expected difference between the value of a de-
stroyed target and the total value of the destroyed vehicles. These
issues are discussed further for the well-known weapon-target as-
signment problem in Sec. 5.
An important consideration in specifying the vehicle utilities,
UVia,i=1,...,nv, is to make them “aligned” with the global
utility, Uga. Ideally, this means that the vehicles can only agree
on an optimal assignment profile, i.e., an assignment profile that
maximizes the global utility. Because it is not always straightfor-
ward to achieve the alignment of the vehicle utilities with the
global utility in this ideal sense without first calculating an opti-
mal assignment, we adopt a more relaxed notion of alignment
from 10. That is, a vehicle can improve its own utility by uni-
lateral action if and only if the same unilateral action also im-
proves the global utility.
DEFINITION 3.1 ALIGNMENT.We will say that a set of vehicle
utilities UVia,ViV,is aligned1with the global utility Uga
when the following condition is satisfied.For every vehicle,Vi
V,for every aiAi,and for every ai
,ai
Ai,
UViai
,aiUViai
,ai0Ugai
,aiUgai
,ai0
3
We see that the notion of alignment coincides with the notion of
ordinal potential games in Definition 2.1.
It turns out that alignment does not rule out pure Nash equilib-
ria that may be suboptimal from the global utility perspective.
Moreover, such suboptimal pure Nash equilibria may even yield
the highest utilities to all vehicles and hence may be efficient.
Nevertheless, alignment also guarantees that the optimal assign-
ment profiles are always included in the set of pure Nash equilib-
ria; hence, they are agreeable to the vehicles even though they
may be inefficient.
The above discussion on alignment is summarized by the fol-
lowing proposition, whose proof is straightforward.
PROPOSITION 3.1. Let aopt denote an optimal assignment profile,
i.e.,
aopt arg max
aA
Uga
Under the alignment condition 3,the resulting game is an ordi-
nal potential game that has aopt as a possibly nonuniquepure
Nash equilibrium.
3.1 Identical Interest Utility (IIU). One obvious, but ulti-
mately ineffective, way of making the vehicle utilities aligned
with the global utility is to set all vehicle utilities to the global
utility. In game-theory terminology, setting
UVia=Uga, for all vehicles ViV4
1The notion of alignment we adopt here is called factoredness in 10.
586 / Vol. 129, SEPTEMBER 2007 Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
results in an identical interest game. Obviously, an identical inter-
est game with UVia=Uga, for all vehicles ViV, is also a
potential game with the potential Uga, and hence, the vehicle
utilities 4are aligned with the global utility. In fact, optimal
assignments in this case yield the highest vehicle utilities and
therefore are efficient. However, suboptimal Nash equilibria may
still exist.
As will be seen later, the vehicles negotiate by proposing tar-
gets and responding to the previous target assignment proposals
that are exchanged among the vehicles. Each vehicle whose utility
is set to the global utility needs to know ithe proposals made by
all other vehicles as well as iithe characteristics of all the ve-
hicles and the targets to be able to generate a new proposal. The
reason for this is that vehicle Vi’s utility would depend on all
engagements with all targets, including those that are not in Ai.
Therefore, when the vehicle utilities are set to the global utility,
continuous dissemination of global information is required among
the vehicles.
3.2 Range-Restricted Utility (RRU). A possible way of
making the vehicle utilities more localized than IIU would be to
set the utility of vehicle Viequal to the sum of the utilities gen-
erated by the engagements with the targets that belong to vehicle
Vi’s target set Ai, i.e.,
UVia=
TjAi
UTja, for all vehicles ViV5
Note that in this case the global information requirement on the
vehicles is alleviated. Moreover, the vehicle utilities 5are still
aligned with the global utility. This guarantees that the optimal
assignments are agreeable to the vehicles, but they may be ineffi-
cient; see Example 3.3. In fact, the vehicle utilities lead to a po-
tential game; see 7. The following proposition is an immediate
consequence of Assumption 3.1.
PROPOSITION 3.2. Vehicle utilities that satisfy 5form a poten-
tial game with the global utility Ugaserving as a potential
function.
Note that when all vehicles have the same set of available tar-
gets, i.e., A1=¯=Anv, then 5leads to an identical interest
game.
A concern regarding vehicle utilities 4兲共and possibly 5兲兲
stems from the so-called learnability issue introduced in 10. That
is, a vehicle may not be able to influence its own utility in a
significant way when a large number of vehicles can assign them-
selves to the same large set of targets. In this case, since the utility
of a vehicle is the total sum of the utilities generated by a large
number of engagements involving a large number of targets and
vehicles, the proposals made by an individual vehicle may not
have any significant effect on its own utility. Hence, a negotiating
vehicle may find itself approximately indifferent to the available
target choices if the negotiation mechanism employed is utility
based, i.e., the vehicle proposes targets in response to the actual
utilities corresponding to its past proposals, as in reinforcement
learning.
3.3 Equally Shared Utility (ESU). One way to limit the in-
fluence of other vehicles on vehicle Vi’s utility is to set
UVia=
UTja
nTja,ifai=Tj6
where nTjais the total number of vehicles engaging target Tj.
The rationale behind 6is to distribute the utility generated by an
engagement equally among the engaging vehicles. Note that in
this case vehicle Vi’s utility is independent of the engagements to
which vehicle Vidoes not participate.
Even though the total sum of vehicle utilities 6equals the
global utility, it turns out that 6need not be exactly aligned with
the global utility.
Example 3.1. Consider two targets T1and T2with values 2 and
10, respectively, and two anonymous vehicles V1and V2, i.e., V1
and V2have identical characteristics. Assume that each vehicle is
individually capable of destroying any one of the targets with
probability 1, while the targets in no case have any chance of
destroying any of the vehicles. The vehicle utilities in this ex-
ample can be represented in the matrix form, shown in Fig. 2,
where if vehicle ViV1,V2chooses target aiT0,T1,T2then
the first number respectively the second numberin the entry
a1,a2represents the utility to the first vehicle respectively to
the second vehicle. The global planner would of course prefer
each vehicle to engage a different target, since this would yield a
maximal global utility 12. However, such an optimal assignment
profile might leave the vehicle engaging the low-value target un-
satisfied with a utility 2, and this unsatisfied vehicle might be able
to improve its utility to 5 by unilaterally switching to the high-
value target at the expense of lowering the global utility to 10.
Because of the misalignment of 6with the global utility in this
example, an optimal assignment profile may not be agreeable by
all vehicles, whereas the vehicles may find the suboptimal Nash
equilibrium assignment a1,a2=T2,T2agreeable.
However, in the case of anonymous vehicles, 6does lead to a
potential game.
DEFINITION 3.2 ANONYMITY.Vehicles are anonymous if for
any permutation
:1,2, . .. ,nv1,2, . .. ,nv
and for any two assignments,a and a
˜
,related by
a
˜
i=a
i,i1,2, . .. ,nv
the equality
UTja=UTja
˜
holds for any target Tj.
As the terminology implies, the utility generated by an engage-
ment with a target does not depend on the identities of the ve-
hicles engaging the target, but only the number of vehicles engag-
ing the target.
PROPOSITION 3.3. Anonymous vehicles with utilities that satisfy
6form a potential game with potential function
a=
TjT
=1
nTjaUTj
where nTjais the total number of vehicles assigned to target Tj
and UTjis the utility generated by an engagement of anony-
mous vehicles with target Tj.
Hence, in the case of anonymous vehicles, 6is aligned with
the above potential function, which is the same potential function
introduced in 12in the context of so-called congestion games,
but different from the global utility function Uga. The signifi-
cance of this observation is that the existence of a potential func-
tion associated with the vehicle utilities guarantees the existence
of agreeable possibly suboptimalassignment profiles in the form
Fig. 2 Misaligned vehicle utilities
Journal of Dynamic Systems, Measurement, and Control SEPTEMBER 2007, Vol. 129 / 587
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
of pure Nash equilibria. Furthermore, there exist learning algo-
rithms that are known to converge in potential games and these
convergent learning algorithms can be used by the vehicles as
negotiation mechanisms always leading to a settlement on an as-
signment profile. If the vehicles are not anonymous, then the mis-
alignment of the vehicle utilities 6with the global utility can be
even more severe.
Example 3.2. Consider two targets T1and T2with values 10
each, and two distinguishable vehicles, V1and V2, with values 2
each. Assume that vehicle V1is individually capable of destroying
any one of the targets with probability one, and not one of the
targets is ever capable of destroying V1. Assume further that ve-
hicle V2is never capable of destroying any of the targets, and any
one of the targets can destroy vehicle V2with probability one.
This setup leads to the vehicle utilities shown in Fig. 3. In this
example, the two vehicles may not be able to agree on any assign-
ment profile, optimal or suboptimal, because while vehicle V1
would be better off by engaging a target alone, vehicle V2would
be better off by engaging a target together with vehicle V1. Yet,
the global planner would prefer vehicle V1engaging one of the
targets and vehicle V2not engaging any target. If these two ve-
hicles were to use a negotiation mechanism that allows settlement
only on a pure Nash equilibrium, then they would not be able to
agree on any assignment because a pure Nash equilibrium does
not exist in this example. A mixed, but not pure, Nash equilibrium
is still guaranteed to exist, but would not lead to an agreement on
a particular assignment. Therefore, in the distinguishable vehicles
case, the vehicle utilities 6might lead to a situation where the
vehicles are not only in conflict with the global planner but also in
conflict among themselves.
3.4 Wonderful Life Utility (WLU). A solution to the prob-
lem of designing individual utility functions that are more learn-
able than 4or 5and still aligned with the global utility is
offered in 10in the form of a family of utility structures called
the wonderful life utility. In our context, a particular WLU struc-
ture would be obtained by setting the utility of a vehicle to the
marginal contribution made by the vehicle to the global utility,
i.e.,
UViai,ai=Ugai,aiUgT0,ai, for all vehicles ViV
7
From the definition of the global utility 2, the WLU 7can be
written as
UViai,ai=UTjai,aiUTjT0,ai,ifai=Tj
for all vehicles ViV, which means that the utility of a vehicle is
its marginal contribution to the utility generated by the engage-
ment that the vehicle participates. WLU is expected to make each
vehicle’s utility more learnable by removing the unnecessary de-
pendencies on other vehicles’ assignment decisions, while still
keeping the vehicle utilities aligned with the global utility. It turns
out that WLU 7also leads to a potential game with the global
utility being the potential function.
PROPOSITION 3.4. Vehicle utilities that satisfy 7form a poten-
tial game with the global utility Ugaserving as a potential
function.
Another interpretation of the WLU is that a vehicle is rewarded
with a side payment equal to the externality it may create by not
assigning itself to any target, which is the idea behind “internal-
izing the externalities” in economics 13.
3.5 Comparisons. Each of the vehicle utilities IIU 4, RRU
5, and WLU 7lead to a potential game with the globally utility
function being the potential function, and hence, they are aligned
with the global utility. This guarantees that the optimal assign-
ments are in each case included in the set of pure Nash equilibria.
However, in each case, there may also be suboptimal Nash equi-
libria that may be pure and/or mixed. There is ample evidence in
the literature that a mixed equilibrium cannot emerge as a stable
outcome of vehicle negotiations, particularly in potential games
e.g., 14兴兲. However, a suboptimal pure Nash equilibrium can
emerge as a stable outcome, depending on the negotiation mecha-
nism used by the vehicles.
Example 3.3. Consider N2 vehicles, V1,...,VN, and N+1
targets, T1,... ,TN+1, where Ai=Ti,TN+1. Assume that any ve-
hicle Viengaging target Tigenerates 1 unit of utility. Assume also
that an engagement with target TN+1 generates 0 utility unless all
vehicles engage TN+1 in which case they generate 2 units of utility.
Clearly, the optimal assignment is given by a*=T1,T2,...,TN.
The optimal assignment profile a*is a pure Nash equilibrium when
the vehicle utilities are given by any of 4and 5,or7. How-
ever, there is another pure Nash equilibrium a**
=TN+1 ,TN+1 ,... ,TN+1for any of vehicle utilities 4and 5,or
7which is suboptimal with respect to the global utility. The
global utility and the vehicle utilities corresponding to a*and a**
are summarized as follows:
Uga*=NU
ga**=2
UVia*=NU
Via**= 2 if vehicles utilities are given by 4
UVia*=1 UVia**= 2 if vehicles utilities are given by 5or 7
Note that the optimality gap N2 between a*and a** can be
arbitrarily large for large N. Note also that if the vehicle utilities
are given by RRU 5or WLU 7the suboptimal Nash equilib-
rium a** yields higher utilities to all vehicles than the optimal
Nash equilibrium a*.
In the case of RRU or WLU, if the negotiation mechanism
employed by the vehicles were to eliminate the inefficient assign-
ment profiles, the vehicles would never be able to agree on the
optimal assignment a*. This example illustrates the fact that the
vehicle utilities cannot be designed independently of the negotia-
tion mechanism employed by the vehicles.
4 Negotiation Mechanisms
The issue of which Nash equilibrium will emerge as a stable
outcome of vehicle negotiations is studied under the topic of equi-
librium selection in game theory. In this section, we will discuss
equilibrium selection and other important properties of some ne-
gotiation mechanisms. In particular, we will present a negotiation
mechanism from the literature that leads to an optimal Nash equi-
librium in potential games with arbitrarily high probability.
We will adopt various learning algorithms available in the lit-
erature for multiplayer games as vehicle negotiation mechanisms
to make use of the theoretical and computational tools provided
by game theory. The negotiation mechanisms that will be pre-
sented in this section will provide the vehicles with strategic
decision-making capabilities. In particular, each vehicle will ne-
gotiate with other vehicles without any knowledge about the utili-
ties of the other vehicles. One of the reasons for such a require-
ment is that the vehicles may not have the same information
regarding their environment. For example, a vehicle may not
know all the targets and/or the potential collaborating vehicles
Fig. 3 Misaligned vehicle utilities with no pure Nash
equilibrium
588 / Vol. 129, SEPTEMBER 2007 Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
available to another vehicle and, moreover, it may not be possible
to pass on such information due to limited communication band-
width. Another reason for the private utilities requirement is to
make the vehicles truly autonomous in the sense that each vehicle
is individually capable of making robust strategic decisions in
uncertain and adversarial environments. In this case, any indi-
vidual vehicle is cooperative with the other vehicles only to the
extent that cooperation helps the vehicle to maximize its own
utility, which is, of course, carefully designed by the global plan-
ner.
Accordingly, we will consider some negotiation mechanisms
that require each vehicle to know, at most, its own utility function,
the proposals made by the vehicle itself, and the proposals made
by those other vehicles that can influence the utility of the vehicle.
We will review these negotiation mechanisms in terms of conver-
gence, equilibrium selection, and computational efficiency. We
will present our review primarily in the context of potential
games, since many of the vehicle utility structures considered in
Sec. 3 fall into this category. In some cases, we will point to
existing results in the literature, while in some other cases we will
point to open problems.
4.1 Review: Selected Recursive Averaging Algorithms
4.1.1 Action-Based Fictitious Play. Action-based fictitious
play, or simply FP, was originally introduced as a computational
method to calculate the Nash equilibria in zero-sum games 15,
but later proposed as a learning mechanism in multi-player games
cf., 8兴兲.
One can also think of FP as a negotiation mechanism employed
by the vehicles to select their targets. At each negotiation step, k
=1,2,..., vehicles simultaneously propose targets
akªa1k, ... ,anvk兲兴
where aikAiis the label of the target proposed by vehicle Vi.
The objective is to construct a negotiation mechanism so that the
proposed assignments, ak, ultimately converge for large k.FPis
one such mechanism that is guaranteed to converge for potential
games.
In FP, the target assignment proposals at stage kare functions of
past proposed assignments over the interval 1,k−1as follows.
First, enumerate the targets available to vehicle Vias Ai
=Ai
1,... ,Ai
Ai. For any target index j1,Ai兩兴, let njk;Vide-
note the total number of times vehicle Viproposed target Ai
jup to
stage k. Now define the empirical frequency vector, qikRAi,
of vehicle Vias follows:
qik=
n1k−1;Vi
k−1 ,n2k−1;Vi
k−1 , ... ,
nAik−1;Vi
k−1
In words, qikreflects the histogram of proposed target assign-
ments by vehicle Viover the interval 1,k−1. Note that the ele-
ments of the empirical frequency vector are all positive and sum
to unity. Therefore, qikcan be identified with a probability vec-
tor on the probability simplex of dimension Ai.
We are now set to define the FP process. At stage k, vehicle Vi
selects its proposed assignment aikAiin accordance with
maximizing its expected utility as though all other vehicles make
a simultaneous and independent random selection of their actions,
ai, based on the product distribution defined by empirical fre-
quencies, q1k,... ,qi−1k,qi+1k,...,qnvk, i.e.,
aikarg max
Ai
EaiUVi
,ai兲兴
In case the maximizer is not unique, then any maximizer will do.
One appealing property of FP is that the empirical frequencies
generated by FP converge to the set of Nash equilibria in potential
games 7,16. Although the empirical frequencies may converge
to a mixed Nash equilibrium while the proposals are cycling see
the related churning issue in 17兴兲, it is generally believed that
convergence of empirical frequencies to a mixed but not pure
Nash equilibrium happens rarely when vehicle utilities are not
equivalent to a zero sum game 18,19. Thus, if the vehicles ne-
gotiate using FP and their utilities constitute a potential game,
then in most cases we can expect them to asymptotically reach an
agreement on an assignment profile. We should also mention nu-
merous stochastic versions of FP with similar convergence prop-
erties 20.
The main disadvantage of FP for the purposes of this paper is
its computational burden on each vehicle. The most computation-
ally intensive operation is the optimization of the utilities during
the negotiations, which effectively requires an enumeration of all
possible combined assignments by other vehicles 21,22. This
makes FP computationally prohibitive when there are large num-
bers of vehicles with large target sets. To make FP truly scalable,
it is clear that the vehicles need to evaluate their utilities more
directly without using the empirical frequencies.
4.1.2 Utility-Based FP. The distinction between action-based
and utility-based FP, see 23,24, is that the vehicles predict their
utilities during the negotiations based on the actual utilities corre-
sponding to the previous proposals. Utility Based FP is in essence
a multi-agent Reinforcement Learning algorithm 25,26. The dif-
ference is that in reinforcement learning, the utility evaluation is
based on experience, whereas in utility based FP, it is based on a
call to a simulated utility function evaluator.
The main advantage of utility-based FP is its very low compu-
tational burden on each vehicle. In particular, the vehicles do not
need to compute the empirical frequencies of the past proposals
made by any vehicle and do not need to compute their expected
utilities based on the empirical frequencies. It only requires an
individual vehicle to process a statevector whose dimension is
its number of targets and to select a randomizedmaximizer. This
significantly alleviates the computational bottleneck of FP. How-
ever, the convergence of utility-based FP for potential games is
still an open issue.
There are also other utility-based learning algorithms that are
proven to converge in partnership games 27–29. These algo-
rithms are similar to multiagent reinforcement learning algorithms
and have comparable computational burden to that of utility based
FP. However, convergence requires fine tuning of various param-
eters, such as the learning rates of each agent. Moreover, utility-
based learning algorithms are prone to the issue learnability and
may exhibit a slower convergence than action-based FP.
4.1.3 Regret Matching. The discussion on FP in Sec. 4.1.2
motivates a learning algorithm that is computationally feasible as
well as convergent in potential games, both theoretically and prac-
tically. Accordingly, we introduce regret matching, from 30,
whose main distinction is that the vehicles propose targets based
on their regret for not proposing particular targets in the past
negotiation steps.
As before, let us enumerate the targets available to vehicle Vias
Ai=Ai
1,... ,Ai
Ai. Vehicle Viselects its proposed target, aik,
according to a probability distribution, pik共兩Ai兩兲, that will
be specified shortly. The th component, pi
k,ofpikis the
probability that vehicle Viselects the th target in Aiat the nego-
tiation step k, i.e., pi
k=Probaik=Ai
. Vehicle Vidoes not
know the utility UViak兲兴 before proposing its own target aik.
Accordingly, before selecting aik,k1, vehicle Vicomputes its
average regret
RVi
kª1
k−1
m=1
k−1
UViAi
,aimUViam
for not proposing Ai
in all past negotiation steps, assuming that
the proposed targets of all other vehicles remain unaltered.
Clearly, vehicle Vican compute RVi
kusing the recursion
Journal of Dynamic Systems, Measurement, and Control SEPTEMBER 2007, Vol. 129 / 589
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
RVi
k+1=k−1
kRVi
k+1
k
UViAi
,aikUViak
,k1
We note that, at any step k1, vehicle Viupdates all entries in
its average regret vector RVikªRVi
1k,... ,RVi
Aik兲兴T, whose di-
mension is Ai. In particular, the vehicles do not need to compute
the empirical frequencies of the past proposals made by any ve-
hicle and do not need to compute their expected utilities based on
the empirical frequencies. We also note that it is sufficient for
vehicle Vi, at step k1, to have access to aik−1and
UVi(Ai
,aik−1)for all 1,... ,Ai兩其. In other words, it is
sufficient for vehicle Vi’s to have access to its proposal at step k
1 and its actual utility UVi(ak−1)received at step k1 as well
as its hypothetical utilities UVi(Ai
,aik−1), which would have
been received if it had proposed target Ai
instead of aik−1兲兴 and
all other vehicle proposals aik−1had remained unchanged at
step k−1.
Once vehicle Vicomputes its average regret vector, RVik,it
proposes a target aik,k1, according to the probability distri-
bution
pik=
RVik兲兴+
1TRVik兲兴+
provided that the denominator above is positive; otherwise, pik
is the uniform distribution over Aipi1共兩Ai兩兲 is always ar-
bitrary. Roughly speaking, a vehicle using regret matching pro-
poses a particular target at any step with probability proportional
to the average regret for not playing that particular target in the
past negotiation steps. It turns out that the average regret of a
vehicle using regret matching would asymptotically vanish simi-
lar results hold for different regret based adaptive dynamics; see
30–32. Although this result characterizes the long-term behavior
of regret matching in general games, it need not imply that the
negotiations of vehicles using regret matching will converge to a
pure equilibrium assignment profile when vehicle utilities consti-
tute a potential game, an objective which we will pursue in Sec.
4.2.
4.2 Generalized Regret Monitoring With Fading Memory
and Inertia. To enable convergence to a pure equilibrium in po-
tential games, we will modify regret matching in two ways. First,
we will assume that each vehicle has a fading memory; that is,
each vehicle exponentially discounts the influence of its past re-
gret in the computation of its average regret vector. More pre-
cisely, each vehicle computes a discounted average regret vector
according to the recursion
R
˜
Vi
k+1=1−
R
˜
Vi
k+
UViAi
,aik
UViak, for all 1, ...,Ai兩其
where
0,1is a parameter with 1
being the discount fac-
tor, and R
˜
Vi
1=0.
Second, we will assume that each vehicle proposes a target
based on its discounted average regret using some inertia. There-
fore, each vehicle Viproposes a target aik, at step k1, accord-
ing to the probability distribution
ikRMiR
˜
Vik+1−
ik兲兴vaik−1
where
ikis a parameter representing vehicle Vi’s willingness to
optimize at time k,vaik−1is the vertex of 共兩Ai兩兲 corresponding
to the target aik−1proposed by vehicle Viat step k1, and
RMi:RAi共兩Ai兩兲 is any continuous function satisfying
x0RMi
x0 and 1Tx+=0RMix=1
Ai1
8
where xand RMi
xare the th components of xand RMix,
respectively.
We will call the above dynamics generalized regret monitoring
RMwith fading memory and inertia. The reason behind the term
“monitoring” is that the algorithm leaves as unspecified how an
agent reacts to regrets through the function RMi·. One particular
choice for the function RMiis
RMix=x+
1Tx+when 1Tx+0
which leads to regret matching with fading memory and inertia.
Another particular choice is
RMi
x=e1
x
xm0e1
xmIx0其共when 1Tx+0兲共9
where
0 is a parameter. Note that, for small values of
, ve-
hicle Viwould choose, with high probability, the target corre-
sponding to the maximum regret. This choice leads to a stochastic
variant of an algorithm called joint strategy fictitious play with
fading memory and inertia; see 22. Also, note that, for large
values of
,Viwould choose any target having positive regret with
equal probability.
According to these rules, vehicle Viwill stay with its previous
proposal aik−1with probability 1
ikregardless of its regret.
We make the following standing assumption on the vehicles’ will-
ingness to optimize.
ASSUMPTION 4.1. There exist constants and
¯
such that
0␧⬍
ik⬍␧
¯
1
for all time k1and for all i1, ...,nv.
This assumption implies that vehicles are always willing to op-
timize with some nonzero inertia.2The following theorem estab-
lishes the convergence of generalized regret monitoring with fad-
ing memory and inertia to a pure equilibrium.
THEOREM 4.1. Assume that vehicle utilities constitute an ordinal
potential game3and no vehicle is indifferent between distinct
strategies,i.e.,
UViai
1,aiUViai
2,ai,ai
1,ai
2Ai,ai
1ai
2,ai
Ai,i1, ... ,nv
Then,the target proposals atgenerated by generalized regret
monitoring with fading memory and inertia satisfying Assumption
4.1 converge to a pure Nash equilibrium almost surely.
Proof. We will state and prove a series of claims. The first claim
states that if a vehicle proposes a target with positive discounted
averageregret, then all subsequent target proposals will also have
positive regret.
CLAIM 4.1. Fix any k01. Then,R
˜
Vi
aik0k00R
˜
Vi
aikk0
for all kk0.
Proof. Suppose R
˜
Vi
aik0k00. If aik0+1=aik0, then
R
˜
Vi
aik0+1k0+1=1−
R
˜
Vi
aik0k00
If aik0+1aik0, then
R
˜
Vi
aik0+1k0+10
The argument can be repeated to show that R
˜
Vi
aikk0, for all
kk0.
Define
2This assumption can be relaxed to holding for sufficiently large k, as opposed to
all k.
3This theorem also holds in the more general weakly acyclic games, see 33.
590 / Vol. 129, SEPTEMBER 2007 Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
MuªmaxUVia:aA,ViV
muªminUVia:aA,ViV
ªmin兵兩UVia1UVia2兲兩:a1,a2A,ai
1=ai
2,UVia1
UVia2兲兩 0,ViV
Nªmin
n1,2, . .. :1−1−
n
1−
nMumu
2
fªmin
RMi
mx:xMumu,,xm
2, for one m,ViV
Note that
,f0, and R
˜
Vi
aik兲兩Mumu, for all ViV,aiAi,
k1.
The second claim states that if the current proposal is a strict
Nash equilibrium and if the proposal is repeated a sufficient num-
ber of times, then all subsequent proposals will also be that Nash
equilibrium.
CLAIM 4.2. Fix k01. Assume
1. ak0is a strict Nash equilibrium,and
2. R
˜
Vi
aik0k00for all ViV,and
3. ak0=ak0+1=¯=ak0+N−1.
Then,ak=ak0,for all kk0.
Proof. For any ViVand any aiAi, we have
R
˜
Vi
aik0+N=1−
NR
˜
Vi
aik0+1−1−
N兴兵UViai,aik0
UViaik0,aik0
Since ak0is a strict Nash equilibrium, for any ViVand any
aiAi,aiaik, we have
UViai,aik0UViaik0,aik0
Therefore,
R
˜
Vi
aik0+N1−
NMumu1−1−
N
20
We also know that, for all ViV,
R
˜
Vi
aik0k0+N=1−
NR
˜
Vi
aik0k00
This proves the claim.
The third claim states that if the current proposal is not a Nash
equilibrium and if the proposal is repeated a sufficient number of
times, then the subsequent assignment proposal will have a higher
global utility with at least a fixed probability.
CLAIM 4.3. Fix k01. Assume
1. ak0is not a Nash equilibrium,and
2. ak0=ak0+1=¯=ak0+N−1
Let a*=(ai
*,aik0)be such that
UViai
*,aik0UViaik0,aik0
for some ViVand some ai
*Ai.Then,R
˜
Vi
ai
*k0+N
/2, and a*
will be proposed at step k0+N with at least probability
ª1
¯
nv−1
f.
Proof. We have
R
˜
Vi
ai
*k0+N1−
NMumu+1−1−
N
2
Therefore, the probability of vehicle Viproposing ai
*at step k0
+Nis at least
f. Because of players’ inertia, the probability that
all vehicles will propose action a*at step k0+Nis at least 1
¯
nv−1
f.
The fourth claim specifies an event and associated probability
that guarantees that all vehicles will only propose targets with
positive regret.
CLAIM 4.4. Fix k01. We have R
˜
Vi
aikk0for all kk0
+2Nnvand for all ViVwith probability at least
i=1
nv1
Ai
1−
¯
2Nnv
Proof. Let a0ªak0. Suppose R
˜
Vi
ai
0k00. Furthermore, sup-
pose that a0is repeated Nconsecutive times, i.e., ak0=¯
=ak0+N−1=a0, which occurs with at least probability at least
1−
¯
nvN−1.
If there exists an a*=ai
*,ai
0such that UVia*Uia0, then,
by Claim 4.3, R
˜
Vi
ai
*k0+N
/2 and a*will be proposed at step
k0+Nwith at least probability
. Conditioned on this, we know
from Claim 4.1 that R
˜
Vi
aikk0 for all kk0+N.
If there does not exist such an action a*, then R
˜
Vi
aik0+N0 for
all aiAi. A proposal profile ai
w,ai
0with UViai
w,ai
0Uia0
will be proposed at step k0+Nwith at least probability 1/Ai兩兲
1−
¯
nv−1.Ifak0+N=ai
w,ai
0, and if, furthermore, ai
w,ai
0is
repeated Nconsecutive times, i.e., ak+N=¯=ak+2N−1,
which happens with probability at least 1−
¯
nvN−1, then, by
Claim 4.3, R
˜
Vi
ai
0k0+2N
/2 and the joint target a0will be pro-
posed at step k0+2Nwith at least probability
. Conditioned on
this, we know from Claim 4.1 that R
˜
Vi
aikk0 for all kk0
+2N.
In summary, R
˜
Vi
aikk0 for all kk0+2Nwith at least prob-
ability
1
Ai
1−
¯
2Nnv
We can repeat this argument for each vehicle to show that
R
˜
Vi
aikk0 for all times kk0+2Nnvand for all ViVwith
probability, at least
i=1
nv1
Ai
1−
¯
2Nnv
Final Step: Establishing Convergence to a Pure Nash Equilib-
rium. Fix k01. Let k1ªk0+2Nnv. Suppose R
˜
Vi
aikk0 for all
kk1and for all ViV, which, by Claim 4.4, occurs with prob-
ability, at least
i=1
nv1
Ai
1−
¯
2Nnv
Suppose further that ak1=¯=ak1+N−1which occurs with at
least probability 1−
¯
nvN−1.Ifakis a Nash equilibrium, then
by Claim 4.1, we are done. Otherwise, according to Claim 4.3, a
proposal profile a=(ai
,aik1)with UViaUVi(ak1)for
some ViVwill be played at step k1+Nwith at least probability
. Note that this would imply Ug(ak1+N)Ug(ak1). Suppose
now ak1+N=¯=ak1+2N−1, which occurs with at least
Journal of Dynamic Systems, Measurement, and Control SEPTEMBER 2007, Vol. 129 / 591
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
probability 1−
¯
nvN−1.Ifais a Nash equilibrium, then, by
Claim 4.2, we are done. Otherwise, according to Claim 4.3, a
proposal profile a=ai
,ai
with UViaUVi(ak1+N)for
some ViVwill be played at step k1+2Nwith at least probability
. Note that this would imply Ug(ak1+2N)Ug(ak1+N).
Note that this procedure can only be repeated a finite number of
times because the global utility is strictly increasing each time.
We can repeat the above arguments until we reach a pure Nash
equilibrium a*and stay at a*for Nconsecutive steps. This means
that there exists constants
˜
0 and T
˜
0, both of which are in-
dependent of k0, such that the following event happens with at
least probability
˜
:ak=a*for all kk0+T
˜
. This proves Theorem
4.1.
Note that an agreed assignment that emerges from generalized
RM with fading memory and inertia can be suboptimal. Charac-
terizing the equilibrium selection properties in potential games
still remains as an open problem. As in FP, regret-based dynamics
introduced above would require communication of proposed tar-
get assignments as part of a negotiation process. FP is guaranteed
to converge for potential games but requires an individual vehicle
to process the empirical frequencies of all other vehicles that af-
fect its utility and to use these empirical frequencies to compute
the maximizer of its expected utility. Generalized RM with fading
memory and inertia is guaranteed to converge to a pure equilib-
rium in almost all ordinalpotential games; however, its compu-
tational requirements are significantly lower. It only requires an
individual vehicle to process an average regret vector whose di-
mension is its number of targets and to select a randomized
target based on the positive part of its average regret vector.
4.3 Review: One-Step Memory Spatial Adaptive Play. The
previous negotiation mechanisms were called recursive averaging
algorithms since they maintained a running average or fading
memory averageof certain variables, e.g., averaged actions of
other players FPor averaged regret measures RM. These algo-
rithms have “infinite memory” in that the long-term effect of a
measured variable may diminish but is never completely
eliminated.
In this section, we will consider an opposite extreme, namely, a
specific one-step memory algorithm called spatial adaptive play.
SAPspatial adaptive play was introduced in 9兴共Chap. 6
which also reviews other multistep memory algorithmsas a
learning process for games played on graphs. SAP can be a very
effective negotiation mechanism in our autonomous vehicle-target
assignment problem because it would have low computational
burden on each vehicle and it would lead to an optimal solution in
potential games with arbitrarily high probability.
Unlike the other negotiation mechanisms we considered thus
far, at any step of SAP negotiations, one vehicle is randomly
chosen, where each vehicle is equally likely to be chosen, and
only this chosen vehicle is given the chance to update its proposed
target.4Let ak−1denote the profile of proposed targets at step
k1. At step k, the vehicle that is given the chance to update its
proposed target, say vehicle Vi, proposes a target according to a
probability distribution pik共兩Ai兩兲 that maximizes
pi
Tk
UViAi
1,aik−1
UViAi
Ai,aik−1
+
Hpik兲兴
where H·is the entropy function that rewards randomization
see Nomenclatureand
0 is a parameter that controls the level
of randomization. For any
0, the maximizing probability pik
is uniquely given by
pik=
1
UViAi
1,aik−1兲兴
UViAi
Ai,aik−1兲兴
where
·is the logit or soft-max function see Nomenclature.
For any
0, pikassigns positive probability to all targets in
Ai. We are interested in small values of
0 because then pik
approximately maximizes vehicle Vi’s unperturbedutility based
on other vehicles’ proposals at the previous step. For other inter-
pretations of the entropy term, see 35,36; and for different ways
of randomization, see 20.
The computational burden of SAP on each updating vehicle is
comparable to that of RM on each vehicle. Each vehicle needs to
observe and maintain the proposal profile ak兲共actually, only the
relevant part of ak兲兲. If given the chance to update its proposal,
vehicle Vineeds to call its utility function evaluator only Ai
times. Because only one vehicle updates its proposal at a given
negotiation step, the convergence of negotiations may be slow
when there are large number of vehicles.5However, if the vehicles
have a relatively small number of common targets in their target
sets, then multiple vehicles can be allowed to update their propos-
als at a given step as long as they do not have common targets.
Allowing such multiple updates may potentially speed up the ne-
gotiations substantially. In our simulations summarized in Sec. 5,
typically SAP provided convergence to a near-optimal assignment
faster than the most other negotiations mechanisms.
4.4 Selective Spatial Adaptive Play. We will now introduce
“selective spatial adaptive play” sSAPfor the cases where a
vehicle has a large number of targets in its target set or calling its
utility function evaluator is computationally expensive. We will
parameterize sSAP with n=n1, ...,nnvwhere 1niAi−1
represents the number of times that vehicle Vicalls its utility
function evaluator when it is given the chance to update its pro-
posal. Let us say that vehicle Vi, using sSAP, is given the chance
to update its proposal at step k. First, vehicle Visequentially se-
lects nitargets from Ai\aik−1兲其 without replacement where
each target is selected independently and with uniform probability
over the remaining targets. Call these selected targets
Ai
1k,... ,Ai
nik, and let Ai
0kªaik−1be appended to these
set of selected targets. Then, at step k,vehicle Viproposes a target
according to the probability distribution
pik=
1
UViAi
0k,aik−1兲兴
UViAi
nik,aik−1兲兴
for some
0. In other words, at step k, vehicle Viproposes a
target to approximately maximize its own utility based on the
selected targets Ai
0k,... ,Ai
nikand other vehicles’ proposals
at the previous step. Thus, to compute pik, vehicle Vi, needs to
call its utility function evaluator only nitimes where ni1 could
be much smaller than Ai. It turns out that we can characterize the
4We will not deal with the issue of how the autonomous vehicles can randomly
choose exactly one vehicle or multiple vehicles with no common targetsto update
its proposal without centralized coordination. In actuality, such asynchronous updat-
ing may be easier to implement than implementing the aforementioned negotiation
mechanisms that require synchronous updating. One possible implementation of
asynchronous updating would be similar to the implementation of well known Aloha
protocol in multiaccess communication, where multiple transmitting nodes attempt to
access a single communication channel without colliding with each other 34.
5If SAP is used as a centralized optimization tool, then the computational burden
at each step will be very small because only one entry in akwill be updated at each
step.
592 / Vol. 129, SEPTEMBER 2007 Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
long-term behavior of sSAP quite precisely following along simi-
lar lines of proof of Theorem 6.1 in 9.
THEOREM 4.2. Assume that the vehicle utilities constitute a po-
tential game where the global utility Ugis a potential function.
Then,the target proposals akgenerated by sSAP satisfy
lim
0
lim
k
Probakis an optimal target assignment profile
=1
Proof. sSAP induces an irreducible Markov process where the
state space is Aand the state at step kis the profile akof
proposed targets. The empirical frequencies of the visited states
converge to the unique stationary distribution of this induced Mar-
kov process. As in Theorem 6.1 in 9, we show that, this station-
ary distribution, denoted by
, is given as
a=e1/
Uga
a
¯
Ae1/
Uga
¯
,aA
by verifying the detailed balance equations
aProbab=
bProbba,a,bA
The only nontrivial case that requires the verification of the above
equations is when aand bdiffer in exactly one position. Fix aand
bsuch that aibiand ai=bi. Then, we have
Probab
=1
nv
a0,. . .,aniSa,b
1
共兩Ai−1¯共兩Aini
e1/
UVib
j=0
nie1/
UViaj
where
Sa,b=兵共a0, ... ,aniAni+1:ai
j=ai,j兲共a0=a兲共aj
=b, for one j,ajam,j,m兲其
It is now straightforward to see that
Probab
Probba=e1/
兲关UVibUVia兲兴 =e1/
兲关UgbUga兲兴 =
b
a
Therefore,
is indeed as given above, and it can be written, in
the alternative vector form, as
=
1
Ug
where, by an abuse of notation, Ugis also used to represent a
vector whose “ath entry” equals Uga. Finally, the fact that the
Markov process induced by sSAP with
0 being irreducible and
aperiodic readily leads to the desired result.
Thus, in the setup above,
assigns arbitrarily high probability
to those assignment profiles that maximize a potential function for
the game as
0. Clearly, this result indicates that in the case of
vehicle utilities IIU 4, RRU 5,orWLU7, sSAP negotiations
would lead to an optimal target assignment with arbitrarily high
probability provided that
0 is chosen sufficiently small. Of
course, one can gradually decrease
to allow initial exploration.
We believe that one can obtain convergence, in probability, of
proposals akto an optimal assignment if
is decreased suffi-
ciently slowly as in simulated annealing 37,38. In our simula-
tions, choosing
inversely proportional to k2during the negotia-
tions typically resulted in fast convergence of the proposals to a
near optimal assignment.
5 Simulation Results
In this section, we present some numerical results to illustrate
that when the individual utility functions and the negotiation
mechanisms are properly selected the autonomous vehicles can
agree on a target assignment profile that yield near-optimal global
utility. We consider two scenarios. In the first scenario, we illus-
trate the near optimality of our approach by simulating a special
case of the well-known weapon target assignment model where an
optimal assignment can be obtained for large number of weapons
and targets in a short period of time 2. In the second scenario,
we simulate a general instance of the problem and compare vari-
ous negotiation algorithms in terms of their performance and
speed of convergence.
Scenario 1. Here, the vehicles are identical and have zero val-
ues, whereas the targets are different and have positive values.
Each vehicle can be assigned to any of the targets.6Let Vjbe the
value of target Tjand pjbe the probability that target Tjgets
eliminated when only a single vehicle engages target Tj. When
multiple vehicles are assigned to target Tj, each of the vehicles is
assumed to engage target Tj, independently. Hence, if the number
of vehicles engaging target Tjis xj, then Tjwill be eliminated with
probability 11−pjxj. Therefore, as a function of the assignment
profile a, the utility generated by the engagement with target Tjis
given by
UTja=Vj1−1−pji=1
nvIai=Tj
which leads to the following global utility function:
Uga=
j=1
nt
Vj1−1−pji=1
nvIai=Tj
Given the parameters nv,nt,V1,... ,Vnt, and p1,...,pnt, an opti-
mal vehicle-target assignment that maximizes the global utility
function given above can be quickly obtained using an iterative
procedure called minimum marginal return algorithm 2.
To test the effectiveness of our approach, we simulated the
vehicle negotiations using the above model with 200 vehicles and
200 targets in MATLAB on a single personal computer with
1.4 GHz PentiumRM processor and 1.1 GB of RAM. Each of
the target values, V1,... ,V200, and each of the elimination prob-
abilities, p1,... ,p200, are once independently chosen according to
uniform probability distribution on 0,1and thereafter kept con-
stant throughout the simulations. We first conducted 100 runs of
generalized RM negotiations RMifunction is as in 9,
=0.1,
=0.5with WLU utilities 7, where each negotiation consisted
of 100 steps. We then repeated this with 100 runs of SAP nego-
tiations with WLU utilities 7where each run consisted of 1000
steps. We also conducted 100 runs of utility based FP negotiations
with WLU utilities 7, where each negotiation consisted of 1000
steps. In all cases, the randomization level
is decreased as 10/k2,
where kis the negotiation step. Evolution of global utility during
typical runs of generalized RM, SAP, and utility-based FP nego-
tiations is shown in Fig. 4. Also, the global utility corresponding
to the assignment profile at the end of each run of negotiations and
the CPU time required for each run were recorded. A summary of
these numerical results is provided in Table 1.
All negotiations consistently yielded near-optimal assignments.
Global utility generated by SAP negotiations were almost always
monotonically increasing, whereas global utility generated by
generalized RM and utility-based FP negotiations exhibited fluc-
tuations as seen in Fig. 4.
In any SAP negotiation step, only one vehicle calls its utility
function evaluator 200 times; whereas in any generalized RM ne-
gotiation step, all vehicles call their utility function evaluators
200 times for each vehicle. As a result, although a typical gen-
eralized RM negotiation converged in 100 steps as opposed to
1000 steps in the case of SAP, a typical 100 step generalized RM
negotiation took 593 s CPU time, on average, whereas a typical
1000-step SAP negotiation took 49 s CPU time, on average. How-
ever, it is important to note that these numbers reflect sequential
6Note that there is no reason to consider a null target T0here.
Journal of Dynamic Systems, Measurement, and Control SEPTEMBER 2007, Vol. 129 / 593
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
CPU time. In an actual implementation, individual vehicles will
call their utility function evaluators in parallel. The “parallel”
CPU time in Table 1 is the overall CPU time divided by the
number of vehicles. It is a rough reflection of what would be the
actual implementation time in a parallel implementation. We see
that generalized RM is actually faster than SAP. The parallel time
in SAP is the same as the sequential CPU time because only one
vehicle updates its strategy per iteration.
In the case of utility-based FP, all vehicles call their utility
function evaluators at each negotiation step but only once for each
vehicle. This can be contrasted with generalized RM, which re-
quires a utility function evaluation for every possible target.
Utility-based FP took 1000 negotiation steps to approach the op-
timal global utility, but using 67 s CPU time, on average or
0.33 s in parallel, which is also faster than the average CPU time
used by RM, despite utility-based FP requiring more iterations.
For this scenario, action-based FP would impose enormous
computational burden on each vehicle since a vehicle using action
FP would have to keep track of the empirical frequencies of the
choices of 199 other vehicles and compute its expected utility
over a decision space of dimension 200200 at every negotiation
step. However, the numerical results presented above verify that
autonomous vehicles can quickly negotiate and agree on an as-
signment profile that yields near optimal global utility when ve-
hicle utilities and negotiation mechanisms are chosen properly.
Scenario 2. In this scenario, we consider a more general in-
stance of the weapon target assignment problem, where we have
virtually no way of computing the optimal global utility. The setup
in this scenario is similar to the one in Scenario 1, except that the
vehicles are not identical and are also range restricted. More spe-
cifically, each vehicle still has zero value, but the probability pij
that target Tjgets eliminated when only vehicle Viengages target
Tjdiffers from vehicle to vehicle. Each of the elimination prob-
abilities, pij,0i,j200, is once independently chosen accord-
ing to uniform probability distribution on 0,1and thereafter
kept constant throughout the simulations. Each vehicle Vihas 20
targets in its range Aiand the targets in Aiare chosen from the set
of all targets with equal probability and independently of the other
vehicles. Therefore, a pair of two vehicles may have some com-
mon as well as distinct targets in their ranges. As in Scenario 1,
the target values V1,... ,V200 are chosen independently and ac-
cording to uniform probability distribution on 0,1. Therefore, as
a function of the assignment profile a, the utility generated by the
engagement with target Tjis given by
UTja=Vj
1−
i:TjAi
1−pij
which leads to the following global utility function
Uga=
j=1
nt
Vj
1−
i:TjAi
1−pij
Using the same computational resources and the same setup as
in Scenario 1, we simulated the vehicle negotiations on the above
model. Evolution of global utility during typical runs of general-
ized RM, SAP, and utility-based FP negotiations is shown in Fig.
5. The global utility corresponding to the assignment profile at the
end of each run of negotiations and the CPU time required for
each run were recorded. A summary of these numerical results is
provided in Table 2.
All negotiations eventually settled at some assignment profiles,
leading to comparable global utility as shown in Fig. 5 and Table
2. The convergence in this scenario was slower for all negotiation
mechanisms. The reason for this is that the vehicles in this sce-
nario are not identical and range restricted, and as a result, com-
puting each vehicle’s utility is computationally more demanding.
The relative timings in both CPU time and convergence rates are
similar to those in Scenario 1.
Action-based FP was computationally infeasible for this sce-
nario as well for the same reasons stated earlier, i.e., its enormous
computational burden on each vehicle.
The numerical results presented above show that autonomous
vehicles can quickly negotiate and agree on an possibly near-
optimalassignment profile when vehicle utilities and negotiation
mechanisms are chosen properly. In all cases, vehicles only com-
municate with their “neighbors,” i.e., those vehicles that share a
common target. The difference between algorithms is in the num-
ber of vehicles that communicate per iteration. In SAP, only the
vehicle revising its assignment must communicate with its neigh-
bors. In generalized RM and utility-based FP, all vehicles must
communicate with their neighbors in every iteration. In Scenario
1, all vehicles share the same targets and thus, all vehicles are
Fig. 4 Evolution of global utility during typical runs of
negotiations
Table 1 Summary of simulation runs
Generalized RM SAP Utility-based FP
Average global utility Optimal global utility 0.99 0.99 0.98
Minimum global utility Optimal global utility 0.99 0.99 0.96
Average CPU time s593 共⬇3.0 parallel49 67 共⬇0.33 parallel
594 / Vol. 129, SEPTEMBER 2007 Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
neighbors. In Scenario 2, the communication pattern is much
more sparse because of the limited vehicle ranges and distribution
of targets. The most communications savings per iteration is for
SAP. However, SAP showed more iterations required for conver-
gence.
6 Conclusions
We introduced an autonomous vehicle-target assignment prob-
lem as a multiplayer game where the vehicles are self-interested
players with their own individual utility functions. We emphasized
rational decision making on the part of the vehicles to develop
autonomous operation capability in uncertain and adversarial en-
vironments. To achieve optimality with respect to a global utility
function, we discussed various aspects of the design of the vehicle
utilities, in particular, alignment with a global utility function and
localization. We reviewed selected multiplayer learning algo-
rithms available in the literature. We introduced two new algo-
rithms that address the informational and computation require-
ment of existing algorithms, namely, generalized RM with fading
memory and inertia and selective spatial adaptive play, and pro-
vided accompanying convergence proofs. Finally, we discussed
these learning algorithms in terms of convergence, equilibrium
selection, and computational efficiency, and illustrated the
achievement of a global utility in a near-optimal fashion through
autonomous vehicle negotiations.
We end by pointing to a significant extension of this work, the
case where the vehicle-target assignments need to be made se-
quentially over a time horizon 2. In this case, the assignment
decisions made by the vehicles at a given time step probabilisti-
callydetermines the future games to be played by the vehicles.
Therefore, the vehicles need to take the future utilities into ac-
count in their negotiations. A natural framework to study such
problems of sequential decision making in a competitive multi-
player setting is the framework of Markov games 39,40. Extend-
ing the approach taken in this paper to a Markov game setup
requires significant future work.
Acknowledgment
Research supported by NSF Grant No. ECS-0501394, AFOSR/
MURI Grant No. F49620-01-1-0361, and ARO Grant No.
W911NF–04–1–0316.
Nomenclature
Anumber of elements in A, for a finite set A
I·indicator function
Rnndimensional Euclidian space, for a positive
integer n
1vector
1
1
Rn
·Ttranspose operation
nsimplex in Rn, i.e.,
sRns0 componentwise, and 1Ts=1
Int(n)set of interior points of a simplex, i.e., s0
componentwise
H:Intn兲兲
Rentropy function Hx=−xTlogx
:Rnn“logit” or “soft-max” function (
x)i=exi/ex1
+¯+exn
x+Rnvector whose ith entry equals maxxi,0, for x
Rn
References
1Olfati-Saber, R., 2006, “Flocking for Multi-Agent Dynamic Systems: Algo-
rithms and Theory,” IEEE Trans. Autom. Control, 51, pp. 401–420.
2Murphey, R. A., 1999, “Target-Based Weapon Target Assignment Problems,”
Nonlinear Assignment Problems: Algorithms and Applications, Pardalos, P.
M., and Pitsoulis, L. S., ed., pp. 39–53, Kluwer, Dordrecht.
3Ahuja, R. K., Kumar, A., Jha, K., and Orlin, J. B., 2003, “Exact and Heuristic
Methods for the Weapon-Target Assignment Problem,” http://ssrn.com/abstract
489802
4Fudenberg, D., and Tirole, J., 1991, Game Theory, MIT Press, Cambridge,
MA.
5Basar, T., and Olsder, G. J., 1999, Dynamic Noncooperative Game Theory,
SIAM, Philadelphia.
6Wolpert, D. H., and Tumor, K., 2001, “Optimal Payoff Functions for Members
of Collectives,” Adv. Complex Syst., 42&3, pp. 265–279.
7Monderer, D., and Shapley, L. S., 1996, “Potential Games,” Games Econ.
Behav., 14, pp. 124–143.
8Fudenberg, D., and Levine, D. K., 1998, The Theory of Learning in Games,
MIT Press, Cambridge, MA.
9Young, H. P., 1998, Individual Strategy and Social Structure: An Evolutionary
Theory of Institutions, Princeton University Press, Princeton, NJ.
10Wolpert, D., and Tumor, K., 2004, “A Survey of Collectives,” in Collectives
and the Design of Complex Systems, K. Tumer and D. Wolpert, eds., Springer-
Verlag, New York, NY, p. 142.
11Miettinen, K. M., 1998, Nonlinear Multiobjective Optimization, Kluwer, Dor-
drecht.
12Rosenthal, R. W., 1973, “A Class of Games Possessing Pure-Strategy Nash
Equilibria,” Int. J. Game Theory, 2, pp. 65– 67.
13Mas-Colell, A., Whinston, M. D., and Green, J. R., 1995, Microeconomic
Theory, Oxford University Press, London.
14Benaim, M., and Hirsch, M. W., 1999, “Mixed Equilibria and Dynamical
Systems Arising From Fictitious Play in Perturbed Games,” Games Econ. Be-
Fig. 5 Evolution of global utility during typical runs of
negotiations
Table 2 Summary of simulation runs
Generalized
RM SAP
Utility-based
FP
Global utility 87.62 85.24 85.49
Average CPU time s2707
共⬇13.5 parallel
382 529
共⬇2.64 parallel
Journal of Dynamic Systems, Measurement, and Control SEPTEMBER 2007, Vol. 129 / 595
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
hav., 29, pp. 36–72.
15Brown, G. W., 1951, “Iterative Solutions of Games by Fictitious Play,” Activ-
ity Analysis of Production and Allocation, Koopmans, T. C., ed., Wiley, New
York, pp. 374–376.
16Monderer, D., and Shapley, L. S., 1996, “Fictitious Play Property for Games
With Identical Interests,” J. Econ. Theory, 68, pp. 258–265.
17Curtis, J. W., and Murphey, R., 2003, “Simultaneous Area Search and Task
Assignment for a Team of Cooperative Agents,” AIAA Guidance, Navigation,
and Control Conference and Exhibit, August, Austin, Texas, AIAA, pp. 2003–
5584.
18Hofbauer, J., 1995, “Stability for the Best Response Dynamics,” University of
Vienna, Vienna, Austria, http://homepage.univie.ac.at/josef.hofbauer/br.ps
19Krishna, V., and Sjöström, T., 1998, “On the Convergence of Fictitious Play,”
Math. Op. Res., 23, pp. 479–511.
20Hofbauer, J., and Sandholm, B., 2002, “On the Global Convergence of Sto-
chastic Fictitious Play,” Econometrica, 70, pp. 2265–2294.
21Lambert, T. J., III, Epelman, M. A., and Smith, R. L., 2005, “A Fictitious Play
Approach to Large-Scale Optimization,” Oper. Res., 533, pp. 477–489.
22Marden, J. R., Arslan, G., and Shamma, J. S., 2005, “Joint Strategy Fictitious
Play with Inertia for Potential Games,” Proc. of 44th IEEE Conference on
Decision and Control, Dec., pp. 6692–6697.
23Fudenberg, D., and Levine, D., 1998, “Learning in Games,” European Eco-
nomic Review, 42, pp. 631–639.
24Fudenberg, D., and Levine, D. K., 1995, “Consistency and Cautious Fictitious
Play,” J. Econ. Dyn. Control, 19, pp. 1065–1089.
25Sutton, R. S., and Barto, A. G., 1998, Reinforcement Learning: An Introduc-
tion, MIT Press, Cambridge, MA.
26Bertsekas, D. P., and Tsitsiklis, J. N., 1996, Neuro-Dynamic Programming,
Athena Scientific, Belmont, MA.
27Leslie, D., and Collins, E., 2003, “Convergent Multiple-Ttimescales Rein-
forcement Learning Algorithms in Normal form Games,” Ann. Appl. Probab.,
13, pp. 1231–1251.
28Leslie, D., and Collins, E., 2005, “Individual Q-Learning in Normal Form
Games,” SIAM J. Control Optim., 442. pp. 495–514.
29Leslie, D. S., and Collins, E. J., 2006, “Generalised Weakened Fictitious Play,”
Games and Economic Behavior, Vol. 56, issue 2, pages 285–298.
30Hart, S., and Mas-Colell, A., 2000, “A Simple Adaptive Procedure Leading to
Correlated Equilibrium,” Econometrica, 685, pp. 1127–1150.
31Hart, S., and Mas-Colell, A., 2001, “A General Class of Adaptative Strate-
gies,” J. Econ. Theory, 98, pp. 26–54.
32Hart, S., and Mas-Colell, A., 2003, “Regret Based Continuous-Time Dynam-
ics,” Games Econ. Behav., 45, pp. 375–394.
33Marden, J. R., Arslan, G., and Shamma, J. S., 2007, “Regret Based Dynamics:
Convergence in Weakly Acyclic Games,” Proc. of 6th International Joint Con-
ference on Autonomous Agents and Multi-Agent Systems, ACM Press, New
York, NY, pp. 194 –201.
34Bertsekas, D., and Gallager, R., 1992, Data Networks, 2nd ed., Prentice-Hall,
Englewood Cliffs., NJ.
35Hofbauer, J., and Hopkins, E., 2005, “Learning in Perturbed Asymmetric
Games,” Games and Economic Behavior, Vol. 52, pp. 133–152.
36Wolpert, D. H., 2004, “Information Theory—The Bridge Connecting Bounded
Rational Game Theory and Statistical Physics,” http://arxiv.org/PS-cache/
cond-mat/pdf/0402/0402508.pdf
37Aarts, E., and Korst, J., 1989, Simulated Annealing and Boltzman Machines,
Wiley, New York.
38van Laarhoven, P. J. M., and Aarts, E. H. L., 1987, Simulated Annealing:
Theory and Applications, Reidel, Dordrecht.
39Raghavan, T. E. S., and Fillar, J. A., 1991, “Algorithms for Stochastic
Games—A Survey,” Methods Models Op. Res., 35, pp. 437–472.
40Vrieze, O. J., and Tijs, S. H., 1980, “Fictitious Play Applied to Sequence of
Games and Discounted Stochastic Games,” Int. J. Game Theory, 11 , pp. 71–
85.
596 / Vol. 129, SEPTEMBER 2007 Transactions of the ASME
Downloaded 02 Sep 2007 to 128.171.57.189. Redistribution subject to ASME license or copyright, see http://www.asme.org/terms/Terms_Use.cfm
... In potential games, several learning rules exist such as iterative best-response dynamics [31,4,9], no-regret algorithms [13,30,11], and fictitious play [22,23] for which asymptotic convergence to an NE is guaranteed. However, only log-linear learning [5,33] and variants thereof [1,20,17] are known to converge to an efficient Nash equilibrium. In log-linear learning, the players asynchronously choose an action with a probability proportional to its exponentiated utility. ...
... For example, Leslie and Marden [14] proves that log-linear learning also handles the practical setting where the observed utilities are corrupted by noise. Furthermore, Arslan et al. [1] proposes binary log-linear learning, a slight modification of log-linear learning that only requires two points of feedback per round. Although the above works established asymptotic convergence of log-linear learning to an efficient NE, a finite-time analysis is missing for general potential games. ...
... Having such full-information feedback when action sets are large can be demanding. Binary log-linear learning [1,20] alleviates this limitation by requiring two-point feedback, reducing the feedback needed by a factor A per round. Now, we briefly review the binary-log-linear learning rule. ...
Preprint
Full-text available
This paper investigates the convergence time of log-linear learning to an $\epsilon$-efficient Nash equilibrium (NE) in potential games. In such games, an efficient NE is defined as the maximizer of the potential function. Existing results are limited to potential games with stringent structural assumptions and entail exponential convergence times in $1/\epsilon$. Unaddressed so far, we tackle general potential games and prove the first finite-time convergence to an $\epsilon$-efficient NE. In particular, by using a problem-dependent analysis, our bound depends polynomially on $1/\epsilon$. Furthermore, we provide two extensions of our convergence result: first, we show that a variant of log-linear learning that requires a factor $A$ less feedback on the utility per round enjoys a similar convergence time; second, we demonstrate the robustness of our convergence guarantee if log-linear learning is subject to small perturbations such as alterations in the learning rule or noise-corrupted utilities.
... Other learning algorithms have also been analyzed, including fictitious play [Monderer and Shapley, 1996], joint strategy fictitious play [Marden et al., 2009a], adaptive play [Young, 1993], and interactive trial and error learning [Young, 2009], among others [Young, 2020, 2004, Marden et al., 2009b. Notably, these approaches have been applied to practical multi-agent problems such as wind farm control , vehicle-target assignment problems [Arslan et al., 2007], and network routing [Marden et al., 2009b]. ...
Preprint
While there are numerous works in multi-agent reinforcement learning (MARL), most of them focus on designing algorithms and proving convergence to a Nash equilibrium (NE) or other equilibrium such as coarse correlated equilibrium. However, NEs can be non-unique and their performance varies drastically. Thus, it is important to design algorithms that converge to Nash equilibrium with better rewards or social welfare. In contrast, classical game theory literature has extensively studied equilibrium selection for multi-agent learning in normal-form games, demonstrating that decentralized learning algorithms can asymptotically converge to potential-maximizing or Pareto-optimal NEs. These insights motivate this paper to investigate equilibrium selection in the MARL setting. We focus on the stochastic game model, leveraging classical equilibrium selection results from normal-form games to propose a unified framework for equilibrium selection in stochastic games. The proposed framework is highly modular and can extend various learning rules and their corresponding equilibrium selection results from normal-form games to the stochastic game setting.
... the probability of no-leakerKarasakal (2008),Karasakal et al. (2011), Zhu et al. (2014,Leboucher, Shin, Siarry, et al. (2013),Taghavi and Ranjbar (2015),Silav et al. (2019),Silav et al. (2021),Karasakal et al. (2021) Maximization of the expected total damage value of the targetsArslan et al. (2007),Zhengrong et al. (2020) ...
Article
This article focuses on air defense in maritime environment, which involves protecting friendly naval assets from aerial threats. Specifically, we define and address the Naval Air Defense Planning (NADP) problem, which consists of maneuvering decisions of the ships and scheduling weapons and sensors to the threats in order to maximize the total expected survival probability of friendly units. The NADP problem is more realistic and applicable than previous studies, as it considers features such as sensor assignment requirements, weapon and sensor blind sectors, sequence‐dependent setup times, and ship's infrared/radar signature. In this study, a mixed‐integer nonlinear programming model of the NADP problem is presented and heuristic solution approaches are developed. Computational results demonstrate that these heuristic approaches are both fast and efficient in solving the NADP problem.
Article
Full-text available
Mission planners are one of the major classes of autonomy software and their design is especially challenging in the case cooperation autonomy is required for unmanned multi-vehicle systems. A clear example of this is given by the applications of teams of drones, such as multi-drone spatio-temporal sensing. Here, drone teams act as mobile and cooperative sensor networks to simultaneously collect sensor data in areas of interest and to allow detailed computation on the sensed data. For the design of cooperative and autonomous drone teams, mission planning shall be accomplished in the form of coordinated sensing to optimally assign the different sensing tasks and routes to each drone, employing task allocation and route planning as the basic pillars to maximize the multi-drone mission effectiveness. This work proposes a dynamic and decentralized mission planner for a drone team performing autonomous and cooperative spatio-temporal sensing. The design exploits the learning-in-games framework for the processing of optimal routes in reasonable time frames. Two ad-hoc variants of the binary log-linear learning are proposed as a coordination algorithm to manage both task allocation and route planning, by demonstrating reachability and reversibility properties. Also, the work describes an experimental analysis of the proposed solutions by means of model-in-the-loop simulations, in order to provide a preliminary tune of the main learning parameters for both solutions.
Article
This article addresses the problem of distributed Nash equilibrium seeking over networks for games with finite action sets. Gradient-like and consensus-based methods commonly used for continuous action spaces fail to work for this case. To this end, we propose a utility decoupling method to reformulate the original game into an augmented game, which preserves the Nash equilibrium and weakly acyclic property, yet enjoys a utility coupling network the same as the communication network. In this way, a variety of full-information game-theoretic learning dynamics for the augmented game turns into partial-information Nash equilibrium seeking dynamics for the original game. We proceed to apply the developed utility decoupling method to formulate three types of distributed Nash equilibrium seeking dynamics, including distributed best-response dynamics, distributed fictitious play, and distributed regret matching for weakly acyclic games. In the last, a typical color assignment game is utilized to empirically illustrate the validity and effectiveness of our approach.
Article
The weapon-target assignment problem has been considered as an essential issue for military applications to provide a protection for defended assets. The goal of a typical weapon-target assignment problem is to maximize the expected survivability of the valuable assets. In this study, defense of naval vessels that encounter aerial targets is considered. The vessels are assumed to have different types of weapons having various firepower and cost as well as the incoming targets may have different attack capabilities. In a typical scenario, in addition to protecting assets, it is also desirable to minimize the cost of weapons. Therefore, an asset-based static weapon-target assignment problem is considered in order to both maximize the expected survivability of the assets and minimize the weapon budget. Thus, a co-operative game theory based solution is proposed which relates the utilities of the individuals to the global utility and reach the Nash equilibrium.
Article
Full-text available
We define and discuss several notions of potential functions for games in strategic form. We characterize games that have a potential function, and we present a variety of applications.Journal of Economic LiteratureClassification Numbers:C72, C73
Article
Full-text available
Due to the increasing sophistication and miniaturization of computational components, complex, distributed systems of interacting agents are becoming ubiquitous. Such systems, where each agent aims to optimize its own performance, but there is a well-defined set of system-level performance criteria, are called collectives. The fundamental problem in analyzing and designing such systems is in determining how the combined actions of a large number of agents lead to “coordinated” behavior on the global scale. Examples of artificial systems that exhibit such behavior include packet routing across a data network, control of an array of communication satellites, coordination of multiple rovers, and dynamic job scheduling across a distributed computer grid. Examples of natural systems include ecosystems, economies, and the organelles within a living cell. No current scientific discipline provides a thorough understanding of the relation between the structure of collectives and how well they meet their overall performance criteria. Although still very young, research on collectives has resulted in successes in both understanding and designing such systems. It is expected that as it matures and draws on other disciplines related to collectives, this field will greatly expand the range of computationally addressable tasks. Moreover, in addition to drawing on them, such a fully developed field of collective intelligence may provide insight into already established scientific fields, such as mechanism design, economics, game theory, and population biology. This chapter provides a survey of the emerging science of collectives.
Book
Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics -- trial-and-error search and delayed reward -- are the most important distinguishing features of reinforcement learning. Reinforcement learning is both a new and a very old topic in AI. The term appears to have been coined by Minsk (1961), and independently in control theory by Walz and Fu (1965). The earliest machine learning research now viewed as directly relevant was Samuel's (1959) checker player, which used temporal-difference learning to manage delayed reward much as it is used today. Of course learning and reinforcement have been studied in psychology for almost a century, and that work has had a very strong impact on the AI/engineering work. One could in fact consider all of reinforcement learning to be simply the reverse engineering of certain psychological learning processes (e.g. operant conditioning and secondary reinforcement). Reinforcement Learning is an edited volume of original research, comprising seven invited contributions by leading researchers.
Conference Paper
Many strategies exist for the coordinated search and exploration of an unknown region, however many multi-agent applications must do more than search a space: they must act upon objects that are found in the search. There can be a strong coupling between these two problems, and this paper explores the simultaneous search of an unknown region and assignment of tasks to agents. Specifically, we cast the search and task assignment decision into a unified optimization framework, and we propose a coordination strategy that generates a sequence of way-points to simultaneously accomplish area search and task assignment. Computational complexity is mitigated by a two-stage algorithm based on the satisficing approach, and a method of preventing cooperative instability (churn) is addressed.
Article
This essay discusses some recent work on `learning in games'. We explore non-equilibrium theories in which equilibrium emerges as the long-run outcome of a dynamic process of adjustment or learning. We focus on individual level models, and more specifically on variants of `fictitious play' in two-player games. We discuss both the theoretical properties of the models and their relationship to regularities observed in game theory experiments.
Article
In this chapter we consider a class of non-linear assignment problems collectively referred to as Target-based Weapon Target Assignment (WTA). The Target-based Weapon Target Assignment problem considers optimally assigning M weapons to N targets so that the total expected damage to the targets is maximized. We use the term target-based to distinguish these problems from those that are asset-based, that is problems where weapons are assigned to targets such that the value of a group of assets is maximized supposing that the targets themselves are missiles engaging the assets. The asset-based problem is most pertinent to strategic ballistic missile defense problems whereas the target-based problems apply to offensive conventional warfare types of problems. Previous surveys of WTA problems were made by Matlin [Matlin, 1970] and Eckler and Burr [Eckler and Burr, 1972] the first on offensive, target based problems and the second primarily about defensive, asset based problems.
Book
Preface. Acknowledgements. Notation and Symbols. Part I: Terminology and Theory. 1. Introduction. 2. Concepts. 3. Theoretical Background. Part II: Methods. 1. Introduction. 2. No-Preference Methods. 3. A Posteriori Methods. 4. A Priori Methods. 5. Interactive Methods. Part III: Related Issues. 1. Comparing Methods. 2. Software. 3. Graphical Illustration. 4. Future Directions. 5. Epilogue. References. Index.