Conference PaperPDF Available

AATEAM: Achieving the Ad Hoc Teamwork by Employing the Attention Mechanism

Authors:
  • Spanish National Research Council / Change Management Tool

Abstract

In the ad hoc teamwork setting, a team of agents needs to perform a task without prior coordination. The most advanced approach learns policies based on previous experiences and reuses one of the policies to interact with new teammates. However, the selected policy in many cases is sub-optimal. Switching between policies to adapt to new teammates' behaviour takes time, which threatens the successful performance of a task. In this paper, we propose AATEAM-a method that uses the attention-based neural networks to cope with new teammates' behaviour in real-time. We train one attention network per teammate type. The attention networks learn both to extract the temporal correlations from the sequence of states (i.e. contexts) and the mapping from contexts to actions. Each attention network also learns to predict a future state given the current context and its output action. The prediction accuracies help to determine which actions the ad hoc agent should take. We perform extensive experiments to show the effectiveness of our method.
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
AATEAM: Achieving the Ad Hoc
Teamwork by Employing the Attention Mechanism
Shuo Chen,1,2 Ewa Andrejczuk,2Zhiguang Cao,3Jie Zhang1
1School of Computer Science and Engineering, Nanyang Technological University, Singapore
2ST Engineering - NTU Corporate Laboratory, Nanyang Technological University, Singapore
3Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore
chen1087@e.ntu.edu.sg, {ewaa, zhangj}@ntu.edu.sg, isecaoz@nus.edu.sg
Abstract
In the ad hoc teamwork setting, a team of agents needs to per-
form a task without prior coordination. The most advanced
approach learns policies based on previous experiences and
reuses one of the policies to interact with new teammates.
However, the selected policy in many cases is sub-optimal.
Switching between policies to adapt to new teammates’ be-
haviour takes time, which threatens the successful perfor-
mance of a task. In this paper, we propose AATEAM –
a method that uses the attention-based neural networks to
cope with new teammates’ behaviour in real-time. We train
one attention network per teammate type. The attention net-
works learn both to extract the temporal correlations from the
sequence of states (i.e. contexts) and the mapping from con-
texts to actions. Each attention network also learns to predict
a future state given the current context and its output action.
The prediction accuracies help to determine which actions the
ad hoc agent should take. We perform extensive experiments
to show the effectiveness of our method.
Introduction
In the majority of the multiagent systems (MAS) literature
tackling teamwork, a team of at least two agents shares co-
operation protocols to work together and perform collabora-
tive tasks (Grosz and Kraus 1996; Tambe 1997). However,
with the increasing number of agents in various domains
(e.g. construction, search and rescue) developed by differ-
ent companies, we can no longer coordinate agents’ activi-
ties beforehand. In this case, we expect a team of agents to
perform a task without any pre-defined coordination strat-
egy. This problem is known in the literature as the ad hoc
teamwork problem (Stone et al. 2010). This paper tackles
the ad hoc teamwork problem where an ad hoc agent plans
its actions only by observing other teammates.
Most of the existing approaches for the ad hoc teamwork
focus on simple domains, i.e. they ignore the environmental
uncertainty or they assume teammates’ actions are fully ob-
servable (Agmon, Barrett, and Stone 2014; Ravula, Alkoby,
and Stone 2019). To the best of our knowledge, PLASTIC-
Policy (Barrett et al. 2017) is the only scheme proposed for
more complex domains, i.e. with a dynamic environment
and a continuous state/action space. PLASTIC-Policy learns
Copyright c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
a policy for each past teammate type. When working with
new teammates, it chooses one of the policies based on the
similarity between the new and past experiences and uses
that policy to select the sequence of actions for the ad hoc
agent. However, the selected policy only works well when
new teammates behave similarly to the chosen teammate
type. When new teammates’ behaviour becomes far from the
teammate type, the ad hoc agent needs to switch to another
policy. Switching between policies to adapt to new team-
mates’ behaviour takes time, which threatens the successful
performance of a task. Given that, it is desirable to develop
a new method that adapts to new teammates’ behaviours in
real-time for complex domains.
In this paper, we propose to Achieve the Ad hoc Teamwork
by Employing Attention Mechanism (AATEAM). AATEAM
uses attention-based recurrent neural networks to cope with
new teammates’ behaviour in real-time. Our workflow is as
follows. First, we use reinforcement learning to learn poli-
cies to work with past teammates. Second, we make the ad
hoc agent play with each type of teammate using the cor-
responding learned policy. We collect the states (input) and
the ad hoc agent’s actions (label) as training data. Third, we
use this data to teach one attention network for each type of
past teammates. The attention network works as follows. In
every step, the attention network takes a state as input. The
attention mechanism inside extracts the temporal correlation
between the state and the states encountered previously, i.e.
context. It further maps the context to an action probability
distribution. In every step, the attention networks also out-
put a metric that measures the similarity between past and
new teammate types. The more similar the types, the more
important the corresponding probability distribution over ac-
tions. We aggregate networks’ probability distributions over
actions by calculating the weighted sum of those probabil-
ities and its importance. Finally, we output an action with
the highest aggregated probability. Since AATEAM evalu-
ates the similarity in every step, we can adjust to the new
teammates’ behaviour in real-time.
In summary, we make the following contributions: 1) we
propose AATEAM, an attention-based method to cope with
teammates’ different behaviours in complex domains. To the
best of our knowledge, this is the first time attention mech-
anism is applied to this problem. 2) Besides learning the
mapping from contexts to probability distribution over ac-
7095
tions, we train the attention network to measure the simi-
larity between the corresponding past teammates and new
teammates. This helps to adjust the action selection in real-
time. 3) We perform extensive experiments in the RoboCup
2D simulation domain. The results demonstrate that the at-
tention networks outperform PLASTIC-Policy when work-
ing with both known and unknown teammates.
Related Work
The majority of existing work in the ad hoc teamwork is pro-
posed for simple domains. They either do not consider the
environmental uncertainty or they assume the action space
is fully observable.
For the first simplification, the authors (Agmon and Stone
2012; Chakraborty and Stone 2013; Agmon, Barrett, and
Stone 2014) assume that the utilities of the team’s actions are
given. Also, Genter, Agmon, and Stone (2011) deal with the
role assignment problem for the ad hoc agent where team-
mates’ roles are known. However, the actions’ utilities and
teammates’ roles may vary when the environment changes.
Therefore, their problem setting is too simplistic.
As for the second simplification, assuming a fully-
observable action space limits the potential on real-life
applications. The researchers develop algorithms to plan
the ad hoc agent’s action either by using the past team-
mates’ behaviour types (Wu, Zilberstein, and Chen 2011;
Albrecht and Ramamoorthy 2013; Barrett et al. 2013; Melo
and Sardinha 2016; Chandrasekaran et al. 2017; Ravula,
Alkoby, and Stone 2019) or by inferring the responsibilities
new teammates are taking (Chen et al. 2019). In both cases,
they assume that teammates’ actions are fully observable,
which is a strong assumption when it comes to more com-
plex domains (e.g. the actions are continuous and hence, the
ad hoc agent cannot know the exact value of an action).
The only scheme for complex domains is PLASTIC-
Policy (Barrett et al. 2017). PLASTIC-Policy uses reinforce-
ment learning to learn a policy for working with each type of
past teammates. To reduce learning complexity, PLASTIC-
Policy first assumes the existence of a countable set of high-
level actions. Each high-level action is a mapping from the
state to a continuous action. Therefore, it limits the action
space to be searched by reinforcement learning. The policy
takes a continuous state as input and outputs a high-level
action of the ad hoc agent. As mentioned before, in com-
plex domains, the ad hoc agent cannot directly observe team-
mates’ actions. Therefore, PLASTIC-Policy defines the state
transition for continuous state space and observes team-
mates’ behaviour based on state transitions. Given a state
transition with new teammates, the ad hoc agent locates the
most similar state transition with each type of past team-
mates. Using that information, it updates the probability of
new teammates being similar to each type of past teammates,
and applies the policy for the most similar past teammates.
The main drawback of PLASTIC-Policy is that the learned
policies are not necessarily optimal for new teammates. Ad-
ditionally, the update based on state transitions may not be
fast enough to adapt to new teammates’ changing behaviour.
Our paper addresses these drawbacks by using all attention
networks to select the best action based on a context and the
similarity between past and new teammates’ type.
Preliminary
Markov Decision Process
In this paper, our objective is to find the best set of ac-
tions for the ad hoc agent. Following the state-of-the-art
literature (Barrett et al. 2017), we model the ad hoc team-
work problem as the Markov Decision Process (MDP) (Sut-
ton and Barto 2011). Specifically, we represent an MDP
by a tuple S, A, P, R, where Sis a set of states; Ais
a set of high-level actions the ad hoc agent can perform;
P:S×A×S[0,1] is the state transition function; That
is, P(s|s, a)is the probability of transiting to the next state
s, given the action aand current state s; and the reward
function R:S×A→Rspecifies the reward R(s, a)when
the ad hoc agent takes the action ain the state s. Note that
for continuous state space, the definition of state transition
is domain-specific.
A policy π:SAtakes a state sas input and out-
puts an action a. After taking the action ain the state s, the
maximum long term expected reward is computed as:
Q(s, a)=R(s, a)+γ
sS
P(s|s, a)max
aQ(s,a
)(1)
where 01is the discount factor for the future reward.
An optimal policy chooses the action athat maximizes the
Q(s, a)for all s.
Attention Mechanism
The attention mechanism is an approach for intelligently
extracting contextual information from a sequence of in-
puts. Due to its great performance, the attention mech-
anism has become a popular technique in computer vi-
sion (Mnih et al. 2014; Xu et al. 2015) and natural lan-
guage processing (NLP) (Bahdanau, Cho, and Bengio 2014;
Vaswani et al. 2017).
The example attention network for translation is shown
in Figure 1. The attention network has two recurrent neu-
ral networks called encoder and decoder respectively. The
attention network first encodes a full sentence with its en-
coder. The last hidden state from the encoder is the initial
Figure 1: The attention network for translation
7096
hidden state (input) to its decoder. To do the translation, the
attention mechanism decides the importance (i.e. weights) of
each input to the translation output. The weighted sum of en-
coder’s outputs is the contextual information. The decoder’s
hidden state and the contextual information determine the
output together. The method to compute the weight in the
attention mechanism varies from problem to problem.
Note that in the teamwork, the ad hoc agent’s action in the
current step is correlated to the current state as well as to pre-
vious states. Given different teammates’ behaviours (coming
from different teammate types or different situations), the
correlations of previous states to the current action are dif-
ferent. Therefore, the attention mechanism is a suitable ap-
proach that takes the sequence of states as input and outputs
the correlations given different teammates’ behaviours.
Ad Hoc Teamwork with Attention Mechanism
In this section, we explain the details of our method. First,
we present the overall design of one attention-based neu-
ral network. Note that AATEAM contains multiple attention
networks, one per past teammates’ type. Second, we explain
how AATEAM selects actions for the ad hoc agent. Finally,
we describe how we train AATEAM.
The Design of the Attention Network in AATEAM
The overall architecture of our attention-based neural net-
work is as shown in Figure 2. The architecture consists of
two parts. That is, the encoder and decoder respectively. The
encoder extracts important features for the ad hoc teamwork
from the sequence of states. The decoder exploits the ex-
tracted features to output the probability distribution over
actions for the ad hoc agent.
The encoder. The encoder consists of a linear layer and
the gated recurrent unit (GRU)1(Cho et al. 2014) as shown
in Figure 3. The initial hidden state of the encoder
h0is a
vector of zeros
0. In every step t, the encoder takes the pre-
vious hidden state
ht1and the state stas input. Let ude-
note the size of state vector, ldenote the size of GRU’s hid-
den state and fxy(·)denote a linear layer that transforms a
vector of size xinto another vector of size y. A linear layer
ful(·)transforms the state vector stinto an embedding
vector of length l. It does so, by multiplying the input by
the weight matrix. Then, GRU uses the embedding vector
and
ht1to output a hidden state
htand an encoder-output
ot, i.e. the encoding of st. The
htand otthen become the
input of the decoder.
The decoder. The purpose of the decoder is to output a
high-level action atAfor the ad hoc agent. Let n=|A|
denote the number of high-level actions. Next, we under-
stand an episode as one instance of a task performance. We
denote the signal of the start of an episode as SOEand set
1The GRU is a popular variant of the recurrent neural network
(RNN). As an alternative RNN, we could use the long short-term
memory (LSTM) (Hochreiter and Schmidhuber 1997). However,
GRU has comparable performance to LSTM while having less pa-
rameters, which makes it easier to train (Chung et al. 2014).
Figure 2: The attention network for the ad hoc teamwork
its index to n+1. The initial hidden state of the decoder
d0
is the encoder’s hidden state in step 1.a0=SOEand its
index is input into the decoder in step 1to signal the start of
a task. In every step t, the decoder takes as input: 1) the set
of encoder-outputs until step t, i.e. {oi|i=1, ..., t}, 2) the
previous hidden state
dt1, 3) the encoder’s hidden state
ht,
and 4) the previous output action at1. As shown in Fig-
ure 3, the decoder uses these inputs and the attention module
to generate the inputs for its GRU. Then, the GRU outputs
a hidden state
dtand a probability distribution over actions
{P(ak
t)|k=1, .., n}(we discuss how we select atfrom
{P(ak
t)|k=1, .., n}in the latter subsection).
The attention module determines the weight of each oi
based on {oi|i=1, ..., t},
dt1, and
ht(we present the de-
tails of how to compute the weights in the following subsec-
tion). Given the attention weight set {wt,i|i=1, .., t}, the
decoder produces the context vector ctin step tas
ct=
t
i=1
wt,i ·oi(2)
Next, the decoder uses an embedding layer (that maps an
action index to a vector of length l) and dropout layer (that
reduces overfitting (Srivastava et al. 2014)) to generate the
embedding of at1, i.e. Embed(at1). To produce the input
intof GRU in step t, the concatenation (denoted by “”) of
Embed(at1)and ctgoes through a linear layer f2ll(·)
and Relu layer. The Relu layer applies the rectified linear
unit function to each element of the input vector:
Relu(x)=max(0,x)(3)
We use the Relu function because, if the neuron gets acti-
vated, its gradient always equals to 1. This helps to train the
network faster. Given intand
dt1, GRU outputs the de-
coder’s hidden state
dtin step tand an action seed vector at
of length l. To produce the probability distribution over ac-
tions in step t, i.e. {P(ak
t)|k=1, ..., n}, we first use a linear
layer fln(·)that converts atinto a vector {˜ak
t|k=1, .., n}.
Second, we transform {˜ak
t}using a softmax layer. That is,
P(ak
t)= e˜ak
t
n
m=1 e˜am
t
(4)
7097
Figure 3: The detailed design of the attention network
where P(ak
t)represents the probability of taking action ak
in step t.Attention Mechanism. The attention module is
shown in Figure 3. In every step t, the attention module
takes the following input: 1) the set of encoder’s outputs
{oi|it}; Each oinaturally provides the information for
the weights of {oi}; 2) The encoder’s hidden state
ht; The
hidden state
htcontains the information of states until step
tand thus, it helps to produce a new context; and 3) the de-
coder’s previous hidden state
dt1; Note that
dt1comes
from at2and
dt2. It also results from at3,at2and
dt3. Following this logic,
dt1reflects the effects of the
sequence of actions {ai|i<t}on
d0, that is, on the infor-
mation of initial state
h1. The effects of previous actions on
states are useful for generating wt,i since atis correlated
with wt,i.
Given the inputs, a linear layer f2ll(·)transforms
the concatenation of
htand
dt1into a vector vhd =
f2ll(
ht||
dt1). Then, each oiin {oi|it}goes through
a linear layer to become a vector voi=fll(oi). The con-
catenation of voiand vhd further goes through a Tanh layer
(following Bahdanau, Cho, and Bengio (2014)) that applies
the tanh function to each element of the input vector:
tanh(x)=exex
ex+ex(5)
Next, the output of Tanh layer goes through a linear layer
f2ll(·)to output the weight seed ¯wt,i for oiin step t.In
summary, the weight seed ¯wt,i is computed as:
¯wt,i =f2ll(Tanh(f2ll(
ht||
dt1)||fll(oi))) (6)
Finally, a softmax layer takes the set of weight seeds
{¯wt,i|it}as input and outputs the weight wt,i of each
oiin step t. That is,
wt,i =e¯wt,i
t
b=1 e¯wt,b
(7)
Selecting the Action in AATEAM
In the previous subsection, we described the design of one
attention network. Each attention network outputs a proba-
bility distribution over actions. In this subsection, we discuss
how to select the final action for the ad hoc agent given those
probability distributions.
Formally, we denote the set of attention networks in
AATEAM as T. In every step t, we input the current state
stand previous action at1into each attention network.
Note that each network has its own {oi},
ht1and
dt1
(computed by that network). Based on the input, each net-
work jin Toutputs a probability distribution over actions
{P(ak
t)}jas well as its
htand
dt. Note that the encoder’s
hidden state
htresults from {si|it}. Therefore,
htcon-
tains the state information until the step t. Additionally,
dt
reflects the effects of {ai|i<t}on s1(as discussed in
the Attention Mechanism subsection). Henceforth, we train
each attention network such that
dtis the prediction of the
state in step tbased on previous states and actions (we dis-
cuss the details of the training in the next subsection). If
dt
is close to
ht, it means that the attention network’s action
affects the state as intended. Because the state also results
from the teammates’ behaviour, the distance between
dtand
ht, i.e. dist j(
ht,
dt), implies the extent to which the atten-
tion network matches the teammates’ behaviour. The more
a network matches the teammates’ behaviour, the more in-
fluential it should be to the action’s selection. Henceforth,
we can use the distance to weigh the probability distribu-
tions (one per attention network) and select the action with
the highest weighted probability. We compute the weight of
each network jas:
vj=edistj(
ht,
dt)·10
|T |
z=1 edistz(
ht,
dt)·10 (8)
and the weighted probability distribution over actions as:
{¯
P(ak
t)}=
|T |
j=1
{P(ak
t)}j·vj(9)
Finally, we select the action with the highest probability in
{¯
P(ak
t)}as the action for the ad hoc agent in step t.
Training AATEAM
In this subsection, we discuss how we train one attention
network j(for each teammate type j). First, we use rein-
forcement learning to learn a policy πjfor cooperating with
past teammate type j. After that, we make the ad hoc agent
use πjto work with the teammate type jfor a number of
episodes. During the teamwork, we collect the state-action
7098
pairs as training data. Let us denote ˆatas the action that
the ad hoc agent took in step tbased on learned policy πj.
The state-action pair (st,ˆat)is the training instance of step
t, where ˆatis the label for the training. A training sample
is a sequence of state-action pairs, formally {(st,ˆat)}j.For
each state-action pair (st,ˆat)in a training sample, the at-
tention network takes stas input and returns a probability
vector over actions {P(ak
t)|k=1, ..., n}. The probability
that the network selects the action ˆatis P(ak
tat). We cal-
culate the loss of an action using the negative log-likelihood:
lossa=log(P(ak
tat)) (10)
Notice that the higher the probability that the attention net-
work selects ˆat, the smaller the loss of action lossa.
Since {ˆat}output by πjfits the past teammate type j’s
behaviour, i.e. affects the state as intended, the attention net-
work should learn to reduce the distance between
dtand
ht.
By doing so, we ensure that
dtmakes good prediction of st
given the teammate type js behaviour. Hence, we compute
the loss of distance as the square of L2-norm:
lossd=
dt
ht2
2(11)
Then, the overall loss is an additive function of the above
loss measures:
loss =lossa+βlossd(12)
where βis the constant that adjusts the weight of lossd.
We use the Adam optimizer (Kingma and Ba 2014) to up-
date the network parameters to minimise the computed loss.
Moreover, during the training, we apply the teacher forcing
strategy (Williams and Zipser 1989), i.e. use ˆatinstead of at
as the input to the decoder in the step t+1. This is because
the actual st+1 results from ˆatinstead of at.
Comparison to the NLP domain. In the teamwork do-
main, the attention network needs to handle the sequence of
inputs differently from the attention networks in the NLP
domain (e.g. the Figure 1). This is because each domain has
different information available. First, in the NLP domain,
the encoder can encode the whole sentence before inputting
anything into the decoder. In our domain, we do not know
the states from the future. Hence, we can only use the state
information until the current step. Additionally, the NLP de-
coder uses the last hidden state of the encoder as its initial
hidden state such that the decoding is based on the whole
sentence. In our domain, the decoder can only use
h1as its
initial hidden state (since there is only s1in the beginning
of a task). Second, the attention network in NLP needs to
learn when to end a sentence (i.e. output EOS). However,
in our domain, the end of episode is an external signal and
thus, our network does not need to output it.
Experiments
In this section, we present the experiments we have done
to demonstrate the effectiveness of AATEAM. First, we in-
troduce the experiment domain. Second, we present how
we collect the data. Finally, we pitch AATEAM against
PLASTIC-Policy to show the advantage of our approach. In
detail, we show the results of both algorithm when work-
ing with first, known and second, unknown teammates. Our
results indicate that when working with both known and un-
known teammates, in most cases our algorithm outperforms
PLASTIC-Policy significantly.
Half Field Offence Domain
Following Barrett et al. (2017), we use a limited version of
the RoboCup 2D domain, i.e. half field offense (HFO) do-
main (Hausknecht et al. 2016). HFO retains the complex-
ity of the RoboCup 2D domain in terms of the state/action
space and the uncertainty of actions’ effects. The difference
between both settings is that we play the game within a half
of the field and the ad hoc agent always replaces one of the
agents in the offence team. This simpler setting with sin-
gle objective allows researchers to focus on developing al-
gorithms that tackle the environmental complexity.
In this paper, we use the HFO version with two offensive
agents that try to score a goal into opposite team’s goal. The
opposite team has two defenders including the goalie. For
the defensive agents, we adopt the agent2D behaviour pro-
vided by Akiyama (2010). At the beginning of each episode,
the ball is put in a random position close to the midline of
full field. The initial position of the goalie is in the goal cen-
tre. The remaining agents start at random position within a
given range. The ranges of offensive and defensive agents
ensure that the offensive agents are initialised to be closer to
the ball than the defender. In every step, each agent can ob-
serve the position of the ball as well as the others’ positions.
An episode ends in one of the following cases: 1) the offen-
sive team scores a goal, 2) the ball leaves the field, 3) the de-
fensive team captures the ball, or 4) the game lasts for 500
simulation steps. The success episodes are the episodes in
which the offensive team scores a goal.
State. In every simulation step, the ad hoc agent gets the
following state features:
the ad hoc agent’s (x, y)position and its body angle
the ball’s (x, y)position
the distance and the angle from the ad hoc agent to the
centre of the goal
the ad hoc agent’s largest opening angle to the goal
the distance from the ad hoc agent to the closest defender
the teammate’s largest opening angle to the goal
the distance from the teammate to the closest defender
the opening angle for passing to the teammate
the teammate’s (x, y)position
each defender’s (x, y)position
Given our setting, i.e. one teammate and two opponents, the
state is a vector of 18 elements. Note that we sort the defend-
ers’ position in the state vector based on their uniform num-
ber while the original simulator sorts those positions based
on the distance between each defender and the ad hoc agent.
This modification is to ensure that each element about a de-
fender’s position always corresponds to the same defender.
7099
Action. The high-level actions that the ad hoc agent can per-
form are: 1) Dribble,2)Pass,3)Shoot, and 4) Move.In
every step, the simulator informs the ad hoc agent if it is
close enough to be able to kick the ball. If so, the ad hoc
agent can select an action from {Dribbl e, P ass, Shoot}.
When it is not close enough to the ball, it can only perform
Move. Both learned policies and attention networks select
high-level actions for the ad hoc agent when the agent can
kick the ball. Given a high-level action, the simulator maps
it to a continuous action to perform the simulation.
Collecting data for AATEAM. We use the State-Action-
Reward-State-Action (SARSA) algorithm (Hausknecht et al.
2016) to learn a policy πjfor each past teammate type j.
The ad hoc agent uses πjto work with teammate type jfor
200000 episodes. Within each episode, we collect the state-
action pairs {(st,ˆat)}every time the ad hoc agent obtains
the control of the ball and until it loses the ball. This se-
quence of {(st,ˆat)}is one training sample. We only store
the state-action pairs from the success episodes.
Note that due to the continuous state space, the states
of adjacent steps are very similar. Our initial experiments
show that if we collect the state-action pairs in every step,
the smoothness among the sequence of states impairs the at-
tention network’s performance. Henceforth, we set a thresh-
old tr to detect the change of state. In every time step,
we compare each element of current state (denoted as z)
with the corresponding element of previous state (denoted
as z). When at least one state element meets the condition
|zz|
|min(z,z)|>tr, we collect the current state-action pair.
Similarly, when using the trained networks to play, we select
a new action when at least one state element meets the above
condition. Otherwise, we continue the current action. More-
over, we remove the training samples that contain only one
state-action pair. This is because attention networks learn to
extract the temporal correlations of states. The samples with
only one state have no temporal correlation inside and thus,
hinder the learning performance.
Collecting data for PLASTIC-Policy. We collect state tran-
sitions (200000 episodes for each type) since PLASTIC-
Policy uses them to update its probability distribution over
the past teammate types. According to Barrett et al. (2017),
a state transition (s, a, s)means that either the possession
of the ball changes from one agent to another agent or the
episode ends when the ad hoc agent performs the action a.
In the experiments, s is the state of the step when the ad hoc
agent can kick the ball. sis the state of the step either when
another agent can kick the ball (this could result from inter-
cepting, passing, or shooting) or the episode ends. Note that
we cannot determine state transitions when the ad hoc agent
cannot kick the ball because the simulator does not have a
signal that indicates which specific agent holds the ball.
Teammate types. In this paper, we use the externally-
created teammates. The teammates are not programmed to
work with the ad hoc agent. We run the binary file released
from the 2013 RoboCup 2D simulation league competition
(RoboCup 2013). From these external agents, we use five
types as past teammate types, i.e. aut, axiom, agent2D, glid-
ers, and yushan. Additionally, we use five types as new team-
Table 1: Experimental Parameters
Name Value Name Value
β0.1 tr 0.2
u18 l256
GRU Layer Number 5Max length 60
Dropout rate 0.1Learning rate 0.01
mate types, i.e. yunlu, utaustin, legend, ubc, and warthog.
Experimental Parameters and Settings
We sum up the experiment parameters in Table 1. The num-
ber of hidden state layers for GRU is 5. We use the last layer
of hidden states to compute the dist j(
ht,
dt)and lossd. The
maximum length of a state sequence that attention networks
process is 60. When the state sequence’s length exceeds 60,
the networks drop the states from the beginning. The rate of
dropout is 0.1. When training the attention networks, the de-
fault learning rate is 0.01 and the batch size is 32. In the ex-
periments, we use AATEAM and PLASTIC-Policy to play
with each teammate type for 10 trials where each trial con-
tains 1000 episodes. We set the decision time limit for each
step as 100 ms following Barrett et al. (2017). We evaluate
the performance based on the scoring rate, i.e. how many
episodes the offence team wins in every 1000 episodes. The
error bars in Figure 4 and Figure 5 below are the standard
deviations of scoring rates. Note that the results are discrete
points. We add lines between them to improve clarity. We
apply the binomial test to test the significance of our results
(following Barrett et al. (2017)). We observe that the differ-
ence between AATEAM and PLASTIC-Policy is significant
(95% confidence) except for the axiom and ubc types.
Playing with Past Teammates
In this subsection, we make the ad hoc agent play with
the teammates it encountered before, i.e. past teammates.
In detail, it plays with five past teammate types using first,
AATEAM and second, PLASTIC-Policy. Also, we have one
baseline and one upper bound here. The baseline is the per-
formance of the original offence team, i.e. the team without
the ad hoc agent. The upper bound is the performance of the
ad hoc agent playing with a given teammate type following
the learned policy for that teammate. Since each policy is
learned only for the corresponding past teammate type, we
expect the performance with the learned policies to be higher
than any other approach. The results are shown in Figure 4.
In Figure 4, we see that both AATEAM and PLASTIC-
Policy are better than the baseline. This is because the rein-
forcement learning algorithms managed to learn good poli-
cies for past teammate types (both PLASTIC-Policy and
during the training of AATEAM). On top of that, the at-
tention networks successfully learned to fit the behaviour of
past teammates. Hence, both algorithms exhibit more intel-
ligent behaviours.
When it comes to the upper bound, we can see that
AATEAM’s performance is closer to the upper bound than
7100
aut axiom agent2D gliders yushan
0.55
0.60
0.65
0.70
0.75
0.80
Scoring Rate
AATEAM
PLASTIC
Baseline
Upper bound
Figure 4: The performance with known teammates
PLASTIC-Policy in most of the cases. Also, AATEAM out-
performs PLASTIC-Policy significantly. The reason is as
follows. PLASTIC-Policy finds the corresponding policy for
the known teammate type by evaluating the similarity be-
tween past experiences and current ones. However, it may
not be able to identify the most similar past experience due
to the decision time limit (i.e. 100 ms). Thus, the simi-
larity evaluation may cause it to switch between different
learned policies. This degrades its performance. In compar-
ison, AATEAM extracts the context in every step and maps
the context to the best action. Note that the extraction and
mapping using trained attention networks are very efficient.
Therefore, AATEAM can take advantage of the knowledge
from the learned policies in a more flexible way. In detail,
the ad hoc agent uses AATEAM to learn to cope with sev-
eral different teammate types. Next, it plays with teammates
of one of the known types. Each attention network outputs a
probability distribution over actions. AATEAM aggregates
the probabilities from all attention networks. The aggrega-
tion of the probability distributions enables a better usage of
the previous knowledge (as we base our decision on more
knowledge than the knowledge about only one past team-
mate type). Note that the originally learned policies (upper
bounds) come from reinforcement learning and they may
not be optimal. Given a teammate type, some policies for
another types may suggest a better action than the policy
for the given type. Therefore, AATEAM can perform better
than the upper bounds as it uses more information from neu-
ral networks. In summary, we observe that AATEAM can
easily adapt to the teammates’ behaviour.
Playing with New Teammates
In this subsection, the ad hoc agent uses AATEAM and
PLASTIC-Policy to work with the new teammates, i.e. the
legend ubc utaustin warthog yunlu
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Scoring Rate
AATEAM
PLASTIC
Baseline
Figure 5: The performance with unknown teammates
0 250 500 750 1000 1250 1500 1750 2000
Training Time
(
K iterations
)
0.1
0.3
0.5
0.7
0.9
1.1
Loss of actions
AATEAM
Without attention
Keep samples of length 1
Figure 6: The loss of different training cases (using yushan
type as example)
teammates it has never encountered before. Naturally, we
have one baseline. That is, the baseline where an original
team plays without the ad hoc agent. We cannot use the up-
per bound as the ad hoc agent did not learn the policies to
work with the new teammates beforehand.
The results in Figure 5 show that both AATEAM and
PLASTIC-Policy outperform the baseline as both algo-
rithms behave more intelligently than the replaced agent.
Moreover, in most cases AATEAM still significantly out-
performs PLASTIC-Policy. The reason is that PLASTIC-
Policy switches between policies based on the current be-
haviour of new teammates. Since switching takes time,
PLASTIC-Policy cannot adapt to teammates’ behaviours
quickly. In addition, the performance of both AATEAM and
PLASTIC-Policy is good except when the ad hoc agent
plays with the ubc type. The reason is that the ubc agent
does not often pass the ball to its teammate. Therefore, even
though AATEAM and PLASTIC-Policy are smarter than
its original teammate, they cannot help much. In summary,
the experiments of playing with unknown teammate types
demonstrate that AATEAM can adapt to different team-
mates’ behaviours better than PLASTIC-Policy. This is cru-
cial for the ad hoc teamwork in complex domains.
The Importance of Attention Mechanism
In this subsection, we show the importance of the attention
mechanism in AATEAM. Besides the architecture shown in
Figure 3, we also try to train the neural network without the
attention mechanism, i.e. each oihas the same weight. Ad-
ditionally, we try to include the training samples of length
1 during the training. The loss of actions in different cases
are shown in Figure 6. The results demonstrate that we can
get a good network only when we adopt the attention mech-
anism and remove the training samples of length 1 (to pre-
vent the attention module from being influenced). Note that
the loss of actions directly affects the performance of neural
network. Henceforth, the results in Figure 6 indicate that the
attention mechanism is an essential part in AATEAM.
Conclusions and Future Work
This paper proposes a novel approach called AATEAM to
solve the ad hoc teamwork problem in complex domains.
7101
To the best of our knowledge, this is the first attempt to
solve the ad hoc teamwork problem using attention-based
recurrent neural networks. In detail, AATEAM contains a
set of neural networks, each of them taught the behaviour
of one different past teammate type. Using the probability
distributions output by each neural network, AATEAM se-
lects the best action for the ad hoc agent in real-time. We
pitched our algorithm against PLASTIC-Policy that is the
only scheme proposed for the ad hoc teamwork problem in
complex domains. Our results indicate that when working
with both known and unknown teammates, in most cases
our algorithm outperforms PLASTIC-Policy.
One limitation of AATEAM is that it needs plenty of data
to train the attention networks. Additionally, it is difficult
to know if the architecture of neural networks is the most
appropriate one. As future work, we will test AATEAM for
more teammate types to show the robustness of our method.
Furthermore, we plan to test the same idea for even more
complex domains (with larger action and search space).
Acknowledgments
The research was partially supported by the ST Engi-
neering - NTU Corporate Lab through the NRF corporate
lab@university scheme.
References
Agmon, N., and Stone, P. 2012. Leading ad hoc agents in joint
action settings with multiple teammates. In Proceedings of the
International Conference on Autonomous Agents and Multiagent
Systems, 341–348.
Agmon, N.; Barrett, S.; and Stone, P. 2014. Modeling uncertainty
in leading ad hoc teams. In Proceedings of the International Con-
ference on Autonomous Agents and Multiagent Systems, 397–404.
Akiyama, H. 2010. Agent2D base code release. https://osdn.net/
projects/rctools/.
Albrecht, S. V., and Ramamoorthy, S. 2013. A game-theoretic
model and best-response learning method for ad hoc coordination
in multiagent systems. In Proceedings of the International Confer-
ence on Autonomous Agents and Multiagent Systems, 1155–1156.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine
translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473.
Barrett, S.; Stone, P.; Kraus, S.; and Rosenfeld, A. 2013. Teamwork
with limited knowledge of teammates. In Proceedings of the AAAI
Conference on Artificial Intelligence, 102–108.
Barrett, S.; Rosenfeld, A.; Kraus, S.; and Stone, P. 2017. Mak-
ing friends on the fly: Cooperating with new teammates. Artificial
Intelligence 242:132–171.
Chakraborty, D., and Stone, P. 2013. Cooperating with a Marko-
vian ad hoc teammate. In Proceedings of the International Confer-
ence on Autonomous Agents and Multiagent Systems, 1085–1092.
Chandrasekaran, M.; Doshi, P.; Zeng, Y.; and Chen, Y. 2017. Can
bounded and self-interested agents be teammates? Application to
planning in ad hoc teams. Autonomous Agents and Multi-Agent
Systems 31(4):821–860.
Chen, S.; Andrejczuk, E.; A. Irissappane, A.; and Zhang, J. 2019.
ATSIS: Achieving the ad hoc teamwork by sub-task inference and
selection. In Proceedings of the International Joint Conference on
Artificial Intelligence, 172–179.
Cho, K.; Van Merri¨
enboer, B.; Gulcehre, C.; Bahdanau, D.;
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase
representations using RNN encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078.
Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical
evaluation of gated recurrent neural networks on sequence model-
ing. arXiv preprint arXiv:1412.3555.
Genter, K. L.; Agmon, N.; and Stone, P. 2011. Role-based ad
hoc teamwork. In Proceedings of the AAAI Conference on Plan,
Activity, and Intent Recognition, 17–24.
Grosz, B. J., and Kraus, S. 1996. Collaborative plans for complex
group action. Artificial Intelligence 86(2):269–357.
Hausknecht, M.; Mupparaju, P.; Subramanian, S.; Kalyanakrish-
nan, S.; and Stone, P. 2016. Half field offense: An environment
for multiagent learning and ad hoc teamwork. In AAMAS Adaptive
Learning Agents (ALA) Workshop.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term mem-
ory. Neural computation 9(8):1735–1780.
Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980.
Melo, F. S., and Sardinha, A. 2016. Ad hoc teamwork by learn-
ing teammates’ task. Autonomous Agents and Multi-Agent Systems
30(2):175–219.
Mnih, V.; Heess, N.; Graves, A.; and Kavukcuoglu, K. 2014. Re-
current models of visual attention. In Advances in Neural Informa-
tion Processing Systems, 2204–2212.
Ravula, M.; Alkoby, S.; and Stone, P. 2019. Ad hoc teamwork
with behavior switching agents. In Proceedings of the International
Joint Conference on Artificial Intelligence, 550–556.
RoboCup. 2013. The binary file for the agents. https://archive.
robocup.info/Soccer/Simulation/2D/binaries/RoboCup/.
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Salakhutdinov, R. 2014. Dropout: a simple way to prevent neu-
ral networks from overfitting. The Journal of Machine Learning
Research 15(1):1929–1958.
Stone, P.; Kaminka, G. A.; Kraus, S.; and Rosenschein, J. S.
2010. Ad hoc autonomous agent teams: Collaboration without pre-
coordination. In Proceedings of the AAAI Conference on Artificial
Intelligence, 1504–1509.
Sutton, R. S., and Barto, A. G. 2011. Reinforcement learning: An
introduction. Cambridge, MA: MIT Press.
Tambe, M. 1997. Towards flexible teamwork. Journal of Artificial
Intelligence Research 7:83–124.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;
Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is
all you need. In Advances in Neural Information Processing Sys-
tems, 5998–6008.
Williams, R. J., and Zipser, D. 1989. A learning algorithm for
continually running fully recurrent neural networks. Neural Com-
putation 1(2):270–280.
Wu, F.; Zilberstein, S.; and Chen, X. 2011. Online planning for ad
hoc autonomous agent teams. In Proceedings of the International
Joint Conferences on Artificial Intelligence, 439–445.
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.;
Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural
image caption generation with visual attention. In International
Conference on Machine Learning, 2048–2057.
7102
... The formal definition of ad hoc teamwork is proposed in [40], and has attracted a surge of studies in the past years [2,3,4,5,9,10,57]. However, most of these works discuss ad hoc teamwork collaboration as a theoretical problem under relatively ideal conditions in some simple 2D environments. ...
... Two common methods for ad hoc teamwork include type inference, which models teammates as defined types and uses past interactions to estimate type probabilities for learning algorithms [2,3], and experience recognition, which compares current and past observations to identify the best actions, with PLASTIC-Policy as an influential example [4]. Deep learning has been applied to both methods [5,29,57], and some variations of ad hoc teamwork have been explored, including partial observability settings [12,30] and open environment settings with fluctuating team sizes [11,33]. However, most ad hoc teamwork research uses data-driven learning methods, which are computationally expensive and lack transparency. ...
Preprint
Compared with the widely investigated homogeneous multi-robot collaboration, heterogeneous robots with different capabilities can provide a more efficient and flexible collaboration for more complex tasks. In this paper, we consider a more challenging heterogeneous ad hoc teamwork collaboration problem where an ad hoc robot joins an existing heterogeneous team for a shared goal. Specifically, the ad hoc robot collaborates with unknown teammates without prior coordination, and it is expected to generate an appropriate cooperation policy to improve the efficiency of the whole team. To solve this challenging problem, we leverage the remarkable potential of the large language model (LLM) to establish a decentralized heterogeneous ad hoc teamwork collaboration framework that focuses on generating reasonable policy for an ad hoc robot to collaborate with original heterogeneous teammates. A training-free hierarchical dynamic planner is developed using the LLM together with the newly proposed Interactive Reflection of Thoughts (IRoT) method for the ad hoc agent to adapt to different teams. We also build a benchmark testing dataset to evaluate the proposed framework in the heterogeneous ad hoc multi-agent tidying-up task. Extensive comparison and ablation experiments are conducted in the benchmark to demonstrate the effectiveness of the proposed framework. We have also employed the proposed framework in physical robots in a real-world scenario. The experimental videos can be found at https://youtu.be/wHYP5T2WIp0.
... UCT has been combined with biased adaptive play to provide an online planning algorithm for ad hoc teamwork (Wu, Zilberstein, and Chen 2011), and researchers have explored using Value Iteration or UCT depending on the state observations (Barrett, Stone, and Kraus 2011). Much of the current (and recent) research has formulated AHT as a probabilistic sequential decision-making problem, assuming an underlying Markov decision process (MDP) or a partially observable MDP (POMDP) (Barrett et al. 2017;Chen et al. 2020;Santos et al. 2021;Rahman et al. 2021). This has included learning different policies for different teammate types and choosing a policy based on the teammate types seen during execution (Barrett et al. 2017). ...
... This has included learning different policies for different teammate types and choosing a policy based on the teammate types seen during execution (Barrett et al. 2017). Later work has used recurrent neural networks to avoid switching between policies for different teammate types (Chen et al. 2020). ...
Article
Full-text available
State of the art methods for ad hoc teamwork, i.e., for collaboration without prior coordination, often use a long history of prior observations to model the behavior of other agents (or agent types) and to determine the ad hoc agent's behavior. In many practical domains, it is difficult to obtain large training datasets, and necessary to quickly revise the existing models to account for changes in team composition or domain attributes. Our architecture builds on the principles of step-wise refinement and ecological rationality to enable an ad hoc agent to perform non-monotonic logical reasoning with prior commonsense domain knowledge and models learned rapidly from limited examples to predict the behavior of other agents. In the simulated multiagent collaboration domain Fort Attack, we experimentally demonstrate that our architecture enables an ad hoc agent to adapt to changes in the behavior of other agents, and provides enhanced transparency and better performance than a state of the art data-driven baseline.
... They enabled the model to work in a synthetic soccer game, where different policies were trained to cooperate with dedicated soccer teams (Hausknecht et al., 2016). Chen et al. (2020) designed an attention network to perform teammate type inference, which incorporated the temporal information flexibly. Nevertheless, their solution was still based on pre-training best responses over a static set of teammate types, which they chose to inherit the setting of Barrett et al. (2017). ...
... S.) (Thompson, 1933). Qlearning has been widely applied in many previous ad hoc solutions (Barrett et al., 2017;Chen et al., 2020). The other baselines are classic online learning solutions designed for similar context. ...
Preprint
Ad hoc teamwork requires an agent to cooperate with unknown teammates without prior coordination. Many works propose to abstract teammate instances into high-level representation of types and then pre-train the best response for each type. However, most of them do not consider the distribution of teammate instances within a type. This could expose the agent to the hidden risk of \emph{type confounding}. In the worst case, the best response for an abstract teammate type could be the worst response for all specific instances of that type. This work addresses the issue from the lens of causal inference. We first theoretically demonstrate that this phenomenon is due to the spurious correlation brought by uncontrolled teammate distribution. Then, we propose our solution, CTCAT, which disentangles such correlation through an instance-wise teammate feedback rectification. This operation reweights the interaction of teammate instances within a shared type to reduce the influence of type confounding. The effect of CTCAT is evaluated in multiple domains, including classic ad hoc teamwork tasks and real-world scenarios. Results show that CTCAT is robust to the influence of type confounding, a practical issue that directly hazards the robustness of our trained agents but was unnoticed in previous works.
... However, this approach relies on environmental feedback rewards, which may be very sparse in real-world cases. The AATEAM [20] algorithm employs attention mechanisms to measure the similarity between current teammate types and those in the testing types repository. However, this method requires a large amount of data to train the attention network, which is inefficient. ...
Article
Full-text available
When agents need to collaborate without previous coordination, the multi-agent cooperation problem transforms into an ad hoc teamwork (AHT) problem. Mainstream research on AHT is divided into type-based and type-free methods. The former depends on known teammate types to infer the current teammate type, while the latter does not require them at all. However, in many real-world applications, the complete absence and sufficient knowledge of known types are both impractical. Thus, this research focuses on the challenge of AHT with limited known types. To this end, this paper proposes a method called a Few typE-based Ad hoc Teamwork via meta-reinforcement learning (FEAT), which effectively adapts to teammates using a small set of known types within a single episode. FEAT enables agents to develop a highly adaptive policy through meta-reinforcement learning by employing limited priors about known types. It also utilizes this policy to generate a diverse type repository automatically. During the ad hoc cooperation, the agent can autonomously identify known teammate types followed by directly utilizing the pre-trained optimal cooperative policy or swiftly updating the meta policy to respond to teammates of unknown types. Comprehensive experiments in the pursuit domain validate the effectiveness of the algorithm and its components.
... ConvCPD [54] allows the changing of teammate types, which introduces a mechanism to detect the change point of the current type. AATEAM [55] develops an attention-based structure to infer teammate types from the state histories. ODITS [56] proposes a multimodal representation framework to encode teamwork situations. ...
Article
Full-text available
Multi-agent reinforcement learning (MARL) is a prevalent learning paradigm for solving stochastic games. In most MARL studies, agents in a game are defined as teammates or enemies beforehand, and the relationships among the agents (i.e., their identities) remain fixed throughout the game. However, in real-world problems, the agent relationships are commonly unknown in advance or dynamically changing. Many multi-party interactions start off by asking: who is on my team? This question arises whether it is the first day at the stock exchange or the kindergarten. Therefore, training policies for such situations in the face of imperfect information and ambiguous identities is an important problem that needs to be addressed. In this work, we develop a novel identity detection reinforcement learning (IDRL) framework that allows an agent to dynamically infer the identities of nearby agents and select an appropriate policy to accomplish the task. In the IDRL framework, a relation network is constructed to deduce the identities of other agents by observing the behaviors of the agents. A danger network is optimized to estimate the risk of false-positive identifications. Beyond that, we propose an intrinsic reward that balances the need to maximize external rewards and accurate identification. After identifying the cooperation-competition pattern among the agents, IDRL applies one of the off-the-shelf MARL methods to learn the policy. To evaluate the proposed method, we conduct experiments on Red-10 card-shedding game, and the results show that IDRL achieves superior performance over other state-of-the-art MARL methods. Impressively, the relation network has the par performance to identify the identities of agents with top human players; the danger network reasonably avoids the risk of imperfect identification. The code to reproduce all the reported results is available online at https://github.com/MR-BENjie/IDRL.
... For example, RL methods have been used to choose the most useful policy (from a set of learned policies) to control the ad hoc agent in each situation (Barrett et al. 2017), or to consider predictions from learned policies when selecting an ad hoc agent's actions for different types of agents (Santos et al. 2021). Attention-based deep neural networks have been used to jointly learn policies for different agent types (Chen et al. 2020) and for different team compositions (Rahman et al. 2021). Other work has combined sampling strategies with learning methods to optimize performance (Zand et al. 2022). ...
Article
Full-text available
Ad hoc teamwork (AHT) refers to the problem of enabling an agent to collaborate with teammates without prior coordination. State of the art methods in AHT are data-driven, using a large labeled dataset of prior observations to model the behavior of other agent types and to determine the ad hoc agent’s behavior. These methods are computationally expensive, lack transparency, and make it difficult to adapt to previously unseen changes. Our recent work introduced an architecture that determined an ad hoc agent’s behavior based on non-monotonic logical reasoning with prior commonsense domain knowledge and models learned from limited examples to predict the behavior of other agents. This paper describes KAT, a knowledge-driven architecture for AHT that substantially expands our prior architecture’s capabilities to support: (a) online selection, adaptation, and learning of the behavior prediction models; and (b) collaboration with teammates in the presence of partial observability and limited communication. We illustrate and experimentally evaluate KAT’s capabilities in two simulated benchmark domains for multiagent collaboration: Fort Attack and Half Field Offense. We show that KAT’s performance is better than a purely knowledge-driven baseline, and comparable with or better than a state of the art data-driven baseline, particularly in the presence of limited training data, partial observability, and changes in team composition.
Conference Paper
State of the art frameworks for ad hoc teamwork i.e., for enabling an agent to collaborate with others “on the fly”, pursue a data-driven approach, using a large labeled dataset of prior observations to model the behavior of other agents and to determine the ad hoc agent’s behavior. It is often difficult to pursue such an approach in complex domains due to the lack of sufficient training examples and computational resources. In addition, the learned models lack transparency and it is difficult to revise the existing knowledge in response to previously unseen changes. Our prior architecture enabled an ad hoc agent to perform non-monotonic logical reasoning with commonsense domain knowledge and predictive models of other agents’ behavior that are learned from limited examples. In this paper, we enable the ad hoc agent to acquire previously unknown domain knowledge governing actions and change, and to provide relational descriptions as on-demand explanations of its decisions in response to different types of questions. We evaluate the architecture’s knowledge acquisition and explanation generation abilities in two simulated benchmark domains: Fort Attack and Half Field Offense.
Conference Paper
Full-text available
In an ad hoc teamwork setting, the team needs to coordinate their activities to perform a task without prior agreement on how to achieve it. The ad hoc agent cannot communicate with its teammates but it can observe their behaviour and plan accordingly. To do so, the existing approaches rely on the teammates' behaviour models. However, the models may not be accurate, which can compromise teamwork. For this reason, we present Ad Hoc Teamwork by Sub-task Inference and Selection (ATSIS) algorithm that uses a sub-task inference without relying on teammates' models. First, the ad hoc agent observes its teammates to infer which sub-tasks they are handling. Based on that, it selects its own sub-task using a partially observable Markov decision process that handles the uncertainty of the sub-task inference. Last, the ad hoc agent uses the Monte Carlo tree search to find the set of actions to perform the sub-task. Our experiments show the benefits of ATSIS for robust teamwork.
Article
Full-text available
Planning for ad hoc teamwork is challenging because it involves agents collaborating without any prior coordination or communication. The focus is on principled methods for a single agent to cooperate with others. This motivates investigating the ad hoc teamwork problem in the context of self-interested decision-making frameworks. Agents engaged in individual decision making in multiagent settings face the task of having to reason about other agents’ actions, which may in turn involve reasoning about others. An established approximation that operationalizes this approach is to bound the infinite nesting from below by introducing level 0 models. For the purposes of this study, individual, self-interested decision making in multiagent settings is modeled using interactive dynamic influence diagrams (I-DID). These are graphical models with the benefit that they naturally offer a factored representation of the problem, allowing agents to ascribe dynamic models to others and reason about them. We demonstrate that an implication of bounded, finitely-nested reasoning by a self-interested agent is that we may not obtain optimal team solutions in cooperative settings, if it is part of a team. We address this limitation by including models at level 0 whose solutions involve reinforcement learning. We show how the learning is integrated into planning in the context of I-DIDs. This facilitates optimal teammate behavior, and we demonstrate its applicability to ad hoc teamwork on several problem domains and configurations.
Article
Full-text available
Robots are being deployed in an increasing variety of environments for longer periods of time. As the number of robots grows, they will increasingly need to interact with other robots. Additionally, the number of companies and research laboratories producing these robots is increasing, leading to the situation where these robots may not share a common communication or coordination protocol. While standards for coordination and communication may be created, we expect that robots will need to additionally reason intelligently about their teammates with limited information. This problem motivates the area of ad hoc teamwork in which an agent may potentially cooperate with a variety of teammates in order to achieve a shared goal. This article focuses on a limited version of the ad hoc teamwork problem in which an agent knows the environmental dynamics and has had past experiences with other teammates, though these experiences may not be representative of the current teammates. To tackle this problem, this article introduces a new general-purpose algorithm, PLASTIC, that reuses knowledge learned from previous teammates or provided by experts to quickly adapt to new teammates. This algorithm is instantiated in two forms: 1) PLASTIC–Model – which builds models of previous teammates' behaviors and plans behaviors online using these models and 2) PLASTIC–Policy – which learns policies for cooperating with previous teammates and selects among these policies online. We evaluate PLASTIC on two benchmark tasks: the pursuit domain and robot soccer in the RoboCup 2D simulation domain. Recognizing that a key requirement of ad hoc teamwork is adaptability to previously unseen agents, the tests use more than 40 previously unknown teams on the first task and 7 previously unknown teams on the second. While PLASTIC assumes that there is some degree of similarity between the current and past teammates' behaviors, no steps are taken in the experimental setup to make sure this assumption holds. The teammates were created by a variety of independent developers and were not designed to share any similarities. Nonetheless, the results show that PLASTIC was able to identify and exploit similarities between its current and past teammates' behaviors, allowing it to quickly adapt to new teammates.
Article
Full-text available
This paper addresses the problem of ad hoc teamwork, where a learning agent engages in a cooperative task with other (unknown) agents. The agent must effectively coordinate with the other agents towards completion of the intended task, not relying on any pre-defined coordination strategy. We contribute a new perspective on the ad hoc teamwork problem and propose that, in general, the learning agent should not only identify (and coordinate with) the teammates’ strategy but also identify the task to be completed. In our approach to the ad hoc teamwork problem, we represent tasks as fully cooperative matrix games. Relying exclusively on observations of the behavior of the teammates, the learning agent must identify the task at hand (namely, the corresponding payoff function) from a set of possible tasks and adapt to the teammates’ behavior. Teammates are assumed to follow a bounded-rationality best-response model and thus also adapt their behavior to that of the learning agent. We formalize the ad hoc teamwork problem as a sequential decision problem and propose two novel approaches to address it. In particular, we propose (i) the use of an online learning approach that considers the different tasks depending on their ability to predict the behavior of the teammate; and (ii) a decision-theoretic approach that models the ad hoc teamwork problem as a partially observable Markov decision problem. We provide theoretical bounds of the performance of both approaches and evaluate their performance in several domains of different complexity.
Conference Paper
As autonomous AI agents proliferate in the real world, they will increasingly need to cooperate with each other to achieve complex goals without always being able to coordinate in advance. This kind of cooperation, in which agents have to learn to cooperate on the fly, is called ad hoc teamwork. Many previous works investigating this setting assumed that teammates behave according to one of many predefined types that is fixed throughout the task. This assumption of stationarity in behaviors, is a strong assumption which cannot be guaranteed in many real-world settings. In this work, we relax this assumption and investigate settings in which teammates can change their types during the course of the task. This adds complexity to the planning problem as now an agent needs to recognize that a change has occurred in addition to figuring out what is the new type of the teammate it is interacting with. In this paper, we present a novel Convolutional-Neural-Network-based Change point Detection (CPD) algorithm for ad hoc teamwork. When evaluating our algorithm on the modified predator prey domain, we find that it outperforms existing Bayesian CPD algorithms.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.
Conference Paper
The growing use of autonomous agents in practice may require agents to cooperate as a team in situations where they have limited prior knowledge about one another, cannot communicate directly, or do not share the same world models. These situations raise the need to design ad hoc team members, i.e., agents that will be able to cooperate without coordination in order to reach an optimal team behavior. This paper considers the problem of leading N-agent teams by an agent toward their optimal joint utility, where the agents compute their next actions based only on their most recent observations of their teammates' actions. We show that compared to previous results in two-agent teams, in larger teams the agent might not be able to lead the team to the action with maximal joint utility, thus its optimal strategy is to lead the team to the best possible reachable cycle of joint actions. We describe a graphical model of the problem and a polynomial time algorithm for solving it. We then consider other variations of the problem, including leading teams of agents where they base their actions on longer history of past observations, leading a team by more than one ad hoc agent, and leading a teammate while the ad hoc agent is uncertain of its behavior.
Conference Paper
The ad hoc coordination problem is to design an ad hoc agent which is able to achieve optimal flexibility and efficiency in a multiagent system that admits no prior coordination between the ad hoc agent and the other agents. We conceptualise this problem formally as a stochastic Bayesian game in which the behaviour of a player is determined by its type. Based on this model, we derive a solution, called Harsanyi-Bellman Ad Hoc Coordination (HBA), which utilises a set of user-defined types to characterise players based on their observed behaviours. We evaluate HBA in the level-based foraging domain, showing that it outperforms several alternative algorithms using just a few user-defined types. We also report on a human-machine experiment in which the humans played Prisoner's Dilemma and Rock-Paper-Scissors against HBA and alternative algorithms. The results show that HBA achieved equal efficiency but a significantly higher welfare and winning rate.