Content uploaded by Ewa Andrejczuk
Author content
All content in this area was uploaded by Ewa Andrejczuk on Jun 26, 2020
Content may be subject to copyright.
Content uploaded by Zhiguang Cao
Author content
All content in this area was uploaded by Zhiguang Cao on Jun 25, 2020
Content may be subject to copyright.
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
AATEAM: Achieving the Ad Hoc
Teamwork by Employing the Attention Mechanism
Shuo Chen,1,2 Ewa Andrejczuk,2Zhiguang Cao,3Jie Zhang1
1School of Computer Science and Engineering, Nanyang Technological University, Singapore
2ST Engineering - NTU Corporate Laboratory, Nanyang Technological University, Singapore
3Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore
chen1087@e.ntu.edu.sg, {ewaa, zhangj}@ntu.edu.sg, isecaoz@nus.edu.sg
Abstract
In the ad hoc teamwork setting, a team of agents needs to per-
form a task without prior coordination. The most advanced
approach learns policies based on previous experiences and
reuses one of the policies to interact with new teammates.
However, the selected policy in many cases is sub-optimal.
Switching between policies to adapt to new teammates’ be-
haviour takes time, which threatens the successful perfor-
mance of a task. In this paper, we propose AATEAM –
a method that uses the attention-based neural networks to
cope with new teammates’ behaviour in real-time. We train
one attention network per teammate type. The attention net-
works learn both to extract the temporal correlations from the
sequence of states (i.e. contexts) and the mapping from con-
texts to actions. Each attention network also learns to predict
a future state given the current context and its output action.
The prediction accuracies help to determine which actions the
ad hoc agent should take. We perform extensive experiments
to show the effectiveness of our method.
Introduction
In the majority of the multiagent systems (MAS) literature
tackling teamwork, a team of at least two agents shares co-
operation protocols to work together and perform collabora-
tive tasks (Grosz and Kraus 1996; Tambe 1997). However,
with the increasing number of agents in various domains
(e.g. construction, search and rescue) developed by differ-
ent companies, we can no longer coordinate agents’ activi-
ties beforehand. In this case, we expect a team of agents to
perform a task without any pre-defined coordination strat-
egy. This problem is known in the literature as the ad hoc
teamwork problem (Stone et al. 2010). This paper tackles
the ad hoc teamwork problem where an ad hoc agent plans
its actions only by observing other teammates.
Most of the existing approaches for the ad hoc teamwork
focus on simple domains, i.e. they ignore the environmental
uncertainty or they assume teammates’ actions are fully ob-
servable (Agmon, Barrett, and Stone 2014; Ravula, Alkoby,
and Stone 2019). To the best of our knowledge, PLASTIC-
Policy (Barrett et al. 2017) is the only scheme proposed for
more complex domains, i.e. with a dynamic environment
and a continuous state/action space. PLASTIC-Policy learns
Copyright c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
a policy for each past teammate type. When working with
new teammates, it chooses one of the policies based on the
similarity between the new and past experiences and uses
that policy to select the sequence of actions for the ad hoc
agent. However, the selected policy only works well when
new teammates behave similarly to the chosen teammate
type. When new teammates’ behaviour becomes far from the
teammate type, the ad hoc agent needs to switch to another
policy. Switching between policies to adapt to new team-
mates’ behaviour takes time, which threatens the successful
performance of a task. Given that, it is desirable to develop
a new method that adapts to new teammates’ behaviours in
real-time for complex domains.
In this paper, we propose to Achieve the Ad hoc Teamwork
by Employing Attention Mechanism (AATEAM). AATEAM
uses attention-based recurrent neural networks to cope with
new teammates’ behaviour in real-time. Our workflow is as
follows. First, we use reinforcement learning to learn poli-
cies to work with past teammates. Second, we make the ad
hoc agent play with each type of teammate using the cor-
responding learned policy. We collect the states (input) and
the ad hoc agent’s actions (label) as training data. Third, we
use this data to teach one attention network for each type of
past teammates. The attention network works as follows. In
every step, the attention network takes a state as input. The
attention mechanism inside extracts the temporal correlation
between the state and the states encountered previously, i.e.
context. It further maps the context to an action probability
distribution. In every step, the attention networks also out-
put a metric that measures the similarity between past and
new teammate types. The more similar the types, the more
important the corresponding probability distribution over ac-
tions. We aggregate networks’ probability distributions over
actions by calculating the weighted sum of those probabil-
ities and its importance. Finally, we output an action with
the highest aggregated probability. Since AATEAM evalu-
ates the similarity in every step, we can adjust to the new
teammates’ behaviour in real-time.
In summary, we make the following contributions: 1) we
propose AATEAM, an attention-based method to cope with
teammates’ different behaviours in complex domains. To the
best of our knowledge, this is the first time attention mech-
anism is applied to this problem. 2) Besides learning the
mapping from contexts to probability distribution over ac-
7095
tions, we train the attention network to measure the simi-
larity between the corresponding past teammates and new
teammates. This helps to adjust the action selection in real-
time. 3) We perform extensive experiments in the RoboCup
2D simulation domain. The results demonstrate that the at-
tention networks outperform PLASTIC-Policy when work-
ing with both known and unknown teammates.
Related Work
The majority of existing work in the ad hoc teamwork is pro-
posed for simple domains. They either do not consider the
environmental uncertainty or they assume the action space
is fully observable.
For the first simplification, the authors (Agmon and Stone
2012; Chakraborty and Stone 2013; Agmon, Barrett, and
Stone 2014) assume that the utilities of the team’s actions are
given. Also, Genter, Agmon, and Stone (2011) deal with the
role assignment problem for the ad hoc agent where team-
mates’ roles are known. However, the actions’ utilities and
teammates’ roles may vary when the environment changes.
Therefore, their problem setting is too simplistic.
As for the second simplification, assuming a fully-
observable action space limits the potential on real-life
applications. The researchers develop algorithms to plan
the ad hoc agent’s action either by using the past team-
mates’ behaviour types (Wu, Zilberstein, and Chen 2011;
Albrecht and Ramamoorthy 2013; Barrett et al. 2013; Melo
and Sardinha 2016; Chandrasekaran et al. 2017; Ravula,
Alkoby, and Stone 2019) or by inferring the responsibilities
new teammates are taking (Chen et al. 2019). In both cases,
they assume that teammates’ actions are fully observable,
which is a strong assumption when it comes to more com-
plex domains (e.g. the actions are continuous and hence, the
ad hoc agent cannot know the exact value of an action).
The only scheme for complex domains is PLASTIC-
Policy (Barrett et al. 2017). PLASTIC-Policy uses reinforce-
ment learning to learn a policy for working with each type of
past teammates. To reduce learning complexity, PLASTIC-
Policy first assumes the existence of a countable set of high-
level actions. Each high-level action is a mapping from the
state to a continuous action. Therefore, it limits the action
space to be searched by reinforcement learning. The policy
takes a continuous state as input and outputs a high-level
action of the ad hoc agent. As mentioned before, in com-
plex domains, the ad hoc agent cannot directly observe team-
mates’ actions. Therefore, PLASTIC-Policy defines the state
transition for continuous state space and observes team-
mates’ behaviour based on state transitions. Given a state
transition with new teammates, the ad hoc agent locates the
most similar state transition with each type of past team-
mates. Using that information, it updates the probability of
new teammates being similar to each type of past teammates,
and applies the policy for the most similar past teammates.
The main drawback of PLASTIC-Policy is that the learned
policies are not necessarily optimal for new teammates. Ad-
ditionally, the update based on state transitions may not be
fast enough to adapt to new teammates’ changing behaviour.
Our paper addresses these drawbacks by using all attention
networks to select the best action based on a context and the
similarity between past and new teammates’ type.
Preliminary
Markov Decision Process
In this paper, our objective is to find the best set of ac-
tions for the ad hoc agent. Following the state-of-the-art
literature (Barrett et al. 2017), we model the ad hoc team-
work problem as the Markov Decision Process (MDP) (Sut-
ton and Barto 2011). Specifically, we represent an MDP
by a tuple S, A, P, R, where Sis a set of states; Ais
a set of high-level actions the ad hoc agent can perform;
P:S×A×S→[0,1] is the state transition function; That
is, P(s|s, a)is the probability of transiting to the next state
s, given the action aand current state s; and the reward
function R:S×A→Rspecifies the reward R(s, a)when
the ad hoc agent takes the action ain the state s. Note that
for continuous state space, the definition of state transition
is domain-specific.
A policy π:S→Atakes a state sas input and out-
puts an action a. After taking the action ain the state s, the
maximum long term expected reward is computed as:
Q(s, a)=R(s, a)+γ
s∈S
P(s|s, a)max
aQ(s,a
)(1)
where 0<γ≤1is the discount factor for the future reward.
An optimal policy chooses the action athat maximizes the
Q(s, a)for all s.
Attention Mechanism
The attention mechanism is an approach for intelligently
extracting contextual information from a sequence of in-
puts. Due to its great performance, the attention mech-
anism has become a popular technique in computer vi-
sion (Mnih et al. 2014; Xu et al. 2015) and natural lan-
guage processing (NLP) (Bahdanau, Cho, and Bengio 2014;
Vaswani et al. 2017).
The example attention network for translation is shown
in Figure 1. The attention network has two recurrent neu-
ral networks called encoder and decoder respectively. The
attention network first encodes a full sentence with its en-
coder. The last hidden state from the encoder is the initial
Figure 1: The attention network for translation
7096
hidden state (input) to its decoder. To do the translation, the
attention mechanism decides the importance (i.e. weights) of
each input to the translation output. The weighted sum of en-
coder’s outputs is the contextual information. The decoder’s
hidden state and the contextual information determine the
output together. The method to compute the weight in the
attention mechanism varies from problem to problem.
Note that in the teamwork, the ad hoc agent’s action in the
current step is correlated to the current state as well as to pre-
vious states. Given different teammates’ behaviours (coming
from different teammate types or different situations), the
correlations of previous states to the current action are dif-
ferent. Therefore, the attention mechanism is a suitable ap-
proach that takes the sequence of states as input and outputs
the correlations given different teammates’ behaviours.
Ad Hoc Teamwork with Attention Mechanism
In this section, we explain the details of our method. First,
we present the overall design of one attention-based neu-
ral network. Note that AATEAM contains multiple attention
networks, one per past teammates’ type. Second, we explain
how AATEAM selects actions for the ad hoc agent. Finally,
we describe how we train AATEAM.
The Design of the Attention Network in AATEAM
The overall architecture of our attention-based neural net-
work is as shown in Figure 2. The architecture consists of
two parts. That is, the encoder and decoder respectively. The
encoder extracts important features for the ad hoc teamwork
from the sequence of states. The decoder exploits the ex-
tracted features to output the probability distribution over
actions for the ad hoc agent.
The encoder. The encoder consists of a linear layer and
the gated recurrent unit (GRU)1(Cho et al. 2014) as shown
in Figure 3. The initial hidden state of the encoder
h0is a
vector of zeros
0. In every step t, the encoder takes the pre-
vious hidden state
ht−1and the state stas input. Let ude-
note the size of state vector, ldenote the size of GRU’s hid-
den state and fx→y(·)denote a linear layer that transforms a
vector of size xinto another vector of size y. A linear layer
fu→l(·)transforms the state vector stinto an embedding
vector of length l. It does so, by multiplying the input by
the weight matrix. Then, GRU uses the embedding vector
and
ht−1to output a hidden state
htand an encoder-output
ot, i.e. the encoding of st. The
htand otthen become the
input of the decoder.
The decoder. The purpose of the decoder is to output a
high-level action at∈Afor the ad hoc agent. Let n=|A|
denote the number of high-level actions. Next, we under-
stand an episode as one instance of a task performance. We
denote the signal of the start of an episode as SOEand set
1The GRU is a popular variant of the recurrent neural network
(RNN). As an alternative RNN, we could use the long short-term
memory (LSTM) (Hochreiter and Schmidhuber 1997). However,
GRU has comparable performance to LSTM while having less pa-
rameters, which makes it easier to train (Chung et al. 2014).
Figure 2: The attention network for the ad hoc teamwork
its index to n+1. The initial hidden state of the decoder
d0
is the encoder’s hidden state in step 1.a0=SOEand its
index is input into the decoder in step 1to signal the start of
a task. In every step t, the decoder takes as input: 1) the set
of encoder-outputs until step t, i.e. {oi|i=1, ..., t}, 2) the
previous hidden state
dt−1, 3) the encoder’s hidden state
ht,
and 4) the previous output action at−1. As shown in Fig-
ure 3, the decoder uses these inputs and the attention module
to generate the inputs for its GRU. Then, the GRU outputs
a hidden state
dtand a probability distribution over actions
{P(ak
t)|k=1, .., n}(we discuss how we select atfrom
{P(ak
t)|k=1, .., n}in the latter subsection).
The attention module determines the weight of each oi
based on {oi|i=1, ..., t},
dt−1, and
ht(we present the de-
tails of how to compute the weights in the following subsec-
tion). Given the attention weight set {wt,i|i=1, .., t}, the
decoder produces the context vector ctin step tas
ct=
t
i=1
wt,i ·oi(2)
Next, the decoder uses an embedding layer (that maps an
action index to a vector of length l) and dropout layer (that
reduces overfitting (Srivastava et al. 2014)) to generate the
embedding of at−1, i.e. Embed(at−1). To produce the input
intof GRU in step t, the concatenation (denoted by “”) of
Embed(at−1)and ctgoes through a linear layer f2l→l(·)
and Relu layer. The Relu layer applies the rectified linear
unit function to each element of the input vector:
Relu(x)=max(0,x)(3)
We use the Relu function because, if the neuron gets acti-
vated, its gradient always equals to 1. This helps to train the
network faster. Given intand
dt−1, GRU outputs the de-
coder’s hidden state
dtin step tand an action seed vector at
of length l. To produce the probability distribution over ac-
tions in step t, i.e. {P(ak
t)|k=1, ..., n}, we first use a linear
layer fl→n(·)that converts atinto a vector {˜ak
t|k=1, .., n}.
Second, we transform {˜ak
t}using a softmax layer. That is,
P(ak
t)= e˜ak
t
n
m=1 e˜am
t
(4)
7097
Figure 3: The detailed design of the attention network
where P(ak
t)represents the probability of taking action ak
in step t.Attention Mechanism. The attention module is
shown in Figure 3. In every step t, the attention module
takes the following input: 1) the set of encoder’s outputs
{oi|i≤t}; Each oinaturally provides the information for
the weights of {oi}; 2) The encoder’s hidden state
ht; The
hidden state
htcontains the information of states until step
tand thus, it helps to produce a new context; and 3) the de-
coder’s previous hidden state
dt−1; Note that
dt−1comes
from at−2and
dt−2. It also results from at−3,at−2and
dt−3. Following this logic,
dt−1reflects the effects of the
sequence of actions {ai|i<t}on
d0, that is, on the infor-
mation of initial state
h1. The effects of previous actions on
states are useful for generating wt,i since atis correlated
with wt,i.
Given the inputs, a linear layer f2l→l(·)transforms
the concatenation of
htand
dt−1into a vector vhd =
f2l→l(
ht||
dt−1). Then, each oiin {oi|i≤t}goes through
a linear layer to become a vector voi=fl→l(oi). The con-
catenation of voiand vhd further goes through a Tanh layer
(following Bahdanau, Cho, and Bengio (2014)) that applies
the tanh function to each element of the input vector:
tanh(x)=ex−e−x
ex+e−x(5)
Next, the output of Tanh layer goes through a linear layer
f2l→l(·)to output the weight seed ¯wt,i for oiin step t.In
summary, the weight seed ¯wt,i is computed as:
¯wt,i =f2l→l(Tanh(f2l→l(
ht||
dt−1)||fl→l(oi))) (6)
Finally, a softmax layer takes the set of weight seeds
{¯wt,i|i≤t}as input and outputs the weight wt,i of each
oiin step t. That is,
wt,i =e¯wt,i
t
b=1 e¯wt,b
(7)
Selecting the Action in AATEAM
In the previous subsection, we described the design of one
attention network. Each attention network outputs a proba-
bility distribution over actions. In this subsection, we discuss
how to select the final action for the ad hoc agent given those
probability distributions.
Formally, we denote the set of attention networks in
AATEAM as T. In every step t, we input the current state
stand previous action at−1into each attention network.
Note that each network has its own {oi},
ht−1and
dt−1
(computed by that network). Based on the input, each net-
work jin Toutputs a probability distribution over actions
{P(ak
t)}jas well as its
htand
dt. Note that the encoder’s
hidden state
htresults from {si|i≤t}. Therefore,
htcon-
tains the state information until the step t. Additionally,
dt
reflects the effects of {ai|i<t}on s1(as discussed in
the Attention Mechanism subsection). Henceforth, we train
each attention network such that
dtis the prediction of the
state in step tbased on previous states and actions (we dis-
cuss the details of the training in the next subsection). If
dt
is close to
ht, it means that the attention network’s action
affects the state as intended. Because the state also results
from the teammates’ behaviour, the distance between
dtand
ht, i.e. dist j(
ht,
dt), implies the extent to which the atten-
tion network matches the teammates’ behaviour. The more
a network matches the teammates’ behaviour, the more in-
fluential it should be to the action’s selection. Henceforth,
we can use the distance to weigh the probability distribu-
tions (one per attention network) and select the action with
the highest weighted probability. We compute the weight of
each network jas:
vj=e−distj(
ht,
dt)·10
|T |
z=1 e−distz(
ht,
dt)·10 (8)
and the weighted probability distribution over actions as:
{¯
P(ak
t)}=
|T |
j=1
{P(ak
t)}j·vj(9)
Finally, we select the action with the highest probability in
{¯
P(ak
t)}as the action for the ad hoc agent in step t.
Training AATEAM
In this subsection, we discuss how we train one attention
network j(for each teammate type j). First, we use rein-
forcement learning to learn a policy πjfor cooperating with
past teammate type j. After that, we make the ad hoc agent
use πjto work with the teammate type jfor a number of
episodes. During the teamwork, we collect the state-action
7098
pairs as training data. Let us denote ˆatas the action that
the ad hoc agent took in step tbased on learned policy πj.
The state-action pair (st,ˆat)is the training instance of step
t, where ˆatis the label for the training. A training sample
is a sequence of state-action pairs, formally {(st,ˆat)}j.For
each state-action pair (st,ˆat)in a training sample, the at-
tention network takes stas input and returns a probability
vector over actions {P(ak
t)|k=1, ..., n}. The probability
that the network selects the action ˆatis P(ak
t=ˆat). We cal-
culate the loss of an action using the negative log-likelihood:
lossa=−log(P(ak
t=ˆat)) (10)
Notice that the higher the probability that the attention net-
work selects ˆat, the smaller the loss of action lossa.
Since {ˆat}output by πjfits the past teammate type j’s
behaviour, i.e. affects the state as intended, the attention net-
work should learn to reduce the distance between
dtand
ht.
By doing so, we ensure that
dtmakes good prediction of st
given the teammate type j’s behaviour. Hence, we compute
the loss of distance as the square of L2-norm:
lossd=
dt−
ht2
2(11)
Then, the overall loss is an additive function of the above
loss measures:
loss =lossa+βlossd(12)
where βis the constant that adjusts the weight of lossd.
We use the Adam optimizer (Kingma and Ba 2014) to up-
date the network parameters to minimise the computed loss.
Moreover, during the training, we apply the teacher forcing
strategy (Williams and Zipser 1989), i.e. use ˆatinstead of at
as the input to the decoder in the step t+1. This is because
the actual st+1 results from ˆatinstead of at.
Comparison to the NLP domain. In the teamwork do-
main, the attention network needs to handle the sequence of
inputs differently from the attention networks in the NLP
domain (e.g. the Figure 1). This is because each domain has
different information available. First, in the NLP domain,
the encoder can encode the whole sentence before inputting
anything into the decoder. In our domain, we do not know
the states from the future. Hence, we can only use the state
information until the current step. Additionally, the NLP de-
coder uses the last hidden state of the encoder as its initial
hidden state such that the decoding is based on the whole
sentence. In our domain, the decoder can only use
h1as its
initial hidden state (since there is only s1in the beginning
of a task). Second, the attention network in NLP needs to
learn when to end a sentence (i.e. output EOS). However,
in our domain, the end of episode is an external signal and
thus, our network does not need to output it.
Experiments
In this section, we present the experiments we have done
to demonstrate the effectiveness of AATEAM. First, we in-
troduce the experiment domain. Second, we present how
we collect the data. Finally, we pitch AATEAM against
PLASTIC-Policy to show the advantage of our approach. In
detail, we show the results of both algorithm when work-
ing with first, known and second, unknown teammates. Our
results indicate that when working with both known and un-
known teammates, in most cases our algorithm outperforms
PLASTIC-Policy significantly.
Half Field Offence Domain
Following Barrett et al. (2017), we use a limited version of
the RoboCup 2D domain, i.e. half field offense (HFO) do-
main (Hausknecht et al. 2016). HFO retains the complex-
ity of the RoboCup 2D domain in terms of the state/action
space and the uncertainty of actions’ effects. The difference
between both settings is that we play the game within a half
of the field and the ad hoc agent always replaces one of the
agents in the offence team. This simpler setting with sin-
gle objective allows researchers to focus on developing al-
gorithms that tackle the environmental complexity.
In this paper, we use the HFO version with two offensive
agents that try to score a goal into opposite team’s goal. The
opposite team has two defenders including the goalie. For
the defensive agents, we adopt the agent2D behaviour pro-
vided by Akiyama (2010). At the beginning of each episode,
the ball is put in a random position close to the midline of
full field. The initial position of the goalie is in the goal cen-
tre. The remaining agents start at random position within a
given range. The ranges of offensive and defensive agents
ensure that the offensive agents are initialised to be closer to
the ball than the defender. In every step, each agent can ob-
serve the position of the ball as well as the others’ positions.
An episode ends in one of the following cases: 1) the offen-
sive team scores a goal, 2) the ball leaves the field, 3) the de-
fensive team captures the ball, or 4) the game lasts for 500
simulation steps. The success episodes are the episodes in
which the offensive team scores a goal.
State. In every simulation step, the ad hoc agent gets the
following state features:
•the ad hoc agent’s (x, y)position and its body angle
•the ball’s (x, y)position
•the distance and the angle from the ad hoc agent to the
centre of the goal
•the ad hoc agent’s largest opening angle to the goal
•the distance from the ad hoc agent to the closest defender
•the teammate’s largest opening angle to the goal
•the distance from the teammate to the closest defender
•the opening angle for passing to the teammate
•the teammate’s (x, y)position
•each defender’s (x, y)position
Given our setting, i.e. one teammate and two opponents, the
state is a vector of 18 elements. Note that we sort the defend-
ers’ position in the state vector based on their uniform num-
ber while the original simulator sorts those positions based
on the distance between each defender and the ad hoc agent.
This modification is to ensure that each element about a de-
fender’s position always corresponds to the same defender.
7099
Action. The high-level actions that the ad hoc agent can per-
form are: 1) Dribble,2)Pass,3)Shoot, and 4) Move.In
every step, the simulator informs the ad hoc agent if it is
close enough to be able to kick the ball. If so, the ad hoc
agent can select an action from {Dribbl e, P ass, Shoot}.
When it is not close enough to the ball, it can only perform
Move. Both learned policies and attention networks select
high-level actions for the ad hoc agent when the agent can
kick the ball. Given a high-level action, the simulator maps
it to a continuous action to perform the simulation.
Collecting data for AATEAM. We use the State-Action-
Reward-State-Action (SARSA) algorithm (Hausknecht et al.
2016) to learn a policy πjfor each past teammate type j.
The ad hoc agent uses πjto work with teammate type jfor
200000 episodes. Within each episode, we collect the state-
action pairs {(st,ˆat)}every time the ad hoc agent obtains
the control of the ball and until it loses the ball. This se-
quence of {(st,ˆat)}is one training sample. We only store
the state-action pairs from the success episodes.
Note that due to the continuous state space, the states
of adjacent steps are very similar. Our initial experiments
show that if we collect the state-action pairs in every step,
the smoothness among the sequence of states impairs the at-
tention network’s performance. Henceforth, we set a thresh-
old tr to detect the change of state. In every time step,
we compare each element of current state (denoted as z)
with the corresponding element of previous state (denoted
as z). When at least one state element meets the condition
|z−z|
|min(z,z)|>tr, we collect the current state-action pair.
Similarly, when using the trained networks to play, we select
a new action when at least one state element meets the above
condition. Otherwise, we continue the current action. More-
over, we remove the training samples that contain only one
state-action pair. This is because attention networks learn to
extract the temporal correlations of states. The samples with
only one state have no temporal correlation inside and thus,
hinder the learning performance.
Collecting data for PLASTIC-Policy. We collect state tran-
sitions (200000 episodes for each type) since PLASTIC-
Policy uses them to update its probability distribution over
the past teammate types. According to Barrett et al. (2017),
a state transition (s, a, s)means that either the possession
of the ball changes from one agent to another agent or the
episode ends when the ad hoc agent performs the action a.
In the experiments, s is the state of the step when the ad hoc
agent can kick the ball. sis the state of the step either when
another agent can kick the ball (this could result from inter-
cepting, passing, or shooting) or the episode ends. Note that
we cannot determine state transitions when the ad hoc agent
cannot kick the ball because the simulator does not have a
signal that indicates which specific agent holds the ball.
Teammate types. In this paper, we use the externally-
created teammates. The teammates are not programmed to
work with the ad hoc agent. We run the binary file released
from the 2013 RoboCup 2D simulation league competition
(RoboCup 2013). From these external agents, we use five
types as past teammate types, i.e. aut, axiom, agent2D, glid-
ers, and yushan. Additionally, we use five types as new team-
Table 1: Experimental Parameters
Name Value Name Value
β0.1 tr 0.2
u18 l256
GRU Layer Number 5Max length 60
Dropout rate 0.1Learning rate 0.01
mate types, i.e. yunlu, utaustin, legend, ubc, and warthog.
Experimental Parameters and Settings
We sum up the experiment parameters in Table 1. The num-
ber of hidden state layers for GRU is 5. We use the last layer
of hidden states to compute the dist j(
ht,
dt)and lossd. The
maximum length of a state sequence that attention networks
process is 60. When the state sequence’s length exceeds 60,
the networks drop the states from the beginning. The rate of
dropout is 0.1. When training the attention networks, the de-
fault learning rate is 0.01 and the batch size is 32. In the ex-
periments, we use AATEAM and PLASTIC-Policy to play
with each teammate type for 10 trials where each trial con-
tains 1000 episodes. We set the decision time limit for each
step as 100 ms following Barrett et al. (2017). We evaluate
the performance based on the scoring rate, i.e. how many
episodes the offence team wins in every 1000 episodes. The
error bars in Figure 4 and Figure 5 below are the standard
deviations of scoring rates. Note that the results are discrete
points. We add lines between them to improve clarity. We
apply the binomial test to test the significance of our results
(following Barrett et al. (2017)). We observe that the differ-
ence between AATEAM and PLASTIC-Policy is significant
(95% confidence) except for the axiom and ubc types.
Playing with Past Teammates
In this subsection, we make the ad hoc agent play with
the teammates it encountered before, i.e. past teammates.
In detail, it plays with five past teammate types using first,
AATEAM and second, PLASTIC-Policy. Also, we have one
baseline and one upper bound here. The baseline is the per-
formance of the original offence team, i.e. the team without
the ad hoc agent. The upper bound is the performance of the
ad hoc agent playing with a given teammate type following
the learned policy for that teammate. Since each policy is
learned only for the corresponding past teammate type, we
expect the performance with the learned policies to be higher
than any other approach. The results are shown in Figure 4.
In Figure 4, we see that both AATEAM and PLASTIC-
Policy are better than the baseline. This is because the rein-
forcement learning algorithms managed to learn good poli-
cies for past teammate types (both PLASTIC-Policy and
during the training of AATEAM). On top of that, the at-
tention networks successfully learned to fit the behaviour of
past teammates. Hence, both algorithms exhibit more intel-
ligent behaviours.
When it comes to the upper bound, we can see that
AATEAM’s performance is closer to the upper bound than
7100
aut axiom agent2D gliders yushan
0.55
0.60
0.65
0.70
0.75
0.80
Scoring Rate
AATEAM
PLASTIC
Baseline
Upper bound
Figure 4: The performance with known teammates
PLASTIC-Policy in most of the cases. Also, AATEAM out-
performs PLASTIC-Policy significantly. The reason is as
follows. PLASTIC-Policy finds the corresponding policy for
the known teammate type by evaluating the similarity be-
tween past experiences and current ones. However, it may
not be able to identify the most similar past experience due
to the decision time limit (i.e. 100 ms). Thus, the simi-
larity evaluation may cause it to switch between different
learned policies. This degrades its performance. In compar-
ison, AATEAM extracts the context in every step and maps
the context to the best action. Note that the extraction and
mapping using trained attention networks are very efficient.
Therefore, AATEAM can take advantage of the knowledge
from the learned policies in a more flexible way. In detail,
the ad hoc agent uses AATEAM to learn to cope with sev-
eral different teammate types. Next, it plays with teammates
of one of the known types. Each attention network outputs a
probability distribution over actions. AATEAM aggregates
the probabilities from all attention networks. The aggrega-
tion of the probability distributions enables a better usage of
the previous knowledge (as we base our decision on more
knowledge than the knowledge about only one past team-
mate type). Note that the originally learned policies (upper
bounds) come from reinforcement learning and they may
not be optimal. Given a teammate type, some policies for
another types may suggest a better action than the policy
for the given type. Therefore, AATEAM can perform better
than the upper bounds as it uses more information from neu-
ral networks. In summary, we observe that AATEAM can
easily adapt to the teammates’ behaviour.
Playing with New Teammates
In this subsection, the ad hoc agent uses AATEAM and
PLASTIC-Policy to work with the new teammates, i.e. the
legend ubc utaustin warthog yunlu
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Scoring Rate
AATEAM
PLASTIC
Baseline
Figure 5: The performance with unknown teammates
0 250 500 750 1000 1250 1500 1750 2000
Training Time
(
K iterations
)
0.1
0.3
0.5
0.7
0.9
1.1
Loss of actions
AATEAM
Without attention
Keep samples of length 1
Figure 6: The loss of different training cases (using yushan
type as example)
teammates it has never encountered before. Naturally, we
have one baseline. That is, the baseline where an original
team plays without the ad hoc agent. We cannot use the up-
per bound as the ad hoc agent did not learn the policies to
work with the new teammates beforehand.
The results in Figure 5 show that both AATEAM and
PLASTIC-Policy outperform the baseline as both algo-
rithms behave more intelligently than the replaced agent.
Moreover, in most cases AATEAM still significantly out-
performs PLASTIC-Policy. The reason is that PLASTIC-
Policy switches between policies based on the current be-
haviour of new teammates. Since switching takes time,
PLASTIC-Policy cannot adapt to teammates’ behaviours
quickly. In addition, the performance of both AATEAM and
PLASTIC-Policy is good except when the ad hoc agent
plays with the ubc type. The reason is that the ubc agent
does not often pass the ball to its teammate. Therefore, even
though AATEAM and PLASTIC-Policy are smarter than
its original teammate, they cannot help much. In summary,
the experiments of playing with unknown teammate types
demonstrate that AATEAM can adapt to different team-
mates’ behaviours better than PLASTIC-Policy. This is cru-
cial for the ad hoc teamwork in complex domains.
The Importance of Attention Mechanism
In this subsection, we show the importance of the attention
mechanism in AATEAM. Besides the architecture shown in
Figure 3, we also try to train the neural network without the
attention mechanism, i.e. each oihas the same weight. Ad-
ditionally, we try to include the training samples of length
1 during the training. The loss of actions in different cases
are shown in Figure 6. The results demonstrate that we can
get a good network only when we adopt the attention mech-
anism and remove the training samples of length 1 (to pre-
vent the attention module from being influenced). Note that
the loss of actions directly affects the performance of neural
network. Henceforth, the results in Figure 6 indicate that the
attention mechanism is an essential part in AATEAM.
Conclusions and Future Work
This paper proposes a novel approach called AATEAM to
solve the ad hoc teamwork problem in complex domains.
7101
To the best of our knowledge, this is the first attempt to
solve the ad hoc teamwork problem using attention-based
recurrent neural networks. In detail, AATEAM contains a
set of neural networks, each of them taught the behaviour
of one different past teammate type. Using the probability
distributions output by each neural network, AATEAM se-
lects the best action for the ad hoc agent in real-time. We
pitched our algorithm against PLASTIC-Policy that is the
only scheme proposed for the ad hoc teamwork problem in
complex domains. Our results indicate that when working
with both known and unknown teammates, in most cases
our algorithm outperforms PLASTIC-Policy.
One limitation of AATEAM is that it needs plenty of data
to train the attention networks. Additionally, it is difficult
to know if the architecture of neural networks is the most
appropriate one. As future work, we will test AATEAM for
more teammate types to show the robustness of our method.
Furthermore, we plan to test the same idea for even more
complex domains (with larger action and search space).
Acknowledgments
The research was partially supported by the ST Engi-
neering - NTU Corporate Lab through the NRF corporate
lab@university scheme.
References
Agmon, N., and Stone, P. 2012. Leading ad hoc agents in joint
action settings with multiple teammates. In Proceedings of the
International Conference on Autonomous Agents and Multiagent
Systems, 341–348.
Agmon, N.; Barrett, S.; and Stone, P. 2014. Modeling uncertainty
in leading ad hoc teams. In Proceedings of the International Con-
ference on Autonomous Agents and Multiagent Systems, 397–404.
Akiyama, H. 2010. Agent2D base code release. https://osdn.net/
projects/rctools/.
Albrecht, S. V., and Ramamoorthy, S. 2013. A game-theoretic
model and best-response learning method for ad hoc coordination
in multiagent systems. In Proceedings of the International Confer-
ence on Autonomous Agents and Multiagent Systems, 1155–1156.
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine
translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473.
Barrett, S.; Stone, P.; Kraus, S.; and Rosenfeld, A. 2013. Teamwork
with limited knowledge of teammates. In Proceedings of the AAAI
Conference on Artificial Intelligence, 102–108.
Barrett, S.; Rosenfeld, A.; Kraus, S.; and Stone, P. 2017. Mak-
ing friends on the fly: Cooperating with new teammates. Artificial
Intelligence 242:132–171.
Chakraborty, D., and Stone, P. 2013. Cooperating with a Marko-
vian ad hoc teammate. In Proceedings of the International Confer-
ence on Autonomous Agents and Multiagent Systems, 1085–1092.
Chandrasekaran, M.; Doshi, P.; Zeng, Y.; and Chen, Y. 2017. Can
bounded and self-interested agents be teammates? Application to
planning in ad hoc teams. Autonomous Agents and Multi-Agent
Systems 31(4):821–860.
Chen, S.; Andrejczuk, E.; A. Irissappane, A.; and Zhang, J. 2019.
ATSIS: Achieving the ad hoc teamwork by sub-task inference and
selection. In Proceedings of the International Joint Conference on
Artificial Intelligence, 172–179.
Cho, K.; Van Merri¨
enboer, B.; Gulcehre, C.; Bahdanau, D.;
Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase
representations using RNN encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078.
Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical
evaluation of gated recurrent neural networks on sequence model-
ing. arXiv preprint arXiv:1412.3555.
Genter, K. L.; Agmon, N.; and Stone, P. 2011. Role-based ad
hoc teamwork. In Proceedings of the AAAI Conference on Plan,
Activity, and Intent Recognition, 17–24.
Grosz, B. J., and Kraus, S. 1996. Collaborative plans for complex
group action. Artificial Intelligence 86(2):269–357.
Hausknecht, M.; Mupparaju, P.; Subramanian, S.; Kalyanakrish-
nan, S.; and Stone, P. 2016. Half field offense: An environment
for multiagent learning and ad hoc teamwork. In AAMAS Adaptive
Learning Agents (ALA) Workshop.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term mem-
ory. Neural computation 9(8):1735–1780.
Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980.
Melo, F. S., and Sardinha, A. 2016. Ad hoc teamwork by learn-
ing teammates’ task. Autonomous Agents and Multi-Agent Systems
30(2):175–219.
Mnih, V.; Heess, N.; Graves, A.; and Kavukcuoglu, K. 2014. Re-
current models of visual attention. In Advances in Neural Informa-
tion Processing Systems, 2204–2212.
Ravula, M.; Alkoby, S.; and Stone, P. 2019. Ad hoc teamwork
with behavior switching agents. In Proceedings of the International
Joint Conference on Artificial Intelligence, 550–556.
RoboCup. 2013. The binary file for the agents. https://archive.
robocup.info/Soccer/Simulation/2D/binaries/RoboCup/.
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Salakhutdinov, R. 2014. Dropout: a simple way to prevent neu-
ral networks from overfitting. The Journal of Machine Learning
Research 15(1):1929–1958.
Stone, P.; Kaminka, G. A.; Kraus, S.; and Rosenschein, J. S.
2010. Ad hoc autonomous agent teams: Collaboration without pre-
coordination. In Proceedings of the AAAI Conference on Artificial
Intelligence, 1504–1509.
Sutton, R. S., and Barto, A. G. 2011. Reinforcement learning: An
introduction. Cambridge, MA: MIT Press.
Tambe, M. 1997. Towards flexible teamwork. Journal of Artificial
Intelligence Research 7:83–124.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;
Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is
all you need. In Advances in Neural Information Processing Sys-
tems, 5998–6008.
Williams, R. J., and Zipser, D. 1989. A learning algorithm for
continually running fully recurrent neural networks. Neural Com-
putation 1(2):270–280.
Wu, F.; Zilberstein, S.; and Chen, X. 2011. Online planning for ad
hoc autonomous agent teams. In Proceedings of the International
Joint Conferences on Artificial Intelligence, 439–445.
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.;
Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural
image caption generation with visual attention. In International
Conference on Machine Learning, 2048–2057.
7102