Conference PaperPDF Available

AATEAM: Achieving the Ad Hoc Teamwork by Employing the Attention Mechanism

April 2020
Proceedings of the AAAI Conference on Artificial Intelligence 34(05)

April 2020
34(05)

DOI:10.1609/aaai.v34i05.6196

Conference: 34th AAAI Conference on Artificial Intelligence (AAAI)
At: New York

Authors:

Ewa Andrejczuk

Spanish National Research Council / Change Management Tool

Zhiguang Cao

Singapore Management University

Jie Zhang

Nanyang Technological University

In the ad hoc teamwork setting, a team of agents needs to perform a task without prior coordination. The most advanced approach learns policies based on previous experiences and reuses one of the policies to interact with new teammates. However, the selected policy in many cases is sub-optimal. Switching between policies to adapt to new teammates' behaviour takes time, which threatens the successful performance of a task. In this paper, we propose AATEAM-a method that uses the attention-based neural networks to cope with new teammates' behaviour in real-time. We train one attention network per teammate type. The attention networks learn both to extract the temporal correlations from the sequence of states (i.e. contexts) and the mapping from contexts to actions. Each attention network also learns to predict a future state given the current context and its output action. The prediction accuracies help to determine which actions the ad hoc agent should take. We perform extensive experiments to show the effectiveness of our method.

Content uploaded by Ewa Andrejczuk

Content may be subject to copyright.

Content uploaded by Zhiguang Cao

Content may be subject to copyright.

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

AATEAM: Achieving the Ad Hoc

Teamwork by Employing the Attention Mechanism

Shuo Chen,1,2 Ewa Andrejczuk,2Zhiguang Cao,3Jie Zhang1

1School of Computer Science and Engineering, Nanyang Technological University, Singapore

2ST Engineering - NTU Corporate Laboratory, Nanyang Technological University, Singapore

3Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore

chen1087@e.ntu.edu.sg, {ewaa, zhangj}@ntu.edu.sg, isecaoz@nus.edu.sg

Abstract

In the ad hoc teamwork setting, a team of agents needs to per-

form a task without prior coordination. The most advanced

approach learns policies based on previous experiences and

reuses one of the policies to interact with new teammates.

However, the selected policy in many cases is sub-optimal.

Switching between policies to adapt to new teammates’ be-

haviour takes time, which threatens the successful perfor-

mance of a task. In this paper, we propose AATEAM –

a method that uses the attention-based neural networks to

cope with new teammates’ behaviour in real-time. We train

one attention network per teammate type. The attention net-

works learn both to extract the temporal correlations from the

sequence of states (i.e. contexts) and the mapping from con-

texts to actions. Each attention network also learns to predict

a future state given the current context and its output action.

The prediction accuracies help to determine which actions the

ad hoc agent should take. We perform extensive experiments

to show the effectiveness of our method.

Introduction

In the majority of the multiagent systems (MAS) literature

tackling teamwork, a team of at least two agents shares co-

operation protocols to work together and perform collabora-

tive tasks (Grosz and Kraus 1996; Tambe 1997). However,

with the increasing number of agents in various domains

(e.g. construction, search and rescue) developed by differ-

ent companies, we can no longer coordinate agents’ activi-

ties beforehand. In this case, we expect a team of agents to

perform a task without any pre-deﬁned coordination strat-

egy. This problem is known in the literature as the ad hoc

teamwork problem (Stone et al. 2010). This paper tackles

the ad hoc teamwork problem where an ad hoc agent plans

its actions only by observing other teammates.

Most of the existing approaches for the ad hoc teamwork

focus on simple domains, i.e. they ignore the environmental

uncertainty or they assume teammates’ actions are fully ob-

servable (Agmon, Barrett, and Stone 2014; Ravula, Alkoby,

and Stone 2019). To the best of our knowledge, PLASTIC-

Policy (Barrett et al. 2017) is the only scheme proposed for

more complex domains, i.e. with a dynamic environment

and a continuous state/action space. PLASTIC-Policy learns

2020, Association for the Advancement of Artiﬁcial

a policy for each past teammate type. When working with

new teammates, it chooses one of the policies based on the

similarity between the new and past experiences and uses

that policy to select the sequence of actions for the ad hoc

agent. However, the selected policy only works well when

new teammates behave similarly to the chosen teammate

type. When new teammates’ behaviour becomes far from the

teammate type, the ad hoc agent needs to switch to another

policy. Switching between policies to adapt to new team-

mates’ behaviour takes time, which threatens the successful

performance of a task. Given that, it is desirable to develop

a new method that adapts to new teammates’ behaviours in

real-time for complex domains.

In this paper, we propose to Achieve the Ad hoc Teamwork

by Employing Attention Mechanism (AATEAM). AATEAM

uses attention-based recurrent neural networks to cope with

new teammates’ behaviour in real-time. Our workﬂow is as

follows. First, we use reinforcement learning to learn poli-

cies to work with past teammates. Second, we make the ad

hoc agent play with each type of teammate using the cor-

responding learned policy. We collect the states (input) and

the ad hoc agent’s actions (label) as training data. Third, we

use this data to teach one attention network for each type of

past teammates. The attention network works as follows. In

every step, the attention network takes a state as input. The

attention mechanism inside extracts the temporal correlation

between the state and the states encountered previously, i.e.

context. It further maps the context to an action probability

distribution. In every step, the attention networks also out-

put a metric that measures the similarity between past and

new teammate types. The more similar the types, the more

important the corresponding probability distribution over ac-

tions. We aggregate networks’ probability distributions over

actions by calculating the weighted sum of those probabil-

ities and its importance. Finally, we output an action with

the highest aggregated probability. Since AATEAM evalu-

ates the similarity in every step, we can adjust to the new

teammates’ behaviour in real-time.

In summary, we make the following contributions: 1) we

propose AATEAM, an attention-based method to cope with

teammates’ different behaviours in complex domains. To the

best of our knowledge, this is the ﬁrst time attention mech-

anism is applied to this problem. 2) Besides learning the

mapping from contexts to probability distribution over ac-

7095

tions, we train the attention network to measure the simi-

larity between the corresponding past teammates and new

teammates. This helps to adjust the action selection in real-

time. 3) We perform extensive experiments in the RoboCup

2D simulation domain. The results demonstrate that the at-

tention networks outperform PLASTIC-Policy when work-

ing with both known and unknown teammates.

Related Work

The majority of existing work in the ad hoc teamwork is pro-

posed for simple domains. They either do not consider the

environmental uncertainty or they assume the action space

is fully observable.

For the ﬁrst simpliﬁcation, the authors (Agmon and Stone

2012; Chakraborty and Stone 2013; Agmon, Barrett, and

Stone 2014) assume that the utilities of the team’s actions are

given. Also, Genter, Agmon, and Stone (2011) deal with the

role assignment problem for the ad hoc agent where team-

mates’ roles are known. However, the actions’ utilities and

teammates’ roles may vary when the environment changes.

Therefore, their problem setting is too simplistic.

As for the second simpliﬁcation, assuming a fully-

observable action space limits the potential on real-life

applications. The researchers develop algorithms to plan

the ad hoc agent’s action either by using the past team-

mates’ behaviour types (Wu, Zilberstein, and Chen 2011;

Albrecht and Ramamoorthy 2013; Barrett et al. 2013; Melo

and Sardinha 2016; Chandrasekaran et al. 2017; Ravula,

Alkoby, and Stone 2019) or by inferring the responsibilities

new teammates are taking (Chen et al. 2019). In both cases,

they assume that teammates’ actions are fully observable,

which is a strong assumption when it comes to more com-

plex domains (e.g. the actions are continuous and hence, the

ad hoc agent cannot know the exact value of an action).

The only scheme for complex domains is PLASTIC-

Policy (Barrett et al. 2017). PLASTIC-Policy uses reinforce-

ment learning to learn a policy for working with each type of

past teammates. To reduce learning complexity, PLASTIC-

Policy ﬁrst assumes the existence of a countable set of high-

level actions. Each high-level action is a mapping from the

state to a continuous action. Therefore, it limits the action

space to be searched by reinforcement learning. The policy

takes a continuous state as input and outputs a high-level

action of the ad hoc agent. As mentioned before, in com-

plex domains, the ad hoc agent cannot directly observe team-

mates’ actions. Therefore, PLASTIC-Policy deﬁnes the state

transition for continuous state space and observes team-

mates’ behaviour based on state transitions. Given a state

transition with new teammates, the ad hoc agent locates the

most similar state transition with each type of past team-

mates. Using that information, it updates the probability of

new teammates being similar to each type of past teammates,

and applies the policy for the most similar past teammates.

The main drawback of PLASTIC-Policy is that the learned

policies are not necessarily optimal for new teammates. Ad-

ditionally, the update based on state transitions may not be

fast enough to adapt to new teammates’ changing behaviour.

Our paper addresses these drawbacks by using all attention

networks to select the best action based on a context and the

similarity between past and new teammates’ type.

Preliminary

Markov Decision Process

In this paper, our objective is to ﬁnd the best set of ac-

tions for the ad hoc agent. Following the state-of-the-art

literature (Barrett et al. 2017), we model the ad hoc team-

work problem as the Markov Decision Process (MDP) (Sut-

ton and Barto 2011). Speciﬁcally, we represent an MDP

by a tuple S, A, P, R, where Sis a set of states; Ais

a set of high-level actions the ad hoc agent can perform;

P:S×A×S→[0,1] is the state transition function; That

is, P(s|s, a)is the probability of transiting to the next state

s, given the action aand current state s; and the reward

function R:S×A→Rspeciﬁes the reward R(s, a)when

the ad hoc agent takes the action ain the state s. Note that

for continuous state space, the deﬁnition of state transition

is domain-speciﬁc.

A policy π:S→Atakes a state sas input and out-

puts an action a. After taking the action ain the state s, the

maximum long term expected reward is computed as:

Q(s, a)=R(s, a)+γ

s∈S

P(s|s, a)max

aQ(s,a

)(1)

where 0<γ≤1is the discount factor for the future reward.

An optimal policy chooses the action athat maximizes the

Q(s, a)for all s.

Attention Mechanism

The attention mechanism is an approach for intelligently

extracting contextual information from a sequence of in-

puts. Due to its great performance, the attention mech-

anism has become a popular technique in computer vi-

sion (Mnih et al. 2014; Xu et al. 2015) and natural lan-

guage processing (NLP) (Bahdanau, Cho, and Bengio 2014;

Vaswani et al. 2017).

The example attention network for translation is shown

in Figure 1. The attention network has two recurrent neu-

ral networks called encoder and decoder respectively. The

attention network ﬁrst encodes a full sentence with its en-

coder. The last hidden state from the encoder is the initial

Figure 1: The attention network for translation

7096

hidden state (input) to its decoder. To do the translation, the

attention mechanism decides the importance (i.e. weights) of

each input to the translation output. The weighted sum of en-

coder’s outputs is the contextual information. The decoder’s

hidden state and the contextual information determine the

output together. The method to compute the weight in the

attention mechanism varies from problem to problem.

Note that in the teamwork, the ad hoc agent’s action in the

current step is correlated to the current state as well as to pre-

vious states. Given different teammates’ behaviours (coming

from different teammate types or different situations), the

correlations of previous states to the current action are dif-

ferent. Therefore, the attention mechanism is a suitable ap-

proach that takes the sequence of states as input and outputs

the correlations given different teammates’ behaviours.

Ad Hoc Teamwork with Attention Mechanism

In this section, we explain the details of our method. First,

we present the overall design of one attention-based neu-

ral network. Note that AATEAM contains multiple attention

networks, one per past teammates’ type. Second, we explain

how AATEAM selects actions for the ad hoc agent. Finally,

we describe how we train AATEAM.

The Design of the Attention Network in AATEAM

The overall architecture of our attention-based neural net-

work is as shown in Figure 2. The architecture consists of

two parts. That is, the encoder and decoder respectively. The

encoder extracts important features for the ad hoc teamwork

from the sequence of states. The decoder exploits the ex-

tracted features to output the probability distribution over

actions for the ad hoc agent.

The encoder. The encoder consists of a linear layer and

the gated recurrent unit (GRU)1(Cho et al. 2014) as shown

in Figure 3. The initial hidden state of the encoder 

h0is a

vector of zeros 

0. In every step t, the encoder takes the pre-

vious hidden state 

ht−1and the state stas input. Let ude-

note the size of state vector, ldenote the size of GRU’s hid-

den state and fx→y(·)denote a linear layer that transforms a

vector of size xinto another vector of size y. A linear layer

fu→l(·)transforms the state vector stinto an embedding

vector of length l. It does so, by multiplying the input by

the weight matrix. Then, GRU uses the embedding vector

and 

ht−1to output a hidden state 

htand an encoder-output

ot, i.e. the encoding of st. The 

htand otthen become the

input of the decoder.

The decoder. The purpose of the decoder is to output a

high-level action at∈Afor the ad hoc agent. Let n=|A|

denote the number of high-level actions. Next, we under-

stand an episode as one instance of a task performance. We

denote the signal of the start of an episode as SOEand set

1The GRU is a popular variant of the recurrent neural network

(RNN). As an alternative RNN, we could use the long short-term

memory (LSTM) (Hochreiter and Schmidhuber 1997). However,

GRU has comparable performance to LSTM while having less pa-

rameters, which makes it easier to train (Chung et al. 2014).

Figure 2: The attention network for the ad hoc teamwork

its index to n+1. The initial hidden state of the decoder 

is the encoder’s hidden state in step 1.a0=SOEand its

index is input into the decoder in step 1to signal the start of

a task. In every step t, the decoder takes as input: 1) the set

of encoder-outputs until step t, i.e. {oi|i=1, ..., t}, 2) the

previous hidden state 

dt−1, 3) the encoder’s hidden state 

ht,

and 4) the previous output action at−1. As shown in Fig-

ure 3, the decoder uses these inputs and the attention module

to generate the inputs for its GRU. Then, the GRU outputs

a hidden state 

dtand a probability distribution over actions

{P(ak

t)|k=1, .., n}(we discuss how we select atfrom

{P(ak

t)|k=1, .., n}in the latter subsection).

The attention module determines the weight of each oi

based on {oi|i=1, ..., t},

dt−1, and 

ht(we present the de-

tails of how to compute the weights in the following subsec-

tion). Given the attention weight set {wt,i|i=1, .., t}, the

decoder produces the context vector ctin step tas

ct=



i=1

wt,i ·oi(2)

Next, the decoder uses an embedding layer (that maps an

action index to a vector of length l) and dropout layer (that

reduces overﬁtting (Srivastava et al. 2014)) to generate the

embedding of at−1, i.e. Embed(at−1). To produce the input

intof GRU in step t, the concatenation (denoted by “”) of

Embed(at−1)and ctgoes through a linear layer f2l→l(·)

and Relu layer. The Relu layer applies the rectiﬁed linear

unit function to each element of the input vector:

Relu(x)=max(0,x)(3)

We use the Relu function because, if the neuron gets acti-

vated, its gradient always equals to 1. This helps to train the

network faster. Given intand 

dt−1, GRU outputs the de-

coder’s hidden state 

dtin step tand an action seed vector at

of length l. To produce the probability distribution over ac-

tions in step t, i.e. {P(ak

t)|k=1, ..., n}, we ﬁrst use a linear

layer fl→n(·)that converts atinto a vector {˜ak

t|k=1, .., n}.

Second, we transform {˜ak

t}using a softmax layer. That is,

P(ak

t)= e˜ak

n

m=1 e˜am

(4)

7097

Figure 3: The detailed design of the attention network

where P(ak

t)represents the probability of taking action ak

in step t.Attention Mechanism. The attention module is

shown in Figure 3. In every step t, the attention module

takes the following input: 1) the set of encoder’s outputs

{oi|i≤t}; Each oinaturally provides the information for

the weights of {oi}; 2) The encoder’s hidden state 

ht; The

hidden state 

htcontains the information of states until step

tand thus, it helps to produce a new context; and 3) the de-

coder’s previous hidden state 

dt−1; Note that 

dt−1comes

from at−2and 

dt−2. It also results from at−3,at−2and



dt−3. Following this logic, 

dt−1reﬂects the effects of the

sequence of actions {ai|i<t}on 

d0, that is, on the infor-

mation of initial state 

h1. The effects of previous actions on

states are useful for generating wt,i since atis correlated

with wt,i.

Given the inputs, a linear layer f2l→l(·)transforms

the concatenation of 

htand 

dt−1into a vector vhd =

f2l→l(

ht||

dt−1). Then, each oiin {oi|i≤t}goes through

a linear layer to become a vector voi=fl→l(oi). The con-

catenation of voiand vhd further goes through a Tanh layer

(following Bahdanau, Cho, and Bengio (2014)) that applies

the tanh function to each element of the input vector:

tanh(x)=ex−e−x

ex+e−x(5)

Next, the output of Tanh layer goes through a linear layer

f2l→l(·)to output the weight seed ¯wt,i for oiin step t.In

summary, the weight seed ¯wt,i is computed as:

¯wt,i =f2l→l(Tanh(f2l→l(

ht||

dt−1)||fl→l(oi))) (6)

Finally, a softmax layer takes the set of weight seeds

{¯wt,i|i≤t}as input and outputs the weight wt,i of each

oiin step t. That is,

wt,i =e¯wt,i

t

b=1 e¯wt,b

(7)

Selecting the Action in AATEAM

In the previous subsection, we described the design of one

attention network. Each attention network outputs a proba-

bility distribution over actions. In this subsection, we discuss

how to select the ﬁnal action for the ad hoc agent given those

probability distributions.

Formally, we denote the set of attention networks in

AATEAM as T. In every step t, we input the current state

stand previous action at−1into each attention network.

Note that each network has its own {oi},

ht−1and 

dt−1

(computed by that network). Based on the input, each net-

work jin Toutputs a probability distribution over actions

{P(ak

t)}jas well as its 

htand 

dt. Note that the encoder’s

hidden state 

htresults from {si|i≤t}. Therefore, 

htcon-

tains the state information until the step t. Additionally, 

reﬂects the effects of {ai|i<t}on s1(as discussed in

the Attention Mechanism subsection). Henceforth, we train

each attention network such that 

dtis the prediction of the

state in step tbased on previous states and actions (we dis-

cuss the details of the training in the next subsection). If 

is close to 

ht, it means that the attention network’s action

affects the state as intended. Because the state also results

from the teammates’ behaviour, the distance between 

dtand



ht, i.e. dist j(

ht,

dt), implies the extent to which the atten-

tion network matches the teammates’ behaviour. The more

a network matches the teammates’ behaviour, the more in-

ﬂuential it should be to the action’s selection. Henceforth,

we can use the distance to weigh the probability distribu-

tions (one per attention network) and select the action with

the highest weighted probability. We compute the weight of

each network jas:

vj=e−distj(

ht,

dt)·10

|T |

z=1 e−distz(

ht,

dt)·10 (8)

and the weighted probability distribution over actions as:

{¯

P(ak

t)}=

|T |



j=1

{P(ak

t)}j·vj(9)

Finally, we select the action with the highest probability in

{¯

P(ak

t)}as the action for the ad hoc agent in step t.

Training AATEAM

In this subsection, we discuss how we train one attention

network j(for each teammate type j). First, we use rein-

forcement learning to learn a policy πjfor cooperating with

past teammate type j. After that, we make the ad hoc agent

use πjto work with the teammate type jfor a number of

episodes. During the teamwork, we collect the state-action

7098

pairs as training data. Let us denote ˆatas the action that

the ad hoc agent took in step tbased on learned policy πj.

The state-action pair (st,ˆat)is the training instance of step

t, where ˆatis the label for the training. A training sample

is a sequence of state-action pairs, formally {(st,ˆat)}j.For

each state-action pair (st,ˆat)in a training sample, the at-

tention network takes stas input and returns a probability

vector over actions {P(ak

t)|k=1, ..., n}. The probability

that the network selects the action ˆatis P(ak

t=ˆat). We cal-

culate the loss of an action using the negative log-likelihood:

lossa=−log(P(ak

t=ˆat)) (10)

Notice that the higher the probability that the attention net-

work selects ˆat, the smaller the loss of action lossa.

Since {ˆat}output by πjﬁts the past teammate type j’s

behaviour, i.e. affects the state as intended, the attention net-

work should learn to reduce the distance between 

dtand 

ht.

By doing so, we ensure that 

dtmakes good prediction of st

given the teammate type j’s behaviour. Hence, we compute

the loss of distance as the square of L2-norm:

lossd=

dt−

ht2

2(11)

Then, the overall loss is an additive function of the above

loss measures:

loss =lossa+βlossd(12)

where βis the constant that adjusts the weight of lossd.

We use the Adam optimizer (Kingma and Ba 2014) to up-

date the network parameters to minimise the computed loss.

Moreover, during the training, we apply the teacher forcing

strategy (Williams and Zipser 1989), i.e. use ˆatinstead of at

as the input to the decoder in the step t+1. This is because

the actual st+1 results from ˆatinstead of at.

Comparison to the NLP domain. In the teamwork do-

main, the attention network needs to handle the sequence of

inputs differently from the attention networks in the NLP

domain (e.g. the Figure 1). This is because each domain has

different information available. First, in the NLP domain,

the encoder can encode the whole sentence before inputting

anything into the decoder. In our domain, we do not know

the states from the future. Hence, we can only use the state

information until the current step. Additionally, the NLP de-

coder uses the last hidden state of the encoder as its initial

hidden state such that the decoding is based on the whole

sentence. In our domain, the decoder can only use 

h1as its

initial hidden state (since there is only s1in the beginning

of a task). Second, the attention network in NLP needs to

learn when to end a sentence (i.e. output EOS). However,

in our domain, the end of episode is an external signal and

thus, our network does not need to output it.

Experiments

In this section, we present the experiments we have done

to demonstrate the effectiveness of AATEAM. First, we in-

troduce the experiment domain. Second, we present how

we collect the data. Finally, we pitch AATEAM against

PLASTIC-Policy to show the advantage of our approach. In

detail, we show the results of both algorithm when work-

ing with ﬁrst, known and second, unknown teammates. Our

results indicate that when working with both known and un-

known teammates, in most cases our algorithm outperforms

PLASTIC-Policy signiﬁcantly.

Half Field Offence Domain

Following Barrett et al. (2017), we use a limited version of

the RoboCup 2D domain, i.e. half ﬁeld offense (HFO) do-

main (Hausknecht et al. 2016). HFO retains the complex-

ity of the RoboCup 2D domain in terms of the state/action

space and the uncertainty of actions’ effects. The difference

between both settings is that we play the game within a half

of the ﬁeld and the ad hoc agent always replaces one of the

agents in the offence team. This simpler setting with sin-

gle objective allows researchers to focus on developing al-

gorithms that tackle the environmental complexity.

In this paper, we use the HFO version with two offensive

agents that try to score a goal into opposite team’s goal. The

opposite team has two defenders including the goalie. For

the defensive agents, we adopt the agent2D behaviour pro-

vided by Akiyama (2010). At the beginning of each episode,

the ball is put in a random position close to the midline of

full ﬁeld. The initial position of the goalie is in the goal cen-

tre. The remaining agents start at random position within a

given range. The ranges of offensive and defensive agents

ensure that the offensive agents are initialised to be closer to

the ball than the defender. In every step, each agent can ob-

serve the position of the ball as well as the others’ positions.

An episode ends in one of the following cases: 1) the offen-

sive team scores a goal, 2) the ball leaves the ﬁeld, 3) the de-

fensive team captures the ball, or 4) the game lasts for 500

simulation steps. The success episodes are the episodes in

which the offensive team scores a goal.

State. In every simulation step, the ad hoc agent gets the

following state features:

•the ad hoc agent’s (x, y)position and its body angle

•the ball’s (x, y)position

•the distance and the angle from the ad hoc agent to the

centre of the goal

•the ad hoc agent’s largest opening angle to the goal

•the distance from the ad hoc agent to the closest defender

•the teammate’s largest opening angle to the goal

•the distance from the teammate to the closest defender

•the opening angle for passing to the teammate

•the teammate’s (x, y)position

•each defender’s (x, y)position

Given our setting, i.e. one teammate and two opponents, the

state is a vector of 18 elements. Note that we sort the defend-

ers’ position in the state vector based on their uniform num-

ber while the original simulator sorts those positions based

on the distance between each defender and the ad hoc agent.

This modiﬁcation is to ensure that each element about a de-

fender’s position always corresponds to the same defender.

7099

Action. The high-level actions that the ad hoc agent can per-

form are: 1) Dribble,2)Pass,3)Shoot, and 4) Move.In

every step, the simulator informs the ad hoc agent if it is

close enough to be able to kick the ball. If so, the ad hoc

agent can select an action from {Dribbl e, P ass, Shoot}.

When it is not close enough to the ball, it can only perform

Move. Both learned policies and attention networks select

high-level actions for the ad hoc agent when the agent can

kick the ball. Given a high-level action, the simulator maps

it to a continuous action to perform the simulation.

Collecting data for AATEAM. We use the State-Action-

Reward-State-Action (SARSA) algorithm (Hausknecht et al.

2016) to learn a policy πjfor each past teammate type j.

The ad hoc agent uses πjto work with teammate type jfor

200000 episodes. Within each episode, we collect the state-

action pairs {(st,ˆat)}every time the ad hoc agent obtains

the control of the ball and until it loses the ball. This se-

quence of {(st,ˆat)}is one training sample. We only store

the state-action pairs from the success episodes.

Note that due to the continuous state space, the states

of adjacent steps are very similar. Our initial experiments

show that if we collect the state-action pairs in every step,

the smoothness among the sequence of states impairs the at-

tention network’s performance. Henceforth, we set a thresh-

old tr to detect the change of state. In every time step,

we compare each element of current state (denoted as z)

with the corresponding element of previous state (denoted

as z). When at least one state element meets the condition

|z−z|

|min(z,z)|>tr, we collect the current state-action pair.

Similarly, when using the trained networks to play, we select

a new action when at least one state element meets the above

condition. Otherwise, we continue the current action. More-

over, we remove the training samples that contain only one

state-action pair. This is because attention networks learn to

extract the temporal correlations of states. The samples with

only one state have no temporal correlation inside and thus,

hinder the learning performance.

Collecting data for PLASTIC-Policy. We collect state tran-

sitions (200000 episodes for each type) since PLASTIC-

Policy uses them to update its probability distribution over

the past teammate types. According to Barrett et al. (2017),

a state transition (s, a, s)means that either the possession

of the ball changes from one agent to another agent or the

episode ends when the ad hoc agent performs the action a.

In the experiments, s is the state of the step when the ad hoc

agent can kick the ball. sis the state of the step either when

another agent can kick the ball (this could result from inter-

cepting, passing, or shooting) or the episode ends. Note that

we cannot determine state transitions when the ad hoc agent

cannot kick the ball because the simulator does not have a

signal that indicates which speciﬁc agent holds the ball.

Teammate types. In this paper, we use the externally-

created teammates. The teammates are not programmed to

work with the ad hoc agent. We run the binary ﬁle released

from the 2013 RoboCup 2D simulation league competition

(RoboCup 2013). From these external agents, we use ﬁve

types as past teammate types, i.e. aut, axiom, agent2D, glid-

ers, and yushan. Additionally, we use ﬁve types as new team-

Table 1: Experimental Parameters

Name Value Name Value

β0.1 tr 0.2

u18 l256

GRU Layer Number 5Max length 60

Dropout rate 0.1Learning rate 0.01

mate types, i.e. yunlu, utaustin, legend, ubc, and warthog.

Experimental Parameters and Settings

We sum up the experiment parameters in Table 1. The num-

ber of hidden state layers for GRU is 5. We use the last layer

of hidden states to compute the dist j(

ht,

dt)and lossd. The

maximum length of a state sequence that attention networks

process is 60. When the state sequence’s length exceeds 60,

the networks drop the states from the beginning. The rate of

dropout is 0.1. When training the attention networks, the de-

fault learning rate is 0.01 and the batch size is 32. In the ex-

periments, we use AATEAM and PLASTIC-Policy to play

with each teammate type for 10 trials where each trial con-

tains 1000 episodes. We set the decision time limit for each

step as 100 ms following Barrett et al. (2017). We evaluate

the performance based on the scoring rate, i.e. how many

episodes the offence team wins in every 1000 episodes. The

error bars in Figure 4 and Figure 5 below are the standard

deviations of scoring rates. Note that the results are discrete

points. We add lines between them to improve clarity. We

apply the binomial test to test the signiﬁcance of our results

(following Barrett et al. (2017)). We observe that the differ-

ence between AATEAM and PLASTIC-Policy is signiﬁcant

(95% conﬁdence) except for the axiom and ubc types.

Playing with Past Teammates

In this subsection, we make the ad hoc agent play with

the teammates it encountered before, i.e. past teammates.

In detail, it plays with ﬁve past teammate types using ﬁrst,

AATEAM and second, PLASTIC-Policy. Also, we have one

baseline and one upper bound here. The baseline is the per-

formance of the original offence team, i.e. the team without

the ad hoc agent. The upper bound is the performance of the

ad hoc agent playing with a given teammate type following

the learned policy for that teammate. Since each policy is

learned only for the corresponding past teammate type, we

expect the performance with the learned policies to be higher

than any other approach. The results are shown in Figure 4.

In Figure 4, we see that both AATEAM and PLASTIC-

Policy are better than the baseline. This is because the rein-

forcement learning algorithms managed to learn good poli-

cies for past teammate types (both PLASTIC-Policy and

during the training of AATEAM). On top of that, the at-

tention networks successfully learned to ﬁt the behaviour of

past teammates. Hence, both algorithms exhibit more intel-

ligent behaviours.

When it comes to the upper bound, we can see that

AATEAM’s performance is closer to the upper bound than

7100

aut axiom agent2D gliders yushan

0.55

0.60

0.65

0.70

0.75

0.80

Scoring Rate

AATEAM

PLASTIC

Baseline

Upper bound

Figure 4: The performance with known teammates

PLASTIC-Policy in most of the cases. Also, AATEAM out-

performs PLASTIC-Policy signiﬁcantly. The reason is as

follows. PLASTIC-Policy ﬁnds the corresponding policy for

the known teammate type by evaluating the similarity be-

tween past experiences and current ones. However, it may

not be able to identify the most similar past experience due

to the decision time limit (i.e. 100 ms). Thus, the simi-

larity evaluation may cause it to switch between different

learned policies. This degrades its performance. In compar-

ison, AATEAM extracts the context in every step and maps

the context to the best action. Note that the extraction and

mapping using trained attention networks are very efﬁcient.

Therefore, AATEAM can take advantage of the knowledge

from the learned policies in a more ﬂexible way. In detail,

the ad hoc agent uses AATEAM to learn to cope with sev-

eral different teammate types. Next, it plays with teammates

of one of the known types. Each attention network outputs a

probability distribution over actions. AATEAM aggregates

the probabilities from all attention networks. The aggrega-

tion of the probability distributions enables a better usage of

the previous knowledge (as we base our decision on more

knowledge than the knowledge about only one past team-

mate type). Note that the originally learned policies (upper

bounds) come from reinforcement learning and they may

not be optimal. Given a teammate type, some policies for

another types may suggest a better action than the policy

for the given type. Therefore, AATEAM can perform better

than the upper bounds as it uses more information from neu-

ral networks. In summary, we observe that AATEAM can

easily adapt to the teammates’ behaviour.

Playing with New Teammates

In this subsection, the ad hoc agent uses AATEAM and

PLASTIC-Policy to work with the new teammates, i.e. the

legend ubc utaustin warthog yunlu

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Scoring Rate

AATEAM

PLASTIC

Baseline

Figure 5: The performance with unknown teammates

0 250 500 750 1000 1250 1500 1750 2000

Training Time

(

K iterations

)

0.1

0.3

0.5

0.7

0.9

1.1

Loss of actions

AATEAM

Without attention

Keep samples of length 1

Figure 6: The loss of different training cases (using yushan

type as example)

teammates it has never encountered before. Naturally, we

have one baseline. That is, the baseline where an original

team plays without the ad hoc agent. We cannot use the up-

per bound as the ad hoc agent did not learn the policies to

work with the new teammates beforehand.

The results in Figure 5 show that both AATEAM and

PLASTIC-Policy outperform the baseline as both algo-

rithms behave more intelligently than the replaced agent.

Moreover, in most cases AATEAM still signiﬁcantly out-

performs PLASTIC-Policy. The reason is that PLASTIC-

Policy switches between policies based on the current be-

haviour of new teammates. Since switching takes time,

PLASTIC-Policy cannot adapt to teammates’ behaviours

quickly. In addition, the performance of both AATEAM and

PLASTIC-Policy is good except when the ad hoc agent

plays with the ubc type. The reason is that the ubc agent

does not often pass the ball to its teammate. Therefore, even

though AATEAM and PLASTIC-Policy are smarter than

its original teammate, they cannot help much. In summary,

the experiments of playing with unknown teammate types

demonstrate that AATEAM can adapt to different team-

mates’ behaviours better than PLASTIC-Policy. This is cru-

cial for the ad hoc teamwork in complex domains.

The Importance of Attention Mechanism

In this subsection, we show the importance of the attention

mechanism in AATEAM. Besides the architecture shown in

Figure 3, we also try to train the neural network without the

attention mechanism, i.e. each oihas the same weight. Ad-

ditionally, we try to include the training samples of length

1 during the training. The loss of actions in different cases

are shown in Figure 6. The results demonstrate that we can

get a good network only when we adopt the attention mech-

anism and remove the training samples of length 1 (to pre-

vent the attention module from being inﬂuenced). Note that

the loss of actions directly affects the performance of neural

network. Henceforth, the results in Figure 6 indicate that the

attention mechanism is an essential part in AATEAM.

Conclusions and Future Work

This paper proposes a novel approach called AATEAM to

solve the ad hoc teamwork problem in complex domains.

7101

To the best of our knowledge, this is the ﬁrst attempt to

solve the ad hoc teamwork problem using attention-based

recurrent neural networks. In detail, AATEAM contains a

set of neural networks, each of them taught the behaviour

of one different past teammate type. Using the probability

distributions output by each neural network, AATEAM se-

lects the best action for the ad hoc agent in real-time. We

pitched our algorithm against PLASTIC-Policy that is the

only scheme proposed for the ad hoc teamwork problem in

complex domains. Our results indicate that when working

with both known and unknown teammates, in most cases

our algorithm outperforms PLASTIC-Policy.

One limitation of AATEAM is that it needs plenty of data

to train the attention networks. Additionally, it is difﬁcult

to know if the architecture of neural networks is the most

appropriate one. As future work, we will test AATEAM for

more teammate types to show the robustness of our method.

Furthermore, we plan to test the same idea for even more

complex domains (with larger action and search space).

Acknowledgments

The research was partially supported by the ST Engi-

neering - NTU Corporate Lab through the NRF corporate

lab@university scheme.

References

Agmon, N., and Stone, P. 2012. Leading ad hoc agents in joint

action settings with multiple teammates. In Proceedings of the

International Conference on Autonomous Agents and Multiagent

Systems, 341–348.

Agmon, N.; Barrett, S.; and Stone, P. 2014. Modeling uncertainty

in leading ad hoc teams. In Proceedings of the International Con-

ference on Autonomous Agents and Multiagent Systems, 397–404.

Akiyama, H. 2010. Agent2D base code release. https://osdn.net/

projects/rctools/.

Albrecht, S. V., and Ramamoorthy, S. 2013. A game-theoretic

model and best-response learning method for ad hoc coordination

in multiagent systems. In Proceedings of the International Confer-

ence on Autonomous Agents and Multiagent Systems, 1155–1156.

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine

translation by jointly learning to align and translate. arXiv preprint

arXiv:1409.0473.

Barrett, S.; Stone, P.; Kraus, S.; and Rosenfeld, A. 2013. Teamwork

with limited knowledge of teammates. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 102–108.

Barrett, S.; Rosenfeld, A.; Kraus, S.; and Stone, P. 2017. Mak-

ing friends on the ﬂy: Cooperating with new teammates. Artiﬁcial

Intelligence 242:132–171.

Chakraborty, D., and Stone, P. 2013. Cooperating with a Marko-

vian ad hoc teammate. In Proceedings of the International Confer-

ence on Autonomous Agents and Multiagent Systems, 1085–1092.

Chandrasekaran, M.; Doshi, P.; Zeng, Y.; and Chen, Y. 2017. Can

bounded and self-interested agents be teammates? Application to

planning in ad hoc teams. Autonomous Agents and Multi-Agent

Systems 31(4):821–860.

Chen, S.; Andrejczuk, E.; A. Irissappane, A.; and Zhang, J. 2019.

ATSIS: Achieving the ad hoc teamwork by sub-task inference and

selection. In Proceedings of the International Joint Conference on

Artiﬁcial Intelligence, 172–179.

Cho, K.; Van Merri¨

enboer, B.; Gulcehre, C.; Bahdanau, D.;

Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase

representations using RNN encoder-decoder for statistical machine

translation. arXiv preprint arXiv:1406.1078.

Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical

evaluation of gated recurrent neural networks on sequence model-

ing. arXiv preprint arXiv:1412.3555.

Genter, K. L.; Agmon, N.; and Stone, P. 2011. Role-based ad

hoc teamwork. In Proceedings of the AAAI Conference on Plan,

Activity, and Intent Recognition, 17–24.

Grosz, B. J., and Kraus, S. 1996. Collaborative plans for complex

group action. Artiﬁcial Intelligence 86(2):269–357.

Hausknecht, M.; Mupparaju, P.; Subramanian, S.; Kalyanakrish-

nan, S.; and Stone, P. 2016. Half ﬁeld offense: An environment

for multiagent learning and ad hoc teamwork. In AAMAS Adaptive

Learning Agents (ALA) Workshop.

Hochreiter, S., and Schmidhuber, J. 1997. Long short-term mem-

ory. Neural computation 9(8):1735–1780.

Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic

optimization. arXiv preprint arXiv:1412.6980.

Melo, F. S., and Sardinha, A. 2016. Ad hoc teamwork by learn-

ing teammates’ task. Autonomous Agents and Multi-Agent Systems

30(2):175–219.

Mnih, V.; Heess, N.; Graves, A.; and Kavukcuoglu, K. 2014. Re-

current models of visual attention. In Advances in Neural Informa-

tion Processing Systems, 2204–2212.

Ravula, M.; Alkoby, S.; and Stone, P. 2019. Ad hoc teamwork

with behavior switching agents. In Proceedings of the International

Joint Conference on Artiﬁcial Intelligence, 550–556.

RoboCup. 2013. The binary ﬁle for the agents. https://archive.

robocup.info/Soccer/Simulation/2D/binaries/RoboCup/.

Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and

Salakhutdinov, R. 2014. Dropout: a simple way to prevent neu-

ral networks from overﬁtting. The Journal of Machine Learning

Research 15(1):1929–1958.

Stone, P.; Kaminka, G. A.; Kraus, S.; and Rosenschein, J. S.

2010. Ad hoc autonomous agent teams: Collaboration without pre-

coordination. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 1504–1509.

Sutton, R. S., and Barto, A. G. 2011. Reinforcement learning: An

introduction. Cambridge, MA: MIT Press.

Tambe, M. 1997. Towards ﬂexible teamwork. Journal of Artiﬁcial

Intelligence Research 7:83–124.

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;

Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is

all you need. In Advances in Neural Information Processing Sys-

tems, 5998–6008.

Williams, R. J., and Zipser, D. 1989. A learning algorithm for

continually running fully recurrent neural networks. Neural Com-

putation 1(2):270–280.

Wu, F.; Zilberstein, S.; and Chen, X. 2011. Online planning for ad

hoc autonomous agent teams. In Proceedings of the International

Joint Conferences on Artiﬁcial Intelligence, 439–445.

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.;

Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural

image caption generation with visual attention. In International

Conference on Machine Learning, 2048–2057.

7102

Leveraging Large Language Model for Heterogeneous Ad Hoc Teamwork Collaboration

Preprint

Jun 2024

Compared with the widely investigated homogeneous multi-robot collaboration, heterogeneous robots with different capabilities can provide a more efficient and flexible collaboration for more complex tasks. In this paper, we consider a more challenging heterogeneous ad hoc teamwork collaboration problem where an ad hoc robot joins an existing heterogeneous team for a shared goal. Specifically, the ad hoc robot collaborates with unknown teammates without prior coordination, and it is expected to generate an appropriate cooperation policy to improve the efficiency of the whole team. To solve this challenging problem, we leverage the remarkable potential of the large language model (LLM) to establish a decentralized heterogeneous ad hoc teamwork collaboration framework that focuses on generating reasonable policy for an ad hoc robot to collaborate with original heterogeneous teammates. A training-free hierarchical dynamic planner is developed using the LLM together with the newly proposed Interactive Reflection of Thoughts (IRoT) method for the ad hoc agent to adapt to different teams. We also build a benchmark testing dataset to evaluate the proposed framework in the heterogeneous ad hoc multi-agent tidying-up task. Extensive comparison and ablation experiments are conducted in the benchmark to demonstrate the effectiveness of the proposed framework. We have also employed the proposed framework in physical robots in a real-world scenario. The experimental videos can be found at https://youtu.be/wHYP5T2WIp0.

Back to the Future: Toward a Hybrid Architecture for Ad Hoc Teamwork

Article

Full-text available

Jun 2023

State of the art methods for ad hoc teamwork, i.e., for collaboration without prior coordination, often use a long history of prior observations to model the behavior of other agents (or agent types) and to determine the ad hoc agent's behavior. In many practical domains, it is difficult to obtain large training datasets, and necessary to quickly revise the existing models to account for changes in team composition or domain attributes. Our architecture builds on the principles of step-wise refinement and ecological rationality to enable an ad hoc agent to perform non-monotonic logical reasoning with prior commonsense domain knowledge and models learned rapidly from limited examples to predict the behavior of other agents. In the simulated multiagent collaboration domain Fort Attack, we experimentally demonstrate that our architecture enables an ad hoc agent to adapt to changes in the behavior of other agents, and provides enhanced transparency and better performance than a state of the art data-driven baseline.

Controlling Type Confounding in Ad Hoc Teamwork with Instance-wise Teammate Feedback Rectification

Preprint

Jun 2023

Ad hoc teamwork requires an agent to cooperate with unknown teammates without prior coordination. Many works propose to abstract teammate instances into high-level representation of types and then pre-train the best response for each type. However, most of them do not consider the distribution of teammate instances within a type. This could expose the agent to the hidden risk of \emph{type confounding}. In the worst case, the best response for an abstract teammate type could be the worst response for all specific instances of that type. This work addresses the issue from the lens of causal inference. We first theoretically demonstrate that this phenomenon is due to the spurious correlation brought by uncontrolled teammate distribution. Then, we propose our solution, CTCAT, which disentangles such correlation through an instance-wise teammate feedback rectification. This operation reweights the interaction of teammate instances within a shared type to reduce the influence of type confounding. The effect of CTCAT is evaluated in multiple domains, including classic ad hoc teamwork tasks and real-world scenarios. Results show that CTCAT is robust to the influence of type confounding, a practical issue that directly hazards the robustness of our trained agents but was unnoticed in previous works.

Learning Ad Hoc Cooperation Policies from Limited Priors via Meta-Reinforcement Learning

Article

Full-text available

Apr 2024

When agents need to collaborate without previous coordination, the multi-agent cooperation problem transforms into an ad hoc teamwork (AHT) problem. Mainstream research on AHT is divided into type-based and type-free methods. The former depends on known teammate types to infer the current teammate type, while the latter does not require them at all. However, in many real-world applications, the complete absence and sufficient knowledge of known types are both impractical. Thus, this research focuses on the challenge of AHT with limited known types. To this end, this paper proposes a method called a Few typE-based Ad hoc Teamwork via meta-reinforcement learning (FEAT), which effectively adapts to teammates using a small set of known types within a single episode. FEAT enables agents to develop a highly adaptive policy through meta-reinforcement learning by employing limited priors about known types. It also utilizes this policy to generate a diverse type repository automatically. During the ad hoc cooperation, the agent can autonomously identify known teammate types followed by directly utilizing the pre-trained optimal cooperative policy or swiftly updating the meta policy to respond to teammates of unknown types. Comprehensive experiments in the pursuit domain validate the effectiveness of the algorithm and its components.

Classifying ambiguous identities in hidden-role Stochastic games with multi-agent reinforcement learning

Article

Full-text available

Aug 2023
AUTON AGENT MULTI-AG

Multi-agent reinforcement learning (MARL) is a prevalent learning paradigm for solving stochastic games. In most MARL studies, agents in a game are defined as teammates or enemies beforehand, and the relationships among the agents (i.e., their identities) remain fixed throughout the game. However, in real-world problems, the agent relationships are commonly unknown in advance or dynamically changing. Many multi-party interactions start off by asking: who is on my team? This question arises whether it is the first day at the stock exchange or the kindergarten. Therefore, training policies for such situations in the face of imperfect information and ambiguous identities is an important problem that needs to be addressed. In this work, we develop a novel identity detection reinforcement learning (IDRL) framework that allows an agent to dynamically infer the identities of nearby agents and select an appropriate policy to accomplish the task. In the IDRL framework, a relation network is constructed to deduce the identities of other agents by observing the behaviors of the agents. A danger network is optimized to estimate the risk of false-positive identifications. Beyond that, we propose an intrinsic reward that balances the need to maximize external rewards and accurate identification. After identifying the cooperation-competition pattern among the agents, IDRL applies one of the off-the-shelf MARL methods to learn the policy. To evaluate the proposed method, we conduct experiments on Red-10 card-shedding game, and the results show that IDRL achieves superior performance over other state-of-the-art MARL methods. Impressively, the relation network has the par performance to identify the identities of agents with top human players; the danger network reasonably avoids the risk of imperfect identification. The code to reproduce all the reported results is available online at https://github.com/MR-BENjie/IDRL.

Knowledge-based Reasoning and Learning under Partial Observability in Ad Hoc Teamwork

Article

Full-text available

Jun 2023

Ad hoc teamwork (AHT) refers to the problem of enabling an agent to collaborate with teammates without prior coordination. State of the art methods in AHT are data-driven, using a large labeled dataset of prior observations to model the behavior of other agent types and to determine the ad hoc agent’s behavior. These methods are computationally expensive, lack transparency, and make it difficult to adapt to previously unseen changes. Our recent work introduced an architecture that determined an ad hoc agent’s behavior based on non-monotonic logical reasoning with prior commonsense domain knowledge and models learned from limited examples to predict the behavior of other agents. This paper describes KAT, a knowledge-driven architecture for AHT that substantially expands our prior architecture’s capabilities to support: (a) online selection, adaptation, and learning of the behavior prediction models; and (b) collaboration with teammates in the presence of partial observability and limited communication. We illustrate and experimentally evaluate KAT’s capabilities in two simulated benchmark domains for multiagent collaboration: Fort Attack and Half Field Offense. We show that KAT’s performance is better than a purely knowledge-driven baseline, and comparable with or better than a state of the art data-driven baseline, particularly in the presence of limited training data, partial observability, and changes in team composition.

RoMA: Resilient Multi-Agent Reinforcement Learning with Dynamic Participating Agents

Conference Paper

Nov 2023

Explanation and Knowledge Acquisition in Ad Hoc Teamwork

Conference Paper

Jan 2023

State of the art frameworks for ad hoc teamwork i.e., for enabling an agent to collaborate with others “on the fly”, pursue a data-driven approach, using a large labeled dataset of prior observations to model the behavior of other agents and to determine the ad hoc agent’s behavior. It is often difficult to pursue such an approach in complex domains due to the lack of sufficient training examples and computational resources. In addition, the learned models lack transparency and it is difficult to revise the existing knowledge in response to previously unseen changes. Our prior architecture enabled an ad hoc agent to perform non-monotonic logical reasoning with commonsense domain knowledge and predictive models of other agents’ behavior that are learned from limited examples. In this paper, we enable the ad hoc agent to acquire previously unknown domain knowledge governing actions and change, and to provide relational descriptions as on-demand explanations of its decisions in response to different types of questions. We evaluate the architecture’s knowledge acquisition and explanation generation abilities in two simulated benchmark domains: Fort Attack and Half Field Offense.

TEAMSTER: Model-based reinforcement learning for ad hoc teamwork

Article

Nov 2023
ARTIF INTELL

Leveraging Fitness Critics To Learn Robust Teamwork

Conference Paper

Jul 2023

ATSIS: Achieving the Ad hoc Teamwork by Sub-task Inference and Selection

Conference Paper

Full-text available

Aug 2019

In an ad hoc teamwork setting, the team needs to coordinate their activities to perform a task without prior agreement on how to achieve it. The ad hoc agent cannot communicate with its teammates but it can observe their behaviour and plan accordingly. To do so, the existing approaches rely on the teammates' behaviour models. However, the models may not be accurate, which can compromise teamwork. For this reason, we present Ad Hoc Teamwork by Sub-task Inference and Selection (ATSIS) algorithm that uses a sub-task inference without relying on teammates' models. First, the ad hoc agent observes its teammates to infer which sub-tasks they are handling. Based on that, it selects its own sub-task using a partially observable Markov decision process that handles the uncertainty of the sub-task inference. Last, the ad hoc agent uses the Monte Carlo tree search to find the set of actions to perform the sub-task. Our experiments show the benefits of ATSIS for robust teamwork.

Can bounded and self-interested agents be teammates? Application to planning in ad hoc teams

Article

Full-text available

Jul 2017
AUTON AGENT MULTI-AG

Planning for ad hoc teamwork is challenging because it involves agents collaborating without any prior coordination or communication. The focus is on principled methods for a single agent to cooperate with others. This motivates investigating the ad hoc teamwork problem in the context of self-interested decision-making frameworks. Agents engaged in individual decision making in multiagent settings face the task of having to reason about other agents’ actions, which may in turn involve reasoning about others. An established approximation that operationalizes this approach is to bound the infinite nesting from below by introducing level 0 models. For the purposes of this study, individual, self-interested decision making in multiagent settings is modeled using interactive dynamic influence diagrams (I-DID). These are graphical models with the benefit that they naturally offer a factored representation of the problem, allowing agents to ascribe dynamic models to others and reason about them. We demonstrate that an implication of bounded, finitely-nested reasoning by a self-interested agent is that we may not obtain optimal team solutions in cooperative settings, if it is part of a team. We address this limitation by including models at level 0 whose solutions involve reinforcement learning. We show how the learning is integrated into planning in the context of I-DIDs. This facilitates optimal teammate behavior, and we demonstrate its applicability to ad hoc teamwork on several problem domains and configurations.

Making friends on the fly: Cooperating with new teammates

Article

Full-text available

Oct 2016
ARTIF INTELL

Robots are being deployed in an increasing variety of environments for longer periods of time. As the number of robots grows, they will increasingly need to interact with other robots. Additionally, the number of companies and research laboratories producing these robots is increasing, leading to the situation where these robots may not share a common communication or coordination protocol. While standards for coordination and communication may be created, we expect that robots will need to additionally reason intelligently about their teammates with limited information. This problem motivates the area of ad hoc teamwork in which an agent may potentially cooperate with a variety of teammates in order to achieve a shared goal. This article focuses on a limited version of the ad hoc teamwork problem in which an agent knows the environmental dynamics and has had past experiences with other teammates, though these experiences may not be representative of the current teammates. To tackle this problem, this article introduces a new general-purpose algorithm, PLASTIC, that reuses knowledge learned from previous teammates or provided by experts to quickly adapt to new teammates. This algorithm is instantiated in two forms: 1) PLASTIC–Model – which builds models of previous teammates' behaviors and plans behaviors online using these models and 2) PLASTIC–Policy – which learns policies for cooperating with previous teammates and selects among these policies online. We evaluate PLASTIC on two benchmark tasks: the pursuit domain and robot soccer in the RoboCup 2D simulation domain. Recognizing that a key requirement of ad hoc teamwork is adaptability to previously unseen agents, the tests use more than 40 previously unknown teams on the first task and 7 previously unknown teams on the second. While PLASTIC assumes that there is some degree of similarity between the current and past teammates' behaviors, no steps are taken in the experimental setup to make sure this assumption holds. The teammates were created by a variety of independent developers and were not designed to share any similarities. Nonetheless, the results show that PLASTIC was able to identify and exploit similarities between its current and past teammates' behaviors, allowing it to quickly adapt to new teammates.

Ad Hoc Teamwork by Learning Teammates' Task

Article

Full-text available

Jan 2015

This paper addresses the problem of ad hoc teamwork, where a learning agent engages in a cooperative task with other (unknown) agents. The agent must effectively coordinate with the other agents towards completion of the intended task, not relying on any pre-defined coordination strategy. We contribute a new perspective on the ad hoc teamwork problem and propose that, in general, the learning agent should not only identify (and coordinate with) the teammates’ strategy but also identify the task to be completed. In our approach to the ad hoc teamwork problem, we represent tasks as fully cooperative matrix games. Relying exclusively on observations of the behavior of the teammates, the learning agent must identify the task at hand (namely, the corresponding payoff function) from a set of possible tasks and adapt to the teammates’ behavior. Teammates are assumed to follow a bounded-rationality best-response model and thus also adapt their behavior to that of the learning agent. We formalize the ad hoc teamwork problem as a sequential decision problem and propose two novel approaches to address it. In particular, we propose (i) the use of an online learning approach that considers the different tasks depending on their ability to predict the behavior of the teammate; and (ii) a decision-theoretic approach that models the ad hoc teamwork problem as a partially observable Markov decision problem. We provide theoretical bounds of the performance of both approaches and evaluate their performance in several domains of different complexity.

Ad Hoc Teamwork With Behavior Switching Agents

Conference Paper

Aug 2019

As autonomous AI agents proliferate in the real world, they will increasingly need to cooperate with each other to achieve complex goals without always being able to coordinate in advance. This kind of cooperation, in which agents have to learn to cooperate on the fly, is called ad hoc teamwork. Many previous works investigating this setting assumed that teammates behave according to one of many predefined types that is fixed throughout the task. This assumption of stationarity in behaviors, is a strong assumption which cannot be guaranteed in many real-world settings. In this work, we relax this assumption and investigate settings in which teammates can change their types during the course of the task. This adds complexity to the planning problem as now an agent needs to recognize that a change has occurred in addition to figuring out what is the new type of the teammate it is interacting with. In this paper, we present a novel Convolutional-Neural-Network-based Change point Detection (CPD) algorithm for ad hoc teamwork. When evaluating our algorithm on the modified predator prey domain, we find that it outperforms existing Bayesian CPD algorithms.

Attention Is All You Need

Article

Jun 2017

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Article

Jun 2014
J MACH LEARN RES

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.

Recurrent Models of Visual Attention

Article

Jun 2014
Adv Neural Inform Process Syst

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

Leading Ad Hoc Agents in Joint Action Settings with Multiple Teammates

Conference Paper

Jun 2012

The growing use of autonomous agents in practice may require agents to cooperate as a team in situations where they have limited prior knowledge about one another, cannot communicate directly, or do not share the same world models. These situations raise the need to design ad hoc team members, i.e., agents that will be able to cooperate without coordination in order to reach an optimal team behavior. This paper considers the problem of leading N-agent teams by an agent toward their optimal joint utility, where the agents compute their next actions based only on their most recent observations of their teammates' actions. We show that compared to previous results in two-agent teams, in larger teams the agent might not be able to lead the team to the action with maximal joint utility, thus its optimal strategy is to lead the team to the best possible reachable cycle of joint actions. We describe a graphical model of the problem and a polynomial time algorithm for solving it. We then consider other variations of the problem, including leading teams of agents where they base their actions on longer history of past observations, leading a team by more than one ad hoc agent, and leading a teammate while the ad hoc agent is uncertain of its behavior.

A Game-Theoretic Model and Best-Response Learning Method for Ad Hoc Coordination in Multiagent Systems

Conference Paper

May 2013

The ad hoc coordination problem is to design an ad hoc agent which is able to achieve optimal flexibility and efficiency in a multiagent system that admits no prior coordination between the ad hoc agent and the other agents. We conceptualise this problem formally as a stochastic Bayesian game in which the behaviour of a player is determined by its type. Based on this model, we derive a solution, called Harsanyi-Bellman Ad Hoc Coordination (HBA), which utilises a set of user-defined types to characterise players based on their observed behaviours. We evaluate HBA in the level-based foraging domain, showing that it outperforms several alternative algorithms using just a few user-defined types. We also report on a human-machine experiment in which the humans played Prisoner's Dilemma and Rock-Paper-Scissors against HBA and alternative algorithms. The results show that HBA achieved equal efficiency but a significantly higher welfare and winning rate.

AATEAM: Achieving the Ad Hoc Teamwork by Employing the Attention Mechanism

Abstract

Recommended publications

Learning State Representation for Deep Actor-Critic Control

Ad Hoc Teamwork in the Presence of Non-stationary Teammates

Ad Hoc Teamwork in the Presence of Non-Stationary Teammates

Open Ad Hoc Teamwork using Graph-based Policy Learning

Allocating training instances to learning agents for team formation