PreprintPDF Available

Towards Heterogeneous Multi-Agent Reinforcement Learning with Graph Neural Networks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This work proposes a neural network architecture that learns policies for multiple agent classes in a heterogeneous multi-agent reinforcement setting. The proposed network uses directed labeled graph representations for states, encodes feature vectors of different sizes for different entity classes, uses relational graph convolution layers to model different communication channels between entity types and learns distinct policies for different agent classes, sharing parameters wherever possible. Results have shown that specializing the communication channels between entity classes is a promising step to achieve higher performance in environments composed of heterogeneous entities.
Content may be subject to copyright.
Towards Heterogeneous Multi-Agent Reinforcement Learning
with Graph Neural Networks
Douglas R. Meneghetti1, Reinaldo A. C. Bianchi1
1Electrical Engineering Department, FEI University Center
CEP 09850-901 – 3972 – S˜
ao Bernardo do Campo – SP – Brazil
{douglasrizzo,rbianchi}@fei.edu.br
Abstract. This work proposes a neural network architecture that learns policies
for multiple agent classes in a heterogeneous multi-agent reinforcement setting.
The proposed network uses directed labeled graph representations for states,
encodes feature vectors of different sizes for different entity classes, uses rela-
tional graph convolution layers to model different communication channels be-
tween entity types and learns distinct policies for different agent classes, sharing
parameters wherever possible. Results have shown that specializing the commu-
nication channels between entity classes is a promising step to achieve higher
performance in environments composed of heterogeneous entities.
1. Introduction
In recent years, multi-agent deep reinforcement learning has emerged as an active area
of research. Alongside it, geometric deep learning enables neural networks to per-
form supervised, semi-supervised and unsupervised learning on data structured as graphs
and manifolds. Combining both fields, a new paradigm of multi-agent reinforcement
learning has emerged, in which agents learn to communicate [Sukhbaatar et al. 2016,
Peng et al. 2017] by using graph convolution layers as message passing mechanisms
[Gilmer et al. 2017].
Until now, new work has focused in the approximation of policies for homoge-
neous agents, i.e. agents that share the same action set and policy [Agarwal et al. 2019,
Malysheva et al. 2019, Jiang et al. 2020], or in the specialization of agents for a limited
number of simple actions [Wang et al. 2018a]. However, no work has explicitly studied
the potential of creating neural network architectures for environments with heteroge-
neous agents, capable of specializing the approximated policies according to an agent’s
class or role in the environment. Such environments may contain heterogeneous teams of
agents (e.g. drones and terrestrial robots) or homogeneous teams of agents with the need
for specialized policies (e.g. the RoboCup Soccer Leagues).
In this work, we tackle the challenge of heterogeneous multi-agent reinforcement
learning by proposing a neural network architecture that employs information regarding
the classes of agents and environment entities to model specialized communication mech-
anisms, as well as harvest the information regarding agent classes in a heterogeneous
multi-agent environment to specialize their communication through the use of inter-class
relational graph convolutions.
The text is organized as follows: section 2 presents the theoretical background
in reinforcement learning and graph neural networks; section 3 presents related work;
arXiv:2009.13161v1 [cs.AI] 28 Sep 2020
in section 4, we introduce the heterogeneous multi-agent graph network, our proposed
neural network architecture; sections 5 and 6 present our experiments and results in the
StarCraft Multi-Agent Challenge environments and section 7 concludes the paper.
2. Research Background
Reinforcement learning techniques solve tasks that are formalized as Markov Decision
Processes (MDPs). An MDP is a tuple hS, A, P, Ri, where Sis the set of possible states,
Athe set of actions an agent can perform, P:S×A×Sa state transition function, where
P(s, a, s0)maps the probability of an agent observing state s0after executing action ain
state s.R:S×ARis a reward function and 0γ < 1is a discount factor for future
rewards, compared to present ones.
Many authors [Littman 1994, Bowling and Veloso 2000, Busoniu et al. 2008]
propose the modeling of multi-agent systems as stochastic games, which can be con-
sidered a generalization of MDPs. In a stochastic game, the set of actions becomes
A=A1×A2×. . . ×Amfrom magents; the transition function becomes conditioned on
the joint action of all agents, P:S×A1×A2×. . . ×Am×S; and the reward function
may be different for each agent.
Furthermore, earlier works that have represented MDPs as sets of ob-
jects belonging to multiple classes include relational MDPs [Guestrin et al. 2003],
object-oriented MDPs (OO-MDPs) [Wasser et al. 2008] and multi-agent OO-MDPS
[da Silva et al. 2019].
2.1. Graph Neural Networks
In the same way that successful neural network architectures are biased with relation to
the underlying structure of their input data (e.g. convolutional neural networks for data
with spatial relations and recurrent neural networks for sequential data), the existence
of many kinds of data that can be naturally represented as graphs, such as road maps,
academic citations [Kipf and Welling 2017] and molecules [Duvenaud et al. 2015], have
prompted the creation of neural network architectures specialized in dealing with graphs.
A graph Gis composed of a non-empty set of nodes or vertices, denoted as V,
and a set of edges, denoted as E. Each edge eEconnects a pair of (not necessarily
distinct) nodes [Bondy and Murty 2008]. When dealing with graphs for the purposes of
machine learning, each node, edge and the graph itself may possess features, stored in
vectors [Battaglia et al. 2018]. ~vi,~ejand ~u are the attribute vectors of node i, edge jand
graph G, respectively.
For this work, a graph is defined as a tuple G= (V,E), where vertices in Vhave
features vectors and Eis a set of arcs (directed edges) which do not have features.
In its most essential form [Gori et al. 2005, Scarselli et al. 2009a,
Scarselli et al. 2009b], a graph neural network allows each node iVin an input
graph to aggregate information from its in-neighbors N
(i), an operation called message
passing. Message passing can be expressed generically as
~ui=AggjN
(i)f(l)~v (l1)
i, ~v (l1)
j, ~e(j,i),
where ~v (l1)
iis the feature vector of node iin layer l1of the network, ~e(j,i)is the feature
vector of edge e(i,j),fis a parametric transition function that takes into account the state
of node iand its in-neighbors, and Agg is a permutation-invariant aggregation function,
such as average, max or sum.
After the message passing step, the vector of aggregated information ~uiis used to
generate the output of node ifor layer lusing a parameterized update or output function
g,
~v (l)
i=g(l)~v (l1)
i, ~ui.
Although each node only aggregates information from its in-neighbors, its output
can still be influenced by nodes at greater distances by incorporating multiple layers with
the aforementioned steps, achieving what can be compared to multi-hop communication.
3. Related Work
Works that directly represent multi-agent systems as graphs of agents include DGN
[Jiang et al. 2020], MAGNet [Malysheva et al. 2019], NerveNet [Wang et al. 2018b] and
[Agarwal 2019]. DGN [Jiang et al. 2020] introduced the use of graph convolutional lay-
ers for inter-agent communication, as well techniques to stabilize training when using
these convolutions in RL tasks. The work of [Agarwal 2019] was the first to add non-
agent entities to graphs. MAGNet introduces a graph generation layer, which generates
the adjacency matrix between agents. NerveNet [Wang et al. 2018a] is the first work that
tackles the problem of agents of multiple types but, since they work with graphs of fixed
sizes and all agents have a single real action, the network does not specialize to these
different types of agents.
Preceding work that can be viewed as inter-agent communication with GNNs in-
clude CommNet [Sukhbaatar et al. 2016] and BiCNet [Peng et al. 2017]. More recent
work that also focuses on modulating when agents communicate with each other include
ATOC [Jiang and Lu 2018] and TarMAC [Das et al. 2019].
This work differs from previous ones by making use of node class information
both as a means to specialize communication between node classes as well as for learning
a centralized policy for multiple agents of the same class.
4. Heterogeneous Multi-agent Graph Q-Network
In this work, states are represented as directed labeled graphs, in which nodes repre-
sent either agents or environment entities; arcs represent communication channels ei-
ther among agents or between an agent and an environment entity; node labels represent
agents/entities classes and edge labels represent specialized communication channels. In
practical settings, the existence of an arc between a node vand an agent zmay be related
to the z’s capability of observing vand an arc from agent z1to agent z2indicate an open
communication line from z1to z2.
For a graph G, each node vV(G)is associated with a node class cC, where C
is the set of node classes. The class of a node vcan be accessed through a function C(v).
A subset Zof Ccontains agent classes, i.e. classes pertaining solely to agent nodes.
z(1)
1z(1)
2
z(2)
3
v(3)
1
v(3)
2
v(3)
3
v(4)
4
v(4)
5
(1,1)
(1,1)
(1,3)
(1,3)
(1,4)
(1,3)
(1,3)
(2,3)
(2,4)
Figure 1. A multi-agent system represented as a graph.
The class of a node determines the number of state variables used to describe that
node. Furthermore, the class of an agent zdetermines its action set AC(z), as well as its
policy πC(z). In our work, arcs are used to encode relations between node classes. More
specifically, an arc labeled (n, m)represents a relation between an agent of class nwith
another node of class m.
Figure 1 exemplifies a graph with three agents and five environment entities, in
which agent nodes aggregate information from their neighbors. In the figure, v(j)
irepre-
sents node vwith index iand class j. Darker nodes represent agents, while lighter nodes
represent other environment objects encoded in the graph state.
4.1. Neural Network Architecture
The proposed neural network architecture, denominated Heterogeneous Multi-Agent
Graph Q-Network (HMAGQ-Net), is composed of three modules: an encoding module,
a communication module and an action selection module. The modules are applied in
sequence to the input data and the model is capable of being trained end-to-end. The full
network architecture is presented in figure 2 and explained below.
4.1.1. Encoding
In order to deal with the varying number and meaning of state variables that compose
each node class, we introduce an encoding function φcfor each cC, which receives
as input the vector ~v RdC(v)containing a node’s description, and outputs an encoded
vector φc(~v)Rm, where mis a common output size for the encoding functions of all
classes. In this work, we explore using multi-layer perceptrons as implementations of φ.
4.1.2. Communication
In the communication layer, each agent node zaggregates information from its set of
in-neighbors nodes N
(z). In HMAGQ-Net, we employ relational graph convolutions
(RGCN) [Schlichtkrull et al. 2018] to allow for specialization of the message passing
mechanism. In an RGCN layer, the feature vector of node iin layer l+ 1 is given by
~v l+1
i=σ
X
rRX
jNr
i
1
ci r
W(l)
r~v (l)
r+W(l)
0~v (l)
i
,
where rrepresents the index of a relation between nodes iand j. In this work, the set of
relations is defined as all possible pairs (c1, c2),c1Z,c2C(see arc labels in figure
1).
Regularization in RGCN is achieved by decomposing parameter matrix W(l)
rinto
Bbasis transformations Vand coefficient vectors a,
W(l)
r=
B
X
b=1
~a (l)
rb V(l)
b.
In this way, all relations rRshare the same set of basis matrices, while coefficient
vectors depend on r. In our work, the number of relations is |R|=|Z|×|C|, as each
agent class models a specialized communication channel with all other node classes.
4.1.3. Action selection
After Klayers of graph convolutions, the final feature vectors of the agent nodes are taken
as their individual observations of the graph. We introduce a function Qcfor each agent
class cZ, which receives the observation ozof an agent zof class cas input and outputs
a vector of size |Ac|, corresponding to the observation-action values for agent z.
Optionally, the concatenation of the feature vectors generated by all graph convo-
lution layers may be taken as the final observation for each agent [Jiang et al. 2020], an
alternative named in the experiments as “full receptive field” and displayed as red arrows
in figure 2.
4.2. Training stabilization
We employ both a policy network and a target network, with the same topology. The
target network is responsible for generating stable targets and is updated with a copy of
the parameters of the policy network after a fixed number of time steps. The parameters
of the policy network are optimized during every step of the environment with a batch of
transitions sampled from a replay buffer.
To speed up training, proportional prioritized experience replay
[Schaul et al. 2015] was implemented, in which each transition of the replay buffer
maintains a tuple hs,~a, s0, ~ri, where sand s0are states represented as graphs, ~a are the
actions selected by all agents and ~r are the rewards observed by each agent.
The loss function is given by
J(θ) = X
cZ
1
|Z|X
zc
(ri+γmax
a0Qc(s, o0
z;θ)Qc(oz, az;θ))2,
where Zrepresents all agent classes, zis a single agent, θare the parameters of the policy
network and θare the parameters of the target network.
ϕc1
ϕc2
ϕc3
...
Qc2
Qc3
encoding
module
communication
module
action selection
module
K
( layers)
state-dependent
agent observations
encoded
observations
Final agent
observations
oz
Figure 2. An example of the proposed model processing a graph of 3 environment
entities of class c1(gray), 2 agents of class c2(green) and 3 agents of class
c3(blue). Red elements refer to changes in the network topology if the
output of all Klayers from the communication module are used as input
for the action module.
Table 1. Hyperparameters used in the training setting
Training steps 106
ˆ
θupdate interval 250
Network learning rate 2.5104
L2 regularization coef. 105
TRR coef. 0.01
RL discount factor γ0.99
max 0.95
min 0.1
Proportional PER α0.6
Proportional PER β0.4
5. Experiments
The proposed model was tested in the StarCraft Multi-Agent Challenge (SMAC) domain
[Samvelyan et al. 2019], a collection of maps for the StarCraft II Learning Environment
focused in multi-agent tasks. In the maps, nunits from the player team are individually
controlled in order to achieve victory in a battle scenario against munits from the ad-
versary team. Each unit belongs to one of multiple classes, which may be described by
different state variables and have different action sets and optimal policies. In each of the
maps, each node class possesses between 4 and 6 features. Agent classes have 4 move-
ment actions, mattack actions and 1 no-op action for incapacitated units (all discrete).
Since units have different behavior (movement speed, attack range) it is expected that
learning different policies for each unit type will be beneficial for the player team.
In all tests, each network φin the encoding layer was an MLP with two hidden lay-
ers of 128 neurons and an output encoding of 64 values. The communication module was
composed of 4 relational layers, with the first layer having an input vector of 64 values,
the last layer having an output vector of 64 values, and all hidden connections being com-
posed of vectors of 128 values. The relational module was tested against an attentional
communication module composed of graph attention layers [Veliˇ
ckovi´
c et al. 2018]. The
attention layers worked with 4 attention heads, whose output was concatenated at the end
of each layer. Finally, the Qnetworks for agent classes were MLPs with 64 values in the
input layer, two hidden layers with 128 neurons and output vector size equal to the num-
ber of actions of each agent class. For all modules, the sigmoid nonlinearity was used, as
well as the Adam optimizer. Hyperparameters are provided in table 1.
Additional experiments were performed to evaluate the performance of using the
full receptive field (FRF) as the final agent observations; giving the agents the ability to
communicate by creating arcs between them, regardless of distance (full agent communi-
cation, FAC) and the use of temporal relation regularization in the attentional model our
proposal was tested against (TRR, [Jiang et al. 2020]).
Experiments were performed in a mix hardware environment, a computer
equipped with an Nvidia GTX 1070 and a server equipped with an Nvidia V100. Each
run took an average time of 70 hours to complete.
Table 2. Results of applying HMAGQ-Net on the 2s3z map of the SMAC domain
under different configurations. FRF = full receptive field. FAC = full agent
communication. TRR = temporal relation regularization.
Mean n. steps Mean reward
Comms module FRF FAC TRR All Last 10% All Last 10%
RGCN X X 77.66 82.18 4.69 4.79
RGCN X78.65 83.62 3.82 3.77
RGCN 76.85 81.93 4.24 4.30
GAT X X X 70.63 75.10 3.85 3.86
GAT X X 77.20 82.13 3.53 3.47
GAT X X 79.41 84.59 3.97 3.99
GAT 77.21 82.29 3.98 3.99
Random baseline 52.155 2.222
6. Results
Table 2 displays the results of the different trained models. Two values were taken as
measures of performance for the agent team: final episode reward and number of steps
the agent team remained alive. In the SMAC environments, agents with larger rewards
were able to deal more damage to the opponent teams, while longer episodes indicate
agents that were able to survive for longer.
When tested against a random baseline (a group of agents which only take random
actions), all the trained models had superior performance in both measures. The two
models that accumulated the most average reward by episode employed RGCN layers,
while the model that remained alive for the most number of steps was a GAT model.
Overall, networks that employed full-agent communication (FAC) achieved a
higher reward than networks that did not. However, it is hard to observe better perfor-
mance of agents that received the full receptive field (FRF) of the communication layers
as their observations or used TRR in their loss function. This may be due to the fact that
these techniques were first proposed to solve the simpler predator-prey environment, with
homogeneous agents [Jiang et al. 2020].
Figure 3 presents the same measures throughout the training. Since the measures
were noisy, we employed exponential smoothing to better visualize the trend lines. It can
be seen from the top graph that all models tended to converge to the same number of
steps alive, while achieving different final rewards. It can also be seen that the two RGCN
models that accumulated the most reward dominated all other models in that measure
through most of the training time.
7. Conclusion
This work presented the Heterogeneous Multi-agent Graph Q-Network, a neural network
architecture that processes environment states represented as directed labeled graphs and
employs relational graph convolution layers to achieve specialized communication be-
tween agents of heterogeneous classes, as well as multiple encoding networks to normal-
ize entity representation and multiple action networks to learn individual policies for each
agent class.
Figure 3. Number of steps (top) and average reward collected by each agent
(bottom), per episode, for a total of 1 million training steps.
Results have shown that specializing the communication channels between entity
classes is a promising step to achieve higher performance in environments composed
of heterogeneous entities. In future work, we intend to test HMAGQ-Net on multiple
environments with different number of agents and agent classes; isolate the contribution
of learning policies for agent classes by testing variants which learn a single policy for all
agents and individual policies for each agent; and propose an action module trained via
policy gradient.
Acknowledgments
The authors acknowledge the S˜
ao Paulo Research Foundation (FAPESP Grant
2019/07665-4) for supporting this project. This study was financed in part by the
Coordenac¸ ˜
ao de Aperfeic¸oamento de Pessoal de N´
ıvel Superior - Brasil (CAPES) - Fi-
nance Code 001.
References
[Agarwal 2019] Agarwal, A. (2019). Learning Transferable Cooperative Behavior in Multi-
Agent Teams. Master’s Thesis, Carnegie Mellon University, Pittsburg, USA.
[Agarwal et al. 2019] Agarwal, A., Kumar, S., and Sycara, K. (2019). Learning Transferable
Cooperative Behavior in Multi-Agent Teams. In ICML 2019 Workshop on Learning
and Reasoning with Graph-Structured Representations.
[Battaglia et al. 2018] Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A.,
Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R.,
Gulcehre, C., Song, F., Ballard, A., Gilmer, J., Dahl, G., Vaswani, A., Allen, K., Nash,
C., Langston, V., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals,
O., Li, Y., and Pascanu, R. (2018). Relational inductive biases, deep learning, and
graph networks. arXiv e-prints.
[Bondy and Murty 2008] Bondy, J. A. and Murty, U. S. R. (2008). Graph Theory. Springer
London.
[Bowling and Veloso 2000] Bowling, M. and Veloso, M. (2000). An Analysis of Stochastic
Game Theory for Multiagent Reinforcement Learning. Resreport, School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA.
[Busoniu et al. 2008] Busoniu, L., Babuska, R., and Schutter, B. D. (2008). A Comprehen-
sive Survey of Multiagent Reinforcement Learning. IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172.
[da Silva et al. 2019] da Silva, F. L., Glatt, R., and Costa, A. H. R. (2019). MOO-MDP: An
Object-Oriented Representation for Cooperative Multiagent Reinforcement Learning.
IEEE Transactions on Cybernetics, 49.
[Das et al. 2019] Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., and
Pineau, J. (2019). TarMAC: Targeted Multi-Agent Communication. Proceedings of
the 36th International Conference on Machine Learning, 97:1538–1546.
[Duvenaud et al. 2015] Duvenaud, D., Maclaurin, D., Aguilera-Iparraguirre, J., G´
omez-
Bombarelli, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. (2015). Convolutional
Networks on Graphs for Learning Molecular Fingerprints.
[Gilmer et al. 2017] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E.
(2017). Neural Message Passing for Quantum Chemistry. arXiv:1704.01212 [cs].
[Gori et al. 2005] Gori, M., Monfardini, G., and Scarselli, F. (2005). A new model for
learning in graph domains. In Proceedings of the International Joint Conference on
Neural Networks, volume 2, pages 729–734. IEEE.
[Guestrin et al. 2003] Guestrin, C., Koller, D., Gearhart, C., and Kanodia, N. (2003). Gen-
eralizing Plans to New Environments in Relational MDPs. In Proceedings of the 18th
International Joint Conference on Artificial Intelligence, IJCAI’03, pages 1003–1010,
San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
[Jiang et al. 2020] Jiang, J., Dun, C., Huang, T., and Lu, Z. (2020). Graph Convolutional
Reinforcement Learning. In International Conference on Learning Representations.
[Jiang and Lu 2018] Jiang, J. and Lu, Z. (2018). Learning Attentional Communication for
Multi-Agent Cooperation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,
Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing
Systems 31, pages 7254–7264. Curran Associates, Inc.
[Kipf and Welling 2017] Kipf, T. N. and Welling, M. (2017). Semi-Supervised Classifica-
tion with Graph Convolutional Networks. In 5th International Conference on Learning
Representations, ICLR 2017 - Conference Track Proceedings. International Confer-
ence on Learning Representations, ICLR.
[Littman 1994] Littman, M. L. (1994). Markov Games as a Framework for Multi-Agent
Reinforcement Learning. In Proceedings of the Eleventh International Conference on
Machine Learning, volume 157, pages 157–163.
[Malysheva et al. 2019] Malysheva, A., Kudenko, D., and Shpilman, A. (2019). MAGNet:
Multi-agent Graph Network for Deep Multi-agent Reinforcement Learning. In Adap-
tive and Learning Agents Workshop at AAMAS (ALA 2019), Montreal, Canada.
[Peng et al. 2017] Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., and Wang,
J. (2017). Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-Level
Coordination in Learning to Play StarCraft Combat Games.
[Samvelyan et al. 2019] Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli,
N., Rudner, T. G. J., Hung, C.-M., Torr, P. H. S., Foerster, J., and Whiteson, S. (2019).
The StarCraft Multi-Agent Challenge. arXiv:1902.04043 [cs, stat].
[Scarselli et al. 2009a] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfar-
dini, G. (2009a). Computational capabilities of graph neural networks. IEEE Transac-
tions on Neural Networks, 20(1):81–102.
[Scarselli et al. 2009b] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfar-
dini, G. (2009b). The Graph Neural Network Model. IEEE Transactions on Neural
Networks, 20(1):61–80.
[Schaul et al. 2015] Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal
Value Function Approximators. In International Conference on Machine Learning,
pages 1312–1320.
[Schlichtkrull et al. 2018] Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov,
I., and Welling, M. (2018). Modeling relational data with graph convolutional net-
works. In European Semantic Web Conference, pages 593–607. Springer.
[Sukhbaatar et al. 2016] Sukhbaatar, S., Szlam, A., and Fergus, R. (2016). Learning Multi-
agent Communication with Backpropagation. In Lee, D. D., Sugiyama, M., Luxburg,
U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing
Systems 29, pages 2244–2252. Curran Associates, Inc.
[Veliˇ
ckovi´
c et al. 2018] Veliˇ
ckovi´
c, P., Casanova, A., Li`
o, P., Cucurull, G., Romero, A., and
Bengio, Y. (2018). Graph attention networks. In 6th International Conference on
Learning Representations, ICLR 2018 - Conference Track Proceedings. International
Conference on Learning Representations, ICLR.
[Wang et al. 2018a] Wang, D., Duan, Y., and Weng, J. (2018a). Motivated Optimal Devel-
opmental Learning for Sequential Tasks Without Using Rigid Time-Discounts. IEEE
Transactions on Neural Networks and Learning Systems, 29.
[Wang et al. 2018b] Wang, T., Liao, R., Ba, J., and Fidler, S. (2018b). Nervenet: Learn-
ing structured policy with graph neural networks. In 6th International Conference on
Learning Representations, ICLR 2018 - Conference Track Proceedings. International
Conference on Learning Representations, ICLR.
[Wasser et al. 2008] Wasser, C. G. D., Cohen, A., and Littman, M. L. (2008). An Object-
Oriented Representation for Efficient Reinforcement Learning. In Proceedings of the
25th International Conference on Machine Learning, pages 240–247. ACM.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Value functions are a core component of reinforcement learning systems. The main idea is to to construct a single function approximator V (s; θ) that estimates the long-term reward from any state s, using parameters θ. In this paper we introduce universal value function approx-imators (UVFAs) V (s, g; θ) that generalise not just over states s but also over goals g. We develop an efficient technique for supervised learning of UVFAs, by factoring observed values into separate embedding vectors for state and goal, and then learning a mapping from s and g to these factored embedding vectors. We show how this technique may be incorporated into a reinforcement learning algorithm that updates the UVFA solely from observed rewards. Finally, we demonstrate that a UVFA can successfully gener-alise to previously unseen goals.
Article
Reinforcement Learning (RL) is a widely known technique to enable autonomous learning. Even though RL methods achieved successes in increasingly large and complex problems, scaling solutions remains a challenge. One way to simplify (and consequently accelerate) learning is to exploit regularities in a domain, which allows generalization and reduction of the learning space. While Object-Oriented Markov Decision Processes (OO-MDP) provide such generalization opportunities, we argue that the learning process may be further simplified by dividing the workload of tasks amongst multiple agents, solving problems as Multiagent Systems (MAS). In this work, we propose a novel combination of OO-MDP and MAS, called Multiagent Object-Oriented Markov Decision Process (MOO-MDP). Our proposal accrues the benefits of both OO-MDP and MAS, better addressing scalability issues. We formalize the general model MOO-MDP and present an algorithm to solve deterministic cooperative MOO-MDPs. We show that our algorithm learns optimal policies while reducing the learning space by exploiting state abstractions. We experimentally compare our results with earlier approaches in three domains and evaluate the advantages of our approach in sample efficiency and memory requirements.
Article
Learning behaviors in a multiagent environment is crucial for developing and adapting multiagent systems. Reinforcement learning techniques have addressed this problem for a single agent acting in a stationary environment, which is modeled as a Markov decision process (MDP). But, multiagent environments are inherently non-stationary since the other agents are free to change their behavior as they also learn and adapt. Stochastic games, first studied in the game theory community, are a natural extension of MDPs to include multiple agents. In this paper we contribute a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the assumptions and limitations of these algorithms, and identify similarities between these algorithms, single agent reinforcement learners, and basic game theory techniques. This research was sponsored by the United States Air Force under Cooperative Agreements No. F30602-97-2-0250 and No. F30602-98-2-0135. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), the Air Force, or the US Government. Keywords: Multiagent systems, stochastic games, reinforcement learning, game theory 1
Learning Transferable Cooperative Behavior in Multi-Agent Teams
  • Agarwal
Agarwal et al. 2019] Agarwal, A., Kumar, S., and Sycara, K. (2019). Learning Transferable Cooperative Behavior in Multi-Agent Teams. In ICML 2019 Workshop on Learning and Reasoning with Graph-Structured Representations.
TarMAC: Targeted Multi-Agent Communication
  • Das
[Das et al. 2019] Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., and Pineau, J. (2019). TarMAC: Targeted Multi-Agent Communication. Proceedings of the 36th International Conference on Machine Learning, 97:1538-1546.
Semi-Supervised Classification with Graph Convolutional Networks
  • T N Kipf
  • M Welling
[Kipf and Welling 2017] Kipf, T. N. and Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017 -Conference Track Proceedings. International Conference on Learning Representations, ICLR.
Markov Games as a Framework for Multi-Agent Reinforcement Learning
  • M L Littman
[Littman 1994] Littman, M. L. (1994). Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Proceedings of the Eleventh International Conference on Machine Learning, volume 157, pages 157-163.
  • Samvelyan
[Samvelyan et al. 2019] Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G. J., Hung, C.-M., Torr, P. H. S., Foerster, J., and Whiteson, S. (2019). The StarCraft Multi-Agent Challenge. arXiv:1902.04043 [cs, stat].
Modeling relational data with graph convolutional networks
  • Schlichtkrull
[Schlichtkrull et al. 2018] Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov, I., and Welling, M. (2018). Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593-607. Springer.