Conference PaperPDF Available

Mastering Cooperative Driving Strategy in Complex Scenarios using Multi-Agent Reinforcement Learning

Authors:
Mastering Cooperative Driving Strategy in Complex Scenarios using
Multi-Agent Reinforcement Learning
Qingyi Liang1,2, Zhengmin Jiang2,3, Jianwen Yin2, Kun Xu2,
Zhongming Pan2, Shaobo Dang2, Jia Liu2,
Abstract With the advent of machine learning, several
autonomous driving tasks have become easier to accomplish.
Nonetheless, the proliferation of autonomous vehicles in urban
traffic scenarios has precipitated other challenges, such as coop-
erative driving. Multi-Agent Reinforcement Learning (MARL)
approaches have emerged as a promising solution, as they can
successfully address the challenge by simulating the interactive
relationship among autonomous vehicles (AVs). In this paper,
we leverage Social Value Orientation to depict the behavioral
tendencies of AVs, thereby enhancing the performance of
MARL approaches. We also incorporate different scale features
in the policy network to strengthen its representation ability.
Moreover, effective reward functions are designed based on
traffic efficiency, comfort, safety, and strategy. Finally, we
validate our approach in an open-source autonomous driving
simulator. Simulation results indicate that our proposed ap-
proach outperforms IPPO and MAPPO algorithms in terms
of success rate, route completion rate, crash rate, and other
metrics.
I. INTRODUCTION
Autonomous driving technology has garnered significant
attention in recent years, and has improved traffic con-
ditions in terms of safety, convenience, and intelligence.
Presently, practical applications of autonomous driving are
observed in dock transportation and express delivery. The
autonomous system typically comprises three main com-
ponents for generating actions: (1) perception - gathering
information about the surrounding environment, including
roads, vehicles, and pedestrians, using cameras and LIDAR,
as well as other technologies; (2) planning - utilizing the
acquired information to generate the vehicle’s next high-level
command; and (3) execution - transforming the command
generated in the previous step into control signals, such
as throttle and steering of the vehicle[1]. While supervised
learning has exhibited excellent performance in tasks such
as computer vision and plays a crucial role in the perception
part of autonomous driving, it is less applicable to the
decision-making process for various driving tasks, such as
lane keeping, overtaking, intersections, etc., as the vehicle
needs to make timely predictions and decisions regarding
1Southern University of Science and Technology, Shenzhen, 518055,
China 12233254@mail.sustech.edu.cn
2CAS Key Laboratory of Human-Machine Intelligence-Synergy
Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy
of Sciences, Shenzhen 518055, China (qy.liang1@siat.ac.cn,
zm.jiang@siat.ac.cn, jw.yin@siat.ac.cn,
kun.xu@siat.ac.cn, zm.pan@siat.ac.cn,
shaobo.dang@siat.ac.cn, jia.liu1@siat.ac.cn)
3University of Chinese Academy of Sciences, Beijing, 100049, China
Correspondence:jia.liu1@siat.ac.cn
the dynamic environment. Therefore, other solutions such as
RL should be explored.
RL necessitates a detailed and explicit reward function,
after which the agent continuously interacts with the environ-
ment and acquires experience by trial and error to eventually
form a policy. The process of RL can be outlined as follows:
first, the agent perceives the state in the environment; then,
the agent takes a certain action after computation; and the
environment provides a reward based on the current state and
the action taken by the agent, repeating the above steps until
reaching the goal. Consequently, RL becomes a sequential
decision problem and usually involves multiple rounds of
decisions, each of which necessitates the agent to respond to
the new environment and select an optimal action, enabling
it to handle different scenarios. RL is divided into single-
agent RL and multi-agent RL (MARL), where the agent only
requires interaction with the environment and maximizing
its own benefit in a single-agent environment, whereas in
MARL, each agent needs to interact not only with the envi-
ronment but also with other agents and maintain their reward
functions. This makes the multi-agent environment more
unstable and complicated, as the relationship among agents
can be classified as cooperative, competitive, or mixed. There
have been several important breakthroughs in the study of
MARL in chess games and other games [2][3][4][5]. It is also
expected to have applications in smart cities, autonomous
driving, and industrial production. Furthermore, combining
the study of MARL with human social group behavior can
help model predictions and analyze the behavior of a certain
cluster, thus enabling the evaluation and optimization of
decisions.
Owing to the high cost of trial and error, present MARL
experiments are predominantly conducted on simulators,
and there are still some issues that require resolution: (1)
generalization is not strong, as most studies concentrate on
a single independent scenario, and the coping ability for
other scenarios or some special scenarios is insufficient. For
example, in the case of an extremely congested intersection,
selecting the appropriate route to avoid congestion is a chal-
lenge. (2) Due to the low efficiency of RL data, lack of direct
supervision signals, and unsuitability for deep networks, the
network structure in DRL is relatively simple, consisting
mostly of shallow MLP or CNN, and the representation
ability of shallow networks is weaker, which can lead to slow
updates or falling into local optimum due to the absence of
crucial information when updating the network. In this paper,
we propose novel solutions to address the aforementioned
979-8-3503-2718-2/23/$31.00 ©2023 IEEE 372
Proceedings of The 2023 IEEE International
Conference on Real-time Computing and Robotics
July 17-20, 2023, Datong, China
2023 IEEE International Conference on Real-time Computing and Robotics (RCAR) | 979-8-3503-2718-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/RCAR58764.2023.10249648
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
challenges, and the research objectives and contributions of
this paper are outlined as follows:
Firstly, to enhance the extracted features, we concate-
nate features at different scales to minimize information
loss while simultaneously improving the network rep-
resentation.
Secondly, we design comprehensive and detailed reward
functions that consider travel efficiency, safety, comfort,
and driving strategies. These reward functions are in-
tended to verify the generalization ability of the model
to different scenarios and its strategic capabilities for
dealing with special scenarios.
Finally, we construct complex training scenarios and
validate the effectiveness of the network structure and
its generalization ability in three distinct scenarios.
II. RELATED WORK
MARL environment poses several challenges such as di-
mensional explosion, non-stationarity, and credit assignment,
which make direct application of single-agent algorithms
infeasible. To address this issue, one approach is to combine
the value function of each agent into the joint action value
function, which is the sum of the local action value function
of each agent[11]. To deal with the nonlinearity problem
of the joint action value function, another improvement is
proposed, which ensures consistent monotonicity between
the local action value function and the joint action value
function, optimizing the global action value function as well
as the individual’s action value function[12]. Additionally, a
more robust decomposition method is introduced for the joint
action value function by mapping it to a new value function,
which aims to scale the true action value function, making
the value function approximated by the network closer to the
true joint action value function[13]. To address the challenge
posed by dimensional explosion, a centralized training and
decentralized execution approach is proposed, which allows
access to additional information during training and only
local information during execution and has proven effective
in cooperative, competitive, and mixed interactions[14].
Research has been conducted on the coordination be-
tween an autonomous vehicle (AV) and a human-driven
vehicle (HV). The AV collects information about the HV,
correlates HV’s position to AV’s reward function, and can
force HV to move left, right, slow down, and prioritize
through intersections[6]. Social Value Orientations(SVO) is
introduced to measure HV’s behavior and assess HV’s al-
truistic, pro-social, egoistic, or competitive preference. The
AV predicts HV’s trajectory and chooses whether to merge,
turn left, etc. based on the SVO value. The results verify
the validity of SVO in measuring social preferences and the
improvement of correctly predicting HV’s trajectories[7].
While the aforementioned studies focused on a single AV
and HV, multiple AVs and HVs are studied which utilizes
a multichannel VelocityMap with different time sequences
focusing on AVs, HVs, self-attention, mission vehicles, and
road layout, equipped with a 3D convolutional network to
solve the time-dependent problem. The results demonstrate
that adjusting the level of altruism in an AV can affect other
vehicles[8]. Researchers insist that human driving behavior
is a series of continuous skills that expand over time. Thus,
a task-independent and ego-centric vehicle is constructed,
first encoding the collected vehicle trajectory information as
latent skills and then training the vehicle with RL method to
learn skills instead of actions in the latent space. The skill
output is then decoded as control information, enabling the
vehicle to cope with different scenarios[9]. Modifying the
network architecture is a viable option for improving per-
formance. Features in the Soft Actor-Critic (SAC) network
are reused and connected to avoid the problems of gradient
disappearance and explosion. The results indicate that this
approach outperforms the standard SAC[23]. SVO is further
extended to all agents in a MARL environment, where each
agent focuses not only on its own reward but also on local
and global rewards. Experiments demonstrate that the agents
exhibit socially acceptable behavior, improving the safety
and efficiency of the whole group across five scenarios, such
as intersections and roundabouts[10].
Other studies focus on the distribution of credit for MARL
cooperation. A counterfactual baseline is used in [20] to
determine how much an agent contributes to the collective.
For each action of an agent, a “default action” is used
instead, and the difference in global rewards due to these
two actions is calculated, evaluating the contribution of the
agent’s current action to the system. In [19], the problem of
total reward redistribution in MARL systems is discussed,
and it is found that using more efficient value distribution
can promote stable cooperation among agents. The Shapley
Value is proposed to approximate the contribution of each
agent to the system to distribute cooperation benefits fairly.
The results verify the feasibility of this method, providing
new ideas for MARL credit assignment tasks.
III. PRELIMINARY
A. Dec-POMDPs
We describe the decentralized partially observable Markov
decision process in MARL as a tuple: <S,Ai,P,r
i>for
agent i=1,··· ,N, where Sis the state space that includes
all the possible states that agents can adopt, Aiis defined as
the finite set which consists of the actions that agent ican
take ai∈A
i,Pis the transition function with a range of
[0,1] which is defined as P:S×A1×···×An→S,r
iis the
reward of agent iand denoted as ri:AiR is the dis-
count factor which is to balance the instantaneous and long-
term returns: γ[0,1]. The total reward of the environment
from the beginning moment t0to the end of the interaction at
moment Tcan be expressed as: Rt=
T
t=t0
γtt0rt. A policy
can be expressed as a state-to-action mapping π:S→A,
whether the action in a state is good is evaluated by the
action value function: Qπ(s, a)=E[Rt|st=s, at=a, π]
and the state value function used to judge the strategy under
the state is expressed as: Vπ(s)=E[Rt|st=s, π]. And
advantage function is At=rt+γV (st+1)V(st).
373
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
B. Social Value Orientation(SVO)
Social psychology acknowledges that individuals are com-
plex decision-makers, as they not only consider benefits
to themselves but also benefits of others, as well as how
others perceive their behaviors[15]. In complex scenarios,
individuals’ motivations are diverse, and there are individual
differences and variations. Social Value Orientations (SVO)
provide a measure of the importance individuals attach to
different interests based on their attitudes toward individual
benefits and the benefits of others. In [16], different individu-
als are classified into four types: Individualism, Competition,
Cooperation, and Altruism, as shown in Fig. 1.
Fig. 1: Graphical representation of Social Value Orientation framework.
Integrating SVO into MARL is a desirable approach. Our
objective is to enable agents to learn driving strategies and
have authentic and benign interactions with humans. SVO
provides a good measure for optimizing the interactions
among agents and obtaining information about agents’ SVO
can help find solutions to improve traffic efficiency. Inspired
by SVO and the reward function design in [10][18] which
integrates individual rewards and rewards of others, we use
neighborhood rewards and global rewards to balance self-
interests and other’s interests. Since an agent only interacts
with agents in the surrounding neighborhood within finite
time steps, we assign a higher attention weight to the nearby
agents. We define the average reward of nearby agents within
d=40 meters as the neighborhood rewards, as follows:
rK
i,t =
K
rj,t
K,K ={j:||pos(i)pos(j)|| d}(1)
here K is the number of neighborhood agents. For global
rewards, we have:
rN
i,t =
N
rj,t
N(2)
wherein Nis the total number of agents in the environment.
IV. MATHODOLOGY
To evaluate our approach, we conduct experiments on
the driving simulator MetaDrive, which covers single-agent,
safe exploration, and multi-agent traffic scenarios[17]. The
simulator supports user-customized maps and can import
real scenario data. Furthermore, the platform provides several
standard RL environment interfaces, enabling researchers to
swiftly build scenarios to validate their algorithms.
A. Network structure
Due to the inefficiency of MARL sampling, the available
sample data is limited, and the deep network structure is
unsuitable for the task of this paper. Therefore, we utilize a
shallow fully-connected layer network to update individual
rewards, neighborhood rewards, and global rewards sepa-
rately. The input to the three networks is Oi, and the output is
a one-dimensional value. As for the policy network, we use a
fully connected network equipped with skip-connection. The
skip-connection connects the original input Oito the input of
each layer of the network, minimizing information loss and
improving the effectiveness of the network representation by
connecting features of different scales. The policy network
architecture is shown in Fig. 2.
Fig. 2: Architecture of the proposed policy network.
We denote the loss function of the individual value func-
tion network as follows:
Li(φ)=Est[max{(Vi,φ(st)ˆ
Vi,t)2,(Vi,φold (st)
+ clip(Vi,φ(st)Vi,φold (st),, +)
ˆ
Vi,t)2}]
(3)
where φold is the old parameter before the network updates
and ˆ
Vi,t is the target value function. Similarly, we can derive
the loss function of neighborhood network as follows:
Ln(φ)=EsK
t[max{(Vn,φ(sK
t)ˆ
Vn,t)2,(Vn,φold (sK
t)
+ clip(Vn,φ(sK
t)Vn,φold (sK
t),, +)
ˆ
Vn,t)2}]
(4)
where sK
t=S1×···×Skis the state space composed of
neighboring agents and global network’s:
Lg(φ)=EsN
t[max{(Vg,φ(sN
t)ˆ
Vg,t)2,(Vg,φold(sN
t)
+ clip(Vg,φ(sN
t)Vg,φold(sN
t),, +)
ˆ
Vg,t)2}]
(5)
here sN
t=S1×···×SNis the whole state space.We express
the objective of updating Aiin this way:
Ji,θ =Eπθ[min(ρAi,clip(ρ, 1, 1+)Ai)] (6)
here we use ρas a regularization term to ensure that
the difference between the old and new parameters is not
significant. Regarding the actions taken by each agent, we
374
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
utilize KL divergence and entropy as constraints to limit
them.
Jθ,ai,t =E[πθ(ai,t|si,t)logπθ(ai,t |si,t)
+DKL(πθ(ai,t|si,t )
θ(ai,t1|si,t1)] (7)
B. Reward function
Constructing a comprehensive and appropriate reward
function is crucial for enabling the agent to learn strategies
quickly to handle various road scenarios. In addition to
encouraging the agent to drive forward, we also consider
other factors such as safety, comfort, and lane change.
rt=c1·Rdriving +c2·Rsaf ety +c3·Rcomfort
+c4·Rchange lane +c5·Rend
(8)
Driving reward Rdriving
Our approach encourages the agent to drive forward,
actively explore the environment, and obtain the corre-
sponding rewards. We also aim for the agent to drive
as quickly as possible to improve efficiency in passing
various road sections and reach the destination.
Safety reward Rsafety
To prevent collisions, we incorporate LIDAR informa-
tion into the reward function. Specifically, if any vehicle
is detected within a certain distance directly in front
of the agent’s current lane, the agent slows down. In
merge scenarios, we detect front vehicles in all lanes in
front of the agent. If a vehicle traveling in a different
direction from the agent (such as making a cut-in action,
etc.) is detected, the agent slows down. This approach
effectively avoids unnecessary collisions.
Comfort reward Rcomfort
Driving comfort is a crucial factor in real-world driving,
and we incorporate it into the reward function. Specif-
ically, we define the comfort reward as follows: when
the current reference route is straight, the yaw angle
should be as small as possible to reduce unnecessary
oscillation.
Change lane reward Rchange lane
We have observed that when congestion occurs in the
local area in front of the agent, such as a lane or part
of the lanes being occupied, the agent tends to slow
down and wait instead of switching to a free lane. This
behavior can lead to increased congestion, collisions,
and reduced traffic efficiency. Therefore, we introduce
a change lane reward. Inspired by [24] of using the
position and speed of the front vehicle to infer the
curvature of the road during congestion, we believe that
the state of the surrounding vehicles can represent the
state of the road to some extent, so we design change
lane reward as follows: when the average speed of
nearby vehicles and the speed of the ego-vehicle are
lower than a certain threshold (indicating congestion
and vehicles stop), we encourage the agent to switch to
another lane to avoid prolonged congestion and improve
traffic flow efficiency.
End reward Rend
Upon successful arrival at the destination, a reward of
10 is added. However, if the vehicle runs out of the road
or a collision occurs, a penalty of 8 is imposed.
C. Building scenarios
We aim to construct a new scenario that includes a range
of common traffic scenarios such as straight lanes, curves,
intersections, roundabouts, T-intersections, ramps, etc. These
scenarios are combined in a specific sequence to form a com-
plex road that simulates various real-world scenarios without
traffic lights. Our objective is for agents to successfully pass
these scenarios, and even acquire better driving skills and
strategies than human drivers. To assess the generalization
performance of the algorithm, we design three different test
scenarios, which are illustrated in Fig. 3.
Fig. 3: The map on the top left is the training scenario and the rest is the
test scenarios.
V. EXPERIMENT
A. Experiment settings
We conducted our experiments on the MetaDrive
simulator[17], where we specified the random positions of
vehicles in four lanes on the road and maintained a real-
time agent count of 40 to ensure a certain traffic density. The
start and end points of each vehicle were chosen randomly,
and we considered the route successfully completed if the
vehicle followed the given reference route to the destination
without crashing or veer off the road. Otherwise, the reasons
for failure were classified as crash, out, and arrived max
step, while the route completion rate was calculated. To
obtain sensing information, we mainly used LIDAR, with
a neighborhood range of 40m and a safety distance of
15m to prevent collisions. Our experiments were conducted
on a computer equipped with an Intel i7-12700×20 core
processor and an NVIDIA GeForce RTX 3080 GPU, running
MetaDrive 0.2.5, python 3.7, PyTorch 1.13, and CUDA 10.2.
The training period lasted for 2.5M environment steps with
1563 iterations, a learning rate of 3e-4, a training batch size
of 1024, γ=0.99, and λ=0.95 for GAE.
B. Results
We compared the proposed algorithm to the baseline
algorithms IPPO[21] and MAPPO[22]. We used 4 start seeds
375
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
(a) (b) (c)
(d) (e)
Fig. 4: a-c, Learning curves of success rate, route completion rate and reward for the three training methods: IPPO, MAPPO and our method. d,
Performance of the network before and after the proposed changes. We used the original CoPO[10] algorithm with the improved reward function described
in Section IV to make a comparison. f, Effect of the number of agents on the route completion rate.
for all algorithms, and each training trial consisted of 2.5M
steps. We evaluated the algorithms using five metrics: success
rate, route completion rate, crash rate, out rate, and reward.
Additionally, to assess the impact of concatenating network
inputs of different scales, we conducted experiments com-
paring the network before and after the proposed changes.
TABLE I: Comparison of different algorithms
Metrics IPPO MAPPO Ours
Success(%) 63.37±3.56 20.07±5.21 82.01±3.10
Route completion(%) 76.66±2.45 38.70±8.90 87.99±2.37
Crash(%) 20.01±0.95 22.14±3.17 14.69±0.71
Out(%) 40.47±5.72 58.11±4.01 19.18±4.71
Reward 1162.67±56.68 338.88±142.64 1217.62±64.90
As shown in Fig. 4a-4c, our method and IPPO have
similar performance in the early training period. However,
our method eventually converges to a higher performance
level with a stable success rate of approximately 0.8, which
is nearly 20% higher than IPPO, and 15% higher in the route
completion rate. Although the agents trained by our method
take longer to obtain rewards than those of IPPO, the final
reward values of the two are similar, suggesting that our
agents consider additional metrics besides maximizing the
reward function when updating their strategy. Notably, after
0.5M training steps, our agents achieve higher success rate
and route completion rate despite receiving fewer rewards
than IPPO, indicating that they learn a more effective strategy
to complete the task. Table I presents the results of the
three algorithms, our proposed method achieves the best
performance across all five metrics. It is worth noting that
MAPPO performs poorly on this task compared to the IPPO
algorithm. We speculate that this may be due to the fact
that although MAPPO uses a centralized critic to evalu-
ate the concatenation of neighbor agents’ observations, the
surrounding agents change frequently, leading to increased
input complexity, unstable training, and reduced efficiency,
resulting in poorer performance.
As depicted in Fig. 4d, our network improvement results
in a faster training speed, higher success rate, and better
convergence scores, illustrating the effectiveness of network
enhancement. We also conduct an investigation into the
impact of increasing the number of agents on performance,
as shown in Fig. 4e. As the number of agents increases, the
Fig. 5: Evaluation results on the test II scenario: success, route completion, crash, out and reward curves. Each evaluation takes 30 episodes.
376
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Validation results on the three test scenarios
Scenarios Test I Test II Test III
Success(%) 74.67±4.34 80.67±3.90 77.81±4.97
Route completion(%) 83.26±2.68 88.64±2.36 85.31±3.27
Crash(%) 13.90±3.08 12.90±3.72 19.02±3.92
Out(%) 3.60±0.81 1.66±0.77 3.42±1.34
Reward 546.19±22.29 440.40±38.71 1075.13±98.27
convergence rate remains largely unchanged, and the route
completion rates remain above 0.8, indicating the scalability
of our method. Table II presents the evaluation results on
test scenarios, with Fig. 5 demonstrating that our proposed
method outperforms IPPO and MAPPO.
VI. CONCLUSIONS
In this paper, we presented a novel approach to address
cooperative driving issues using the MARL technique. We
developed a multi-scale feature fusion method to reduce
information loss and improve the representation ability of the
policy network. Additionally, we designed comprehensive
reward functions based on traffic efficiency, safety, comfort
and driving strategy. Compared with other baselines, we
validated the advantages of our approach on five general
metrics. In the future, we plan to focus on credit assignment
and communication in MARL and contribute to solutions
for hybrid traffic scenarios involving human drivers. We will
also explore more about the potential of this method for real-
world applications.
APPENDIX
Video of the test scenario are available at:
https://www.bilibili.com/video/BV11L411y7ea/?vd source=
eb49c849410e667ba1f8ced7e0b03440
ACKNOWLEDGMENT
This work was supported by the National Natural Sci-
ence Funds of China for Young Scholar with the Project
(62003328, 62073311), was supported by Guang Dong Basic
and Applied Basic Research Foundation (2023A1515011813,
2020B515130004), was supported by Shenzhen Basic Key
Research Project (JCYJ20200109115414354).
REFERENCES
[1] Kiran, B.R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A.A.,
Yogamani, S. and P´
erez, P., 2021. Deep reinforcement learning for
autonomous driving: A survey. IEEE Transactions on Intelligent
Transportation Systems, 23(6), pp.4909-4926.
[2] Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L.,
Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T. and
Lillicrap, T., 2020. Mastering atari, go, chess and shogi by planning
with a learned model. Nature, 588(7839), pp.604-609.
[3] Perolat, J., De Vylder, B., Hennes, D., Tarassov, E., Strub, F., de Boer,
V., Muller, P., Connor, J.T., Burch, N., Anthony, T. and McAleer,
S., 2022. Mastering the game of Stratego with model-free multiagent
reinforcement learning. Science, 378(6623), pp.990-996.
[4] Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik,
A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P. and
Oh, J., 2019. Grandmaster level in StarCraft II using multi-agent
reinforcement learning. Nature, 575(7782), pp.350-354.
[5] Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison,
C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C. and J´
ozefowicz,
R., 2019. Dota 2 with large scale deep reinforcement learning. arXiv
preprint arXiv:1912.06680.
[6] Sadigh, D., Landolfi, N., Sastry, S.S., Seshia, S.A. and Dragan, A.D.,
2018. Planning for cars that coordinate with people: leveraging effects
on human actions for planning and active information gathering over
human internal state. Autonomous Robots, 42, pp.1405-1426.
[7] Schwarting, W., Pierson, A., Alonso-Mora, J., Karaman, S. and Rus,
D., 2019. Social behavior for autonomous vehicles. Proceedings of the
National Academy of Sciences, 116(50), pp.24972-24978.
[8] Toghi, B., Valiente, R., Sadigh, D., Pedarsani, R. and Fallah, Y.P.,
2021, September. Cooperative autonomous vehicles that sympathize
with human drivers. In 2021 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS) (pp. 4517-4524). IEEE.
[9] Zhou, T., Wang, L., Chen, R., Wang, W. and Liu, Y., 2022.
Accelerating Reinforcement Learning for Autonomous Driving us-
ing Task-Agnostic and Ego-Centric Motion Skills. arXiv preprint
arXiv:2209.12072.
[10] Peng, Z., Li, Q., Hui, K.M., Liu, C. and Zhou, B., 2021. Learning
to simulate self-driven particles system with coordinated policy op-
timization. Advances in Neural Information Processing Systems, 34,
pp.10784-10797.
[11] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi,
V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K.
and Graepel, T., 2017. Value-decomposition networks for cooperative
multi-agent learning. arXiv preprint arXiv:1706.05296.
[12] Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J. and
Whiteson, S., 2020. Monotonic value function factorisation for deep
multi-agent reinforcement learning. The Journal of Machine Learning
Research, 21(1), pp.7234-7284.
[13] Son, K., Kim, D., Kang, W.J., Hostallero, D.E. and Yi, Y., 2019, May.
Qtran: Learning to factorize with transformation for cooperative multi-
agent reinforcement learning. In International conference on machine
learning (pp. 5887-5896). PMLR.
[14] Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O. and
Mordatch, I., 2017. Multi-agent actor-critic for mixed cooperative-
competitive environments. Advances in neural information processing
systems, 30.
[15] McClintock, C.G. and McNeel, S.P., 1966. Reward and score feedback
as determinants of cooperative and competitive game behavior. Journal
of Personality and Social Psychology, 4(6), p.606.
[16] Liebrand, W.B. and McClintock, C.G., 1988. The ring measure of
social values: A computerized procedure for assessing individual
differences in information processing and social value orientation.
European journal of personality, 2(3), pp.217-230.
[17] Li, Q., Peng, Z., Feng, L., Zhang, Q., Xue, Z. and Zhou, B., 2022.
Metadrive: Composing diverse driving scenarios for generalizable
reinforcement learning. IEEE transactions on pattern analysis and
machine intelligence.
[18] Toghi, B., Valiente, R., Sadigh, D., Pedarsani, R. and Fallah, Y.P.,
2021. Altruistic maneuver planning for cooperative autonomous
vehicles using multi-agent advantage actor-critic. arXiv preprint
arXiv:2107.05664.
[19] Han, S., Wang, H., Su, S., Shi, Y. and Miao, F., 2022, May. Stable
and efficient Shapley value-based reward reallocation for multi-agent
reinforcement learning of autonomous vehicles. In 2022 International
Conference on Robotics and Automation (ICRA) (pp. 8765-8771).
IEEE.
[20] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N. and Whiteson, S.,
2018, April. Counterfactual multi-agent policy gradients. In Proceed-
ings of the AAAI conference on artificial intelligence (Vol. 32, No.
1).
[21] de Witt, C.S., Gupta, T., Makoviichuk, D., Makoviychuk, V., Torr,
P.H., Sun, M. and Whiteson, S., 2020. Is independent learning
all you need in the starcraft multi-agent challenge?. arXiv preprint
arXiv:2011.09533.
[22] Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A. and Wu, Y., 2021.
The surprising effectiveness of ppo in cooperative, multi-agent games.
arXiv preprint arXiv:2103.01955.
[23] Liu, J., Li, H., Yang, Z., Dang, S. and Huang, Z., 2022. Deep Dense
Network-Based Curriculum Reinforcement Learning for High-Speed
Overtaking. IEEE Intelligent Transportation Systems Magazine.
[24] Li, K., Lei, B. and Li, H., 2020. Estimation of Road Geometric
Information for Congested Roads by Autonomous Driving. Journal
of Integration Technology, 9(5), pp.69-80.
377
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
... SVO enables vehicles to consider both their own benefits and the benefits of others, thereby facilitating the learning of collaborative strategies among vehicles. Similar to the approach presented in [17] and our previous work [25], we utilize a coordination advantage function for updating strategies. The coordination advantage function is defined as: ...
... Sparse rewards often fail and hinder exploration efficiency. Similar to our previous work [25], we employ a dense reward function that considers various factors such as driving efficiency, safety, comfort, and driving strategies. The reward function can be expressed as follows: ...
... To demonstrate the effectiveness of our method, we conduct a comparative analysis with several benchmark algorithms, including IPPO [31], MAPPO [30], CoPO [17], and our previous work [25]. This comparative analysis provides an intuitive understanding of the superior performance of the proposed approach. ...
Article
Full-text available
Multi-agent reinforcement learning (MARL) methods have emerged as a promising solution for multi-agent collaborative driving in the intersection and roundabout scenarios. However, these methods need large amounts of training data obtained from the interaction with the driving simulator, and learning from limited interaction remains significantly underdeveloped. In this paper, we propose an efficient MARL method to address this challenge. Our method enables each vehicle to receive limited messages from surrounding vehicles, which are then used to augment the input representation of the local driving policy. By predicting the next-step state based on the current augmented local state and action, our approach enhances the decision-making capability of each vehicle. Specifically, we design a Self-supervised Message Attention Encoding (SMAE) module that utilizes an attention mechanism to aggregate the received messages and local observations, generating a compact representation. Then, this representation is used in a self-supervised module to predict the next-step state. By jointly training the encoder module and the prediction module, each vehicle effectively leverages the most relevant components of the aggregated representation to improve the learning efficiency of driving policy and alleviate issues related to partial observability in making driving decisions. To validate the effectiveness of our approach, we conduct experiments using an open-source autonomous driving simulator. The simulation results demonstrate that our proposed method outperforms the IPPO, MAPPO and CoPO algorithms in terms of success rate, route completion rate, crash rate, and other relevant metrics.
... In the experiment, we selected five commonly used MARL algorithms for performance comparison with our proposed model. These algorithms are MATD3, MAPPO [29], MADDPG, IPPO [30], and COMA [31]. SC-MARL is an extension of MATD3 with the addition of a safety constraint module. ...
Article
Full-text available
This paper examines the difficulties of managing distributed power systems, notably due to the increasing use of renewable energy sources, and focuses on voltage control challenges exacerbated by their variable nature in modern power grids. To tackle the unique challenges of voltage control in distributed renewable energy networks, researchers are increasingly turning towards multi-agent reinforcement learning (MARL). However, MARL raises safety concerns due to the unpredictability in agent actions during their exploration phase. This unpredictability can lead to unsafe control measures. To mitigate these safety concerns in MARL-based voltage control, our study introduces a novel approach: Safety-Constrained Multi-Agent Reinforcement Learning (SC-MARL). This approach incorporates a specialized safety constraint module specifically designed for voltage control within the MARL framework. This module ensures that the MARL agents carry out voltage control actions safely. The experiments demonstrate that, in the 33-buses, 141-buses, and 322-buses power systems, employing SC-MARL for voltage control resulted in a reduction of the Voltage Out of Control Rate (%V.out) from 0.43, 0.24, and 2.95 to 0, 0.01, and 0.03, respectively. Additionally, the Reactive Power Loss (Q loss) decreased from 0.095, 0.547, and 0.017 to 0.062, 0.452, and 0.016 in the corresponding systems.
Article
Full-text available
Reinforcement learning methods have promising applications in autonomous vehicles, such as lane keeping, adaptive cruise control, and overtaking. However, long-time sequential decision making in complicated scenarios, such as continuous high-speed overtaking, remains challenge. One such challenge is that current neural networks in reinforcement learning employ a shallow neural network, which has limitations in feature extractions. To address this, we propose a deep actor–critic network framework using multilayer dense-connection networks to obtain efficient extracted features. To be specific, we employ a deep dense architecture in the framework of soft actor–critic, which has multiple hidden layers serving as feature extractions. To tackle the vanishing or exploding gradient problem, we reuse features and design shortcut connections, as well as feature concatenating operations between hidden layers. Meanwhile, when learning multitask high-speed overtaking as one whole progression, the agent has to learn driving, lane changing, and avoiding other vehicles at the same time. Therefore, another challenge is that sparse rewards worsen the speed of convergence. We design a task-level sequencing curriculum reinforcement learning method to tackle the reward sparseness. It breaks down a complex task into two successive subtasks: the first aims to drive as fast as possible in a single-vehicle model; and the second aims to overtake other vehicles without any collisions as soon as possible according to the knowledge gained from the first task in a multi vehicle model. Finally, we evaluate our method using a racing car simulator with different tracks. In a comparison using the standard actor–critic algorithm in three different multivehicle scenarios, the best median results show that our deep actor–critic network framework reduces overtaking distance and overtaking time by up to 4.9% and 8.7%, respectively.
Conference Paper
Full-text available
With the adoption of autonomous vehicles on our roads, we will witness a mixed-autonomy environment where autonomous and human-driven vehicles must learn to co-exist by sharing the same road infrastructure. To attain socially-desirable behaviors, autonomous vehicles must be instructed to consider the utility of other vehicles around them in their decision-making process. Particularly, we study the maneuver planning problem for autonomous vehicles and investigate how a decentralized reward structure can induce altruism in their behavior and incentivize them to account for the interest of other autonomous and human-driven vehicles. This is a challenging problem due to the ambiguity of a human driver's willingness to cooperate with an autonomous vehicle. Thus, in contrast with the existing works which rely on behavior models of human drivers, we take an end-to-end approach and let the autonomous agents to implicitly learn the decision-making process of human drivers only from experience. We introduce a multi-agent variant of the synchronous Advantage Actor-Critic (A2C) algorithm and train agents that coordinate with each other and can affect the behavior of human drivers to improve traffic flow and safety.
Conference Paper
Full-text available
Widespread adoption of autonomous vehicles will not become a reality until solutions are developed that enable these intelligent agents to co-exist with humans. This includes safely and efficiently interacting with human-driven vehicles, especially in both conflictive and competitive scenarios. We build up on the prior work on socially-aware navigation and borrow the concept of social value orientation from psychology-that formalizes how much importance a person allocates to the welfare of others-in order to induce altruistic behavior in autonomous driving. In contrast with existing works that explicitly model the behavior of human drivers and rely on their expected response to create opportunities for cooperation, our Sympathetic Cooperative Driving (SymCoDrive) paradigm trains altruistic agents that realize safe and smooth traffic flow in competitive driving scenarios only from experiential learning and without any explicit coordination. We demonstrate a significant improvement in both safety and traffic-level metrics as a result of this altruistic behavior and importantly conclude that the level of altruism in agents requires proper tuning as agents that are too altruistic also lead to sub-optimal traffic flow. The code and supplementary material are available at: https://symcodrive.toghi.net/
Article
Full-text available
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
Article
Full-text available
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess¹ and Go², where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games³—the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled⁴—the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi—canonical environments for high-performance planning—the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm⁵ that was supplied with the rules of the game.
Article
Full-text available
Significance We present a framework that integrates social psychology tools into controller design for autonomous vehicles. Our key insight utilizes Social Value Orientation (SVO), quantifying an agent’s degree of selfishness or altruism, which allows us to better predict driver behavior. We model interactions between human and autonomous agents with game theory and the principle of best response. Our unified algorithm estimates driver SVOs and incorporates their predicted trajectories into the autonomous vehicle’s control while respecting safety constraints. We study common-yet-difficult traffic scenarios: highway merging and unprotected left turns. Incorporating SVO reduces error in predictions by 25%, validated on 92 human driving merges. Furthermore, we find that merging drivers are more competitive than nonmerging drivers.
Article
We introduce DeepNash, an autonomous agent that plays the imperfect information game Stratego at a human expert level. Stratego is one of the few iconic board games that artificial intelligence (AI) has not yet mastered. It is a game characterized by a twin challenge: It requires long-term strategic thinking as in chess, but it also requires dealing with imperfect information as in poker. The technique underpinning DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego through self-play from scratch. DeepNash beat existing state-of-the-art AI methods in Stratego and achieved a year-to-date (2022) and all-time top-three ranking on the Gravon games platform, competing with human expert players.
Article
Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the RL agent's generalizability. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive .