Content uploaded by Zhengmin Jiang
Author content
All content in this area was uploaded by Zhengmin Jiang on Nov 27, 2023
Content may be subject to copyright.
Mastering Cooperative Driving Strategy in Complex Scenarios using
Multi-Agent Reinforcement Learning
Qingyi Liang1,2, Zhengmin Jiang2,3, Jianwen Yin2, Kun Xu2,
Zhongming Pan2, Shaobo Dang2, Jia Liu2,∗
Abstract— With the advent of machine learning, several
autonomous driving tasks have become easier to accomplish.
Nonetheless, the proliferation of autonomous vehicles in urban
traffic scenarios has precipitated other challenges, such as coop-
erative driving. Multi-Agent Reinforcement Learning (MARL)
approaches have emerged as a promising solution, as they can
successfully address the challenge by simulating the interactive
relationship among autonomous vehicles (AVs). In this paper,
we leverage Social Value Orientation to depict the behavioral
tendencies of AVs, thereby enhancing the performance of
MARL approaches. We also incorporate different scale features
in the policy network to strengthen its representation ability.
Moreover, effective reward functions are designed based on
traffic efficiency, comfort, safety, and strategy. Finally, we
validate our approach in an open-source autonomous driving
simulator. Simulation results indicate that our proposed ap-
proach outperforms IPPO and MAPPO algorithms in terms
of success rate, route completion rate, crash rate, and other
metrics.
I. INTRODUCTION
Autonomous driving technology has garnered significant
attention in recent years, and has improved traffic con-
ditions in terms of safety, convenience, and intelligence.
Presently, practical applications of autonomous driving are
observed in dock transportation and express delivery. The
autonomous system typically comprises three main com-
ponents for generating actions: (1) perception - gathering
information about the surrounding environment, including
roads, vehicles, and pedestrians, using cameras and LIDAR,
as well as other technologies; (2) planning - utilizing the
acquired information to generate the vehicle’s next high-level
command; and (3) execution - transforming the command
generated in the previous step into control signals, such
as throttle and steering of the vehicle[1]. While supervised
learning has exhibited excellent performance in tasks such
as computer vision and plays a crucial role in the perception
part of autonomous driving, it is less applicable to the
decision-making process for various driving tasks, such as
lane keeping, overtaking, intersections, etc., as the vehicle
needs to make timely predictions and decisions regarding
1Southern University of Science and Technology, Shenzhen, 518055,
China 12233254@mail.sustech.edu.cn
2CAS Key Laboratory of Human-Machine Intelligence-Synergy
Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy
of Sciences, Shenzhen 518055, China (qy.liang1@siat.ac.cn,
zm.jiang@siat.ac.cn, jw.yin@siat.ac.cn,
kun.xu@siat.ac.cn, zm.pan@siat.ac.cn,
shaobo.dang@siat.ac.cn, jia.liu1@siat.ac.cn)
3University of Chinese Academy of Sciences, Beijing, 100049, China
∗Correspondence:jia.liu1@siat.ac.cn
the dynamic environment. Therefore, other solutions such as
RL should be explored.
RL necessitates a detailed and explicit reward function,
after which the agent continuously interacts with the environ-
ment and acquires experience by trial and error to eventually
form a policy. The process of RL can be outlined as follows:
first, the agent perceives the state in the environment; then,
the agent takes a certain action after computation; and the
environment provides a reward based on the current state and
the action taken by the agent, repeating the above steps until
reaching the goal. Consequently, RL becomes a sequential
decision problem and usually involves multiple rounds of
decisions, each of which necessitates the agent to respond to
the new environment and select an optimal action, enabling
it to handle different scenarios. RL is divided into single-
agent RL and multi-agent RL (MARL), where the agent only
requires interaction with the environment and maximizing
its own benefit in a single-agent environment, whereas in
MARL, each agent needs to interact not only with the envi-
ronment but also with other agents and maintain their reward
functions. This makes the multi-agent environment more
unstable and complicated, as the relationship among agents
can be classified as cooperative, competitive, or mixed. There
have been several important breakthroughs in the study of
MARL in chess games and other games [2][3][4][5]. It is also
expected to have applications in smart cities, autonomous
driving, and industrial production. Furthermore, combining
the study of MARL with human social group behavior can
help model predictions and analyze the behavior of a certain
cluster, thus enabling the evaluation and optimization of
decisions.
Owing to the high cost of trial and error, present MARL
experiments are predominantly conducted on simulators,
and there are still some issues that require resolution: (1)
generalization is not strong, as most studies concentrate on
a single independent scenario, and the coping ability for
other scenarios or some special scenarios is insufficient. For
example, in the case of an extremely congested intersection,
selecting the appropriate route to avoid congestion is a chal-
lenge. (2) Due to the low efficiency of RL data, lack of direct
supervision signals, and unsuitability for deep networks, the
network structure in DRL is relatively simple, consisting
mostly of shallow MLP or CNN, and the representation
ability of shallow networks is weaker, which can lead to slow
updates or falling into local optimum due to the absence of
crucial information when updating the network. In this paper,
we propose novel solutions to address the aforementioned
979-8-3503-2718-2/23/$31.00 ©2023 IEEE 372
Proceedings of The 2023 IEEE International
Conference on Real-time Computing and Robotics
July 17-20, 2023, Datong, China
2023 IEEE International Conference on Real-time Computing and Robotics (RCAR) | 979-8-3503-2718-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/RCAR58764.2023.10249648
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
challenges, and the research objectives and contributions of
this paper are outlined as follows:
•Firstly, to enhance the extracted features, we concate-
nate features at different scales to minimize information
loss while simultaneously improving the network rep-
resentation.
•Secondly, we design comprehensive and detailed reward
functions that consider travel efficiency, safety, comfort,
and driving strategies. These reward functions are in-
tended to verify the generalization ability of the model
to different scenarios and its strategic capabilities for
dealing with special scenarios.
•Finally, we construct complex training scenarios and
validate the effectiveness of the network structure and
its generalization ability in three distinct scenarios.
II. RELATED WORK
MARL environment poses several challenges such as di-
mensional explosion, non-stationarity, and credit assignment,
which make direct application of single-agent algorithms
infeasible. To address this issue, one approach is to combine
the value function of each agent into the joint action value
function, which is the sum of the local action value function
of each agent[11]. To deal with the nonlinearity problem
of the joint action value function, another improvement is
proposed, which ensures consistent monotonicity between
the local action value function and the joint action value
function, optimizing the global action value function as well
as the individual’s action value function[12]. Additionally, a
more robust decomposition method is introduced for the joint
action value function by mapping it to a new value function,
which aims to scale the true action value function, making
the value function approximated by the network closer to the
true joint action value function[13]. To address the challenge
posed by dimensional explosion, a centralized training and
decentralized execution approach is proposed, which allows
access to additional information during training and only
local information during execution and has proven effective
in cooperative, competitive, and mixed interactions[14].
Research has been conducted on the coordination be-
tween an autonomous vehicle (AV) and a human-driven
vehicle (HV). The AV collects information about the HV,
correlates HV’s position to AV’s reward function, and can
force HV to move left, right, slow down, and prioritize
through intersections[6]. Social Value Orientations(SVO) is
introduced to measure HV’s behavior and assess HV’s al-
truistic, pro-social, egoistic, or competitive preference. The
AV predicts HV’s trajectory and chooses whether to merge,
turn left, etc. based on the SVO value. The results verify
the validity of SVO in measuring social preferences and the
improvement of correctly predicting HV’s trajectories[7].
While the aforementioned studies focused on a single AV
and HV, multiple AVs and HVs are studied which utilizes
a multichannel VelocityMap with different time sequences
focusing on AVs, HVs, self-attention, mission vehicles, and
road layout, equipped with a 3D convolutional network to
solve the time-dependent problem. The results demonstrate
that adjusting the level of altruism in an AV can affect other
vehicles[8]. Researchers insist that human driving behavior
is a series of continuous skills that expand over time. Thus,
a task-independent and ego-centric vehicle is constructed,
first encoding the collected vehicle trajectory information as
latent skills and then training the vehicle with RL method to
learn skills instead of actions in the latent space. The skill
output is then decoded as control information, enabling the
vehicle to cope with different scenarios[9]. Modifying the
network architecture is a viable option for improving per-
formance. Features in the Soft Actor-Critic (SAC) network
are reused and connected to avoid the problems of gradient
disappearance and explosion. The results indicate that this
approach outperforms the standard SAC[23]. SVO is further
extended to all agents in a MARL environment, where each
agent focuses not only on its own reward but also on local
and global rewards. Experiments demonstrate that the agents
exhibit socially acceptable behavior, improving the safety
and efficiency of the whole group across five scenarios, such
as intersections and roundabouts[10].
Other studies focus on the distribution of credit for MARL
cooperation. A counterfactual baseline is used in [20] to
determine how much an agent contributes to the collective.
For each action of an agent, a “default action” is used
instead, and the difference in global rewards due to these
two actions is calculated, evaluating the contribution of the
agent’s current action to the system. In [19], the problem of
total reward redistribution in MARL systems is discussed,
and it is found that using more efficient value distribution
can promote stable cooperation among agents. The Shapley
Value is proposed to approximate the contribution of each
agent to the system to distribute cooperation benefits fairly.
The results verify the feasibility of this method, providing
new ideas for MARL credit assignment tasks.
III. PRELIMINARY
A. Dec-POMDPs
We describe the decentralized partially observable Markov
decision process in MARL as a tuple: <S,Ai,P,r
i,γ>for
agent i=1,··· ,N, where Sis the state space that includes
all the possible states that agents can adopt, Aiis defined as
the finite set which consists of the actions that agent ican
take ai∈A
i,Pis the transition function with a range of
[0,1] which is defined as P:S×A1×···×An→S,r
iis the
reward of agent iand denoted as ri:S×Ai→R,γ is the dis-
count factor which is to balance the instantaneous and long-
term returns: γ→[0,1]. The total reward of the environment
from the beginning moment t0to the end of the interaction at
moment Tcan be expressed as: Rt=
T
t=t0
γt−t0rt. A policy
can be expressed as a state-to-action mapping π:S→A,
whether the action in a state is good is evaluated by the
action value function: Qπ(s, a)=E[Rt|st=s, at=a, π]
and the state value function used to judge the strategy under
the state is expressed as: Vπ(s)=E[Rt|st=s, π]. And
advantage function is At=rt+γV (st+1)−V(st).
373
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
B. Social Value Orientation(SVO)
Social psychology acknowledges that individuals are com-
plex decision-makers, as they not only consider benefits
to themselves but also benefits of others, as well as how
others perceive their behaviors[15]. In complex scenarios,
individuals’ motivations are diverse, and there are individual
differences and variations. Social Value Orientations (SVO)
provide a measure of the importance individuals attach to
different interests based on their attitudes toward individual
benefits and the benefits of others. In [16], different individu-
als are classified into four types: Individualism, Competition,
Cooperation, and Altruism, as shown in Fig. 1.
Fig. 1: Graphical representation of Social Value Orientation framework.
Integrating SVO into MARL is a desirable approach. Our
objective is to enable agents to learn driving strategies and
have authentic and benign interactions with humans. SVO
provides a good measure for optimizing the interactions
among agents and obtaining information about agents’ SVO
can help find solutions to improve traffic efficiency. Inspired
by SVO and the reward function design in [10][18] which
integrates individual rewards and rewards of others, we use
neighborhood rewards and global rewards to balance self-
interests and other’s interests. Since an agent only interacts
with agents in the surrounding neighborhood within finite
time steps, we assign a higher attention weight to the nearby
agents. We define the average reward of nearby agents within
d=40 meters as the neighborhood rewards, as follows:
rK
i,t =
K
rj,t
K,K ={j:||pos(i)−pos(j)|| ≤ d}(1)
here K is the number of neighborhood agents. For global
rewards, we have:
rN
i,t =
N
rj,t
N(2)
wherein Nis the total number of agents in the environment.
IV. MATHODOLOGY
To evaluate our approach, we conduct experiments on
the driving simulator MetaDrive, which covers single-agent,
safe exploration, and multi-agent traffic scenarios[17]. The
simulator supports user-customized maps and can import
real scenario data. Furthermore, the platform provides several
standard RL environment interfaces, enabling researchers to
swiftly build scenarios to validate their algorithms.
A. Network structure
Due to the inefficiency of MARL sampling, the available
sample data is limited, and the deep network structure is
unsuitable for the task of this paper. Therefore, we utilize a
shallow fully-connected layer network to update individual
rewards, neighborhood rewards, and global rewards sepa-
rately. The input to the three networks is Oi, and the output is
a one-dimensional value. As for the policy network, we use a
fully connected network equipped with skip-connection. The
skip-connection connects the original input Oito the input of
each layer of the network, minimizing information loss and
improving the effectiveness of the network representation by
connecting features of different scales. The policy network
architecture is shown in Fig. 2.
Fig. 2: Architecture of the proposed policy network.
We denote the loss function of the individual value func-
tion network as follows:
Li(φ)=Est[max{(Vi,φ(st)−ˆ
Vi,t)2,(Vi,φold (st)
+ clip(Vi,φ(st)−Vi,φold (st),−, +)
−ˆ
Vi,t)2}]
(3)
where φold is the old parameter before the network updates
and ˆ
Vi,t is the target value function. Similarly, we can derive
the loss function of neighborhood network as follows:
Ln(φ)=EsK
t[max{(Vn,φ(sK
t)−ˆ
Vn,t)2,(Vn,φold (sK
t)
+ clip(Vn,φ(sK
t)−Vn,φold (sK
t),−, +)
−ˆ
Vn,t)2}]
(4)
where sK
t=S1×···×Skis the state space composed of
neighboring agents and global network’s:
Lg(φ)=EsN
t[max{(Vg,φ(sN
t)−ˆ
Vg,t)2,(Vg,φold(sN
t)
+ clip(Vg,φ(sN
t)−Vg,φold(sN
t),−, +)
−ˆ
Vg,t)2}]
(5)
here sN
t=S1×···×SNis the whole state space.We express
the objective of updating Aiin this way:
Ji,θ =Eπθ[min(ρAi,clip(ρ, 1−, 1+)Ai)] (6)
here we use ρas a regularization term to ensure that
the difference between the old and new parameters is not
significant. Regarding the actions taken by each agent, we
374
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
utilize KL divergence and entropy as constraints to limit
them.
Jθ,ai,t =E[−πθ(ai,t|si,t)logπθ(ai,t |si,t)
+DKL(πθ(ai,t|si,t ),π
θ(ai,t−1|si,t−1)] (7)
B. Reward function
Constructing a comprehensive and appropriate reward
function is crucial for enabling the agent to learn strategies
quickly to handle various road scenarios. In addition to
encouraging the agent to drive forward, we also consider
other factors such as safety, comfort, and lane change.
rt=c1·Rdriving +c2·Rsaf ety +c3·Rcomfort
+c4·Rchange lane +c5·Rend
(8)
•Driving reward Rdriving
Our approach encourages the agent to drive forward,
actively explore the environment, and obtain the corre-
sponding rewards. We also aim for the agent to drive
as quickly as possible to improve efficiency in passing
various road sections and reach the destination.
•Safety reward Rsafety
To prevent collisions, we incorporate LIDAR informa-
tion into the reward function. Specifically, if any vehicle
is detected within a certain distance directly in front
of the agent’s current lane, the agent slows down. In
merge scenarios, we detect front vehicles in all lanes in
front of the agent. If a vehicle traveling in a different
direction from the agent (such as making a cut-in action,
etc.) is detected, the agent slows down. This approach
effectively avoids unnecessary collisions.
•Comfort reward Rcomfort
Driving comfort is a crucial factor in real-world driving,
and we incorporate it into the reward function. Specif-
ically, we define the comfort reward as follows: when
the current reference route is straight, the yaw angle
should be as small as possible to reduce unnecessary
oscillation.
•Change lane reward Rchange lane
We have observed that when congestion occurs in the
local area in front of the agent, such as a lane or part
of the lanes being occupied, the agent tends to slow
down and wait instead of switching to a free lane. This
behavior can lead to increased congestion, collisions,
and reduced traffic efficiency. Therefore, we introduce
a change lane reward. Inspired by [24] of using the
position and speed of the front vehicle to infer the
curvature of the road during congestion, we believe that
the state of the surrounding vehicles can represent the
state of the road to some extent, so we design change
lane reward as follows: when the average speed of
nearby vehicles and the speed of the ego-vehicle are
lower than a certain threshold (indicating congestion
and vehicles stop), we encourage the agent to switch to
another lane to avoid prolonged congestion and improve
traffic flow efficiency.
•End reward Rend
Upon successful arrival at the destination, a reward of
10 is added. However, if the vehicle runs out of the road
or a collision occurs, a penalty of 8 is imposed.
C. Building scenarios
We aim to construct a new scenario that includes a range
of common traffic scenarios such as straight lanes, curves,
intersections, roundabouts, T-intersections, ramps, etc. These
scenarios are combined in a specific sequence to form a com-
plex road that simulates various real-world scenarios without
traffic lights. Our objective is for agents to successfully pass
these scenarios, and even acquire better driving skills and
strategies than human drivers. To assess the generalization
performance of the algorithm, we design three different test
scenarios, which are illustrated in Fig. 3.
Fig. 3: The map on the top left is the training scenario and the rest is the
test scenarios.
V. EXPERIMENT
A. Experiment settings
We conducted our experiments on the MetaDrive
simulator[17], where we specified the random positions of
vehicles in four lanes on the road and maintained a real-
time agent count of 40 to ensure a certain traffic density. The
start and end points of each vehicle were chosen randomly,
and we considered the route successfully completed if the
vehicle followed the given reference route to the destination
without crashing or veer off the road. Otherwise, the reasons
for failure were classified as crash, out, and arrived max
step, while the route completion rate was calculated. To
obtain sensing information, we mainly used LIDAR, with
a neighborhood range of 40m and a safety distance of
15m to prevent collisions. Our experiments were conducted
on a computer equipped with an Intel i7-12700×20 core
processor and an NVIDIA GeForce RTX 3080 GPU, running
MetaDrive 0.2.5, python 3.7, PyTorch 1.13, and CUDA 10.2.
The training period lasted for 2.5M environment steps with
1563 iterations, a learning rate of 3e-4, a training batch size
of 1024, γ=0.99, and λ=0.95 for GAE.
B. Results
We compared the proposed algorithm to the baseline
algorithms IPPO[21] and MAPPO[22]. We used 4 start seeds
375
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
(a) (b) (c)
(d) (e)
Fig. 4: a-c, Learning curves of success rate, route completion rate and reward for the three training methods: IPPO, MAPPO and our method. d,
Performance of the network before and after the proposed changes. We used the original CoPO[10] algorithm with the improved reward function described
in Section IV to make a comparison. f, Effect of the number of agents on the route completion rate.
for all algorithms, and each training trial consisted of 2.5M
steps. We evaluated the algorithms using five metrics: success
rate, route completion rate, crash rate, out rate, and reward.
Additionally, to assess the impact of concatenating network
inputs of different scales, we conducted experiments com-
paring the network before and after the proposed changes.
TABLE I: Comparison of different algorithms
Metrics IPPO MAPPO Ours
Success(%) 63.37±3.56 20.07±5.21 82.01±3.10
Route completion(%) 76.66±2.45 38.70±8.90 87.99±2.37
Crash(%) 20.01±0.95 22.14±3.17 14.69±0.71
Out(%) 40.47±5.72 58.11±4.01 19.18±4.71
Reward 1162.67±56.68 338.88±142.64 1217.62±64.90
As shown in Fig. 4a-4c, our method and IPPO have
similar performance in the early training period. However,
our method eventually converges to a higher performance
level with a stable success rate of approximately 0.8, which
is nearly 20% higher than IPPO, and 15% higher in the route
completion rate. Although the agents trained by our method
take longer to obtain rewards than those of IPPO, the final
reward values of the two are similar, suggesting that our
agents consider additional metrics besides maximizing the
reward function when updating their strategy. Notably, after
0.5M training steps, our agents achieve higher success rate
and route completion rate despite receiving fewer rewards
than IPPO, indicating that they learn a more effective strategy
to complete the task. Table I presents the results of the
three algorithms, our proposed method achieves the best
performance across all five metrics. It is worth noting that
MAPPO performs poorly on this task compared to the IPPO
algorithm. We speculate that this may be due to the fact
that although MAPPO uses a centralized critic to evalu-
ate the concatenation of neighbor agents’ observations, the
surrounding agents change frequently, leading to increased
input complexity, unstable training, and reduced efficiency,
resulting in poorer performance.
As depicted in Fig. 4d, our network improvement results
in a faster training speed, higher success rate, and better
convergence scores, illustrating the effectiveness of network
enhancement. We also conduct an investigation into the
impact of increasing the number of agents on performance,
as shown in Fig. 4e. As the number of agents increases, the
Fig. 5: Evaluation results on the test II scenario: success, route completion, crash, out and reward curves. Each evaluation takes 30 episodes.
376
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Validation results on the three test scenarios
Scenarios Test I Test II Test III
Success(%) 74.67±4.34 80.67±3.90 77.81±4.97
Route completion(%) 83.26±2.68 88.64±2.36 85.31±3.27
Crash(%) 13.90±3.08 12.90±3.72 19.02±3.92
Out(%) 3.60±0.81 1.66±0.77 3.42±1.34
Reward 546.19±22.29 440.40±38.71 1075.13±98.27
convergence rate remains largely unchanged, and the route
completion rates remain above 0.8, indicating the scalability
of our method. Table II presents the evaluation results on
test scenarios, with Fig. 5 demonstrating that our proposed
method outperforms IPPO and MAPPO.
VI. CONCLUSIONS
In this paper, we presented a novel approach to address
cooperative driving issues using the MARL technique. We
developed a multi-scale feature fusion method to reduce
information loss and improve the representation ability of the
policy network. Additionally, we designed comprehensive
reward functions based on traffic efficiency, safety, comfort
and driving strategy. Compared with other baselines, we
validated the advantages of our approach on five general
metrics. In the future, we plan to focus on credit assignment
and communication in MARL and contribute to solutions
for hybrid traffic scenarios involving human drivers. We will
also explore more about the potential of this method for real-
world applications.
APPENDIX
Video of the test scenario are available at:
https://www.bilibili.com/video/BV11L411y7ea/?vd source=
eb49c849410e667ba1f8ced7e0b03440
ACKNOWLEDGMENT
This work was supported by the National Natural Sci-
ence Funds of China for Young Scholar with the Project
(62003328, 62073311), was supported by Guang Dong Basic
and Applied Basic Research Foundation (2023A1515011813,
2020B515130004), was supported by Shenzhen Basic Key
Research Project (JCYJ20200109115414354).
REFERENCES
[1] Kiran, B.R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A.A.,
Yogamani, S. and P´
erez, P., 2021. Deep reinforcement learning for
autonomous driving: A survey. IEEE Transactions on Intelligent
Transportation Systems, 23(6), pp.4909-4926.
[2] Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L.,
Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T. and
Lillicrap, T., 2020. Mastering atari, go, chess and shogi by planning
with a learned model. Nature, 588(7839), pp.604-609.
[3] Perolat, J., De Vylder, B., Hennes, D., Tarassov, E., Strub, F., de Boer,
V., Muller, P., Connor, J.T., Burch, N., Anthony, T. and McAleer,
S., 2022. Mastering the game of Stratego with model-free multiagent
reinforcement learning. Science, 378(6623), pp.990-996.
[4] Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik,
A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P. and
Oh, J., 2019. Grandmaster level in StarCraft II using multi-agent
reinforcement learning. Nature, 575(7782), pp.350-354.
[5] Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison,
C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C. and J´
ozefowicz,
R., 2019. Dota 2 with large scale deep reinforcement learning. arXiv
preprint arXiv:1912.06680.
[6] Sadigh, D., Landolfi, N., Sastry, S.S., Seshia, S.A. and Dragan, A.D.,
2018. Planning for cars that coordinate with people: leveraging effects
on human actions for planning and active information gathering over
human internal state. Autonomous Robots, 42, pp.1405-1426.
[7] Schwarting, W., Pierson, A., Alonso-Mora, J., Karaman, S. and Rus,
D., 2019. Social behavior for autonomous vehicles. Proceedings of the
National Academy of Sciences, 116(50), pp.24972-24978.
[8] Toghi, B., Valiente, R., Sadigh, D., Pedarsani, R. and Fallah, Y.P.,
2021, September. Cooperative autonomous vehicles that sympathize
with human drivers. In 2021 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS) (pp. 4517-4524). IEEE.
[9] Zhou, T., Wang, L., Chen, R., Wang, W. and Liu, Y., 2022.
Accelerating Reinforcement Learning for Autonomous Driving us-
ing Task-Agnostic and Ego-Centric Motion Skills. arXiv preprint
arXiv:2209.12072.
[10] Peng, Z., Li, Q., Hui, K.M., Liu, C. and Zhou, B., 2021. Learning
to simulate self-driven particles system with coordinated policy op-
timization. Advances in Neural Information Processing Systems, 34,
pp.10784-10797.
[11] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi,
V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K.
and Graepel, T., 2017. Value-decomposition networks for cooperative
multi-agent learning. arXiv preprint arXiv:1706.05296.
[12] Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J. and
Whiteson, S., 2020. Monotonic value function factorisation for deep
multi-agent reinforcement learning. The Journal of Machine Learning
Research, 21(1), pp.7234-7284.
[13] Son, K., Kim, D., Kang, W.J., Hostallero, D.E. and Yi, Y., 2019, May.
Qtran: Learning to factorize with transformation for cooperative multi-
agent reinforcement learning. In International conference on machine
learning (pp. 5887-5896). PMLR.
[14] Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O. and
Mordatch, I., 2017. Multi-agent actor-critic for mixed cooperative-
competitive environments. Advances in neural information processing
systems, 30.
[15] McClintock, C.G. and McNeel, S.P., 1966. Reward and score feedback
as determinants of cooperative and competitive game behavior. Journal
of Personality and Social Psychology, 4(6), p.606.
[16] Liebrand, W.B. and McClintock, C.G., 1988. The ring measure of
social values: A computerized procedure for assessing individual
differences in information processing and social value orientation.
European journal of personality, 2(3), pp.217-230.
[17] Li, Q., Peng, Z., Feng, L., Zhang, Q., Xue, Z. and Zhou, B., 2022.
Metadrive: Composing diverse driving scenarios for generalizable
reinforcement learning. IEEE transactions on pattern analysis and
machine intelligence.
[18] Toghi, B., Valiente, R., Sadigh, D., Pedarsani, R. and Fallah, Y.P.,
2021. Altruistic maneuver planning for cooperative autonomous
vehicles using multi-agent advantage actor-critic. arXiv preprint
arXiv:2107.05664.
[19] Han, S., Wang, H., Su, S., Shi, Y. and Miao, F., 2022, May. Stable
and efficient Shapley value-based reward reallocation for multi-agent
reinforcement learning of autonomous vehicles. In 2022 International
Conference on Robotics and Automation (ICRA) (pp. 8765-8771).
IEEE.
[20] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N. and Whiteson, S.,
2018, April. Counterfactual multi-agent policy gradients. In Proceed-
ings of the AAAI conference on artificial intelligence (Vol. 32, No.
1).
[21] de Witt, C.S., Gupta, T., Makoviichuk, D., Makoviychuk, V., Torr,
P.H., Sun, M. and Whiteson, S., 2020. Is independent learning
all you need in the starcraft multi-agent challenge?. arXiv preprint
arXiv:2011.09533.
[22] Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A. and Wu, Y., 2021.
The surprising effectiveness of ppo in cooperative, multi-agent games.
arXiv preprint arXiv:2103.01955.
[23] Liu, J., Li, H., Yang, Z., Dang, S. and Huang, Z., 2022. Deep Dense
Network-Based Curriculum Reinforcement Learning for High-Speed
Overtaking. IEEE Intelligent Transportation Systems Magazine.
[24] Li, K., Lei, B. and Li, H., 2020. Estimation of Road Geometric
Information for Congested Roads by Autonomous Driving. Journal
of Integration Technology, 9(5), pp.69-80.
377
Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.