Conference PaperPDF Available

Mastering Cooperative Driving Strategy in Complex Scenarios using Multi-Agent Reinforcement Learning

July 2023

July 2023

DOI:10.1109/RCAR58764.2023.10249648

Conference: 2023 IEEE International Conference on Real-time Computing and Robotics (RCAR)

Authors:

Zhengmin Jiang

Chinese Academy of Sciences

Kun Xu

Chinese Academy of Sciences

Show all 7 authorsHide

Graphical representation of Social Value Orientation framework.

…

Architecture of the proposed policy network. We denote the loss function of the individual value function network as follows: L i (φ) = E st [max{(V i,φ (s t ) − ˆ V i,t ) 2 , (V i,φ old (s t ) + clip(V i,φ (s t ) − V i,φ old (s t ), −, +)

…

Figures - uploaded by Zhengmin Jiang

Content may be subject to copyright.

Content uploaded by Zhengmin Jiang

Content may be subject to copyright.

Mastering Cooperative Driving Strategy in Complex Scenarios using

Multi-Agent Reinforcement Learning

Qingyi Liang1,2, Zhengmin Jiang2,3, Jianwen Yin2, Kun Xu2,

Zhongming Pan2, Shaobo Dang2, Jia Liu2,∗

Abstract— With the advent of machine learning, several

autonomous driving tasks have become easier to accomplish.

Nonetheless, the proliferation of autonomous vehicles in urban

trafﬁc scenarios has precipitated other challenges, such as coop-

erative driving. Multi-Agent Reinforcement Learning (MARL)

approaches have emerged as a promising solution, as they can

successfully address the challenge by simulating the interactive

relationship among autonomous vehicles (AVs). In this paper,

we leverage Social Value Orientation to depict the behavioral

tendencies of AVs, thereby enhancing the performance of

MARL approaches. We also incorporate different scale features

in the policy network to strengthen its representation ability.

Moreover, effective reward functions are designed based on

trafﬁc efﬁciency, comfort, safety, and strategy. Finally, we

validate our approach in an open-source autonomous driving

simulator. Simulation results indicate that our proposed ap-

proach outperforms IPPO and MAPPO algorithms in terms

of success rate, route completion rate, crash rate, and other

metrics.

I. INTRODUCTION

Autonomous driving technology has garnered signiﬁcant

attention in recent years, and has improved trafﬁc con-

ditions in terms of safety, convenience, and intelligence.

Presently, practical applications of autonomous driving are

observed in dock transportation and express delivery. The

autonomous system typically comprises three main com-

ponents for generating actions: (1) perception - gathering

information about the surrounding environment, including

roads, vehicles, and pedestrians, using cameras and LIDAR,

as well as other technologies; (2) planning - utilizing the

acquired information to generate the vehicle’s next high-level

command; and (3) execution - transforming the command

generated in the previous step into control signals, such

as throttle and steering of the vehicle[1]. While supervised

learning has exhibited excellent performance in tasks such

as computer vision and plays a crucial role in the perception

part of autonomous driving, it is less applicable to the

decision-making process for various driving tasks, such as

lane keeping, overtaking, intersections, etc., as the vehicle

needs to make timely predictions and decisions regarding

1Southern University of Science and Technology, Shenzhen, 518055,

China 12233254@mail.sustech.edu.cn

2CAS Key Laboratory of Human-Machine Intelligence-Synergy

Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy

of Sciences, Shenzhen 518055, China (qy.liang1@siat.ac.cn,

zm.jiang@siat.ac.cn, jw.yin@siat.ac.cn,

kun.xu@siat.ac.cn, zm.pan@siat.ac.cn,

shaobo.dang@siat.ac.cn, jia.liu1@siat.ac.cn)

3University of Chinese Academy of Sciences, Beijing, 100049, China

∗Correspondence:jia.liu1@siat.ac.cn

the dynamic environment. Therefore, other solutions such as

RL should be explored.

RL necessitates a detailed and explicit reward function,

after which the agent continuously interacts with the environ-

ment and acquires experience by trial and error to eventually

form a policy. The process of RL can be outlined as follows:

ﬁrst, the agent perceives the state in the environment; then,

the agent takes a certain action after computation; and the

environment provides a reward based on the current state and

the action taken by the agent, repeating the above steps until

reaching the goal. Consequently, RL becomes a sequential

decision problem and usually involves multiple rounds of

decisions, each of which necessitates the agent to respond to

the new environment and select an optimal action, enabling

it to handle different scenarios. RL is divided into single-

agent RL and multi-agent RL (MARL), where the agent only

requires interaction with the environment and maximizing

its own beneﬁt in a single-agent environment, whereas in

MARL, each agent needs to interact not only with the envi-

ronment but also with other agents and maintain their reward

functions. This makes the multi-agent environment more

unstable and complicated, as the relationship among agents

can be classiﬁed as cooperative, competitive, or mixed. There

have been several important breakthroughs in the study of

MARL in chess games and other games [2][3][4][5]. It is also

expected to have applications in smart cities, autonomous

driving, and industrial production. Furthermore, combining

the study of MARL with human social group behavior can

help model predictions and analyze the behavior of a certain

cluster, thus enabling the evaluation and optimization of

decisions.

Owing to the high cost of trial and error, present MARL

experiments are predominantly conducted on simulators,

and there are still some issues that require resolution: (1)

generalization is not strong, as most studies concentrate on

a single independent scenario, and the coping ability for

other scenarios or some special scenarios is insufﬁcient. For

example, in the case of an extremely congested intersection,

selecting the appropriate route to avoid congestion is a chal-

lenge. (2) Due to the low efﬁciency of RL data, lack of direct

supervision signals, and unsuitability for deep networks, the

network structure in DRL is relatively simple, consisting

mostly of shallow MLP or CNN, and the representation

ability of shallow networks is weaker, which can lead to slow

updates or falling into local optimum due to the absence of

crucial information when updating the network. In this paper,

we propose novel solutions to address the aforementioned

Proceedings of The 2023 IEEE International

Conference on Real-time Computing and Robotics

July 17-20, 2023, Datong, China

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.

challenges, and the research objectives and contributions of

this paper are outlined as follows:

•Firstly, to enhance the extracted features, we concate-

nate features at different scales to minimize information

loss while simultaneously improving the network rep-

resentation.

•Secondly, we design comprehensive and detailed reward

functions that consider travel efﬁciency, safety, comfort,

and driving strategies. These reward functions are in-

tended to verify the generalization ability of the model

to different scenarios and its strategic capabilities for

dealing with special scenarios.

•Finally, we construct complex training scenarios and

validate the effectiveness of the network structure and

its generalization ability in three distinct scenarios.

II. RELATED WORK

MARL environment poses several challenges such as di-

mensional explosion, non-stationarity, and credit assignment,

which make direct application of single-agent algorithms

infeasible. To address this issue, one approach is to combine

the value function of each agent into the joint action value

function, which is the sum of the local action value function

of each agent[11]. To deal with the nonlinearity problem

of the joint action value function, another improvement is

proposed, which ensures consistent monotonicity between

the local action value function and the joint action value

function, optimizing the global action value function as well

as the individual’s action value function[12]. Additionally, a

more robust decomposition method is introduced for the joint

action value function by mapping it to a new value function,

which aims to scale the true action value function, making

the value function approximated by the network closer to the

true joint action value function[13]. To address the challenge

posed by dimensional explosion, a centralized training and

decentralized execution approach is proposed, which allows

access to additional information during training and only

local information during execution and has proven effective

in cooperative, competitive, and mixed interactions[14].

Research has been conducted on the coordination be-

tween an autonomous vehicle (AV) and a human-driven

vehicle (HV). The AV collects information about the HV,

correlates HV’s position to AV’s reward function, and can

force HV to move left, right, slow down, and prioritize

through intersections[6]. Social Value Orientations(SVO) is

introduced to measure HV’s behavior and assess HV’s al-

truistic, pro-social, egoistic, or competitive preference. The

AV predicts HV’s trajectory and chooses whether to merge,

turn left, etc. based on the SVO value. The results verify

the validity of SVO in measuring social preferences and the

improvement of correctly predicting HV’s trajectories[7].

While the aforementioned studies focused on a single AV

and HV, multiple AVs and HVs are studied which utilizes

a multichannel VelocityMap with different time sequences

focusing on AVs, HVs, self-attention, mission vehicles, and

road layout, equipped with a 3D convolutional network to

solve the time-dependent problem. The results demonstrate

that adjusting the level of altruism in an AV can affect other

vehicles[8]. Researchers insist that human driving behavior

is a series of continuous skills that expand over time. Thus,

a task-independent and ego-centric vehicle is constructed,

ﬁrst encoding the collected vehicle trajectory information as

latent skills and then training the vehicle with RL method to

learn skills instead of actions in the latent space. The skill

output is then decoded as control information, enabling the

vehicle to cope with different scenarios[9]. Modifying the

network architecture is a viable option for improving per-

formance. Features in the Soft Actor-Critic (SAC) network

are reused and connected to avoid the problems of gradient

disappearance and explosion. The results indicate that this

approach outperforms the standard SAC[23]. SVO is further

extended to all agents in a MARL environment, where each

agent focuses not only on its own reward but also on local

and global rewards. Experiments demonstrate that the agents

exhibit socially acceptable behavior, improving the safety

and efﬁciency of the whole group across ﬁve scenarios, such

as intersections and roundabouts[10].

Other studies focus on the distribution of credit for MARL

cooperation. A counterfactual baseline is used in [20] to

determine how much an agent contributes to the collective.

For each action of an agent, a “default action” is used

instead, and the difference in global rewards due to these

two actions is calculated, evaluating the contribution of the

agent’s current action to the system. In [19], the problem of

total reward redistribution in MARL systems is discussed,

and it is found that using more efﬁcient value distribution

can promote stable cooperation among agents. The Shapley

Value is proposed to approximate the contribution of each

agent to the system to distribute cooperation beneﬁts fairly.

The results verify the feasibility of this method, providing

new ideas for MARL credit assignment tasks.

III. PRELIMINARY

A. Dec-POMDPs

We describe the decentralized partially observable Markov

decision process in MARL as a tuple: <S,Ai,P,r

i,γ>for

agent i=1,··· ,N, where Sis the state space that includes

all the possible states that agents can adopt, Aiis deﬁned as

the ﬁnite set which consists of the actions that agent ican

take ai∈A

i,Pis the transition function with a range of

[0,1] which is deﬁned as P:S×A1×···×An→S,r

iis the

reward of agent iand denoted as ri:S×Ai→R,γ is the dis-

count factor which is to balance the instantaneous and long-

term returns: γ→[0,1]. The total reward of the environment

from the beginning moment t0to the end of the interaction at

moment Tcan be expressed as: Rt=



t=t0

γt−t0rt. A policy

can be expressed as a state-to-action mapping π:S→A,

whether the action in a state is good is evaluated by the

action value function: Qπ(s, a)=E[Rt|st=s, at=a, π]

and the state value function used to judge the strategy under

the state is expressed as: Vπ(s)=E[Rt|st=s, π]. And

advantage function is At=rt+γV (st+1)−V(st).

373

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.

B. Social Value Orientation(SVO)

Social psychology acknowledges that individuals are com-

plex decision-makers, as they not only consider beneﬁts

to themselves but also beneﬁts of others, as well as how

others perceive their behaviors[15]. In complex scenarios,

individuals’ motivations are diverse, and there are individual

differences and variations. Social Value Orientations (SVO)

provide a measure of the importance individuals attach to

different interests based on their attitudes toward individual

beneﬁts and the beneﬁts of others. In [16], different individu-

als are classiﬁed into four types: Individualism, Competition,

Cooperation, and Altruism, as shown in Fig. 1.

Fig. 1: Graphical representation of Social Value Orientation framework.

Integrating SVO into MARL is a desirable approach. Our

objective is to enable agents to learn driving strategies and

have authentic and benign interactions with humans. SVO

provides a good measure for optimizing the interactions

among agents and obtaining information about agents’ SVO

can help ﬁnd solutions to improve trafﬁc efﬁciency. Inspired

by SVO and the reward function design in [10][18] which

integrates individual rewards and rewards of others, we use

neighborhood rewards and global rewards to balance self-

interests and other’s interests. Since an agent only interacts

with agents in the surrounding neighborhood within ﬁnite

time steps, we assign a higher attention weight to the nearby

agents. We deﬁne the average reward of nearby agents within

d=40 meters as the neighborhood rewards, as follows:

i,t =

rj,t

K,K ={j:||pos(i)−pos(j)|| ≤ d}(1)

here K is the number of neighborhood agents. For global

rewards, we have:

i,t =

rj,t

N(2)

wherein Nis the total number of agents in the environment.

IV. MATHODOLOGY

To evaluate our approach, we conduct experiments on

the driving simulator MetaDrive, which covers single-agent,

safe exploration, and multi-agent trafﬁc scenarios[17]. The

simulator supports user-customized maps and can import

real scenario data. Furthermore, the platform provides several

standard RL environment interfaces, enabling researchers to

swiftly build scenarios to validate their algorithms.

A. Network structure

Due to the inefﬁciency of MARL sampling, the available

sample data is limited, and the deep network structure is

unsuitable for the task of this paper. Therefore, we utilize a

shallow fully-connected layer network to update individual

rewards, neighborhood rewards, and global rewards sepa-

rately. The input to the three networks is Oi, and the output is

a one-dimensional value. As for the policy network, we use a

fully connected network equipped with skip-connection. The

skip-connection connects the original input Oito the input of

each layer of the network, minimizing information loss and

improving the effectiveness of the network representation by

connecting features of different scales. The policy network

architecture is shown in Fig. 2.

Fig. 2: Architecture of the proposed policy network.

We denote the loss function of the individual value func-

tion network as follows:

Li(φ)=Est[max{(Vi,φ(st)−ˆ

Vi,t)2,(Vi,φold (st)

+ clip(Vi,φ(st)−Vi,φold (st),−, +)

−ˆ

Vi,t)2}]

(3)

where φold is the old parameter before the network updates

and ˆ

Vi,t is the target value function. Similarly, we can derive

the loss function of neighborhood network as follows:

Ln(φ)=EsK

t[max{(Vn,φ(sK

t)−ˆ

Vn,t)2,(Vn,φold (sK

+ clip(Vn,φ(sK

t)−Vn,φold (sK

t),−, +)

−ˆ

Vn,t)2}]

(4)

where sK

t=S1×···×Skis the state space composed of

neighboring agents and global network’s:

Lg(φ)=EsN

t[max{(Vg,φ(sN

t)−ˆ

Vg,t)2,(Vg,φold(sN

+ clip(Vg,φ(sN

t)−Vg,φold(sN

t),−, +)

−ˆ

Vg,t)2}]

(5)

here sN

t=S1×···×SNis the whole state space.We express

the objective of updating Aiin this way:

Ji,θ =Eπθ[min(ρAi,clip(ρ, 1−, 1+)Ai)] (6)

here we use ρas a regularization term to ensure that

the difference between the old and new parameters is not

signiﬁcant. Regarding the actions taken by each agent, we

374

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.

utilize KL divergence and entropy as constraints to limit

them.

Jθ,ai,t =E[−πθ(ai,t|si,t)logπθ(ai,t |si,t)

+DKL(πθ(ai,t|si,t ),π

θ(ai,t−1|si,t−1)] (7)

B. Reward function

Constructing a comprehensive and appropriate reward

function is crucial for enabling the agent to learn strategies

quickly to handle various road scenarios. In addition to

encouraging the agent to drive forward, we also consider

other factors such as safety, comfort, and lane change.

rt=c1·Rdriving +c2·Rsaf ety +c3·Rcomfort

+c4·Rchange lane +c5·Rend

(8)

•Driving reward Rdriving

Our approach encourages the agent to drive forward,

actively explore the environment, and obtain the corre-

sponding rewards. We also aim for the agent to drive

as quickly as possible to improve efﬁciency in passing

various road sections and reach the destination.

•Safety reward Rsafety

To prevent collisions, we incorporate LIDAR informa-

tion into the reward function. Speciﬁcally, if any vehicle

is detected within a certain distance directly in front

of the agent’s current lane, the agent slows down. In

merge scenarios, we detect front vehicles in all lanes in

front of the agent. If a vehicle traveling in a different

direction from the agent (such as making a cut-in action,

etc.) is detected, the agent slows down. This approach

effectively avoids unnecessary collisions.

•Comfort reward Rcomfort

Driving comfort is a crucial factor in real-world driving,

and we incorporate it into the reward function. Specif-

ically, we deﬁne the comfort reward as follows: when

the current reference route is straight, the yaw angle

should be as small as possible to reduce unnecessary

oscillation.

•Change lane reward Rchange lane

We have observed that when congestion occurs in the

local area in front of the agent, such as a lane or part

of the lanes being occupied, the agent tends to slow

down and wait instead of switching to a free lane. This

behavior can lead to increased congestion, collisions,

and reduced trafﬁc efﬁciency. Therefore, we introduce

a change lane reward. Inspired by [24] of using the

position and speed of the front vehicle to infer the

curvature of the road during congestion, we believe that

the state of the surrounding vehicles can represent the

state of the road to some extent, so we design change

lane reward as follows: when the average speed of

nearby vehicles and the speed of the ego-vehicle are

lower than a certain threshold (indicating congestion

and vehicles stop), we encourage the agent to switch to

another lane to avoid prolonged congestion and improve

trafﬁc ﬂow efﬁciency.

•End reward Rend

Upon successful arrival at the destination, a reward of

10 is added. However, if the vehicle runs out of the road

or a collision occurs, a penalty of 8 is imposed.

C. Building scenarios

We aim to construct a new scenario that includes a range

of common trafﬁc scenarios such as straight lanes, curves,

intersections, roundabouts, T-intersections, ramps, etc. These

scenarios are combined in a speciﬁc sequence to form a com-

plex road that simulates various real-world scenarios without

trafﬁc lights. Our objective is for agents to successfully pass

these scenarios, and even acquire better driving skills and

strategies than human drivers. To assess the generalization

performance of the algorithm, we design three different test

scenarios, which are illustrated in Fig. 3.

Fig. 3: The map on the top left is the training scenario and the rest is the

test scenarios.

V. EXPERIMENT

A. Experiment settings

We conducted our experiments on the MetaDrive

simulator[17], where we speciﬁed the random positions of

vehicles in four lanes on the road and maintained a real-

time agent count of 40 to ensure a certain trafﬁc density. The

start and end points of each vehicle were chosen randomly,

and we considered the route successfully completed if the

vehicle followed the given reference route to the destination

without crashing or veer off the road. Otherwise, the reasons

for failure were classiﬁed as crash, out, and arrived max

step, while the route completion rate was calculated. To

obtain sensing information, we mainly used LIDAR, with

a neighborhood range of 40m and a safety distance of

15m to prevent collisions. Our experiments were conducted

on a computer equipped with an Intel i7-12700×20 core

processor and an NVIDIA GeForce RTX 3080 GPU, running

MetaDrive 0.2.5, python 3.7, PyTorch 1.13, and CUDA 10.2.

The training period lasted for 2.5M environment steps with

1563 iterations, a learning rate of 3e-4, a training batch size

of 1024, γ=0.99, and λ=0.95 for GAE.

B. Results

We compared the proposed algorithm to the baseline

algorithms IPPO[21] and MAPPO[22]. We used 4 start seeds

375

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.

(a) (b) (c)

(d) (e)

Fig. 4: a-c, Learning curves of success rate, route completion rate and reward for the three training methods: IPPO, MAPPO and our method. d,

Performance of the network before and after the proposed changes. We used the original CoPO[10] algorithm with the improved reward function described

in Section IV to make a comparison. f, Effect of the number of agents on the route completion rate.

for all algorithms, and each training trial consisted of 2.5M

steps. We evaluated the algorithms using ﬁve metrics: success

rate, route completion rate, crash rate, out rate, and reward.

Additionally, to assess the impact of concatenating network

inputs of different scales, we conducted experiments com-

paring the network before and after the proposed changes.

TABLE I: Comparison of different algorithms

Metrics IPPO MAPPO Ours

Success(%) 63.37±3.56 20.07±5.21 82.01±3.10

Route completion(%) 76.66±2.45 38.70±8.90 87.99±2.37

Crash(%) 20.01±0.95 22.14±3.17 14.69±0.71

Out(%) 40.47±5.72 58.11±4.01 19.18±4.71

Reward 1162.67±56.68 338.88±142.64 1217.62±64.90

As shown in Fig. 4a-4c, our method and IPPO have

similar performance in the early training period. However,

our method eventually converges to a higher performance

level with a stable success rate of approximately 0.8, which

is nearly 20% higher than IPPO, and 15% higher in the route

completion rate. Although the agents trained by our method

take longer to obtain rewards than those of IPPO, the ﬁnal

reward values of the two are similar, suggesting that our

agents consider additional metrics besides maximizing the

reward function when updating their strategy. Notably, after

0.5M training steps, our agents achieve higher success rate

and route completion rate despite receiving fewer rewards

than IPPO, indicating that they learn a more effective strategy

to complete the task. Table I presents the results of the

three algorithms, our proposed method achieves the best

performance across all ﬁve metrics. It is worth noting that

MAPPO performs poorly on this task compared to the IPPO

algorithm. We speculate that this may be due to the fact

that although MAPPO uses a centralized critic to evalu-

ate the concatenation of neighbor agents’ observations, the

surrounding agents change frequently, leading to increased

input complexity, unstable training, and reduced efﬁciency,

resulting in poorer performance.

As depicted in Fig. 4d, our network improvement results

in a faster training speed, higher success rate, and better

convergence scores, illustrating the effectiveness of network

enhancement. We also conduct an investigation into the

impact of increasing the number of agents on performance,

as shown in Fig. 4e. As the number of agents increases, the

Fig. 5: Evaluation results on the test II scenario: success, route completion, crash, out and reward curves. Each evaluation takes 30 episodes.

376

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.

TABLE II: Validation results on the three test scenarios

Scenarios Test I Test II Test III

Success(%) 74.67±4.34 80.67±3.90 77.81±4.97

Route completion(%) 83.26±2.68 88.64±2.36 85.31±3.27

Crash(%) 13.90±3.08 12.90±3.72 19.02±3.92

Out(%) 3.60±0.81 1.66±0.77 3.42±1.34

Reward 546.19±22.29 440.40±38.71 1075.13±98.27

convergence rate remains largely unchanged, and the route

completion rates remain above 0.8, indicating the scalability

of our method. Table II presents the evaluation results on

test scenarios, with Fig. 5 demonstrating that our proposed

method outperforms IPPO and MAPPO.

VI. CONCLUSIONS

In this paper, we presented a novel approach to address

cooperative driving issues using the MARL technique. We

developed a multi-scale feature fusion method to reduce

information loss and improve the representation ability of the

policy network. Additionally, we designed comprehensive

reward functions based on trafﬁc efﬁciency, safety, comfort

and driving strategy. Compared with other baselines, we

validated the advantages of our approach on ﬁve general

metrics. In the future, we plan to focus on credit assignment

and communication in MARL and contribute to solutions

for hybrid trafﬁc scenarios involving human drivers. We will

also explore more about the potential of this method for real-

world applications.

APPENDIX

Video of the test scenario are available at:

https://www.bilibili.com/video/BV11L411y7ea/?vd source=

eb49c849410e667ba1f8ced7e0b03440

ACKNOWLEDGMENT

This work was supported by the National Natural Sci-

ence Funds of China for Young Scholar with the Project

(62003328, 62073311), was supported by Guang Dong Basic

and Applied Basic Research Foundation (2023A1515011813,

2020B515130004), was supported by Shenzhen Basic Key

Research Project (JCYJ20200109115414354).

REFERENCES

[1] Kiran, B.R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A.A.,

Yogamani, S. and P´

erez, P., 2021. Deep reinforcement learning for

autonomous driving: A survey. IEEE Transactions on Intelligent

Transportation Systems, 23(6), pp.4909-4926.

[2] Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L.,

Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T. and

Lillicrap, T., 2020. Mastering atari, go, chess and shogi by planning

with a learned model. Nature, 588(7839), pp.604-609.

[3] Perolat, J., De Vylder, B., Hennes, D., Tarassov, E., Strub, F., de Boer,

V., Muller, P., Connor, J.T., Burch, N., Anthony, T. and McAleer,

S., 2022. Mastering the game of Stratego with model-free multiagent

reinforcement learning. Science, 378(6623), pp.990-996.

[4] Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik,

A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P. and

Oh, J., 2019. Grandmaster level in StarCraft II using multi-agent

reinforcement learning. Nature, 575(7782), pp.350-354.

[5] Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison,

C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C. and J´

ozefowicz,

R., 2019. Dota 2 with large scale deep reinforcement learning. arXiv

preprint arXiv:1912.06680.

[6] Sadigh, D., Landolﬁ, N., Sastry, S.S., Seshia, S.A. and Dragan, A.D.,

2018. Planning for cars that coordinate with people: leveraging effects

on human actions for planning and active information gathering over

human internal state. Autonomous Robots, 42, pp.1405-1426.

[7] Schwarting, W., Pierson, A., Alonso-Mora, J., Karaman, S. and Rus,

D., 2019. Social behavior for autonomous vehicles. Proceedings of the

National Academy of Sciences, 116(50), pp.24972-24978.

[8] Toghi, B., Valiente, R., Sadigh, D., Pedarsani, R. and Fallah, Y.P.,

2021, September. Cooperative autonomous vehicles that sympathize

with human drivers. In 2021 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS) (pp. 4517-4524). IEEE.

[9] Zhou, T., Wang, L., Chen, R., Wang, W. and Liu, Y., 2022.

Accelerating Reinforcement Learning for Autonomous Driving us-

ing Task-Agnostic and Ego-Centric Motion Skills. arXiv preprint

arXiv:2209.12072.

[10] Peng, Z., Li, Q., Hui, K.M., Liu, C. and Zhou, B., 2021. Learning

to simulate self-driven particles system with coordinated policy op-

timization. Advances in Neural Information Processing Systems, 34,

pp.10784-10797.

[11] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi,

V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K.

and Graepel, T., 2017. Value-decomposition networks for cooperative

multi-agent learning. arXiv preprint arXiv:1706.05296.

[12] Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J. and

Whiteson, S., 2020. Monotonic value function factorisation for deep

multi-agent reinforcement learning. The Journal of Machine Learning

Research, 21(1), pp.7234-7284.

[13] Son, K., Kim, D., Kang, W.J., Hostallero, D.E. and Yi, Y., 2019, May.

Qtran: Learning to factorize with transformation for cooperative multi-

agent reinforcement learning. In International conference on machine

learning (pp. 5887-5896). PMLR.

[14] Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O. and

Mordatch, I., 2017. Multi-agent actor-critic for mixed cooperative-

competitive environments. Advances in neural information processing

systems, 30.

[15] McClintock, C.G. and McNeel, S.P., 1966. Reward and score feedback

as determinants of cooperative and competitive game behavior. Journal

of Personality and Social Psychology, 4(6), p.606.

[16] Liebrand, W.B. and McClintock, C.G., 1988. The ring measure of

social values: A computerized procedure for assessing individual

differences in information processing and social value orientation.

European journal of personality, 2(3), pp.217-230.

[17] Li, Q., Peng, Z., Feng, L., Zhang, Q., Xue, Z. and Zhou, B., 2022.

Metadrive: Composing diverse driving scenarios for generalizable

reinforcement learning. IEEE transactions on pattern analysis and

machine intelligence.

[18] Toghi, B., Valiente, R., Sadigh, D., Pedarsani, R. and Fallah, Y.P.,

2021. Altruistic maneuver planning for cooperative autonomous

vehicles using multi-agent advantage actor-critic. arXiv preprint

arXiv:2107.05664.

[19] Han, S., Wang, H., Su, S., Shi, Y. and Miao, F., 2022, May. Stable

and efﬁcient Shapley value-based reward reallocation for multi-agent

reinforcement learning of autonomous vehicles. In 2022 International

Conference on Robotics and Automation (ICRA) (pp. 8765-8771).

IEEE.

[20] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N. and Whiteson, S.,

2018, April. Counterfactual multi-agent policy gradients. In Proceed-

ings of the AAAI conference on artiﬁcial intelligence (Vol. 32, No.

1).

[21] de Witt, C.S., Gupta, T., Makoviichuk, D., Makoviychuk, V., Torr,

P.H., Sun, M. and Whiteson, S., 2020. Is independent learning

all you need in the starcraft multi-agent challenge?. arXiv preprint

arXiv:2011.09533.

[22] Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A. and Wu, Y., 2021.

The surprising effectiveness of ppo in cooperative, multi-agent games.

arXiv preprint arXiv:2103.01955.

[23] Liu, J., Li, H., Yang, Z., Dang, S. and Huang, Z., 2022. Deep Dense

Network-Based Curriculum Reinforcement Learning for High-Speed

Overtaking. IEEE Intelligent Transportation Systems Magazine.

[24] Li, K., Lei, B. and Li, H., 2020. Estimation of Road Geometric

Information for Congested Roads by Autonomous Driving. Journal

of Integration Technology, 9(5), pp.69-80.

377

Authorized licensed use limited to: Shenzhen Institute of Advanced Technology CAS. Downloaded on November 27,2023 at 07:16:01 UTC from IEEE Xplore. Restrictions apply.

Limited Information Aggregation for Collaborative Driving in Multi-Agent Autonomous Vehicles

Article

Full-text available

Jul 2024

Multi-agent reinforcement learning (MARL) methods have emerged as a promising solution for multi-agent collaborative driving in the intersection and roundabout scenarios. However, these methods need large amounts of training data obtained from the interaction with the driving simulator, and learning from limited interaction remains significantly underdeveloped. In this paper, we propose an efficient MARL method to address this challenge. Our method enables each vehicle to receive limited messages from surrounding vehicles, which are then used to augment the input representation of the local driving policy. By predicting the next-step state based on the current augmented local state and action, our approach enhances the decision-making capability of each vehicle. Specifically, we design a Self-supervised Message Attention Encoding (SMAE) module that utilizes an attention mechanism to aggregate the received messages and local observations, generating a compact representation. Then, this representation is used in a self-supervised module to predict the next-step state. By jointly training the encoder module and the prediction module, each vehicle effectively leverages the most relevant components of the aggregated representation to improve the learning efficiency of driving policy and alleviate issues related to partial observability in making driving decisions. To validate the effectiveness of our approach, we conduct experiments using an open-source autonomous driving simulator. The simulation results demonstrate that our proposed method outperforms the IPPO, MAPPO and CoPO algorithms in terms of success rate, route completion rate, crash rate, and other relevant metrics.

Safety-Constrained Multi-Agent Reinforcement Learning for Power Quality Control in Distributed Renewable Energy Networks

Article

Full-text available

Apr 2024
CMC-COMPUT MATER CON

This paper examines the difficulties of managing distributed power systems, notably due to the increasing use of renewable energy sources, and focuses on voltage control challenges exacerbated by their variable nature in modern power grids. To tackle the unique challenges of voltage control in distributed renewable energy networks, researchers are increasingly turning towards multi-agent reinforcement learning (MARL). However, MARL raises safety concerns due to the unpredictability in agent actions during their exploration phase. This unpredictability can lead to unsafe control measures. To mitigate these safety concerns in MARL-based voltage control, our study introduces a novel approach: Safety-Constrained Multi-Agent Reinforcement Learning (SC-MARL). This approach incorporates a specialized safety constraint module specifically designed for voltage control within the MARL framework. This module ensures that the MARL agents carry out voltage control actions safely. The experiments demonstrate that, in the 33-buses, 141-buses, and 322-buses power systems, employing SC-MARL for voltage control resulted in a reduction of the Voltage Out of Control Rate (%V.out) from 0.43, 0.24, and 2.95 to 0, 0.01, and 0.03, respectively. Additionally, the Reactive Power Loss (Q loss) decreased from 0.095, 0.547, and 0.017 to 0.062, 0.452, and 0.016 in the corresponding systems.

Deep Dense Network-Based Curriculum Reinforcement Learning for High-Speed Overtaking

Article

Full-text available

Jun 2022

Reinforcement learning methods have promising applications in autonomous vehicles, such as lane keeping, adaptive cruise control, and overtaking. However, long-time sequential decision making in complicated scenarios, such as continuous high-speed overtaking, remains challenge. One such challenge is that current neural networks in reinforcement learning employ a shallow neural network, which has limitations in feature extractions. To address this, we propose a deep actor–critic network framework using multilayer dense-connection networks to obtain efficient extracted features. To be specific, we employ a deep dense architecture in the framework of soft actor–critic, which has multiple hidden layers serving as feature extractions. To tackle the vanishing or exploding gradient problem, we reuse features and design shortcut connections, as well as feature concatenating operations between hidden layers. Meanwhile, when learning multitask high-speed overtaking as one whole progression, the agent has to learn driving, lane changing, and avoiding other vehicles at the same time. Therefore, another challenge is that sparse rewards worsen the speed of convergence. We design a task-level sequencing curriculum reinforcement learning method to tackle the reward sparseness. It breaks down a complex task into two successive subtasks: the first aims to drive as fast as possible in a single-vehicle model; and the second aims to overtake other vehicles without any collisions as soon as possible according to the knowledge gained from the first task in a multi vehicle model. Finally, we evaluate our method using a racing car simulator with different tracks. In a comparison using the standard actor–critic algorithm in three different multivehicle scenarios, the best median results show that our deep actor–critic network framework reduces overtaking distance and overtaking time by up to 4.9% and 8.7%, respectively.

Altruistic Maneuver Planning for Cooperative Autonomous Vehicles Using Multi-agent Advantage Actor-Critic

Conference Paper

Full-text available

Jul 2021

With the adoption of autonomous vehicles on our roads, we will witness a mixed-autonomy environment where autonomous and human-driven vehicles must learn to co-exist by sharing the same road infrastructure. To attain socially-desirable behaviors, autonomous vehicles must be instructed to consider the utility of other vehicles around them in their decision-making process. Particularly, we study the maneuver planning problem for autonomous vehicles and investigate how a decentralized reward structure can induce altruism in their behavior and incentivize them to account for the interest of other autonomous and human-driven vehicles. This is a challenging problem due to the ambiguity of a human driver's willingness to cooperate with an autonomous vehicle. Thus, in contrast with the existing works which rely on behavior models of human drivers, we take an end-to-end approach and let the autonomous agents to implicitly learn the decision-making process of human drivers only from experience. We introduce a multi-agent variant of the synchronous Advantage Actor-Critic (A2C) algorithm and train agents that coordinate with each other and can affect the behavior of human drivers to improve traffic flow and safety.

Cooperative Autonomous Vehicles that Sympathize with Human Drivers

Conference Paper

Full-text available

Jul 2021

Widespread adoption of autonomous vehicles will not become a reality until solutions are developed that enable these intelligent agents to co-exist with humans. This includes safely and efficiently interacting with human-driven vehicles, especially in both conflictive and competitive scenarios. We build up on the prior work on socially-aware navigation and borrow the concept of social value orientation from psychology-that formalizes how much importance a person allocates to the welfare of others-in order to induce altruistic behavior in autonomous driving. In contrast with existing works that explicitly model the behavior of human drivers and rely on their expected response to create opportunities for cooperation, our Sympathetic Cooperative Driving (SymCoDrive) paradigm trains altruistic agents that realize safe and smooth traffic flow in competitive driving scenarios only from experiential learning and without any explicit coordination. We demonstrate a significant improvement in both safety and traffic-level metrics as a result of this altruistic behavior and importantly conclude that the level of altruism in agents requires proper tuning as agents that are too altruistic also lead to sub-optimal traffic flow. The code and supplementary material are available at: https://symcodrive.toghi.net/

Deep Reinforcement Learning for Autonomous Driving: A Survey

Article

Full-text available

Feb 2021

With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.

Mastering Atari, Go, chess and shogi by planning with a learned model

Article

Full-text available

Dec 2020
NATURE

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess¹ and Go², where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games³—the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled⁴—the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi—canonical environments for high-performance planning—the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm⁵ that was supplied with the rules of the game.

Social behavior for autonomous vehicles

Article

Full-text available

Nov 2019

Significance We present a framework that integrates social psychology tools into controller design for autonomous vehicles. Our key insight utilizes Social Value Orientation (SVO), quantifying an agent’s degree of selfishness or altruism, which allows us to better predict driver behavior. We model interactions between human and autonomous agents with game theory and the principle of best response. Our unified algorithm estimates driver SVOs and incorporates their predicted trajectories into the autonomous vehicle’s control while respecting safety constraints. We study common-yet-difficult traffic scenarios: highway merging and unprotected left turns. Incorporating SVO reduces error in predictions by 25%, validated on 92 human driving merges. Furthermore, we find that merging drivers are more competitive than nonmerging drivers.

Accelerating Reinforcement Learning for Autonomous Driving Using Task-Agnostic and Ego-Centric Motion Skills

Conference Paper

Oct 2023

Mastering the game of Stratego with model-free multiagent reinforcement learning

Article

Dec 2022
SCIENCE

We introduce DeepNash, an autonomous agent that plays the imperfect information game Stratego at a human expert level. Stratego is one of the few iconic board games that artificial intelligence (AI) has not yet mastered. It is a game characterized by a twin challenge: It requires long-term strategic thinking as in chess, but it also requires dealing with imperfect information as in poker. The technique underpinning DeepNash uses a game-theoretic, model-free deep reinforcement learning method, without search, that learns to master Stratego through self-play from scratch. DeepNash beat existing state-of-the-art AI methods in Stratego and achieved a year-to-date (2022) and all-time top-three ranking on the Gravon games platform, competing with human expert players.

MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

Article

Jul 2022

Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the safety awareness of the surrounding traffic, and the decision-making in complex multi-agent settings. Despite the great success of Reinforcement Learning (RL), most of the RL research works investigate each capability separately due to the lack of integrated environments. In this work, we develop a new driving simulation platform called MetaDrive to support the research of generalizable reinforcement learning algorithms for machine autonomy. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real data importing. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. The generalization experiments conducted on both procedurally generated scenarios and real-world scenarios show that increasing the diversity and the size of the training set leads to the improvement of the RL agent's generalizability. We further evaluate various safe reinforcement learning and multi-agent reinforcement learning algorithms in MetaDrive environments and provide the benchmarks. Source code, documentation, and demo video are available at https://metadriverse.github.io/metadrive .

Stable and Efficient Shapley Value-Based Reward Reallocation for Multi-Agent Reinforcement Learning of Autonomous Vehicles

Conference Paper

May 2022

Mastering Cooperative Driving Strategy in Complex Scenarios using Multi-Agent Reinforcement Learning

Figures

Recommended publications

Research Interests Design and control of eective multi-agent systems, including: observing, tracking...

Multi-domain Cooperative Action Planning Strategy Based on Reinforcement Learning

Nexus Rescue Team Description

Multi-agent reinforcement learning and its application to role assignment of robot soccer