Content uploaded by Xiulin Qiu
Author content
All content in this area was uploaded by Xiulin Qiu on Apr 05, 2023
Content may be subject to copyright.
2160 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022
A Data-Driven Packet Routing Algorithm for an Unmanned Aerial Vehicle
Swarm: A Multi-Agent Reinforcement Learning Approach
Xiulin Qiu ,Student Member, IEEE,LeiXu ,Member, IEEE,PingWang ,Fellow, IEEE,
Yuwang Yang , and Zhenqiang Liao
Abstract—Routing decisions made by unmanned aerial vehi-
cle (UAV) swarms are affected by complex and dynamically
changing topologies. A centralized routing algorithm imposes
the entire computational burden on one module, and the high
data dimensionality renders computation burdensome. In this
letter, we develop a multi-agent reinforcement learning-based
routing algorithm for a UAV swarm. The UAVs are trained
in a data-driven manner to make distributed routing decisions.
Factors that include channel quality, UAV movement, UAV over-
head, and the extent of neighbor variation are incorporated into
link quality assessment. Long short-term memory is used to
improve the Actor and Critic networks, and more information
on temporal continuity is added to facilitate adaptation to the
dynamically changing environment. Simulations show that the
proposed routing algorithm reduces data transmission delay and
enhances the transmission rate compared with traditional routing
algorithms.
Index Terms—UAV swarm network, multi-agent reinforcement
learning, link quality assessment, data-driven routing, LSTM.
I. INTRODUCTION
AN UNMANNED aerial vehicle (UAV) swarm is a dis-
tributed deployment of multiple UAVs that work together
to accomplish a mission. Swarms are more reliable and accu-
rate than a single UAV and cover more terrain when engaging
in post-disaster search and rescue, battlefield strikes, environ-
mental sensing, and target tracking. It is challenging to ensure
connectivity, high throughput, and efficient self-adaptation [1]
when UAV move rapidly.
Artificial Intelligence (AI) empowered wireless commu-
nication technologies are believed to offer a range of new
features such as traffic prediction, network optimization and
congestion control. With 5G standards gradually becoming
Manuscript received 11 June 2022; revised 11 July 2022; accepted 27
July 2022. Date of publication 3 August 2022; date of current version
7 October 2022. This work was supported in part by the National Natural
Science Foundation of China under Grant 61973161, Grant 61991404, and
Grant 61773206; and in part by SZKITI under Grant SYG201826. The
associate editor coordinating the review of this article and approving it for
publication was G. Chen. (Corresponding author: Yuwang Yang.)
Xiulin Qiu, Lei Xu, and Yuwang Yang are with the School of
Computer Science and Engineering, Nanjing University of Science
and Technology, Nanjing 210094, China (e-mail: qiuxiulin@njust.edu.cn;
xulei_marcus@126.com; yuwangyang@njust.edu.cn).
Ping Wang is with the Department of Electrical Engineering and Computer
Science, Lassonde School of Engineering, York University, Toronto, ON M3J
1P3, Canada (e-mail: pingw@yorku.ca).
Zhenqiang Liao is with the School of Electrical and Mechanical
Engineering, Suzhou Global Institute of Software Technology, Suzhou
215163, China (e-mail: zqliao1013@126.com).
Digital Object Identifier 10.1109/LWC.2022.3195963
fixed, AI empowered 6G will unlock the full potential of
radio signals and enable a shift from cognitive radio (CR)
to intelligent radio (IR) [2]. Traditional fixed-model-based
communication design relies on the application of algorith-
mic insights by human experts, while data-driven techniques
use machine learning to learn a “good” communication
policies from data sets, and it is more adaptable to topo-
logically complex and dynamically changing environments.
So data-driven routing methods applied to dynamic UAV
swarm networks are expected to further improve transmission
performance.
Reinforcement learning is one of the main branches of AI
that acquires experience by exploring interactions with the
environment [3]. So RL is an emerging approach to solve rout-
ing decisions in dynamic environments. Jiang et al. [4]used
Q-Learning to develop adaptive UAV-assisted geographic rout-
ing. Mai et al. [5] developed a model-free, data-driven routing
strategy by combining a graphical neural network (GNN) with
multi-agent RL. Kaviani et al. [6] presented a DeepCQ+ rout-
ing framework employing multi-agent deep RL (MADRL)
that can handle different numbers of nodes, data streams
with different data rates and source/destination pairs, various
mobility levels, and other dynamic topologies. Jin et al. [7]
employed Q-learning reward function learning (RFLQGeo) to
improve the performance and efficiency of unmanned robotic
networks (URNs).
However, the topology in UAV swarm networks is com-
plex and variable, and it is difficult to obtain the global
information by centralized networking. We try to solve the
routing problem of UAV swarm networks with distributed
ideas, so we use the multi-agent reinforcement learning
method. The proposed algorithm establishes a parameter
exchange mechanism based on multi-agent deep determinis-
tic policy gradient (MADDPG) [11] in the training phase, and
the execution phase can be fully distributed. Each UAV uses
an Actor net designed based on LSTM to obtain all possi-
ble information on temporal continuity and the correlations
between information on each state.
The contributions of this letter are as follows. (1) We use
the interactive learning capability of RL to cope with the
dynamic environment and build a routing model with dis-
tributed execution. (2) A link quality assessment mechanism is
proposed to alleviate the difficulty of collecting global topol-
ogy information in dynamic environments. (3) We build the
Actor net and Critic net containing LSTM cells to further
accelerate the convergence of the algorithm. (4) We analyze
the computational complexity of the algorithm, and simu-
lations demonstrate the advantages of the proposed routing
algorithm.
2162-2345 c
2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
QIU et al.: DATA-DRIVEN PACKET ROUTING ALGORITHM FOR AN UAV SWARM 2161
II. ROUTING MODELING BASED ON MULTI-AGENT
REINFORCEMENT LEARNING
A. The Network Model
Consider an area containing users of mobile communication
devices, where multi-UAVs are deployed, and these users try to
communicate with base stations (BS) outside the area through
multi-UAVs relays. This model can be applied to the scenario
of establishing communication between the affected area and
users outside the area in case of base stations damaged. To
ensure the users experience, the UAV swarm network needs
to provide reliable and high-quality links. The UAV swarm
network is viewed as a graph G(V,E) where V(G)istheset
of nodes (all UAVs), E(G) is the set of edges (all communica-
tion links), d_e(i,j)denotes the Euclidean distance between
UAVs iand j, the communication link e(i,j) exists only when
d_c(i,j)>d_e(i,j), where d_c(i,j)is the communication
distance.
B. The Channel Model
We use channel model defined in [8] to describe UAV to
UAV communication links, and divide the network transmis-
sion bandwidth into Korthogonal subchannels. The subchan-
nels are allocated to UAVs using a round-robin approach
proposed in [9], where every UAV visits each subchannel in
a predefined sequential order to avoid transmission collisions
with others. We use the Rician model described channel, sup-
pose subchannel kis assigned to UAV iat time slot t, when
UAV itransmits to UAV jover subchannel k, the received
power at UAV jfrom UAV iis expressed as Pk
i,j(t). The rice
distribution is defined as
pk
ξ(di,j(t)) = di,j(t)
σ2
0
exp−di,j(t)2−ρ2
2σ2
0I0di,j(t)ρ
σ2
0(1)
where di,j(t)is the transmission distance between UAV iand
j,ρand σ0reflecting the strength of the dominant and non-
dominant paths respectively, I0is the modified Basel function.
In case no dominant path exists (ρ=0), the Rician fading
reduces to a Rayleigh fading defined by
pk
ξ(di,j(t)) = di,j(t)
σ2
0
exp−di,j(t)2
2σ2
0(2)
The received power at UAV jfrom UAV iis expressed as
Pk
i,j(t)=PtGdi,j(t)−α+pk
ξ(di,j(t)) (3)
where Gis the constant power gain factor introduced into the
amplifier and antenna, Ptis the transmit power of a UAV,
di,j(t)is the transmission distance between UAV iand j, and
(di,j(t))−αis the path loss. The interference from UAV mto
UAV jover subchannel kis represented as:
Ik
m,UAV (t)=ψm,k(t)PtGdm,j(t)−α(4)
where Ψ(t)=[ψm,k(t)] is an Nl×Kbinary UAV-to-UAV
subchannel pairing matrix with a value of ψi,k(t)=1when
subchannel kis assigned to UAV ifor UAV-to-UAV transmis-
sion, otherwise ψi,k(t)=0.Nl(t)={1,2,...,Nl(t)}is the
set of UAVs that perform UAV-to-UAV transmission links in
time slot t. The SINR at UAV jover subchannel kis shown as:
γk
i,j(t)=PtGdi,j(t)−α+pk
ξ(di,j(t))
σ2+Nl
m=1,m=iIk
m,UAV (t)(5)
σ2is the Gaussian noise variance. Therefore, the rate of
data transmission from UAV ito UAV jin subchannel kis:
Rk
i,j(t)=blog21+γk
i,j(t)(6)
where bis the channel bandwidth. The channel and power
resources are allocated by the Multiple Access Control (MAC)
protocol, while each UAV chooses the link with the highest
signal-to-noise ratio for transmission over the allocated link.
The RL-based model is described below.
C. The Multi-Agent RL Model
Assume that the source node of the packet is UAV s, the des-
tination node is UAV d, each UAV is an agent and all nodes co-
operate to complete packet transmission. The routing problem
is a multi-agent Markov (stochastic) game characterized by the
tuple (I,S,{Ai}i∈I,p,{ri}i∈I)where IΔ
={1,...,i,...,I}
is the set of agents and SΔ
={S1,...,Si,...,SI}is the set
of states of all agents where Siis the state of the i-th agent.
{Ai}i∈Iis the set of actions of agent i,p(s|s,{Ai}i∈I)Δ
=
S×A1×···×AI×Sis the state transition probability of the
system, where sis the next state, and {ri}i∈Iis the reward
for the agent.
In the game, all UAVs try to find the path with the best link
quality to maximize the value function vπ(s), and vπ(s)=
Eπ[rt+1 +ηrt+2 +η2rt+3 +...St=s]. The discount factor
η∈[0,1] indicates the importance of future rewards. After all
UAVs formulate optimal policies, the equilibrium policy for
the whole system {π∗
1,...,π
∗
I}is the Nash policy [11] that
obtains the optimal packet transmission path, where π∗
iis the
optimal policy of the i-th UAV.
1) The State Space: The state of each UAV is private to
that UAV, and we include the channel condition indicated by
the SINR in the state. For a single UAV i, the state Si(t)at
time tcontains two parts:
•The signal-to-noise ratio, γk
i,−i(t), which is the signal
impact of other UAVs on UAV i, denoted by UAV -i.
•The set of neighbors Ner (t)of UAV i, i.e., the set of
UAVs that can establish direct connection with UAV i.
Thus, the state of UAV iat time tis Si(t)<γk
i,−i,Ner >.
2) The Action Space: The action set of UAV iis to choose
which link to transmit packets. In the channel model, when
communication distance d_c(i,j)>d_e(i,j),UAVjcan be
a neighbor of UAV i, and there is a possible communication
link e(i,j). However, the vector distance of the two UAVs can-
not be used as the effective communication distance, and the
link duration must meet the communication requirements. We
use the link expiration time (LET) defined in [10], and assume
that UAVs iand jmove at speeds viand vj, respectively, then
the link expiration is expressed as:
LETi,j=(m2+c2)r2−(md −nc)2−(mn +cd )
m2+c2(7)
where
m=vicos(θi)−vjcosθj,n=xi−xj(8)
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
2162 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022
Fig. 1. The UAV swarm network routing algorithm(a) and improved MADDPG routing algorithm based on LSTM(b).
c=visin(θi)−vjsinθj,d=yi−yj(9)
(xi,yi)and (xj,yj)denote the position of UAV iand UAV j,
respectively, viand vjdenote the velocity of UAVs, θiand θj
are the angle between UAV velocity and horizontal direction.
Hence, if LETi,j≥Thello ,e(i,j)∈ai, where Thello is the
period in which the UAV performs the link handshake. Thus,
the action space of the UAV is:
ai∈A={e(i,j)|d_c(i,j)≥d_e(i,j)∩LETi,j≥Thello }(10)
where Ais the set of actions of UAVi, and affected by the
communication distance with other UAVs.
3) The Reward Function: Our link quality evaluation mea-
sures link quality in several aspects. The first quality measure
is the channel metrics using the channel model of UAV
presented in Section II-B.The goal is to select a link, the
signal-to-noise ratio of which is no lower than a predefined
threshold:
γk
i,j(t)= PUdi,j(t)−α
σ2+Nl
m=1,m=iIk
m,UAV (t)≥γk
min (11)
where γk
min is the minimum signal-to-noise ratio of the UAV
channel. If over time slot t, the link channel of a UAV sat-
isfies the constraint (9), the UAV receives a reward rSINR,
defined as:
rSINR =blog(1 + γk
i,j(t)) −vmpU(t),if γk
i,j(t)≥γk
m
0,otherwise (12)
where bis the channel bandwidth, pU(t)is the transmission
power, and vm>0is a cost per unit level of power. The
second part of the reward is:
rLET =LETi,j
τ,LETi,j<τ
1,otherwise (13)
where τis the time interval of the position update. When
LETi,j<τ, the LET is shorter and the reward brought by the
agent choosing this link is smaller, so rLET <1. Otherwise,
the LET is longer, but the reward given cannot be increased
infinitely, so as long as the transmission demand is satisfied
the reward is set to 1. The third component is the busyness
of the selected neighboring node, expressed in terms of the
queue length of packets to be sent, nsend . There is a larger
penalty for longer queues:
rBUSY =−log(1 + nsend )(14)
where nsend is positive and the reward is inversely propor-
tional to nsend . The more data to be sent by the UAV, the
greater the load on this link, so the reward reduce when agent
select this link. Thus, the total reward of UAV iat time tis:
rt
i=αrt
SINR +βrt
LET +δrt
BUSY (15)
where α, β, δ are the weights of the three components.
We propose two measures to cope with the situation that
local information is not available. First, it is necessary to let
the dataset of the agent traverse the situation of missing local
data when training, and the algorithm has the experience of
missing data at the time of execution. Second, cross-layer tech-
niques need to be applied in this scenario to increase the access
to the local data.
III. THE ROUTING ALGORITHM BASED ON IMPROVED
MADDPG WITH LSTM
Figure 1(a) shows the 10 steps of the routing algorithm.
In steps 1 to 3, the LSTM-based Actor net outputs an action
based on the current state. In steps 4 to 7, the agent per-
forms the action, obtains a reward, and evolves to a new state.
The agent stores the state, action, reward and new state in
the buffer, and then samples them from buffer, using target
network parameters to calculate ywhich is used for Critic net
loss minimization. In steps 8 to 10, the LSTM-based Critic
net parameters are updated using the loss function which is
defended in (18), and the LSTM-based Actor net parameters
are updated using the gradient ascent algorithm.
We fit the policy function and the action value function
using the MADDPG [11] algorithm. The policy is repre-
sented by the function π={π1,π
2,...π
I}with parameters
θ={θ1,θ
2,...θ
I}. UAV routing is a time-varying decision
problem; traditional routing is prone to frequent failures as
the topology changes because information on network state
changes in the time domain is not fully utilized. We replace the
critic and target networks of the traditional MADDPG algo-
rithm with LSTM that can leverage time-domain information
between UAV states to reduce the possibility of routing failure
and improve the accuracy of routing.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
QIU et al.: DATA-DRIVEN PACKET ROUTING ALGORITHM FOR AN UAV SWARM 2163
As shown in Figure 1(b), we build the Actor net and Critic
net containing LSTM cells, where the input of the Actor net
πiof the i-th UAV at time tis the information on the histor-
ical states and historical actions of all UAVs encoded by the
LSTM, and the current observed state si
tof UAV i.The output
of the Actor net is the current action taken by UAV ai
t, thus
ai
t=πi(ht−1,si
t), where ht−1is the LSTM output at t−1.
The Actor’s reward is thus:
J(θi)=Eht−1,stQ(ht−1,st,a)|a=μi(ht−1,st)(16)
Q(ht−1,st,a)is the action value function of the UAV, and
Qπ(ht−1,st,a)=Eτ∼π[R(t)S=st,A=a], where R(t)is
the cumulative reward obtained at time tand afterwards, and
R(t)=rt+1 +ηrt+2 +η2rt+3 +...=∞
k=0 ηkrt+k+1.The
gradients are:
∇θiJ(θi)=Eht−1,st∇aQ(ht−1,st,a)|
a=μ
i(ht−1,st)∇θiπi(ht−1,si
t)
(17)
The action value function Qi(S,a1,a2,...,aI)is fitted by
a network with parameters ω={ω1,ω
2,...ω
I}.Atis the
joint action distribution (a1
t,...,aI
t)of all UAVs at time t
and htthe LSTM output at time t. The Critic net is updated
to minimize the loss defined as follows.
Los s (ωi)=Eht−1,stQωi
i(ht−1,st,a)−yi2(18)
where yiis the target value derived from the target network,
yi=ri
t+ηQω
i
iht,st+1,π
iht,si
t+1/θ
i(19)
where ri
tdenotes the instantaneous reward of the i-th UAV
in the current state, θis Actor target network parameter, ω
is Critic target network parameter, ηis the discount factor of
the cumulative reward, and htis the output of the LSTM at
time t.
As shown in Algorithm 1, the LSTM parameters are first
initialized as h0and the actions of each UAV are generated
according to the policy. After executing the actions to obtain
rewards, the LSTM parameters become h1and information
is stored in the experience pool R. The existence of LSTM
helps to alleviate the dilemma that current tasks cannot receive
relevant information long before it, so it is suitable to cope
with the fast-changing wireless communication environment.
Computational complexity analysis: The total parameters in
each LSTM network is M=4ncnc+4ninc+ncno+3nc,
where nc,ni,noare number of memory cells, input neuron
units, output neuron units, respectively. So the computational
complexity of the standard LSTM network is O(M). Assume
the batch size is B, the training episodes are H. So the time
complexity of LSTM training is approximately O(M*H*B).
The computational complexity of the proposed MADDPG-
LSTM routing algorithm for the system containing IUAVs
is about O(M∗H∗B∗I∗Ds)+O(M∗H∗B∗I∗Da+Ds),
Dais the dimension of the action space, and Dsis the dimen-
sion of the state space. Compared with the centralized routing
algorithm, the proposed algorithm alleviates the computational
power required by the UAV and reduces routing overhead.
Algorithm 1 The Processing of Routing Algorithm in UAV
Initialize: For all UAV, initialize Actor online network param-
eters θ={θ1,θ
2,...θ
I}and Critic online network parameters
ω={ω1,ω
2,...ω
I}, replay buffer R, Actor target network
parameters θ←θ, Critic target network parameters ω←ω.
1: for episode =1to Mdo
2: Initialize empty history h0, State s={s1,s2,...sI}
3: while time t<Tand s=∅do
4: for agent i=1to Ido
5: According to ai
t=πi(ht−1,si
t)generate the
action to be performed by UAV i;
6: Get reward ri
tand new observation st+1 from the
environment;
7: According to ht=LSTM (ht−1,[St−1,At−1])
update LSTM.
8: end for
9: end while
10: store episode {h,si,ai,ri}
11: for agent i=1to Ido
12: sample a random state of episodes
{h1,si
1,ai
1,ri
1,h2,si
2,ai
2,ri
2,...}
13: for time t=Tdown to 1do
14: Using target networks parameters calculate y
by(19)
15: update Critic online network by minimizing the
loss(18)
16: update Actor online network by policy gradi-
ent(17)
17: end for
18: end for
19: update target networks:
θ
i←τθi+(1−τ)θ
i
ω
i←τωi+(1−τ)ω
i
20: end for
IV. EVA L U AT I O N
DGATR [5] is a multi-agent RL approach combined
with graph neural network for routing, and deepCQ++ [6]
trains DNNs by interacting acknowledgment (ACK) packets
and solves the routing problem of dynamic mobile ad-hoc
networks (MANETs) based on multi-agent RL. Although these
two routing methods perform well in scenarios with vari-
able network size, data rate and mobility, they do not take
channel quality and node busyness into consideration. We sim-
ulated the proposed algorithm using the NS3-Gym [12] tool,
and we compared DGATR, deepCQ+, MADDPG, and our
MADDPG-LSTM in delay and the data delivery rate. The for-
mer is the time difference between packets sent by the source
and received by the destination node, and the latter is the num-
ber of packets received by the destination node divided by the
number sent by the source node.
NS3-Gym evaluates data communication between NS3 and
OpenAI Gym [13] based on ZMQ (Zero Message Queue is
a library used to implement messaging and communication
between two applications). The system effectively simulates
communication protocols based on RL methods. The dataset
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
2164 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022
TAB LE I
SIMULATION PARAMETERS
Fig. 2. Simulation results.
used for training is obtained from the execution environ-
ment through the interface functions provided by NS3-Gym
for passing status, reward, action and end flag information.
The states, actions, and rewards are combined into a vec-
tor {sj,aj,rj}and deposited into the experience pool R
to form the dataset. To be more realistic, we used the
Gaussian/Markov mobility model to generate random UAV
motions. The network load was controlled by varying the
constant bit rate (CBR), we also changed the UAV speeds,
the number of UAVs, and other parameters. The simulation
parameters are shown in Table I.
Figure 2(a) shows that the transmission delay increases as
the load increases. With DGATR, the delay is about 160 ms
when CBR =1,000 kbps. Our MADDPG-LSTM exhibits
shorter delay (about 130 ms), and similar to the traditional
MADDPG using fully connected neural network. Figures 2(b)
and (c) show that the delay increases as the number of UAVs
and the speed of UAVs increase, respectively. Our MADDPG-
LSTM outperforms the other methods, indicating that it can
well adapt to the dynamically changing and complex network
topology.
Figure 2(d) shows that the delivery rate decreases as the
network load increases. MADDPG-LSTM performs better
than the other methods under all conditions. As the network
load increases to 1,000 kbps, the data delivery rates of all
methods exceed 65%, and MADDPG-LSTM achieves the
highest, i.e., nearly 80%. Figures 2(e) and (f) show that as
UAV numbers and speeds increase, the topological complexity
and variability rise, negatively impacting performance.
V. CONCLUSION
We have presented a multi-agent RL-based UAV swarm
routing algorithm. Each UAV is viewed as an agent that uses
local information to make routing decisions via a link qual-
ity assessment mechanism. We have used LSTM to replace
the fully connected layers of the Actor and Critic networks of
the MADDPG algorithm to improve the ability of the routing
protocol to adapt to dynamic environments. NS3-Gym sim-
ulation has shown the superiority of our MADDPG-LSTM
algorithm in terms of transmission delay and delivery rate. In
our future work, we plan to refine further the routing strategies
for different UAV swarms to make data-driven routing more
intelligent.
REFERENCES
[1] L. Hong, H. Guo, J. Liu, and Y. Zhang, “Toward swarm coordina-
tion: Topology-aware inter-UAV routing optimization,” IEEE Trans. Veh.
Technol., vol. 69, no. 9, pp. 10177–10187, Sep. 2020.
[2] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “What should 6G
be?” Nat. Electron., vol. 3, no. 1, pp. 20–29, 2020.
[3] J. Aznar-Poveda, A.-J. Garcia-Sanchez, E. Egea-Lopez, and
J. Garcia-Haro, “Approximate reinforcement learning to control
beaconing congestion in distributed networks,” Sci. Rep., vol. 12, no. 1,
pp. 1–11, 2022.
[4] S. Jiang, Z. Huang, and Y. Ji, “Adaptive UAV-assisted geographic rout-
ing with Q-learning in VANET,” IEEE Commun. Lett., vol. 25, no. 4,
pp. 1358–1362, Apr. 2021.
[5] X. Mai, Q. Fu, and Y. Chen, “Packet routing with graph attention multi-
agent reinforcement learning,” 2021, arXiv:2107.13181.
[6] S. Kaviani et al., “Robust and scalable routing with multi-agent deep
reinforcement learning for MANETs,” 2021, arXiv:2101.03273.
[7] W. Jin, R. Gu, and Y. Ji, “Reward function learning for Q-learning-
based geographic routing protocol,” IEEE Commun. Lett., vol. 23, no. 7,
pp. 1236–1239, Jul. 2019.
[8] S. Zhang, H. Zhang, B. Di, and L. Song, “Cellular UAV-to-X communi-
cations: Design and optimization for multi-UAV networks,” IEEE Trans.
Wireless Commun., vol. 18, no. 2, pp. 1346–1359, Feb. 2019.
[9] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-
channel opportunistic access: structure, optimality, and performance,”
IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440,
Dec. 2008.
[10] B. Sliwa, C. Schüler, M. Patchou, and C. Wietfeld, “PARRoT: Predictive
ad-hoc routing fueled by reinforcement learning and trajectory knowl-
edge,” in Proc. IEEE 93rd Veh. Technol. Conf. (VTC-Spring), 2021,
pp. 1–7.
[11] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-
agent actor–critic for mixed cooperative-competitive environments,” in
Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6379–6390.
[12] P. Gawłowicz and A. Zubow, “ns3-gym: Extending OpenAI gym for
networking research,” 2018, arXiv:1810.03943.
[13] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.