ArticlePDF Available

A Data-Driven Packet Routing Algorithm for an Unmanned Aerial Vehicle Swarm: A Multi-Agent Reinforcement Learning Approach

Authors:

Abstract

Routing decisions made by unmanned aerial vehicle (UAV) swarms are affected by complex and dynamically changing topologies. A centralized routing algorithm imposes the entire computational burden on one module, and the high data dimensionality renders computation burdensome. In this letter, we develop a multi-agent reinforcement learning-based routing algorithm for a UAV swarm. The UAVs are trained in a data-driven manner to make distributed routing decisions. Factors that include channel quality, UAV movement, UAV overhead, and the extent of neighbor variation are incorporated into link quality assessment. Long short-term memory is used to improve the Actor and Critic networks, and more information on temporal continuity is added to facilitate adaptation to the dynamically changing environment. Simulations show that the proposed routing algorithm reduces data transmission delay and enhances the transmission rate compared with traditional routing algorithms.
2160 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022
A Data-Driven Packet Routing Algorithm for an Unmanned Aerial Vehicle
Swarm: A Multi-Agent Reinforcement Learning Approach
Xiulin Qiu ,Student Member, IEEE,LeiXu ,Member, IEEE,PingWang ,Fellow, IEEE,
Yuwang Yang , and Zhenqiang Liao
Abstract—Routing decisions made by unmanned aerial vehi-
cle (UAV) swarms are affected by complex and dynamically
changing topologies. A centralized routing algorithm imposes
the entire computational burden on one module, and the high
data dimensionality renders computation burdensome. In this
letter, we develop a multi-agent reinforcement learning-based
routing algorithm for a UAV swarm. The UAVs are trained
in a data-driven manner to make distributed routing decisions.
Factors that include channel quality, UAV movement, UAV over-
head, and the extent of neighbor variation are incorporated into
link quality assessment. Long short-term memory is used to
improve the Actor and Critic networks, and more information
on temporal continuity is added to facilitate adaptation to the
dynamically changing environment. Simulations show that the
proposed routing algorithm reduces data transmission delay and
enhances the transmission rate compared with traditional routing
algorithms.
Index Terms—UAV swarm network, multi-agent reinforcement
learning, link quality assessment, data-driven routing, LSTM.
I. INTRODUCTION
AN UNMANNED aerial vehicle (UAV) swarm is a dis-
tributed deployment of multiple UAVs that work together
to accomplish a mission. Swarms are more reliable and accu-
rate than a single UAV and cover more terrain when engaging
in post-disaster search and rescue, battlefield strikes, environ-
mental sensing, and target tracking. It is challenging to ensure
connectivity, high throughput, and efficient self-adaptation [1]
when UAV move rapidly.
Artificial Intelligence (AI) empowered wireless commu-
nication technologies are believed to offer a range of new
features such as traffic prediction, network optimization and
congestion control. With 5G standards gradually becoming
Manuscript received 11 June 2022; revised 11 July 2022; accepted 27
July 2022. Date of publication 3 August 2022; date of current version
7 October 2022. This work was supported in part by the National Natural
Science Foundation of China under Grant 61973161, Grant 61991404, and
Grant 61773206; and in part by SZKITI under Grant SYG201826. The
associate editor coordinating the review of this article and approving it for
publication was G. Chen. (Corresponding author: Yuwang Yang.)
Xiulin Qiu, Lei Xu, and Yuwang Yang are with the School of
Computer Science and Engineering, Nanjing University of Science
and Technology, Nanjing 210094, China (e-mail: qiuxiulin@njust.edu.cn;
xulei_marcus@126.com; yuwangyang@njust.edu.cn).
Ping Wang is with the Department of Electrical Engineering and Computer
Science, Lassonde School of Engineering, York University, Toronto, ON M3J
1P3, Canada (e-mail: pingw@yorku.ca).
Zhenqiang Liao is with the School of Electrical and Mechanical
Engineering, Suzhou Global Institute of Software Technology, Suzhou
215163, China (e-mail: zqliao1013@126.com).
Digital Object Identifier 10.1109/LWC.2022.3195963
fixed, AI empowered 6G will unlock the full potential of
radio signals and enable a shift from cognitive radio (CR)
to intelligent radio (IR) [2]. Traditional fixed-model-based
communication design relies on the application of algorith-
mic insights by human experts, while data-driven techniques
use machine learning to learn a “good” communication
policies from data sets, and it is more adaptable to topo-
logically complex and dynamically changing environments.
So data-driven routing methods applied to dynamic UAV
swarm networks are expected to further improve transmission
performance.
Reinforcement learning is one of the main branches of AI
that acquires experience by exploring interactions with the
environment [3]. So RL is an emerging approach to solve rout-
ing decisions in dynamic environments. Jiang et al. [4]used
Q-Learning to develop adaptive UAV-assisted geographic rout-
ing. Mai et al. [5] developed a model-free, data-driven routing
strategy by combining a graphical neural network (GNN) with
multi-agent RL. Kaviani et al. [6] presented a DeepCQ+ rout-
ing framework employing multi-agent deep RL (MADRL)
that can handle different numbers of nodes, data streams
with different data rates and source/destination pairs, various
mobility levels, and other dynamic topologies. Jin et al. [7]
employed Q-learning reward function learning (RFLQGeo) to
improve the performance and efficiency of unmanned robotic
networks (URNs).
However, the topology in UAV swarm networks is com-
plex and variable, and it is difficult to obtain the global
information by centralized networking. We try to solve the
routing problem of UAV swarm networks with distributed
ideas, so we use the multi-agent reinforcement learning
method. The proposed algorithm establishes a parameter
exchange mechanism based on multi-agent deep determinis-
tic policy gradient (MADDPG) [11] in the training phase, and
the execution phase can be fully distributed. Each UAV uses
an Actor net designed based on LSTM to obtain all possi-
ble information on temporal continuity and the correlations
between information on each state.
The contributions of this letter are as follows. (1) We use
the interactive learning capability of RL to cope with the
dynamic environment and build a routing model with dis-
tributed execution. (2) A link quality assessment mechanism is
proposed to alleviate the difficulty of collecting global topol-
ogy information in dynamic environments. (3) We build the
Actor net and Critic net containing LSTM cells to further
accelerate the convergence of the algorithm. (4) We analyze
the computational complexity of the algorithm, and simu-
lations demonstrate the advantages of the proposed routing
algorithm.
2162-2345 c
2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
QIU et al.: DATA-DRIVEN PACKET ROUTING ALGORITHM FOR AN UAV SWARM 2161
II. ROUTING MODELING BASED ON MULTI-AGENT
REINFORCEMENT LEARNING
A. The Network Model
Consider an area containing users of mobile communication
devices, where multi-UAVs are deployed, and these users try to
communicate with base stations (BS) outside the area through
multi-UAVs relays. This model can be applied to the scenario
of establishing communication between the affected area and
users outside the area in case of base stations damaged. To
ensure the users experience, the UAV swarm network needs
to provide reliable and high-quality links. The UAV swarm
network is viewed as a graph G(V,E) where V(G)istheset
of nodes (all UAVs), E(G) is the set of edges (all communica-
tion links), d_e(i,j)denotes the Euclidean distance between
UAVs iand j, the communication link e(i,j) exists only when
d_c(i,j)>d_e(i,j), where d_c(i,j)is the communication
distance.
B. The Channel Model
We use channel model defined in [8] to describe UAV to
UAV communication links, and divide the network transmis-
sion bandwidth into Korthogonal subchannels. The subchan-
nels are allocated to UAVs using a round-robin approach
proposed in [9], where every UAV visits each subchannel in
a predefined sequential order to avoid transmission collisions
with others. We use the Rician model described channel, sup-
pose subchannel kis assigned to UAV iat time slot t, when
UAV itransmits to UAV jover subchannel k, the received
power at UAV jfrom UAV iis expressed as Pk
i,j(t). The rice
distribution is defined as
pk
ξ(di,j(t)) = di,j(t)
σ2
0
expdi,j(t)2ρ2
2σ2
0I0di,j(t)ρ
σ2
0(1)
where di,j(t)is the transmission distance between UAV iand
j,ρand σ0reflecting the strength of the dominant and non-
dominant paths respectively, I0is the modified Basel function.
In case no dominant path exists (ρ=0), the Rician fading
reduces to a Rayleigh fading defined by
pk
ξ(di,j(t)) = di,j(t)
σ2
0
expdi,j(t)2
2σ2
0(2)
The received power at UAV jfrom UAV iis expressed as
Pk
i,j(t)=PtGdi,j(t)α+pk
ξ(di,j(t)) (3)
where Gis the constant power gain factor introduced into the
amplifier and antenna, Ptis the transmit power of a UAV,
di,j(t)is the transmission distance between UAV iand j, and
(di,j(t))αis the path loss. The interference from UAV mto
UAV jover subchannel kis represented as:
Ik
m,UAV (t)=ψm,k(t)PtGdm,j(t)α(4)
where Ψ(t)=[ψm,k(t)] is an Nl×Kbinary UAV-to-UAV
subchannel pairing matrix with a value of ψi,k(t)=1when
subchannel kis assigned to UAV ifor UAV-to-UAV transmis-
sion, otherwise ψi,k(t)=0.Nl(t)={1,2,...,Nl(t)}is the
set of UAVs that perform UAV-to-UAV transmission links in
time slot t. The SINR at UAV jover subchannel kis shown as:
γk
i,j(t)=PtGdi,j(t)α+pk
ξ(di,j(t))
σ2+Nl
m=1,m=iIk
m,UAV (t)(5)
σ2is the Gaussian noise variance. Therefore, the rate of
data transmission from UAV ito UAV jin subchannel kis:
Rk
i,j(t)=blog21+γk
i,j(t)(6)
where bis the channel bandwidth. The channel and power
resources are allocated by the Multiple Access Control (MAC)
protocol, while each UAV chooses the link with the highest
signal-to-noise ratio for transmission over the allocated link.
The RL-based model is described below.
C. The Multi-Agent RL Model
Assume that the source node of the packet is UAV s, the des-
tination node is UAV d, each UAV is an agent and all nodes co-
operate to complete packet transmission. The routing problem
is a multi-agent Markov (stochastic) game characterized by the
tuple (I,S,{Ai}iI,p,{ri}iI)where IΔ
={1,...,i,...,I}
is the set of agents and SΔ
={S1,...,Si,...,SI}is the set
of states of all agents where Siis the state of the i-th agent.
{Ai}iIis the set of actions of agent i,p(s|s,{Ai}iI)Δ
=
S×A1×···×AI×Sis the state transition probability of the
system, where sis the next state, and {ri}iIis the reward
for the agent.
In the game, all UAVs try to find the path with the best link
quality to maximize the value function vπ(s), and vπ(s)=
Eπ[rt+1 +ηrt+2 +η2rt+3 +...St=s]. The discount factor
η[0,1] indicates the importance of future rewards. After all
UAVs formulate optimal policies, the equilibrium policy for
the whole system {π
1,...,π
I}is the Nash policy [11] that
obtains the optimal packet transmission path, where π
iis the
optimal policy of the i-th UAV.
1) The State Space: The state of each UAV is private to
that UAV, and we include the channel condition indicated by
the SINR in the state. For a single UAV i, the state Si(t)at
time tcontains two parts:
The signal-to-noise ratio, γk
i,i(t), which is the signal
impact of other UAVs on UAV i, denoted by UAV -i.
The set of neighbors Ner (t)of UAV i, i.e., the set of
UAVs that can establish direct connection with UAV i.
Thus, the state of UAV iat time tis Si(t)k
i,i,Ner >.
2) The Action Space: The action set of UAV iis to choose
which link to transmit packets. In the channel model, when
communication distance d_c(i,j)>d_e(i,j),UAVjcan be
a neighbor of UAV i, and there is a possible communication
link e(i,j). However, the vector distance of the two UAVs can-
not be used as the effective communication distance, and the
link duration must meet the communication requirements. We
use the link expiration time (LET) defined in [10], and assume
that UAVs iand jmove at speeds viand vj, respectively, then
the link expiration is expressed as:
LETi,j=(m2+c2)r2(md nc)2(mn +cd )
m2+c2(7)
where
m=vicos(θi)vjcosθj,n=xixj(8)
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
2162 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022
Fig. 1. The UAV swarm network routing algorithm(a) and improved MADDPG routing algorithm based on LSTM(b).
c=visin(θi)vjsinθj,d=yiyj(9)
(xi,yi)and (xj,yj)denote the position of UAV iand UAV j,
respectively, viand vjdenote the velocity of UAVs, θiand θj
are the angle between UAV velocity and horizontal direction.
Hence, if LETi,jThello ,e(i,j)ai, where Thello is the
period in which the UAV performs the link handshake. Thus,
the action space of the UAV is:
aiA={e(i,j)|d_c(i,j)d_e(i,j)LETi,jThello }(10)
where Ais the set of actions of UAVi, and affected by the
communication distance with other UAVs.
3) The Reward Function: Our link quality evaluation mea-
sures link quality in several aspects. The first quality measure
is the channel metrics using the channel model of UAV
presented in Section II-B.The goal is to select a link, the
signal-to-noise ratio of which is no lower than a predefined
threshold:
γk
i,j(t)= PUdi,j(t)α
σ2+Nl
m=1,m=iIk
m,UAV (t)γk
min (11)
where γk
min is the minimum signal-to-noise ratio of the UAV
channel. If over time slot t, the link channel of a UAV sat-
isfies the constraint (9), the UAV receives a reward rSINR,
defined as:
rSINR =blog(1 + γk
i,j(t)) vmpU(t),if γk
i,j(t)γk
m
0,otherwise (12)
where bis the channel bandwidth, pU(t)is the transmission
power, and vm>0is a cost per unit level of power. The
second part of the reward is:
rLET =LETi,j
τ,LETi,j
1,otherwise (13)
where τis the time interval of the position update. When
LETi,j, the LET is shorter and the reward brought by the
agent choosing this link is smaller, so rLET <1. Otherwise,
the LET is longer, but the reward given cannot be increased
infinitely, so as long as the transmission demand is satisfied
the reward is set to 1. The third component is the busyness
of the selected neighboring node, expressed in terms of the
queue length of packets to be sent, nsend . There is a larger
penalty for longer queues:
rBUSY =log(1 + nsend )(14)
where nsend is positive and the reward is inversely propor-
tional to nsend . The more data to be sent by the UAV, the
greater the load on this link, so the reward reduce when agent
select this link. Thus, the total reward of UAV iat time tis:
rt
i=αrt
SINR +βrt
LET +δrt
BUSY (15)
where α, β, δ are the weights of the three components.
We propose two measures to cope with the situation that
local information is not available. First, it is necessary to let
the dataset of the agent traverse the situation of missing local
data when training, and the algorithm has the experience of
missing data at the time of execution. Second, cross-layer tech-
niques need to be applied in this scenario to increase the access
to the local data.
III. THE ROUTING ALGORITHM BASED ON IMPROVED
MADDPG WITH LSTM
Figure 1(a) shows the 10 steps of the routing algorithm.
In steps 1 to 3, the LSTM-based Actor net outputs an action
based on the current state. In steps 4 to 7, the agent per-
forms the action, obtains a reward, and evolves to a new state.
The agent stores the state, action, reward and new state in
the buffer, and then samples them from buffer, using target
network parameters to calculate ywhich is used for Critic net
loss minimization. In steps 8 to 10, the LSTM-based Critic
net parameters are updated using the loss function which is
defended in (18), and the LSTM-based Actor net parameters
are updated using the gradient ascent algorithm.
We fit the policy function and the action value function
using the MADDPG [11] algorithm. The policy is repre-
sented by the function π={π1
2,...π
I}with parameters
θ={θ1
2,...θ
I}. UAV routing is a time-varying decision
problem; traditional routing is prone to frequent failures as
the topology changes because information on network state
changes in the time domain is not fully utilized. We replace the
critic and target networks of the traditional MADDPG algo-
rithm with LSTM that can leverage time-domain information
between UAV states to reduce the possibility of routing failure
and improve the accuracy of routing.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
QIU et al.: DATA-DRIVEN PACKET ROUTING ALGORITHM FOR AN UAV SWARM 2163
As shown in Figure 1(b), we build the Actor net and Critic
net containing LSTM cells, where the input of the Actor net
πiof the i-th UAV at time tis the information on the histor-
ical states and historical actions of all UAVs encoded by the
LSTM, and the current observed state si
tof UAV i.The output
of the Actor net is the current action taken by UAV ai
t, thus
ai
t=πi(ht1,si
t), where ht1is the LSTM output at t1.
The Actor’s reward is thus:
J(θi)=Eht1,stQ(ht1,st,a)|a=μi(ht1,st)(16)
Q(ht1,st,a)is the action value function of the UAV, and
Qπ(ht1,st,a)=Eτπ[R(t)S=st,A=a], where R(t)is
the cumulative reward obtained at time tand afterwards, and
R(t)=rt+1 +ηrt+2 +η2rt+3 +...=
k=0 ηkrt+k+1.The
gradients are:
θiJ(θi)=Eht1,staQ(ht1,st,a)|
a=μ
i(ht1,st)θiπi(ht1,si
t)
(17)
The action value function Qi(S,a1,a2,...,aI)is fitted by
a network with parameters ω={ω1
2,...ω
I}.Atis the
joint action distribution (a1
t,...,aI
t)of all UAVs at time t
and htthe LSTM output at time t. The Critic net is updated
to minimize the loss defined as follows.
Los s (ωi)=Eht1,stQωi
i(ht1,st,a)yi2(18)
where yiis the target value derived from the target network,
yi=ri
t+ηQω
i
iht,st+1
iht,si
t+1
i(19)
where ri
tdenotes the instantaneous reward of the i-th UAV
in the current state, θis Actor target network parameter, ω
is Critic target network parameter, ηis the discount factor of
the cumulative reward, and htis the output of the LSTM at
time t.
As shown in Algorithm 1, the LSTM parameters are first
initialized as h0and the actions of each UAV are generated
according to the policy. After executing the actions to obtain
rewards, the LSTM parameters become h1and information
is stored in the experience pool R. The existence of LSTM
helps to alleviate the dilemma that current tasks cannot receive
relevant information long before it, so it is suitable to cope
with the fast-changing wireless communication environment.
Computational complexity analysis: The total parameters in
each LSTM network is M=4ncnc+4ninc+ncno+3nc,
where nc,ni,noare number of memory cells, input neuron
units, output neuron units, respectively. So the computational
complexity of the standard LSTM network is O(M). Assume
the batch size is B, the training episodes are H. So the time
complexity of LSTM training is approximately O(M*H*B).
The computational complexity of the proposed MADDPG-
LSTM routing algorithm for the system containing IUAVs
is about O(MHBIDs)+O(MHBIDa+Ds),
Dais the dimension of the action space, and Dsis the dimen-
sion of the state space. Compared with the centralized routing
algorithm, the proposed algorithm alleviates the computational
power required by the UAV and reduces routing overhead.
Algorithm 1 The Processing of Routing Algorithm in UAV
Initialize: For all UAV, initialize Actor online network param-
eters θ={θ1
2,...θ
I}and Critic online network parameters
ω={ω1
2,...ω
I}, replay buffer R, Actor target network
parameters θθ, Critic target network parameters ωω.
1: for episode =1to Mdo
2: Initialize empty history h0, State s={s1,s2,...sI}
3: while time t<Tand s=do
4: for agent i=1to Ido
5: According to ai
t=πi(ht1,si
t)generate the
action to be performed by UAV i;
6: Get reward ri
tand new observation st+1 from the
environment;
7: According to ht=LSTM (ht1,[St1,At1])
update LSTM.
8: end for
9: end while
10: store episode {h,si,ai,ri}
11: for agent i=1to Ido
12: sample a random state of episodes
{h1,si
1,ai
1,ri
1,h2,si
2,ai
2,ri
2,...}
13: for time t=Tdown to 1do
14: Using target networks parameters calculate y
by(19)
15: update Critic online network by minimizing the
loss(18)
16: update Actor online network by policy gradi-
ent(17)
17: end for
18: end for
19: update target networks:
θ
iτθi+(1τ)θ
i
ω
iτωi+(1τ)ω
i
20: end for
IV. EVA L U AT I O N
DGATR [5] is a multi-agent RL approach combined
with graph neural network for routing, and deepCQ++ [6]
trains DNNs by interacting acknowledgment (ACK) packets
and solves the routing problem of dynamic mobile ad-hoc
networks (MANETs) based on multi-agent RL. Although these
two routing methods perform well in scenarios with vari-
able network size, data rate and mobility, they do not take
channel quality and node busyness into consideration. We sim-
ulated the proposed algorithm using the NS3-Gym [12] tool,
and we compared DGATR, deepCQ+, MADDPG, and our
MADDPG-LSTM in delay and the data delivery rate. The for-
mer is the time difference between packets sent by the source
and received by the destination node, and the latter is the num-
ber of packets received by the destination node divided by the
number sent by the source node.
NS3-Gym evaluates data communication between NS3 and
OpenAI Gym [13] based on ZMQ (Zero Message Queue is
a library used to implement messaging and communication
between two applications). The system effectively simulates
communication protocols based on RL methods. The dataset
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
2164 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022
TAB LE I
SIMULATION PARAMETERS
Fig. 2. Simulation results.
used for training is obtained from the execution environ-
ment through the interface functions provided by NS3-Gym
for passing status, reward, action and end flag information.
The states, actions, and rewards are combined into a vec-
tor {sj,aj,rj}and deposited into the experience pool R
to form the dataset. To be more realistic, we used the
Gaussian/Markov mobility model to generate random UAV
motions. The network load was controlled by varying the
constant bit rate (CBR), we also changed the UAV speeds,
the number of UAVs, and other parameters. The simulation
parameters are shown in Table I.
Figure 2(a) shows that the transmission delay increases as
the load increases. With DGATR, the delay is about 160 ms
when CBR =1,000 kbps. Our MADDPG-LSTM exhibits
shorter delay (about 130 ms), and similar to the traditional
MADDPG using fully connected neural network. Figures 2(b)
and (c) show that the delay increases as the number of UAVs
and the speed of UAVs increase, respectively. Our MADDPG-
LSTM outperforms the other methods, indicating that it can
well adapt to the dynamically changing and complex network
topology.
Figure 2(d) shows that the delivery rate decreases as the
network load increases. MADDPG-LSTM performs better
than the other methods under all conditions. As the network
load increases to 1,000 kbps, the data delivery rates of all
methods exceed 65%, and MADDPG-LSTM achieves the
highest, i.e., nearly 80%. Figures 2(e) and (f) show that as
UAV numbers and speeds increase, the topological complexity
and variability rise, negatively impacting performance.
V. CONCLUSION
We have presented a multi-agent RL-based UAV swarm
routing algorithm. Each UAV is viewed as an agent that uses
local information to make routing decisions via a link qual-
ity assessment mechanism. We have used LSTM to replace
the fully connected layers of the Actor and Critic networks of
the MADDPG algorithm to improve the ability of the routing
protocol to adapt to dynamic environments. NS3-Gym sim-
ulation has shown the superiority of our MADDPG-LSTM
algorithm in terms of transmission delay and delivery rate. In
our future work, we plan to refine further the routing strategies
for different UAV swarms to make data-driven routing more
intelligent.
REFERENCES
[1] L. Hong, H. Guo, J. Liu, and Y. Zhang, “Toward swarm coordina-
tion: Topology-aware inter-UAV routing optimization,” IEEE Trans. Veh.
Technol., vol. 69, no. 9, pp. 10177–10187, Sep. 2020.
[2] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “What should 6G
be?” Nat. Electron., vol. 3, no. 1, pp. 20–29, 2020.
[3] J. Aznar-Poveda, A.-J. Garcia-Sanchez, E. Egea-Lopez, and
J. Garcia-Haro, “Approximate reinforcement learning to control
beaconing congestion in distributed networks, Sci. Rep., vol. 12, no. 1,
pp. 1–11, 2022.
[4] S. Jiang, Z. Huang, and Y. Ji, Adaptive UAV-assisted geographic rout-
ing with Q-learning in VANET,” IEEE Commun. Lett., vol. 25, no. 4,
pp. 1358–1362, Apr. 2021.
[5] X. Mai, Q. Fu, and Y. Chen, “Packet routing with graph attention multi-
agent reinforcement learning,” 2021, arXiv:2107.13181.
[6] S. Kaviani et al., “Robust and scalable routing with multi-agent deep
reinforcement learning for MANETs,” 2021, arXiv:2101.03273.
[7] W. Jin, R. Gu, and Y. Ji, “Reward function learning for Q-learning-
based geographic routing protocol,” IEEE Commun. Lett., vol. 23, no. 7,
pp. 1236–1239, Jul. 2019.
[8] S. Zhang, H. Zhang, B. Di, and L. Song, “Cellular UAV-to-X communi-
cations: Design and optimization for multi-UAV networks,” IEEE Trans.
Wireless Commun., vol. 18, no. 2, pp. 1346–1359, Feb. 2019.
[9] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-
channel opportunistic access: structure, optimality, and performance,
IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440,
Dec. 2008.
[10] B. Sliwa, C. Schüler, M. Patchou, and C. Wietfeld, “PARRoT: Predictive
ad-hoc routing fueled by reinforcement learning and trajectory knowl-
edge,” in Proc. IEEE 93rd Veh. Technol. Conf. (VTC-Spring), 2021,
pp. 1–7.
[11] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-
agent actor–critic for mixed cooperative-competitive environments, in
Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6379–6390.
[12] P. Gawłowicz and A. Zubow, “ns3-gym: Extending OpenAI gym for
networking research,” 2018, arXiv:1810.03943.
[13] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540.
Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.
... Qiu et al. [40] applied MA-DDPG-LSTM to make routing decisions by considering link SINR, LD, and queuing delay together, which involves adopting centralized training and distributed execution. To cope with the dynamic topology, they considered the LSTM-based actor and critic network. ...
... For comparison, the following existing algorithms are considered:  We consider JTFR variation DMA-DDPG-1, in which both the actor and critic network has only one-hop neighbor information. DMA-DDPG-1 has a similar MDP formulation and neural network architecture as discussed in Section IV.  We consider MA-DDPG-LSTM [40], in which both actor, critic, and their target network are developed by only the LSTM cell. MDP is then formulated using the one-hop neighbor information according to the procedure given in [40]. ...
... DMA-DDPG-1 has a similar MDP formulation and neural network architecture as discussed in Section IV.  We consider MA-DDPG-LSTM [40], in which both actor, critic, and their target network are developed by only the LSTM cell. MDP is then formulated using the one-hop neighbor information according to the procedure given in [40].  Finally, we consider the MCA-OLSR [30], which is a recently published novel topology-based proactive cross-layer routing protocol, to validate the effectiveness of the adaptive learningbased algorithm in packet routing. ...
Article
Full-text available
Collaborative unmanned aerial vehicle (UAV) swarm networks can effectively execute various emerging missions such as surveillance and communication coverage. However, due to high mobility and constrained transmission range, packet routing encounters mutual interferences, link breakages, and unexpected delays. In such networks, routing performance is coupled with trajectory control, frequency allocation, and relay selection. In this study, we propose a joint trajectory control, frequency allocation, and packet routing (JTFR) algorithm, in which link utility is maximized by considering the link stability, signal-to-interference-plus-noise ratio, queuing delay, and residual energy of UAVs. The proposed JTFR employs adaptive distributed multi-agent deep deterministic policy gradient coupled with the swarming behavior to obtain the optimal solution. For each UAV, an actor network is established by utilizing a long short-term memory-based state representation layer containing two-hop neighbor information to adopt the dynamic time-varying topology. Subsequently, a scalable multi-head attentional critic network is set up to adaptively adjust the actor network policy of each UAV by collaborating with neighbors. The extensive simulation results show that JTFR outperforms existing routing protocols by 30-60% less end-to-end delay, 15-32% better packet delivery ratio, and 20-46% less energy consumption.
... To achieve the objectives of reinforcement learning, the set of neighbor nodes is considered, and the base stations (BS) is treated as the fixed destination node that broadcasts hello packets regularly via the use of Q -learning-based topologyaware routing (QTAR) [10,11,12,13,14]. By the aforementioned principles, the receiving node is required to update the Q-value table of its device: the higher the Q-value, the closer the device is to RS. ...
... This delay performance was evaluated w.r.t. different Number of Movements (NM), which were varied between 2k to 40k, and compared with ECOR [3], QL [5], & QTAR [14] in table 2 as follows: Based on this estimation and its visualization in Figure 2, it has been found that the proposed model can reduce the delay needed for routing by 9.5% when compared with ECOR [3], 14.5% when compared with QL [5], and 18.9% when compared with QTAR [14], which makes it very important and critical for various real-time routing applications. This delay is reduced due to the use of distance levels, and temporal delay performance during the selection of routing paths. ...
... This delay performance was evaluated w.r.t. different Number of Movements (NM), which were varied between 2k to 40k, and compared with ECOR [3], QL [5], & QTAR [14] in table 2 as follows: Based on this estimation and its visualization in Figure 2, it has been found that the proposed model can reduce the delay needed for routing by 9.5% when compared with ECOR [3], 14.5% when compared with QL [5], and 18.9% when compared with QTAR [14], which makes it very important and critical for various real-time routing applications. This delay is reduced due to the use of distance levels, and temporal delay performance during the selection of routing paths. ...
... The complexity is greatly reduced compared to the high complexity of centralized routing algorithms facing large-scale networks. Similarly, Qiu et al. [102] continue to adapt multi-intelligent reinforcement learning to the complex and variable FANET environment by using LSTMs instead of the fully connected layers of actor-critic networks. The algorithm starts with an LSTM-based behavioral network that performs an output operation based on the current state, followed by an intelligent body that operates, receives a reward and evolves to a new state, and finally updates the parameters of the judging network using a loss function and updates the parameters of the behavioral network using a gradient ascent algorithm. ...
Article
Full-text available
With the rapid development of 5G and 6G communications in recent years, there has been significant interest in space–air–ground integrated networks (SAGINs), which aim to achieve seamless all-area, all-time coverage. As a key component of SAGINs, flying ad hoc networks (FANETs) have been widely used in the agriculture and transportation sectors in recent years. Reliable communication in SAGINs requires efficient routing algorithms to support them. In this study, we analyze the unique communication architecture of FANETs in SAGINs. At the same time, existing routing protocols are presented and clustered. In addition, we review the latest research advances in routing algorithms over the last five years. Finally, we clarify the future research trends of FANET routing algorithms in SAGINs by discussing the algorithms and comparing the routing experiments with the characteristics of unmanned aerial vehicles.
Article
The problem of fixed‐time formation control for a class of second‐order nonlinear multi‐agent systems is studied. For a class of such systems, a control algorithm is proposed to maintain the connections among the agents while avoiding collisions. Furthermore, a radial basis function neural network is used in the design to precisely approximate the nonlinear function for the nonlinear terms in the model. Then, a dynamic sliding mode control method is proposed to suppress the chattering phenomenon that may arise due to the sliding mode control. A sufficient condition for the system to achieve fixed‐time formation is obtained by using different methods, such as Lyapunov stability. Finally, the effectiveness of the proposed algorithm is verified by example. Simulation experiments reveal that the proposed method has faster error convergence and better robust control than conventional algorithms.
Article
Full-text available
The effective usage of energy becomes crucial for the successful deployment and operation of unmanned aerial vehicles (UAVs) in different applications, such as surveillance, transportation, and communication networks. The increasing demand for UAVs in different industries such as agriculture, logistics, and emergency response has led to the development of more sophisticated and advanced UAVs. However, the limited onboard energy resource of UAVs poses a major problem for their long-term operation and endurance. In addition, artificial intelligence (AI) and machine learning (ML) could allow UAVs to make more informed and intelligent decisions regarding their operations, resulting in sustainable and more energy-efficient UAV deployment. This article designs a Hybrid Snake Optimizer-based Route Selection Approach for Unmanned Aerial Vehicles Communication (HSO-RSAUAVC) technique. The goal of the HSO-RSAUAVC technique is to explore and select optimal routes for UAV communication. In the presented HSO-RSAUAVC technique, the SO algorithm is integrated with Bernoulli Chaotic Mapping and Levy flight (LF) for enhanced performance. In addition, the HSO-RSAUAVC method derives a fitness function including residual energy (RE), distance, and UAV degree. By incorporating the HSO-RSAUAVC technique, we can dynamically adapt UAV paths to overcome obstacles, decrease communication interference, and optimize energy utilization. To validate the performance of the proposed model, a series of simulations were performed. The comparative result analysis illustrates the better performance of the HSO-RSAUAVC technique in improving the performance and reliability of UAV communication.
Article
Aims and Background The topology and communication links of vehicular adhoc networks, or VANETs, are always changing due to the transient nature of automobiles. VANETs are a subset of MANETs that have applications in the transportation sector, specifically in Intelligent Transportation Systems (ITS). Routing in these networks is challenging due to frequent link detachments, rapid topological changes, and high vehicle mobility. Methods: As a result, there are many obstacles and constraints in the way of creating an effective routing protocol that satisfies latency restrictions with minimal overhead. Malicious vehicle detection is also a crucial role in VANETs. Unmanned-Aerial-Vehicles(UAVs) can be useful for overcoming these constraints. This study examines the utilize of UAVs operating in an adhoc form and cooperating via cars VANETs to aid in the routing and detection of hostile vehicles. VANET is a routing protocol. The proposed UAV-assisted routing protocol (VRU) incorporates two separate protocols for routing data: (1) a protocol called VRU_vu for delivering data packets amid vehicles with the assist of UAVs, and (2) a protocol called VRU_u for routing data packets amid UAVs. Results: To estimate the efficacy of VRU routing objects in a metropolitan setting, we run the NS-2.35 simulator under Linux Ubuntu 12.04. Vehicle and UAV motions can also be generated with the help of the mobility generator VanetMobiSim and the mobility simulation software MobiSim. Conclusion: According to the results of the performance analysis, the VRU-protocol is able to outperform the other evaluated routing protocols in terms of packet-delivery-ratio (by 17 percent) &detection-ratio (9 percent). The VRU protocol cuts overhead near 41% and reduces end-to-enddelay in mean of 15%.
Article
Full-text available
In vehicular communications, the increase of the channel load caused by excessive periodical messages (beacons) is an important aspect that must be controlled to ensure the appropriate operation of safety applications and driver-assistance systems. To date, the majority of congestion control solutions involve including additional information in the payload of the messages transmitted, which may jeopardize the appropriate operation of these control solutions when channel conditions are unfavorable, provoking packet losses. This study exploits the advantages of non-cooperative, distributed beaconing allocation, in which vehicles operate independently without requiring any costly road infrastructure. In particular, we formulate the beaconing rate control problem as a Markov Decision Process and solve it using approximate reinforcement learning to carry out optimal actions. Results obtained were compared with other traditional solutions, revealing that our approach, called SSFA, is able to keep a certain fraction of the channel capacity available, which guarantees the delivery of emergency-related notifications with faster convergence than other proposals. Moreover, good performance was obtained in terms of packet delivery and collision ratios.
Technical Report
Full-text available
OpenAI Gym is a toolkit for reinforcement learning (RL) research. It includes a large number of well-known problems that expose a common interface allowing to directly compare the performance results of different RL algorithms. Since many years, the ns-3 network simulation tool is the de-facto standard for academic and industry research into networking protocols and communications technology. Numerous scientific papers were written reporting results obtained using ns-3, and hundreds of models and modules were written and contributed to the ns-3 code base. Today as a major trend in network research we see the use of machine learning tools like RL. What is missing is the integration of a RL framework like OpenAI Gym into the network simulator ns-3. This paper presents the ns3-gym framework. First, we discuss design decisions that went into the software. Second, two illustrative examples implemented using ns3-gym are presented. Our software package is provided to the community as open source under a GPL license and hence can be easily extended.
Article
The Q-learning based geographic routing approaches suffer from problems of low converging speed and inefficient resources utilization in VANET due to the dynamic scale of Q-value table. In addition, the next hop selection based on local information can not always be conducive to the global message forwarding. In this letter, we propose an adaptive unmanned aerial vehicle (UAV) assisted geographic routing with Q-Learning. The routing scheme is divided into two components. In the aerial component, the global routing path is calculated by the fuzzy-logic and depth-first-search (DFS) algorithm using the UAV-collected information like the global road traffic, which is then forwarded to the ground requesting vehicle. In the ground component, the vehicle maintains a fix-sized Q-table converged with a well-designed reward function and forwards the routing request to the optimal node by looking up the Q-table filtered according to the global routing path. The simulation results show the proposed approach performs remarkably well in packet delivering and end-to-end delay.
Article
Swarm unmanned aerial vehicles (UAVs) is an approach to the coordination of multiple UAVs as a system, which has huge advantages on mission capabilities, such as cooperative search, border surveillance, and situation awareness. For better swarm coordination, a robust inter-UAV network, especially in ad hoc mode, is needed. Due to the dynamic nature of UAV networks owing to mobility and topology change, adapting routing quickly to topology changes is one of the key components of UAV networks. Based on investigating the relationship between the swarm formation control and the network topology, we propose a proactive topology-aware scheme to track the network topology change. Through simulations on Qualnet and real-world experiments with five quadrotors, the results confirmed that the proposed scheme reduces the average delay and improves routing performance significantly.
Article
The standardization of fifth generation (5G) communications has been completed, and the 5G network should be commercially launched in 2020. As a result, the visioning and planning of 6G communications has begun, with an aim to provide communication services for the future demands of the 2030s. Here, we provide a vision for 6G that could serve as a research guide in the post-5G era. We suggest that human-centric mobile communications will still be the most important application of 6G and the 6G network should be human centric. Thus, high security, secrecy and privacy should be key features of 6G and should be given particular attention by the wireless research community. To support this vision, we provide a systematic framework in which potential application scenarios of 6G are anticipated and subdivided. We subsequently define key potential features of 6G and discuss the required communication technologies. We also explore the issues beyond communication technologies that could hamper research and deployment of 6G. This Perspective provides a vision for sixth generation (6G) communications in which human-centric mobile communications are considered the most important application, and high security, secrecy and privacy are its key features.
Article
This letter proposes a new scheme that uses Reward Function Learning for Q-learning-based Geographic Routing (RFLQGeo) to improve the performance and efficiency of unmanned robotic networks (URN). High mobility of robotic nodes and changing environment pose challenges for geographic routing protocols; with multiple features simultaneously considered, routing becomes even harder. Q-learning-based geographic routing protocols (QGeo) with preconfigured reward function encumber the learning process and increase network communication overhead. To solve these problems, we design a routing scheme with an inverse reinforcement learning concept to learn the reward function in real time. We evaluate the performance of the RFLQGeo in comparison with other protocols. The results indicate that RFLQGeo has a strong ability to organize multiple features, improve network performance and reduce the communication overhead.
Article
In this paper, we consider a single-cell cellular network with a number of cellular users~(CUs) and unmanned aerial vehicles~(UAVs), in which multiple UAVs upload their collected data to the base station (BS). Two transmission modes are considered to support the multi-UAV communications, i.e., UAV-to-infrastructure (U2I) and UAV-to-UAV~(U2U) transmissions. Specifically, a UAV either uploads its collected data to the BS through U2I overlay transmission or offloads the data to a neighboring UAV through U2U underlay transmission when facing on-board battery outage. We formulate the subcarrier allocation and trajectory control problem to maximize the uplink sum-rate taking the delay of sensing tasks into consideration. To solve this NP-hard problem efficiently, we decouple it into three sub-problems: U2I and cellular user (CU) subcarrier allocation, U2U subcarrier allocation, and UAV trajectory control. An iterative subcarrier allocation and trajectory control algorithm (ISATCA) is proposed to solve these sub-problems jointly. Simulation results show that the proposed ISATCA can upload 20\% more data than the one without U2U offloading, and 10\% more than that obtained by the random algorithm.