ArticlePDF Available

A Data-Driven Packet Routing Algorithm for an Unmanned Aerial Vehicle Swarm: A Multi-Agent Reinforcement Learning Approach

October 2022
IEEE Wireless Communications Letters 11(10):1-1

October 2022
11(10):1-1

DOI:10.1109/LWC.2022.3195963

Authors:

Xiulin Qiu

Jiangsu University of Science and Technology

Lei Xu

Ping Wang

York University

Yuwang Yang

Nanjing University of Science and Technology

Show all 5 authorsHide

Routing decisions made by unmanned aerial vehicle (UAV) swarms are affected by complex and dynamically changing topologies. A centralized routing algorithm imposes the entire computational burden on one module, and the high data dimensionality renders computation burdensome. In this letter, we develop a multi-agent reinforcement learning-based routing algorithm for a UAV swarm. The UAVs are trained in a data-driven manner to make distributed routing decisions. Factors that include channel quality, UAV movement, UAV overhead, and the extent of neighbor variation are incorporated into link quality assessment. Long short-term memory is used to improve the Actor and Critic networks, and more information on temporal continuity is added to facilitate adaptation to the dynamically changing environment. Simulations show that the proposed routing algorithm reduces data transmission delay and enhances the transmission rate compared with traditional routing algorithms.

Content uploaded by Xiulin Qiu

Content may be subject to copyright.

2160 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022

A Data-Driven Packet Routing Algorithm for an Unmanned Aerial Vehicle

Swarm: A Multi-Agent Reinforcement Learning Approach

Xiulin Qiu ,Student Member, IEEE,LeiXu ,Member, IEEE,PingWang ,Fellow, IEEE,

Yuwang Yang , and Zhenqiang Liao

Abstract—Routing decisions made by unmanned aerial vehi-

cle (UAV) swarms are affected by complex and dynamically

changing topologies. A centralized routing algorithm imposes

the entire computational burden on one module, and the high

data dimensionality renders computation burdensome. In this

letter, we develop a multi-agent reinforcement learning-based

routing algorithm for a UAV swarm. The UAVs are trained

in a data-driven manner to make distributed routing decisions.

Factors that include channel quality, UAV movement, UAV over-

head, and the extent of neighbor variation are incorporated into

link quality assessment. Long short-term memory is used to

improve the Actor and Critic networks, and more information

on temporal continuity is added to facilitate adaptation to the

dynamically changing environment. Simulations show that the

proposed routing algorithm reduces data transmission delay and

enhances the transmission rate compared with traditional routing

algorithms.

Index Terms—UAV swarm network, multi-agent reinforcement

learning, link quality assessment, data-driven routing, LSTM.

I. INTRODUCTION

AN UNMANNED aerial vehicle (UAV) swarm is a dis-

tributed deployment of multiple UAVs that work together

to accomplish a mission. Swarms are more reliable and accu-

rate than a single UAV and cover more terrain when engaging

in post-disaster search and rescue, battleﬁeld strikes, environ-

mental sensing, and target tracking. It is challenging to ensure

connectivity, high throughput, and efﬁcient self-adaptation [1]

when UAV move rapidly.

Artiﬁcial Intelligence (AI) empowered wireless commu-

nication technologies are believed to offer a range of new

features such as trafﬁc prediction, network optimization and

congestion control. With 5G standards gradually becoming

Manuscript received 11 June 2022; revised 11 July 2022; accepted 27

July 2022. Date of publication 3 August 2022; date of current version

7 October 2022. This work was supported in part by the National Natural

Science Foundation of China under Grant 61973161, Grant 61991404, and

Grant 61773206; and in part by SZKITI under Grant SYG201826. The

associate editor coordinating the review of this article and approving it for

publication was G. Chen. (Corresponding author: Yuwang Yang.)

Xiulin Qiu, Lei Xu, and Yuwang Yang are with the School of

Computer Science and Engineering, Nanjing University of Science

and Technology, Nanjing 210094, China (e-mail: qiuxiulin@njust.edu.cn;

xulei_marcus@126.com; yuwangyang@njust.edu.cn).

Ping Wang is with the Department of Electrical Engineering and Computer

Science, Lassonde School of Engineering, York University, Toronto, ON M3J

1P3, Canada (e-mail: pingw@yorku.ca).

Zhenqiang Liao is with the School of Electrical and Mechanical

Engineering, Suzhou Global Institute of Software Technology, Suzhou

215163, China (e-mail: zqliao1013@126.com).

Digital Object Identiﬁer 10.1109/LWC.2022.3195963

ﬁxed, AI empowered 6G will unlock the full potential of

radio signals and enable a shift from cognitive radio (CR)

to intelligent radio (IR) [2]. Traditional ﬁxed-model-based

communication design relies on the application of algorith-

mic insights by human experts, while data-driven techniques

use machine learning to learn a “good” communication

policies from data sets, and it is more adaptable to topo-

logically complex and dynamically changing environments.

So data-driven routing methods applied to dynamic UAV

swarm networks are expected to further improve transmission

performance.

Reinforcement learning is one of the main branches of AI

that acquires experience by exploring interactions with the

environment [3]. So RL is an emerging approach to solve rout-

ing decisions in dynamic environments. Jiang et al. [4]used

Q-Learning to develop adaptive UAV-assisted geographic rout-

ing. Mai et al. [5] developed a model-free, data-driven routing

strategy by combining a graphical neural network (GNN) with

multi-agent RL. Kaviani et al. [6] presented a DeepCQ+ rout-

ing framework employing multi-agent deep RL (MADRL)

that can handle different numbers of nodes, data streams

with different data rates and source/destination pairs, various

mobility levels, and other dynamic topologies. Jin et al. [7]

employed Q-learning reward function learning (RFLQGeo) to

improve the performance and efﬁciency of unmanned robotic

networks (URNs).

However, the topology in UAV swarm networks is com-

plex and variable, and it is difﬁcult to obtain the global

information by centralized networking. We try to solve the

routing problem of UAV swarm networks with distributed

ideas, so we use the multi-agent reinforcement learning

method. The proposed algorithm establishes a parameter

exchange mechanism based on multi-agent deep determinis-

tic policy gradient (MADDPG) [11] in the training phase, and

the execution phase can be fully distributed. Each UAV uses

an Actor net designed based on LSTM to obtain all possi-

ble information on temporal continuity and the correlations

between information on each state.

The contributions of this letter are as follows. (1) We use

the interactive learning capability of RL to cope with the

dynamic environment and build a routing model with dis-

tributed execution. (2) A link quality assessment mechanism is

proposed to alleviate the difﬁculty of collecting global topol-

ogy information in dynamic environments. (3) We build the

Actor net and Critic net containing LSTM cells to further

accelerate the convergence of the algorithm. (4) We analyze

the computational complexity of the algorithm, and simu-

lations demonstrate the advantages of the proposed routing

algorithm.

2162-2345 c

2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.

QIU et al.: DATA-DRIVEN PACKET ROUTING ALGORITHM FOR AN UAV SWARM 2161

II. ROUTING MODELING BASED ON MULTI-AGENT

REINFORCEMENT LEARNING

A. The Network Model

Consider an area containing users of mobile communication

devices, where multi-UAVs are deployed, and these users try to

communicate with base stations (BS) outside the area through

multi-UAVs relays. This model can be applied to the scenario

of establishing communication between the affected area and

users outside the area in case of base stations damaged. To

ensure the users experience, the UAV swarm network needs

to provide reliable and high-quality links. The UAV swarm

network is viewed as a graph G(V,E) where V(G)istheset

of nodes (all UAVs), E(G) is the set of edges (all communica-

tion links), d_e(i,j)denotes the Euclidean distance between

UAVs iand j, the communication link e(i,j) exists only when

d_c(i,j)>d_e(i,j), where d_c(i,j)is the communication

distance.

B. The Channel Model

We use channel model deﬁned in [8] to describe UAV to

UAV communication links, and divide the network transmis-

sion bandwidth into Korthogonal subchannels. The subchan-

nels are allocated to UAVs using a round-robin approach

proposed in [9], where every UAV visits each subchannel in

a predeﬁned sequential order to avoid transmission collisions

with others. We use the Rician model described channel, sup-

pose subchannel kis assigned to UAV iat time slot t, when

UAV itransmits to UAV jover subchannel k, the received

power at UAV jfrom UAV iis expressed as Pk

i,j(t). The rice

distribution is deﬁned as

ξ(di,j(t)) = di,j(t)

σ2

exp−di,j(t)2−ρ2

2σ2

0I0di,j(t)ρ

σ2

0(1)

where di,j(t)is the transmission distance between UAV iand

j,ρand σ0reﬂecting the strength of the dominant and non-

dominant paths respectively, I0is the modiﬁed Basel function.

In case no dominant path exists (ρ=0), the Rician fading

reduces to a Rayleigh fading deﬁned by

ξ(di,j(t)) = di,j(t)

σ2

exp−di,j(t)2

2σ2

0(2)

The received power at UAV jfrom UAV iis expressed as

i,j(t)=PtGdi,j(t)−α+pk

ξ(di,j(t)) (3)

where Gis the constant power gain factor introduced into the

ampliﬁer and antenna, Ptis the transmit power of a UAV,

di,j(t)is the transmission distance between UAV iand j, and

(di,j(t))−αis the path loss. The interference from UAV mto

UAV jover subchannel kis represented as:

m,UAV (t)=ψm,k(t)PtGdm,j(t)−α(4)

where Ψ(t)=[ψm,k(t)] is an Nl×Kbinary UAV-to-UAV

subchannel pairing matrix with a value of ψi,k(t)=1when

subchannel kis assigned to UAV ifor UAV-to-UAV transmis-

sion, otherwise ψi,k(t)=0.Nl(t)={1,2,...,Nl(t)}is the

set of UAVs that perform UAV-to-UAV transmission links in

time slot t. The SINR at UAV jover subchannel kis shown as:

γk

i,j(t)=PtGdi,j(t)−α+pk

ξ(di,j(t))

σ2+Nl

m=1,m=iIk

m,UAV (t)(5)

σ2is the Gaussian noise variance. Therefore, the rate of

data transmission from UAV ito UAV jin subchannel kis:

i,j(t)=blog21+γk

i,j(t)(6)

where bis the channel bandwidth. The channel and power

resources are allocated by the Multiple Access Control (MAC)

protocol, while each UAV chooses the link with the highest

signal-to-noise ratio for transmission over the allocated link.

The RL-based model is described below.

C. The Multi-Agent RL Model

Assume that the source node of the packet is UAV s, the des-

tination node is UAV d, each UAV is an agent and all nodes co-

operate to complete packet transmission. The routing problem

is a multi-agent Markov (stochastic) game characterized by the

tuple (I,S,{Ai}i∈I,p,{ri}i∈I)where IΔ

={1,...,i,...,I}

is the set of agents and SΔ

={S1,...,Si,...,SI}is the set

of states of all agents where Siis the state of the i-th agent.

{Ai}i∈Iis the set of actions of agent i,p(s|s,{Ai}i∈I)Δ

S×A1×···×AI×Sis the state transition probability of the

system, where sis the next state, and {ri}i∈Iis the reward

for the agent.

In the game, all UAVs try to ﬁnd the path with the best link

quality to maximize the value function vπ(s), and vπ(s)=

Eπ[rt+1 +ηrt+2 +η2rt+3 +...St=s]. The discount factor

η∈[0,1] indicates the importance of future rewards. After all

UAVs formulate optimal policies, the equilibrium policy for

the whole system {π∗

1,...,π

∗

I}is the Nash policy [11] that

obtains the optimal packet transmission path, where π∗

iis the

optimal policy of the i-th UAV.

1) The State Space: The state of each UAV is private to

that UAV, and we include the channel condition indicated by

the SINR in the state. For a single UAV i, the state Si(t)at

time tcontains two parts:

•The signal-to-noise ratio, γk

i,−i(t), which is the signal

impact of other UAVs on UAV i, denoted by UAV -i.

•The set of neighbors Ner (t)of UAV i, i.e., the set of

UAVs that can establish direct connection with UAV i.

Thus, the state of UAV iat time tis Si(t)<γk

i,−i,Ner >.

2) The Action Space: The action set of UAV iis to choose

which link to transmit packets. In the channel model, when

communication distance d_c(i,j)>d_e(i,j),UAVjcan be

a neighbor of UAV i, and there is a possible communication

link e(i,j). However, the vector distance of the two UAVs can-

not be used as the effective communication distance, and the

link duration must meet the communication requirements. We

use the link expiration time (LET) deﬁned in [10], and assume

that UAVs iand jmove at speeds viand vj, respectively, then

the link expiration is expressed as:

LETi,j=(m2+c2)r2−(md −nc)2−(mn +cd )

m2+c2(7)

where

m=vicos(θi)−vjcosθj,n=xi−xj(8)

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.

2162 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022

Fig. 1. The UAV swarm network routing algorithm(a) and improved MADDPG routing algorithm based on LSTM(b).

c=visin(θi)−vjsinθj,d=yi−yj(9)

(xi,yi)and (xj,yj)denote the position of UAV iand UAV j,

respectively, viand vjdenote the velocity of UAVs, θiand θj

are the angle between UAV velocity and horizontal direction.

Hence, if LETi,j≥Thello ,e(i,j)∈ai, where Thello is the

period in which the UAV performs the link handshake. Thus,

the action space of the UAV is:

ai∈A={e(i,j)|d_c(i,j)≥d_e(i,j)∩LETi,j≥Thello }(10)

where Ais the set of actions of UAVi, and affected by the

communication distance with other UAVs.

3) The Reward Function: Our link quality evaluation mea-

sures link quality in several aspects. The ﬁrst quality measure

is the channel metrics using the channel model of UAV

presented in Section II-B.The goal is to select a link, the

signal-to-noise ratio of which is no lower than a predeﬁned

threshold:

γk

i,j(t)= PUdi,j(t)−α

σ2+Nl

m=1,m=iIk

m,UAV (t)≥γk

min (11)

where γk

min is the minimum signal-to-noise ratio of the UAV

channel. If over time slot t, the link channel of a UAV sat-

isﬁes the constraint (9), the UAV receives a reward rSINR,

deﬁned as:

rSINR =blog(1 + γk

i,j(t)) −vmpU(t),if γk

i,j(t)≥γk

0,otherwise (12)

where bis the channel bandwidth, pU(t)is the transmission

power, and vm>0is a cost per unit level of power. The

second part of the reward is:

rLET =LETi,j

τ,LETi,j<τ

1,otherwise (13)

where τis the time interval of the position update. When

LETi,j<τ, the LET is shorter and the reward brought by the

agent choosing this link is smaller, so rLET <1. Otherwise,

the LET is longer, but the reward given cannot be increased

inﬁnitely, so as long as the transmission demand is satisﬁed

the reward is set to 1. The third component is the busyness

of the selected neighboring node, expressed in terms of the

queue length of packets to be sent, nsend . There is a larger

penalty for longer queues:

rBUSY =−log(1 + nsend )(14)

where nsend is positive and the reward is inversely propor-

tional to nsend . The more data to be sent by the UAV, the

greater the load on this link, so the reward reduce when agent

select this link. Thus, the total reward of UAV iat time tis:

i=αrt

SINR +βrt

LET +δrt

BUSY (15)

where α, β, δ are the weights of the three components.

We propose two measures to cope with the situation that

local information is not available. First, it is necessary to let

the dataset of the agent traverse the situation of missing local

data when training, and the algorithm has the experience of

missing data at the time of execution. Second, cross-layer tech-

niques need to be applied in this scenario to increase the access

to the local data.

III. THE ROUTING ALGORITHM BASED ON IMPROVED

MADDPG WITH LSTM

Figure 1(a) shows the 10 steps of the routing algorithm.

In steps 1 to 3, the LSTM-based Actor net outputs an action

based on the current state. In steps 4 to 7, the agent per-

forms the action, obtains a reward, and evolves to a new state.

The agent stores the state, action, reward and new state in

the buffer, and then samples them from buffer, using target

network parameters to calculate ywhich is used for Critic net

loss minimization. In steps 8 to 10, the LSTM-based Critic

net parameters are updated using the loss function which is

defended in (18), and the LSTM-based Actor net parameters

are updated using the gradient ascent algorithm.

We ﬁt the policy function and the action value function

using the MADDPG [11] algorithm. The policy is repre-

sented by the function π={π1,π

2,...π

I}with parameters

θ={θ1,θ

2,...θ

I}. UAV routing is a time-varying decision

problem; traditional routing is prone to frequent failures as

the topology changes because information on network state

changes in the time domain is not fully utilized. We replace the

critic and target networks of the traditional MADDPG algo-

rithm with LSTM that can leverage time-domain information

between UAV states to reduce the possibility of routing failure

and improve the accuracy of routing.

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.

QIU et al.: DATA-DRIVEN PACKET ROUTING ALGORITHM FOR AN UAV SWARM 2163

As shown in Figure 1(b), we build the Actor net and Critic

net containing LSTM cells, where the input of the Actor net

πiof the i-th UAV at time tis the information on the histor-

ical states and historical actions of all UAVs encoded by the

LSTM, and the current observed state si

tof UAV i.The output

of the Actor net is the current action taken by UAV ai

t, thus

t=πi(ht−1,si

t), where ht−1is the LSTM output at t−1.

The Actor’s reward is thus:

J(θi)=Eht−1,stQ(ht−1,st,a)|a=μi(ht−1,st)(16)

Q(ht−1,st,a)is the action value function of the UAV, and

Qπ(ht−1,st,a)=Eτ∼π[R(t)S=st,A=a], where R(t)is

the cumulative reward obtained at time tand afterwards, and

R(t)=rt+1 +ηrt+2 +η2rt+3 +...=∞

k=0 ηkrt+k+1.The

gradients are:

∇θiJ(θi)=Eht−1,st∇aQ(ht−1,st,a)|

a=μ

i(ht−1,st)∇θiπi(ht−1,si

t)

(17)

The action value function Qi(S,a1,a2,...,aI)is ﬁtted by

a network with parameters ω={ω1,ω

2,...ω

I}.Atis the

joint action distribution (a1

t,...,aI

t)of all UAVs at time t

and htthe LSTM output at time t. The Critic net is updated

to minimize the loss deﬁned as follows.

Los s (ωi)=Eht−1,stQωi

i(ht−1,st,a)−yi2(18)

where yiis the target value derived from the target network,

yi=ri

t+ηQω

iht,st+1,π

iht,si

t+1/θ

i(19)

where ri

tdenotes the instantaneous reward of the i-th UAV

in the current state, θis Actor target network parameter, ω

is Critic target network parameter, ηis the discount factor of

the cumulative reward, and htis the output of the LSTM at

time t.

As shown in Algorithm 1, the LSTM parameters are ﬁrst

initialized as h0and the actions of each UAV are generated

according to the policy. After executing the actions to obtain

rewards, the LSTM parameters become h1and information

is stored in the experience pool R. The existence of LSTM

helps to alleviate the dilemma that current tasks cannot receive

relevant information long before it, so it is suitable to cope

with the fast-changing wireless communication environment.

Computational complexity analysis: The total parameters in

each LSTM network is M=4ncnc+4ninc+ncno+3nc,

where nc,ni,noare number of memory cells, input neuron

units, output neuron units, respectively. So the computational

complexity of the standard LSTM network is O(M). Assume

the batch size is B, the training episodes are H. So the time

complexity of LSTM training is approximately O(M*H*B).

The computational complexity of the proposed MADDPG-

LSTM routing algorithm for the system containing IUAVs

is about O(M∗H∗B∗I∗Ds)+O(M∗H∗B∗I∗Da+Ds),

Dais the dimension of the action space, and Dsis the dimen-

sion of the state space. Compared with the centralized routing

algorithm, the proposed algorithm alleviates the computational

power required by the UAV and reduces routing overhead.

Algorithm 1 The Processing of Routing Algorithm in UAV

Initialize: For all UAV, initialize Actor online network param-

eters θ={θ1,θ

2,...θ

I}and Critic online network parameters

ω={ω1,ω

2,...ω

I}, replay buffer R, Actor target network

parameters θ←θ, Critic target network parameters ω←ω.

1: for episode =1to Mdo

2: Initialize empty history h0, State s={s1,s2,...sI}

3: while time t<Tand s=∅do

4: for agent i=1to Ido

5: According to ai

t=πi(ht−1,si

t)generate the

action to be performed by UAV i;

6: Get reward ri

tand new observation st+1 from the

environment;

7: According to ht=LSTM (ht−1,[St−1,At−1])

update LSTM.

8: end for

9: end while

10: store episode {h,si,ai,ri}

11: for agent i=1to Ido

12: sample a random state of episodes

{h1,si

1,ai

1,ri

1,h2,si

2,ai

2,ri

2,...}

13: for time t=Tdown to 1do

14: Using target networks parameters calculate y

by(19)

15: update Critic online network by minimizing the

loss(18)

16: update Actor online network by policy gradi-

ent(17)

17: end for

18: end for

19: update target networks:

θ

i←τθi+(1−τ)θ

ω

i←τωi+(1−τ)ω

20: end for

IV. EVA L U AT I O N

DGATR [5] is a multi-agent RL approach combined

with graph neural network for routing, and deepCQ++ [6]

trains DNNs by interacting acknowledgment (ACK) packets

and solves the routing problem of dynamic mobile ad-hoc

networks (MANETs) based on multi-agent RL. Although these

two routing methods perform well in scenarios with vari-

able network size, data rate and mobility, they do not take

channel quality and node busyness into consideration. We sim-

ulated the proposed algorithm using the NS3-Gym [12] tool,

and we compared DGATR, deepCQ+, MADDPG, and our

MADDPG-LSTM in delay and the data delivery rate. The for-

mer is the time difference between packets sent by the source

and received by the destination node, and the latter is the num-

ber of packets received by the destination node divided by the

number sent by the source node.

NS3-Gym evaluates data communication between NS3 and

OpenAI Gym [13] based on ZMQ (Zero Message Queue is

a library used to implement messaging and communication

between two applications). The system effectively simulates

communication protocols based on RL methods. The dataset

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.

2164 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 11, NO. 10, OCTOBER 2022

TAB LE I

SIMULATION PARAMETERS

Fig. 2. Simulation results.

used for training is obtained from the execution environ-

ment through the interface functions provided by NS3-Gym

for passing status, reward, action and end ﬂag information.

The states, actions, and rewards are combined into a vec-

tor {sj,aj,rj}and deposited into the experience pool R

to form the dataset. To be more realistic, we used the

Gaussian/Markov mobility model to generate random UAV

motions. The network load was controlled by varying the

constant bit rate (CBR), we also changed the UAV speeds,

the number of UAVs, and other parameters. The simulation

parameters are shown in Table I.

Figure 2(a) shows that the transmission delay increases as

the load increases. With DGATR, the delay is about 160 ms

when CBR =1,000 kbps. Our MADDPG-LSTM exhibits

shorter delay (about 130 ms), and similar to the traditional

MADDPG using fully connected neural network. Figures 2(b)

and (c) show that the delay increases as the number of UAVs

and the speed of UAVs increase, respectively. Our MADDPG-

LSTM outperforms the other methods, indicating that it can

well adapt to the dynamically changing and complex network

topology.

Figure 2(d) shows that the delivery rate decreases as the

network load increases. MADDPG-LSTM performs better

than the other methods under all conditions. As the network

load increases to 1,000 kbps, the data delivery rates of all

methods exceed 65%, and MADDPG-LSTM achieves the

highest, i.e., nearly 80%. Figures 2(e) and (f) show that as

UAV numbers and speeds increase, the topological complexity

and variability rise, negatively impacting performance.

V. CONCLUSION

We have presented a multi-agent RL-based UAV swarm

routing algorithm. Each UAV is viewed as an agent that uses

local information to make routing decisions via a link qual-

ity assessment mechanism. We have used LSTM to replace

the fully connected layers of the Actor and Critic networks of

the MADDPG algorithm to improve the ability of the routing

protocol to adapt to dynamic environments. NS3-Gym sim-

ulation has shown the superiority of our MADDPG-LSTM

algorithm in terms of transmission delay and delivery rate. In

our future work, we plan to reﬁne further the routing strategies

for different UAV swarms to make data-driven routing more

intelligent.

REFERENCES

[1] L. Hong, H. Guo, J. Liu, and Y. Zhang, “Toward swarm coordina-

tion: Topology-aware inter-UAV routing optimization,” IEEE Trans. Veh.

Technol., vol. 69, no. 9, pp. 10177–10187, Sep. 2020.

[2] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “What should 6G

be?” Nat. Electron., vol. 3, no. 1, pp. 20–29, 2020.

[3] J. Aznar-Poveda, A.-J. Garcia-Sanchez, E. Egea-Lopez, and

J. Garcia-Haro, “Approximate reinforcement learning to control

beaconing congestion in distributed networks,” Sci. Rep., vol. 12, no. 1,

pp. 1–11, 2022.

[4] S. Jiang, Z. Huang, and Y. Ji, “Adaptive UAV-assisted geographic rout-

ing with Q-learning in VANET,” IEEE Commun. Lett., vol. 25, no. 4,

pp. 1358–1362, Apr. 2021.

[5] X. Mai, Q. Fu, and Y. Chen, “Packet routing with graph attention multi-

agent reinforcement learning,” 2021, arXiv:2107.13181.

[6] S. Kaviani et al., “Robust and scalable routing with multi-agent deep

reinforcement learning for MANETs,” 2021, arXiv:2101.03273.

[7] W. Jin, R. Gu, and Y. Ji, “Reward function learning for Q-learning-

based geographic routing protocol,” IEEE Commun. Lett., vol. 23, no. 7,

pp. 1236–1239, Jul. 2019.

[8] S. Zhang, H. Zhang, B. Di, and L. Song, “Cellular UAV-to-X communi-

cations: Design and optimization for multi-UAV networks,” IEEE Trans.

Wireless Commun., vol. 18, no. 2, pp. 1346–1359, Feb. 2019.

[9] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for multi-

channel opportunistic access: structure, optimality, and performance,”

IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440,

Dec. 2008.

[10] B. Sliwa, C. Schüler, M. Patchou, and C. Wietfeld, “PARRoT: Predictive

ad-hoc routing fueled by reinforcement learning and trajectory knowl-

edge,” in Proc. IEEE 93rd Veh. Technol. Conf. (VTC-Spring), 2021,

pp. 1–7.

[11] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-

agent actor–critic for mixed cooperative-competitive environments,” in

Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6379–6390.

[12] P. Gawłowicz and A. Zubow, “ns3-gym: Extending OpenAI gym for

networking research,” 2018, arXiv:1810.03943.

[13] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540.

Authorized licensed use limited to: NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY. Downloaded on October 08,2022 at 13:30:00 UTC from IEEE Xplore. Restrictions apply.

Joint Trajectory Control Frequency Allocation and Routing for UAV Swarm Networks A Multi-Agent Deep Reinforcement Learning Approach

Article

Full-text available

May 2024
IEEE T MOBILE COMPUT

Muhammad Morshed Alam

Collaborative unmanned aerial vehicle (UAV) swarm networks can effectively execute various emerging missions such as surveillance and communication coverage. However, due to high mobility and constrained transmission range, packet routing encounters mutual interferences, link breakages, and unexpected delays. In such networks, routing performance is coupled with trajectory control, frequency allocation, and relay selection. In this study, we propose a joint trajectory control, frequency allocation, and packet routing (JTFR) algorithm, in which link utility is maximized by considering the link stability, signal-to-interference-plus-noise ratio, queuing delay, and residual energy of UAVs. The proposed JTFR employs adaptive distributed multi-agent deep deterministic policy gradient coupled with the swarming behavior to obtain the optimal solution. For each UAV, an actor network is established by utilizing a long short-term memory-based state representation layer containing two-hop neighbor information to adopt the dynamic time-varying topology. Subsequently, a scalable multi-head attentional critic network is set up to adaptively adjust the actor network policy of each UAV by collaborating with neighbors. The extensive simulation results show that JTFR outperforms existing routing protocols by 30-60% less end-to-end delay, 15-32% better packet delivery ratio, and 20-46% less energy consumption.

QMRNB: Design of an Efficient Q-Learning Model to Improve Routing Efficiency of UAV Networks via Bioinspired Optimizations

Article

Full-text available

Apr 2023

UAV Ad Hoc Network Routing Algorithms in Space–Air–Ground Integrated Networks: Challenges and Directions

Article

Full-text available

Jul 2023

With the rapid development of 5G and 6G communications in recent years, there has been significant interest in space–air–ground integrated networks (SAGINs), which aim to achieve seamless all-area, all-time coverage. As a key component of SAGINs, flying ad hoc networks (FANETs) have been widely used in the agriculture and transportation sectors in recent years. Reliable communication in SAGINs requires efficient routing algorithms to support them. In this study, we analyze the unique communication architecture of FANETs in SAGINs. At the same time, existing routing protocols are presented and clustered. In addition, we review the latest research advances in routing algorithms over the last five years. Finally, we clarify the future research trends of FANET routing algorithms in SAGINs by discussing the algorithms and comparing the routing experiments with the characteristics of unmanned aerial vehicles.

Fixed‐time formation control of second‐order nonlinear multi‐agent systems using the neural network dynamic sliding mode

Article

Jun 2024

The problem of fixed‐time formation control for a class of second‐order nonlinear multi‐agent systems is studied. For a class of such systems, a control algorithm is proposed to maintain the connections among the agents while avoiding collisions. Furthermore, a radial basis function neural network is used in the design to precisely approximate the nonlinear function for the nonlinear terms in the model. Then, a dynamic sliding mode control method is proposed to suppress the chattering phenomenon that may arise due to the sliding mode control. A sufficient condition for the system to achieve fixed‐time formation is obtained by using different methods, such as Lyapunov stability. Finally, the effectiveness of the proposed algorithm is verified by example. Simulation experiments reveal that the proposed method has faster error convergence and better robust control than conventional algorithms.

Topology Maintenance Optimization Algorithm Based on Deep Reinforcement Learning in High Dynamic Flying Ad-Hoc Networks

Conference Paper

May 2024

Design of Hybrid Snake Optimizer Based Route Selection Approach for Unmanned Aerial Vehicles Communication

Article

Full-text available

Jan 2024

The effective usage of energy becomes crucial for the successful deployment and operation of unmanned aerial vehicles (UAVs) in different applications, such as surveillance, transportation, and communication networks. The increasing demand for UAVs in different industries such as agriculture, logistics, and emergency response has led to the development of more sophisticated and advanced UAVs. However, the limited onboard energy resource of UAVs poses a major problem for their long-term operation and endurance. In addition, artificial intelligence (AI) and machine learning (ML) could allow UAVs to make more informed and intelligent decisions regarding their operations, resulting in sustainable and more energy-efficient UAV deployment. This article designs a Hybrid Snake Optimizer-based Route Selection Approach for Unmanned Aerial Vehicles Communication (HSO-RSAUAVC) technique. The goal of the HSO-RSAUAVC technique is to explore and select optimal routes for UAV communication. In the presented HSO-RSAUAVC technique, the SO algorithm is integrated with Bernoulli Chaotic Mapping and Levy flight (LF) for enhanced performance. In addition, the HSO-RSAUAVC method derives a fitness function including residual energy (RE), distance, and UAV degree. By incorporating the HSO-RSAUAVC technique, we can dynamically adapt UAV paths to overcome obstacles, decrease communication interference, and optimize energy utilization. To validate the performance of the proposed model, a series of simulations were performed. The comparative result analysis illustrates the better performance of the HSO-RSAUAVC technique in improving the performance and reliability of UAV communication.

Reinforcement Learning Based Energy-Efficient Routing with Latency Constraints for FANETs

Conference Paper

Dec 2023

Machine Learning Based Secure Routing Protocol with Uav-Assisted for Autonomous Vehicles

Article

Jan 2024

Aims and Background The topology and communication links of vehicular adhoc networks, or VANETs, are always changing due to the transient nature of automobiles. VANETs are a subset of MANETs that have applications in the transportation sector, specifically in Intelligent Transportation Systems (ITS). Routing in these networks is challenging due to frequent link detachments, rapid topological changes, and high vehicle mobility. Methods: As a result, there are many obstacles and constraints in the way of creating an effective routing protocol that satisfies latency restrictions with minimal overhead. Malicious vehicle detection is also a crucial role in VANETs. Unmanned-Aerial-Vehicles(UAVs) can be useful for overcoming these constraints. This study examines the utilize of UAVs operating in an adhoc form and cooperating via cars VANETs to aid in the routing and detection of hostile vehicles. VANET is a routing protocol. The proposed UAV-assisted routing protocol (VRU) incorporates two separate protocols for routing data: (1) a protocol called VRU_vu for delivering data packets amid vehicles with the assist of UAVs, and (2) a protocol called VRU_u for routing data packets amid UAVs. Results: To estimate the efficacy of VRU routing objects in a metropolitan setting, we run the NS-2.35 simulator under Linux Ubuntu 12.04. Vehicle and UAV motions can also be generated with the help of the mobility generator VanetMobiSim and the mobility simulation software MobiSim. Conclusion: According to the results of the performance analysis, the VRU-protocol is able to outperform the other evaluated routing protocols in terms of packet-delivery-ratio (by 17 percent) &detection-ratio (9 percent). The VRU protocol cuts overhead near 41% and reduces end-to-enddelay in mean of 15%.

Distributed Nash equilibrium searching for multi-agent games under false data injection attacks

Article

Dec 2023
NEUROCOMPUTING

Q-Learning based system for Path Planning with Unmanned Aerial Vehicles swarms in obstacle environments

Article

Aug 2023
EXPERT SYST APPL

Approximate Reinforcement Learning to Control Beaconing Congestion in Distributed Networks

Article

Full-text available

Jan 2022

In vehicular communications, the increase of the channel load caused by excessive periodical messages (beacons) is an important aspect that must be controlled to ensure the appropriate operation of safety applications and driver-assistance systems. To date, the majority of congestion control solutions involve including additional information in the payload of the messages transmitted, which may jeopardize the appropriate operation of these control solutions when channel conditions are unfavorable, provoking packet losses. This study exploits the advantages of non-cooperative, distributed beaconing allocation, in which vehicles operate independently without requiring any costly road infrastructure. In particular, we formulate the beaconing rate control problem as a Markov Decision Process and solve it using approximate reinforcement learning to carry out optimal actions. Results obtained were compared with other traditional solutions, revealing that our approach, called SSFA, is able to keep a certain fraction of the channel capacity available, which guarantees the delivery of emergency-related notifications with faster convergence than other proposals. Moreover, good performance was obtained in terms of packet delivery and collision ratios.

ns3-gym: Extending OpenAI Gym for Networking Research

Technical Report

Full-text available

Oct 2018

OpenAI Gym is a toolkit for reinforcement learning (RL) research. It includes a large number of well-known problems that expose a common interface allowing to directly compare the performance results of different RL algorithms. Since many years, the ns-3 network simulation tool is the de-facto standard for academic and industry research into networking protocols and communications technology. Numerous scientific papers were written reporting results obtained using ns-3, and hundreds of models and modules were written and contributed to the ns-3 code base. Today as a major trend in network research we see the use of machine learning tools like RL. What is missing is the integration of a RL framework like OpenAI Gym into the network simulator ns-3. This paper presents the ns3-gym framework. First, we discuss design decisions that went into the software. Second, two illustrative examples implemented using ns3-gym are presented. Our software package is provided to the community as open source under a GPL license and hence can be easily extended.

Packet Routing with Graph Attention Multi-Agent Reinforcement Learning

Conference Paper

Dec 2021

DeepCQ+: Robust and Scalable Routing with Multi-Agent Deep Reinforcement Learning for Highly Dynamic Networks

Conference Paper

Nov 2021

PARRoT: Predictive Ad-hoc Routing Fueled by Reinforcement Learning and Trajectory Knowledge

Conference Paper

Apr 2021

Adaptive UAV-Assisted Geographic Routing with Q-Learning in VANET

Article

Dec 2020

The Q-learning based geographic routing approaches suffer from problems of low converging speed and inefficient resources utilization in VANET due to the dynamic scale of Q-value table. In addition, the next hop selection based on local information can not always be conducive to the global message forwarding. In this letter, we propose an adaptive unmanned aerial vehicle (UAV) assisted geographic routing with Q-Learning. The routing scheme is divided into two components. In the aerial component, the global routing path is calculated by the fuzzy-logic and depth-first-search (DFS) algorithm using the UAV-collected information like the global road traffic, which is then forwarded to the ground requesting vehicle. In the ground component, the vehicle maintains a fix-sized Q-table converged with a well-designed reward function and forwards the routing request to the optimal node by looking up the Q-table filtered according to the global routing path. The simulation results show the proposed approach performs remarkably well in packet delivering and end-to-end delay.

Toward Swarm Coordination: Topology-Aware Inter-UAV Routing Optimization

Article

Jun 2020

Swarm unmanned aerial vehicles (UAVs) is an approach to the coordination of multiple UAVs as a system, which has huge advantages on mission capabilities, such as cooperative search, border surveillance, and situation awareness. For better swarm coordination, a robust inter-UAV network, especially in ad hoc mode, is needed. Due to the dynamic nature of UAV networks owing to mobility and topology change, adapting routing quickly to topology changes is one of the key components of UAV networks. Based on investigating the relationship between the swarm formation control and the network topology, we propose a proactive topology-aware scheme to track the network topology change. Through simulations on Qualnet and real-world experiments with five quadrotors, the results confirmed that the proposed scheme reduces the average delay and improves routing performance significantly.

What should 6G be?

Article

Jan 2020

The standardization of fifth generation (5G) communications has been completed, and the 5G network should be commercially launched in 2020. As a result, the visioning and planning of 6G communications has begun, with an aim to provide communication services for the future demands of the 2030s. Here, we provide a vision for 6G that could serve as a research guide in the post-5G era. We suggest that human-centric mobile communications will still be the most important application of 6G and the 6G network should be human centric. Thus, high security, secrecy and privacy should be key features of 6G and should be given particular attention by the wireless research community. To support this vision, we provide a systematic framework in which potential application scenarios of 6G are anticipated and subdivided. We subsequently define key potential features of 6G and discuss the required communication technologies. We also explore the issues beyond communication technologies that could hamper research and deployment of 6G. This Perspective provides a vision for sixth generation (6G) communications in which human-centric mobile communications are considered the most important application, and high security, secrecy and privacy are its key features.

Reward Function Learning for Q-learning-Based Geographic Routing Protocol

Article

Apr 2019

This letter proposes a new scheme that uses Reward Function Learning for Q-learning-based Geographic Routing (RFLQGeo) to improve the performance and efficiency of unmanned robotic networks (URN). High mobility of robotic nodes and changing environment pose challenges for geographic routing protocols; with multiple features simultaneously considered, routing becomes even harder. Q-learning-based geographic routing protocols (QGeo) with preconfigured reward function encumber the learning process and increase network communication overhead. To solve these problems, we design a routing scheme with an inverse reinforcement learning concept to learn the reward function in real time. We evaluate the performance of the RFLQGeo in comparison with other protocols. The results indicate that RFLQGeo has a strong ability to organize multiple features, improve network performance and reduce the communication overhead.

Cellular UAV-to-X Communications: Design and Optimization for Multi-UAV Networks in 5G

Article

Jan 2018

In this paper, we consider a single-cell cellular network with a number of cellular users~(CUs) and unmanned aerial vehicles~(UAVs), in which multiple UAVs upload their collected data to the base station (BS). Two transmission modes are considered to support the multi-UAV communications, i.e., UAV-to-infrastructure (U2I) and UAV-to-UAV~(U2U) transmissions. Specifically, a UAV either uploads its collected data to the BS through U2I overlay transmission or offloads the data to a neighboring UAV through U2U underlay transmission when facing on-board battery outage. We formulate the subcarrier allocation and trajectory control problem to maximize the uplink sum-rate taking the delay of sensing tasks into consideration. To solve this NP-hard problem efficiently, we decouple it into three sub-problems: U2I and cellular user (CU) subcarrier allocation, U2U subcarrier allocation, and UAV trajectory control. An iterative subcarrier allocation and trajectory control algorithm (ISATCA) is proposed to solve these sub-problems jointly. Simulation results show that the proposed ISATCA can upload 20\% more data than the one without U2U offloading, and 10\% more than that obtained by the random algorithm.

A Data-Driven Packet Routing Algorithm for an Unmanned Aerial Vehicle Swarm: A Multi-Agent Reinforcement Learning Approach

Abstract

Recommended publications

Unmanned Aerial Vehicle Path Planning Algorithm Based on Deep Reinforcement Learning in Large-Scale...

Maintaining Links in the Highly Dynamic FANET Using Deep Reinforcement Learning

Coding-Aware Routing for Maximum Throughput and Coding Opportunities by Deep Reinforcement Learning...

Research on Autonomous Behavior Decision of UAV Cluster with Limited Communication Bandwidth

Survey on Q-Learning-Based Position-Aware Routing Protocols in Flying Ad Hoc Networks