ArticlePDF Available

Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

Authors:

Abstract and Figures

Anti-interception guidance can enhance a hypersonic glide vehicle (HGV) compard to multiple interceptors. In general, anti-interception guidance for aircraft can be divided into procedural guidance, fly-around guidance and active evading guidance. However, these guidance methods cannot be applied to an HGV’s unknown real-time process due to limited intelligence information or on-board computing abilities. In this paper, an anti-interception guidance approach based on deep reinforcement learning (DRL) is proposed. First, the penetration process is conceptualized as a generalized three-body adversarial optimal (GTAO) problem. The problem is then modelled as a Markov decision process (MDP), and a DRL scheme consisting of an actor-critic architecture is designed to solve this. Reusing the same sample batch during training results in fewer serious estimation errors in the critic network (CN), which provides better gradients to the immature actor network (AN). We propose a new mechanismcalled repetitive batch training (RBT). In addition, the training data and test results confirm that the RBT can improve the traditional DDPG-based-methodes.
Content may be subject to copyright.
Citation: Jiang, L.; Nan, Y.; Zhang, Y.;
Li, Z. Anti-Interception Guidance for
Hypersonic Glide Vehicle: A Deep
Reinforcement Learning Approach.
Aerospace 2022,9, 424. https://
doi.org/10.3390/aerospace9080424
Academic Editor: Sergey Leonov
Received: 8 April 2022
Accepted: 1 August 2022
Published: 4 August 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
aerospace
Article
Anti-Interception Guidance for Hypersonic Glide Vehicle:
A Deep Reinforcement Learning Approach
Liang Jiang 1,* , Ying Nan 1, Yu Zhang 2and Zhihan Li 1
1College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
*Correspondence: nuaajl@nuaa.edu.com
Abstract:
Anti-interception guidance can enhance a hypersonic glide vehicle (HGV) compard to
multiple interceptors. In general, anti-interception guidance for aircraft can be divided into procedural
guidance, fly-around guidance and active evading guidance. However, these guidance methods
cannot be applied to an HGV’s unknown real-time process due to limited intelligence information
or on-board computing abilities. In this paper, an anti-interception guidance approach based on
deep reinforcement learning (DRL) is proposed. First, the penetration process is conceptualized
as a generalized three-body adversarial optimal (GTAO) problem. The problem is then modelled
as a Markov decision process (MDP), and a DRL scheme consisting of an actor-critic architecture
is designed to solve this. Reusing the same sample batch during training results in fewer serious
estimation errors in the critic network (CN), which provides better gradients to the immature actor
network (AN). We propose a new mechanismcalled repetitive batch training (RBT). In addition, the
training data and test results confirm that the RBT can improve the traditional DDPG-based-methodes.
Keywords: hypersonic glide vehicle; anti-interception; deep reinforcement learning; guidance
1. Introduction
A hypersonic glide vehicle (HGV) has the advantages of a ballistic missile and a lifting
body vehicle. It is efficient as it does not require any additional complex anti-interception
mechanisms (e.g., carrying a defender to destroy an interceptor [
1
] or relying on the range
advantage of a high-powered onboard radar for early evasion before the interceptor locks
on [
2
]), and can achieve penetration through anti-interception guidance. Over the past
decade, studies have focused on utilizing HGV manoeuvreability to protect against inter-
ceptors.
In general, anti-interception guidance for aircraft has three categories: (1) procedural
guidance [36], (2) fly-around guidance [7,8], and (3) active evading guidance [9,10].
Early research on anti-interception guidance focused on procedural guidance. In
procedure guidance, the desired trajectory (such as sine manoeuvres [
5
], square wave
manoeuvres [
3
], or snake manoeuvres [
6
]) is planned prior to launch based on facts such
as the target position, interceptor capability and manoeuvre strategy, and the vehicle
receives guidance based on the fixed trajectory after launch. Imado et al. [
11
] studied the
lateral procedural guidance in a horizontal plane. Zhang et al. [
12
] proposed a midcourse
penetration strategy using an axial impulse manoeuvre and provided a detailed trajectory
design method. This penetration strategy does not require lateral pulse motors. To eliminate
the re-entry error caused by midcourse manoeuvre, Wu et al. [
13
] used the remaining pulse
motors to regress a preset ballistic. Procedural guidance only needs to plan a ballistic and
inject it into the onboard computer before launch, which is easy to implement and does not
occupy onboard computing resources. With the advancement of interception guidance, the
procedural manoeuvres studied by Imado et al. [
11
,
12
] may be recognized by interceptors,
and effectiveness cannot be guaranteed when fighting against advanced interceptors.
Aerospace 2022,9, 424. https://doi.org/10.3390/aerospace9080424 https://www.mdpi.com/journal/aerospace
Aerospace 2022,9, 424 2 of 21
In conjunction with the advances in planning and information fusion, fly-around the
detection zone has emerged as a penetration strategy. The primary objective is to plan a
trajectory that can evade an enemy’s detection zone. A complex nonlinear programming
problem with multiple constraints and stages is used to achieve this objective. In terms
of the detection zone, Zhang et al. [
14
] established infinitely high cylindrical and semi-
ellipsoidal detection zone models under the assumption of earth-flattening and optimized
the trajectory that satisfies the waypoint and detection zone constraints. Zhao et al. [
7
]
proposed adjusting the interval density with curvature and error criteria based on the
multi-interval pseudo-spectrum method. An adaptive pseudo-spectrum method was
constructed to solve the multi-interval pseudo-spectrum, and the number of points in the
interval was allocated. A rapid trajectory optimization algorithm was also proposed for
the whole course under the condition of multiple constraints and multiple detection zones.
However, fly-around guidance has insufficient adaptability on the battlefield. Potential
re-entry points have a wide distribution, and the circumvention methods studied by
Zhang et al. [
7
,
14
], cannot guarantee that a planned trajectory will meet energy constraints.
In addition, there may not be a trajectory that can evade all detection zones in an enemy’s
key air defence area. Moreover, it may be impossible to know an enemy’s detection zones
due to limited intelligence.
As onboard air-detection capabilities have advanced, active evading has gradually
gained popularity in anti-interception guidance, and some research results have been
recorded through differential game (DG) theory and numerical optimization algorithms in
recent years. In DG, Hamilton functions are built based on an adversarial model and are
then solved by numerical algorithms to find the optimal control using real-time aircraft
flight states. Xian et al. [
15
] conducted research based on DG and achieved a strategy
set of evasion manoeuvres. According to the accurate model of a penetration spacecraft
and an interceptor, Bardhan et al. [
4
] proposed guidance using a state-dependent Riccati
equation (SDRE). This approach obtained superior combat effectiveness compared with
traditional DG. However, DG requires many calculations and has poor real-time perfor-
mance. Model errors can be introduced during linearization [
16
], and onboard computers
have difficulty achieving high-frequency corrections. The ADP algorithm was employed
by Sun et al. [
17
,
18
] address with the horizontal flight pursuit problem. In an attempt
to reduce the computational complexity, neural networks were used to fit the Hamilton
function online. However, it is unclear whether this idea can be adapted for an HGV
featuring a gaint flight envelope. Numerical optimization algorithms, for example, pseudo-
spectral
methods [19,20]
have been used to discretize complex nonlinear HGV differential
equations and convert various types of constraints into algebraic constraints. Various
optimization methods, such as convex optimization or sequential quadratic programming,
are used to solve the optimal trajectory. The main drawback of applying numerical opti-
mization algorithms to HGV anti-interception guidance is that they occupy a considerable
amount of onboard computer resources over a long period of time, and the computational
time required increases exponentially with the number of aircrafts. Due to these limitations,
active evading guidance is unsuitable for engineering.
Reinforcement learning (RL) is a model-free algorithm that is used to solve decision-
making problems and has gained attention in the control field because it is entirely based
on data, does not require model knowledge, and can perform end-to-end self-learning.
Due to the limitations of traditional RL, early research cannot handle high-dimensional
and continuous battlefield state information. In recent years, deep neural networks (DNNs)
have demonstrated the ability to approximate an arbitrary function and have unparalleled
advantages in the feature extraction of high-dimensional data. Deep reinforcement learning
(DRL), is a technique resulting from the intersection of DNN and RL, and its abilities
can exceed those of human empirical cognition [
21
]. A wave of research into DRL has
been sparked by successful applications, such as Alpha-Zero [
22
] and Deep Q Networks
(DQN) [
23
,
24
]. After training, a DNN can quickly output control in milliseconds and has a
good generalization ability in unknown environments [
25
]. Therefore, DRL has promising
Aerospace 2022,9, 424 3 of 21
applications in aircraft guidance. There has been some discussion regarding using DRL to
train DNNs in intercepting a penetration aircraft [
26
]. Brain et al. [
27
] used reinforcement
meta-learning to optimize an adaptive guidance system that is suitable for the approach
phase of an HGV. However, no research has been conducted on the application of DRL to
HGV anti-interception guidance. However, some studies in different areas have examined
similar questions. Wen et al. [
28
] proposed a collision avoidance method based on a deep
deterministic policy gradient (DDPG) approach, and a proper heading angle was obtained
by using the proposed algorithm to guarantee conflict-free conditions for all aircraft. For
micro-drones flying in orchards, Lin et al. [
29
] implemented DDPG to create a collision-free
path. Guo et al. studied a similar problem, the difference being that DDPG was applied to
an unmanned ship. Lin et al. [
30
] studied how to use DDPG to train a fully connected DNN
to avoid collision with four other vehicles by controlling the acceleration of the merging
vehicle. For unmanned surface vehicles (USVs), Xu
et al. [31]
used DDPG to determine
the switching time of path-planning and dynamic collision avoidance. These studies led
us to believe that DDPG-based methods have promising applications for solving the anti-
interception guidance problem of HGVs. However, due to the differences in the objects of
study, the anti-interception guidance problem for HGVs requires consideration of some of
the following issues: (1) The performance (especially the available overload) of an HGV is
time-varying due to the velocity and altitude, while the performance of the control object
is fixed in the abovementioned studies. We need to build a model in which the DDPG
training process is not affected by time-varying performance. (2) The end state is the only
concern in the anti-intercept guidance problem, and only one instant reward is obtained in
a training episode. Therefore, sparsity and the delayed reward effect are more significant
in this study than in the studies mentioned above. In this paper, we attempt to improve the
existing DDPG-based methods to help DNNs gain intelligence faster.
The main contributions of this paper are as follows: (1) Anti-interception HGV guid-
ance is described as an optimization problem, and a generalized three-body adversarial
optimization (GTAO) model is developed. This model does not need to account for the
severe constraints in the available overload (AO) and is suitable for DRL. To our knowledge,
this is the first time that DRL-specific application was implemented for the anti-interception
guidance of an HGV. (2) A DRL scheme is developed to solve the GTAO problem, and the
RBT-DDPG algorithm is proposed. Compared with the traditional DDPG-based algorithm,
the RBT-DDPG algorithm can improve the learning effects of the critic network (CN), alle-
viate the exploration–utilization paradox, and achieve a better performance. In addition,
since the forward computation of a fully connected neural network is very simple, an
intelligent network trained by DRL can quickly compute a command for anti-interception
guidance that matches the high dynamic characteristics of the HGV. (3) A strategy review of
teh HGV anti-interception guidance derived from the DRL approach is provided. We note
that this is the first time that these strategies are summarized for HGV guidance through
semantics, and may inspire the research community.
The remainder of this paper is organized as follows: Section 2describes the problem of
anti-interception guidance for HGVs as an optimization problem and translates the problem
into solving a Markov decision process (MDP). In Section 3, the RBT-DDPG algorithm is
given in detail. Moreover, we propose a specific design of the state space, action space,
and reward functions that are necessary to solve the MDP using DRL. Section 4examines
the training and test data and the specific anti-interception strategy. Section 5presents a
conclusion to the paper and an outlook on intelligent guidance.
2. Problem Description
Figure 1shows the path of an HGV conducting an anti-interception manoeuvre. The
coordinate and velocity of the aircraft are known, as well as the coordinate of the desired
regression point (DRP). The HGV and interceptor rely on aerodynamics to perform ballistic
manoeuvres. The aerodynamic forces are mainly derived from the angle of attack (AoA).
As an HGV can glide for thousands of kilometres, this paper focuses on the guidance
Aerospace 2022,9, 424 4 of 21
needed after the interceptors locked on the HGV, and the flight distance of this process is
set to 200 km.
Figure 1.
Illustration of an HGV anti-interception manoeuvre. The anti-interception of an HGV can
be divided into three phases.
1
Approach. The HGV manoeuvres according to anti-interception
guidance, while the interceptor operates under its own guidance. Since the HGV is located a long
distance from the interceptor, and has high energy, various penetration strategies are available during
this phase.
2
Endezvous. At this phase, the distance between the HGV and the interceptors is
the shortest. This distance may be shorter than the kill radius of the interceptors, allowing for the
HGV to be intercepted, or greater than the kill radius, allowing for the HGV to successfully avoid
interception.
3
Egress. With its remaining energy, the HGV flies to the DRP after moving away from
the interceptors. A successful mission requires the HGV to arrive at the DRP with the highest levels
of energy and accuracy. From the above analysis, it can be seen that whether the HGV can evade the
interceptors in phase
2
depends on the manoeuvres adopted in phase
1
. Phase
1
also determines
the difficulty of ballistic regression in phase 3
.
2.1. The Object of Anti-Interception Guidance
In the Earth-centered earth-fixed frame, the motion of the aircraft in the vertical plane
is described as follows [32]:
dv
dt=1
m(Pcos αCX(v,α)qS)gsin θ
dθ
dt=1
mv (Psin α+CY(v,α)qS)g
vv
R0+ycos θ
dx
dt=R0vcos θ
R0+H
dy
dt=vsin θ
(1)
where
x
is the flight distance,
y
is the altitude,
v
is the flight velocity,
θ
is the ballistic
inclination,
g
is the gravity acceleration,
R0
is the mean radius of the earth when the
flatness of the earth is ignored,
CX(v,α)
and
CY(v,α)
are the lift and drag aerodynamic
coefficients of the aircraft, respectively,
α
is the AoA,
q
is the dynamic pressure,
S
is the
aerodynamic reference area, mis the mass of the vehicle, and Pis the engine thrust.
Letting subscript H and I indicate the variables subordinate to the HGV and interceptor,
respectively. Setting
xH(t)=x y v θT
is the state of an axisymmetric HGV, then
Equation (1) can be rewritten as follows:
˙xH(t)=fH(xH)+gH(xH)u(t)=
R0vcos θ
R0+H
vsin θ
CXqS
mgsin θ
g
vv
R0+ycos θ
+
0
0
0
δH(xH)
u(t)(2)
where
δH(xH) = max
α1
mv CY(v,α)qS
is the maximum rate of inclination generated by
aerodynamics in state xH. The u(t)1, 1 is the command of the guidance.
Aerospace 2022,9, 424 5 of 21
Remark 1. δH(xH)
is related to
y
,
v
of the vehicle and the available AoA, so its value varies with
xH(t)
. The purpose of this procedure is to place
u(t)
into a constant range, ensuring that the
manoeuvring ability required for guidance always follows the real-time AO under the giant flight
envelope of the HGV.
Remark 2.
The reason for using
u(t)
as the control variable, instead of directly using the AoA
α
,
is that it is possible to rely on an existing model to calculate
δH(xH)
and input it into the neural
network (as shown in Section 3), thus omitting the neural network training process in the maximum
available overload.
As the long-range interceptor is in the target-lock state, its booster rocket is switched
off and unpowered with state xI(t)=x y v θT. The motion is as follows:
˙xI(t)=fI(xI,xH)=
R0vcos θ
R0+H
vsin θ
CXqS
mgsin θ
GI(xI,xH,t)
(3)
where
GI(xI,xH,t)
is the actual rate of inclination under the influence of the guidance and
control system.
Remark 3.
As the focus of this paper is on the centre-of-mass motion of vehicles, it is assumed that
the vehicles always follow the guidance under the effects of the attitude control system, so errors in
attitude control are ignored in Equations (2) and (3), as well as minor effects such as Gauche forces
and implicated accelerations.
HGV and Ninterceptors can form a nonlinear system:
˙xS(t)=fS(xS)+gS(xS)u(t)(4)
where system state is
xS=h(xH)T(xI1)T. . . (xIN)TiT
, the nonlinear kinematics of
the system are
fS(xS)=(fH) ( fI1 ). . . (fIN)T
and the non-linear effect of control
is gS(xS)=h(gH)TO1×4NiT.
The guidance system aims to control the system described in Equation (4) using
u(t)
to
achieve penetration. Therefore, the design of an anti-interception guidance can be viewed
as an optimal control problem. First, the HGV must successfully evade the interceptor
during penetration, and then reach the DRP
PE=xEyET
to conduct the follow-up
mission. Let
tf
be the time that the HGV arrives at
xE
, and let
u(t)
direct system
Equation (4)
to state
xS(tf)
. The anti-interception guidance is designed to solve the following GTAO
problem [33].
As mentioned in Equation (4), an HGV and its opponents form a system represented
by xS. The initial value of xSis:
xS(t0)= [ xH,0 yH,0 vH,0 θH,0 . . . xIN,0 yIN,0 vIN,0 θIN,0 ]T(5)
The process constraint of penetration:
min(Mx,ixS(t))2+My,ixS(t)2>R2,i[1, N](6)
where
Mx,i=1O1×3O1×4(i1)1O1×3O1×4(Ni)
, and
R
is the kill radius
of the interceptor, My,i=0 1 O1×2O1×4(i1)01O1×2O1×4(Ni).
The process constraint of heat flux is:
(qS)3D qU(7)
Aerospace 2022,9, 424 6 of 21
where
(qS)3D
is the three-dimensional stagnation point of the heat flux and
qU
is the upper
limit of the heat flux. For an arbitrarily shaped, three-dimensional stagnation point with a
radius of curvature R1and R2, the heat flux is expressed as:
(qS)3D =r1+k
2(qS)AXI (8)
where
k=R1
R2
and
(qS)AXI
is the axisymmetric heat flux, which is related to the flight
altitude and velocity [34].
The minimum velocity constraint:
O1×210O1×4xSVmin (9)
The control constraint:
u(t)1 1 (10)
The objective function is a Mayer type function:
J(xS(t0),u(t)) =QxStf (11)
where
QxStf=xStf˜
PETRxStf˜
PE
,
˜
PE=PETVmin O1×(4N+1)T
and R=
0
w1
w2
O(4N+1)×(4N+1)
in which w1,w2R+are weights.
The optimal performance:
J(xS(t0)) =max
u(t),t[t0,tf]J(xS(t0),u(t)) (12)
From Equation (12), the optimal control
u(t)
is determined by
xS(t0)
. After obtaining
all the model information of system Equation (4), the optimal state trajectory of the sys-
tem can be found according to
xS(t0)
using static optimization methods (e.g., quadratic
programming). Nevertheless, resolving the problem using optimization methods is chal-
lenging, especially since there is limited information about the interceptor (aerodynamic
parameters, available overload, guidance, etc.).
2.2. Markov Decision Process
The MDP can model a sequential decision problem and is well-suited to the HGV
anti-interception process. The MDP can be defined by a five-tuple
hS,A,T,R,γi
[
35
].
S
is a multidimensional continuous state space.
A
is an available action space.
T
is a state
transition function:
S ×A S
. That is, after an action
a A
is taken in the state
s S
,
the state changes from
s
to
s0 S
.
R
is an instant reward function: it represents an instant
reward obtained from the state transition.
γ[
0, 1
]
is a constant discount factor used to
balance the importance of the instant reward and forward reward.
The cumulative reward obtained by the controller under the command sequence
τ={a0, . . . , an}is:
G(s0,τ) =
t=0
γtrt(13)
The expected value function of the cumulative reward based on state
st
and the
expected value function of (st,at)are introduced as shown below:
Vπ(st) = E"
k=0
γkrt+k|st,π#(14)
Aerospace 2022,9, 424 7 of 21
Qπ(st,at) = E"
k=0
γkrt+k|st,at,π#(15)
where
Vπ(st)
indicates the expected cumulative reward that controller
π
can obtain in the
current state
st
.
Qπ(st
,
at)
indicates the expected cumulative reward under controller
π
after executing atin state st.
According to the Bellman Optimal theorem [
35
], updating
π(st)
through the iter-
ation rule shown in the following equation can be used to approximate the maximum
Vπ(st)value:
π(st) = arg max
at
Qπ(st,at)(16)
3. Proposed Method
Section 2converts the anti-interception guidance of an HGV to an MDP. The key to
obtaining optimal guidance is the accurate estimation of
Qπ(st
,
at)
. Much progress has
been made towards artificial intelligence using supervised learning systems trained to
replicate human expert decisions. However, expert data are often expensive, unreliable, or
unavailable [
36
], although DRL can still realize accurate
Qπ(st
,
at)
estimation. The guidance
system needs to process as much data as possible and then choose the next action. The
input space of the guidance system has high dimensionality and continuous characteristics,
and its action space has continuous characteristics. The DDPG-based methods (the main
methods are DDPG and TD3) have been shown to effectively handle high-dimensional
continuous information and output continuous actions. They could be used to solve the
MDP proposed in this paper. This section aims to achieve faster policy convergence and
better performance by optimizing DDPG-based method training. Since the TD3 algorithm
has only one more CN pair compared to DDPG, this paper takes RBT-DDPG as an example
to introduce how the RBT mechanism improves the training of the critic part. RBT-TD3 is
easily obtained by simply repeating the improvements made by RBT-DDPG in each pair of
TD3 critic networks.
3.1. RBT-DDPG-Based Methods
CN
Q(·)
in the DDPG-based methods is only used, during reinforcement learning, to
train AN A(·). At execution, the real action is determined directly by A(·).
For CN, the DDPG-based optimization objective is to minimize the loss function
LQ(φQ). Its gradient is
φQLQ(φQ) = 1
Nb
Nb
j=1φQQ(st,j,at,j|φQ)! 2
Nb
Nb
j=1Q(st,j,at,j|φQ)ˆ
yt!(17)
After a single gradient descent operation is performed, the gradient descent method is
only guaranteed to update the parameters in the right direction. The amplitude of the loss
function is not guaranteed after a single gradient descent procedure.
For AN, the DDPG-based optimization objective is to minimize the loss function
LA(φA). Its gradient is
φALA(φA) = 1
Nb
Nb
j=1˜
at,jQst,j,˜
at,j|φQφAA(st,j|φA)(18)
The direction of parameter iteration of AN is affected by CN. The single gradient
descent method used by the DDPG-based method does not guarantee accurate CN esti-
mation. In some instances where the estimation is grossly inaccurate, incorrect parameter
updates will be provided to AN, further deteriorating the sample data in the memory pool
and affecting the training efficiency.
Rather than allowing CN to steer AN in the wrong direction, this paper proposes a
new, improved, DDPG-based mechanism: repetitive batch training (RBT). The core idea
Aerospace 2022,9, 424 8 of 21
of RBT, which mainly aims to improve the updating strategy of Qπ(st,at), is that when a
sample batch is used to update the CN parameters, the CN is repetitively trained using
the loss function as its reference threshold (as in Equation (19)), thereby avoiding a serious
misestimate of CN. The reference threshold LTH for repeats should be set appropriately.
Repeat,LQ(φQ)LTH
Pass,LQ(φQ)<LTH (19)
Qst,j,˜
at,j|φQis split into two parts:
Qst,j,at,j|φQ=Qst,j,at,j|φQ+Dst,j,at,j|φQ(20)
where,
Qst,j,at,j|φQ
is the real mapping of
Q
and
Dst,j,at,j|φQ
is the estimation error.
Therefore, Equation (18) is rewritten as:
φALA(φA) = 1
Nb
Nb
j=1˜
at,jQst,j,˜
at,j|φQ+˜
at,jDst,j,˜
at,j|φQφAA(st,j|φA)
(21)
According to the Taylor expansion:
˜
at,jDst,j,˜
at,j|φQ=1
˜
aDst,j,˜
at,j+˜
a|φQDst,j,˜
at,j|φQR2(22)
where R2is the higher-order residual term. Since
Dst,j,˜
at,j|φQ
(0, LTH),:
Dst,j,˜
at,j+˜
a|φQDst,j,˜
at,j|φQ
<2LTH (23)
˜
at,jDst,j,˜
at,j|φQ
<2LTH
k˜
ak2
R2
˜
a
(24)
Naturally, setting
LTH
as a small value will reduce the misdirection from the uncon-
verged CN to the AN.
If
LTH
is too large, the RBT-DDPG will be weakened. CN will overfit the samples in a
single batch if
LTH
is too small. The RBT-DDPG shown in Figure 2and Algorithm 1is an
example of how RBT can be combined with DDPG-based methods.
Figure 2.
Signal flow of the RBT-DDPG algorithm. When a sample batch is used to update the CN
parameters, the CN is repetitively trained using the loss function as its reference threshold (as shown
in Steps 4–7 of the Figure), thereby avoiding a serious misestimate of CN.
Aerospace 2022,9, 424 9 of 21
Algorithm 1 RBT-DDPG
1: Initialize parameters φA,φA,φQ,φQ.
2: for each iteration do
3: for each environment step do
4: a=1eθtµ+eθtˆ
a+σq1e2θt
2θε,ε(0, 1),ˆ
a=A(s).
5: end for
6: for each gradient step do
7: Randomly sample Nbsamples.
8: φQφQ+αCφQLQ(φQ).
9: if LC>LTH then
10: Back to step 7.
11: end if
12: φAφA+αAφALA(φA).
13: φQ(1sr)φQ+srφQ.
14: φA(1sr)φA+srφA.
15: end for
16: end for
3.2. Scheme of DRL
To approximate
Qπ(st,at)
and optimal AN through DRL, the state space, action space
and instant reward function are designed as follows.
3.2.1. State Space
As mentioned in Section 2.1, in the vertical plane, the HGV and interceptors follow
Equations (2) or (3), and both can be expressed in terms of two-dimensional coordinates:
the velocity, and the ballistic inclination. Therefore, the network can predict the future flight
state based on the current state. The AN state space is
SHI =SH×SI1 ×. . . × SIN
, where
SHR4
and
SIiR4
are the state space of the HGV and the interceptor
i
, respectively. AN
needs to know the AO of the current state
SR
to evaluate the available manoeuvrability.
It also needs to know the position of the DRP
SDR2
. As a result, the state space of AN is
designed as a 7
+
4
N
-dimensional space
S=SHI ×S×SD
, and the form of each element
is
(xH,yH,vH,θH,xI1,yI1 ,vI1,θI1, . . . , xIN,yIN,vIN,θIN,ωmax ,xD,yD)
. For CN, an additional
action bit is needed, so its input space is 8 +4N-dimensional.
Remark 4.
The definition of the state space used in this paper means that there is a vast input
space for the neural network when an HGV is confronted with many interceptors, resulting in many
duplicate elements and complicating the training process. An attention mechanism can alleviate
this issue by extracting features from the original input space [
37
]. Typically, two interceptors are
used to intercept one target in air defence operations [
38
]. Therefore, there is only one HGV and two
interceptors in the virtual scenario, limiting the input space of the neural networks.
3.2.2. Action Space
As described in Equation (2), since
u(t)
is limited to the interval
[1, 1]
, the output
or action of the neural network can easily be defined as u(t).u(t)is multiplied by the AO
derived from the model information and then applied to the HGV’s flight control system to
ensure that the guidance signal meets the dynamical constraints of the current flight state.
The neural network only needs to pick a real number within
[1, 1]
, which indicates the
choice of
˙
θ
as a guidance command subject to AO constraints. From a training perspective,
when aiming to bypass the learning process of the AO, it is more straightforward to use the
model information directly, rather than adopting the AoA as the action space.
3.2.3. Instant Reward Function
As an essential part of the trial-and-error approach, the instant reward function, guides
the networks to learn the optimal strategy. Diverse instant reward functions indicate differ-
Aerospace 2022,9, 424 10 of 21
ent behaviour tendencies. The instant reward function affects the quality of the strategy
learned by DRL. As in Equation (12), an instant reward function should be designed, with
two aims: (1) to determine the distance between the HGV and DRP reward function
rE(·)
at the terminal time
tf
,and (2) to apply the velocity reward function
rD(·)
at
tf
. The instant
reward function is designed as follows:
r(xS(t)) =rE(xS(t)) +rD(xS(t)),t=tf
0, t<tf
(25)
It is necessary to convert
w1
in Equation (11) to a function with a positive domain in
order to encourage the HGV to evade interceptors and reach xEduring DRL:
f(x)=1
w1+x(26)
Remark 5.
If the flight state of the HGV does not meet the various constraints mentioned in
Section 2
, then the episode ends early, and the instant reward for the whole episode is 0, which
introduces the common sparse reward problem in reinforcement learning. However, it is evident
from Section 4that RBT-DDPG-based methods can generate intelligence from sparse rewards.
Remark 6.
An intensive instant reward function similar to that designed by Jiang
et al. [39]
is not
used, since there is no experience to draw on for the HGV anti-interception problem. Furthermore,
rather than a Bolza problem, this instant reward function is fully equivalent to the optimization
goal, that is, the mayer type problem defined in Equation (12). Moreover, the neural network is not
influenced by the human tendencies of strategy exploration, resulting in an in-depth exploration of
all strategic options that could approach the global optimum.
Remark 7.
A curriculum-based approach similar to that discussed by Li et al. [
40
] was attempted,
where HGVs can quickly learn to reduce interceptors’ energy using snake manoeuvres. As the policy
is solidified at the approach phase, it is difficult to achieve global optimality using this framework,
and the obtained performance is significantly lower than that obtained by Equation (25).
4. Training and Testing
Section 3introduces RBT-DDPG-based methods to solve the GTAO problem discussed
in Section 2. This section verifies the effectiveness of DRL in finding the optimal anti-
interception guidance system.
4.1. Settings
4.1.1. Aircraft Settings
To simulate a random initial interceptor energy state in a virtual scenario, the intercep-
tors’ initial altitude and initial velocity follow a uniform distribution and are
U(25 km, 45 km)
and U(1050 m/s, 1650 m/s)respectively. Table 1illustrates the remaining parameters.
The aerodynamics of an aircraft are usually approximated by the curve-fitted model
(CFM). The CFM of the HGV used in this paper is referenced in Wang et al. [
41
]:
CL=
0.21
+
0.075
·M+(0.23 +0.05 ·M)·α
and
CD=
0.41
+
0.011
·M+(0.0081 +0.0021 ·M)·
α+
0.0042
·α2
.
M
is the Mach number of the HGV and
α
is the AoA (rad). Morever,
the interceptors are
CL=(0.18 +0.02 ·M)·α
and
CD=
0.18
+
0.01
·M+
0.001
·M·
α+
0.004
·α2
[
42
]. The interceptors employ a proportional guide. To compensate for the
lack of representation of the vectoring capability in the virtual scenario, we increased the
interceptors’ kill radius to a staggering 300 m.
Aerospace 2022,9, 424 11 of 21
Table 1. Parameters of theHGV and interceptor in the virtual scenario.
Parameter Interceptor HGV
Mass/kg 75 500
Reference area/m20.3 0.579
Minimum velocity/(m/s) 400 1000
Available AoA/° 2020 1010
Time constant of Attitude
Control System/s 0.1 1
Initial coordinate x/km 200 0
Initial coordinate y/km Random 35
Initial velocity/(km/s) Random 2000
Initial inclination/° 0 0
Coordinate x of the DRP/km - 200
Coordinate y of the DRP/km - 35
Kill radius/m 300 -
4.1.2. Hyperparameter Settings
Hyperparameters of training are shown in Figure 2.
In AN, it is evident that the bulk of computation occurs in the hidden layer. A neuron
in the hidden layer reads in
ni
(
ni
is the width of the previous layer) numbers according to
a dropOut layer( the drop rate is 0.2), multiplies them by the weight (totalling 2
ni+
0.8
ni
FLOPs), adds up all the values (totalling
ni
FLOPs), then passes the result through the
activation function (LReLU) after adding a bias term (totaling 2 FLOPs), which means that
a single neuron consumes 3.8
ni+
3 FLOPs in a single calculation. The actor, as shown
in Figure 3, consumes approximately 87K FLOPs in a single execution. Assuming an
on-board computer with 10
3
FLOP floating-point capabilities (most mainstream industrial
FPGAs in 2021 have more than a 1 FLOP floating-point capability), a single execution of
anti-interception guidance formed by AN takes less than 0.1 ms.
Figure 3.
Structures of the neural networks ((
Left
) AN. (
Right
) CN) with theparameters of each layer.
Aerospace 2022,9, 424 12 of 21
Table 2. Hyperparameters of RBT-DDPG and RBT-TD3.
Parameter Value
t/s 102
Tc/s 1
γ1
αC104
αAin RBT-DDPG 104
αAin RBT-TD3 5×105
sr0.001
µin RBT-DDPG 0.05
σin RBT-DDPG 0.01
θin RBT-DDPG 5×105
σin RBT-TD3 0.1
LTH 0.1
Weight initialization N(0, 0.02)
Bias initialization N(0, 0.02)
w10.5
w2104
4.2. Training Results and Analysis
The CPU of the training platform is an AMD Ryzen5 3600@4.2 GHz. The RAM is
8 GB
×
2 DDR4@3733 MHz. As the networks are straightforward and the computation is
concentrated on calculating aircraft models, GPUs are not selected for training. The aircraft
models in the virtual scenario are written in C++ and packaged as a dynamic link library
(DLL). The networks and training algorithm are implemented by Python. The networks
and the virtual scenario interact after calling the DLL. The training process is shown in
Figures 46.
Figure 4.
Cumulative reward during training. With the help of the RBT mechanism, both DDPG and
TD3 reached a faster training speed.
Aerospace 2022,9, 424 13 of 21
Figure 5.
Loss function of AN and CN during training. Similar to the cumulative reward curve
presented in Figure 4, the actor loss function of RBT-DDPG in Figure 5decreases faster, indicating
that RBT-DDPG ensures that AN learns faster. RBT-DDPG has a lower CN loss function almost
throughout the episodes, reflecting that RBT can improve the CN estimation. The same phenomenon
occurs in the comparison between RBT-TD3 and its original version.
Figure 6.
The number of RBT iterations that occur during RBT-DDPG training. RBT is repeated
several times at the beginning of training (near the first training iteration), and CN is required to
train to a tiny estimation error. RBT is barely executed before 60,000 training steps, as CN can already
provide accurate estimates for the samples in the current memory pool, and no additional training is
needed. From steps 100,000 to 200,000 , RBT is repeated several times, and many of the executions
are greater than 10. Due to the introduction of new strategies into the memory pool, the original
CN does not accurately estimate the Q value, so additional training is performed. Between steps
20,000 and 35,000, RBT is occasionallyexecuted, and most executions contain less than 10 repetitions
because fine-tuning CN can accommodate the new strategies that are explored by the actor. RBT
executions increase after 350,000 steps, as CN must adapt to the multiple strategy samples brought
into the memory pool. At the end of training, the average -number of repetitions is approximately 2,
which is an acceptable algorithmic complexity cost.
Aerospace 2022,9, 424 14 of 21
For episodes 0–3500, DDPG and RBT-DDPG are in the process of constructing their
memory pool. The neural networks approximate a random output in this process, and no
training occurs. As a result, cumulative rewards are near 0 for most episodes, with a slim
possibility of reaching six points through random exploration. The neural networks begin
to be iteratively updated as soon as the memory pool is complete. The RBT-DDPG exceeds
the maximum score of the random exploration process at approximately 3900 episodes,
which is close to seven points. The RBT-DDPG then gradually boosts intelligence at a
rate of approximately 0.00145 points/episode, which is approximately three times the
0.00046 points/episode of the DDPG. RBT-TD3 intelligence achieves steady growth from
about episode 500. TD3, on the other hand, apparently fell into a local pole before episode
3500, with reward values that were always around 0.8 points. RBT-TD3 learns faster
compared to RBT-DDPG because the learning algorithm is more complex, but the strategies
they learned are similar and eventually converge to almost the same reward.
4.3. Test Results and Analysis
A Monte Carlo test was performed to verify that the strategies learned by the neural
network are universally adaptable. In addition, some cases were used to analyze the
anti-interception strategies. As the final performance obtained by RBT-DDPG and RBT-TD3
is similar, due to space limitations, this section uses the AN learned by RBT-DDPG as the
test object.
4.3.1. Monte Carlo Test
To reveal the specific strategy that was obtained from training, AN from RBT-DDPG
controls an HGV in virtual scenarios to perform anti-interception tests. To verify the
adaptability of an AN facing different initial states, a test was conducted using scenarios
in which the initial altitude and velocity of the interceptors were randomly distributed
(the same distribution as used during training). Since exploration is no longer needed, no
further OU processes were added to AN. A total of 1000 episodes were conducted.
AN from DDPG and RBT-DDPG were tested for 1000 episodes each. Table 3and
Figure 7illustrate the results. Suppose that the measure of the success of an anti-
interception is whether it eliminates two interceptors. In that case, the 91.48% anti-
interception success rate of the RBT-DDPG AN is better than the 79.74% rate of the DDPG
AN, reflecting the greater adaptability of the RBT-DDPG AN to complex initial conditions.
According to the average terminal miss distance
¯
e(tf)
, having eliminated the interceptors,
both two actors perform well in achieving DRP regression. However, on average, in terms
of terminal velocity
¯
v(tf)
, the HGV guided by AN from RBT-DDPG is faster. For RBT-
DDPG, the peak probability density is 7.4, while for DDPG, the peak probability density is
6.7, indicating that RBT-DDPG can perform well in more scenarios than DDPG.
To observe the specific strategies learned through RBT-DDPG, we traversed the initial
conditions of the two interceptors and tested each individually. In Figure 8, the correspon-
dence between the initial state of the interceptors and the test case serial number is indicated.
The vertical motion of the HGV and interceptors in all cases are shown in Figure 9.
Table 3. Statistical results.
Algorithm Anti-Interception
Success Rate ¯e(tf)/m ¯v(tf)/(m/s)
DDPG 79.74% 1425.62 1377.17
RBT-DDPG 91.48% 1514.44 1453.81
Aerospace 2022,9, 424 15 of 21
Figure 7.
Probability distribution and density of the cumulative reward for each episode in the Monte
Carlo test. About 10% of the RBT-DDPG rewards are less than 3, compared to about 24% of DDPG
reflecting the greater adaptability of the RBT-DDPG AN to complex initial conditions.
Figure 8.
Correspondence between initial state and test case serial number (the horizontal coordinate
represents the initial state of the first interceptor, while the vertical coordinate represents the second
interceptor. The letters M, H, and L in the first position represent 44 km, 35 km, and 26 km altitude,
respectively. The letters M, H, and L in the second position represent 1500 m/s, 1250 m/s, and
1000 m/s velocities, respectively).
In Figure 9, the neural network adopts different strategies in response to interceptors
with variable initial energies. In terms of behaviour, the strategies fall into two categories:
S-curve (dive–leap–dive) and C-curve (leap–dive). There is also a specific pattern to the
peaks. In general, the higher the initial energy of the interceptors faced by the HGV, the
higher the peak.
Aerospace 2022,9, 424 16 of 21
Figure 9.
Vertical motion of the HGV and interceptors in the test cases. The strategies fall into two
categories: S-curve (dive–leap–dive) and C-curve (leap–dive). There is also a specific pattern to
the peaks.
4.3.2. Analysis of Anti-Interception Strategies
Using the data from the 5th and 42nd test cases mentioned in Section 4.3.1, we at-
tempted to identify the strategies the neural network learned from DRL.
As shown by the solid purple trajectory in Figures 10 and 11, we also used the
Differential Game (DG) approach [
33
] as a comparison method. While the HGV evaded the
interceptor with DG guidance, it lost significant kinetic energy due to its long residence in
the dense atmosphere and its low ballistic dive point, and fell below the minimum velocity
at approximately 55 km from the target. The DG approach uses the relative angle and
velocity information as input to guide the flight, and can successfully evade the interceptor.
In contrast, DG cannot account for atmospheric density and cannot optimize energy, which
is a significant advantage of DRL.
Aerospace 2022,9, 424 17 of 21
Figure 10.
Vertical motion comparison between RBT-DDPG(Cases 5 and 42) and Differential Game.
The DG approach uses the relative angle and velocity information as input to guide the flight, and
can successfully evade the interceptor. In contrast, DG cannot account for atmospheric density and
cannot optimize energy, which is a significant advantage of DRL. The AN learned by RBT-DDPG
chooses to dive before leaping in Case 5, whereas, in Case 42, it takes a direct leap. Furthermore,
while the peak in Case 5 is 60 km, it only reaches 54 km in Case 42 before diving. The AN can control
the HGV to select the appropriate ballistic inclination for the dive after escaping the interceptor.
Figure 11. Velocities comparison between RBT-DDPG(Cases 5 and 42) and Differential Game.
The neural network chooses to dive before leaping in Case 5, whereas, in Case 42, it
takes a direct leap. Furthermore, while the peak in Case 5 is 60 km, it only reaches 54 km
in Case 42 before diving. In both cases, the HGV causes one of the interceptors to go
below the minimum flight speed (400 m/s) before entering the rendezvous phase. At the
rendezvous phase, the minimum distances between the interceptor and the HGV are 389 m
and 672 m, respectively, which indicates that the HGV is almost within the interceptable
area of interceptors. The terminal velocity of the HGV in Case 5 is approximately 100 m/s
lower than that in Case 42, due to the higher initial interceptor energy faced in Case 5, which
resulted in a longer manoeuvre path and a more violent pull-up in the dense atmospheric
region. Figure 12 illustrates that the HGV tends to perform a large overload manoeuvre in
the approach phase, almost exclusively utilizing the manoeuvring ability. In the regress
phase of Case 42, only very small magnitude manoeuvres were required to correct the
ballistic inclination, demonstrating that the neural network can control the HGV and select
the appropriate ballistic inclination for the dive after escaping the interceptor.
Aerospace 2022,9, 424 18 of 21
Figure 12. Overloads comparison between Cases 5 and 42.
We derived a rudimentary instant reward function with no prior knowledge of the
strategy that should be implemented, resulting in a sparse reward problem. The RBT-
DDPG trained a CN; however, it does not make significant Q estimation errors. There was
an accurate Q estimation at the beginning of an episode, and accuracy was maintained
throughout the processes, in both case 5 and case 42 (Figure 13). This phenomenon is
consistent with the idea presented in Equation (12) that the flight states of both sides at the
outset determine the optimal anti-interception strategy that the HGV should implement.
Figure 13.
Q-value comparison between Case 5 and 42, as shown in Figure 12. An accurate Q
estimation at the beginning of an episode and maintains accuracy throughout the processes in both
case 5 and case 42.
Figures 1012 illustrate the anti-interception strategy learned by the RBT-DDPG:(1)
The HGV lures interceptors with high initial energy into dense atmospheres by diving
manoeuvres in the approach phase, thus relying on pull-up manoeuvres with a denser
atmosphere to drain much of the interceptors’ energy. It is important to note that this
dive manoeuvre also consumes the kinetic HGV energy (e.g., Case 5). In contrast, when
confronted with interceptors with low initial energy, the neural network does not choose
to dive first, even though this strategy is feasible, but instead leaps directly into the thin-
atmosphere region (e.g., Case 42), reflecting the optimality of the strategy. (2) Through the
approach phase manoeuvre, the HGV reduces the kinetic energy of the interceptors to a
proper level and attracts interceptors to the thin atmosphere. Here, the interceptor’s AO
Aerospace 2022,9, 424 19 of 21
no longer supports the interceptable area to cover the whole reachable area of the HGV,
which allows for the HGV to gain an available penetration path, as shown in Figure 14.
Figure 14.
Illustration of the penetration strategy during the rendezvous phase learned by RBT-DDPG.
The interceptor’s AO no longer supports the interceptable area to cover the whole reachable area of
the HGV, which allows for the HGV to gain an available penetration path.
5. Conclusions
Traditionally, research on anti-interception guidance for aircraft has focused on dif-
ferential game theory and optimization algorithms. Due to the high number of matrix
calculations needed, applying differential game theory online is computationally uneco-
nomic. Even though the newly developed ADP algorithm employs a neural network that
significantly reduces the computation associated with the Hamilton functions, it cannot be
applied to aircraft with very large flight envelopes, such as HGVs. It is challenging to im-
plement convex planning, sequential quadratic planning, or other planning algorithms for
HGVs due to their high computational complexity and insufficient real-time performance.
We conceptualize the penetration strategy of HGVs as a GTAO problem using the
perspective of the optimal problem, revealing the significant impact that both the initial
conditions of the attackers and defenders have on the penetration strategy. The problem
is then modelled as an MDP and solved using the DDPG algorithm. The RBT-DDPG
algorithm was developed to improve the CN estimation during the training process. At the
end of the paper, the data on the training process and online simulation tests verify that RBT-
DDPG can autonomously learn anti-interception guidance and adopt a rational strategy
(S-curve or C-curve) when interceptors have differing initial energy conditions. Compared
to the traditional DDPG algorithm, our proposed algorithm reduces the training episodes
by 48.48% . Since the AN deployed online is straightforward, it is suitable for onboard
computers. To our knowledge, this is the first to apply DRL to achieve anti-interception
guidance for HGVs.
This paper focuses on a scenario in which one HGV breaks through the defences
of two interceptors; this is a traditional scenario but may not be fully adapted to the
future trends in group confrontation. In the future, we anticipate using multi-agent-DRL
(MA-DRL) applied to multiple HGVs. The intelligent trained by MADRL can conduct
guidance for each HGV in a distributed manner with limited information constraints, so
the anti-intercept guidance strategy can adapt to the numbers of enemies and HGVs. This
will greatly improve the intelligent generalizability to the battlefield. Additionally, RL is
known to suffer from long training times. We anticipate using pseudo-spectral methods
to create a collection of expert data and then combining some expert advice [
43
,
44
] to
accelerate training.
Aerospace 2022,9, 424 20 of 21
Author Contributions:
Conceptualization, L.J. and Y.N.; methodology, L.J.; software, L.J.; validation,
Y.Z., Z.L.; formal analysis, L.J.; investigation, Y.N.; resources, Y.N.; data curation, L.J.; writing—
original draft preparation, L.J.; writing—review and editing, Y.Z.; visualization, L.J.; supervision,
Y.N.; project administration, Y.N.; funding acquisition, Y.N. All authors have read and agreed to the
published version of the manuscript.
Funding:
This work was supported in part by the Aviation Science Foundation of China under Grant
201929052002.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments:
The authors would like to thank Cheng Yuehua from the University of Aeronau-
tics and Astronautic, for her invaluable support during the writing of the paper.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
Guo, Y.; Gao, Q.; Xie, J.; Qiao, Y.; Hu, X. Hypersonic vehicles against a guided missile: A defender triangle interception approach.
In Proceedings of the 2014 IEEE Chinese Guidance, Navigation and Control Conference, Yantai, China, 8–10 August 2014;
pp. 2506–2509.
2.
Liu, K.F.; Meng, H.D.; Wang, C.J.; Li, J.; Chen, Y. Anti-Head-on Interception Penetration Guidance Law for Slide Vehicle. Mod.
Def. Technol. 2008,4, 39–45.
3.
Luo, C.; Huang, C.Q.; Ding, D.L.; Guo, H. Design of Weaving Penetration for Hypersonic Glide Vehicle. Electron. Opt. Control
2013,7, 67–72.
4.
Zhu, Q.G.; Liu, G.; Xian, Y. Simulation of Reentry Maneuvering Trajectory of Tactical Ballistic Missile. Tactical Missile Technol.
2008,1, 79–82.
5.
He, L.; Yan, X.D.; Tang, S. Guidance law design for spiral-diving maneuver penetration. Acta Aeronaut. Astronaut. Sin.
2019
,40,
188–202.
6.
Zhao, K.; Cao, D.Q.; Huang, W.H. Manoeuvre control of the hypersonic gliding vehicle with a scissored pair of control moment
gyros. Sci. China Technol. 2018,61, 1150–1160. [CrossRef]
7.
Zhao, X.; Qin, W.W.; Zhang, X.S.; He, B.; Yan, X. Rapid full-course trajectory optimization for multi-constraint and multi-step
avoidance zones. J. Solid Rocket. Technol. 2019,42, 245–252.
8.
Wang, P.; Yang, X.L.; Fu, W.X.; Qiang, L. An On-board Reentry Trajectory Planning Method with No-fly Zone Constraints. Missiles
Space Vechicles 2016,2, 1–7.
9.
Fang, X.L.; Liu, X.X.; Zhang, G.Y.; Wang, F. An analysis of foreign ballistic missile manoeuvre penetration strategies. Winged
Missiles J. 2011,12, 17–22.
10.
Sun, S.M.; Tang, G.J.; Zhou, Z.B. Research on Penetration Maneuver of Ballistic Missile Based on Differential Game. J. Proj. Rocket.
Missiles Guid. 2010,30, 65–68.
11. Imado, F.; Miwa, S. Fighter evasive maneuvers against proportional navigation missile. J. Aircr. 1986,23, 825–830. [CrossRef]
12.
Zhang, G.; Gao, P.; Tang, Q. The Method of the Impulse Trajectory Transfer in a Different Plane for the Ballistic Missile Penetrating
Missile Defense System in the Passive Ballistic Curve. J. Astronaut. 2008,29, 89–94.
13. Wu, Q.X.; Zhang, W.H. Research on Midcourse Maneuver Penetration of Ballistic Missile. J. Astronaut. 2006,27, 1243–1247.
14.
Zhang, K.N.; Zhou, H.; Chen, W.C. Trajectory Planning for Hypersonic Vehicle With Multiple Constraints and Multiple Manoeu-
vreing Penetration Strategies. J. Ballist. 2012,24, 85–90.
15.
Xian, Y.; Tian, H.P.; Wang, J.; Shi, J.Q. Research on intelligent manoeuvre penetration of missile based on differential game theory.
Flight Dyn. 2014,32, 70–73.
16.
Sun, J.L.; Liu, C.S. An Overview on the Adaptive Dynamic Programming Based Missile Guidance Law. Acta Autom. Sin.
2017
,43,
1101–1113.
17.
Sun, J.L.; Liu, C.S. Distributed Fuzzy Adaptive Backstepping Optimal Control for Nonlinear Multimissile Guidance Systems with
Input Saturation. IEEE Trans. Fuzzy Syst. 2019,27, 447–461.
18.
Sun, J.L.; Liu, C.S. Backstepping-based adaptive dynamic programming for missile-target guidance systems with state and input
constraints. J. Frankl. Inst. 2018,355, 8412–8440. [CrossRef]
19.
Wang, F.; Cui, N.G. Optimal Control of Initiative Anti-interception Penetration Using Multistage Hp-Adaptive Radau Pseu-
dospectral Method. In Proceedings of the 2015 2nd International Conference on Information Science and Control Engineering,
Shanghai, China, 24–26 April 2015.
20.
Liu, Y.; Yang, Z.; Sun, M.; Chen, Z. Penetration design for the boost phase of near space aircraft. In Proceedings of the 2017 36th
Chinese Control Conference, Dalian, China, 26–28 July 2017.
21. Marcus, G. Innateness, alphazero, and artificial intelligence. arXiv 2018, arXiv:1801.05667.
Aerospace 2022,9, 424 21 of 21
22.
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al.
Mastering the game of go without human knowledge. Nature 2017,550, 354–359. [CrossRef]
23. Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep Exploration via Bootstrapped DQN. arXiv 2016, arXiv:1602.04621.
24.
Chen, J.W.; Cheng, Y.H.; Jiang, B. Mission-Constrained Spacecraft Attitude Control System On-Orbit Reconfiguration Algorithm.
J. Astronaut. 2017,38, 989–997.
25.
Dong, C.; Deng, Y.B.; Luo, C.C.; Tang, X. Compression Artifacts Reduction by a Deep Convolutional Network. In Proceedings of
the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015.
26.
Fu, X.W.; Wang, H.; Xu, Z. Research on Cooperative Pursuit Strategy for Multi-UAVs based on DE-MADDPG Algorithm. Acta
Aeronaut. Astronaut. Sin. 2021,42, 311–325.
27.
Brian, G.; Kris, D.; Roberto, F. Adaptive Approach Phase Guidance for a Hypersonic Glider via Reinforcement Meta Learning. In
Proceedings of the AIAA SCITECH 2022 Forum, San Diego, CA, USA, 3–7 January 2022.
28.
Wen, H.; Li, H.; Wang, Z.; Hou, X.; He, K. Application of DDPG-based Collision Avoidance Algorithm in Air Traffic Control.
In Proceedings of the ISCID 2019: IEEE 12th International Symposium on Computational Intelligence and Design, Hangzhou,
China, 14 December 2020.
29.
Lin, G.; Zhu, L.; Li, J.; Zou, X.; Tang, Y. Collision-free path planning for a guava-harvesting robot based on recurrent deep
reinforcement learning. Comput. Electron. Agric. 2021,188, 106350. [CrossRef]
30.
Lin, Y.; Mcphee, J.; Azad, N.L. Anti-Jerk On-Ramp Merging Using Deep Reinforcement Learning. In Proceedings of the IVS 2020:
IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 19 October–13 November 2020.
31.
Xu, X.L.; Cai, P.; Ahmed, Z.; Yellapu, V.S.; Zhang, W. Path planning and dynamic collision avoidance algorithm under COLREGs
via deep reinforcement learning. Neurocomputing 2021,468, 181–197. [CrossRef]
32. Lei, H.M. Principles of Missile Guidance and Control. Control Technol. Tactical Missile 2007,15, 162–164.
33.
Cheng, T.; Zhou, H.; Dong, X.F.; Cheng, W.C. Differential game guidance law for integration of penetration and strike of multiple
flight vehicles. J. Beijing Univ. Aeronaut. Astronaut. 2022,48, 898–909.
34.
Zhao, J.S.; Gu, L.X.; Ma, H.Z. A rapid approach to convective aeroheating prediction of hypersonic vehicles. Sci. China Technol.
Sci. 2013,56, 2010–2024. [CrossRef]
35. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998.
36. Liu, R.Z.; Wang, W.; Shen. Y.; Li, Z.; Yu, Y.; Lu, T. An Introduction of mini-AlphaStar. arXiv 2021, arXiv:2104.06890.
37.
Deka, A.; Luo, W.; Li, H.; Lewis, M.; Sycara, K. Hiding Leader’s Identity in Leader-Follower Navigation through Multi-Agent
Reinforcement Learning. arXiv 2021, arXiv:2103.06359.
38.
Xiong, J.-H.; Tang, S.-J.; Guo, J.; Zhu, D.-L. Design of Variable Structure Guidance Law for Head-on Interception Based on Variable
Coefficient Strategy. Acta Armamentarii 2014,35, 134–139.
39.
Jiang, L.; Nan, Y.; Li, Z.H. Realizing Midcourse Penetration With Deep Reinforcement Learning. IEEE Access
2021
,9, 89812–89822.
[CrossRef]
40.
Li, B.; Yang, Z.P.; Chen, D.Q.; Liang, S.Y.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning.
Def. Technol. 2021,17, 457–466. [CrossRef]
41.
Wang, J.; Zhang, R. Terminal guidance for a hypersonic vehicle with impact time control. J. Guid. Control Dyn.
2018
,41, 1790–1798.
[CrossRef]
42.
Ge, L.Q. Cooperative Guidance for Intercepting Multiple Targets by Multiple Air-to-Air Missiles. Master’s Thesis, Nanjing
University of Aeronautics and Astronautics, Nanjing, China, 2019.
43.
Cruz, F.; Parisi, G.I.; Twiefel, J.; Wermter, S. Multi-modal integration of dynamic audiovisual patterns for an interactive
reinforcement learning scenario. In Proceedings of the RSJ 2016: IEEE International Conference on Intelligent Robots & Systems,
Daejeon, Korea, 9–14 October 2016.
44.
Bignold, A.; Cruz, F.; Dazeley, R.; Vamplew, P.; Foale, C. Human engagement providing evaluative and informative advice for
interactive reinforcement learning. Neural Comput. Appl. 2022. [CrossRef]
... The optimal guidance command for the nonlinear system was approximated using online RL. Reference [25] proposed an anti-interception guidance method based on DRL and designed a new mechanism of repeated batch training to reduce the serious error estimation problem in the critic network. In [26], a solution that could address the uncertain no-fly zones faced by a high-speed flying object was introduced. ...
... In the critic network updating process, the state value function V π χ (s t ) value is calculated based on the updated critic network and the loss function using Equation (25). Then, the Adam optimizer is employed to update the critic network parameters to minimize the loss. ...
... In the critic network updating process, the state value function   () t Vs value is calculated based on the updated critic network and the loss function using Equation (25). Then, the Adam optimizer is employed to update the critic network parameters to minimize the loss. ...
Article
Full-text available
In this study, two different impact-angle-constrained guidance and control strategies using deep reinforcement learning (DRL) are proposed. The proposed strategies are based on the dual-loop and integrated guidance and control types. To address comprehensive flying object dynamics and the control mechanism, a Markov decision process is used to solve the guidance and control problem, and a real-time impact-angle error in the state vector is used to improve the model applicability. In addition, a reasonable reward mechanism is designed based on the state component which reduces both the miss distance and the impact-angle error and solves the problem of sparse rewards in DRL. Further, to overcome the negative effects of unbounded distributions on bounded action spaces, a Beta distribution is used instead of a Gaussian distribution in the proximal policy optimization algorithm for policy sampling. The state initialization is then realized using a sampling method adjusted to engineering backgrounds, and the control strategy is adapted to a wide range of operational scenarios with different impact angles. Simulation and Monte Carlo experiments in various scenarios show that, compared with other methods mentioned in the experiment in this paper, the proposed DRL strategy has smaller impact-angle errors and miss distance, which demonstrates the method’s effectiveness, applicability, and robustness.
... A deep reinforcement learning model based on the deep deterministic strategy gradient algorithm was given to generate the optimal decision-making network for missile penetration control. In another study [26], an anti-interception guidance method based on deep reinforcement learning (DRL) was also proposed: the problem was modeled as a Markov decision process (MDP), and a DRL scheme composed of actor-critic architecture was designed to solve this problem. Both [26] and [27] improved the reinforcement learning algorithm accordingly. ...
... In another study [26], an anti-interception guidance method based on deep reinforcement learning (DRL) was also proposed: the problem was modeled as a Markov decision process (MDP), and a DRL scheme composed of actor-critic architecture was designed to solve this problem. Both [26] and [27] improved the reinforcement learning algorithm accordingly. ...
Article
Full-text available
As defense technology develops, it is essential to study the pursuit–evasion (PE) game problem in hypersonic vehicles, especially in the situation where a head-on scenario is created. Under a head-on situation, the hypersonic vehicle’s speed advantage is offset. This paper, therefore, establishes the scenario and model for the two sides of attack and defense, using the twin delayed deep deterministic (TD3) gradient strategy, which has a faster convergence speed and reduces over-estimation. In view of the flight state–action value function, the decision framework for escape control based on the actor–critic method is constructed, and the solution method for a deep reinforcement learning model based on the TD3 gradient network is presented. Simulation results show that the proposed strategy enables the hypersonic vehicle to evade successfully, even under an adverse head-on scene. Moreover, the programmed maneuver strategy of the hypersonic vehicle is improved, transforming it into an intelligent maneuver strategy.
... In the domain of guidance design based on the DRL technique, He et al. [18] proposed a guidance method based on the deep deterministic policy gradient (DDPG) algorithm, which adeptly balances guidance accuracy, energy efficiency, and interception timing, thereby yielding superior performance relative to conventional guidance paradigms. Jiang et al. [19] proposed an anti-interception guidance law utilizing the actor-critic (AC) algorithm, potentially apt for scenarios involving multiple PFVs. Shen et al. [20] concentrated on formulating penetration trajectories for EFVs, aimed at evading two PFVs. ...
Article
Full-text available
Guidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals; thus, guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on the DRL technique, while the pursuit flight vehicles (PFVs) derive their guidance commands employing the proportional navigation method. For every PFV, the evasion distance is described as the minimum distance between the EFV and the PFV during the escape-and-pursuit process. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, which is described as the EFV’s velocity when the last evasion distance is attained, subject to the constraint imposed by the given evasion distance threshold. In the outlined problem, three dimensionalities of uncertainty emerge: (1) the number of PFVs requiring evasion at each time instant; (2) the precise time instant at which each of the evasion distances can be attained; (3) whether each attained evasion distance exceeds the given threshold or not. To solve the challenging problem, we propose an innovative solution that integrates the recurrent neural network (RNN) with the proximal policy optimization (PPO) algorithm, engineered to generate the guidance commands of the EFV. Initially, the model, trained by the RNN-based PPO algorithm, demonstrates effectiveness in evading a single PFV. Subsequently, the aforementioned model is deployed to evade additional PFVs, thereby systematically augmenting the model’s capabilities. Comprehensive simulation outcomes substantiate that the guidance design method based on the proposed RNN-based PPO algorithm is highly effective.
... For the case of antidefense of two interceptors, Shen et al. [26] solved the surprise ballistic design problem by the second-order cone planning method and obtained an antidefense strategy based on the initial line-of-sight angle of the interceptors. Jiang et al. [27] studied a hypersonic gliding vehicle antidefense method based on deep reinforcement learning, which abstracts the confrontation process into a generalized three-body confrontation optimization problem and generates a breakout guidance law and the corresponding breakout trajectory through data training. From the perspective of escaping the KKV maneuver interception range, Liu et al. [28] first analysed the feasibility of anti-head-on interception for hypersonic glide vehicle and then designed a maneuver strategy in resolved form based on the state of the KKV at the moment of its separation from the booster stage. ...
Article
Full-text available
A novel standard trajectory design and tracking guidance used in the multiple active leap maneuver mode for hypersonic glide vehicles (HGVs) is proposed in this paper. First, the dynamic equation and multiconstraint model are first established in the flight path coordinate system. Second, the reference drag acceleration-normalized energy (D-e) profile of the multiple active leap maneuver mode is quickly determined by the Newton iterative algorithm with a single design parameter. The range to go error is corrected by the drag acceleration profile update algorithm, and the drag acceleration error of the gliding terminal is corrected by the aerodynamic parameter estimation algorithm. Then, the reference drag acceleration tracking guidance law is designed based on the prescribed performance control method. Finally, the CAV-L vehicle model is used for numerical simulation. The results show that the proposed method can satisfy the design requirements of drag acceleration under multiple active leap maneuver modes, and the reference drag acceleration can be tracked precisely. The adaptability and robustness of the proposed method are verified by the Monte Carlo simulations under various combined deviation conditions.
... With an emphasis on the terminal evasion scenario, Qiu et al. (2022), based on DRL, developed a maneuver evasion guidance method that took into account both guidance accuracy and evasion capabilities. In a different study (Jiang et al., 2022), the problem was reformulated as a Markov decision process (MDP), and an Actor-Critic (AC) framework-based DRL algorithm was used to solve it to suggest the anti-interception guiding law. To intercept the moving target, somewhat enhanced the reinforcement learning algorithm. ...
Article
Full-text available
Aiming at the rapid development of anti-hypersonic collaborative interception technology, this paper designs an intelligent maneuver strategy of hypersonic vehicles (HV) based on deep reinforcement learning (DRL) to evade the collaborative interception by two interceptors. Under the meticulously designed collaborative interception strategy, the uncertainty and difficulty of evasion are significantly increased and the opportunity for maneuvers is further compressed. This paper, accordingly, selects the twin delayed deep deterministic gradient (TD3) strategy acting on the continuous action space and makes targeted improvements combining deep neural networks to grasp the maneuver strategy and achieve successful evasion. Focusing on the time-coordinated interception strategy of two interceptors, the three-player pursuit and evasion (PE) problem is modeled as the Markov decision process, and the double training strategy is proposed to juggle both interceptors. In reward functions of the training process, the energy saving factor is set to achieve the trade-off between miss distance and energy consumption. In addition, the regression neural network is introduced into the deep neural network of TD3 to enhance intelligent maneuver strategies’ generalization. Finally, numerical simulations are conducted to verify that the improved TD3 algorithm can effectively evade the collaborative interception of two interceptors under tough situations, and the improvements of the algorithm in terms of convergence speed, generalization, and energy-saving effect are verified.
... And the intelligent method proposed does not occupy awful onboard computer resources and does not require intercepting information from the pursuer at all times in the PE procedure compared with the differential game method [12]. In addition, compared with the DDPG algorithm used in the study [27], the TD3 algorithm owns better performance by improving the shortcoming of overestimation of DDPG. And the proposed method further improves the TD3 algorithm in the training stage. ...
Article
Full-text available
In order to improve the problem of overly relying on situational information, high computational power requirements, and weak adaptability of traditional maneuver methods used by hypersonic vehicles (HV), an intelligent maneuver strategy combining deep reinforcement learning (DRL) and deep neural network (DNN) is proposed to solve the hypersonic pursuit–evasion (PE) game problem under tough head-on situations. The twin delayed deep deterministic (TD3) gradient strategy algorithm is utilized to explore potential maneuver instructions, the DNN is used to fit to broaden application scenarios, and the intelligent maneuver strategy is generated with the initial situation of both the pursuit and evasion sides as the input and the maneuver game overload of the HV as the output. In addition, the experience pool classification strategy is proposed to improve the training convergence and rate of the TD3 algorithm. A set of reward functions is designed to achieve adaptive adjustment of evasion miss distance and energy consumption under different initial situations. The simulation results verify the feasibility and effectiveness of the above intelligent maneuver strategy in dealing with the PE game problem of HV under difficult situations, and the proposed improvement strategies are validated as well.
... Near space high-speed vehicle technology is an important milestone in the history of modern weapons and equipment. As the commanding height of science and technology, it greatly enriches the content of attack-defense confrontation in near space [1][2][3][4]. Aiming at the optimal penetration guidance problem of high-speed vehicles against a modified proportional guidance interceptor, this paper obtains the analytical solution of an optimal penetration strategy by introducing Hamilton's principle [5][6][7]. Hamilton's principle is based on the principle of minimum action, which indicates that the motion trajectory of an object in any time interval causes the action to obtain the extreme value. ...
Article
Full-text available
Aiming at the penetration problem of high-speed vehicles against a modified proportional guidance interceptor, a three-dimensional mathematical model of attack–defense confrontation between the high-speed vehicle and the interceptor is established in this paper. The modified proportional navigation guidance law of the interceptor is included in the model, and control constraints, pitch angle velocity constraints, and dynamic delay are introduced. Then, the performance index of the optimal penetration of high-speed vehicles is established. Under the condition of considering the 180-degree BTT control, the analytical solutions of the optimal speed roll angle and the optimal overload of high-speed vehicles are obtained according to symmetric Hamilton principle. The simulation results show that the overload switching times of high-speed vehicles to achieve optimal penetration are N − 1, where N is the modified proportional guidance coefficient of the interceptor. When the maximum speed roll angle velocity is [60, 90] degrees per second, the penetration effect of high-speed vehicles is good. Finally, the optimal penetration guidance law proposed in this paper can achieve a miss distance of more than 5 m when the overload capacity ratio is 0.33.
... Based on the Dubins curve, Bares P et al. designed a 3D path following algorithm, so that the UAV can avoid static obstacles while meeting the kinematic and dynamic constraints [1]. Based on the performance index of minimum energy, Han S C et al. designed an optimal/suboptimal proportional guidance method to enable the UAV to avoid linear motion obstacles. ...
Article
Full-text available
The penetration of unmanned aerial vehicles (UAVs) is an essential and important link in modern warfare. Enhancing UAV’s ability of autonomous penetration through machine learning has become a research hotspot. However, the current generation of autonomous penetration strategies for UAVs faces the problem of excessive sample demand. To reduce the sample demand, this paper proposes a combination policy learning (CPL) algorithm that combines distributed reinforcement learning and demonstrations. Innovatively, the action of the CPL algorithm is jointly determined by the initial policy obtained from demonstrations and the target policy in the asynchronous advantage actor-critic network, thus retaining the guiding role of demonstrations in the initial training. In a complex and unknown dynamic environment, 1000 training experiments and 500 test experiments were conducted for the CPL algorithm and related baseline algorithms. The results show that the CPL algorithm has the smallest sample demand, the highest convergence efficiency, and the highest success rate of penetration among all the algorithms, and has strong robustness in dynamic environments.
Chapter
A novel intelligent guidance law based on Deep Deterministic Policy Gradient (DDPG) is devised in this paper to deal with problems of penetration for hypersonic boost glide vehicle (HBGV). Firstly, an agent based on DDPG algorithm is designed by setting a model of Markov decision process. Interacting with environment, the agent is trained to produce continuous overload command to avoid interception by one interceptor and ensure successful penetration for HBGV. Compared with traditional penetration guidance law, the new guidance law can intelligently deal with the complex offensive and defensive game problem. In addition, the designed reward of agent avoids excessive overloading which can save energy of HBGV. Finally, simulations are carried out, which prove that the guidance law enables HBGV to avoid interception by interceptor. The results in different test scenarios also illustrate that the intelligent guidance law has great generalization ability which can meet need of penetration in other offensive and defensive situations.
Article
The long-term prediction of hypersonic glide vehicle (HGV) trajectory based on radar tracking data still faces challenges. Existing approaches are hard to provide satisfactory performances in long-term HGV trajectory prediction tasks due to insufficient long-range dependency extraction ability and cumulative errors. To address the above issues, a trajectory prediction model Physics-informed Transformer (PIT) is proposed in this work. The Top $\tau$ -Mean attention mechanism and the generative decoder are designed in this model to extract the trajectories' long-range dependency and reduce the cumulative errors. Furthermore, the physical knowledge is applied at both the input and output ends of the model, which further improves the model's prediction accuracy even in the case of missing a small amount of radar tracking data. A simulated HGV trajectory dataset is established based on the dynamic equations for the PIT trajectory prediction model. The experimental results show that the PIT performs better than other state-of-the-art deep neural network models. The additional experiments on a small-scale dataset and datasets with missing tracking data further verify the excellent performance and robustness of the PIT in these cases.
Article
Full-text available
Interactive reinforcement learning proposes the use of externally sourced information in order to speed up the learning process. When interacting with a learner agent, humans may provide either evaluative or informative advice. Prior research has focused on the effect of human-sourced advice by including real-time feedback on the interactive reinforcement learning process, specifically aiming to improve the learning speed of the agent, while minimising the time demands on the human. This work focuses on answering which of two approaches, evaluative or informative, is the preferred instructional approach for humans. Moreover, this work presents an experimental setup for a human trial designed to compare the methods people use to deliver advice in terms of human engagement. The results obtained show that users giving informative advice to the learner agents provide more accurate advice, are willing to assist the learner agent for a longer time, and provide more advice per episode. Additionally, self-evaluation from participants using the informative approach has indicated that the agent’s ability to follow the advice is higher, and therefore, they feel their own advice to be of higher accuracy when compared to people providing evaluative advice.
Conference Paper
Full-text available
We use Reinforcement Meta Learning to optimize an adaptive guidance system suitable for the approach phase of a gliding hypersonic vehicle. Adaptability is achieved by optimizing over a range of off-nominal flight conditions including perturbation of aerodynamic coefficient parameters, actuator failure scenarios, and sensor noise. The system maps observations directly to commanded bank angle and angle of attack rates. These observations include a velocity field tracking error formulated using parallel navigation, but adapted to work over long trajectories where the Earth's curvature must be taken into account. Minimizing the tracking error keeps the curved space line of sight to the target location aligned with the vehicle's velocity vector. The optimized guidance system will then induce trajectories that bring the vehicle to the target location with a high degree of accuracy at the designated terminal speed, while satisfying heating rate, load, and dynamic pressure constraints. We demonstrate the adaptability of the guidance system by testing over flight conditions that were not experienced during optimization. The guidance system's performance is then compared to that of a linear quadratic regulator tracking an optimal trajectory.
Article
Full-text available
To maintain the survivability of ballistic missiles, this paper proposes using deep reinforcement learning to obtain a midcourse maneuver controller. First, the midcourse is abstracted as a Markov decision process (MDP) with an unknown system state equation. Then, a controller formed by the Dueling Double Deep Q (D3Q) neural network is used to approximate the state-action value function of the MDP. To improve the intelligence of the controller through reinforcement learning, the state space, action space and instant reward function of the MDP are customized. The controller uses a real-time situation as input and outputs the ignition states of pulse motors. Offline training shows that deep reinforcement learning can obtain the convergence of the optimal strategy after approximately 65 hours. According to an online test, the controller can evade an interceptor intelligently and take the re-entry error into account. In scenarios with multiple random factors, the controller achieved a penetration probability of 100re-entry error of less than 5000 m.
Article
Full-text available
Tracking maneuvering target in real time autonomously and accurately in an uncertain environment is one of the challenging missions for unmanned aerial vehicles (UAVs). In this paper, aiming to address the control problem of maneuvering target tracking and obstacle avoidance, an online path planning approach for UAV is developed based on deep reinforcement learning. Through end-to-end learning powered by neural networks, the proposed approach can achieve the perception of the environment and continuous motion output control. This proposed approach includes: (1) A deep deterministic policy gradient (DDPG)-based control framework to provide learning and autonomous decision-making capability for UAVs; (2) An improved method named MN-DDPG for introducing a type of mixed noises to assist UAV with exploring stochastic strategies for online optimal planning; and (3) An algorithm of task-decomposition and pre-training for efficient transfer learning to improve the generalization capability of UAV's control model built based on MN-DDPG. The experimental simulation results have verified that the proposed approach can achieve good self-adaptive adjustment of UAV's flight attitude in the tasks of maneuvering target tracking with a significant improvement in generalization capability and training efficiency of UAV tracking controller in uncertain environments.
Article
As one of the core technologies of the automatic control system for unmanned surface vehicles (USVs), autonomous collision avoidance algorithm is the key to ensure the safe navigation of USVs. In this paper, path planning and dynamic collision avoidance (PPDC) algorithm which obeys COLREGs is proposed for USVs. In order to avoid unnecessary collision avoidance actions, the risk assessment model is developed, which is used to determine the switching time of path planning and dynamic collision avoidance. In order to train the algorithm which complies with the COLREGs, the encounter situation is divided quantitatively, which is regarded as the input state of the system, so that the high-dimensional input is successfully avoided. The state space of the USV is defined by relative parameters to improve the generalization ability of the algorithm, meanwhile, a network structure based on DDPG is designed to achieve the continuous output of thrust and rudder angle. Combined with path planning, collision avoidance, compliance with COLREGs and smooth arrival task, four kinds of reward functions are designed. In order to solve the problem of low training efficiency of experience replay mechanism in DDPG, cumulative priority sampling mechanism is proposed. Through the simulation and verification in a variety of scenarios, it is proved that PPDC algorithm has the function of path planning and dynamic collision avoidance in compliance with COLREGs, which has good real-time performance and security.
Article
In unstructured orchard environments, picking a target fruit without colliding with neighboring branches is a significant challenge for guava-harvesting robots. This paper introduces a fast and robust collision-free path-planning method based on deep reinforcement learning. A recurrent neural network is first adopted to remember and exploit the past states observed by the robot, then a deep deterministic policy gradient algorithm (DDPG) predicts a collision-free path from the states. A simulation environment is developed and its parameters are randomized during the training phase to enable recurrent DDPG to generalize to real-world scenarios. We also introduce an image processing method that uses a deep neural network to detect obstacles and uses many three-dimensional line segments to approximate the obstacles. Simulations show that recurrent DDPG only needs 29 ms to plan a collision-free path with a success rate of 90.90%. Field tests show that recurrent DDPG can increase grasp, detachment, and harvest success rates by 19.43%, 9.11%, and 10.97%, respectively, compared to cases where no collision-free path-planning algorithm is implemented. Recurrent DDPG strikes a strong balance between efficiency and robustness and may be suitable for other fruits.
Article
In this paper, a novel backstepping-based adaptive dynamic programming (ADP) method is developed to solve the problem of intercepting a maneuver target in the presence of full-state and input constraints. To address state constraints, a barrier Lyapunov function is introduced to every backstepping procedure. An auxiliary design system is employed to compensate the input constraints. Then, an adaptive backstepping feedforward control strategy is designed, by which the tracking problem for strict-feedback systems can be reduced to an equivalence optimal regulation problem for affine nonlinear systems. Secondly, an adaptive optimal controller is developed by using ADP technique, in which a critic network is constructed to approximate the solution of the associated Hamilton–Jacobi–Bellman (HJB) equation. Therefore, the whole control scheme consists of an adaptive feedforward controller and an optimal feedback controller. By utilizing Lyapunov's direct method, all signals in the closed-loop system are guaranteed to be uniformly ultimately bounded (UUB). Finally, the effectiveness of the proposed strategy is demonstrated by using a simple nonlinear system and a nonlinear two-dimensional missile-target interception system.