ArticlePDF Available

Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

August 2022
Aerospace 9(8):424

August 2022
9(8):424

License
CC BY 4.0

Authors:

Anti-interception guidance can enhance a hypersonic glide vehicle (HGV) compard to multiple interceptors. In general, anti-interception guidance for aircraft can be divided into procedural guidance, fly-around guidance and active evading guidance. However, these guidance methods cannot be applied to an HGV’s unknown real-time process due to limited intelligence information or on-board computing abilities. In this paper, an anti-interception guidance approach based on deep reinforcement learning (DRL) is proposed. First, the penetration process is conceptualized as a generalized three-body adversarial optimal (GTAO) problem. The problem is then modelled as a Markov decision process (MDP), and a DRL scheme consisting of an actor-critic architecture is designed to solve this. Reusing the same sample batch during training results in fewer serious estimation errors in the critic network (CN), which provides better gradients to the immature actor network (AN). We propose a new mechanismcalled repetitive batch training (RBT). In addition, the training data and test results confirm that the RBT can improve the traditional DDPG-based-methodes.

Illustration of an HGV anti-interception manoeuvre. The anti-interception of an HGV can be divided into three phases. 1 Approach. The HGV manoeuvres according to anti-interception guidance, while the interceptor operates under its own guidance. Since the HGV is located a long distance from the interceptor, and has high energy, various penetration strategies are available during this phase. 2 Endezvous. At this phase, the distance between the HGV and the interceptors is the shortest. This distance may be shorter than the kill radius of the interceptors, allowing for the HGV to be intercepted, or greater than the kill radius, allowing for the HGV to successfully avoid interception. 3 Egress. With its remaining energy, the HGV flies to the DRP after moving away from the interceptors. A successful mission requires the HGV to arrive at the DRP with the highest levels of energy and accuracy. From the above analysis, it can be seen that whether the HGV can evade the interceptors in phase 2 depends on the manoeuvres adopted in phase 1 . Phase 1 also determines the difficulty of ballistic regression in phase 3 .

…

Parameters of theHGV and interceptor in the virtual scenario.

…

Statistical results.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Citation: Jiang, L.; Nan, Y.; Zhang, Y.;

Li, Z. Anti-Interception Guidance for

Hypersonic Glide Vehicle: A Deep

Reinforcement Learning Approach.

Aerospace 2022,9, 424. https://

doi.org/10.3390/aerospace9080424

Academic Editor: Sergey Leonov

Received: 8 April 2022

Accepted: 1 August 2022

Published: 4 August 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional afﬁl-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

aerospace

Article

Anti-Interception Guidance for Hypersonic Glide Vehicle:

A Deep Reinforcement Learning Approach

Liang Jiang 1,* , Ying Nan 1, Yu Zhang 2and Zhihan Li 1

1College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

2School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

*Correspondence: nuaajl@nuaa.edu.com

Abstract:

Anti-interception guidance can enhance a hypersonic glide vehicle (HGV) compard to

multiple interceptors. In general, anti-interception guidance for aircraft can be divided into procedural

guidance, fly-around guidance and active evading guidance. However, these guidance methods

cannot be applied to an HGV’s unknown real-time process due to limited intelligence information

or on-board computing abilities. In this paper, an anti-interception guidance approach based on

deep reinforcement learning (DRL) is proposed. First, the penetration process is conceptualized

as a generalized three-body adversarial optimal (GTAO) problem. The problem is then modelled

as a Markov decision process (MDP), and a DRL scheme consisting of an actor-critic architecture

is designed to solve this. Reusing the same sample batch during training results in fewer serious

estimation errors in the critic network (CN), which provides better gradients to the immature actor

network (AN). We propose a new mechanismcalled repetitive batch training (RBT). In addition, the

training data and test results confirm that the RBT can improve the traditional DDPG-based-methodes.

Keywords: hypersonic glide vehicle; anti-interception; deep reinforcement learning; guidance

1. Introduction

A hypersonic glide vehicle (HGV) has the advantages of a ballistic missile and a lifting

body vehicle. It is efﬁcient as it does not require any additional complex anti-interception

mechanisms (e.g., carrying a defender to destroy an interceptor [

] or relying on the range

advantage of a high-powered onboard radar for early evasion before the interceptor locks

on [

]), and can achieve penetration through anti-interception guidance. Over the past

decade, studies have focused on utilizing HGV manoeuvreability to protect against inter-

ceptors.

In general, anti-interception guidance for aircraft has three categories: (1) procedural

guidance [3–6], (2) ﬂy-around guidance [7,8], and (3) active evading guidance [9,10].

Early research on anti-interception guidance focused on procedural guidance. In

procedure guidance, the desired trajectory (such as sine manoeuvres [

], square wave

manoeuvres [

], or snake manoeuvres [

]) is planned prior to launch based on facts such

as the target position, interceptor capability and manoeuvre strategy, and the vehicle

receives guidance based on the ﬁxed trajectory after launch. Imado et al. [

] studied the

lateral procedural guidance in a horizontal plane. Zhang et al. [

] proposed a midcourse

penetration strategy using an axial impulse manoeuvre and provided a detailed trajectory

design method. This penetration strategy does not require lateral pulse motors. To eliminate

the re-entry error caused by midcourse manoeuvre, Wu et al. [

] used the remaining pulse

motors to regress a preset ballistic. Procedural guidance only needs to plan a ballistic and

inject it into the onboard computer before launch, which is easy to implement and does not

occupy onboard computing resources. With the advancement of interception guidance, the

procedural manoeuvres studied by Imado et al. [

] may be recognized by interceptors,

and effectiveness cannot be guaranteed when ﬁghting against advanced interceptors.

Aerospace 2022,9, 424. https://doi.org/10.3390/aerospace9080424 https://www.mdpi.com/journal/aerospace

Aerospace 2022,9, 424 2 of 21

In conjunction with the advances in planning and information fusion, ﬂy-around the

detection zone has emerged as a penetration strategy. The primary objective is to plan a

trajectory that can evade an enemy’s detection zone. A complex nonlinear programming

problem with multiple constraints and stages is used to achieve this objective. In terms

of the detection zone, Zhang et al. [

] established inﬁnitely high cylindrical and semi-

ellipsoidal detection zone models under the assumption of earth-ﬂattening and optimized

the trajectory that satisﬁes the waypoint and detection zone constraints. Zhao et al. [

]

proposed adjusting the interval density with curvature and error criteria based on the

multi-interval pseudo-spectrum method. An adaptive pseudo-spectrum method was

constructed to solve the multi-interval pseudo-spectrum, and the number of points in the

interval was allocated. A rapid trajectory optimization algorithm was also proposed for

the whole course under the condition of multiple constraints and multiple detection zones.

However, ﬂy-around guidance has insufﬁcient adaptability on the battleﬁeld. Potential

re-entry points have a wide distribution, and the circumvention methods studied by

Zhang et al. [

], cannot guarantee that a planned trajectory will meet energy constraints.

In addition, there may not be a trajectory that can evade all detection zones in an enemy’s

key air defence area. Moreover, it may be impossible to know an enemy’s detection zones

due to limited intelligence.

As onboard air-detection capabilities have advanced, active evading has gradually

gained popularity in anti-interception guidance, and some research results have been

recorded through differential game (DG) theory and numerical optimization algorithms in

recent years. In DG, Hamilton functions are built based on an adversarial model and are

then solved by numerical algorithms to ﬁnd the optimal control using real-time aircraft

ﬂight states. Xian et al. [

] conducted research based on DG and achieved a strategy

set of evasion manoeuvres. According to the accurate model of a penetration spacecraft

and an interceptor, Bardhan et al. [

] proposed guidance using a state-dependent Riccati

equation (SDRE). This approach obtained superior combat effectiveness compared with

traditional DG. However, DG requires many calculations and has poor real-time perfor-

mance. Model errors can be introduced during linearization [

], and onboard computers

have difﬁculty achieving high-frequency corrections. The ADP algorithm was employed

by Sun et al. [

] address with the horizontal ﬂight pursuit problem. In an attempt

to reduce the computational complexity, neural networks were used to ﬁt the Hamilton

function online. However, it is unclear whether this idea can be adapted for an HGV

featuring a gaint ﬂight envelope. Numerical optimization algorithms, for example, pseudo-

spectral

methods [19,20]

have been used to discretize complex nonlinear HGV differential

equations and convert various types of constraints into algebraic constraints. Various

optimization methods, such as convex optimization or sequential quadratic programming,

are used to solve the optimal trajectory. The main drawback of applying numerical opti-

mization algorithms to HGV anti-interception guidance is that they occupy a considerable

amount of onboard computer resources over a long period of time, and the computational

time required increases exponentially with the number of aircrafts. Due to these limitations,

active evading guidance is unsuitable for engineering.

Reinforcement learning (RL) is a model-free algorithm that is used to solve decision-

making problems and has gained attention in the control ﬁeld because it is entirely based

on data, does not require model knowledge, and can perform end-to-end self-learning.

Due to the limitations of traditional RL, early research cannot handle high-dimensional

and continuous battleﬁeld state information. In recent years, deep neural networks (DNNs)

have demonstrated the ability to approximate an arbitrary function and have unparalleled

advantages in the feature extraction of high-dimensional data. Deep reinforcement learning

(DRL), is a technique resulting from the intersection of DNN and RL, and its abilities

can exceed those of human empirical cognition [

]. A wave of research into DRL has

been sparked by successful applications, such as Alpha-Zero [

] and Deep Q Networks

(DQN) [

]. After training, a DNN can quickly output control in milliseconds and has a

good generalization ability in unknown environments [

]. Therefore, DRL has promising

Aerospace 2022,9, 424 3 of 21

applications in aircraft guidance. There has been some discussion regarding using DRL to

train DNNs in intercepting a penetration aircraft [

]. Brain et al. [

] used reinforcement

meta-learning to optimize an adaptive guidance system that is suitable for the approach

phase of an HGV. However, no research has been conducted on the application of DRL to

HGV anti-interception guidance. However, some studies in different areas have examined

similar questions. Wen et al. [

] proposed a collision avoidance method based on a deep

deterministic policy gradient (DDPG) approach, and a proper heading angle was obtained

by using the proposed algorithm to guarantee conﬂict-free conditions for all aircraft. For

micro-drones ﬂying in orchards, Lin et al. [

] implemented DDPG to create a collision-free

path. Guo et al. studied a similar problem, the difference being that DDPG was applied to

an unmanned ship. Lin et al. [

] studied how to use DDPG to train a fully connected DNN

to avoid collision with four other vehicles by controlling the acceleration of the merging

vehicle. For unmanned surface vehicles (USVs), Xu

et al. [31]

used DDPG to determine

the switching time of path-planning and dynamic collision avoidance. These studies led

us to believe that DDPG-based methods have promising applications for solving the anti-

interception guidance problem of HGVs. However, due to the differences in the objects of

study, the anti-interception guidance problem for HGVs requires consideration of some of

the following issues: (1) The performance (especially the available overload) of an HGV is

time-varying due to the velocity and altitude, while the performance of the control object

is ﬁxed in the abovementioned studies. We need to build a model in which the DDPG

training process is not affected by time-varying performance. (2) The end state is the only

concern in the anti-intercept guidance problem, and only one instant reward is obtained in

a training episode. Therefore, sparsity and the delayed reward effect are more signiﬁcant

in this study than in the studies mentioned above. In this paper, we attempt to improve the

existing DDPG-based methods to help DNNs gain intelligence faster.

The main contributions of this paper are as follows: (1) Anti-interception HGV guid-

ance is described as an optimization problem, and a generalized three-body adversarial

optimization (GTAO) model is developed. This model does not need to account for the

severe constraints in the available overload (AO) and is suitable for DRL. To our knowledge,

this is the ﬁrst time that DRL-speciﬁc application was implemented for the anti-interception

guidance of an HGV. (2) A DRL scheme is developed to solve the GTAO problem, and the

RBT-DDPG algorithm is proposed. Compared with the traditional DDPG-based algorithm,

the RBT-DDPG algorithm can improve the learning effects of the critic network (CN), alle-

viate the exploration–utilization paradox, and achieve a better performance. In addition,

since the forward computation of a fully connected neural network is very simple, an

intelligent network trained by DRL can quickly compute a command for anti-interception

guidance that matches the high dynamic characteristics of the HGV. (3) A strategy review of

teh HGV anti-interception guidance derived from the DRL approach is provided. We note

that this is the ﬁrst time that these strategies are summarized for HGV guidance through

semantics, and may inspire the research community.

The remainder of this paper is organized as follows: Section 2describes the problem of

anti-interception guidance for HGVs as an optimization problem and translates the problem

into solving a Markov decision process (MDP). In Section 3, the RBT-DDPG algorithm is

given in detail. Moreover, we propose a speciﬁc design of the state space, action space,

and reward functions that are necessary to solve the MDP using DRL. Section 4examines

the training and test data and the speciﬁc anti-interception strategy. Section 5presents a

conclusion to the paper and an outlook on intelligent guidance.

2. Problem Description

Figure 1shows the path of an HGV conducting an anti-interception manoeuvre. The

coordinate and velocity of the aircraft are known, as well as the coordinate of the desired

regression point (DRP). The HGV and interceptor rely on aerodynamics to perform ballistic

manoeuvres. The aerodynamic forces are mainly derived from the angle of attack (AoA).

As an HGV can glide for thousands of kilometres, this paper focuses on the guidance

Aerospace 2022,9, 424 4 of 21

needed after the interceptors locked on the HGV, and the ﬂight distance of this process is

set to 200 km.

Figure 1.

Illustration of an HGV anti-interception manoeuvre. The anti-interception of an HGV can

be divided into three phases.



Approach. The HGV manoeuvres according to anti-interception

guidance, while the interceptor operates under its own guidance. Since the HGV is located a long

distance from the interceptor, and has high energy, various penetration strategies are available during

this phase.



Endezvous. At this phase, the distance between the HGV and the interceptors is

the shortest. This distance may be shorter than the kill radius of the interceptors, allowing for the

HGV to be intercepted, or greater than the kill radius, allowing for the HGV to successfully avoid

interception.



Egress. With its remaining energy, the HGV ﬂies to the DRP after moving away from

the interceptors. A successful mission requires the HGV to arrive at the DRP with the highest levels

of energy and accuracy. From the above analysis, it can be seen that whether the HGV can evade the

interceptors in phase



depends on the manoeuvres adopted in phase



. Phase



also determines

the difﬁculty of ballistic regression in phase 3

.

2.1. The Object of Anti-Interception Guidance

In the Earth-centered earth-ﬁxed frame, the motion of the aircraft in the vertical plane

is described as follows [32]:











dt=1

m(Pcos α−CX(v,α)qS)−gsin θ

dθ

dt=1

mv (Psin α+CY(v,α)qS)−g

v−v

R0+ycos θ

dt=R0vcos θ

R0+H

dt=vsin θ

(1)

where

is the ﬂight distance,

is the altitude,

is the ﬂight velocity,

is the ballistic

inclination,

is the gravity acceleration,

is the mean radius of the earth when the

ﬂatness of the earth is ignored,

CX(v,α)

and

CY(v,α)

are the lift and drag aerodynamic

coefﬁcients of the aircraft, respectively,

is the AoA,

is the dynamic pressure,

is the

aerodynamic reference area, mis the mass of the vehicle, and Pis the engine thrust.

Letting subscript H and I indicate the variables subordinate to the HGV and interceptor,

respectively. Setting

xH(t)=x y v θT

is the state of an axisymmetric HGV, then

Equation (1) can be rewritten as follows:

˙xH(t)=fH(xH)+gH(xH)u(t)=





R0vcos θ

R0+H

vsin θ

−CXqS

m−gsin θ

−g

v−v

R0+ycos θ







+





δH(xH)







u(t)(2)

where

δH(xH) = max

α1

mv CY(v,α)qS

is the maximum rate of inclination generated by

aerodynamics in state xH. The u(t)∈−1, 1 is the command of the guidance.

Aerospace 2022,9, 424 5 of 21

Remark 1. δH(xH)

is related to

of the vehicle and the available AoA, so its value varies with

xH(t)

. The purpose of this procedure is to place

u(t)

into a constant range, ensuring that the

manoeuvring ability required for guidance always follows the real-time AO under the giant ﬂight

envelope of the HGV.

Remark 2.

The reason for using

u(t)

as the control variable, instead of directly using the AoA

is that it is possible to rely on an existing model to calculate

δH(xH)

and input it into the neural

network (as shown in Section 3), thus omitting the neural network training process in the maximum

available overload.

As the long-range interceptor is in the target-lock state, its booster rocket is switched

off and unpowered with state xI(t)=x y v θT. The motion is as follows:

˙xI(t)=fI(xI,xH)=





R0vcos θ

R0+H

vsin θ

−CXqS

m−gsin θ

GI(xI,xH,t)







(3)

where

GI(xI,xH,t)

is the actual rate of inclination under the inﬂuence of the guidance and

control system.

Remark 3.

As the focus of this paper is on the centre-of-mass motion of vehicles, it is assumed that

the vehicles always follow the guidance under the effects of the attitude control system, so errors in

attitude control are ignored in Equations (2) and (3), as well as minor effects such as Gauche forces

and implicated accelerations.

HGV and Ninterceptors can form a nonlinear system:

˙xS(t)=fS(xS)+gS(xS)u(t)(4)

where system state is

xS=h(xH)T(xI1)T. . . (xIN)TiT

, the nonlinear kinematics of

the system are

fS(xS)=(fH) ( fI1 ). . . (fIN)T

and the non-linear effect of control

is gS(xS)=h(gH)TO1×4NiT.

The guidance system aims to control the system described in Equation (4) using

u(t)

achieve penetration. Therefore, the design of an anti-interception guidance can be viewed

as an optimal control problem. First, the HGV must successfully evade the interceptor

during penetration, and then reach the DRP

PE=xEyET

to conduct the follow-up

mission. Let

be the time that the HGV arrives at

, and let

u(t)

direct system

Equation (4)

to state

xS(tf)

. The anti-interception guidance is designed to solve the following GTAO

problem [33].

As mentioned in Equation (4), an HGV and its opponents form a system represented

by xS. The initial value of xSis:

xS(t0)= [ xH,0 yH,0 vH,0 θH,0 . . . xIN,0 yIN,0 vIN,0 θIN,0 ]T(5)

The process constraint of penetration:

min(Mx,ixS(t))2+My,ixS(t)2>R2,i∈[1, N](6)

where

Mx,i=1O1×3O1×4(i−1)−1O1×3O1×4(N−i)

, and

is the kill radius

of the interceptor, My,i=0 1 O1×2O1×4(i−1)0−1O1×2O1×4(N−i).

The process constraint of heat ﬂux is:

(qS)3D ≤qU(7)

Aerospace 2022,9, 424 6 of 21

where

(qS)3D

is the three-dimensional stagnation point of the heat ﬂux and

is the upper

limit of the heat ﬂux. For an arbitrarily shaped, three-dimensional stagnation point with a

radius of curvature R1and R2, the heat ﬂux is expressed as:

(qS)3D =r1+k

2(qS)AXI (8)

where

k=R1

and

(qS)AXI

is the axisymmetric heat ﬂux, which is related to the ﬂight

altitude and velocity [34].

The minimum velocity constraint:

O1×210O1×4xS≥Vmin (9)

The control constraint:

u(t)∈−1 1 (10)

The objective function is a Mayer type function:

J(xS(t0),u(t)) =QxStf (11)

where

QxStf=xStf−˜

PETRxStf−˜

PE

PE=PETVmin O1×(4N+1)T

and R=





−w1

O(4N+1)×(4N+1)







in which w1,w2∈R+are weights.

The optimal performance:

J∗(xS(t0)) =max

u(t),t∈[t0,tf]J(xS(t0),u(t)) (12)

From Equation (12), the optimal control

u∗(t)

is determined by

xS(t0)

. After obtaining

all the model information of system Equation (4), the optimal state trajectory of the sys-

tem can be found according to

xS(t0)

using static optimization methods (e.g., quadratic

programming). Nevertheless, resolving the problem using optimization methods is chal-

lenging, especially since there is limited information about the interceptor (aerodynamic

parameters, available overload, guidance, etc.).

2.2. Markov Decision Process

The MDP can model a sequential decision problem and is well-suited to the HGV

anti-interception process. The MDP can be deﬁned by a ﬁve-tuple

hS,A,T,R,γi

[

is a multidimensional continuous state space.

is an available action space.

is a state

transition function:

S ×A → S

. That is, after an action

a∈ A

is taken in the state

s∈ S

the state changes from

s0∈ S

is an instant reward function: it represents an instant

reward obtained from the state transition.

γ∈[

0, 1

]

is a constant discount factor used to

balance the importance of the instant reward and forward reward.

The cumulative reward obtained by the controller under the command sequence

τ={a0, . . . , an}is:

G(s0,τ) =

∞

∑

t=0

γtrt(13)

The expected value function of the cumulative reward based on state

and the

expected value function of (st,at)are introduced as shown below:

Vπ(st) = E"∞

∑

k=0

γkrt+k|st,π#(14)

Aerospace 2022,9, 424 7 of 21

Qπ(st,at) = E"∞

∑

k=0

γkrt+k|st,at,π#(15)

where

Vπ(st)

indicates the expected cumulative reward that controller

can obtain in the

current state

Qπ(st

at)

indicates the expected cumulative reward under controller

after executing atin state st.

According to the Bellman Optimal theorem [

], updating

π(st)

through the iter-

ation rule shown in the following equation can be used to approximate the maximum

Vπ(st)value:

π(st) = arg max

Qπ(st,at)(16)

3. Proposed Method

Section 2converts the anti-interception guidance of an HGV to an MDP. The key to

obtaining optimal guidance is the accurate estimation of

Qπ(st

at)

. Much progress has

been made towards artiﬁcial intelligence using supervised learning systems trained to

replicate human expert decisions. However, expert data are often expensive, unreliable, or

unavailable [

], although DRL can still realize accurate

Qπ(st

at)

estimation. The guidance

system needs to process as much data as possible and then choose the next action. The

input space of the guidance system has high dimensionality and continuous characteristics,

and its action space has continuous characteristics. The DDPG-based methods (the main

methods are DDPG and TD3) have been shown to effectively handle high-dimensional

continuous information and output continuous actions. They could be used to solve the

MDP proposed in this paper. This section aims to achieve faster policy convergence and

better performance by optimizing DDPG-based method training. Since the TD3 algorithm

has only one more CN pair compared to DDPG, this paper takes RBT-DDPG as an example

to introduce how the RBT mechanism improves the training of the critic part. RBT-TD3 is

easily obtained by simply repeating the improvements made by RBT-DDPG in each pair of

TD3 critic networks.

3.1. RBT-DDPG-Based Methods

Q(·)

in the DDPG-based methods is only used, during reinforcement learning, to

train AN A(·). At execution, the real action is determined directly by A(·).

For CN, the DDPG-based optimization objective is to minimize the loss function

LQ(φQ). Its gradient is

∇φQLQ(φQ) = 1

∑

j=1∇φQQ(st,j,at,j|φQ)! 2

∑

j=1Q(st,j,at,j|φQ)−ˆ

yt!(17)

After a single gradient descent operation is performed, the gradient descent method is

only guaranteed to update the parameters in the right direction. The amplitude of the loss

function is not guaranteed after a single gradient descent procedure.

For AN, the DDPG-based optimization objective is to minimize the loss function

LA(φA). Its gradient is

∇φALA(φA) = −1

∑

j=1∇˜

at,jQst,j,˜

at,j|φQ∇φAA(st,j|φA)(18)

The direction of parameter iteration of AN is affected by CN. The single gradient

descent method used by the DDPG-based method does not guarantee accurate CN esti-

mation. In some instances where the estimation is grossly inaccurate, incorrect parameter

updates will be provided to AN, further deteriorating the sample data in the memory pool

and affecting the training efﬁciency.

Rather than allowing CN to steer AN in the wrong direction, this paper proposes a

new, improved, DDPG-based mechanism: repetitive batch training (RBT). The core idea

Aerospace 2022,9, 424 8 of 21

of RBT, which mainly aims to improve the updating strategy of Qπ(st,at), is that when a

sample batch is used to update the CN parameters, the CN is repetitively trained using

the loss function as its reference threshold (as in Equation (19)), thereby avoiding a serious

misestimate of CN. The reference threshold LTH for repeats should be set appropriately.

Repeat,LQ(φQ)≥LTH

Pass,LQ(φQ)<LTH (19)

Qst,j,˜

at,j|φQis split into two parts:

Qst,j,at,j|φQ=Q∗st,j,at,j|φQ∗+Dst,j,at,j|φQ(20)

where,

Q∗st,j,at,j|φQ∗

is the real mapping of

and

Dst,j,at,j|φQ

is the estimation error.

Therefore, Equation (18) is rewritten as:

∇φALA(φA) = −1

∑

j=1∇˜

at,jQ∗st,j,˜

at,j|φQ∗+∇˜

at,jDst,j,˜

at,j|φQ∇φAA(st,j|φA)

(21)

According to the Taylor expansion:

∇˜

at,jDst,j,˜

at,j|φQ=1

∆˜

aDst,j,˜

at,j+∆˜

a|φQ−Dst,j,˜

at,j|φQ−R2(22)

where R2is the higher-order residual term. Since 

Dst,j,˜

at,j|φQ

∈(0, LTH),:



Dst,j,˜

at,j+∆˜

a|φQ−Dst,j,˜

at,j|φQ

<2LTH (23)



∇˜

at,jDst,j,˜

at,j|φQ



<2LTH

k∆˜

ak−2



∆˜

a



(24)

Naturally, setting

LTH

as a small value will reduce the misdirection from the uncon-

verged CN to the AN.

LTH

is too large, the RBT-DDPG will be weakened. CN will overﬁt the samples in a

single batch if

LTH

is too small. The RBT-DDPG shown in Figure 2and Algorithm 1is an

example of how RBT can be combined with DDPG-based methods.

Figure 2.

Signal ﬂow of the RBT-DDPG algorithm. When a sample batch is used to update the CN

parameters, the CN is repetitively trained using the loss function as its reference threshold (as shown

in Steps 4–7 of the Figure), thereby avoiding a serious misestimate of CN.

Aerospace 2022,9, 424 9 of 21

Algorithm 1 RBT-DDPG

1: Initialize parameters φA,φA−,φQ,φQ−.

2: for each iteration do

3: for each environment step do

4: a=1−e−θtµ+e−θtˆ

a+σq1−e−2θt

2θε,ε∼(0, 1),ˆ

a=A(s).

5: end for

6: for each gradient step do

7: Randomly sample Nbsamples.

8: φQ←φQ+αC∇φQLQ(φQ).

9: if LC>LTH then

10: Back to step 7.

11: end if

12: φA←φA+αA∇φALA(φA).

13: φQ−←(1−sr)φQ−+srφQ.

14: φA−←(1−sr)φA−+srφA.

15: end for

16: end for

3.2. Scheme of DRL

To approximate

Qπ(st,at)

and optimal AN through DRL, the state space, action space

and instant reward function are designed as follows.

3.2.1. State Space

As mentioned in Section 2.1, in the vertical plane, the HGV and interceptors follow

Equations (2) or (3), and both can be expressed in terms of two-dimensional coordinates:

the velocity, and the ballistic inclination. Therefore, the network can predict the future ﬂight

state based on the current state. The AN state space is

SHI =SH×SI1 ×. . . × SIN

, where

SH∈R4

and

SIi∈R4

are the state space of the HGV and the interceptor

, respectively. AN

needs to know the AO of the current state

SΩ∈R

to evaluate the available manoeuvrability.

It also needs to know the position of the DRP

SD∈R2

. As a result, the state space of AN is

designed as a 7

-dimensional space

S=SHI ×SΩ×SD

, and the form of each element

(xH,yH,vH,θH,xI1,yI1 ,vI1,θI1, . . . , xIN,yIN,vIN,θIN,ωmax ,xD,yD)

. For CN, an additional

action bit is needed, so its input space is 8 +4N-dimensional.

Remark 4.

The deﬁnition of the state space used in this paper means that there is a vast input

space for the neural network when an HGV is confronted with many interceptors, resulting in many

duplicate elements and complicating the training process. An attention mechanism can alleviate

this issue by extracting features from the original input space [

]. Typically, two interceptors are

used to intercept one target in air defence operations [

]. Therefore, there is only one HGV and two

interceptors in the virtual scenario, limiting the input space of the neural networks.

3.2.2. Action Space

As described in Equation (2), since

u(t)

is limited to the interval

[−1, 1]

, the output

or action of the neural network can easily be deﬁned as u(t).u(t)is multiplied by the AO

derived from the model information and then applied to the HGV’s ﬂight control system to

ensure that the guidance signal meets the dynamical constraints of the current ﬂight state.

The neural network only needs to pick a real number within

[−1, 1]

, which indicates the

choice of

as a guidance command subject to AO constraints. From a training perspective,

when aiming to bypass the learning process of the AO, it is more straightforward to use the

model information directly, rather than adopting the AoA as the action space.

3.2.3. Instant Reward Function

As an essential part of the trial-and-error approach, the instant reward function, guides

the networks to learn the optimal strategy. Diverse instant reward functions indicate differ-

Aerospace 2022,9, 424 10 of 21

ent behaviour tendencies. The instant reward function affects the quality of the strategy

learned by DRL. As in Equation (12), an instant reward function should be designed, with

two aims: (1) to determine the distance between the HGV and DRP reward function

rE(·)

at the terminal time

,and (2) to apply the velocity reward function

rD(·)

. The instant

reward function is designed as follows:

r(xS(t)) =rE(xS(t)) +rD(xS(t)),t=tf

0, t<tf

(25)

It is necessary to convert

in Equation (11) to a function with a positive domain in

order to encourage the HGV to evade interceptors and reach xEduring DRL:

f(x)=1

w1+√x(26)

Remark 5.

If the ﬂight state of the HGV does not meet the various constraints mentioned in

Section 2

, then the episode ends early, and the instant reward for the whole episode is 0, which

introduces the common sparse reward problem in reinforcement learning. However, it is evident

from Section 4that RBT-DDPG-based methods can generate intelligence from sparse rewards.

Remark 6.

An intensive instant reward function similar to that designed by Jiang

et al. [39]

is not

used, since there is no experience to draw on for the HGV anti-interception problem. Furthermore,

rather than a Bolza problem, this instant reward function is fully equivalent to the optimization

goal, that is, the mayer type problem deﬁned in Equation (12). Moreover, the neural network is not

inﬂuenced by the human tendencies of strategy exploration, resulting in an in-depth exploration of

all strategic options that could approach the global optimum.

Remark 7.

A curriculum-based approach similar to that discussed by Li et al. [

] was attempted,

where HGVs can quickly learn to reduce interceptors’ energy using snake manoeuvres. As the policy

is solidiﬁed at the approach phase, it is difﬁcult to achieve global optimality using this framework,

and the obtained performance is signiﬁcantly lower than that obtained by Equation (25).

4. Training and Testing

Section 3introduces RBT-DDPG-based methods to solve the GTAO problem discussed

in Section 2. This section veriﬁes the effectiveness of DRL in ﬁnding the optimal anti-

interception guidance system.

4.1. Settings

4.1.1. Aircraft Settings

To simulate a random initial interceptor energy state in a virtual scenario, the intercep-

tors’ initial altitude and initial velocity follow a uniform distribution and are

U(25 km, 45 km)

and U(1050 m/s, 1650 m/s)respectively. Table 1illustrates the remaining parameters.

The aerodynamics of an aircraft are usually approximated by the curve-ﬁtted model

(CFM). The CFM of the HGV used in this paper is referenced in Wang et al. [

CL=

−

0.21

0.075

·M+(0.23 +0.05 ·M)·α

and

CD=

0.41

0.011

·M+(−0.0081 +0.0021 ·M)·

α+

0.0042

·α2

is the Mach number of the HGV and

is the AoA (rad). Morever,

the interceptors are

CL=(0.18 +0.02 ·M)·α

and

CD=

0.18

0.01

·M+

0.001

·M·

α+

0.004

·α2

[

]. The interceptors employ a proportional guide. To compensate for the

lack of representation of the vectoring capability in the virtual scenario, we increased the

interceptors’ kill radius to a staggering 300 m.

Aerospace 2022,9, 424 11 of 21

Table 1. Parameters of theHGV and interceptor in the virtual scenario.

Parameter Interceptor HGV

Mass/kg 75 500

Reference area/m20.3 0.579

Minimum velocity/(m/s) 400 1000

Available AoA/° −20∼20 −10∼10

Time constant of Attitude

Control System/s 0.1 1

Initial coordinate x/km 200 0

Initial coordinate y/km Random 35

Initial velocity/(km/s) Random 2000

Initial inclination/° 0 0

Coordinate x of the DRP/km - 200

Coordinate y of the DRP/km - 35

Kill radius/m 300 -

4.1.2. Hyperparameter Settings

Hyperparameters of training are shown in Figure 2.

In AN, it is evident that the bulk of computation occurs in the hidden layer. A neuron

in the hidden layer reads in

(

is the width of the previous layer) numbers according to

a dropOut layer( the drop rate is 0.2), multiplies them by the weight (totalling 2

ni+

0.8

FLOPs), adds up all the values (totalling

FLOPs), then passes the result through the

activation function (LReLU) after adding a bias term (totaling 2 FLOPs), which means that

a single neuron consumes 3.8

ni+

3 FLOPs in a single calculation. The actor, as shown

in Figure 3, consumes approximately 87K FLOPs in a single execution. Assuming an

on-board computer with 10

−3

FLOP ﬂoating-point capabilities (most mainstream industrial

FPGAs in 2021 have more than a 1 FLOP ﬂoating-point capability), a single execution of

anti-interception guidance formed by AN takes less than 0.1 ms.

Figure 3.

Structures of the neural networks ((

Left

) AN. (

Right

) CN) with theparameters of each layer.

Aerospace 2022,9, 424 12 of 21

Table 2. Hyperparameters of RBT-DDPG and RBT-TD3.

Parameter Value

∆t/s 10−2

Tc/s 1

γ1

αC10−4

αAin RBT-DDPG 10−4

αAin RBT-TD3 5×10−5

sr0.001

µin RBT-DDPG 0.05

σin RBT-DDPG 0.01

θin RBT-DDPG 5×10−5

σin RBT-TD3 0.1

LTH 0.1

Weight initialization N(0, 0.02)

Bias initialization N(0, 0.02)

w10.5

w210−4

4.2. Training Results and Analysis

The CPU of the training platform is an AMD Ryzen5 3600@4.2 GHz. The RAM is

8 GB

2 DDR4@3733 MHz. As the networks are straightforward and the computation is

concentrated on calculating aircraft models, GPUs are not selected for training. The aircraft

models in the virtual scenario are written in C++ and packaged as a dynamic link library

(DLL). The networks and training algorithm are implemented by Python. The networks

and the virtual scenario interact after calling the DLL. The training process is shown in

Figures 4–6.

Figure 4.

Cumulative reward during training. With the help of the RBT mechanism, both DDPG and

TD3 reached a faster training speed.

Aerospace 2022,9, 424 13 of 21

Figure 5.

Loss function of AN and CN during training. Similar to the cumulative reward curve

presented in Figure 4, the actor loss function of RBT-DDPG in Figure 5decreases faster, indicating

that RBT-DDPG ensures that AN learns faster. RBT-DDPG has a lower CN loss function almost

throughout the episodes, reﬂecting that RBT can improve the CN estimation. The same phenomenon

occurs in the comparison between RBT-TD3 and its original version.

Figure 6.

The number of RBT iterations that occur during RBT-DDPG training. RBT is repeated

several times at the beginning of training (near the ﬁrst training iteration), and CN is required to

train to a tiny estimation error. RBT is barely executed before 60,000 training steps, as CN can already

provide accurate estimates for the samples in the current memory pool, and no additional training is

needed. From steps 100,000 to 200,000 , RBT is repeated several times, and many of the executions

are greater than 10. Due to the introduction of new strategies into the memory pool, the original

CN does not accurately estimate the Q value, so additional training is performed. Between steps

20,000 and 35,000, RBT is occasionallyexecuted, and most executions contain less than 10 repetitions

because ﬁne-tuning CN can accommodate the new strategies that are explored by the actor. RBT

executions increase after 350,000 steps, as CN must adapt to the multiple strategy samples brought

into the memory pool. At the end of training, the average -number of repetitions is approximately 2,

which is an acceptable algorithmic complexity cost.

Aerospace 2022,9, 424 14 of 21

For episodes 0–3500, DDPG and RBT-DDPG are in the process of constructing their

memory pool. The neural networks approximate a random output in this process, and no

training occurs. As a result, cumulative rewards are near 0 for most episodes, with a slim

possibility of reaching six points through random exploration. The neural networks begin

to be iteratively updated as soon as the memory pool is complete. The RBT-DDPG exceeds

the maximum score of the random exploration process at approximately 3900 episodes,

which is close to seven points. The RBT-DDPG then gradually boosts intelligence at a

rate of approximately 0.00145 points/episode, which is approximately three times the

0.00046 points/episode of the DDPG. RBT-TD3 intelligence achieves steady growth from

about episode 500. TD3, on the other hand, apparently fell into a local pole before episode

3500, with reward values that were always around 0.8 points. RBT-TD3 learns faster

compared to RBT-DDPG because the learning algorithm is more complex, but the strategies

they learned are similar and eventually converge to almost the same reward.

4.3. Test Results and Analysis

A Monte Carlo test was performed to verify that the strategies learned by the neural

network are universally adaptable. In addition, some cases were used to analyze the

anti-interception strategies. As the ﬁnal performance obtained by RBT-DDPG and RBT-TD3

is similar, due to space limitations, this section uses the AN learned by RBT-DDPG as the

test object.

4.3.1. Monte Carlo Test

To reveal the speciﬁc strategy that was obtained from training, AN from RBT-DDPG

controls an HGV in virtual scenarios to perform anti-interception tests. To verify the

adaptability of an AN facing different initial states, a test was conducted using scenarios

in which the initial altitude and velocity of the interceptors were randomly distributed

(the same distribution as used during training). Since exploration is no longer needed, no

further OU processes were added to AN. A total of 1000 episodes were conducted.

AN from DDPG and RBT-DDPG were tested for 1000 episodes each. Table 3and

Figure 7illustrate the results. Suppose that the measure of the success of an anti-

interception is whether it eliminates two interceptors. In that case, the 91.48% anti-

interception success rate of the RBT-DDPG AN is better than the 79.74% rate of the DDPG

AN, reﬂecting the greater adaptability of the RBT-DDPG AN to complex initial conditions.

According to the average terminal miss distance

e(tf)

, having eliminated the interceptors,

both two actors perform well in achieving DRP regression. However, on average, in terms

of terminal velocity

v(tf)

, the HGV guided by AN from RBT-DDPG is faster. For RBT-

DDPG, the peak probability density is 7.4, while for DDPG, the peak probability density is

6.7, indicating that RBT-DDPG can perform well in more scenarios than DDPG.

To observe the specific strategies learned through RBT-DDPG, we traversed the initial

conditions of the two interceptors and tested each individually. In Figure 8, the correspon-

dence between the initial state of the interceptors and the test case serial number is indicated.

The vertical motion of the HGV and interceptors in all cases are shown in Figure 9.

Table 3. Statistical results.

Algorithm Anti-Interception

Success Rate ¯e(tf)/m ¯v(tf)/(m/s)

DDPG 79.74% 1425.62 1377.17

RBT-DDPG 91.48% 1514.44 1453.81

Aerospace 2022,9, 424 15 of 21

Figure 7.

Probability distribution and density of the cumulative reward for each episode in the Monte

Carlo test. About 10% of the RBT-DDPG rewards are less than 3, compared to about 24% of DDPG

reﬂecting the greater adaptability of the RBT-DDPG AN to complex initial conditions.

Figure 8.

Correspondence between initial state and test case serial number (the horizontal coordinate

represents the initial state of the ﬁrst interceptor, while the vertical coordinate represents the second

interceptor. The letters M, H, and L in the ﬁrst position represent 44 km, 35 km, and 26 km altitude,

respectively. The letters M, H, and L in the second position represent 1500 m/s, 1250 m/s, and

1000 m/s velocities, respectively).

In Figure 9, the neural network adopts different strategies in response to interceptors

with variable initial energies. In terms of behaviour, the strategies fall into two categories:

S-curve (dive–leap–dive) and C-curve (leap–dive). There is also a speciﬁc pattern to the

peaks. In general, the higher the initial energy of the interceptors faced by the HGV, the

higher the peak.

Aerospace 2022,9, 424 16 of 21

Figure 9.

Vertical motion of the HGV and interceptors in the test cases. The strategies fall into two

categories: S-curve (dive–leap–dive) and C-curve (leap–dive). There is also a speciﬁc pattern to

the peaks.

4.3.2. Analysis of Anti-Interception Strategies

Using the data from the 5th and 42nd test cases mentioned in Section 4.3.1, we at-

tempted to identify the strategies the neural network learned from DRL.

As shown by the solid purple trajectory in Figures 10 and 11, we also used the

Differential Game (DG) approach [

] as a comparison method. While the HGV evaded the

interceptor with DG guidance, it lost signiﬁcant kinetic energy due to its long residence in

the dense atmosphere and its low ballistic dive point, and fell below the minimum velocity

at approximately 55 km from the target. The DG approach uses the relative angle and

velocity information as input to guide the ﬂight, and can successfully evade the interceptor.

In contrast, DG cannot account for atmospheric density and cannot optimize energy, which

is a signiﬁcant advantage of DRL.

Aerospace 2022,9, 424 17 of 21

Figure 10.

Vertical motion comparison between RBT-DDPG(Cases 5 and 42) and Differential Game.

The DG approach uses the relative angle and velocity information as input to guide the ﬂight, and

can successfully evade the interceptor. In contrast, DG cannot account for atmospheric density and

cannot optimize energy, which is a signiﬁcant advantage of DRL. The AN learned by RBT-DDPG

chooses to dive before leaping in Case 5, whereas, in Case 42, it takes a direct leap. Furthermore,

while the peak in Case 5 is 60 km, it only reaches 54 km in Case 42 before diving. The AN can control

the HGV to select the appropriate ballistic inclination for the dive after escaping the interceptor.

Figure 11. Velocities comparison between RBT-DDPG(Cases 5 and 42) and Differential Game.

The neural network chooses to dive before leaping in Case 5, whereas, in Case 42, it

takes a direct leap. Furthermore, while the peak in Case 5 is 60 km, it only reaches 54 km

in Case 42 before diving. In both cases, the HGV causes one of the interceptors to go

below the minimum ﬂight speed (400 m/s) before entering the rendezvous phase. At the

rendezvous phase, the minimum distances between the interceptor and the HGV are 389 m

and 672 m, respectively, which indicates that the HGV is almost within the interceptable

area of interceptors. The terminal velocity of the HGV in Case 5 is approximately 100 m/s

lower than that in Case 42, due to the higher initial interceptor energy faced in Case 5, which

resulted in a longer manoeuvre path and a more violent pull-up in the dense atmospheric

region. Figure 12 illustrates that the HGV tends to perform a large overload manoeuvre in

the approach phase, almost exclusively utilizing the manoeuvring ability. In the regress

phase of Case 42, only very small magnitude manoeuvres were required to correct the

ballistic inclination, demonstrating that the neural network can control the HGV and select

the appropriate ballistic inclination for the dive after escaping the interceptor.

Aerospace 2022,9, 424 18 of 21

Figure 12. Overloads comparison between Cases 5 and 42.

We derived a rudimentary instant reward function with no prior knowledge of the

strategy that should be implemented, resulting in a sparse reward problem. The RBT-

DDPG trained a CN; however, it does not make signiﬁcant Q estimation errors. There was

an accurate Q estimation at the beginning of an episode, and accuracy was maintained

throughout the processes, in both case 5 and case 42 (Figure 13). This phenomenon is

consistent with the idea presented in Equation (12) that the ﬂight states of both sides at the

outset determine the optimal anti-interception strategy that the HGV should implement.

Figure 13.

Q-value comparison between Case 5 and 42, as shown in Figure 12. An accurate Q

estimation at the beginning of an episode and maintains accuracy throughout the processes in both

case 5 and case 42.

Figures 10–12 illustrate the anti-interception strategy learned by the RBT-DDPG:(1)

The HGV lures interceptors with high initial energy into dense atmospheres by diving

manoeuvres in the approach phase, thus relying on pull-up manoeuvres with a denser

atmosphere to drain much of the interceptors’ energy. It is important to note that this

dive manoeuvre also consumes the kinetic HGV energy (e.g., Case 5). In contrast, when

confronted with interceptors with low initial energy, the neural network does not choose

to dive ﬁrst, even though this strategy is feasible, but instead leaps directly into the thin-

atmosphere region (e.g., Case 42), reﬂecting the optimality of the strategy. (2) Through the

approach phase manoeuvre, the HGV reduces the kinetic energy of the interceptors to a

proper level and attracts interceptors to the thin atmosphere. Here, the interceptor’s AO

Aerospace 2022,9, 424 19 of 21

no longer supports the interceptable area to cover the whole reachable area of the HGV,

which allows for the HGV to gain an available penetration path, as shown in Figure 14.

Figure 14.

Illustration of the penetration strategy during the rendezvous phase learned by RBT-DDPG.

The interceptor’s AO no longer supports the interceptable area to cover the whole reachable area of

the HGV, which allows for the HGV to gain an available penetration path.

5. Conclusions

Traditionally, research on anti-interception guidance for aircraft has focused on dif-

ferential game theory and optimization algorithms. Due to the high number of matrix

calculations needed, applying differential game theory online is computationally uneco-

nomic. Even though the newly developed ADP algorithm employs a neural network that

signiﬁcantly reduces the computation associated with the Hamilton functions, it cannot be

applied to aircraft with very large ﬂight envelopes, such as HGVs. It is challenging to im-

plement convex planning, sequential quadratic planning, or other planning algorithms for

HGVs due to their high computational complexity and insufﬁcient real-time performance.

We conceptualize the penetration strategy of HGVs as a GTAO problem using the

perspective of the optimal problem, revealing the signiﬁcant impact that both the initial

conditions of the attackers and defenders have on the penetration strategy. The problem

is then modelled as an MDP and solved using the DDPG algorithm. The RBT-DDPG

algorithm was developed to improve the CN estimation during the training process. At the

end of the paper, the data on the training process and online simulation tests verify that RBT-

DDPG can autonomously learn anti-interception guidance and adopt a rational strategy

(S-curve or C-curve) when interceptors have differing initial energy conditions. Compared

to the traditional DDPG algorithm, our proposed algorithm reduces the training episodes

by 48.48% . Since the AN deployed online is straightforward, it is suitable for onboard

computers. To our knowledge, this is the ﬁrst to apply DRL to achieve anti-interception

guidance for HGVs.

This paper focuses on a scenario in which one HGV breaks through the defences

of two interceptors; this is a traditional scenario but may not be fully adapted to the

future trends in group confrontation. In the future, we anticipate using multi-agent-DRL

(MA-DRL) applied to multiple HGVs. The intelligent trained by MADRL can conduct

guidance for each HGV in a distributed manner with limited information constraints, so

the anti-intercept guidance strategy can adapt to the numbers of enemies and HGVs. This

will greatly improve the intelligent generalizability to the battleﬁeld. Additionally, RL is

known to suffer from long training times. We anticipate using pseudo-spectral methods

to create a collection of expert data and then combining some expert advice [

] to

accelerate training.

Aerospace 2022,9, 424 20 of 21

Author Contributions:

Conceptualization, L.J. and Y.N.; methodology, L.J.; software, L.J.; validation,

Y.Z., Z.L.; formal analysis, L.J.; investigation, Y.N.; resources, Y.N.; data curation, L.J.; writing—

original draft preparation, L.J.; writing—review and editing, Y.Z.; visualization, L.J.; supervision,

Y.N.; project administration, Y.N.; funding acquisition, Y.N. All authors have read and agreed to the

published version of the manuscript.

Funding:

This work was supported in part by the Aviation Science Foundation of China under Grant

201929052002.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement: Not applicable.

Acknowledgments:

The authors would like to thank Cheng Yuehua from the University of Aeronau-

tics and Astronautic, for her invaluable support during the writing of the paper.

Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

Guo, Y.; Gao, Q.; Xie, J.; Qiao, Y.; Hu, X. Hypersonic vehicles against a guided missile: A defender triangle interception approach.

In Proceedings of the 2014 IEEE Chinese Guidance, Navigation and Control Conference, Yantai, China, 8–10 August 2014;

pp. 2506–2509.

Liu, K.F.; Meng, H.D.; Wang, C.J.; Li, J.; Chen, Y. Anti-Head-on Interception Penetration Guidance Law for Slide Vehicle. Mod.

Def. Technol. 2008,4, 39–45.

Luo, C.; Huang, C.Q.; Ding, D.L.; Guo, H. Design of Weaving Penetration for Hypersonic Glide Vehicle. Electron. Opt. Control

2013,7, 67–72.

Zhu, Q.G.; Liu, G.; Xian, Y. Simulation of Reentry Maneuvering Trajectory of Tactical Ballistic Missile. Tactical Missile Technol.

2008,1, 79–82.

He, L.; Yan, X.D.; Tang, S. Guidance law design for spiral-diving maneuver penetration. Acta Aeronaut. Astronaut. Sin.

2019

,40,

188–202.

Zhao, K.; Cao, D.Q.; Huang, W.H. Manoeuvre control of the hypersonic gliding vehicle with a scissored pair of control moment

gyros. Sci. China Technol. 2018,61, 1150–1160. [CrossRef]

Zhao, X.; Qin, W.W.; Zhang, X.S.; He, B.; Yan, X. Rapid full-course trajectory optimization for multi-constraint and multi-step

avoidance zones. J. Solid Rocket. Technol. 2019,42, 245–252.

Wang, P.; Yang, X.L.; Fu, W.X.; Qiang, L. An On-board Reentry Trajectory Planning Method with No-ﬂy Zone Constraints. Missiles

Space Vechicles 2016,2, 1–7.

Fang, X.L.; Liu, X.X.; Zhang, G.Y.; Wang, F. An analysis of foreign ballistic missile manoeuvre penetration strategies. Winged

Missiles J. 2011,12, 17–22.

10.

Sun, S.M.; Tang, G.J.; Zhou, Z.B. Research on Penetration Maneuver of Ballistic Missile Based on Differential Game. J. Proj. Rocket.

Missiles Guid. 2010,30, 65–68.

11. Imado, F.; Miwa, S. Fighter evasive maneuvers against proportional navigation missile. J. Aircr. 1986,23, 825–830. [CrossRef]

12.

Zhang, G.; Gao, P.; Tang, Q. The Method of the Impulse Trajectory Transfer in a Different Plane for the Ballistic Missile Penetrating

Missile Defense System in the Passive Ballistic Curve. J. Astronaut. 2008,29, 89–94.

13. Wu, Q.X.; Zhang, W.H. Research on Midcourse Maneuver Penetration of Ballistic Missile. J. Astronaut. 2006,27, 1243–1247.

14.

Zhang, K.N.; Zhou, H.; Chen, W.C. Trajectory Planning for Hypersonic Vehicle With Multiple Constraints and Multiple Manoeu-

vreing Penetration Strategies. J. Ballist. 2012,24, 85–90.

15.

Xian, Y.; Tian, H.P.; Wang, J.; Shi, J.Q. Research on intelligent manoeuvre penetration of missile based on differential game theory.

Flight Dyn. 2014,32, 70–73.

16.

Sun, J.L.; Liu, C.S. An Overview on the Adaptive Dynamic Programming Based Missile Guidance Law. Acta Autom. Sin.

2017

,43,

1101–1113.

17.

Sun, J.L.; Liu, C.S. Distributed Fuzzy Adaptive Backstepping Optimal Control for Nonlinear Multimissile Guidance Systems with

Input Saturation. IEEE Trans. Fuzzy Syst. 2019,27, 447–461.

18.

Sun, J.L.; Liu, C.S. Backstepping-based adaptive dynamic programming for missile-target guidance systems with state and input

constraints. J. Frankl. Inst. 2018,355, 8412–8440. [CrossRef]

19.

Wang, F.; Cui, N.G. Optimal Control of Initiative Anti-interception Penetration Using Multistage Hp-Adaptive Radau Pseu-

dospectral Method. In Proceedings of the 2015 2nd International Conference on Information Science and Control Engineering,

Shanghai, China, 24–26 April 2015.

20.

Liu, Y.; Yang, Z.; Sun, M.; Chen, Z. Penetration design for the boost phase of near space aircraft. In Proceedings of the 2017 36th

Chinese Control Conference, Dalian, China, 26–28 July 2017.

21. Marcus, G. Innateness, alphazero, and artiﬁcial intelligence. arXiv 2018, arXiv:1801.05667.

Aerospace 2022,9, 424 21 of 21

22.

Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al.

Mastering the game of go without human knowledge. Nature 2017,550, 354–359. [CrossRef]

23. Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep Exploration via Bootstrapped DQN. arXiv 2016, arXiv:1602.04621.

24.

Chen, J.W.; Cheng, Y.H.; Jiang, B. Mission-Constrained Spacecraft Attitude Control System On-Orbit Reconﬁguration Algorithm.

J. Astronaut. 2017,38, 989–997.

25.

Dong, C.; Deng, Y.B.; Luo, C.C.; Tang, X. Compression Artifacts Reduction by a Deep Convolutional Network. In Proceedings of

the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015.

26.

Fu, X.W.; Wang, H.; Xu, Z. Research on Cooperative Pursuit Strategy for Multi-UAVs based on DE-MADDPG Algorithm. Acta

Aeronaut. Astronaut. Sin. 2021,42, 311–325.

27.

Brian, G.; Kris, D.; Roberto, F. Adaptive Approach Phase Guidance for a Hypersonic Glider via Reinforcement Meta Learning. In

Proceedings of the AIAA SCITECH 2022 Forum, San Diego, CA, USA, 3–7 January 2022.

28.

Wen, H.; Li, H.; Wang, Z.; Hou, X.; He, K. Application of DDPG-based Collision Avoidance Algorithm in Air Trafﬁc Control.

In Proceedings of the ISCID 2019: IEEE 12th International Symposium on Computational Intelligence and Design, Hangzhou,

China, 14 December 2020.

29.

Lin, G.; Zhu, L.; Li, J.; Zou, X.; Tang, Y. Collision-free path planning for a guava-harvesting robot based on recurrent deep

reinforcement learning. Comput. Electron. Agric. 2021,188, 106350. [CrossRef]

30.

Lin, Y.; Mcphee, J.; Azad, N.L. Anti-Jerk On-Ramp Merging Using Deep Reinforcement Learning. In Proceedings of the IVS 2020:

IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 19 October–13 November 2020.

31.

Xu, X.L.; Cai, P.; Ahmed, Z.; Yellapu, V.S.; Zhang, W. Path planning and dynamic collision avoidance algorithm under COLREGs

via deep reinforcement learning. Neurocomputing 2021,468, 181–197. [CrossRef]

32. Lei, H.M. Principles of Missile Guidance and Control. Control Technol. Tactical Missile 2007,15, 162–164.

33.

Cheng, T.; Zhou, H.; Dong, X.F.; Cheng, W.C. Differential game guidance law for integration of penetration and strike of multiple

ﬂight vehicles. J. Beijing Univ. Aeronaut. Astronaut. 2022,48, 898–909.

34.

Zhao, J.S.; Gu, L.X.; Ma, H.Z. A rapid approach to convective aeroheating prediction of hypersonic vehicles. Sci. China Technol.

Sci. 2013,56, 2010–2024. [CrossRef]

35. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998.

36. Liu, R.Z.; Wang, W.; Shen. Y.; Li, Z.; Yu, Y.; Lu, T. An Introduction of mini-AlphaStar. arXiv 2021, arXiv:2104.06890.

37.

Deka, A.; Luo, W.; Li, H.; Lewis, M.; Sycara, K. Hiding Leader’s Identity in Leader-Follower Navigation through Multi-Agent

Reinforcement Learning. arXiv 2021, arXiv:2103.06359.

38.

Xiong, J.-H.; Tang, S.-J.; Guo, J.; Zhu, D.-L. Design of Variable Structure Guidance Law for Head-on Interception Based on Variable

Coefﬁcient Strategy. Acta Armamentarii 2014,35, 134–139.

39.

Jiang, L.; Nan, Y.; Li, Z.H. Realizing Midcourse Penetration With Deep Reinforcement Learning. IEEE Access

2021

,9, 89812–89822.

[CrossRef]

40.

Li, B.; Yang, Z.P.; Chen, D.Q.; Liang, S.Y.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning.

Def. Technol. 2021,17, 457–466. [CrossRef]

41.

Wang, J.; Zhang, R. Terminal guidance for a hypersonic vehicle with impact time control. J. Guid. Control Dyn.

2018

,41, 1790–1798.

[CrossRef]

42.

Ge, L.Q. Cooperative Guidance for Intercepting Multiple Targets by Multiple Air-to-Air Missiles. Master’s Thesis, Nanjing

University of Aeronautics and Astronautics, Nanjing, China, 2019.

43.

Cruz, F.; Parisi, G.I.; Twiefel, J.; Wermter, S. Multi-modal integration of dynamic audiovisual patterns for an interactive

reinforcement learning scenario. In Proceedings of the RSJ 2016: IEEE International Conference on Intelligent Robots & Systems,

Daejeon, Korea, 9–14 October 2016.

44.

Bignold, A.; Cruz, F.; Dazeley, R.; Vamplew, P.; Foale, C. Human engagement providing evaluative and informative advice for

interactive reinforcement learning. Neural Comput. Appl. 2022. [CrossRef]

Impact-Angle Constraint Guidance and Control Strategies Based on Deep Reinforcement Learning

Article

Full-text available

Nov 2023

In this study, two different impact-angle-constrained guidance and control strategies using deep reinforcement learning (DRL) are proposed. The proposed strategies are based on the dual-loop and integrated guidance and control types. To address comprehensive flying object dynamics and the control mechanism, a Markov decision process is used to solve the guidance and control problem, and a real-time impact-angle error in the state vector is used to improve the model applicability. In addition, a reasonable reward mechanism is designed based on the state component which reduces both the miss distance and the impact-angle error and solves the problem of sparse rewards in DRL. Further, to overcome the negative effects of unbounded distributions on bounded action spaces, a Beta distribution is used instead of a Gaussian distribution in the proximal policy optimization algorithm for policy sampling. The state initialization is then realized using a sampling method adjusted to engineering backgrounds, and the control strategy is adapted to a wide range of operational scenarios with different impact angles. Simulation and Monte Carlo experiments in various scenarios show that, compared with other methods mentioned in the experiment in this paper, the proposed DRL strategy has smaller impact-angle errors and miss distance, which demonstrates the method’s effectiveness, applicability, and robustness.

Intelligent Pursuit–Evasion Game Based on Deep Reinforcement Learning for Hypersonic Vehicles

Article

Full-text available

Jan 2023

As defense technology develops, it is essential to study the pursuit–evasion (PE) game problem in hypersonic vehicles, especially in the situation where a head-on scenario is created. Under a head-on situation, the hypersonic vehicle’s speed advantage is offset. This paper, therefore, establishes the scenario and model for the two sides of attack and defense, using the twin delayed deep deterministic (TD3) gradient strategy, which has a faster convergence speed and reduces over-estimation. In view of the flight state–action value function, the decision framework for escape control based on the actor–critic method is constructed, and the solution method for a deep reinforcement learning model based on the TD3 gradient network is presented. Simulation results show that the proposed strategy enables the hypersonic vehicle to evade successfully, even under an adverse head-on scene. Moreover, the programmed maneuver strategy of the hypersonic vehicle is improved, transforming it into an intelligent maneuver strategy.

Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm

Article

Full-text available

Apr 2024

Guidance commands of flight vehicles can be regarded as a series of data sets having fixed time intervals; thus, guidance design constitutes a typical sequential decision problem and satisfies the basic conditions for using the deep reinforcement learning (DRL) technique. In this paper, we consider the scenario where the escape flight vehicle (EFV) generates guidance commands based on the DRL technique, while the pursuit flight vehicles (PFVs) derive their guidance commands employing the proportional navigation method. For every PFV, the evasion distance is described as the minimum distance between the EFV and the PFV during the escape-and-pursuit process. For the EFV, the objective of the guidance design entails progressively maximizing the residual velocity, which is described as the EFV’s velocity when the last evasion distance is attained, subject to the constraint imposed by the given evasion distance threshold. In the outlined problem, three dimensionalities of uncertainty emerge: (1) the number of PFVs requiring evasion at each time instant; (2) the precise time instant at which each of the evasion distances can be attained; (3) whether each attained evasion distance exceeds the given threshold or not. To solve the challenging problem, we propose an innovative solution that integrates the recurrent neural network (RNN) with the proximal policy optimization (PPO) algorithm, engineered to generate the guidance commands of the EFV. Initially, the model, trained by the RNN-based PPO algorithm, demonstrates effectiveness in evading a single PFV. Subsequently, the aforementioned model is deployed to evade additional PFVs, thereby systematically augmenting the model’s capabilities. Comprehensive simulation outcomes substantiate that the guidance design method based on the proposed RNN-based PPO algorithm is highly effective.

Multiple Leap Maneuver Trajectory Design and Tracking Method Based on Prescribed Performance Control during the Gliding Phase of Vehicles

Article

Full-text available

Mar 2024

A novel standard trajectory design and tracking guidance used in the multiple active leap maneuver mode for hypersonic glide vehicles (HGVs) is proposed in this paper. First, the dynamic equation and multiconstraint model are first established in the flight path coordinate system. Second, the reference drag acceleration-normalized energy (D-e) profile of the multiple active leap maneuver mode is quickly determined by the Newton iterative algorithm with a single design parameter. The range to go error is corrected by the drag acceleration profile update algorithm, and the drag acceleration error of the gliding terminal is corrected by the aerodynamic parameter estimation algorithm. Then, the reference drag acceleration tracking guidance law is designed based on the prescribed performance control method. Finally, the CAV-L vehicle model is used for numerical simulation. The results show that the proposed method can satisfy the design requirements of drag acceleration under multiple active leap maneuver modes, and the reference drag acceleration can be tracked precisely. The adaptability and robustness of the proposed method are verified by the Monte Carlo simulations under various combined deviation conditions.

Intelligent maneuver strategy for hypersonic vehicles in three-player pursuit-evasion games via deep reinforcement learning

Article

Full-text available

Feb 2024

Aiming at the rapid development of anti-hypersonic collaborative interception technology, this paper designs an intelligent maneuver strategy of hypersonic vehicles (HV) based on deep reinforcement learning (DRL) to evade the collaborative interception by two interceptors. Under the meticulously designed collaborative interception strategy, the uncertainty and difficulty of evasion are significantly increased and the opportunity for maneuvers is further compressed. This paper, accordingly, selects the twin delayed deep deterministic gradient (TD3) strategy acting on the continuous action space and makes targeted improvements combining deep neural networks to grasp the maneuver strategy and achieve successful evasion. Focusing on the time-coordinated interception strategy of two interceptors, the three-player pursuit and evasion (PE) problem is modeled as the Markov decision process, and the double training strategy is proposed to juggle both interceptors. In reward functions of the training process, the energy saving factor is set to achieve the trade-off between miss distance and energy consumption. In addition, the regression neural network is introduced into the deep neural network of TD3 to enhance intelligent maneuver strategies’ generalization. Finally, numerical simulations are conducted to verify that the improved TD3 algorithm can effectively evade the collaborative interception of two interceptors under tough situations, and the improvements of the algorithm in terms of convergence speed, generalization, and energy-saving effect are verified.

Intelligent Maneuver Strategy for a Hypersonic Pursuit-Evasion Game Based on Deep Reinforcement Learning

Article

Full-text available

Sep 2023

In order to improve the problem of overly relying on situational information, high computational power requirements, and weak adaptability of traditional maneuver methods used by hypersonic vehicles (HV), an intelligent maneuver strategy combining deep reinforcement learning (DRL) and deep neural network (DNN) is proposed to solve the hypersonic pursuit–evasion (PE) game problem under tough head-on situations. The twin delayed deep deterministic (TD3) gradient strategy algorithm is utilized to explore potential maneuver instructions, the DNN is used to fit to broaden application scenarios, and the intelligent maneuver strategy is generated with the initial situation of both the pursuit and evasion sides as the input and the maneuver game overload of the HV as the output. In addition, the experience pool classification strategy is proposed to improve the training convergence and rate of the TD3 algorithm. A set of reward functions is designed to achieve adaptive adjustment of evasion miss distance and energy consumption under different initial situations. The simulation results verify the feasibility and effectiveness of the above intelligent maneuver strategy in dealing with the PE game problem of HV under difficult situations, and the proposed improvement strategies are validated as well.

Optimal Penetration Guidance Law for High-Speed Vehicles against an Interceptor with Modified Proportional Navigation Guidance

Article

Full-text available

Jun 2023

Aiming at the penetration problem of high-speed vehicles against a modified proportional guidance interceptor, a three-dimensional mathematical model of attack–defense confrontation between the high-speed vehicle and the interceptor is established in this paper. The modified proportional navigation guidance law of the interceptor is included in the model, and control constraints, pitch angle velocity constraints, and dynamic delay are introduced. Then, the performance index of the optimal penetration of high-speed vehicles is established. Under the condition of considering the 180-degree BTT control, the analytical solutions of the optimal speed roll angle and the optimal overload of high-speed vehicles are obtained according to symmetric Hamilton principle. The simulation results show that the overload switching times of high-speed vehicles to achieve optimal penetration are N − 1, where N is the modified proportional guidance coefficient of the interceptor. When the maximum speed roll angle velocity is [60, 90] degrees per second, the penetration effect of high-speed vehicles is good. Finally, the optimal penetration guidance law proposed in this paper can achieve a miss distance of more than 5 m when the overload capacity ratio is 0.33.

A Penetration Method for UAV Based on Distributed Reinforcement Learning and Demonstrations

Article

Full-text available

Mar 2023

The penetration of unmanned aerial vehicles (UAVs) is an essential and important link in modern warfare. Enhancing UAV’s ability of autonomous penetration through machine learning has become a research hotspot. However, the current generation of autonomous penetration strategies for UAVs faces the problem of excessive sample demand. To reduce the sample demand, this paper proposes a combination policy learning (CPL) algorithm that combines distributed reinforcement learning and demonstrations. Innovatively, the action of the CPL algorithm is jointly determined by the initial policy obtained from demonstrations and the target policy in the asynchronous advantage actor-critic network, thus retaining the guiding role of demonstrations in the initial training. In a complex and unknown dynamic environment, 1000 training experiments and 500 test experiments were conducted for the CPL algorithm and related baseline algorithms. The results show that the CPL algorithm has the smallest sample demand, the highest convergence efficiency, and the highest success rate of penetration among all the algorithms, and has strong robustness in dynamic environments.

An Intelligent Penetration Guidance Law Based on DDPG for Hypersonic Vehicle

Chapter

Jan 2024

A novel intelligent guidance law based on Deep Deterministic Policy Gradient (DDPG) is devised in this paper to deal with problems of penetration for hypersonic boost glide vehicle (HBGV). Firstly, an agent based on DDPG algorithm is designed by setting a model of Markov decision process. Interacting with environment, the agent is trained to produce continuous overload command to avoid interception by one interceptor and ensure successful penetration for HBGV. Compared with traditional penetration guidance law, the new guidance law can intelligently deal with the complex offensive and defensive game problem. In addition, the designed reward of agent avoids excessive overloading which can save energy of HBGV. Finally, simulations are carried out, which prove that the guidance law enables HBGV to avoid interception by interceptor. The results in different test scenarios also illustrate that the intelligent guidance law has great generalization ability which can meet need of penetration in other offensive and defensive situations.

Long-Term Trajectory Prediction of Hypersonic Glide Vehicle Based on Physics-Informed Transformer

Article

Dec 2023

The long-term prediction of hypersonic glide vehicle (HGV) trajectory based on radar tracking data still faces challenges. Existing approaches are hard to provide satisfactory performances in long-term HGV trajectory prediction tasks due to insufficient long-range dependency extraction ability and cumulative errors. To address the above issues, a trajectory prediction model Physics-informed Transformer (PIT) is proposed in this work. The Top $\tau$ -Mean attention mechanism and the generative decoder are designed in this model to extract the trajectories' long-range dependency and reduce the cumulative errors. Furthermore, the physical knowledge is applied at both the input and output ends of the model, which further improves the model's prediction accuracy even in the case of missing a small amount of radar tracking data. A simulated HGV trajectory dataset is established based on the dynamic equations for the PIT trajectory prediction model. The experimental results show that the PIT performs better than other state-of-the-art deep neural network models. The additional experiments on a small-scale dataset and datasets with missing tracking data further verify the excellent performance and robustness of the PIT in these cases.

Human engagement providing evaluative and informative advice for interactive reinforcement learning

Article

Full-text available

Jan 2022
NEURAL COMPUT APPL

Interactive reinforcement learning proposes the use of externally sourced information in order to speed up the learning process. When interacting with a learner agent, humans may provide either evaluative or informative advice. Prior research has focused on the effect of human-sourced advice by including real-time feedback on the interactive reinforcement learning process, specifically aiming to improve the learning speed of the agent, while minimising the time demands on the human. This work focuses on answering which of two approaches, evaluative or informative, is the preferred instructional approach for humans. Moreover, this work presents an experimental setup for a human trial designed to compare the methods people use to deliver advice in terms of human engagement. The results obtained show that users giving informative advice to the learner agents provide more accurate advice, are willing to assist the learner agent for a longer time, and provide more advice per episode. Additionally, self-evaluation from participants using the informative approach has indicated that the agent’s ability to follow the advice is higher, and therefore, they feel their own advice to be of higher accuracy when compared to people providing evaluative advice.

Adaptive Approach Phase Guidance for a Hypersonic Glider via Reinforcement Meta Learning

Conference Paper

Full-text available

Jan 2022

We use Reinforcement Meta Learning to optimize an adaptive guidance system suitable for the approach phase of a gliding hypersonic vehicle. Adaptability is achieved by optimizing over a range of off-nominal flight conditions including perturbation of aerodynamic coefficient parameters, actuator failure scenarios, and sensor noise. The system maps observations directly to commanded bank angle and angle of attack rates. These observations include a velocity field tracking error formulated using parallel navigation, but adapted to work over long trajectories where the Earth's curvature must be taken into account. Minimizing the tracking error keeps the curved space line of sight to the target location aligned with the vehicle's velocity vector. The optimized guidance system will then induce trajectories that bring the vehicle to the target location with a high degree of accuracy at the designated terminal speed, while satisfying heating rate, load, and dynamic pressure constraints. We demonstrate the adaptability of the guidance system by testing over flight conditions that were not experienced during optimization. The guidance system's performance is then compared to that of a linear quadratic regulator tracking an optimal trajectory.

Realizing Midcourse Penetration With Deep Reinforcement Learning

Article

Full-text available

Jun 2021

To maintain the survivability of ballistic missiles, this paper proposes using deep reinforcement learning to obtain a midcourse maneuver controller. First, the midcourse is abstracted as a Markov decision process (MDP) with an unknown system state equation. Then, a controller formed by the Dueling Double Deep Q (D3Q) neural network is used to approximate the state-action value function of the MDP. To improve the intelligence of the controller through reinforcement learning, the state space, action space and instant reward function of the MDP are customized. The controller uses a real-time situation as input and outputs the ignition states of pulse motors. Offline training shows that deep reinforcement learning can obtain the convergence of the optimal strategy after approximately 65 hours. According to an online test, the controller can evade an interceptor intelligently and take the re-entry error into account. In scenarios with multiple random factors, the controller achieved a penetration probability of 100re-entry error of less than 5000 m.

Maneuvering target tracking of UAV based on MN-DDPG and transfer learning

Article

Full-text available

Mar 2021

Tracking maneuvering target in real time autonomously and accurately in an uncertain environment is one of the challenging missions for unmanned aerial vehicles (UAVs). In this paper, aiming to address the control problem of maneuvering target tracking and obstacle avoidance, an online path planning approach for UAV is developed based on deep reinforcement learning. Through end-to-end learning powered by neural networks, the proposed approach can achieve the perception of the environment and continuous motion output control. This proposed approach includes: (1) A deep deterministic policy gradient (DDPG)-based control framework to provide learning and autonomous decision-making capability for UAVs; (2) An improved method named MN-DDPG for introducing a type of mixed noises to assist UAV with exploring stochastic strategies for online optimal planning; and (3) An algorithm of task-decomposition and pre-training for efficient transfer learning to improve the generalization capability of UAV's control model built based on MN-DDPG. The experimental simulation results have verified that the proposed approach can achieve good self-adaptive adjustment of UAV's flight attitude in the tasks of maneuvering target tracking with a significant improvement in generalization capability and training efficiency of UAV tracking controller in uncertain environments.

Hiding Leader’s Identity in Leader-Follower Navigation through Multi-Agent Reinforcement Learning

Conference Paper

Sep 2021

Path planning and dynamic collision avoidance algorithm under COLREGs via deep reinforcement learning

Article

Oct 2021
NEUROCOMPUTING

As one of the core technologies of the automatic control system for unmanned surface vehicles (USVs), autonomous collision avoidance algorithm is the key to ensure the safe navigation of USVs. In this paper, path planning and dynamic collision avoidance (PPDC) algorithm which obeys COLREGs is proposed for USVs. In order to avoid unnecessary collision avoidance actions, the risk assessment model is developed, which is used to determine the switching time of path planning and dynamic collision avoidance. In order to train the algorithm which complies with the COLREGs, the encounter situation is divided quantitatively, which is regarded as the input state of the system, so that the high-dimensional input is successfully avoided. The state space of the USV is defined by relative parameters to improve the generalization ability of the algorithm, meanwhile, a network structure based on DDPG is designed to achieve the continuous output of thrust and rudder angle. Combined with path planning, collision avoidance, compliance with COLREGs and smooth arrival task, four kinds of reward functions are designed. In order to solve the problem of low training efficiency of experience replay mechanism in DDPG, cumulative priority sampling mechanism is proposed. Through the simulation and verification in a variety of scenarios, it is proved that PPDC algorithm has the function of path planning and dynamic collision avoidance in compliance with COLREGs, which has good real-time performance and security.

Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning

Article

Sep 2021
COMPUT ELECTRON AGR

In unstructured orchard environments, picking a target fruit without colliding with neighboring branches is a significant challenge for guava-harvesting robots. This paper introduces a fast and robust collision-free path-planning method based on deep reinforcement learning. A recurrent neural network is first adopted to remember and exploit the past states observed by the robot, then a deep deterministic policy gradient algorithm (DDPG) predicts a collision-free path from the states. A simulation environment is developed and its parameters are randomized during the training phase to enable recurrent DDPG to generalize to real-world scenarios. We also introduce an image processing method that uses a deep neural network to detect obstacles and uses many three-dimensional line segments to approximate the obstacles. Simulations show that recurrent DDPG only needs 29 ms to plan a collision-free path with a success rate of 90.90%. Field tests show that recurrent DDPG can increase grasp, detachment, and harvest success rates by 19.43%, 9.11%, and 10.97%, respectively, compared to cases where no collision-free path-planning algorithm is implemented. Recurrent DDPG strikes a strong balance between efficiency and robustness and may be suitable for other fruits.

Anti-Jerk On-Ramp Merging Using Deep Reinforcement Learning

Conference Paper

Oct 2020

Application of DDPG-based Collision Avoidance Algorithm in Air Traffic Control

Conference Paper

Dec 2019

Backstepping-based adaptive dynamic programming for missile-target guidance systems with state and input constraints

Article

Oct 2018
J FRANKLIN I

In this paper, a novel backstepping-based adaptive dynamic programming (ADP) method is developed to solve the problem of intercepting a maneuver target in the presence of full-state and input constraints. To address state constraints, a barrier Lyapunov function is introduced to every backstepping procedure. An auxiliary design system is employed to compensate the input constraints. Then, an adaptive backstepping feedforward control strategy is designed, by which the tracking problem for strict-feedback systems can be reduced to an equivalence optimal regulation problem for affine nonlinear systems. Secondly, an adaptive optimal controller is developed by using ADP technique, in which a critic network is constructed to approximate the solution of the associated Hamilton–Jacobi–Bellman (HJB) equation. Therefore, the whole control scheme consists of an adaptive feedforward controller and an optimal feedback controller. By utilizing Lyapunov's direct method, all signals in the closed-loop system are guaranteed to be uniformly ultimately bounded (UUB). Finally, the effectiveness of the proposed strategy is demonstrated by using a simple nonlinear system and a nonlinear two-dimensional missile-target interception system.

Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

Abstract and Figures

Recommended publications

Intelligent Online Multiconstrained Reentry Guidance Based on Hindsight Experience Replay

Realizing Midcourse Penetration With Deep Reinforcement Learning

Finite-time H2 performance measure-based strategy design

Research on Head Pursuit Interception Strategy for Hypersonic Target

The Pursuit-Evasion Game Strategy of High-Speed Aircraft Based on Monte-Carlo Deep Reinforcement Lea...