Conference PaperPDF Available

Solving Finite-Horizon HJB for Optimal Control of Continuous-Time Systems

Authors:

Figures

Content may be subject to copyright.
Solving finite-horizon HJB for optimal control of continuous-time
systems
Ziyu Lin
School of Vehicle and Mobility
Tsinghua University
Beijing, China
linzy17@mails.tsinghua.edu.cn
Jie Li
School of Vehicle and Mobility
Tsinghua University
Beijing, China
jie-li18@mails.tsinghua.edu.cn
Jingliang Duan
School of Vehicle and Mobility
Tsinghua University
Beijing, China
djl15@mails.tsinghua.edu.cn
Haitong Ma
School of Vehicle and Mobility
Tsinghua University
Beijing, China
maht19@mails.tsinghua.edu.cn
Shengbo Eben Li*
School of Vehicle and Mobility
Tsinghua University
Beijing, China
lishbo@tsinghua.edu.cn
Jianyu Chen
Institute for Interdisciplinary Information Sciences
Tsinghua University
Beijing, China
jianyuchen@berkeley.edu
Abstract—Continuous-time finite-horizon optimal control is
of significant importance since most physical systems are
defined in continuous-time, and most practical applications
consider finite-horizon. However, different from HJB equation
in infinite horizon, finite-horizon HJB equation contains an
additional partial derivative with respect to time, which is
an intractable unknown term. This paper has found that this
partial derivative exactly equals to the minus of terminal-time
utility function, which significantly simplifies the HJB equation
and thus allows efficient numerical solutions to it. We provide
two methods to prove our finding: one is by analyzing the
initial-time equivalency between fixed time horizon optimal
control problem and fixed terminal time optimal control prob-
lem, and the second is by the definition of the partial derivative.
Based on this finding, we can reuse the traditional approximate
dynamic programming (ADP) algorithm to approximate the
solution of the finite-horizon HJB equation for continuous-
time systems. We evaluate the correctness of our finding by
analyzing a linear quadratic problem.
Keywords-finite-horizon HJB equation, approximate dy-
namic programming, continuous-time optimal control.
I. INTRODUCTION
Continuous-time (CT) finite-horizon control systems pre-
serve important merits in theoretical analysis and controller
synthesis. Most physical systems are defined in continuous-
time. Meanwhile, fixed-final-time optimal control problem
(OCP) is common in practical applications. For the vehicle-
tracking control example, the desired tracking trajectory can
only be given in finite-horizon version due to the time-
varying environments and the limited capability of road
perception [1]. In other words, it may be impossible to
define an infinite-horizon OCP in most cases of real world
applications.
Dynamic programming (DP) is the fundamental tool for
solving the continuous-time (CT) finite-horizon OCP [2].
It employs the principle of Bellman optimality to derive
the Hamilton-Jacobi-Bellman (HJB) equation, which is the
sufficient and necessary condition for the CT optimal control
policy. For the continuous-time linear quadratic control prob-
lem, the optimal solution satisfies the Riccati equation, a spe-
cial case of HJB equation [3]. However, in the general case,
the HJB equation is a typical nonlinear partial differential
equation (PDE) in value function, which is usually infeasible
to be solved analytically [4]. To make things worse, different
from infinite-horizon HJB equation, the finite-horizon HJB
equation contains a time-dependent value function, whose
partial derivative is an intractable unknown term.
Recent decades have witnessed loads of efforts to solve
the HJB equation for nonlinear systems [5]. An algorithm
using HJB equation to approximate the control policy was
firstly proposed by Werbos [6], [7]. It was termed Ap-
proximate Dynamic Programming (ADP), whose synonyms
include adaptive DP [8], neuro-DP [8] and reinforcement
learning (RL) [9]. Its basic principle is to solve HJB equation
off-line using policy iteration, which involves a two-step
iteration process, policy evaluation and policy improvement
[10]. Considerable ADP methods seeking optimal control
solution for CT systems with infinite horizon have emerged
in literature [11]. Abu-Khalaf and Lewis proposed a method
for continuous constrained input systems with non-quadratic
utility function to obtain approximate control law [12].
LE. T. Dierks and S. Jagannathan designed a single online
critic network to solve the infinite horizon HJB equation
with small bounded error [13]. Vamvoudakis and Lewis
introduced the synchronous policy iteration, which simul-
taneously adapts both the actor network (NN) and the critic
NN [14]. However, comparing with infinite-horizon settings,
most ADP methods are facing much more challenges in
solving the finite-horizon OCP because the non-zero time-
varying partial derivative of value with respect to time
introduces an additional unknown term in finite-horizon HJB
equation.
To overcome this problem, some researches adopted linear
combination approximation with time-varying weights to
approximate the time-varying value function [15]–[17]. The
time-varying weights can be solved by applying least squares
on a predefined region in practical [18]. In some researches,
time-varying activation functions are employed as the time-
varying approximator [19]. Nevertheless, existing methods
for CT finite-horizon OCP suffer from two problems: a)
Input-affine nature of system: These methods are subject to
input-affine system because the control policy needs to be
analytically expressed by a value function. b) Requirement
for hand-crafted basis functions: The value function or
policy is represented by a linear combination of manually de-
signed basis functions. For complex or large-scale systems,
inappropriate basis functions usually result in inaccurate
value function or sub-optimal policy.
Different from the existing methods, this paper provides
an alternative viewpoint which directly derives the partial
derivative of the value function with respect to time analyt-
ically. The contributions can be summarized as follows:
(1) It is firstly found in theory that the partial derivative of
value with respect to time exactly equals to the minus of the
terminal-time utility function. Hence, the CT finite-horizon
HJB equation only contains two unknown terms, such that
any ADP methods for CT infinite-horizon OCP can be used
to solve finite-horizon OCP.
(2) It overcomes the following problems for existing
finite-horizon ADP methods: a) Since many ADP methods
do not require the policy to be analytically expressed by the
value function, our finding allows reusing these methods
to realize optimal control for arbitrary nonlinear systems
with non-affine inputs. b) Neural networks can be employed
with system states as inputs to represent the value function
or policy. In this case, no hand-crafted basis functions are
needed.
The paper is organized as follows. Section II provides the
formulation of the optimal control problem and introduces
the general CT ADP framework. Section III introduces the
derivation of the CT finite-horizon HJB equation with two
proof methods including analyzing the initial-time equiva-
lency between fixed time horizon OCP and fixed terminal
time OCP, and definition of the partial derivative. Section
IV demonstrates the correctness of our finding by analyzing
linear quadratic systems. Section V concludes the paper.
II. PRELIMINARIES
A. Optimal Control Problem formulation
Consider the general continuous-time time-invariant dy-
namic system given by:
˙x=f(x, u)(1)
where xRmis the state, uRnis the control
input, f(x, u)is the system model, which is assumed to be
controllable. The value function of CT finite-horizon OCP
is:
Vπ(x(t), t) = ZT
t
l(x(τ), u(τ))dτ, (2)
where the time-varying Vπ(x(t), t)is defined as the value
function corresponding to policy π.t[0, T ]is current
time, Tis the fixed terminal time, τis the virtual time in the
future horizon, l(x, u)is the utility function. Different from
discrete-time control, the value function in CT domain is the
integral of the utility function. One cannot easily handle the
derivative of an integral function, which leads to a major
difference between discrete-time control and continuous-
time control. πmaps from state xto control input u:
u(τ) = π(x(τ), τ ), τ [t, T ], t [0, T ].(3)
The objective of OCP is to minimize the cost functional:
V(x(t), t) = min
uVπ(x, t) = min
uZT
t
l(x(τ), u(τ))
s.t. ˙x=f(x, u),
(4)
The superscript means “optimal”. The basic principle
of CT finite-horizon OCP is to seek a policy π(x, t)to
minimize V(x, t)for x:
π(x, t) = arg min
u{V(x(t), t)}.
B. Optimal condition for OCP
To find the optimal policy, we introduce the Hamilton
function:
H(x, u, ∂V π(x, t)
∂x>) = l(x, u) + ∂V π(x, t)
∂x>f(x, u),
The Hamiltonian satisfies the following finite-horizon self-
consistency condition:
H(x, u, ∂V π(x, t)
∂x>) = V π(x, t)
∂t ,t[0, T ].(5)
Similarly, the optimal solution satisfies the following
finite-horizon HJB equation:
min
ul(x, u) + V (x, t)
∂x>f(x, u)=V (x, t)
∂t ,(6)
with terminal boundary condition
V(xπ(T), T )=0.
To seek the optimal policy for CT finite-horizon systems,
we only need to solve the HJB equation, which contains
three unknown terms: u(x, t),∂V (x,t)
∂x , and V (x,t)
∂t . The
HJB equation is hard to be solved analytically since it is a
nonlinear partial differential equation.
C. Policy Iteration of traditional ADP
ADP serves as a numerical solution to the HJB equation
(6) using policy iteration (PI) framework, which iteratively
applies the following two steps:
(1) Policy Evaluation (PEV):
At the kth iteration, given policy πk, the corresponding
value function Vπk(x, t)is estimated by applying self-
consistency condition (5).
(2) Policy Improvement (PIM):
PIM is to find a better policy πk+1(x, t)by minimizing
the Hamilton function with Vπk(x, t):
πk+1(x, t) = arg min
u(H x, u, ∂V πk(x, t)
∂x>!).(7)
The iteration of PEV and PIM will switch gradually
to the HJB solution. The pseudo-code of PI is shown in
Algorithm 1.
Algorithm 1 CT ADP in PI framework
Initial with policy π0(x, t)
Given an arbitrarily small positive and set k= 0
while kVπk+1 (x, t)Vπk(x, t)k do
1. Solve value function Vπk(x, t)for xusing:
l(x, πk(x, t)) + ∂V πk
∂x>f(x, πk(x, t)) = ∂V πk
∂t (8)
2. Solve new policy πk+1(x)for xusing (7)
end while
In the PEV process, there are two unknown partial deriva-
tive terms in (8): ∂V π(x,t)
∂t and V π(x,t)
∂x . For any different
∂V π(x,t)
∂t , we can find a correspondingly different ∂V π(x,t)
∂x
to make (8) holds, vice versa. In other words, there is no
unique solution for (8). However, for infinite-horizon HJB
equation, we have ∂V π(x,t)
∂t = 0, such that the PEV process
can be easily solved since only V π(x,t)
∂x is unknown in this
case. Inspired by this, if ∂V π(x,t)
∂t is known, we can obtain
the solution of finite-horizon HJB similar to existing infinite-
horizon ADP methods with PI techniques. Therefore, it is
crucial and meaningful to obtain an analytical expression of
∂V π(x,t)
∂t .
III. DER IVATIO N OF FI NI TE -HORIZON HJB EQUATIO N
This paper focuses on the right part of (8), which is
∂V π(x,t)
∂t . If we know its value, the HJB equation may be
solved easily as mentioned before. Firstly, we put forward
our conclusion as the Theorem 1.
Then, by replacing V π(x,t)
∂t with l(xπ(T), u(T)), the
finite-horizon HJB equation can be rewritten as
min
ul(x, u) + V (x, t)
∂x>f(x, u)=l(x(T), u(T))
x, t [0,T ].
(9)
The above HJB equation can be solved by reusing the
traditional CT ADP method with PI technique, which is very
similar to the existing infinite-horizon ADP. Similarly, its
corresponding self-consistency condition is:
l(x(t), u(t)) + V π(x(t), t)
∂x>f(x, u) = l(xπ(T), u(T)),
x, t [0,T ].
(10)
Theorem 1. For the value function defined in (4), its partial
derivative with respect to time texactly equals to the minus
of terminal-time utility function, i.e:
∂V π(x, t)
∂t =l(xπ(T), u(T))
=l(xπ(T), π(xπ(T), T )) ,
(11)
where xπ(T)is the terminal state. It is obtained by rolling
out the model f(x, u)with policy πfrom initial state x(t).
Proof: We now provide two approaches to prove it. (1)
The first one uses the connection between two finite-time
control problems with fixed terminal time and fixed horizon;
(2) The second one directly derives the partial derivative
according to the definition of the partial derivative.
A. Method 1: Initial time equivalence of two OCPs
Firstly, we define a fixed predictive horizon OCP which
is equivalent to the fixed terminal time OCP (4). It is called
assistant problem, denoted as OCPA. The cost function of
OCPAis:
VπA
A(x(t), c) = ZT+tc
t
l(x(τ), uA(τ))dτ, (12)
where c[0, T ], and Tis a fixed constant. The upper limit
of the integral is T+tc. The predictive horizon of OCPA
is a fixed value Tc, while the predictive horizon Tt
of OCP (4) varies with t. In fact, the value function Vis
dependent on the length of the predictive horizon. Therefore,
VπA
A(x(t), c)is the explicit function of cinstead of t.πAis
the policy used in OCPAand it maps from state xto control
input uA:
uA(τ) = πA(x(τ), c, τ ), τ [t, T +tc], t, c [0, T ].
(13)
The objective of OCPAis to minimize the VπA
A:
V
A(x(t), c) = min
uA
VπA
A(x(t), c)
= min
uAZT+tc
t
l(x(τ), uA(τ))dτ,
(14)
We can obtain the optimal equivalent value function OCP
and OCPAwhen t=c:
V(x, t) = V
A(x(t), c) = min
uZT
c
l(x(τ), u(τ))dτ,
t=c, x.
(15)
Hence, it is obvious to know the two OCPs share the same
optimal control policy in this case:
π(x, t) = π
A(x, c, t), t =c(16)
According to (16), we assume that:
u(τ) = π(x(τ), τ ) = uA(τ) = πA(x(τ), c, τ),
t=c, x(17)
Furthermore, it follows that:
Vπ(x(t), t) = ZT
c
l(x(τ), u(τ))
=ZT
c
l(x(τ), uA(τ))
=VπA
A(x(t), c), t =c
(18)
Based on the (18), we have:
∂V π(x, t)
∂x>=V πA
A(x, c)
∂x>, t =c. (19)
Different from V π(x,t)
∂t ,
∂V πA
A(x(t), c)
∂t = 0,(20)
because VπA
A(x(t), c)is the function of cinstead of t.
Furthermore, taking the derivative of (2) and (12) with
respect to time t, one has:
dV π(x(t), t)
dt =∂V π(x, t)
∂x>f(x, u) + ∂V π(x, t)
∂t
=l(x(t), u(t)),
(21)
dV πA
A(x(t), c)
dt =∂V πA
A(x, c)
∂x>f(x, u)
=l(xπA(T+tc), uA(T+tc)) l(x(t), uA(t)).
(22)
According to (17), (21) and (22),we know:
∂V π(x, t)
∂t t=c
=dV π(x(t), t)
dt dV πA
A(x(t), c)
dt
=l(xπA(T+tc), uA(T+tc))
=l(xπA(T), uA(T))
=l(xπ(T), u(T))
(23)
Since c[0, T ], it can be simplified as:
∂V π(x, t)
∂t =l(xπ(T), u(T)), t [0, T ].(24)
By now, the proof method 1 is completed.
B. Method 2: The definition of the partial derivative
Here, we provide another way to prove the Theorem 1,
which uses the definitions of both total derivative and partial
derivative. The value function of the initial optimal control
problem is shown as (2).
Let’s firstly look at the total derivative of value with
respect to time:
dV π(x, t)
dt =Vπ(x+dx, t +dt)Vπ(x, t)
dt
=RT
t+dt l(x(τ), u(τ)) RT
t(x(τ), u(τ))
dt
=Rt+dt
tl(x(τ), u(τ))
dt
=l(x(t), u(t))dt
dt
=l(x(t), u(t)).
(25)
We compare it with the partial derivative. In mathematical
definition, a partial derivative of a function of multiple
variables is its derivative with respect to one of those
variables, with the others hold constant (as opposed to the
total derivative, in which all variables are allowed to vary).
For example, the partial derivative of f(x1,· · · , xi,· · · , xn)
in direction of xiis:
∂f
∂xi
= lim
h0
f(x1,· · · , xi+h, · · · , xn)f(x1,· · · , xi,· · · , xn)
h.
(26)
When we derive partial derivative with respect to t, the
state x(t)and policy u(t)are the other variables, whose
trajectories are treated to keep constant. Hence, Vπ(x, t+dt)
can derived as RT
t+dt l(x(τdt), u(τdt)). Then, the
partial derivative of the value with respect to the variable t
is:
∂V π(x, t)
∂t =Vπ(x, t +dt)Vπ(x, t)
dt
=RT
t+dt l(x(τdt), u(τdt)) RT
tl(x, u)
dt
=RTdt
tl(x(τ), u(τ)) RT
tl(x, u)
dt
=RT
Tdt l(xπ(T), u(T))
dt
=l(xπ(T), u(T))dt
dt
=l(xπ(T), u(T)), t [0, T ].
(27)
An illustration of this partial derivative is shown as the
Fig. 1.
From (24) and (27), it is clear that the two derivation
methods have reached the same conclusion.
By submitting (24) into the initial OCP’s self-consistency
condition (5), we can obtain (10). If we select the optimal
Figure 1: Partial derivative
policy π, and substitute (24) into (6), we can get the finite-
horizon HJB equation (9).
Remark 1. Since V π(x,t)
∂t =l(xπ(T), u(T)), only
∂V π(x,t)
∂x is unknown in (8). In this case, the only difference
is that for infinite-horizon ADP, ∂V π(x,t)
∂t = 0, while for
finite-horizon ADP, V π(x,t)
∂t =l(xπ(T), u(T)). Hence, we
can easily obtain the solution of CT finite-horizon OCP by
using PI techniques, similar to existing infinite-horizon ADP
methods.
From Remark 1, the pseudo-code of CT finite-horizon
ADP is shown in Algorithm 2. The framework can be
shown as the Fig. 2.
Algorithm 2 CT Finite-horizon ADP in PI framework
Initial with policy π0(x, t)
Given an arbitrarily small positive and set k= 0
while kVπk+1 (x, t)Vπk(x, t)k do
Rollout from xtwith policy πk
Receive and store xπk(T)
1. Solve value function Vπk(x, t)for xusing:
l(x, πk(x, t))+ V πk
∂x>f(x, πk(x, t)) = l(xπk(T), u(T))
(28)
2. Solve new policy πk+1(x)for xusing (7)
end while
Compared with existing methods for solving CT finite-
horizon HJB equation, our method doesn’t need hand-crafted
basis functions and can be used for arbitrary nonlinear sys-
tems with non-affine inputs. We have applied the correspond-
ing CT finite-horizon ADP algorithm to the simulations of
automated vehicle control for the path tracking maneuver.
The results showed that the proposed ADP method can
obtain the near-optimal policy with less than 1% error for
Figure 2: The framework of CT finte-horizon ADP algorithm
linear vehicle dynamic systems and it is about 500 times
faster than the nonlinear MPC ipopt solver for nonlinear
vehicle dynamic systems [20].
IV. ANALYTICAL VERIFICATION
To verify Theorem 1, we analyze our result with the
analytical solution of a linear quadratic (LQ) optimal control
problem, which is formulated as:
Vπ(x, t) = 1
2ZT
tqx2+ru2
s.t. ˙x(t) = ax(t) + bu(t).
(29)
where (a, b)is stabilizable, q > 0and r > 0.
For LQ problem, we can get the analytical optimal solu-
tion of both value Vand control input uwith the optimal
control theory:
V(x, t) = 1
2xTP(t)x, u(x, t) = 1
rbP (t)x. (30)
where P(t)can be obtained by solving the algebraic Riccati
equation for problem (29):
P(t) = r
b2
β+a+η(βa)e2β(tT)
1ηe2β(tT),(31)
where P(T)=0.
Theorem 1 introduces the conclusion which is suitable for
the general policy π. Here, for LQ problem, we verify the
special case of the optimal policy π. The left side of the
equation (11), which is the partial derivative of LQ value
function with respect to time, can derived as:
∂V (x(t), t)
∂t =1
2˙
P(t)x(t)2(32)
Meanwhile, the right side of the equation (11), which is
the terminal-time utility function, can be denoted as:
l(x(T), u(T)) = 1
2qx(T)2+ru(T)2.(33)
Hence, the equation (11) is turned into the following
equation:
1
2˙
P(t)x(t)2=1
2qx(T)2+ru(T)2.(34)
To verify whether Theorem 1 holds, we need to further
derive the equation (34) with the analytical solution. The
optimal terminal state x(T)can be obtained by analyzing
the following state trajectory:
x(τ) = x(t)eRτ
tab2
rP(t)dt,(35)
where τ[t, T ]and the integral term can be derived further:
Zτ
tab2
rP(t)dt
=Zτ
tab2
r
r
b22β
1ηe2β(tT)(a+β)dt
=Zτ
tβ2β
1ηe2β(tT)dt
=β(τt)2βZτ
t
1
1ηe2β(tT)dt
=β(τt)2β(τt)1
2βln 1ηe2β(τT)
1ηe2β(tT)
=β(τt) + ln 1ηe2β(τT)
1ηe2β(tT).
(36)
Hence, the optimal state trajectory is:
x(τ) = x(t)eRτ
tab2
rP(t)dt
=x(t)eβ(τt)+ln1ηe2β(τT)
1ηe2β(tT)
=x(t)1ηe2β(τT)
1ηe2β(tT)eβ(τt), τ [t, T ].
(37)
In this case, x(T)can be obtained by letting τ=T.˙
P(t)
in the left side of equation (34) can be obtained with the
derivation of P(t):
˙
P(t) = (2β)2
b2
e2β(tT)
1ηe2β(tT)2,(38)
where τ[t, T ],β=qqb2
r+a2,η=a+β
a+β.
The partial derivative of value function with respect to
time, the left side of the equation (34), can be derived as:
∂V (x(t), t)
∂t
=1
2˙
P(t)x(t)2
=
1
2(2β)2r
b2
a+β
a+βx(t)2e2β(tT)
(1 ηe2β(tT))2
=
1
2(2β)2r
b2
a2+β2
(a+β)2x(t)2e2β(tT)
(1 ηe2β(tT))2
=
1
2
r
b2qb2
r+a2
a2(2β)2
(a+β)2x(t)2e2β(tT)
(1 ηe2β(tT))2
=
1
2qx(t)2(2β)2
(a+β)2
e2β(tT)
(1 ηe2β(tT))2.
(39)
As for the terminal-time utility function, the right part of
the equation (34) can be derived as:
l(x(T), u(T)) = 1
2qx(T)2+1
2ru(T)2
=1
2qx(T)2
=1
2qx(t)21ηe2β(TT)
1ηe2β(tT)2
e2β(Tt)
=1
2qx(t)2e2β(Tt)1η
1ηe2β(tT)2
=1
2qx(t)2(1 η)2e2β(tT)
1ηe2β(tT)2
=1
2qx(t)22β
a+β2e2β(tT)
1ηe2β(tT)2.
(40)
Now it can be seen that for LQ problem, the partial
derivative of optimal value function with respect to time
equals to the minus of terminal-time utility function.
∂V (x(t), t)
∂t =l(x(T), u(T)), t [0, T ].(41)
By the way, we can even get ∂V (x(τ) )
∂τ =
l(x(T), u(T)), τ [t, T ].
V. CONCLUSION
Different from HJB equation in infinite horizon, finite-
horizon HJB equation contains an additional partial deriva-
tive with respect to time, which is an intractable unknown
term. This paper provides an alternative viewpoint which
directly derives the partial derivative of the value function
with respect to time analytically. It is found that this partial
derivative exactly equals to the minus of terminal-time utility
function, which allows reusing the traditional ADP algorithm
to approximate the solution of the finite-horizon HJB equa-
tion for continuous-time systems. Two proof methods have
been given, including two OCPs initial-time equivalence and
the definition of the partial derivative. The correctness of our
finding is verified by analyzing a linear quadratic problem.
VI. ACK NOWLEDGE
This work is supported by International Science
& Technology Cooperation Program of China under
2019YFE0100200, Beijing NSF with JQ18010. Special
thanks should be given to TOYOTA for their support on this
study. Ziyu Lin and Jingliang Duan contribute equally to this
work. Corresponding author: S. Li (lisb04@gmail.com).
REFERENCES
[1] S. Li, K. Li, R. Rajamani, and J. Wang, “Model predic-
tive multi-objective vehicular adaptive cruise control,” IEEE
Transactions on Control Systems Technology, vol. 19, no. 3,
pp. 556–566, 2010.
[2] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic pro-
gramming: an overview,” in Proceedings of 1995 34th IEEE
Conference on Decision and Control, vol. 1. IEEE, 1995,
pp. 560–564.
[3] T. Pappas, A. Laub, and N. Sandell, “On the numerical
solution of the discrete-time algebraic riccati equation,” IEEE
Transactions on Automatic Control, vol. 25, no. 4, pp. 631–
641, 1980.
[4] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control.
John Wiley & Sons, 2012.
[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An
introduction. MIT press, 2018.
[6] P. Werbos, Approximate dynamic programming for realtime
control and neural modelling,” Handbook of intelligent con-
trol: neural, fuzzy and adaptive approaches, pp. 493–525,
1992.
[7] P. J. Werbos, “Building and understanding adaptive systems:
A statistical/numerical approach to factory automation and
brain research,” IEEE Transactions on Systems, Man, and
Cybernetics, vol. 17, no. 1, pp. 7–20, 1987.
[8] W. B. Powell, Approximate Dynamic Programming: Solving
the curses of dimensionality. John Wiley & Sons, 2007, vol.
703.
[9] S. Li, “Reinforcement learning and control,” Lecture notes in
Tsinghua University, 2020.
[10] R. A. Howard, “Dynamic programming and markov pro-
cesses.” 1960.
[11] J. Duan, S. E. Li, Z. Liu, M. Bujarbaruah, and B. Cheng,
“Generalized policy iteration for optimal control in continu-
ous time,” arXiv preprint arXiv:1909.05402, 2019.
[12] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws
for nonlinear systems with saturating actuators using a neural
network hjb approach,” Automatica, vol. 41, no. 5, pp. 779–
791, 2005.
[13] T. Dierks and S. Jagannathan, “Optimal control of affine
nonlinear continuous-time systems,” in Proceedings of the
2010 American Control Conference. IEEE, 2010, pp. 1568–
1573.
[14] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic al-
gorithm to solve the continuous-time infinite horizon optimal
control problem,” Automatica, vol. 46, no. 5, pp. 878–888,
2010.
[15] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, A neural network
solution for fixed-final time optimal control of nonlinear
systems,” Automatica, vol. 43, no. 3, pp. 482–490, 2007.
[16] Z. Zhao, Y. Yang, H. Li, and D. Liu, “Approximate finite-
horizon optimal control with policy iteration,” in Proceedings
of the 33rd Chinese Control Conference. IEEE, 2014, pp.
8895–8900.
[17] D. Adhyaru, I. Kar, and M. Gopal, “Fixed final time optimal
control approach for bounded robust controller design using
hamilton–jacobi–bellman solution,” IET control theory &
applications, vol. 3, no. 9, pp. 1183–1195, 2009.
[18] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “Fixed-final-
time-constrained optimal control of nonlinear systems using
neural network hjb approach,” IEEE Transactions on Neural
Networks, vol. 18, no. 6, pp. 1725–1737, 2007.
[19] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-
constrained nonlinear optimal control using single network
adaptive critics, IEEE Transactions on Neural Networks and
Learning Systems, vol. 24, no. 1, pp. 145–157, 2012.
[20] Z. Lin, J. Duan, S. E. Li, H. Ma, and Y. Yin, “Continuous-time
finite-horizon adp for automated vehicle controller design
with high efficiency,” arXiv preprint arXiv:2007.02070, 2020.
... For example, Duan et al. (2020) realized the motion control under a virtual two-lane highway with hierarchical reinforcement learning (RL), which achieved high-level maneuver selection and three low-level maneuvers by designing complicated reward functions instead of relying on the expert data [2]. To enhance the sample efficiency, one of the model-based RL methods, the Approximate Dynamic Programming (ADP) is brought into attention by the current research community [18], [19]. It is established on the prior knowledge of the environmental dynamic model in the whole state-action space [10], [20]. ...
... Theorem 1. For arbitrary initial policy π 0 (x, t), if the policy is updated by policy iteration technique, i.e., iteratively solving (17) and (19), then V π k (x, t) → V * (x, t), π k (x, t) → π * (x, t) uniformly on Ω as k goes to ∞. ...
... Then, considering the PIM step (19), the following equation is satisfied ...
Article
Full-text available
The Hamilton–Jacobi–Bellman (HJB) equation serves as the necessary and sufficient condition for the optimal solution to the continuous-time (CT) optimal control problem (OCP). Compared with the infinite-horizon HJB equation, the solving of the finite-horizon (FH) HJB equation has been a long-standing challenge, because the partial time derivative of the value function is involved as an additional unknown term. To address this problem, this study first-time bridges the link between the partial time derivative and the terminal-time utility function, and thus it facilitates the use of the policy iteration (PI) technique to solve the CT FH OCPs. Based on this key finding, the FH approximate dynamic programming (ADP) algorithm is proposed leveraging an actor–critic framework. It is shown that the algorithm exhibits important properties in terms of convergence and optimality. Rather importantly, with the use of multilayer neural networks (NNs) in the actor–critic architecture, the algorithm is suitable for CT FH OCPs toward more general nonlinear and complex systems. Finally, the effectiveness of the proposed algorithm is demonstrated by conducting a series of simulations on both a linear quadratic regulator (LQR) problem and a nonlinear vehicle tracking problem.
... In this sense, the additional unknown term, i.e., the partial derivative of value function with respect to time, is naturally introduced. Thus in this regard, there are three unknown terms in two equations, and the PI technique is no longer suitable for the finite-horizon HJB equation [20]. Therefore, the solving of the finite-horizon HJB equation is still not explored thoroughly due to the lack of effective mathematical tools. ...
... . Thereafter, (20) can be further derived as ...
Article
Full-text available
In the area of autonomous driving, it typically brings great difficulty in solving the motion planning problem since the vehicle model is nonlinear and the driving scenarios are complex. Particularly, most of the existing methods cannot be generalized to dynamically changing scenarios with varying surrounding vehicles. To address this problem, this development here investigates the framework of integrated decision and control. As part of the modules, static path planning determines the reference candidates ahead, and then the optimal path-tracking controller realizes the specific autonomous driving task. An innovative and effective constrained finite-horizon approximate dynamic programming (ADP) algorithm is herein presented to generate the desired control policy for effective path tracking. With the generalized policy neural network that maps from the state to the control input, the proposed algorithm preserves the high effectiveness for the motion planning problem towards changing driving environments with varying surrounding vehicles. Moreover, the algorithm attains the noteworthy advantage of alleviating the typically heavy computational loads with the mode of offline training and online execution. As a result of the utilization of multi-layer neural networks in conjunction with the actor-critic framework, the constrained ADP method is capable of handling complex and multidimensional scenarios. Finally, various simulations have been carried out to show that the constrained ADP algorithm is effective.
... Recently, numerous efforts to improve constrained RL are based on the actor-critic architecture integrated with the "constrained policy optimization" technique. The actor update progress is modified to find a constraint-satisfying policy, and the critic update is the same as existing state-value RL algorithms like trust-region policy optimization (TRPO) [14], [15]. ...
Article
This paper studies a novel distributed optimization problem that aims to minimize the sum of the non-convex objective functionals of the multi-agent network under privacy protection, which means that the local objective of each agent is unknown to others. The above problem involves complexity simultaneously in the time and space aspects. Yet existing works about distributed optimization mainly consider privacy protection in the space aspect where the decision variable is a vector with finite dimensions. In contrast, when the time aspect is considered in this paper, the decision variable is a continuous function concerning time. Hence, the minimization of the overall functional belongs to the calculus of variations. Traditional works usually aim to seek the optimal decision function. Due to privacy protection and non-convexity, the Euler-Lagrange equation of the proposed problem is a complicated partial differential equation. Hence, we seek the optimal decision derivative function rather than the decision function. This manner can be regarded as seeking the control input for an optimal control problem, for which we propose a centralized reinforcement learning (RL) framework. In the space aspect, we further present a distributed reinforcement learning framework to deal with the impact of privacy protection. Finally, rigorous theoretical analysis and simulation validate the effectiveness of our framework.
Article
Full-text available
To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network Adaptive Critic is developed in this paper. Inputs to the NN are the current system states and the time-to-go, and the network outputs are the costates that are used to compute optimal feedback control. Control constraints are handled through a nonquadratic cost function. Convergence proofs of: 1) the reinforcement learning-based training method to the optimal solution; 2) the training error; and 3) the network weights are provided. The resulting controller is shown to solve the associated time-varying Hamilton-Jacobi-Bellman equation and provide the fixed-final-time optimal solution. Performance of the new synthesis technique is demonstrated through different examples including an attitude control problem wherein a rigid spacecraft performs a finite-time attitude maneuver subject to control bounds. The new formulation has great potential for implementation since it consists of only one NN with single set of weights and it provides comprehensive feedback solutions online, though it is trained offline.
Conference Paper
In this paper, the policy iteration algorithm for the finite-horizon optimal control of continuous time systems is addressed. The finite-horizon optimal control with input constraints is formulated in the Hamilton-Jacobi-Bellman (HJB) equation by using a suitable nonquadratic function. The value function of the HJB equation is obtained by solving a sequence of cost functions satisfying the generalized HJB (GHJB) equations with policy iteration. The convergence of the policy iteration algorithm is proved and the admissibility of each iterative policy is discussed. Using the least squares method with neural networks (NN) approximation of the cost function, the approximate solution of the GHJB equation converges uniformly to that of the HJB equation. A numerical example is given to illustrate the result.
Article
Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all searching for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the "curse of dimensionality" inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research developed practical methods for solving stochastic, dynamic optimization problems which has emerged as a seemingly disparate family of algorithmic strategies. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950's and 60's.
Article
A Dynamic Programming Example: A Shortest Path Problem The Three Curses of Dimensionality Some Real Applications Problem Classes The Many Dialects of Dynamic Programming What is New in this Book? Bibliographic Notes
Article
In this study, an optimal control algorithm based on Hamilton-Jacobi-Bellman (HJB) equation, for the bounded robust controller design for finite-time-horizon nonlinear systems, is proposed. The HJB equation formulated using a suitable nonquadratic term in the performance functional to take care of magnitude constraints on the control input. Utilising the direct method of Lyapunov stability, we have proved the optimality of the controller with respect to a cost functional, that includes penalty on the control effort and the maximum bound on system uncertainty. The bounded controller requires the knowledge of the upper bound of system uncertainty. In the proposed algorithm, neural network is used to approximate the time-varying solution of HJB equation using least squares method. Proposed algorithm has been applied on the nonlinear system with matched and unmatched system uncertainties. Necessary theoretical and simulation results are presented to validate proposed algorithm.