Content uploaded by Shengbo Eben Li
Author content
All content in this area was uploaded by Shengbo Eben Li on Mar 09, 2023
Content may be subject to copyright.
Solving finite-horizon HJB for optimal control of continuous-time
systems
Ziyu Lin
School of Vehicle and Mobility
Tsinghua University
Beijing, China
linzy17@mails.tsinghua.edu.cn
Jie Li
School of Vehicle and Mobility
Tsinghua University
Beijing, China
jie-li18@mails.tsinghua.edu.cn
Jingliang Duan
School of Vehicle and Mobility
Tsinghua University
Beijing, China
djl15@mails.tsinghua.edu.cn
Haitong Ma
School of Vehicle and Mobility
Tsinghua University
Beijing, China
maht19@mails.tsinghua.edu.cn
Shengbo Eben Li*
School of Vehicle and Mobility
Tsinghua University
Beijing, China
lishbo@tsinghua.edu.cn
Jianyu Chen
Institute for Interdisciplinary Information Sciences
Tsinghua University
Beijing, China
jianyuchen@berkeley.edu
Abstract—Continuous-time finite-horizon optimal control is
of significant importance since most physical systems are
defined in continuous-time, and most practical applications
consider finite-horizon. However, different from HJB equation
in infinite horizon, finite-horizon HJB equation contains an
additional partial derivative with respect to time, which is
an intractable unknown term. This paper has found that this
partial derivative exactly equals to the minus of terminal-time
utility function, which significantly simplifies the HJB equation
and thus allows efficient numerical solutions to it. We provide
two methods to prove our finding: one is by analyzing the
initial-time equivalency between fixed time horizon optimal
control problem and fixed terminal time optimal control prob-
lem, and the second is by the definition of the partial derivative.
Based on this finding, we can reuse the traditional approximate
dynamic programming (ADP) algorithm to approximate the
solution of the finite-horizon HJB equation for continuous-
time systems. We evaluate the correctness of our finding by
analyzing a linear quadratic problem.
Keywords-finite-horizon HJB equation, approximate dy-
namic programming, continuous-time optimal control.
I. INTRODUCTION
Continuous-time (CT) finite-horizon control systems pre-
serve important merits in theoretical analysis and controller
synthesis. Most physical systems are defined in continuous-
time. Meanwhile, fixed-final-time optimal control problem
(OCP) is common in practical applications. For the vehicle-
tracking control example, the desired tracking trajectory can
only be given in finite-horizon version due to the time-
varying environments and the limited capability of road
perception [1]. In other words, it may be impossible to
define an infinite-horizon OCP in most cases of real world
applications.
Dynamic programming (DP) is the fundamental tool for
solving the continuous-time (CT) finite-horizon OCP [2].
It employs the principle of Bellman optimality to derive
the Hamilton-Jacobi-Bellman (HJB) equation, which is the
sufficient and necessary condition for the CT optimal control
policy. For the continuous-time linear quadratic control prob-
lem, the optimal solution satisfies the Riccati equation, a spe-
cial case of HJB equation [3]. However, in the general case,
the HJB equation is a typical nonlinear partial differential
equation (PDE) in value function, which is usually infeasible
to be solved analytically [4]. To make things worse, different
from infinite-horizon HJB equation, the finite-horizon HJB
equation contains a time-dependent value function, whose
partial derivative is an intractable unknown term.
Recent decades have witnessed loads of efforts to solve
the HJB equation for nonlinear systems [5]. An algorithm
using HJB equation to approximate the control policy was
firstly proposed by Werbos [6], [7]. It was termed Ap-
proximate Dynamic Programming (ADP), whose synonyms
include adaptive DP [8], neuro-DP [8] and reinforcement
learning (RL) [9]. Its basic principle is to solve HJB equation
off-line using policy iteration, which involves a two-step
iteration process, policy evaluation and policy improvement
[10]. Considerable ADP methods seeking optimal control
solution for CT systems with infinite horizon have emerged
in literature [11]. Abu-Khalaf and Lewis proposed a method
for continuous constrained input systems with non-quadratic
utility function to obtain approximate control law [12].
LE. T. Dierks and S. Jagannathan designed a single online
critic network to solve the infinite horizon HJB equation
with small bounded error [13]. Vamvoudakis and Lewis
introduced the synchronous policy iteration, which simul-
taneously adapts both the actor network (NN) and the critic
NN [14]. However, comparing with infinite-horizon settings,
most ADP methods are facing much more challenges in
solving the finite-horizon OCP because the non-zero time-
varying partial derivative of value with respect to time
introduces an additional unknown term in finite-horizon HJB
equation.
To overcome this problem, some researches adopted linear
combination approximation with time-varying weights to
approximate the time-varying value function [15]–[17]. The
time-varying weights can be solved by applying least squares
on a predefined region in practical [18]. In some researches,
time-varying activation functions are employed as the time-
varying approximator [19]. Nevertheless, existing methods
for CT finite-horizon OCP suffer from two problems: a)
Input-affine nature of system: These methods are subject to
input-affine system because the control policy needs to be
analytically expressed by a value function. b) Requirement
for hand-crafted basis functions: The value function or
policy is represented by a linear combination of manually de-
signed basis functions. For complex or large-scale systems,
inappropriate basis functions usually result in inaccurate
value function or sub-optimal policy.
Different from the existing methods, this paper provides
an alternative viewpoint which directly derives the partial
derivative of the value function with respect to time analyt-
ically. The contributions can be summarized as follows:
(1) It is firstly found in theory that the partial derivative of
value with respect to time exactly equals to the minus of the
terminal-time utility function. Hence, the CT finite-horizon
HJB equation only contains two unknown terms, such that
any ADP methods for CT infinite-horizon OCP can be used
to solve finite-horizon OCP.
(2) It overcomes the following problems for existing
finite-horizon ADP methods: a) Since many ADP methods
do not require the policy to be analytically expressed by the
value function, our finding allows reusing these methods
to realize optimal control for arbitrary nonlinear systems
with non-affine inputs. b) Neural networks can be employed
with system states as inputs to represent the value function
or policy. In this case, no hand-crafted basis functions are
needed.
The paper is organized as follows. Section II provides the
formulation of the optimal control problem and introduces
the general CT ADP framework. Section III introduces the
derivation of the CT finite-horizon HJB equation with two
proof methods including analyzing the initial-time equiva-
lency between fixed time horizon OCP and fixed terminal
time OCP, and definition of the partial derivative. Section
IV demonstrates the correctness of our finding by analyzing
linear quadratic systems. Section V concludes the paper.
II. PRELIMINARIES
A. Optimal Control Problem formulation
Consider the general continuous-time time-invariant dy-
namic system given by:
˙x=f(x, u)(1)
where x∈Ω⊂Rmis the state, u∈Rnis the control
input, f(x, u)is the system model, which is assumed to be
controllable. The value function of CT finite-horizon OCP
is:
Vπ(x(t), t) = ZT
t
l(x(τ), u(τ))dτ, (2)
where the time-varying Vπ(x(t), t)is defined as the value
function corresponding to policy π.t∈[0, T ]is current
time, Tis the fixed terminal time, τis the virtual time in the
future horizon, l(x, u)is the utility function. Different from
discrete-time control, the value function in CT domain is the
integral of the utility function. One cannot easily handle the
derivative of an integral function, which leads to a major
difference between discrete-time control and continuous-
time control. πmaps from state xto control input u:
u(τ) = π(x(τ), τ ), τ ∈[t, T ], t ∈[0, T ].(3)
The objective of OCP is to minimize the cost functional:
V∗(x(t), t) = min
uVπ(x, t) = min
uZT
t
l(x(τ), u(τ))dτ
s.t. ˙x=f(x, u),
(4)
The superscript ∗means “optimal”. The basic principle
of CT finite-horizon OCP is to seek a policy π(x, t)to
minimize V(x, t)for ∀x:
π∗(x, t) = arg min
u{V(x(t), t)}.
B. Optimal condition for OCP
To find the optimal policy, we introduce the Hamilton
function:
H(x, u, ∂V π(x, t)
∂x>) = l(x, u) + ∂V π(x, t)
∂x>f(x, u),
The Hamiltonian satisfies the following finite-horizon self-
consistency condition:
H(x, u, ∂V π(x, t)
∂x>) = −∂V π(x, t)
∂t ,∀t∈[0, T ].(5)
Similarly, the optimal solution satisfies the following
finite-horizon HJB equation:
min
ul(x, u) + ∂V ∗(x, t)
∂x>f(x, u)=−∂V ∗(x, t)
∂t ,(6)
with terminal boundary condition
V∗(xπ(T), T )=0.
To seek the optimal policy for CT finite-horizon systems,
we only need to solve the HJB equation, which contains
three unknown terms: u∗(x, t),∂V ∗(x,t)
∂x , and ∂ V ∗(x,t)
∂t . The
HJB equation is hard to be solved analytically since it is a
nonlinear partial differential equation.
C. Policy Iteration of traditional ADP
ADP serves as a numerical solution to the HJB equation
(6) using policy iteration (PI) framework, which iteratively
applies the following two steps:
(1) Policy Evaluation (PEV):
At the kth iteration, given policy πk, the corresponding
value function Vπk(x, t)is estimated by applying self-
consistency condition (5).
(2) Policy Improvement (PIM):
PIM is to find a better policy πk+1(x, t)by minimizing
the Hamilton function with Vπk(x, t):
πk+1(x, t) = arg min
u(H x, u, ∂V πk(x, t)
∂x>!).(7)
The iteration of PEV and PIM will switch gradually
to the HJB solution. The pseudo-code of PI is shown in
Algorithm 1.
Algorithm 1 CT ADP in PI framework
Initial with policy π0(x, t)
Given an arbitrarily small positive and set k= 0
while kVπk+1 (x, t)−Vπk(x, t)k ≥ do
1. Solve value function Vπk(x, t)for ∀x∈Ωusing:
l(x, πk(x, t)) + ∂V πk
∂x>f(x, πk(x, t)) = −∂V πk
∂t (8)
2. Solve new policy πk+1(x)for ∀x∈Ωusing (7)
end while
In the PEV process, there are two unknown partial deriva-
tive terms in (8): ∂V π(x,t)
∂t and ∂ V π(x,t)
∂x . For any different
∂V π(x,t)
∂t , we can find a correspondingly different ∂V π(x,t)
∂x
to make (8) holds, vice versa. In other words, there is no
unique solution for (8). However, for infinite-horizon HJB
equation, we have ∂V π(x,t)
∂t = 0, such that the PEV process
can be easily solved since only ∂V π(x,t)
∂x is unknown in this
case. Inspired by this, if ∂V π(x,t)
∂t is known, we can obtain
the solution of finite-horizon HJB similar to existing infinite-
horizon ADP methods with PI techniques. Therefore, it is
crucial and meaningful to obtain an analytical expression of
∂V π(x,t)
∂t .
III. DER IVATIO N OF FI NI TE -HORIZON HJB EQUATIO N
This paper focuses on the right part of (8), which is
∂V π(x,t)
∂t . If we know its value, the HJB equation may be
solved easily as mentioned before. Firstly, we put forward
our conclusion as the Theorem 1.
Then, by replacing ∂V π(x,t)
∂t with −l(xπ(T), u(T)), the
finite-horizon HJB equation can be rewritten as
min
ul(x, u) + ∂V ∗(x, t)
∂x>f(x, u)=l(x∗(T), u∗(T))
∀x∈Ω, t ∈[0,T ].
(9)
The above HJB equation can be solved by reusing the
traditional CT ADP method with PI technique, which is very
similar to the existing infinite-horizon ADP. Similarly, its
corresponding self-consistency condition is:
l(x(t), u(t)) + ∂V π(x(t), t)
∂x>f(x, u) = l(xπ(T), u(T)),
∀x∈Ω, t ∈[0,T ].
(10)
Theorem 1. For the value function defined in (4), its partial
derivative with respect to time texactly equals to the minus
of terminal-time utility function, i.e:
∂V π(x, t)
∂t =−l(xπ(T), u(T))
=−l(xπ(T), π(xπ(T), T )) ,
(11)
where xπ(T)is the terminal state. It is obtained by rolling
out the model f(x, u)with policy πfrom initial state x(t).
Proof: We now provide two approaches to prove it. (1)
The first one uses the connection between two finite-time
control problems with fixed terminal time and fixed horizon;
(2) The second one directly derives the partial derivative
according to the definition of the partial derivative.
A. Method 1: Initial time equivalence of two OCPs
Firstly, we define a fixed predictive horizon OCP which
is equivalent to the fixed terminal time OCP (4). It is called
assistant problem, denoted as OCPA. The cost function of
OCPAis:
VπA
A(x(t), c) = ZT+t−c
t
l(x(τ), uA(τ))dτ, (12)
where c∈[0, T ], and Tis a fixed constant. The upper limit
of the integral is T+t−c. The predictive horizon of OCPA
is a fixed value T−c, while the predictive horizon T−t
of OCP (4) varies with t. In fact, the value function Vis
dependent on the length of the predictive horizon. Therefore,
VπA
A(x(t), c)is the explicit function of cinstead of t.πAis
the policy used in OCPAand it maps from state xto control
input uA:
uA(τ) = πA(x(τ), c, τ ), τ ∈[t, T +t−c], t, c ∈[0, T ].
(13)
The objective of OCPAis to minimize the VπA
A:
V∗
A(x(t), c) = min
uA
VπA
A(x(t), c)
= min
uAZT+t−c
t
l(x(τ), uA(τ))dτ,
(14)
We can obtain the optimal equivalent value function OCP
and OCPAwhen t=c:
V∗(x, t) = V∗
A(x(t), c) = min
uZT
c
l(x(τ), u(τ))dτ,
t=c, ∀x∈Ω.
(15)
Hence, it is obvious to know the two OCPs share the same
optimal control policy in this case:
π∗(x, t) = π∗
A(x, c, t), t =c(16)
According to (16), we assume that:
u(τ) = π(x(τ), τ ) = uA(τ) = πA(x(τ), c, τ),
t=c, ∀x∈Ω(17)
Furthermore, it follows that:
Vπ(x(t), t) = ZT
c
l(x(τ), u(τ))dτ
=ZT
c
l(x(τ), uA(τ))dτ
=VπA
A(x(t), c), t =c
(18)
Based on the (18), we have:
∂V π(x, t)
∂x>=∂V πA
A(x, c)
∂x>, t =c. (19)
Different from ∂ V π(x,t)
∂t ,
∂V πA
A(x(t), c)
∂t = 0,(20)
because VπA
A(x(t), c)is the function of cinstead of t.
Furthermore, taking the derivative of (2) and (12) with
respect to time t, one has:
dV π(x(t), t)
dt =∂V π(x, t)
∂x>f(x, u) + ∂V π(x, t)
∂t
=−l(x(t), u(t)),
(21)
dV πA
A(x(t), c)
dt =∂V πA
A(x, c)
∂x>f(x, u)
=l(xπA(T+t−c), uA(T+t−c)) −l(x(t), uA(t)).
(22)
According to (17), (21) and (22),we know:
∂V π(x, t)
∂t t=c
=dV π(x(t), t)
dt −dV πA
A(x(t), c)
dt
=−l(xπA(T+t−c), uA(T+t−c))
=−l(xπA(T), uA(T))
=−l(xπ(T), u(T))
(23)
Since c∈[0, T ], it can be simplified as:
∂V π(x, t)
∂t =−l(xπ(T), u(T)), t ∈[0, T ].(24)
By now, the proof method 1 is completed.
B. Method 2: The definition of the partial derivative
Here, we provide another way to prove the Theorem 1,
which uses the definitions of both total derivative and partial
derivative. The value function of the initial optimal control
problem is shown as (2).
Let’s firstly look at the total derivative of value with
respect to time:
dV π(x, t)
dt =Vπ(x+dx, t +dt)−Vπ(x, t)
dt
=RT
t+dt l(x(τ), u(τ))dτ −RT
t(x(τ), u(τ))dτ
dt
=−Rt+dt
tl(x(τ), u(τ))dτ
dt
=−l(x(t), u(t))dt
dt
=−l(x(t), u(t)).
(25)
We compare it with the partial derivative. In mathematical
definition, a partial derivative of a function of multiple
variables is its derivative with respect to one of those
variables, with the others hold constant (as opposed to the
total derivative, in which all variables are allowed to vary).
For example, the partial derivative of f(x1,· · · , xi,· · · , xn)
in direction of xiis:
∂f
∂xi
= lim
h→0
f(x1,· · · , xi+h, · · · , xn)−f(x1,· · · , xi,· · · , xn)
h.
(26)
When we derive partial derivative with respect to t, the
state x(t)and policy u(t)are the other variables, whose
trajectories are treated to keep constant. Hence, Vπ(x, t+dt)
can derived as RT
t+dt l(x(τ−dt), u(τ−dt))dτ. Then, the
partial derivative of the value with respect to the variable t
is:
∂V π(x, t)
∂t =Vπ(x, t +dt)−Vπ(x, t)
dt
=RT
t+dt l(x(τ−dt), u(τ−dt))dτ −RT
tl(x, u)dτ
dt
=RT−dt
tl(x(τ), u(τ))dτ −RT
tl(x, u)dτ
dt
=−RT
T−dt l(xπ(T), u(T))dτ
dt
=l(xπ(T), u(T))dt
dt
=−l(xπ(T), u(T)), t ∈[0, T ].
(27)
An illustration of this partial derivative is shown as the
Fig. 1.
From (24) and (27), it is clear that the two derivation
methods have reached the same conclusion.
By submitting (24) into the initial OCP’s self-consistency
condition (5), we can obtain (10). If we select the optimal
Figure 1: Partial derivative
policy π∗, and substitute (24) into (6), we can get the finite-
horizon HJB equation (9).
Remark 1. Since ∂ V π(x,t)
∂t =−l(xπ(T), u(T)), only
∂V π(x,t)
∂x is unknown in (8). In this case, the only difference
is that for infinite-horizon ADP, ∂V π(x,t)
∂t = 0, while for
finite-horizon ADP, ∂ V π(x,t)
∂t =−l(xπ(T), u(T)). Hence, we
can easily obtain the solution of CT finite-horizon OCP by
using PI techniques, similar to existing infinite-horizon ADP
methods.
From Remark 1, the pseudo-code of CT finite-horizon
ADP is shown in Algorithm 2. The framework can be
shown as the Fig. 2.
Algorithm 2 CT Finite-horizon ADP in PI framework
Initial with policy π0(x, t)
Given an arbitrarily small positive and set k= 0
while kVπk+1 (x, t)−Vπk(x, t)k ≥ do
Rollout from ∀xt∈Ωwith policy πk
Receive and store xπk(T)
1. Solve value function Vπk(x, t)for ∀x∈Ωusing:
l(x, πk(x, t))+ ∂V πk
∂x>f(x, πk(x, t)) = l(xπk(T), u(T))
(28)
2. Solve new policy πk+1(x)for ∀x∈Ωusing (7)
end while
Compared with existing methods for solving CT finite-
horizon HJB equation, our method doesn’t need hand-crafted
basis functions and can be used for arbitrary nonlinear sys-
tems with non-affine inputs. We have applied the correspond-
ing CT finite-horizon ADP algorithm to the simulations of
automated vehicle control for the path tracking maneuver.
The results showed that the proposed ADP method can
obtain the near-optimal policy with less than 1% error for
Figure 2: The framework of CT finte-horizon ADP algorithm
linear vehicle dynamic systems and it is about 500 times
faster than the nonlinear MPC ipopt solver for nonlinear
vehicle dynamic systems [20].
IV. ANALYTICAL VERIFICATION
To verify Theorem 1, we analyze our result with the
analytical solution of a linear quadratic (LQ) optimal control
problem, which is formulated as:
Vπ(x, t) = 1
2ZT
tqx2+ru2dτ
s.t. ˙x(t) = ax(t) + bu(t).
(29)
where (a, b)is stabilizable, q > 0and r > 0.
For LQ problem, we can get the analytical optimal solu-
tion of both value V∗and control input u∗with the optimal
control theory:
V∗(x, t) = 1
2xTP(t)x, u∗(x, t) = −1
rbP (t)x. (30)
where P(t)can be obtained by solving the algebraic Riccati
equation for problem (29):
P(t) = r
b2
β+a+η(β−a)e2β(t−T)
1−ηe2β(t−T),(31)
where P(T)=0.
Theorem 1 introduces the conclusion which is suitable for
the general policy π. Here, for LQ problem, we verify the
special case of the optimal policy π∗. The left side of the
equation (11), which is the partial derivative of LQ value
function with respect to time, can derived as:
∂V ∗(x(t), t)
∂t =1
2˙
P(t)x(t)2(32)
Meanwhile, the right side of the equation (11), which is
the terminal-time utility function, can be denoted as:
−l(x∗(T), u∗(T)) = −1
2qx∗(T)2+ru∗(T)2.(33)
Hence, the equation (11) is turned into the following
equation:
1
2˙
P(t)x(t)2=−1
2qx∗(T)2+ru∗(T)2.(34)
To verify whether Theorem 1 holds, we need to further
derive the equation (34) with the analytical solution. The
optimal terminal state x∗(T)can be obtained by analyzing
the following state trajectory:
x∗(τ) = x(t)eRτ
ta−b2
rP(t)dt,(35)
where τ∈[t, T ]and the integral term can be derived further:
Zτ
ta−b2
rP(t)dt
=Zτ
ta−b2
r
r
b22β
1−ηe2β(t−T)−(−a+β)dt
=Zτ
tβ−2β
1−ηe2β(t−T)dt
=β(τ−t)−2βZτ
t
1
1−ηe2β(t−T)dt
=β(τ−t)−2β(τ−t)−1
2βln 1−ηe2β(τ−T)
1−ηe2β(t−T)
=−β(τ−t) + ln 1−ηe2β(τ−T)
1−ηe2β(t−T).
(36)
Hence, the optimal state trajectory is:
x∗(τ) = x(t)eRτ
ta−b2
rP(t)dt
=x(t)e−β(τ−t)+ln1−ηe2β(τ−T)
1−ηe2β(t−T)
=x(t)1−ηe2β(τ−T)
1−ηe2β(t−T)e−β(τ−t), τ ∈[t, T ].
(37)
In this case, x∗(T)can be obtained by letting τ=T.˙
P(t)
in the left side of equation (34) can be obtained with the
derivation of P(t):
˙
P(t) = (2β)2rη
b2
e2β(t−T)
1−ηe2β(t−T)2,(38)
where τ∈[t, T ],β=qqb2
r+a2,η=−a+β
−a+β.
The partial derivative of value function with respect to
time, the left side of the equation (34), can be derived as:
∂V ∗(x(t), t)
∂t
=1
2˙
P(t)x(t)2
=−
1
2(2β)2r
b2
a+β
−a+βx(t)2e2β(t−T)
(1 −ηe2β(t−T))2
=−
1
2(2β)2r
b2
−a2+β2
(−a+β)2x(t)2e2β(t−T)
(1 −ηe2β(t−T))2
=−
1
2
r
b2qb2
r+a2
−a2(2β)2
(−a+β)2x(t)2e2β(t−T)
(1 −ηe2β(t−T))2
=−
1
2qx(t)2(2β)2
(−a+β)2
e2β(t−T)
(1 −ηe2β(t−T))2.
(39)
As for the terminal-time utility function, the right part of
the equation (34) can be derived as:
l(x∗(T), u∗(T)) = 1
2qx∗(T)2+1
2ru∗(T)2
=1
2qx(T)2
=1
2qx(t)21−ηe2β(T−T)
1−ηe2β(t−T)2
e−2β(T−t)
=1
2qx(t)2e−2β(T−t)1−η
1−ηe2β(t−T)2
=1
2qx(t)2(1 −η)2e2β(t−T)
1−ηe2β(t−T)2
=1
2qx(t)22β
−a+β2e2β(t−T)
1−ηe2β(t−T)2.
(40)
Now it can be seen that for LQ problem, the partial
derivative of optimal value function with respect to time
equals to the minus of terminal-time utility function.
∂V ∗(x(t), t)
∂t =−l(x∗(T), u∗(T)), t ∈[0, T ].(41)
By the way, we can even get ∂V ∗(x(τ),τ )
∂τ =
−l(x∗(T), u∗(T)), τ ∈[t, T ].
V. CONCLUSION
Different from HJB equation in infinite horizon, finite-
horizon HJB equation contains an additional partial deriva-
tive with respect to time, which is an intractable unknown
term. This paper provides an alternative viewpoint which
directly derives the partial derivative of the value function
with respect to time analytically. It is found that this partial
derivative exactly equals to the minus of terminal-time utility
function, which allows reusing the traditional ADP algorithm
to approximate the solution of the finite-horizon HJB equa-
tion for continuous-time systems. Two proof methods have
been given, including two OCPs initial-time equivalence and
the definition of the partial derivative. The correctness of our
finding is verified by analyzing a linear quadratic problem.
VI. ACK NOWLEDGE
This work is supported by International Science
& Technology Cooperation Program of China under
2019YFE0100200, Beijing NSF with JQ18010. Special
thanks should be given to TOYOTA for their support on this
study. Ziyu Lin and Jingliang Duan contribute equally to this
work. Corresponding author: S. Li (lisb04@gmail.com).
REFERENCES
[1] S. Li, K. Li, R. Rajamani, and J. Wang, “Model predic-
tive multi-objective vehicular adaptive cruise control,” IEEE
Transactions on Control Systems Technology, vol. 19, no. 3,
pp. 556–566, 2010.
[2] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic pro-
gramming: an overview,” in Proceedings of 1995 34th IEEE
Conference on Decision and Control, vol. 1. IEEE, 1995,
pp. 560–564.
[3] T. Pappas, A. Laub, and N. Sandell, “On the numerical
solution of the discrete-time algebraic riccati equation,” IEEE
Transactions on Automatic Control, vol. 25, no. 4, pp. 631–
641, 1980.
[4] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control.
John Wiley & Sons, 2012.
[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An
introduction. MIT press, 2018.
[6] P. Werbos, “Approximate dynamic programming for realtime
control and neural modelling,” Handbook of intelligent con-
trol: neural, fuzzy and adaptive approaches, pp. 493–525,
1992.
[7] P. J. Werbos, “Building and understanding adaptive systems:
A statistical/numerical approach to factory automation and
brain research,” IEEE Transactions on Systems, Man, and
Cybernetics, vol. 17, no. 1, pp. 7–20, 1987.
[8] W. B. Powell, Approximate Dynamic Programming: Solving
the curses of dimensionality. John Wiley & Sons, 2007, vol.
703.
[9] S. Li, “Reinforcement learning and control,” Lecture notes in
Tsinghua University, 2020.
[10] R. A. Howard, “Dynamic programming and markov pro-
cesses.” 1960.
[11] J. Duan, S. E. Li, Z. Liu, M. Bujarbaruah, and B. Cheng,
“Generalized policy iteration for optimal control in continu-
ous time,” arXiv preprint arXiv:1909.05402, 2019.
[12] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws
for nonlinear systems with saturating actuators using a neural
network hjb approach,” Automatica, vol. 41, no. 5, pp. 779–
791, 2005.
[13] T. Dierks and S. Jagannathan, “Optimal control of affine
nonlinear continuous-time systems,” in Proceedings of the
2010 American Control Conference. IEEE, 2010, pp. 1568–
1573.
[14] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic al-
gorithm to solve the continuous-time infinite horizon optimal
control problem,” Automatica, vol. 46, no. 5, pp. 878–888,
2010.
[15] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “A neural network
solution for fixed-final time optimal control of nonlinear
systems,” Automatica, vol. 43, no. 3, pp. 482–490, 2007.
[16] Z. Zhao, Y. Yang, H. Li, and D. Liu, “Approximate finite-
horizon optimal control with policy iteration,” in Proceedings
of the 33rd Chinese Control Conference. IEEE, 2014, pp.
8895–8900.
[17] D. Adhyaru, I. Kar, and M. Gopal, “Fixed final time optimal
control approach for bounded robust controller design using
hamilton–jacobi–bellman solution,” IET control theory &
applications, vol. 3, no. 9, pp. 1183–1195, 2009.
[18] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “Fixed-final-
time-constrained optimal control of nonlinear systems using
neural network hjb approach,” IEEE Transactions on Neural
Networks, vol. 18, no. 6, pp. 1725–1737, 2007.
[19] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-
constrained nonlinear optimal control using single network
adaptive critics,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 24, no. 1, pp. 145–157, 2012.
[20] Z. Lin, J. Duan, S. E. Li, H. Ma, and Y. Yin, “Continuous-time
finite-horizon adp for automated vehicle controller design
with high efficiency,” arXiv preprint arXiv:2007.02070, 2020.