Conference PaperPDF Available

Solving Finite-Horizon HJB for Optimal Control of Continuous-Time Systems

January 2021

January 2021

DOI:10.1109/ICCCR49711.2021.9349412

Conference: 2021 International Conference on Computer, Control and Robotics (ICCCR)

Authors:

Ziyu Lin

South China University of Technology

Jingliang Duan

University of Science and Technology Beijing

Shengbo Eben Li

Tsinghua University

Jie Li

Tsinghua University

Show all 8 authorsHide

Partial derivative

…

The framework of CT finte-horizon ADP algorithm

…

Figures - uploaded by Shengbo Eben Li

Content may be subject to copyright.

Content uploaded by Shengbo Eben Li

Content may be subject to copyright.

Solving ﬁnite-horizon HJB for optimal control of continuous-time

systems

Ziyu Lin

School of Vehicle and Mobility

Tsinghua University

Beijing, China

linzy17@mails.tsinghua.edu.cn

Jie Li

School of Vehicle and Mobility

Tsinghua University

Beijing, China

jie-li18@mails.tsinghua.edu.cn

Jingliang Duan

School of Vehicle and Mobility

Tsinghua University

Beijing, China

djl15@mails.tsinghua.edu.cn

Haitong Ma

School of Vehicle and Mobility

Tsinghua University

Beijing, China

maht19@mails.tsinghua.edu.cn

Shengbo Eben Li*

School of Vehicle and Mobility

Tsinghua University

Beijing, China

lishbo@tsinghua.edu.cn

Jianyu Chen

Institute for Interdisciplinary Information Sciences

Tsinghua University

Beijing, China

jianyuchen@berkeley.edu

Abstract—Continuous-time ﬁnite-horizon optimal control is

of signiﬁcant importance since most physical systems are

deﬁned in continuous-time, and most practical applications

consider ﬁnite-horizon. However, different from HJB equation

in inﬁnite horizon, ﬁnite-horizon HJB equation contains an

additional partial derivative with respect to time, which is

an intractable unknown term. This paper has found that this

partial derivative exactly equals to the minus of terminal-time

utility function, which signiﬁcantly simpliﬁes the HJB equation

and thus allows efﬁcient numerical solutions to it. We provide

two methods to prove our ﬁnding: one is by analyzing the

initial-time equivalency between ﬁxed time horizon optimal

control problem and ﬁxed terminal time optimal control prob-

lem, and the second is by the deﬁnition of the partial derivative.

Based on this ﬁnding, we can reuse the traditional approximate

dynamic programming (ADP) algorithm to approximate the

solution of the ﬁnite-horizon HJB equation for continuous-

time systems. We evaluate the correctness of our ﬁnding by

analyzing a linear quadratic problem.

Keywords-ﬁnite-horizon HJB equation, approximate dy-

namic programming, continuous-time optimal control.

I. INTRODUCTION

Continuous-time (CT) ﬁnite-horizon control systems pre-

serve important merits in theoretical analysis and controller

synthesis. Most physical systems are deﬁned in continuous-

time. Meanwhile, ﬁxed-ﬁnal-time optimal control problem

(OCP) is common in practical applications. For the vehicle-

tracking control example, the desired tracking trajectory can

only be given in ﬁnite-horizon version due to the time-

varying environments and the limited capability of road

perception [1]. In other words, it may be impossible to

deﬁne an inﬁnite-horizon OCP in most cases of real world

applications.

Dynamic programming (DP) is the fundamental tool for

solving the continuous-time (CT) ﬁnite-horizon OCP [2].

It employs the principle of Bellman optimality to derive

the Hamilton-Jacobi-Bellman (HJB) equation, which is the

sufﬁcient and necessary condition for the CT optimal control

policy. For the continuous-time linear quadratic control prob-

lem, the optimal solution satisﬁes the Riccati equation, a spe-

cial case of HJB equation [3]. However, in the general case,

the HJB equation is a typical nonlinear partial differential

equation (PDE) in value function, which is usually infeasible

to be solved analytically [4]. To make things worse, different

from inﬁnite-horizon HJB equation, the ﬁnite-horizon HJB

equation contains a time-dependent value function, whose

partial derivative is an intractable unknown term.

Recent decades have witnessed loads of efforts to solve

the HJB equation for nonlinear systems [5]. An algorithm

using HJB equation to approximate the control policy was

ﬁrstly proposed by Werbos [6], [7]. It was termed Ap-

proximate Dynamic Programming (ADP), whose synonyms

include adaptive DP [8], neuro-DP [8] and reinforcement

learning (RL) [9]. Its basic principle is to solve HJB equation

off-line using policy iteration, which involves a two-step

iteration process, policy evaluation and policy improvement

[10]. Considerable ADP methods seeking optimal control

solution for CT systems with inﬁnite horizon have emerged

in literature [11]. Abu-Khalaf and Lewis proposed a method

for continuous constrained input systems with non-quadratic

utility function to obtain approximate control law [12].

LE. T. Dierks and S. Jagannathan designed a single online

critic network to solve the inﬁnite horizon HJB equation

with small bounded error [13]. Vamvoudakis and Lewis

introduced the synchronous policy iteration, which simul-

taneously adapts both the actor network (NN) and the critic

NN [14]. However, comparing with inﬁnite-horizon settings,

most ADP methods are facing much more challenges in

solving the ﬁnite-horizon OCP because the non-zero time-

varying partial derivative of value with respect to time

introduces an additional unknown term in ﬁnite-horizon HJB

equation.

To overcome this problem, some researches adopted linear

combination approximation with time-varying weights to

approximate the time-varying value function [15]–[17]. The

time-varying weights can be solved by applying least squares

on a predeﬁned region in practical [18]. In some researches,

time-varying activation functions are employed as the time-

varying approximator [19]. Nevertheless, existing methods

for CT ﬁnite-horizon OCP suffer from two problems: a)

Input-afﬁne nature of system: These methods are subject to

input-afﬁne system because the control policy needs to be

analytically expressed by a value function. b) Requirement

for hand-crafted basis functions: The value function or

policy is represented by a linear combination of manually de-

signed basis functions. For complex or large-scale systems,

inappropriate basis functions usually result in inaccurate

value function or sub-optimal policy.

Different from the existing methods, this paper provides

an alternative viewpoint which directly derives the partial

derivative of the value function with respect to time analyt-

ically. The contributions can be summarized as follows:

(1) It is ﬁrstly found in theory that the partial derivative of

value with respect to time exactly equals to the minus of the

terminal-time utility function. Hence, the CT ﬁnite-horizon

HJB equation only contains two unknown terms, such that

any ADP methods for CT inﬁnite-horizon OCP can be used

to solve ﬁnite-horizon OCP.

(2) It overcomes the following problems for existing

ﬁnite-horizon ADP methods: a) Since many ADP methods

do not require the policy to be analytically expressed by the

value function, our ﬁnding allows reusing these methods

to realize optimal control for arbitrary nonlinear systems

with non-afﬁne inputs. b) Neural networks can be employed

with system states as inputs to represent the value function

or policy. In this case, no hand-crafted basis functions are

needed.

The paper is organized as follows. Section II provides the

formulation of the optimal control problem and introduces

the general CT ADP framework. Section III introduces the

derivation of the CT ﬁnite-horizon HJB equation with two

proof methods including analyzing the initial-time equiva-

lency between ﬁxed time horizon OCP and ﬁxed terminal

time OCP, and deﬁnition of the partial derivative. Section

IV demonstrates the correctness of our ﬁnding by analyzing

linear quadratic systems. Section V concludes the paper.

II. PRELIMINARIES

A. Optimal Control Problem formulation

Consider the general continuous-time time-invariant dy-

namic system given by:

˙x=f(x, u)(1)

where x∈Ω⊂Rmis the state, u∈Rnis the control

input, f(x, u)is the system model, which is assumed to be

controllable. The value function of CT ﬁnite-horizon OCP

is:

Vπ(x(t), t) = ZT

l(x(τ), u(τ))dτ, (2)

where the time-varying Vπ(x(t), t)is deﬁned as the value

function corresponding to policy π.t∈[0, T ]is current

time, Tis the ﬁxed terminal time, τis the virtual time in the

future horizon, l(x, u)is the utility function. Different from

discrete-time control, the value function in CT domain is the

integral of the utility function. One cannot easily handle the

derivative of an integral function, which leads to a major

difference between discrete-time control and continuous-

time control. πmaps from state xto control input u:

u(τ) = π(x(τ), τ ), τ ∈[t, T ], t ∈[0, T ].(3)

The objective of OCP is to minimize the cost functional:

V∗(x(t), t) = min

uVπ(x, t) = min

uZT

l(x(τ), u(τ))dτ

s.t. ˙x=f(x, u),

(4)

The superscript ∗means “optimal”. The basic principle

of CT ﬁnite-horizon OCP is to seek a policy π(x, t)to

minimize V(x, t)for ∀x:

π∗(x, t) = arg min

u{V(x(t), t)}.

B. Optimal condition for OCP

To ﬁnd the optimal policy, we introduce the Hamilton

function:

H(x, u, ∂V π(x, t)

∂x>) = l(x, u) + ∂V π(x, t)

∂x>f(x, u),

The Hamiltonian satisﬁes the following ﬁnite-horizon self-

consistency condition:

H(x, u, ∂V π(x, t)

∂x>) = −∂V π(x, t)

∂t ,∀t∈[0, T ].(5)

Similarly, the optimal solution satisﬁes the following

ﬁnite-horizon HJB equation:

min

ul(x, u) + ∂V ∗(x, t)

∂x>f(x, u)=−∂V ∗(x, t)

∂t ,(6)

with terminal boundary condition

V∗(xπ(T), T )=0.

To seek the optimal policy for CT ﬁnite-horizon systems,

we only need to solve the HJB equation, which contains

three unknown terms: u∗(x, t),∂V ∗(x,t)

∂x , and ∂ V ∗(x,t)

∂t . The

HJB equation is hard to be solved analytically since it is a

nonlinear partial differential equation.

C. Policy Iteration of traditional ADP

ADP serves as a numerical solution to the HJB equation

(6) using policy iteration (PI) framework, which iteratively

applies the following two steps:

(1) Policy Evaluation (PEV):

At the kth iteration, given policy πk, the corresponding

value function Vπk(x, t)is estimated by applying self-

consistency condition (5).

(2) Policy Improvement (PIM):

PIM is to ﬁnd a better policy πk+1(x, t)by minimizing

the Hamilton function with Vπk(x, t):

πk+1(x, t) = arg min

u(H x, u, ∂V πk(x, t)

∂x>!).(7)

The iteration of PEV and PIM will switch gradually

to the HJB solution. The pseudo-code of PI is shown in

Algorithm 1.

Algorithm 1 CT ADP in PI framework

Initial with policy π0(x, t)

Given an arbitrarily small positive and set k= 0

while kVπk+1 (x, t)−Vπk(x, t)k ≥ do

1. Solve value function Vπk(x, t)for ∀x∈Ωusing:

l(x, πk(x, t)) + ∂V πk

∂x>f(x, πk(x, t)) = −∂V πk

∂t (8)

2. Solve new policy πk+1(x)for ∀x∈Ωusing (7)

end while

In the PEV process, there are two unknown partial deriva-

tive terms in (8): ∂V π(x,t)

∂t and ∂ V π(x,t)

∂x . For any different

∂V π(x,t)

∂t , we can ﬁnd a correspondingly different ∂V π(x,t)

∂x

to make (8) holds, vice versa. In other words, there is no

unique solution for (8). However, for inﬁnite-horizon HJB

equation, we have ∂V π(x,t)

∂t = 0, such that the PEV process

can be easily solved since only ∂V π(x,t)

∂x is unknown in this

case. Inspired by this, if ∂V π(x,t)

∂t is known, we can obtain

the solution of ﬁnite-horizon HJB similar to existing inﬁnite-

horizon ADP methods with PI techniques. Therefore, it is

crucial and meaningful to obtain an analytical expression of

∂V π(x,t)

∂t .

III. DER IVATIO N OF FI NI TE -HORIZON HJB EQUATIO N

This paper focuses on the right part of (8), which is

∂V π(x,t)

∂t . If we know its value, the HJB equation may be

solved easily as mentioned before. Firstly, we put forward

our conclusion as the Theorem 1.

Then, by replacing ∂V π(x,t)

∂t with −l(xπ(T), u(T)), the

ﬁnite-horizon HJB equation can be rewritten as

min

ul(x, u) + ∂V ∗(x, t)

∂x>f(x, u)=l(x∗(T), u∗(T))

∀x∈Ω, t ∈[0,T ].

(9)

The above HJB equation can be solved by reusing the

traditional CT ADP method with PI technique, which is very

similar to the existing inﬁnite-horizon ADP. Similarly, its

corresponding self-consistency condition is:

l(x(t), u(t)) + ∂V π(x(t), t)

∂x>f(x, u) = l(xπ(T), u(T)),

∀x∈Ω, t ∈[0,T ].

(10)

Theorem 1. For the value function deﬁned in (4), its partial

derivative with respect to time texactly equals to the minus

of terminal-time utility function, i.e:

∂V π(x, t)

∂t =−l(xπ(T), u(T))

=−l(xπ(T), π(xπ(T), T )) ,

(11)

where xπ(T)is the terminal state. It is obtained by rolling

out the model f(x, u)with policy πfrom initial state x(t).

Proof: We now provide two approaches to prove it. (1)

The ﬁrst one uses the connection between two ﬁnite-time

control problems with ﬁxed terminal time and ﬁxed horizon;

(2) The second one directly derives the partial derivative

according to the deﬁnition of the partial derivative.

A. Method 1: Initial time equivalence of two OCPs

Firstly, we deﬁne a ﬁxed predictive horizon OCP which

is equivalent to the ﬁxed terminal time OCP (4). It is called

assistant problem, denoted as OCPA. The cost function of

OCPAis:

VπA

A(x(t), c) = ZT+t−c

l(x(τ), uA(τ))dτ, (12)

where c∈[0, T ], and Tis a ﬁxed constant. The upper limit

of the integral is T+t−c. The predictive horizon of OCPA

is a ﬁxed value T−c, while the predictive horizon T−t

of OCP (4) varies with t. In fact, the value function Vis

dependent on the length of the predictive horizon. Therefore,

VπA

A(x(t), c)is the explicit function of cinstead of t.πAis

the policy used in OCPAand it maps from state xto control

input uA:

uA(τ) = πA(x(τ), c, τ ), τ ∈[t, T +t−c], t, c ∈[0, T ].

(13)

The objective of OCPAis to minimize the VπA

V∗

A(x(t), c) = min

VπA

A(x(t), c)

= min

uAZT+t−c

l(x(τ), uA(τ))dτ,

(14)

We can obtain the optimal equivalent value function OCP

and OCPAwhen t=c:

V∗(x, t) = V∗

A(x(t), c) = min

uZT

l(x(τ), u(τ))dτ,

t=c, ∀x∈Ω.

(15)

Hence, it is obvious to know the two OCPs share the same

optimal control policy in this case:

π∗(x, t) = π∗

A(x, c, t), t =c(16)

According to (16), we assume that:

u(τ) = π(x(τ), τ ) = uA(τ) = πA(x(τ), c, τ),

t=c, ∀x∈Ω(17)

Furthermore, it follows that:

Vπ(x(t), t) = ZT

l(x(τ), u(τ))dτ

=ZT

l(x(τ), uA(τ))dτ

=VπA

A(x(t), c), t =c

(18)

Based on the (18), we have:

∂V π(x, t)

∂x>=∂V πA

A(x, c)

∂x>, t =c. (19)

Different from ∂ V π(x,t)

∂t ,

∂V πA

A(x(t), c)

∂t = 0,(20)

because VπA

A(x(t), c)is the function of cinstead of t.

Furthermore, taking the derivative of (2) and (12) with

respect to time t, one has:

dV π(x(t), t)

dt =∂V π(x, t)

∂x>f(x, u) + ∂V π(x, t)

∂t

=−l(x(t), u(t)),

(21)

dV πA

A(x(t), c)

dt =∂V πA

A(x, c)

∂x>f(x, u)

=l(xπA(T+t−c), uA(T+t−c)) −l(x(t), uA(t)).

(22)

According to (17), (21) and (22),we know:

∂V π(x, t)

∂t t=c

=dV π(x(t), t)

dt −dV πA

A(x(t), c)

=−l(xπA(T+t−c), uA(T+t−c))

=−l(xπA(T), uA(T))

=−l(xπ(T), u(T))

(23)

Since c∈[0, T ], it can be simpliﬁed as:

∂V π(x, t)

∂t =−l(xπ(T), u(T)), t ∈[0, T ].(24)

By now, the proof method 1 is completed.

B. Method 2: The deﬁnition of the partial derivative

Here, we provide another way to prove the Theorem 1,

which uses the deﬁnitions of both total derivative and partial

derivative. The value function of the initial optimal control

problem is shown as (2).

Let’s ﬁrstly look at the total derivative of value with

respect to time:

dV π(x, t)

dt =Vπ(x+dx, t +dt)−Vπ(x, t)

=RT

t+dt l(x(τ), u(τ))dτ −RT

t(x(τ), u(τ))dτ

=−Rt+dt

tl(x(τ), u(τ))dτ

=−l(x(t), u(t))dt

=−l(x(t), u(t)).

(25)

We compare it with the partial derivative. In mathematical

deﬁnition, a partial derivative of a function of multiple

variables is its derivative with respect to one of those

variables, with the others hold constant (as opposed to the

total derivative, in which all variables are allowed to vary).

For example, the partial derivative of f(x1,· · · , xi,· · · , xn)

in direction of xiis:

∂f

∂xi

= lim

h→0

f(x1,· · · , xi+h, · · · , xn)−f(x1,· · · , xi,· · · , xn)

(26)

When we derive partial derivative with respect to t, the

state x(t)and policy u(t)are the other variables, whose

trajectories are treated to keep constant. Hence, Vπ(x, t+dt)

can derived as RT

t+dt l(x(τ−dt), u(τ−dt))dτ. Then, the

partial derivative of the value with respect to the variable t

is:

∂V π(x, t)

∂t =Vπ(x, t +dt)−Vπ(x, t)

=RT

t+dt l(x(τ−dt), u(τ−dt))dτ −RT

tl(x, u)dτ

=RT−dt

tl(x(τ), u(τ))dτ −RT

tl(x, u)dτ

=−RT

T−dt l(xπ(T), u(T))dτ

=l(xπ(T), u(T))dt

=−l(xπ(T), u(T)), t ∈[0, T ].

(27)

An illustration of this partial derivative is shown as the

Fig. 1.

From (24) and (27), it is clear that the two derivation

methods have reached the same conclusion.

By submitting (24) into the initial OCP’s self-consistency

condition (5), we can obtain (10). If we select the optimal

Figure 1: Partial derivative

policy π∗, and substitute (24) into (6), we can get the ﬁnite-

horizon HJB equation (9).

Remark 1. Since ∂ V π(x,t)

∂t =−l(xπ(T), u(T)), only

∂V π(x,t)

∂x is unknown in (8). In this case, the only difference

is that for inﬁnite-horizon ADP, ∂V π(x,t)

∂t = 0, while for

ﬁnite-horizon ADP, ∂ V π(x,t)

∂t =−l(xπ(T), u(T)). Hence, we

can easily obtain the solution of CT ﬁnite-horizon OCP by

using PI techniques, similar to existing inﬁnite-horizon ADP

methods.

From Remark 1, the pseudo-code of CT ﬁnite-horizon

ADP is shown in Algorithm 2. The framework can be

shown as the Fig. 2.

Algorithm 2 CT Finite-horizon ADP in PI framework

Initial with policy π0(x, t)

Given an arbitrarily small positive and set k= 0

while kVπk+1 (x, t)−Vπk(x, t)k ≥ do

Rollout from ∀xt∈Ωwith policy πk

Receive and store xπk(T)

1. Solve value function Vπk(x, t)for ∀x∈Ωusing:

l(x, πk(x, t))+ ∂V πk

∂x>f(x, πk(x, t)) = l(xπk(T), u(T))

(28)

2. Solve new policy πk+1(x)for ∀x∈Ωusing (7)

end while

Compared with existing methods for solving CT ﬁnite-

horizon HJB equation, our method doesn’t need hand-crafted

basis functions and can be used for arbitrary nonlinear sys-

tems with non-afﬁne inputs. We have applied the correspond-

ing CT ﬁnite-horizon ADP algorithm to the simulations of

automated vehicle control for the path tracking maneuver.

The results showed that the proposed ADP method can

obtain the near-optimal policy with less than 1% error for

Figure 2: The framework of CT ﬁnte-horizon ADP algorithm

linear vehicle dynamic systems and it is about 500 times

faster than the nonlinear MPC ipopt solver for nonlinear

vehicle dynamic systems [20].

IV. ANALYTICAL VERIFICATION

To verify Theorem 1, we analyze our result with the

analytical solution of a linear quadratic (LQ) optimal control

problem, which is formulated as:

Vπ(x, t) = 1

2ZT

tqx2+ru2dτ

s.t. ˙x(t) = ax(t) + bu(t).

(29)

where (a, b)is stabilizable, q > 0and r > 0.

For LQ problem, we can get the analytical optimal solu-

tion of both value V∗and control input u∗with the optimal

control theory:

V∗(x, t) = 1

2xTP(t)x, u∗(x, t) = −1

rbP (t)x. (30)

where P(t)can be obtained by solving the algebraic Riccati

equation for problem (29):

P(t) = r

β+a+η(β−a)e2β(t−T)

1−ηe2β(t−T),(31)

where P(T)=0.

Theorem 1 introduces the conclusion which is suitable for

the general policy π. Here, for LQ problem, we verify the

special case of the optimal policy π∗. The left side of the

equation (11), which is the partial derivative of LQ value

function with respect to time, can derived as:

∂V ∗(x(t), t)

∂t =1

2˙

P(t)x(t)2(32)

Meanwhile, the right side of the equation (11), which is

the terminal-time utility function, can be denoted as:

−l(x∗(T), u∗(T)) = −1

2qx∗(T)2+ru∗(T)2.(33)

Hence, the equation (11) is turned into the following

equation:

2˙

P(t)x(t)2=−1

2qx∗(T)2+ru∗(T)2.(34)

To verify whether Theorem 1 holds, we need to further

derive the equation (34) with the analytical solution. The

optimal terminal state x∗(T)can be obtained by analyzing

the following state trajectory:

x∗(τ) = x(t)eRτ

ta−b2

rP(t)dt,(35)

where τ∈[t, T ]and the integral term can be derived further:

Zτ

ta−b2

rP(t)dt

=Zτ

ta−b2

b22β

1−ηe2β(t−T)−(−a+β)dt

=Zτ

tβ−2β

1−ηe2β(t−T)dt

=β(τ−t)−2βZτ

1−ηe2β(t−T)dt

=β(τ−t)−2β(τ−t)−1

2βln 1−ηe2β(τ−T)

1−ηe2β(t−T)

=−β(τ−t) + ln 1−ηe2β(τ−T)

1−ηe2β(t−T).

(36)

Hence, the optimal state trajectory is:

x∗(τ) = x(t)eRτ

ta−b2

rP(t)dt

=x(t)e−β(τ−t)+ln1−ηe2β(τ−T)

1−ηe2β(t−T)

=x(t)1−ηe2β(τ−T)

1−ηe2β(t−T)e−β(τ−t), τ ∈[t, T ].

(37)

In this case, x∗(T)can be obtained by letting τ=T.˙

P(t)

in the left side of equation (34) can be obtained with the

derivation of P(t):

P(t) = (2β)2rη

e2β(t−T)

1−ηe2β(t−T)2,(38)

where τ∈[t, T ],β=qqb2

r+a2,η=−a+β

−a+β.

The partial derivative of value function with respect to

time, the left side of the equation (34), can be derived as:

∂V ∗(x(t), t)

∂t

2˙

P(t)x(t)2

=−

2(2β)2r

a+β

−a+βx(t)2e2β(t−T)

(1 −ηe2β(t−T))2

=−

2(2β)2r

−a2+β2

(−a+β)2x(t)2e2β(t−T)

(1 −ηe2β(t−T))2

=−

b2qb2

r+a2

−a2(2β)2

(−a+β)2x(t)2e2β(t−T)

(1 −ηe2β(t−T))2

=−

2qx(t)2(2β)2

(−a+β)2

e2β(t−T)

(1 −ηe2β(t−T))2.

(39)

As for the terminal-time utility function, the right part of

the equation (34) can be derived as:

l(x∗(T), u∗(T)) = 1

2qx∗(T)2+1

2ru∗(T)2

2qx(T)2

2qx(t)21−ηe2β(T−T)

1−ηe2β(t−T)2

e−2β(T−t)

2qx(t)2e−2β(T−t)1−η

1−ηe2β(t−T)2

2qx(t)2(1 −η)2e2β(t−T)

1−ηe2β(t−T)2

2qx(t)22β

−a+β2e2β(t−T)

1−ηe2β(t−T)2.

(40)

Now it can be seen that for LQ problem, the partial

derivative of optimal value function with respect to time

equals to the minus of terminal-time utility function.

∂V ∗(x(t), t)

∂t =−l(x∗(T), u∗(T)), t ∈[0, T ].(41)

By the way, we can even get ∂V ∗(x(τ),τ )

∂τ =

−l(x∗(T), u∗(T)), τ ∈[t, T ].

V. CONCLUSION

Different from HJB equation in inﬁnite horizon, ﬁnite-

horizon HJB equation contains an additional partial deriva-

tive with respect to time, which is an intractable unknown

term. This paper provides an alternative viewpoint which

directly derives the partial derivative of the value function

with respect to time analytically. It is found that this partial

derivative exactly equals to the minus of terminal-time utility

function, which allows reusing the traditional ADP algorithm

to approximate the solution of the ﬁnite-horizon HJB equa-

tion for continuous-time systems. Two proof methods have

been given, including two OCPs initial-time equivalence and

the deﬁnition of the partial derivative. The correctness of our

ﬁnding is veriﬁed by analyzing a linear quadratic problem.

VI. ACK NOWLEDGE

This work is supported by International Science

& Technology Cooperation Program of China under

2019YFE0100200, Beijing NSF with JQ18010. Special

thanks should be given to TOYOTA for their support on this

study. Ziyu Lin and Jingliang Duan contribute equally to this

work. Corresponding author: S. Li (lisb04@gmail.com).

REFERENCES

[1] S. Li, K. Li, R. Rajamani, and J. Wang, “Model predic-

tive multi-objective vehicular adaptive cruise control,” IEEE

Transactions on Control Systems Technology, vol. 19, no. 3,

pp. 556–566, 2010.

[2] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic pro-

gramming: an overview,” in Proceedings of 1995 34th IEEE

Conference on Decision and Control, vol. 1. IEEE, 1995,

pp. 560–564.

[3] T. Pappas, A. Laub, and N. Sandell, “On the numerical

solution of the discrete-time algebraic riccati equation,” IEEE

Transactions on Automatic Control, vol. 25, no. 4, pp. 631–

641, 1980.

[4] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control.

John Wiley & Sons, 2012.

[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An

introduction. MIT press, 2018.

[6] P. Werbos, “Approximate dynamic programming for realtime

control and neural modelling,” Handbook of intelligent con-

trol: neural, fuzzy and adaptive approaches, pp. 493–525,

1992.

[7] P. J. Werbos, “Building and understanding adaptive systems:

A statistical/numerical approach to factory automation and

brain research,” IEEE Transactions on Systems, Man, and

Cybernetics, vol. 17, no. 1, pp. 7–20, 1987.

[8] W. B. Powell, Approximate Dynamic Programming: Solving

the curses of dimensionality. John Wiley & Sons, 2007, vol.

703.

[9] S. Li, “Reinforcement learning and control,” Lecture notes in

Tsinghua University, 2020.

[10] R. A. Howard, “Dynamic programming and markov pro-

cesses.” 1960.

[11] J. Duan, S. E. Li, Z. Liu, M. Bujarbaruah, and B. Cheng,

“Generalized policy iteration for optimal control in continu-

ous time,” arXiv preprint arXiv:1909.05402, 2019.

[12] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws

for nonlinear systems with saturating actuators using a neural

network hjb approach,” Automatica, vol. 41, no. 5, pp. 779–

791, 2005.

[13] T. Dierks and S. Jagannathan, “Optimal control of afﬁne

nonlinear continuous-time systems,” in Proceedings of the

2010 American Control Conference. IEEE, 2010, pp. 1568–

1573.

[14] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic al-

gorithm to solve the continuous-time inﬁnite horizon optimal

control problem,” Automatica, vol. 46, no. 5, pp. 878–888,

2010.

[15] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “A neural network

solution for ﬁxed-ﬁnal time optimal control of nonlinear

systems,” Automatica, vol. 43, no. 3, pp. 482–490, 2007.

[16] Z. Zhao, Y. Yang, H. Li, and D. Liu, “Approximate ﬁnite-

horizon optimal control with policy iteration,” in Proceedings

of the 33rd Chinese Control Conference. IEEE, 2014, pp.

8895–8900.

[17] D. Adhyaru, I. Kar, and M. Gopal, “Fixed ﬁnal time optimal

control approach for bounded robust controller design using

hamilton–jacobi–bellman solution,” IET control theory &

applications, vol. 3, no. 9, pp. 1183–1195, 2009.

[18] T. Cheng, F. L. Lewis, and M. Abu-Khalaf, “Fixed-ﬁnal-

time-constrained optimal control of nonlinear systems using

neural network hjb approach,” IEEE Transactions on Neural

Networks, vol. 18, no. 6, pp. 1725–1737, 2007.

[19] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-

constrained nonlinear optimal control using single network

adaptive critics,” IEEE Transactions on Neural Networks and

Learning Systems, vol. 24, no. 1, pp. 145–157, 2012.

[20] Z. Lin, J. Duan, S. E. Li, H. Ma, and Y. Yin, “Continuous-time

ﬁnite-horizon adp for automated vehicle controller design

with high efﬁciency,” arXiv preprint arXiv:2007.02070, 2020.

Policy-Iteration-Based Finite-Horizon Approximate Dynamic Programming for Continuous-Time Nonlinear Optimal Control

Article

Full-text available

Dec 2022

The Hamilton–Jacobi–Bellman (HJB) equation serves as the necessary and sufficient condition for the optimal solution to the continuous-time (CT) optimal control problem (OCP). Compared with the infinite-horizon HJB equation, the solving of the finite-horizon (FH) HJB equation has been a long-standing challenge, because the partial time derivative of the value function is involved as an additional unknown term. To address this problem, this study first-time bridges the link between the partial time derivative and the terminal-time utility function, and thus it facilitates the use of the policy iteration (PI) technique to solve the CT FH OCPs. Based on this key finding, the FH approximate dynamic programming (ADP) algorithm is proposed leveraging an actor–critic framework. It is shown that the algorithm exhibits important properties in terms of convergence and optimality. Rather importantly, with the use of multilayer neural networks (NNs) in the actor–critic architecture, the algorithm is suitable for CT FH OCPs toward more general nonlinear and complex systems. Finally, the effectiveness of the proposed algorithm is demonstrated by conducting a series of simulations on both a linear quadratic regulator (LQR) problem and a nonlinear vehicle tracking problem.

Policy Iteration Based Approximate Dynamic Programming Toward Autonomous Driving in Constrained Dynamic Environment

Article

Full-text available

May 2023

In the area of autonomous driving, it typically brings great difficulty in solving the motion planning problem since the vehicle model is nonlinear and the driving scenarios are complex. Particularly, most of the existing methods cannot be generalized to dynamically changing scenarios with varying surrounding vehicles. To address this problem, this development here investigates the framework of integrated decision and control. As part of the modules, static path planning determines the reference candidates ahead, and then the optimal path-tracking controller realizes the specific autonomous driving task. An innovative and effective constrained finite-horizon approximate dynamic programming (ADP) algorithm is herein presented to generate the desired control policy for effective path tracking. With the generalized policy neural network that maps from the state to the control input, the proposed algorithm preserves the high effectiveness for the motion planning problem towards changing driving environments with varying surrounding vehicles. Moreover, the algorithm attains the noteworthy advantage of alleviating the typically heavy computational loads with the mode of offline training and online execution. As a result of the utilization of multi-layer neural networks in conjunction with the actor-critic framework, the constrained ADP method is capable of handling complex and multidimensional scenarios. Finally, various simulations have been carried out to show that the constrained ADP algorithm is effective.

Model-based Constrained Reinforcement Learning using Generalized Control Barrier Function

Conference Paper

Full-text available

Sep 2021

Discrete-time Finite Horizon Adaptive Dynamic Programming for Autonomous Vehicle Control

Conference Paper

Oct 2022

An Optimal Control-Based Distributed Reinforcement Learning Framework for a Class of Non-Convex Objective Functionals of the Multi-Agent Network

Article

Jan 2022

This paper studies a novel distributed optimization problem that aims to minimize the sum of the non-convex objective functionals of the multi-agent network under privacy protection, which means that the local objective of each agent is unknown to others. The above problem involves complexity simultaneously in the time and space aspects. Yet existing works about distributed optimization mainly consider privacy protection in the space aspect where the decision variable is a vector with finite dimensions. In contrast, when the time aspect is considered in this paper, the decision variable is a continuous function concerning time. Hence, the minimization of the overall functional belongs to the calculus of variations. Traditional works usually aim to seek the optimal decision function. Due to privacy protection and non-convexity, the Euler-Lagrange equation of the proposed problem is a complicated partial differential equation. Hence, we seek the optimal decision derivative function rather than the decision function. This manner can be regarded as seeking the control input for an optimal control problem, for which we propose a centralized reinforcement learning (RL) framework. In the space aspect, we further present a distributed reinforcement learning framework to deal with the impact of privacy protection. Finally, rigorous theoretical analysis and simulation validate the effectiveness of our framework.

Parallel Collaborative Motion Planning with Alternating Direction Method of Multipliers

Conference Paper

Oct 2021

Continuous-time finite-horizon ADP for automated vehicle controller design with high efficiency

Conference Paper

Full-text available

Nov 2020

Finite-Horizon Control-Constrained Nonlinear Optimal Control Using Single Network Adaptive Critics

Article

Full-text available

Jan 2013

To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network Adaptive Critic is developed in this paper. Inputs to the NN are the current system states and the time-to-go, and the network outputs are the costates that are used to compute optimal feedback control. Control constraints are handled through a nonquadratic cost function. Convergence proofs of: 1) the reinforcement learning-based training method to the optimal solution; 2) the training error; and 3) the network weights are provided. The resulting controller is shown to solve the associated time-varying Hamilton-Jacobi-Bellman equation and provide the fixed-final-time optimal solution. Performance of the new synthesis technique is demonstrated through different examples including an attitude control problem wherein a rigid spacecraft performs a finite-time attitude maneuver subject to control bounds. The new formulation has great potential for implementation since it consists of only one NN with single set of weights and it provides comprehensive feedback solutions online, though it is trained offline.

Dynamic Programming and Markov Processes.

Article

Apr 1961

Approximate finite-horizon optimal control with policy iteration

Conference Paper

Jul 2014

In this paper, the policy iteration algorithm for the finite-horizon optimal control of continuous time systems is addressed. The finite-horizon optimal control with input constraints is formulated in the Hamilton-Jacobi-Bellman (HJB) equation by using a suitable nonquadratic function. The value function of the HJB equation is obtained by solving a sequence of cost functions satisfying the generalized HJB (GHJB) equations with policy iteration. The convergence of the policy iteration algorithm is proved and the admissibility of each iterative policy is discussed. Using the least squares method with neural networks (NN) approximation of the cost function, the approximate solution of the GHJB equation converges uniformly to that of the HJB equation. A numerical example is given to illustrate the result.

Dynamic Programming and Markov Processes

Article

Feb 1961

A Tour of the Jungle of Approximate Dynamic Programming

Article

May 2012

Warren B. Powell

Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all searching for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the "curse of dimensionality" inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research developed practical methods for solving stochastic, dynamic optimization problems which has emerged as a seemingly disparate family of algorithmic strategies. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950's and 60's.

Approximate Dynamic Programming: Solving the Curses of Dimensionality

Article

Aug 2011

Warren B. Powell

A Dynamic Programming Example: A Shortest Path Problem The Three Curses of Dimensionality Some Real Applications Problem Classes The Many Dialects of Dynamic Programming What is New in this Book? Bibliographic Notes

Approximate dynamic programming for real-time control and neural modeling

Article

Jan 1992

Paul Werbos

Optimal Control

Book

Jan 1995

Fixed final time optimal control approach for bounded robust controller design using Hamilton-Jacobi-Bellman solution

Article

Oct 2009

In this study, an optimal control algorithm based on Hamilton-Jacobi-Bellman (HJB) equation, for the bounded robust controller design for finite-time-horizon nonlinear systems, is proposed. The HJB equation formulated using a suitable nonquadratic term in the performance functional to take care of magnitude constraints on the control input. Utilising the direct method of Lyapunov stability, we have proved the optimality of the controller with respect to a cost functional, that includes penalty on the control effort and the maximum bound on system uncertainty. The bounded controller requires the knowledge of the upper bound of system uncertainty. In the proposed algorithm, neural network is used to approximate the time-varying solution of HJB equation using least squares method. Proposed algorithm has been applied on the nonlinear system with matched and unmatched system uncertainties. Necessary theoretical and simulation results are presented to validate proposed algorithm.

Solving Finite-Horizon HJB for Optimal Control of Continuous-Time Systems

Figures

Recommended publications

Policy-Iteration-Based Finite-Horizon Approximate Dynamic Programming for Continuous-Time Nonlinear...

Policy Iteration Based Approximate Dynamic Programming Toward Autonomous Driving in Constrained Dyna...

Generalized Policy Iteration for Optimal Control in Continuous Time

Model-based Safe Reinforcement Learning using Generalized Control Barrier Function