ArticlePDF Available

Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

Authors:

Abstract and Figures

Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q -learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.
Content may be subject to copyright.
14 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011
Reinforcement Learning for Partially Observable
Dynamic Processes: Adaptive Dynamic
Programming Using Measured Output Data
F. L. Lewis, Fellow, IEEE, and Kyriakos G. Vamvoudakis, Member, IEEE
Abstract—Approximate dynamic programming (ADP) is a class
of reinforcement learning methods that have shown their im-
portance in a variety of applications, including feedback control
of dynamical systems. ADP generally requires full information
about the system internal states, which is usually not available
in practical situations. In this paper, we show how to implement
ADP methods using only measured input/output data from the
system. Linear dynamical systems with deterministic behavior
are considered herein, which are systems of great interest in the
control system community. In control system theory, these types
of methods are referred to as output feedback (OPFB). The sto-
chastic equivalent of the systems dealt with in this paper is a class
of partially observable Markov decision processes. We develop
both policy iteration and value iteration algorithms that converge
to an optimal controller that requires only OPFB. It is shown
that, similar to Q-learning, the new methods have the important
advantage that knowledge of the system dynamics is not needed
for the implementation of these learning algorithms or for the
OPFB control. Only the order of the system, as well as an upper
bound on its “observability index,” must be known. The learned
OPFB controller is in the form of a polynomial autoregressive
moving-average controller that has equivalent performance with
the optimal state variable feedback gain.
Index Terms—Approximate dynamic programming (ADP),
data-based optimal control, policy iteration (PI), output feedback
(OPFB), value iteration (VI).
I. INTRODUCTION
REINFORCEMENT learning (RL) is a class of methods
used in machine learning to methodically modify the
actions of an agent based on observed responses from its en-
vironment [7], [33]. RL methods have been developed starting
from learning mechanisms observed in mammals [4], [7]–[9],
[24], [28], [29], [33]. Every decision-making organism interacts
with its environment and uses those interactions to improve
its own actions in order to maximize the positive effect of
its limited available resources; this, in turn, leads to better
survival chances. RL is a means of learning optimal behaviors
by observing the response from the environment to nonoptimal
Manuscript received October 19, 2009; revised February 10, 2010; accepted
February 16, 2010. Date of publication March 29, 2010; date of current
version January 14, 2011. This work was supported in part by National
Science Foundation Grant ECCS-0801330 and in part by Army Research Office
Grant W91NF-05-1-0314. This paper was recommended by Associate Editor
S. E. Shimony.
The authors are with the Automation and Robotics Research Institute,
The University of Texas at Arlington, Fort Worth, TX 76118 USA (e-mail:
lewis@arri.uta.edu; kyriakos@arri.uta.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSMCB.2010.2043839
control policies. In engineering terms, RL refers to the learning
approach of an actor or agent that modifies its actions, or
control policies, based on stimuli received in response to its
interaction with its environment. This is based on evaluative
information from the environment and could be called action-
based learning. RL can be applied where standard supervised
learning is not applicable, and requires less aprioriknowledge.
In view of the advantages offered by RL methods, a recent
objective of control system researchers is to introduce and
develop RL techniques that result in optimal feedback con-
trollers for dynamical systems that can be described in terms
of ordinary differential equations [4], [30], [38], [41], [42].
This includes most of the human-engineered systems, including
aerospace systems, vehicles, robotic systems, and many classes
of industrial processes.
Optimal control is generally an offline design technique that
requires full knowledge of the system dynamics, e.g., in the
linear system case, one must solve the Riccati equation. On
the other hand, adaptive control is a body of online design
techniques that use measured data along system trajectories
to learn to compensate for unknown system dynamics, dis-
turbances, and modeling errors to provide guaranteed perfor-
mance. Optimal adaptive controllers have been designed using
indirect techniques, whereby the unknown plant is first identi-
fied and then a Riccati equation is solved [12]. Inverse adaptive
controllers have been provided that optimize a performance
index, meaningful but not of the designer’s choice [13], [20].
Direct adaptive controllers that converge to optimal solutions
for unknown systems given a PI selected by the designer have
generally not been developed.
Applications of RL to feedback control are discussed in [25],
[28], [43], and in the recent IEEE T RANSACTIONS ON SYS-
TEMS,MAN,AND CYBERNETICS—PART B: CYBERNETICS
(SMC-B) Special Issue [16]. Recent surveys are given by Lewis
and Vrabie [19] and Wang et al. [38]. Temporal difference
RL methods [4], [7], [24], [28], [33], [41], namely, policy
iteration (PI) and value iteration (VI), have been developed
to solve online the Hamilton–Jacobi–Bellman (HJB) equation
associated with the optimal control problem. Such methods
require measurement of the entire state vector of the dynamical
system to be controlled.
PI refers to a class of algorithms built as a two-step iteration:
policy evaluation and policy improvement. Instead of trying a
direct approach to solving the HJB equation, the PI algorithm
starts by evaluating the cost/value of a given initial admissible
(stabilizing) controller. The cost associated with this policy
1083-4419/$26.00 © 2010 IEEE
LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 15
is then used to obtain a new improved control policy (i.e., a
policy that will have a lower associated cost than the previous
one). This is often accomplished by minimizing a Hamiltonian
function with respect to the new cost. The resulting policy is
thus obtained based on a greedy policy update with respect
to the new cost. These two steps of policy evaluation and
policy improvement are repeated until the policy improvement
step no longer changes the actual policy, and convergence to
the optimal controller is achieved. One must note that the
infinite horizon cost associated with a given policy can only be
evaluated in the case of an admissible control policy, meaning
that the control policy must be stabilizing. The difficulty of the
algorithm comes from the required effort of finding an initial
admissible control policy.
VI algorithms, as introduced by Werbos, are actor–critic on-
line learning algorithms that solve the optimal control problem
without the requirement of an initial stabilizing control policy
[40]–[42]. Werbos referred to the family of VI algorithms as
approximate dynamic programming (ADP) algorithms. These
algorithms use a critic neural network (NN) for value function
approximation (VFA) and an actor NN for the approximation
of the control policy. Actor–critic designs using NNs for the
implementation of VI are detailed in [25]. Furthermore, the
SMC-B Special Issue gives a good overview on applications
of ADP and RL to feedback control [16].
Most of the PI and VI algorithms require at least some
knowledge of the system dynamics and measurement of the
entire internal state vector that describes the dynamics of the
system/environment (e.g., [2], [7], [24], [28], [33], [35]–[37],
and [41]). The so-called Q-learning class of algorithms [5],
[39] (called action-dependent heuristic dynamic programming
(HDP) by Werbos [41], [42]) does not require exact or explicit
description of the system dynamics, but they still use full state
measurement in the feedback control loop. From a control
system engineering perspective, this latter requirement may be
hard to fulfill as measurements of the entire state vector may not
be available and/or may be difficult and expensive to obtain.
Although various control algorithms (e.g., state feedback) re-
quire full state knowledge, in practical implementations, taking
measurements of the entire state vector is not feasible. The
state vector is generally estimated based on limited information
about the system available by measuring the system’s outputs.
State estimation techniques have been proposed (e.g., [18],
[21], and [26]). These generally require a known model of the
system dynamics.
In real life, it is difficult to design and implement optimal
estimators because the system dynamics and the noise statistics
are not exactly known. However, information about the system
and noise is included in a long-enough set of input/output data.
It would be desirable to be able to design an estimator by using
input/output data without any system knowledge or noise iden-
tification. Such techniques belong to the field of data-based con-
trol techniques, where the control input depends on input/output
data measured directly from the plant. These techniques are as
follows: data-based predictive control [22], unfalsified control
[27], Markov data-based linear quadratic Gaussian control [31],
disturbance-based control [34], simultaneous perturbation sto-
chastic approximation [32], pulse-response-based control [3],
iterative feedback tuning [11], and virtual reference feedback
tuning [15]. In [1], data-based optimal control was achieved
through identification of Markov parameters.
In this paper, novel output-feedback (OPFB) ADP algorithms
are derived for affine in the control input linear time-invariant
(LTI) deterministic systems. Such systems have, as stochastic
equivalent, the partially observable Markov decision processes
(POMDPs). In this paper, data-based optimal control is im-
plemented online using novel PI and VI ADP algorithms that
require only reduced measured information available at the
system outputs. These two classes of OPFB algorithms do not
require any knowledge of the system dynamics (A, B, C)and,
as such, are similar to Q-learning [5], [39], [41], [42], but they
have an added advantage of requiring only measurements of
input/output data and not the full system state. In order to ensure
that the data set is sufficiently rich and linearly independent,
there is a need to add (cf. [5]) probing noise to the control input.
We discuss this issue showing that probing noise leads to bias.
Adding a discount factor in the cost minimizes it to an almost-
zero effect. This discount factor is related to adding exponential
data weighting in the Kalman filter to remove the bias effects of
unmodeled dynamics [18].
This paper is organized as follows. Section II provides the
background of optimal control problem, dynamic program-
ming, and RL methods for the linear quadratic regulation
problem. Section III introduces the new class of PI and VI
algorithms by formulating the temporal difference error with
respect to the observed data and redefining the control sequence
as the output of a dynamic polynomial controller. Section IV
discusses the implementation aspects of the OPFB ADP al-
gorithms. For convergence, the new algorithms require per-
sistently exciting probing noise whose bias effect is canceled
by using a discounted cost function. Section V presents the
simulation results obtained using the new data-based ADP
algorithms and is followed by concluding remarks.
II. BACKGROUND
In this section, we give a review of optimal control, dynamic
programming, and RL methods [i.e., PI and VI] for the lin-
ear quadratic regulator (LQR). It is pointed out that both of
these methods employ contraction maps to solve the Bellman
equation, which is a fixed-point equation [23]. Both PI and VI
methods require full measurements of the entire state vector.
In the next section, we will show how to implement PI and VI
using only reduced information available at the system outputs.
A. Dynamic Programming and the LQR
Consider the linear TI discrete-time (DT) system
xk+1 =Axk+Buk
yk=Cxk(1)
where xkRnis the state, ukRmis the control input,
and ykRpis the measured output. Assume throughout that
(A, B)is controllable and (A, C )is observable [17].
16 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011
Given a stabilizing control policy uk=μ(xk), associate to
the system the performance index
Vµ(xk)=
i=kyT
iQyi+uT
iRui
i=k
ri(2)
with weighting matrices Q=QT0,R=RT>0, and
(A, CQ)being observable. Note that uk=μ(xk)is the fixed
policy. The utility is
rk=yT
kQyk+uT
kRuk.(3)
The optimal control problem [17] is to find the policy uk=
μ(xk)that minimizes the cost (2) along the trajectories of the
system (1). Due to the special structure of the dynamics and the
cost, this is known as the LQR problem.
A difference equation that is equivalent to (2) is given by the
Bellman equation
Vµ(xk)=yT
kQyk+uT
kRuk+Vµ(xk+1).(4)
The optimal cost, or value, is given by
V(xk) = min
µ
i=kyT
iQyi+uT
iRui.(5)
According to Bellman’s optimality principle, the value may be
determined using the HJB equation
V(xk) = min
ukyT
kQyk+uT
kRuk+V(xk+1)(6)
and the optimal control is given by
μ(xk) = arg min
ukyT
kQyk+uT
kRuk+V(xk+1).(7)
For the LQR case, any value is quadratic in the state so that
the cost associated to any policy uk=μ(xk)(not necessarily
optimal) is
Vµ(xk)=xT
kPx
k(8)
for some n×nmatrix P. Substituting this into (4), one obtains
the LQR Bellman equation
xT
kPx
k=yT
kQyk+uT
kRuk+xT
k+1Px
k+1.(9)
If the policy is a linear state variable feedback so that
uk=μ(xk)=Kxk(10)
then the closed-loop system is
xk+1 =(ABK)xkAcxk.(11)
Inserting these equations into (9) and averaging over all state
trajectories yield the Lyapunov equation
0=(ABK)TP(ABK)P+CTQC +KTRK.
(12)
If the feedback Kis stabilizing and (A, C)is observable, there
exists a positive definite solution to this equation. Then, the
Lyapunov solution gives the value of using the state feedback
K, i.e., the solution of this equation gives the kernel Psuch that
Vµ(xk)=xT
kPx
k.
To find the optimal control, insert (1) into (9) to obtain
xT
kPx
k=yT
kQyk+uT
kRuk+(Axk+Buk)TP(Axk+Buk).
(13)
To determine the minimizing control, set the derivative with
respect to ukequal to zero to obtain
uk=(R+BTPB)1BTPAx
k(14)
whence substitution into (12) yields the Riccati equation
0=ATPAP+CTQC ATPB(R+BTPB)1BTPA.
(15)
This is the LQR that is equivalent to the HJB equation (6).
B. Temporal Difference, PI, and VI
It is well known that the optimal value and control can be
determined online in real time using temporal difference RL
methods [7], [24], [28], [33], [41], which rely on solving online
for the value that makes small the so-called Bellman temporal
difference error
ek=Vµ(xk)+yT
kQyk+uT
kRuk+Vµ(xk+1).(16)
The temporal difference error is defined based on the Bellman
equation (4). For use in any practical real-time algorithm, the
value should be approximated by a parametric structure [4],
[41], [42].
For the LQR case, the value is quadratic in the state, and the
Bellman temporal difference error is
ek=xT
kPx
k+yT
kQyk+uT
kRuk+xT
k+1Px
k+1.(17)
Given a control policy, the solution of this equation is equiva-
lent to solving the Lyapunov equation (12) and gives the kernel
Psuch that Vµ(xk)=xT
kPx
k.
Write (6) equivalently as
0 = min
ukV(xk)+yT
kQyk+uT
kRuk+V(xk+1).
(18)
This is a fixed-point equation. As such, it can be solved by
the method of successive approximation using a contraction
map. The successive approximation method resulting from this
fixed-point equation is known as PI, an iterative method of
determining the optimal value and policy. For the LQR, PI is
performed by the following two steps based on the temporal
difference error (17) and a policy update step based on (7).
Algorithm 1—PI
Select a stabilizing initial control policy u0
k=μ0(xk). Then,
for j=0,1,..., perform until convergence:
1. Policy Evaluation: Using the policy uj
k=μj(xk)in (1),
solve for Pj+1 such that
0=xT
kPj+1xk+yT
kQyk+uj
kTRuj
k+xT
k+1Pj+1 xk+1.
(19)
LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 17
2. Policy Improvement:
uj+1
k=μj+1
(xk) = arg min
ukyT
kQyk+uT
kRuk+xT
k+1Pj+1 xk+1.
(20)
The solution in the policy evaluation step is generally carried
out in a least squares (LS) sense. The initial policy is required to
be stable since only then does (19) have a meaningful solution.
This algorithm is equivalent to the following, which uses
the Lyapunov equation (12), instead of the Bellman equation,
and (14).
Algorithm 2—PI Lyapunov Iteration Equivalent
Select a stabilizing initial control policy K0. Then,
for j=0,1,..., perform until convergence:
1. Policy Evaluation:
0=(ABKj)TPj+1 (ABKj)
Pj+1 +CTQC +(Kj)TRKj.(21)
2. Policy Improvement:
Kj+1 =(R+BTPj+1B)1BTPj+1A. (22)
It was shown in [4] that, under general conditions, the
policy μj+1 obtained by (20) is improved over μjin the sense
that Vµj+1 (xk)Vµj(xk). It was shown by Hewer [10] that
Algorithm 2 converges under the controllability/observability
assumptions if the initial feedback gain is stabilizing.
Note that, in PI Algorithm 2, the system dynamics (A, B)are
required for the policy evaluation step, while in PI Algorithm 1,
they are not. Algorithm 2 is performed offline knowing the state
dynamics.
On the other hand, Algorithm 1 is performed online in real
time as the data (xk,r
k,x
k+1)are measured at each time step,
with rk=yT
kQyk+uT
kRukbeing the utility. Note that (19)
is a scalar equation, whereas the value kernel Pis a sym-
metric n×nmatrix with n(n+1)/2independent elements.
Therefore, n(n+1)/2data sets are required before (19) can
be solved. This is a standard problem in LS estimation. The
policy evaluation step may be performed using batch LS, as
enough data are collected along the system trajectory, or using
recursive LS (RLS). The dynamics (A, B)are not required
for this, since the state xk+1 is measured at each step, which
contains implicit information about the dynamics (A, B).This
procedure amounts to a stochastic approximation method that
evaluates the performance of a given policy along one sample
path, e.g., the system trajectory. PI Algorithm 1 effectively
provides a method for solving the Riccati equation (15) online
using data measured along the system trajectories. Full state
measurements of xkare required.
A second class of algorithms for the online iterative solution
of the optimal control problem based on the Bellman temporal
difference error (17) is given by VI or HDP [41]. Instead of
the fixed-point equation in the form of (18), which leads to PI,
consider (6)
V(xk) = min
ukyT
kQyk+uT
kRuk+V(xk+1)(23)
which is also a fixed-point equation. As such, it can be solved
using a contraction map by successive approximation using the
VI algorithm.
Algorithm 3—VI
Select an initial control policy K0. Then, for j=0,1,...,
perform until convergence:
1. Value Update:
xT
kPj+1xk=yT
kQyk+uj
kTRuj
k+xT
k+1Pjxk+1 .(24)
2. Policy Improvement:
uj+1
k=μj+1
(xk) = arg min
ukyT
kQyk+uT
kRuk+xT
k+1Pj+1 xk+1.
(25)
Since (23) is not an equation but a recursion, its imple-
mentation does not require a stabilizing policy. Therefore, the
VI algorithm does not require an initial stabilizing gain. It is
equivalent to a matrix recursion knowing the system dynamics
(A, B), which has been shown to converge in [14].
Note that the policy update steps in PI Algorithm 1 and VI
Algorithm 3 rely on the hypothesis of the VFA parameterization
(8). Both algorithms require knowledge of the dynamics (A, B)
for the policy improvement step. In [2], it is shown that if a
second approximator structure is assumed for the policy, then
only the B matrix is needed for the policy improvement in
Algorithm 1 or 3.
An approach that provides online real-time algorithms for the
solution of the optimal control problem without knowing any
system dynamics is Q-learning. This has been applied to both
PI [5] and VI, where it is known as action-dependent HDP [41].
All these methods require measurement of the full state
xkRn.
III. TEMPORAL DIFFERENCE,PI,AND VI BASED ON OPFB
This section presents the new results of this paper. The
Bellman error (17) for LQR is quadratic in the state. This can
be used in a PI or VI algorithm for online learning of optimal
controls as long as full measurements of state xkRnare
available. In this section, we show how to write the Bellman
temporal difference error in terms only of the observed data,
namely, the input sequence ukand the output sequence yk.
18 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011
The main results of the following equations are (45) and (46),
which give a temporal difference error in terms only of the
observed output data.
To reformulate PI and VI in terms only of the observed
data, we show how to write Vµ(xk)=xT
kPx
kas a quadratic
form in terms of the input and output sequences. A surprising
benefit is that there result two algorithms for RL that do not
require any knowledge of the system dynamics (A, B, C). That
is, these algorithms have the same advantage as Q-learning
in not requiring knowledge of the system dynamics, yet they
have the added benefit of requiring only measurements of the
available input/output data, not the full system state as required
by Q-learning.
A. Writing the Value Function in Terms of Available
Measured Data
Consider the deterministic linear TI system
xk+1 =Axk+Buk
yk=Cxk(26)
with xkRn,ukRm, and ykRp. Assume that (A, B)is
controllable and (A, C)is observable [17]. Controllability is the
property of matrices (A, B)and means that any initial state can
be driven to any desired final state. Observability is a property
of (A, C)and means that observations of the output ykover a
long-enough time horizon can be used to reconstruct the full
state xk. Given the current time k, the dynamics can be written
on a time horizon [kN,k]as the expanded state equation
xk=ANxkN+[BABA
2B··· AN1B]
uk1
uk2
.
.
.
ukN
(27)
yk1
yk2
.
.
.
ykN
=
CAN1
.
.
.
CA
C
xkN
+
0CB CAB ··· CAN2B
00 CB ··· CAN3B
.
.
..
.
........
.
.
0··· 0CB
00 0 0 0
uk1
uk2
.
.
.
ukN
(28)
or by appropriate definition of variables as
xk=ANxkN+UNuk1,kN(29)
yk1,kN=VNxkN+TNuk1,kN(30)
with UN=[BAB··· AN1B]being the controllability
matrix and
VN=
CAN1
.
.
.
CA
C
(31)
being the observability matrix, where VNRpN×n.TNis the
Toeplitz matrix of Markov parameters.
Vectors
yk1,kN=
yk1
yk2
.
.
.
ykN
RpN
uk1,kN=
uk1
uk2
.
.
.
ukN
RmN
are the input and output sequences over the time interval [k
N,k 1]. They represent the available measured data.
Since (A, C)is observable, there exists a K, the observ-
ability index, such that rank(VN)<n for N<K and that
rank(VN)=nfor NK. Note that Ksatisfies Kp n.Let
NK. Then, VNhas full column rank n, and there exists a
matrix MRn×pN such that
AN=MVN.(32)
This was used in [1] to find the optimal control through identi-
fication of Markov parameters.
Since VNhas full column rank, its left inverse is given as
V+
N=VT
NVN1VT
N(33)
so that
M=ANV+
N+ZIVNV+
NM0+M1(34)
for any matrix Z, with M0denoting the minimum norm opera-
tor and P(R(VN))=IVNV+
Nbeing the projection onto a
range perpendicular to VN.
The following lemma shows how to write the system state in
terms of the input/output data.
Lemma 1: Let the system (26) be observable. Then, the
system state is given uniquely in terms of the measured in-
put/output sequences by
xk=M0yk1,kN+(UNM0TN)uk1,kN
Myyk1,kN+Muuk1,kN(35)
or
xk=[MuMy]uk1,kN
yk1,kN(36)
where My=M0and Mu=UNM0TN, with M0=
ANV+
N,V+
N=(VT
NVN)1VT
N, being the left inverse of the
observability matrix (31), and NK, where Kis the observ-
ability index.
Proof: Note that ANxkN=MVNxkNso that, accord-
ing to (30)
ANxkN=MVNxkN
=Myk1,k NMTNuk1,kN
(37)
(M0+M1)VNxkN=(M0+M1)yk1,kN
(M0+M1)TNuk1,kN.(38)
LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 19
Note, however, that M1VN=0 so that MVNxkN=
M0VNxkN, and apply M1to (30) to see that
0=M1yk1,kNM1TNuk1,kNM1s.t. M1VN=0.
(39)
Therefore
ANxkN=M0VNxkN=M0yk1,kNM0TNuk1,kN
(40)
independently of M1. Then, from (29)
xk=M0yk1,kN+(UNM0TN)uk1,kN
Myyk1,kN+Muuk1,kN.(41)
This result expresses xkin terms of the system inputs and
outputs from time kNto time k1. Now, we will express
the value function in terms of the inputs and outputs.
It is important to note that the system dynamics information
(e.g., A,B, and C) must be known to use (36). In fact, My=
M0is given in (34), where V+
Ndepends on Aand C.Also,
Mudepends on M0,UN, and TN.UNis given in (27) and (29)
in terms of Aand B.TNin (28) and (30) depends on A,B,
and C.
In the next step, it is shown how to use the structural
dependence in (35) yet avoid knowledge of A,B, and C.
Define the vector of the observed data at time kas
zk1,kN=uk1,kN
yk1,kN.(42)
Now, one has
Vµ(xk)=xT
kPx
k
=zT
k1,kNMT
u
MT
yP[MuMy]zk1,kN(43)
Vµ(xk)=zT
k1,kNMT
uPM
uMT
uPM
y
MT
yPM
uMT
yPM
yzk1,kN
zT
k1,kN¯
Pzk1,kN.(44)
Note that uk1,kNRmN ,yk1,kNRpN ,zk1,kN
R(m+p)N, and ¯
PR(m+p)N×(m+p)N.
Equation (44) expresses the value function at time kas a
quadratic form in terms of the past inputs and outputs.
Note that the inner kernel matrix ¯
Pin (44) depends on the
system dynamics A,B, and C. In the next section, it is shown
how to use RL methods to learn the kernel matrix ¯
Pwithout
knowing A,B, and C.
B. Writing the TD Error in Terms of Available Measured Data
We may now write Bellman’s equation (9) in terms of the
observed data as
zT
k1,kN¯
Pzk1,kN
=yT
kQyk+uT
kRuk+zT
k,kN+1 ¯
Pzk,kN+1.(45)
Based on this equation, write the temporal difference error (17)
in terms of the inputs and outputs as
ek=zT
k1,kN¯
Pzk1,kN+yT
kQyk+uT
kRuk
+zT
k,kN+1 ¯
Pzk,kN+1.(46)
Using this TD error, the policy evaluation step of any form of
RL based on the Bellman temporal difference error (17) can be
equivalently performed using only the measured data, not the
state.
Matrix ¯
Pdepends on A,B, and Cthrough Myand Mu.
However, RL methods allow one to learn ¯
Ponline without A,
B,C, as shown next.
C. Writing the Policy Update in Terms of Available
Measured Data
Using Q-learning, one can perform PI and VI without any
knowledge of the system dynamics. Likewise, it is now shown
that, using the aforementioned constructions, one can derive
a form for the policy improvement step (20)/(25) that does
not depend on the state dynamics but only on the measured
input/output data.
The policy improvement step may be written in terms of the
observed data as
μ(xk) = arg min
ukyT
kQyk+uT
kRuk+xT
k+1Px
k+1(47)
μ(xk) = arg min
ukyT
kQyk+uT
kRuk
+zT
k,kN+1 ¯
Pzk,kN+1.(48)
Partition zT
k,kN+1 ¯
Pzk,kN+1 as
zT
k,kN+1 ¯
Pzk,kN+1
=
uk
uk1,kN+1
yk,kN+1
T
p0pupy
pT
uP22 P23
pT
yP32 P33
uk
uk1,kN+1
yk,kN+1
.(49)
One has p0Rm×m,puRm×(m(N1)), and pyRm×pN .
Then, differentiating with respect to ukto perform the mini-
mization in (48) yields
0=Ruk+p0uk+puuk1,kN+1 +pyyk,kN+1 (50)
or
uk=(R+p0)1(puuk1,kN+1 +pyyk,kN+1).(51)
This is a dynamic polynomial autoregressive moving-average
(ARMA) controller that generates the current control input uk
in terms of the previous inputs and the current and previous
outputs.
Exactly as in Q-learning (called by Werbos as action-
dependent learning [41]), the control input appears in quadratic
form (49), so that the minimization in (48) can be carried out
in terms of the learned kernel matrix ¯
Pwithout resorting to
the system dynamics. However, since (49) contains present and
20 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011
past values of the input and output, the result is a dynamical
controller in polynomial ARMA form.
We have developed the following RL algorithms that use
only the measured input/output data and do not require mea-
surements of the full state vector xk.
Algorithm 4—PI Algorithm Using OPFB
Select a stabilizing initial control policy u0
k=μ0. Then, for
j=0,1,..., perform until convergence:
1. Policy Evaluation: Solve for ¯
Pj+1 such that
0=zT
k1,kN¯
Pj+1zk1,kN+yT
kQyk+uj
kTRuj
k
+zT
k,kN+1 ¯
Pj+1zk,kN+1.(52)
2. Policy Improvement: Partition ¯
Pas in (49). Then, define
the updated policy by
uj+1
k=μj+1(xk)
=R+pj+1
01
·pj+1
uuk1,kN+1 +pj+1
yyk,kN+1.(53)
Algorithm 5—VI Algorithm Using OPFB
Select any initial control policy u0
k=μ0. Then, for j=
0,1,..., perform until convergence:
1. Policy Evaluation: Solve for ¯
Pj+1 such that
zT
k1,kN¯
Pj+1zk1,kN
=yT
kQyk+uj
kTRuj
k+zT
k,kN+1 ¯
Pjzk,kN+1.(54)
2. Policy Improvement: Partition ¯
Pas in (49). Then, define
the updated policy by
uj+1
k=uj+1(xk)
=R+pj+1
01
·pj+1
uuk1,kN+1 +pj+1
yyk,kN+1.(55)
Remark 1: These algorithms do not require measurements
of the internal state vector xk. They only require measurements
at each time step kof the utility rk=yT
kQyk+uT
kRuk,the
inputs from time kNto time k, and the outputs from time
kNto time k. The policy evaluation step may be imple-
mented using standard methods such as batch LS or RLS (see
the next section).
Remark 2: The control policy given by these algorithms in
the form of (51), (53), and (55) is a dynamic ARMA regulator
in terms of the past inputs and the current and past outputs. As
such, it optimizes a polynomial square-of-sums cost function
that is equivalent to the LQR sum-of-squares cost function (2).
This polynomial cost function could be determined using the
techniques in [17].
Remark 3: PI Algorithm 4 requires an initial stabilizing
control policy, while VI Algorithm 5 does not. As such, VI is
suitable for control of open-loop unstable systems.
Remark 4: The PI algorithm requires an initial stabilizing
control policy u0
k=μ0that is required to be a function only
of the observable data. Suppose that one can find an initial
stabilizing state feedback gain u0
k=μ0(xk)=K0xk. Then,
the equivalent stabilizing OPFB ARMA controller is easy to
find and is given by
u0
k=μ0(xk)=K0xk=K0[MuMy]uk1,kN
yk1,kN.
The next result shows that the controller (51) is unique.
Lemma 2: Define M0and M1according to (34). Then, the
control sequence generated by (51) is independent of M1and
depends only on M0. Moreover, (51) is equivalent to
uk=(R+BTPB)1(puuk1,kN+1 +pyyk,kN+1)
(56)
where puand pydepend only on M0.
Proof: Write (50) as
0=Ruk+p0uk+puuk1,kN+1 +pyyk,kN+1
=Ruk+[p0pu|py]uk,kN+1
yk,kN+1 .
According to (44) and (49)
[p0pu]=[Im0]MT
uPM
upy[Im0]MT
uPM
y
so
0=Ruk+p0uk+puuk1,kN+1 +pyyk,kN+1
=Ruk+[Im0]MT
uP(Muuk,kN+1 +Myyk,kN+1).
According to Lemma 1, this is unique, independently of M1
[see (38)], and equal to
0=Ruk+[Im0]MT
u
×P(UNM0TN)uk,kN+1 +M0yk,kN+1.
Now
[Im0]MT
uP=MuIm
0T
P.
However
MuIm
0=(UN(M0+M1)TN)Im
0=B
where one has used the structure of UNand TN. Consequently
0=Ruk+p0uk+puuk1,kN+1 +pyyk,kN+1
=Ruk+BTP(UNM0TN)uk,kN+1 +M0yk,kN+1
which is independent of M1.
LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 21
Now note that p0=[Im0]MT
uPM
uIm
0=BTPB as
per the previous one. Hence, (56) follows from (51).
Remark 5: Note that the controller cannot be implemented
in the form of (56) since it requires that matrix Pbe known.
IV. IMPLEMENTATION,PROB ING NOISE,BIAS,
AND DISCOUNT FACTOR S
In this section, we discuss the need for probing noise to
implement the aforementioned algorithms, show that this noise
leads to deleterious effects such as bias, and argue how adding a
discount factor in the cost (2) can reduce this bias to a negligible
effect.
The equations in PI and VI are solved online by standard
techniques using methods such as batch LS or RLS. See [5],
where RLS was used in Q-learning. In PI, one solves (52) by
writing it in the form of
stk(¯
Pj+1)[zk1,kNzk1,kNzk,kN+1 zk,kN+1]
=yT
kQyk+uj
kTRuj
k(57)
with being the Kronecker product and stk(.)being the
column stacking operator [6]. The redundant quadratic terms
in the Kronecker product are combined. In VI, one solves (54)
in the form of
stk(¯
Pj+1)[zk1,kNzk1,kN]
=yT
kQyk+uj
kTRuj
k+zT
k,kN+1 ¯
Pjzk,kN+1.(58)
Both of these equations only require that the input/output
data be measured. They are scalar equations, yet one must
solve for the kernel matrix ¯
PR(m+p)N×(m+p)N, which
is symmetric and has [(m+p)N][(m+p)N+1]/2indepen-
dent terms. Therefore, one requires data samples for at least
[(m+p)N][(m+p)N+1]/2time steps for a solution using
batch LS.
To solve the PI update equation (57), it is required
that the quadratic vector [zk1,kNzk1,kNzk,kN+1
zk,kN+1]be linearly independent over time, which is a prop-
erty known as persistence of excitation (PE). To solve the VI
update equation (58), one requires the PE of the quadratic
vector [zk1,kNzk1,kN]. It is standard practice (see [5]
for instance) to inject probing noise into the control action to
obtain PE, so that one puts into the system dynamics the input
ˆuk=uk+dk, with ukbeing the control computed by the PI or
VI current policy and dkbeing a probing noise or dither, e.g.,
white noise.
It is well known that dither can cause biased results and mis-
match in system identification. In [44], this issue is discussed,
and several alternative methods are presented for injecting
dither into system identification schemes to obtain improved
results. Unfortunately, in control applications, one has little
choice about where to inject the probing noise.
To see the deleterious effects of probing noise, consider the
Bellman equation (9) with input ˆuk=uk+dk, where dkis a
probing noise. One writes
xT
kˆ
Px
k=yT
kQykuT
kRˆukxT
k+1 ˆ
Pˆxk+1
xT
kˆ
Px
k=yT
kQyk+(uk+dk)TR(uk+dk)
+(Axk+Buk+Bdk)Tˆ
P(Axk+Buk+Bdk)
xT
kˆ
Px
k=yT
kQyk+uT
kRuk+dT
kRdk+uT
kRdk
+dT
kRuk+(Axk+Buk)Tˆ
P(Axk+Buk)
+(Axk+Buk)Tˆ
PBd
k+(Bdk)Tˆ
P(Axk+Buk)
+(Bdk)Tˆ
PBd
k.
Now, use tr{AB}=tr{BA}for commensurate matrices A
and B, take the expected values to evaluate correlation matrices,
and assume that the dither at time kis white noise independent
of ukand xkso that E{RukdT
k}=0 and E{ˆ
PBd
k(Axk+
Buk)T}=0 and that the cross terms drop out. Then, aver-
aged over repeated control runs with different probing noise
sequence dk, this equation is effectively
xT
kˆ
Px
k=yT
kQyk+uT
kRuk+(Axk+Buk)Tˆ
P(Axk+Buk)
+dT
k(BTˆ
PB +R)dk
which is the undithered Bellman equation plus a term depend-
ing on the dither covariance. As such, the solution computed by
PI or VI will not correspond to the actual value corresponding
to the Bellman equation.
It is now argued that discounting the cost can significantly re-
duce the deleterious effects of probing noise. Adding a discount
factor γ<1to the cost (2) results in
Vµ(xk)=
i=k
γikyT
iQyi+uT
iRui
i=k
γikri(59)
with the associated Bellman equation
xT
kPx
k=yT
kQyk+uT
kRuk+γxT
k+1Px
k+1.(60)
This has an HJB (Riccati) equation that is equivalent to
0=P+γATPAATPB(R/γ +BTPB)1BTPA
+CTQC (61)
and an optimal policy of
uk=(R/γ +BTPB)1BTPAx
k.(62)
The benefits of discounting are most clearly seen by examin-
ing VI. VI Algorithm 3, which, with discount, has (24) modi-
fied as
xT
kPj+1xk=yT
kQyk+uj
kTRuj
k+γxT
k+1Pjxk+1 (63)
corresponds to the underlying Riccati difference equation
Pj+1 =γATPjAATPjB(R/γ +BTPjB)1BTPjA
+CTQC (64)
22 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011
where each iteration is decayed by the factor γ<1. The effects
of this can be best understood by considering the Lyapunov
difference equation
Pj+1 =γATPjA+CTQC, P0(65)
which has a solution
Pj=γj(AT)jP0Aj+
j1
i=0
γi(AT)iQAi.(66)
The effect of the discount factor is thus to decay the effects of
the initial conditions.
In a similar vein, the discount factor decays the effects of the
previous probing noises and improper initial conditions in the
PI and VI algorithms. In fact, adding a discount factor is closely
related to adding exponential data weighting in the Kalman
filter to remove the bias effects of unmodeled dynamics [18].
V. S IMULATIONS
A. Stable Linear System and PI
Consider the stable linear system with quadratic cost function
xk+1 =1.10.3
10
xk+1
0uk
yk=[1 0.8]xk
where Qand Rin the cost function are the identity matrices
of appropriate dimensions. The open-loop poles are z1=0.5
and z1=0.6. In order to verify the correctness of the proposed
algorithm, the optimal value kernel matrix Pis found by
solving (15) to be
P=1.0150 0.8150
0.8150 0.6552 .
Now, by using ¯
PMT
uPM
uMT
uPM
y
MT
yPM
uMT
yPM
yand (35), one has
¯
P=
1.0150 0.8440 1.1455 0.3165
0.8440 0.7918 1.0341 0.2969
1.1455 1.0341 1.3667 0.3878
0.3165 0.2969 0.3878 0.1113
.
Since the system is stable, we use the OPFB PI Algorithm
4 implemented as in (52) and (53). PE was ensured by adding
dithering noise to the control input, and a discount (forgetting)
factor γ=0.2was added to diminish the dither bias effects.
The observer index is K=2, and Nis selected to be equal
to two.
By applying dynamic OPFB control (53), the system re-
mained stable, and the parameters of ¯
Pconverged to the
optimal ones. In fact
ˆ
¯
P=
1.1340 0.8643 1.1571 0.3161
0.8643 0.7942 1.0348 0.2966
1.1571 1.0348 1.3609 0.3850
0.3161 0.2966 0.3850 0.1102
.
In the example, batch LS was used to solve (52) at each step.
Fig. 1. Convergence of p0,pu,andpy.
Fig. 2. Evolution of the system states for the duration of the experiment.
Fig. 1 shows the convergence of p0R,puR, and py
R1×2of ¯
Pto the correct values. Fig. 2 shows the evolution of
the system states and their convergence to zero.
B. Q-Learning and OPFB ADP
The purpose of this example is to compare the performance
of Q-learning [5], [39] and that of OPFB ADP. Consider the
stable linear system described before, and apply Q-learning.
The Q-function for this particular system is given by
Qh(xk,u
k)=xk
ukTQ+ATPA A
TPB
BTPA R+BTPB xk
uk
xk
ukT
Hxk
uk
=xk
ukT
4.9768 0.8615 2.8716
0.8615 1.2534 0.8447
2.8716 0.8447 3.8158
xk
uk.
LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 23
Fig. 3. Convergence of H11,H12 ,andH13 .
Fig. 4. Evolution of the system states for the duration of the experiment.
By comparing these two methods, it is obvious that
Q-learning has a faster performance than OPFB ADP since
fewer parameters are being identified.
Fig. 3 shows the convergence of the three parameters H11,
H12, and H13 of the Hmatrix shown before to the correct
values. Fig. 4 shows the evolution of the system states.
C. Unstable Linear System and VI
Consider the unstable linear system with quadratic cost
function
xk+1 =1.80.7700
10
xk+1
0uk
yk=[1 0.5]xk
where Qand Rin the cost function are the identity matrices
of appropriate dimensions. The open-loop poles are z1=0.7
Fig. 5. Convergence of p0,pu,andpy.
and z2=1.1, so the system is unstable. In order to verify the
correctness of the proposed algorithm, the optimal value kernel
matrix Pis found by solving (15) to be
P=1.3442 0.7078
0.7078 0.3756 .
Now, by using ¯
PMT
uPM
uMT
uPM
y
MT
yPM
uMT
yPM
yand (35), one has
¯
P=
1.3442 0.7465 2.4582 1.1496
0.7465 0.4271 1.3717 0.6578
2.4582 1.3717 4.4990 2.1124
1.1496 0.6578 2.1124 1.0130
.
Since the system is open-loop unstable, one must use VI, not
PI. OPFB VI Algorithm 5 is implemented as in (54) and (55).
PE was ensured by adding dithering noise to the control input,
and a discount (forgetting) factor γ=0.2was used to diminish
the dither bias effects. The observer index is K=2, and Nis
selected to be equal to two.
By applying dynamic OPFB control (55), the system re-
mained stabilized, and the parameters of ¯
Pconverged to the
optimal ones. In fact
ˆ
¯
P=
1.3431 0.7504 2.4568 1.1493
0.7504 0.4301 1.3730 0.6591
2.4568 1.3730 4.4979 2.1120
1.1493 0.6591 2.1120 1.0134
.
In the example, batch LS was used to solve (54) at each step.
Fig. 5 shows the convergence of p0R,puR, and py
R1×2of ¯
P. Fig. 6 shows the evolution of the system states, their
boundedness despite the fact that the plant is initially unstable,
and their convergence to zero.
VI. CONCLUSION
In this paper, we have proposed the implementation of ADP
using only the measured input/output data from a dynamical
system. This is known in control system theory as “OPFB,” as
24 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011
Fig. 6. Evolution of the system states for the duration of the experiment.
opposed to full state feedback, and corresponds to RL for a class
of POMDPs. Both PI and VI algorithms have been developed
that require only OPFB. An added and surprising benefit is that,
similar to Q-learning, the system dynamics are not needed to
implement these OPFB algorithms, so that they converge to
the optimal controller for completely unknown systems. The
system order, as well as an upper bound on its “observability
index,” must be known. The learned OPFB controller is given
in the form of a polynomial ARMA controller that is equivalent
to the optimal state variable feedback gain. This controller
needs the addition of probing noise in order to be sufficiently
rich. This probing noise adds some bias, and in order to avoid
it, a discount factor is added in the cost.
Future research efforts will focus on how to apply the previ-
ous method to nonlinear systems.
REFERENCES
[1] W. Aangenent, D. Kostic, B. de Jager, R. van de Molengraft, and
M. Steinbuch, “Data-based optimal control,” in Proc. Amer. ControlConf.,
Portland, OR, 2005, pp. 1460–1465.
[2] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear
HJB solution using approximate dynamic programming: Convergence
proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4,
pp. 943–949, Aug. 2008.
[3] J. K. Bennighof, S. M. Chang, and M. Subramaniam, “Minimum time
pulse response based control of flexible structures,” J. Guid. Control Dyn.,
vol. 16, no. 5, pp. 874–881, 1993.
[4] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
Belmont, MA: Athena Scientific, 1996.
[5] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic
control using policy iteration,” in Proc. Amer. Control Conf., Baltimore,
MD, Jun. 1994, pp. 3475–3476.
[6] J. W. Brewer, “Kronecker products and matrix calculus in system theory,”
IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, pp. 772–781, Sep. 1978.
[7] X. Cao, Stochastic Learning and Optimization. Berlin, Germany:
Springer-Verlag, 2007.
[8] K. Doya, “Reinforcement learning in continuous time and space,” Neural
Comput., vol. 12, no. 1, pp. 219–245, Jan. 2000.
[9] K. Doya, H. Kimura, and M. Kawato, “Neural mechanisms of learn-
ing and control,” IEEE Control Syst. Mag., vol. 21, no. 4, pp. 42–54,
Aug. 2001.
[10] G. Hewer, “An iterative technique for the computation of the steady state
gains for the discrete optimal regulator,” IEEE Trans. Autom. Control,
vol. AC-16, no. 4, pp. 382–384, Aug. 1971.
[11] H. Hjalmarsson and M. Gevers, “Iterative feedback tuning: Theory and
applications,”IEEE Control Syst. Mag., vol. 18, no. 4, pp. 26–41,Aug. 1998.
[12] P. Ioannou and B. Fidan, “Advances in design and control,” in Adaptive
Control Tutorial. Philadelphia, PA: SIAM, 2006.
[13] M. Krstic and H. Deng, Stabilization of Nonlinear Uncertain Systems.
Berlin, Germany: Springer-Verlag, 1998.
[14] P. Lancaster and L. Rodman, Algebraic Riccati Equations (Oxford Science
Publications). New York: Oxford Univ. Press, 1995.
[15] A. Lecchini, M. C. Campi, and S. M. Savaresi, “Virtual reference feedback
tuning for two degree of freedom controllers,” Int. J. Adapt. Control
Signal Process., vol. 16, no. 5, pp. 355–371, 2002.
[16] F. L. Lewis, G. Lendaris, and D. Liu, “Special issue on approximate
dynamic programming and reinforcement learning for feedback control,”
IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 896–897,
Aug. 2008.
[17] F. L. Lewis and V. L. Syrmos, Optimal Control. New York: Wiley, 1995.
[18] F. L. Lewis, L. Xie, and D. O. Popa, Optimal and Robust Estimation.
Boca Raton, FL: CRC Press, Sep. 2007.
[19] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic
programming for feedback control,” IEEE Circuits Syst. Mag.,vol.9,
no. 3, pp. 32–50, Sep. 2009.
[20] Z. H. Li and M. Krstic, “Optimal design of adaptive tracking controllers
for nonlinear systems,” in Proc. ACC, 1997, pp. 1191–1197.
[21] R. K. Lim, M. Q. Phan, and R. W. Longman, “State-Space System
Identification with Identified Hankel Matrix,” Dept. Mech. Aerosp. Eng.,
Princeton Univ., Princeton, NJ, Tech. Rep. 3045, Sep. 1998.
[22] R. K. Lim and M. Q. Phan, “Identification of a multistep-ahead observer
and its application to predictive control,” J. Guid. Control Dyn., vol. 20,
no. 6, pp. 1200–1206, 1997.
[23] preprint P. Mehta and S. Meyn, Q-learning and Pontryagin’s minimum
principle2009.
[24] W. Powell, Approximate Dynamic Programming: Solving the Curses of
Dimensionality. New York: Wiley, 2007.
[25] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Trans.
Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997.
[26] M. Q. Phan, R. K. Lim, and R. W. Longman, “Unifying Input–Output
and State-Space Perspectives of Predictive Control,” Dept. Mech. Aerosp.
Eng., Princeton Univ., Princeton, NJ, Tech. Rep. 3044, Sep. 1998.
[27] M. G. Safonov and T. C. Tsao, “The unfalsified control concept and
learning,” IEEE Trans. Autom. Control, vol. 42, no. 6, pp. 843–847,
Jun. 1997.
[28] W. Schultz, “Neural coding of basic reward terms of animal learning the-
ory, game theory, microeconomics and behavioral ecology,” Curr. Opin.
Neurobiol., vol. 14, no. 2, pp. 139–147, Apr. 2004.
[29] W. Schultz, P. Dayan, and P. Read Montague, “A neural substrate of
prediction and reward,” Science, vol. 275, no. 5306, pp. 1593–1599,
Mar. 1997.
[30] J. Si, A. Barto, W. Powel, and D. Wunch, Handbook of Learning and Ap-
proximate Dynamic Programming. Englewood Cliffs, NJ: Wiley, 2004.
[31] R. E. Skelton and G. Shi, “Markov Data-Based LQG Control,” Trans.
ASME,J.Dyn.Syst.Meas.Control, vol. 122, no. 3, pp. 551–559, 2000.
[32] J. C. Spall and J. A. Criston, “Model-free control of nonlinear stochastic
systems with discrete-time measurements,” IEEE Trans. Autom. Control,
vol. 43, no. 9, pp. 1198–1210, Sep. 1998.
[33] R. S. Sutton and A. G. Barto, Reinforcement Learning—An Introduction.
Cambridge, MA: MIT Press, 1998.
[34] R. L. Toussaint, J. C. Boissy, M. L. Norg, M. Steinbuch, and
O. H. Bosgra, “Suppressing non-periodically repeating disturbances in
mechanical servo systems,” in Proc. IEEE Conf. Decision Control,
Tampa, FL, 1998, pp. 2541–2542.
[35] D. Vrabie, K. Vamvoudakis, and F. Lewis, “Adaptive optimal control-
lers based on generalized policy iteration in a continuous-time frame-
work,” in Proc. IEEE Mediterranean Conf. Control Autom., Jun. 2009,
pp. 1402–1409.
[36] K. G. Vamvoudakis and F. L. Lewis, “Online actor critic algorithm
to solve the continuous-time infinite horizon optimal control prob-
lem,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, Jun. 2009,
pp. 3180–3187.
[37] K. Vamvoudakis, D. Vrabie, and F. L. Lewis, “Online policy itera-
tion based algorithms to solve the continuous-time infinite horizon op-
timal control problem,” in Proc. IEEE Symp. ADPRL, Nashville, TN,
Mar. 2009, pp. 36–41.
[38] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming:
An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47,
May 2009.
[39] C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, Cam-
bridge Univ., Cambridge, U.K., 1989.
LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 25
[40] P. J. Werbos, “Beyond regression: New tools for prediction and analysis
in the behavior sciences,” Ph.D. dissertation, Harvard Univ., Cambridge,
MA, 1974.
[41] P. J. Werbos, “Approximate dynamic programming for real-time control
and neural modeling,” in Handbook of Intelligent Control,D.A.White
and D. A. Sofge, Eds. New York: Van Nostrand Reinhold, 1992.
[42] P. Werbos, “Neural networks for control and system identification,” in
Proc. IEEE CDC, 1989, pp. 260–265.
[43] D. A. White and D. A Sofge, Eds., Handbook of Intelligent Control.New
York: Van Nostrand Reinhold, 1992.
[44] B. Widrow and E. Walach, Adaptive Inverse Control. Upper Saddle
River, NJ: Prentice-Hall, 1996.
F. L. Lewis (S’78–M’81–SM’86–F’94) received the
B.S. degree in physics/electrical engineering and the
M.S.E.E. degree from Rice University, Houston, TX,
the M.S. degree in aeronautical engineering from the
University of West Florida, Pensacola, and the Ph.D.
degree from the Georgia Institute of Technology,
Atlanta.
He is currently a Distinguished Scholar Professor
and the Moncrief-O’Donnell Chair with the Automa-
tion and Robotics Research Institute (ARRI), The
University of Texas at Arlington, Fort Worth. He
works in feedback control, intelligent systems, distributed control systems, and
sensor networks. He is the author of 216 journal papers, 330 conference papers,
14 books, 44 chapters, and 11 journal special issues and is the holder of six U.S.
patents.
Dr. Lewis is a Fellow of the International Federation of Automatic Control
and the U.K. Institute of Measurement and Control, a Professional Engineer in
Texas, and a U.K. Chartered Engineer. He was the Founding Member of the
Board of Governors of the Mediterranean Control Association. He served on
the National Academy of Engineering Committee on Space Station in 1995.
He is an elected Guest Consulting Professor with both South China University
of Technology, Guangzhou, China, and Shanghai Jiao Tong University, Shang-
hai, China. He received the Fulbright Research Award, the National Science
Foundation Research Initiation Grant, the American Society for Engineering
Education Terman Award, the 2009 International Neural Network Society
Gabor Award, the U.K. Institute of Measurement and Control Honeywell Field
Engineering Medal in 2009, and the Outstanding Service Award from the
Dallas IEEE Section. He was selected as Engineer of the Year by the Fort
Worth IEEE Section and listed in Fort Worth Business Press Top 200 Leaders
in Manufacturing. He helped win the IEEE Control Systems Society Best
Chapter Award (as the Founding Chairman of the Dallas fort Worth, Chapter),
the National Sigma Xi Award for Outstanding Chapter (as the President of
the University of Texas at Arlington Chapter), and the U.S. Small business
Administration Tibbets Award in 1996 (as the Director of ARRI’s Small
Business Innovation Research Program).
Kyriakos G. Vamvoudakis (S’02–M’06) was born
in Athens, Greece. He received the Diploma in
Electronic and Computer Engineering (with high-
est honors) from the Technical University of Crete,
Chania, Greece, in 2006 and the M.Sc. degree in
electrical engineering from The University of Texas
at Arlington, in 2008, where he is currently working
toward the Ph.D. degree.
He is also currently a Research Assistant with
the Automation and Robotics Research Institute,
The University of Texas at Arlington. His current
research interests include approximate dynamic programming, neural-network
feedback control, optimal control, adaptive control, and systems biology.
Mr. Vamvoudakis is a member of Tau Beta Pi, Eta Kappa Nu, and Golden
Key honor societies and is listed in Who’s Who in the World. He is a Registered
Electrical/Computer Engineer (Professional Engineer) and a member of the
Technical Chamber of Greece.
... The following theorem proposes sufficient conditions to guarantee the monotonicity of sequence {P [t] } generated by the Riccati (29). ...
... Moreover, based on Lemmas 1 and 2, (29) gives ...
... According to [20], [29], the Q-function in (40) can be expressed in terms of available measured data. ...
Article
In this study, we employ two data-driven approaches to address the secure control problem for cyber-physical systems when facing false data injection attacks. Firstly, guided by zero-sum game theory and the principle of optimality, we derive the optimal control gain, which hinges on the solution of a corresponding algebraic Riccati equation. Secondly, we present sufficient conditions to guarantee the existence of a solution to the algebraic Riccati equation, which constitutes the first major contributions of this paper. Subsequently, we introduce two data-driven Q-learning algorithms, facilitating model-free control design. The second algorithm represents the second major contribution of this paper, as it not only operates without the need for a system model but also eliminates the requirement for state vectors, making it quite practical. Lastly, the efficacy of the proposed control schemes is confirmed through a case study involving an F-16 aircraft.
... This paper was recommended for publication in revised form by Associate Editor Kyriakos G. Vamvoudakis under the direction of Editor Miroslav Krstic. Jiang (2016, 2019), Jiang and Jiang (2012), Lewis and Vamvoudakis (2011), Lewis and Vrabie (2009), Modares et al. (2016), Vamvoudakis and Lewis (2010). Although the explicit system identification process is skipped by using ADP/RL based learning algorithm for solving the optimal control problem, yet accurate state data is still required in Bian and Jiang (2016), Chen et al. (2019), , Jiang and Jiang (2012), Jiang et al. (2018), Luo et al. (2018) to establish the iterative learning equation. ...
... In order to expand the application of ADP/RL based learning methods, several practical issues are further taken into account, for instance, control with output feedback Lewis & Vamvoudakis, 2011;Modares et al., 2016;Peng et al., 2020;Rizvi & Lin, 2020b;Sun et al., 2019), disturbance rejection Luo et al., 2018), uncertain system dynamics Jiang & Jiang, 2013). In particular, a robust ADP-based learning algorithm with output feedback (OPFB) controller was proposed in to solve the linear quadratic regulation (LQR) problem for linear systems with dynamics uncertainty, which was extended to solve the output regulation problem for discrete-time systems in . ...
... Moreover, the effect on the estimated optimal controller by the existence of the observer error has not been discussed in Rizvi and Lin (2020b). In addition, in most existing ADP-based algorithms, e.g., , , Jiang and Jiang (2012), Lewis and Vamvoudakis (2011), Modares et al. (2016), Rizvi and Lin (2020b), Vamvoudakis and Lewis (2010), the PE condition or the rank condition is usually needed to ensure that the collected learning data is rich enough. ...
Article
In this paper, we present an approximate optimal dynamic output feedback control learning algorithm to solve the linear quadratic regulation problem for unknown linear continuous-time systems. First, a dynamic output feedback controller is designed by constructing the internal state. Then, an adaptive dynamic programming based learning algorithm is proposed to estimate the optimal feedback control gain by only accessing the input and output data. By adding a constructed virtual observer error into the iterative learning equation, the proposed learning algorithm with the new iterative learning equation is immune to the observer error. In addition, the value iteration based learning equation is established without storing a series of past data, which could lead to a reduction of demands on the usage of memory storage. Besides, the proposed algorithm eliminates the requirement of repeated finite window integrals, which may reduce the computational load. Moreover, the convergence analysis shows that the estimated control policy converges to the optimal control policy. Finally, a physical experiment on an unmanned quadrotor is given to illustrate the effectiveness of the proposed approach.
Article
The objective of this research is to enable safety‐critical systems to simultaneously learn and execute optimal control policies in a safe manner to achieve complex autonomy. Learning optimal policies via trial and error, that is, traditional reinforcement learning, is difficult to implement in safety‐critical systems, particularly when task restarts are unavailable. Safe model‐based reinforcement learning techniques based on a barrier transformation have recently been developed to address this problem. However, these methods rely on full‐state feedback, limiting their usability in a real‐world environment. In this work, an output‐feedback safe model‐based reinforcement learning technique based on a novel barrier‐aware dynamic state estimator has been designed to address this issue. The developed approach facilitates simultaneous learning and execution of safe control policies for safety‐critical linear systems. Simulation results indicate that barrier transformation is an effective approach to achieve online reinforcement learning in safety‐critical systems using output feedback.
Article
This article proposes a novel data‐driven framework of distributed optimal consensus for discrete‐time linear multi‐agent systems under general digraphs. A fully distributed control protocol is proposed by using linear quadratic regulator approach, which is proved to be a sufficient and necessary condition for optimal control of multi‐agent systems through dynamic programming and minimum principle. Moreover, the control protocol can be constructed by using local information with the aid of the solution of the algebraic Riccati equation (ARE). Based on the Q‐learning method, a reinforcement learning framework is presented to find the solution of the ARE in a data‐driven way, in which we only need to collect information from an arbitrary follower to learn the feedback gain matrix. Thus, the multi‐agent system can achieve distributed optimal consensus when system dynamics and global information are completely unavailable. For output feedback cases, accurate state information estimation is established such that optimal consensus control is realized. Moreover, the data‐driven optimal consensus method designed in this article is applicable to general digraph that contains a directed spanning tree. Finally, numerical simulations verify the validity of the proposed optimal control protocols and data‐driven framework.
Chapter
In control system design, a common objective is to find a stabilizing controller that ensures the system’s output tracks a desired reference trajectory. Optimal control theory aims to achieve this goal by determining a control law that not only stabilizes the error dynamics but also minimizes a predefined performance index. Reinforcement learning (RL) algorithms have proven to be effective in solving the optimal tracking control problem (OTCP) for both discrete-time (Dierks and Jagannathan 2009; Wang et al. 2012; Modares et al. 2014) and continuous-time systems (Zhang et al. 2011). RL algorithms not only learn optimal tracking control solutions but also stabilize the tracking error systems.
Book
Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. Edited by the pioneers of RL and ADP research, the book brings together ideas and methods from many fields and provides an important and timely guidance on controlling a wide variety of systems, such as robots, industrial processes, and economic decision-making. © 2013 The Institute of Electrical and Electronics Engineers, Inc.
Article
State estimation is a fundamental component of modern control theory. In discrete-time format, the standard state estimator (observer) is one step ahead. It provides one-step-ahead estimation of the system state on the basis of information available at the current time step. To obtain multistep-ahead estimation, one can repeatedly propagate the one-step estimation a number of time steps into the future, but this process tends to accumulate errors from one propagation to the next. A multistep observer, which directly estimates the state of the system at some specified time step in the future, is identified directly from I/O data. One possible application of this multistep-ahead observer is in receding-horizon predictive control, which bases its present control action on a prediction of the system response at some time step in the future. It is possible to recover the usual one-step-ahead state-space model of the system from the identified multistep-ahead observer as well, although a stabilizing feedback controller can be designed directly from the identified observer. Numerical examples are used to illustrate the key identification and control aspects of this multistep-ahead observer concept.
Book
Preface Acknowledgements List of acronyms 1. Introduction 2. Parametric models 3. Parameter identification: continuous time 4. Parameter identification: discrete time 5. Continuous-time model reference adaptive control 6. Continuous-time adaptive pole placement control 7. Adaptive control for discrete-time systems 8. Adaptive control of nonlinear systems Appendix Bibliography Index.
Article
This paper presents the pulse response based control method for minimum time control of structures. An explicit model of a structure is not needed for this method, because the structure is represented in terms of its measured response to pulses in control inputs. Minimum time control problem are solved by finding the minimal number of time steps for which a control history exists that consists of a train of pulses, satisfies input bounds, and results in desired outputs at the end of the control task. There is no modal truncation in the pulse response representation of the response, because all modes contribute to the pulse response data. The precision with which the final state of the system can be specified using pulse response based control is limited only by the observability of the system with the given set of outputs. A special algorithm for solving the numerical optimization problem arising in pulse response based control is presented, and the effect of measurement noise on the accuracy of the final predicted outputs is investigated. A numerical example demonstrates the effectiveness of pulse response based control and the algorithm used to implement it. The pulse response based control method is applied to linear problems in this paper.
Article
A Dynamic Programming Example: A Shortest Path Problem The Three Curses of Dimensionality Some Real Applications Problem Classes The Many Dialects of Dynamic Programming What is New in this Book? Bibliographic Notes
Conference Paper
In this paper we discuss an online algorithm based on policy iteration for learning the continuous-time (CT) optimal control solution with infinite horizon cost for nonlinear systems with known dynamics. We present an online adaptive algorithm implemented as an actor/critic structure which involves simultaneous continuous-time adaptation of both actor and critic neural networks. We call this dasiasynchronouspsila policy iteration. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for both critic and actor networks, with extra terms in the actor tuning law being required to guarantee closed-loop dynamical stability. The convergence to the optimal controller is proven, and stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm.