ArticlePDF Available

Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

March 2010
IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics: a publication of the IEEE Systems, Man, and Cybernetics Society 41(1):14-25

March 2010
41(1):14-25

DOI:10.1109/TSMCB.2010.2043839

Source
PubMed

Authors:

Kyriakos G. Vamvoudakis

Georgia Institute of Technology

Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q -learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.

…

Figures - uploaded by Kyriakos G. Vamvoudakis

Content may be subject to copyright.

Content uploaded by Kyriakos G. Vamvoudakis

Content may be subject to copyright.

14 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Reinforcement Learning for Partially Observable

Dynamic Processes: Adaptive Dynamic

Programming Using Measured Output Data

F. L. Lewis, Fellow, IEEE, and Kyriakos G. Vamvoudakis, Member, IEEE

Abstract—Approximate dynamic programming (ADP) is a class

of reinforcement learning methods that have shown their im-

portance in a variety of applications, including feedback control

of dynamical systems. ADP generally requires full information

about the system internal states, which is usually not available

in practical situations. In this paper, we show how to implement

ADP methods using only measured input/output data from the

system. Linear dynamical systems with deterministic behavior

are considered herein, which are systems of great interest in the

control system community. In control system theory, these types

of methods are referred to as output feedback (OPFB). The sto-

chastic equivalent of the systems dealt with in this paper is a class

of partially observable Markov decision processes. We develop

both policy iteration and value iteration algorithms that converge

to an optimal controller that requires only OPFB. It is shown

that, similar to Q-learning, the new methods have the important

advantage that knowledge of the system dynamics is not needed

for the implementation of these learning algorithms or for the

OPFB control. Only the order of the system, as well as an upper

bound on its “observability index,” must be known. The learned

OPFB controller is in the form of a polynomial autoregressive

moving-average controller that has equivalent performance with

the optimal state variable feedback gain.

Index Terms—Approximate dynamic programming (ADP),

data-based optimal control, policy iteration (PI), output feedback

(OPFB), value iteration (VI).

I. INTRODUCTION

REINFORCEMENT learning (RL) is a class of methods

used in machine learning to methodically modify the

actions of an agent based on observed responses from its en-

vironment [7], [33]. RL methods have been developed starting

from learning mechanisms observed in mammals [4], [7]–[9],

[24], [28], [29], [33]. Every decision-making organism interacts

with its environment and uses those interactions to improve

its own actions in order to maximize the positive effect of

its limited available resources; this, in turn, leads to better

survival chances. RL is a means of learning optimal behaviors

by observing the response from the environment to nonoptimal

Manuscript received October 19, 2009; revised February 10, 2010; accepted

February 16, 2010. Date of publication March 29, 2010; date of current

version January 14, 2011. This work was supported in part by National

Science Foundation Grant ECCS-0801330 and in part by Army Research Ofﬁce

Grant W91NF-05-1-0314. This paper was recommended by Associate Editor

S. E. Shimony.

The authors are with the Automation and Robotics Research Institute,

The University of Texas at Arlington, Fort Worth, TX 76118 USA (e-mail:

lewis@arri.uta.edu; kyriakos@arri.uta.edu).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TSMCB.2010.2043839

control policies. In engineering terms, RL refers to the learning

approach of an actor or agent that modiﬁes its actions, or

control policies, based on stimuli received in response to its

interaction with its environment. This is based on evaluative

information from the environment and could be called action-

based learning. RL can be applied where standard supervised

learning is not applicable, and requires less aprioriknowledge.

In view of the advantages offered by RL methods, a recent

objective of control system researchers is to introduce and

develop RL techniques that result in optimal feedback con-

trollers for dynamical systems that can be described in terms

of ordinary differential equations [4], [30], [38], [41], [42].

This includes most of the human-engineered systems, including

aerospace systems, vehicles, robotic systems, and many classes

of industrial processes.

Optimal control is generally an ofﬂine design technique that

requires full knowledge of the system dynamics, e.g., in the

linear system case, one must solve the Riccati equation. On

the other hand, adaptive control is a body of online design

techniques that use measured data along system trajectories

to learn to compensate for unknown system dynamics, dis-

turbances, and modeling errors to provide guaranteed perfor-

mance. Optimal adaptive controllers have been designed using

indirect techniques, whereby the unknown plant is ﬁrst identi-

ﬁed and then a Riccati equation is solved [12]. Inverse adaptive

controllers have been provided that optimize a performance

index, meaningful but not of the designer’s choice [13], [20].

Direct adaptive controllers that converge to optimal solutions

for unknown systems given a PI selected by the designer have

generally not been developed.

Applications of RL to feedback control are discussed in [25],

[28], [43], and in the recent IEEE T RANSACTIONS ON SYS-

TEMS,MAN,AND CYBERNETICS—PART B: CYBERNETICS

(SMC-B) Special Issue [16]. Recent surveys are given by Lewis

and Vrabie [19] and Wang et al. [38]. Temporal difference

RL methods [4], [7], [24], [28], [33], [41], namely, policy

iteration (PI) and value iteration (VI), have been developed

to solve online the Hamilton–Jacobi–Bellman (HJB) equation

associated with the optimal control problem. Such methods

require measurement of the entire state vector of the dynamical

system to be controlled.

PI refers to a class of algorithms built as a two-step iteration:

policy evaluation and policy improvement. Instead of trying a

direct approach to solving the HJB equation, the PI algorithm

starts by evaluating the cost/value of a given initial admissible

(stabilizing) controller. The cost associated with this policy

LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 15

is then used to obtain a new improved control policy (i.e., a

policy that will have a lower associated cost than the previous

one). This is often accomplished by minimizing a Hamiltonian

function with respect to the new cost. The resulting policy is

thus obtained based on a greedy policy update with respect

to the new cost. These two steps of policy evaluation and

policy improvement are repeated until the policy improvement

step no longer changes the actual policy, and convergence to

the optimal controller is achieved. One must note that the

inﬁnite horizon cost associated with a given policy can only be

evaluated in the case of an admissible control policy, meaning

that the control policy must be stabilizing. The difﬁculty of the

algorithm comes from the required effort of ﬁnding an initial

admissible control policy.

VI algorithms, as introduced by Werbos, are actor–critic on-

line learning algorithms that solve the optimal control problem

without the requirement of an initial stabilizing control policy

[40]–[42]. Werbos referred to the family of VI algorithms as

approximate dynamic programming (ADP) algorithms. These

algorithms use a critic neural network (NN) for value function

approximation (VFA) and an actor NN for the approximation

of the control policy. Actor–critic designs using NNs for the

implementation of VI are detailed in [25]. Furthermore, the

SMC-B Special Issue gives a good overview on applications

of ADP and RL to feedback control [16].

Most of the PI and VI algorithms require at least some

knowledge of the system dynamics and measurement of the

entire internal state vector that describes the dynamics of the

system/environment (e.g., [2], [7], [24], [28], [33], [35]–[37],

and [41]). The so-called Q-learning class of algorithms [5],

[39] (called action-dependent heuristic dynamic programming

(HDP) by Werbos [41], [42]) does not require exact or explicit

description of the system dynamics, but they still use full state

measurement in the feedback control loop. From a control

system engineering perspective, this latter requirement may be

hard to fulﬁll as measurements of the entire state vector may not

be available and/or may be difﬁcult and expensive to obtain.

Although various control algorithms (e.g., state feedback) re-

quire full state knowledge, in practical implementations, taking

measurements of the entire state vector is not feasible. The

state vector is generally estimated based on limited information

about the system available by measuring the system’s outputs.

State estimation techniques have been proposed (e.g., [18],

[21], and [26]). These generally require a known model of the

system dynamics.

In real life, it is difﬁcult to design and implement optimal

estimators because the system dynamics and the noise statistics

are not exactly known. However, information about the system

and noise is included in a long-enough set of input/output data.

It would be desirable to be able to design an estimator by using

input/output data without any system knowledge or noise iden-

tiﬁcation. Such techniques belong to the ﬁeld of data-based con-

trol techniques, where the control input depends on input/output

data measured directly from the plant. These techniques are as

follows: data-based predictive control [22], unfalsiﬁed control

[27], Markov data-based linear quadratic Gaussian control [31],

disturbance-based control [34], simultaneous perturbation sto-

chastic approximation [32], pulse-response-based control [3],

iterative feedback tuning [11], and virtual reference feedback

tuning [15]. In [1], data-based optimal control was achieved

through identiﬁcation of Markov parameters.

In this paper, novel output-feedback (OPFB) ADP algorithms

are derived for afﬁne in the control input linear time-invariant

(LTI) deterministic systems. Such systems have, as stochastic

equivalent, the partially observable Markov decision processes

(POMDPs). In this paper, data-based optimal control is im-

plemented online using novel PI and VI ADP algorithms that

require only reduced measured information available at the

system outputs. These two classes of OPFB algorithms do not

require any knowledge of the system dynamics (A, B, C)and,

as such, are similar to Q-learning [5], [39], [41], [42], but they

have an added advantage of requiring only measurements of

input/output data and not the full system state. In order to ensure

that the data set is sufﬁciently rich and linearly independent,

there is a need to add (cf. [5]) probing noise to the control input.

We discuss this issue showing that probing noise leads to bias.

Adding a discount factor in the cost minimizes it to an almost-

zero effect. This discount factor is related to adding exponential

data weighting in the Kalman ﬁlter to remove the bias effects of

unmodeled dynamics [18].

This paper is organized as follows. Section II provides the

background of optimal control problem, dynamic program-

ming, and RL methods for the linear quadratic regulation

problem. Section III introduces the new class of PI and VI

algorithms by formulating the temporal difference error with

respect to the observed data and redeﬁning the control sequence

as the output of a dynamic polynomial controller. Section IV

discusses the implementation aspects of the OPFB ADP al-

gorithms. For convergence, the new algorithms require per-

sistently exciting probing noise whose bias effect is canceled

by using a discounted cost function. Section V presents the

simulation results obtained using the new data-based ADP

algorithms and is followed by concluding remarks.

II. BACKGROUND

In this section, we give a review of optimal control, dynamic

programming, and RL methods [i.e., PI and VI] for the lin-

ear quadratic regulator (LQR). It is pointed out that both of

these methods employ contraction maps to solve the Bellman

equation, which is a ﬁxed-point equation [23]. Both PI and VI

methods require full measurements of the entire state vector.

In the next section, we will show how to implement PI and VI

using only reduced information available at the system outputs.

A. Dynamic Programming and the LQR

Consider the linear TI discrete-time (DT) system

xk+1 =Axk+Buk

yk=Cxk(1)

where xk∈Rnis the state, uk∈Rmis the control input,

and yk∈Rpis the measured output. Assume throughout that

(A, B)is controllable and (A, C )is observable [17].

16 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Given a stabilizing control policy uk=μ(xk), associate to

the system the performance index

Vµ(xk)= ∞



i=kyT

iQyi+uT

iRui≡∞



i=k

ri(2)

with weighting matrices Q=QT≥0,R=RT>0, and

(A, C√Q)being observable. Note that uk=μ(xk)is the ﬁxed

policy. The utility is

rk=yT

kQyk+uT

kRuk.(3)

The optimal control problem [17] is to ﬁnd the policy uk=

μ(xk)that minimizes the cost (2) along the trajectories of the

system (1). Due to the special structure of the dynamics and the

cost, this is known as the LQR problem.

A difference equation that is equivalent to (2) is given by the

Bellman equation

Vµ(xk)=yT

kQyk+uT

kRuk+Vµ(xk+1).(4)

The optimal cost, or value, is given by

V∗(xk) = min

∞



i=kyT

iQyi+uT

iRui.(5)

According to Bellman’s optimality principle, the value may be

determined using the HJB equation

V∗(xk) = min

ukyT

kQyk+uT

kRuk+V∗(xk+1)(6)

and the optimal control is given by

μ∗(xk) = arg min

ukyT

kQyk+uT

kRuk+V∗(xk+1).(7)

For the LQR case, any value is quadratic in the state so that

the cost associated to any policy uk=μ(xk)(not necessarily

optimal) is

Vµ(xk)=xT

kPx

k(8)

for some n×nmatrix P. Substituting this into (4), one obtains

the LQR Bellman equation

kPx

k=yT

kQyk+uT

kRuk+xT

k+1Px

k+1.(9)

If the policy is a linear state variable feedback so that

uk=μ(xk)=−Kxk(10)

then the closed-loop system is

xk+1 =(A−BK)xk≡Acxk.(11)

Inserting these equations into (9) and averaging over all state

trajectories yield the Lyapunov equation

0=(A−BK)TP(A−BK)−P+CTQC +KTRK.

(12)

If the feedback Kis stabilizing and (A, C)is observable, there

exists a positive deﬁnite solution to this equation. Then, the

Lyapunov solution gives the value of using the state feedback

K, i.e., the solution of this equation gives the kernel Psuch that

Vµ(xk)=xT

kPx

To ﬁnd the optimal control, insert (1) into (9) to obtain

kPx

k=yT

kQyk+uT

kRuk+(Axk+Buk)TP(Axk+Buk).

(13)

To determine the minimizing control, set the derivative with

respect to ukequal to zero to obtain

uk=−(R+BTPB)−1BTPAx

k(14)

whence substitution into (12) yields the Riccati equation

0=ATPA−P+CTQC −ATPB(R+BTPB)−1BTPA.

(15)

This is the LQR that is equivalent to the HJB equation (6).

B. Temporal Difference, PI, and VI

It is well known that the optimal value and control can be

determined online in real time using temporal difference RL

methods [7], [24], [28], [33], [41], which rely on solving online

for the value that makes small the so-called Bellman temporal

difference error

ek=−Vµ(xk)+yT

kQyk+uT

kRuk+Vµ(xk+1).(16)

The temporal difference error is deﬁned based on the Bellman

equation (4). For use in any practical real-time algorithm, the

value should be approximated by a parametric structure [4],

[41], [42].

For the LQR case, the value is quadratic in the state, and the

Bellman temporal difference error is

ek=−xT

kPx

k+yT

kQyk+uT

kRuk+xT

k+1Px

k+1.(17)

Given a control policy, the solution of this equation is equiva-

lent to solving the Lyapunov equation (12) and gives the kernel

Psuch that Vµ(xk)=xT

kPx

Write (6) equivalently as

0 = min

uk−V∗(xk)+yT

kQyk+uT

kRuk+V∗(xk+1).

(18)

This is a ﬁxed-point equation. As such, it can be solved by

the method of successive approximation using a contraction

map. The successive approximation method resulting from this

ﬁxed-point equation is known as PI, an iterative method of

determining the optimal value and policy. For the LQR, PI is

performed by the following two steps based on the temporal

difference error (17) and a policy update step based on (7).

Algorithm 1—PI

Select a stabilizing initial control policy u0

k=μ0(xk). Then,

for j=0,1,..., perform until convergence:

1. Policy Evaluation: Using the policy uj

k=μj(xk)in (1),

solve for Pj+1 such that

0=−xT

kPj+1xk+yT

kQyk+uj

kTRuj

k+xT

k+1Pj+1 xk+1.

(19)

LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 17

2. Policy Improvement:

uj+1

k=μj+1

(xk) = arg min

ukyT

kQyk+uT

kRuk+xT

k+1Pj+1 xk+1.

(20)



The solution in the policy evaluation step is generally carried

out in a least squares (LS) sense. The initial policy is required to

be stable since only then does (19) have a meaningful solution.

This algorithm is equivalent to the following, which uses

the Lyapunov equation (12), instead of the Bellman equation,

and (14).

Algorithm 2—PI Lyapunov Iteration Equivalent

Select a stabilizing initial control policy K0. Then,

for j=0,1,..., perform until convergence:

1. Policy Evaluation:

0=(A−BKj)TPj+1 (A−BKj)

−Pj+1 +CTQC +(Kj)TRKj.(21)

2. Policy Improvement:

Kj+1 =(R+BTPj+1B)−1BTPj+1A. (22)



It was shown in [4] that, under general conditions, the

policy μj+1 obtained by (20) is improved over μjin the sense

that Vµj+1 (xk)≤Vµj(xk). It was shown by Hewer [10] that

Algorithm 2 converges under the controllability/observability

assumptions if the initial feedback gain is stabilizing.

Note that, in PI Algorithm 2, the system dynamics (A, B)are

required for the policy evaluation step, while in PI Algorithm 1,

they are not. Algorithm 2 is performed ofﬂine knowing the state

dynamics.

On the other hand, Algorithm 1 is performed online in real

time as the data (xk,r

k,x

k+1)are measured at each time step,

with rk=yT

kQyk+uT

kRukbeing the utility. Note that (19)

is a scalar equation, whereas the value kernel Pis a sym-

metric n×nmatrix with n(n+1)/2independent elements.

Therefore, n(n+1)/2data sets are required before (19) can

be solved. This is a standard problem in LS estimation. The

policy evaluation step may be performed using batch LS, as

enough data are collected along the system trajectory, or using

recursive LS (RLS). The dynamics (A, B)are not required

for this, since the state xk+1 is measured at each step, which

contains implicit information about the dynamics (A, B).This

procedure amounts to a stochastic approximation method that

evaluates the performance of a given policy along one sample

path, e.g., the system trajectory. PI Algorithm 1 effectively

provides a method for solving the Riccati equation (15) online

using data measured along the system trajectories. Full state

measurements of xkare required.

A second class of algorithms for the online iterative solution

of the optimal control problem based on the Bellman temporal

difference error (17) is given by VI or HDP [41]. Instead of

the ﬁxed-point equation in the form of (18), which leads to PI,

consider (6)

V∗(xk) = min

ukyT

kQyk+uT

kRuk+V∗(xk+1)(23)

which is also a ﬁxed-point equation. As such, it can be solved

using a contraction map by successive approximation using the

VI algorithm.

Algorithm 3—VI

Select an initial control policy K0. Then, for j=0,1,...,

perform until convergence:

1. Value Update:

kPj+1xk=yT

kQyk+uj

kTRuj

k+xT

k+1Pjxk+1 .(24)

2. Policy Improvement:

uj+1

k=μj+1

(xk) = arg min

ukyT

kQyk+uT

kRuk+xT

k+1Pj+1 xk+1.

(25)



Since (23) is not an equation but a recursion, its imple-

mentation does not require a stabilizing policy. Therefore, the

VI algorithm does not require an initial stabilizing gain. It is

equivalent to a matrix recursion knowing the system dynamics

(A, B), which has been shown to converge in [14].

Note that the policy update steps in PI Algorithm 1 and VI

Algorithm 3 rely on the hypothesis of the VFA parameterization

(8). Both algorithms require knowledge of the dynamics (A, B)

for the policy improvement step. In [2], it is shown that if a

second approximator structure is assumed for the policy, then

only the B matrix is needed for the policy improvement in

Algorithm 1 or 3.

An approach that provides online real-time algorithms for the

solution of the optimal control problem without knowing any

system dynamics is Q-learning. This has been applied to both

PI [5] and VI, where it is known as action-dependent HDP [41].

All these methods require measurement of the full state

xk∈Rn.

III. TEMPORAL DIFFERENCE,PI,AND VI BASED ON OPFB

This section presents the new results of this paper. The

Bellman error (17) for LQR is quadratic in the state. This can

be used in a PI or VI algorithm for online learning of optimal

controls as long as full measurements of state xk∈Rnare

available. In this section, we show how to write the Bellman

temporal difference error in terms only of the observed data,

namely, the input sequence ukand the output sequence yk.

18 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

The main results of the following equations are (45) and (46),

which give a temporal difference error in terms only of the

observed output data.

To reformulate PI and VI in terms only of the observed

data, we show how to write Vµ(xk)=xT

kPx

kas a quadratic

form in terms of the input and output sequences. A surprising

beneﬁt is that there result two algorithms for RL that do not

require any knowledge of the system dynamics (A, B, C). That

is, these algorithms have the same advantage as Q-learning

in not requiring knowledge of the system dynamics, yet they

have the added beneﬁt of requiring only measurements of the

available input/output data, not the full system state as required

by Q-learning.

A. Writing the Value Function in Terms of Available

Measured Data

Consider the deterministic linear TI system

xk+1 =Axk+Buk

yk=Cxk(26)

with xk∈Rn,uk∈Rm, and yk∈Rp. Assume that (A, B)is

controllable and (A, C)is observable [17]. Controllability is the

property of matrices (A, B)and means that any initial state can

be driven to any desired ﬁnal state. Observability is a property

of (A, C)and means that observations of the output ykover a

long-enough time horizon can be used to reconstruct the full

state xk. Given the current time k, the dynamics can be written

on a time horizon [k−N,k]as the expanded state equation

xk=ANxk−N+[BABA

2B··· AN−1B]⎡

⎢

⎣

uk−1

uk−2

uk−N

⎤

⎥

⎦

(27)

⎡

⎢

⎣

yk−1

yk−2

yk−N

⎤

⎥

⎦

=⎡

⎢

⎣

CAN−1

⎤

⎥

⎦

xk−N

⎡

⎢

⎣

0CB CAB ··· CAN−2B

00 CB ··· CAN−3B

........

0··· 0CB

00 0 0 0

⎤

⎥

⎦

⎡

⎢

⎣

uk−1

uk−2

uk−N

⎤

⎥

⎦

(28)

or by appropriate deﬁnition of variables as

xk=ANxk−N+UNuk−1,k−N(29)

yk−1,k−N=VNxk−N+TNuk−1,k−N(30)

with UN=[BAB··· AN−1B]being the controllability

matrix and

VN=⎡

⎢

⎣

CAN−1

⎤

⎥

⎦

(31)

being the observability matrix, where VN∈RpN×n.TNis the

Toeplitz matrix of Markov parameters.

Vectors

yk−1,k−N=⎡

⎢

⎣

yk−1

yk−2

yk−N

⎤

⎥

⎦∈RpN

uk−1,k−N=⎡

⎢

⎣

uk−1

uk−2

uk−N

⎤

⎥

⎦∈RmN

are the input and output sequences over the time interval [k−

N,k −1]. They represent the available measured data.

Since (A, C)is observable, there exists a K, the observ-

ability index, such that rank(VN)<n for N<K and that

rank(VN)=nfor N≥K. Note that Ksatisﬁes Kp ≥n.Let

N≥K. Then, VNhas full column rank n, and there exists a

matrix M∈Rn×pN such that

AN=MVN.(32)

This was used in [1] to ﬁnd the optimal control through identi-

ﬁcation of Markov parameters.

Since VNhas full column rank, its left inverse is given as

V+

N=VT

NVN−1VT

N(33)

so that

M=ANV+

N+ZI−VNV+

N≡M0+M1(34)

for any matrix Z, with M0denoting the minimum norm opera-

tor and P(R⊥(VN))=I−VNV+

Nbeing the projection onto a

range perpendicular to VN.

The following lemma shows how to write the system state in

terms of the input/output data.

Lemma 1: Let the system (26) be observable. Then, the

system state is given uniquely in terms of the measured in-

put/output sequences by

xk=M0yk−1,k−N+(UN−M0TN)uk−1,k−N

≡Myyk−1,k−N+Muuk−1,k−N(35)

xk=[MuMy]uk−1,k−N

yk−1,k−N(36)

where My=M0and Mu=UN−M0TN, with M0=

ANV+

N,V+

N=(VT

NVN)−1VT

N, being the left inverse of the

observability matrix (31), and N≥K, where Kis the observ-

ability index.

Proof: Note that ANxk−N=MVNxk−Nso that, accord-

ing to (30)

ANxk−N=MVNxk−N

=Myk−1,k −N−MTNuk−1,k−N

(37)

(M0+M1)VNxk−N=(M0+M1)yk−1,k−N

−(M0+M1)TNuk−1,k−N.(38)

LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 19

Note, however, that M1VN=0 so that MVNxk−N=

M0VNxk−N, and apply M1to (30) to see that

0=M1yk−1,k−N−M1TNuk−1,k−N∀M1s.t. M1VN=0.

(39)

Therefore

ANxk−N=M0VNxk−N=M0yk−1,k−N−M0TNuk−1,k−N

(40)

independently of M1. Then, from (29)

xk=M0yk−1,k−N+(UN−M0TN)uk−1,k−N

≡Myyk−1,k−N+Muuk−1,k−N.(41)



This result expresses xkin terms of the system inputs and

outputs from time k−Nto time k−1. Now, we will express

the value function in terms of the inputs and outputs.

It is important to note that the system dynamics information

(e.g., A,B, and C) must be known to use (36). In fact, My=

M0is given in (34), where V+

Ndepends on Aand C.Also,

Mudepends on M0,UN, and TN.UNis given in (27) and (29)

in terms of Aand B.TNin (28) and (30) depends on A,B,

and C.

In the next step, it is shown how to use the structural

dependence in (35) yet avoid knowledge of A,B, and C.

Deﬁne the vector of the observed data at time kas

zk−1,k−N=uk−1,k−N

yk−1,k−N.(42)

Now, one has

Vµ(xk)=xT

kPx

=zT

k−1,k−NMT

yP[MuMy]zk−1,k−N(43)

Vµ(xk)=zT

k−1,k−NMT

uPM

uMT

uPM

yPM

uMT

yPM

yzk−1,k−N

≡zT

k−1,k−N¯

Pzk−1,k−N.(44)

Note that uk−1,k−N∈RmN ,yk−1,k−N∈RpN ,zk−1,k−N∈

R(m+p)N, and ¯

P∈R(m+p)N×(m+p)N.

Equation (44) expresses the value function at time kas a

quadratic form in terms of the past inputs and outputs.

Note that the inner kernel matrix ¯

Pin (44) depends on the

system dynamics A,B, and C. In the next section, it is shown

how to use RL methods to learn the kernel matrix ¯

Pwithout

knowing A,B, and C.

B. Writing the TD Error in Terms of Available Measured Data

We may now write Bellman’s equation (9) in terms of the

observed data as

k−1,k−N¯

Pzk−1,k−N

=yT

kQyk+uT

kRuk+zT

k,k−N+1 ¯

Pzk,k−N+1.(45)

Based on this equation, write the temporal difference error (17)

in terms of the inputs and outputs as

ek=−zT

k−1,k−N¯

Pzk−1,k−N+yT

kQyk+uT

kRuk

+zT

k,k−N+1 ¯

Pzk,k−N+1.(46)

Using this TD error, the policy evaluation step of any form of

RL based on the Bellman temporal difference error (17) can be

equivalently performed using only the measured data, not the

state.

Matrix ¯

Pdepends on A,B, and Cthrough Myand Mu.

However, RL methods allow one to learn ¯

Ponline without A,

B,C, as shown next.

C. Writing the Policy Update in Terms of Available

Measured Data

Using Q-learning, one can perform PI and VI without any

knowledge of the system dynamics. Likewise, it is now shown

that, using the aforementioned constructions, one can derive

a form for the policy improvement step (20)/(25) that does

not depend on the state dynamics but only on the measured

input/output data.

The policy improvement step may be written in terms of the

observed data as

μ(xk) = arg min

ukyT

kQyk+uT

kRuk+xT

k+1Px

k+1(47)

μ(xk) = arg min

ukyT

kQyk+uT

kRuk

+zT

k,k−N+1 ¯

Pzk,k−N+1.(48)

Partition zT

k,k−N+1 ¯

Pzk,k−N+1 as

k,k−N+1 ¯

Pzk,k−N+1

=⎡

⎣

uk−1,k−N+1

yk,k−N+1

⎤

⎦

T⎡

⎣

p0pupy

uP22 P23

yP32 P33

⎤

⎦⎡

⎣

uk−1,k−N+1

yk,k−N+1

⎤

⎦.(49)

One has p0∈Rm×m,pu∈Rm×(m(N−1)), and py∈Rm×pN .

Then, differentiating with respect to ukto perform the mini-

mization in (48) yields

0=Ruk+p0uk+puuk−1,k−N+1 +pyyk,k−N+1 (50)

uk=−(R+p0)−1(puuk−1,k−N+1 +pyyk,k−N+1).(51)

This is a dynamic polynomial autoregressive moving-average

(ARMA) controller that generates the current control input uk

in terms of the previous inputs and the current and previous

outputs.

Exactly as in Q-learning (called by Werbos as action-

dependent learning [41]), the control input appears in quadratic

form (49), so that the minimization in (48) can be carried out

in terms of the learned kernel matrix ¯

Pwithout resorting to

the system dynamics. However, since (49) contains present and

20 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

past values of the input and output, the result is a dynamical

controller in polynomial ARMA form.

We have developed the following RL algorithms that use

only the measured input/output data and do not require mea-

surements of the full state vector xk.

Algorithm 4—PI Algorithm Using OPFB

Select a stabilizing initial control policy u0

k=μ0. Then, for

j=0,1,..., perform until convergence:

1. Policy Evaluation: Solve for ¯

Pj+1 such that

0=−zT

k−1,k−N¯

Pj+1zk−1,k−N+yT

kQyk+uj

kTRuj

+zT

k,k−N+1 ¯

Pj+1zk,k−N+1.(52)

2. Policy Improvement: Partition ¯

Pas in (49). Then, deﬁne

the updated policy by

uj+1

k=μj+1(xk)

=−R+pj+1

0−1

·pj+1

uuk−1,k−N+1 +pj+1

yyk,k−N+1.(53)



Algorithm 5—VI Algorithm Using OPFB

Select any initial control policy u0

k=μ0. Then, for j=

0,1,..., perform until convergence:

1. Policy Evaluation: Solve for ¯

Pj+1 such that

k−1,k−N¯

Pj+1zk−1,k−N

=yT

kQyk+uj

kTRuj

k+zT

k,k−N+1 ¯

Pjzk,k−N+1.(54)

2. Policy Improvement: Partition ¯

Pas in (49). Then, deﬁne

the updated policy by

uj+1

k=uj+1(xk)

=−R+pj+1

0−1

·pj+1

uuk−1,k−N+1 +pj+1

yyk,k−N+1.(55)



Remark 1: These algorithms do not require measurements

of the internal state vector xk. They only require measurements

at each time step kof the utility rk=yT

kQyk+uT

kRuk,the

inputs from time k−Nto time k, and the outputs from time

k−Nto time k. The policy evaluation step may be imple-

mented using standard methods such as batch LS or RLS (see

the next section).

Remark 2: The control policy given by these algorithms in

the form of (51), (53), and (55) is a dynamic ARMA regulator

in terms of the past inputs and the current and past outputs. As

such, it optimizes a polynomial square-of-sums cost function

that is equivalent to the LQR sum-of-squares cost function (2).

This polynomial cost function could be determined using the

techniques in [17].

Remark 3: PI Algorithm 4 requires an initial stabilizing

control policy, while VI Algorithm 5 does not. As such, VI is

suitable for control of open-loop unstable systems.

Remark 4: The PI algorithm requires an initial stabilizing

control policy u0

k=μ0that is required to be a function only

of the observable data. Suppose that one can ﬁnd an initial

stabilizing state feedback gain u0

k=μ0(xk)=−K0xk. Then,

the equivalent stabilizing OPFB ARMA controller is easy to

ﬁnd and is given by

k=μ0(xk)=−K0xk=−K0[MuMy]uk−1,k−N

yk−1,k−N.

The next result shows that the controller (51) is unique.

Lemma 2: Deﬁne M0and M1according to (34). Then, the

control sequence generated by (51) is independent of M1and

depends only on M0. Moreover, (51) is equivalent to

uk=−(R+BTPB)−1(puuk−1,k−N+1 +pyyk,k−N+1)

(56)

where puand pydepend only on M0.

Proof: Write (50) as

0=Ruk+p0uk+puuk−1,k−N+1 +pyyk,k−N+1

=Ruk+[p0pu|py]uk,k−N+1

yk,k−N+1 .

According to (44) and (49)

[p0pu]=[Im0]MT

uPM

upy[Im0]MT

uPM

0=Ruk+p0uk+puuk−1,k−N+1 +pyyk,k−N+1

=Ruk+[Im0]MT

uP(Muuk,k−N+1 +Myyk,k−N+1).

According to Lemma 1, this is unique, independently of M1

[see (38)], and equal to

0=Ruk+[Im0]MT

×P(UN−M0TN)uk,k−N+1 +M0yk,k−N+1.

Now

[Im0]MT

uP=MuIm

0T

However

MuIm

0=(UN−(M0+M1)TN)Im

0=B

where one has used the structure of UNand TN. Consequently

0=Ruk+p0uk+puuk−1,k−N+1 +pyyk,k−N+1

=Ruk+BTP(UN−M0TN)uk,k−N+1 +M0yk,k−N+1

which is independent of M1.

LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 21

Now note that p0=[Im0]MT

uPM

uIm

0=BTPB as

per the previous one. Hence, (56) follows from (51). 

Remark 5: Note that the controller cannot be implemented

in the form of (56) since it requires that matrix Pbe known.

IV. IMPLEMENTATION,PROB ING NOISE,BIAS,

AND DISCOUNT FACTOR S

In this section, we discuss the need for probing noise to

implement the aforementioned algorithms, show that this noise

leads to deleterious effects such as bias, and argue how adding a

discount factor in the cost (2) can reduce this bias to a negligible

effect.

The equations in PI and VI are solved online by standard

techniques using methods such as batch LS or RLS. See [5],

where RLS was used in Q-learning. In PI, one solves (52) by

writing it in the form of

stk(¯

Pj+1)[zk−1,k−N⊗zk−1,k−N−zk,k−N+1 ⊗zk,k−N+1]

=yT

kQyk+uj

kTRuj

k(57)

with ⊗being the Kronecker product and stk(.)being the

column stacking operator [6]. The redundant quadratic terms

in the Kronecker product are combined. In VI, one solves (54)

in the form of

stk(¯

Pj+1)[zk−1,k−N⊗zk−1,k−N]

=yT

kQyk+uj

kTRuj

k+zT

k,k−N+1 ¯

Pjzk,k−N+1.(58)

Both of these equations only require that the input/output

data be measured. They are scalar equations, yet one must

solve for the kernel matrix ¯

P∈R(m+p)N×(m+p)N, which

is symmetric and has [(m+p)N][(m+p)N+1]/2indepen-

dent terms. Therefore, one requires data samples for at least

[(m+p)N][(m+p)N+1]/2time steps for a solution using

batch LS.

To solve the PI update equation (57), it is required

that the quadratic vector [zk−1,k−N⊗zk−1,k−N−zk,k−N+1 ⊗

zk,k−N+1]be linearly independent over time, which is a prop-

erty known as persistence of excitation (PE). To solve the VI

update equation (58), one requires the PE of the quadratic

vector [zk−1,k−N⊗zk−1,k−N]. It is standard practice (see [5]

for instance) to inject probing noise into the control action to

obtain PE, so that one puts into the system dynamics the input

ˆuk=uk+dk, with ukbeing the control computed by the PI or

VI current policy and dkbeing a probing noise or dither, e.g.,

white noise.

It is well known that dither can cause biased results and mis-

match in system identiﬁcation. In [44], this issue is discussed,

and several alternative methods are presented for injecting

dither into system identiﬁcation schemes to obtain improved

results. Unfortunately, in control applications, one has little

choice about where to inject the probing noise.

To see the deleterious effects of probing noise, consider the

Bellman equation (9) with input ˆuk=uk+dk, where dkis a

probing noise. One writes

kˆ

k=yT

kQyk+ˆuT

kRˆuk+ˆxT

k+1 ˆ

Pˆxk+1

kˆ

k=yT

kQyk+(uk+dk)TR(uk+dk)

+(Axk+Buk+Bdk)Tˆ

P(Axk+Buk+Bdk)

kˆ

k=yT

kQyk+uT

kRuk+dT

kRdk+uT

kRdk

+dT

kRuk+(Axk+Buk)Tˆ

P(Axk+Buk)

+(Axk+Buk)Tˆ

PBd

k+(Bdk)Tˆ

P(Axk+Buk)

+(Bdk)Tˆ

PBd

Now, use tr{AB}=tr{BA}for commensurate matrices A

and B, take the expected values to evaluate correlation matrices,

and assume that the dither at time kis white noise independent

of ukand xkso that E{RukdT

k}=0 and E{ˆ

PBd

k(Axk+

Buk)T}=0 and that the cross terms drop out. Then, aver-

aged over repeated control runs with different probing noise

sequence dk, this equation is effectively

kˆ

k=yT

kQyk+uT

kRuk+(Axk+Buk)Tˆ

P(Axk+Buk)

+dT

k(BTˆ

PB +R)dk

which is the undithered Bellman equation plus a term depend-

ing on the dither covariance. As such, the solution computed by

PI or VI will not correspond to the actual value corresponding

to the Bellman equation.

It is now argued that discounting the cost can signiﬁcantly re-

duce the deleterious effects of probing noise. Adding a discount

factor γ<1to the cost (2) results in

Vµ(xk)= ∞



i=k

γi−kyT

iQyi+uT

iRui≡∞



i=k

γi−kri(59)

with the associated Bellman equation

kPx

k=yT

kQyk+uT

kRuk+γxT

k+1Px

k+1.(60)

This has an HJB (Riccati) equation that is equivalent to

0=−P+γATPA−ATPB(R/γ +BTPB)−1BTPA



+CTQC (61)

and an optimal policy of

uk=−(R/γ +BTPB)−1BTPAx

k.(62)

The beneﬁts of discounting are most clearly seen by examin-

ing VI. VI Algorithm 3, which, with discount, has (24) modi-

ﬁed as

kPj+1xk=yT

kQyk+uj

kTRuj

k+γxT

k+1Pjxk+1 (63)

corresponds to the underlying Riccati difference equation

Pj+1 =γATPjA−ATPjB(R/γ +BTPjB)−1BTPjA

+CTQC (64)

22 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

where each iteration is decayed by the factor γ<1. The effects

of this can be best understood by considering the Lyapunov

difference equation

Pj+1 =γATPjA+CTQC, P0(65)

which has a solution

Pj=γj(AT)jP0Aj+

j−1



i=0

γi(AT)iQAi.(66)

The effect of the discount factor is thus to decay the effects of

the initial conditions.

In a similar vein, the discount factor decays the effects of the

previous probing noises and improper initial conditions in the

PI and VI algorithms. In fact, adding a discount factor is closely

related to adding exponential data weighting in the Kalman

ﬁlter to remove the bias effects of unmodeled dynamics [18].

V. S IMULATIONS

A. Stable Linear System and PI

Consider the stable linear system with quadratic cost function

xk+1 =1.1−0.3

xk+1

0uk

yk=[1 −0.8]xk

where Qand Rin the cost function are the identity matrices

of appropriate dimensions. The open-loop poles are z1=0.5

and z1=0.6. In order to verify the correctness of the proposed

algorithm, the optimal value kernel matrix Pis found by

solving (15) to be

P=1.0150 −0.8150

−0.8150 0.6552 .

Now, by using ¯

P≡MT

uPM

uMT

uPM

yPM

uMT

yPM

yand (35), one has

P=⎡

⎢

⎣

1.0150 −0.8440 1.1455 −0.3165

−0.8440 0.7918 −1.0341 0.2969

1.1455 −1.0341 1.3667 −0.3878

−0.3165 0.2969 −0.3878 0.1113

⎤

⎥

⎦.

Since the system is stable, we use the OPFB PI Algorithm

4 implemented as in (52) and (53). PE was ensured by adding

dithering noise to the control input, and a discount (forgetting)

factor γ=0.2was added to diminish the dither bias effects.

The observer index is K=2, and Nis selected to be equal

to two.

By applying dynamic OPFB control (53), the system re-

mained stable, and the parameters of ¯

Pconverged to the

optimal ones. In fact

P=⎡

⎢

⎣

1.1340 −0.8643 1.1571 −0.3161

−0.8643 0.7942 −1.0348 0.2966

1.1571 −1.0348 1.3609 −0.3850

−0.3161 0.2966 −0.3850 0.1102

⎤

⎥

⎦.

In the example, batch LS was used to solve (52) at each step.

Fig. 1. Convergence of p0,pu,andpy.

Fig. 2. Evolution of the system states for the duration of the experiment.

Fig. 1 shows the convergence of p0∈R,pu∈R, and py∈

R1×2of ¯

Pto the correct values. Fig. 2 shows the evolution of

the system states and their convergence to zero.

B. Q-Learning and OPFB ADP

The purpose of this example is to compare the performance

of Q-learning [5], [39] and that of OPFB ADP. Consider the

stable linear system described before, and apply Q-learning.

The Q-function for this particular system is given by

Qh(xk,u

k)=xk

ukTQ+ATPA A

TPB

BTPA R+BTPB xk

uk

≡xk

ukT

Hxk

uk

=xk

ukT⎡

⎣

4.9768 −0.8615 2.8716

−0.8615 1.2534 −0.8447

2.8716 −0.8447 3.8158

⎤

⎦xk

uk.

LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 23

Fig. 3. Convergence of H11,H12 ,andH13 .

Fig. 4. Evolution of the system states for the duration of the experiment.

By comparing these two methods, it is obvious that

Q-learning has a faster performance than OPFB ADP since

fewer parameters are being identiﬁed.

Fig. 3 shows the convergence of the three parameters H11,

H12, and H13 of the Hmatrix shown before to the correct

values. Fig. 4 shows the evolution of the system states.

C. Unstable Linear System and VI

Consider the unstable linear system with quadratic cost

function

xk+1 =1.8−0.7700

xk+1

0uk

yk=[1 −0.5]xk

where Qand Rin the cost function are the identity matrices

of appropriate dimensions. The open-loop poles are z1=0.7

Fig. 5. Convergence of p0,pu,andpy.

and z2=1.1, so the system is unstable. In order to verify the

correctness of the proposed algorithm, the optimal value kernel

matrix Pis found by solving (15) to be

P=1.3442 −0.7078

−0.7078 0.3756 .

Now, by using ¯

P≡MT

uPM

uMT

uPM

yPM

uMT

yPM

yand (35), one has

P=⎡

⎢

⎣

1.3442 −0.7465 2.4582 −1.1496

−0.7465 0.4271 −1.3717 0.6578

2.4582 −1.3717 4.4990 −2.1124

−1.1496 0.6578 −2.1124 1.0130

⎤

⎥

⎦.

Since the system is open-loop unstable, one must use VI, not

PI. OPFB VI Algorithm 5 is implemented as in (54) and (55).

PE was ensured by adding dithering noise to the control input,

and a discount (forgetting) factor γ=0.2was used to diminish

the dither bias effects. The observer index is K=2, and Nis

selected to be equal to two.

By applying dynamic OPFB control (55), the system re-

mained stabilized, and the parameters of ¯

Pconverged to the

optimal ones. In fact

P=⎡

⎢

⎣

1.3431 −0.7504 2.4568 −1.1493

−0.7504 0.4301 −1.3730 0.6591

2.4568 −1.3730 4.4979 −2.1120

−1.1493 0.6591 −2.1120 1.0134

⎤

⎥

⎦.

In the example, batch LS was used to solve (54) at each step.

Fig. 5 shows the convergence of p0∈R,pu∈R, and py∈

R1×2of ¯

P. Fig. 6 shows the evolution of the system states, their

boundedness despite the fact that the plant is initially unstable,

and their convergence to zero.

VI. CONCLUSION

In this paper, we have proposed the implementation of ADP

using only the measured input/output data from a dynamical

system. This is known in control system theory as “OPFB,” as

24 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Fig. 6. Evolution of the system states for the duration of the experiment.

opposed to full state feedback, and corresponds to RL for a class

of POMDPs. Both PI and VI algorithms have been developed

that require only OPFB. An added and surprising beneﬁt is that,

similar to Q-learning, the system dynamics are not needed to

implement these OPFB algorithms, so that they converge to

the optimal controller for completely unknown systems. The

system order, as well as an upper bound on its “observability

index,” must be known. The learned OPFB controller is given

in the form of a polynomial ARMA controller that is equivalent

to the optimal state variable feedback gain. This controller

needs the addition of probing noise in order to be sufﬁciently

rich. This probing noise adds some bias, and in order to avoid

it, a discount factor is added in the cost.

Future research efforts will focus on how to apply the previ-

ous method to nonlinear systems.

REFERENCES

[1] W. Aangenent, D. Kostic, B. de Jager, R. van de Molengraft, and

M. Steinbuch, “Data-based optimal control,” in Proc. Amer. ControlConf.,

Portland, OR, 2005, pp. 1460–1465.

[2] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear

HJB solution using approximate dynamic programming: Convergence

proof,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4,

pp. 943–949, Aug. 2008.

[3] J. K. Bennighof, S. M. Chang, and M. Subramaniam, “Minimum time

pulse response based control of ﬂexible structures,” J. Guid. Control Dyn.,

vol. 16, no. 5, pp. 874–881, 1993.

[4] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.

Belmont, MA: Athena Scientiﬁc, 1996.

[5] S. J. Bradtke, B. E. Ydstie, and A. G. Barto, “Adaptive linear quadratic

control using policy iteration,” in Proc. Amer. Control Conf., Baltimore,

MD, Jun. 1994, pp. 3475–3476.

[6] J. W. Brewer, “Kronecker products and matrix calculus in system theory,”

IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, pp. 772–781, Sep. 1978.

[7] X. Cao, Stochastic Learning and Optimization. Berlin, Germany:

Springer-Verlag, 2007.

[8] K. Doya, “Reinforcement learning in continuous time and space,” Neural

Comput., vol. 12, no. 1, pp. 219–245, Jan. 2000.

[9] K. Doya, H. Kimura, and M. Kawato, “Neural mechanisms of learn-

ing and control,” IEEE Control Syst. Mag., vol. 21, no. 4, pp. 42–54,

Aug. 2001.

[10] G. Hewer, “An iterative technique for the computation of the steady state

gains for the discrete optimal regulator,” IEEE Trans. Autom. Control,

vol. AC-16, no. 4, pp. 382–384, Aug. 1971.

[11] H. Hjalmarsson and M. Gevers, “Iterative feedback tuning: Theory and

applications,”IEEE Control Syst. Mag., vol. 18, no. 4, pp. 26–41,Aug. 1998.

[12] P. Ioannou and B. Fidan, “Advances in design and control,” in Adaptive

Control Tutorial. Philadelphia, PA: SIAM, 2006.

[13] M. Krstic and H. Deng, Stabilization of Nonlinear Uncertain Systems.

Berlin, Germany: Springer-Verlag, 1998.

[14] P. Lancaster and L. Rodman, Algebraic Riccati Equations (Oxford Science

Publications). New York: Oxford Univ. Press, 1995.

[15] A. Lecchini, M. C. Campi, and S. M. Savaresi, “Virtual reference feedback

tuning for two degree of freedom controllers,” Int. J. Adapt. Control

Signal Process., vol. 16, no. 5, pp. 355–371, 2002.

[16] F. L. Lewis, G. Lendaris, and D. Liu, “Special issue on approximate

dynamic programming and reinforcement learning for feedback control,”

IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 896–897,

Aug. 2008.

[17] F. L. Lewis and V. L. Syrmos, Optimal Control. New York: Wiley, 1995.

[18] F. L. Lewis, L. Xie, and D. O. Popa, Optimal and Robust Estimation.

Boca Raton, FL: CRC Press, Sep. 2007.

[19] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic

programming for feedback control,” IEEE Circuits Syst. Mag.,vol.9,

no. 3, pp. 32–50, Sep. 2009.

[20] Z. H. Li and M. Krstic, “Optimal design of adaptive tracking controllers

for nonlinear systems,” in Proc. ACC, 1997, pp. 1191–1197.

[21] R. K. Lim, M. Q. Phan, and R. W. Longman, “State-Space System

Identiﬁcation with Identiﬁed Hankel Matrix,” Dept. Mech. Aerosp. Eng.,

Princeton Univ., Princeton, NJ, Tech. Rep. 3045, Sep. 1998.

[22] R. K. Lim and M. Q. Phan, “Identiﬁcation of a multistep-ahead observer

and its application to predictive control,” J. Guid. Control Dyn., vol. 20,

no. 6, pp. 1200–1206, 1997.

[23] preprint P. Mehta and S. Meyn, Q-learning and Pontryagin’s minimum

principle2009.

[24] W. Powell, Approximate Dynamic Programming: Solving the Curses of

Dimensionality. New York: Wiley, 2007.

[25] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Trans.

Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997.

[26] M. Q. Phan, R. K. Lim, and R. W. Longman, “Unifying Input–Output

and State-Space Perspectives of Predictive Control,” Dept. Mech. Aerosp.

Eng., Princeton Univ., Princeton, NJ, Tech. Rep. 3044, Sep. 1998.

[27] M. G. Safonov and T. C. Tsao, “The unfalsiﬁed control concept and

learning,” IEEE Trans. Autom. Control, vol. 42, no. 6, pp. 843–847,

Jun. 1997.

[28] W. Schultz, “Neural coding of basic reward terms of animal learning the-

ory, game theory, microeconomics and behavioral ecology,” Curr. Opin.

Neurobiol., vol. 14, no. 2, pp. 139–147, Apr. 2004.

[29] W. Schultz, P. Dayan, and P. Read Montague, “A neural substrate of

prediction and reward,” Science, vol. 275, no. 5306, pp. 1593–1599,

Mar. 1997.

[30] J. Si, A. Barto, W. Powel, and D. Wunch, Handbook of Learning and Ap-

proximate Dynamic Programming. Englewood Cliffs, NJ: Wiley, 2004.

[31] R. E. Skelton and G. Shi, “Markov Data-Based LQG Control,” Trans.

ASME,J.Dyn.Syst.Meas.Control, vol. 122, no. 3, pp. 551–559, 2000.

[32] J. C. Spall and J. A. Criston, “Model-free control of nonlinear stochastic

systems with discrete-time measurements,” IEEE Trans. Autom. Control,

vol. 43, no. 9, pp. 1198–1210, Sep. 1998.

[33] R. S. Sutton and A. G. Barto, Reinforcement Learning—An Introduction.

Cambridge, MA: MIT Press, 1998.

[34] R. L. Toussaint, J. C. Boissy, M. L. Norg, M. Steinbuch, and

O. H. Bosgra, “Suppressing non-periodically repeating disturbances in

mechanical servo systems,” in Proc. IEEE Conf. Decision Control,

Tampa, FL, 1998, pp. 2541–2542.

[35] D. Vrabie, K. Vamvoudakis, and F. Lewis, “Adaptive optimal control-

lers based on generalized policy iteration in a continuous-time frame-

work,” in Proc. IEEE Mediterranean Conf. Control Autom., Jun. 2009,

pp. 1402–1409.

[36] K. G. Vamvoudakis and F. L. Lewis, “Online actor critic algorithm

to solve the continuous-time inﬁnite horizon optimal control prob-

lem,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, Jun. 2009,

pp. 3180–3187.

[37] K. Vamvoudakis, D. Vrabie, and F. L. Lewis, “Online policy itera-

tion based algorithms to solve the continuous-time inﬁnite horizon op-

timal control problem,” in Proc. IEEE Symp. ADPRL, Nashville, TN,

Mar. 2009, pp. 36–41.

[38] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming:

An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47,

May 2009.

[39] C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, Cam-

bridge Univ., Cambridge, U.K., 1989.

LEWIS AND VAMVOUDAKIS:RL FOR PARTIALLY OBSERVABLE DYNAMIC PROCESSES 25

[40] P. J. Werbos, “Beyond regression: New tools for prediction and analysis

in the behavior sciences,” Ph.D. dissertation, Harvard Univ., Cambridge,

MA, 1974.

[41] P. J. Werbos, “Approximate dynamic programming for real-time control

and neural modeling,” in Handbook of Intelligent Control,D.A.White

and D. A. Sofge, Eds. New York: Van Nostrand Reinhold, 1992.

[42] P. Werbos, “Neural networks for control and system identiﬁcation,” in

Proc. IEEE CDC, 1989, pp. 260–265.

[43] D. A. White and D. A Sofge, Eds., Handbook of Intelligent Control.New

York: Van Nostrand Reinhold, 1992.

[44] B. Widrow and E. Walach, Adaptive Inverse Control. Upper Saddle

River, NJ: Prentice-Hall, 1996.

F. L. Lewis (S’78–M’81–SM’86–F’94) received the

B.S. degree in physics/electrical engineering and the

M.S.E.E. degree from Rice University, Houston, TX,

the M.S. degree in aeronautical engineering from the

University of West Florida, Pensacola, and the Ph.D.

degree from the Georgia Institute of Technology,

Atlanta.

He is currently a Distinguished Scholar Professor

and the Moncrief-O’Donnell Chair with the Automa-

tion and Robotics Research Institute (ARRI), The

University of Texas at Arlington, Fort Worth. He

works in feedback control, intelligent systems, distributed control systems, and

sensor networks. He is the author of 216 journal papers, 330 conference papers,

14 books, 44 chapters, and 11 journal special issues and is the holder of six U.S.

patents.

Dr. Lewis is a Fellow of the International Federation of Automatic Control

and the U.K. Institute of Measurement and Control, a Professional Engineer in

Texas, and a U.K. Chartered Engineer. He was the Founding Member of the

Board of Governors of the Mediterranean Control Association. He served on

the National Academy of Engineering Committee on Space Station in 1995.

He is an elected Guest Consulting Professor with both South China University

of Technology, Guangzhou, China, and Shanghai Jiao Tong University, Shang-

hai, China. He received the Fulbright Research Award, the National Science

Foundation Research Initiation Grant, the American Society for Engineering

Education Terman Award, the 2009 International Neural Network Society

Gabor Award, the U.K. Institute of Measurement and Control Honeywell Field

Engineering Medal in 2009, and the Outstanding Service Award from the

Dallas IEEE Section. He was selected as Engineer of the Year by the Fort

Worth IEEE Section and listed in Fort Worth Business Press Top 200 Leaders

in Manufacturing. He helped win the IEEE Control Systems Society Best

Chapter Award (as the Founding Chairman of the Dallas fort Worth, Chapter),

the National Sigma Xi Award for Outstanding Chapter (as the President of

the University of Texas at Arlington Chapter), and the U.S. Small business

Administration Tibbets Award in 1996 (as the Director of ARRI’s Small

Business Innovation Research Program).

Kyriakos G. Vamvoudakis (S’02–M’06) was born

in Athens, Greece. He received the Diploma in

Electronic and Computer Engineering (with high-

est honors) from the Technical University of Crete,

Chania, Greece, in 2006 and the M.Sc. degree in

electrical engineering from The University of Texas

at Arlington, in 2008, where he is currently working

toward the Ph.D. degree.

He is also currently a Research Assistant with

the Automation and Robotics Research Institute,

The University of Texas at Arlington. His current

research interests include approximate dynamic programming, neural-network

feedback control, optimal control, adaptive control, and systems biology.

Mr. Vamvoudakis is a member of Tau Beta Pi, Eta Kappa Nu, and Golden

Key honor societies and is listed in Who’s Who in the World. He is a Registered

Electrical/Computer Engineer (Professional Engineer) and a member of the

Technical Chamber of Greece.

Learning Secure Control Design for Cyber-Physical Systems Under False Data Injection Attacks

Article

Jan 2024

In this study, we employ two data-driven approaches to address the secure control problem for cyber-physical systems when facing false data injection attacks. Firstly, guided by zero-sum game theory and the principle of optimality, we derive the optimal control gain, which hinges on the solution of a corresponding algebraic Riccati equation. Secondly, we present sufficient conditions to guarantee the existence of a solution to the algebraic Riccati equation, which constitutes the first major contributions of this paper. Subsequently, we introduce two data-driven Q-learning algorithms, facilitating model-free control design. The second algorithm represents the second major contribution of this paper, as it not only operates without the need for a system model but also eliminates the requirement for state vectors, making it quite practical. Lastly, the efficacy of the proposed control schemes is confirmed through a case study involving an F-16 aircraft.

Optimal dynamic output feedback control of unknown linear continuous-time systems by adaptive dynamic programming

Article

May 2024
AUTOMATICA

In this paper, we present an approximate optimal dynamic output feedback control learning algorithm to solve the linear quadratic regulation problem for unknown linear continuous-time systems. First, a dynamic output feedback controller is designed by constructing the internal state. Then, an adaptive dynamic programming based learning algorithm is proposed to estimate the optimal feedback control gain by only accessing the input and output data. By adding a constructed virtual observer error into the iterative learning equation, the proposed learning algorithm with the new iterative learning equation is immune to the observer error. In addition, the value iteration based learning equation is established without storing a series of past data, which could lead to a reduction of demands on the usage of memory storage. Besides, the proposed algorithm eliminates the requirement of repeated finite window integrals, which may reduce the computational load. Moreover, the convergence analysis shows that the estimated control policy converges to the optimal control policy. Finally, a physical experiment on an unmanned quadrotor is given to illustrate the effectiveness of the proposed approach.

Safe adaptive output‐feedback optimal control of a class of linear systems

Article

Apr 2024

The objective of this research is to enable safety‐critical systems to simultaneously learn and execute optimal control policies in a safe manner to achieve complex autonomy. Learning optimal policies via trial and error, that is, traditional reinforcement learning, is difficult to implement in safety‐critical systems, particularly when task restarts are unavailable. Safe model‐based reinforcement learning techniques based on a barrier transformation have recently been developed to address this problem. However, these methods rely on full‐state feedback, limiting their usability in a real‐world environment. In this work, an output‐feedback safe model‐based reinforcement learning technique based on a novel barrier‐aware dynamic state estimator has been designed to address this issue. The developed approach facilitates simultaneous learning and execution of safe control policies for safety‐critical linear systems. Simulation results indicate that barrier transformation is an effective approach to achieve online reinforcement learning in safety‐critical systems using output feedback.

Q-Learning Design for Discrete-Time Stochastic Zero-Sum Games

Conference Paper

Nov 2023

Reinforcement Learning and Disturbance Observer Based Optimal Control for Uncertain Systems

Conference Paper

Nov 2023

Model‐free distributed optimal control for general discrete‐time linear systems using reinforcement learning

Article

Mar 2024

This article proposes a novel data‐driven framework of distributed optimal consensus for discrete‐time linear multi‐agent systems under general digraphs. A fully distributed control protocol is proposed by using linear quadratic regulator approach, which is proved to be a sufficient and necessary condition for optimal control of multi‐agent systems through dynamic programming and minimum principle. Moreover, the control protocol can be constructed by using local information with the aid of the solution of the algebraic Riccati equation (ARE). Based on the Q‐learning method, a reinforcement learning framework is presented to find the solution of the ARE in a data‐driven way, in which we only need to collect information from an arbitrary follower to learn the feedback gain matrix. Thus, the multi‐agent system can achieve distributed optimal consensus when system dynamics and global information are completely unavailable. For output feedback cases, accurate state information estimation is established such that optimal consensus control is realized. Moreover, the data‐driven optimal consensus method designed in this article is applicable to general digraph that contains a directed spanning tree. Finally, numerical simulations verify the validity of the proposed optimal control protocols and data‐driven framework.

Integral Reinforcement Learning for Optimal Tracking

Chapter

Mar 2024

In control system design, a common objective is to find a stabilizing controller that ensures the system’s output tracks a desired reference trajectory. Optimal control theory aims to achieve this goal by determining a control law that not only stabilizes the error dynamics but also minimizes a predefined performance index. Reinforcement learning (RL) algorithms have proven to be effective in solving the optimal tracking control problem (OTCP) for both discrete-time (Dierks and Jagannathan 2009; Wang et al. 2012; Modares et al. 2014) and continuous-time systems (Zhang et al. 2011). RL algorithms not only learn optimal tracking control solutions but also stabilize the tracking error systems.

Optimal fuzzy output feedback tracking control for unmanned surface vehicles systems

Article

Mar 2024
OCEAN ENG

Value iteration for LQR control of unknown stochastic-parameter linear systems

Article

Mar 2024
SYST CONTROL LETT

Specified convergence rate guaranteed output tracking of discrete-time systems via reinforcement learning

Article

Mar 2024
AUTOMATICA

Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Science. Thesis (Ph. D.). Appl. Math. Harvard University

Thesis

Full-text available

Jan 1974

Paul Werbos

Reinforcement Learning and Approximate Dynamic Programming for Feedback Control

Book

Feb 2013

Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. Edited by the pioneers of RL and ADP research, the book brings together ideas and methods from many fields and provides an important and timely guidance on controlling a wide variety of systems, such as robots, industrial processes, and economic decision-making. © 2013 The Institute of Electrical and Electronics Engineers, Inc.

Handbook of Intelligent Control

Article

Jan 1992

P.J. Werbos

Beyond regression

Book

Jan 1974

Paul Werbos

Identification of a Multistep-Ahead Observer and Its Application to Predictive Control

Article

Nov 1997

State estimation is a fundamental component of modern control theory. In discrete-time format, the standard state estimator (observer) is one step ahead. It provides one-step-ahead estimation of the system state on the basis of information available at the current time step. To obtain multistep-ahead estimation, one can repeatedly propagate the one-step estimation a number of time steps into the future, but this process tends to accumulate errors from one propagation to the next. A multistep observer, which directly estimates the state of the system at some specified time step in the future, is identified directly from I/O data. One possible application of this multistep-ahead observer is in receding-horizon predictive control, which bases its present control action on a prediction of the system response at some time step in the future. It is possible to recover the usual one-step-ahead state-space model of the system from the identified multistep-ahead observer as well, although a stabilizing feedback controller can be designed directly from the identified observer. Numerical examples are used to illustrate the key identification and control aspects of this multistep-ahead observer concept.

Adaptive Control Tutorial

Book

Jan 2006

Preface Acknowledgements List of acronyms 1. Introduction 2. Parametric models 3. Parameter identification: continuous time 4. Parameter identification: discrete time 5. Continuous-time model reference adaptive control 6. Continuous-time adaptive pole placement control 7. Adaptive control for discrete-time systems 8. Adaptive control of nonlinear systems Appendix Bibliography Index.

Minimum time pulse response based control of flexible structures

Article

Sep 1993

This paper presents the pulse response based control method for minimum time control of structures. An explicit model of a structure is not needed for this method, because the structure is represented in terms of its measured response to pulses in control inputs. Minimum time control problem are solved by finding the minimal number of time steps for which a control history exists that consists of a train of pulses, satisfies input bounds, and results in desired outputs at the end of the control task. There is no modal truncation in the pulse response representation of the response, because all modes contribute to the pulse response data. The precision with which the final state of the system can be specified using pulse response based control is limited only by the observability of the system with the given set of outputs. A special algorithm for solving the numerical optimization problem arising in pulse response based control is presented, and the effect of measurement noise on the accuracy of the final predicted outputs is investigated. A numerical example demonstrates the effectiveness of pulse response based control and the algorithm used to implement it. The pulse response based control method is applied to linear problems in this paper.

Stabilization of Uncertain Nonlinear Systems

Article

Jan 1998

Approximate Dynamic Programming: Solving the Curses of Dimensionality

Article

Aug 2011

Warren B. Powell

A Dynamic Programming Example: A Shortest Path Problem The Three Curses of Dimensionality Some Real Applications Problem Classes The Many Dialects of Dynamic Programming What is New in this Book? Bibliographic Notes

Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem

Conference Paper

Jun 2009

In this paper we discuss an online algorithm based on policy iteration for learning the continuous-time (CT) optimal control solution with infinite horizon cost for nonlinear systems with known dynamics. We present an online adaptive algorithm implemented as an actor/critic structure which involves simultaneous continuous-time adaptation of both actor and critic neural networks. We call this dasiasynchronouspsila policy iteration. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for both critic and actor networks, with extra terms in the actor tuning law being required to guarantee closed-loop dynamical stability. The convergence to the optimal controller is proven, and stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm.

Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

Abstract and Figures

Recommended publications

Optimal adaptive control for unknown systems using output feedback by reinforcement learning methods

Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adapti...

Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton–Jacobi equati...

Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control