ArticlePDF Available

An elementary derivation of the maximum likelihood estimator of the covariance matrix, and an illustrative determinant inequality

March 1991
Automatica 27(2):425–426

March 1991
27(2):425–426

DOI:10.1016/0005-1098(91)90095-J

Authors:

Seppo Karrila

Prince of Songkla University

Tapio Westerlund

Åbo Akademi University

The unique maximum likelihood estimate of the covariance matrix of normally distributed random vectors is derived by use of elementary linear algebra leading to simple scalar equations. In addition the application of a determinant inequality, also derived here, shows that a standard “derivation” of the maximum likelihood estimate is fallacious.

Content uploaded by Tapio Westerlund

Content may be subject to copyright.

Automatica,

Vol. 27, No. 2, pp. 425-426, 1991

Printed in Great Britain. 0005-1098/91 $3.00 + 0.00

Pergamon Press

plc

1991 International Federation of Automatic Control

Technical Communique

An Elementary Derivation of the Maximum

Likelihood Estimator of the Covariance Matrix,

and an lllustrative Determinant Inequality*

SEPPO KARRILAt$ and TAPIO WESTERLUND%

Key Words--Maximum likelihood estimation; estimation; determinants, optimization; least-squares

estimation.

Abstract--The unique maximum likelihood estimate of the

covariance matrix of normally distributed random vectors is

derived by use of elementary linear algebra leading to simple

scalar equations. In addition the application of a determinant

inequality, also derived here, shows that a standard

"derivation" of the maximum likelihood estimate is

fallacious.

Introduction

IN SOME textbooks on estimation theory [for example

Goodwin and Payne (1977), p. 48] the maximum likelihood

estimate of the covariance matrix of normally distributed

random vectors is obtained by matrix differentiation results

for

general matrices, not restricting the covariance matrix to

being symmetric positive definite (SPD) or even just

symmetric. A stationary point of the likelihood function is

obtained--the stationary point is then observed to be SPD

and it is concluded that this must be the unique solution to

the maximization problem in the smaller domain of SPD

matrices. Although the solution is correct its derivation is

not, and some confusion may arise since the likelihood

function attains arbitrarily large values when the covariance

matrix is not restricted to being SPD. Naturally the

stationary point obtained for general matrices was in fact a

saddle point, and some further insight about the situation is

provided by the monotonicity result presented for deter-

minants here.

The maximum likelihood estimate

The likelihood function of N independent normally

distributed random vectors e with n real components is given

= (2•) -(N/2)n

IR-tr N/2 exp (-½tr {ErER-1)} (1)

where R is the unknown covariance matrix. The matrix E is

formed from the observations according to

E T = [el, e2 ..... eN]. (2)

The standard way of obtaining the maximum likelihood

estimates is by differentiating the logarithm of the likelihood

function with respect to the estimated parameters using

matrix differentiation rules for general matrices. This gives

[see Goodwin and Payne (1977), p. 48, eqn 3.3.10]

alnL NR_ ~

2 + ½R-1ErER -1. (3)

* Received 19 December 1989; received in final form 12

June 1990. Recommended for publication in revised form by

Editor W. S. Levine.

t Department of Chemical Engineering, ,~bo Akademi,

Biskopsgatan 8, SF-20500 ,~bo, Finland.

~tAuthor to whom all correspondence should be

addressed.

The maximum likelihood estimate of R is given by the root

= 1 ErE. (4)

The saddle point nature of this symmetric root is seen as

follows. We perturb the inverse solution with a real nonzero

skew-symmetric matrix H = -Hr:

R 1 = N(ErE)-l + AH (5)

where A is a real scalar. Observe that the trace within the

exponential in equation (1) is unchanged by this perturba-

tion, since the trace of a sum is the sum of traces, and the

skew-symmetric real matrix EHE r has zero diagonal

elements whereby tr (ErEH) = tr (EHE r) = 0. Therefore

t ^-1 IR -1

AHI

N/2

L(R )=L(R

)( ift_tl )

(6)

Also the determinant inequality

Ill -t + AHI > Ill ~1 (7)

holds for all A 4:0 as !1 -t is SPD (see the Appendix) so that

L(R -1) > L(R-t). (8)

The likelihood function will thus be increased (monotonically

with respect to IAI) by perturbations about the stationary

point with skew-symmetric matrices, and the stationary point

given by equation (4) is only a saddle point (for general

matrices as the domain).

A simple solution

The maximum likelihood estimate can be obtained

rigorously by using differentiation rules for symmetric

matrices (Graybill, 1983). However, the following elemen-

tary and concise approach is more appealing especially for

classroom use.

Constrain R to be SPD and assume ErE is invertible so

that it is also SPD. Then, square roots of these matrices are

defined (uniquely by requiring them to be SPD). Define the

matrix

A = (ETE)I/2R-I(ETE)I/2 > 0 (9)

and note that it has the same trace as ErER -1. The

determinant of A is related to that of R by

IAI = IErEI IR-11. (10)

Now maximization of equation (1) is equivalent to

maximizing

f = IAI N/2 exp {-½ tr {A}} (11)

with respect to A where A is SPD.

Let A1, )-2 .... , A, be the eigenvalues of A--being SPD A

425

AUTO 27:2-N

426 Technical Communique

can be diagonalized and all its eigenvalues are positive. Then

f= Z i exp -½ ki = H x7:2e-~a'/2). (12)

-- i=1 /

i=l

(This equation would be equally valid for general symmetric

matrices A (or R) and considering negative ~.i, clearly shows

that no global maximum would ever be attained.) The

stationary point is now obtained from the last expression by

considering the factors separately:

d).N%-~xi/2)

d~ i

--,~N/2)-le--(;ti/2)(~) =O.

(13)

For the allowed eigenvalues 3. i ~ ]0, c¢[ the unique solution of

equation (13) is

).i = N, Vi. (14)

Since the derivative of each factor changes sign just once,

from positive to negative, the stationary point obtained is the

global maximum (within the domain considered).

Now all the eigenvalues of A are equal. Note that the only

matrix similar to a multiple of the identity is that multiple

itself, and

A = NI =

(ErE)R -t. (15)

The unique SPD maximum likelihood estimate of R is

therefore

= 1 ErE. (16)

Discussion

An elementary proof for the maximum likelihood estimate

of the covariance matrix, for normally distributed random

vectors, was presented. This proof, it is hoped, will

supersede the less rigorous but more complicated proofs in

some current textbooks. Aside from the main theme a

determinant inequality, that the authors have not managed

to find in the literature, is derived in the Appendix and used

for illustrating the importance of proper consideration of the

domain in optimization problems.

References

Goodwin, G. C. and R. L. Payne (1977).

Dynamic System

Identification. Experimental Design and Data Analysis.

Academic Press, New York.

Graybill, F. A. (1983).

Matrices with Applications in

Statistics.

Wadsworth, Belmont, CA.

Appendix

Let S be a real symmetric positive definite matrix and H be

some nonzero real skew-symmetric matrix (H = -Hr). Here

we show that

IS + AHI > ISI (A.1)

for all values of the real scalar A:#0, and in fact

monotonically increases with the absolute magnitude of this

perturbation parameter. (The reader may observe that the

same proof is valid for the skew-Hermitian perturbation of a

Hermitian matrix in the complex case, provided that absolute

values of the determinants are taken).

Observe that

S + AH = Sl/2(! +

AS-1/2HS-1/2)81/2

(A. 2)

and by the product rule for determinants

IS + AHI = ISI" I1 + AGI (a.3)

with G = S-v2HS -1/2. (A.4)

Since G is skew-symmetric its eigenvalues are purely

imaginary, and these are shifted by unity when the identity

matrix is added:

IS + AHI = ISl

I~I

(1 + i A~.j), (A.5)

]=1

where

i).~, j= 1 ..... n

are the eigenvalues of G and

i = ~/- 1. The product on the RHS is pure real since the LHS

is, so taking the absolute value will at most change the sign.

Shifting the absolute value to the factors of the product gives

ISI IJ ~, (m.6)

/=1

which obviously is monotonically increasing with IAI, strictly

so since at least one of the eigenvalues is nonzero. Due to

continuity with respect to A the RHS of (A.5) cannot jump

to negative values as A moves away from zero; thus it stays

positive and taking the absolute value does not even change

sign. This proves that the LHS of (A.5) is also monotonically

increasing with respect to the absolute value of A. The

weaker result

IS+ AHI > ISI (m.7)

for all real A :# 0, follows from this strict monotonicity.

Q.E.D.

Comparison of certain MINLP algorithms when applied to a model structure determination and parameter estimation problem

Article

Full-text available

Nov 1998
COMPUT CHEM ENG

The maximum likelihood method is frequently used in parameter estimation. If the structure of the model is unknown, the maximization of the likelihood function can be replaced by minimizing an information criterion. One criterion that allows this to be done is Akaike's information criterion (AIC). Minimizing the AIC is a mixed integer non-linear programming (MINLP) problem. In this paper, three dioeerent MINLP algorithms are compared in the solution of a simultaneous model structure determination and parameter estimation problem by minimizing the AIC criterion. The problem considered appears in quantitative Fourier transformed infra red (FTIR) spectroscopy where concentration estimates of certain gas components are to be obtained from measured absorbances at dioeerent wave numbers. The resulting problem is a large MINLP problem containing several hundreds, or even thousands, of variables including a huge number of possible model structures. It is, however, found that the studied algor...

A Model identification approach using MINLP techniques

Article

Jul 2006

Stefan Emet

The present paper presents an approach of identifying the structure of a dynamic system using Mixed Integer Nonlinear Programming (MINLP) techniques. It is shown that the problem can be tackled by minimizing, for example, Akaike's Information Criterion (AIC). The presented techni ques are applied in determining the structure and the parameters of some illustrative Auto-Regressive Moving Average (ARMA) time series. The example problems are solved using the Extended Cutting Plane (ECP) method.

The joint problem of model structure determination and parameter estimation in quantitative IR spectroscopy

Article

Full-text available

Jul 1995
CHEMOMETR INTELL LAB

A method for automatically selecting the wave numbers best suited for quantitative analysis as well as for simultaneously estimating the parameters in the emerging model is presented. As an indicator of the goodness of the model, Akaike's information theoretic criterion (AIC) is used. Since this approach involves the maximum likelihood estimate of the parameters, the problem of how to scale the data prior to the calculations is eliminated. The method described in this paper is not restricted to Fourier transform infrared (FTIR) problems, but can be applied to other similar problems, where both the model structure and the parameters should be determined. During the calibration stage, a mixed-integer nonlinear programming problem must be solved. It is demonstrated that the use of such modern optimization techniques makes it possible to solve these types of problems without tremendous computational effort. During the prediction stage the obtained model is easy to use.

Simultaneous parameter estimation and model structure determination in FTIR spectroscopy by global MINLP optimization

Article

Jun 2003
COMPUT CHEM ENG

We address the problem of simultaneous model structure determination and parameter estimation in infrared spectroscopy. For given measurements of concentrations (C) and absorbances (A), we seek to find the constant of analogy (Θ) in reverse Beer's law (C=ΘA). Two approaches are described and compared in this paper. Both utilize Akaike's information criterion (AIC) to obtain an estimate of the constant. The first method is frequently used in practice and requires the iterative solution of mixed-integer convex quadratic optimization problems. The second method is a novel one that requires the solution of a single mixed-integer nonconvex nonlinear program for which we develop a global optimization algorithm. Computational results demonstrate that the latter approach provides better solutions for all of the eleven problems solved in this paper. Our computational experiments also reveal the importance of bounding the errors and number of model parameters when minimizing AIC.

The role of abstract algebra in structured estimation theory

Article

Jun 1992

S.D. Morgera

An attempt is made to formalize both structured covariance estimation and autoregressive process parameter estimation in terms of the underlying abstract Jordan algebra, an algebra that differs from the usual noncommutative but associative matrix algebra. The investigation puts one on a firm footing from which to attack future problems in statistical signal processing, rather in the same manner that the introduction of Lie algebra and Lie groups in control theory made a variety of new ideas and developments possible

Matrices with Applications in Statistics

Article

Feb 1985
TECHNOMETRICS

Dynamic system identification. Experiment design and data analysis

Book

Jan 1977

An elementary derivation of the maximum likelihood estimator of the covariance matrix, and an illustrative determinant inequality

Abstract

Recommended publications

A Comparison of Multivariate Normal Generators.

Techniques in the Maximum Likelihood Estimation of the Covariance Matrix

Approximating general SIS using non-negative matrix factorization

A good property of the maximum likelihood estimator in a restricted normal model

Models with a Kronecker product covariance structure: Estimation and testing