ArticlePDF Available

An elementary derivation of the maximum likelihood estimator of the covariance matrix, and an illustrative determinant inequality

Authors:

Abstract

The unique maximum likelihood estimate of the covariance matrix of normally distributed random vectors is derived by use of elementary linear algebra leading to simple scalar equations. In addition the application of a determinant inequality, also derived here, shows that a standard “derivation” of the maximum likelihood estimate is fallacious.
Automatica,
Vol. 27, No. 2, pp. 425-426, 1991
Printed in Great Britain. 0005-1098/91 $3.00 + 0.00
Pergamon Press
plc
1991 International Federation of Automatic Control
Technical Communique
An Elementary Derivation of the Maximum
Likelihood Estimator of the Covariance Matrix,
and an lllustrative Determinant Inequality*
SEPPO KARRILAt$ and TAPIO WESTERLUND%
Key Words--Maximum likelihood estimation; estimation; determinants, optimization; least-squares
estimation.
Abstract--The unique maximum likelihood estimate of the
covariance matrix of normally distributed random vectors is
derived by use of elementary linear algebra leading to simple
scalar equations. In addition the application of a determinant
inequality, also derived here, shows that a standard
"derivation" of the maximum likelihood estimate is
fallacious.
Introduction
IN SOME textbooks on estimation theory [for example
Goodwin and Payne (1977), p. 48] the maximum likelihood
estimate of the covariance matrix of normally distributed
random vectors is obtained by matrix differentiation results
for
general matrices, not restricting the covariance matrix to
being symmetric positive definite (SPD) or even just
symmetric. A stationary point of the likelihood function is
obtained--the stationary point is then observed to be SPD
and it is concluded that this must be the unique solution to
the maximization problem in the smaller domain of SPD
matrices. Although the solution is correct its derivation is
not, and some confusion may arise since the likelihood
function attains arbitrarily large values when the covariance
matrix is not restricted to being SPD. Naturally the
stationary point obtained for general matrices was in fact a
saddle point, and some further insight about the situation is
provided by the monotonicity result presented for deter-
minants here.
The maximum likelihood estimate
The likelihood function of N independent normally
distributed random vectors e with n real components is given
by
L
= (2•) -(N/2)n
IR-tr N/2 exp (-½tr {ErER-1)} (1)
where R is the unknown covariance matrix. The matrix E is
formed from the observations according to
E T = [el, e2 ..... eN]. (2)
The standard way of obtaining the maximum likelihood
estimates is by differentiating the logarithm of the likelihood
function with respect to the estimated parameters using
matrix differentiation rules for general matrices. This gives
[see Goodwin and Payne (1977), p. 48, eqn 3.3.10]
alnL NR_ ~
aR
2 + ½R-1ErER -1. (3)
* Received 19 December 1989; received in final form 12
June 1990. Recommended for publication in revised form by
Editor W. S. Levine.
t Department of Chemical Engineering, ,~bo Akademi,
Biskopsgatan 8, SF-20500 ,~bo, Finland.
~tAuthor to whom all correspondence should be
addressed.
The maximum likelihood estimate of R is given by the root
= 1 ErE. (4)
it
The saddle point nature of this symmetric root is seen as
follows. We perturb the inverse solution with a real nonzero
skew-symmetric matrix H = -Hr:
R 1 = N(ErE)-l + AH (5)
where A is a real scalar. Observe that the trace within the
exponential in equation (1) is unchanged by this perturba-
tion, since the trace of a sum is the sum of traces, and the
skew-symmetric real matrix EHE r has zero diagonal
elements whereby tr (ErEH) = tr (EHE r) = 0. Therefore
t ^-1 IR -1
+
AHI
N/2
L(R )=L(R
)( ift_tl )
(6)
Also the determinant inequality
Ill -t + AHI > Ill ~1 (7)
holds for all A 4:0 as !1 -t is SPD (see the Appendix) so that
L(R -1) > L(R-t). (8)
The likelihood function will thus be increased (monotonically
with respect to IAI) by perturbations about the stationary
point with skew-symmetric matrices, and the stationary point
given by equation (4) is only a saddle point (for general
matrices as the domain).
A simple solution
The maximum likelihood estimate can be obtained
rigorously by using differentiation rules for symmetric
matrices (Graybill, 1983). However, the following elemen-
tary and concise approach is more appealing especially for
classroom use.
Constrain R to be SPD and assume ErE is invertible so
that it is also SPD. Then, square roots of these matrices are
defined (uniquely by requiring them to be SPD). Define the
matrix
A = (ETE)I/2R-I(ETE)I/2 > 0 (9)
and note that it has the same trace as ErER -1. The
determinant of A is related to that of R by
IAI = IErEI IR-11. (10)
Now maximization of equation (1) is equivalent to
maximizing
f = IAI N/2 exp {-½ tr {A}} (11)
with respect to A where A is SPD.
Let A1, )-2 .... , A, be the eigenvalues of A--being SPD A
425
AUTO 27:2-N
426 Technical Communique
can be diagonalized and all its eigenvalues are positive. Then
o
f= Z i exp ki = H x7:2e-~a'/2). (12)
-- i=1 /
i=l
(This equation would be equally valid for general symmetric
matrices A (or R) and considering negative ~.i, clearly shows
that no global maximum would ever be attained.) The
stationary point is now obtained from the last expression by
considering the factors separately:
d).N%-~xi/2)
d~ i
--,~N/2)-le--(;ti/2)(~) =O.
(13)
For the allowed eigenvalues 3. i ~ ]0, c¢[ the unique solution of
equation (13) is
).i = N, Vi. (14)
Since the derivative of each factor changes sign just once,
from positive to negative, the stationary point obtained is the
global maximum (within the domain considered).
Now all the eigenvalues of A are equal. Note that the only
matrix similar to a multiple of the identity is that multiple
itself, and
A = NI =
(ErE)R -t. (15)
The unique SPD maximum likelihood estimate of R is
therefore
= 1 ErE. (16)
Discussion
An elementary proof for the maximum likelihood estimate
of the covariance matrix, for normally distributed random
vectors, was presented. This proof, it is hoped, will
supersede the less rigorous but more complicated proofs in
some current textbooks. Aside from the main theme a
determinant inequality, that the authors have not managed
to find in the literature, is derived in the Appendix and used
for illustrating the importance of proper consideration of the
domain in optimization problems.
References
Goodwin, G. C. and R. L. Payne (1977).
Dynamic System
Identification. Experimental Design and Data Analysis.
Academic Press, New York.
Graybill, F. A. (1983).
Matrices with Applications in
Statistics.
Wadsworth, Belmont, CA.
Appendix
Let S be a real symmetric positive definite matrix and H be
some nonzero real skew-symmetric matrix (H = -Hr). Here
we show that
IS + AHI > ISI (A.1)
for all values of the real scalar A:#0, and in fact
monotonically increases with the absolute magnitude of this
perturbation parameter. (The reader may observe that the
same proof is valid for the skew-Hermitian perturbation of a
Hermitian matrix in the complex case, provided that absolute
values of the determinants are taken).
Observe that
S + AH = Sl/2(! +
AS-1/2HS-1/2)81/2
(A. 2)
and by the product rule for determinants
IS + AHI = ISI" I1 + AGI (a.3)
with G = S-v2HS -1/2. (A.4)
Since G is skew-symmetric its eigenvalues are purely
imaginary, and these are shifted by unity when the identity
matrix is added:
IS + AHI = ISl
I~I
(1 + i A~.j), (A.5)
]=1
where
i).~, j= 1 ..... n
are the eigenvalues of G and
i = ~/- 1. The product on the RHS is pure real since the LHS
is, so taking the absolute value will at most change the sign.
Shifting the absolute value to the factors of the product gives
n
ISI IJ ~, (m.6)
/=1
which obviously is monotonically increasing with IAI, strictly
so since at least one of the eigenvalues is nonzero. Due to
continuity with respect to A the RHS of (A.5) cannot jump
to negative values as A moves away from zero; thus it stays
positive and taking the absolute value does not even change
sign. This proves that the LHS of (A.5) is also monotonically
increasing with respect to the absolute value of A. The
weaker result
IS+ AHI > ISI (m.7)
for all real A :# 0, follows from this strict monotonicity.
Q.E.D.
... In the latter case it can be shown (Karrila and Westerlund (1990)) that the unique maximum likelihood estimate of the covariance matrix iŝ ...
Article
Full-text available
The maximum likelihood method is frequently used in parameter estimation. If the structure of the model is unknown, the maximization of the likelihood function can be replaced by minimizing an information criterion. One criterion that allows this to be done is Akaike's information criterion (AIC). Minimizing the AIC is a mixed integer non-linear programming (MINLP) problem. In this paper, three dioeerent MINLP algorithms are compared in the solution of a simultaneous model structure determination and parameter estimation problem by minimizing the AIC criterion. The problem considered appears in quantitative Fourier transformed infra red (FTIR) spectroscopy where concentration estimates of certain gas components are to be obtained from measured absorbances at dioeerent wave numbers. The resulting problem is a large MINLP problem containing several hundreds, or even thousands, of variables including a huge number of possible model structures. It is, however, found that the studied algor...
Article
The present paper presents an approach of identifying the structure of a dynamic system using Mixed Integer Nonlinear Programming (MINLP) techniques. It is shown that the problem can be tackled by minimizing, for example, Akaike's Information Criterion (AIC). The presented techni ques are applied in determining the structure and the parameters of some illustrative Auto-Regressive Moving Average (ARMA) time series. The example problems are solved using the Extended Cutting Plane (ECP) method.
Article
Full-text available
A method for automatically selecting the wave numbers best suited for quantitative analysis as well as for simultaneously estimating the parameters in the emerging model is presented. As an indicator of the goodness of the model, Akaike's information theoretic criterion (AIC) is used. Since this approach involves the maximum likelihood estimate of the parameters, the problem of how to scale the data prior to the calculations is eliminated. The method described in this paper is not restricted to Fourier transform infrared (FTIR) problems, but can be applied to other similar problems, where both the model structure and the parameters should be determined. During the calibration stage, a mixed-integer nonlinear programming problem must be solved. It is demonstrated that the use of such modern optimization techniques makes it possible to solve these types of problems without tremendous computational effort. During the prediction stage the obtained model is easy to use.
Article
We address the problem of simultaneous model structure determination and parameter estimation in infrared spectroscopy. For given measurements of concentrations (C) and absorbances (A), we seek to find the constant of analogy (Θ) in reverse Beer's law (C=ΘA). Two approaches are described and compared in this paper. Both utilize Akaike's information criterion (AIC) to obtain an estimate of the constant. The first method is frequently used in practice and requires the iterative solution of mixed-integer convex quadratic optimization problems. The second method is a novel one that requires the solution of a single mixed-integer nonconvex nonlinear program for which we develop a global optimization algorithm. Computational results demonstrate that the latter approach provides better solutions for all of the eleven problems solved in this paper. Our computational experiments also reveal the importance of bounding the errors and number of model parameters when minimizing AIC.
Article
An attempt is made to formalize both structured covariance estimation and autoregressive process parameter estimation in terms of the underlying abstract Jordan algebra, an algebra that differs from the usual noncommutative but associative matrix algebra. The investigation puts one on a firm footing from which to attack future problems in statistical signal processing, rather in the same manner that the introduction of Lie algebra and Lie groups in control theory made a variety of new ideas and developments possible