ArticlePDF Available

Inversion of Covariance Matrix for High Dimension Data

March 2011
Journal of Mathematics and Statistics 7(3):227-229

March 2011
7(3):227-229

DOI:10.3844/jmssp.2011.227.229

Authors:

Independent academician/ researcher

Problem statement: In the testing statistic problem for the mean vector of independent and identically distributed multivariate normal random vectors with unknown covariance matrix when the data has sample size less than the dimension n≤p, for example, the data came from DNA microarrays where a large number of gene expression levels are measured on relatively few subjects, the p×p sample covariance matrix S does not have an inverse.. Hence any statistic value involving inversion of S does not exist. Approach: In this study, we showed a version of some modification on S, S+cI and find a real smallest value c≠0 which makes (S+cI)-1 exist. Results: The result from study provided when the dimension p tends to infinity and smallest change in S' the (S+cI)-1 do exist when c = 1. Conclusion: In statistical analysis involving with high dimensional data that an inversion of sample covariance matrix do not exist, one way to modify a sample covariance matrix S to have an inverse is to consider a sample covariance matrix, S, as the form S+cI and we recommend to choose c = 1.

Content uploaded by Samruam Chongcharoen

Content may be subject to copyright.

Journal of Mathematics and Statistics 7 (3): 227-229, 2011

ISSN 1549-3644

227

Inversion of Covariance Matrix for High Dimension Data

Samruam Chongcharoen

National Institute of Development Administration,

School of Applied Statistics, Bangkok, 10240,Thailand

Abstract: Problem statement: In the testing statistic problem for the mean vector of independent and

identically distributed multivariate normal random vectors with unknown covariance matrix when the

data has sample size less than the dimension n≤p, for example, the data came from DNA microarrays

where a large number of gene expression levels are measured on relatively few subjects, the p×p

sample covariance matrix S

does not have an inverse..

Hence any statistic value involving inversion of

does not exist. Approach: In this study, we showed a version of some modification on S, S+cI and

find a real smallest value c≠0 which makes (S + cI)

−1

exist. Results: The result from study provided

when the dimension p tends to infinity and smallest change in S, the (S + cI)

-1

do exist when c = 1.

Conclusion: In statistical analysis involving with high dimensional data that an inversion of sample

covariance matrix do not exist, one way to modify a sample covariance matrix S to have an inverse is

to consider a sample covariance matrix, S, as the form S + cI and we recommend to choose c = 1.

Key words: DNA micro arrays, eigenvalue, positive semidefinite, positive definite, gene expression,

covariance matrix, statistic value, real vector, real number, determinant, symmetric

matrix, definite matrix

INTRODUCTION

Now suppose X

, X

,…,X

is a random sample

from a p-dimensional multivariate normal distribution

with unknown mean µ = (µ

, µ

,…, µ

)’ and unknown

positive definite covariance matrix V with n≤p. The

sample mean and p×p sample covariance matrix S are

n n

i i i

i 1 i 1

X (X X)(X X)

X S

n n 1

and

= =

′

− −

= = −

∑ ∑

(1)

Since covariance matrix is a real symmetric matrix

and Harville (1997) showed that every symmetric

matrix has an eigenvalues.

The Hotelling’s T

statistic do not exist, so

Dempster (1958; 1960), Bai and Saranadasa (1996) and

Srivastava and Du (2008) developed test statistics using

some other forms of

for their tests instead of inversion

of S because it does not exist. By equipping with the

knowledge of (Polymenis, 2011; Girgis et al., 2010;

George and Kibria, 2010; Yahya et al., 2011; Nassiry et

al., 2009) will help to transform ideas.

Searle (1982) also showed that the eigenvalues of

every real symmetric matrix is real. For n≤p, Johnson

and Wichern (2002) have shown that the determinant of

sample covariance is zero for all samples, that is, S is

singular. Now consider p×p an covariance matrix A

with an eigenvalue λ, which for any real vector v ≠ 0,

then by the definition of eigenvalue, we have Av = λv,

then:

v Av v v

′ ′

= λ

and then:

vAv

′

=λ

′

Because v’v > 0, A is positive semidefinite (v’Av ≥

0) if and only if λ ≥ 0 and A is positive (v’Av > 0)

definite if and only if λ>0. Thus the covariance matrix

is at least positive semidefinite. Searle (1982) also

showed that for any p×p matrix A, the determinant of A

is equal to the product of its eienvalues, that

is,

i 1

= λ

∏

. Hence, covariance matrix S must have at

least one eigenvalue to be zero. Since every positive

definite matrix is nonsingular and its determinant is

positive, so the easiest way to makes covariance matrix

S from high dimensional data to be nonsingular is to

modify it to be positive definite matrix. We consider the

form S + cI, c ≠ 0 by looking for a smallest c ≠ 0 which

makes (S + cI)

-1

exist. Now suppose that S has r

nonzero eigenvalues, that is, it has exactly r positive

eigenvalues and p-r zero eigenvalues. We are interested

in modifying S to be nonsingular with the smallest

J. Math. & Stat., 7 (3): 227-229, 2011

228

change in S by considering S + cI, c ≠ 0 for any real

number.

MATERIAL AND METHODS

Suppose λ

≥ λ

≥…≥λ

> 0 and λ

r+1

= λ

r+2

…= λ

= 0 are all eigenvalues of S. From the definition

of evigenvalue, for any eigenvalue λ; i = 1, 2,…, p of S

and for any real vector v ≠ 0, we have Sv = λv, then (S

+ cI)v = Sv + cv = λv + cv = (λ + c)v. So, λ + c is an

eigenvalue of S + cI. Thus all p eigenvalues of S + cI

are λ

+ c ≥ λ

+ c ≥… ≥ λ

+ c > c =…= c. We

can see that

cannot be negative because if it does, S

+ cI cannot be positive definite matrix. Now if 0 < c <

1, the determinant of S + cI is:

p r

i 1

S cI ( c)(c)

−

+ = λ +

∏

which approaches to zero as p tends to infinity, that

makes (S + cI)

-1

does not exist. Therefore the only one

possible case is c≥1, but we are looking for a smallest

value

that makes (S + cI)

-1

exist. So we pick c = 1.

The proof is completed.

RESULTS

The result from this study provided a way to

modify a sample covariance matrix, S, came from the

data with the number of the dimension p larger than the

number of observation n available, n ≤ p, which its

inversion of sample covariance matrix do not exist, to

be S + cI with smallest change in S and then (S + cI)

-1

do exist with c = 1.

DISCUSSION

At present, there are a number of data with the

number of the dimension p larger than the number of

observation n available, n ≤ p, in many diverse applied

fields, e.g., medical, pharmaceutical, agricultural,

psychological, educational, social, behavioral, political,

criminal, industrial, meteorological, zoological and

biological sciences but there are barely any statistical

technique for analyzing this kind of data. The resulted

technique we found may help the researchers to develop

new statistical techniques for analyzing high

dimensional data.

CONLUSION

In statistical analysis, when one involves with high

dimensional data, the number of sample size less than

the number of dimension(variables), any statistic values

involving with inversion of sample covariance matrix

will not exist because inversion of sample covariance

matrix do not exist. One way to modify a sample

covariance matrix, S, to have an inverse is to consider a

sample covariance matrix, S, as the form S + cI. For

this form of sample covariance matrix, we showed

when c ≥ 1, that (S + cI)

-1

do exist and for smallest

change in S, we recommend to choose c = 1.

ACKNOWLEDGMENT

Professor Dr. A.K. Gupta, department of

Mathematics and statistics, Bowling Green State

University, Bowling Green ,USA, for his great

suggestions.

REFERENCES

Bai, Z. and H. Saranadasa, 1996. Effect of high

dimension: An example of a two sample problem.

Statist. Sinica, 6: 311-329.

http://www3.stat.sinica.edu.tw/statistica/j6n2/j6n21

/j6n21.htm

Dempster, A.P., 1958. A high dimensional two sample

significance test. Ann. Math. Stat., 29: 995-1010.

http://projecteuclid.org/euclid.aoms/1177706437

Dempster, A.P., 1960. A significance test for the

separation of two highly multivariate small

samples. Biometrics, 16: 41-50.

http://www.jstor.org/stable/2527954

George, F. and B.M.G. Kibria, 2010. Some test

statistics for testing the binomial parameter:

empirical power comparison. Am. J. Biostat., 1:

82-93. DOI: 10.3844/amjbsp.2010.82.93

Girgis,H., R. Hamed and M. Osman, 2010. Testing the

equality of growth curves of independent

populations with application on Egypt case. Am. J.

Biostat., 1: 46-61. DOI:

10.3844/amjbsp.2010.46.61

Harville, D.A., 1997. Matrix Algebra From A

Statistician’s Perspective. Springer, ISBN: 0-387-

94978-X, pp: 533.

http://springer.com/statistics/statistical

Johnson, R.A. and D.W. Wichern, 2002. Applied

Multivariate Statistical Analysis. 5th Edn., Prentice

Hall, ISBN: 0-13-092553-5, pp: 135.

http://www.prenhall.com/

Nassiry, M.R., A. Javanmard and R. Tohidi, 2009.

Application of statistical procedures for analysis of

genetic diversity in domestic animal populations.

Am. J. Anim. Vet. Sci., 4: 136-141. DOI:

10.3844/ajavsp.2009.136.141

J. Math. & Stat., 7 (3): 227-229, 2011

229

Polymenis, D.A., 2011. An application of univariate

statistics to hotelling’s T

. J. Math. Stat., 7: 86-94.

DOI: 10.3844/jmssp.2011.86.94

Srivastava, M.D. and M. Du, 2008. A test for mean

vector with fewer observations than the dimension.

J. Multivariate Anal. 99: 386-402.

http://www.elsevier.com/locate/jmva

Searle, S.R., 1982. Matrix Algebra Useful for Statistics.

John Wiley and Sons, Inc., ISBN: 0-471-86681-4,

pp: 278.

Yahya, A.A., A. Osman, A.R. Ramli and A. Balola,

2011. Feature selection for high dimensional data:

An evolutionary filter approach. J. Comput. Sci.,

7: 800-820. DOI: 10.3844/jcssp.2011.800.820

A Test on the Multivariate Behrens–Fisher Problem in High–Dimensional Data by Block Covariance Estimation

Article

Full-text available

Jan 2019

In this paper, we proposed a new testing statistic for testing the equality of mean vectors from two multivariate normal populations when the covariance matrices are unknown and unequal in high–dimensional data. A new test is proposed based on the idea of keeping more information from the sample covariance matrices as much as possible. A proposed test is invariant under scalar transformations and location shifts. We showed that the asymptotic distribution of proposed statistic is standard normal distribution when number of random variables approach infinity. We also compared the performance of the proposed test with other three existing tests by the simulation study. The simulation results showed that the attained significance level of proposed test close to setting nominal significance level satisfactorily. The attained power of proposed test outperforms as the other comparative tests under form of covariance matrices considered which can be arranged to block diagonal matrix structure. The attained power becomes more powerful when the dimension increases for a given sample size or vice versa, or relationship level between random variables in each sample increases. Finally, the proposed test is also illustrated with an analysis of DNA microarray data.

Wind Load Effects on Manufactured Home Foundations

Article

Full-text available

Jan 2011

Problem statement: Manufactured homes are susceptible to hurricane damage. Each year, significant losses, in terms of fatalities and property damage, are reported. There is always a prevalent concern about lateral load resistance capacity of tie-down system of manufactured homes when subjected to windstorms. This study is performed to determine the effects of hurricane wind on manufactured homes' foundations. Approach: A 1:120th scale model of single wide manufactured home of size 14 ft by 80 ft was designed for the wind tunnel test. Proper instrumentations and simulations were considered to measure wind forces applied on the model. Sting balance and Pitot static tube were used to measure forces and air velocity during the wind tunnel test. Displacements of anchors were observed during the test. Results: The ultimate forces as well as the displacements of the anchors were determined at different angles of wind direction ranging from 30-180°. Wind speed inside the tunnel was increased at the rate of 5 miles h −1. Conclusion/Recommendations: Test result showed that auger anchors used to support lateral load are incapable to resist hurricane wind loads. It was found that anchors displaced 2 in. vertically and 4 in. horizontally at loads less than 4725lb. Tested manufactured homes anchors experienced maximum force of 4087 lb when 45 miles h −1 wind acted in transverse direction to the wall. The manufactured home anchors displaced more than 2 inches in vertical direction and 4 inches in horizontal direction due to this wind load. This research indicated that manufactured homes ground anchors can sustain wind velocity of 95 miles h −1 when the wind is acting at longitudinal direction.

Wind Load Effects on Manufactured Home Foundations

Article

Apr 2011

Qimin

A Test for Two-Sample Repeated Measures Designs: Effect of High-Dimensional Data

Article

Oct 2011

Abdul ..

A Test for Two-Sample Repeated Measures Designs: Effect of High-Dimensional Data

Article

Full-text available

Dec 2011

Problem statement: High-dimensional repeated measures data are increasingly encountered in various areas of modern science since classical multivariate statistics, e.g,. Hotelling’s T 2 , are not well defined in the case of high-dimensional data. Approach: In this study, the test statistics with no specific form of covariance matrix for analyzing high-dimensional two-sample repeated measures designs with common equal covariance are proposed. The asymptotic distributions of the proposed test statistics also were derived. Results: A simulation study exposes the approximated Type I errors in the null case very well even though the number of subjects of each sample as small as 10. Numerical simulations study indicates that the proposed test have good power. Application of the new tests is demonstrated using data from the body-weight of male Wistar rats example. Conclusion: The proposed test statistics have an asymptotically distributed as standard normal distributions, under the null hypothesis. Simulation studies show that these test statistics still accurately control Type I error and have quite good power for any the covariance matrix pattern considered.

An Application of Univariate Statistics to Hotelling's T2

Article

Full-text available

Jan 2011

Athanase Polymenis

The Hotelling’s T 2 statistic has been well documented in the existing literature and exact as well as asymptotic results have been obtained. We focus on an important particular case of T 2 and we note that there is no clear way to show how univariate results used in the theory of Student’s t statistic, could be used to derive corresponding multivariate ones, for T 2 . Therefore, our goal is to find an alternative method, which would be more useful than the usual one, in order to generalize directly univariate theory. At first, we used some matrix tools in order to obtain an equivalent algebraic form for T 2 and then we applied some univariate results concerning distributions which arise from the normal distribution. We found an algebraic representation of T 2 , which can be conceived as the natural extension of some results appearing in the literature and we used our findings to show how standard univariate techniques can be applied in order to derive the exact and limiting distributions of T 2 . Using the proposed representation of T 2 gives a better insight on the generalization of univariate results to multivariate analyses and indicates an alternative way to prove typical multivariate results. Furthermore, it allows for usual theoretical calculations to simplify.

Feature Selection for High Dimensional Data: An Evolutionary Filter Approach

Article

Full-text available

May 2011
J Comput Sci

Anwar Ali Yahya

Feature selection is a task of crucial importance for the application of machine learning in various domains. In addition, the recent increase of data dimensionality poses a severe challenge to many existing feature selection approaches with respect to efficiency and effectiveness. As an example, genetic algorithm is an effective search algorithm that lends itself directly to feature selection; however this direct application is hindered by the recent increase of data dimensionality. Therefore adapting genetic algorithm to cope with the high dimensionality of the data becomes increasingly appealing. Approach: In this study, we proposed an adapted version of genetic algorithm that can be applied for feature selection in high dimensional data. The proposed approach is based essentially on a variable length representation scheme and a set of modified and proposed genetic operators. To assess the effectiveness of the proposed approach, we applied it for cues phrase selection and compared its performance with a number of ranking approaches which are always applied for this task. Results and Conclusion: The results provide experimental evidences on the effectiveness of the proposed approach for feature selection in high dimensional data

Applied Multivariate Statistical Analysis

Article

Dec 1988

Some Test Statistics for Testing the Binomial Parameter: Empirical Power Comparison

Article

Feb 2010

George

Matrix Algebra Useful for Statistics.

Article

Dec 1983

Matrix Algebra from a Statistician's Perspective

Article

May 1998

A Significance Test for the Separation of Two Highly Multivariate Small Samples

Article

Mar 1960

A.P. Dempster

Effect of high dimension: By an example of a two sample problem

Article

Jan 1996
STAT SINICA

With the rapid development of modern computing techniques, statisticians are dealing with data with much higher dimension. Consequently, due to their loss of accuracy or power, some classical statistical inferences are being challenged by non-exact approaches. The purpose of this paper is to point out and briefly analyze such a phenomenon and to encourage statisticians to reexamine classical statistical approaches when they are dealing with high dimensional data. As an example, we derive the asymptotic power of the classical Hotelling’s T 2 test and Dempster’s non-exact test [see A. P. Dempster, Ann. Math. Stat. 29, 995-1010 (1958); Biometrics 16, 41-50 (1960)] for a two-sample problem. Also, an asymptotically normally distributed test statistic is proposed. Our results show that both Dempster’s non-exact test and the new test have higher power than Hotelling’s test when the data dimension is proportionally close to the within sample degrees of freedom. Although our new test has an asymptotic power function similar to Dempster’s, it does not rely on the normality assumption. Some simulation results are presented which show that the non-exact tests are more powerful than Hotelling’s test even for moderately large dimensions and sample sizes.

Testing the Equality of Growth Curves of Independent Populations with

Article

Jan 2010

Numerous trials have been conducted to compare the body growth curves and hence growth rates relying on smoothing and modeling different growth curves using different parameter values for the same model. This study aimed to construct a test of the equality of two percentile growth curves and of a set of percentile growth curves from two independent populations regardless of the shape of these curves. Currently available tests allow us to make a decision on one group. Making a decision regarding the whole curve necessitates building new tests. Approach: This study developed two tests of the equality of two growth curves based on the concept of the precedence and the chi-square tests and a test of the equality of a set of growth curves. The Monte Carlo simulation technique was used to investigate the power of the three tests under a shift in the location parameter and under a shift in the scale parameter of the normal and gamma distributions. The tests were applied to the weight-for-age percentile growth curves of Egyptian regions. Results: The curve precedence test is more powerful than the curve chi-square test in testing the equality of growth curves under a shift in the location parameter of both the normal and gamma distributions. It is also more powerful than the curve chi-square test in testing the equality of growth curves under a shift in the scale parameter of the gamma distribution and in testing equality of growth curves with high ranks under a shift in the scale parameter of the normal distribution. Applying the new tests to the weight-for-age growth curves of the two Egyptian regions showed that the regions have different growth curves. Conclusion: The new tests are powerful in testing the equality of growth curves. According to them, the two Egyptian regions have different nutritional status.

A High Dimensional Two Sample Significance Test

Article

Dec 1958
Ann Math Stat

A. P. Dempster

The classical multivariate 2 sample significance test based on Hotelling's $T^2$ is undefined when the number $k$ of variables exceeds the number of within sample degrees of freedom available for estimation of variances and covariances. Addition of an a priori Euclidean metric to the affine $k$-space assumed by the classical method leads to an alternative approach to the same problem. A test statistic $F$ which is the ratio of 2 mean square distances is proposed and 3 methods of attaching a significance level to $F$ are described. The third method is considered in detail and leads to a "non-exact" significance test where the null hypothesis distribution of $F$ depends, in approximation, on a single unknown parameter $r$ for which an estimate must be substituted. Approximate distribution theory leads to 2 independent estimates of $r$ based on nearly sufficient statistics and these may be combined to yield a single estimate. A test of $F$ nominally at the 5% level but based on an estimate of $r$ rather than $r$ itself has a true significance level which is a function of $r$. This function is investigated and shown to be quite near 5%. The sensitivity of the test to a parameter measuring statistical distance between population means is discussed and it is shown that arbitrarily small differences in each individual variable can result in a detectable overall difference provided the number of variables (or, more precisely, $r$) can be made sufficiently large. This sensitivity discussion has stated implications for the a priori choice of metric in $k$-space. Finally a geometrical description of the case of large $r$ is presented.

Inversion of Covariance Matrix for High Dimension Data

Abstract

Recommended publications

A new test for the mean vector in high-dimensional data

A Two-Sample Test for Mean Vectors in High-Dimensional Data

One-Sided Multivariate Tests for High Dimensional Data

A Test on the Multivariate Behrens–Fisher Problem in High–Dimensional Data by Block Covariance Estim...