ArticlePDF Available

Inversion of Covariance Matrix for High Dimension Data

Authors:
  • Independent academician/ researcher

Abstract

Problem statement: In the testing statistic problem for the mean vector of independent and identically distributed multivariate normal random vectors with unknown covariance matrix when the data has sample size less than the dimension n≤p, for example, the data came from DNA microarrays where a large number of gene expression levels are measured on relatively few subjects, the p×p sample covariance matrix S does not have an inverse.. Hence any statistic value involving inversion of S does not exist. Approach: In this study, we showed a version of some modification on S, S+cI and find a real smallest value c≠0 which makes (S+cI)-1 exist. Results: The result from study provided when the dimension p tends to infinity and smallest change in S' the (S+cI)-1 do exist when c = 1. Conclusion: In statistical analysis involving with high dimensional data that an inversion of sample covariance matrix do not exist, one way to modify a sample covariance matrix S to have an inverse is to consider a sample covariance matrix, S, as the form S+cI and we recommend to choose c = 1.
Journal of Mathematics and Statistics 7 (3): 227-229, 2011
ISSN 1549-3644
© 2011 Science Publications
227
Inversion of Covariance Matrix for High Dimension Data
Samruam Chongcharoen
National Institute of Development Administration,
School of Applied Statistics, Bangkok, 10240,Thailand
Abstract: Problem statement: In the testing statistic problem for the mean vector of independent and
identically distributed multivariate normal random vectors with unknown covariance matrix when the
data has sample size less than the dimension np, for example, the data came from DNA microarrays
where a large number of gene expression levels are measured on relatively few subjects, the p
sample covariance matrix S
does not have an inverse..
Hence any statistic value involving inversion of
S
does not exist. Approach: In this study, we showed a version of some modification on S, S+cI and
find a real smallest value c0 which makes (S + cI)
1
exist. Results: The result from study provided
when the dimension p tends to infinity and smallest change in S, the (S + cI)
-1
do exist when c = 1.
Conclusion: In statistical analysis involving with high dimensional data that an inversion of sample
covariance matrix do not exist, one way to modify a sample covariance matrix S to have an inverse is
to consider a sample covariance matrix, S, as the form S + cI and we recommend to choose c = 1.
Key words: DNA micro arrays, eigenvalue, positive semidefinite, positive definite, gene expression,
covariance matrix, statistic value, real vector, real number, determinant, symmetric
matrix, definite matrix
INTRODUCTION
Now suppose X
1
, X
2
,…,X
n
is a random sample
from a p-dimensional multivariate normal distribution
with unknown mean µ =
1
, µ
2
,…, µ
p
)’ and unknown
positive definite covariance matrix V with np. The
sample mean and p×p sample covariance matrix S are
n n
i i i
i 1 i 1
X (X X)(X X)
X S
n n 1
and
= =
− −
= =
∑ ∑
(1)
Since covariance matrix is a real symmetric matrix
and Harville (1997) showed that every symmetric
matrix has an eigenvalues.
The Hotelling’s T
2
statistic do not exist, so
Dempster (1958; 1960), Bai and Saranadasa (1996) and
Srivastava and Du (2008) developed test statistics using
some other forms of
S
for their tests instead of inversion
of S because it does not exist. By equipping with the
knowledge of (Polymenis, 2011; Girgis et al., 2010;
George and Kibria, 2010; Yahya et al., 2011; Nassiry et
al., 2009) will help to transform ideas.
Searle (1982) also showed that the eigenvalues of
every real symmetric matrix is real. For np, Johnson
and Wichern (2002) have shown that the determinant of
sample covariance is zero for all samples, that is, S is
singular. Now consider p×p an covariance matrix A
with an eigenvalue λ, which for any real vector v 0,
then by the definition of eigenvalue, we have Av = λv,
then:
v Av v v
′ ′
= λ
and then:
vAv
vv
=λ
Because v’v > 0, A is positive semidefinite (v’Av
0) if and only if λ 0 and A is positive (v’Av > 0)
definite if and only if λ>0. Thus the covariance matrix
S
is at least positive semidefinite. Searle (1982) also
showed that for any p×p matrix A, the determinant of A
is equal to the product of its eienvalues, that
is,
p
i 1
A
=
= λ
. Hence, covariance matrix S must have at
least one eigenvalue to be zero. Since every positive
definite matrix is nonsingular and its determinant is
positive, so the easiest way to makes covariance matrix
S from high dimensional data to be nonsingular is to
modify it to be positive definite matrix. We consider the
form S + cI, c 0 by looking for a smallest c 0 which
makes (S + cI)
-1
exist. Now suppose that S has r
nonzero eigenvalues, that is, it has exactly r positive
eigenvalues and p-r zero eigenvalues. We are interested
in modifying S to be nonsingular with the smallest
J. Math. & Stat., 7 (3): 227-229, 2011
228
change in S by considering S + cI, c 0 for any real
number.
MATERIAL AND METHODS
Suppose λ
1
λ
2
λ
3
λ
r
> 0 and λ
r+1
= λ
r+2
=
…= λ
p
= 0 are all eigenvalues of S. From the definition
of evigenvalue, for any eigenvalue λ; i = 1, 2,…, p of S
and for any real vector v 0, we have Sv = λv, then (S
+ cI)v = Sv + cv = λv + cv = (λ + c)v. So, λ + c is an
eigenvalue of S + cI. Thus all p eigenvalues of S + cI
are λ
1
+ c λ
2
+ c λ
3
+ c λ
r
+ c > c =…= c. We
can see that
c
cannot be negative because if it does, S
+ cI cannot be positive definite matrix. Now if 0 < c <
1, the determinant of S + cI is:
r
p r
i
i 1
S cI ( c)(c)
=
+ = λ +
which approaches to zero as p tends to infinity, that
makes (S + cI)
-1
does not exist. Therefore the only one
possible case is c1, but we are looking for a smallest
value
c
that makes (S + cI)
-1
exist. So we pick c = 1.
The proof is completed.
RESULTS
The result from this study provided a way to
modify a sample covariance matrix, S, came from the
data with the number of the dimension p larger than the
number of observation n available, n p, which its
inversion of sample covariance matrix do not exist, to
be S + cI with smallest change in S and then (S + cI)
-1
do exist with c = 1.
DISCUSSION
At present, there are a number of data with the
number of the dimension p larger than the number of
observation n available, n p, in many diverse applied
fields, e.g., medical, pharmaceutical, agricultural,
psychological, educational, social, behavioral, political,
criminal, industrial, meteorological, zoological and
biological sciences but there are barely any statistical
technique for analyzing this kind of data. The resulted
technique we found may help the researchers to develop
new statistical techniques for analyzing high
dimensional data.
CONLUSION
In statistical analysis, when one involves with high
dimensional data, the number of sample size less than
the number of dimension(variables), any statistic values
involving with inversion of sample covariance matrix
will not exist because inversion of sample covariance
matrix do not exist. One way to modify a sample
covariance matrix, S, to have an inverse is to consider a
sample covariance matrix, S, as the form S + cI. For
this form of sample covariance matrix, we showed
when c 1, that (S + cI)
-1
do exist and for smallest
change in S, we recommend to choose c = 1.
ACKNOWLEDGMENT
Professor Dr. A.K. Gupta, department of
Mathematics and statistics, Bowling Green State
University, Bowling Green ,USA, for his great
suggestions.
REFERENCES
Bai, Z. and H. Saranadasa, 1996. Effect of high
dimension: An example of a two sample problem.
Statist. Sinica, 6: 311-329.
http://www3.stat.sinica.edu.tw/statistica/j6n2/j6n21
/j6n21.htm
Dempster, A.P., 1958. A high dimensional two sample
significance test. Ann. Math. Stat., 29: 995-1010.
http://projecteuclid.org/euclid.aoms/1177706437
Dempster, A.P., 1960. A significance test for the
separation of two highly multivariate small
samples. Biometrics, 16: 41-50.
http://www.jstor.org/stable/2527954
George, F. and B.M.G. Kibria, 2010. Some test
statistics for testing the binomial parameter:
empirical power comparison. Am. J. Biostat., 1:
82-93. DOI: 10.3844/amjbsp.2010.82.93
Girgis,H., R. Hamed and M. Osman, 2010. Testing the
equality of growth curves of independent
populations with application on Egypt case. Am. J.
Biostat., 1: 46-61. DOI:
10.3844/amjbsp.2010.46.61
Harville, D.A., 1997. Matrix Algebra From A
Statistician’s Perspective. Springer, ISBN: 0-387-
94978-X, pp: 533.
http://springer.com/statistics/statistical
Johnson, R.A. and D.W. Wichern, 2002. Applied
Multivariate Statistical Analysis. 5th Edn., Prentice
Hall, ISBN: 0-13-092553-5, pp: 135.
http://www.prenhall.com/
Nassiry, M.R., A. Javanmard and R. Tohidi, 2009.
Application of statistical procedures for analysis of
genetic diversity in domestic animal populations.
Am. J. Anim. Vet. Sci., 4: 136-141. DOI:
10.3844/ajavsp.2009.136.141
J. Math. & Stat., 7 (3): 227-229, 2011
229
Polymenis, D.A., 2011. An application of univariate
statistics to hotelling’s T
2
. J. Math. Stat., 7: 86-94.
DOI: 10.3844/jmssp.2011.86.94
Srivastava, M.D. and M. Du, 2008. A test for mean
vector with fewer observations than the dimension.
J. Multivariate Anal. 99: 386-402.
http://www.elsevier.com/locate/jmva
Searle, S.R., 1982. Matrix Algebra Useful for Statistics.
John Wiley and Sons, Inc., ISBN: 0-471-86681-4,
pp: 278.
Yahya, A.A., A. Osman, A.R. Ramli and A. Balola,
2011. Feature selection for high dimensional data:
An evolutionary filter approach. J. Comput. Sci.,
7: 800-820. DOI: 10.3844/jcssp.2011.800.820
... and min(n 1 -1,n 2 -1) ≤ v ≤ n 1 + n 2 -2, this approximation reduces to the usual Welch's approximate degrees of freedom to the Behrens-Fisher problem in the univariate (p = 1) case (Richard and Dean, 2014). In high-dimensional data, for one population when the data has the number of variable exceed sample size (minus 1), p > n i -1, for example the data that collects from DNA microarrays technology where a large number of gene expression levels may be in the hundreds or thousands, are measured on relatively few subjects (Zhou et al., 2017), then the sample covariance matrix S i lose its full rank and will be singular, which makes S i does not have an inverse (Chongcharoen, 2011). Furthermore, for two populations when the data has the number of variable is larger than the sum of the sample sizes (minus 2), p > n 1 + n 2 -2, then the sample covariance matrix S ɶ in (7) does not have an inverse. ...
Article
Full-text available
In this paper, we proposed a new testing statistic for testing the equality of mean vectors from two multivariate normal populations when the covariance matrices are unknown and unequal in high–dimensional data. A new test is proposed based on the idea of keeping more information from the sample covariance matrices as much as possible. A proposed test is invariant under scalar transformations and location shifts. We showed that the asymptotic distribution of proposed statistic is standard normal distribution when number of random variables approach infinity. We also compared the performance of the proposed test with other three existing tests by the simulation study. The simulation results showed that the attained significance level of proposed test close to setting nominal significance level satisfactorily. The attained power of proposed test outperforms as the other comparative tests under form of covariance matrices considered which can be arranged to block diagonal matrix structure. The attained power becomes more powerful when the dimension increases for a given sample size or vice versa, or relationship level between random variables in each sample increases. Finally, the proposed test is also illustrated with an analysis of DNA microarray data.
... After attaining geometric similarity, the next step is to determine kinematic and dynamic similarity of the model to represent the actual manufactured home in the field. This can be achieved by dimensional analysis (Chongcharoen, 2011;Zaidi et al., 2010). Buckingham pi theorem is used herein to perform dimensional analysis, from which six variables (F, V ...
Article
Full-text available
Problem statement: Manufactured homes are susceptible to hurricane damage. Each year, significant losses, in terms of fatalities and property damage, are reported. There is always a prevalent concern about lateral load resistance capacity of tie-down system of manufactured homes when subjected to windstorms. This study is performed to determine the effects of hurricane wind on manufactured homes' foundations. Approach: A 1:120th scale model of single wide manufactured home of size 14 ft by 80 ft was designed for the wind tunnel test. Proper instrumentations and simulations were considered to measure wind forces applied on the model. Sting balance and Pitot static tube were used to measure forces and air velocity during the wind tunnel test. Displacements of anchors were observed during the test. Results: The ultimate forces as well as the displacements of the anchors were determined at different angles of wind direction ranging from 30-180°. Wind speed inside the tunnel was increased at the rate of 5 miles h −1. Conclusion/Recommendations: Test result showed that auger anchors used to support lateral load are incapable to resist hurricane wind loads. It was found that anchors displaced 2 in. vertically and 4 in. horizontally at loads less than 4725lb. Tested manufactured homes anchors experienced maximum force of 4087 lb when 45 miles h −1 wind acted in transverse direction to the wall. The manufactured home anchors displaced more than 2 inches in vertical direction and 4 inches in horizontal direction due to this wind load. This research indicated that manufactured homes ground anchors can sustain wind velocity of 95 miles h −1 when the wind is acting at longitudinal direction.
... After attaining geometric similarity, the next step is to determine kinematic and dynamic similarity of the model to represent the actual manufactured home in the field. This can be achieved by dimensional analysis (Chongcharoen, 2011;Zaidi et al., 2010). Buckingham pi theorem is used herein to perform dimensional analysis, from which six variables (F, V ...
Article
Problem statement: Manufactured homes are susceptible to hurricane damage. Each year, significant losses, in terms of fatalities and property damage, are reported. There is always a prevalent concern about lateral load resistance capacity of tie-down system of manufactured homes when subjected to windstorms. This study is performed to determine the effects of hurricane wind on manufactured homes
... Many works have been published on hypothesis testing for means when both p and n go to infinity with the ratio p/n must remain bounded, Bai andSaranadasa (1996), Fujikoshi et al. (2004); Srivastava and Fujikoshi (2006);Srivastava (2007;2009) ;Schott (2007) and Srivastava and Du (2008). In addition, when sample covariance matrix does not have an inverse, Chongcharoen (2011) proposed one way to modify a sample covariance matrix. Yahya et al. (2011) proposed approach for feature selection in high dimensional data. ...
Article
Full-text available
Problem statement: High-dimensional repeated measures data are increasingly encountered in various areas of modern science since classical multivariate statistics, e.g,. Hotelling’s T 2 , are not well defined in the case of high-dimensional data. Approach: In this study, the test statistics with no specific form of covariance matrix for analyzing high-dimensional two-sample repeated measures designs with common equal covariance are proposed. The asymptotic distributions of the proposed test statistics also were derived. Results: A simulation study exposes the approximated Type I errors in the null case very well even though the number of subjects of each sample as small as 10. Numerical simulations study indicates that the proposed test have good power. Application of the new tests is demonstrated using data from the body-weight of male Wistar rats example. Conclusion: The proposed test statistics have an asymptotically distributed as standard normal distributions, under the null hypothesis. Simulation studies show that these test statistics still accurately control Type I error and have quite good power for any the covariance matrix pattern considered.
Article
Full-text available
The Hotelling’s T 2 statistic has been well documented in the existing literature and exact as well as asymptotic results have been obtained. We focus on an important particular case of T 2 and we note that there is no clear way to show how univariate results used in the theory of Student’s t statistic, could be used to derive corresponding multivariate ones, for T 2 . Therefore, our goal is to find an alternative method, which would be more useful than the usual one, in order to generalize directly univariate theory. At first, we used some matrix tools in order to obtain an equivalent algebraic form for T 2 and then we applied some univariate results concerning distributions which arise from the normal distribution. We found an algebraic representation of T 2 , which can be conceived as the natural extension of some results appearing in the literature and we used our findings to show how standard univariate techniques can be applied in order to derive the exact and limiting distributions of T 2 . Using the proposed representation of T 2 gives a better insight on the generalization of univariate results to multivariate analyses and indicates an alternative way to prove typical multivariate results. Furthermore, it allows for usual theoretical calculations to simplify.
Article
Full-text available
Feature selection is a task of crucial importance for the application of machine learning in various domains. In addition, the recent increase of data dimensionality poses a severe challenge to many existing feature selection approaches with respect to efficiency and effectiveness. As an example, genetic algorithm is an effective search algorithm that lends itself directly to feature selection; however this direct application is hindered by the recent increase of data dimensionality. Therefore adapting genetic algorithm to cope with the high dimensionality of the data becomes increasingly appealing. Approach: In this study, we proposed an adapted version of genetic algorithm that can be applied for feature selection in high dimensional data. The proposed approach is based essentially on a variable length representation scheme and a set of modified and proposed genetic operators. To assess the effectiveness of the proposed approach, we applied it for cues phrase selection and compared its performance with a number of ranking approaches which are always applied for this task. Results and Conclusion: The results provide experimental evidences on the effectiveness of the proposed approach for feature selection in high dimensional data
Article
With the rapid development of modern computing techniques, statisticians are dealing with data with much higher dimension. Consequently, due to their loss of accuracy or power, some classical statistical inferences are being challenged by non-exact approaches. The purpose of this paper is to point out and briefly analyze such a phenomenon and to encourage statisticians to reexamine classical statistical approaches when they are dealing with high dimensional data. As an example, we derive the asymptotic power of the classical Hotelling’s T 2 test and Dempster’s non-exact test [see A. P. Dempster, Ann. Math. Stat. 29, 995-1010 (1958); Biometrics 16, 41-50 (1960)] for a two-sample problem. Also, an asymptotically normally distributed test statistic is proposed. Our results show that both Dempster’s non-exact test and the new test have higher power than Hotelling’s test when the data dimension is proportionally close to the within sample degrees of freedom. Although our new test has an asymptotic power function similar to Dempster’s, it does not rely on the normality assumption. Some simulation results are presented which show that the non-exact tests are more powerful than Hotelling’s test even for moderately large dimensions and sample sizes.
Article
Numerous trials have been conducted to compare the body growth curves and hence growth rates relying on smoothing and modeling different growth curves using different parameter values for the same model. This study aimed to construct a test of the equality of two percentile growth curves and of a set of percentile growth curves from two independent populations regardless of the shape of these curves. Currently available tests allow us to make a decision on one group. Making a decision regarding the whole curve necessitates building new tests. Approach: This study developed two tests of the equality of two growth curves based on the concept of the precedence and the chi-square tests and a test of the equality of a set of growth curves. The Monte Carlo simulation technique was used to investigate the power of the three tests under a shift in the location parameter and under a shift in the scale parameter of the normal and gamma distributions. The tests were applied to the weight-for-age percentile growth curves of Egyptian regions. Results: The curve precedence test is more powerful than the curve chi-square test in testing the equality of growth curves under a shift in the location parameter of both the normal and gamma distributions. It is also more powerful than the curve chi-square test in testing the equality of growth curves under a shift in the scale parameter of the gamma distribution and in testing equality of growth curves with high ranks under a shift in the scale parameter of the normal distribution. Applying the new tests to the weight-for-age growth curves of the two Egyptian regions showed that the regions have different growth curves. Conclusion: The new tests are powerful in testing the equality of growth curves. According to them, the two Egyptian regions have different nutritional status.
Article
The classical multivariate 2 sample significance test based on Hotelling's $T^2$ is undefined when the number $k$ of variables exceeds the number of within sample degrees of freedom available for estimation of variances and covariances. Addition of an a priori Euclidean metric to the affine $k$-space assumed by the classical method leads to an alternative approach to the same problem. A test statistic $F$ which is the ratio of 2 mean square distances is proposed and 3 methods of attaching a significance level to $F$ are described. The third method is considered in detail and leads to a "non-exact" significance test where the null hypothesis distribution of $F$ depends, in approximation, on a single unknown parameter $r$ for which an estimate must be substituted. Approximate distribution theory leads to 2 independent estimates of $r$ based on nearly sufficient statistics and these may be combined to yield a single estimate. A test of $F$ nominally at the 5% level but based on an estimate of $r$ rather than $r$ itself has a true significance level which is a function of $r$. This function is investigated and shown to be quite near 5%. The sensitivity of the test to a parameter measuring statistical distance between population means is discussed and it is shown that arbitrarily small differences in each individual variable can result in a detectable overall difference provided the number of variables (or, more precisely, $r$) can be made sufficiently large. This sensitivity discussion has stated implications for the a priori choice of metric in $k$-space. Finally a geometrical description of the case of large $r$ is presented.