ArticlePDF Available

Redundancy of Universal Coding, Kolmogorov Complexity, and Hausdorff Dimension

Authors:
  • Random Data Lab. Inc.

Abstract

We study asymptotic code lengths of universal codes for parametric models. We show a universal code whose code length is asymptotically less than or equal to that of the minimum description length (MDL) code. Especially when some of the parameters of a source are not random reals, the coefficient of the logarithm in the formula of our universal code is less than that of the MDL code. We describe the redundancy in terms of Kolmogorov complexity and Hausdorff dimension. We show that our universal code is asymptotically optimal in the sense that the coefficient of the logarithm in the formula of the code length is minimal. Our universal code can be considered to be a natural extension of the Shannon code and the MDL code.
Redundancy of universal coding, Kolmogorov
complexity, and Hausdorff dimension (Abstract)
Hayato Takahashi
January 28, 2003
In [1, 4, 5], under a suitable condition, it is shown that asymptotic code-
lengths of sequences generated by a parametric model Pθis given as follows;
log Pˆ
θ+k
2log n+o(log n), Pθa.e., (1)
where ˆ
θis the maximum-likelihood estimator, kis the dimension of parameter
space, nis the sample size, and the base of log is 2.
In view of the proof of Rissanen [4], the second term of (1) is the description
of the maximum likelihood estimator ˆ
θwith (log n)/2 bit accuracy, therefore, it
is natural to study a universal coding obtained by compressing the description of
the maximum likelihood estimator. In fact, Vovk [6] studied a universal coding
for Bernoulli model with code-length
inf
θlog Pθ+K(θ|n),(2)
where θranges over computable real, and Kis the prefix Kolmogorov complexity
[2, 3].
In order to study the code (2), we study asymptotic expansion of Bayes
mixture RPθdm(θ) with two kind of priors. One is a prior that is singular with
respect to Lebesgue measure, and another is a priori probability on Euclidean
space.
By considering prior of Bayes mixture to be a priori probability on Euclidean
space, we extend the universal coding (2) to multidimensional parameter space,
and show a universal coding whose code-length is
log Pˆ
θ+
k
X
j=1
K(description of θjup-to (log n)/2 bit |n)+O(log log n), Pθa.e.,
(3)
where θ= (θ1,· · · , θk). On the other hand Rissanen [5] showed that the code-
length (1) is optimal up to O(log n) term except for parameters in a set of
The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-
8569, Japan, tel: 81-3-3446-1501(ext. 9701), fax: 81-3-5421-8750, E-mail: takahasi@ism.ac.jp
1
Lebesgue measure 0. Therefore we characterize the parameter set such that
Pk
j=1 K(description of θjup-to (log n)/2 bit |n)
k
2log n<1
(n ) with Hausdorff dimension. Consequently we show a universal coding
having the following property: For each real numbers hj,0hj1,1jk,
there are subsets of parameter space, H1× · ·· × Hksuch that if θjHj, the
code-length is
log Pˆ
θ+(Pk
j=1 dim Hj)
2log n+o(log n), Pθa.e., (4)
where dim His the Hausdorff dimension of H. Also, we show that the code-
length optimal up-to O(log n) term when the parameter space is unit interval.
Since the code-length of the universal coding (2) and (3) involves Kolmogorov
complexity, we can not construct such the code effectively. To avoid this dif-
ficulty, we approximate Kolmogorov complexity in (2) and (3), by considering
Bayes mixture with singular prior with respect to Lebesgue measure. Then we
show a universal coding, which is constructive, such that the code-length is
log Pˆ
θ+h
2log n+o(log n), Pθa.e., (5)
where h=plog p(1p) log(1 p), and pis the relative frequency of 1 in the
dyadic expansion of ˆ
θ. Note that h < 1p6= 1/2, i.e. the relative frequency of
1 in the dyadic expansion of ˆ
θis biased then code-length (5) is asymptotically
less than that of MDL coding. Also we show that the code (5) is optimal up-to
O(log n) term for almost every θwith respect to the prior.
Finally we remark that the code-lengths shown in this paper give a non-
trivial upper and lower bound of Kolmogorov complexity when the source is not
a computable measure.
References
[1] A. R. Barron. Logically smooth density estimation. Ph.D. dissertation, Dept. Elec. Eng.,
Stanford Univ., Stanford, CA, Sept. 1985.
[2] G. J. Chaitin. A theory of program size formally identical to information theory. J. ACM,
22:329–340, 1975.
[3] L. A. Levin. Laws of information conservation (nongrowth) and aspects of the foundation
of probability theory. Prob. Inf. Transm., 10:206–210, 1974.
[4] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans.
Inform. Theory, IT-30(4):629–636, 1984.
[5] J. Rissanen. Stochastic complexity and modeling. Ann. Statist., 14(3):1080–1100, 1986.
[6] V. G. Vovk. Learning about the parameter of the Bernoulli model. J. Comput. System
Sci., 55:96–104, 1997.
2
... Here, (17) follows from $\beta(x, y^{\infty}, c)\subseteq\gamma(x, y^{\infty}, c),$ (18) follows from that $P(\cdot|y^{\infty})$ is a relative computable probability on $\Omega$ , where $c_{1}$ is a constant independent from $x$ , and (19) follows from $\alpha(x, y^{\infty}, c)\subseteq\gamma(x, y^{\infty}, c)$ . Thus we have $\sup_{(x,y)\in A^{f}(x^{\infty},y^{\infty})}|\log P(x|y)+Km(x|y)|<\infty$ . ...
... If such a function exists, then y ∞ is not random with respect to P Y and the Lebesgue measure of such parameters is 0. On the other hand, it is known that the set of parameters that are estimated within o(1/ √ n) accuracy has Lebesgue measure 0 [4]. Theorem 6.2 shows a relation between the redundancy of universal coding and parameter estimation; as in [18], if we set P Y to be a singular prior, we have inf xx ∞ P (y|x) > 0 for a large order h −1. In such a case we have a super-efficient estimator. ...
Article
We study algorithmic randomness and monotone complexity on product of the set of infinite binary sequences. We explore the following problems: monotone complexity on product space, Lambalgen's theorem for correlated probability, classification of random sets by likelihood ratio tests, decomposition of complexity and independence, Bayesian statistics for individual random sequences. Formerly Lambalgen's theorem for correlated probability is shown under a uniform computability assumption in [H. Takahashi Inform. Comp. 2008]. In this paper we show the theorem without the assumption.
Conference Paper
A concept of Cellular Neural Networks with dynamic cell activity control is proposed in the paper. The concept is an extension to the Fixed State Map mechanism and it assumes that cells can be disabled or enabled for processing based on assessment of current distributions of their neighboring signals. A particular case, where this assessment is made by thresholding a result of cross-correlation between feedback template and neighborhood outputs is shown to provide a simple means for efficient min/max problem handling. This idea requires introducing only minor modifications to a cell structure. As an example, application of the proposed network for fast estimation of Hausdorff distance between two sets has been considered.
Article
We introduce a universal Bayes test, which is a Bayesian version of Martin-Lof test. Then we define random sequences with respect to parametric models based on our universal Bayes test. We state some theorems related to Bayesian statistical inference in terms of random sequence
Article
Full-text available
Assume that A is a finite alphabet; OMEGA //i is a set of Markov sources of connectedness i that generate letters from A(i equals 1, 2, . . . ); and OMEGA //0 is a set of Bernoulli sources. A code is proposed whose redundancy as a function of the block length on each OMEGA //i is asymptotically as small as that of the universal code that is optimal on OMEGA //i (i equals 0, 1, 2,. . . ). A generalization of this problem to the case of an arbitrary countable family of sets of stationary ergodic sources is considered.
Article
A new alternative definition is given for the algorithmic quantity of information defined by A. N. Kolmogorov. The nongrowth of this quantity is proved for random and certain other processes.
Article
In Part I, four ostensibly different theoretical models of induction are presented, in which the problem dealt with is the extrapolation of a very long sequence of symbols—presumably containing all of the information to be used in the induction. Almost all, if not all problems in induction can be put in this form. Some strong heuristic arguments have been obtained for the equivalence of the last three models. One of these models is equivalent to a Bayes formulation, in which a priori probabilities are assigned to sequences of symbols on the basis of the lengths of inputs to a universal Turing machine that are required to produce the sequence of interest as output. Though it seems likely, it is not certain whether the first of the four models is equivalent to the other three. Few rigorous results are presented. Informal investigations are made of the properties of these models. There are discussions of their consistency and meaningfulness, of their degree of independence of the exact nature of the Turing machine used, and of the accuracy of their predictions in comparison to those of other induction methods. In Part II these models are applied to the solution of three problems—prediction of the Bernoulli sequence, extrapolation of a certain kind of Markov chain, and the use of phrase structure grammars for induction. Though some approximations are used, the first of these problems is treated most rigorously. The result is Laplace's rule of succession. The solution to the second problem uses less certain approximations, but the properties of the solution that are discussed, are fairly independent of these approximations. The third application, using phrase structure grammars, is least exact of the three. First a formal solution is presented. Though it appears to have certain deficiencies, it is hoped that presentation of this admittedly inadequate model will suggest acceptable improvements in it. This formal solution is then applied in an approximate way to the determination of the “optimum” phrase structure grammar for a given set of strings. The results that are obtained are plausible, but subject to the uncertainties of the approximation used.
Article
In 1964 Kolmogorov introduced the concept of the complexity of a finite object (for instance, the words in a certain alphabet). He defined complexity as the minimum number of binary signs containing all the information about a given object that are sufficient for its recovery (decoding). This definition depends essentially on the method of decoding. However, by means of the general theory of algorithms, Kolmogorov was able to give an invariant (universal) definition of complexity. Related concepts were investigated by Solomonoff (U.S.A.) and Markov. Using the concept of complexity, Kolmogorov gave definitions of the quantity of information in finite objects and of the concept of a random sequence (which was then defined more precisely by Martin-Löf). Afterwards, this circle of questions developed rapidly. In particular, an interesting development took place of the ideas of Markov on the application of the concept of complexity to the study of quantitative questions in the theory of algorithms. The present article is a survey of the fundamental results connected with the brief remarks above.