Article

Estimation of Entropy and Mutual Information

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

We present some new results on the nonparametric estimation of entropy and mutual information. First, we use an exact local expansion of the entropy function to prove almost sure consistency and central limit theorems for three of the most commonly used discretized information estimators. The setup is related to Grenander's method of sieves and places no assumptions on the underlying probability measure generating the data. Second, we prove a converse to these consistency theorems, demonstrating that a misapplication of the most common estimation techniques leads to an arbitrarily poor estimate of the true information, even given unlimited data. This “inconsistency” theorem leads to an analytical approximation of the bias, valid in surprisingly small sample regimes and more accurate than the usual [Formula: see text] formula of Miller and Madow over a large region of parameter space. The two most practical implications of these results are negative: (1) information estimates in a certain data regime are likely contaminated by bias, even if “bias-corrected” estimators are used, and (2) confidence intervals calculated by standard techniques drastically underestimate the error of the most common estimation methods. Finally, we note a very useful connection between the bias of entropy estimators and a certain polynomial approximation problem. By casting bias calculation problems in this approximation theory framework, we obtain the best possible generalization of known asymptotic bias results. More interesting, this framework leads to an estimator with some nice properties: the estimator comes equipped with rigorous bounds on the maximum error over all possible underlying probability distributions, and this maximum error turns out to be surprisingly small. We demonstrate the application of this new estimator on both real and simulated data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Such a scenario can occur, for example, in image recognition, where symbols represent RGB values. This setup is typically referred to as the large-alphabet regime, where the entropy estimators are shown to be biased [2] and the convergence rate can be slow [3]. ...
... Notably, in the large-alphabet regime, numerous studies have been conducted [9][10][11][12][13][14]. This research seeks to build upon these studies by analyzing the latest approach to entropy estimation using deep neural networks, with a specific emphasis on the large-alphabet regime, which poses challenges to conventional entropy estimation methods, such as those that rely on the plug-in principle [2,15,16]. ...
... It exhibits commendable results within the classical regime, typified by a large sample size and a small alphabet size. However, a significant negative bias is observed as the alphabet size increases [2]. In response to this bias, several estimators were developed, including the Miller-Madow correction (MM) [15], which corrects the bias by incorporating a constant dependent on the non-zero sample probability count. ...
Article
Full-text available
This paper presents a comparative study of entropy estimation in a large-alphabet regime. A variety of entropy estimators have been proposed over the years, where each estimator is designed for a different setup with its own strengths and caveats. As a consequence, no estimator is known to be universally better than the others. This work addresses this gap by comparing twenty-one entropy estimators in the studied regime, starting with the simplest plug-in estimator and leading up to the most recent neural network-based and polynomial approximate estimators. Our findings show that the estimators’ performance highly depends on the underlying distribution. Specifically, we distinguish between three types of distributions, ranging from uniform to degenerate distributions. For each class of distribution, we recommend the most suitable estimator. Further, we propose a sample-dependent approach, which again considers three classes of distribution, and report the top-performing estimators in each class. This approach provides a data-dependent framework for choosing the desired estimator in practical setups.
... It would be highly desirable to have an unbiased entropy estimator, i.e., an estimator whose average value coincides with the true result H for all values of the sequence length N. However, it can be proven that such an estimator does not exist [15] and that, apart from the unavoidable statistical errors due to the finite number N of data of the sample (and which typically scale as N −1/2 ), all estimators present systematic errors which are in general difficult to evaluate properly. Therefore, a large effort has been devoted to the development of entropy estimators that, although necessarily biased, provide a good value for H with small statistical and systematic errors [16]. ...
... Hence, the need to compare and evaluate the different estimators in Markov chains. Even though there exists a plethora of entropy estimators in the literature [15,[39][40][41][42][43][44][45][46][47], we here focus on nine of the most commonly employed estimators, and we also propose a new estimator, constructed from known results [34]. ...
... is negatively biased [15], i.e., ⟨Ĥ MLE ⟩ − H < 0. ...
Article
Full-text available
Entropy estimation is a fundamental problem in information theory that has applications in various fields, including physics, biology, and computer science. Estimating the entropy of discrete sequences can be challenging due to limited data and the lack of unbiased estimators. Most existing entropy estimators are designed for sequences of independent events and their performances vary depending on the system being studied and the available data size. In this work, we compare different entropy estimators and their performance when applied to Markovian sequences. Specifically, we analyze both binary Markovian sequences and Markovian systems in the undersampled regime. We calculate the bias, standard deviation, and mean squared error for some of the most widely employed estimators. We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this paper provides a comprehensive comparison of entropy estimators and their performance in estimating entropy for systems with memory, which can be useful for researchers and practitioners in various fields.
... NNs. Indeed, the sample complexity of MI estimation scales poorly with dimension (Paninski, 2003). Collecting more samples of (W, Z i ) can be expensive, especially with NNs, as one realization of W ∼ P W |Sn requires one complete training run. ...
... SI k has been shown to retain many important properties of MI (Goldfeld et al., 2022), and more importantly, the statistical convergence rate for estimating SI k (X; Y ) depends on k but not the ambient dimensions d x , d y . This provides significant advantages over MI, whose computation generally requires an exponential number of samples in max(d x , d y ) (Paninski, 2003). Similar convergence rates can be achieved while slicing in only one dimension. ...
Preprint
Full-text available
The ability of machine learning (ML) algorithms to generalize well to unseen data has been studied through the lens of information theory, by bounding the generalization error with the input-output mutual information (MI), i.e., the MI between the training data and the learned hypothesis. Yet, these bounds have limited practicality for modern ML applications (e.g., deep learning), due to the difficulty of evaluating MI in high dimensions. Motivated by recent findings on the compressibility of neural networks, we consider algorithms that operate by slicing the parameter space, i.e., trained on random lower-dimensional subspaces. We introduce new, tighter information-theoretic generalization bounds tailored for such algorithms, demonstrating that slicing improves generalization. Our bounds offer significant computational and statistical advantages over standard MI bounds, as they rely on scalable alternative measures of dependence, i.e., disintegrated mutual information and $k$-sliced mutual information. Then, we extend our analysis to algorithms whose parameters do not need to exactly lie on random subspaces, by leveraging rate-distortion theory. This strategy yields generalization bounds that incorporate a distortion term measuring model compressibility under slicing, thereby tightening existing bounds without compromising performance or requiring model compression. Building on this, we propose a regularization scheme enabling practitioners to control generalization through compressibility. Finally, we empirically validate our results and achieve the computation of non-vacuous information-theoretic generalization bounds for neural networks, a task that was previously out of reach.
... important testing ground for methods to address these errors [31][32][33]; for a review see Appendix A.8 of Ref. [26]. The approach we take here uses the fact that naive entropy estimates depend systematically on the size of the sample; if we can detect this systematic dependence then we can extrapolate to infinite data, as described in Appendix A. The result is that I gap = 1.68 ± 0.07 bits/cell. ...
... We see the expected dependence on 1/N em , and the steepness of this dependence is twice as large at N = 20 than at N = 10, as expected [26]. This gives us confidence in the extrapolation N em → ∞ [26,[30][31][32][33]. ...
Article
Full-text available
In a developing embryo, information about the position of cells is encoded in the concentrations of morphogen molecules. In the fruit fly, the local concentrations of just a handful of proteins encoded by the gap genes are sufficient to specify position with a precision comparable to the spacing between cells along the anterior-posterior axis. This matches the precision of downstream events such as the striped patterns of expression in the pair-rule genes, but is not quite sufficient to define unique identities for individual cells. We demonstrate theoretically that this information gap can be bridged if positional errors are spatially correlated, with correlation lengths approximately 20% of the embryo length. We then show experimentally that these correlations are present, with the required strength, in the fluctuating positions of the pair-rule stripes, and this can be traced back to the gap genes. Taking account of these correlations, the available information matches the information needed for unique cellular specification, within error bars of approximately 2%. These observation support a precisionist view of information flow through the underlying genetic networks, in which accurate signals are available from the start and preserved as they are transformed into the final spatial patterns.
... Bellin et al. (1994) propose a convergence criterion based on the simulation volume's moments and the size, while Ballio and Guadagnini (2004) appropriately extend these criteria to the first two (simulations ensemble's mean and variance) statistical moments. The works presenting this kind of information for stochastic simulations are not so recurrent in literature, but examples can be found in Geyer (1994), Paninski (2003), and others already cited. Moreover, the convergence of groundwater models' results, such as hydraulic heads, is qualitatively approved if the curve of Monte Carlo-based simulation volumes is flat or not on the ensemble average. ...
... Moreover, the distance to the numerical stability condition, i.e. the local ensemble variance, is (non-linearly) proportional to the absolute value of the entropy of the Monte Carlo process (Shannon, 1948), i.e. its thermodynamical temperature coded as its susceptibility to variations along with the volume of Monte Carlo simulations, NMC (Paninski, 2003;and Schiavo, 2023c). Therefore, from an ensemble of 2000 Monte Carlo conductivity fields, different volumes of simulations are employed as ensemble mean conductivity fields. ...
Article
Full-text available
The knowledge of aquifer systems, their geological setting, their structure, and subsequent modeling is highly uncertain and is usually faced through Monte Carlo-based methods in hydrogeology. One of the most important uncertainty sources for groundwater models is represented by input hydraulic conductivities, related to the aquifer's structure. There are no specific rules when simulating hydraulic conductivity fields within Monte Carlo frameworks to instruct numerical models, and information about employed conductivity fields and their numerical convergence is often not given. This technical work aims to fill this gap by investigating the impact of employing conductivity information upon different volumes of Monte Carlo simulations applied to a real case study. Thus, this work estimates the minimum volumes of Monte Carlo hydraulic conductivity fields to be employed in groundwater flow models for achieving numerically stable (i) boundary conditions, (ii) global model performances, and (iii) local ones such as simulated hydraulic heads. The present results aim to be indicative of similar hydrogeological settings and will serve as a basis for more complex ones and for investigating transport problems.
... Privacy quantification involves a wide range of mathematical concepts and metrics that serve as the foundation for measuring and evaluating privacy. Mutual information, entropy, and (ε, δ)-DP are examples of these concepts [35][36][37]. Other entropy concepts, such as Rényi's min-entropy, are used in privacy quantification to represent different types of attacks and information leakage scenarios [38]. ...
Article
Full-text available
Public administration frequently deals with geographically scattered personal data between multiple government locations and organizations. As digital technologies advance, public administration is increasingly relying on collaborative intelligence while protecting individual privacy. In this context, federated learning has become known as a potential technique to train machine learning models on private and distributed data while maintaining data privacy. This work looks at the trade-off between privacy assurances and vulnerability to membership inference attacks in differential private federated learning in the context of public administration applications. Real-world data from collaborating organizations, concretely, the payroll data from the Ministry of Education and the public opinion survey data from Asia Foundation in Afghanistan, were used to evaluate the effectiveness of noise injection, a typical defense strategy against membership inference attacks, at different noise levels. The investigation focused on the impact of noise on model performance and selected privacy metrics applicable to public administration data. The findings highlight the importance of a balanced compromise between data privacy and model utility because excessive noise can reduce the accuracy of the model. They also highlight the need for careful consideration of noise levels in differential private federated learning for public administration tasks to provide a well-calibrated balance between data privacy and model utility, contributing toward transparent government practices.
... In the last two decades, many efforts were made to improve the bounds on the sample complexity. Paninski [22], [23]] was the first to prove that it is possible to consistently estimate the entropy using sublinear sample size. While the scaling of the minimal sample size of consistent estimation was shown to be n log n in the seminal results of [24], [25], the optimal dependence of the sample size on both n and ε was not completely resolved until recently. ...
Preprint
Full-text available
We observe an infinite sequence of independent identically distributed random variables $X_1,X_2,\ldots$ drawn from an unknown distribution $p$ over $[n]$, and our goal is to estimate the entropy $H(p)=-\mathbb{E}[\log p(X)]$ within an $\varepsilon$-additive error. To that end, at each time point we are allowed to update a finite-state machine with $S$ states, using a possibly randomized but time-invariant rule, where each state of the machine is assigned an entropy estimate. Our goal is to characterize the minimax memory complexity $S^*$ of this problem, which is the minimal number of states for which the estimation task is feasible with probability at least $1-\delta$ asymptotically, uniformly in $p$. Specifically, we show that there exist universal constants $C_1$ and $C_2$ such that $ S^* \leq C_1\cdot\frac{n (\log n)^4}{\varepsilon^2\delta}$ for $\varepsilon$ not too small, and $S^* \geq C_2 \cdot \max \{n, \frac{\log n}{\varepsilon}\}$ for $\varepsilon$ not too large. The upper bound is proved using approximate counting to estimate the logarithm of $p$, and a finite memory bias estimation machine to estimate the expectation operation. The lower bound is proved via a reduction of entropy estimation to uniformity testing. We also apply these results to derive bounds on the memory complexity of mutual information estimation.
... Since the naive plug-in estimator systematically underestimates the Shannon entropy [30], in the third step, we could also have used any of the plethora of bias-corrected estimators for Shannon entropy that have been proposed [19,[30][31][32][33][34][35][36]. Any of the other measures can also be computed either using plug-in estimation or other generic estimators such as the jackknife estimator [37], or any tailored measure-specific estimator. ...
Preprint
Full-text available
In the nonlinear timeseries analysis literature, countless quantities have been presented as new "entropy" or "complexity" measures, often with similar roles. The ever-increasing pool of such measures makes creating a sustainable and all-encompassing software for them difficult both conceptually and pragmatically. Such a software however would be an important tool that can aid researchers make an informed decision of which measure to use and for which application, as well as accelerate novel research. Here we present ComplexityMeasures.jl, an easily extendable and highly performant open-source software that implements a vast selection of complexity measures. The software provides 1530 measures with 3,834 lines of source code, averaging only 2.5 lines of code per exported quantity (version 3.5). This is made possible by its mathematically rigorous composable design. In this paper we discuss the software design and demonstrate how it can accelerate complexity-related research in the future. We carefully compare it with alternative software and conclude that ComplexityMeasures.jl outclasses the alternatives in several objective aspects of comparison, such as computational performance, overall amount of measures, reliability, and extendability. ComplexityMeasures.jl is also a component of the DynamicalSystems.jl library for nonlinear dynamics and nonlinear timeseries analysis and follows open source development practices for creating a sustainable community of developers.
... Якщо застосовувати формулу (6) напряму до реальних статистичних даних, то оцінка буде зміщеною [25,26]. Щоб отримати незміщену оцінку, можливо використовувати корекцію, зокрема наступні методи корекції, що задані формулами (7) -(9) [27]: ...
Article
Розглядаються методи та засоби аналізу, оцінки та порівняння властивостей випадкових послідовностей та випадкових чисел. Обговорюються також такі аспекти, як математичне моделювання випадкових процесів, статистичні методи оцінювання параметрів розподілу, порівняльний аналіз властивостей випадкових величин. На сьогодн випадкові послідовності (ВП) та випадкові числа (ВЧ), що виробляються фізично справжніми (PT RNG) та нефізично справжніми (NPT RNG) генераторами, широко застосовуються на практиці – вони по суті законодавчо визначають механізми генерування ключів у криптографічних системах. У залежності від криптографічних перетворень, вони застосовуються для генерації довгострокових ключів та ключів сеансу симетричних криптоперетворень, довгострокових асиметричних пар ключів та пар сеансових ключів, загальних параметрів криптоперетворень та криптографічних протоколів, специфічних одноразових значень (nonces), викликів (challenges), засліплення та маскування значень тощо. Серед множини вимог до таких генераторів є забезпечення у ряді, а можливо і більшості, криптографічних застосунків максимально можливого значення початкової ентропії. Аналіз міжнародних та національних нормативно-правових документів щодо вимог до PT RNG та NPT RNG джерел та відповідно до генераторів показав, що вони, з урахуванням суттєвих викликів, які пов’язані з розширенням можливостей криптоаналізу на основі застосування, крім класичних, квантових та атак бічними каналами, суттєвою мірою повинні бути удосконаленими та оцінені з використанням комплексних методик із використанням системи безумовних критеріїв. Метою даної статті є обґрунтування, розробка та експериментальне підтвердження коректного застосування алгоритмів генерування ВП та ВЧ на основі PTRNG та NPTRNG, в тому числі при застосуванні класичної та квантової мікроелектроніки, а також розробка рекомендацій щодо їх застосування для генерування ключів та параметрів для квантово стійких методів та стандартів криптографічних перетворень.
... As this estimate is an approximation we also use the Miller-Meadow correction in order to smooth the estimate based on sample size and improve its accuracy (Miller & Miller, 1955). No method of estimating discrete entropy in continuous spaces is perfect (see Paninski, 2003 for extensive discussion), but our estimator is invariant to linear transformations while making minimal assumptions about the underlying distribution. Note that while in the results presented here we estimate entropy per dimension we can just as easily estimate entropy per pair or set of dimensions (akin to modelling at the unigram vs n-gram level); our use of a dimension-wise estimate simplifies our analysis but limits its ability to track cross-dimensional dependencies. ...
Preprint
Full-text available
Large scale neural models show impressive performance across a wide array of linguistic tasks. Despite this they remain, largely, black-boxes - inducing vector-representations of their input that prove difficult to interpret. This limits our ability to understand what they learn, and when the learn it, or describe what kinds of representations generalise well out of distribution. To address this we introduce a novel approach to interpretability that looks at the mapping a model learns from sentences to representations as a kind of language in its own right. In doing so we introduce a set of information-theoretic measures that quantify how structured a model's representations are with respect to its input, and when during training that structure arises. Our measures are fast to compute, grounded in linguistic theory, and can predict which models will generalise best based on their representations. We use these measures to describe two distinct phases of training a transformer: an initial phase of in-distribution learning which reduces task loss, then a second stage where representations becoming robust to noise. Generalisation performance begins to increase during this second phase, drawing a link between generalisation and robustness to noise. Finally we look at how model size affects the structure of the representational space, showing that larger models ultimately compress their representations more than their smaller counterparts.
... An important technique in our approach is Mutual Information Maximization (MIM) [1,9,23], based on the concept of mutual information [10,24]. For two random variables X and Y, mutual information quantifies how much knowing X reduces uncertainty in Y, and vice versa. ...
Preprint
Cross-Domain Sequential Recommendation (CDSR) methods aim to address the data sparsity and cold-start problems present in Single-Domain Sequential Recommendation (SDSR). Existing CDSR methods typically rely on overlapping users, designing complex cross-domain modules to capture users' latent interests that can propagate across different domains. However, their propagated informative information is limited to the overlapping users and the users who have rich historical behavior records. As a result, these methods often underperform in real-world scenarios, where most users are non-overlapping (cold-start) and long-tailed. In this research, we introduce a new CDSR framework named Information Maximization Variational Autoencoder (\textbf{\texttt{IM-VAE}}). Here, we suggest using a Pseudo-Sequence Generator to enhance the user's interaction history input for downstream fine-grained CDSR models to alleviate the cold-start issues. We also propose a Generative Recommendation Framework combined with three regularizers inspired by the mutual information maximization (MIM) theory \cite{mcgill1954multivariate} to capture the semantic differences between a user's interests shared across domains and those specific to certain domains, as well as address the informational gap between a user's actual interaction sequences and the pseudo-sequences generated. To the best of our knowledge, this paper is the first CDSR work that considers the information disentanglement and denoising of pseudo-sequences in the open-world recommendation scenario. Empirical experiments illustrate that \texttt{IM-VAE} outperforms the state-of-the-art approaches on two real-world cross-domain datasets on all sorts of users, including cold-start and tailed users, demonstrating the effectiveness of \texttt{IM-VAE} in open-world recommendation.
... Given that, an estimation of this quantity could be used to understand the complexity of the problem and predict the type of ML architecture that better fits the scenario. The literature on information measure estimation is rich [69]- [74], and many methods can be adapted to provide a consistent proxy for the conditional entropy in the mixed continuous-discrete setting. On the other hand, extending the presented numerical analysis over a large class of models and ML schemes and exploring the practical use of data-driven entropy estimators as a proxy to condition the ML design are exciting directions for future work. ...
Preprint
Full-text available
We present new results to model and understand the role of encoder-decoder design in machine learning (ML) from an information-theoretic angle. We use two main information concepts, information sufficiency (IS) and mutual information loss (MIL), to represent predictive structures in machine learning. Our first main result provides a functional expression that characterizes the class of probabilistic models consistent with an IS encoder-decoder latent predictive structure. This result formally justifies the encoder-decoder forward stages many modern ML architectures adopt to learn latent (compressed) representations for classification. To illustrate IS as a realistic and relevant model assumption, we revisit some known ML concepts and present some interesting new examples: invariant, robust, sparse, and digital models. Furthermore, our IS characterization allows us to tackle the fundamental question of how much performance (predictive expressiveness) could be lost, using the cross entropy risk, when a given encoder-decoder architecture is adopted in a learning setting. Here, our second main result shows that a mutual information loss quantifies the lack of expressiveness attributed to the choice of a (biased) encoder-decoder ML design. Finally, we address the problem of universal cross-entropy learning with an encoder-decoder design where necessary and sufficiency conditions are established to meet this requirement. In all these results, Shannon's information measures offer new interpretations and explanations for representation learning.
... For many problems of interest, precise computation of MI is not an easy task [55,62]. Consequently, a wide range of techniques for MI estimation have flourished. ...
Preprint
Full-text available
Diffusion models for Text-to-Image (T2I) conditional generation have seen tremendous success recently. Despite their success, accurately capturing user intentions with these models still requires a laborious trial and error process. This challenge is commonly identified as a model alignment problem, an issue that has attracted considerable attention by the research community. Instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models to steer image generation, in this work we present a novel method that relies on an information-theoretic alignment measure. In a nutshell, our method uses self-supervised fine-tuning and relies on point-wise mutual information between prompts and images to define a synthetic training set to induce model alignment. Our comparative analysis shows that our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI and a lightweight fine-tuning strategy.
... This motivates the derivation of informationtheoretic lower and upper bounds for ,min (̂−) that are more amenable in terms of applications and interpretation. For example, obtaining ,min (̂−) reliably from Eq. (7) may not be possible in situations where, on the other hand, information-theoretic quantities can be efficiently calculated using estimators [20]. Even when Eq. (7) can be evaluated accurately, its manipulation becomes challenging in the context of model development due to the non-linearity introduced by the min(⋅) operator [21]. ...
Article
Full-text available
Predicting extreme events in chaotic systems, characterized by rare but intensely fluctuating properties, is of great importance due to their impact on the performance and reliability of a wide range of systems. Some examples include weather forecasting, traffic management, power grid operations, and financial market analysis, to name a few. Methods of increasing sophistication have been developed to forecast events in these systems. However, the boundaries that define the maximum accuracy of forecasting tools are still largely unexplored from a theoretical standpoint. Here, we address the question: What is the minimum possible error in the prediction of extreme events in complex, chaotic systems? We derive the minimum probability of error in extreme event forecasting along with its information-theoretic lower and upper bounds. These bounds are universal for a given problem, in that they hold regardless of the modeling approach for extreme event prediction: from traditional linear regressions to sophisticated neural network models. The limits in predictability are obtained from the cost-sensitive Fano’s and Hellman’s inequalities using the Rényi entropy. The results are also connected to Takens’ embedding theorem using the information can’t hurt inequality. Finally, the probability of error for a forecasting model is decomposed into three sources: uncertainty in the initial conditions, hidden variables, and suboptimal modeling assumptions. The latter allows us to assess whether prediction models are operating near their maximum theoretical performance or if further improvements are possible. The bounds are applied to the prediction of extreme events in the Rössler system and the Kolmogorov flow.
... Here, ( ) , ( ) , and ( ) denote -th known samples of historical prices, spot-forward parity, and forecasted prices. Since the optimal mean , is theoretically intractable [82], to determine , , we follow the work of [83], using the outputs of the empirically optimal predictor in experiments to approximate , . ...
Article
Full-text available
The complexity in stock index futures markets, influenced by the intricate interplay of human behavior, is characterized as nonlinearity and dynamism, contributing to significant uncertainty in long-term price forecasting. While machine learning models have demonstrated their efficacy in stock price forecasting, they rely solely on historical price data, which, given the inherent volatility and dynamic nature of financial markets, are insufficient to address the complexity and uncertainty in long-term forecasting due to the limited connection between historical and forecasting prices. This paper introduces a pioneering approach that integrates financial theory with advanced deep learning methods to enhance predictive accuracy and risk management in China’s stock index futures market. The SF-Transformer model, combining spot-forward parity and the Transformer model, is proposed to improve forecasting accuracy across short and long-term horizons. Formulated upon the arbitrage-free futures pricing model, the spot-forward parity model offers variables such as stock index price, risk-free rate, and stock index dividend yield for forecasting. Our insight is that the mutual information generated by these variables has the potential to significantly reduce uncertainty in long-term forecasting. A case study on predicting major stock index futures prices in China demonstrates the superiority of the SF-Transformer model over models based on LSTM, MLP, and the stock index futures arbitrage-free pricing model, covering both short and long-term forecasting up to 28 days. Unlike existing machine learning models, the Transformer processes entire time series concurrently, leveraging its attention mechanism to discern intricate dependencies and capture long-range relationships, thereby offering a holistic understanding of time series data. An enhancement of mutual information is observed after introducing spot-forward parity in the forecasting. The variation of mutual information and ablation study results highlights the significant contributions of spot-forward parity, particularly to the long-term forecasting. Overall, these findings highlight the SF-Transformer model’s efficacy in leveraging spot-forward parity for reducing uncertainty and advancing robust and comprehensive approaches in long-term stock index futures price forecasting.
... This is attributed to the fact that, practically, it is impossible to know the true probability distributions p(x), p(y) and p(x, y) and we have to estimate them from the dataset. Estimating mutual information for continuous data is not trivial [20]- [22] as we need to measure the probability densities without the probability density functions. Therefore, in this paper, we use datasets discretized with Sturge's rule, a commonly-used algorithm that chooses the number of bins needed to approximate the original distribution of the samples based on the size of the dataset. ...
Conference Paper
Feature selection is the data analysis process that selects a smaller and curated subset of the original dataset by filtering out data (features) which are irrelevant or redundant. The most important features can be ranked and selected based on statistical measures, such as mutual information. Feature selection not only reduces the size of dataset as well as the execution time for training Machine Learning (ML) models, but it can also improve the accuracy of the inference. This paper analyses mutual-information-based feature selection for resource-constrained FPGAs and proposes FINESSD, a novel approach that can be deployed for near-storage acceleration. This paper highlights that the Mutual Information Maximization (MIM) algorithm does not require multiple passes over the data while being a good trade-off between accuracy and FPGA resources, when approximated appropriately. The new FPGA accelerator for MIM generated by FINESSD can fully utilize the NVMe bandwidth of a modern SSD and perform feature selection without requiring full dataset transfers onto the main processor. The evaluation using a Samsung SmartSSD over small, large and out-of-core datasets shows that, compared to the mainstream multiprocessing Python ML libraries and an optimized C library, FINESSD yields up to 35x and 19x speedup respectively while being more than 70x more energy efficient for large, out-of-core datasets.
... This implementation depends on the estimation of information-theoretic measures from data. A ubiquitous challenge for the application of mutual information measures is that they are positively biased and their estimation is data-demanding [75,76]. These biases scale with the dimensionality of the variables, and hence can hinder the applicability of informationtheoretic inequalities for large collections of variables, or for variables with high cardinality. ...
Article
Full-text available
The causal structure of a system imposes constraints on the joint probability distribution of variables that can be generated by the system. Archetypal constraints consist of conditional independencies between variables. However, particularly in the presence of hidden variables, many causal structures are compatible with the same set of independencies inferred from the marginal distributions of observed variables. Additional constraints allow further testing for the compatibility of data with specific causal structures. An existing family of causally informative inequalities compares the information about a set of target variables contained in a collection of variables, with a sum of the information contained in different groups defined as subsets of that collection. While procedures to identify the form of these groups-decomposition inequalities have been previously derived, we substantially enlarge the applicability of the framework. We derive groups-decomposition inequalities subject to weaker independence conditions, with weaker requirements in the configuration of the groups, and additionally allowing for conditioning sets. Furthermore, we show how constraints with higher inferential power may be derived with collections that include hidden variables, and then converted into testable constraints using data processing inequalities. For this purpose, we apply the standard data processing inequality of conditional mutual information and derive an analogous property for a measure of conditional unique information recently introduced to separate redundant, synergistic, and unique contributions to the information that a set of variables has about a target.
... MI estimation in high-dimensional spaces is very challenging due to the "curse of dimensionality" [7], which is the need for exponentially more data as the dimension increases in order to provide reliable probability estimates. Nonetheless, due to the importance of estimating MI empirically from data, many MI estimators have been proposed in the literature [8], [9], [10], [11], [12], [13], relying on concepts such as kernel density estimation, variational approaches, or estimating bounds on MI. The approach we take in this paper is to leverage stateof-the-art entropy estimators developed in the area of learningbased compression [14], [15], [16], and use the relationship between entropy and MI to provide an estimate. ...
Preprint
Full-text available
In recent years, there has been a significant increase in applications of multimodal signal processing and analysis, largely driven by the increased availability of multimodal datasets and the rapid progress in multimodal learning systems. Well-known examples include autonomous vehicles, audiovisual generative systems, vision-language systems, and so on. Such systems integrate multiple signal modalities: text, speech, images, video, LiDAR, etc., to perform various tasks. A key issue for understanding such systems is the relationship between various modalities and how it impacts task performance. In this paper, we employ the concept of mutual information (MI) to gain insight into this issue. Taking advantage of the recent progress in entropy modeling and estimation, we develop a system called InfoMeter to estimate MI between modalities in a multimodal learning system. We then apply InfoMeter to analyze a multimodal 3D object detection system over a large-scale dataset for autonomous driving. Our experiments on this system suggest that a lower MI between modalities is beneficial for detection accuracy. This new insight may facilitate improvements in the development of future multimodal learning systems.
... Despite these potential advantages of MI, the scarcity of prior work on MI-based feature selection in DNNs is likely due to the fact that both the intermediate feature space and the output space of a DNN are often high-dimensional, and this is known to be a challenging setup for estimating MI [16,34,35,36]. Practical MI estimators often need to be tailored to the problem at hand. ...
Preprint
Full-text available
Deep models produce a number of features in each internal layer. A key problem in applications such as feature compression for remote inference is determining how important each feature is for the task(s) performed by the model. The problem is especially challenging in the case of multi-task inference, where the same feature may carry different importance for different tasks. In this paper, we examine how effective is mutual information (MI) between a feature and a model's task output as a measure of the feature's importance for that task. Experiments involving hard selection and soft selection (unequal compression) based on MI are carried out to compare the MI-based method with alternative approaches. Multi-objective analysis is provided to offer further insight.
... In this paper, the importance of parent node to child node is shown as the importance of pair factor characteristics to the occurrence of victim risk. According to Shannon's information theory, mutual information (MI) is an index to measure the importance of a parent node to a child node in a network [16]. Netica regards MI as an important indicator for sensitivity analysis [17], in addition, measures such as percentage of variance and variance of Beliefs [18] are used for sensitivity analysis. ...
Article
Full-text available
Assessing the risk level and exploring the risk characteristics of victims of brushing type telecom network fraud is of practical significance for crime risk prevention and control. Starting from the demographic characteristics of victims and case characteristics, this paper establishes a Bayesian network model with a tuple composed of loss amount and contact duration as the evaluation index of victim risk level, aiming to provide ideas for police to implement precise anti-fraud propaganda. The research shows that the tuple victim risk assessment model has a high prediction accuracy and can take into account both the loss amount and contact duration, which is feasible as a victim risk assessment model; there is no significant single influencing factor characteristic that affects the victim risk level; among the key victim groups of women who are commerce service personnel and use social media platforms, police should focus on highly educated groups, and among people with the same educational background, police should focus on young people under 28 years old.
... where integration is performed over the range of values of X and Y and pX, pY, and pXY are probability density functions of X, Y, and their joint distribution respectively (Paninski, 2003). If variables X and Y are independent, it follows that I[X,Y] 0, since pXY = pX pY and the logarithm argument becomes unity. ...
Article
Full-text available
Ferrochrome (FeCr) is a vital ingredient in stainless steel production and is commonly produced by smelting chromite ores in submerged arc furnaces. Silicon (Si) is a componrnt of the FeCr alloy from the smelting process. Being both a contaminant and an indicator of the state of the process, its content needs to be kept within a narrow range. The complex chemistry of the smelting process and interactions between various factors make Si prediction by fundamental models infeasible. A data-driven approach offers an alternative by formulating the model based on historical data. This paper presents a systematic development of a data-driven model for predicting Si content. The model includes dimensionality reduction, regularized linear regression, and a boosting method to reduce the variability of the linear model residuals. It shows a good performance on testing data (R2 = 0.63). The most significant predictors, as determined by linear model analysis and permutation testing, are previous Si content, carbon and titanium in the alloy, calcium oxide in the slag, resistance between electrodes, and electrode slips. Further analysis using thermodynamic data and models, links these predictors to electrode control and slag
... Estimating mutual information in high-dimensional spaces presents a significant challenge when applying information-theoretic measures to real-world data. This problem has been extensively studied [143,144], revealing the inefficiency of solutions for large dimensions and the limited scalability of known approximations with respect to the sample size and dimension. Despite these difficulties, various entropy and mutual information estimation approaches have been developed, including classic methods like k-nearest neighbors (KNNs) [145] and kernel density estimation techniques [146], as well as more recent efficient methods. ...
Article
Full-text available
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory has shaped deep neural networks, particularly the information bottleneck principle. This principle optimizes the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. This framework includes multiple encoders and decoders, suggesting that all existing work on self-supervised learning can be seen as specific instances. We aim to unify these approaches to understand their underlying principles better and address the main challenge: many works present different frameworks with differing theories that may seem contradictory. By weaving existing research into a cohesive narrative, we delve into contemporary self-supervised methodologies, spotlight potential research areas, and highlight inherent challenges. Moreover, we discuss how to estimate information-theoretic quantities and their associated empirical problems. Overall, this paper provides a comprehensive review of the intersection of information theory, self-supervised learning, and deep neural networks, aiming for a better understanding through our proposed unified approach.
... The estimation of MI related quantities is known to be difficult (Pichler et al., 2020;Paninski, 2003). As a result, to estimate I(Z; S|Y ), we develop a tailored made constrastive learning objectives to obtain a parameter free estimator. ...
... To proof Theorem 2, we need a lemma from Gibbs and Su (2002); Paninski (2003) for the convergence rate of empirical measures. ...
... Thus, it is also called "Shannon entropy" when used in information theory. Actually, entropy is technically quite difficult to compute reliably for continuous variables 57 . However, for a random event with discrete probabilities of occurrence p 1 , p 2 , …, p n , the corresponding entropy can be easily derived from the following formula: ...
Article
Full-text available
Cell migration is crucial for numerous physiological and pathological processes. A cell adapts its morphology, including the overall and nuclear morphology, in response to various cues in complex microenvironments, such as topotaxis and chemotaxis during migration. Thus, the dynamics of cellular morphology can encode migration strategies, from which diverse migration mechanisms can be inferred. However, deciphering the mechanisms behind cell migration encoded in morphology dynamics remains a challenging problem. Here, we present a powerful universal metric, the Cell Morphological Entropy (CME), developed by combining parametric morphological analysis with Shannon entropy. The utility of CME, which accurately quantifies the complex cellular morphology at multiple length scales through the deviation from a perfectly circular shape, is illustrated using a variety of normal and tumor cell lines in different in vitro microenvironments. Our results show how geometric constraints affect the MDA-MB-231 cell nucleus, the emerging interactions of MCF-10A cells migrating on collagen gel, and the critical transition from proliferation to invasion in tumor spheroids. The analysis demonstrates that the CME-based approach provides an effective and physically interpretable tool to measure morphology in real-time across multiple length scales. It provides deeper insight into cell migration and contributes to the understanding of different behavioral modes and collective cell motility in more complex microenvironments.
... In the case of a high-dimensional output, insufficient number of trials may lead to perceived correlations in the data which are in fact not there, subsequently increasing the calculated mutual information [54,52,55,56,53]. To decrease the sampling bias, we first performed principal component analysis to decrease dimensionality of the output and employed a quadratic extrapolation method to estimate the unbiased value of information-metabolic efficiency. ...
Article
Full-text available
Shared input to a population of neurons induces noise correlations, which can decrease the information carried by a population activity. Inhibitory feedback in recurrent neural networks can reduce the noise correlations and thus increase the information carried by the population activity. However, the activity of inhibitory neurons is costly. This inhibitory feedback decreases the gain of the population. Thus, depolarization of its neurons requires stronger excitatory synaptic input, which is associated with higher ATP consumption. Given that the goal of neural populations is to transmit as much information as possible at minimal metabolic costs, it is unclear whether the increased information transmission reliability provided by inhibitory feedback compensates for the additional costs. We analyze this problem in a network of leaky integrate-and-fire neurons receiving correlated input. By maximizing mutual information with metabolic cost constraints, we show that there is an optimal strength of recurrent connections in the network, which maximizes the value of mutual information-per-cost. For higher values of input correlation, the mutual information-per-cost is higher for recurrent networks with inhibitory feedback compared to feedforward networks without any inhibitory neurons. Our results, therefore, show that the optimal synaptic strength of a recurrent network can be inferred from metabolically efficient coding arguments and that decorrelation of the input by inhibitory feedback compensates for the associated increased metabolic costs.
... Transfer entropy, and even mutual information, is difficult to compute [24], especially for highdimensional or noisy data. In Appendix B, we offer a theoretical proof for the consistency and convergence properties of Transfer Entropy Neural Estimation, and examine its bias on a linear dynamic system where the true values of transfer entropy can be determined analytically. ...
... This motivates the derivation of information-theoretic lower and upper bounds for P c e,min (Q − ) that are more amenable in terms of applications and interpretation. For example, obtaining P c e,min (Q − ) reliably from Equation (7) may not be possible in situations where, on the other hand, information-theoretic quantities can be efficiently calculated using estimators [20]. Even when Equation (7) can be evaluated accurately, its manipulation becomes challenging in the context of model development due to the non-linearity introduced by the min(·) operator [21]. ...
Preprint
Full-text available
Predicting extreme events in chaotic systems, characterized by rare but intensely fluctuating properties, is of great importance due to their impact on the performance and reliability of a wide range of systems. Some examples include weather forecasting, traffic management, power grid operations, and financial market analysis, to name a few. Methods of increasing sophistication have been developed to forecast events in these systems. However, the boundaries that define the maximum accuracy of forecasting tools are still largely unexplored from a theoretical standpoint. Here, we address the question: What is the minimum possible error in the prediction of extreme events in complex, chaotic systems? We derive lower bounds for the minimum probability of error in extreme event forecasting using the information-theoretic Fano's inequality. The limits obtained are universal, in that they hold regardless of the modeling approach: from traditional linear regressions to sophisticated neural network models. The approach also allows us to assess whether reduced-order models are operating near their theoretical maximum performance or if further improvements are theoretically possible.
Article
В работе рассматривается $\alpha$-энтропия конечных вероятностных схем, предложенная А. Реньи в 1961 году в качестве их мер неопределенности (случайности). Представлены две новые предельные теоремы для выборочной оценки энтропии Реньи последовательности независимых полиномиальных испытаний.
Article
Statistical divergences are important tools in data analysis, information theory, and statistical physics, and there exist well-known inequalities on their bounds. However, in many circumstances involving temporal evolution, one needs limitations on the rates of such quantities instead. Here, several general upper bounds on the rates of some f-divergences are derived, valid for any type of stochastic dynamics (both Markovian and non-Markovian), in terms of information-like and/or thermodynamic observables. As special cases, the analytical bounds on the rate of mutual information are obtained. The major role in all those limitations is played by temporal Fisher information, characterizing the speed of global system dynamics, and some of them contain entropy production, suggesting a link with stochastic thermodynamics. Indeed, the derived inequalities can be used for estimation of minimal dissipation and global speed in thermodynamic stochastic systems. Specific applications of these inequalities in physics and neuroscience are given, which include the bounds on the rates of free energy and work in nonequilibrium systems, limits on the speed of information gain in learning synapses, as well as the bounds on the speed of predictive inference and learning rate. Overall, the derived bounds can be applied to any complex network of interacting elements, where predictability and thermodynamics of network dynamics are of prime concern.
Article
Full-text available
We propose a series of quantum algorithms for computing a wide range of quantum entropies and distances, including the von Neumann entropy, quantum Rényi entropy, trace distance, and fidelity. The proposed algorithms significantly outperform the prior best (and even quantum) ones in the low-rank case, some of which achieve exponential speedups. In particular, for N -dimensional quantum states of rank r , our proposed quantum algorithms for computing the von Neumann entropy, trace distance and fidelity within additive error ε have time complexity of Õ( r /ε <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ), Õ( r <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">5</sup> /ε <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">6</sup> ) and Õ( r <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">6.5</sup> /ε <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">7.5</sup> ), respectively. By contrast, prior quantum algorithms for the von Neumann entropy and trace distance usually have time complexity Ω( N ), and the prior best one for fidelity has time complexity Õ( r <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">12.5</sup> /ε <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">13.5</sup> ). The key idea of our quantum algorithms is to extend block-encoding from unitary operators in previous work to quantum states (i.e., density operators). It is realized by developing several convenient techniques to manipulate quantum states and extract information from them. The advantage of our techniques over the existing methods is that no restrictions on density operators are required; in sharp contrast, the previous methods usually require a lower bound on the minimal non-zero eigenvalue of density operators.
Article
The symmetric information bottleneck (SIB), an extension of the more familiar information bottleneck, is a dimensionality-reduction technique that simultaneously compresses two random variables to preserve information between their compressed versions. We introduce the generalized symmetric information bottleneck (GSIB), which explores different functional forms of the cost of such simultaneous reduction. We then explore the data set size requirements of such simultaneous compression. We do this by deriving bounds and root-mean-squared estimates of statistical fluctuations of the involved loss functions. We show that in typical situations, the simultaneous GSIB compression requires qualitatively less data to achieve the same errors compared to compressing variables one at a time. We suggest that this is an example of a more general principle that simultaneous compression is more data efficient than independent compression of each of the input variables.
Article
Full-text available
We address the unsolved question of how best to estimate the collision entropy, also called quadratic or second order Rényi entropy. Integer-order Rényi entropies are synthetic indices useful for the characterization of probability distributions. In recent decades, numerous studies have been conducted to arrive at their valid estimates starting from experimental data, so to derive suitable classification methods for the underlying processes, but optimal solutions have not been reached yet. Limited to the estimation of collision entropy, a one-line formula is presented here. The results of some specific Monte Carlo experiments give evidence of the validity of this estimator even for the very low densities of the data spread in high-dimensional sample spaces. The method strengths are unbiased consistency, generality and minimum computational cost.
Article
Full-text available
Although the Bayesian paradigm is an important benchmark in studies of human inference, the extent to which it provides a useful framework to account for human behavior remains debated. We document systematic departures from Bayesian inference under correct beliefs, even on average, in the estimates by experimental subjects of the probability of a binary event following observations of successive realizations of the event. In particular, we find underreaction of subjects’ estimates to the evidence (“conservatism”) after only a few observations and at the same time overreaction after longer sequences of observations. This is not explained by an incorrect prior nor by many common models of Bayesian inference. We uncover the autocorrelation in estimates, which suggests that subjects carry imprecise representations of the decision situations, with noise in beliefs propagating over successive trials. But even taking into account these internal imprecisions and assuming various incorrect beliefs, we find that subjects’ updates are inconsistent with the rules of Bayesian inference. We show how subjects instead considerably economize on the attention that they pay to the information relevant to the decision, and on the degree of control that they exert over their precise response, while giving responses fairly adapted to the task. A “noisy-counting” model of probability estimation reproduces the several patterns we exhibit in subjects’ behavior. In sum, human subjects in our task perform reasonably well while greatly minimizing the amount of information that they pay attention to. Our results emphasize that investigating this economy of attention is crucial in understanding human decisions.
Article
In many applications in biology, engineering, and economics, identifying similarities and differences between distributions of data from complex processes requires comparing finite categorical samples of discrete counts. Statistical divergences quantify the difference between two distributions. However, their estimation is very difficult and empirical methods often fail, especially when the samples are small. We develop a Bayesian estimator of the Kullback-Leibler divergence between two probability distributions that makes use of a mixture of Dirichlet priors on the distributions being compared. We study the properties of the estimator on two examples: probabilities drawn from Dirichlet distributions and random strings of letters drawn from Markov chains. We extend the approach to the squared Hellinger divergence. Both estimators outperform other estimation techniques, with better results for data with a large number of categories and for higher values of divergences.
Article
Full-text available
To create a behaviorally relevant representation of the visual world, neurons in higher visual areas exhibit dynamic response changes to account for the time-varying interactions between external (e.g., visual input) and internal (e.g., reward value) factors. The resulting high-dimensional representational space poses challenges for precisely quantifying individual factors’ contributions to the representation and readout of sensory information during a behavior. The widely used point process generalized linear model (GLM) approach provides a powerful framework for a quantitative description of neuronal processing as a function of various sensory and non-sensory inputs (encoding) as well as linking particular response components to particular behaviors (decoding), at the level of single trials and individual neurons. However, most existing variations of GLMs assume the neural systems to be time-invariant, making them inadequate for modeling nonstationary characteristics of neuronal sensitivity in higher visual areas. In this review, we summarize some of the existing GLM variations, with a focus on time-varying extensions. We highlight their applications to understanding neural representations in higher visual areas and decoding transient neuronal sensitivity as well as linking physiology to behavior through manipulation of model components. This time-varying class of statistical models provide valuable insights into the neural basis of various visual behaviors in higher visual areas and hold significant potential for uncovering the fundamental computational principles that govern neuronal processing underlying various behaviors in different regions of the brain.
Article
We develop a data-driven framework to identify the interconnections between firms using an information-theoretic measure. This measure generalizes Granger causality and is capable of detecting nonlinear relationships within a network. Moreover, we develop an algorithm using recurrent neural networks and the aforementioned measure to identify the interconnections of high-dimensional nonlinear systems. The outcome of this algorithm is the causal graph encoding the interconnections among the firms. These causal graphs can be used as preliminary feature selection for another predictive model or for policy design. We evaluate the performance of our algorithm using both synthetic linear and nonlinear experiments and apply it to the daily stock returns of U.S. listed firms and infer their interconnections.
Article
Networks with stochastic variables described by heavy-tailed lognormal distribution are ubiquitous in nature, and hence they deserve an exact information-theoretic characterization. We derive analytical formulas for mutual information between elements of different networks with correlated lognormally distributed activities. In a special case, we find an explicit expression for mutual information between neurons when neural activities and synaptic weights are lognormally distributed, as suggested by experimental data. Comparison of this expression with the case when these two variables have short tails reveals that mutual information with heavy tails for neurons and synapses is generally larger and can diverge for some finite variances in presynaptic firing rates and synaptic weights. This result suggests that evolution might prefer brains with heterogeneous dynamics to optimize information processing.
Article
Ultrasound (US) imaging is widely used for biometric measurement and diagnosis of internal organs due to the advantages of being real-time and radiation-free. However, due to inter-operator variations, resulting images highly depend on the experience of sonographers. This work proposes an intelligent robotic sonographer to autonomously “explore” target anatomies and navigate a US probe to standard planes by learning from the expert. The underlying high-level physiological knowledge from experts is inferred by a neural reward function, using a ranked pairwise image comparison approach in a self-supervised fashion. This process can be referred to as understanding the “language of sonography.” Considering the generalization capability to overcome inter-patient variations, mutual information is estimated by a network to explicitly disentangle the task-related and domain features in latent space. The robotic localization is carried out in coarse-to-fine mode based on the predicted reward associated with B-mode images. To validate the effectiveness of the proposed reward inference network, representative experiments were performed on vascular phantoms (“line” target), two types of ex vivo animal organ phantoms (chicken heart and lamb kidney representing “point” target), and in vivo human carotids. To further validate the performance of the autonomous acquisition framework, physical robotic acquisitions were performed on three phantoms (vascular, chicken heart, and lamb kidney). The results demonstrated that the proposed advanced framework can robustly work on a variety of seen and unseen phantoms as well as in vivo human carotid data. Code: https://github.com/yuan-12138/MI-GPSR . Video: https://youtu.be/u4ThAA9onE0 .
Article
Shared information is a measure of mutual dependence among multiple jointly distributed random variables with finite alphabets. For a Markov chain on a tree with a given joint distribution, we give a new proof of an explicit characterization of shared information. The Markov chain on a tree is shown to possess a global Markov property based on graph separation; this property plays a key role in our proofs. When the underlying joint distribution is not known, we exploit the special form of this characterization to provide a multiarmed bandit algorithm for estimating shared information, and analyze its error performance.
Chapter
Full-text available
Monografia colectivă „Ameliorarea calităţii şi siguranţei alimentelor prin biotehnologie şi inginerie alimentară” a fo st realizată în cadrul proiectului cu cifrul 20.80009.5107.09 „Ameliorarea calităţii şi siguranţei alimentelor prin biotehnologie şi inginerie alimentară” din cadrul Programului de Stat (2020-2023) Prioritatea strategică II „Agricultură durabilă, securitate alimentară şi siguranţa alimentelor”. Monografia este recomandată pentru editare de către Senatul Universității Tehnice a Moldovei (proces-verbal nr.4 din 24 octombrie 2023). Lucrarea este destinată specialiștilor din industria alimentară, domeniul vitivinicol, operatorilor economici care se ocupă cu producerea și procesarea materiilor prime horticole, de promovare și marketing. Sunt analizate diferite aspecte de sporire a valorii biologice a produselor alimentare prin aplicarea tehnologiilor avansate de protecție a compușilor biologic activi în timpul fabricării și păstrării. Concepția de bază constă în valorificarea componentelor naturale din materii prime vegetale prin metode eficiente de tratare, extracție și încorporare în matricea alimentelor. Sunt elucidate multitudinea factorilor, care pot influența calitatea produselor – influența factorilor tehnologici, metode de stabilizare și protecție a activității biologice a componentelor hidro- și liposolubile, optimizarea proceselor tehnologice de fabricare și păstrare. Tehnologiile propuse iau în considerare și posibilele modificări de textură și senzoriale, deoarece consumatorul reprezintă evaluatorul final al produselor alimentare. Monografia colectivă „Ameliorarea calităţii şi siguranţei alimentelor prin biotehnologie şi inginerie alimentară” este recomandată drept manual pentru studenții ciclului II (Masterat) și III (Doctorat) ale Facultăților Tehnologia Alimentelor și Științe Agricole, Silvice și ale Mediului.
Chapter
Deep neural networks are vulnerable to adversarial examples, which exploit imperceptible perturbations to mislead classifiers. To improve adversarial robustness, recent methods have focused on estimating mutual information (MI). However, existing MI estimators struggle to provide stable and reliable estimates in high-dimensional data. To this end, we propose a Copula Entropic MI Estimator (CE\(^2\)) to address these limitations. CE\(^2\) leverages copula entropy to estimating MI in high dimensions, allowing target models to harness information from both clean and adversarial examples to withstand attacks. Our empirical experiments demonstrate that CE\(^2\) achieves a trade-off between variance and bias in MI estimates, resulting in stable and reliable estimates. Furthermore, the defense algorithm based on CE\(^2\) significantly enhances adversarial robustness against multiple attacks. The experimental results underscore the effectiveness of CE\(^2\) and its potential for improving adversarial robustness.
Article
Full-text available
An overview is given of the several methods in use for the nonparametric estimation of the dierential entropy of a continuous random variable. The properties of various methods are compared. Several applications are given such as tests for goodness-of-t, parameter estimation, quantization theory and spectral estimation.
Article
Full-text available
Considers the problem of the calculation of the bias of the maximum likelihood information estimate H, based on independent choices among k events. The expectation EH is calculated exactly as a function of the probabilities p1, p2, . . . , pkk. The bias H - EH is approximated by using a convergent expansion for a logarithm and using the 1st 2 terms of a finite expansion for the jth moment of a random variable. The resulting approximation is more generally valid, although less concise and simple, than the classical Miller-Madow approximation. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Measuring the information carried by neuronal activity is made difficult, particularly when recording from mammalian cells, by the limited amount of data usually available, which results in a systematic error. While empirical ad hoc procedures have been used to correct for such error, we have recently proposed a direct procedure consisting of the analytical calculation of the average error, its estimation (up to subleading terms) from the data, and its subtraction from raw information measures to yield unbiased measures. We calculate here the leading correction terms for both the average transmitted information and the conditional information and, since usually one must first regularize the data, we specify the expressions appropriate to different regularizations. Computer simulations indicate a broad range of validity of the analytical results, suggest the effectiveness of regularizing by simple binning and illustrate the advantage of this over the previously used 'bootstrap' procedure.
Article
Full-text available
Suppose P is an arbitrary discrete distribution on acountable alphabet . Given an i.i.d. sample (X1,…,Xn) drawnfrom P, we consider the problem of estimating the entropy H(P) or some other functional F=F(P) of the unknown distribution P. We show that, for additive functionals satisfying mild conditions (including the cases of the mean, the entropy, and mutual information), the plug-in estimates of F are universally consistent. We also prove that, without further assumptions, no rate-of-convergence results can be obtained for any sequence of estimators. In the case of entropy estimation, under a variety of different assumptions, we get rate-of-convergence results for the plug-in estimate and for a nonparametric estimator based on match-lengths. The behavior of the variance and the expected error of the plug-in estimate is shown to be in sharp contrast to the finite-alphabet case. A number of other important examples of functionals are also treated in some detail. © 2001 John Wiley & Sons, Inc. Random Struct. Alg., 19: 163–193, 2001
Article
Full-text available
Variance adaptation processes have recently been examined in cells of the fly visual system and various vertebrate preparations. To better understand the contributions of somatic mechanisms to this kind of adaptation, we recorded intracellularly in vitro from neurons of rat sensorimotor cortex. The cells were stimulated with a noise current whose standard deviation was varied parametrically. We observed systematic variance-dependent adaptation (defined as a scaling of a nonlinear transfer function) similar in many respects to the effects observed in vivo. The fact that similar adaptive phenomena are seen in such different preparations led us to investigate a simple model of stochastic stimulus-driven neural activity. The simplest such model, the leaky integrate-and-fire (LIF) cell driven by noise current, permits us to analytically compute many quantities relevant to our observations on adaptation. We show that the LIF model displays "adaptive" behavior which is quite similar to the effects observed in vivo and in vitro.
Article
Full-text available
Consider estimating a functional $T(F)$ of an unknown distribution $F \in \mathbf{F}$ from data $X_1, \cdots, X_n$ i.i.d. $F$. Let $\omega(\varepsilon)$ denote the modulus of continuity of the functional $T$ over $\mathbf{F}$, computed with respect to Hellinger distance. For well-behaved loss functions $l(t)$, we show that $\inf_{T_n \sup_\mathbf{F}} E_Fl(T_n - T(F))$ is equivalent to $l(\omega(n^{-1/2}))$ to within constants, whenever $T$ is linear and $\mathbf{F}$ is convex. The same conclusion holds in three nonlinear cases: estimating the rate of decay of a density, estimating the mode and robust nonparametric regression. We study the difficulty of testing between the composite, infinite dimensional hypotheses $H_0: T(F) \leq t$ and $H_1: T(F) \geq t + \Delta$. Our results hold, in the cases studied, because the difficulty of the full infinite-dimensional composite testing problem is comparable to the difficulty of the hardest simple two-point testing subproblem.
Article
Full-text available
We consider in this paper two widely studied examples of nonparametric and semiparametric models in which the standard information bounds are totally misleading. In fact, no estimators converge at the $n^{-\alpha}$ rate for any $\alpha > 0$, although the information is strictly positive "promising" that $n^{-1/2}$ is achievable. The examples are the estimation of $\int p^2$ and the slope in the model of Engle et al. A class of models in which the parameter of interest can be estimated efficiently is discussed.
Article
Full-text available
We present general sufficient conditions for the almost sure $L_1$-consistency of histogram density estimates based on data-dependent partitions. Analogous conditions guarantee the almost-sure risk consistency of histogram classification schemes based on data-dependent partitions. Multivariate data are considered throughout. ¶ In each case, the desired consistency requires shrinking cells, subexponential growth of a combinatorial complexity measure and sublinear growth of the number of cells. It is not required that the cells of every partition be rectangles with sides parallel to the coordinate axis or that each cell contain a minimum number of points. No assumptions are made concerning the common distribution of the training vectors. ¶ We apply the results to establish the consistency of several known partitioning estimates, including the $k_n$-spacing density estimate, classifiers based on statistically equivalent blocks and classifiers based on multivariate clustering schemes.
Article
Full-text available
Traditional approaches to neural coding characterize the encoding of known stimuli in average neural responses. Organisms face nearly the opposite task--extracting information about an unknown time-dependent stimulus from short segments of a spike train. Here the neural code was characterized from the point of view of the organism, culminating in algorithms for real-time stimulus estimation based on a single example of the spike train. These methods were applied to an identified movement-sensitive neuron in the fly visual system. Such decoding experiments determined the effective noise level and fault tolerance of neural computation, and the structure of the decoding algorithms suggested a simple model for real-time analog signal processing with spiking neurons.
Chapter
Pattern recognition or discrimination is about guessing or predicting the unknown nature of an observation, a discrete quantity such as black or white, one or zero, sick or healthy, real or fake. An observation is a collection of numerical measurements such as an image (which is a sequence of bits, one per pixel), a vector of weather data, an electrocardiogram, or a signature on a check suitably digitized. More formally, an observation is a d-dimensional vector x. The unknown nature of the observation is called a class. It is denoted by y and takes values in a finite set {1, 2,..., M}. In pattern recognition, one creates a function g(x): ℛ d → {1,..., M} which represents one’s guess of y given x. The mapping g is called a classifier. Our classifier errs on x if g(x) ≠ y.
Chapter
We shall need weighted moduli of smoothness in some significant situations of which the following are examples: (a) Investigation of the K-functional of the pair of spaces: a weighted Lp space and a weighted Sobolev space. (b) Relation between modulus of smoothness of a function and that of its derivative (which turns out to be a weighted modulus of smoothness). (c) Applications to best weighted polynomial approximation. (d) Applications to weighted approximation of some linear operators.
Book
Introduction.- LDP for Finite Dimensional Space.- Applications - The Finite Dimensional Case.- General Principles.- Sample Path Large Deviations.- The LDP for Abstract Empirical Measures.- Applications of Empirical Measures LDP.
Article
Extracting information measures from limited experimental samples, such as those normally available when using data recorded in vivo from mammalian cortical neurons, is known to be plagued by a systematic error, which tends to bias the estimate upward. We calculate here the average of the bias, under certain conditions, as an asymptotic expansion in the inverse of the size of the data sample. The result agrees with numerical simulations, and is applicable, as an additive correction term, to measurements obtained under such conditions. Moreover, we discuss the implications for measurements obtained through other usual procedures.
Article
Measuring the information carried by neuronal activity is made difficult, particularly when recording from mammalian cells, by the limited amount of data usually available, which results in a systematic error. While empirical ad hoc procedures have been used to correct for such error, we have recently proposed a direct procedure consisting of the analytical calculation of the average error, its estimation (up to subleading terms) from the data, and its subtraction from raw information measures to yield unbiased measures. We calculate here the leading correction terms for both the average transmitted information and the conditional information and, since usually one must first regularize the data, we specify the expressions appropriate to different regularizations. Computer simulations indicate a broad range of validity of the analytical results, suggest the effectiveness of regularizing by simple binning and illustrate the advantage of this over the previously used 'bootstrap' procedure.
Article
This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady. The paper was first published in Russian as ???????????? ??. ??. and ?????????????????????? ??. ??. ?? ?????????????????????? ???????????????????? ???????????? ?????????????????? ?????????????? ?? ???? ????????????????????????. ???????????? ???????????????????????? ?? ???? ???????????????????? 16(2), 264???279 (1971).
Article
The mean value and variance are computed for a statistical estimate for the entropy of a sequence of mutually independent random variables having a similar distribution. The estimate is shown to be biased, consistent and asymptotically normal.
Article
A main theme of this report is the relationship of approximation to learning and the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to the main stream of mathematics.
Article
Cramer-Rao type integral inequalities are developed for loss functionsw(x) which are bounded below by functions of the typeg(x) =c|x| l ,l > 1. As applications, we obtain lower bounds of Hajek-LeCam type for locally asymptotic minimax error for such loss functions.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Chapter
Information theory answers two fundamental questions in communication theory: what is the ultimate data compression (answer: the entropy H), and what is the ultimate transmission rate of communication (answer: the channel capacity C). For this reason some consider information theory to be a subset of communication theory. We will argue that it is much more. Indeed, it has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam's Razor: “The simplest explanation is best”) and to probability and statistics (error rates for optimal hypothesis testing and estimation). The relationship of information theory to other fields is discussed. Information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (algorithmic complexity). We describe these areas of intersection in detail.
Article
Many statistical problems can be reformulated in terms of tests of uniformity. Some strong laws of large numbers and a central limit theorem for the logarithm of transformed spacings are obtained. These theorems provide a characterization of the uniform distribution. A general information-type inequality is deduced which gives a quantitative measurement (using the Kullback-Leibler number) of the discrepancy between an arbitrary distribution and the uniform distribution.
Article
The asymptotic behaviour of the minimax risk can be used as a measure of how ‘hard’ an estimation problem is. We relate the asymptotic behaviour of this quantity to an appropriate modulus of continuity, using elementary ideas and techniques only.
Article
When studying convergence of measures, an important issue is the choice of probability metric. We provide a summary and some new results concerning bounds among some important probability metrics/distances that are used by statisticians and probabilists. Knowledge of other metrics can provide a means of deriving bounds for another one in an applied problem. Considering other metrics can also provide alternate insights. We also give examples that show that rates of convergence can strongly depend on the metric chosen. Careful consideration is necessary when choosing a metric.
Article
In many cases an optimum or computationally convenient test of a simple hypothesis $H_0$ against a simple alternative $H_1$ may be given in the following form. Reject $H_0$ if $S_n = \sum^n_{j=1} X_j \leqq k,$ where $X_1, X_2, \cdots, X_n$ are $n$ independent observations of a chance variable $X$ whose distribution depends on the true hypothesis and where $k$ is some appropriate number. In particular the likelihood ratio test for fixed sample size can be reduced to this form. It is shown that with each test of the above form there is associated an index $\rho$. If $\rho_1$ and $\rho_2$ are the indices corresponding to two alternative tests $e = \log \rho_1/\log \rho_2$ measures the relative efficiency of these tests in the following sense. For large samples, a sample of size $n$ with the first test will give about the same probabilities of error as a sample of size $en$ with the second test. To obtain the above result, use is made of the fact that $P(S_n \leqq na)$ behaves roughly like $m^n$ where $m$ is the minimum value assumed by the moment generating function of $X - a$. It is shown that if $H_0$ and $H_1$ specify probability distributions of $X$ which are very close to each other, one may approximate $\rho$ by assuming that $X$ is normally distributed.
Article
Let $X_1, X_2, \cdots, X_n$ be $n$ independent random variables each distributed uniformly over the interval (0, 1), and let $Y_0, Y_1, \cdots, Y_n$ be the respective lengths of the $n + 1$ segments into which the unit interval is divided by the $\{X_i\}$. A fairly wide class of statistical problems is related to finding the distribution of certain functions of the $Y_j$; these problems are reviewed in Section 1. The principal result of this paper is the development of a contour integral for the characteristic function (ch. fn.) of the random variable $W_n = \sum^n_{j=0} h_j(Y_j)$ for quite arbitrary functions $h_j(x)$, this result being essentially an extension of the classical integrals of Dirichlet. The cases of statistical interest correspond to $h_j(x) = h(x),$ independent of $j$. There is a fairly extensive literature devoted to studying the distributions for various functions $h(x)$. By applying our method these distributions and others are readily obtained, in a closed form in some instances, and generally in an asymptotic form by applying a steepest descent method to the contour integral.
Article
If $S(x_1, x_2,\cdots, x_n)$ is any function of $n$ variables and if $X_i, \hat{X}_i, 1 \leq i \leq n$ are $2n$ i.i.d. random variables then $\operatorname{var} S \leq \frac{1}{2} E \sum^n_{i=1} (S - S_i)^2$ where $S = S(X_1, X_2,\cdots, X_n)$ and $S_i$ is given by replacing the $i$th observation with $\hat{X}_i$, so $S_i = S(X_1, X_2,\cdots, \hat{X}_i,\cdots, X_n)$. This is applied to sharpen known variance bounds in the long common subsequence problem.
Article
Tukey's jackknife estimate of variance for a statistic $S(X_1, X_2, \cdots, X_n)$ which is a symmetric function of i.i.d. random variables $X_i$, is investigated using an ANOVA-like decomposition of $S$. It is shown that the jackknife variance estimate tends always to be biased upwards, a theorem to this effect being proved for the natural jackknife estimate of $\operatorname{Var} S(X_1, X_2, \cdots, X_{n-1})$ based on $X_1, X_2, \cdots, X_n$.
Article
Part II: Bayes Estimators for Mutual Information, Chi-Squared, Covariance, and other Statistics. This paper is the second in a series of two on the problem of estimating a function of a probability distribution from a finite set of samples of that distribution. In the first paper, the Bayes estimator for a function of a probability distribution was introduced, the optimal properties of the Bayes estimator were discussed, and Bayes and frequency-counts estimators for the Shannon entropy were derived and graphically contrasted. In the current paper the analysis of the first paper is extended by the derivation of Bayes estimators for several other functions of interest in statistics and information theory. These functions are (powers of) the mutual information, chisquared for tests of independence, variance, covariance, and average. Finding Bayes estimators for several of these functions requires extensions to the analytical techniques developed in the first paper, and these extensions form the main body of this paper. This paper extends the analysis in other ways as well, for example by enlarging the class of potential priors beyond the uniform prior assumed in the first paper. In particular, the use of the entropic and Dirichlet priors is considered.
Article
Although motion-sensitive neurons in macaque middle temporal (MT) area are conventionally characterized using stimuli whose velocity remains constant for 1-3 s, many ecologically relevant stimuli change on a shorter time scale (30-300 ms). We compared neuronal responses to conventional (constant-velocity) and time-varying stimuli in alert primates. The responses to both stimulus ensembles were well described as rate-modulated Poisson processes but with very high precision (approximately 3 ms) modulation functions underlying the time-varying responses. Information-theoretic analysis revealed that the responses encoded only approximately 1 bit/s about constant-velocity stimuli but up to 29 bits/s about the time-varying stimuli. Analysis of local field potentials revealed that part of the residual response variability arose from "noise" sources extrinsic to the neuron. Our results demonstrate that extrastriate neurons in alert primates can encode the fine temporal structure of visual stimuli.
Article
: This paper addresses the problem of estimating a function of a probability distribution from a finite set of samples of that distribution. A Bayesian analysis of this problem is presented, the optimal properties of the Bayes estimators are discussed, and as an example of the formalism, closed form expressions for the Bayes estimators for the moments of the Shannon entropy function are derived. Then numerical results are presented that compare the Bayes estimator to the frequency -counts estimator for the Shannon entropy. We also present the closed form estimators, all derived elsewhere, for the mutual information, chi-squared, covariance, and some other statistics. PACS numbers: 02.50.+s, 05.20.-y 2 1. INTRODUCTION Consider a system with m possible states and an associated m-vector of probabilities of those states, , , ( ). The system is repeatedly and independently sampled according to the distribution . Let the total number of samples be N and denote the associated vector of cou...
Article
We present a new derivation of the asymptotic correction for bias in the estimate of information from a finite sample. The new derivation reveals a relationship between information estimates and a sequence of polynomials with combinatorial significance, the exponential (Bell) polynomials, and helps to provide an understanding of the form and behavior of the asymptotic correction for bias.