Figure - available via license: CC BY
Content may be subject to copyright.
Summary of information theoretic causal effect quantification and comparison of the two steps to the Pearl and Neyman-Rubin potential outcome frameworks.

Summary of information theoretic causal effect quantification and comparison of the two steps to the Pearl and Neyman-Rubin potential outcome frameworks.

Source publication
Article
Full-text available
Modelling causal relationships has become popular across various disciplines. Most common frameworks for causality are the Pearlian causal directed acyclic graphs (DAGs) and the Neyman-Rubin potential outcome framework. In this paper, we propose an information theoretic framework for causal effect quantification. To this end, we formulate a two ste...

Context in source publication

Context 1
... this section we showed that only their combination yields a rigorous framework for causal effect quantification in Pearlian DAGs. Table 1 summarises our approach. ...

Citations

... The resulting edges form a DAG representing the optimal causal relationship. GES also utilizes conditional information entropy 4 as a score function (Warren and Setubal [22]; Wieczorek and Roth [23]; Cai et al. [24]). Conditional information entropy indicates the extent to which the uncertainty (or amount of information) of other variables reduces when the value of a specific variable is known. ...
Preprint
Full-text available
This study examines the causal relationships between interest rates in the Korean financial market using a vector error correction model (VECM) and a Bayesian network. The Bayesian network, a novel approach in this context, identifies two distinct transmission channels: one led by call rates (reflecting monetary policy) and another by 10-year Treasury bond yields (reflecting fiscal policy). While the call rate channel aligns with traditional views, affecting bank lending rates, corporate bond spreads, and commercial paper rates, the 10-year Treasury yield channel highlights a separate fiscal policy transmission mechanism, influencing commercial paper rates indirectly through 30-year Treasury yields and directly impacting merchant bank lending rates. This finding suggests that fiscal intervention could potentially interfere with monetary policy, emphasizing the need for coordinated macroeconomic policy measures. The study also emphasizes the role of 10-year Treasury bonds as a benchmark in the Korean bond market, reflecting medium- to long-term economic outlooks and serving as the underlying asset for various derivatives.
... and is interpreted as the effect of k past X symbols on the present symbol of Y , given l elements of its past, calculated at time m. Beyond communications and neuroscience, information theory was shown useful for various fields of sequential analysis, encompassing control [5]- [8] ,causal inference [9], [10] and various machine learning tasks [11]- [13]. ...
Preprint
Full-text available
Despite the popularity of information measures in analysis of probabilistic systems, proper tools for their visualization are not common. This work develops a simple matrix representation of information transfer in sequential systems, termed information matrix (InfoMat). The simplicity of the InfoMat provides a new visual perspective on existing decomposition formulas of mutual information, and enables us to prove new relations between sequential information theoretic measures. We study various estimation schemes of the InfoMat, facilitating the visualization of information transfer in sequential datasets. By drawing a connection between visual patterns in the InfoMat and various dependence structures, we observe how information transfer evolves in the dataset. We then leverage this tool to visualize the effect of capacity-achieving coding schemes on the underlying exchange of information. We believe the InfoMat is applicable to any time-series task for a better understanding of the data at hand.
... Parents' heights are potential causes of offspring heights, but not vice versa. This intuitively appealing idea leads to tests for predictive causality, such as classical Granger causality testing in time-series analysis and its nonparametric generalization to transfer entropy [19]. In these tests, X is defined as a predictive cause of Y if future values of Y are not conditionally independent of past and present values of X, given past and present values of Y itself. ...
... Specifically, to predict how changing X (via an exogenous intervention or manipulation) would change the distribution of one of its descendants, Y, it is necessary to identify an adjustment set of other variables to condition upon [30]. Adjustment sets generalize the principle that one must condition on any common parents of X and Y to eliminate confounding biases, but must not condition on any common children to avoid introducing selection biases [19]. Appropriate adjustment sets can be computed from the DAG structure of a causal BN for both direct causal effects and total causal effects [30]. ...
... The distribution of Y would then change in response to the change in the distribution of its input, Z. If all relevant confounders are measured and controlled in this way, then the changes in Y's distribution are presumably caused by the differences in X values [16,19]. ...
Chapter
Full-text available
For an AI agent to make trustworthy decision recommendations under uncertainty on behalf of human principals, it should be able to explain why its recommended decisions make preferred outcomes more likely and what risks they entail. Such rationales use causal models to link potential courses of action to resulting outcome probabilities. They reflect an understanding of possible actions, preferred outcomes, effects of action on outcome probabilities, and acceptable risks and trade-offs—the standard ingredients of normative theories of decision-making under uncertainty, such as expected utility theory. Competent AI advisory systems should also notice changes that might affect a user’s plans and goals. In response, they should apply both learned patterns for quick response (analogous to fast, intuitive “System 1” decision-making in human psychology) and also slower causal inference and simulation, decision optimization, and planning algorithms (analogous to deliberative “System 2” decision-making in human psychology) to decide how best to respond to changing conditions. Concepts of conditional independence, conditional probability tables (CPTs) or models, causality, heuristic search for optimal plans, uncertainty reduction, and value of information (VoI) provide a rich, principled framework for recognizing and responding to relevant changes and features of decision problems via both learned and calculated responses. This chapter reviews how these and related concepts can be used to identify probabilistic causal dependencies among variables, detect changes that matter for achieving goals, represent them efficiently to support responses on multiple time scales, and evaluate and update causal models and plans in light of new data. The resulting causally explainable decisions make efficient use of available information to achieve goals in uncertain environments.KeywordsBayesian networksExplainable AI (XAI)CausalityDecision analysisExplainable AIExplanationInformationPartially observable Markov decision processesReinforcement learningStochastic control
... We use do(Z i ) as a shorthand for do(Z i = z i ). [39,53]). Given a set of variables ...
Preprint
Full-text available
Counterfactual data augmentation has recently emerged as a method to mitigate confounding biases in the training data for a machine learning model. These biases, such as spurious correlations, arise due to various observed and unobserved confounding variables in the data generation process. In this paper, we formally analyze how confounding biases impact downstream classifiers and present a causal viewpoint to the solutions based on counterfactual data augmentation. We explore how removing confounding biases serves as a means to learn invariant features, ultimately aiding in generalization beyond the observed data distribution. Additionally, we present a straightforward yet powerful algorithm for generating counterfactual images, which effectively mitigates the influence of confounding effects on downstream classifiers. Through experiments on MNIST variants and the CelebA datasets, we demonstrate the effectiveness and practicality of our approach.
... Definition 3.2. (Directed Information (Raginsky, 2011;Wieczorek & Roth, 2019)). In a causal directed acyclic graph (DAG) G = (V, E), where V denotes the set of variables and E denotes the set of directed edges denoting the direction of causal influence among the variables in V, the directed information from a variable Z i ∈ V to another variable Z j ∈ V is denoted by I(Z i → Z j ). ...
Preprint
Full-text available
A machine learning model, under the influence of observed or unobserved confounders in the training data, can learn spurious correlations and fail to generalize when deployed. For image classifiers, augmenting a training dataset using counterfactual examples has been empirically shown to break spurious correlations. However, the counterfactual generation task itself becomes more difficult as the level of confounding increases. Existing methods for counterfactual generation under confounding consider a fixed set of interventions (e.g., texture, rotation) and are not flexible enough to capture diverse data-generating processes. Given a causal generative process, we formally characterize the adverse effects of confounding on any downstream tasks and show that the correlation between generative factors (attributes) can be used to quantitatively measure confounding between generative factors. To minimize such correlation, we propose a counterfactual generation method that learns to modify the value of any attribute in an image and generate new images given a set of observed attributes, even when the dataset is highly confounded. These counterfactual images are then used to regularize the downstream classifier such that the learned representations are the same across various generative factors conditioned on the class label. Our method is computationally efficient, simple to implement, and works well for any number of generative factors and confounding variables. Our experimental results on both synthetic (MNIST variants) and real-world (CelebA) datasets show the usefulness of our approach.
... For systems that the researcher cannot perturb in a controlled manner, one can attempt causal inference from observational data -dubbed "causal discovery" (Dawid 2010b). Such "discovery" requires explicit assumptions about process structure, conveyed perhaps as a directed acyclic graph coupled with inductive evidence (eg iterative falsification of alternative explanations) (Cox and Wermuth 2004;Dietze et al. 2018;Wieczorek and Roth 2019). ...
Article
Full-text available
Society increasingly demands accurate predictions of complex ecosystem processes under novel conditions to address environmental challenges. However, obtaining the process‐level knowledge required to do so does not necessarily align with the burgeoning use in ecology of correlative model selection criteria, such as Akaike information criterion. These criteria select models based on their ability to reproduce outcomes, not on their ability to accurately represent causal effects. Causal understanding does not require matching outcomes, but rather involves identifying model forms and parameter values that accurately describe processes. We contend that researchers can arrive at incorrect conclusions about cause‐and‐effect relationships by relying on information criteria. We illustrate via a specific example that inference extending beyond prediction into causality can be seriously misled by information‐theoretic evidence. Finally, we identify a solution space to bridge the gap between the correlative inference provided by model selection criteria and a process‐based understanding of ecological systems.
... Our approach combines two ideas: First, we make use of the ability of probability trees to represent context-dependent relationships by representing the different causal hypothesis as sub-trees in a single large probability tree, thereby reducing the problem of causal induction to a simple inference problem in this larger probability tree, which can be solved using Bayes theorem; second, by combining the causal hypotheses in a single model, we can predict the information gain associated with each intervention in advance, and thus select the intervention which has the highest gain, thereby leading to a natural activelearning method for causal induction on probability trees 1 . a) Related work: Information theory has previously been used to quantify the causal effect between variables [17], or to specify circumstances where the causal orientation of categorical variables can be determined from observational data [18]. Information geometry was used to infer causal orientation from observational data, using assumptions on generative mechanisms [19], however both settings are different from the Bayesian learning problem considered here. ...
Preprint
Full-text available
The past two decades have seen a growing interest in combining causal information, commonly represented using causal graphs, with machine learning models. Probability trees provide a simple yet powerful alternative representation of causal information. They enable both computation of intervention and counterfactuals, and are strictly more general, since they allow context-dependent causal dependencies. Here we present a Bayesian method for learning probability trees from a combination of interventional and observational data. The method quantifies the expected information gain from an intervention, and selects the interventions with the largest gain. We demonstrate the efficiency of the method on simulated and real data. An effective method for learning probability trees on a limited interventional budget will greatly expand their applicability.
... Semantic entropy, in contrast, quantifies the reverse evolutionary process of the decay in the relative proportion of edges due to causal intervention. In particular, we build on the information-theoretic paradigm for quantifying causal influence (Massey and Massey, 2005;Wieczorek and Roth, 2019;Raginsky, 2011;Ay and Polani, arXiv:2109.09653v1 [cs.AI] 20 Sep 2021 Figure 1: The relative proportion of non-isomorphic posets (or DAGs) on 16 elements as a function of the number of levels (left) and the number of relations or edges (right). ...
... Semantic causal entropy, in contrast, attempts to quantify the reverse evolutionary process of decay of the proportion of edges dn 2 due to causal interventions that eliminate edges. A variety of information-theoretic metrics have been proposed to quantify causal influence (Massey and Massey, 2005;Raginsky, 2011;Ay and Polani, 2008;Wieczorek and Roth, 2019). We generalize the edge-centric model proposed by (Janzing et al., 2013), and define causal influence of removing a set of edges S as the φ divergence D φ (P P S ) between the distribution P represented by the original DAG with P S denoting the distribution represented by the DAG with edges S removed. ...
Preprint
We investigate causal inference in the asymptotic regime as the number of variables approaches infinity using an information-theoretic framework. We define structural entropy of a causal model in terms of its description complexity measured by the logarithmic growth rate, measured in bits, of all directed acyclic graphs (DAGs), parameterized by the edge density d. Structural entropy yields non-intuitive predictions. If we randomly sample a DAG from the space of all models, in the range d = (0, 1/8), almost surely the model is a two-layer DAG! Semantic entropy quantifies the reduction in entropy where edges are removed by causal intervention. Semantic causal entropy is defined as the f-divergence between the observational distribution and the interventional distribution P', where a subset S of edges are intervened on to determine their causal influence. We compare the decomposability properties of semantic entropy for different choices of f-divergences, including KL-divergence, squared Hellinger distance, and total variation distance. We apply our framework to generalize a recently popular bipartite experimental design for studying causal inference on large datasets, where interventions are carried out on one set of variables (e.g., power plants, items in an online store), but outcomes are measured on a disjoint set of variables (residents near power plants, or shoppers). We generalize bipartite designs to k-partite designs, and describe an optimization framework for finding the optimal k-level DAG architecture for any value of d \in (0, 1/2). As edge density increases, a sequence of phase transitions occur over disjoint intervals of d, with deeper DAG architectures emerging for larger values of d. We also give a quantitative bound on the number of samples needed to reliably test for average causal influence for a k-partite design.
... However, to understand how CAI methods and explanations can improve the trustworthiness of causal inferences and intervention decision recommendations in practice, we recommend recent analyses and applications of computational causal methods in Systems Biology for cancer research. Although current CAI methods do not yet fully automate valid causal discovery with high reliability [11], causal discovery and understanding of low-level (molecular-biological) pathways are increasingly able to inform, and build confidence in, high-level public health policies by helping target the right causal factors at the macro-level (e.g., diet, exposures) to be causally effective in reducing risks [12]. Explaining how interventions cause desired changes helps to select effective interventions. ...
... We may yet find that an ounce of causal AI is worth a pound of prediction." For practitioners, the greatest practical value of CAI is probably that it helps to identify actions and policies that are causally effective in increasing the probabilities of achieving desired outcomes [12]. For developers, a strong contribution of CAI is that it improves the speed and accuracy of machine learning, especially in novel situations, by enabling efficient generalization and transfer of knowledge learned under previously encountered conditions to apply to new situations, as in the example of Figure 9. CPTs satisfying the ICP property need not be re-learned every time an environment changes, or as new conditions or interventions are encountered, but instead can be used to "transport" previously acquired causal knowledge to the new situations (via transportability calculations) to predict effects of interventions without waiting to collect new data [25]. ...
Article
Full-text available
For an AI agent to make trustworthy decision recommendations under uncertainty on behalf of human principals, it should be able to explain why its recommended decisions make preferred outcomes more likely and what risks they entail. Such rationales use causal models to link potential courses of action to resulting outcome probabilities. They reflect an understanding of possible actions, preferred outcomes, the effects of action on outcome probabilities, and acceptable risks and trade-offs—the standard ingredients of normative theories of decision-making under uncertainty, such as expected utility theory. Competent AI advisory systems should also notice changes that might affect a user’s plans and goals. In response, they should apply both learned patterns for quick response (analogous to fast, intuitive “System 1” decision-making in human psychology) and also slower causal inference and simulation, decision optimization, and planning algorithms (analogous to deliberative “System 2” decision-making in human psychology) to decide how best to respond to changing conditions. Concepts of conditional independence, conditional probability tables (CPTs) or models, causality, heuristic search for optimal plans, uncertainty reduction, and value of information (VoI) provide a rich, principled framework for recognizing and responding to relevant changes and features of decision problems via both learned and calculated responses. This paper reviews how these and related concepts can be used to identify probabilistic causal dependencies among variables, detect changes that matter for achieving goals, represent them efficiently to support responses on multiple time scales, and evaluate and update causal models and plans in light of new data. The resulting causally explainable decisions make efficient use of available information to achieve goals in uncertain environments.
... Instead of modeling how changes in exposure propagate through mechanisms to cause changes in outcome probability distributions (called interventional distributions), the PO framework uses assumptions to interpret differences between observed and statistically predicted hypothetical outcomes as average treatment effects (ATEs) or average causal effects (ACEs) caused by treatments or interventions. Causal effects of interventions are assumed to be quantified by ACEs in the PO framework, rather than by changes in interventional distributions as in the SCM framework [32,34]. Key assumptions commonly made in applying the PO framework are that (a) treatment is independent of potential outcomes, called the ignorability assumption; (b) the outcome observed for one person (or unit of observation) does not depend on treatments assigned to others (e.g., lower pollution levels for some people should not affect mortality rates for others, such as family members); (c) there is only one version of each treatment (e.g., a specified reduction in mean PM 2.5 is always achieved by the same reductions in its components and the same changes in detailed time courses of exposure); and (d) the assumed regression model describing how exposure affects mortality is correctly specified. ...
... The SNMM models the difference of intermediate ACEs (e.g., daily mortality at time t + 1) under counterfactual levels of a last increment of exposure (e.g., daily PM 2.5 concentration at time t) as a function of past exposure and covariate history [44]). Counterfactual causal effects are then estimated under multiple assumptions such as that (a) the treatment effect model is correctly specified (i.e., the modeler is able to specify the correct mathematical formula by which exposure affects response); (b) the potential outcome under the observed exposure coincides with the observed outcome ("consistency"); (c) all combinations of variable levels needed to identify causal effects occur in the dataset ("positivity"); and (d) there are no unmeasured confounders inducing spurious (not directly causal) associations between exposures and response (i.e., conditional on the exposure history and the history of all measured confounders, the exposure at time t is independent of the potential outcomes; this is called the "sequential ignorability," "no unmeasured confounders," or "conditional exchangeability" assumption) [34,44,[47][48][49]. Compared to some other PO methods (e.g., marginal structural models), SNMMs are, in principle, better tailored for dealing with violations of certain assumptions and can explicitly quantify interactions between time-varying exposures and time-varying covariates [45][46][47]. ...
... Technical options for reducing dependence of conclusions on untested assumptions have greatly expanded recently. The SCM framework [32] and an information-theoretic framework for causal discovery and causal inference [34] allow imponderable questions about whether ignorability, exchangeability, exclusion restriction, and other PO assumptions are valid to be replaced by properties that are more readily empirically testable using observational data, such as whether mortality is conditionally independent of exposure given the values of other measured variables [33,66], whether the structural causal mechanisms in an SCM are the same across multiple studies and interventions ("invariant causal prediction") [66], and whether information is found to flow from changes in exposure to changes in mortality rates over time ("directed information") [34]. Causal questions posed in the PO framework can be translated to equivalent questions in the SCM framework [33]. ...
Article
Full-text available
Causal inference regarding exposures to ambient fine particulate matter (PM2.5) and mortality estimated from observational studies is limited by confounding, among other factors. In light of a variety of causal inference frameworks and methods that have been developed over the past century to specifically quantify causal effects, three research teams were selected in 2016 to evaluate the causality of PM2.5-mortality association among Medicare beneficiaries, using their own selections of causal inference methods and study designs but the same data sources. With a particular focus on controlling for unmeasured confounding, two research teams adopted an instrumental variables approach under a quasi-experiment or natural experiment study design, whereas one team adopted a structural nested mean model under the traditional cohort study design. All three research teams reported results supporting an estimated counterfactual causal relationship between ambient PM2.5 and all-cause mortality, and their estimated causal relationships are largely of similar magnitudes to recent epidemiological studies based on regression analyses with omitted potential confounders. The causal methods used by all three research teams were built upon the potential outcomes framework. This framework has marked conceptual advantages over regression-based methods in addressing confounding and yielding unbiased estimates of average treatment effect in observational epidemiological studies. However, potential violations of the unverifiable assumptions underlying each causal method leave the results from all three studies subject to biases. We also note that the studies are not immune to some other common sources of bias, including exposure measurement errors, ecological study design, model uncertainty and specification errors, and irrelevant exposure windows, that can undermine the validity of causal inferences in observational studies. As a result, despite some apparent consistency of study results from the three research teams with the wider epidemiological literature on PM2.5-mortality statistical associations, caution seems warranted in drawing causal conclusions from the results. A possible way forward is to improve study design and reduce dependence of conclusions on untested assumptions by complementing potential outcomes methods with structural causal modeling and information-theoretic methods that emphasize empirically tested and validated relationships.