BookPDF Available

Causation, Prediction, and Search

January 1993

January 1993
81

DOI:10.1007/978-1-4612-2748-9

ISBN: 978-1-4612-7650-0

Authors:

Peter Spirtes

Carnegie Mellon University

Richard Scheines

Carnegie Mellon University

What assumptions and methods allow us to turn observations into causal knowledge, and how can even incomplete causal knowledge be used in planning and prediction to influence and control our environment? In this book Peter Spirtes, Clark Glymour, and Richard Scheines address these questions using the formalism of Bayes networks, with results that have been applied in diverse areas of research in the social, behavioral, and physical sciences. The authors show that although experimental and observational study designs may not always permit the same inferences, they are subject to uniform principles. They axiomatize the connection between causal structure and probabilistic independence, explore several varieties of causal indistinguishability, formulate a theory of manipulation, and develop asymptotically reliable procedures for searching over equivalence classes of causal models, including models of categorical data and structural equation models with and without latent variables. The authors show that the relationship between causality and probability can also help to clarify such diverse topics in statistics as the comparative power of experimentation versus observation, Simpson's paradox, errors in regression models, retrospective versus prospective sampling, and variable selection. The second edition contains a new introduction and an extensive survey of advances and applications that have appeared since the first edition was published in 1993.

Content uploaded by Peter Spirtes

Content may be subject to copyright.

Causation, Prediction, and Search

second edition

Peter Spirtes, Clark Glymour,and Richard Scheines

What assumptions and methods allow us to turn obser-

vations into causal knowledge, and how can even

incomplete causal knowledge be used in planning and

prediction to influence and control our environment?

In this book Peter Spirtes, Clark Glymour,and Richard

Scheines address these questions using the formalism of

Bayes networks, with results that have been applied in

diverse areas of research in the social, behavioral, and

physical sciences.

The authors show that although experimental and

observational study designs may not always permit the

same inferences, they are subject to uniform principles.

They axiomatize the connection between causal struc-

ture and probabilistic independence, explore several vari-

eties of causal indistinguishability, formulate a theory of

manipulation, and develop asymptotically reliable proce-

dures for searching over equivalence classes of causal

models, including models of categorical data and struc-

tural equation models with and without latent variables.

The authors show that the relationship between

causality and probability can also help to clarify such

diverse topics in statistics as the comparative power of

experimentation versus observation, Simpson’s paradox,

errors in regression models, retrospective versus prospec-

tive sampling, and variable selection.The second edition

contains a new introduction and an extensive survey of

advances and applications that have appeared since the

first edition was published in 1993.

Peter Spirtes is Professor of Philosophy at the Center

for Automated Learning and Discovery, Carnegie Mellon

University.Clark Glymour is Alumni University Professor

of Philosophy at Carnegie Mellon University and Valtz

Family Professor of Philosophy at the University of

California, San Diego. He is also Distinguished External

Member of the Center for Human and Machine Cognition

at the University of West Florida, and Adjunct Professor of

Philosophy of History and Philosophy of Science at the

University of Pittsburgh. Richard Scheines is Associate

Professor of Philosophy at the Center for Automated

Learning and Discovery,and at the Human Computer

Interaction Institute, Carnegie Mellon University.

Causation, Prediction, and Search

Peter Spirtes,

Clark Glymour, and

Richard Scheines

second edition

Of related interest

Learning in Graphical Models

edited by Michael I. Jordan

Graphical models, a marriage between probability theory and graph theory, provide a natural

tool for dealing with two problems that occur throughout applied mathematics and engi-

neering—uncertainty and complexity. In particular, they play an increasingly important role in

the design and analysis of machine learning algorithms. Fundamental to the idea of a graphi-

cal model is the notion of modularity: a complex system is built by combining simpler parts.

Probability theory serves as the glue whereby the parts are combined, ensuring that the sys-

tem as a whole is consistent and providing ways to interface models to data. Graph theory

provides both an intuitively appealing interface by which humans can model highly interact-

ing sets of variables and a data structure that lends itself naturally to the design of efficient

general-purpose algorithms.This book presents an in-depth exploration of issues related to

learning within the graphical model formalism.

Computation, Causation, and Discovery

edited by Clark Glymour and Gregory F. Cooper

In science, business, and policymaking—anywhere data are used in prediction—two sorts of

problems requiring very different methods of analysis often arise. The first, problems of recog-

nition and classification, concerns learning how to use some features of a system to accurate-

ly predict other features of that system. The second,problems of causal discovery,concerns

learning how to predict those changes to some features of a system that will result if an inter-

vention changes other features.This book is about the second—much more difficult—type of

problem. The contributors discuss recent research and applications using Bayes nets or direct-

ed graphic representations, including representations of feedback or “recursive”systems. The

book contains a thorough discussion of foundational issues, algorithms, proof techniques, and

applications to economics, physics, biology,educational research, and other areas.

A Bradford Book

The MIT Press

Massachusetts Institute of Technology

Cambridge, Massachusetts 02142

http://mitpress.mit.edu

SPICH 0-262-19440-6

Causation, Prediction, and Search

Spirtes,

Glymour,

and

Scheines

,!7IA2G2-bjeeac!:t;K;k;K;k

Adaptive Computation and Machine Learning series

second edition

Spirtes mechanical 11/19/01 10:37 AM Page 1

Causation, Prediction, and Search

Adaptive Computation and Machine Learning

Thomas Dietterich, Editor

Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,

Associate Editors

Bioinformatics: The Machine Learning Approach

Pierre Baldi and Søren Brunak

Reinforcement Learning: An Introduction

Richard S. Sutton and Andrew G. Barto

Graphical Models for Machine Learning and Digital Communication

Brendan J. Frey

Learning in Graphical Models

Michael I. Jordan

Causation, Prediction,and Search, second edition

Peter Spirtes, Clark Glymour, and Richard Scheines

Causation, Prediction, and Search

Peter Spirtes, Clark Glymour, and Richard Scheines

with additional material by

David Heckerman, Christopher Meek,

Gregory F. Cooper, and Thomas Richardson

The MIT Press

Cambridge, Massachusetts

London, England

electronic or mechanical means (including photocopying, recording, or infor-

mation storage and retrieval) without permission in writing from the publisher.

Printed and bound in the United States of America.

Library of Congress Cataloging-in-Publication Data

Spirtes, Peter.

Causation, prediction, and search.—2nd ed. / Peter Spirtes, Clark Glymour,

and Richard Scheines ; with additional material by David Heckerman,

Christopher Meek, Gregory F. Cooper, and Thomas Richardson.

p. cm. — (Adaptive computation and machine learning)

Includes bibliographical references and index.

ISBN 0-262-19440-6 (hc : alk. paper)

1. Mathematical statistics. I. Glymour, Clark N. II. Scheines, Richard.

III. Title. IV. Series.

QA276 .S65 2000

519.—dc21 00-026266

Contents

Preface xi

Acknowledgments xv

1Introduction and Advertisement 1

2Formal Preliminaries 5

3Causation and Prediction: Axioms and Explications 19

4Statistical Indistinguishability 59

5Discovery Algorithms for Causally Sufﬁcient Structures 73

6Discovery Algorithms without Causal Sufﬁciency 123

7Prediction 157

8Regression, Causation, and Prediction 191

9The Design of Empirical Studies 209

10 The Structure of the Unobserved 253

11 Elaborating Linear Theories with Unmeasured Variables 269

12 Prequels and Sequels 295

13 Proofs of Theorems 377

Notes 475

Glossary 481

References 495

Index 531

Preface to the Second Edition xi

Notational Conventions xvii

To my parents, Morris and Cecile Spirtes—P.S.

In memory of Lucille Lynch Schwartz Watkins Speede Tindall Preston—C. G.

To Martha, for her support and love—R.S.

It is with data affected by numerous causes that Statistics is mainly concerned.

Experiment seeks to disentangle a complex of causes by removing all but one of them, or

rather by concentrating on the study of one and reducing the others, as far as

circumstances permit, to comparatively small residium. Statistics, denied this resource,

must accept for analysis data subject to the influence of a host of causes, and must try to

discover from the data themselves which causes are the important ones and how much of

the observed effect is due to the operation of each.

—G. U. Yule and M. G. Kendall, 1950

The Theory of Estimation discusses the principles upon which observational data may be

used to estimate, or to throw light upon the values of theoretical quantities, not known

numerically, which enter into our specification of the causal system operating.

—Sir Ronald Fisher, 1956

George Box has [almost] said “The only way to find out what will happen when a

complex system is disturbed is to disturb the system, not merely to observe it passively.”

These words of caution about “natural experiments” are uncomfortably strong. Yet in

today’s world we see no alternative to accepting them as, if anything, too weak.

—G. Mosteller and J. Tukey, 1977

Causal inference is one of the most important, most subtle, and most neglected of all the

problems of Statistics.

—P. Dawid, 1979

Preface to the Second Edition

This second edition of Causation, Prediction, and Search is the culmination of almost

twenty years of research on automation and causal inference, beginning in 1980 with a

chapter of Glymour’s Theory and Evidence, continuing with our book, Discovering

Causal Structure, written with Kevin Kelly in 1987, and with an essay in 1990 (Spirtes et

al. 1990) which layed out much of the research project we have followed in subsequent

years. The thought—which one of us had—that the subject was more or less exhausted in

1993, when the first edition of this book appeared, has been proved entirely wrong.

For this edition we have substituted a new and briefer introduction, a discussion of d-

separation that eliminates a misleading didacticism of the first edition, and an entirely

new twelfth chapter, surveying and summarizing relevant results and applications since

1993. The original twelfth chapter was chiefly a series of conjectures, most of which have

been proved correct, concerning cyclic graphs and feedback systems.

Our first debt for this edition is to our two former students, Chris Meek and Thomas

Richardson. Much of the new work we describe is theirs. We are almost equally indebted

to Gregory Cooper, David Heckerman, and Larry Wasserman, who have been wonderful,

helpful colleagues and collaborators, and to Jaimie Robins, who, though often unhappy

with the very idea of this book, helped with his insightfulness and fairness of mind. We

have been encouraged by Judea Pearl’s support, by his development of ideas presented

here, particularly those on prediction first presented in chapter 7 of this book, and by his

explorations of a multitude of new aspects of causal inference not considered here. We

have been equally encouraged by the ingenious uses and modifications of our procedures

provided by a number of scientists, including Bill Shipley, David Bessler and his

collaborators, and Ludwig Litzka and his students. We owe a particular thanks to Cooper,

Heckerman and Meek for permitting us to use in chapter 12 their survey of Bayesian

search methods, and to Thomas Richardson for providing us with information about

recent unpublished developments on chain graphs.

Preface

This book is intended for anyone, regardless of discipline, who is interested in the use of

statistical methods to help obtain scientific explanations or to predict the outcomes of

actions, experiments or policies.

Much of G. Udny Yule’s work illustrates a vision of statistics whose goal is to

investigate when and how causal influences may be reliably inferred, and their

comparative strengths estimated, from statistical samples. Yule’s enterprise has been

largely replaced by Ronald Fisher’s conception, in which there is a fundamental

cleavage between experimental and non-experimental inquiry, and statistics is largely

unable to aid in causal inference without randomized experimental trials. Every now and

then members of the statistical community express misgivings about this turn of events,

and, in our view, rightly so. Our work represents a return to something like Yule’s

conception of the enterprise of theoretical statistics and its potential practical benefits.

If intellectual history in the twentieth century had gone otherwise, there might have

been a discipline to which our work belongs. As it happens, there is not. We develop

material that belongs to statistics, to computer science, and to philosophy; the

combination may not be entirely satisfactory for specialists in any of these subjects. We

hope it is nonetheless satisfactory for its purpose. We are not statisticians by training or

by association, and perhaps for that reason we tend to look at issues differently, and,

from the perspective common in the discipline, no doubt oddly. We are struck by the fact

that in the social and behavioral sciences, epidemiology, economics, market research,

engineering, and even applied physics, statistical methods are routinely used to justify

causal inferences from data not obtained from randomized experiments, and sample

statistics are used to predict the effects of policies, manipulations, or experiments.

Without these uses the profession of statistics would be a far smaller business. It may not

strike many professional statisticians as particularly odd that the discipline thriving from

such uses assures its audience that they are unwarranted, but it strikes us as very odd

indeed. From our perspective outside the discipline, the most urgent questions about the

application of statistics to such ends concern the conditions under which causal

inferences and predictions of the effects of manipulations can and cannot reliably be

made, and the most urgent need is a principled, rigorous theory with which to address

these problems. To judge from the testimony of their books, a good many statisticians

think any such theory is impossible. We think the common arguments against the

possibility of inferring causes from statistics outside of experimental trials are unsound,

and radical separations of the principles of experimental and observational study designs

are unwise. Experimental and observational design may not always permit the same

inferences, but they are subject to uniform principles.

The theory we develop follows necessarily from assumptions laid down in the

statistical community over the last fifteen years. The underlying structure of the theory is

essentially axiomatic. We will give two independent axioms on the relation between

causal structures and probability distributions and deduce from them features of causal

relationships and predictions that can and that cannot be reliably inferred from statistical

constraints under a variety of background assumptions. Versions of all of the axioms can

be found in papers by Lauritzen, Wermuth, Speed, Pearl, Rubin, Pratt, Schlaifer, and

others. In most cases we will develop the theory in terms of probability distributions that

can be thought of loosely as propensities that determine long run frequencies, but many

of the probability distributions can alternatively be understood as (normative) subjective

degrees of belief, and we will occasionally note Bayesian applications. From the axioms

there follow a variety of theorems concerning estimation, sampling, latent variable

existence and structure, regression, indistinguishability relations, experimental design,

prediction, Simpson’s paradox, and other topics. Foremost among the “other topics” are

the discovery that statistical methods commonly used for causal inference are radically

suboptimal, and that there exist asymptotically reliable, computationally efficient search

procedures that conjecture causal relationships from the outcomes of statistical decisions

made on the basis of sample data. (The procedures we will describe require statistical

decisions about the independence of random variables; when we say such a procedure is

“asymptotically reliable” we mean it provides correct information if the outcome of each

of the requisite statistical decisions is true in the population under study.)

This much of the book is mathematics: where the axioms are accepted, so must the

theorems be, including the existence of search procedures. The procedures we describe

are applicable to both linear and discrete data and can be feasibly applied to a hundred or

more variables so long as the causal relations between the variables are sufficiently

sparse and the sample sufficiently large. These procedures have been implemented in a

computer program, TETRAD II, which at the time of writing is publicly available.1

The theorems concerning the existence and properties of reliable discovery

procedures of themselves tell us nothing about the reliabilities of the search procedures

in the short run. The methods we describe require an unpredictable sequence of

statistical decisions, which we have implemented as hypothesis tests. As is usual in such

cases, in small samples the conventional p values of the individual tests may not provide

good estimates of type 1 error probabilities for the search methods. We provide the

results of extensive tests of various procedures on simulated data using Monte Carlo

methods, and these tests give considerable evidence about reliability under the

conditions of the simulations. The simulations illustrate an easy method for estimating

the probabilities of error for any of the search methods we describe. The book also

contains studies of one large pseudoempirical data set—a body of simulated data created

by medical researchers to model emergency medicine diagnostic indicators and their

causes—and a great many empirical data sets, most of which have been discussed by

other authors in the context of specification searches.

A further aim of this work is to show that a proper understanding of the relationship

between causality and probability can help to clarify diverse topics in the statistical

literature, including the comparative power of experimentation versus observation,

xiv Preface

Simpson’s paradox, errors in regression models, retrospective versus prospective

sampling, the perils of variable selection, and other topics. There are a number of

relevant topics we do not consider. They include problems of estimation with discrete

latent variables, optimizing statistical decisions, many details of sampling designs, time

series, and a full theory of “nonrecursive” causal structures—that is, finite graphical

representations of systems with feedback.

Causation, Prediction, and Search is not intended to be a textbook, and it is not fitted

out with the associated paraphernalia. There are open problems but no exercises. In a

textbook everything ought to be presented as if it were complete and tidy, even if it isn’t.

We make no such pretenses in this book, and the chapters are rich in unsolved problems

and open questions. Textbooks don’t usually pause much to argue points of view; we

pause quite a lot.

The various theorems in this book often have a graph theoretic character; many of

them are long, difficult case arguments of a kind quite unfamiliar in statistics. In order

not to interrupt the flow of the discussion we have placed all proofs but one in a chapter

at the end of the book. In the few cases where detailed proofs are available in the

published literature, we have simply referred the reader to them. Where proofs of

important results have not been published or are not readily available we have given the

demonstrations in some detail.

The structure of the book is as follows. Chapter 1 concerns the motivation for the

book in the context of current statistical practice and advertises some of the results.

Chapter 2 introduces the mathematical ideas necessary to the investigation, and chapter 3

gives the formal framework a causal interpretation, lays down the axioms, notes

circumstances in which they are likely to fail, and provides a few fundamental theorems.

The next two chapters work out the consequences of two of the axioms for some

fundamental issues in contexts in which it is known, or assumed, that there are no

unmeasured common causes affecting measured variables. In chapter 4 we give

graphical characterizations of necessary and sufficient conditions for causal hypotheses

to be statistically indistinguishable from one another in each of several senses. In chapter

5 we criticize features of model specification procedures commonly recommended in

statistics, and we describe feasible algorithms that from properties of population

distributions extract correct information about causal structure, assuming the axioms

apply and that no unmeasured common causes are at work. The algorithms are illustrated

for a variety of empirical and simulated samples. Chapter 6 extends the analysis of

chapter 5 to contexts in which it cannot be assumed that no unmeasured common causes

act on measured variables. From both a theoretical and practical perspective, this chapter

and the next form the center of the book, but they are especially difficult. Chapter 7

addresses the fundamental issue of predicting the effects of manipulations, policies, or

experiments. As an easy corollary, the chapter unifies directed graphical models with

Donald Rubin’s “counterfactual” framework for analyzing prediction. Chapter 8 applies

the results of the preceding chapters to the subject of regression. We argue that even

Preface xv

when standard statistical assumptions are satisfied multiple regression is a defective and

unreliable way to assess causal influence even in the large sample limit, and various

automated regression model specification searches only make matters worse. We show

that the algorithms of chapter 6 are more reliable in principle, and we compare the

performances of these algorithms against various multiple regression procedures on a

variety of simulated and empirical data sets. Chapter 9 considers the design of empirical

studies in the light of the results of earlier chapters, including issues of retrospective and

prospective sampling, the comparative power of experimental and observational designs,

selection of variables, and the design of ethical clinical trials. The chapter concludes

with a look back at some aspects of the dispute over smoking and lung cancer. Chapters

10 and 11 further consider the linear case, and analyze algorithms for discovering or

elaborating causal relations among measured and unmeasured variables in linear

systems. Chapter 12 is a brief consideration of a variety of open questions. Proofs are

given in chapter 13.

We have tried to make this work self-contained, but it is admittedly and unavoidably

difficult. The reader will be aided by a previous reading of Pearl 1988, Whittaker 1990,

or Neopolitan 1990.

xvi Preface

Acknowledgments

One source of the ideas in this book is in work we began ten years ago at the University

of Pittsburgh. We drew many ideas about causality, statistics, and search from the

psychometric, economic, and sociological literature, beginning with Charles Spearman’s

project at the turn of the century and including the work of Herbert Simon, Hubert

Blalock, and Herbert Costner.

We obtained a new perspective on the enterprise from Judea Pearl’s Probabilistic

Reasoning in Intelligent Systems, which appeared the next year. Although not principally

concerned with discovery, Pearl’s book showed us how to connect conditional

independence with causal structure quite generally, and that connection proved essential

to establishing general, reliable discovery procedures. We have since profited from

correspondence and conversation with Pearl and with Dan Geiger and Thomas Verma,

and from several of their papers. Pearl’s work drew on the papers of Wermuth (1980),

Kiiveri and Speed (1982), Wermuth and Lauritzen (1983), and Kiiveri, Speed, and

Carlin (1984), which in the early 1980s had already provided the foundations for a

rigorous study of causal inference. Paul Holland introduced one of us to the Rubin

framework some years ago, but we only recently realized it’s logical connections with

directed graphical models. We were further helped by J. Whittaker’s (1990) excellent

account of the properties of undirected graphical models.

We have learned a great deal from Gregory Cooper at the University of Pittsburgh

who provided us with data, comments, Bayesian algorithms and the picture and

description of the ALARM network which we consider in several places. Over the years

we have learned useful things from Kenneth Bollen. Chris Meek provided essential help

in obtaining an important theorem that derives various claims made by Rubin, Pratt, and

Schlaifer from axioms on directed graphical models.

Steve Fienberg and several students from Carnegie Mellon’s department of statistics

joined with us in a seminar on graphical models from which we learned a great deal. We

are indebted to him for his openness, intelligence, and helpfulness in our research, and to

Elizabeth Slate for guiding us through several papers in the Rubin framework. We are

obliged to Nancy Cartwright for her courteous but salient criticism of the approach taken

in our previous book and continued here. Her comments prompted our work on

parameters in chapter 4. We are indebted to Brian Skyrms for his interest and

encouragement over many years, and to Marek Druzdzel for helpful comments and

encouragement. We have also been helped by Linda Bouck, Ronald Christensen, Jan

Callahan, David Papineau, John Earman, Dan Hausman, Joe Hill, Michael Meyer, Teddy

Seidenfeld, Dana Scott, Jay Kadane, Steven Klepper, Herb Simon, Peter Slezak, Steve

Sorensen, John Worrall, and Andrea Woody. We are indebted to Ernest Seneca for

putting us in contact with Dr. Rick Linthurst, and we are especially grateful to Dr.

Linthurst for making his doctoral thesis available to us.

Our work has been supported by many institutions. They, and those who made

decisions on their behalf, deserve our thanks. They include Carnegie Mellon University,

the National Science Foundation programs in History and Philosophy of Science, in

Economics, and in Knowledge and Database Systems, the Office of Naval Research, the

Navy Personnel Research and Development Center, the John Simon Guggenheim

Memorial Foundation, Susan Chipman, Stanley Collyer, Helen Gigley, Peter Machamer,

Steve Sorensen, Teddy Seidenfeld, and Ron Overmann. The Navy Personnel Research

and Development Center provided us the benefit of access to a number of challenging

data analysis problems from which we have learned a great deal.

xviii Acknowledgements

Notational Conventions

Text

In the text, each technical term is written in boldface where it is defined.

Variables: capitalized, and in italics, e.g., X

Values of variables: lower case, and in italics, e.g., X = x

Sets: capitalized, and in boldface, e.g., V

Values of sets of variables: lower case, and in boldface, e.g., V = v

Members of X that are not members of Y:X\Y

Error variables:  e

Independence of X and Y:X Y

Independence of X and Y conditional on Z:X Y|Z

X ∪ Y: XY

Covariance of X and Y: COV(X,Y) or XY

Correlation of X and Y : XY

Sample correlation of X and Y : rXY

Partial Correlation of X and Y,

controlling for all members of set Z: XY.Z

In all of the graphs that we consider, the vertices are random variables. Hence we use the

terms “variables in a graph” and “vertices in a graph” interchangeably.

Figures

Figure numbers occur just below a figure, starting at 1 within each chapter. Where

necessary, we distinguish between measured and unmeasured variables by boxing

measured variables and circling unmeasured variables (except for error terms). Variables

beginning with e, or are understood to be “error,” or “disturbance,” variables. For

example, in the figure below, X and Y are measured, T is not, and is an error term.

Figure n.1

We will neither box nor circle variables in graphs in which no distinction need be

made between measured and unmeasured variables, for example, figure n.2.

XX1

Figure n.2

For simplicity, we state and prove our results for probability distributions over

discrete random variables. However, under suitable integrability conditions, the results

can be easily generalized to continuous distributions that have density functions by

replacing the discrete variables by continuous variables, probability distributions by

density functions, and summations by integrals.

If a description of a set of variables is a function of a graph G and variables in G, then

we make G an optional argument to the function. For example, Parents(G,X) denotes

the set of variables that are parents of X in graph G; if the context makes clear which

graph is being referred to we will simply write Parents(X).

If a distribution is defined over a set of random variables O then we refer to the

distribution as P(O). An equation between distributions over random variables is

understood to be true for all values of the random variables for which all of the

distributions in the equation are defined. For example if X and Y each take the values 0

or 1 and P(X = 0) ≠ 0 and P(X = 1) ≠ 0 then P(Y|X) = P(Y) means P(Y = 0|X= 0) = P(Y =

0), P(Y = 0|X= 1) = P(Y = 0), P(Y = 1|X= 0) = P(Y = 1), and P(Y = 1|X= 1) = P(Y = 1).

We sometimes use a special summation symbol,

→

∑, which has the following

properties:

(i) when sets of random variables are written beneath the special summation symbol, it is

understood that the summation is to be taken over sets of values of the random variables,

not the random variables themselves,

(ii) if a conditional probability distribution appears in the scope of such a summation

symbol, the summation is to be taken only over values of the random variables for which

the conditional probability distributions are defined,

(iii) if there are no values of the random variables under the special summation symbol

for which the conditional probability distributions in the scope of the symbol are

defined, then the summation is equal to zero.

X1X3

X2X4

properties:

xx Notational Conventions

P(X|Y=0,Z=0)

→

∑=P(X=0| Y=0,Z=0)+P(X=1| Y=0, Z=0)

However, if P(Y=0,Z=0) = 0, then P(X=0|Y=0,Z=0) and P(X=1|Y=0,Z=0) are not defined,

P(X|Y=0,Z=0)

→

∑=0

We will adopt the following conventions for empty sets of variables. If Y = ∅ then

(i) P(X|Y) means P(X).

(ii) XZ.Y means XZ.

(iii) A B|Y means A B.

(iv) A Y is always true.

For example, suppose that X, Y, and Z can each take on the values 0 or 1. Then if

P(Y = 0, Z = 0) 0

Notational Conventions xxi

1 Introduction and Advertisement

1.1 The Issue

Adult judgments about which event is “the cause” of another event are loaded with

topicality, interest, background knowledge about normal cases, and moral implications.

Tell someone that Suzie was injured in an accident while John was driving her home, and

then ask what further information is needed to decide whether John’s actions caused her

injury. People want to know John’s condition, the detailed circumstances of the accident,

including the condition of the roadway, of John’s car, of the other driver if there was one,

and so on (Ahn 1995, 1996). The responses show that in such contexts judgments about

causation have a moral aspect, and an aspect that depends on an understanding of normal

conditions and deviations from the normal. That sort of thing will vary with culture,

background, and circumstance.

Causal claims have a subjunctive complexity—they are associated with claims about

what did not happen, or has not happened yet, or what would have happened if some

circumstance had been otherwise. If someone says their hair is brown because they dye it,

we infer that if they had not dyed their hair it would have been some other color. That

sort of counterfactual conditional is not always correct (someone with brown hair can dye

their hair brown), and endless but indispensable complexities result. Our moral sense, our

very notions of blame and regret, depend on subjunctive aspects of causal claims. In

addition, the kinds of entities that are described as causes and effects are enormously

varied, and the logical form of causal claims can vary from particular to general to

universal. Events are causes—the rise of the middle class caused the American

Revolution; Constantine’s conversion caused the triumph of Christianity in the Roman

Empire; the discovery of penicillin saved millions of lives. Features or properties, or their

changes, are often cited as causes—the pH of the liquid caused it to turn pink when

phenolthalein was added; the heat caused the butter to melt. Objects or persons are cited

as causes—my daughter gave me a cold. Even relationships, or instances of them, can be

described as causes—her love for him caused her to leave the country. Descriptions of

effects can be equally varied. The salient effect of a preventive cause, for example, can be

a circumstance or event that doesn’t exist—she prevented the catastrophe.

The variation—the looseness—of causal claims has provided a reason for many

people to dismiss the very idea of causation as prescientific. Bertrand Russell claimed as

much, and Karl Pearson proposed to replace the idea of causation entirely by the idea of

correlation. To this day, some writers try to avoid the issue by euphemism, as though a

new word would clarify things, and at almost any conference of statisticians or social

scientists (but not, anymore, of philosophers) there is someone—often not alone—who is

eager to say he doesn’t “believe in causality.” But he acts as if he does; we all do, all the

time: we ask people to do things, or do them ourselves, because we want what we think

Introduction and Advertisement 1

will result from the actions—turn down the volume on the radio, its too loud—and we

blame people for the unhappy effects of their actions. The skeptic about causality pushes

the brake pedal to make his car slow, flips a switch to make a lamp glow, puts his money

in the bank to collect interest.

Francis Bacon claimed knowledge is power, and he was talking about the power of

control supplied by causal knowledge. One of the greatest mysteries of the human

condition is that in a few short years a newborn infant comes to control much of her

environment, knows how to climb up to things, how to turn on the television with the

remote control, how to make a balloon expand and a soap bubble form, how to summon

others and how to avoid them. Developmental psychology has hardly begun to crack how

all that causal knowledge, all that power to control, is acquired so quickly. And a great

deal—perhaps most—of our scientific inquiry aims to find out something about causal

relationships. Billions are spent each year to discover the effects of drug treatments alone,

and similar sums to estimate the likely results of possible social and economic policies.

Those who claim not to believe in causality may provide consulting services to clients

who want to predict the effects of alternative business strategies, or who want to know

how to judge the bearing of some body of data on a causal hypothesis. Loose as the

notion may be, there is nothing serious to the claim that we can live and thrive without

using the idea of causation, however we name it.

The baby and the scientist occupy two ends of the same question: how can

observations be turned into causal knowledge, and how can causal knowledge, even if

incomplete, be used to influence and control our environment? The theory of

experimental design offers a route to causal knowledge, but while Fisher’s discussions of

the probabilistic and statistical aspects of experimental design were brilliant and rigorous,

everything causal was left informal. Fisher did not provide as rigorous a theory regarding

causal inference from non-experimental observations. Yet most of what we want to know

about, and most of what we think we know, is not amenable to randomized clinical trials.

The question of prediction has been equally unsettled—the question is: if you know

some causal relations, and you know some of the probability relations among some of the

related variables, can you predict what will result if you intervene and alter the value of

one or more of the variables. In many causal systems the probability of an event Y given

an intervention to bring about an event X is different from the conditional probability of Y

on X. In recent years some philosophers, economists, computer scientists, and

statisticians have realized the importance to many different kinds of problems of the

difference between predicting by conditioning and predicting by intervening.

Philosophers have used the difference between conditioning and intervening to argue that

the principle of maximum expected utility is not always rational. Whether they are right

or wrong about that (and we think wrong: see Meek and Glymour 1994), the essential

thing is to provide a general means of determining when the probability distribution of

one variable can be calculated from an intervention that forces a probability distribution

on the values of another variable, given partial causal and probabilistic knowledge of the

2 Chapter 1

undisturbed system, and, when the probability of an effect can be calculated, to provide a

means of calculating it.

So we have three problems: first, the problem of clarifying the very idea of a causal

system with sufficient precision for mathematical analysis and sufficient generality to

capture a wide range of scientific practices; second, the problem of understanding the

possibilities and limitations for discovering such causal structures from various kinds of

data; and third, the problem of characterizing the probabilities predicted by a causal

hypothesis given an intervention directly to force a value, or distribution of values, on

one or more variables. This book attempts answers to all three of these questions.

Our answer to the problem of regimenting causal hypotheses uses a formalism

developed by Terry Speed and his students and subsequently elaborated by Judea Pearl

and his students, and gives it a causal interpretation previously suggested (Kiiveri and

Speed 1982). Our answer to the problem of discovery turns on algorithms developed

from the mathematics of the representation; that answer is supplemented in chapter 12 by

a discussion of Bayesian algorithms lent us by David Heckerman, Christopher Meek, and

Gregory Cooper, and by a discussion—based on joint work with Wasserman and

Robins— of the convergence properties of any possible non-experimental discovery

procedure. The assumptions of the theory of manipulation developed here has

anticipations in the econometric literature (Strotz and Wold 1960), but by putting these

assumptions in a graphical framework we are able to prove some novel theorems that

follow from the assumptions.

One approach to clarifying the notion of causation—the philosophers’ approach ever

since Plato—is to try to define “causation” in other terms, to provide necessary and

sufficient and noncircular conditions for one thing, or feature or event or circumstance, to

cause another, the way one can define “bachelor” as “unmarried adult male human.”

Another approach to the same problem—the mathematician’s approach ever since

Euclid—is to provide axioms that use the notion of causation without defining it, and to

investigate the necessary consequences of those assumptions. We have few fruitful

examples of the first sort of clarification, but many of the second: Euclid’s geometry,

Newton’s physics, Frege’s logic and Hilbert’s, Kolmogorov’s probability. Some

axiomatic theories—Newton’s, for example—offer a substantive theory of nature, while

others—Frege’s and Hilbert’s logics and Kolmogorov’s probability—are a

systematization and abstraction from practice and intuition—while still others—Euclid’s

Elements—are something of both. While we do not claim the success of these examples,

they are the models of this book.

We use a formalism—directed graphical models—that is not in the least original with

us; we claim some originality in explicitly stating the causal assumptions implicit in the

causal interpretation of the graphs, and in extending the application of graphs to solving

certain problems about manipulations. The representation invokes two ideas about

causation that are fundamental and ancient. The first idea, which can be traced back at

least to Bernoulli, is that the absence of causal relations is marked by independence in

Introduction and Advertisement 3

probability—in Bernoulli’s examples, if the outcome of one trial has no influence on the

outcome of another trial, then the probability of both outcomes equals the product of each

outcome separately. The second idea, Bacon’s again, is that probability is associated with

control: if variation of one feature, X, causes variation of another feature Y, then Y can be

changed by an appropriate intervention that alters X. It turns out that the representation

captures what is common to a wide variety of statistical models of causal relations—for

example: regression models, logistic regression models, structural equation models, latent

factor models, and many models of categorical data—and captures how these models

may be used in prediction and control. The general assumptions are given in chapter 3.

These axioms have implications for scientific discovery by experimental and non-

experimental means. Our investigations require characterizing when two or alternative

causal theories are, in various technical senses, indistinguishable by data, and

characterizing when and which causal features are shared by all models indistinguishable

from any particular model. These characterizations are given in chapter 4, and more

recent work on equivalence is described in chapter 12. In chapters 5, 6, 10, and 11, and

again in chapter 12, we describe algorithms for discovering causal structure from sample

data. We evaluate the algorithms in terms of several different features. (1) Are they

computationally feasible on realistic problems? (2) Are they reliable—do they in some

sense converge to a description of features common to all models indistinguishable from

the true model? (3) Are they as informative as possible for the features of the data they

use? Our algorithmic results concern the discovery of causal structure in linear and

nonlinear systems, systems with and without feedback, cases where there may, or may

not, be unrecorded common causes of recorded variables, and cases in which membership

in the observed sample is influenced by the variables under study. We investigate the

reliabilities of the algorithms on simulated data, and illustrate their application with

published data sets. In chapter 8, reliable procedures for causal inference are compared

with regression in theory, on simulated data, and on real data sets. In chapter 9 we

compare causal inference from experimental and non-experimental data. Besides reviews

of work on equivalence, prediction and search algorithms, chapter 12 includes a

consideration of the senses in which it is, and is not, possible to have a procedure that

reliably converges to the truth about causal relations as they are represented here, and a

discussion of procedures for learning feedback models, so far as such models can be

represented by directed cyclic graphs, and a brief discussion of the combination of search

methods with the Gibbs sampler and related procedures for estimation of posterior

probabilities.

The discovery of causal relations is only half of the story. The other half concerns the

use in prediction of causal knowledge, even partial and incomplete causal knowledge.

The fundamentals of the theory of prediction are given in chapter 3, and their

consequences are developed in detail in chapter 7. The theory of prediction has a number

of limitations, some of which are considered in chapter 12. The final chapter, 13,

provides detailed proofs of the theorems in the main text.

4 Chapter 1

2 Formal Preliminaries

This chapter introduces some mathematical concepts used throughout the book. The

chapter is meant to provide mathematically explicit definitions of the formal apparatus

we use. It may be skipped in a first reading and referred to as needed, although the reader

should be warned that for good reason we occasionally use nonstandard definitions of

standard notions in graph theory. We assume the reader has some background in finite

mathematics and statistics, including correlation analysis, but otherwise this chapter

contains all of the mathematical concepts needed in this book. Some of the same

mathematical objects defined here are given special interpretations in the next chapter,

but here we treat everything entirely formally.

We consider a number of different kinds of graphs: directed graphs, undirected graphs,

inducing path graphs, partially oriented inducing path graphs, and patterns. These

different kinds of objects all contain a set of vertices and a set of edges. They differ in the

kinds of edges they contain. Despite these differences, many graphical concepts such as

undirected path, directed path, parent, etc., can be defined uniformly for all of these

different kinds of objects. In order to provide this uniformity for the objects we need in

our work, we modify the customary definitions in the theory of graphs.

2.1 Graphs

The undirected graph shown in figure 2.1 contains only undirected edges (e.g., A - B).

ABC

Figure 2.1

A directed graph, shown in figure 2.2, contains only directed edges (e.g., A → B).

Formal Preliminaries 5

6Ch

apter

Figure 2.2

An inducing path graph, shown in figure 2.3, contains both directed edges (e.g., A →

B) and bi-directed edges (e.g., B ↔ C). (Inducing path graphs and their uses are

explained in detail in chapter 6.)

Figure 2.3

A partially oriented inducing path graph, shown in figure 2.4, contains directed edges

(e.g., B → F), bi-directed edges (e.g B ↔ C), nondirected edges (e.g., E o-o D), and

partially directed edges (e.g., A o→ B.). (Partially oriented inducing path graphs and their

uses are explained in detail in chapter 6.)

Figure 2.4

A pattern, shown in ﬁ gure 2.5, contains undirected edges (e.g., A – B) and directed edges

(e.g., A → E). (Patterns and their uses are explained in detail in chapter 5.)

CBA

AB CD

6 Chapter 2

orma

nar

ABC

Figure 2.5

In the usual graph theoretic definition, a graph is an ordered pair <V,E> where V is a

set of vertices, and E is a set of edges. The members of E are pairs of vertices (an ordered

pair in a directed graph and an unordered pair in an undirected graph). For example, the

edge A → B is represented by the ordered pair <A,B>. In directed graphs the ordering of

the pair of vertices representing an edge in effect marks an arrowhead at one end of the

edge. For our purposes we need to represent a larger variety of marks attached to the ends

of undirected edges. In general, we allow that the end of an edge can be unmarked, or can

be marked with an arrowhead, or can be marked with an “o.”

In order to specify completely the type of an edge, therefore, we need to specify the

variables and marks at each end. For example, the left end of “A o→ B” can be

represented as the ordered pair [A, o],1 and the right end can be represented as the ordered

pair [B, >]. The first member of the ordered pair is called an endpoint of an edge, for

example, in [A, o] the endpoint is A. The entire edge is a set of ordered pairs representing

the endpoints, for example, {[A, o], [B, >]}. The edge {[B, >],[A, o]} is the same as {[A,

o],[B, >]} since it doesn’t matter which end of the edge is listed first.

Note that a directed edge such as A → B has no mark at the A endpoint; we consider

the mark at the A endpoint to be empty, but when we write out the ordered pair we will

use the notation EM to stand for the empty mark, for example, [A,EM].

More formally, we say a graph is an ordered triple <V,M,E> where V is a non-empty

set of vertices, M is a non-empty set of marks, and E is a set of sets of ordered pairs of

the form {[V1,M1],[V2,M2]}, where V1 and V2 are in V, V1 ≠ V2, and M1 and M2 are in M.

Except in our discussion of systems with feedback we will always assume that in any

graph, any pair of vertices V1 and V2 occur in at most one set in E, or, in other words, that

there there is at most one edge between any two vertices. If G = <V,M,E> we say that G

is over V.

For example, the directed graph of figure 2.2 can be represented as <{A,B,C,D,E},

{EM, >}, {{[A,EM],[B, >]}, {[A,EM],[E, >]}, {[A,EM],[D, >]}, {[D,EM],[B, >]},

{[D,EM],[C, >]}, {[B,EM],[C, >]}, {[E,EM],[C, >]}}>.

Each member {[V1, M1],[V2,M2]} of E is called an edge (e.g., {[A,EM],[B, >]} in

figure 2.2.) Each ordered pair [V1, M1] in an edge is called an edge-end (e.g., [A,EM] is

an edge-end of {[A,EM],[B, >]}.) Each vertex V1 in an edge {[V1, M1],[V2, M2]} is called

an endpoint of the edge (e.g., A is an endpoint of {[A,EM],[B, >]}.) V1 and V2 are

CBA

Formal Preliminaries 7

8Ch

apter

adjacent in G if and only if there is an edge in E with endpoints V1 and V2 (e.g., in figure

2.2, A and B are adjacent, but A and C are not.)

An undirected graph is a graph in which the set of marks M = {EM}. A directed

graph is a graph in which the set of marks M = {EM, >} and for each edge in E, one

edge-end has mark EM and the other edge-end has mark “>.”

An edge {<[A,EM],[B, >]} is a directed edge from A to B. (Note that in an undirected

graph there are no directed edges.) An edge {[A,M1],[B, >]} is into B. An edge

{[A,EM],[B,M2]} is out of A. If there is a directed edge from A to B then A is a parent of

B and B is a child (or daughter) of B. We denote the set of all parents of vertices in V as

Parents(V) and the set of all children of vertices in V as Children(V). The indegree of a

vertex V is equal to the number of its parents; the outdegree is equal to the number of its

children; and the degree is equal to the number of vertices adjacent to V. (In a directed

graph, the degree of a vertex is equal to the sum of it’s indegree and outdegree.) In figure

2.2, the parents of B are A and D, and the child of B is C. Hence, B is of indegree 2,

outdegree 1, and degree 3.

We will treat an undirected path in a graph as a sequence of vertices that are adjacent

in the graph. In other words for every pair X, Y adjacent on the path, there is an edge

{[X,M1],[Y,M2]} in the graph. For example, in figure 2.2, the sequence <A,B,C,D> is an

undirected path because each pair of variables adjacent in the sequence (A and B, B and

C, and C and D) have corresponding edges in the graph. The set of edges in a path

consists of those edges whose endpoints are adjacent in the sequence. In figure 2.2 the

edges in path <A,B,C,D> are {[A,EM],[B, >]}, {[B,EM],[C, >]}, and {[C, >],[D,EM]}.

More formally, an undirected path between A and B in a graph G is a sequence of

vertices beginning with A and ending with B such that for every pair of vertices X and Y

that are adjacent in the sequence there is an edge {[X,M1],[Y,M2]} in G. An edge

{[X,M1],[Y,M2]} is in path U if and only if X and Y are adjacent to each other (in either

order) in U. If an edge between X and Y is in path U we also say that X and Y are

adjacent on U. If the edge containing X in an undirected path between X and Y is out of X

then we say that the path is out of X; similarly, if the edge containing X in a path

between X and Y is into X then we say that the path is into X. In order to simplify proofs

we call a sequence that consists of a single vertex an empty path. A path that contains no

vertex more than once is acyclic; otherwise it is cyclic. Two paths intersect iff they have

a vertex in common; any such common vertex is a point of intersection. If path U is

<U1, . . . ,Un> and path V is <Un, . . . ,Vm>, then the concatenation of U and V is <U1, . . .

,Un,V1, . . . ,Vm> denoted by U andV. The concatenation of U with an empty path is U,

and the concatenation of an empty path with U is U . Ordinarily when we use the term

“path” we will mean acyclic path; in referring to cyclic path we will always use the

adjective.

A directed path from A to B in a graph G is a sequence of vertices beginning with A

and ending with B such that for every pair of vertices X, Y, adjacent in the sequence and

occurring in the sequence in that order, there is an edge {[X,EM],[Y, >]} in G. A is the

Un,

8 Chapter 2

<U1, . . . ,Un> and path V is < Un, V1, . . . ,Vm >, then the concatenation of U and V is <U1, . . .

orma

nar

source and B the sink of the path. For example, in figure 2.2 <A,B,C> is a directed path

with source A and sink C. In contrast, in figure 2.2 <A,B,D> is an undirected path, but not

a directed path because B and D occur in the sequence in that order, but the edge

{[B,EM],[D, >]} is not in G (although {[D,EM],[B, >]} is in G.) Directed paths are

therefore special cases of undirected paths. For a directed edge e from U to V (U → V),

head(e) = V and tail(e) = U. A directed acyclic graph is a directed graph that contains

no directed cyclic paths.

A semidirected path between A and B in a partially oriented inducing path graph is

an undirected path U from A to B in which no edge contains an arrowhead pointing

toward A (i.e., there is no arrowhead at A on U, and if X and Y are adjacent on the path,

and X is between A and Y on the path, then there is no arrowhead at the X end of the edge

between X and Y.) Of course every directed path is semidirected, but in graphs with “o”

end marks there may be semidirected paths that are not directed.

A graph is complete if every pair of its vertices are adjacent. Figure 2.6 illustrates a

complete undirected graph.

A B C

Figure 2.6

A graph is connected if there is an undirected path between any two vertices. Figures

2.1–2.6 are connected, but figure 2.7 is not.

A B C

Figure 2.7

A subgraph of <V,M,E> is any graph <V,M( > such that V is included in V, M

is included in M, and E is included in E. Figure 2.7 is a subgraph of figure 2.2. The

ABC

Formal Preliminaries 9

2.1.

10 Ch

apter

subgraph of <V,M,E> over V , where V is included in V, is the subgraph <V,M,E > in

which an edge is in E if and only if it is in E and has both endpoints in V.

A clique in graph G is any subgraph of G that is complete. In figure 2.1, for example,

the subgraph G =

<{A,B,D},{EM},{{[A,EM],[B,EM]},{[B,EM],[D,EM]},{[A,EM],[D,EM]}}>

is a clique with vertices A, B and D. A clique in G whose vertex set is not properly

contained in any other clique in G is maximal. In figure 2.1, both G and G =

<{A,B},{EM}, {{[A,EM],[B,EM]}}>, are cliques, but G, unlike G is not maximal

because G is properly contained in G.2

A triangle in a graph G is a complete subgraph of G with three vertices; in other

words, vertices X, Y and Z form a triangle if and only if X and Y are adjacent, Y and Z are

adjacent and X and Z are adjacent. In graph G a vertex V is a collider on undirected

path U if and only if there are two distinct edges on U containing V as an endpoint and

both are into V. Otherwise V is a noncollider on U. In graph G, vertex V is an

unshielded collider on U if V is a collider on U, V is adjacent to distinct vertices V1 and

V2 on U, and V1 and V2 are not adjacent in G. An ancestor of a vertex V is any vertex W

such that there is a directed path from W to V. A descendant of a vertex V is any vertex

W such that there is a directed path from V to W. In figure 2.2, A, B, C, D, and E are all

ancestors of C, although neither A nor C is a parent of C. Similarly, C is a descendant of

A, B, C, D, and E, although it is not a child of A or C. Since every vertex V is the source

of a directed (empty) path from V to V, each vertex is its own descendant and its own

ancestor, but not of course its own parent or its own child.

2.2 Probability

The vertices of the graphs we consider will always be random variables taking values in

one of the following: a copy of the real line; a copy of the nonnegative reals; an interval

of integers.

By a joint distribution on the vertices of a graph we mean a countably additive

probability measure on the Cartesian product of these objects. We say that two random

variables, X, Y are independent when the joint density of (X,Y) is the product of the

density of X and the density of Y for all values of X and Y. We write this as X Y. We

generalize in the obvious way when asserting that one set of variables is independent of

another set of variables. When we say a set of random variables is jointly independent

we mean that any two disjoint subsets of the set are independent of one another. We say

that random variables X, Y are independent conditional on Z (or given Z), when the

density of X, Y given Z equals the product of the density of X given Z and the density of

Y given Z, for all values of X, Y, and for all values z of Z for which the density of z is not

equal to 0. We generalize in the obvious way for sets of random variables, X, Y, Z. If X

10 Chapter 2