BookPDF Available

Information Theory and the Brain

Authors:

Abstract and Figures

From the Publisher: This book deals with a new and expanding area of neuroscience that provides a framework for understanding neuronal processing. This framework is derived from a conference held in Newquay, UK, where a group of scientists from around the world met to discuss the topic. This book begins with an introduction to the basic concepts of information theory and then illustrates these concepts with examples from research over the past forty years. Throughout the book, the contributors highlight current research from the areas of biological networks, information theory and artificial networks, information theory and psychology, and formal analysis. Each section includes an introduction and glossary covering basic concepts. This book will appeal to graduate students and researchers in neuroscience as well as computer scientists and cognitive scientists.
Content may be subject to copyright.
INFORMATION THEORY
AND THE BRAIN
Edited by
ROLAND BADDELEY
University of Sussex
PETER HANCOCK
University of Stirling
PETER FO
ÈLDIA
ÂK
University of St. Andrews
PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE
The Pitt Building, Trumpington Street, Cambridge, United Kingdom
CAMBRIDGE UNIVERSITY PRESS
The Edinburgh Building, Cambridge CB2 2RU, UK www.cup.cam.ac.uk
40 West 20th Street, New York, NY 10011-4211, USA www.cup.org
10 Stamford Road, Oakleigh, Melbourne 3166, Australia
Ruiz de Alarco
Ân 13, 28014 Madrid, Spain
#Cambridge University Press 1999
This book is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without
the written permission of Cambridge University Press.
First published 1999
Printed in the United States of America
Typeface Times Roman 10.25/12.5 pt. System 3B2 [KW]
A catalogue record for this book is available from
the British Library
Library of Congress Cataloguing in Publication Data
Information theory and the brain / edited by Roland Baddeley, Peter
Hancock, Peter Fo
Èldia
Âk
p. cm.
1. Neural networks (Neurobiology) 2. Neural networks (Computer
science). 3. Information theory in biology. I. Baddeley, Roland,
1965- . II. Hancock, Peter J. B., 1958- . III. Fo
Èldia
Âk, Peter,
1963- .
OP363.3.I54 1999 98-32172
612.802Ðdc21 CIP
ISBN 0 521 63197 1 hardback
Contents
List of Contributors page xi
Preface xiii
1 Introductory Information Theory and the Brain 1
ROLAND BADDELEY
1.1 Introduction 1
1.2 What Is Information Theory? 1
1.3 Why Is This Interesting? 4
1.4 Practical Use of Information Theory 5
1.5 Maximising Information Transmission 13
1.6 Potential Problems and Pitfalls 17
1.7 Conclusion 19
Part One: Biological Networks 21
2 Problems and Solutions in Early Visual Processing 25
BRIAN G. BURTON
2.1 Introduction 25
2.2 Adaptations of the Insect Retina 26
2.3 The Nature of the Retinal Image 30
2.4 Theories for the RFs of Retinal Cells 31
2.5 The Importance of Phase and the Argument for Sparse,
Distributed Coding 36
2.6 Discussion 38
3 Coding Ef®ciency and the Metabolic Cost of Sensory and
Neural Information 41
SIMON B. LAUGHLIN, JOHN C. ANDERSON, DAVID O'CARROLL
AND ROB DE RUYTER VAN STEVENINCK
3.1 Introduction 41
v
3.2 Why Code Ef®ciently? 42
3.3 Estimating the Metabolic Cost of Transmitting Information 45
3.4 Transmission Rates and Bit Costs in Different Neural
Components of the Blow¯y Retina 48
3.5 The Energetic Cost of Neural Information is Substantial 49
3.6 The Costs of Synaptic Transfer 50
3.7 Bit Costs Scale with Channel Capacity ± Single Synapses
Are Cheaper 52
3.8 Graded Potentials Versus Action Potentials 53
3.9 Costs, Molecular Mechanisms, Cellular Systems and
Neural Codes 54
3.10 Investment in Coding Scales with Utility 57
3.11 Phototransduction and the Cost of Seeing 58
3.12 Investment in Vision 59
3.13 Energetics ± a Unifying Principle? 60
4 Coding Third-Order Image Structure 62
MITCHELL THOMPSON
4.1 Introduction 62
4.2 Higher-Order Statistics 64
4.3 Data Acquisition 65
4.4 Computing the SCF and Power Spectrum 66
4.5 Computing the TCF and Bispectrum 68
4.6 Spectral Measures and Moments 70
4.7 Channels and Correlations 72
4.8 Conclusions 77
Part Two: Information Theory and Arti®cial Networks 79
5 Experiments with Low-Entropy Neural Networks 84
GEORGE HARPUR AND RICHARD PRAGER
5.1 Introduction 84
5.2 Entropy in an Information-Processing System 84
5.3 An Unsupervised Neural Network Architecture 86
5.4 Constraints 88
5.5 Linear ICA 93
5.6 Image Coding 95
5.7 Speech Coding 97
5.8 Conclusions 100
6 The Emergence of Dominance Stripes and Orientation Maps
in a Network of Firing Neurons 101
STEPHEN P. LUTTRELL
6.1 Introduction 101
vi Contents
6.2 Theory 102
6.3 Dominance Stripes and Orientation Maps 104
6.4 Simulations 109
6.5 Conclusions 118
Appendix 119
7 Dynamic Changes in Receptive Fields Induced by Cortical
Reorganization 122
GERMA
ÂN MATO AND NE
ÂSTOR PARGA
7.1 Introduction 122
7.2 The Model 124
7.3 Discussion of the Model 127
7.4 Results 130
7.5 Conclusions 137
8 Time to Learn About Objects 139
GUY WALLIS
8.1 Introduction 139
8.2 Neurophysiology 142
8.3 A Neural Network Model 149
8.4 Simulating Fractal Image Learning 153
8.5 Psychophysical Experiments 156
8.6 Discussion 162
9 Principles of Cortical Processing Applied to and Motivated by
Arti®cial Object Recognition 164
NORBERT KRU
ÈGER, MICHAEL PO
ÈTZSCH AND
GABRIELE PETERS
9.1 Introduction 164
9.2 Object Recognition with Banana Wavelets 166
9.3 Analogies to Visual Processing and Their Functional
Meaning 171
9.4 Conclusion and Outlook 178
10 Performance Measurement Based on Usable Information 180
MARTIN ELLIFFE
10.1 Introduction 181
10.2 Information Theory: Simplistic Application 186
10.3 Information Theory: Binning Strategies 187
10.4 Usable Information: Re®nement 191
10.5 Result Comparison 194
10.6 Conclusion 198
Contents vii
Part Three: Information Theory and Psychology 201
11 Modelling Clarity Change in Spontaneous Speech 204
MATTHEW AYLETT
11.1 Introduction 204
11.2 Modelling Clarity Variation 206
11.3 The Model in Detail 207
11.4 Using the Model to Calculate Clarity 213
11.5 Evaluating the Model 215
11.6 Summary of Results 218
11.7 Discussion 220
12 Free Gifts from Connectionist Modelling 221
JOHN A. BULLINARIA
12.1 Introduction 221
12.2 Learning and Developmental Bursts 222
12.3 Regularity, Frequency and Consistency Effects 223
12.4 Modelling Reaction Times 227
12.5 Speed±Accuracy Trade-offs 231
12.6 Reaction Time Priming 232
12.7 Cohort and Left±Right Seriality Effects 234
12.8 Lesion Studies 235
12.9 Discussion and Conclusions 239
13 Information and Resource Allocation 241
JANNE SINKKONEN
13.1 Introduction 241
13.2 Law for Temporal Resource Allocation 242
13.3 Statistical Information and Its Relationships to Resource
Allocation 246
13.4 Utility and Resource Sharing 248
13.5 Biological Validity of the Resource Concept 248
13.6 An MMR Study 249
13.7 Discussion 251
Part Four: Formal Analysis 255
14 Quantitative Analysis of a Schaffer Collateral Model 257
SIMON SCHULTZ, STEFANO PANZERI, EDMUND ROLLS
AND ALESSANDRO TREVES
14.1 Introduction 257
14.2 A Model of the Schaffer Collaterals 259
14.3 Technical Comments 262
viii Contents
14.4 How Graded is Information Representation on the
Schaffer Collaterals? 264
14.5 Non-uniform Convergence 267
14.6 Discussion and Summary 268
Appendix A. Expression from the Replica Evaluation 270
Appendix B. Parameter Values 272
15 A Quantitative Model of Information Processing in CA1 273
CARLO FULVI MARI, STEFANO PANZERI, EDMUND
ROLLS AND ALESSANDRO TREVES
15.1 Introduction 273
15.2 Hippocampal Circuitry 274
15.3 The Model 276
15.4 Statistical±Informational Analysis 280
15.5 Results 281
15.6 Discussion 283
Appendix: Results of the Analytical Evaluation 283
16 Stochastic Resonance and Bursting in a Binary-Threshold
Neuron with Intrinsic Noise 290
PAUL C. BRESSLOFF AND PETER ROPER
16.1 Introduction 290
16.2 The One-Vesicle Model 293
16.3 Neuronal Dynamics 294
16.4 Periodic Modulation and Response 300
16.5 Conclusions 301
Appendix A: The Continuous-Time CK Equation 303
Appendix B: Derivation of the Critical Temperature 303
17 Information and Density and Cortical Magni®cation Factors 305
M. D. PLUMBLEY
17.1 Introduction 305
17.2 Arti®cial Neural Feature Maps 306
17.3 Information Theory and Information Density 308
17.4 Properties of Information Density and Information
Distribution 309
17.5 Symmetrical Conditional Entropy 311
17.6 Example: Two Components 312
17.7 Alternative Measures 312
17.8 Continuous Domain 314
Contents ix
17.9 Continuous Example: Gaussian Random Function 314
17.10 Discussion 316
17.11 Conclusions 316
Bibliography 318
Index 341
xContents
1.1 Introduction
Learning and using a new technique always takes time. Even if the question
initially seems very straightforward, inevitably technicalities rudely intrude.
Therefore before a researcher decides to use the methods information theory
provides, it is worth ®nding out if these set of tools are appropriate for the
task in hand.
In this chapter I will therefore provide only a few important formulae and
no rigorous mathematical proofs (Cover and Thomas (1991) is excellent in
this respect). Neither will I provide simple ``how to'' recipes (for the psychol-
ogist, even after nearly 40 years, Attneave (1959) is still a good introduction).
Instead, it is hoped to provide a non-mathematical introduction to the basic
concepts and, using examples from the literature, show the kind of questions
information theory can be used to address. If, after reading this and the
following chapters, the reader decides that the methods are inappropriate,
he will have saved time. If, on the other hand, the methods seem potentially
useful, it is hoped that this chapter provides a simplistic overview that will
alleviate the growing pains.
1.2 What Is Information Theory?
Information theory was invented by Claude Shannon and introduced in his
classic book The Mathematical Theory of Communication (Shannon and
Weaver, 1949). What then is information theory? To quote three previous
authors in historical order:
1
1
Introductory Information Theory and the Brain
ROLAND BADDELEY
Information Theory and the Brain, edited by Roland Baddeley, Peter Hancock, and Peter Fo
Èldia
Âk.
Copyright #1999 Cambridge University Press. Printed in the United States of America. All rights
reserved.
The ``amount of information'' is exactly the same concept that we talked about for
years under the name ``variance''. [Miller, 1956]
The technical meaning of ``information'' is not radically different from the everyday
meaning; it is merely more precise. [Attneave, 1959]
The mutual information IX;Yis the relative entropy between the joint distribution
and the product distribution pxpy, i.e.,
IX;YX
x2XX
y2Y
log px;y
pxpy
[Cover and Thomas, 1991]
Information theory is about measuring things, in particular, how much
measuring one thing tells us about another thing that we did not know
before. The approach information theory makes to measuring information
is to ®rst de®ne a measure of how uncertain we are of the state of the world.
We then measure how less uncertain we are of the state of the world after we
have made some measurement (e.g. observing the output of a neuron; asking
a question; listening to someone speak). The difference between our uncer-
tainty before and the uncertainty after making a measurement we then de®ne
as the amount of information that measurement gives us. As can be seen, this
approach depends critically on our approach to measuring uncertainty, and
for this information theory uses entropy. To make our description more
concrete, the concepts of entropy, and later information, will be illustrated
using a rather arti®cial scenario: one person has randomly ¯ipped to a page
of this book, and another has to use yes/no questions (I said it was arti®cial)
to work out some aspect of the page in question (for instance the page
number or the author of the chapter).
Entropy
The ®rst important aspect to quantify is how ``uncertain'' we are about the
input we have before we measure it. There is much less to communicate
about the page numbers in a two-page pamphlet than in the Encyclopedia
Britannica and, as the measure of this initial uncertainty, entropy measures
how many yes/no questions would be required on average to guess the state
of the world. Given that all pages are equally likely, the number of yes/no
questions required to guess the page ¯ipped to in a two-page pamphlet would
be 1, and hence this would have an entropy (uncertainty) of 1 bit. For a 1024
(210) page book, 10 yes/no questions are required on average and the entropy
would be 10 bits. For a one-page book, you would not even need to ask a
question, so it would have 0 bits of entropy. As well as the number of
questions required to guess a signal, the entropy also measures the smallest
possible size that the information could be compressed to.
2Roland Baddeley
The simplest situation and one encountered in many experiments is where
all possible states of the world are equally likely (in our case, the ``page
¯ipper'' ¯ips to all pages with equal probability). In this case no compression
is possible and the entropy (H) is equal to:
Hlog2N1:1
where Nis the number of possible states of the world, and log2means that
the logarithm is to the base 2.1Simply put, the more pages in a book, the
more yes/no questions required to identify the page and the higher the
entropy. But rather than work in a measuring system based on ``number of
pages'', we work with logarithms. The reason for this is simply that in many
cases we will be dealing with multiple events. If the ``page ¯ipper'' ¯ips twice,
the number of possible combinations of word pages would be NN(the
numbers of states multiply). If instead we use logarithms, then the entropy of
two-page ¯ips will simply be the sum of the individual entropies (if the
number of states multiply, their logarithms add). Addition is simpler than
multiplication so by working with logs, we make subsequent calculations
much simpler (we also make the numbers much more manageable; an
entropy of 25 bits is more memorable than a system of 33,554,432 states).
When all states of the world are not equally likely, then compression is
possible and fewer questions need (on average) to be asked to identify an
input. People often are biased page ¯ippers, ¯ipping more often to the middle
pages. A clever compression algorithm, or a wise asker of questions can use
this information to take, on average, fewer questions to identify the given
page. One of the main results of information theory is that given knowledge
of the probability of all events, the minimum number of questions on average
required to identify a given event (and smallest that the thing can be com-
pressed) is given by:
HXXpxlog2
1
px1:2
where pxis the probability of event x. If all events are equally likely, this
reduces to equation 1.1. In all cases the value of equation 1.2 will always be
equal to (if all states are equally likely), or less than (if the probabilities are
not equal) the entropy as calculated using equation 1.1. This leads us to call a
distribution where all states are equally likely a maximum entropy distribu-
tion, a property we will come back to later in Section 1.5.
1.2. What Is Information Theory? 3
1Logarithms to the base 2 are often used since this makes the ``number of yes/no'' interpretation
possible. Sometimes, for mathematical convenience, natural logarithms are used and the resulting
measurements are then expressed in nats. The conversion is simple with 1 bit loge=log2
nats 0:69314718 nats.
Information
So entropy is intuitively a measure of (the logarithm of) the number of states
the world could be in. If, after measuring the world, this uncertainty is
decreased (it can never be increased), then the amount of decrease tells us
how much we have learned. Therefore, the information is de®ned as the
difference between the uncertainty before and after making a measurement.
Using the probability theory notation of PXjYto indicate the probability
of X given knowledge of Y (conditional on), the mutual information
(IX;Y) between a measurement X and the input Y can be de®ned as:
IX;YHXÿHXjY1:3
With a bit of mathematical manipulation, we can also get the following
de®nitions where HX;Yis the entropy of all combination of inputs and
outputs (the joint distribution):
IX;Y
HXÿHXjYa
HYÿHYjXb
HXHYÿHX;Yc
8
<
:
1:4
1.3 Why Is This Interesting?
In the previous section, we have informally de®ned information but left
unanswered the question of why information theory would be of any use
in studying brain function. A number of reasons have inspired its use includ-
ing:
Information Theory Can Be Used as a Statistical Tool. There are a number of
cases where information-theoretic tools are useful simply for the statistical
description or modelling of data. As a simple measure of association of two
variables, the mutual information or a near relative (Good, 1961; Press et al.,
1992) can be applied to both categorical and continuous signals and produces
a number that is on the same scale for both. While correlation is useful for
continuous variables (and if the variables are Gaussian, will produce very
similar results), it is not directly applicable to categorical data. While 2is
applicable to categorical data, all continuous data needs to be binned. In
these cases, information theory provides a well founded and general measure
of relatedness.
The use of information theory in statistics also provides a basis for the
tools of (non-linear) regression and prediction. Traditionally regression
methods minimise the sum-squared error. If instead we minimise the
(cross) entropy, this is both general (it can be applied to both categorical
and continuous outputs), and if used as an objective for neural networks,
4Roland Baddeley
maximising information (or minimising some related term) can result in
neural network learning algorithms that are much simpler; theoretically
more elegant; and in many cases appear to perform better (Ackley et al.,
1985; Bishop, 1995).
Analysis of Informational Bottlenecks. While many problems are, for theore-
tical and practical reasons, not amenable to analysis using information the-
ory, there are cases where a lot of information has to be communicated but
the nature of the communication itself places strong constraints on transmis-
sion rates. The time-varying membrane potential (a rich informational
source) has to be communicated using only a stream of spikes. A similar
argument applies to synapses, and to retinal ganglion cells communicating
the incoming light pattern to the cortex and beyond. The rate of speech
production places a strong limit on the rate of communication between
two people who at least sometimes think faster than they can speak. Even
though a system may not be best thought of as simply a communication
system, and all information transmitted may not be used, calculating trans-
mitted information places constraints on the relationship between two sys-
tems. Looking at models that maximise information transmission may
provide insight into the operation of such systems (Atick, 1992a; Linsker,
1992; Baddeley et al., 1997).
1.4 Practical Use of Information Theory
The previous section brie¯y outlined why, in principle, information theory
might be useful. That still leaves the very important practical question of how
one could measure it. Even in the original Shannon and Weaver book
(Shannon and Weaver, 1949), a number of methods were used. To give a
feel for how mutual information and entropy can be estimated, this section
will describe a number of different methods that have been applied to pro-
blems in brain function.
Directly Measuring Discrete Probability Distributions
The most direct and simply understood method of measuring entropy and
mutual information is to directly estimate the appropriate probability dis-
tributions (Pinput,Poutputand Pinput and output). This is concep-
tually straightforward and, given enough data, a reasonable method.
One example of an application where this method is applicable was
inspired by the observation that people are very bad at random number
generation. People try and make sequences ``more random'' than real ran-
dom numbers by avoiding repeats of the same digit; they also, under time
pressure, repeat sequences. This ability to generate random sequences has
1.4. Practical Use of Information Theory 5
therefore been used as a measure of cognitive load (Figure 1.1), where
entropy has been used as the measure of randomness (Baddeley, 1956).
The simplest estimators were based on simple letter probabilities and in
this case it is very possible to directly estimate the distribution (we only
have 26 probabilities to estimate). Unfortunately, methods based on simple
probability estimation will prove unreliable when used to estimate, say, letter
pair probabilities (a statistic that will be sensitive to some order information).
In this case there are 676 (262) probabilities to be estimated, and subjects'
patience would probably be exhausted before enough data had been collected
to reliably estimate them. Note that even when estimating 26 probabilities,
entropy will be systematically underestimated (and information overesti-
mated) if we only have small amounts of data. Fortunately, simple methods
to remove such an ``under-sampling bias'' have been known for a long time
(Miller, 1955).
Of great interest in the 1960s was the measuring of the ``capacity'' of
various senses. The procedure varied in detail, but was essentially the
same: the subjects were asked to label stimuli (say, tones of different frequen-
cies) with different numbers. The mutual information between the stimuli
and the numbers assigned by the subjects was then calculated with different
numbers of stimuli presented (see Figure 1.2). Given only two stimuli, a
subject would almost never make a mistaken identi®cation, but as the num-
ber of stimuli to be labelled increased, subjects started to make mistakes. By
estimating where the function relating mutual information to the number of
6Roland Baddeley
a - m - z
0
0.02
0.04
0.06
0.08
a h d y w s h c f
y f k t w v n l j k
e p u u c d q l d Σ
Estimate distribution
H( X ) = P( X )log 1/P( X )
Calculate Entropy
Probability
of letters
f p y u b f e r k i
Random letter
Sequence
Figure 1.1. The most straightforward method to calculate entropy or mutual informa-
tion is direct estimation of the probability distributions (after Baddeley, 1956). One case
where this is appropriate is in using the entropy of subjects' random number generation
ability as a measure of cognitive load. The subject is asked to generate random digit
sequences in time with a metronome, either as the only task, or while simultaneously
performing a task such as card sorting. Depending on the dif®culty of the other task and
the speed of generation, the ``randomness'' of the digits will decrease. The simplest way
to estimate entropy is to estimate the probability of different letters. Using this measure
of entropy, redundancy (entropy/maximum entropy) decreases linearly with generation
time, and also with the dif®culty of the other task. This has subsequently proved a very
effective measure of cognitive load.
input categories asymptotes, an estimate of subjects channel capacity can be
made. Surprisingly this number is very small ± about 2.5 bits. This capacity
estimate approximately holds for a large number of other judgements: loud-
ness (2.3 bits), tastes (1.9 bits), points on a line (3.25 bits), and this leads to
one of the best titles in psychology ± the ``seven plus or minus two'' of Miller
(1956) refers to this small range (between 2.3 bits (log25) and 3.2 bits
(log29)).
Again in these tasks, since the number of labels usable by subjects is small,
it is very possible to directly estimate the probability distributions with rea-
sonable amounts of data. If instead subjects were reliably able to label 256
stimuli (8 bits as opposed to 2.5 bits capacity), we would again get into
problems of collecting amounts of data suf®cient to specify the distributions,
and methods based on the direct estimation of probability distributions
would require vast amounts of subjects' time.
Continuous Distributions
Given that the data are discrete, and we have enough data, then simply
estimating probability distributions presents few conceptual problems.
Unfortunately if we have continuous variables such as membrane potentials,
or reaction times, then we have a problem. While the entropy of a discrete
probability distribution is ®nite, the entropy of any continuous variable is
1.4. Practical Use of Information Theory 7
100hz 1000hz 10,000hz
100hz 1000hz 10,000hz
7
100hz 1000hz 10,000hz
0 1 2 3 4
0
1
2
3
Input Information
Transmitted Information
A
B
C
D
123
12345678
128
43 56
Figure 1.2. Estimating the ``channel capacity'' for tone discrimination (after Pollack,
1952, 1953). The subject is presented with a number of tones and asked to assign
numeric labels to them. Given only three tones (A), the subject has almost perfect
performance, but as the number of tones increase (B), performance rapidly deteriorates.
This is not primarily an early sensory constraint, as performance is similar when the
tones are tightly grouped (C). One way to analyse such data is to plot the transmitted
information as a function of the number of input stimuli (D). As can be seen, up until
about 2.5 bits, all the available information is transmitted, but when the input informa-
tion is above 2.5 bits, the excess information is lost. This limited capacity has been found
for many tasks and was of great interest in the 1960s.
in®nite. One easy way to see this is that using a single real number between 0
and 1, we could very simply code the entire Encyclopedia Britannica. The ®rst
two digits after the decimal place could represent the ®rst letter; the second
two digits could represent the second letter, and so on. Given no constraint
on accuracy, this means that the entropy of a continuous variable is in®nite.
Before giving up hope, it should be remembered that mutual information
as speci®ed by equation 1.4 is the difference between two entropies. It turns
out that as long as there is some noise in the system (HXjY>0), then the
difference between these two in®nite entropies is ®nite. This makes the role of
noise vital in any information theory measurement of continuous variables.
One particular case is if both the signal and noise are Gaussian (i.e.
normally) distributed. In this case the mutual information between the signal
(s) and the noise-corrupted version (sn) is simply:
Is;sn1
2log212
signal
2
noise
! 1:5
where 2
signal is the variance of the signal, and 2
noise is the variance of the noise.
This has the expected characteristics: the larger the signal relative to the noise,
the larger the amount of information transmitted; a doubling of the signal will
result in an approximately 1 bit increase in information transmission; and the
information transmitted will be independent of the unit of measurement.
It is important to note that the above expression is only valid when both
the signal and noise are Gaussian. While this is often a reasonable and
testable assumption because of the central limit theorem (basically, the
more things we add, usually the more Gaussian the system becomes), it is
still only an estimate and can underestimate the information (if the signal is
more Gaussian than the noise) or overestimate the information (if the noise is
more Gaussian than the signal).
A second problem concerns correlated signals. Often a signal will have
structure ± for instance, it could vary only slowly over time. Alternatively,
we could have multiple measurements. If all these measurements are inde-
pendent, then the situation is simple ± the entropies and mutual informations
simply add. If, on the other hand, the variables are correlated across time,
then some method is required to take these correlations into account. In an
extreme case if all the measurements were identical in both signal and noise,
the information from one such measurement would be the same as the com-
bined information from all: it is important to in some way deal with these
effects of correlation.
Perhaps the most common way to deal with this ``correlated measure-
ments'' problem is to transform the signal to the Fourier domain. This
method is used in a number of papers in this volume and the underlying
logic is described in Figure 1.3.
8Roland Baddeley
The Fourier transform method always uses the same representation (in
terms of sines and cosines) independent of the data. In some cases, especially
when we do not have that much data, it may be more useful to choose a
representation which still has the uncorrelated property of the Fourier com-
ponents, but is optimised to represent a particular data set. One plausible
candidate for such a method is principal components analysis. Here a new set
of measurements, based on linear transformation of the original data, is used
to describe the data. The ®rst component is the linear combination of the
original measurements that captures the maximum amount of variance. The
second component is formed by a linear combination of the original mea-
surements that captures as much of the variance as possible while being
orthogonal to the ®rst component (and hence independent of the ®rst com-
ponent if the signal is Gaussian). Further components can be constructed in a
similar manner. The main advantage over a Fourier-based representation is
1.4. Practical Use of Information Theory 9
0 20 40
10
5
0
5
10
0 20 40
10
5
0
5
10
0 1 2 3
0
5
10
0 20 40
10
5
0
5
10
Amplitude
=
=
0 20 40
10
5
0
5
10
C)
f = 3 Frequency
Original signal
Voltage
A)
B) f = 1
f = 2
Figure 1.3. Taking into account correlations in data by transforming to a new repre-
sentation. (A) shows a signal varying slowly as a function of time. Because the voltages
at different time steps are correlated, it is not possible to treat each time step as inde-
pendent and work out the information as the sum of the information values at different
time steps. One way to approach this problem is to transform the signal to a new
representation where all components are now uncorrelated. If the signal is Gaussian,
transforming to a Fourier series representation has this property. Here we represent the
original signal (A) as a sum of sines and cosines of different frequencies (B). While the
individual time measurements are correlated, if the signal is Gaussian, the amounts of
each Fourier components (C) will be uncorrelated. Therefore the mutual information
for the whole signal will simply be the sum of the information values for the individual
frequencies (and these can be calculated using equation 1.5).
that more of the signal can be described using fewer descriptors and thus less
data is required to estimate the characteristics of the signal and noise.
Methods based on principal-component-based representations of spikes
trains have been applied to calculating the information transmitted by cor-
tical neurons (Richmond and Optican, 1990).
All the above methods rely on an assumption of Gaussian nature of the
signal, and if this is not true and there exist non-linear relationships between
the inputs and outputs, methods based on Fourier analysis or principal
components analysis can only give rather inaccurate estimates. One method
that can be applied in this case is to use a non-linear compression method to
generate a compressed representation before performing the information
estimation (see Figure 1.4).
10 Roland Baddeley
= Non linear unit
n units
n units
h units
B)
c1 units
c2 units
h units
n units
n units
Input
Input
Output
Output
A)
= Linear unit
Figure 1.4. Using non-linear compression techniques for generating compact represen-
tations of data. Linear principal components analysis can be performed using the neural
network shown in (A) where a copy of the input is used as the target output. On
convergence, the weights from the ninput units to the hcoding units will span the
same space as the ®rst hprincipal components and, given that the input is Gaussian,
the coding units will be a good representation of the signal. If, on the other hand, there
is non-Gaussian non-linear structure in the signals, this approach may not be optimal.
One possible approach to dealing with such non-linearity is to use a compression-based
algorithm to create a non-linear compressed representation of the signals. This can be
done using the non-linear generalisation of the simple network to allow non-linearities
in processing (shown in (B)). Again the network is trained to recreate its input from its
output, while transmitting the information through a bottleneck, but this time the data
is allowed to be transformed using an arbitrary non-linearity before coding. If there are
signi®cant non-linearities in the data, the representation provided by the bottleneck
units may provide a better representation of the input than a principal-components-
based representation. (After Fotheringhame and Baddeley, 1997.)
Estimation Using an ``Intelligent'' Predictor
Though the direct measurement of the probability distributions is concep-
tually the simplest method, often the dimensionality of the problem renders
this implausible. For instance, if interested in the entropy of English, one
could get better and better approximations by estimating the probability
distribution of letters, letter pairs, letter triplets, and so on. Even for letter
triplets, there is the probability of 27319,683 possible three-letter combi-
nations to estimate: the amount of data required to do this at all accurately is
prohibitive. This is made worse because we know that many of the regula-
rities of English would only be revealed over groups of more than three
letters. One potential solution to this problem is available if we have access
to a good model of the language or predictor. For English, one source of a
predictor of English is a native speaker. Shannon (see Table 1.1) used this to
devise an ingenious method for estimating the entropy of English as
described in Table 1.1.
Even when we don't have access to such a good predictor as an English
language speaker, it often simpler to construct (or train) a predictor rather
than to estimate a large number of probabilities. This approach to estimating
mutual information has been applied (Heller et al., 1995) to estimation of the
visual information transmission properties of neurons in both the primary
visual cortex (also called V1; area 17; or striate cortex) and the inferior
temporal cortex (see Figure 1.5). Essentially the spikes generated by neurons
when presented various stimuli were coded in a number of different ways (the
1.4. Practical Use of Information Theory 11
Table 1.1. Estimating the entropy of English using an intelligent predictor (after Shannon,
1951).
THERE I S NO REVERSE
111511211211151171112
ON A MOTORCYCL E
1321227111141111
Above is a short passage of text. Underneath each letter is the number of guesses required by a
person to guess that letter based only on knowledge of the previous letters. If the letters were
completely random (maximum entropy and no redundancy), the best predictor would take on
average 27/2 guesses (26 letters and a space) for every letter. If, on the other hand, there is complete
predictability, then a predictor would only require only one guess per letter. English is between
these two extremes and, using this method, Shannon estimated an entropy per letter of between 1.6
and 0.6 bits per letter. This contrasts with log 27 4:76 bits if every letter was equally likely and
independent. Technical details can be found in Shannon (1951) and Attneave (1959).
average ®ring rate, vectors representing the presence and absence of spikes,
various low-pass-®ltered versions of the spike train, etc). These codi®ed spike
trains were used to train a neural network to predict the visual stimulus that
was presented when the neurons generated these spikes. The accuracy of
these predictions, given some assumptions, can again be used to estimate
the mutual information between the visual input and the differently coded
spike trains estimated. For these neurons and stimuli, the information trans-
mission is relatively small (0.5 bits sÿ1).
Estimation Using Compression
One last method for estimating entropy is based on Shannon's coding theo-
rem, which states that the smallest size that any compression algorithm can
compress a sequence is equal to its entropy. Therefore, by invoking a number
of compression algorithms on the sample sequence of interest, the smallest
compressed representation can be taken as an upper bound on that sequen-
ce's entropy. Methods based on this intuition have been more common in
genetics, where they have been used to ask such questions as does ``coding''
DNA have higher or lower entropy than ``non-coding'' DNA (Farach et al.,
1995). (The requirements of quick convergence and reasonable computation
12 Roland Baddeley
Network
A)
B) C)
D)
E)
Walsh Patterns
Spike train
Neuron
Prediction of input Neural
Figure 1.5. Estimating neuronal information transfer rate using a neural network based
predictor (after Heller et al., 1995). A collection of 32 44 Walsh patterns (and their
contrast reversed versions) (A) were presented to awake Rhesus Macaque monkeys, and
the spike trains generated by neurons in V1 and IT recorded (B and C). Using differ-
ently coded versions of these spike trains as input, a neural network (D) was trained
using the back-propagation algorithm to predict which Walsh pattern was presented.
Intuitively, if the spike train contains a lot of information about the input, then an
accurate prediction is possible, while if there is very little information then the spike
train will not allow accurate prediction of the input. Notice that (1) the calculated
information will be very dependent on the choice (and number of) of stimuli, and (2)
even though we are using a predictor, implicitly we are still estimating probability
distributions and hence we require large amounts of data to accurately estimate the
information. Using this method, it was claimed that the neurons only transmitted small
amounts of information (0:5 bits), and that this information was contained not in the
exact timing of the spikes, but in a local ``rate''.
time mean that only the earliest algorithms simply performed compression,
but the concept behind later algorithms is essentially the same.)
More recently, this compression approach to entropy estimation has been
applied to automatically calculating linguistic taxonomies (Figure 1.6). The
entropy was calculated using a modi®ed compression algorithm based on
Farach et al. (1995). Cross entropy was estimated using the compressed
length when the code book derived for one language was used to compress
another. Though methods based on compression have not been commonly
used in the theoretical neuroscience community (but see Redlich, 1993), they
provide at least interesting possibilities.
1.5 Maximising Information Transmission
The previous section was concerned with simply measuring entropy and
information. One other proposal that has received a lot of attention recently
is the proposition that some cortical systems can be understood in terms of
them maximising information transmission (Barlow, 1989). There are a num-
ber of reasons supporting such an information maximisation framework:
Maximising the Richness of a Representation. The richness and ¯exibility of
the responses to a behaviourally relevant input will be limited by the number
of different states that can be discriminated. As an extreme case, a protozoa
that can only discriminate between bright and dark will have less ¯exible
navigating behaviour than an insect (or human) that has an accurate repre-
1.5. Maximising Information Transmission 13
Basque
not to bring into the
Manx (Celtic)
English
German
Italian
Spanish
Dutch
Cluster using
cross entropies
as distances
A) B) C)
Estimate entropies and
cross entropies using
compression algorithm
techniques.
"I hereby undertake not
to remove from the
library, or to mark, deface,
or injure in anyway, any
volume, document, or
other object belonging
to it or in its custody;
Library or kindle ........"
Figure 1.6. Estimating entropies and cross entropies using compression-based techni-
ques. The declaration of the Bodleian Library (Oxford) has been translated into more
than 50 languages (A). The entropy of these letter sequences can be estimated using the
size of a compressed version of the statement. If the code book derived by the algorithm
for one language is used to code another language, the size of the code book will re¯ect
the cross entropy (B). Hierarchical minimum distance cluster analysis, using these cross
entropies as a distances, can then be applied to this data (a small subset of the resulting
tree is shown (C)). This method can produce an automatic taxonomy of languages, and
has been shown to correspond very closely to those derived using more traditional
linguistic analysis (Juola, P., personal communication).
sentation of the grey-level structure of the visual world. Therefore, heuristi-
cally, evolution will favour representations that maximise information trans-
mission, because these will maximise the number of discriminable states of
the world.
As a Heuristic to Identify Underlying Causes in the Input. A second reason is
that maximising information transmission is a reasonable principle for gen-
erating representations of the world. The pressure to compress the world
often forces a new representation in terms of the actual ``causes'' of the
images (Olshausen and Field, 1996a). A representation of the world in
terms of edges (the result of a number of information maximisation algo-
rithms when applied to natural images, see for instance Chapter 5), may well
be easier to work with than a much larger and redundant representation in
terms of the raw intensities across the image.
To Allow Economies to be Made in Space, Weight and Energy. By having a
representation that is ef®cient at transmitting information, it may be possible
to economise on some other of the system design. As described in Chapter 3,
an insect eye that transmits information ef®ciently can be smaller and lighter,
and can consume less energy (both when operating and when being trans-
ported). Such ``energetic'' arguments can also be applied to, say, the trans-
mission of information from the eye to the brain, where an inef®cient
representation would require far more retinal ganglion cells, would take
signi®cantly more space in the brain, and use a signi®cantly larger amount
of energy.
As a Reasonable Formalism for Describing Models. The last reason is more
pragmatic and empirical. The quantities required to work out how ef®cient a
representation is, and the nature of a representation that maximises informa-
tion transmission, are measurable and mathematically formalisable. When
this is done, and the ``optimal'' representations compared to the physiologi-
cal and psychophysical measurements, the correspondence between these
optimal representations and those observed empirically is often very close.
This means that even if the information maximisation approach is only
heuristic, it is still useful in summarising data.
How then can one maximise information transmission? Most approaches
can be understood in terms of a combination of three different strategies:
.Maximise the number of effective measurements by making sure that each
measurement tells us about a different thing.
.Maximise the signal whilst minimising the noise.
.Subject to the external constraints placed on the system, maximise the
ef®ciency of the questions asked.
14 Roland Baddeley
Maximising the Effective Number of Questions
The simplest method of increasing information transmission is to increase the
number of measurements made: someone asking 50 questions concerning the
page ¯ipped to in a book has more chance of identifying it than someone who
asks one question. Again an eye connected by a large number of retinal
ganglion cells to later areas should send more information than the single
ganglion cell connected to an eyecup of a ¯atworm.
This insight is simple enough not to rely on information theory, but the
raw number of measurements is not always equivalent to the ``effective''
number of measurements. If given two questions to identify a page in the
book ± if the ®rst one was ``Is it between pages 1 and 10?'' then a second of
``Is it between 2 and 11?'' would provide remarkably little extra information.
In particular, given no noise, the maximum amount of information can be
transmitted if all measurements are independent of each other.
A similar case occurs in the transmission of information about light enter-
ing the eye. The outputs of two adjacent photoreceptors will often be mea-
suring light coming from the same object and therefore send very correlated
signals. Transmitting information to later stages simply as the output of
photoreceptors would therefore be very inef®cient, since we would be sending
the same information multiple times. One simple proposal for transforming
the raw retinal input before transmitting it to later stages is shown in Figure
1.7, and has proved successful in describing a number of facts about early
visual processing (see Chapter 3).
1.5. Maximising Information Transmission 15
+
--
C
AB
Figure 1.7. Maximising information transmission by minimising redundancy. In most
images, (A) the intensity arriving at two locations close together in the visual ®eld will
often be very similar, since it will often originate from the same object. Sending infor-
mation in this form is therefore very inef®cient. One way to improve the ef®ciency of
transmission is not to send the pixel intensities, but the difference between the intensity
at a location and that predicted from the nearby photoreceptors. This can be achieved
by using a centre surround receptive ®eld as shown in (B). If we transmit this new
representation (C), far less channel capacity is used to send the same amount of infor-
mation. Such an approach seems to give a good account of the early spatial ®ltering
properties of insect (Srinivasan et al., 1982; van Hateren, 1992b) and human (Atick,
1992b; van Hateren, 1993) visual systems.
Guarding Against Noise
The above ``independent measurement'' argument is only true to a point.
Given that the person you ask the question of speaks clearly, then ensuring
that each measurement tells you about a different thing is a reasonable
strategy. Unfortunately, if the person mumbles, has a very strong accent,
or has possibly been drinking too much, we could potentially miss the answer
to our questions. If this happens, then because each question is unrelated to
the others, an incorrect answer cannot be detected by its relationship to other
questions, nor can they be used to correct the mistake. Therefore, in the
presence of noise, some redundancy can be helpful to (1) detect corrupted
information, and (2) help correct any errors. As an example, many non-
native English speakers have great dif®culty in hearing the difference between
the numbers 17 and 70. In such a case it actually might be worth asking ``is
the page above seventy'' as well as ``is it above ®fty'' since this would provide
some guard against confusion of the word seventy. This may also explain the
charming English habit of shouting loudly and slowly to foreigners.
The appropriate amount of redundancy will depend on the amount of
noise: the amount of redundancy should be high when there is a lot of
noise, and low when there is little. Unfortunately this can be dif®cult to
handle when the amount of noise is different at different times, as in the
retina. Under a bright illuminant, the variations in image intensity (the sig-
nal) will be much larger than the variations due to the random nature of
photon arrival or the unreliability of synapses (the noise). On the other hand,
for very low light conditions this is no longer the case, with the variations due
to the noise now relatively large. If the system was to operate optimally, the
amount of redundancy in the representation should change at different illu-
mination levels. In the primate visual system, the spatial frequency ®ltering
properties of the ``retinal ®lters'' change as a function of light level, consistent
with the retina maximising information transmission at different light levels
(Atick, 1992b).
Making Ef®cient Measurements
The last way to maximise information transmission is to ensure not only that
all measurements measure different things, and noise is dealt with effectively,
but also that the measurements made are as informative as possible, subject
to the constraints imposed by the physics of the system.
For binary yes/no questions, this is relatively straightforward. Consider
again the problem of guessing a page in the Encyclopedia Britannica. Asking
the question ``Is it page number 1?'' is generally not a good idea ± if you
happen to guess correctly then this will provide a great deal of information
(technically known as suprisal), but for the majority of the time you will
16 Roland Baddeley
know very little more. The entropy (and hence the maximum amount of
information transmission) is maximal when the uncertainty is maximal,
and this occurs when both alternatives are equally likely. In this case we
want questions where ``yes'' is has the same probability as ``no''. For instance
a question such as ``Is it in the ®rst or second half of the book?'' will generally
tell you more than ``Is it page 2?''. The entropy as a function of probability is
shown for a yes/no system (binary channel) in Figure 1.8.
When there are more possible signalling states than true and false, the
constraints become much more important. Figure 1.9 shows three of the
simplest cases of constraints and the nature of the outputs (if we have no
noise) that will maximise information transmission. It is interesting to note
that the spike trains of neurons are exponentially distributed as shown in
Figure 1.9(C), consistent with maximal information transmission subject to
an average ®ring rate constraint (Baddeley et al., 1997).
1.6 Potential Problems and Pitfalls
The last sections were essentially positive. Unfortunately not all things about
information theory are good:
The Huge Data Requirement. Possibly the greatest problem with information
theory is its requirement for vast amounts of data if the results are to tell us
more about the data than about the assumptions used to calculate its value.
As mentioned in Section 1.4, estimating the probability of every three-letter
combination in English would require suf®cient data to estimate 19,683 dif-
ferent probabilities. While this may actually be possible given the large num-
ber of books available electronically, to get a better approximation to
English, (say, eight-letter combinations), the amount of data required
1.6. Potential Problems and Pitfalls 17
0 0.5 1
0
0.5
1
Entropy (bits)
Probability
Figure 1.8. The entropy of a binary random (Bernoulli) variable is a function of its
probability and maximum when its probability is 0.5 (when it has an entropy of 1 bit).
Intuitively, if a measurement is always false (or always true) then we are not uncertain of
its value. If instead it is true as often as not, then the uncertainty, and hence the entropy,
is maximised.
becomes completely unrealistic. Problems of this form are almost always
present when applying information theory, and often the only way to pro-
ceed is to make assumptions which are possibly unfounded and often dif®cult
to test. Assuming true independence (very dif®cult to verify even with large
data sets), and assuming a Gaussian signal and noise can greatly cut down on
the number of measurements required. However, these assumptions often
remain only assumptions, and any interpretations of the data rest strongly
on them.
Information and Useful Information. Information theory again only mea-
sures whether there are variations in the world that can be reliably discri-
minated. It does not tell us if this distinction is of any interest to the
animal. As an example, most information-maximisation-based models of
low-level vision assume that the informativeness of visual information is
simply based on how much it varies. Even at the simplest level, this is
dif®cult to maintain as variation due to, say, changes in illumination is
often of less interest than variations due to changes in re¯ectance, while
the variance due to changes in illumination is almost always greater than
that caused by changes in re¯ectance. While the simple ``variation equals
information'' may be a useful starting point, after the mathematics starts it
is potentially easy to forget that it is only a ®rst approximation, and one
can be led astray.
18 Roland Baddeley
0 50 100
0
0.005
0.01
0 50 100
0
0.01
0.02
0.03
0 50 100
0
0.02
0.04
0.06
Figure 1.9. The distribution of neuronal outputs consistent with optimal information
transmission will be determined by the most important constraints operating on that
neuron. First, if a neuron is only constrained by its maximum and minimum output,
then the maximum entropy, and therefore the maximum information that could be
transmitted, will occur when all output states are equally likely (A) (Laughlin, 1981).
Second, a constraint favoured for mathematical convenience is that the power (or
variance) of the output states is constrained. Given this, entropy is maximised for a
Gaussian ®ring rate distribution (B). Third, if the constraint is on the average ®ring
rate of a neuron, higher ®ring rates will be more ``costly'' than low ®ring rates, and an
exponential distribution of ®ring rates would maximise entropy (C). Measurements
from V1 and IT cells show that neurons in these areas have exponentially distributed
outputs when presented with natural images (Baddeley et al., 1997), and hence are at
least consistent with maximising information transmission subject to an average rate
constraint.
Coding and Decoding. A related problem is that information theory tells us if
the information is present, but does not describe whether, given the compu-
tational properties of real neurons, it would be simple for neurons to extract.
Caution should therefore be expressed when saying that information present
in a signal is information available to later neurons.
Does the Receiver Know About the Input? Information theory makes some
strong assumptions about the system. In particular it assumes that the recei-
ver knows everything about the statistics of the input, and that these statistics
do not change over time (that the system is stationary). This assumption of
stationarity is often particularly unrealistic.
1.7 Conclusion
In this chapter it was hoped to convey an intuitive feel for the core concepts
of information theory: entropy and information. These concepts themselves
are straightforward, and a number of ways of applying them to calculate
information transmission in real systems were described. Such examples are
intended to guide the reader towards the domains that in the past have
proved amenable to information theoretic techniques. In particular it is
argued that some aspects of cortical computation can be understood in the
context of maximisation of transmitted information. The following chapters
contain a large number of further examples and, in combination with Cover
and Thomas (1991) and Rieke et al. (1997), it is hoped that the reader will
®nd this book helpful as a starting point in exploring how information theory
can be applied to new problem domains.
1.7. Conclusion 19
... We simulated the effects of compressive or anticompressive distortions of sample values (here 1-9) on task performance, taking into account different types of processing limitations. We started by replicating previous findings showing that, without further assumptions, compressive weighting policies (k , 1) are performance-maximizing In other words, with compressive weighting, the sample values are transformed with a greater "gain" of processing , which can be assumed to be costly (e.g., in terms of metabolic resources) for biological observers (Baddeley et al., 2000;Kostal et al., 2013). The beneficial effect of compression on performance in Figure 2D can thus be explained by the recruitment of greater processing resources, resulting in an overall steeper weighting curve (Figure 2A), which counteracts the effects of decision noise . ...
Article
Full-text available
People routinely make decisions based on samples of numerical values. A common conclusion from the literature in psychophysics and behavioral economics is that observers subjectively compress magnitudes, such that extreme values have less sway over people’s decisions than prescribed by a normative model (underweighting). However, recent studies have reported evidence for anti-compression, that is, the relative overweighting of extreme values. Here, we investigate potential reasons for this discrepancy in findings and propose that it might reflect adaptive responses to different task requirements. We performed a large-scale study (n = 586) of sequential numerical integration, manipulating (a) the task requirement (averaging a single stream or comparing two interleaved streams of numbers), (b) the distribution of sample values (uniform or Gaussian), and (c) their range (1–9 or 100–900). The data showed compression of subjective values in the averaging task, but anticompression in the comparison task. This pattern held for both distribution types and for both ranges. In model simulations, we show that either compression or anticompression can be beneficial for noisy observers, depending on the sample-level processing demands imposed by the task. This suggests that the empirically observed patterns of over- and underweighting might reflect adaptive responses.
... According to this widespread and often implicit metaphor, the brain is a machine surrounded by a sensory apparatus that almost perfectly transduces -like a one-to-one correspondence-the properties of external objects into internal data. Data, once produced, is interpreted as an (almost) perfect representation of the external object [3]. ...
Article
Full-text available
The most accepted paradigm of brain function sustaining mathematical thinking is the “Brain as Computer” metaphor. Since the early 1970s, however, a new viewpoint based on the Biology of Cognition, created by Humberto Maturana and Francisco Varela, has slowly positioned itself as a challenger to the computer metaphor. Here, we interpret the notion of metaphorising in mathematics in the context of Biology of Cognition. Specifically, we introduce the fundamental concept of Structural Coupling, which is the mechanism by which living systems create the “objects” populating their niche by exploiting correlations in the never-ending Perception ⟲ Action loop. Furthermore, we show that how a living system decides which action to perform is the outcome of a Bayesian Inference mechanism; therefore, randomness is fundamental to living systems and not just a consequence of the mere lack of sufficient information to compute the next action. Additionally, in Bayesian inference, the underlying conditional probability distribution P ( action|perception ) changes by the very execution of every Perception ⟲ Action loop, performing a biased random walk according to a Hebbian-like rule. With this rich set of concepts derived from cutting-edge biology, we show that Biology of Cognition is a good fit to understand mathematical metaphorisation.
... According to this widespread and often implicit metaphor, the brain is a machine surrounded by a sensory apparatus that almost perfectly transduces -like a one-to-one correspondence-the properties of external objects into internal data. Data, once produced, is interpreted as an (almost) perfect representation of the external object [3]. ...
Preprint
Full-text available
The most accepted paradigm of brain function sustaining mathematical thinking is the "Brain as Computer" metaphor. Since the early 1970s, however, a new viewpoint based on the Biology of Cognition, created by Humberto Maturana and Francisco Varela, has slowly positioned itself as a challenger to the computer metaphor. Here, we interpret the notion of metaphorising in mathematics in the context of Biology of Cognition. Specifically, we introduce the fundamental concept of Structural Coupling, which is the mechanism by which living systems create the "objects" populating their niche by exploiting correlations in the never-ending Perception ⟲ Action loop. Furthermore, we show that how a living system decides which action to perform is the outcome of a Bayesian Inference mechanism; therefore, randomness is fundamental to living systems and not just a consequence of the mere lack of sufficient information to compute the next action. Additionally, in Bayesian inference, the underlying conditional probability distribution P (action|perception) changes by the very execution of every Perception ⟲ Action loop, performing a biased random walk according to a Hebbian-like rule. With this rich set of concepts derived from cutting-edge biology, we show that Biology of Cognition is a good fit to understand mathematical metaphorisation.
... This notion is further supported by evolutionary pressure on organisms where they spend a lot of metabolic energy to reduce the amount of noise at the first sensory stage of information processing. For instance, around 30% of a fly's resting metabolism is consumed by its photoreceptors and eye optics (Baddeley et al., 2000). ...
Thesis
Full-text available
Neurons use action potentials, or spikes, to process information. Different aspects of spiking, such as its rate or timing, are used to encode information about different features of an input. But noise can influence how robustly information is represented in each coding scheme. Additionally, information can be lost if neural representations are not transmitted with high fidelity to downstream neurons. In both cases, properties of the input (its amplitude and kinetics) and properties of the neuron (its spike initiation mechanism and excitability) impact neural information processing. In my thesis, I first investigated how axons are optimized to transmit spike-based representations. Using patch clamp electrophysiology combined with optogenetics, I showed that the axon of CA1 pyramidal neurons spikes transiently in response to sustained depolarization, in contrast to the soma and axon initial segment, which spike repetitively. These distinct spiking patterns are due to the differential expression of ion channels, supporting functional specialization of neuronal compartments. Specifically, low-threshold potassium channels (Kv1) cause the axon to behave as a high-pass filter, enabling high fidelity transmission of spike-based information so that the axon selectively responds to inputs with fast kinetics. Together with biophysical modeling, my findings demonstrate that spike initiation properties in each part of the neuron are well matched to the signals normally processed in that neuronal compartment. I then investigated how background synaptic activity (noise) affects rate and temporal coding of vibrotactile stimuli. Using patch clamp electrophysiology and dynamic clamp in vitro, I found that layer 2/3 pyramidal neurons in primary somatosensory cortex spike intermittently to inputs repeated at frequencies perceived as vibration. The fraction of inputs evoking a spike varies with input amplitude, enabling firing rate to encode stimulus intensity. Despite being small in amplitude, inputs are abrupt in onset, which allows them to evoke precisely timed spikes, even under noisy conditions. Unreliable spiking allows noise to produce irregular skipping, enabling spike times (patterns) to encode stimulus frequency. The reliability and precision of spikes depend on input amplitude and kinetics, respectively. With the help of simulations, my results show that noise helps multiplexed rate and temporal coding.
... However, compressive transformations are also characterized by overall larger decision values (in terms of absolute decision values |dv|) than linear or anticompressive transformations (Figure 2A, inset bar graph). In other words, with compressive weighting, the sample values are transformed with a greater "gain" of processing , which can be assumed to be costly (e.g., in terms of metabolic resources) for biological observers (Baddeley et al., 2000;Kostal et al., 2013). The beneficial effect of compression on performance in Figure 2D can thus be explained by the recruitment of greater processing resources, resulting in an overall steeper weighting curve (Figure 2A), which counteracts the effects of decision noise . ...
Preprint
People routinely make decisions based on samples of numerical values. A common conclusion from the literature in psychophysics and behavioral economics is that observers subjectively compress magnitudes, such that extreme values have less sway over choice than prescribed by a normative model (underweighting). However, recent studies have reported evidence for anti-compression, that is, the relative overweighting of extreme values. Here, we investigate potential reasons for this discrepancy in findings and examine the possibility that it reflects adaptive responses to different task requirements. We performed a large-scale study (N = 607) of sequential numerical integration, manipulating (i) the task requirement (averaging a single stream or comparing two streams of numbers), (ii) the distribution of sample values (uniform or Gaussian), and (iii) their range (1 to 9 or 100 to 900). The data showed compression of subjective values in the averaging task, but anti-compression in the comparison task. This pattern held for both distribution types and for both ranges. The findings are consistent with model simulations showing that either compression or anti-compression can be beneficial for noisy observers, depending on the sample-level processing demands imposed by the task.
... There are several definitions of entropy in information systems. Regardless of the use of this concept, it is frequently described as the level of uncertainty, either of a signal or more generally of a specific information structure (Baddeley et al., 2008). That level of uncertainty has been found in other digital environments and has been evidenced through the kind of sequences that appear in sequential analysis (Sanchez-Lozano, 2010). ...
... Information theory (Cover and Thomas 2006) underlies much research on learning theory (Belghazi et al. 2018;Kwak and Chong-Ho Choi 2002;Brakel and Bengio 2018) as well as thinking in neuroscience (Baddeley, Foldiak, and Hancock 1999). The Information Bottleneck (IB) principle (Tishby, Pereira, and Bialek 1999) generalizes the notion of minimal sufficient statistics, expressing a tradeoff in the hidden representation between the information needed for predicting the output, and the information retained about the input. ...
Article
Full-text available
We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to the conventional cross-entropy loss and backpropagation that has a number of distinct advantages. It mitigates exploding and vanishing gradients, resulting in the ability to learn very deep networks without skip connections. There is no requirement for symmetric feedback or update locking. We find that the HSIC bottleneck provides performance on MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels. Appending a single layer trained with SGD (without backpropagation) to reformat the information further improves performance.
Article
Full-text available
Natural scene analysis has been extensively used to understand how the invariant structure of the visual environment may have shaped biological image processing strategies. This paper deals with four crucial, but hitherto largely neglected aspects of natural scenes: (1) the viewpoint of specific animals; (2) the fact that image statistics are not independent of the position within the visual field; (3) the influence of the direction of illumination on luminance, spectral and polarization contrast in a scene; and (4) the biologically relevant information content of natural scenes. To address these issues, I recorded the spatial distribution of light in a tropical mudflat with a spectrographic imager equipped with a polarizing filter in an attempt to describe quantitatively the visual environment of fiddler crabs. The environment viewed by the crabs has a distinct structure. Depending on the position of the sun, the luminance, the spectral composition, and the polarization characteristics of horizontal light distribution are not uniform. This is true for both skylight and for reflections from the mudflat surface. The high-contrast feature of the line of horizon dominates the vertical distribution of light and is a discontinuity in terms of luminance, spectral distribution and of image statistics. On a clear day, skylight intensity increases towards the horizon due to multiple scattering, and its spectral composition increasingly resembles that of sunlight. Sky-substratum contrast is highest at short wavelengths. I discuss the consequences of this extreme example of the topography of vision for extracting biologically relevant information from natural scenes.
Article
We propose that coding and decoding in the brain are achieved through digital computation using three principles: relative ordinal coding of inputs, random connections between neurons, and belief voting. Due to randomization and despite the coarseness of the relative codes, we show that these principles are sufficient for coding and decoding sequences with error-free reconstruction. In particular, the number of neurons needed grows linearly with the size of the input repertoire growing exponentially. We illustrate our model by reconstructing sequences with repertoires on the order of a billion items. From this, we derive the Shannon equations for the capacity limit to learn and transfer information in the neural population, which is then generalized to any type of neural network. Following the maximum entropy principle of efficient coding, we show that random connections serve to decorrelate redundant information in incoming signals, creating more compact codes for neurons and therefore, conveying a larger amount of information. Henceforth, despite the unreliability of the relative codes, few neurons become necessary to discriminate the original signal without error. Finally, we discuss the significance of this digital computation model regarding neurobiological findings in the brain and more generally with artificial intelligence algorithms, with a view toward a neural information theory and the design of digital neural networks.
Article
Full-text available
In 1948, Claude Shannon introduced his version of a concept that was core to Norbert Wiener's cybernetics, namely, information theory. Shannon's formalisms include a physical framework, namely a general communication system having six unique elements. Under this framework, Shannon information theory offers two particularly useful statistics, channel capacity and information transmitted. Remarkably, hundreds of neuroscience laboratories subsequently reported such numbers. But how (and why) did neuroscientists adapt a communications-engineering framework? Surprisingly, the literature offers no clear answers. To therefore first answer "how", 115 authoritative peer-reviewed papers, proceedings, books and book chapters were scrutinized for neuroscientists' characterizations of the elements of Shannon's general communication system. Evidently, many neuroscientists attempted no identification of the system's elements. Others identified only a few of Shannon's system's elements. Indeed, the available neuroscience interpretations show a stunning incoherence, both within and across studies. The interpretational gamut implies hundreds, perhaps thousands, of different possible neuronal versions of Shannon's general communication system. The obvious lack of a definitive, credible interpretation makes neuroscience calculations of channel capacity and information transmitted meaningless. To now answer why Shannon's system was ever adapted for neuroscience, three common features of the neuroscience literature were examined: ignorance of the role of the observer, the presumption of "decoding" of neuronal voltage-spike trains, and the pursuit of ingrained analogies such as information, computation, and machine. Each of these factors facilitated a plethora of interpretations of Shannon's system elements. Finally, let us not ignore the impact of these "informational misadventures" on society at large. It is the same impact as scientific fraud.
ResearchGate has not been able to resolve any references for this publication.