ArticlePDF Available
Aggregated Learning Curves
Andrew Heathcote and Scott Brown
School of Behavioural Sciences,
The University of Newcastle, Australia
Address for correspondence
Andrew Heathcote
School of Behavioural Sciences, Building W
The University of Newcastle, Callaghan, 2308, NSW, Australia
Email: heathcote@psychology.newcastle.edu.au
Aggregated Learning Curves
1
Abstract
We examine recent concerns that averaged learning curves can present a distorted picture
of individual learning. Analyses of practice curve data from a range of paradigms
demonstrate that such concerns are well founded for fits of power and exponential
functions when the arithmetic average is computed over subjects. We also demonstrate
that geometric averaging over subjects does not, in general, avoid distortion. By contrast,
we show that block averages of individual curves, and similar smoothing techniques,
cause little or no distortion of functional form, while still providing the noise-reduction
benefits that motivate the use of averages. Our analyses are concerned mainly with the
effects of averaging on the fit of exponential and power functions, but we also define
general conditions that must be met by any set of curves to avoid averaging distortion,
and provide a simple graphical test to determine when the conditions are violated.
Aggregated Learning Curves
2
A number of authors have expressed concern recently that the curve described by
an average across subjects does not represent performance of any subject contributing to
the average (e.g., Anderson & Tweney, 1997; Heathcote, Brown & Mewhort, 2000;
Myung, Kim, & Pitt, 2000). Concern that averaging over subjects misrepresents the
quantitative and qualitative form of individual curves is not new. As early as 1892, Boas
questioned the representativeness of averaged growth curves, and like concerns have
been reiterated for a variety of learning curves (e.g., Bakan, 1954; Bahrick, Fitts & Briggs,
1957; Estes, 1956; Kling, 1971; Sidman, 1952; Underwood, 1949). Examples of gross
misrepresentation by averaging are easy to construct. Sudden changes in performance by
individuals, for example, yield a smooth curve in the average when the point of change
differs among the individuals. The recent interest in averaging misrepresentationand
the main focus of this paperconcerns a subtle artefact, one that favours the power
over the exponential function in fits to average data.
Figure 1 illustrates the type of averaging distortion of interest. It shows that the
arithmetic mean of three exponential functions (y = a + be-rx) with differing scale (b) and
rate (r) parameters is better fit by a power function (y = a + bx-s, R2 = 0.995) than by
exponential function (R2 = 0.984). Variation across individual curves in the scale
parameter does not cause averaging distortion, and the same is true for variation in
asymptote (a, note that a = 0 for the three curves in Figure 1), as these parameters are
linear1. Distortion is caused only when nonlinear parameters (e.g., exponents, such as r
for the exponential or s for the power function, and in, general, any parameter that does
not disappear when the function is differentiated with respect to it) vary among
individual curves.
---------------------------------
Insert Figure 1 about here
---------------------------------
Aggregated Learning Curves
3
Brown, Heathcote and Keats (submitted), using results that go back to Cauchy in
1821 (see Aczel, 1966 for details and extensions), showed that an arithmetic mean of
several component functions has exactly the same form as the components if and only if
the component functions are linear in all parameters that vary across the components.
This does not mean that the component functions must themselves be linear; each may
have an extremely nonlinear form. The issue is whether that form varies nonlinearly over
the components. We will describe such nonlinear variation as variation in the “shape” of
the component functions. Where shape varies amongst component functions, as in
Figure 1, fitting quantitative models to the mean function can be misleading, because the
mean function may have a different functional form than the components that contribute
to the mean. We refer to misleading preference for the wrong function as distortion.
The example in Figure 1 presents a puzzle in view of Anderson and Tweeny’s
(1997) conclusion that “extreme variability” (p. 737) in r is required to produce a
misleading average. In Figure 1, the maximum r differs from the minimum by only one
order of magnitude, yet the mean is better fit by the power function than by the
exponential. Using simulation, Brown et al. (submitted) determined the conditions under
which the average of exponential functions is better fit by a power function. Preference
for the power function depends on the ratio of maximum to minimum rate of the
component curves. When components with intermediate rates were included in the
average, preference for the power was the same, or greater, than for the average of two
functions. Under the right conditions a ratio of only ten was sufficient to cause a power
function to provide a better fit to the mean.
Figure 1 was constructed to illustrate the conditions required for such distortion.
A power function is a better fit to the arithmetic average when the component
exponential curve with the largest rate dominates the mean function for an appreciable
range of small x values, but because it reaches its asymptote, leaves the component
Aggregated Learning Curves
4
exponential curve with smallest rate to dominate the mean function for larger values of x.
There is an important caveat, however: The smallest rate must not be so small, nor the
largest rate so large that the component curve is effectively flat in the measured range of
x. When a rate is large relative to the range of x, the curve achieves asymptote almost
immediately, and so has little effect on the way the mean function changes. Where the
rate is small, the curve does not change appreciably over the measured range of x, and,
again, has little effect on the way the average changes.
To understand why the power bias occurs, the shape of a curve can be described
by the way its “relative rate” (i.e., minus the derivative of the curve divided by its height
above the asymptote) changes with x (cf. Heathcote et al.’s, 2000, relative learning rates
for practice functions and Wicken’s, 1998, hazard rates for retention functions). The
relative rate for the exponential is a constant (r) for all values of x; for the power
function, it decreases hyperbolically as s/x. In Figure 1 the relative rate of the mean
function decreases from close to 1 (i.e., the relative rate of component one) when x is
small, to a relative rate close to 0.1 (i.e., the relative rate of component three) when x is
large, thus, mimicking the decreasing relative rate of the power function. As illustrated,
this moderate change in relative rate is sufficient to yield a better fit for the power than
the exponential function.
Arithmetic Averages over Subjects
The examples examined so far may seem contrived or artificial. The important
question is whether such distortion is a problem for real learning curves. The literature
offers a mixed story: In a reply to Anderson and Tweeny (1997), Wixted and Ebbesen
(1997) report little averaging distortion in the data of Wixted and Ebbesen (1991), as
both individual and average retention functions from were better fit by power than
exponential functions. For practice curves, Heathcote et al. (2000) found that fits to
Aggregated Learning Curves
5
individual curves (i.e., separate curves for each combination of subject and experimental
condition) strongly favoured an exponential over a power function. In contrast, an earlier
survey by Newell and Rosenbloom (1981) using practice data mostly averaged over
subjects and/or conditions, favoured the power function in every averaged data set.
Hence, at least for practice curves, it appears that averaging distortion may be a real
problem.
To examine the impact of averaging, 17 data sets from Heathcote et al. (2000)
were re-analysed after arithmetic averaging across subjects. Figure 2 displays the results
listed in order of the percentage of individual (unaveraged) curves better fit by the
exponential. The details of the methods, and the definitions of the labels used for each
data set, are given in Appendix A. There were many experimental conditions (between 8
and 260) in each data set, and data were not averaged over these only across subjects
within conditions. The results show that averaging over subjects has strong and
unpredictable effects on model selection. Prior to averaging, all data sets showed a
consistent preference for the exponential function (in 62% to 91% of cases the
exponential provided a better fit). In 7 of the 17 data sets, averaging over subjects
decreased the preference for the exponential; 2 cases resulted in a strong preference for
the power function.
---------------------------------
Insert Figure 2 about here
---------------------------------
For some data sets shown in Figure 2 averaging over subjects increased rather
than decreased preference for the exponential function. One explanation relies on the
assertion that, as noise increases and dominates the underlying learning curve, both
power and exponential functions will provide the best fit equally often. While this
assertion seems intuitively plausible, Myung et al. (2000) reported a very strong bias
whereby the power function provided a better fit to purely random data than the
Aggregated Learning Curves
6
exponential function in 99% of cases. However, Myung et al.’s simulations were limited
to only single parameter power and exponential functions (i.e. where only the nonlinear s
and r parameter varied). Brown et al. (submitted) replicated Myung et al.’s results, but
when they extended the analysis to exponential and power functions with two parameters
(b and r or s) and three parameters (b, r or s, and the asymptote a) varying, they found
equal preference in fits to random data. Hence, in Figure 2, which reports the fit of three
parameter power and exponential functions, averaging over subjects may have increased
the preference for the exponential function because it reduced noise, and so moved the
results away from equal preference.
The take-home message from this analysis is clear: arithmetic averaging yields
unpredictable results. It is too dangerous to be trusted where issues of curve form, and
more generally the fit of any nonlinear model, are at stake.
Geometric Averaging over Subjects
Geometric averaging has often been thought to attenuate or cure distortion
associated with the arithmetic average (e.g., Anderson & Tweney, 1997; Myung et al,
2000; Rickard, 1997, 1999; Wixted & Ebbesen, 1997). Given a set of two parameter
power curves (b and s varying), a logarithmic transformation produces a linear equation
in the logarithm of x. When the averaged logarithms are de-transformed, the result is a
geometric average. The geometric mean function has the same form as the component
functions and its parameters are the average of the component’s parameters (the type of
the average depends on the type of the parameter: geometric for the scale parameters
and arithmetic for the exponent parameters).
Geometric averaging also works for two-parameter exponential curves. For the
data in Figure 1, Component 2 is exactly the geometric average of the three components,
with a scale parameter that is the geometric average of the component scale parameters
Aggregated Learning Curves
7
and a rate parameter that is the arithmetic average of the component rate parameters. As
is clear in the figure, the arithmetic mean function is very different from the geometric
average.
Unfortunately, geometric averages of power and exponential functions are not
well behaved when the component functions have non-zero asymptote parameters. In
memory retention, a zero asymptote is plausible, although designs with sufficiently long
retention intervals to determine the asymptote are rarely employed. When Rubin,
Hinton and Wenzel (1999) did employ long retention intervals in a continuous
recognition paradigm they did not obtain zero asymptotes, even with difficult materials.
Unlike retention curves, practice functions must have an asymptote parameter greater
than zero because stimulus-driven RT cannot equal zero, even in highly practiced
subjects. Hence, although the geometric average is well behaved when the asymptote is
zero, the zero-asymptote case cannot be assumed to occur in real-world examples.
Brown et al. (submitted) studied the effect of non-zero asymptotes on geometric
averaging of exponential functions. Geometric averaging (GA) becomes increasingly
similar to arithmetic averaging (AA) as the ratio of scale parameters to asymptote
parameters (a/b) decreases. For values of a/b less than about 0.3, geometric averaging
differed sufficiently from arithmetic averaging that the geometric mean function was
better fit by an exponential than a power function. At all values of a/b greater than 0.3,
however, the bias towards the power function was not sufficiently attenuated, and a
power function fit the geometric mean better than an exponential function. For a/b
greater than 1, the effects of arithmetic and geometric averaging on power and
exponential fits were virtually identical.
Taylor expansions can be used to explain these findings. Across any given
domain that is sufficiently small, the logarithmic and exponential functions are
approximately linear: KK
+
)1ln( , and KeK+ 1. This yields the following
Aggregated Learning Curves
8
approximation for geometric averages of exponential functions with scale (ai), asymptote
(bi) and rate (ri) parameters that vary across components:
( )
( )
( ) ( )
( )
+=
+
=+
+
Nr
ii
i
i
Nr
i
i
i
e
a
b
AA
i
e
a
b
aAA
Nr
ii
ii
N
i
r
i
i
N
i
r
i
i
i
i
eba
a
AAaGAe
a
b
AAaGA
eaGAeebaGA
1
1
1ln
Hence, geometric averaging is approximately equal to arithmetic averaging, differing only
by a factor of GA(ai) and a weighting based on the ai. The approximations hold only as
well as the approximations to the log and exponential functions: that is, they hold when
ratio of asymptote to scale parameters (ai/bi) is small.
In real practice data, it is common for ai to be large relative to bi. In the many
thousands of practice curves examined by Heathcote et al. (2000), for example, 51%
yielded parameter estimates for which ai/bi > 1. The data on geometric averages in
Figure 2 bear out these findings: Geometric averaging showed the same substantial and
inconsistent effects on model discrimination as arithmetic averaging and, in some cases,
produced effects quite different to arithmetic averaging. In short, geometric averaging
cannot be relied on to cure, or even attenuate, the distorting effects of averaging over
subjects.
While simple geometric averaging fails when components have non-zero
asymptote parameters, shifted geometric averaging does work. Consider, for example,
three parameter exponential components xr
iii i
ebay
+= . Subtracting the asymptote
parameter from both sides and taking the logarithms yields functions that are linear in
the independent variable,
(
)
(
)
rxbayiii = lnln , and so no distortion occurs in their
average. Although the approach appears attractive in principle, its usefulness is severely
limited in practice. The problem is that the asymptote must be estimated for each
component. Averaging is required only in cases where the signal to noise ratio for
Aggregated Learning Curves
9
components is small, exactly the cases for which an accurate estimate of the asymptote is
difficult to obtain by fitting.
An alternative strategy is to obtain long series of observations so that asymptotes
may be estimated without fitting functions. For practice series, for example, the
asymptote can be estimated from the mean RT in later trials where performance shows
no change with practice. Again, however, the usefulness of this approach is limited, due
to the presence of noise. Because learning curves estimate mean individual performance,
the asymptote also estimates mean performance, and so some observations less than or
equal to the asymptote will likely occur. Consequently, when the asymptote estimate is
subtracted from the data, zero or negative values are obtained with the results that the
logarithmic transformation cannot be applied. If these values are excluded, the average
will be biased. This problem is quite general, and applies even when the asymptote
parameters are known or can be estimated accurately by fitting functions. It can only be
avoided if variability in the dependent variable shrinks to zero as performance
approaches the value of the asymptote parameter, an unlikely event.
As with the arithmetic average, the take-home message is clear: Geometric
averaging yields unpredictable results and is too dangerous to be trusted.
Averaging Within Functions
Newell, Liu and Mayer-Kress (2001) have argued that averaging within curves
(e.g., block averaging), like averaging across subjects, is dangerous. In their words:
Learning trials are often blocked … to remove the presumed transient
randomlike changes from trial to trial while emphasizing the persistent
changes or the global trend of learning over trials. The problem is that
blocking data from groups of trials can modify or mask properties of the
persistent trend as well as those of the transient changes. In particular,
Aggregated Learning Curves
10
this data analysis strategy reduces the evidence of rapid change in
performance that is often present early in practice. (p. 59)
Block averaging is an example of the general class of data analysis tools often called
“smooths” that have been much favoured in modern statistics. As its name implies, a
smooth is biased when it comes to rapid changes. It is not true, however, that
smoothing always changes the shape of “persistent trends” such as learning curves. We
will demonstrate that block averaging, and more general types of smoothing, have no
effect on the shape of exponential functions, and that the bias induced for power
functions is usually acceptably small.
Consider the exponential function defined by Equation 1. :
(
)
rx
ebaxy
+= (1)
Now, suppose that the trial factor, x, is measured in N blocks of M trials, so that
x = 1, 2, …, NM. Each point in the block-averaged series is defined as the arithmetic
average of all points within the corresponding block of the raw data series, which yields
Equation 2, where i is block number (from 1, …, N):
( )
( )( )
=
+
+= M
j
jMir
e
M
baiy
1
1
1 (2)
Equation 2 may be re-expressed as Equation 3:
( )
rMi
M
j
rj
rM ee
M
e
baiy
=
+=
1
(3)
Thus, the block average is precisely an exponential function as in Equation 1, except the
parameter b is multiplied by the constant (with respect to blocks) term in brackets in
Equation 3, and the parameter r is multiplied by M. The change in scale is linear and the
change in the rate simply reflects a linear change in the units for the predictor (trials to
blocks). Hence, there is no distortion of shape.
Aggregated Learning Curves
11
Similar results hold for the moving window smooths, at least when end-effects
are neglected (for a discussion of the subtleties of end-effects see Wand & Jones, 1995,
or Fan & Gijbels, 1996). Thus, a simple boxcar smooth the continuous generalisation
of block averaging or a more sophisticated weighted smooth will not change the
functional form of exponential functions. Moving window smooths includes as a sub-
case zero-order local polynomial regression (a Nadaraya-Watson smooth), at least for
windows with bounded support. Given the exponential function in Equation 1, the
boxcar smooth with a rectangular window of width M (i.e., the continuous version of the
block average) is defined as:
( )
( )
+=
+
+= 2
1
2
1M
M
j
jxr
jew
M
baxy (4)
The w j are a set of weights (constrained to have sum M) and x is constrained to
(M/2), …, (N-M/2), to remove end effects altogether. Equation 4 can be re-expressed
as an exponential function in the form of Equation 1:
( )
rx
M
M
j
rj
jeew
M
baxy
+=
+=
2
1
2
1 (5)
Thus the weighted moving window smooth of an exponential function is itself an
exponential function with the parameter b scaled by the term in brackets in Equation 5,
while all other parameters remain unchanged.
The power function does not behave quite so tractably under block averaging. In
general, it is not the case that the block average of a power function will be a power
function itself. However, by applying results from the kernel smoothing literature, we
can assure ourselves that the biases introduced by block averaging power function data
will be small, given certain conditions.
Aggregated Learning Curves
12
Again, consider the continuous version of the block average, the boxcar smooth,
for generality. Ruppert and Wand (1994) and Bowman and Azzalini (1997) provide
estimates for the expected bias of a kernel smooth. Assuming that the “true” regression
function is a power function, the approximate (first order) pointwise bias for the boxcar
smooth of Equation 4 that is
(
)
22
48
11
+s
xsbsM. As Newell et al. (2001) anticipated,
the greatest bias occurs at the start of the series, because that is where curvature is
greatest, but it rapidly diminishes to zero towards the tail of the series (as x).
Although bias occurs for block averages of power functions, it tends to be small
for the values of s and b typically observed in retention and practice data, at least for
reasonable choices of block width, M. For example, consider the best fitting power
function to the mean function in Figure 1, y = 1.377x-1.0279. Assume five blocks of length
M = 5 are used. For the first block, at x = 3, the approximate bias is only 0.054, which
reduces to 0.0001 for the fifth block, at x = 23. Further, the relative rate of a block-
averaged power function still decreases with x, if anything at a faster rate than for a
power function2, so block averaging does not distort a power function into appearing like
an exponential function.
Figure 3 shows the effect of block averaging on exponential versus power
discrimination in Heathcote et al.’s (2000) practice data. Each of the 17 un-averaged data
sets from the practice law survey was re-analysed after averaging over blocks. Both short
and long block averages were used for each individual series, with the actual block
lengths dictated by the design of the experiment and ensuring that each series had a
reasonable number of points after averaging (see Appendix A for details).
---------------------------------
Insert Figure 3 about here
---------------------------------
The effect of block averaging was much smaller than the effect of averaging over
subjects. In particular, preference for the exponential model was never reversed, although
Aggregated Learning Curves
13
preference for the exponential was generally less than for the un-averaged case3. Usually,
the increased averaging associated with longer blocks resulted in a greater decrease in
preference for the exponential compared to shorter blocks. The exceptions to these
generalizations occurred mainly in data sets with weaker un-averaged preference, where
block averaging sometimes increased preference for the exponential, and in one case (s2)
long blocks caused a greater increase than short blocks. Averages over subjects were also
calculated on block averages. The effect on model selection was similar to that seen with
averages over sequentially numbered trials, that is, large distortions were observed.
General Discussion
Arithmetic averaging can have opposing effects on averages of noisy exponential
functions. Where the individual curve’s rate parameters vary sufficiently, a bias toward
the power function is created in the average. However, averaging can also reduce noise,
so in some cases, a clearer preference for the exponential can emerge as the deleterious
effects of noise on model discrimination are attenuated. In the data from Heathcote et
al.’s (2000) survey (Figure 2, above), both effects of averaging appear to be operating and
interacting in complicated ways. Consequently, researchers who rely on the average
could be misled into concluding that different functional forms apply to different
conditions or paradigms when the real cause is variation in the distribution of learning
rates.
Geometric averaging cannot be relied upon to cure averaging distortion for
power and exponential functions. Geometric averaging attenuates the bias favouring the
power over exponential functions only when asymptotic performance is less than one
third of the scale of learning. For practice curves, this condition is commonly violated.
Analysis of Heathcote et al.’s (2000) practice data showed geometric averaging did not
differ much from arithmetic averaging, and so it was largely ineffective at avoiding
Aggregated Learning Curves
14
distortion. For retention curves, geometric averaging is more likely to be effective, but
its effectiveness cannot be assumed unless asymptotic performance is determined and its
magnitude compared with the scale of learning. Further, our results on geometric
averaging and those of Brown et al. (submitted) do not address the effects of non-zero
asymptotes that vary across components, which are likely in empirical data.
When we have reported our results on the potential distortion engendered by
averaging over subjects to colleagues, one of the first questions to arise regards the
implications for the analysis of learning curves with repeated measures ANOVA. The
question is a difficult one, and we can deal with it only briefly here. First, the object of
inference in ANOVA is the mean over subjects, so any quantitative evaluation of the
mean function’s shape, such as polynomial contrasts, can suffer from averaging
distortion. Second, the situation is actually much worse than it need be: Additivity is
usually adopted as the structural model for the subjects’ effect in most repeated-measures
ANOVA programs. Additivity implies that each subject differs in location (e.g.,
asymptote), but exhibits the same change in performance from the beginning to end of
learning, an erroneous assumption in our experience. Further, no averaging distortion
occurs when subject curves differ in scale as well as location; it seems wasteful not to
take advantage of this fact4. When only additivity is assumed, scale differences between
subjects are assigned to error, unnecessarily reducing the power of tests. Scale
differences can also induce spurious covariance between levels of the learning factor,
which are commonly corrected for by reducing degrees of freedom, at a further
unnecessary cost to power.
A much more optimistic picture emerged for block averages, and in general for
“smooths” that aggregate data from contiguous trials. Newell et al.’s (2001) concern
about block averaging is unwarranted for exponential learning curves, and the bias for
power functions is generally small for reasonable choices of smoothing parameters (e.g.,
Aggregated Learning Curves
15
block width). Comparison of Figures 2 and 3 show that, for practice data, block averages
produced much less distortion than averages over subjects.5
All of the types of averaging examined here improved the goodness-of-fit for
both exponential and power functions for the data of Heathcote et al.’s (2000) survey.
For the raw data, the (unweighted) average R2 across data sets was .35 for the exponential
and .31 for the power function. For short block averages, the average R2 was .61 for the
exponential and .57 for the power function, and for long block averages the average R2
was .73 for the exponential function and .70 for the power function. In arithmetic
averages over subjects the average R2 was .70 for the exponential and .65 for the power
function. Hence, block averaging can be as effective in improving signal to noise ratio as
averaging over subjects. Consequently, block averaging can take advantage of the
improvement in model discrimination that occurs with decreased noise while introducing
no averaging distortion for the exponential, and very little averaging distortion for the
power function.
Block averages and, in general, local smooths, work because they take advantage
of the assumed smooth nature of learning curves. Empirical learning curves, however,
are rarely smooth, but this is usually assumed to be due to an uninteresting random
process. The latter assumptionthat performance fluctuations in behavioural time
series contain no psychologically relevant structurehas recently been challenged by
Fourier analysis (e.g., Gilden, 2001) and by nonlinear dynamical systems or “chaos”
analysis (e.g., Kelly, Heathcote, Heath & Longstaff, in press). The structure found by
both kinds of analysis will be destroyed by local averages in the time (or trial) domain.
However, local averages can still be appropriate in the frequency domain for Fourier
analysis (e.g., power spectrum estimates combining adjacent frequencies), or in delay
coordinates for dynamical systems analysis (e.g., Kantz & Schreiber, 1997; Heath, 2000).
The critical issue is that aggregation is performed locally on a representation where
Aggregated Learning Curves
16
change is smooth. Chaotic dynamics, for example, can produce sharp fluctuations in a
time series while following relatively smooth trajectories in delay coordinates.
The results presented here have implications for theories of the learning, as well
as for analysis of the learning curve. In many cases, measurement and design limitations
mean that raw data are fundamentally aggregated. In retention experiments, for example,
even individual subject retention probabilities can be aggregates over a population of
items. If items have widely differing rates of forgettingin Figure 1, for example,
Component 1 may represent items that are forgotten quickly and Component 3 items
that are forgotten slowlythe aggregated retention curve will have a decreasing relative
rate (see Heathcote et al., 2000, for a discussion of aggregated component practice
theories). The present results show that exponential component rates need differ only by
an order of magnitude for a power function to provide a better fit to the aggregate. In
some cases, theoretical or practical interest may focus on the aggregate curve over
components (e.g., retention of words in general), but, as with averages over subjects,
decreasing relative rates in aggregate curves do not rule out exponential component
learning, only exponential components that all have similar rates.
Given that averages of exponential functions, arguably the baseline form for a
learning curve, demonstrably produce marked averaging distortions, the situation will
often be worse for more complex nonlinear models. One reason that we have studied
exponentials is their simplicity; their shape is defined by a single parameter, relative rate,
which is invariant across trials. Any unobserved learning trials prior to an experiment (k)
can always be absorbed into a linear scale parameter (e.g.,
(
)
rxrkkxreee =
+); it is this
translation invariance of shape that allows local averages to preserve the form of
exponentials. Further, the rate parameter of an exponential defines an invariant temporal
scale for change across trials, and deviations from an exponential form can be taken as
Aggregated Learning Curves
17
diagnostic of multiple temporal scales of change, such as often occur in nonlinear
dynamical systems (cf. Newell et al., 2001).
Importantly, the present results apply to all types of smooth learning curves, not
just exponential and power functions. Unless it can be shown that individual curves
differ only linearly, the quantitative form of individual learning, or the relative merits of
quantitative models of individual behaviour, should not investigated by examining
averages over subjects.
Given that deciding linearity is a prerequisite for averaging, it is important to test
for linearity. For any form of the learning curve, two curves differ only linearly if one
curve plotted against the other forms a straight line. More usefully, a group of
components vary only linearly if they plot as straight lines against their mean function.
Figure 4 is such a plot for the example in Figure 1; it shows clearly that the components
differ nonlinearly. In noisy data, smoothing can be applied to such plots in order to
discern nonlinear trends.
---------------------------------
Insert Figure 4 about here
---------------------------------
The relationships shown in Figure 4 can also be used to construct an
approximate test for nonlinear variation among components. Even though the mean
function is measured with error, its error will be much less than that of the components,
so it can be used as a predictor in polynomial regressions on each component. For
monotone learning curves, any non-linear variation will be largely concentrated in lower
order polynomial trends, so if the interaction of any these trends with the components
(subjects) factor is significant, averaging over components should be avoided.
Aggregated Learning Curves
18
References
Aczel, J. (1966) Lectures on functional equations and their applications. London:
Academic Press.
Anderson, R.B. & Tweney, R.D. (1997) Artefactual power curves in forgetting.
Memory and Cognition, 25, 724-730.
Azzalini, A., Bowman, A.W., & Hardle, W (1989) On the use of non-parametric
regression for model checking. Biometrika, 76, 1-11
Bahrick, H. P., Fitts, P. M. & Briggs, G. E. (1957). Learning curvesfacts or
artefacts? Psychological Bulletin , 54, 256-268.
Bakan, D. (1954) A generalisation of Sidman’s results on group and individual
functions and a criterion. Psychological Bulletin, 51, 63-64.
Boas, F. (1892). The growth of children. Science, 19, 256-257, 281-282; 20, 351-
352.
Bowman, A. W. & Azzalini, A. (1997) Applied smoothing techniques for data
analysis : The kernel approach with S-Plus illustrations. Oxford: Clarendon Press.
Brown, S., & Heathcote, A. (2000) Good, better, best: The use of non-parametric
regression and the bootstrap to assess parametric models. Paper presented at the 33rd
Annual Meeting of the Society for Mathematical Psychology, Kingston, Canada
Brown, S., Heathcote, A. and Keats, J. (submitted). Averaging exponential
functions.
Estes, K. W. (1956). The problem of inference from curves based on group
data. Psychological Bulletin, 53, 134-140.
Fan, J. & Gijbels, I. (1996) Local polynomial modelling and its applications.
London: Chapman & Hall.
Aggregated Learning Curves
19
Gilden, D. L. (2001). Cognitive emissions of 1/f noise. Psychological Review,
108, 33-56.
Heath, R. A. (2000). Nonlinear Dynamics: Techniques and Applications in
Psychology, Earlbaum: Mahwah, NJ.
Heathcote, A., Brown, S. & Mewhort, D.J.K. (2000) Repealing the power law:
The case for an exponential law of practice. Psychonomic Bulletin and Review, 7, 185-
207.
Kantz, H. & Schreiber. T. (1997). Nonlinear time series analysis, Cambridge,
England: Cambridge University Press.
Kelly, A., Heathcote, A., Heath, R. A. & Longstaff, M. (in press). Response time
dynamics: Evidence for linear and low-dimensional nonlinear structure in human choice
sequences. Quarterly Journal of Experimental Psychology.
Kling, J. W. (1971). Learning: An introductory survey. In J. W. Kling & L. A
Riggs (Eds.), Woodworth and Schlosberg's Experimental Psychology (pp. 551-613).
New York: Holt, Rinehart, & Winston.
Lindstrom, M. J., & Bates, D. M. (1990). Nonlinear mixed effects models for
repeated measures data. Biometrics, 46, 673-687.
Mandel, J. (1963). Non-additivity in two-way analysis of variance. Journal of the
American Statistical Society, 56, 878-888.
Myung, I. J., Kim, C., & Pitt, M. A. (2000). Toward an explanation of the power-
law artefact: Insights from response surface analysis. Memory & Cognition, 28, 832-840.
Newell, K. M., Liu, Y-T & Mayer-Kress, G. (2001) Time scals in motor learning
and development. Psychological Review, 108, 57-82.
Newell, A., & Rosenbloom, P. S. (1981). Mechanisms of skill acquisition and the
law of practice. In J. R. Anderson (Ed.), Cognitive Skills and their Acquisition (pp. 1-55).
Hillsdale, NJ: Erlbaum.
Aggregated Learning Curves
20
Rickard, T. C. (1997). Bending the Power law: A CMPL theory of strategy shifts
and the automatization of cognitive skills. Journal of Experimental Psychology: General,
126, 288-311.
Rickard, T.C. (1999) A CMPL alternative account of practice effects in
numerosity judgement tasks. Journal of Experimental Psychology: Learning, Memory and
Cognition, 25, 532-542.
Rubin, D. C., Hinton, S., & Wenzel, A. (1999). The precise time course of
retention, Journal of Experimental Psychology: Learning, Memory and Cognition, 25,
Ruppert, D., & Wand, M. P. (1994). Multivariate locally weighted least squares
regression. Annals of Statistics, 22, 1347-1370.
Sidman, M. (1952). A note on functional relations obtained from group data.
Psychological Bulletin, 49, 263-269.
Underwood, B. (1949). Experimental Psychology, New York: Appleton-Century-
Crofts.
Wand, M. P., & Jones, M. C. (1995). Kernel Smoothing. London: Chapman &
Hall.
Wickens, T. D. (1998). On the form of the retention function: Comment on
Rubin and Wenzel (1996): A quantitative description of retention. Psychological Review,
105, 379-386.
Wixted, J. T., & Ebbesen, E. B. (1991). On the form of forgetting. Psychological
Science, 2, 409-415.
Wixted, J. T., & Ebbesen, E. B. (1997). Genuine power curves in forgetting: A
quantitative analysis of individual subject forgetting functions. Memory & Cognition, 25,
731-739.
Aggregated Learning Curves
21
Appendix A
Table A1 defines the labels used for data sets from Heathcote et al.’s (2000)
survey. The “Short Blocks” and “Long Blocks” columns indicate the number of
observations per block/number of blocks per subject. For the k1 and k2 data different
block lengths were used for the two within subject conditions.
---------------------------------
Insert Table A1 about here
---------------------------------
Ordinary least squares estimation was used to fit three parameter power and
exponential functions with estimated asymptotes bounded below by zero. Note that the
pre-averaging results in Figure 2 differ slightly from those reported in Heathcote et al.
(2000) for two reasons. First, the numbering system used for the practice factor (N)
labelled the first correctly answered trial occurring in each within-subjects condition as 1,
the second correct trial as 2, and so on, rather than using the absolute trial number
regardless of condition, as in Heathcote et al. Second, practice series were truncated to
the length of the shortest practice series within a condition. This ensured that each
subject contributed exactly one RT to each value in the averaged series. We found that
other numbering systems that did not enforce this condition introduced substantial
distortion into the average; for instance by allowing the tail of the series to be dominated
by a single subject’s data. The disadvantage of this approach is that it discards some
information about the tail of the practice function, and so may push the results towards
no preference (50%). Comparison with Heathcote et al.’s Figure 1 shows that the effect
was only slight.
Aggregated Learning Curves
22
Tables
Table A1: Labels, sources and blo ck specifications for the 17 un-averaged practice data
sets in Heathcote et al. (2000).
Label
Reference Experiment Long Blocks
Short
Blocks
m1 Rickard, & Bourne (1996) “OPER” Experiment 10/9 3/30
m2 Rickard (1997) “CPL” Experiment 10/9 5/18
m3 Reder & Ritter (1992) Experiment 1 4/5 2/10
m4 As for M3 Experiment 2 4/5 2/10
M5 Schunn, Reder,
Nhouyvanisvong, Richards, &
Stroffolino (1997)
Experiment 1, using only
stimuli presented 28 times. 4/7 2/14
s1 Strayer & Kramer (1994a) Consistently mapped trials
from mixed consistent/varied
mapping training blocks from
Experiment 2
48/15 24/30
s2 Strayer & Kramer (1994;
1994b) Consistently mapped training
blocks from Experiment 2 of
(1994a), and Experiments 4,
6, and 7 from (1994b), and
an unpublished two-
alternative forced-choice
version of the task
48/15 24/30
s3 Strayer & Kramer (1994c) Consistently mapped trials
(young subjects). 24/18 12/36
v1 Heathcote & Mewhort (1993)
Experiment 1 20/10 10/20
v2 Carrasco, Ponte, Rechea &
Sampedro (1998) 12/7 4/21
v3 As for v1 Experiments 3 and 4 20/16 10/32
k1 Verwey (1996) Time to press each key taken
separately day 1 session
only
- 30/24 &
10/12
k2 As for k1 Time to press each key taken
separately day 1 session
omitted due to non-stationary
errors.
- 30/53 &
10/21
c1 Palmeri (1997). Experiment 1 16/13 4/52
c2 As for c1 Experiment 2 16/10 4/40
c3 As for c1 Experiment 3 8/10 4/20
a1 As for m2 Alphabet Arithmetic task 12/7 4/21
Aggregated Learning Curves
23
Footnotes
1Variation in linear parameters can, however, affect the strength of distortion. For
example, as b for a component curve approaches zero it is effectively excluded from the
average, and so cannot cause distortion. Conversely, a very large b can increase extend
the range of smaller x values where the shape of the mean function is affected by a
component with a high rate. The asymptote parameter has no direct effect, but where,
for example, asymptotes greater than zero occur, a power function can often imitate an
exponential function by underestimating the asymptote.
2 For example, for y = bx-s and bias = kbx-(s+2) + O(x-(s+3)), where k
=M2s(s+1)/48, the approximate block average power function is z = y + bias. The
derivative of z, neglecting higher order terms, is
(
)
31 2'+ = s
kbxzsxz, and so the
approximate relative rate of z is
( )
kxx
k
x
s
z
z
+
+= 2
2', which decreases faster than a
hyperbolic function for practice and retention curves, as by definition they are decreasing
functions, and so s > 0 and so k > 0.
3Practice curves can show a small decrease in relative rate early in practice but a
constant relative rate later in practice, as indicated by Heathcote et al.’s (2000) finding
that an APEX function provided a significantly improved fit over the exponential
function in some cases. The slightly decreased preference for the exponential function in
block averages may indicate greater sensitivity to early changes in relative learning rate
due to reduced noise.
4Mandel (1963) provides the methods necessary to perform ANOVA allowing a
full linear model for subjects’ effects. Our current work is investigating extensions of his
approach that allow violations of the linear subjects’ effect model (e.g., due to the
nonlinear trends evident in Figure 4), and hence the potential for averaging distortion, to
Aggregated Learning Curves
24
be detected. Where violations of linearity occur, transformations must be applied or
ANOVA replaced by nonlinear mixed model analyses (e.g., Lindstrom & Bates, 1990)
5We have considered only zero-order (local average) smooths, but higher order
polynomial smooths are also available. In particular, odd-order smooths are useful
because they are much less subject to end effects than even-order smooths (Fan &
Gijbels, 1996). First-order smoothing (i.e., approximating each local neighbourhood by
the line of best fit) is particularly attractive for the analysis of learning curves because it
provides point estimates of both their magnitude and derivative. Although the topic is
beyond the scope of the present work, the application of first-order smooths to the
estimation of relative rates and, hence, development of a non-parametric characterisation
of curve shape is a promising direction for future work. Smooths can also be applied to
the testing of parametric curves (Azzalini, Bowman & Hardle, 1989; Brown & Heathcote,
2000).
Aggregated Learning Curves
25
Acknowledgements
We would like to acknowledge the advice of Prof. D. J. K. Mewhort on writing
style and the support of an Australian Research Council Large Grant to AH.
Aggregated Learning Curves
26
Figures Captions
Figure 1. Three exponential functions, y = be-rx, and their average, for x = 1, 2, … 25.
Figure 2. Percentage of fits where a three-parameter exponential function provided a
better fit than a three-parameter power function in un-averaged data compared to fits for
data arithmetically and geometrically averaged over subjects. Note that where bars appear
to be absent this indicates equal (50%) preference.
Figure 3. Percentage of fits where a three-parameter exponential function provided a
better fit than a three-parameter power function in un-averaged data compared to fits to
short and long block arithmetic averages. Note that only short blocks were used for the
k1 and k2 data sets due to the design of the experiment.
Figure 4. The three exponential functions from Figure 1 plotted against their mean.
Aggregated Learning Curves
27
0 5 10 15 20 25
0.0
0.5
1.0
1.5
x
y
Component 1: b = 5, r = 1
Component 2: b = 2.236, r = 0.55
Component 3: b = 1, r = 0.1
Mean Function
Figure 1
Aggregated Learning Curves
28
010 20 30 40 50 60 70 80 90 100
m4 m1 m3 s2 k2 m2 v3 s3 s1 v2 m5 c3 a1 k1 v1 c1 c2
Data Set
% Exponential
Not Averaged
Arithmetic Average
Geometric Average
Averages over Subjects
Figure 2
Aggregated Learning Curves
29
50 55 60 65 70 75 80 85 90 95
m4 m1 s2 m3 k2 m2 v3 s3 s1 v2 m5 c3 v1 a1 k1 c1 c2
Data Set
% Exponential
Not Averaged
Short Blocks
Long Blocks
Averages over Blocks
Figure 3
Aggregated Learning Curves
30
1.41.21.00.80.60.40.20.0
2
1
0
Mean Function
Y
Component 3
Component 1
Component 2
Figure 4
... The exponential learning model has been seen to provide a better fit to data as more estimating parameters are used in the exponent and multiplier in the estimating equation [11]. ...
... Such problems or inconsistencies may come from including only total job processing time, which contains setup time and run time, even though batch size may differ from run to run. It has also been seen in studies that when data aggregation is used, the power model outperforms exponential models based on confidence limits [11]. ...
... Studies have concluded that the standard power models fit random variables better when limited to the order of estimating parameters in the exponential part of the equation [11]. Also, the power model is used because most industrial processes have a very high first time per unit value and decrease rapidly. ...
Article
Full-text available
Thesis (M.S..)--Ohio University, November, 2005. Includes bibliographical references (p. 76-78)
... The shape of a learning curve can take many forms, including power and exponential (Everett & Farghal, 1997). The power model has been shown to outperform exponential models based on confidence limits when data aggregation is used (Heathcote & Brown, 2001). The power model has also been used to determine the relationship between cost and production rate, which has an effect similar to that of learning (Badiru, 1992; Sundaram, 2005). ...
Article
Learning is a decrease in the time to perform an operation due to repetition and is an important consideration when forecasting process times or product costs. This paper presents a new method for calculating the learning rate for a family of parts using a matrix-based approach to organize historical data on production times. By calculating a single learning rate for the entire family, the data on individual parts is pooled, creating a larger sample size and reducing the variance of the estimate. Applying this method to forecasting costs of a family of jet engine parts shows that it provides much more accurate estimates than the previously available method of taking a weighted average of individual parts’ learning rates. The matrix-based method also allows for calculation of first-unit costs more reliably (since the estimates are less affected by outliers in a larger sample) and for calculation of confidence limits on the estimates, to provide users with information on the reliability of the estimates.
Article
Full-text available
The residual fluctuations that naturally arise in experimental inquiry are analyzed in terms of their time histories. Although these fluctuations are generally relegated to a statistical purgatory known as unexplained variance, this article shows that they may harbor a long-term memory process known as 1/f noise. This type of noise has been encountered in a number of biological and physical systems and is theorized to be a signature of dynamic complexity. Its presence in psychological data appears to be associated with the most elementary aspect of cognitive process, the formation of representations.
Article
Full-text available
A reanalysis of the numerosity judgment data described in T. J. Palmeri (see record 1997-03378-004) showed that the mean latency exhibits clear deviations from the power function as predicted by the component power laws (CMPL) theory of strategy shifting (T. C. Rickard, 1997). The variance of the latency systematically increases and then decreases with practice for large numerosities, a result that is also consistent with the CMPL theory. Neither of these results are predicted by existing versions of either the exemplar-based random walk or the instance theories. These findings suggest that numerosity judgment, like other skills, reflects one at a time rather than concurrent execution of algorithmic and memory retrieval strategies. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
The shift with practice from use of generic, multistep problem-solving strategies to fast and relatively effortless memory-based strategies, was explored in 2 experiments using pseudoarithmetic tasks. A complete transition to the memory strategy occurred by about the 60th exposure to each problem. The power law of practice did not hold in the overall data for either the mean or the standard deviation of response latency, but it did hold within each strategy (algorithm or retrieval). Learning was highly specific to the practiced problems. These results constitute the 1st clear demonstration of a skill for which the power law does not apply overall. The results do not support the instance theory of automatization (G. D. Logan, 1988) but are consistent with an alternative component power laws (CMPL) theory that assumes that because of intrinsic attentional limitations, only 1 strategy can be executed at a time. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
D. C. Rubin and A. E. Wenzel (1996) fitted many simple functions to a large collection of retention data sets. Their search for the mathematical form of the retention function can be simplified by (a) attending to the failures of simple functions, (b) considering the constraints and process assumptions that any psychological theory must obey, and (c) drawing on results from survival theory. Three sets of psychologically plausible assumptions to interpret the form of a retention function are described. These representations converge on a single functional form, demonstrating the impossibility of determining process purely from empirical fits. A candidate form for an empirical retention function whose parameters separate the various aspects of retention is proposed. These parameters can be used to compare results from different studies.
Article
In two-way classification analysis of variance situations there often exists a systematic type of row column interaction. A model is proposed in which the interaction is of the type Qiγ j where Qi is a parameter of the ith row, not necessarily associated with the main effect for rows, and γj is the main effect for column j. The analysis of data according to this model is given, including estimation and tests of significance. The model is more general than that involved in Tukey's "one degree of freedom for non-additivity" and includes the latter as a special case. The relationship between the two methods is discussed. Applications of the method to different types of problems are mentioned and a numerical example is included.
Book
Human behavior would not be interesting to us if it remained the same from one moment to the next. Moreover, we tend to be sensitive to changes in people's behavior, especially when such change impacts on our own, and other's, behavior. This book describes a variety of techniques for investigating change in behavior. It employs conventional time series methods, as well as recently developed methodology using nonlinear dynamics, including chaos, a term that is not easy to define, nor to confirm. Although nonlinear methods are being used more frequently in psychology, a comprehensive coverage of methods, theory and applications, with a particular focus on human behavior, is needed. Between these covers, the reader is led through various procedures for linear and nonlinear time series analysis, including some novel procedures that allow subtle temporal aspects of human cognition to be detected. Analyses of reaction times, heart-rate, psychomotor skill, decision making, and EEG are supplemented by a contemporary review of recent dynamical research in developmental psychology, psychopathology, and human cognitive processes. A consideration of nonlinear dynamics assists our understanding of deep issues such as: Why is our short-term memory capacity limited? Why do chronic disorders, and also cognitive development, progress through stage-like transitions? Why do people make irrational decisions? This book will be of particular interest to researchers, practitioners, and advanced students in a variety of areas in psychology, particularly in human experimental and physiological psychology. Data analyses are performed using the latest nonlinear dynamics computer packages. A comprehensive WWW resource of software and supplementary information is provided to assist the reader's understanding of the novel, and potentially revolutionary, procedures described in the book.
Article
Almost everyone would agree that the course of forgetting is some curvilinear function of time. The purpose of the research described herein was to identify the nature of that function. Three experiments are reported, two involving human subjects and one involving pigeons. The human experiments investigated this issue using recall of words and recognition of faces, whereas the pigeon experiment employed the standard delayed matching-to-sample task. In all cases, the course of forgetting was best described by a simple power function of time relative to five other reasonable alternatives (linear, exponential, exponential-power, hyperbolic, and logarithmic). Furthermore, a reanalysis of Ebbinghaus's (1885) classic savings function showed that it, too, declines as a power function of time. These findings suggest that the form of forgetting is a relatively robust property of memory performance and that its mathematical description, perhaps only coincidentally, matches that of the psychophysical function.