ArticlePDF Available

Additive Logistic Regression: A Statistical View of Boosting

Authors:

Abstract and Figures

Boosting is one of the most important recent developments in classi-fication methodology. Boosting works by sequentially applying a classifica-tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi-ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip-tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
Content may be subject to copyright.
The Annals of Statistics
2000, Vol. 28, No. 2, 337–407
SPECIAL INVITED PAPER
ADDITIVE LOGISTIC REGRESSION:
A STATISTICAL VIEW OF BOOSTING
By Jerome Friedman,1Trevor Hastie23and
Robert Tibshirani24
Stanford University
Boosting is one of the most important recent developments in classi-
fication methodology. Boosting works by sequentially applying a classifica-
tion algorithm to reweighted versions of the training data and then taking
a weighted majority vote of the sequence of classifiers thus produced. For
many classification algorithms, this simple strategy results in dramatic
improvements in performance. We show that this seemingly mysterious
phenomenon can be understood in terms of well-known statistical princi-
ples, namely additive modeling and maximum likelihood. For the two-class
problem, boosting can be viewed as an approximation to additive modeling
on the logistic scale using maximum Bernoulli likelihood as a criterion.
We develop more direct approximations and show that they exhibit nearly
identical results to boosting. Direct multiclass generalizations based on
multinomial likelihood are derived that exhibit performance comparable
to other recently proposed multiclass generalizations of boosting in most
situations, and far superior in some. We suggest a minor modification to
boosting that can reduce computation, often by factors of 10 to 50. Finally,
we apply these insights to produce an alternative formulation of boosting
decision trees. This approach, based on best-first truncated tree induction,
often leads to better performance, and can provide interpretable descrip-
tions of the aggregate decision rule. It is also much faster computationally,
making it more suitable to large-scale data mining applications.
1. Introduction. The starting point for this paper is an interesting pro-
cedure called “boosting,” which is a way of combining the performance of many
“weak” classifiers to produce a powerful “committee.” Boosting was proposed in
the computational learning theory literature [Schapire (1990), Freund (1995),
Freund and Schapire (1997)] and has since received much attention.
While boosting has evolved somewhat over the years, we describe the most
commonly used version of the AdaBoost procedure [Freund and Schapire
Received August 1998; revised December 1999.
1Also at Stanford Linear Accelerator Center, Stanford, CA 94305. Supported in part by Dept.
of Energy Contract DE-AC03-76 SF 00515 and NSF Grant DMS-97-64431.
2Also at Division of BioStatistics, Dept. of Health, Research and Policy, Stanford University,
Stanford, CA 94305.
3Supported in part by NSF Grants DMS-95-04495, DMS-98-03645 and NIH Grant
ROI-CA-72028-01.
4Supported in part by Natural Sciences and Engineering Research Council of Canada.
AMS 1991 subject classifications. 62G05, 62G07, 68T10, 68T05.
Key words and phrases. classification, tree, nonparametric estimation, stagewise fitting,
machine learning.
337
338 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
(1996b)], which we call Discrete AdaBoost. This is essentially the same as
AdaBoost.M1 for binary data in Freund and Schapire. Hereis a concise descrip-
tion of AdaBoost in the two-class classification setting. We have training data
x1y
1xNy
Nwith xia vector valued feature and yi=−1or1.We
define Fx=M
1cmfmxwhere each fmxis a classifier producing val-
ues plus or minus 1 and cmare constants; the corresponding prediction is
signFx. The AdaBoost procedure trains the classifiers fmxon weighted
versions of the training sample, giving higher weight to cases that are cur-
rently misclassified. This is done for a sequence of weighted samples, and then
the final classifier is defined to be a linear combination of the classifiers from
each stage. A detailed description of Discrete AdaBoost is given in the boxed
display titled Algorithm 1.
Much has been written about the success of AdaBoost in producing accurate
classifiers. Many authors have explored the use of a tree-based classifier for
fmxand have demonstrated that it consistently produces significantly lower
error rates than a single decision tree. In fact, Breiman (1996) (referring to
a NIPS workshop) called AdaBoost with trees the “best off-the-shelf classifier
in the world” [see also Breiman (1998b)]. Interestingly, in many examples the
test error seems to consistently decrease and then level off as more classifiers
are added, rather than ultimately increase. For some reason, it seems that
AdaBoost is resistant to overfitting.
Figure 1 shows the performance of Discrete AdaBoost on a synthetic clas-
sification task, using an adaptation of CARTTM [Breiman, Friedman, Olshen
and Stone (1984)] as the base classifier. This adaptation grows fixed-size trees
in a “best-first” manner (see Section 8). Included in the figure is the bagged
tree [Breiman (1996)] which averages trees grown on bootstrap resampled
versions of the training data. Bagging is purely a variance-reduction tech-
nique, and since trees tend to have high variance, bagging often produces
good results.
Discrete AdaBoost [Freund and Schapire (1996b)]
1. Start with weights wi=1/N i =1N.
2. Repeat for m=12M:
(a) Fit the classifier fmx∈11using weights wion the training data.
(b) Compute errm=Ew1y=fmx,cm=log1errm/errm.
(c) Set wiwiexpcm1yi=fmxi,i=12N, and renormalize so that
iwi=1.
3. Output the classifier signM
m=1cmfmx.
Algorithm 1. Ewrepresents expectation over the training data with
weights w=w1w
2w
N, and 1Sis the indicator of the set S. At each
iteration, AdaBoost increases the weights of the observations misclassified by
fmxby a factor that depends on the weighted training error.
ADDITIVE LOGISTIC REGRESSION 339
Fig. 1.Test error for Bagging,Discrete AdaBoost and Real AdaBoost on a simulated two-class
nested spheres problem (see Section 6). There are 2000 training data points in ten dimensions,and
the Bayes error rate is zero. All trees are grown best-firstwithout pruning. The leftmost iteration
corresponds to a single tree.
Early versions of AdaBoost used a resampling scheme to implement step 2
of Algorithm 1, by weighted sampling from the training data. This suggested a
connection with bagging and that a major component of the success of boosting
has to do with variance reduction.
However, boosting performs comparably well when:
1. A weighted tree-growing algorithm is used in step 2 rather than weighted
resampling, where each training observation is assigned its weight wi. This
removes the randomization component essential in bagging.
2. “Stumps” are used for the weak learners. Stumps are single-split trees with
only two terminal nodes. These typically have low variance but high bias.
Bagging performs very poorly with stumps [Figure 1 (top right panel)].
340 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
These observations suggest that boosting is capable of both bias and vari-
ance reduction, and thus differs fundamentally from bagging.
The base classifier in Discrete AdaBoost produces a classification rule
fmx →11, where is the domain of the predictive features x.
Freund and Schapire (1996b), Breiman (1998a) and Schapire and Singer
(1998) have suggested various modifications to improve the boosting algo-
rithms.
A generalization of Discrete AdaBoost appeared in Freund and Schapire
(1996b), and was developed further in Schapire and Singer (1998), that uses
real-valued “confidence-rated” predictions rather than the −11of Discrete
AdaBoost. The weak learner for this generalized boosting produces a map-
ping fmx:→ R; the sign of fmxgives the classification, and fmx a
measure of the “confidence” in the prediction. This real-valued contribution is
combined with the previous contributions with a multiplier cmas before, and
a slightly different recipe for cmis provided.
We present a generalized version of AdaBoost, which we call Real AdaBoost
in Algorithm 2, in which the weak learner returns a class probability estimate
pmx= ˆ
Pwy=1x∈01. The contribution to the final classifier is half the
logit-transform of this probability estimate. One form of Schapire and Singer’s
generalized AdaBoost coincides with Real AdaBoost, in the special case where
the weak learner is a decision tree. Real AdaBoost tends to perform the best
in our simulated examples in Figure 1, especially with stumps, although we
see with 100 node trees Discrete AdaBoost overtakes Real AdaBoost after 200
iterations.
In this paper we analyze the AdaBoost procedures from a statistical per-
spective. The main result of our paper rederives AdaBoost as a method for
fitting an additive model mfmxin a forward stagewise manner. This sim-
ple fact largely explains why it tends to outperform a single base learner.
By fitting an additive model of different and potentially simple functions, it
expands the class of functions that can be approximated.
Real AdaBoost
1. Start with weights wi=1/N i =12N.
2. Repeat for m=12M:
(a) Fit the classifier to obtain a class probability estimate pmx= ˆ
Pwy=
1x∈01, using weights wion the training data.
(b) Set fmx←1
2log pmx/1pmx∈R.
(c) Set wiwiexp−yifmxi,i=12N, and renormalize so that
iwi=1.
3. Output the classifier signM
m=1fmx.
Algorithm 2. The Real AdaBoost algorithm uses class probability esti-
mates pmxto construct real-valued contributions fmx.
ADDITIVE LOGISTIC REGRESSION 341
Given this fact, Discrete and Real AdaBoost appear unnecessarily compli-
cated. A much simpler way to fit an additive model would be to minimize
squared-error loss Eyfmx2in a forward stagewise manner. At the
mth stage we fix f1x···fm1xand minimize squared error to obtain fmx
=Eym1
1fjxx. This is just “fitting of residuals” and is commonly used
in linear regression and additive modeling [Hastie and Tibshirani (1990)].
However squared error loss is not a good choice for classification (see
Figure 2 in Section 4.2) and hence “fitting of residuals” doesn’t work very
well in that case. We show that AdaBoost fits an additive model using a bet-
ter loss function for classification. Specifically we show that AdaBoost fits an
additive logistic regression model, using a criterion similar to, but not the
same as, the binomial log-likelihood. [If pmxare the class probabilities, an
additive logistic regression approximates log pmx/1pmx by an addi-
tive function mfmx.] We then go on to derive a new boosting procedure
“LogitBoost” that directly optimizes the binomial log-likelihood.
The original boosting techniques [Schapire (1990), Freund (1995)] prov-
ably improved or “boosted” the performance of a single classifier by produc-
ing a “majority vote” of similar classifiers. These algorithms then evolved
into more adaptive and practical versions such as AdaBoost, whose success
was still explained in terms of boosting individual classifiers by a “weighted
majority vote” or “weighted committee.” We believe that this view, along with
the appealing name “boosting” inherited by AdaBoost, may have led to some
of the mystery about how and why the method works. As mentioned above,
we instead view boosting as a technique for fitting an additive model.
Section 2 gives a short history of the boosting idea. In Section 3 we briefly
review additive modeling. Section 4 shows how boosting can be viewed as an
additive model estimator and proposes some new boosting methods for the two-
class case. The multiclass problem is studied in Section 5. Simulated and real
data experiments are discussed in Sections 6 and 7. Our tree-growing imple-
mentation, using truncated best-first trees, is described in Section 8. Weight
trimming to speed up computation is discussed in Section 9, and we briefly
describe generalizations of boosting in Section 10. We end with a discussion
in Section 11.
2. A brief history of boosting. Schapire (1990) developed the first sim-
ple boosting procedure in the PAC-learning framework [Valiant (1984), Kearns
and Vazirani (1994)]. Schapire showed that a weak learner could always
improve its performance by training two additional classifiers on filtered ver-
sions of the input data stream. A weak learner is an algorithm for producing
a two-class classifier with performance guaranteed (with high probability) to
be significantly better than a coinflip. After learning an initial classifier h1on
the first Ntraining points:
1. h2is learned on a new sample of Npoints, half of which are misclassified
by h1.
2. h3is learned on Npoints for which h1and h2disagree.
3. The boosted classifier is hB=Majority Voteh1h
2h
3.
342 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Schapire’s “Strength of Weak Learnability” theorem proves that hBhas
improved performance over h1.
Freund (1995) proposed a “boost by majority” variation which combined
many weak learners simultaneously and improved the performance of the sim-
ple boosting algorithm of Schapire. The theory supporting both of these algo-
rithms requires the weak learner to produce a classifier with a fixed error rate.
This led to the more adaptive and realistic AdaBoost [Freund and Schapire
(1996b)] and its offspring, where this assumption was dropped.
Freund and Schapire (1996b) and Schapire and Singer (1998) provide some
theory to support their algorithms, in the form of upper bounds on generaliza-
tion error. This theory has evolved in the computational learning community,
initially based on the concepts of PAC learning. Other theories attempting
to explain boosting come from game theory [Freund and Schapire (1996a),
Breiman (1997)] and VC theory [Schapire, Freund, Bartlett and Lee (1998)].
The bounds and the theory associated with the AdaBoost algorithms are inter-
esting, but tend to be too loose to be of practical importance. In practice, boost-
ing achieves results far more impressive than the bounds would imply.
3. Additive models. We show in the next section that AdaBoost fits an
additive model Fx=M
m=1cmfmx. We believe that viewing current boost-
ing procedures as stagewise algorithms for fitting additive models goes a long
way toward understanding their performance. Additive models have a long
history in statistics, and so we first give some examples here.
3.1. Additive regression models. We initially focus on the regression prob-
lem, where the response yis quantitative, xand yhave some joint distribu-
tion, and we are interested in modeling the mean Eyx=Fx. The additive
model has the form
Fx=
p
j=1
fjxj(1)
There is a separate function fjxjfor each of the pinput variables xj. More
generally, each component fjis a function of a small, prespecified subset of
the input variables. The backfitting algorithm [Friedman and Stuetzle (1981),
Buja, Hastie and Tibshirani (1989)] is a convenient modular “Gauss–Seidel”
algorithm for fitting additive models. A backfitting update is
fjxj←Ey
k=j
fkxk
xjfor j=12p1 (2)
Any method or algorithm for estimating a function of xjcan be used to
obtain an estimate of the conditional expectation in (2). In particular, this
can include nonparametric smoothing algorithms, such as local regression or
smoothing splines. In the right-hand side, all the latest versions of the func-
tions fkare used in forming the partial residuals. The backfitting cycles are
repeated until convergence. Under fairly general conditions, backfitting can
ADDITIVE LOGISTIC REGRESSION 343
be shown to converge to the minimizer of EyFx2[Buja, Hastie and
Tibshirani (1989)].
3.2. Extended additive models. More generally, one can consider additive
models whose elements fmxM
1are functions of potentially all of the input
features x. Usually in this context the fmxare taken to be simple functions
characterized by a set of parameters γand a multiplier βm,
fmx=βmbxγm(3)
The additive model then becomes
FMx=
M
m=1
βmbxγm(4)
For example, in single hidden layer neural networks bxγ=σγtxwhere
σ· is a sigmoid function and γparameterizes a linear combination of the
input features. In signal processing, wavelets are a popular choice with γ
parameterizing the location and scale shifts of a “mother” wavelet bx.
In these applications bxγmM
1are generally called “basis functions” since
they span a function subspace.
If least-squares is used as a fitting criterion, one can solve for an optimal
set of parameters through a generalized backfitting algorithm with updates,
βmγ
m←arg min
β γ Ey
k=m
βkbxγk−βbxγ2
(5)
for m=12M in cycles until convergence. Alternatively, one can use a
“greedy” forward stepwise approach,
βmγ
m←arg min
β γ EyFm1x−βbxγ2
(6)
for m=12M, where βkγ
km1
1are fixed at their corresponding solu-
tion values at earlier iterations. This is the approach used by Mallat and
Zhang (1993) in “matching pursuit,” where the bxγare selected from an
over-complete dictionary of wavelet bases. In the language of boosting, fx=
βbxγwould be called a “weak learner” and FMx(4) the “committee.” If
decision trees were used as the weak learner, the parameters γwould repre-
sent the splitting variables, split points, the constants in each terminal node
and number of terminal nodes of each tree.
Note that the backfitting procedure (5) or its greedy cousin (6) only require
an algorithm for fitting a single weak learner (3) to data. This base algorithm
is simply applied repeatedly to modified versions of the original data
ymy
k=m
fkx
344 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
In the forward stepwise procedure (6), the modified output ymat the mth iter-
ation depends only on its value ym1and the solution fm1xat the previous
iteration,
ym=ym1fm1x(7)
At each step m, the previous output values ym1are modified (7) so that
the previous model fm1xhas no explanatory power on the new outputs
ym. One can therefore view this as a procedure for boosting a weak learner
fx=βbxγto form a powerful committee FMx(4).
3.3. Classification problems. For the classification problem, we learn from
Bayes theorem that all we need is Py=jx, the posterior or conditional class
probabilities. One could transfer all the above regression machinery across
to the classification domain by simply noting that E1y=jx=Py=jx,
where 1y=jis the 0/1 indicator variable representing class j. While this works
fairly well in general, several problems have been noted [Hastie, Tibshirani
and Buja (1994)] for constrained regression methods. The estimates are typ-
ically not confined to 01, and severe masking problems can occur when
there are more than two classes. A notable exception is when trees are used
as the regression method, and in fact this is the approach used by Breiman,
Friedman, Olshen and Stone (1984).
Logistic regression is a popular approach used in statistics for overcom-
ing these problems. For a two-class problem, an additive logistic model has
the form
log Py=1x
Py=−1x=
M
m=1
fmx(8)
The monotone logit transformation on the left guarantees that for any values
of Fx=M
m=1fmx∈R, the probability estimates lie in 01; inverting
we get
px=Py=1x= eFx
1+eFx(9)
Here we have given a general additive form for Fx; special cases exist that
are well known in statistics. In particular, linear logistic regression [McCullagh
and Nelder (1989), e.g.] and additive logistic regression [Hastie and Tibshirani
(1990)] are popular. These models are usually fit by maximizing the binomial
log-likelihood and enjoy all the associated asymptotic optimality features of
maximum likelihood estimation.
A generalized version of backfitting (2), called “Local Scoring” in Hastie and
Tibshirani (1990), can be used to fit the additive logistic model by maximum
likelihood. Starting with guesses f1x1···fpxp,Fx=fkxkand px
defined in (9), we form the working response:
z=Fx+ 1y=1px
px1px(10)
ADDITIVE LOGISTIC REGRESSION 345
We then apply backfitting to the response zwith observation weights px1
px to obtain new fkxk. This process is repeated until convergence. The
forward stagewise version (6) of this procedure bears a close similarity to the
LogitBoost algorithm described later in the paper.
4. AdaBoost: an additive logistic regression model. In this section
we show that the AdaBoost algorithms (Discrete and Real) can be interpreted
as stagewise estimation procedures for fitting an additive logistic regression
model. They optimize an exponential criterion which to second order is equiva-
lent to the binomial log-likelihood criterion. We then propose a more standard
likelihood-based boosting procedure.
4.1. An exponential criterion. Consider minimizing the criterion
JF=EeyFx(11)
for estimation of Fx. Here Erepresents expectation; depending on the con-
text, this may be a population expectation (with respect to a probability dis-
tribution) or else a sample average. Ewindicates a weighted expectation.
Lemma 1 shows that the function Fxthat minimizes JFis the symmetric
logistic transform of Py=1x.
Lemma 1. EeyFxis minimized at
Fx=1
2log Py=1x
Py=−1x(12)
Hence
Py=1x= eFx
eFx+eFx(13)
Py=−1x= eFx
eFx+eFx(14)
Proof. While Eentails expectation over the joint distribution of yand x,
it is sufficient to minimize the criterion conditional on x:
EeyFxx=Py=1xeFx+Py=−1xeFx
∂EeyFxx
∂Fx=−Py=1xeFx+Py=−1xeFx
The result follows by setting the derivative to zero.
This exponential criterion appeared in Schapire and Singer (1998), moti-
vated as an upper bound on misclassification error. Breiman (1997) also used
this criterion in his results on AdaBoost and prediction games. The usual
346 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
logistic transform does not have the factor 1
2as in (12); by multiplying the
numerator and denominator in (13) by eFx, we get the usual logistic model
px= e2Fx
1+e2Fx(15)
Hence the two models are equivalent up to a factor 2.
Corollary 1. If Eis replaced by averages over regions of xwhere Fxis
constant (as in the terminal node of a decision tree), the same result applies to
the sample proportions of y=1and y=−1.
Results 1 and 2 show that both Discrete and Real AdaBoost, as well as the
Generalized AdaBoost of Freund and Schapire (1996b), can be motivated as
iterative algorithms for optimizing the (population based) exponential crite-
rion. The results share the same format.
1. Given an imperfect Fx, an update Fx+fxis proposed based on the
population version of the criterion.
2. The update, which involves population conditional expectations, is imper-
fectly approximated for finite data sets by some restricted class of estima-
tors, such as averages in terminal nodes of trees.
Hastie and Tibshirani (1990) use a similar derivation of the local scoring
algorithm used in fitting generalized additive models. Many terms are typi-
cally required in practice, since at each stage the approximation to conditional
expectation is rather crude. Because of Lemma 1, the resulting algorithms
can be interpreted as a stagewise estimation procedure for fitting an additive
logistic regression model. The derivations are sufficiently different to warrant
separate treatment.
Result 1. The Discrete AdaBoost algorithm (population version) builds
an additive logistic regression model via Newton-like updates for minimizing
EeyFx.
Proof. Let JF=EeyFx. Suppose we have a current estimate Fx
and seek an improved estimate Fx+cfx. For fixed c(and x), we expand
JFx+cfx to second order about fx=0,
JF+cf=EeyFx+cfx
EeyFx1ycfx+c2y2fx2/2
=EeyFx1ycfx+c2/2
since y2=1 and fx2=1. Minimizing pointwise with respect to fx∈
−11, we write
fx=arg min
fEw1ycfx+c2/2x(16)
ADDITIVE LOGISTIC REGRESSION 347
Here the notation Ew·xrefers to a weighted conditional expectation, where
w=wx y=eyFx, and
Ewgx yxdef
=Ewx ygx yx
Ewx yx
For c>0, minimizing (16) is equivalent to maximizing
Ewyfx(17)
The solution is
fx=1if Ewyx=Pwy=1x−Pwy=−1x>0,
1otherwise.
(18)
Note that
Ewyfx=Ewyfx2/21(19)
[again using fx2=y2=1]. Thus minimizing a quadratic approximation to
the criterion leads to a weighted least-squares choice of fx∈11, and
this constitutes the Newton-like step.
Given fx∈11, we can directly minimize JF+cf to determine c:
c=arg min
cEwecyfx
=1
2log 1err
err
(20)
where err =Ew1y=fx. Note that ccan be negative if the weak learner
does worse than 50%, in which case it automatically reverses the polarity.
Combining these steps we get the update for Fx,
Fx←Fx+1
2log 1err
err fx
In the next iteration the new contribution cfxto Fxaugments the weights
wx y←wx yecfxy
Since yfx=2×1y=fx 1, we see that the update is equivalent to
wx y←wx yexplog1err
err 1y=fx
Thus the function and weight updates are of an identical form to those used
in Discrete AdaBoost.
This population version of AdaBoost translates naturally to a data version
using trees. The weighted conditional expectation in (18) is approximated by
the terminal-node weighted averages in a tree. In particular, the weighted
least-squares criterion is used to grow the tree-based classifier fx, and given
fx, the constant cis based on the weighted training error.
348 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Note that after each Newton step, the weights change, and hence the tree
configuration will change as well. This adds an adaptive twist to the data
version of a Newton-like algorithm.
Parts of this derivation for AdaBoost can be found in Breiman (1997) and
Schapire and Singer (1998), but without making the connection to additive
logistic regression models.
Corollary 2. After each update to the weights, the weighted misclassifi-
cation error of the most recent weak learner is 50%.
Proof. This follows by noting that the cthat minimizes JF+cfsatisfies
∂JF+cf
∂c =−EeyFx+cf xyf x=0(21)
The result follows since yfxis 1 for a correct and 1 for an incorrect
classification.
Schapire and Singer (1998) give the interpretation that the weights are
updated to make the new weighted problem maximally difficult for the next
weak learner.
The Discrete AdaBoost algorithm expects the tree or other “weak learner”
to deliver a classifier fx∈11. Result 1 requires minor modifications to
accommodate fx∈R, as in the generalized AdaBoost algorithms [Freund
and Schapire (1996b), Schapire and Singer (1998)]; the estimate for cmdiffers.
Fixing f, we see that the minimizer of (20) must satisfy
Ewyfxecyfx=0(22)
If fis not discrete, this equation has no closed-form solution for c, and requires
an iterative solution such as Newton–Raphson.
We now derive the Real AdaBoost algorithm, which uses weighted proba-
bility estimates to update the additive logistic model, rather than the classifi-
cations themselves. Again we derive the population updates and then apply it
to data by approximating conditional expectations by terminal-node averages
in trees.
Result 2. The Real AdaBoost algorithm fits an additive logistic regression
model by stagewise and approximate optimization of JF=EeyFx.
Proof. Suppose we have a current estimate Fxand seek an improved esti-
mate Fx+fxby minimizing JFx+fx at each x.
JFx+fx = EeyFxeyfxx
=efxEeyFx1y=1x+efxEeyFx1y=−1x
Dividing through by EeyFxxand setting the derivative w.r.t. fxto zero
we get
fx=1
2log Ew1y=1x
Ew1y=−1x
(23)
ADDITIVE LOGISTIC REGRESSION 349
=1
2log Pwy=1x
Pwy=−1x(24)
where wx y=exp−yFx. The weights get updated by
wx y←wx yeyfx
The algorithm as presented would stop after one iteration. In practice we
use crude approximations to conditional expectation, such as decision trees or
other constrained models, and hence many steps are required.
Corollary 3. At the optimal Fxthe weighted conditional mean of yis 0.
Proof. If Fxis optimal, we have
25∂JFx
Fx=−EeyFxy=0
We can think of the weights as providing an alternative to residuals for the
binary classification problem. At the optimal function F, there is no further
information about Fin the weighted conditional distribution of y. If there is,
we use it to update F.
An iteration Min either the Discrete or Real AdaBoost algorithms, we have
composed an additive function of the form
Fx=
M
m=1
fmx(26)
where each of the components are found in a greedy forward stagewise fash-
ion, fixing the earlier components. Our term “stagewise” refers to a similar
approach in statistics:
1. Variables are included sequentially in a stepwise regression.
2. The coefficients of variables already included receive no further adjustment.
4.2.WhyEeyFx? So far the only justification for this exponential crite-
rion is that it has a sensible population minimizer, and the algorithm described
above performs well on real data. In addition:
1. Schapire and Singer (1998) motivate eyFxas a differentiable upper bound
to misclassification error 1yF<0(see Figure 2).
2. The AdaBoost algorithm that it generates is extremely modular, requir-
ing at each iteration the retraining of a classifier on a weighted training
database.
Let y=y+1/2, taking values 0, 1, and parametrize the binomial prob-
abilities by
px= eFx
eFx+eFx
The binomial log-likelihood is
lypx = ylogpx+1ylog1px
=−log1+e2yFx
(27)
350 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Fig. 2.A variety of loss functions for estimating a function Fxfor classification. The horizontal
axis is yF which is negative for errors and positive for correct classifications. All the loss functions
are monotone in yF and are centered and scaled to match eyF at F=0. The curve labeled Log-
likelihoodis the binomial log-likelihood or cross-entropy ylog p+1ylog1p. The curve
labeled Squared Error(p)” is yp2. The curve labeled Squared Error(F)” is yF2and
increases once yF exceeds 1thereby increasingly penalizing classifications that are too correct.
Hence we see that:
3. The population minimizers of Elypx and EeyFxcoincide. This is
easily seen because the expected log-likelihood is maximized at the true
probabilities px=Py=1x, which define the logit Fx. By Lemma 1
we see that this is exactly the minimizer of EeyFx. In fact, the exponential
criterion and the (negative) log-likelihood are equivalent to second order in
a Taylor series around F=0,
lyp≈exp−yF+log2−1(28)
Graphs of exp−yFand log1+e2yFxare shown in Figure 2, as a
function of yF—positive values of yF imply correct classification. Note
that exp−yFitself is not a proper log-likelihood, as it does not equal
the log of any probability mass function on plus or minus 1.
4. There is another way to view the criterion JF. It is easy to show that
eyFx=ypx
px1px(29)
with Fx=1
2logpx/1px. The right-hand side is known as the
χstatistic in the statistical literature. χ2is a quadratic approximation to
the log-likelihood, and so χcan be considered a “gentler” alternative.
ADDITIVE LOGISTIC REGRESSION 351
One feature of both the exponential and log-likelihood criteria is that they
are monotone and smooth. Even if the training error is zero, the criteria will
drive the estimates towards purer solutions (in terms of probability estimates).
Why not estimate the fmby minimizing the squared error EyFx2?
If Fm1x=m1
1fjxis the current prediction, this leads to a forward
stagewise procedure that does an unweighted fit to the response yFm1x
at step mas in (6). Empirically we have found that this approach works quite
well, but is dominated by those that use monotone loss criteria. We believe
that the nonmonotonicity of squared error loss (Figure 2) is the reason. Correct
classifications, but with yFx>1, incur increasing loss for increasing values
of Fx. This makes squared-error loss an especially poor approximation to
misclassification error rate. Classifications that are “too correct” are penalized
as much as misclassification errors.
4.3. Direct optimization of the binomial log-likelihood. In this section we
explore algorithms for fitting additive logistic regression models by stagewise
optimization of the Bernoulli log-likelihood. Here we focus again on the two-
class case and will use a 0/1 response yto represent the outcome. We repre-
sent the probability of y=1bypx, where
px= eFx
eFx+eFx(30)
Algorithm 3 gives the details.
LogitBoost (two classes)
1. Start with weights wi=1/N i =12N,Fx=0 and probability esti-
mates pxi=1
2.
2. Repeat for m=12M:
(a) Compute the working response and weights
zi=y
ipxi
pxi1pxi
wi=pxi1pxi
(b) Fit the function fmxby a weighted least-squares regression of zito
xiusing weights wi.
(c) Update Fx←Fx+1
2fmxand px←eFx/eFx+eFx.
3. Output the classifier signFx=signM
m=1fmx.
Algorithm 3. An adaptive Newton algorithm for fitting an additive logis-
tic regression model.
352 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Result 3. The LogitBoost algorithm (two classes, population version) uses
Newton steps for fitting an additive symmetric logistic model by maximum
likelihood.
Derivation. Consider the update Fx+fxand the expected log-
likelihood
ElF+f=E2yFx+fx − log1+e2Fx+fx(31)
Conditioning on x, we compute the first and second derivative at fx=0,
sx=∂ElFx+fx
∂fx
fx=0
(32) =2Eypxx
Hx=2ElFx+fx
∂fx2
fx=0
(33) =−4Epx1pxx
where pxis defined in terms of Fx. The Newton update is then
Fx←Fx−Hx1sx
(34)
=Fx+1
2
Eypxx
Epx1pxx
=Fx+1
2Ewypx
px1px
x(35)
where wx=px1px. Equivalently, the Newton update fxsolves
the weighted least-squares approximation [about Fx] to the log-likelihood
min
fxEwxFx+1
2
ypx
px1px −Fx+fx2
(36)
The population algorithm described here translates immediately to an
implementation on data when E·xis replaced by a regression method, such
as regression trees [Breiman, Friedman, Olshen and Stone (1984)]. While the
role of the weights are somewhat artificial in the population case, they are
not in any implementation; wxis constant when conditioned on x, but the
wxiin a terminal node of a tree, for example, depend on the current values
Fxi, and will typically not be constant.
Sometimes the wxget very small in regions of (x) perceived [by Fx]tobe
pure—that is, when pxis close to 0 or 1. This can cause numerical problems
in the construction of z, and lead to the following crucial implementation
protections:
1. If y=1, then compute z=yp/p1p as 1/p. Since this number
can get large if pis small, threshold this ratio at zmax. The particu-
lar value chosen for zmax is not crucial; we have found empirically that
ADDITIVE LOGISTIC REGRESSION 353
zmax ∈24works well. Likewise, if y=0, compute z=−1/1pwith
a lower threshold of zmax.
2. Enforce a lower threshold on the weights: w=maxw 2×machine-zero.
4.4. Optimizing EeyFxby Newton stepping. The population version of
the Real AdaBoost procedure (Algorithm 2) optimizes Eexp−yFx+fx
exactly with respect to fat each iteration. In Algorithm 4 we propose the
“Gentle AdaBoost” procedure that instead takes adaptive Newton steps much
like the LogitBoost algorithm just described.
Result 4. The Gentle AdaBoost algorithm (population version) uses
Newton steps for minimizing EeyFx.
Derivation.
∂JFx+fx
∂fx
fx=0=−EeyFxyx
2JFx+fx
∂fx2
fx=0=EeyFxxsince y2=1
Hence the Newton update is
Fx←Fx+EeyFxyx
EeyFxx
=Fx+Ewyx
where wx y=eyFx.
The main difference between this and the Real AdaBoost algorithm is how
it uses its estimates of the weighted class probabilities to update the functions.
Here the update is fmx=Pwy=1x−Pwy=−1x, rather than half the
Gentle AdaBoost
1. Start with weights wi=1/N i =12NFx=0
2. Repeat for m=12M:
(a) Fit the regression function fmxby weighted least-squares of yito xi
with weights wi.
(b) Update Fx←Fx+fmx.
(c) Update wiwiexp−yifmxi and renormalize.
3. Output the classifier signFx=signM
m=1fmx.
Algorithm 4. A modified version of the Real AdaBoost algorithm, using
Newton stepping rather than exact optimization at each step.
354 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
log-ratio as in (24): fmx=1
2logPwy=1x/Pwy=−1x. Log-ratios
can be numerically unstable, leading to very large updates in pure regions,
while the update here lies in the range −11. Empirical evidence suggests
(see Section 7) that this more conservative algorithm has similar performance
to both the Real AdaBoost and LogitBoost algorithms, and often outperforms
them both, especially when stability is an issue.
There is a strong similarity between the updates for the Gentle AdaBoost
algorithm and those for the LogitBoost algorithm. Let P=Py=1x, and
px=eFx/eFx+eFx. Then
EeyFxyx
EeyFxx=eFxPeFx1P
eFxP+eFx1P
(37)
=Ppx
1pxP+px1P
The analogous expression for LogitBoost from (34) is
1
2
Ppx
px1px(38)
At px≈1
2these are nearly the same, but they differ as the pxbecome
extreme. For example, if P1 and px≈0, (38) blows up, while (37) is
about 1 (and always falls in −11).
5. Multiclass procedures. Here we explore extensions of boosting to
classification with multiple classes. We start off by proposing a natural gen-
eralization of the two-class symmetric logistic transformation, and then con-
sider specific algorithms. In this context Schapire and Singer (1998) define
Jresponses yjfor a Jclass problem, each taking values in −11. Simi-
larly the indicator response vector with elements y
jis more standard in the
statistics literature. Assume the classes are mutually exclusive.
Definition 1. For a Jclass problem let pjx=Pyj=1x. We define
the symmetric multiple logistic transformation
Fjx=log pjx− 1
J
J
k=1
log pkx(39)
Equivalently,
pjx= eFjx
J
k=1eFkx
J
k=1
Fkx=0(40)
The centering condition in (40) is for numerical stability only; it simply pins
the Fjdown, else we could add an arbitrary constant to each Fjand the
probabilities remain the same. The equivalence of these two definitions is
easily established, as well as the equivalence with the two-class case.
ADDITIVE LOGISTIC REGRESSION 355
Schapire and Singer (1998) provide several generalizations of AdaBoost for
the multiclass case, and also refer to other proposals [Freund and Schapire
(1997), Schapire (1997)]; we describe their AdaBoost.MH algorithm (see Algo-
rithm 5), since it seemed to dominate the others in their empirical studies. We
then connect it to the models presented here. We will refer to the augmented
variable in Algorithm 5 as the “class” variable C. We make a few observations:
1. The population version of this algorithm minimizes J
j=1EeyjFjx, which
is equivalent to running separate population boosting algorithms on each
of the Jproblems of size Nobtained by partitioning the N×Jsamples
in the obvious fashion. This is seen trivially by first conditioning on C=j,
and then xC=j, when computing conditional expectations.
2. The same is almost true for a tree-based algorithm. We see this because:
(a) If the first split is on C, either a J-nary split if permitted, or else J1
binary splits, then the subtrees are identical to separate trees grown to
each of the Jgroups. This will always be the case for the first tree.
(b) If a tree does not split on Canywhere on the path to a terminal node,
then that node returns a function fmx j=gmxthat contributes
nothing to the classification decision. However, as long as a tree includes
a split on Cat least once on every path to a terminal node, it will make
a contribution to the classifier for all input feature values.
The advantage or disadvantage of building one large tree using class label
as an additional input feature is not clear. No motivation is provided.
We therefore implement AdaBoost.MH using the more traditional direct
approach of building Jseparate trees to minimize J
j=1Eexp−yjFjx
We have thus shown
Result 5. The AdaBoost.MH algorithm for a J-class problem fits Juncou-
pled additive logistic models, Gjx=1
2log pjx/1pjx, each class
against the rest.
AdaBoost.MH [Schapire and Singer (1998)]
1. Expand the original Nobservations into N×Jpairs xi1y
i1,
xi2y
i2xiJy
iJi=1N. Here yij is the −11
response for class jand observation i.
2. Apply Real AdaBoost to the augmented dataset, producing a function
F×1JRFx j=mfmx j.
3. Output the classifier arg maxjFx j.
Algorithm 5. The AdaBoost.MH algorithm converts the Jclass problem
into that of estimating a two class classifier on a training set Jtimes as large,
with an additional “feature” defined by the set of class labels.
356 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
In principal this parametrization is fine, since Gjxis monotone in pjx.
However, we are estimating the Gjxin an uncoupled fashion, and there is
no guarantee that the implied probabilities sum to 1. We give some examples
where this makes a difference, and AdaBoost.MH performs more poorly than
an alternative coupled likelihood procedure.
Schapire and Singer’s AdaBoost.MH was also intended to cover situations
where observations can belong to more than one class. The “MH” represents
“Multi-Label Hamming”, Hamming loss being used to measure the errors in
the space of 2Jpossible class labels. In this context fitting a separate classifier
for each label is a reasonable strategy. However, Schapire and Singer also
propose using AdaBoost.MH when the class labels are mutually exclusive,
which is the focus in this paper.
Algorithm 6 is a natural generalization of Algorithm 3 for fitting the J-class
logistic regression model (40).
Result 6. The LogitBoost algorithm (Jclasses, population version) uses
quasi-Newton steps for fitting an additive symmetric logistic model by
maximum-likelihood.
LogitBoost (J classes)
1. Start with weights wij =1/N i =1N j =1J F
jx=0 and
pjx=1/J j.
2. Repeat for m=12M:
(a) Repeat for j=1J:
(i) Compute working responses and weights in the jth class,
zij =y
ij pjxi
pjxi1pjxi
wij =pjxi1pjxi
(ii) Fit the function fmj xby a weighted least-squares regression of zij
to xiwith weights wij .
(b) Set fmj x←J1
Jfmj x−1
JJ
k=1fmkx, and Fjx←Fjx+
fmj x.
(c) Update pjxvia (40).
3. Output the classifier arg maxjFjx.
Algorithm 6. An adaptive Newton algorithm for fitting an additive mul-
tiple logistic regression model.
ADDITIVE LOGISTIC REGRESSION 357
Derivation.
1. We first give the score and Hessian for the population Newton algorithm
corresponding to a standard multilogit parametrization
Gjx=log Py
j=1x
Py
J=1x
with GJx=0 (and the choice of Jfor the base class is arbitrary). The
expected conditional log-likelihood is:
ElG+gx=
J1
j=1
Ey
jxGjx+gjx
log 1+
J1
k=1
eGkx+gkx
sjx=Ey
jpjxxj=1J1
Hj kx=−pjxδjk pkxjk=1J1
2. Our quasi-Newton update amounts to using a diagonal approximation to
the Hessian, producing updates:
gjx← Ey
jpjxx
pjx1pjxj =1J1
3. To convert to the symmetric parametrization, we would note that gJ=0,
and set fjx=gjx−1/JJ
k=1gkx. However, this procedure could
be applied using any class as the base, not just the Jth. By averaging over
all choices for the base class, we get the update
fjx=J1
JEy
jpjxx
pjx1pjx 1
J
J
k=1
Ey
kpkxx
pkx1pkx
For more rigid parametric models and full Newton stepping, this sym-
metrization would be redundant. With quasi-Newton steps and adaptive (tree
based) models, the symmetrization removes the dependence on the choice of
the base class.
6. Simulation studies. In this section the four flavors of boosting out-
lined above are applied to several artificially constructed problems. Compar-
isons based on real data are presented in Section 7.
An advantage of comparisons made in a simulation setting is that all as-
pects of each example are known, including the Bayes error rate and the
complexity of the decision boundary. In addition, the population expected error
rates achieved by each of the respective methods can be estimated to arbitrary
accuracy by averaging over a large number of different training and test data
358 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
sets drawn from the population. The four boosting methods compared here
are:
DAB: Discrete AdaBoost—Algorithm 1.
RAB: Real AdaBoost—Algorithm 2.
LB: LogitBoost—Algorithms 3 and 6.
GAB: Gentle AdaBoost—Algorithm 4.
DAB, RAB and GAB handle multiple classes using the AdaBoost.MH
approach.
In an attempt to differentiate performance, all of the simulated examples
involve fairly complex decision boundaries. The ten input features for all exam-
ples are randomly drawn from a ten-dimensional standard normal distribution
xN100I. For the first three examples the decision boundaries separating
successive classes are nested concentric ten-dimensional spheres constructed
by thresholding the squared-radius from the origin
r2=
10
j=1
x2
j(41)
Each class Ck1kKis defined as the subset of observations
Ck=xitk1r2
i<t
k(42)
with t0=0 and tK=∞. The tkK1
1for each example were chosen so as to
put approximately equal numbers of observations in each class. The training
sample size is N=K·1000 so that approximately 1000 training observations
are in each class. An independently drawn test set of 10,000 observations was
used to estimate error rates for each training set. Averaged results over ten
such independently drawn training–test set combinations were used for the
final error rate estimates. The corresponding statistical uncertainties (stan-
dard errors) of these final estimates (averages) are approximately a line width
on each plot.
Figure 3 (top left) compares the four algorithms in the two-class K=2
case using a two-terminal node decision tree (“stump”) as the base classifier.
Shown is error rate as a function of number of boosting iterations. The upper
(black) line represents DAB and the other three nearly coincident lines are
the other three methods (dotted red =RAB, short-dashed green =LB, and
long-dashed blue =GAB). Note that the somewhat erratic behavior of DAB,
especially for less than 200 iterations, is not due to statistical uncertainty. For
less than 400 iterations LB has a minuscule edge, after that it is a dead heat
with RAB and GAB. DAB shows substantially inferior performance here with
roughly twice the error rate at all iterations.
Figure 3 (lower left) shows the corresponding results for three classes K=
3again with two-terminal node trees. Here the problem is more difficult as
represented by increased error rates for all four methods, but their relation-
ship is roughly the same: the upper (black) line represents DAB and the other
ADDITIVE LOGISTIC REGRESSION 359
Fig. 3.Test error curves for the simulation experiment with an additive decision boundary,as
described in 42. In all panels except the top right,the solid curve (representing Discrete AdaBoost)
lies alone above the other three curves.
three nearly coincident lines are the other three methods. The situation is
somewhat different for larger number of classes. Figure 3 (lower right) shows
results for K=5 which are typical for K4. As before, DAB incurs much
higher error rates than all the others, and RAB and GAB have nearly iden-
tical performance. However, the performance of LB relative to RAB and GAB
has changed. Up to about 40 iterations it has the same error rate. From 40 to
about 100 iterations LB’s error rates are slightly higher than the other two.
After 100 iterations the error rate for LB continues to improve whereas that
360 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
for RAB and GAB level off, decreasing much more slowly. By 800 iterations the
error rate for LB is 0.19 whereas that for RAB and GAB is 0.32. Speculation
as to the reason for LB’s performance gain in these situations is presented
below.
In the above examples a stump was used as the base classifier. One might
expect the use of larger trees would do better for these rather complex prob-
lems. Figure 3 (top right) shows results for the two-class problem, here boost-
ing trees with eight terminal nodes. These results can be compared to those for
stumps in Figure 3 (top left). Initially, error rates for boosting eight-node trees
decrease much more rapidly than for stumps, with each successive iteration,
for all methods. However, the error rates quickly level off and improvement
is very slow after about 100 iterations. The overall performance of DAB is
much improved with the bigger trees, coming close to that of the other three
methods. As before RAB, GAB and LB exhibit nearly identical performance.
Note that at each iteration the eight-node tree model consists of four times
the number of additive terms as does the corresponding stump model. This
is why the error rates decrease so much more rapidly in the early iterations.
In terms of model complexity (and training time), a 100-iteration model using
eight-terminal node trees is equivalent to a 400-iteration stump model.
Comparing the top two panels in Figure 3, one sees that for RAB, GAB
and LB the error rate using the bigger trees (0.072) is in fact 33% higher
than that for stumps (0.054) at 800 iterations, even though the former is four
times more complex. This seemingly mysterious behavior is easily understood
by examining the nature of the decision boundary separating the classes. The
Bayes decision boundary between two classes is the set
xlog Py=1x
Py=−1x=0
(43)
or simply xBx=0. To approximate this set it is sufficient to estimate
the logit Bx, or any monotone transformation of Bx, as closely as possible.
As discussed above, boosting produces an additive logistic model whose com-
ponent functions are represented by the base classifier. With stumps as the
base classifier, each component function has the form
fmx=cL
m1xjtm+cR
m1xj>tm
(44)
=fmxj(45)
if the mth stump chose to split on coordinate j. Here tmis the split-point,
and cL
mand cR
mare the weighted means of the response in the left and right
terminal nodes. Thus the model produced by boosting stumps is additive in
the original features,
Fx=
p
j=1
gjxj(46)
where gjxjadds together all those stumps involving xj(and is 0 if none
exist).
ADDITIVE LOGISTIC REGRESSION 361
Examination of (41) and (42) reveals that an optimal decision boundary for
the above examples is also additive in the original features, with fjxj=x2
j+
constant. Thus, in the context of decision trees, stumps are ideally matched
to these problems; larger trees are not needed. However boosting larger trees
need not be counterproductive in this case if all of the splits in each individual
tree are made on the same predictor variable. This would also produce an
additive model in the original features (46). However, due to the forward
greedy stagewise strategy used by boosting, this is not likely to happen if the
decision boundary function involves more than one predictor; each individual
tree will try to do its best to involve all of the important predictors. Owing to
the nature of decision trees, this will produce models with interaction effects;
most terms in the model will involve products in more than one variable. Such
nonadditive models are not as well suited for approximating truly additive
decision boundaries such as (41) and (42). This is reflected in increased error
rate as observed in Figure 3.
The above discussion also suggests that if the decision boundary separating
pairs of classes were inherently nonadditive in the predictors, then boosting
stumps would be less advantageous than using larger trees. A tree with m
terminal nodes can produce basis functions with a maximum interaction order
of minm1pwhere pis the number of predictor features. These higher
order basis functions provide the possibility to more accurately estimate those
decision boundaries Bxwith high-order interactions. The purpose of the next
example is to verify this intuition. There are two classes K=2and 5000
training observations with the xi5000
1drawn from a ten-dimensional normal
distribution as in the previous examples. Class labels were randomly assigned
to each observation with log-odds
logPry=1x
Pry=−1x=10
6
j=1
xj1+
6
l=1−1lxl(47)
Approximately equal numbers of observations are assigned to each of the
two classes, and the Bayes error rate is 0.046. The decision boundary for this
problem is a complicated function of the first six predictor variables involving
all of them in second-order interactions of equal strength. As in the above
examples, test sets of 10,000 observations was used to estimate error rates for
each training set, and final estimates were averages over ten replications.
Figure 4 (top left) shows test-error rate as a function of iteration number for
each of the four boosting methods using stumps. As in the previous examples,
RAB and GAB track each other very closely. DAB begins very slowly, being
dominated by all of the others until around 180 iterations, where it passes
below RAB and GAB. LB mostly dominates, having the lowest error rate until
about 650 iterations. At that point DAB catches up and by 800 iterations it
may have a very slight edge. However, none of these boosting methods perform
well with stumps on this problem, the best error rate being 0.35.
Figure 4 (top right) shows the corresponding plot when four terminal node
trees are boosted. Here there is a dramatic improvement with all of the four
362 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Fig. 4.Test error curves for the simulation experiment with a nonadditive decision boundary,as
described in (47).
methods. For the first time there is some small differentiation between RAB
and GAB. At nearly all iterations the performance ranking is LB best, followed
by GAB, RAB and DAB in order. At 800 iterations LB achieves an error rate of
0.134. Figure 4 (lower left) shows results when eight terminal node trees are
boosted. Here, error rates are generally further reduced with LB improving
the least (0.130), but still dominating. The performance ranking among the
other three methods changes with increasing iterations; DAB overtakes RAB
ADDITIVE LOGISTIC REGRESSION 363
at around 150 iterations and GAB at about 230 becoming fairly close to LB
by 800 iterations with an error rate of 0.138.
Although limited in scope, these simulation studies suggest several trends.
They explain why boosting stumps can sometimes be superior to using larger
trees, and suggest situations where this is likely to be the case; that is when
decision boundaries Bxcan be closely approximated by functions that are
additive in the original predictor features. When higher order interactions
are required, stumps exhibit poor performance. These examples illustrate the
close similarity between RAB and GAB. In all cases the difference in perfor-
mance between DAB and the others decreases when larger trees and more
iterations are used, sometimes overtaking the others. More generally, relative
performance of these four methods depends on the problem at hand in terms
of the nature of the decision boundaries, the complexity of the base classifier
and the number of boosting iterations.
The superior performance of LB in Figure 3 (lower right) appears to be a
consequence of the multiclass logistic model (Algorithm 6). All of the other
methods use the asymmetric AdaBoost.MH strategy (Algorithm 5) of building
separate two-class models for each individual class against the pooled com-
plement classes. Even if the decision boundaries separating all class pairs
are relatively simple, pooling classes can produce complex decision bound-
aries that are difficult to approximate [Friedman (1996)]. By considering all
of the classes simultaneously, the symmetric multiclass model is better able
to take advantage of simple pairwise boundaries when they exist [Hastie and
Tibshirani (1998)]. As noted above, the pairwise boundaries induced by (41)
and (42) are simple when viewed in the context of additive modeling, whereas
the pooled boundaries are more complex; they cannot be well approximated
by functions that are additive in the original predictor variables.
The decision boundaries associated with these examples were deliberately
chosen to be geometrically complex in an attempt to elicit performance differ-
ences among the methods being tested. Such complicated boundaries are not
likely to often occur in practice. Many practical problems involve compara-
tively simple boundaries [Holte (1993)]; in such cases performance differences
will still be situation dependent, but correspondingly less pronounced.
7. Some experiments with real world data. In this section we show
the results of running the four fitting methods: LogitBoost, Discrete AdaBoost,
Real AdaBoost and Gentle AdaBoost on a collection of datasets from the UC-
Irvine machine learning archive, plus a popular simulated dataset. The base
learner is a tree in each case, with either two or eight terminal nodes. For com-
parison, a single decision tree was also fit (using the tree function in Splus),
with the tree size determined by 5-fold cross-validation.
The datasets are summarized in Table 1. The test error rates are shown in
Table 2 for the smaller datasets, and in Table 3 for the larger ones. The vowel,
sonar,satimage and letter datasets come with a prespecified test set. The
waveform data is simulated, as described in Breiman, Friedman, Olshen and
364 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Table 1
Datasets used in the experiments
Data set # Train # Test # Inputs # Classes
Vowel 528 462 10 11
Breast cancer 699 5-fold CV 9 2
Ionosphere 351 5-fold CV 34 2
Glass 214 5-fold CV 10 7
Sonar 210 5-fold CV 60 2
Waveform 300 5000 21 3
Satimage 4435 2000 36 6
Letter 16000 4000 16 26
Stone (1984). For the others, 5-fold cross-validation was used to estimate the
test error.
It is difficult to discern trends on the small datasets (Table 2) because all
but quite large observed differences in performance could be attributed to
sampling fluctuations. On the vowel,breast cancer,ionosphere,sonar and
waveform data, purely additive stump models seem to perform comparably to
the larger (eight-node) trees. The glass data seems to benefit a little from
larger trees. There is no clear differentiation in performance among the boost-
ing methods.
On the larger data sets (Table 3) clearer trends are discernible. For the
satimage data the eight-node tree models are only slightly, but significantly,
more accurate than the purely additive models. For the letter data there is
no contest. Boosting stumps is clearly inadequate. There is no clear differ-
entiation among the boosting methods for eight-node trees. For the stumps,
LogitBoost, Real AdaBoost and Gentle AdaBoost have comparable perfor-
break mance, distinctly superior to Discrete AdaBoost. This is consistent with
the results of the simulation study (Section 6).
Except perhaps for Discrete AdaBoost, the real data examples fail to demon-
strate performance differences between the various boosting methods. This is
in contrast to the simulated data sets of Section 6. There LogitBoost gener-
ally dominated, although often by a small margin. The inability of the real
data examples to discriminate may reflect statistical difficulties in estimat-
ing subtle differences with small samples. Alternatively, it may be that their
underlying decision boundaries are all relatively simple [Holte (1993)] so that
all reasonable methods exhibit similar performance.
8. Additive logistic trees. In most applications of boosting the base clas-
sifier is considered to be a primitive, repeatedly called by the boosting proce-
dure as iterations proceed. The operations performed by the base classifier
are the same as they would be in any other context given the same data and
weights. The fact that the final model is going to be a linear combination of
a large number of such classifiers is not taken into account. In particular,
when using decision trees, the same tree growing and pruning algorithms are
ADDITIVE LOGISTIC REGRESSION 365
Table 2
Test error rates on small real examples
2 Terminal Nodes 8 Terminal Nodes
Method Iterations 50 100 200 50 100 200
Vowel CART error =0642
LogitBoost 0.532 0.524 0.511 0.517 0.517 0.517
Real AdaBoost 0.565 0.561 0.548 0.496 0.496 0.496
Gentle AdaBoost 0.556 0.571 0.584 0.515 0.496 0.496
Discrete AdaBoost 0.563 0.535 0.563 0.511 0.500 0.500
Breast CART error =0045
LogitBoost 0.028 0.031 0.029 0.034 0.038 0.038
Real AdaBoost 0.038 0.038 0.040 0.032 0.034 0.034
Gentle AdaBoost 0.037 0.037 0.041 0.032 0.031 0.031
Discrete AdaBoost 0.042 0.040 0.040 0.032 0.035 0.037
Ion CART error =0076
LogitBoost 0.074 0.077 0.071 0.068 0.063 0.063
Real AdaBoost 0.068 0.066 0.068 0.054 0.054 0.054
Gentle AdaBoost 0.085 0.074 0.077 0.066 0.063 0.063
Discrete AdaBoost 0.088 0.080 0.080 0.068 0.063 0.063
Glass CART error =0400
LogitBoost 0.266 0.257 0.266 0.243 0.238 0.238
Real AdaBoost 0.276 0.247 0.257 0.234 0.234 0.234
Gentle AdaBoost 0.276 0.261 0.252 0.219 0.233 0.238
Discrete AdaBoost 0.285 0.285 0.271 0.238 0.234 0.243
Sonar CART error =0596
LogitBoost 0.231 0.231 0.202 0.163 0.154 0.154
Real AdaBoost 0.154 0.163 0.202 0.173 0.173 0.173
Gentle AdaBoost 0.183 0.183 0.173 0.154 0.154 0.154
Discrete AdaBoost 0.154 0.144 0.183 0.163 0.144 0.144
Waveform CART error =0364
LogitBoost 0.196 0.195 0.206 0.192 0.191 0.191
Real AdaBoost 0.193 0.197 0.195 0.185 0.182 0.182
Gentle AdaBoost 0.190 0.188 0.193 0.185 0.185 0.186
Discrete AdaBoost 0.188 0.185 0.191 0.186 0.183 0.183
366 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Table 3
Test error rates on larger data examples
Iterations
Terminal
Method Nodes 20 50 100 200 Fraction
Satimage CART error =0148
LogitBoost 2 0.140 0.120 0.112 0.102
Real AdaBoost 2 0.148 0.126 0.117 0.119
Gentle AdaBoost 2 0.148 0.129 0.119 0.119
Discrete AdaBoost 2 0.174 0.156 0.140 0.128
LogitBoost 8 0.096 0.095 0.092 0.088
Real AdaBoost 8 0.105 0.102 0.092 0.091
Gentle AdaBoost 8 0.106 0.103 0.095 0.089
Discrete AdaBoost 8 0.122 0.107 0.100 0.099
Letter CART error =0124
LogitBoost 2 0.250 0.182 0.159 0.145 0.06
Real AdaBoost 2 0.244 0.181 0.160 0.150 0.12
Gentle AdaBoost 2 0.246 0.187 0.157 0.145 0.14
Discrete AdaBoost 2 0.310 0.226 0.196 0.185 0.18
LogitBoost 8 0.075 0.047 0.036 0.033 0.03
Real AdaBoost 8 0.068 0.041 0.033 0.032 0.03
Gentle AdaBoost 8 0.068 0.040 0.030 0.028 0.03
Discrete AdaBoost 8 0.080 0.045 0.035 0.029 0.03
generally employed. Sometimes alterations are made (such as no pruning) for
programming convenience and speed.
When boosting is viewed in the light of additive modeling, however, this
greedy approach can be seen to be far from optimal in many situations. As
discussed in Section 6 the goal of the final classifier is to produce an accurate
approximation to the decision boundary function Bx. In the context of boost-
ing, this goal applies to the final additive model, not to the individual terms
(base classifiers) at the time they were constructed. For example, it was seen
in Section 6 that if Bxwas close to being additive in the original predictive
features, then boosting stumps was optimal since it produced an approxima-
tion with the same structure. Building larger trees increased the error rate
of the final model because the resulting approximation involved high-order
interactions among the features. The larger trees optimized error rates of the
individual base classifiers, given the weights at that step, and even produced
lower unweighted error rates in the early stages. However, after a sufficient
number of boosts, the stump-based model achieved superior performance.
More generally, one can consider an expansion of the decision boundary
function in a functional ANOVA decomposition [Friedman (1991)]
Bx=
j
fjxj+
j k
fjk xjx
k+
j k l
fjk l xjx
kx
l+···(48)
ADDITIVE LOGISTIC REGRESSION 367
The first sum represents the closest function to Bxthat is additive in
the original features, the first two represent the closest approximation involv-
ing at most two-feature interactions, the first three represent three-feature
interactions, and so on. If Bxcan be accurately approximated by such an
expansion, truncated at low interaction order, then allowing the base classi-
fier to produce higher order interactions can reduce the accuracy of the final
boosted model. In the context of decision trees, higher order interactions are
produced by deeper trees.
In situations where the true underlying decision boundary function admits
a low order ANOVA decomposition, one can take advantage of this structure
to improve accuracy by restricting the depth of the base decision trees to be
not much larger than the actual interaction order of Bx. Since this is not
likely to be known in advance for any particular problem, this maximum depth
becomes a “meta-parameter” of the procedure to be estimated by some model
selection technique, such as cross-validation.
One can restrict the depth of an induced decision tree by using its stan-
dard pruning procedure, starting from the largest possible tree, but requiring
it to delete enough splits to achieve the desired maximum depth. This can
be computationally wasteful when this depth is small. The time required to
build the tree is proportional to the depth of the largest possible tree before
pruning. Therefore, dramatic computational savings can be achieved by sim-
ply stopping the growing process at the maximum depth, or alternatively at
a maximum number of terminal nodes. The standard heuristic arguments in
favor of growing large trees and then pruning do not apply in the context of
boosting. Shortcomings in any individual tree can be compensated by trees
grown later in the boosting sequence.
If a truncation strategy based on number of terminal nodes is to be
employed, it is necessary to define an order in which splitting takes place. We
adopt a “best-first” strategy. An optimal split is computed for each currently
terminal node. The node whose split would achieve the greatest reduction in
the tree building criterion is then actually split. This increases the number
of terminal nodes by one. This continues until a maximum number Mof ter-
minal notes is induced. Standard computational tricks can be employed so
that inducing trees in this order requires no more computation than other
orderings commonly used in decision tree induction.
The truncation limit Mis applied to all trees in the boosting sequence. It is
thus a meta-parameter of the entire boosting procedure. An optimal value can
be estimated through standard model selection techniques such as minimizing
cross-validated error rate of the final boosted model. We refer to this combi-
nation of truncated best-first trees, with boosting, as “additive logistic trees”
(ALT). Best-first trees were used in all of the simulated and real examples.
One can compare results on the latter (Tables 2 and 3) to corresponding results
reported by Dietterich [(1998), Table 1] on common data sets. Error rates
achieved by ALT with very small truncation values are seen to compare quite
favorably with other committee approaches using much larger trees at each
boosting step. Even when error rates are the same, the computational savings
368 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
Fig. 5.Coordinate functions for the additive logistic tree obtained by boosting (Logitboost)with
stumps,for the two-class nested sphere example from Section 6.
associated with ALT can be quite important in data mining contexts where
large data sets cause computation time to become an issue.
Another advantage of low order approximations is model visualization. In
particular, for models additive in the input features (46), the contribution of
each feature xjcan be viewed as a graph of gjxjplotted against xj. Figure 5
shows such plots for the ten features of the two-class nested spheres example
of Figure 3. The functions are shown for the first class concentrated near the
origin; the corresponding functions for the other class are the negatives of
these functions.
The plots in Figure 5 clearly show that the contribution to the log-odds of
each individual feature is approximately quadratic, which matches the gener-
ating model (41) and (42).
When there are more than two classes, plots similar to Figure 5 can be made
for each class and analogously interpreted. Higher order interaction models
are more difficult to visualize. If there are at most two-feature interactions,
the two-variable contributions can be visualized using contour or perspective
mesh plots. Beyond two-feature interactions, visualization techniques are even
less effective. Even when noninteraction (stump) models do not achieve the
highest accuracy, they can be very useful as descriptive statistics owing to the
interpretability of the resulting model.
9. Weight trimming. In this section we propose a simple idea and show
that it can dramatically reduce computation for boosted models without sacri-
ficing accuracy. Despite its apparent simplicity, this approach does not appear
to be in common use [although similar ideas have been proposed before:
Schapire (1990), Freund (1995)]. At each boosting iteration there is a
distribution of weights over the training sample. As iterations proceed, this
distribution tends to become highly skewed towards smaller weight values.
A larger fraction of the training sample becomes correctly classified with
increasing confidence, thereby receiving smaller weights. Observations with
very low relative weight have little impact on training of the base classifier;
only those that carry the dominant proportion of the weight mass are influen-
ADDITIVE LOGISTIC REGRESSION 369
tial. The fraction of such high weight observations can become very small in
later iterations. This suggests that at any iteration one can simply delete from
the training sample the large fraction of observations with very low weight
without having much effect on the resulting induced classifier. However, com-
putation is reduced since it tends to be proportional to the size of the training
sample, regardless of weights.
At each boosting iteration, training observations with weight wiless than
a threshold wi<tβare not used to train the classifier. We take the value of
tβto be the βth quantile of the weight distribution over the training data
at the corresponding iteration. That is, only those observations that carry
the fraction 1 βof the total weight mass are used for training. Typically
β∈00101so that the data used for training carries from 90 to 99% of
the total weight mass. Note that the weights for all training observations are
recomputed at each iteration. Observations deleted at a particular iteration
may therefore reenter at later iterations if their weights subsequently increase
relative to other observations.
Figure 6 (left panel) shows test-error rate as a function of iteration number
for the letter recognition problem described in Section 7, here using Gentle
AdaBoost and eight-node trees as the base classifier. Two error rate curves are
shown. The black solid one represents using the full training sample at each
iteration β=0, whereas the blue dashed curve represents the corresponding
error rate for β=01. The two curves track each other very closely, especially
at the later iterations. Figure 6 (right panel) shows the corresponding frac-
tion of observations used to train the base classifier as a function of iteration
number. Here the two curves are not similar. With β=01 the number of
observations used for training drops very rapidly reaching roughly 5% of the
total at 20 iterations. By 50 iterations it is down to about 3% where it stays
throughout the rest of the boosting procedure. Thus, computation is reduced
Fig. 6.The left panel shows the test error for the letter recognition problem as a function of
iteration number. The black solid curve uses all the training data,the red dashed curve uses a
subset based on weight thresholding. The right panel shows the percent of training data used for
both approaches. The upper curve steps down,because training can stop for an entire class if it is
fit sufficiently well (see text).
370 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
by over a factor of 30 with no apparent loss in classification accuracy. The
reason why sample size in this case decreases for β=0 after 150 iterations is
that if all of the observations in a particular class are classified correctly with
very high confidence Fk>15 +logN training for that class stops, and con-
tinues only for the remaining classes. At 400 iterations, 12 classes remained
of the original 26 classes.
The last column labeled fraction in Table 3 for the letter-recognition problem
shows the average fraction of observations used in training the base classifiers
over the 200 iterations, for all boosting methods and tree sizes. For eight-node
trees, all methods behave as shown in Figure 6. With stumps, LogitBoost uses
considerably less data than the others and is thereby correspondingly faster.
This is a genuine property of LogitBoost that sometimes gives it an advan-
tage with weight trimming. Unlike the other methods, the LogitBoost weights
wi=pi1pido not in any way involve the class outputs yi; they simply
measure nearness to the currently estimated decision boundary FMx=0.
Discarding small weights thus retains only those training observations that
are estimated to be close to the boundary. For the other three procedures the
weight is monotone in yiFMxi. This gives highest weight to currently mis-
classified training observations, especially those far from the boundary. If after
trimming, the fraction of observations remaining is less than the error rate,
the subsample passed to the base learner will be highly unbalanced contain-
ing very few correctly classified observations. This imbalance seems to inhibit
learning. No such imbalance occurs with LogitBoost since near the decision
boundary, correctly and misclassified observations appear in roughly equal
numbers.
As this example illustrates, very large reductions in computation for boost-
ing can be achieved by this simple trick. A variety of other examples (not
shown) exhibit similar behavior with all boosting methods. Note that other
committee approaches to classification such as bagging [Breiman (1996)] and
randomized trees [Dietterich (1998)] while admitting parallel implementa-
tions, cannot take advantage of this approach to reduce computation.
10. Further generalizations of boosting. We have shown above that
AdaBoost fits an additive model, optimizing a criterion similar to binomial
log-likelihood, via an adaptive Newton method. This suggests ways in which
the boosting paradigm may be generalized. First, the Newton step can be
replaced by a gradient step, slowing down the fitting procedure. This can
reduce susceptibility to overfitting and lead to improved performance. Sec-
ond, any smooth loss function can be used: for regression, squared error is
natural, leading to the “fitting of residuals” boosting algorithm mentioned
in the introduction. However, other loss functions might have benefits, for
example, tapered squared error based on Huber’s robust influence function
[Huber (1964)]. The resulting procedure is a fast, convenient method for resis-
tant fitting of additive models. Details of these generalizations may be found
in Friedman (1999).
ADDITIVE LOGISTIC REGRESSION 371
11. Concluding remarks. In order to understand a learning procedure
statistically it is necessary to identify two important aspects: its structural
model and its error model. The former is most important since it determines
the function space of the approximator, thereby characterizing the class of
functions or hypotheses that can be accurately approximated with it. The
error model specifies the distribution of random departures of sampled data
from the structural model. It thereby defines the criterion to be optimized in
the estimation of the structural model.
We have shown that the structural model for boosting is additive on the
logistic scale with the base learner providing the additive components. This
understanding alone explains many of the properties of boosting. It is no sur-
prise that a large number of such (jointly optimized) components defines a
much richer class of learners than one of them alone. It reveals that in the
context of boosting all base learners are not equivalent, and there is no uni-
versally best choice over all situations. As illustrated in Section 6, the base
learners need to be chosen so that the resulting additive expansion matches
the particular decision boundary encountered. Even in the limited context of
boosting decision trees the interaction order, as characterized by the number of
terminal nodes, needs to be chosen with care. Purely additive models induced
by decision stumps are sometimes, but not always, the best. However, we con-
jecture that boundaries involving very high-order interactions will rarely be
encountered in practice. This motivates our additive logistic trees (ALT) pro-
cedure described in Section 8.
The error model for two-class boosting is the obvious one for binary vari-
ables, namely the Bernoulli disribution. We show that the AdaBoost proce-
dures maximize a criterion that is closely related to expected log-Bernoulli
likelihood, having the identical solution in the distributional (L2) limit of
infinite data. We derived a more direct procedure for maximizing this log-
likelihood (LogitBoost) and show that it exhibits properties nearly identical to
those of Real AdaBoost.
In the multiclass case, the AdaBoost procedures maximize a separate
Bernoulli likelihood for each class versus the others. This is a natural choice
and is especially appropriate when observations can belong to more than one
class [Schapire and Singer (1998)]. In the more usual setting of a unique
class label for each observation, the symmetric multinomial distribution is
a more appropriate error model. We develop a multiclass LogitBoost proce-
dure that maximizes the corresponding log-likelihood by quasi-Newton step-
ping. We show through simulated examples that there exist settings where
this approach leads to superior performance, although none of these situa-
tions seems to have been encountered in the set of real data examples used
for illustration; the performance of both approaches had quite similar perfor-
mance over these examples.
The concepts developed in this paper suggest that there is very little, if any,
connection between (deterministic) weighted boosting and other (randomized)
ensemble methods such as bagging [Breiman (1996)] and randomized trees
[Dietterich (1998)]. In the language of least-squares regression, the latter are
372 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI
purely “variance” reducing procedures intended to mitigate instability, espe-
cially that associated with decision trees. Boosting on the other hand seems
fundamentally different. It appears to be mainly a “bias” reducing procedure,
intended to increase the flexibility of stable (highly biased) weak learners by
incorporating them in a jointly fitted additive expansion.
The distinction becomes less clear [Breiman (1998a)] when boosting is
implemented by finite weighted random sampling instead of weighted opti-
mization. The advantages or disadvantages of introducing randomization into
boosting by drawing finite samples is not clear. If there turns out to be an
advantage with randomization in some situations, then the degree of random-
ization, as reflected by the sample size, is an open question. It is not obvious
that the common choice of using the size of the original training sample is
optimal in all (or any) situations.
One fascinating issue not covered in this paper is the fact that boosting,
whatever flavor, seems resistant to overfitting. Some possible explanations are:
1. As the LogitBoost iterations proceed, the overall impact of changes intro-
duced by fmxreduces. Only observations with appreciable weight deter-
mine the new functions—those near the decision boundary. By definition
these observations have Fxnear zero and can be affected by changes,
while those in pure regions have large values of Fx and are less likely
to be modified.
2. The stagewise nature of the boosting algorithms does not allow the full col-
lection of parameters to be jointly fit, and thus has far lower variance than
the full parameterization might suggest. In the computational learning the-
ory literature this is explained in terms of VC dimension of the ensemble
compared to that of each weak learner.
Fig. 7.Real AdaBoost (stumps)on a noisy concentric-sphere problem, with 400 observations per
class and Bayes error 25%. The test error (upper curve)increases after about fifty iterations.
ADDITIVE LOGISTIC REGRESSION 373
3. Classifiers are hurt less by overfitting than other function estimators
[e.g., the famous risk bound of the 1-nearest-neighbor classifier, Cover and
Hart (1967)].
Figure 7 shows a case where boosting does overfit. The data are gener-
ated from two ten-dimensional spherical Gaussians with the same mean, and
variances chosen so that the Bayes error is 25% (400 samples per class). We
used Real AdaBoost and stumps (the results were similar for all the boosting
algorithms). After about 50 iterations the test error (slowly) increases.
Schapire, Freund, Bartlett and Lee (1998) suggest that the properties of
AdaBoost, including its resistance to overfitting, can be understood in terms of
classification margins. However, Breiman (1997) presents evidence counter to
this explanation. Whatever the explanation, the empirical evidence is strong;
the introduction of boosting by Schapire, Freund and colleagues has brought
an exciting and important set of new ideas to the table.
Acknowledgments. We thank Andreas Buja for alerting us to the recent
work on text classification at AT&T laboratories, Bogdan Popescu for illu-
minating discussions on PAC learning theory and Leo Breiman and Robert
Schapire for useful comments on an earlier version of this paper. We also
thank two anonymous referees and an Associate Editor for detailed and use-
ful comments on an earlier draft of the paper.
REFERENCES
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.
Breiman, L. (1997). Prediction games and arcing algorithms. Technical Report 504, Dept. Statis-
tics, Univ. California, Berkeley.
Breiman, L. (1998a). Arcing classifiers (with discussion). Ann. Statist. 26 801–849.
Breiman, L. (1998b). Combining predictors. Technical report, Dept. Statistics, Univ. California,
Berkeley.
Breiman, L., Friedman,J.,Olshen,R.andStone, C. (1984). Classification and Regression Trees.
Wadsworth, Belmont, CA.
Buja, A., Hastie,T.andTibshirani, R. (1989). Linear smoothers and additive models (with
discussion). Ann. Statist. 17 453–555.
Cover,T.andHart, P. (1967). Nearest neighbor pattern classification. Proc. IEEE Trans. Inform.
Theory 13 21–27.
Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles
of decision trees: bagging, boosting, and randomization. Machine Learning? 1–22.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121
256–285.
Freund,Y.andSchapire, R. (1996a). Game theory, on-line prediction and boosting. In Proceedings
of the Ninth Annual Conference on Computational Learning Theory 325–332.
Freund,Y.andSchapire, R. E. (1996b). Experiments with a new boosting algorithm. In Machine
Learning: Proceedings of the Thirteenth International Conference 148–156. Morgan
Kaufman, San Francisco.
Freund,Y.andSchapire, R. E. (1997). A decision-theoretic generalization of online learning and
an application to boosting. J. Comput. System Sciences 55.
Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19
1–141.
374 DISCUSSION
Friedman, J. (1996). Another approach to polychotomous classification. Technical report, Stanford
Univ.
Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical
report, Stanford Univ.
Friedman,J.andStuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76
817–823.
Hastie,T.andTibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.
Hastie,T.andTibshirani, R. (1998). Classification by pairwise coupling. Ann. Statist. 26 451–471.
Hastie,T.,Tibshirani,R.andBuja, A. (1994). Flexible discriminant analysis by optimal scoring.
J. Amer. Statist. Assoc. 89 1255–1270.
Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets.
Machine Learning 11 63–90.
Huber, P. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 53 73–101.
Kearns,M.andVazirani, U. (1994). An Introduction to Computational Learning Theory.MIT
Press.
Mallat,S.andZhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE
Trans. Signal Processing 41 3397–3415.
McCullagh,P.andNelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.
Schapire, R. (1997). Using output codes to boost multiclass learning problems. In Proceedings
of the Fourteenth International Conference on Machine Learning 313–321. Morgan
Kaufman, San Francisco.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5197–227.
Schapire,R.E.andSinger, Y. (1998). Improved boosting algorithms using confidence-rated pre-
dictions. In Proceedings of the Eleventh Annual Conference on Computational Learning
Theory.
Schapire, R., Freund,Y.,Bartlett,P.andLee, W. (1998). Boosting the margin: a new explana-
tion for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
Valiant, L. G. (1984). A theory of the learnable. Comm. ACM 27 1134–1142.
Department of Statistics
Sequoia Hall
Stanford University
Stanford, California 94305
E-mail: jnf, hastie, tibs@stat.stanford.edu
DISCUSSION
Leo Breiman
University of California, Berkeley
The authors have presented a fascinating view of boosting and I congrat-
ulate them for the steps they have taken to clear up the mystery of why the
AdaBoost algorithm works so well. Yet I have misgivings about their explana-
tion (which I expressed earlier in my comments to them on receiving the first
draft of their paper).
A crucial property of AdaBoost is that it almost never overfits the data no
matter how many iterations it is run. The authors counterexample is the first
convincing one I have seen but stands in contrast to the hundreds of data sets
on which AdaBoost has not produced overfitting. Any complete explanation of
AdaBoost has to explain this empirical fact. Ordinarily, in logistic regression,
ADDITIVE LOGISTIC REGRESSION 375
if one adds more and more variables, increasing the likelihood at each step,
then at some point overfitting sets in and the test-set error blossoms. The fact
that this does not happen in AdaBoost or in LogitBoost is extraordinary. And,
unless I am missing something, there is no explanation in the paper.
I approach the problem differently. In Breiman (1999a), I assume an ensem-
ble of classifiers hx θθ 3. Let the sequence θnbe iid selections from
3according to some probability Qand let the classifiers hxθ
1hxθ
2
hxθ
Neach cast a unit vote for the predicted class at x. Then, under weak
conditions, as Ngoes to infinity, the selected class converges a.s. to arg maxj
Qhxθ=j. This holds, for example, if the classifiers are trees, in which
case I refer to the ensemble as a “random forest.” The theorem is a simple
consequence of the law of large numbers.
For example, in bagging, each classifier is grown on a bootstrap sample from
the original training set. If the training set has Kexamples, let θbeaK-vector
of nonnegative integers adding to K. Under Q, let θhave the distribution of
sampling Ktimes with replacement from Kobjects. Thus, using the results
cited, bagging converges a.s. as the number of iterations becomes large.
AdaBoost is probably a random forest. Consider the following version of
AdaBoost: given a probability Pon the training set, use a fixed algorithm to
construct a classifier hxPusing the weighted training set. The next prob-
ability Pon the training set depends only on P, the indices of the examples
misclassified and the weighted misclassification error. Then there is a deter-
ministic function 5defined on the space of all probabilities on the training
set such that P=5P. Assume that 5is ergodic with invariant measure Q
and that the voting is unweighted. Then the class selected converges a.s. to
arg maxjQhxP=j.
The implication is that the random forest constructed by letting the
sequence Pnbe iid from Qis equivalent to AdaBoost. AdaBoost uses
weighted voting, but a small alteration in the argument gives the same result.
In Breiman (1999a), empirical evidence is given to show that AdaBoost is a
random forest. This would settle the question of why AdaBoost does not overfit,
but not the question of why AdaBoost gives such low test-set error.
There are many different random forests, and some are more accurate than
others. For instance, Dieterrich (1998) constructs a random forest by choosing
at random among the ten best splits at any node and shows empirically that
this gives generally lower test-set error than bagging. In (1999a) we show
that the test-set misclassification error is bounded above by an expression
that depends on two parameters: the first is the expected correlation between
pairs of classifiers, and the second is the expected accuracy of the individual
classifiers.
One very effective way to reduce correlation is by random choice of features
at each node. That is, suppose there are ten predictor variables. At each node,
choose a small subgroup of them at random to split on. This idea first appeared
in a paper by Amit and Geman (1997). Using this approach and 100 iterations
gives the following test-set errors as compared to the best corresponding values
for LogitBoost.
376 DISCUSSION
Test-Set Errors (%)
Dataset LogitBoost Random Forest
Breast 3.8 2.9
Ionosphere 6.3 7.1
Glass 23.8 20.6
Sonar 15.4 15.9
Waveform 19.1 17.2
Sat-images 11.2 8.6
The comparison is on runs I have already done. Since I have not done a run
on the vowel dataset that is similar to that of the paper, it does not appear.
Similarly, I have not run the letter dataset referred to. The major point of this
comparison is to show that correlation and strength are the essential ingre-
dients of an accurate ensemble, and that if boosting succeeds, it is because it
gives low correlation between classifiers of reasonable strength. The fact that
a similar algorithm maximizes the likelihood in a logistic setting is not the
primary reason for its success.
So what is known about the factors leading to AdaBoost’s success? One
clear factor is that at each step it decouples itself from the previous step. This
would intuitively seem to lead to low correlation between pairs of classifiers
in AdaBoost considered as a random forest. However, this has not yet been
analytically or empirically established.
A recently discovered empirical property of AdaBoost is mysterious but
may lead to better understanding. At each iteration, the examples currently
misclassified by AdaBoost are weighted more heavily for the next iteration,
to encourage their correct classification. The commonly accepted interpreta-
tion, therefore, is that AdaBoost tries to classify correctly the hard-to-classify
examples, and ignores the easy-to-classify examples. This interpretation is
incorrect, and I have to admit some culpability.
Suppose AdaBoost is run for a while. Then for each case in the training set,
look at the weighted proportion of times that it has been misclassified (the
weights are just those that AdaBoost assigns to each tree in the ensemble).
For all data sets I’ve looked at so far, these proportions are nearly the same
over the entire training set. That is, AdaBoost is an equalizer. When it puts
more emphasis on points that have been recently misclassified, it is trying to
get them correctly classified next time around [see Wheway (1999)].
Let the points in the training set be ynxn. Let prknbe the unweighted
proportion of times that the nth case in the training set is misclassified by the
first kclassifiers. To check on my hypothesis that AdaBoost worked so well
because it is an equalizer, using the methods in Breiman (1999b), I devised
an algorithm called arc-equal which sequentially minimizes Avnprkn−t2,
where tis an adjustable parameter.
Then I did runs on a number of datasets, proceeding as follows: first Ada-
Boost was run 100 times. The twas taken equal to the average weighted
misclassification rate over the training set produced by AdaBoost and arc-
ADDITIVE LOGISTIC REGRESSION 377
equal run 100 times. When this was repeated 50 times, each time leaving out
10% of the data as a test set, the test-set error rate of arc-equal was always
nearly the same as that of AdaBoost.
The experiments performed and the empirical observation that our data-
sets, AdaBoost tends to equalize the misclassification error over the training
set is suggestive. But it does not yet explain why equalization leads to low
test-set error. This mystery should be most interesting to some of our more
theoretically motivated statisticians.
REFERENCES
Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees.
Neural Comput. 91545–1588.
Breiman, L. (1999a). Random forests. Technical report, available at www.stat.berkeley.edu.
Breiman, L. (1999b). Prediction games and arcing algorithms. Neural Comput. 11 1493–1517.
Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles
of decision trees: bagging, boosting and randomization. Mach. Learning 40 139–158.
Wheway, V. (1999). Variance reduction trends on ‘boosted’ classifiers. Available from virg@cse.
unsw.edu.au
Department of Statistics
University of California
367 Evans Hall, #3860
Berkeley, California 94720-3860
E-mail: leo@stat.berkeley.edu
DISCUSSION
Peter B ¨
uhlmann and Bin Yu
ETH Z ¨urich and Bell Laboratories, Murray Hill,
Lucent Technologies and University of California, Berkeley
We congratulate Friedman, Hastie and Tibshirani (FHT) for their signif-
icant work connecting boosting and (kinds of) additive models. FHT provide
a much-needed bridge between an important and effective machine learning
procedure and traditional statistical modeling ideas. With this bridge, ideas
can now flow easily in both directions so that a deeper and more thorough
understanding of boosting will eventually emerge.
In this discussion, we would like to share our thoughts and reflections on
boosting and related statistical ideas, inspired by FHT.
1. What makes boosting work in classification? Let us consider the
two-class problem in this section. In an elegant and convincing way, FHT bring
Freund and Schapire’s Discrete AdaBoosting to our home domain by rederiv-
ing its population version as a familiar Newton-like step in the optimization of
a not-so-familiar exponential loss function serving as a surrogate for the zero–
one loss function. This surrogate is sensible since it is uniquely minimized by
378 DISCUSSION
(half of) the log odds ratio of the conditional probability of Y=1 given X=x.
Under this light, it is not hard to find instances of statistical procedures shar-
ing traces of similarities with boosting. For example, in parametric estimation,
we indeed have been “boosting” a consistent estimator to an efficient estima-
tor by taking a Newton step in the optimization of the likelihood function and
indeed have been using estimating equations as sensible surrogates because
the likelihood equations are intractable. Despite these similarities, there are
fundamental differences between boosting and these statistical procedures.
As recounted by FHT, boosting as a conjecture was proposed as a theoretical
learning question in the machine learning community and is now being viewed
as a nonparametric stepwise fitting technique in an additive style. Its superb
empirical performance in high-dimensional classification problems has very
few, if any, rivals. Even though the concept of a “true” model does get used in
the evaluation stage of boosting, the starting point of boosting is not a “true
model” as is commonly done in statistics. Its starting point is a “weak learner”
and the question posed was how to make it better or “boost” it. In hindsight,
this is a natural and realistic approach in modern data analysis where the size
of datasets could not be imagined in the time of our forebears such as Fisher
or Neyman. Because of the complexity and scale of these problems, it is impos-
sible to come up with an effective likelihood approach in one step. Often, if
not always, a sensible procedure or a “weak learner” is available, either ad
hoc or based on a simple approximation. Moreover, an evaluation criterion or
a loss function is always available. The boosting question now becomes how to
improve the weak learner in terms of this loss function. With a starting point
and an objective function, a greedy approach for improvement is very natu-
ral. Numerical considerations of the greedy approach explain why boosting in
terms of the evaluating zero–one loss function might not be a good idea. Any
derivative-based greedy algorithm such as Newton’s method is not appropri-
ate for this discontinuous loss function whose minimizer is not unique but a
whole class. A surrogate, the exponential loss function, is used in AdaBoost-
ing as the implementing objective function although the evaluation function
remains the zero–one loss. From a numerical point of view, the exponential
function is an excellent function to apply Newton’s method, because of its con-
vexity and gradually changing derivatives. Furthermore, this is a relevant
loss function to optimize since its minimizer is sensible for the classification
problem. A second surrogate, proposed by FHT, is the binomial likelihood that
as a function is very close to the exponential loss and has the same minimizer.
FHT devise the Gentle Boosting algorithm based on the binomial surrogate
and show a similar performance to those based on the exponential loss. A third
surrogate, the squared loss function, is also discussed in FHT. The squared
function is the best possible for Newton’s method: convex and having constant
second derivatives. FHT report good performance but think that it is domi-
nated by schemes based on either the exponential or the binomial likelihood
loss functions and hint that the reason might be that the squared loss func-
tion loses its monotonicity when yFx>1. It is well known that the unique
minimizer of the squared loss is the conditional expectation of Ygiven X=x
ADDITIVE LOGISTIC REGRESSION 379
which is naturally bounded between 1 and 1 since Y=1. If this constraint
is taken into account in the greedy search for the optimizer of the squared loss,
then one should not wander out to the region where yFx>1 but stay in the
region where the squared loss is monotone. It is curious to know whether or
not taking this constraint into account (obviously numerically more compli-
cated) would bring the performance of squared loss boosting on a par with
those based on the exponential or binomial likelihood loss.
Although the three surrogates mentioned so far have sensible minimizers,
they are only qualitative approximations to the zero–one loss function. They
are actually quite bad quantitative approximations (cf. Figure 2 of FHT). But
the population minimizers of the expected zero–one loss function include those
of exponential, binomial likelihood and the squared loss. The first question
then simply becomes:
1. Which surrogate (or implementing loss function for boosting) to use so
that the boosting estimate best approximates its corresponding population
minimizer? This (difficult) question might involve numerical as well as
statistical–stochastic efficiency issues.
Now let us step back from these successful (implementing) surrogates to
look at the original zero–one loss function. From the quantitative approxima-
tion point of view, an obvious surrogate is a smoothed version of the zero–one
loss function 1 5yFxwhere 5is the cdf of 01and σis a tuning
parameter to control the approximation of such a smoothed version to the orig-
inal zero–one loss function. This makes the numerical greedy approach pos-
sible. If one follows the exact steps (when cis fixed) to this objective function
as in the derivation of the population version of Discrete AdaBoost (Result 1
in FHT), one gets a boosting algorithm with a different reweighting scheme
having weights.
ϕyFx=ϕFx
because y2=1 and ϕ·, the standard normal density, is symmetric.
These weights ignore the values of yor avoid reality check and concentrate
more and more on the cases where the previous classifier Fis unsure about
itself. Since this is the opposite of the boosting reweighting philosophy, one
might expect that it would not work well. The smoothed zero–one loss func-
tion is not convex (or concave) and its derivatives change rapidly, especially
for a small tuning parameter σor a closer approximation to the zero–one
loss function. For such functions, the quadratic approximation is not accurate
and Newton’s method easily overshoots to miss the true optimizer. Hence this
recipe should not work from a numerical point of view.
2. The question is, however, for this smoothed zero–one loss, will a more suit-
able numerical optimization method such as trust region [cf. Gill, Murray
and Wright (1981)] lead to a sensible boosting procedure having a different
weighting philosophy, presumably with higher computational cost?
380 DISCUSSION
As argued above, boosting algorithms benefit greatly from a smart choice of
a surrogate implementing objective function from both statistical and numer-
ical points of view. FHT (Section 4.4) also point out that Gentle Boosting has
an edge because of the numerical stability of the derivative calculation of the
binomial likelihood function. However, boosting’s most powerful advocate is
its stunning success on real data sets and its “mysterious” resistance to over-
fitting in most cases in classification. The summary in Table 2 of FHT on
real dataset results suggests that the performance of a particular variant of
boosting depends on the interaction among the choice of the weak learner, the
underlying problem and the effectiveness of the Newton method as a numeri-
cal approximation to the surrogate loss function. Figuring out how these fac-
tors interact should bring us a deeper understanding of boosting and might
shed light on its resistance to overfitting.
We explore this interaction in the rest of this discussion. In the context of
boosting L2regression, we compare boosting with another ensemble scheme,
bagging [Breiman (1996)], for which we have gained some understanding
recently [B ¨
uhlmann and Yu (2000)]. Different weak learners are looked at,
the overfitting issue is touched upon, and a bag-boosting scheme is proposed.
In the last section, we return to classification to emphasize the importance of
the choice of the weak learner and to make the point that the resistance to
overfitting of boosting is probably due to the zero–one evaluation loss function.
2. Boosting and bagging with L2loss for regression. Boosting as
explained and analyzed by FHT is a general method: a greedy forward stage-
wise technique to fit a model of additive style by minimizing a loss function.
This view opens the door for boosting in other contexts than classification
(although L2loss is also an appropriate surrogate as described in the previ-
ous section). For L2regression, the boosting algorithm works as follows:
(a) Set F00 (or another more sensible starting value).
(b) For m=12M, fit the function estimator fm· which is para-
meterized as fmx=βbx γ(as in FHT):
βmγ
m=arg minβγ
n
i=1YiFm1Xi−βbXiγ2
Set Fm· = Fm1· + fm·
(c) Output the function estimator
FM· =
M
m=1
fm·
This algorithm is indicated by FHT [formula (6)] and was also given by
Friedman (1999). It does not involve any “reweighting”. The weak learner
b·γis fitted in the mth step to the current residuals YiFm1XiThere
is no need for a surrogate loss function in this case since the evaluating L2loss
is the best possible for Newton’s method and the quadratic approximation is
ADDITIVE LOGISTIC REGRESSION 381
exact. We now discuss the issue of choosing the weak learner. This discussion
is continued in Section 3 for the case of classification.
Linear weak learners. It is well known that boosting’s cousin bagging does
not change any linear procedures. It can be shown easily that it is the same
with boosting in L2regression. When the weak learner is linear bx γ=xTγ
and fmx=xTγmthe following two statements hold:
(a) For L2boosting in regression as described earlier,
fm· ≡ 0 for all m=23M
FM· = least-squares linear regression predictor for all M1
(b) For LogitBoost as presented by FHT, with
pMx=expFMx/expFMx + exp−FMx
pMxconverges to the corresponding MLE in logistic linear regression as
M→∞provided that standard conditions for convergence of the MLE hold.
Stumps as the weak learner. We now investigate the effect of boosting in
a concrete simple linear model,
Yi=2+3Xi+εii=1n(1)
with εi’s i.i.d. 01X
i’s i.i.d. Uniform01 and all Xi’s indepen-
dent from all εi’s. The target is the true function Ftruex=2+3x x ∈01
We use stumps (with one-dimensional covariate space) as weak learners and
run the L2boosting algorithm for regression from above. For a typical sample,
the boosting stumps estimate is displayed in Figure 1; it is still a rather crude
and erratic approximation for the target. Figure 1 uses six boosting iterations
which is optimal on average with respect to MSE as indicated in Figure 2.
Based on 50 independent simulations of model (1), we have estimated bias,
variance and hence also mean-squared error as a function of the number of
boosting iterations. The result is displayed in Figure 2.
3. Interestingly, there is a clear indication for overfitting, starting after six
iterations already. We see here a simple example in L2boosting for regres-
sion where overfitting occurs easily in contrast to classification; similar
phenomena are reported in Friedman (1999).
From our own work [B ¨
uhlmann and Yu (2000)] we know that stumps eval-
uated at xhave high variances for xin a whole region of the covariate space.
From an asymptotic point of view, this region is “centered around” the true
optimal split point for a stump and has “substantial” size On1/3. That
is, stumps do have high variances even in low dimensions as in this sim-
ple case (with only three parameters) as long as one is looking at the “right
scale” On1/3; such a high variance presumably propagates when combining
stumps in boosting. This observation is the starting point for another boosting
machine to be described next.
382 DISCUSSION
The bag-boosting machine with bagged weak learners. Bagging is known
to smooth the weak learner such as stumps and thus achieves a variance
reduction. The precise (asymptotic) smoothing function is given in B ¨
uhlmann
and Yu (2000), which also characterizes its variance reduction effect in simple
yet canonical cases. The high variance of a stump is reduced up to a factor
of size 2 to 3, while leaving the bias approximately unchanged. This implies
that bagging is also effective for a low-dimensional predictor such as stumps.
Hence, we combine bagging with boosting and we call it bag-boosting. For the
weak learner in boosting, just replace the stump (or a more general tree) by
the bagged stump (or bagged tree). This is a very natural idea and has been
thought about by a couple of researchers although we have not seen anything
in writing. The resulted fit from bag-boosting is shown in Figures 1 and 2.
Note that performance for bagging alone is given by bag-boosting with one
boost. Instead of a classical bagging step, we actually use “sub-bagging” or
bagging on subsamples of size [n/2] (resampling without replacement) which
is computationally cheaper while still being as accurate as bagging [B ¨
uhlmann
and Yu (2000)]. The visual improvement of bag-boosting in Figure 1 can be
Fig. 1.Boosting stumps (B, black)and bag-boosting stumps (BB, red)estimate based on a typical
sample (dots)from (1). The target function is indicated in green. Boosting is done with 6and
bag-boosting with 4iterations.
ADDITIVE LOGISTIC REGRESSION 383
Fig. 2.Boosting stumps and bag-boosting stumps for model (1). Top: squared bias and variance
of boosting (B, black and red)and bag-boosting (BB, purple and green). Bottom:mean-squared
error for boosting (B, black)and bag-boosting (BB, red).
explained by the fact that a bagged stump is a smooth rather than a step
function [B ¨
uhlmann and Yu (2000)]. Its variance and MSE improvements are
impressively described by Figure 2, provided that we know roughly where to
stop with boosting iterations. Figure 2 illustrates that the more efficient base
learner (namely the bagged stump) has a faster increase in variance with the
number of boosting iterations: It would be interesting to know whether this
is a more general fact.
4. Thus bag-boosting has a potential to improve upon boosting with stumps
or larger decision trees. Its drawback is the higher, although still feasible,
computational cost.
384 DISCUSSION
3. Back to classification. Many issues remain. For example, the ques-
tion about convergence is not touched by FHT. In particular, for a fixed sample
size, there are two issues: the (non)convergence of the boosting algorithm and
whether it is better to stop iterations before having approximately converged,
if the algorithm actually would converge. The latter issue is known in fitting
complex parametric models such as neural networks where a possible regular-
ization is given by stopping before convergence. There is another convergence
issue about the next level when the sample size tends to infinity.
5. A more important question is how much FHT’s population story tells about
the data driven algorithms. The population quantities are always approx-
imated: the accuracy depends on the choice of weak learners (see also
our remarks about linear weak learners, stumps and bagged stumps in
Section 2) and the data-generating distributions. Finally, the hope is that
the errors do not propagate with the boosting iterations!
Particularly, even with the connection made by FHT, it is hard to know
whether one should use a simple learner like stumps or a more complex one
such as trees with more terminal nodes.
In the boosting literature, which is almost exclusively on classification, deci-
sion trees are the most popular weak learners. We do agree that trees are
generally attractive, but it is worth pointing out that other (nonlinear) weak
learners may exhibit good performance as well. Even when using trees as
weak learners, the choice of the number of terminal nodes (e.g., with best-first
induction) in the learners does matter. Figure 1 in FHT can be misleading for
a superficial reader. It wrongly suggests that the number of terminal nodes
in the learner is irrelevant when there is sufficient computing power for per-
forming many boosting steps; and that with stumps, Discrete AdaBoost is
outperformed by Real AdaBoost consistently. The careful reader will exploit
a more subtle sensitivity with respect to choosing the learner in Section 6
of FHT [e.g., Figure 4 (top left and top right)] and results on real data sets
(Table 2).
FHT distinguish between additive and nonadditive decision boundaries.
They argue convincingly that in the first case, choosing stumps as weak learn-
ers is better than using a larger tree which is expected to overestimate nonex-
isting interaction terms. If we knew that the decision boundary could be well
approximated by an additive function, we would feel more comfortable by fit-
ting an additive model (with backfitting) as in Hastie and Tibshirani (1990),
rather than boosting stumps. Or in other words, if the conjecture by FHT
[Section 11] “that boundaries involving very high-order interactions will rarely
be encountered in practice” (whatever “high” means here) is really relevant,
there would not be any good reason to prefer boosting over the established
and popular GAM backfitting estimate.
Boosting’s resistance to overfitting has been one of its major virtues in
comparison with other nonparametric techniques. Overfitting in boosting algo-
rithms in the sense of fitting a large model can happen in two ways. The first
is having a large model fitted as the weak learner and the other having to run
ADDITIVE LOGISTIC REGRESSION 385
Table 1
Misclassification rates (in percentages)for the breast cancer data original tree
(unpruned;the program from R): 5.643
Subbagging 4.812
Boosting stumps 4.029 (with optimal 97 boosts)
Bag-boosting stumps 3.652 (with optimal 47 boosts)
Boosting large tree 2.929 (with optimal 117 boosts)
Bag-boosting large tree 2.768 (with optimal 11 boosts)
the algorithm for many iterations. The most overfitted model comes from a
large or complex weak learner in combination with a long iteration in boost-
ing. FHT touch on this overfitting issue in the concluding section by listing
three plausible explanations. As emphasized in Section 1 of this Discussion,
the use of the (implementing) surrogate exponential loss function is crucial
from the numerical point of view and is thus partially responsible for the
success of boosting. However, the evaluation in a classification problem has
always been the zero–one loss.
6. This divorce of the implementation and the evaluation loss function does
not exist in L2boosting where we saw overfitting easily in Section 2. We
concur with the last point of FHT in Section 11 that the zero–one loss is
very robust against overfitting and we further argue that an evaluation
based on the exponential implementing loss does show strong overfitting.
This conjecture is supported by the breast cancer dataset below. Moreover,
we believe that “parts of the real world” are nonadditive. Under this belief
and knowing boosting’s resistance to overfitting, it can be very valuable to
use best-first inducted trees with more than two terminal nodes. This is
confirmed also by our analysis of the breast cancer data in Table 1.
Analysis of breast cancer data. We partially redo the analysis of FHT for
the breast cancer data and thereby add insights on bag-boosting and overfit-
ting. Table 1 and Figure 3 give comparative results to the ones in FHT. We use
“sub-bagging” for computational efficiency and it has a performance similar
to bagging [B ¨
uhlmann and Yu (2000)]. The misclassification rates (called test
error in FHT) are estimated by averaging over 100 random partitions of the
data set into a 90% training and a 10% testing, while FHT use a 5-fold CV. So
our results differ slightly from FHT (also because presumably we are using
different tree algorithms), but we believe ours have better accuracies. Boost-
ing and its variants are done with Real AdaBoost, and the trees are run with
the program from R. Bag-boosting large trees gives the best result, followed
by boosting large trees, bag-boosting stumps, boosting stumps and then bag-
ging. Bag-boosting only needs few iterations and brings nontrivial improve-
ment over boosting alone, at the expense of a more computationally costly
weak learner. Note that Figure 3 (bottom panel) shows strong overfitting with
respect to Eexp−YF X!
386 DISCUSSION
Fig. 3.Different weak learners in Real AdaBoost for breast cancer data (Section 7in FHT).Top:
misclassification rates. Bottom:expected exponential losses Eexp−YF X on the logarithmic
scale.
7. For this particular dataset and with respect to the misclassification rate,
there is hardly any overfitting. It suggests that there is no loss in using
large trees, if there is never any substantial overfitting. Boosting stumps
on the other hand restricts to additive models. With respect to the misclas-
sification rate, should we always boost with large trees as weak learners?
Acknowledgments. Bin Yu thanks Mark Hansen for many stimulating
discussions and helpful comments on the draft and Margaret Wright for help-
ful discussions on optimization. Special thanks are due to Josef Vogt for the
results on the breast cancer data.
REFERENCES
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.
B¨
uhlmann, P. and Yu, B. (2000). Explaining bagging. Preprint.
Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical
report, Stanford Univ.
Gill, E. P., Murray, W. and Wright, M. H. (1981). Practical Optimization. Academic Press,
New York.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.
Seminar f ¨
ur Statistik
ETH-Zentrum, LEO D72
CH-8092 Zurich
Switzerland
E-mail buhlmann@stat.math.ethz.ch
Department of Statistics
University of California
Berkeley, California 94720-3860
ADDITIVE LOGISTIC REGRESSION 387
DISCUSSION
Andreas Buja
AT&T Labs
1. The cultural background of boosting: machine learning versus
statistics. I have worked under the same roof with the inventors of boosting,
Freund and Schapire, for several years, but I might as well have been on
another planet. The physical vicinity was of no help in coming to grips with
boosting. I listened to their talks, but I heard the words and missed the point.
Later I read Leo Breiman’s debates with Freund and Schapire, and it really
got my attention when I heard his praise of boosting. His penetrating analysis
was an effort to understand the performance of boosting, in particular its
mysterious immunity to overfitting. What he did not provide was a translation
of boosting into terms that are familiar to statisticians. He did not bridge
the culture gap that separates the stodgy main stream of statistics from its
dynamic younger sibling in computer science, machine learning. This is one
of the achievements of the paper by Friedman, Hastie and Tibshirani (FHT):
It gives statisticians a language with which to think about boosting.
If Freund and Schapire feel misinterpreted or if they think of FHT’s statis-
tical modeling interpretation as misguided, as I expect they will, it is beside
the point. There is no single “true” interpretation of anything; interpretation
is a vehicle in the service of human comprehension. The value of an interpre-
tation is in enabling others to fruitfully think about an idea, and this is what
the FHT paper certainly does for statisticians and the idea of boosting.
An interpretation of new ideas in terms of old ones is not a mere reduction
of the ideas to an “old hat.” On the contrary, such a reduction can clarify what
is novel. The emphasis of the FHT interpretations may be too long on the
reduction aspect and too short on working out the novelty in boosting; I will
try to explain what I mean below. The novelty that we see in light of FHT’s
interpretations, though, may not agree with what Freund and Schapire think
is novel. Thus, while the FHT paper may provide a gate for statisticians to a
new area, the same gate may not be open in reverse for the folks in machine
learning (although I’d love to be proved wrong).
2. Boosting arbitrary generalized linear and additive models. Ac-
cording to FHT, boosting can be seen as a two-class classification method that
fits additive models on a logistic scale in a forward stagewise manner with
regard to a loss function that is second-order equivalent to the negative log
likelihood for the conditional Bernoulli model. This interpretation leads FHT
naturally to propose their “LogitBoost” method by substituting the exact neg-
ative log likelihood for the original loss function.
The test of every interpretation is in its capacity to spawn new ideas and
generalizations. Differing interpretations are worthwhile because they lead to
differing generalizations. The FHT interpretation resulted immediately in a
388 DISCUSSION
very competitive new version of boosting, as well as an improvement in boost-
ing multiclass classification.
It seems to me, though, that the FHT interpretation calls, even screams,
for a much more global extension of boosting: Now that we know how to boost
additive logistic regression, we also know how to boost any generalized addi-
tive model. If Algorithm 3 of FHT for LogitBoost is the familiar Newton–
Raphson algorithm modified for the acquisition of nonlinear terms in a stage-
wise logistic regression, then the same can be used for the acquisition of non-
linear terms in any stagewise generalized regression.
In particular, we now know how to boost ordinary least-squares regression:
by stagewise LS fitting of nonlinear terms to the previous residuals. This may
seem trite, but it isn’t if one has ever tried to come up with a version of boosting
for regression with continuous response. Freund and Schapire have tried, and
their tentative proposal was to reduce regression to some form of classification,
in a classical attempt at reducing a new problem to an old one. This seems
like a circuitous approach, however, once the FHT interpretation of boosting is
at hand. With the extension of the FHT interpretation to generalized additive
regression, it seems most natural to interpret stagewise LS regression as the
regression version of boosting.
3. The role of greedy stagewise fitting in the protection against
overfit. FHT’s interpretation of boosting is lucid, but their presentation of
stagewise fitting needs to be put in perspective. I have no quibbles with the
facts, but some with the positioning of the facts.
FHT do state that the fitting `
a la boosting is “forward stagewise,” but the
impression we (and above all the readers in machine learning) may get is that
this term describes a common mode of model fitting in statistics. It doesn’t. We
have the term, but it is not very well known, and it does not enjoy high stand-
ing among most statisticians who know about it. It is a computational device,
or more accurately, a capitulation, in those situations where one is unable to
fit a large number of model terms jointly. We even teach our students that
fitting one term at a time and not adjusting the others is not thewaytofita
model; we teach them that changing one term in a model affects all others in
a joint model fit, assuming that the joint model fit is the norm. Interestingly,
one popular method for joint fitting, the backfitting algorithm, was devised
by Friedman and Stuetzle (1981) to more closely achieve joint optimality in
the otherwise greedy fitting procedure of projection pursuit regression (PPR).
Had they omitted the backfitting part in their algorithm and allowed the
acquisition of a greater number of ridge function terms instead, PPR would
have been a regression version of boosting. It appears that even the uncon-
ventional minds of Friedman and Stuetzle were involuntarily constrained by
statisticians’ prejudice in favor of parsimonious models obtained by joint model
fits. Doing away with this prejudice is certainly one of the novelties in the
boosting endeavor. Ockham’s razor is being replaced by Freund and Schapire’s
sledgehammer.
ADDITIVE LOGISTIC REGRESSION 389
Omitting joint fitting is crucial to the success of boosting: Its purported
advantage, relative protection against overfit, disappears instantaneously if
termwise (stagewise) fitting is replaced with joint fitting of all terms. The 500
or so stumps and twigs that I have seen fitted in a single boosting fit would
come crashing down if they were fitted jointly. Therefore, the essence of the
protection against overfit is in the suboptimality of the fitting method.
This also explains why boosting could not have been invented by statisti-
cians: Our preoccupations are such that the statistical benefits of a suboptimal
computational device would not have appeared on our radar. There are excep-
tions to the rule, though: Statisticians have, for example, developed a notion of
one-step estimates in robustness, where it was shown that one Newton step in
the computation of an M-estimate provides essentially the robustness benefits
of full iteration. Similarly, in order to understand the performance of boosting,
we will need a theory that explains why this particular type of suboptimality
often avoids overfit.
A curious observation in this context concerns the cavalier attitude in the
boosting community, including FHT, toward the stopping problem. They seem
to consider it sufficient to provide plots or tables that show misclassification
rates for 100, 200, 300, 400 terms. So we stop around 300 or 400, and this is
part of the definition of the method. The assumption, empirically substanti-
ated, is that it really doesn’t matter much as long as we stop after the acqui-
sition of a sufficiently large number of terms. The situation is similar to that
in artificial neural networks (NNs), where the stopping time of online opti-
mization acts as a smoothing parameter. In NNs, however, one tries to detect
the onset of overfit with cross-validation. In boosting, by contrast, the testing
error remains flat for a long time, overfit does not occur before the human
runs out of patience and a precise stopping rule seems therefore unnecessary.
A satisfactory theory of protection against overfit would have to explain
the relative flatness of the test error in boosting. The flatness is only relative,
however, and even boosting has to submit to overfit at some point. This is
seen from the fact that usually the class of fitted terms bx γis rich in the
sense that for a given finite sample size Nthe N-dimensional design vectors
bγ=bxnγn=1···Nspan all of RNas γruns across its domain.
This is obviously not a proof, but a slightly tighter argument can be given.
First, we note that, unlike in joint fitting, any iterative termwise fitting pro-
cedure depends not only on the space spanned by the terms, but the actual
basis they represent and the order in which they are fitted. (I thank Werner
Stuetzle who spelled this out most clearly to me.)
In the simple case where the domain of the γ’s is finite and consists of
Kelements γ1γ
Konly, boosting is a greedy form of backfitting the linear
model k=1···Kβkbx γk. Instead of cycling through the terms in an orderly
fashion k=12K,12 as in conventional backfitting, the next kis
picked greedily so as to lower the loss function maximally. We can conceive of
at least three types of backfitting according to the order in which the terms
are fitted: (1) cycle over the terms in a fixed order (traditional backfitting),
(2) pick the next term randomly, (3) pick the next term greedily. It would
390 DISCUSSION
seem plausible to examine combinations of type 2 and type 3, whereby the
picking probability of a term is a function of its loss reduction; as is, type 2
uses a uniform distribution on the terms and type 3 puts probability 1 on the
term with maximal reduction.
It is clear from the known convergence of type 1 backfitting to the joint
least-squares solution that overfit must occur if the joint solution overfits. If
similar convergence properties for type 2 and type 3 backfitting hold, overfit
will occur there also. In light of these considerations, it may be just a matter
of time till boosting follows the lead of NNs and cross-validation is introduced
for the stopping problem in boosting as well.
An issue that a theory of protection against overfit would have to tackle is
that there are two ways for a learner or base term bx γto be “weak”: weak-
ness can arise from bias or from variance. In either case the weak learner
usually contains signal, but the signal is weak for different reasons. Weak-
ness due to bias means underfit, weakness due to variance means overfit.
Boosting is designed to reduce bias but not variance. The latter is addressed
by Breiman’s (1996) “bagging” proposal and by Bayesian model averaging. As
things stand, one either approaches good fits from poor fits with boosting, or
from overly flexible fits with model averaging. It would be desirable if both
bias and variance could be tackled in one unified approach.
4. The role of the functional form of the weak learner. Additive
models with small trees as base terms are an innovation we owe the machine
learning literature. Trees are not essential to boosting, but the idea of building
additive models from multiple trees is a welcome side benefit.
Functional form hasn’t been a topic of great debate in the boosting liter-
ature, probably because of the original justifications in terms of the “weak
learner” metaphor. These justifications do not carry us far, however. The ques-
tion is not whether but how much a weak learner can be boosted, or in statis-
tical terminology: how well an additive model can approximate an underlying
response surface. A major determinant for answering this second question is
the functional form of the weak learner bx γ. The fact that the boosting folks
haven’t hit on this the hard way is probably due to the Holte (1993) effect,
cited a couple of times by FHT: “Very simple classification rules perform well
on most commonly used datasets.” That is, nature rarely chooses complex deci-
sion boundaries that couldn’t be described by thresholding additive functions
or at most two-variable or three-variable interactions.
Inspired by ANOVA practices, FHT propose to measure and control com-
plexity of a functional form in terms of interaction order, that is, the maximum
number of variables admitted in the weak learner at any given time. For trees,
the interaction order is identical with the maximum admitted tree depth.
But the idea of interaction order can be extended to other functional forms
as well. The most obvious are Friedman’s MARS fits, in which interaction
order can be controlled as easily as in trees. Another possibility are projection
pursuit regression and classification (PPR and PPC, respectively), although a
major modification is necessary: the linear combinations that enter the ridge
ADDITIVE LOGISTIC REGRESSION 391
functions would have to be constrained to, say, two or three predictors at
a time.
In order to more fully understand boosting, it will be necessary to apply it
to functional forms other than trees. This would not be difficult in the case of
MARS and PPR/C. In fact, for a boosting version of either one would simply
replace the joint fitting with unadjusted greedy fitting. Ironically, stagewise
modifications of MARS and PPR/C require that the technically most creative
ideas of their inventors be thrown out: Friedman used sophisticated stable
Cholesky updates for the fast computation of joint fits in MARS, and Friedman
and Stuetzle devised backfitting for the same purpose in PPR/C. Now we are
led to expect that giving up these technical marvels will be beneficial! Having
experienced first hand the problems of overfit in the original MARS, I would
be thrilled if the anticipated benefits of boosting showed up.
5. Conclusion. The FHT interpretation of boosting has implications for
machine learning as well as for statistics. For machine learning, the gener-
alization of FHT’s ideas lead us straight to a regression version of boosting.
For statistics, FHT’s ideas raise the expectation that the benefits of boosting
might carry over to other stagewise modifications of nonparametric fitting pro-
cedures. The latter is just a guess so far, and in the absence of a convincing
theory of boosting’s protection against overfit, we wouldn’t even know why it
held if it did. Clearly, there is a lot of work ahead of us. FHT have given us a
means for generating new ideas that can guide our future research.
REFERENCES
Breimann, L. (1996). Bagging predictors. Machine Learning 24 123–140.
Friedman,J.andStuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc.
76 817–823.
Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets.
Machine Learning 11 63–90.
AT&T Labs
Shannon Laboratory
180 Park Avenue
Florham Park, New Jersey 07932-0971
E-mail: andreas@research.att.com
DISCUSSION
Yoav Freund and Robert E. Schapire
AT&T Labs
The main and important contribution of this paper is in establishing a
connection between boosting, a newcomer to the statistics scene, and additive
models.
392 DISCUSSION
One of the main properties of boosting that has made it interesting to statis-
ticians and others is its relative (but not complete) immunity to overfitting. As
pointed out by the authors, the current paper does not address this issue. Leo
Breiman (1998) tried to explain this behaviour in terms of bias and variance.
In our paper with Bartlett and Lee (1997), we gave an explanation in terms
of the “margins” of the training examples and the VC-dimension of the base
class. Breiman, as well as the current paper, point out that our bounds are
very rough and are thus not useful in practice. While this is clearly true at
this time, it is also true that the analysis given by Breiman and by this paper
yield no provable bounds whatsoever. It is completely unclear whether this
analysis can be used to predict the performance of classification rules outside
of the training sample.
At the root of this argument about boosting is a much more fundamen-
tal argument about the type of prior assumptions that one should make when
embarking on the task of inducing a classification rule from data. The assump-
tion that seems to underlie the use of maximum likelihood in the current paper
is that data are generated by a distribution from a prespecified class. In this
case, this is the class of distributions in which the relationship between the
features and the labels is described by a log-linear function. In comparison,
the assumption that we make in our analysis is that the data are generated
from some arbitrary distribution in an i.i.d. fashion. Clearly, our assumption
is the weaker one and this leads to a theory that is more generally applicable.
From a related but more practical point of view, one main issue when apply-
ing boosting or boosting-like techniques in practice is how to choose the base
class. The approach taken in this paper is that this choice is made based
on our prior beliefs regarding the type of log-linear dependencies that might
exist between the features and the label. On the other hand, in the boosting
approach, we make an assumption about what kind of rules might have slight
but significant correlations with the label. This is the essence of the “weak
learning” assumption upon which the theory of boosting is founded.
In the current paper, boosting is analyzed mostly in the context of decision
stumps and decision trees. The argument seems to be that while in most
real-world cases decision stumps are powerful enough, in some less com-
mon cases the type of dependencies that exist in the data require a more
powerful base class, such as two- or three-level decision trees. A rather differ-
ent approach to the combination of decision trees and boosting was recently
proposed by Freund and Mason (1999). They represent decision trees as sums
of very simple functions and use boosting to simultaneously learn both the
decision rules and the way to average them.
Another important issue discussed in this paper is the performance of boost-
ing methods on data which are generated by classes that have a significant
overlap, in other words, classification problems in which even the Bayes opti-
mal prediction rule has a significant error. It has been observed by several
authors, including those of the current paper, that AdaBoost is not an optimal
method in this case. The problem seems to be that AdaBoost overemphasizes
the atypical examples which eventually results in inferior rules. In the current
ADDITIVE LOGISTIC REGRESSION 393
paper, the authors suggest “GentleBoost” as a better method than AdaBoost
for this case. The reason that this might be a better method is that it gives
less emphasis to misclassified examples. The increase in the weight of the
example is quadratic in the negative margin, rather than exponential.
However, one can argue that this alteration of AdaBoost, while being a step
in the right direction, is not large enough. In fact, one can argue that once
an example has a very large negative margin it is best to assume that it is
an outlier that should be completely removed from the training set. A new
boosting algorithm based on this radical approach was recently proposed by
Freund (1999).
REFERENCES
Breiman, L. (1998). Arcing classifiers. Ann. Statist. 26 801–849.
Freund, Y. (1999). An adaptive version of the boost by majority algorithm. In Proceedings of the
Twelfth Annual Conference on Computational Learning Theory.
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine
Learning: Proceedings of the Sixteenth International Conference 124–133.
Schapire, R. E.,Freund, Y.,Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new
explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
AT&T Labs–Research
180 Park Avenue
Florham Park, New Jersey 07932-0971
E-mail: yoav@research.att.com
DISCUSSION
Greg Ridgeway
University of Washington
The authors have done a great service to the field of statistics by branch-
ing out to the problems of interests to the computational learning theorist.
The connection between boosting and maximum likelihood not only helps to
explain why boosting is an effective classification method but it also opens
the door to the application of boosting to many other statistical prediction
problems.
It is difficult not to understate the attention that boosting has received in
the machine learning and data mining communities. Soon after Freund and
Schapire’s algorithm made its debut, a flurry of empirical work showcased its
uncanny ability to improve prediction accuracy [Quinlan (1996), Bauer and
Kohavi (1999), Drucker and Cortes (1996)]. Charles Elkan’s application of the
boosted na¨
ıve Bayes won first place out of 45 entries in the 1997 data mining
competition [Elkan (1997)], demonstrating the strength of boosting against
research and industrial prediction tools. Elkan also noted in this work that
his model had the form of nonlinear logistic regression but did not completely
394 DISCUSSION
connect boosting with optimization and likelihood. Boosting is even making
its way into popular science as noted in the New York Times [Lee (1999)].
Low variance, bias reducing steps. Likelihood based inference is in
the business of obtaining good parameter estimates, which is not necessarily
the same as producing a good predictor. In the classification game we win if we
produce a classifier with the lowest misclassification rate. A good estimator of
the underlying class distribution conditional on xis of secondary importance.
Consider the following example.
Example 1. Generate N=1000 observations as
xN01Fx=−1
2x21
px= exp2Fx
1+exp2FxyxBernpx
The Bayes decision rule is to classify a new observation asa1ifx>2
and 0 otherwise. The misclassification error for this rule is about 20.2%.
Figure 1 shows the pxthat generated the data as a smooth curve. After 100
Fig. 1.The smooth curve is PY=1xand the jagged curve is the estimate that the Real
AdaBoost algorithm produces after 100 iterations using a stump classifier on the Example 1data.
ADDITIVE LOGISTIC REGRESSION 395
boosting iterations using a stump classifier (a piecewise constant with only
one knot fit using CART) we see that boosting produces an estimate of px
that is overfit in terms of providing an accurate picture of the underlying data
generating mechanism. However, the misclassification rate on a test dataset
with 10,000 observations is only 21.5%, not far off from the Bayes error rate.
More iterations have little effect. This counts as a win for the analyst inter-
ested solely in predictive performance but a loss for those seeking to uncover
the data generating mechanism. Even though the authors’ Lemma 1 shows
that the population minimizer of EeyFxis the true conditional class prob-
ability, with a finite sample it is easy for AdaBoost to overfit from a function
estimate or calibration point of view. The debate between accurate prediction
and the discovery of the contents of the black box is certainly not new but it
is also has not gone away.
Some of the amazement surrounding boosting was that it could take a sim-
ple classifier like a stump and turn it into a near-Bayes optimal classifier.
The choice of using a stump at each iteration in the example is critical. After
two boosting iterations the classifier will pretty much nail the correct decision
boundary since the two knots will likely be in the neighborhood of 2 and
+2. Due to the constraints on the base classifier, subsequent iterations can
do little to modify the decision boundary. Therefore, local jaggedness devel-
ops as successive stumps try to fit individual observations but it cannot be
substantial enough to have an effect on the misclassification rate. This leads
to the question of the importance of selecting the base classifier. In this exam-
ple, fitting slightly more complex trees at each iteration has disastrous effects
on the performance. Figure 2 tracks the change in misclassification rate with
each boosting iteration when the base classifier is a tree with four termi-
nal nodes. A four-terminal node tree is likely to nearly capture the decision
boundary on the first iteration. Subsequent iterations have enough flexibility
to fit individual observations and generalization performance declines. From
Figure 2 we can see that the performance degrades to slightly worse than
predicting 0, the majority class, for all observations.
The results of Example 1 yield the following insights into boosting.
1. AdaBoost (in its present form) is not necessarily useful in function estima-
tion. Calibration of the boosted probability estimates is questionable.
2. The idea that AdaBoost is resistant to overfitting needs some qualification.
The choice of base classifier can make or break the success of utilizing
boosting.
3. Boosting seems to work well when we use a sequence of simple, low vari-
ance classifiers. Each stage makes a small, low variance step, sequentially
chipping away at the bias.
Other looks at functional optimization. Once boosting is cast as func-
tional optimization we can approach the problem from other avenues. In the
AdaBoost setup we wish to find an Fxthat minimizes JF=EeyFx.
An additive improvement to Fxwould find an αand an fxsuch that
396 DISCUSSION
Fig. 2.Misclassification rate when fitting trees with four terminal nodes to the Example 1test
data on each Real AdaBoost iteration. The horizontal line marks the error rate if we predicted all
observations to be zeros.
JF+αfdecreases. The authors describe how AdaBoost selects αand fx.
The Gentle AdaBoost algorithm that the authors propose performs a point-
wise Newton step to suggest fxwith αfixed at 1. We can also view this as
a gradient ascent in function space with a squared penalty on the magnitude
of fx. Consider the following analysis.
The directional derivative, or Gˆ
ateaux variation, of JFin the direction
of some function fxcomputes the local change in JFif we make the
modification Fx←Fx+αfx, moving Fxin the direction of fx. The
directional derivative of JFis
δJF f= d
E exp−yFx+αfx
α=0
=Eyfxexp−yFx
(1)
A gradient ascent strategy would have us select from among some class of
functions the fxthat minimizes δJF f, or equivalently offers the greatest
local decrease in JF. Without further restrictions this method is rather
unstable since we can make δJF farbitrarily small by setting
fx=−∞if Eyx<0
if Eyx>0
(2)
ADDITIVE LOGISTIC REGRESSION 397
On the other hand, if we penalize fxin (1) for being too big then we might
select fxas
fx=arg min
fEyfxeyFx+1
2fx2
=arg min
fEyfxeyFx+1
2fx2+1
2yeyFx2
=arg min
fEfx−yeyFx2
(3)
Note that (3) is equivalent to choosing the fxthat is similar to a least-
squares fit to the gradient that the authors compute in their derivation of the
Gentle AdaBoost algorithm. Obviously, in practice we only have a sample with
which to estimate the expectation in (3). Therefore, it becomes critical that we
use a low variance least-squares regression method to minimize the empirical
estimate of (3). I will return to this topic in a moment. After estimating fxwe
can use a line search to obtain the αthat minimizes E exp−yFx+αfx.
Applying this same technique within the LogitBoost scenario, that is, maxi-
mizing the directional derivative of the Bernoulli log-likelihood with a squared
penalty on the direction fx, the gradient step is
fx=arg min
fEfx−yeFx
eFx+eFx2
(4)
After estimating the optimal step direction the choice of the optimal step size,
α, takes on an interesting form when we have a sample.
α=arg min
α
N
i=1
2y
iFxi+αfxi − log1+exp2Fxi+αfxi(5)
Equation (5) has the form of a likelihood for the linear logistic regression
model with fxias a covariate and Fxias an offset term. The optimal
αthen is just the maximum likelihood estimate of the coefficient of fxifor
which efficient software exists. Similar results hold for the class of exponential
family regression models and the Cox model for survival analysis [Ridgeway
(1999)].
Both the authors’ results and this section arrive at the situation where we
need to propose a direction in which to move our current guess for the regres-
sion function. When handed only finite training datasets, one is faced with
estimating a direction that seems likely to offer a decrease in the loss func-
tion. Example 1 showed that high variance estimators of descent directions
(e.g., four-terminal node trees) can cause problems. In the past statisticians
have controlled the variance of predictors in other manners. As the authors
point out, the LogitBoost algorithm has a familiar feel to statisticians. Tradi-
tional statistical practice uses linear regression instead of trees at each stage
to estimate the conditional expectation. This yields the iteratively reweighted
398 DISCUSSION
least-squares algorithm for fitting linear logistic regression models. We can
also use smoothers to estimate the conditional expectation as in the local
scoring algorithm. These types of methods controlled variance at the expense
of bias reduction by limiting the functional form, restricting variable interac-
tions, or constraining the jaggedness of the curve.
When we introduce regression trees to estimate these gradient steps insta-
bility becomes an issue. However, various forms of averaging may help to keep
variance in check. If we use bagged trees to estimate the gradient we can often
reduce the variance introduced on the iteration. This is the basis of Breiman’s
adaptive bagging [Breiman (1999)]. Denison (2000) proposes averaging over
all possible stumps at each iteration with each weighted by their posterior dis-
tribution. The amount of needed computation is nearly equivalent to CART
with the benefit of more stable gradient steps. It seems that the future of
nonparametric regression might consist of choosing an ideal loss function and
a sequence of low variance estimates of the gradient direction in which each
step decreases bias.
Bayesian prediction and function estimation. When I first learned
of the problem of boosting I attempted to understand it from a Bayesian
perspective. The redistribution of weights on observations, the model mix-
ing and the improved performance seemed to have a Bayesian flavor to it.
In fact, Freund and Schapire (1997) even offered a na¨
ıve Bayes justifica-
tion for the AdaBoost algorithm. With the current results in hand it now
seems clear that the Bayesian interpretation of boosting is that it is simply
maximum likelihood! Boosting suggests moving the current function in some
optimal direction and assembles a path through function space toward a max-
imum likelihood estimate. The Bayesian wants to move the current function
in all directions at each iteration, eventually considering all paths through
the function space with each weighted by some posterior probability. As is
often the case, computation can impede a fully Bayesian solution. However,
the lessons learned from boosting could be emminently useful for Bayesian
prediction and function estimation problems.
In Bayesian prediction problems we want to find the predictive distribution
of a ygiven its covariates, x, and a dataset containing previous observations,
Pryx D=Pryx FPrFDdF(6)
The predictive distribution integrates over all members, F, in some class of
functions. The crudest approximation simply assumes that PrFDhas most
of its mass in a small neighborhood of the maximum likelihood estimate of F
so that Pryx D≈Pryx ˆ
F. To make a point prediction one would com-
pute some functional of Pryx ˆ
F(e.g., mean, class with the largest proba-
bility). Recent efforts have proposed methods for sampling from PrFDin
order to compute a Monte Carlo estimate of (6). Some of these ideas include
MCMC on the space of trees [Denison, Mallick and Smith (1996), Chipman,
George and McCulloch (1998)], on the space of additive models with a partic-
ADDITIVE LOGISTIC REGRESSION 399
ular smoothness [Hastie and Tibshirani (1998)] and reversible jump MCMC
on the space of piecewise constants [Heikkinen (1998)] or splines [DiMatteo,
Genovese and Kass (1999)]. I think boosting may have a lot to offer the latter
of these methods.
Heikkinen (1998) assumes that the regression function is a piecewise con-
stant where the number of pieces and the constants associated with those
pieces are random quantities. The simulation from PrFDproceeds by either
randomly combining two neighboring pieces, splitting a random piece into two,
or randomly perturbing the constant associated with one of the pieces. Among
the curses of MCMC methods is the problem of wasting iterations explor-
ing regions of the parameter space that are not particularly of interest or not
exploring the entire interesting region. Each boosting iteration computes some
kind of gradient in the function space indicating with respect to the current
state which neighboring functions are most interesting in some sense. From
the insight gained from boosting one could potentially produce more efficient
proposal mechanisms for exploring PrFD; that is in the works.
Closing comments. This work melds some exciting work in both compu-
tational learning theory and statistics. The authors have elegantly connected
the AdaBoost algorithm with a novel interpretation of fitting additive logis-
tic regression. Researchers from a diverse set of fields are in search of bet-
ter methods for modeling datasets that are massive, noisy, complex and high
dimensional. Boosting, as a modular procedure for constructing such models,
offers an exciting paradigm for future prediction methods.
REFERENCES
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms:
bagging, boosting, and variants. Machine Learning 36 105–139.
Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept.
Statistics, Univ. California, Berkeley.
Chipman, H.,George, E. and McCulloch, R. (1998). Bayesian CART model search (with dis-
cussion). J. Amer. Statist. Assoc. 93 935–960.
Denison, D. (2000). Boosting with Bayesian stumps. Technical report, Dept. Mathematics, Impe-
rial College.
Denison, D.,Mallick, B. and Smith, A. (1996). Bayesian CART. Technical report, Dept. Mathe-
matics, Imperial College.
DiMatteo, I.,Genovese, C. and Kass, R. (1999). Bayesian curve fitting with free-knot splines.
Technical report, Carnegie Mellon Univ.
Drucker, H. and Cortes, C. (1996). Boosting decision trees. In Proceedings of Neural Informa-
tion Processing 8 479–485. MIT Press.
Elkan, C. (1997). Boosting and na¨
ıve Bayes learning. Technical Report CS97-557, Univ.
California, San Diego.
Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and
an application to boosting. J. Comput. System Sci. 55 119–139.
Hastie, T. and Tibshirani, R. (1998). Bayesian backfitting. Technical report, Stanford Univ.
Heikkinen, J. (1998). Curve and surface estimation using dynamic step functions. In Practi-
cal Nonparametric and Semiparametric Bayesian Statistics (D. Dey, P. M¨
uller and
D. Sinha, eds.) 255–272. Springer, New York.
Lee, J. (1999). A computer program that plays a hunch. New York Times August 17.
400 REJOINDER
Quinlan, J. (1996). Bagging, boosting, and C4.5. In Proceedings Thirteenth American Associa-
tion for Artificial Intelligence National Conference on Artificial Intelligence 725–730.
AAAI Press, Menlo Park, CA.
Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-first Symposium on the
Interface 172–181.
Ridgeway, G. (1999). The state of boosting. In Computing Science and Statistics 31 (K. Berk,
M. Pourahmadi, eds.) Interface Foundation of North America, 172–181. Fairfax, VA.
Department of Statistics
University of Washington
Seattle, Washington 98195-4322
E-mail: greg@stat.washington.edu
REJOINDER
Jerome Friedman, Trevor Hastie and Robert Tibshirani
Stanford University
We thank the discussants for their generous and thoughtful comments, and
the editors for arranging this discussion. Since the paper and discussions are
long, we can only reply to a few of the many important points raised by the
discussants.
1. Overfitting. All of the discussions raise the issue of overfitting in the
context of boosting. Breiman, and to a lesser extent, Freund and Schapire sug-
gest that our explanation is inadequate since it suggests that enough boosting
steps can lead to overfitting. Breiman cites empirical evidence suggesting that
“AdaBoost almost never overfits the data no matter how many iterations it is
run.” It would be a nice world if that were the case, but unfortunately it isn’t
quite. Besides the example in our paper, Ridgeway provides a simple exam-
ple that exhibits dramatic overfitting in Figure 2 of his discussion. Quoting
Ratsch, Onoda and Muller (2000), “Recent studies with highly noisy patterns
[Quinlan (1996), Grove and Schuurmans (1998), Ratsch (1998)] showed that
it is clearly a myth that Boosting methods do not overfit.” Still, it does appear
that boosting is more resistant to overfitting that one might expect based
on general modeling experience. Buja suggests that the suboptimal greedy
“stagewise” nature of the fitting procedure may provide some resistance. This
could be the case, and merits further investigation. However, we agree with
B¨
uhlmann and Yu that it is largely the nature of zero–one loss that provides
general resistance to overfitting as judged in classification.
We illustrate this with a simple type of “nearest-neighbor” procedure. Con-
sider a fixed set of data points xiN
1. A training data set consists of yixiN
1,
where each yi∈11is a random variable with pi=Pryi=1. Given a
training set, our estimate for each piis ˆpi=1+ay i/2, 0 a1. Here ais
a parameter that controls the degree of fit to the training data. Larger values
provide a closer fit; the value a=1 interpolates the data, whereas a=0
ADDITIVE LOGISTIC REGRESSION 401
Table 1
Criterion LyFˆ
RRa
exponential exp−yF
1a
1+a1/21−¯e1a
1+a1/212¯e
e1+a
1a1/2
likelihood log1+exp−2yF log 2
1+a1−¯elog 2
1+a12¯e
elog 2
1a
squared error yF21a21a2+4a¯e12¯e
absolute error yF1a1−12¯ea1
zero–one loss 1y=signF 0¯e>0
estimates ˆpi=1/2 irrespective of the data. We consider five loss functions
Ly F—exponential, likelihood, squared error, absolute error, and zero–one
loss—all defined in Table 1.
For each loss function, Fis defined to be the population minimizer
arg min
FELy F=arg min
FpL1F+1pL−1F
For the exponential and likelihood, this is F=1
2logp/1pfor squared
and absolute error F=2p1, and for zero–one loss F=sign2p1.
Table 1 shows the average loss on the training data
ˆ
R=1
N
N
i=1
Lyiˆ
Fi
and the corresponding average population risk
R=1
N
N
i=1
ELy
iˆ
Fi
The later expectation is with respect to the training yi(which produces ˆ
Fi),
and the test observation y
i, both drawn independently with probability pi.
The ˆ
Fiare obtained by plugging ˆpiinto the expressions for F. For the expo-
nential and likelihood criteria,
ˆ
Fi=1
2log ˆpi
1−ˆpi=1
2log 1+ay i
1ay i
whereas for the squared and absolute error criteria,
ˆ
Fi=ay i
The risk values are reported in terms of the parameter aand
¯e=1
N
N
i=1
E1y
i=signF
i =2
N
N
i=1
pi1pi
402 REJOINDER
which is just the average risk based on zero–one loss (and is twice the Bayes
risk). This error rate does not depend on the value of asince it is indepen-
dent of the magnitude of ˆ
Fi, depending only on its sign. Thus, overfitting as
measured by zero–one loss does not exist for this procedure. Interpolating the
data by setting a=1 does not degrade zero–one loss. Also shown in Table 1 is
the parameter value athat minimizes the population risk for each respective
criterion.
As seen from Table 1, both the average training set loss and population risk
for the four criteria, other than zero–one loss, do depend on the magnitude of
ˆ
Fi, and thus depend strongly on how close the training data are fit, as regu-
lated by the value of a. The average training set loss ˆ
Ris monotone decreasing
with increasing values of a. Starting at a=0, the population risk Rfor all
four criteria initially decreases. For the first three a minimum is reached at
the value a=12¯e. Increasing the value of abeyond that point monotoni-
cally increases population risk, symptomatic of classic overfitting. The degree
of overfitting depends on the criterion. Criteria listed higher in Table 1 report
more overfitting. For absolute error there is no overfitting; interpolating the
training data here minimizes population risk for this criterion.
Figure 1 of this Rejoinder shows both training error (lowest green curve),
population risk for ¯e=01 (middle red curve), and ¯e=02 (highest blue
curve), as a function of 1/1afor the loss criteria shown in Table 1. All
curves are normalized to have unit risk at a=0. Note the difference in ver-
tical scale between the lower and upper plots. The first three criteria exhibit
classic overfitting. The degree of this overfit increases with increasing intrin-
sic noise as reflected by the value of ¯e. For exponential and likelihood loss the
expected test error approaches infinity as the value of aapproaches one (data
interpolation), the latter much more slowly. Both squared and absolute loss
approach finite values of 4 ¯eand 2 ¯e, respectively. Except for the discontinuity
at a=0 (where the risk is that of a coin flip, R=1
2, normalized to be 1), the
population zero–one risk curves drop immediately to ¯e. When the abscissa is
plotted on a linear rather than logarithmic scale, the plot for exponential loss
(upper left panel) closely resembles that displayed in Figure 3 (lower panel)
of the B ¨
uhlmann and Yu discussion.
It is important to note that the extent to which overfitting is observed here
depends only on the loss criterion used to measure degree-of-fit. For a given
training data set, equal values of the parameter aproduce identical models
for the probabilities ˆpi. The judgment as to whether, or how much, a particu-
lar model overfits the training data is determined solely by the criterion used
to define “fit,” and not by the training procedure used to obtain the model.
Exponential loss reports dramatic overfitting, likelihood substantial overfit-
ting, squared error loss moderate overfitting and absolute and zero–one loss
no overfitting at all, on exactly the same sequence of models.
Of course, this example is highly idealized so as to permit exact calcula-
tions of risk. In other situations absolute and zero–one loss may exhibit some
degree of overfitting. The assumption that the x-values are fixed prevents
zero–one loss from reporting overfitting for this procedure. In actual practice,
ADDITIVE LOGISTIC REGRESSION 403
Fig. 1.Training and population risk curves for the five loss functions in Table 1as a function of
1/1a. The green curves are training risk. The red curves are population risk for ¯e=01and
the blue curves are for ¯e=02. All curves are normalized to have unit risk at a=0.
404 REJOINDER
the x-values themselves are random variables and fitting the data too closely
can result in some incorrect estimates of the sign of F, thereby degrading pop-
ulation zero–one risk. However, the analysis above suggests that overfitting
as judged by absolute and especially zero–one loss should be less severe than
when judged by exponential, likelihood or squared error loss. Also, there is
no reason to expect that the degree-of-fit to the training data that results in
minimum population risk will be the same for the former two measures as
with the latter three. This is illustrated in Figure 3 of B¨
uhlmann and Yu’s
discussion.
This analysis has no bearing on the relative merits of any of these criteria as
implementing loss functions for boosting, or any other procedure. As pointed
out by B ¨
uhlmann and Yu, it is often the case that the best criterion to drive
a search strategy is not the criterion used to judge the value of the resulting
solution. However, the phenomenon of “overfitting” is largely dictated by the
latter.
In the field of statistics most of our experience has been derived from like-
lihood and squared error loss. As Breiman points out “adding more and more
variables to a logistic regression, increasing the training likelihood at each
step, usually results in overfitting at some point.” This is surely the case
when overfitting is judged by the likelihood criterion. It can sometimes be
the case when judged by zero–one loss. However, the simple example pre-
sented here, as well as considerable empirical evidence elsewhere, suggest
that zero–one loss, being less sensitive to variations in the estimates of the
underlying probabilities (they depend only on the sign of 2 ˆpi1), exhibit
considerably less overfitting than other loss criteria that depend more heav-
ily on the detailed probability estimates. This will be the case whether those
estimates are obtained by AdaBoost, LogitBoost or any other procedure.
2. Regression. Most of the discussions consider the extension of boosting
to regression. Buja and B ¨
uhlmann and Yu, as well as Ridgeway (1999), observe
that our view of boosting leads to natural extensions to regression based on
squared error loss. This is discussed in Section 10 of the paper. It is explored in
more detail in Friedman (1999a), where gradient based boosting is extended to
arbitrary differentiable loss criteria. As noted in Friedman (1999a), shrinkage
is an important ingredient to the success of boosting in the regression context.
At each step mof the iterative procedure, the current approximation Fm1x
is updated by
Fmx=Fm1x+ν·fmx(7)
Here fmxis the update computed in the usual manner by the boosting
algorithm at the mth step, and ν∈01is a parameter whose value con-
trols the amount of shrinkage. Empirical evidence was presented that such
shrinkage dramatically improves the accuracy of the target function estimate
when measured by likelihood, squared error or absolute error. Less dramatic
but significant gains were also observed for zero–one loss, where the quality
of the function estimate (as judged by the other criteria) is less important. In
ADDITIVE LOGISTIC REGRESSION 405
all cases, accuracy improved monotonically with decreasing value of ν(more
shrinkage), provided that the number of iterations Mwas correspondingly
increased so as to reach minimum test error.
Breiman and B ¨
uhlmann and Yu suggest that incorporating a stochastic
component into boosting algorithms can substantially improve performance.
B¨
uhlmann and Yu propose “bag-boosting” which, as Ridgeway points out, is the
essence of Breiman’s (1999) “adaptive bagging” procedure. The base learner is
replaced by the corresponding bagged learner at each boosting step. This can
be viewed as an alternative to shrinkage for reducing the learning rate. The
individual bagging substeps can be regarded as randomized boosting steps
without updating the approximation. Friedman (1999b) investigated combin-
ing randomization with shrinkage in an integrated manner. The approxima-
tion is slowly but continuously updated at each randomized boost using a slow
learning rate, rather than alternating between a large number of steps with-
out updating (bagging), followed by a single step with a full update (boost).
Combining randomization with shrinkage in this manner provides a modest
but significant additional improvement over the already large gain produced
by shrinkage alone. However, there is a substantial computational advantage
by employing what B ¨
uhlmann and Yu call “subbagging.” Friedman and Hall
(1999) discuss the equivalence of bagging based on boostrapping, and half-
sampling without replacement and show that different subsample sizes can
often lead to better performance.
3. Interpretations of boosting. For their invention of boosting, Freund
and Schapire deserve a great deal of praise. It has led to both practical learn-
ing algorithms and much intellectual study and debate. As Buja so poetically
states, “There is no single ‘true’ interpretation of anything; interpretation is a
vehicle in the service of human comprehension. The proof of an interpretation
is in enabling others to fruitfully think about an idea.” We hope that our work
has helped serve that goal. Freund and Schapire and Breiman present two
additional interpretations, different from ours and each other. Ridgeway con-
siders yet another view along Bayesian lines. Having multiple interpretations
can only increase the chance of obtaining new insights.
We don’t see a great deal of contradiction between our view and that of
Freund and Schapire. “Provable bounds” can be derived for LogitBoost (Nigel
Duffy and David Helmbold, private communication). Also, we don’t see how
our “assumptions” are necessarily stronger (or weaker) than those motivating
their interpretation. It is hard to imagine a binary random variable that is not
Bernoulli distributed. We do not assume “a log-linear relationship between
the features and the labels.” The function class describing this relationship
is determined by the weak learner employed. AdaBoost, LogitBoost, or any
other boosting method using the same learner is making similar assumptions
concerning the label–feature relationship. As was pointed out in the paper,
and amplified by B ¨
uhlmann and Yu and Buja, “The question is not whether
but how much a weak learner can be boosted, or, in statistical terminology,
how well an additive model can approximate an underlying response surface.”
406 REJOINDER
The underlying response surface is the unknown label–feature relationship.
The weak learner provides the additive components. Different weak learners
will perform differently with different response surfaces.
Freund and Mason’s (1999) “alternating decision tree” approach is quite
interesting. It shares many common elements with MARS [Friedman (1991)],
substituting exponential loss for squared error loss, and using zero-order
rather than first-order splines. Both of these modifications may be beneficial
in the classification context.
Freund’s (1999) suggestion of removing outliers is a good one, but not “rad-
ical.” This approach is at the center of the field of robust statistics. The source
of AdaBoost’s lack of robustness is its use of exponential loss as the implement-
ing loss function, which increases exponentially with large negative margin.
Negative log-likelihood increases only linearly with large negative margin,
thereby being much more robust. Still, sometimes performance can be further
improved a little by completely removing the influence of observations with
very large negative margins. This is accomplished as a beneficial side effect of
the weight trimming strategy described in Section 9 of the paper, when used
with LogitBoost. The LogitBoost weights are wi=pxi1pxi. Obser-
vations with the smallest weights at the current iteration are removed from
consideration for that iteration. Observations with large negative margin have
either pxi0 (for yi=1) or pxi1 (for yi=−1), thereby being deleted.
Breiman views boosting as an “ensemble” method for combining classifiers.
He seeks to incorporate boosting and bagging into a common framework based
on “random forests.” This is an intriguing concept. However, it seems to us
that boosting operates differently than random forests. From the description,
a random forests appears to be an ensembles of learners that are i.i.d. given
the data. Hence the bias of the ensemble is the same as that of each individual
learner. By contrast, in boosting the individual members are not independent
given the data: the construction of each depends explicitly on those members
that have been previously entered. The bias of the ensemble is typically much
less than that of each constituent learner. Boosting turns a weak learner to
strong ensemble, whereas random forests combine strong learners in an i.i.d.
fashion in order to reduce variance.
Viewing boosting as stagewise additive regression does however suggest
Breiman’s conclusion that “if boosting succeeds, it is because it gives low cor-
relation between classifiers of reasonable strength.” In stagewise regression
“each step decouples itself from the previous step” leading to “low correla-
tion between pairs of classifiers.” In general, a greedy search strategy will
not successively add highly correlated terms, since that would not produce
very much local reduction in the criterion being minimized. Also, there will
be a tendency to add terms of “reasonable strength,” since that will tend to
reduce the criterion. Thus, one could conclude that AdaBoost owes its success
to these particular aspects of stagewise additive regression. This is consis-
tent with Breiman’s explanation, although motivated somewhat differently.
Unfortunately, this conclusion appears to be in conflict with one empirical fact,
ADDITIVE LOGISTIC REGRESSION 407
namely, the success of shrinking (1) in increasing accuracy with all boosting
methods, including AdaBoost [Friedman (1999a)].
Shrinking strongly inhibits the natural tendency of greedy search strate-
gies to add low correlation terms of reasonable strength. For smaller val-
ues of the shrinkage parameter ν, the strength of each added term ν·fmx
becomes smaller, and more highly correlated with recently entered terms. Yet,
the evidence so far indicates that the smaller the value of ν, the higher the
overall accuracy, as long as there are enough iterations.
The source of this mystery in under investigation. It is but one of the many
aspect of this general class of procedures that are not yet well understood.
Whether these procedures are referred to as “boosting,” or more prosaically
as “stagewise additive regression,” there are still many problems left to solve
that can challenge us for some time to come.
REFERENCES
Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept.
Statistics, Univ. California, Berkeley.
Freund, Y. (1999). An adaptive version of boost by majority algorithm. In Proceedings of the
Twelfth Annual Conference on Computational Learning Theory.
Freund,Y.andMason, L. (1999). The alternating decision tree learning algorithm. In Machine
Learning: Proceedings of the Sixteenth International Conference 124–133.
Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist.
19 1–141.
Friedman, J. H. (1999a). Greedy function approximation: a gradient boosting machine. Ann.
Statist. To appear.
Friedman, J. H. (1999b). Stochastic gradient boosting. Technical report, Dept. Statistics, Stanford
Univ.
Friedman,J.H.andHall, P. (1999). On bagging and nonlinear estimation. J. Comput. Graph.
Statist. To appear.
Grove,A.andSchuurmans, D. (1998). Boosting in the limit: maximizing the margin of
learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial
Intelligence.
Quinlan, J. (1996). Boosting first order learning. In Proceedings of the Seventh International
Workshop on Algorithmic Learning Theory (S. Arikawa and A. Sharma, eds.) Lecture
Notes in Artificial Intelligence 1160 143–155. Springer, Berlin.
Ratsch, G. (1998). Ensemble learning methods for classification. Masters thesis, Dept. Computer
Science, Univ. Potsdam.
Ratsch,G.,Onoda,T.andMuller, K. R. (2000). Soft margins for AdaBoost. Machine Learning
1–35.
Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-first Symposium on the
Interface 172–181.
Department of Statistics
Sequoia Hall
370 Serra Mall
Stanford University
Stanford California 94305-4065
E-mail: hastie@stat.stanford.edu
... The construction of additive regression models by sequentially fitting a simple parameterised function to current residuals by least squares at each iteration, as PLS do, is called boosting (Friedman 2001(Friedman , 2002. This general procedure can result in dramatic improvements in OOS performance in many applications (Friedman, Hastie and Tibshirani 2000). ...
Conference Paper
Full-text available
A recent trend in space syntax is the application of machine learning (ML) techniques to extend the analyses traditionally performed with methods from graph theory. The goal of this paper is to explore the relationship between visual syntactic measures and the location of seating places in a grid with the resort of ML methods. As far as a public housing estate (Bairro Gulbenkian) located in Odivelas, near Lisbon, Portugal, is concerned, we found that the location of benches can be predicted accurately from visual connectivity, clustering, control, controllability, integration and through vision using partial least squares or random forests. In fact, these two methods provide a better balance between sensitivity (the proportion of seating places classified as such) and specificity (the proportion of other places classified as such) than logistic regression, least absolute shrinkage operator (LASSO), decision trees, support vector machines and neural networks. In addition, we found that visual clustering, integration, control and through vision may be the key measures to predict seating places.
... In this study, two ML models based on gradient boosted regression trees were selected to predict the average annual fertilizer application at the crop-and country-level. In gradient boosting [178], an ensemble of weak learners (in our case regression trees) is trained sequentially. First, a weak learner is fitted to the original data. ...
Preprint
Full-text available
As global fertilizer application rates increase, high-quality datasets are paramount for comprehensive analyses to support informed decision-making and policy formulation in crucial areas such as food security or climate change. This study aims to fill existing data gaps by employing two machine learning models, eXtreme Gradient Boosting and HistGradientBoosting algorithms to produce precise country-level predictions of nitrogen ($N$), phosphorus pentoxide ($P_2O_5$), and potassium oxide ($K_2O$) application rates. Subsequently, we created a comprehensive dataset of 5-arcmin resolution maps depicting the application rates of each fertilizer for 13 major crop groups from 1961 to 2019. The predictions were validated by both comparing with existing databases and by assessing the drivers of fertilizer application rates using the model's SHapley Additive exPlanations. This extensive dataset is poised to be a valuable resource for assessing fertilization trends, identifying the socioeconomic, agricultural, and environmental drivers of fertilizer application rates, and serving as an input for various applications, including environmental modeling, causal analysis, fertilizer price predictions, and forecasting.
... For each noise metric, we apply the XGBoost regression algorithm to build a model. The XGBoost algorithm is an amelioration of the gradient boosting method, and it also leverages poor predictors by combining them together in a way that maximizes their predictive power (Friedman et al. 2000;Friedman 2001Friedman , 2002Chen & Guestrin 2016). It uses a more regularized model formalization to control overfitting and thus gives better performance (Chen & Guestrin 2016). ...
Article
Full-text available
Accurately characterizing the intrinsic stellar photometric noise induced by stellar astrophysics, such as stellar activity, granulation, and oscillations, is of crucial importance for detecting transiting exoplanets. In this study, we investigate the relation between the intrinsic stellar photometric noise, quantified by the Kepler combined differential photometric precision (CDPP) metric, and the level of stellar chromospheric activity, indicated by the S -index of Ca ii H K lines derived from LAMOST spectra. Our results reveal a clear positive correlation between the S -index and robust rms values of CDPP, with the correlation becoming more significant at higher activity levels and on longer timescales. We have therefore built an empirical relation between the robust rms values of CDPP and the S -index as well as T eff , log g , [Fe/H], and the apparent magnitude, with the XGBoost regression algorithm, using the LAMOST–Kepler common star sample as the training set. This method achieves a precision of ∼20 ppm for inferring the intrinsic noise from the S -index and other stellar labels on a 6 hr integration duration. We have applied this empirical relation to the full LAMOST DR7 spectra database and obtained the intrinsic noise predictions for 1,358,275 stars. The resultant catalog is publicly available and expected to be valuable for optimizing target selection for future exoplanet-hunting space missions, such as the Earth 2.0 mission.
... This is biased towards RBF kernel, because RBF kernel gets 625 tuning evaluations on each fold, whereas linear kernel gets just 25, however, this could also potentially introduce some over-¯tting of the model selection criterion. 71 Originally, boosting has been mainly used with tree-type base procedures, typically with small trees such as stumps (two terminal nodes) or trees having say [72][73][74][75][76] Stacking has been used the cross-validation method on the training set 77 to tune its parameters. ...
Article
The particulate matter air pollutant of diameter less than 10 micrometers (PM 10), a category of pollutants including solid and liquid particles, can be a health hazard for several reasons: it can harm lung tissues and throat, aggravate asthma, and increase respiratory illness. Accurate prediction models of PM 10 concentrations are essential for proper management, control, and public warning strategies. Therefore, machine learning techniques can develop methods or tools that can be used to discover unseen patterns from given data to solve a particular task or problem. The chance theory has advanced concepts pertinent to treating cases where both randomness and fuzziness play simultaneous roles at one time. The main objective is to study the modi¯cation of a single machine learning algorithm À À À support vector machine (SVM) À À À applying the chance weight of the target variable, based on the chance theory, to the corresponding dataset point to be superior to the ensemble machine learning algorithms. The results of this study outperform of the SVM algorithms when modifying and combining with the right theory/technique, especially the chance theory over other modern ensemble learning algorithms.
Article
Full-text available
The present study attempts to compare seven tree-based models (including ensembles) to suggest the best modeling approach for identifying the dominating factors (microscopic, macroscopic, and geometric) and predicting the walking speeds over elevated walkways. The results show that Gradient Boosted Trees and Light Gradient Boosted Trees outperformed other approaches in predicting walking speed. The variable importance of final models estimated using SHapley Additive exPlanations (SHAP) reveals that walking speed was dependent on the flow, density, and length of the facility. Moreover, other features such as gender, age, height of facility, and width of the facility also play a significant role in determining the pedestrian walking speeds. The identification of important variables not only provides better insight into factors that affect walking speed over elevated facilities but also provides a valuable source of information to researchers, planners, and policymakers for better designing, operation, and management of elevated pedestrian infrastructures.
Article
Background U.S. construction workers experience high rates of injury that can lead to chronic pain. This pilot study examined nonpharmacological (without medication prescribed by healthcare provider) and pharmacological (e.g., prescription opioids) pain management approaches used by construction workers. Methods A convenience sample of U.S. construction workers was surveyed, in partnership with the U.S. National Institute for Occupational Safety and Health (NIOSH) Construction Sector Program. Differences in familiarity and use of nonpharmacological and pharmacological pain management approaches, by demographics, were assessed using logistic regression models. A boosted regression tree model examined the most influential factors related to pharmacological pain management use, and potential reductions in use were counterfactually modeled. Results Of 166 (85%) of 195 participants reporting pain/discomfort in the last year, 72% reported using pharmacological pain management approaches, including 19% using opioids. There were significant differences in familiarity with nonpharmacological approaches by gender, education, work experience, and job title. Among 37 factors that predicted using pharmacological and non‐pharmacological pain management approaches, training on the risks of opioids, job benefits for unpaid leave and paid disability, and familiarity with music therapy, meditation or mindful breathing, and body scans were among the most important predictors of potentially reducing use of pharmacological approaches. Providing these nonpharmacological approaches to workers could result in an estimated 23% (95% CI: 16%–30%) reduction in pharmacological pain management approaches. Conclusion This pilot study suggests specific factors related to training, job benefits, and worker familiarity with nonpharmacological pain management approaches influence use of these approaches.
Article
A data‐driven approach is developed and proven for ranking the risk of low disinfection residual in water distribution storage tanks, 1 month ahead. The forecasting methodology uses water quality data collected from drinking water treatment plants, storage tank outlets, and rainfall data as inputs. This methodology was developed and tested with data from a water utility serving more than 5 million people. Results show high‐risk category prediction accuracy of 75%–80%. Using a final year of unseen validation data, more than 90% of the storage tanks ranked in the top 20 by the forecasting methodology experienced low disinfectant residual in the following month. Storage tanks are critical water distribution system infrastructure that are currently managed reactively. The adoption of such readily transferable machine learning approaches enables direct proactive management strategies and efficient interventions that can help ensure drinking water quality.
Article
Data samples with multiple concept labels are common for mobile and web-based applications. The classification of multiple-label sample attracts much attention of researchers, in which ensemble learning is a cheap and effective method. Ensemble learning combines a set of base learners with a weight vector to construct powerful classification model, which has been proved effective in single-label cases. However, for multiple-label classification, the ensemble learning is ambiguous due to the accuracy and diversity of the base learners to be combined, which leads to a complex multiple-object optimization problem. We propose a multiple-label classification ensemble learning framework in this paper, in which the accuracy and diversity for multiple-label base learners are defined. Then an optimization target function is presented which is consistent with the laws of ensemble learning. We evaluate the proposed on three multiple-label data sets from UCI data repository. The results show that the proposed framework can obtain better ensembles for multiple-label learners compared to previous successful methods.
Article
In this article we put forward a Bayesian approach for finding classification and regression tree (CART) models. The two basic components of this approach consist of prior specification and stochastic search. The basic idea is to have the prior induce a posterior distribution that will guide the stochastic search toward more promising CART models. As the search proceeds, such models can then be selected with a variety of criteria, such as posterior probability, marginal likelihood, residual sum of squares or misclassification rates. Examples are used to illustrate the potential superiority of this approach over alternative methods.
Article
In the last few years, dramatic decreases in generalization error have come about through the growing and combining of an ensemble of predictors. To generate the ensemble, the most common approach is through perturbations of the training set and construction of the same algorithm (trees, neural nets, etc.) using the perturbed training sets. But other methods of generating ensembles have also been explored. Combination is achieved by averaging the predictions in the case of trying to predict a numerical output (regression) or by weighted or weighted plurality vote if class membership is being predicted (classification). We review some of the recent developments that seem notable to us. These include bagging, boosting, and arcing. The basic algorithm used in our empirical studies is tree-structured CART but a variety of other algorithms have also been used to form ensembles.
Article
Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Y. Freund and R. Schapire [in L. Saitta (ed.), Machine Learning: Proc. Thirteenth Int. Conf. 148-156 (1996); see also Ann. Stat. 26, No. 5, 1651-1686 (1998; Zbl 0929.62069)] propose an algorithm the basis of which is to adaptively resample and combine (hence the acronym “arcing”) so that the weights in the resampling are increased for those cases most often misclassified and the combining is done by weighted voting. Arcing is more successful than bagging in test set error reduction. We explore two arcing algorithms, compare them to each other and to bagging, and try to understand how arcing works. We introduce the definitions of bias and variance for a classifier as components of the test set error. Unstable classifiers can have low bias on a large range of data sets. Their problem is high variance. Combining multiple versions either through bagging or arcing reduces variance significantly.
Article
A new method for nonparametric multiple regression is presented. The procedure models the regression surface as a sum of general smooth functions of linear combinations of the predictor variables in an iterative manner. It is more general than standard stepwise and stagewise regression procedures, does not require the definition of a metric in the predictor space, and lends itself to graphical interpretation.
Article
Breiman(1996) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias unchanged. A new form of bagging we call adaptive bagging is effective in reducing both bias and variance. The procedure works in stages-- the first stage is bagging. Based on the outcomes of the first stage, the output values are altered and a second stage of bagging is carried out using the altered output values. This is repeated until a specified noise level is reached. We give the background theory, and test the method using both trees and nearest neighbor regression methods. Application to two class classification data gives some interesting results.