ArticlePDF Available

Additive Logistic Regression: A Statistical View of Boosting

April 2000
The Annals of Statistics 28(2):337-407

April 2000
28(2):337-407

DOI:10.1214/aos/1016218223

Source
CiteSeer

Authors:

Jerome H. Friedman

Stanford University

Trevor Hastie

Stanford University

Robert Tibshirani

Stanford University

Boosting is one of the most important recent developments in classi-fication methodology. Boosting works by sequentially applying a classifica-tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi-ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip-tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.

Test error for Bagging, Discrete AdaBoost and Real AdaBoost on a simulated two-class nested spheres problem (see Section 5 on page 24.) There are 2000 training data points in 10 dimensions, and the Bayes error rate is zero. All trees are grown \best--rst" without pruning. The left-most iteration corresponds to a single tree.

…

A variety of loss functions for estimating a function F (x) for classication. The horizontal axis is yF , which is negative for errors and positive for correct classiications. All the loss functions are monotone in yF. The curve labeled \Squared Error(p)" is (y ?p) 2 , and gives a uniformly better approximation to misclassiication loss than the exponential criterion. The curve labeled \Squared Error(F)" is (y?F ) 2 , and increases once yF exceeds 1, thereby increasingly penalizing classiications that are \too correct".

…

Test error curves for the simulation experiment with an additive decision boundary, as described in (40) on page 25. In all panels except the the top right, the solid curve (representing Discrete AdaBoost) lies alone above the other three curves.

…

: Test error rates on larger data examples.

…

Test error curves for the simulation experiment with a non-additive decision boundary, as described in (45) on page 30.

…

Figures - uploaded by Trevor Hastie

Content may be subject to copyright.

Content uploaded by Trevor Hastie

Content may be subject to copyright.

Content uploaded by Trevor Hastie

Content may be subject to copyright.

The Annals of Statistics

2000, Vol. 28, No. 2, 337–407

SPECIAL INVITED PAPER

ADDITIVE LOGISTIC REGRESSION:

A STATISTICAL VIEW OF BOOSTING

By Jerome Friedman,1Trevor Hastie23and

Robert Tibshirani24

Stanford University

Boosting is one of the most important recent developments in classi-

ﬁcation methodology. Boosting works by sequentially applying a classiﬁca-

tion algorithm to reweighted versions of the training data and then taking

a weighted majority vote of the sequence of classiﬁers thus produced. For

many classiﬁcation algorithms, this simple strategy results in dramatic

improvements in performance. We show that this seemingly mysterious

phenomenon can be understood in terms of well-known statistical princi-

ples, namely additive modeling and maximum likelihood. For the two-class

problem, boosting can be viewed as an approximation to additive modeling

on the logistic scale using maximum Bernoulli likelihood as a criterion.

We develop more direct approximations and show that they exhibit nearly

identical results to boosting. Direct multiclass generalizations based on

multinomial likelihood are derived that exhibit performance comparable

to other recently proposed multiclass generalizations of boosting in most

situations, and far superior in some. We suggest a minor modiﬁcation to

boosting that can reduce computation, often by factors of 10 to 50. Finally,

we apply these insights to produce an alternative formulation of boosting

decision trees. This approach, based on best-ﬁrst truncated tree induction,

often leads to better performance, and can provide interpretable descrip-

tions of the aggregate decision rule. It is also much faster computationally,

making it more suitable to large-scale data mining applications.

1. Introduction. The starting point for this paper is an interesting pro-

cedure called “boosting,” which is a way of combining the performance of many

“weak” classiﬁers to produce a powerful “committee.” Boosting was proposed in

the computational learning theory literature [Schapire (1990), Freund (1995),

Freund and Schapire (1997)] and has since received much attention.

While boosting has evolved somewhat over the years, we describe the most

commonly used version of the AdaBoost procedure [Freund and Schapire

Received August 1998; revised December 1999.

1Also at Stanford Linear Accelerator Center, Stanford, CA 94305. Supported in part by Dept.

of Energy Contract DE-AC03-76 SF 00515 and NSF Grant DMS-97-64431.

2Also at Division of BioStatistics, Dept. of Health, Research and Policy, Stanford University,

Stanford, CA 94305.

3Supported in part by NSF Grants DMS-95-04495, DMS-98-03645 and NIH Grant

ROI-CA-72028-01.

4Supported in part by Natural Sciences and Engineering Research Council of Canada.

AMS 1991 subject classiﬁcations. 62G05, 62G07, 68T10, 68T05.

Key words and phrases. classiﬁcation, tree, nonparametric estimation, stagewise ﬁtting,

machine learning.

337

338 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

(1996b)], which we call Discrete AdaBoost. This is essentially the same as

AdaBoost.M1 for binary data in Freund and Schapire. Hereis a concise descrip-

tion of AdaBoost in the two-class classiﬁcation setting. We have training data

x1y

1xNy

Nwith xia vector valued feature and yi=−1or1.We

deﬁne Fx=M

1cmfmxwhere each fmxis a classiﬁer producing val-

ues plus or minus 1 and cmare constants; the corresponding prediction is

signFx. The AdaBoost procedure trains the classiﬁers fmxon weighted

versions of the training sample, giving higher weight to cases that are cur-

rently misclassiﬁed. This is done for a sequence of weighted samples, and then

the ﬁnal classiﬁer is deﬁned to be a linear combination of the classiﬁers from

each stage. A detailed description of Discrete AdaBoost is given in the boxed

display titled Algorithm 1.

Much has been written about the success of AdaBoost in producing accurate

classiﬁers. Many authors have explored the use of a tree-based classiﬁer for

fmxand have demonstrated that it consistently produces signiﬁcantly lower

error rates than a single decision tree. In fact, Breiman (1996) (referring to

a NIPS workshop) called AdaBoost with trees the “best off-the-shelf classiﬁer

in the world” [see also Breiman (1998b)]. Interestingly, in many examples the

test error seems to consistently decrease and then level off as more classiﬁers

are added, rather than ultimately increase. For some reason, it seems that

AdaBoost is resistant to overﬁtting.

Figure 1 shows the performance of Discrete AdaBoost on a synthetic clas-

siﬁcation task, using an adaptation of CARTTM [Breiman, Friedman, Olshen

and Stone (1984)] as the base classiﬁer. This adaptation grows ﬁxed-size trees

in a “best-ﬁrst” manner (see Section 8). Included in the ﬁgure is the bagged

tree [Breiman (1996)] which averages trees grown on bootstrap resampled

versions of the training data. Bagging is purely a variance-reduction tech-

nique, and since trees tend to have high variance, bagging often produces

good results.

Discrete AdaBoost [Freund and Schapire (1996b)]

1. Start with weights wi=1/N i =1N.

2. Repeat for m=12M:

(a) Fit the classiﬁer fmx∈−11using weights wion the training data.

(b) Compute errm=Ew1y=fmx,cm=log1−errm/errm.

iwi=1.

3. Output the classiﬁer signM

m=1cmfmx.

Algorithm 1. Ewrepresents expectation over the training data with

weights w=w1w

2w

N, and 1Sis the indicator of the set S. At each

iteration, AdaBoost increases the weights of the observations misclassiﬁed by

fmxby a factor that depends on the weighted training error.

ADDITIVE LOGISTIC REGRESSION 339

Fig. 1.Test error for Bagging,Discrete AdaBoost and Real AdaBoost on a simulated two-class

nested spheres problem (see Section 6). There are 2000 training data points in ten dimensions,and

the Bayes error rate is zero. All trees are grown “best-ﬁrst”without pruning. The leftmost iteration

corresponds to a single tree.

Early versions of AdaBoost used a resampling scheme to implement step 2

of Algorithm 1, by weighted sampling from the training data. This suggested a

connection with bagging and that a major component of the success of boosting

has to do with variance reduction.

However, boosting performs comparably well when:

1. A weighted tree-growing algorithm is used in step 2 rather than weighted

resampling, where each training observation is assigned its weight wi. This

removes the randomization component essential in bagging.

2. “Stumps” are used for the weak learners. Stumps are single-split trees with

only two terminal nodes. These typically have low variance but high bias.

Bagging performs very poorly with stumps [Figure 1 (top right panel)].

340 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

These observations suggest that boosting is capable of both bias and vari-

ance reduction, and thus differs fundamentally from bagging.

The base classiﬁer in Discrete AdaBoost produces a classiﬁcation rule

fmx →−11, where is the domain of the predictive features x.

Freund and Schapire (1996b), Breiman (1998a) and Schapire and Singer

(1998) have suggested various modiﬁcations to improve the boosting algo-

rithms.

A generalization of Discrete AdaBoost appeared in Freund and Schapire

(1996b), and was developed further in Schapire and Singer (1998), that uses

real-valued “conﬁdence-rated” predictions rather than the −11of Discrete

AdaBoost. The weak learner for this generalized boosting produces a map-

ping fmx:→ R; the sign of fmxgives the classiﬁcation, and fmx a

measure of the “conﬁdence” in the prediction. This real-valued contribution is

combined with the previous contributions with a multiplier cmas before, and

a slightly different recipe for cmis provided.

We present a generalized version of AdaBoost, which we call Real AdaBoost

in Algorithm 2, in which the weak learner returns a class probability estimate

pmx= ˆ

Pwy=1x∈01. The contribution to the ﬁnal classiﬁer is half the

logit-transform of this probability estimate. One form of Schapire and Singer’s

generalized AdaBoost coincides with Real AdaBoost, in the special case where

the weak learner is a decision tree. Real AdaBoost tends to perform the best

in our simulated examples in Figure 1, especially with stumps, although we

see with 100 node trees Discrete AdaBoost overtakes Real AdaBoost after 200

iterations.

In this paper we analyze the AdaBoost procedures from a statistical per-

spective. The main result of our paper rederives AdaBoost as a method for

ﬁtting an additive model mfmxin a forward stagewise manner. This sim-

ple fact largely explains why it tends to outperform a single base learner.

By ﬁtting an additive model of different and potentially simple functions, it

expands the class of functions that can be approximated.

Real AdaBoost

1. Start with weights wi=1/N i =12N.

2. Repeat for m=12M:

(a) Fit the classiﬁer to obtain a class probability estimate pmx= ˆ

Pwy=

1x∈01, using weights wion the training data.

(b) Set fmx←1

2log pmx/1−pmx∈R.

iwi=1.

3. Output the classiﬁer signM

m=1fmx.

Algorithm 2. The Real AdaBoost algorithm uses class probability esti-

mates pmxto construct real-valued contributions fmx.

ADDITIVE LOGISTIC REGRESSION 341

Given this fact, Discrete and Real AdaBoost appear unnecessarily compli-

cated. A much simpler way to ﬁt an additive model would be to minimize

squared-error loss Ey−fmx2in a forward stagewise manner. At the

mth stage we ﬁx f1x···fm−1xand minimize squared error to obtain fmx

=Ey−m−1

1fjxx. This is just “ﬁtting of residuals” and is commonly used

in linear regression and additive modeling [Hastie and Tibshirani (1990)].

However squared error loss is not a good choice for classiﬁcation (see

Figure 2 in Section 4.2) and hence “ﬁtting of residuals” doesn’t work very

well in that case. We show that AdaBoost ﬁts an additive model using a bet-

ter loss function for classiﬁcation. Speciﬁcally we show that AdaBoost ﬁts an

additive logistic regression model, using a criterion similar to, but not the

same as, the binomial log-likelihood. [If pmxare the class probabilities, an

additive logistic regression approximates log pmx/1−pmx by an addi-

tive function mfmx.] We then go on to derive a new boosting procedure

“LogitBoost” that directly optimizes the binomial log-likelihood.

The original boosting techniques [Schapire (1990), Freund (1995)] prov-

ably improved or “boosted” the performance of a single classiﬁer by produc-

ing a “majority vote” of similar classiﬁers. These algorithms then evolved

into more adaptive and practical versions such as AdaBoost, whose success

was still explained in terms of boosting individual classiﬁers by a “weighted

majority vote” or “weighted committee.” We believe that this view, along with

the appealing name “boosting” inherited by AdaBoost, may have led to some

of the mystery about how and why the method works. As mentioned above,

we instead view boosting as a technique for ﬁtting an additive model.

Section 2 gives a short history of the boosting idea. In Section 3 we brieﬂy

review additive modeling. Section 4 shows how boosting can be viewed as an

additive model estimator and proposes some new boosting methods for the two-

class case. The multiclass problem is studied in Section 5. Simulated and real

data experiments are discussed in Sections 6 and 7. Our tree-growing imple-

mentation, using truncated best-ﬁrst trees, is described in Section 8. Weight

trimming to speed up computation is discussed in Section 9, and we brieﬂy

describe generalizations of boosting in Section 10. We end with a discussion

in Section 11.

2. A brief history of boosting. Schapire (1990) developed the ﬁrst sim-

ple boosting procedure in the PAC-learning framework [Valiant (1984), Kearns

and Vazirani (1994)]. Schapire showed that a weak learner could always

improve its performance by training two additional classiﬁers on ﬁltered ver-

sions of the input data stream. A weak learner is an algorithm for producing

a two-class classiﬁer with performance guaranteed (with high probability) to

be signiﬁcantly better than a coinﬂip. After learning an initial classiﬁer h1on

the ﬁrst Ntraining points:

1. h2is learned on a new sample of Npoints, half of which are misclassiﬁed

by h1.

2. h3is learned on Npoints for which h1and h2disagree.

3. The boosted classiﬁer is hB=Majority Voteh1h

2h

3.

342 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Schapire’s “Strength of Weak Learnability” theorem proves that hBhas

improved performance over h1.

Freund (1995) proposed a “boost by majority” variation which combined

many weak learners simultaneously and improved the performance of the sim-

ple boosting algorithm of Schapire. The theory supporting both of these algo-

rithms requires the weak learner to produce a classiﬁer with a ﬁxed error rate.

This led to the more adaptive and realistic AdaBoost [Freund and Schapire

(1996b)] and its offspring, where this assumption was dropped.

Freund and Schapire (1996b) and Schapire and Singer (1998) provide some

theory to support their algorithms, in the form of upper bounds on generaliza-

tion error. This theory has evolved in the computational learning community,

initially based on the concepts of PAC learning. Other theories attempting

to explain boosting come from game theory [Freund and Schapire (1996a),

Breiman (1997)] and VC theory [Schapire, Freund, Bartlett and Lee (1998)].

The bounds and the theory associated with the AdaBoost algorithms are inter-

esting, but tend to be too loose to be of practical importance. In practice, boost-

ing achieves results far more impressive than the bounds would imply.

3. Additive models. We show in the next section that AdaBoost ﬁts an

additive model Fx=M

m=1cmfmx. We believe that viewing current boost-

ing procedures as stagewise algorithms for ﬁtting additive models goes a long

way toward understanding their performance. Additive models have a long

history in statistics, and so we ﬁrst give some examples here.

3.1. Additive regression models. We initially focus on the regression prob-

lem, where the response yis quantitative, xand yhave some joint distribu-

tion, and we are interested in modeling the mean Eyx=Fx. The additive

model has the form

Fx=



j=1

fjxj(1)

There is a separate function fjxjfor each of the pinput variables xj. More

generally, each component fjis a function of a small, prespeciﬁed subset of

the input variables. The backﬁtting algorithm [Friedman and Stuetzle (1981),

Buja, Hastie and Tibshirani (1989)] is a convenient modular “Gauss–Seidel”

algorithm for ﬁtting additive models. A backﬁtting update is

fjxj←Ey−

k=j

fkxk



xjfor j=12p1 (2)

Any method or algorithm for estimating a function of xjcan be used to

obtain an estimate of the conditional expectation in (2). In particular, this

can include nonparametric smoothing algorithms, such as local regression or

smoothing splines. In the right-hand side, all the latest versions of the func-

tions fkare used in forming the partial residuals. The backﬁtting cycles are

repeated until convergence. Under fairly general conditions, backﬁtting can

ADDITIVE LOGISTIC REGRESSION 343

be shown to converge to the minimizer of Ey−Fx2[Buja, Hastie and

Tibshirani (1989)].

3.2. Extended additive models. More generally, one can consider additive

models whose elements fmxM

1are functions of potentially all of the input

features x. Usually in this context the fmxare taken to be simple functions

characterized by a set of parameters γand a multiplier βm,

fmx=βmbxγm(3)

The additive model then becomes

FMx=



m=1

βmbxγm(4)

For example, in single hidden layer neural networks bxγ=σγtxwhere

σ· is a sigmoid function and γparameterizes a linear combination of the

input features. In signal processing, wavelets are a popular choice with γ

parameterizing the location and scale shifts of a “mother” wavelet bx.

In these applications bxγmM

1are generally called “basis functions” since

they span a function subspace.

If least-squares is used as a ﬁtting criterion, one can solve for an optimal

set of parameters through a generalized backﬁtting algorithm with updates,

βmγ

m←arg min

β γ Ey−

k=m

βkbxγk−βbxγ2

(5)

for m=12M in cycles until convergence. Alternatively, one can use a

“greedy” forward stepwise approach,

βmγ

m←arg min

β γ Ey−Fm−1x−βbxγ2

(6)

for m=12M, where βkγ

km−1

1are ﬁxed at their corresponding solu-

tion values at earlier iterations. This is the approach used by Mallat and

Zhang (1993) in “matching pursuit,” where the bxγare selected from an

over-complete dictionary of wavelet bases. In the language of boosting, fx=

βbxγwould be called a “weak learner” and FMx(4) the “committee.” If

decision trees were used as the weak learner, the parameters γwould repre-

sent the splitting variables, split points, the constants in each terminal node

and number of terminal nodes of each tree.

Note that the backﬁtting procedure (5) or its greedy cousin (6) only require

an algorithm for ﬁtting a single weak learner (3) to data. This base algorithm

is simply applied repeatedly to modiﬁed versions of the original data

ym←y−

k=m

fkx

344 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

In the forward stepwise procedure (6), the modiﬁed output ymat the mth iter-

ation depends only on its value ym−1and the solution fm−1xat the previous

iteration,

ym=ym−1−fm−1x(7)

At each step m, the previous output values ym−1are modiﬁed (7) so that

the previous model fm−1xhas no explanatory power on the new outputs

ym. One can therefore view this as a procedure for boosting a weak learner

fx=βbxγto form a powerful committee FMx(4).

3.3. Classiﬁcation problems. For the classiﬁcation problem, we learn from

Bayes theorem that all we need is Py=jx, the posterior or conditional class

probabilities. One could transfer all the above regression machinery across

to the classiﬁcation domain by simply noting that E1y=jx=Py=jx,

where 1y=jis the 0/1 indicator variable representing class j. While this works

fairly well in general, several problems have been noted [Hastie, Tibshirani

and Buja (1994)] for constrained regression methods. The estimates are typ-

ically not conﬁned to 01, and severe masking problems can occur when

there are more than two classes. A notable exception is when trees are used

as the regression method, and in fact this is the approach used by Breiman,

Friedman, Olshen and Stone (1984).

Logistic regression is a popular approach used in statistics for overcom-

ing these problems. For a two-class problem, an additive logistic model has

the form

log Py=1x

Py=−1x=



m=1

fmx(8)

The monotone logit transformation on the left guarantees that for any values

of Fx=M

m=1fmx∈R, the probability estimates lie in 01; inverting

we get

px=Py=1x= eFx

1+eFx(9)

Here we have given a general additive form for Fx; special cases exist that

are well known in statistics. In particular, linear logistic regression [McCullagh

and Nelder (1989), e.g.] and additive logistic regression [Hastie and Tibshirani

(1990)] are popular. These models are usually ﬁt by maximizing the binomial

log-likelihood and enjoy all the associated asymptotic optimality features of

maximum likelihood estimation.

A generalized version of backﬁtting (2), called “Local Scoring” in Hastie and

Tibshirani (1990), can be used to ﬁt the additive logistic model by maximum

likelihood. Starting with guesses f1x1···fpxp,Fx=fkxkand px

deﬁned in (9), we form the working response:

z=Fx+ 1y=1−px

px1−px(10)

ADDITIVE LOGISTIC REGRESSION 345

We then apply backﬁtting to the response zwith observation weights px1−

px to obtain new fkxk. This process is repeated until convergence. The

forward stagewise version (6) of this procedure bears a close similarity to the

LogitBoost algorithm described later in the paper.

4. AdaBoost: an additive logistic regression model. In this section

we show that the AdaBoost algorithms (Discrete and Real) can be interpreted

as stagewise estimation procedures for ﬁtting an additive logistic regression

model. They optimize an exponential criterion which to second order is equiva-

lent to the binomial log-likelihood criterion. We then propose a more standard

likelihood-based boosting procedure.

4.1. An exponential criterion. Consider minimizing the criterion

JF=Ee−yFx(11)

for estimation of Fx. Here Erepresents expectation; depending on the con-

text, this may be a population expectation (with respect to a probability dis-

tribution) or else a sample average. Ewindicates a weighted expectation.

Lemma 1 shows that the function Fxthat minimizes JFis the symmetric

logistic transform of Py=1x.

Lemma 1. Ee−yFxis minimized at

Fx=1

2log Py=1x

Py=−1x(12)

Hence

Py=1x= eFx

e−Fx+eFx(13)

Py=−1x= e−Fx

e−Fx+eFx(14)

Proof. While Eentails expectation over the joint distribution of yand x,

it is sufﬁcient to minimize the criterion conditional on x:

Ee−yFxx=Py=1xe−Fx+Py=−1xeFx

∂Ee−yFxx

∂Fx=−Py=1xe−Fx+Py=−1xeFx

The result follows by setting the derivative to zero. ✷

This exponential criterion appeared in Schapire and Singer (1998), moti-

vated as an upper bound on misclassiﬁcation error. Breiman (1997) also used

this criterion in his results on AdaBoost and prediction games. The usual

346 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

logistic transform does not have the factor 1

2as in (12); by multiplying the

numerator and denominator in (13) by eFx, we get the usual logistic model

px= e2Fx

1+e2Fx(15)

Hence the two models are equivalent up to a factor 2.

Corollary 1. If Eis replaced by averages over regions of xwhere Fxis

constant (as in the terminal node of a decision tree), the same result applies to

the sample proportions of y=1and y=−1.

Results 1 and 2 show that both Discrete and Real AdaBoost, as well as the

Generalized AdaBoost of Freund and Schapire (1996b), can be motivated as

iterative algorithms for optimizing the (population based) exponential crite-

rion. The results share the same format.

1. Given an imperfect Fx, an update Fx+fxis proposed based on the

population version of the criterion.

2. The update, which involves population conditional expectations, is imper-

fectly approximated for ﬁnite data sets by some restricted class of estima-

tors, such as averages in terminal nodes of trees.

Hastie and Tibshirani (1990) use a similar derivation of the local scoring

algorithm used in ﬁtting generalized additive models. Many terms are typi-

cally required in practice, since at each stage the approximation to conditional

expectation is rather crude. Because of Lemma 1, the resulting algorithms

can be interpreted as a stagewise estimation procedure for ﬁtting an additive

logistic regression model. The derivations are sufﬁciently different to warrant

separate treatment.

Result 1. The Discrete AdaBoost algorithm (population version) builds

an additive logistic regression model via Newton-like updates for minimizing

Ee−yFx.

Proof. Let JF=Ee−yFx. Suppose we have a current estimate Fx

and seek an improved estimate Fx+cfx. For ﬁxed c(and x), we expand

JFx+cfx to second order about fx=0,

JF+cf=Ee−yFx+cfx

≈Ee−yFx1−ycfx+c2y2fx2/2

=Ee−yFx1−ycfx+c2/2

since y2=1 and fx2=1. Minimizing pointwise with respect to fx∈

−11, we write

fx=arg min

fEw1−ycfx+c2/2x(16)

ADDITIVE LOGISTIC REGRESSION 347

Here the notation Ew·xrefers to a weighted conditional expectation, where

w=wx y=e−yFx, and

Ewgx yxdef

=Ewx ygx yx

Ewx yx

For c>0, minimizing (16) is equivalent to maximizing

Ewyfx(17)

The solution is

fx=1if Ewyx=Pwy=1x−Pwy=−1x>0,

−1otherwise.

(18)

Note that

−Ewyfx=Ewy−fx2/2−1(19)

[again using fx2=y2=1]. Thus minimizing a quadratic approximation to

the criterion leads to a weighted least-squares choice of fx∈−11, and

this constitutes the Newton-like step.

Given fx∈−11, we can directly minimize JF+cf to determine c:

c=arg min

cEwe−cyfx

2log 1−err

err 

(20)

where err =Ew1y=fx. Note that ccan be negative if the weak learner

does worse than 50%, in which case it automatically reverses the polarity.

Combining these steps we get the update for Fx,

Fx←Fx+1

2log 1−err

err fx

In the next iteration the new contribution cfxto Fxaugments the weights

wx y←wx ye−cfxy

Since −yfx=2×1y=fx −1, we see that the update is equivalent to

wx y←wx yexplog1−err

err 1y=fx

Thus the function and weight updates are of an identical form to those used

in Discrete AdaBoost.

This population version of AdaBoost translates naturally to a data version

using trees. The weighted conditional expectation in (18) is approximated by

the terminal-node weighted averages in a tree. In particular, the weighted

least-squares criterion is used to grow the tree-based classiﬁer fx, and given

fx, the constant cis based on the weighted training error.

348 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Note that after each Newton step, the weights change, and hence the tree

conﬁguration will change as well. This adds an adaptive twist to the data

version of a Newton-like algorithm.

Parts of this derivation for AdaBoost can be found in Breiman (1997) and

Schapire and Singer (1998), but without making the connection to additive

logistic regression models.

Corollary 2. After each update to the weights, the weighted misclassiﬁ-

cation error of the most recent weak learner is 50%.

Proof. This follows by noting that the cthat minimizes JF+cfsatisﬁes

∂JF+cf

∂c =−Ee−yFx+cf xyf x=0(21)

The result follows since yfxis 1 for a correct and −1 for an incorrect

classiﬁcation. ✷

Schapire and Singer (1998) give the interpretation that the weights are

updated to make the new weighted problem maximally difﬁcult for the next

weak learner.

The Discrete AdaBoost algorithm expects the tree or other “weak learner”

to deliver a classiﬁer fx∈−11. Result 1 requires minor modiﬁcations to

accommodate fx∈R, as in the generalized AdaBoost algorithms [Freund

and Schapire (1996b), Schapire and Singer (1998)]; the estimate for cmdiffers.

Fixing f, we see that the minimizer of (20) must satisfy

Ewyfxe−cyfx=0(22)

If fis not discrete, this equation has no closed-form solution for c, and requires

an iterative solution such as Newton–Raphson.

We now derive the Real AdaBoost algorithm, which uses weighted proba-

bility estimates to update the additive logistic model, rather than the classiﬁ-

cations themselves. Again we derive the population updates and then apply it

to data by approximating conditional expectations by terminal-node averages

in trees.

Result 2. The Real AdaBoost algorithm ﬁts an additive logistic regression

model by stagewise and approximate optimization of JF=Ee−yFx.

Proof. Suppose we have a current estimate Fxand seek an improved esti-

mate Fx+fxby minimizing JFx+fx at each x.

JFx+fx = Ee−yFxe−yfxx

=e−fxEe−yFx1y=1x+efxEe−yFx1y=−1x

Dividing through by Ee−yFxxand setting the derivative w.r.t. fxto zero

we get

fx=1

2log Ew1y=1x

Ew1y=−1x

(23)

ADDITIVE LOGISTIC REGRESSION 349

2log Pwy=1x

Pwy=−1x(24)

where wx y=exp−yFx. The weights get updated by

wx y←wx ye−yfx

The algorithm as presented would stop after one iteration. In practice we

use crude approximations to conditional expectation, such as decision trees or

other constrained models, and hence many steps are required.

Corollary 3. At the optimal Fxthe weighted conditional mean of yis 0.

Proof. If Fxis optimal, we have

25∂JFx

Fx=−Ee−yFxy=0✷

We can think of the weights as providing an alternative to residuals for the

binary classiﬁcation problem. At the optimal function F, there is no further

information about Fin the weighted conditional distribution of y. If there is,

we use it to update F.

An iteration Min either the Discrete or Real AdaBoost algorithms, we have

composed an additive function of the form

Fx=



m=1

fmx(26)

where each of the components are found in a greedy forward stagewise fash-

ion, ﬁxing the earlier components. Our term “stagewise” refers to a similar

approach in statistics:

1. Variables are included sequentially in a stepwise regression.

2. The coefﬁcients of variables already included receive no further adjustment.

4.2.WhyEe−yFx? So far the only justiﬁcation for this exponential crite-

rion is that it has a sensible population minimizer, and the algorithm described

above performs well on real data. In addition:

1. Schapire and Singer (1998) motivate e−yFxas a differentiable upper bound

to misclassiﬁcation error 1yF<0(see Figure 2).

2. The AdaBoost algorithm that it generates is extremely modular, requir-

ing at each iteration the retraining of a classiﬁer on a weighted training

database.

Let y∗=y+1/2, taking values 0, 1, and parametrize the binomial prob-

abilities by

px= eFx

eFx+e−Fx

The binomial log-likelihood is

ly∗px = y∗logpx+1−y∗log1−px

=−log1+e−2yFx

(27)

350 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Fig. 2.A variety of loss functions for estimating a function Fxfor classiﬁcation. The horizontal

axis is yF which is negative for errors and positive for correct classiﬁcations. All the loss functions

are monotone in yF and are centered and scaled to match e−yF at F=0. The curve labeled “Log-

likelihood”is the binomial log-likelihood or cross-entropy y∗log p+1−y∗log1−p. The curve

labeled “Squared Error(p)” is y∗−p2. The curve labeled “Squared Error(F)” is y−F2and

increases once yF exceeds 1thereby increasingly penalizing classiﬁcations that are “too correct.”

Hence we see that:

3. The population minimizers of −Ely∗px and Ee−yFxcoincide. This is

easily seen because the expected log-likelihood is maximized at the true

probabilities px=Py∗=1x, which deﬁne the logit Fx. By Lemma 1

we see that this is exactly the minimizer of Ee−yFx. In fact, the exponential

criterion and the (negative) log-likelihood are equivalent to second order in

a Taylor series around F=0,

−ly∗p≈exp−yF+log2−1(28)

Graphs of exp−yFand log1+e−2yFxare shown in Figure 2, as a

function of yF—positive values of yF imply correct classiﬁcation. Note

that −exp−yFitself is not a proper log-likelihood, as it does not equal

the log of any probability mass function on plus or minus 1.

4. There is another way to view the criterion JF. It is easy to show that

e−yFx=y∗−px

px1−px(29)

with Fx=1

2logpx/1−px. The right-hand side is known as the

χstatistic in the statistical literature. χ2is a quadratic approximation to

the log-likelihood, and so χcan be considered a “gentler” alternative.

ADDITIVE LOGISTIC REGRESSION 351

One feature of both the exponential and log-likelihood criteria is that they

are monotone and smooth. Even if the training error is zero, the criteria will

drive the estimates towards purer solutions (in terms of probability estimates).

Why not estimate the fmby minimizing the squared error Ey−Fx2?

If Fm−1x=m−1

1fjxis the current prediction, this leads to a forward

stagewise procedure that does an unweighted ﬁt to the response y−Fm−1x

at step mas in (6). Empirically we have found that this approach works quite

well, but is dominated by those that use monotone loss criteria. We believe

that the nonmonotonicity of squared error loss (Figure 2) is the reason. Correct

classiﬁcations, but with yFx>1, incur increasing loss for increasing values

of Fx. This makes squared-error loss an especially poor approximation to

misclassiﬁcation error rate. Classiﬁcations that are “too correct” are penalized

as much as misclassiﬁcation errors.

4.3. Direct optimization of the binomial log-likelihood. In this section we

explore algorithms for ﬁtting additive logistic regression models by stagewise

optimization of the Bernoulli log-likelihood. Here we focus again on the two-

class case and will use a 0/1 response y∗to represent the outcome. We repre-

sent the probability of y∗=1bypx, where

px= eFx

eFx+e−Fx(30)

Algorithm 3 gives the details.

LogitBoost (two classes)

1. Start with weights wi=1/N i =12N,Fx=0 and probability esti-

mates pxi=1

2. Repeat for m=12M:

(a) Compute the working response and weights

zi=y∗

i−pxi

pxi1−pxi

wi=pxi1−pxi

(b) Fit the function fmxby a weighted least-squares regression of zito

xiusing weights wi.

2fmxand px←eFx/eFx+e−Fx.

3. Output the classiﬁer signFx=signM

m=1fmx.

Algorithm 3. An adaptive Newton algorithm for ﬁtting an additive logis-

tic regression model.

352 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Result 3. The LogitBoost algorithm (two classes, population version) uses

Newton steps for ﬁtting an additive symmetric logistic model by maximum

likelihood.

Derivation. Consider the update Fx+fxand the expected log-

likelihood

ElF+f=E2y∗Fx+fx − log1+e2Fx+fx(31)

Conditioning on x, we compute the ﬁrst and second derivative at fx=0,

sx=∂ElFx+fx

∂fx



fx=0

(32) =2Ey∗−pxx

Hx=∂2ElFx+fx

∂fx2



fx=0

(33) =−4Epx1−pxx

where pxis deﬁned in terms of Fx. The Newton update is then

Fx←Fx−Hx−1sx

(34)

=Fx+1

Ey∗−pxx

Epx1−pxx

=Fx+1

2Ewy∗−px

px1−px



x(35)

where wx=px1−px. Equivalently, the Newton update fxsolves

the weighted least-squares approximation [about Fx] to the log-likelihood

min

fxEwxFx+1

y∗−px

px1−px −Fx+fx2

(36)

The population algorithm described here translates immediately to an

implementation on data when E·xis replaced by a regression method, such

as regression trees [Breiman, Friedman, Olshen and Stone (1984)]. While the

role of the weights are somewhat artiﬁcial in the population case, they are

not in any implementation; wxis constant when conditioned on x, but the

wxiin a terminal node of a tree, for example, depend on the current values

Fxi, and will typically not be constant.

Sometimes the wxget very small in regions of (x) perceived [by Fx]tobe

pure—that is, when pxis close to 0 or 1. This can cause numerical problems

in the construction of z, and lead to the following crucial implementation

protections:

1. If y∗=1, then compute z=y∗−p/p1−p as 1/p. Since this number

can get large if pis small, threshold this ratio at zmax. The particu-

lar value chosen for zmax is not crucial; we have found empirically that

ADDITIVE LOGISTIC REGRESSION 353

zmax ∈24works well. Likewise, if y∗=0, compute z=−1/1−pwith

a lower threshold of −zmax.

2. Enforce a lower threshold on the weights: w=maxw 2×machine-zero.

4.4. Optimizing Ee−yFxby Newton stepping. The population version of

the Real AdaBoost procedure (Algorithm 2) optimizes Eexp−yFx+fx

exactly with respect to fat each iteration. In Algorithm 4 we propose the

“Gentle AdaBoost” procedure that instead takes adaptive Newton steps much

like the LogitBoost algorithm just described.

Result 4. The Gentle AdaBoost algorithm (population version) uses

Newton steps for minimizing Ee−yFx.

Derivation.

∂JFx+fx

∂fx



fx=0=−Ee−yFxyx

∂2JFx+fx

∂fx2



fx=0=Ee−yFxxsince y2=1

Hence the Newton update is

Fx←Fx+Ee−yFxyx

Ee−yFxx

=Fx+Ewyx

where wx y=e−yFx.

The main difference between this and the Real AdaBoost algorithm is how

it uses its estimates of the weighted class probabilities to update the functions.

Here the update is fmx=Pwy=1x−Pwy=−1x, rather than half the

Gentle AdaBoost

1. Start with weights wi=1/N i =12NFx=0

2. Repeat for m=12M:

(a) Fit the regression function fmxby weighted least-squares of yito xi

with weights wi.

(b) Update Fx←Fx+fmx.

3. Output the classiﬁer signFx=signM

m=1fmx.

Algorithm 4. A modiﬁed version of the Real AdaBoost algorithm, using

Newton stepping rather than exact optimization at each step.

354 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

log-ratio as in (24): fmx=1

2logPwy=1x/Pwy=−1x. Log-ratios

can be numerically unstable, leading to very large updates in pure regions,

while the update here lies in the range −11. Empirical evidence suggests

(see Section 7) that this more conservative algorithm has similar performance

to both the Real AdaBoost and LogitBoost algorithms, and often outperforms

them both, especially when stability is an issue.

There is a strong similarity between the updates for the Gentle AdaBoost

algorithm and those for the LogitBoost algorithm. Let P=Py=1x, and

px=eFx/eFx+e−Fx. Then

Ee−yFxyx

Ee−yFxx=e−FxP−eFx1−P

e−FxP+eFx1−P

(37)

=P−px

1−pxP+px1−P

The analogous expression for LogitBoost from (34) is

P−px

px1−px(38)

At px≈1

2these are nearly the same, but they differ as the pxbecome

extreme. For example, if P≈1 and px≈0, (38) blows up, while (37) is

about 1 (and always falls in −11).

5. Multiclass procedures. Here we explore extensions of boosting to

classiﬁcation with multiple classes. We start off by proposing a natural gen-

eralization of the two-class symmetric logistic transformation, and then con-

sider speciﬁc algorithms. In this context Schapire and Singer (1998) deﬁne

Jresponses yjfor a Jclass problem, each taking values in −11. Simi-

larly the indicator response vector with elements y∗

jis more standard in the

statistics literature. Assume the classes are mutually exclusive.

Definition 1. For a Jclass problem let pjx=Pyj=1x. We deﬁne

the symmetric multiple logistic transformation

Fjx=log pjx− 1



k=1

log pkx(39)

Equivalently,

pjx= eFjx

J

k=1eFkx



k=1

Fkx=0(40)

The centering condition in (40) is for numerical stability only; it simply pins

the Fjdown, else we could add an arbitrary constant to each Fjand the

probabilities remain the same. The equivalence of these two deﬁnitions is

easily established, as well as the equivalence with the two-class case.

ADDITIVE LOGISTIC REGRESSION 355

Schapire and Singer (1998) provide several generalizations of AdaBoost for

the multiclass case, and also refer to other proposals [Freund and Schapire

(1997), Schapire (1997)]; we describe their AdaBoost.MH algorithm (see Algo-

rithm 5), since it seemed to dominate the others in their empirical studies. We

then connect it to the models presented here. We will refer to the augmented

variable in Algorithm 5 as the “class” variable C. We make a few observations:

1. The population version of this algorithm minimizes J

j=1Ee−yjFjx, which

is equivalent to running separate population boosting algorithms on each

of the Jproblems of size Nobtained by partitioning the N×Jsamples

in the obvious fashion. This is seen trivially by ﬁrst conditioning on C=j,

and then xC=j, when computing conditional expectations.

2. The same is almost true for a tree-based algorithm. We see this because:

(a) If the ﬁrst split is on C, either a J-nary split if permitted, or else J−1

binary splits, then the subtrees are identical to separate trees grown to

each of the Jgroups. This will always be the case for the ﬁrst tree.

(b) If a tree does not split on Canywhere on the path to a terminal node,

then that node returns a function fmx j=gmxthat contributes

nothing to the classiﬁcation decision. However, as long as a tree includes

a split on Cat least once on every path to a terminal node, it will make

a contribution to the classiﬁer for all input feature values.

The advantage or disadvantage of building one large tree using class label

as an additional input feature is not clear. No motivation is provided.

We therefore implement AdaBoost.MH using the more traditional direct

approach of building Jseparate trees to minimize J

j=1Eexp−yjFjx

We have thus shown

Result 5. The AdaBoost.MH algorithm for a J-class problem ﬁts Juncou-

pled additive logistic models, Gjx=1

2log pjx/1−pjx, each class

against the rest.

AdaBoost.MH [Schapire and Singer (1998)]

1. Expand the original Nobservations into N×Jpairs xi1y

i1,

xi2y

i2xiJy

iJi=1N. Here yij is the −11

response for class jand observation i.

2. Apply Real AdaBoost to the augmented dataset, producing a function

F×1J→RFx j=mfmx j.

3. Output the classiﬁer arg maxjFx j.

Algorithm 5. The AdaBoost.MH algorithm converts the Jclass problem

into that of estimating a two class classiﬁer on a training set Jtimes as large,

with an additional “feature” deﬁned by the set of class labels.

356 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

In principal this parametrization is ﬁne, since Gjxis monotone in pjx.

However, we are estimating the Gjxin an uncoupled fashion, and there is

no guarantee that the implied probabilities sum to 1. We give some examples

where this makes a difference, and AdaBoost.MH performs more poorly than

an alternative coupled likelihood procedure.

Schapire and Singer’s AdaBoost.MH was also intended to cover situations

where observations can belong to more than one class. The “MH” represents

“Multi-Label Hamming”, Hamming loss being used to measure the errors in

the space of 2Jpossible class labels. In this context ﬁtting a separate classiﬁer

for each label is a reasonable strategy. However, Schapire and Singer also

propose using AdaBoost.MH when the class labels are mutually exclusive,

which is the focus in this paper.

Algorithm 6 is a natural generalization of Algorithm 3 for ﬁtting the J-class

logistic regression model (40).

Result 6. The LogitBoost algorithm (Jclasses, population version) uses

quasi-Newton steps for ﬁtting an additive symmetric logistic model by

maximum-likelihood.

LogitBoost (J classes)

1. Start with weights wij =1/N i =1N j =1J F

jx=0 and

pjx=1/J ∀j.

2. Repeat for m=12M:

(a) Repeat for j=1J:

(i) Compute working responses and weights in the jth class,

zij =y∗

ij −pjxi

pjxi1−pjxi

wij =pjxi1−pjxi

(ii) Fit the function fmj xby a weighted least-squares regression of zij

to xiwith weights wij .

(b) Set fmj x←J−1

Jfmj x−1

JJ

k=1fmkx, and Fjx←Fjx+

fmj x.

3. Output the classiﬁer arg maxjFjx.

Algorithm 6. An adaptive Newton algorithm for ﬁtting an additive mul-

tiple logistic regression model.

ADDITIVE LOGISTIC REGRESSION 357

Derivation.

1. We ﬁrst give the score and Hessian for the population Newton algorithm

corresponding to a standard multilogit parametrization

Gjx=log Py∗

j=1x

Py∗

J=1x

with GJx=0 (and the choice of Jfor the base class is arbitrary). The

expected conditional log-likelihood is:

ElG+gx=

J−1



j=1

Ey∗

jxGjx+gjx

−log 1+

J−1



k=1

eGkx+gkx

sjx=Ey∗

j−pjxxj=1J−1

Hj kx=−pjxδjk −pkxjk=1J−1

2. Our quasi-Newton update amounts to using a diagonal approximation to

the Hessian, producing updates:

gjx← Ey∗

j−pjxx

pjx1−pjxj =1J−1

3. To convert to the symmetric parametrization, we would note that gJ=0,

and set fjx=gjx−1/JJ

k=1gkx. However, this procedure could

be applied using any class as the base, not just the Jth. By averaging over

all choices for the base class, we get the update

fjx=J−1

JEy∗

j−pjxx

pjx1−pjx −1



k=1

Ey∗

k−pkxx

pkx1−pkx

For more rigid parametric models and full Newton stepping, this sym-

metrization would be redundant. With quasi-Newton steps and adaptive (tree

based) models, the symmetrization removes the dependence on the choice of

the base class.

6. Simulation studies. In this section the four ﬂavors of boosting out-

lined above are applied to several artiﬁcially constructed problems. Compar-

isons based on real data are presented in Section 7.

An advantage of comparisons made in a simulation setting is that all as-

pects of each example are known, including the Bayes error rate and the

complexity of the decision boundary. In addition, the population expected error

rates achieved by each of the respective methods can be estimated to arbitrary

accuracy by averaging over a large number of different training and test data

358 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

sets drawn from the population. The four boosting methods compared here

are:

DAB: Discrete AdaBoost—Algorithm 1.

RAB: Real AdaBoost—Algorithm 2.

LB: LogitBoost—Algorithms 3 and 6.

GAB: Gentle AdaBoost—Algorithm 4.

DAB, RAB and GAB handle multiple classes using the AdaBoost.MH

approach.

In an attempt to differentiate performance, all of the simulated examples

involve fairly complex decision boundaries. The ten input features for all exam-

ples are randomly drawn from a ten-dimensional standard normal distribution

x∼N100I. For the ﬁrst three examples the decision boundaries separating

successive classes are nested concentric ten-dimensional spheres constructed

by thresholding the squared-radius from the origin

r2=



j=1

j(41)

Each class Ck1≤k≤Kis deﬁned as the subset of observations

Ck=xitk−1≤r2

i<t

k(42)

with t0=0 and tK=∞. The tkK−1

1for each example were chosen so as to

put approximately equal numbers of observations in each class. The training

sample size is N=K·1000 so that approximately 1000 training observations

are in each class. An independently drawn test set of 10,000 observations was

used to estimate error rates for each training set. Averaged results over ten

such independently drawn training–test set combinations were used for the

ﬁnal error rate estimates. The corresponding statistical uncertainties (stan-

dard errors) of these ﬁnal estimates (averages) are approximately a line width

on each plot.

Figure 3 (top left) compares the four algorithms in the two-class K=2

case using a two-terminal node decision tree (“stump”) as the base classiﬁer.

Shown is error rate as a function of number of boosting iterations. The upper

(black) line represents DAB and the other three nearly coincident lines are

the other three methods (dotted red =RAB, short-dashed green =LB, and

long-dashed blue =GAB). Note that the somewhat erratic behavior of DAB,

especially for less than 200 iterations, is not due to statistical uncertainty. For

less than 400 iterations LB has a minuscule edge, after that it is a dead heat

with RAB and GAB. DAB shows substantially inferior performance here with

roughly twice the error rate at all iterations.

Figure 3 (lower left) shows the corresponding results for three classes K=

3again with two-terminal node trees. Here the problem is more difﬁcult as

represented by increased error rates for all four methods, but their relation-

ship is roughly the same: the upper (black) line represents DAB and the other

ADDITIVE LOGISTIC REGRESSION 359

Fig. 3.Test error curves for the simulation experiment with an additive decision boundary,as

described in 42. In all panels except the top right,the solid curve (representing Discrete AdaBoost)

lies alone above the other three curves.

three nearly coincident lines are the other three methods. The situation is

somewhat different for larger number of classes. Figure 3 (lower right) shows

results for K=5 which are typical for K≥4. As before, DAB incurs much

higher error rates than all the others, and RAB and GAB have nearly iden-

tical performance. However, the performance of LB relative to RAB and GAB

has changed. Up to about 40 iterations it has the same error rate. From 40 to

about 100 iterations LB’s error rates are slightly higher than the other two.

After 100 iterations the error rate for LB continues to improve whereas that

360 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

for RAB and GAB level off, decreasing much more slowly. By 800 iterations the

error rate for LB is 0.19 whereas that for RAB and GAB is 0.32. Speculation

as to the reason for LB’s performance gain in these situations is presented

below.

In the above examples a stump was used as the base classiﬁer. One might

expect the use of larger trees would do better for these rather complex prob-

lems. Figure 3 (top right) shows results for the two-class problem, here boost-

ing trees with eight terminal nodes. These results can be compared to those for

stumps in Figure 3 (top left). Initially, error rates for boosting eight-node trees

decrease much more rapidly than for stumps, with each successive iteration,

for all methods. However, the error rates quickly level off and improvement

is very slow after about 100 iterations. The overall performance of DAB is

much improved with the bigger trees, coming close to that of the other three

methods. As before RAB, GAB and LB exhibit nearly identical performance.

Note that at each iteration the eight-node tree model consists of four times

the number of additive terms as does the corresponding stump model. This

is why the error rates decrease so much more rapidly in the early iterations.

In terms of model complexity (and training time), a 100-iteration model using

eight-terminal node trees is equivalent to a 400-iteration stump model.

Comparing the top two panels in Figure 3, one sees that for RAB, GAB

and LB the error rate using the bigger trees (0.072) is in fact 33% higher

than that for stumps (0.054) at 800 iterations, even though the former is four

times more complex. This seemingly mysterious behavior is easily understood

by examining the nature of the decision boundary separating the classes. The

Bayes decision boundary between two classes is the set

xlog Py=1x

Py=−1x=0

(43)

or simply xBx=0. To approximate this set it is sufﬁcient to estimate

the logit Bx, or any monotone transformation of Bx, as closely as possible.

As discussed above, boosting produces an additive logistic model whose com-

ponent functions are represented by the base classiﬁer. With stumps as the

base classiﬁer, each component function has the form

fmx=cL

m1xj≤tm+cR

m1xj>tm

(44)

=fmxj(45)

if the mth stump chose to split on coordinate j. Here tmis the split-point,

and cL

mand cR

mare the weighted means of the response in the left and right

terminal nodes. Thus the model produced by boosting stumps is additive in

the original features,

Fx=



j=1

gjxj(46)

where gjxjadds together all those stumps involving xj(and is 0 if none

exist).

ADDITIVE LOGISTIC REGRESSION 361

Examination of (41) and (42) reveals that an optimal decision boundary for

the above examples is also additive in the original features, with fjxj=x2

constant. Thus, in the context of decision trees, stumps are ideally matched

to these problems; larger trees are not needed. However boosting larger trees

need not be counterproductive in this case if all of the splits in each individual

tree are made on the same predictor variable. This would also produce an

additive model in the original features (46). However, due to the forward

greedy stagewise strategy used by boosting, this is not likely to happen if the

decision boundary function involves more than one predictor; each individual

tree will try to do its best to involve all of the important predictors. Owing to

the nature of decision trees, this will produce models with interaction effects;

most terms in the model will involve products in more than one variable. Such

nonadditive models are not as well suited for approximating truly additive

decision boundaries such as (41) and (42). This is reﬂected in increased error

rate as observed in Figure 3.

The above discussion also suggests that if the decision boundary separating

pairs of classes were inherently nonadditive in the predictors, then boosting

stumps would be less advantageous than using larger trees. A tree with m

terminal nodes can produce basis functions with a maximum interaction order

of minm−1pwhere pis the number of predictor features. These higher

order basis functions provide the possibility to more accurately estimate those

decision boundaries Bxwith high-order interactions. The purpose of the next

example is to verify this intuition. There are two classes K=2and 5000

training observations with the xi5000

1drawn from a ten-dimensional normal

distribution as in the previous examples. Class labels were randomly assigned

to each observation with log-odds

logPry=1x

Pry=−1x=10



j=1

xj1+



l=1−1lxl(47)

Approximately equal numbers of observations are assigned to each of the

two classes, and the Bayes error rate is 0.046. The decision boundary for this

problem is a complicated function of the ﬁrst six predictor variables involving

all of them in second-order interactions of equal strength. As in the above

examples, test sets of 10,000 observations was used to estimate error rates for

each training set, and ﬁnal estimates were averages over ten replications.

Figure 4 (top left) shows test-error rate as a function of iteration number for

each of the four boosting methods using stumps. As in the previous examples,

RAB and GAB track each other very closely. DAB begins very slowly, being

dominated by all of the others until around 180 iterations, where it passes

below RAB and GAB. LB mostly dominates, having the lowest error rate until

about 650 iterations. At that point DAB catches up and by 800 iterations it

may have a very slight edge. However, none of these boosting methods perform

well with stumps on this problem, the best error rate being 0.35.

Figure 4 (top right) shows the corresponding plot when four terminal node

trees are boosted. Here there is a dramatic improvement with all of the four

362 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Fig. 4.Test error curves for the simulation experiment with a nonadditive decision boundary,as

described in (47).

methods. For the ﬁrst time there is some small differentiation between RAB

and GAB. At nearly all iterations the performance ranking is LB best, followed

by GAB, RAB and DAB in order. At 800 iterations LB achieves an error rate of

0.134. Figure 4 (lower left) shows results when eight terminal node trees are

boosted. Here, error rates are generally further reduced with LB improving

the least (0.130), but still dominating. The performance ranking among the

other three methods changes with increasing iterations; DAB overtakes RAB

ADDITIVE LOGISTIC REGRESSION 363

at around 150 iterations and GAB at about 230 becoming fairly close to LB

by 800 iterations with an error rate of 0.138.

Although limited in scope, these simulation studies suggest several trends.

They explain why boosting stumps can sometimes be superior to using larger

trees, and suggest situations where this is likely to be the case; that is when

decision boundaries Bxcan be closely approximated by functions that are

additive in the original predictor features. When higher order interactions

are required, stumps exhibit poor performance. These examples illustrate the

close similarity between RAB and GAB. In all cases the difference in perfor-

mance between DAB and the others decreases when larger trees and more

iterations are used, sometimes overtaking the others. More generally, relative

performance of these four methods depends on the problem at hand in terms

of the nature of the decision boundaries, the complexity of the base classiﬁer

and the number of boosting iterations.

The superior performance of LB in Figure 3 (lower right) appears to be a

consequence of the multiclass logistic model (Algorithm 6). All of the other

methods use the asymmetric AdaBoost.MH strategy (Algorithm 5) of building

separate two-class models for each individual class against the pooled com-

plement classes. Even if the decision boundaries separating all class pairs

are relatively simple, pooling classes can produce complex decision bound-

aries that are difﬁcult to approximate [Friedman (1996)]. By considering all

of the classes simultaneously, the symmetric multiclass model is better able

to take advantage of simple pairwise boundaries when they exist [Hastie and

Tibshirani (1998)]. As noted above, the pairwise boundaries induced by (41)

and (42) are simple when viewed in the context of additive modeling, whereas

the pooled boundaries are more complex; they cannot be well approximated

by functions that are additive in the original predictor variables.

The decision boundaries associated with these examples were deliberately

chosen to be geometrically complex in an attempt to elicit performance differ-

ences among the methods being tested. Such complicated boundaries are not

likely to often occur in practice. Many practical problems involve compara-

tively simple boundaries [Holte (1993)]; in such cases performance differences

will still be situation dependent, but correspondingly less pronounced.

7. Some experiments with real world data. In this section we show

the results of running the four ﬁtting methods: LogitBoost, Discrete AdaBoost,

Real AdaBoost and Gentle AdaBoost on a collection of datasets from the UC-

Irvine machine learning archive, plus a popular simulated dataset. The base

learner is a tree in each case, with either two or eight terminal nodes. For com-

parison, a single decision tree was also ﬁt (using the tree function in Splus),

with the tree size determined by 5-fold cross-validation.

The datasets are summarized in Table 1. The test error rates are shown in

Table 2 for the smaller datasets, and in Table 3 for the larger ones. The vowel,

sonar,satimage and letter datasets come with a prespeciﬁed test set. The

waveform data is simulated, as described in Breiman, Friedman, Olshen and

364 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Table 1

Datasets used in the experiments

Data set # Train # Test # Inputs # Classes

Vowel 528 462 10 11

Breast cancer 699 5-fold CV 9 2

Ionosphere 351 5-fold CV 34 2

Glass 214 5-fold CV 10 7

Sonar 210 5-fold CV 60 2

Waveform 300 5000 21 3

Satimage 4435 2000 36 6

Letter 16000 4000 16 26

Stone (1984). For the others, 5-fold cross-validation was used to estimate the

test error.

It is difﬁcult to discern trends on the small datasets (Table 2) because all

but quite large observed differences in performance could be attributed to

sampling ﬂuctuations. On the vowel,breast cancer,ionosphere,sonar and

waveform data, purely additive stump models seem to perform comparably to

the larger (eight-node) trees. The glass data seems to beneﬁt a little from

larger trees. There is no clear differentiation in performance among the boost-

ing methods.

On the larger data sets (Table 3) clearer trends are discernible. For the

satimage data the eight-node tree models are only slightly, but signiﬁcantly,

more accurate than the purely additive models. For the letter data there is

no contest. Boosting stumps is clearly inadequate. There is no clear differ-

entiation among the boosting methods for eight-node trees. For the stumps,

LogitBoost, Real AdaBoost and Gentle AdaBoost have comparable perfor-

break mance, distinctly superior to Discrete AdaBoost. This is consistent with

the results of the simulation study (Section 6).

Except perhaps for Discrete AdaBoost, the real data examples fail to demon-

strate performance differences between the various boosting methods. This is

in contrast to the simulated data sets of Section 6. There LogitBoost gener-

ally dominated, although often by a small margin. The inability of the real

data examples to discriminate may reﬂect statistical difﬁculties in estimat-

ing subtle differences with small samples. Alternatively, it may be that their

underlying decision boundaries are all relatively simple [Holte (1993)] so that

all reasonable methods exhibit similar performance.

8. Additive logistic trees. In most applications of boosting the base clas-

siﬁer is considered to be a primitive, repeatedly called by the boosting proce-

dure as iterations proceed. The operations performed by the base classiﬁer

are the same as they would be in any other context given the same data and

weights. The fact that the ﬁnal model is going to be a linear combination of

a large number of such classiﬁers is not taken into account. In particular,

when using decision trees, the same tree growing and pruning algorithms are

ADDITIVE LOGISTIC REGRESSION 365

Table 2

Test error rates on small real examples

2 Terminal Nodes 8 Terminal Nodes

Method Iterations 50 100 200 50 100 200

Vowel CART error =0642

LogitBoost 0.532 0.524 0.511 0.517 0.517 0.517

Real AdaBoost 0.565 0.561 0.548 0.496 0.496 0.496

Gentle AdaBoost 0.556 0.571 0.584 0.515 0.496 0.496

Discrete AdaBoost 0.563 0.535 0.563 0.511 0.500 0.500

Breast CART error =0045

LogitBoost 0.028 0.031 0.029 0.034 0.038 0.038

Real AdaBoost 0.038 0.038 0.040 0.032 0.034 0.034

Gentle AdaBoost 0.037 0.037 0.041 0.032 0.031 0.031

Discrete AdaBoost 0.042 0.040 0.040 0.032 0.035 0.037

Ion CART error =0076

LogitBoost 0.074 0.077 0.071 0.068 0.063 0.063

Real AdaBoost 0.068 0.066 0.068 0.054 0.054 0.054

Gentle AdaBoost 0.085 0.074 0.077 0.066 0.063 0.063

Discrete AdaBoost 0.088 0.080 0.080 0.068 0.063 0.063

Glass CART error =0400

LogitBoost 0.266 0.257 0.266 0.243 0.238 0.238

Real AdaBoost 0.276 0.247 0.257 0.234 0.234 0.234

Gentle AdaBoost 0.276 0.261 0.252 0.219 0.233 0.238

Discrete AdaBoost 0.285 0.285 0.271 0.238 0.234 0.243

Sonar CART error =0596

LogitBoost 0.231 0.231 0.202 0.163 0.154 0.154

Real AdaBoost 0.154 0.163 0.202 0.173 0.173 0.173

Gentle AdaBoost 0.183 0.183 0.173 0.154 0.154 0.154

Discrete AdaBoost 0.154 0.144 0.183 0.163 0.144 0.144

Waveform CART error =0364

LogitBoost 0.196 0.195 0.206 0.192 0.191 0.191

Real AdaBoost 0.193 0.197 0.195 0.185 0.182 0.182

Gentle AdaBoost 0.190 0.188 0.193 0.185 0.185 0.186

Discrete AdaBoost 0.188 0.185 0.191 0.186 0.183 0.183

366 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Table 3

Test error rates on larger data examples

Iterations

Terminal

Method Nodes 20 50 100 200 Fraction

Satimage CART error =0148

LogitBoost 2 0.140 0.120 0.112 0.102

Real AdaBoost 2 0.148 0.126 0.117 0.119

Gentle AdaBoost 2 0.148 0.129 0.119 0.119

Discrete AdaBoost 2 0.174 0.156 0.140 0.128

LogitBoost 8 0.096 0.095 0.092 0.088

Real AdaBoost 8 0.105 0.102 0.092 0.091

Gentle AdaBoost 8 0.106 0.103 0.095 0.089

Discrete AdaBoost 8 0.122 0.107 0.100 0.099

Letter CART error =0124

LogitBoost 2 0.250 0.182 0.159 0.145 0.06

Real AdaBoost 2 0.244 0.181 0.160 0.150 0.12

Gentle AdaBoost 2 0.246 0.187 0.157 0.145 0.14

Discrete AdaBoost 2 0.310 0.226 0.196 0.185 0.18

LogitBoost 8 0.075 0.047 0.036 0.033 0.03

Real AdaBoost 8 0.068 0.041 0.033 0.032 0.03

Gentle AdaBoost 8 0.068 0.040 0.030 0.028 0.03

Discrete AdaBoost 8 0.080 0.045 0.035 0.029 0.03

generally employed. Sometimes alterations are made (such as no pruning) for

programming convenience and speed.

When boosting is viewed in the light of additive modeling, however, this

greedy approach can be seen to be far from optimal in many situations. As

discussed in Section 6 the goal of the ﬁnal classiﬁer is to produce an accurate

approximation to the decision boundary function Bx. In the context of boost-

ing, this goal applies to the ﬁnal additive model, not to the individual terms

(base classiﬁers) at the time they were constructed. For example, it was seen

in Section 6 that if Bxwas close to being additive in the original predictive

features, then boosting stumps was optimal since it produced an approxima-

tion with the same structure. Building larger trees increased the error rate

of the ﬁnal model because the resulting approximation involved high-order

interactions among the features. The larger trees optimized error rates of the

individual base classiﬁers, given the weights at that step, and even produced

lower unweighted error rates in the early stages. However, after a sufﬁcient

number of boosts, the stump-based model achieved superior performance.

More generally, one can consider an expansion of the decision boundary

function in a functional ANOVA decomposition [Friedman (1991)]

Bx=

fjxj+

j k

fjk xjx

k+ 

j k l

fjk l xjx

kx

l+···(48)

ADDITIVE LOGISTIC REGRESSION 367

The ﬁrst sum represents the closest function to Bxthat is additive in

the original features, the ﬁrst two represent the closest approximation involv-

ing at most two-feature interactions, the ﬁrst three represent three-feature

interactions, and so on. If Bxcan be accurately approximated by such an

expansion, truncated at low interaction order, then allowing the base classi-

ﬁer to produce higher order interactions can reduce the accuracy of the ﬁnal

boosted model. In the context of decision trees, higher order interactions are

produced by deeper trees.

In situations where the true underlying decision boundary function admits

a low order ANOVA decomposition, one can take advantage of this structure

to improve accuracy by restricting the depth of the base decision trees to be

not much larger than the actual interaction order of Bx. Since this is not

likely to be known in advance for any particular problem, this maximum depth

becomes a “meta-parameter” of the procedure to be estimated by some model

selection technique, such as cross-validation.

One can restrict the depth of an induced decision tree by using its stan-

dard pruning procedure, starting from the largest possible tree, but requiring

it to delete enough splits to achieve the desired maximum depth. This can

be computationally wasteful when this depth is small. The time required to

build the tree is proportional to the depth of the largest possible tree before

pruning. Therefore, dramatic computational savings can be achieved by sim-

ply stopping the growing process at the maximum depth, or alternatively at

a maximum number of terminal nodes. The standard heuristic arguments in

favor of growing large trees and then pruning do not apply in the context of

boosting. Shortcomings in any individual tree can be compensated by trees

grown later in the boosting sequence.

If a truncation strategy based on number of terminal nodes is to be

employed, it is necessary to deﬁne an order in which splitting takes place. We

adopt a “best-ﬁrst” strategy. An optimal split is computed for each currently

terminal node. The node whose split would achieve the greatest reduction in

the tree building criterion is then actually split. This increases the number

of terminal nodes by one. This continues until a maximum number Mof ter-

minal notes is induced. Standard computational tricks can be employed so

that inducing trees in this order requires no more computation than other

orderings commonly used in decision tree induction.

The truncation limit Mis applied to all trees in the boosting sequence. It is

thus a meta-parameter of the entire boosting procedure. An optimal value can

be estimated through standard model selection techniques such as minimizing

cross-validated error rate of the ﬁnal boosted model. We refer to this combi-

nation of truncated best-ﬁrst trees, with boosting, as “additive logistic trees”

(ALT). Best-ﬁrst trees were used in all of the simulated and real examples.

One can compare results on the latter (Tables 2 and 3) to corresponding results

reported by Dietterich [(1998), Table 1] on common data sets. Error rates

achieved by ALT with very small truncation values are seen to compare quite

favorably with other committee approaches using much larger trees at each

boosting step. Even when error rates are the same, the computational savings

368 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Fig. 5.Coordinate functions for the additive logistic tree obtained by boosting (Logitboost)with

stumps,for the two-class nested sphere example from Section 6.

associated with ALT can be quite important in data mining contexts where

large data sets cause computation time to become an issue.

Another advantage of low order approximations is model visualization. In

particular, for models additive in the input features (46), the contribution of

each feature xjcan be viewed as a graph of gjxjplotted against xj. Figure 5

shows such plots for the ten features of the two-class nested spheres example

of Figure 3. The functions are shown for the ﬁrst class concentrated near the

origin; the corresponding functions for the other class are the negatives of

these functions.

The plots in Figure 5 clearly show that the contribution to the log-odds of

each individual feature is approximately quadratic, which matches the gener-

ating model (41) and (42).

When there are more than two classes, plots similar to Figure 5 can be made

for each class and analogously interpreted. Higher order interaction models

are more difﬁcult to visualize. If there are at most two-feature interactions,

the two-variable contributions can be visualized using contour or perspective

mesh plots. Beyond two-feature interactions, visualization techniques are even

less effective. Even when noninteraction (stump) models do not achieve the

highest accuracy, they can be very useful as descriptive statistics owing to the

interpretability of the resulting model.

9. Weight trimming. In this section we propose a simple idea and show

that it can dramatically reduce computation for boosted models without sacri-

ﬁcing accuracy. Despite its apparent simplicity, this approach does not appear

to be in common use [although similar ideas have been proposed before:

Schapire (1990), Freund (1995)]. At each boosting iteration there is a

distribution of weights over the training sample. As iterations proceed, this

distribution tends to become highly skewed towards smaller weight values.

A larger fraction of the training sample becomes correctly classiﬁed with

increasing conﬁdence, thereby receiving smaller weights. Observations with

very low relative weight have little impact on training of the base classiﬁer;

only those that carry the dominant proportion of the weight mass are inﬂuen-

ADDITIVE LOGISTIC REGRESSION 369

tial. The fraction of such high weight observations can become very small in

later iterations. This suggests that at any iteration one can simply delete from

the training sample the large fraction of observations with very low weight

without having much effect on the resulting induced classiﬁer. However, com-

putation is reduced since it tends to be proportional to the size of the training

sample, regardless of weights.

At each boosting iteration, training observations with weight wiless than

a threshold wi<tβare not used to train the classiﬁer. We take the value of

tβto be the βth quantile of the weight distribution over the training data

at the corresponding iteration. That is, only those observations that carry

the fraction 1 −βof the total weight mass are used for training. Typically

β∈00101so that the data used for training carries from 90 to 99% of

the total weight mass. Note that the weights for all training observations are

recomputed at each iteration. Observations deleted at a particular iteration

may therefore reenter at later iterations if their weights subsequently increase

relative to other observations.

Figure 6 (left panel) shows test-error rate as a function of iteration number

for the letter recognition problem described in Section 7, here using Gentle

AdaBoost and eight-node trees as the base classiﬁer. Two error rate curves are

shown. The black solid one represents using the full training sample at each

iteration β=0, whereas the blue dashed curve represents the corresponding

error rate for β=01. The two curves track each other very closely, especially

at the later iterations. Figure 6 (right panel) shows the corresponding frac-

tion of observations used to train the base classiﬁer as a function of iteration

number. Here the two curves are not similar. With β=01 the number of

observations used for training drops very rapidly reaching roughly 5% of the

total at 20 iterations. By 50 iterations it is down to about 3% where it stays

throughout the rest of the boosting procedure. Thus, computation is reduced

Fig. 6.The left panel shows the test error for the letter recognition problem as a function of

iteration number. The black solid curve uses all the training data,the red dashed curve uses a

subset based on weight thresholding. The right panel shows the percent of training data used for

both approaches. The upper curve steps down,because training can stop for an entire class if it is

ﬁt sufﬁciently well (see text).

370 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

by over a factor of 30 with no apparent loss in classiﬁcation accuracy. The

reason why sample size in this case decreases for β=0 after 150 iterations is

that if all of the observations in a particular class are classiﬁed correctly with

very high conﬁdence Fk>15 +logN training for that class stops, and con-

tinues only for the remaining classes. At 400 iterations, 12 classes remained

of the original 26 classes.

The last column labeled fraction in Table 3 for the letter-recognition problem

shows the average fraction of observations used in training the base classiﬁers

over the 200 iterations, for all boosting methods and tree sizes. For eight-node

trees, all methods behave as shown in Figure 6. With stumps, LogitBoost uses

considerably less data than the others and is thereby correspondingly faster.

This is a genuine property of LogitBoost that sometimes gives it an advan-

tage with weight trimming. Unlike the other methods, the LogitBoost weights

wi=pi1−pido not in any way involve the class outputs yi; they simply

measure nearness to the currently estimated decision boundary FMx=0.

Discarding small weights thus retains only those training observations that

are estimated to be close to the boundary. For the other three procedures the

weight is monotone in −yiFMxi. This gives highest weight to currently mis-

classiﬁed training observations, especially those far from the boundary. If after

trimming, the fraction of observations remaining is less than the error rate,

the subsample passed to the base learner will be highly unbalanced contain-

ing very few correctly classiﬁed observations. This imbalance seems to inhibit

learning. No such imbalance occurs with LogitBoost since near the decision

boundary, correctly and misclassiﬁed observations appear in roughly equal

numbers.

As this example illustrates, very large reductions in computation for boost-

ing can be achieved by this simple trick. A variety of other examples (not

shown) exhibit similar behavior with all boosting methods. Note that other

committee approaches to classiﬁcation such as bagging [Breiman (1996)] and

randomized trees [Dietterich (1998)] while admitting parallel implementa-

tions, cannot take advantage of this approach to reduce computation.

10. Further generalizations of boosting. We have shown above that

AdaBoost ﬁts an additive model, optimizing a criterion similar to binomial

log-likelihood, via an adaptive Newton method. This suggests ways in which

the boosting paradigm may be generalized. First, the Newton step can be

replaced by a gradient step, slowing down the ﬁtting procedure. This can

reduce susceptibility to overﬁtting and lead to improved performance. Sec-

ond, any smooth loss function can be used: for regression, squared error is

natural, leading to the “ﬁtting of residuals” boosting algorithm mentioned

in the introduction. However, other loss functions might have beneﬁts, for

example, tapered squared error based on Huber’s robust inﬂuence function

[Huber (1964)]. The resulting procedure is a fast, convenient method for resis-

tant ﬁtting of additive models. Details of these generalizations may be found

in Friedman (1999).

ADDITIVE LOGISTIC REGRESSION 371

11. Concluding remarks. In order to understand a learning procedure

statistically it is necessary to identify two important aspects: its structural

model and its error model. The former is most important since it determines

the function space of the approximator, thereby characterizing the class of

functions or hypotheses that can be accurately approximated with it. The

error model speciﬁes the distribution of random departures of sampled data

from the structural model. It thereby deﬁnes the criterion to be optimized in

the estimation of the structural model.

We have shown that the structural model for boosting is additive on the

logistic scale with the base learner providing the additive components. This

understanding alone explains many of the properties of boosting. It is no sur-

prise that a large number of such (jointly optimized) components deﬁnes a

much richer class of learners than one of them alone. It reveals that in the

context of boosting all base learners are not equivalent, and there is no uni-

versally best choice over all situations. As illustrated in Section 6, the base

learners need to be chosen so that the resulting additive expansion matches

the particular decision boundary encountered. Even in the limited context of

boosting decision trees the interaction order, as characterized by the number of

terminal nodes, needs to be chosen with care. Purely additive models induced

by decision stumps are sometimes, but not always, the best. However, we con-

jecture that boundaries involving very high-order interactions will rarely be

encountered in practice. This motivates our additive logistic trees (ALT) pro-

cedure described in Section 8.

The error model for two-class boosting is the obvious one for binary vari-

ables, namely the Bernoulli disribution. We show that the AdaBoost proce-

dures maximize a criterion that is closely related to expected log-Bernoulli

likelihood, having the identical solution in the distributional (L2) limit of

inﬁnite data. We derived a more direct procedure for maximizing this log-

likelihood (LogitBoost) and show that it exhibits properties nearly identical to

those of Real AdaBoost.

In the multiclass case, the AdaBoost procedures maximize a separate

Bernoulli likelihood for each class versus the others. This is a natural choice

and is especially appropriate when observations can belong to more than one

class [Schapire and Singer (1998)]. In the more usual setting of a unique

class label for each observation, the symmetric multinomial distribution is

a more appropriate error model. We develop a multiclass LogitBoost proce-

dure that maximizes the corresponding log-likelihood by quasi-Newton step-

ping. We show through simulated examples that there exist settings where

this approach leads to superior performance, although none of these situa-

tions seems to have been encountered in the set of real data examples used

for illustration; the performance of both approaches had quite similar perfor-

mance over these examples.

The concepts developed in this paper suggest that there is very little, if any,

connection between (deterministic) weighted boosting and other (randomized)

ensemble methods such as bagging [Breiman (1996)] and randomized trees

[Dietterich (1998)]. In the language of least-squares regression, the latter are

372 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

purely “variance” reducing procedures intended to mitigate instability, espe-

cially that associated with decision trees. Boosting on the other hand seems

fundamentally different. It appears to be mainly a “bias” reducing procedure,

intended to increase the ﬂexibility of stable (highly biased) weak learners by

incorporating them in a jointly ﬁtted additive expansion.

The distinction becomes less clear [Breiman (1998a)] when boosting is

implemented by ﬁnite weighted random sampling instead of weighted opti-

mization. The advantages or disadvantages of introducing randomization into

boosting by drawing ﬁnite samples is not clear. If there turns out to be an

advantage with randomization in some situations, then the degree of random-

ization, as reﬂected by the sample size, is an open question. It is not obvious

that the common choice of using the size of the original training sample is

optimal in all (or any) situations.

One fascinating issue not covered in this paper is the fact that boosting,

whatever ﬂavor, seems resistant to overﬁtting. Some possible explanations are:

1. As the LogitBoost iterations proceed, the overall impact of changes intro-

duced by fmxreduces. Only observations with appreciable weight deter-

mine the new functions—those near the decision boundary. By deﬁnition

these observations have Fxnear zero and can be affected by changes,

while those in pure regions have large values of Fx and are less likely

to be modiﬁed.

2. The stagewise nature of the boosting algorithms does not allow the full col-

lection of parameters to be jointly ﬁt, and thus has far lower variance than

the full parameterization might suggest. In the computational learning the-

ory literature this is explained in terms of VC dimension of the ensemble

compared to that of each weak learner.

Fig. 7.Real AdaBoost (stumps)on a noisy concentric-sphere problem, with 400 observations per

class and Bayes error 25%. The test error (upper curve)increases after about ﬁfty iterations.

ADDITIVE LOGISTIC REGRESSION 373

3. Classiﬁers are hurt less by overﬁtting than other function estimators

[e.g., the famous risk bound of the 1-nearest-neighbor classiﬁer, Cover and

Hart (1967)].

Figure 7 shows a case where boosting does overﬁt. The data are gener-

ated from two ten-dimensional spherical Gaussians with the same mean, and

variances chosen so that the Bayes error is 25% (400 samples per class). We

used Real AdaBoost and stumps (the results were similar for all the boosting

algorithms). After about 50 iterations the test error (slowly) increases.

Schapire, Freund, Bartlett and Lee (1998) suggest that the properties of

AdaBoost, including its resistance to overﬁtting, can be understood in terms of

classiﬁcation margins. However, Breiman (1997) presents evidence counter to

this explanation. Whatever the explanation, the empirical evidence is strong;

the introduction of boosting by Schapire, Freund and colleagues has brought

an exciting and important set of new ideas to the table.

Acknowledgments. We thank Andreas Buja for alerting us to the recent

work on text classiﬁcation at AT&T laboratories, Bogdan Popescu for illu-

minating discussions on PAC learning theory and Leo Breiman and Robert

Schapire for useful comments on an earlier version of this paper. We also

thank two anonymous referees and an Associate Editor for detailed and use-

ful comments on an earlier draft of the paper.

REFERENCES

Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.

Breiman, L. (1997). Prediction games and arcing algorithms. Technical Report 504, Dept. Statis-

tics, Univ. California, Berkeley.

Breiman, L. (1998a). Arcing classiﬁers (with discussion). Ann. Statist. 26 801–849.

Breiman, L. (1998b). Combining predictors. Technical report, Dept. Statistics, Univ. California,

Berkeley.

Breiman, L., Friedman,J.,Olshen,R.andStone, C. (1984). Classiﬁcation and Regression Trees.

Wadsworth, Belmont, CA.

Buja, A., Hastie,T.andTibshirani, R. (1989). Linear smoothers and additive models (with

discussion). Ann. Statist. 17 453–555.

Cover,T.andHart, P. (1967). Nearest neighbor pattern classiﬁcation. Proc. IEEE Trans. Inform.

Theory 13 21–27.

Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles

of decision trees: bagging, boosting, and randomization. Machine Learning? 1–22.

Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121

256–285.

Freund,Y.andSchapire, R. (1996a). Game theory, on-line prediction and boosting. In Proceedings

of the Ninth Annual Conference on Computational Learning Theory 325–332.

Freund,Y.andSchapire, R. E. (1996b). Experiments with a new boosting algorithm. In Machine

Learning: Proceedings of the Thirteenth International Conference 148–156. Morgan

Kaufman, San Francisco.

Freund,Y.andSchapire, R. E. (1997). A decision-theoretic generalization of online learning and

an application to boosting. J. Comput. System Sciences 55.

Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19

1–141.

374 DISCUSSION

Friedman, J. (1996). Another approach to polychotomous classiﬁcation. Technical report, Stanford

Univ.

Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical

report, Stanford Univ.

Friedman,J.andStuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76

817–823.

Hastie,T.andTibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.

Hastie,T.andTibshirani, R. (1998). Classiﬁcation by pairwise coupling. Ann. Statist. 26 451–471.

Hastie,T.,Tibshirani,R.andBuja, A. (1994). Flexible discriminant analysis by optimal scoring.

J. Amer. Statist. Assoc. 89 1255–1270.

Holte, R. (1993). Very simple classiﬁcation rules perform well on most commonly used datasets.

Machine Learning 11 63–90.

Huber, P. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 53 73–101.

Kearns,M.andVazirani, U. (1994). An Introduction to Computational Learning Theory.MIT

Press.

Mallat,S.andZhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE

Trans. Signal Processing 41 3397–3415.

McCullagh,P.andNelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.

Schapire, R. (1997). Using output codes to boost multiclass learning problems. In Proceedings

of the Fourteenth International Conference on Machine Learning 313–321. Morgan

Kaufman, San Francisco.

Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5197–227.

Schapire,R.E.andSinger, Y. (1998). Improved boosting algorithms using conﬁdence-rated pre-

dictions. In Proceedings of the Eleventh Annual Conference on Computational Learning

Theory.

Schapire, R., Freund,Y.,Bartlett,P.andLee, W. (1998). Boosting the margin: a new explana-

tion for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.

Valiant, L. G. (1984). A theory of the learnable. Comm. ACM 27 1134–1142.

Department of Statistics

Sequoia Hall

Stanford University

Stanford, California 94305

E-mail: jnf, hastie, tibs@stat.stanford.edu

DISCUSSION

Leo Breiman

University of California, Berkeley

The authors have presented a fascinating view of boosting and I congrat-

ulate them for the steps they have taken to clear up the mystery of why the

AdaBoost algorithm works so well. Yet I have misgivings about their explana-

tion (which I expressed earlier in my comments to them on receiving the ﬁrst

draft of their paper).

A crucial property of AdaBoost is that it almost never overﬁts the data no

matter how many iterations it is run. The authors counterexample is the ﬁrst

convincing one I have seen but stands in contrast to the hundreds of data sets

on which AdaBoost has not produced overﬁtting. Any complete explanation of

AdaBoost has to explain this empirical fact. Ordinarily, in logistic regression,

ADDITIVE LOGISTIC REGRESSION 375

if one adds more and more variables, increasing the likelihood at each step,

then at some point overﬁtting sets in and the test-set error blossoms. The fact

that this does not happen in AdaBoost or in LogitBoost is extraordinary. And,

unless I am missing something, there is no explanation in the paper.

I approach the problem differently. In Breiman (1999a), I assume an ensem-

ble of classiﬁers hx θθ ∈3. Let the sequence θnbe iid selections from

3according to some probability Qand let the classiﬁers hxθ

1hxθ

2

hxθ

Neach cast a unit vote for the predicted class at x. Then, under weak

conditions, as Ngoes to inﬁnity, the selected class converges a.s. to arg maxj

Qhxθ=j. This holds, for example, if the classiﬁers are trees, in which

case I refer to the ensemble as a “random forest.” The theorem is a simple

consequence of the law of large numbers.

For example, in bagging, each classiﬁer is grown on a bootstrap sample from

the original training set. If the training set has Kexamples, let θbeaK-vector

of nonnegative integers adding to K. Under Q, let θhave the distribution of

sampling Ktimes with replacement from Kobjects. Thus, using the results

cited, bagging converges a.s. as the number of iterations becomes large.

AdaBoost is probably a random forest. Consider the following version of

AdaBoost: given a probability Pon the training set, use a ﬁxed algorithm to

construct a classiﬁer hxPusing the weighted training set. The next prob-

ability Pon the training set depends only on P, the indices of the examples

misclassiﬁed and the weighted misclassiﬁcation error. Then there is a deter-

ministic function 5deﬁned on the space of all probabilities on the training

set such that P=5P. Assume that 5is ergodic with invariant measure Q

and that the voting is unweighted. Then the class selected converges a.s. to

arg maxjQhxP=j.

The implication is that the random forest constructed by letting the

sequence Pnbe iid from Qis equivalent to AdaBoost. AdaBoost uses

weighted voting, but a small alteration in the argument gives the same result.

In Breiman (1999a), empirical evidence is given to show that AdaBoost is a

random forest. This would settle the question of why AdaBoost does not overﬁt,

but not the question of why AdaBoost gives such low test-set error.

There are many different random forests, and some are more accurate than

others. For instance, Dieterrich (1998) constructs a random forest by choosing

at random among the ten best splits at any node and shows empirically that

this gives generally lower test-set error than bagging. In (1999a) we show

that the test-set misclassiﬁcation error is bounded above by an expression

that depends on two parameters: the ﬁrst is the expected correlation between

pairs of classiﬁers, and the second is the expected accuracy of the individual

classiﬁers.

One very effective way to reduce correlation is by random choice of features

at each node. That is, suppose there are ten predictor variables. At each node,

choose a small subgroup of them at random to split on. This idea ﬁrst appeared

in a paper by Amit and Geman (1997). Using this approach and 100 iterations

gives the following test-set errors as compared to the best corresponding values

for LogitBoost.

376 DISCUSSION

Test-Set Errors (%)

Dataset LogitBoost Random Forest

Breast 3.8 2.9

Ionosphere 6.3 7.1

Glass 23.8 20.6

Sonar 15.4 15.9

Waveform 19.1 17.2

Sat-images 11.2 8.6

The comparison is on runs I have already done. Since I have not done a run

on the vowel dataset that is similar to that of the paper, it does not appear.

Similarly, I have not run the letter dataset referred to. The major point of this

comparison is to show that correlation and strength are the essential ingre-

dients of an accurate ensemble, and that if boosting succeeds, it is because it

gives low correlation between classiﬁers of reasonable strength. The fact that

a similar algorithm maximizes the likelihood in a logistic setting is not the

primary reason for its success.

So what is known about the factors leading to AdaBoost’s success? One

clear factor is that at each step it decouples itself from the previous step. This

would intuitively seem to lead to low correlation between pairs of classiﬁers

in AdaBoost considered as a random forest. However, this has not yet been

analytically or empirically established.

A recently discovered empirical property of AdaBoost is mysterious but

may lead to better understanding. At each iteration, the examples currently

misclassiﬁed by AdaBoost are weighted more heavily for the next iteration,

to encourage their correct classiﬁcation. The commonly accepted interpreta-

tion, therefore, is that AdaBoost tries to classify correctly the hard-to-classify

examples, and ignores the easy-to-classify examples. This interpretation is

incorrect, and I have to admit some culpability.

Suppose AdaBoost is run for a while. Then for each case in the training set,

look at the weighted proportion of times that it has been misclassiﬁed (the

weights are just those that AdaBoost assigns to each tree in the ensemble).

For all data sets I’ve looked at so far, these proportions are nearly the same

over the entire training set. That is, AdaBoost is an equalizer. When it puts

more emphasis on points that have been recently misclassiﬁed, it is trying to

get them correctly classiﬁed next time around [see Wheway (1999)].

Let the points in the training set be ynxn. Let prknbe the unweighted

proportion of times that the nth case in the training set is misclassiﬁed by the

ﬁrst kclassiﬁers. To check on my hypothesis that AdaBoost worked so well

because it is an equalizer, using the methods in Breiman (1999b), I devised

an algorithm called arc-equal which sequentially minimizes Avnprkn−t2,

where tis an adjustable parameter.

Then I did runs on a number of datasets, proceeding as follows: ﬁrst Ada-

Boost was run 100 times. The twas taken equal to the average weighted

misclassiﬁcation rate over the training set produced by AdaBoost and arc-

ADDITIVE LOGISTIC REGRESSION 377

equal run 100 times. When this was repeated 50 times, each time leaving out

10% of the data as a test set, the test-set error rate of arc-equal was always

nearly the same as that of AdaBoost.

The experiments performed and the empirical observation that our data-

sets, AdaBoost tends to equalize the misclassiﬁcation error over the training

set is suggestive. But it does not yet explain why equalization leads to low

test-set error. This mystery should be most interesting to some of our more

theoretically motivated statisticians.

REFERENCES

Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees.

Neural Comput. 91545–1588.

Breiman, L. (1999a). Random forests. Technical report, available at www.stat.berkeley.edu.

Breiman, L. (1999b). Prediction games and arcing algorithms. Neural Comput. 11 1493–1517.

Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles

of decision trees: bagging, boosting and randomization. Mach. Learning 40 139–158.

Wheway, V. (1999). Variance reduction trends on ‘boosted’ classiﬁers. Available from virg@cse.

unsw.edu.au

Department of Statistics

University of California

367 Evans Hall, #3860

Berkeley, California 94720-3860

E-mail: leo@stat.berkeley.edu

DISCUSSION

Peter B ¨

uhlmann and Bin Yu

ETH Z ¨urich and Bell Laboratories, Murray Hill,

Lucent Technologies and University of California, Berkeley

We congratulate Friedman, Hastie and Tibshirani (FHT) for their signif-

icant work connecting boosting and (kinds of) additive models. FHT provide

a much-needed bridge between an important and effective machine learning

procedure and traditional statistical modeling ideas. With this bridge, ideas

can now ﬂow easily in both directions so that a deeper and more thorough

understanding of boosting will eventually emerge.

In this discussion, we would like to share our thoughts and reﬂections on

boosting and related statistical ideas, inspired by FHT.

1. What makes boosting work in classiﬁcation? Let us consider the

two-class problem in this section. In an elegant and convincing way, FHT bring

Freund and Schapire’s Discrete AdaBoosting to our home domain by rederiv-

ing its population version as a familiar Newton-like step in the optimization of

a not-so-familiar exponential loss function serving as a surrogate for the zero–

one loss function. This surrogate is sensible since it is uniquely minimized by

378 DISCUSSION

(half of) the log odds ratio of the conditional probability of Y=1 given X=x.

Under this light, it is not hard to ﬁnd instances of statistical procedures shar-

ing traces of similarities with boosting. For example, in parametric estimation,

we indeed have been “boosting” a consistent estimator to an efﬁcient estima-

tor by taking a Newton step in the optimization of the likelihood function and

indeed have been using estimating equations as sensible surrogates because

the likelihood equations are intractable. Despite these similarities, there are

fundamental differences between boosting and these statistical procedures.

As recounted by FHT, boosting as a conjecture was proposed as a theoretical

learning question in the machine learning community and is now being viewed

as a nonparametric stepwise ﬁtting technique in an additive style. Its superb

empirical performance in high-dimensional classiﬁcation problems has very

few, if any, rivals. Even though the concept of a “true” model does get used in

the evaluation stage of boosting, the starting point of boosting is not a “true

model” as is commonly done in statistics. Its starting point is a “weak learner”

and the question posed was how to make it better or “boost” it. In hindsight,

this is a natural and realistic approach in modern data analysis where the size

of datasets could not be imagined in the time of our forebears such as Fisher

or Neyman. Because of the complexity and scale of these problems, it is impos-

sible to come up with an effective likelihood approach in one step. Often, if

not always, a sensible procedure or a “weak learner” is available, either ad

hoc or based on a simple approximation. Moreover, an evaluation criterion or

a loss function is always available. The boosting question now becomes how to

improve the weak learner in terms of this loss function. With a starting point

and an objective function, a greedy approach for improvement is very natu-

ral. Numerical considerations of the greedy approach explain why boosting in

terms of the evaluating zero–one loss function might not be a good idea. Any

derivative-based greedy algorithm such as Newton’s method is not appropri-

ate for this discontinuous loss function whose minimizer is not unique but a

whole class. A surrogate, the exponential loss function, is used in AdaBoost-

ing as the implementing objective function although the evaluation function

remains the zero–one loss. From a numerical point of view, the exponential

function is an excellent function to apply Newton’s method, because of its con-

vexity and gradually changing derivatives. Furthermore, this is a relevant

loss function to optimize since its minimizer is sensible for the classiﬁcation

problem. A second surrogate, proposed by FHT, is the binomial likelihood that

as a function is very close to the exponential loss and has the same minimizer.

FHT devise the Gentle Boosting algorithm based on the binomial surrogate

and show a similar performance to those based on the exponential loss. A third

surrogate, the squared loss function, is also discussed in FHT. The squared

function is the best possible for Newton’s method: convex and having constant

second derivatives. FHT report good performance but think that it is domi-

nated by schemes based on either the exponential or the binomial likelihood

loss functions and hint that the reason might be that the squared loss func-

tion loses its monotonicity when yFx>1. It is well known that the unique

minimizer of the squared loss is the conditional expectation of Ygiven X=x

ADDITIVE LOGISTIC REGRESSION 379

which is naturally bounded between −1 and 1 since Y=1. If this constraint

is taken into account in the greedy search for the optimizer of the squared loss,

then one should not wander out to the region where yFx>1 but stay in the

region where the squared loss is monotone. It is curious to know whether or

not taking this constraint into account (obviously numerically more compli-

cated) would bring the performance of squared loss boosting on a par with

those based on the exponential or binomial likelihood loss.

Although the three surrogates mentioned so far have sensible minimizers,

they are only qualitative approximations to the zero–one loss function. They

are actually quite bad quantitative approximations (cf. Figure 2 of FHT). But

the population minimizers of the expected zero–one loss function include those

of exponential, binomial likelihood and the squared loss. The ﬁrst question

then simply becomes:

1. Which surrogate (or implementing loss function for boosting) to use so

that the boosting estimate best approximates its corresponding population

minimizer? This (difﬁcult) question might involve numerical as well as

statistical–stochastic efﬁciency issues.

Now let us step back from these successful (implementing) surrogates to

look at the original zero–one loss function. From the quantitative approxima-

tion point of view, an obvious surrogate is a smoothed version of the zero–one

loss function 1 −5yFx/σwhere 5is the cdf of 01and σis a tuning

parameter to control the approximation of such a smoothed version to the orig-

inal zero–one loss function. This makes the numerical greedy approach pos-

sible. If one follows the exact steps (when cis ﬁxed) to this objective function

as in the derivation of the population version of Discrete AdaBoost (Result 1

in FHT), one gets a boosting algorithm with a different reweighting scheme

having weights.

ϕyFx/σ=ϕFx/σ

because y2=1 and ϕ·, the standard normal density, is symmetric.

These weights ignore the values of yor avoid reality check and concentrate

more and more on the cases where the previous classiﬁer Fis unsure about

itself. Since this is the opposite of the boosting reweighting philosophy, one

might expect that it would not work well. The smoothed zero–one loss func-

tion is not convex (or concave) and its derivatives change rapidly, especially

for a small tuning parameter σor a closer approximation to the zero–one

loss function. For such functions, the quadratic approximation is not accurate

and Newton’s method easily overshoots to miss the true optimizer. Hence this

recipe should not work from a numerical point of view.

2. The question is, however, for this smoothed zero–one loss, will a more suit-

able numerical optimization method such as trust region [cf. Gill, Murray

and Wright (1981)] lead to a sensible boosting procedure having a different

weighting philosophy, presumably with higher computational cost?

380 DISCUSSION

As argued above, boosting algorithms beneﬁt greatly from a smart choice of

a surrogate implementing objective function from both statistical and numer-

ical points of view. FHT (Section 4.4) also point out that Gentle Boosting has

an edge because of the numerical stability of the derivative calculation of the

binomial likelihood function. However, boosting’s most powerful advocate is

its stunning success on real data sets and its “mysterious” resistance to over-

ﬁtting in most cases in classiﬁcation. The summary in Table 2 of FHT on

real dataset results suggests that the performance of a particular variant of

boosting depends on the interaction among the choice of the weak learner, the

underlying problem and the effectiveness of the Newton method as a numeri-

cal approximation to the surrogate loss function. Figuring out how these fac-

tors interact should bring us a deeper understanding of boosting and might

shed light on its resistance to overﬁtting.

We explore this interaction in the rest of this discussion. In the context of

boosting L2regression, we compare boosting with another ensemble scheme,

bagging [Breiman (1996)], for which we have gained some understanding

recently [B ¨

uhlmann and Yu (2000)]. Different weak learners are looked at,

the overﬁtting issue is touched upon, and a bag-boosting scheme is proposed.

In the last section, we return to classiﬁcation to emphasize the importance of

the choice of the weak learner and to make the point that the resistance to

overﬁtting of boosting is probably due to the zero–one evaluation loss function.

2. Boosting and bagging with L2loss for regression. Boosting as

explained and analyzed by FHT is a general method: a greedy forward stage-

wise technique to ﬁt a model of additive style by minimizing a loss function.

This view opens the door for boosting in other contexts than classiﬁcation

(although L2loss is also an appropriate surrogate as described in the previ-

ous section). For L2regression, the boosting algorithm works as follows:

(a) Set F0≡0 (or another more sensible starting value).

(b) For m=12M, ﬁt the function estimator fm· which is para-

meterized as fmx=βbx γ(as in FHT):

βmγ

m=arg minβγ



i=1Yi−Fm−1Xi−βbXiγ2

Set Fm· = Fm−1· + fm·

FM· =



m=1

fm·

This algorithm is indicated by FHT [formula (6)] and was also given by

Friedman (1999). It does not involve any “reweighting”. The weak learner

b·γis ﬁtted in the mth step to the current residuals Yi−Fm−1XiThere

is no need for a surrogate loss function in this case since the evaluating L2loss

is the best possible for Newton’s method and the quadratic approximation is

ADDITIVE LOGISTIC REGRESSION 381

exact. We now discuss the issue of choosing the weak learner. This discussion

is continued in Section 3 for the case of classiﬁcation.

Linear weak learners. It is well known that boosting’s cousin bagging does

not change any linear procedures. It can be shown easily that it is the same

with boosting in L2regression. When the weak learner is linear bx γ=xTγ

and fmx=xTγmthe following two statements hold:

(a) For L2boosting in regression as described earlier,

fm· ≡ 0 for all m=23M

FM· = least-squares linear regression predictor for all M≥1

(b) For LogitBoost as presented by FHT, with

pMx=expFMx/expFMx + exp−FMx

pMxconverges to the corresponding MLE in logistic linear regression as

M→∞provided that standard conditions for convergence of the MLE hold.

Stumps as the weak learner. We now investigate the effect of boosting in

a concrete simple linear model,

Yi=2+3Xi+εii=1n(1)

with εi’s i.i.d. ∼01X

i’s i.i.d. ∼Uniform01 and all Xi’s indepen-

dent from all εi’s. The target is the true function Ftruex=2+3x x ∈01

We use stumps (with one-dimensional covariate space) as weak learners and

run the L2boosting algorithm for regression from above. For a typical sample,

the boosting stumps estimate is displayed in Figure 1; it is still a rather crude

and erratic approximation for the target. Figure 1 uses six boosting iterations

which is optimal on average with respect to MSE as indicated in Figure 2.

Based on 50 independent simulations of model (1), we have estimated bias,

variance and hence also mean-squared error as a function of the number of

boosting iterations. The result is displayed in Figure 2.

3. Interestingly, there is a clear indication for overﬁtting, starting after six

iterations already. We see here a simple example in L2boosting for regres-

sion where overﬁtting occurs easily in contrast to classiﬁcation; similar

phenomena are reported in Friedman (1999).

From our own work [B ¨

uhlmann and Yu (2000)] we know that stumps eval-

uated at xhave high variances for xin a whole region of the covariate space.

From an asymptotic point of view, this region is “centered around” the true

optimal split point for a stump and has “substantial” size On−1/3. That

is, stumps do have high variances even in low dimensions as in this sim-

ple case (with only three parameters) as long as one is looking at the “right

scale” On−1/3; such a high variance presumably propagates when combining

stumps in boosting. This observation is the starting point for another boosting

machine to be described next.

382 DISCUSSION

The bag-boosting machine with bagged weak learners. Bagging is known

to smooth the weak learner such as stumps and thus achieves a variance

reduction. The precise (asymptotic) smoothing function is given in B ¨

uhlmann

and Yu (2000), which also characterizes its variance reduction effect in simple

yet canonical cases. The high variance of a stump is reduced up to a factor

of size 2 to 3, while leaving the bias approximately unchanged. This implies

that bagging is also effective for a low-dimensional predictor such as stumps.

Hence, we combine bagging with boosting and we call it bag-boosting. For the

weak learner in boosting, just replace the stump (or a more general tree) by

the bagged stump (or bagged tree). This is a very natural idea and has been

thought about by a couple of researchers although we have not seen anything

in writing. The resulted ﬁt from bag-boosting is shown in Figures 1 and 2.

Note that performance for bagging alone is given by bag-boosting with one

boost. Instead of a classical bagging step, we actually use “sub-bagging” or

bagging on subsamples of size [n/2] (resampling without replacement) which

is computationally cheaper while still being as accurate as bagging [B ¨

uhlmann

and Yu (2000)]. The visual improvement of bag-boosting in Figure 1 can be

Fig. 1.Boosting stumps (B, black)and bag-boosting stumps (BB, red)estimate based on a typical

sample (dots)from (1). The target function is indicated in green. Boosting is done with 6and

bag-boosting with 4iterations.

ADDITIVE LOGISTIC REGRESSION 383

Fig. 2.Boosting stumps and bag-boosting stumps for model (1). Top: squared bias and variance

of boosting (B, black and red)and bag-boosting (BB, purple and green). Bottom:mean-squared

error for boosting (B, black)and bag-boosting (BB, red).

explained by the fact that a bagged stump is a smooth rather than a step

function [B ¨

uhlmann and Yu (2000)]. Its variance and MSE improvements are

impressively described by Figure 2, provided that we know roughly where to

stop with boosting iterations. Figure 2 illustrates that the more efﬁcient base

learner (namely the bagged stump) has a faster increase in variance with the

number of boosting iterations: It would be interesting to know whether this

is a more general fact.

4. Thus bag-boosting has a potential to improve upon boosting with stumps

or larger decision trees. Its drawback is the higher, although still feasible,

computational cost.

384 DISCUSSION

3. Back to classiﬁcation. Many issues remain. For example, the ques-

tion about convergence is not touched by FHT. In particular, for a ﬁxed sample

size, there are two issues: the (non)convergence of the boosting algorithm and

whether it is better to stop iterations before having approximately converged,

if the algorithm actually would converge. The latter issue is known in ﬁtting

complex parametric models such as neural networks where a possible regular-

ization is given by stopping before convergence. There is another convergence

issue about the next level when the sample size tends to inﬁnity.

5. A more important question is how much FHT’s population story tells about

the data driven algorithms. The population quantities are always approx-

imated: the accuracy depends on the choice of weak learners (see also

our remarks about linear weak learners, stumps and bagged stumps in

Section 2) and the data-generating distributions. Finally, the hope is that

the errors do not propagate with the boosting iterations!

Particularly, even with the connection made by FHT, it is hard to know

whether one should use a simple learner like stumps or a more complex one

such as trees with more terminal nodes.

In the boosting literature, which is almost exclusively on classiﬁcation, deci-

sion trees are the most popular weak learners. We do agree that trees are

generally attractive, but it is worth pointing out that other (nonlinear) weak

learners may exhibit good performance as well. Even when using trees as

weak learners, the choice of the number of terminal nodes (e.g., with best-ﬁrst

induction) in the learners does matter. Figure 1 in FHT can be misleading for

a superﬁcial reader. It wrongly suggests that the number of terminal nodes

in the learner is irrelevant when there is sufﬁcient computing power for per-

forming many boosting steps; and that with stumps, Discrete AdaBoost is

outperformed by Real AdaBoost consistently. The careful reader will exploit

a more subtle sensitivity with respect to choosing the learner in Section 6

of FHT [e.g., Figure 4 (top left and top right)] and results on real data sets

(Table 2).

FHT distinguish between additive and nonadditive decision boundaries.

They argue convincingly that in the ﬁrst case, choosing stumps as weak learn-

ers is better than using a larger tree which is expected to overestimate nonex-

isting interaction terms. If we knew that the decision boundary could be well

approximated by an additive function, we would feel more comfortable by ﬁt-

ting an additive model (with backﬁtting) as in Hastie and Tibshirani (1990),

rather than boosting stumps. Or in other words, if the conjecture by FHT

[Section 11] “that boundaries involving very high-order interactions will rarely

be encountered in practice” (whatever “high” means here) is really relevant,

there would not be any good reason to prefer boosting over the established

and popular GAM backﬁtting estimate.

Boosting’s resistance to overﬁtting has been one of its major virtues in

comparison with other nonparametric techniques. Overﬁtting in boosting algo-

rithms in the sense of ﬁtting a large model can happen in two ways. The ﬁrst

is having a large model ﬁtted as the weak learner and the other having to run

ADDITIVE LOGISTIC REGRESSION 385

Table 1

Misclassiﬁcation rates (in percentages)for the breast cancer data original tree

(unpruned;the program from R): 5.643

Subbagging 4.812

Boosting stumps 4.029 (with optimal 97 boosts)

Bag-boosting stumps 3.652 (with optimal 47 boosts)

Boosting large tree 2.929 (with optimal 117 boosts)

Bag-boosting large tree 2.768 (with optimal 11 boosts)

the algorithm for many iterations. The most overﬁtted model comes from a

large or complex weak learner in combination with a long iteration in boost-

ing. FHT touch on this overﬁtting issue in the concluding section by listing

three plausible explanations. As emphasized in Section 1 of this Discussion,

the use of the (implementing) surrogate exponential loss function is crucial

from the numerical point of view and is thus partially responsible for the

success of boosting. However, the evaluation in a classiﬁcation problem has

always been the zero–one loss.

6. This divorce of the implementation and the evaluation loss function does

not exist in L2boosting where we saw overﬁtting easily in Section 2. We

concur with the last point of FHT in Section 11 that the zero–one loss is

very robust against overﬁtting and we further argue that an evaluation

based on the exponential implementing loss does show strong overﬁtting.

This conjecture is supported by the breast cancer dataset below. Moreover,

we believe that “parts of the real world” are nonadditive. Under this belief

and knowing boosting’s resistance to overﬁtting, it can be very valuable to

use best-ﬁrst inducted trees with more than two terminal nodes. This is

conﬁrmed also by our analysis of the breast cancer data in Table 1.

Analysis of breast cancer data. We partially redo the analysis of FHT for

the breast cancer data and thereby add insights on bag-boosting and overﬁt-

ting. Table 1 and Figure 3 give comparative results to the ones in FHT. We use

“sub-bagging” for computational efﬁciency and it has a performance similar

to bagging [B ¨

uhlmann and Yu (2000)]. The misclassiﬁcation rates (called test

error in FHT) are estimated by averaging over 100 random partitions of the

data set into a 90% training and a 10% testing, while FHT use a 5-fold CV. So

our results differ slightly from FHT (also because presumably we are using

different tree algorithms), but we believe ours have better accuracies. Boost-

ing and its variants are done with Real AdaBoost, and the trees are run with

the program from R. Bag-boosting large trees gives the best result, followed

by boosting large trees, bag-boosting stumps, boosting stumps and then bag-

ging. Bag-boosting only needs few iterations and brings nontrivial improve-

ment over boosting alone, at the expense of a more computationally costly

weak learner. Note that Figure 3 (bottom panel) shows strong overﬁtting with

respect to Eexp−YF X!

386 DISCUSSION

Fig. 3.Different weak learners in Real AdaBoost for breast cancer data (Section 7in FHT).Top:

misclassiﬁcation rates. Bottom:expected exponential losses Eexp−YF X on the logarithmic

scale.

7. For this particular dataset and with respect to the misclassiﬁcation rate,

there is hardly any overﬁtting. It suggests that there is no loss in using

large trees, if there is never any substantial overﬁtting. Boosting stumps

on the other hand restricts to additive models. With respect to the misclas-

siﬁcation rate, should we always boost with large trees as weak learners?

Acknowledgments. Bin Yu thanks Mark Hansen for many stimulating

discussions and helpful comments on the draft and Margaret Wright for help-

ful discussions on optimization. Special thanks are due to Josef Vogt for the

results on the breast cancer data.

REFERENCES

Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.

B¨

uhlmann, P. and Yu, B. (2000). Explaining bagging. Preprint.

Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical

report, Stanford Univ.

Gill, E. P., Murray, W. and Wright, M. H. (1981). Practical Optimization. Academic Press,

New York.

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.

Seminar f ¨

ur Statistik

ETH-Zentrum, LEO D72

CH-8092 Zurich

Switzerland

E-mail buhlmann@stat.math.ethz.ch

Department of Statistics

University of California

Berkeley, California 94720-3860

ADDITIVE LOGISTIC REGRESSION 387

DISCUSSION

Andreas Buja

AT&T Labs

1. The cultural background of boosting: machine learning versus

statistics. I have worked under the same roof with the inventors of boosting,

Freund and Schapire, for several years, but I might as well have been on

another planet. The physical vicinity was of no help in coming to grips with

boosting. I listened to their talks, but I heard the words and missed the point.

Later I read Leo Breiman’s debates with Freund and Schapire, and it really

got my attention when I heard his praise of boosting. His penetrating analysis

was an effort to understand the performance of boosting, in particular its

mysterious immunity to overﬁtting. What he did not provide was a translation

of boosting into terms that are familiar to statisticians. He did not bridge

the culture gap that separates the stodgy main stream of statistics from its

dynamic younger sibling in computer science, machine learning. This is one

of the achievements of the paper by Friedman, Hastie and Tibshirani (FHT):

It gives statisticians a language with which to think about boosting.

If Freund and Schapire feel misinterpreted or if they think of FHT’s statis-

tical modeling interpretation as misguided, as I expect they will, it is beside

the point. There is no single “true” interpretation of anything; interpretation

is a vehicle in the service of human comprehension. The value of an interpre-

tation is in enabling others to fruitfully think about an idea, and this is what

the FHT paper certainly does for statisticians and the idea of boosting.

An interpretation of new ideas in terms of old ones is not a mere reduction

of the ideas to an “old hat.” On the contrary, such a reduction can clarify what

is novel. The emphasis of the FHT interpretations may be too long on the

reduction aspect and too short on working out the novelty in boosting; I will

try to explain what I mean below. The novelty that we see in light of FHT’s

interpretations, though, may not agree with what Freund and Schapire think

is novel. Thus, while the FHT paper may provide a gate for statisticians to a

new area, the same gate may not be open in reverse for the folks in machine

learning (although I’d love to be proved wrong).

2. Boosting arbitrary generalized linear and additive models. Ac-

cording to FHT, boosting can be seen as a two-class classiﬁcation method that

ﬁts additive models on a logistic scale in a forward stagewise manner with

regard to a loss function that is second-order equivalent to the negative log

likelihood for the conditional Bernoulli model. This interpretation leads FHT

naturally to propose their “LogitBoost” method by substituting the exact neg-

ative log likelihood for the original loss function.

The test of every interpretation is in its capacity to spawn new ideas and

generalizations. Differing interpretations are worthwhile because they lead to

differing generalizations. The FHT interpretation resulted immediately in a

388 DISCUSSION

very competitive new version of boosting, as well as an improvement in boost-

ing multiclass classiﬁcation.

It seems to me, though, that the FHT interpretation calls, even screams,

for a much more global extension of boosting: Now that we know how to boost

additive logistic regression, we also know how to boost any generalized addi-

tive model. If Algorithm 3 of FHT for LogitBoost is the familiar Newton–

Raphson algorithm modiﬁed for the acquisition of nonlinear terms in a stage-

wise logistic regression, then the same can be used for the acquisition of non-

linear terms in any stagewise generalized regression.

In particular, we now know how to boost ordinary least-squares regression:

by stagewise LS ﬁtting of nonlinear terms to the previous residuals. This may

seem trite, but it isn’t if one has ever tried to come up with a version of boosting

for regression with continuous response. Freund and Schapire have tried, and

their tentative proposal was to reduce regression to some form of classiﬁcation,

in a classical attempt at reducing a new problem to an old one. This seems

like a circuitous approach, however, once the FHT interpretation of boosting is

at hand. With the extension of the FHT interpretation to generalized additive

regression, it seems most natural to interpret stagewise LS regression as the

regression version of boosting.

3. The role of greedy stagewise ﬁtting in the protection against

overﬁt. FHT’s interpretation of boosting is lucid, but their presentation of

stagewise ﬁtting needs to be put in perspective. I have no quibbles with the

facts, but some with the positioning of the facts.

FHT do state that the ﬁtting `

a la boosting is “forward stagewise,” but the

impression we (and above all the readers in machine learning) may get is that

this term describes a common mode of model ﬁtting in statistics. It doesn’t. We

have the term, but it is not very well known, and it does not enjoy high stand-

ing among most statisticians who know about it. It is a computational device,

or more accurately, a capitulation, in those situations where one is unable to

ﬁt a large number of model terms jointly. We even teach our students that

ﬁtting one term at a time and not adjusting the others is not thewaytoﬁta

model; we teach them that changing one term in a model affects all others in

a joint model ﬁt, assuming that the joint model ﬁt is the norm. Interestingly,

one popular method for joint ﬁtting, the backﬁtting algorithm, was devised

by Friedman and Stuetzle (1981) to more closely achieve joint optimality in

the otherwise greedy ﬁtting procedure of projection pursuit regression (PPR).

Had they omitted the backﬁtting part in their algorithm and allowed the

acquisition of a greater number of ridge function terms instead, PPR would

have been a regression version of boosting. It appears that even the uncon-

ventional minds of Friedman and Stuetzle were involuntarily constrained by

statisticians’ prejudice in favor of parsimonious models obtained by joint model

ﬁts. Doing away with this prejudice is certainly one of the novelties in the

boosting endeavor. Ockham’s razor is being replaced by Freund and Schapire’s

sledgehammer.

ADDITIVE LOGISTIC REGRESSION 389

Omitting joint ﬁtting is crucial to the success of boosting: Its purported

advantage, relative protection against overﬁt, disappears instantaneously if

termwise (stagewise) ﬁtting is replaced with joint ﬁtting of all terms. The 500

or so stumps and twigs that I have seen ﬁtted in a single boosting ﬁt would

come crashing down if they were ﬁtted jointly. Therefore, the essence of the

protection against overﬁt is in the suboptimality of the ﬁtting method.

This also explains why boosting could not have been invented by statisti-

cians: Our preoccupations are such that the statistical beneﬁts of a suboptimal

computational device would not have appeared on our radar. There are excep-

tions to the rule, though: Statisticians have, for example, developed a notion of

one-step estimates in robustness, where it was shown that one Newton step in

the computation of an M-estimate provides essentially the robustness beneﬁts

of full iteration. Similarly, in order to understand the performance of boosting,

we will need a theory that explains why this particular type of suboptimality

often avoids overﬁt.

A curious observation in this context concerns the cavalier attitude in the

boosting community, including FHT, toward the stopping problem. They seem

to consider it sufﬁcient to provide plots or tables that show misclassiﬁcation

rates for 100, 200, 300, 400 terms. So we stop around 300 or 400, and this is

part of the deﬁnition of the method. The assumption, empirically substanti-

ated, is that it really doesn’t matter much as long as we stop after the acqui-

sition of a sufﬁciently large number of terms. The situation is similar to that

in artiﬁcial neural networks (NNs), where the stopping time of online opti-

mization acts as a smoothing parameter. In NNs, however, one tries to detect

the onset of overﬁt with cross-validation. In boosting, by contrast, the testing

error remains ﬂat for a long time, overﬁt does not occur before the human

runs out of patience and a precise stopping rule seems therefore unnecessary.

A satisfactory theory of protection against overﬁt would have to explain

the relative ﬂatness of the test error in boosting. The ﬂatness is only relative,

however, and even boosting has to submit to overﬁt at some point. This is

seen from the fact that usually the class of ﬁtted terms bx γis rich in the

sense that for a given ﬁnite sample size Nthe N-dimensional design vectors

bγ=bxnγn=1···Nspan all of RNas γruns across its domain.

This is obviously not a proof, but a slightly tighter argument can be given.

First, we note that, unlike in joint ﬁtting, any iterative termwise ﬁtting pro-

cedure depends not only on the space spanned by the terms, but the actual

basis they represent and the order in which they are ﬁtted. (I thank Werner

Stuetzle who spelled this out most clearly to me.)

In the simple case where the domain of the γ’s is ﬁnite and consists of

Kelements γ1γ

Konly, boosting is a greedy form of backﬁtting the linear

model k=1···Kβkbx γk. Instead of cycling through the terms in an orderly

fashion k=12K,12 as in conventional backﬁtting, the next kis

picked greedily so as to lower the loss function maximally. We can conceive of

at least three types of backﬁtting according to the order in which the terms

are ﬁtted: (1) cycle over the terms in a ﬁxed order (traditional backﬁtting),

(2) pick the next term randomly, (3) pick the next term greedily. It would

390 DISCUSSION

seem plausible to examine combinations of type 2 and type 3, whereby the

picking probability of a term is a function of its loss reduction; as is, type 2

uses a uniform distribution on the terms and type 3 puts probability 1 on the

term with maximal reduction.

It is clear from the known convergence of type 1 backﬁtting to the joint

least-squares solution that overﬁt must occur if the joint solution overﬁts. If

similar convergence properties for type 2 and type 3 backﬁtting hold, overﬁt

will occur there also. In light of these considerations, it may be just a matter

of time till boosting follows the lead of NNs and cross-validation is introduced

for the stopping problem in boosting as well.

An issue that a theory of protection against overﬁt would have to tackle is

that there are two ways for a learner or base term bx γto be “weak”: weak-

ness can arise from bias or from variance. In either case the weak learner

usually contains signal, but the signal is weak for different reasons. Weak-

ness due to bias means underﬁt, weakness due to variance means overﬁt.

Boosting is designed to reduce bias but not variance. The latter is addressed

by Breiman’s (1996) “bagging” proposal and by Bayesian model averaging. As

things stand, one either approaches good ﬁts from poor ﬁts with boosting, or

from overly ﬂexible ﬁts with model averaging. It would be desirable if both

bias and variance could be tackled in one uniﬁed approach.

4. The role of the functional form of the weak learner. Additive

models with small trees as base terms are an innovation we owe the machine

learning literature. Trees are not essential to boosting, but the idea of building

additive models from multiple trees is a welcome side beneﬁt.

Functional form hasn’t been a topic of great debate in the boosting liter-

ature, probably because of the original justiﬁcations in terms of the “weak

learner” metaphor. These justiﬁcations do not carry us far, however. The ques-

tion is not whether but how much a weak learner can be boosted, or in statis-

tical terminology: how well an additive model can approximate an underlying

response surface. A major determinant for answering this second question is

the functional form of the weak learner bx γ. The fact that the boosting folks

haven’t hit on this the hard way is probably due to the Holte (1993) effect,

cited a couple of times by FHT: “Very simple classiﬁcation rules perform well

on most commonly used datasets.” That is, nature rarely chooses complex deci-

sion boundaries that couldn’t be described by thresholding additive functions

or at most two-variable or three-variable interactions.

Inspired by ANOVA practices, FHT propose to measure and control com-

plexity of a functional form in terms of interaction order, that is, the maximum

number of variables admitted in the weak learner at any given time. For trees,

the interaction order is identical with the maximum admitted tree depth.

But the idea of interaction order can be extended to other functional forms

as well. The most obvious are Friedman’s MARS ﬁts, in which interaction

order can be controlled as easily as in trees. Another possibility are projection

pursuit regression and classiﬁcation (PPR and PPC, respectively), although a

major modiﬁcation is necessary: the linear combinations that enter the ridge

ADDITIVE LOGISTIC REGRESSION 391

functions would have to be constrained to, say, two or three predictors at

a time.

In order to more fully understand boosting, it will be necessary to apply it

to functional forms other than trees. This would not be difﬁcult in the case of

MARS and PPR/C. In fact, for a boosting version of either one would simply

replace the joint ﬁtting with unadjusted greedy ﬁtting. Ironically, stagewise

modiﬁcations of MARS and PPR/C require that the technically most creative

ideas of their inventors be thrown out: Friedman used sophisticated stable

Cholesky updates for the fast computation of joint ﬁts in MARS, and Friedman

and Stuetzle devised backﬁtting for the same purpose in PPR/C. Now we are

led to expect that giving up these technical marvels will be beneﬁcial! Having

experienced ﬁrst hand the problems of overﬁt in the original MARS, I would

be thrilled if the anticipated beneﬁts of boosting showed up.

5. Conclusion. The FHT interpretation of boosting has implications for

machine learning as well as for statistics. For machine learning, the gener-

alization of FHT’s ideas lead us straight to a regression version of boosting.

For statistics, FHT’s ideas raise the expectation that the beneﬁts of boosting

might carry over to other stagewise modiﬁcations of nonparametric ﬁtting pro-

cedures. The latter is just a guess so far, and in the absence of a convincing

theory of boosting’s protection against overﬁt, we wouldn’t even know why it

held if it did. Clearly, there is a lot of work ahead of us. FHT have given us a

means for generating new ideas that can guide our future research.

REFERENCES

Breimann, L. (1996). Bagging predictors. Machine Learning 24 123–140.

Friedman,J.andStuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc.

76 817–823.

Holte, R. (1993). Very simple classiﬁcation rules perform well on most commonly used datasets.

Machine Learning 11 63–90.

AT&T Labs

Shannon Laboratory

180 Park Avenue

Florham Park, New Jersey 07932-0971

E-mail: andreas@research.att.com

DISCUSSION

Yoav Freund and Robert E. Schapire

AT&T Labs

The main and important contribution of this paper is in establishing a

connection between boosting, a newcomer to the statistics scene, and additive

models.

392 DISCUSSION

One of the main properties of boosting that has made it interesting to statis-

ticians and others is its relative (but not complete) immunity to overﬁtting. As

pointed out by the authors, the current paper does not address this issue. Leo

Breiman (1998) tried to explain this behaviour in terms of bias and variance.

In our paper with Bartlett and Lee (1997), we gave an explanation in terms

of the “margins” of the training examples and the VC-dimension of the base

class. Breiman, as well as the current paper, point out that our bounds are

very rough and are thus not useful in practice. While this is clearly true at

this time, it is also true that the analysis given by Breiman and by this paper

yield no provable bounds whatsoever. It is completely unclear whether this

analysis can be used to predict the performance of classiﬁcation rules outside

of the training sample.

At the root of this argument about boosting is a much more fundamen-

tal argument about the type of prior assumptions that one should make when

embarking on the task of inducing a classiﬁcation rule from data. The assump-

tion that seems to underlie the use of maximum likelihood in the current paper

is that data are generated by a distribution from a prespeciﬁed class. In this

case, this is the class of distributions in which the relationship between the

features and the labels is described by a log-linear function. In comparison,

the assumption that we make in our analysis is that the data are generated

from some arbitrary distribution in an i.i.d. fashion. Clearly, our assumption

is the weaker one and this leads to a theory that is more generally applicable.

From a related but more practical point of view, one main issue when apply-

ing boosting or boosting-like techniques in practice is how to choose the base

class. The approach taken in this paper is that this choice is made based

on our prior beliefs regarding the type of log-linear dependencies that might

exist between the features and the label. On the other hand, in the boosting

approach, we make an assumption about what kind of rules might have slight

but signiﬁcant correlations with the label. This is the essence of the “weak

learning” assumption upon which the theory of boosting is founded.

In the current paper, boosting is analyzed mostly in the context of decision

stumps and decision trees. The argument seems to be that while in most

real-world cases decision stumps are powerful enough, in some less com-

mon cases the type of dependencies that exist in the data require a more

powerful base class, such as two- or three-level decision trees. A rather differ-

ent approach to the combination of decision trees and boosting was recently

proposed by Freund and Mason (1999). They represent decision trees as sums

of very simple functions and use boosting to simultaneously learn both the

decision rules and the way to average them.

Another important issue discussed in this paper is the performance of boost-

ing methods on data which are generated by classes that have a signiﬁcant

overlap, in other words, classiﬁcation problems in which even the Bayes opti-

mal prediction rule has a signiﬁcant error. It has been observed by several

authors, including those of the current paper, that AdaBoost is not an optimal

method in this case. The problem seems to be that AdaBoost overemphasizes

the atypical examples which eventually results in inferior rules. In the current

ADDITIVE LOGISTIC REGRESSION 393

paper, the authors suggest “GentleBoost” as a better method than AdaBoost

for this case. The reason that this might be a better method is that it gives

less emphasis to misclassiﬁed examples. The increase in the weight of the

example is quadratic in the negative margin, rather than exponential.

However, one can argue that this alteration of AdaBoost, while being a step

in the right direction, is not large enough. In fact, one can argue that once

an example has a very large negative margin it is best to assume that it is

an outlier that should be completely removed from the training set. A new

boosting algorithm based on this radical approach was recently proposed by

Freund (1999).

REFERENCES

Breiman, L. (1998). Arcing classiﬁers. Ann. Statist. 26 801–849.

Freund, Y. (1999). An adaptive version of the boost by majority algorithm. In Proceedings of the

Twelfth Annual Conference on Computational Learning Theory.

Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine

Learning: Proceedings of the Sixteenth International Conference 124–133.

Schapire, R. E.,Freund, Y.,Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new

explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.

AT&T Labs–Research

180 Park Avenue

Florham Park, New Jersey 07932-0971

E-mail: yoav@research.att.com

DISCUSSION

Greg Ridgeway

University of Washington

The authors have done a great service to the ﬁeld of statistics by branch-

ing out to the problems of interests to the computational learning theorist.

The connection between boosting and maximum likelihood not only helps to

explain why boosting is an effective classiﬁcation method but it also opens

the door to the application of boosting to many other statistical prediction

problems.

It is difﬁcult not to understate the attention that boosting has received in

the machine learning and data mining communities. Soon after Freund and

Schapire’s algorithm made its debut, a ﬂurry of empirical work showcased its

uncanny ability to improve prediction accuracy [Quinlan (1996), Bauer and

Kohavi (1999), Drucker and Cortes (1996)]. Charles Elkan’s application of the

boosted na¨

ıve Bayes won ﬁrst place out of 45 entries in the 1997 data mining

competition [Elkan (1997)], demonstrating the strength of boosting against

research and industrial prediction tools. Elkan also noted in this work that

his model had the form of nonlinear logistic regression but did not completely

394 DISCUSSION

connect boosting with optimization and likelihood. Boosting is even making

its way into popular science as noted in the New York Times [Lee (1999)].

Low variance, bias reducing steps. Likelihood based inference is in

the business of obtaining good parameter estimates, which is not necessarily

the same as producing a good predictor. In the classiﬁcation game we win if we

produce a classiﬁer with the lowest misclassiﬁcation rate. A good estimator of

the underlying class distribution conditional on xis of secondary importance.

Consider the following example.

Example 1. Generate N=1000 observations as

x∼N01Fx=−1

2x2−1

px= exp2Fx

1+exp2Fxyx∼Bernpx

The Bayes decision rule is to classify a new observation asa1ifx>√2

and 0 otherwise. The misclassiﬁcation error for this rule is about 20.2%.

Figure 1 shows the pxthat generated the data as a smooth curve. After 100

Fig. 1.The smooth curve is PY=1xand the jagged curve is the estimate that the Real

AdaBoost algorithm produces after 100 iterations using a stump classiﬁer on the Example 1data.

ADDITIVE LOGISTIC REGRESSION 395

boosting iterations using a stump classiﬁer (a piecewise constant with only

one knot ﬁt using CART) we see that boosting produces an estimate of px

that is overﬁt in terms of providing an accurate picture of the underlying data

generating mechanism. However, the misclassiﬁcation rate on a test dataset

with 10,000 observations is only 21.5%, not far off from the Bayes error rate.

More iterations have little effect. This counts as a win for the analyst inter-

ested solely in predictive performance but a loss for those seeking to uncover

the data generating mechanism. Even though the authors’ Lemma 1 shows

that the population minimizer of Ee−yFxis the true conditional class prob-

ability, with a ﬁnite sample it is easy for AdaBoost to overﬁt from a function

estimate or calibration point of view. The debate between accurate prediction

and the discovery of the contents of the black box is certainly not new but it

is also has not gone away.

Some of the amazement surrounding boosting was that it could take a sim-

ple classiﬁer like a stump and turn it into a near-Bayes optimal classiﬁer.

The choice of using a stump at each iteration in the example is critical. After

two boosting iterations the classiﬁer will pretty much nail the correct decision

boundary since the two knots will likely be in the neighborhood of −√2 and

+√2. Due to the constraints on the base classiﬁer, subsequent iterations can

do little to modify the decision boundary. Therefore, local jaggedness devel-

ops as successive stumps try to ﬁt individual observations but it cannot be

substantial enough to have an effect on the misclassiﬁcation rate. This leads

to the question of the importance of selecting the base classiﬁer. In this exam-

ple, ﬁtting slightly more complex trees at each iteration has disastrous effects

on the performance. Figure 2 tracks the change in misclassiﬁcation rate with

each boosting iteration when the base classiﬁer is a tree with four termi-

nal nodes. A four-terminal node tree is likely to nearly capture the decision

boundary on the ﬁrst iteration. Subsequent iterations have enough ﬂexibility

to ﬁt individual observations and generalization performance declines. From

Figure 2 we can see that the performance degrades to slightly worse than

predicting 0, the majority class, for all observations.

The results of Example 1 yield the following insights into boosting.

1. AdaBoost (in its present form) is not necessarily useful in function estima-

tion. Calibration of the boosted probability estimates is questionable.

2. The idea that AdaBoost is resistant to overﬁtting needs some qualiﬁcation.

The choice of base classiﬁer can make or break the success of utilizing

boosting.

3. Boosting seems to work well when we use a sequence of simple, low vari-

ance classiﬁers. Each stage makes a small, low variance step, sequentially

chipping away at the bias.

Other looks at functional optimization. Once boosting is cast as func-

tional optimization we can approach the problem from other avenues. In the

AdaBoost setup we wish to ﬁnd an Fxthat minimizes JF=Ee−yFx.

An additive improvement to Fxwould ﬁnd an αand an fxsuch that

396 DISCUSSION

Fig. 2.Misclassiﬁcation rate when ﬁtting trees with four terminal nodes to the Example 1test

data on each Real AdaBoost iteration. The horizontal line marks the error rate if we predicted all

observations to be zeros.

JF+αfdecreases. The authors describe how AdaBoost selects αand fx.

The Gentle AdaBoost algorithm that the authors propose performs a point-

wise Newton step to suggest fxwith αﬁxed at 1. We can also view this as

a gradient ascent in function space with a squared penalty on the magnitude

of fx. Consider the following analysis.

The directional derivative, or Gˆ

ateaux variation, of JFin the direction

of some function fxcomputes the local change in JFif we make the

modiﬁcation Fx←Fx+αfx, moving Fxin the direction of fx. The

directional derivative of JFis

δJF f= d

dα E exp−yFx+αfx

α=0

=E−yfxexp−yFx

(1)

A gradient ascent strategy would have us select from among some class of

functions the fxthat minimizes δJF f, or equivalently offers the greatest

local decrease in JF. Without further restrictions this method is rather

unstable since we can make δJF farbitrarily small by setting

fx=−∞if Eyx<0

∞if Eyx>0

(2)

ADDITIVE LOGISTIC REGRESSION 397

On the other hand, if we penalize fxin (1) for being too big then we might

select fxas

fx=arg min

fE−yfxe−yFx+1

2fx2

=arg min

fE−yfxe−yFx+1

2fx2+1

2ye−yFx2

=arg min

fEfx−ye−yFx2

(3)

Note that (3) is equivalent to choosing the fxthat is similar to a least-

squares ﬁt to the gradient that the authors compute in their derivation of the

Gentle AdaBoost algorithm. Obviously, in practice we only have a sample with

which to estimate the expectation in (3). Therefore, it becomes critical that we

use a low variance least-squares regression method to minimize the empirical

estimate of (3). I will return to this topic in a moment. After estimating fxwe

can use a line search to obtain the αthat minimizes E exp−yFx+αfx.

Applying this same technique within the LogitBoost scenario, that is, maxi-

mizing the directional derivative of the Bernoulli log-likelihood with a squared

penalty on the direction fx, the gradient step is

fx=arg min

fEfx−y∗−eFx

e−Fx+eFx2

(4)

After estimating the optimal step direction the choice of the optimal step size,

α, takes on an interesting form when we have a sample.

α=arg min



i=1

2y∗

iFxi+αfxi − log1+exp2Fxi+αfxi(5)

Equation (5) has the form of a likelihood for the linear logistic regression

model with fxias a covariate and Fxias an offset term. The optimal

αthen is just the maximum likelihood estimate of the coefﬁcient of fxifor

which efﬁcient software exists. Similar results hold for the class of exponential

family regression models and the Cox model for survival analysis [Ridgeway

(1999)].

Both the authors’ results and this section arrive at the situation where we

need to propose a direction in which to move our current guess for the regres-

sion function. When handed only ﬁnite training datasets, one is faced with

estimating a direction that seems likely to offer a decrease in the loss func-

tion. Example 1 showed that high variance estimators of descent directions

(e.g., four-terminal node trees) can cause problems. In the past statisticians

have controlled the variance of predictors in other manners. As the authors

point out, the LogitBoost algorithm has a familiar feel to statisticians. Tradi-

tional statistical practice uses linear regression instead of trees at each stage

to estimate the conditional expectation. This yields the iteratively reweighted

398 DISCUSSION

least-squares algorithm for ﬁtting linear logistic regression models. We can

also use smoothers to estimate the conditional expectation as in the local

scoring algorithm. These types of methods controlled variance at the expense

of bias reduction by limiting the functional form, restricting variable interac-

tions, or constraining the jaggedness of the curve.

When we introduce regression trees to estimate these gradient steps insta-

bility becomes an issue. However, various forms of averaging may help to keep

variance in check. If we use bagged trees to estimate the gradient we can often

reduce the variance introduced on the iteration. This is the basis of Breiman’s

adaptive bagging [Breiman (1999)]. Denison (2000) proposes averaging over

all possible stumps at each iteration with each weighted by their posterior dis-

tribution. The amount of needed computation is nearly equivalent to CART

with the beneﬁt of more stable gradient steps. It seems that the future of

nonparametric regression might consist of choosing an ideal loss function and

a sequence of low variance estimates of the gradient direction in which each

step decreases bias.

Bayesian prediction and function estimation. When I ﬁrst learned

of the problem of boosting I attempted to understand it from a Bayesian

perspective. The redistribution of weights on observations, the model mix-

ing and the improved performance seemed to have a Bayesian ﬂavor to it.

In fact, Freund and Schapire (1997) even offered a na¨

ıve Bayes justiﬁca-

tion for the AdaBoost algorithm. With the current results in hand it now

seems clear that the Bayesian interpretation of boosting is that it is simply

maximum likelihood! Boosting suggests moving the current function in some

optimal direction and assembles a path through function space toward a max-

imum likelihood estimate. The Bayesian wants to move the current function

in all directions at each iteration, eventually considering all paths through

the function space with each weighted by some posterior probability. As is

often the case, computation can impede a fully Bayesian solution. However,

the lessons learned from boosting could be emminently useful for Bayesian

prediction and function estimation problems.

In Bayesian prediction problems we want to ﬁnd the predictive distribution

of a ygiven its covariates, x, and a dataset containing previous observations,

Pryx D=Pryx FPrFDdF(6)

The predictive distribution integrates over all members, F, in some class of

functions. The crudest approximation simply assumes that PrFDhas most

of its mass in a small neighborhood of the maximum likelihood estimate of F

so that Pryx D≈Pryx ˆ

F. To make a point prediction one would com-

pute some functional of Pryx ˆ

F(e.g., mean, class with the largest proba-

bility). Recent efforts have proposed methods for sampling from PrFDin

order to compute a Monte Carlo estimate of (6). Some of these ideas include

MCMC on the space of trees [Denison, Mallick and Smith (1996), Chipman,

George and McCulloch (1998)], on the space of additive models with a partic-

ADDITIVE LOGISTIC REGRESSION 399

ular smoothness [Hastie and Tibshirani (1998)] and reversible jump MCMC

on the space of piecewise constants [Heikkinen (1998)] or splines [DiMatteo,

Genovese and Kass (1999)]. I think boosting may have a lot to offer the latter

of these methods.

Heikkinen (1998) assumes that the regression function is a piecewise con-

stant where the number of pieces and the constants associated with those

pieces are random quantities. The simulation from PrFDproceeds by either

randomly combining two neighboring pieces, splitting a random piece into two,

or randomly perturbing the constant associated with one of the pieces. Among

the curses of MCMC methods is the problem of wasting iterations explor-

ing regions of the parameter space that are not particularly of interest or not

exploring the entire interesting region. Each boosting iteration computes some

kind of gradient in the function space indicating with respect to the current

state which neighboring functions are most interesting in some sense. From

the insight gained from boosting one could potentially produce more efﬁcient

proposal mechanisms for exploring PrFD; that is in the works.

Closing comments. This work melds some exciting work in both compu-

tational learning theory and statistics. The authors have elegantly connected

the AdaBoost algorithm with a novel interpretation of ﬁtting additive logis-

tic regression. Researchers from a diverse set of ﬁelds are in search of bet-

ter methods for modeling datasets that are massive, noisy, complex and high

dimensional. Boosting, as a modular procedure for constructing such models,

offers an exciting paradigm for future prediction methods.

REFERENCES

Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classiﬁcation algorithms:

bagging, boosting, and variants. Machine Learning 36 105–139.

Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept.

Statistics, Univ. California, Berkeley.

Chipman, H.,George, E. and McCulloch, R. (1998). Bayesian CART model search (with dis-

cussion). J. Amer. Statist. Assoc. 93 935–960.

Denison, D. (2000). Boosting with Bayesian stumps. Technical report, Dept. Mathematics, Impe-

rial College.

Denison, D.,Mallick, B. and Smith, A. (1996). Bayesian CART. Technical report, Dept. Mathe-

matics, Imperial College.

DiMatteo, I.,Genovese, C. and Kass, R. (1999). Bayesian curve ﬁtting with free-knot splines.

Technical report, Carnegie Mellon Univ.

Drucker, H. and Cortes, C. (1996). Boosting decision trees. In Proceedings of Neural Informa-

tion Processing 8 479–485. MIT Press.

Elkan, C. (1997). Boosting and na¨

ıve Bayes learning. Technical Report CS97-557, Univ.

California, San Diego.

Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and

an application to boosting. J. Comput. System Sci. 55 119–139.

Hastie, T. and Tibshirani, R. (1998). Bayesian backﬁtting. Technical report, Stanford Univ.

Heikkinen, J. (1998). Curve and surface estimation using dynamic step functions. In Practi-

cal Nonparametric and Semiparametric Bayesian Statistics (D. Dey, P. M¨

uller and

D. Sinha, eds.) 255–272. Springer, New York.

Lee, J. (1999). A computer program that plays a hunch. New York Times August 17.

400 REJOINDER

Quinlan, J. (1996). Bagging, boosting, and C4.5. In Proceedings Thirteenth American Associa-

tion for Artiﬁcial Intelligence National Conference on Artiﬁcial Intelligence 725–730.

AAAI Press, Menlo Park, CA.

Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-ﬁrst Symposium on the

Interface 172–181.

Ridgeway, G. (1999). The state of boosting. In Computing Science and Statistics 31 (K. Berk,

M. Pourahmadi, eds.) Interface Foundation of North America, 172–181. Fairfax, VA.

Department of Statistics

University of Washington

Seattle, Washington 98195-4322

E-mail: greg@stat.washington.edu

REJOINDER

Jerome Friedman, Trevor Hastie and Robert Tibshirani

Stanford University

We thank the discussants for their generous and thoughtful comments, and

the editors for arranging this discussion. Since the paper and discussions are

long, we can only reply to a few of the many important points raised by the

discussants.

1. Overﬁtting. All of the discussions raise the issue of overﬁtting in the

context of boosting. Breiman, and to a lesser extent, Freund and Schapire sug-

gest that our explanation is inadequate since it suggests that enough boosting

steps can lead to overﬁtting. Breiman cites empirical evidence suggesting that

“AdaBoost almost never overﬁts the data no matter how many iterations it is

run.” It would be a nice world if that were the case, but unfortunately it isn’t

quite. Besides the example in our paper, Ridgeway provides a simple exam-

ple that exhibits dramatic overﬁtting in Figure 2 of his discussion. Quoting

Ratsch, Onoda and Muller (2000), “Recent studies with highly noisy patterns

[Quinlan (1996), Grove and Schuurmans (1998), Ratsch (1998)] showed that

it is clearly a myth that Boosting methods do not overﬁt.” Still, it does appear

that boosting is more resistant to overﬁtting that one might expect based

on general modeling experience. Buja suggests that the suboptimal greedy

“stagewise” nature of the ﬁtting procedure may provide some resistance. This

could be the case, and merits further investigation. However, we agree with

B¨

uhlmann and Yu that it is largely the nature of zero–one loss that provides

general resistance to overﬁtting as judged in classiﬁcation.

We illustrate this with a simple type of “nearest-neighbor” procedure. Con-

sider a ﬁxed set of data points xiN

1. A training data set consists of yixiN

where each yi∈−11is a random variable with pi=Pryi=1. Given a

training set, our estimate for each piis ˆpi=1+ay i/2, 0 ≤a≤1. Here ais

a parameter that controls the degree of ﬁt to the training data. Larger values

provide a closer ﬁt; the value a=1 interpolates the data, whereas a=0

ADDITIVE LOGISTIC REGRESSION 401

Table 1

Criterion LyFˆ

RRa

∗

exponential exp−yF

1−a

1+a1/21−¯e1−a

1+a1/21−2¯e

+¯e1+a

1−a1/2

likelihood log1+exp−2yF log 2

1+a1−¯elog 2

1+a1−2¯e

+¯elog 2

1−a

squared error y−F21−a21−a2+4a¯e1−2¯e

absolute error y−F1−a1−1−2¯ea1

zero–one loss 1y=signF 0¯e>0

estimates ˆpi=1/2 irrespective of the data. We consider ﬁve loss functions

Ly F—exponential, likelihood, squared error, absolute error, and zero–one

loss—all deﬁned in Table 1.

For each loss function, Fis deﬁned to be the population minimizer

arg min

FELy F=arg min

FpL1F+1−pL−1F

For the exponential and likelihood, this is F=1

2logp/1−pfor squared

and absolute error F=2p−1, and for zero–one loss F=sign2p−1.

Table 1 shows the average loss on the training data

R=1



i=1

Lyiˆ

Fi

and the corresponding average population risk

R=1



i=1

ELy∗

iˆ

Fi

The later expectation is with respect to the training yi(which produces ˆ

Fi),

and the test observation y∗

i, both drawn independently with probability pi.

The ˆ

Fiare obtained by plugging ˆpiinto the expressions for F. For the expo-

nential and likelihood criteria,

Fi=1

2log ˆpi

1−ˆpi=1

2log 1+ay i

1−ay i



whereas for the squared and absolute error criteria,

Fi=ay i

The risk values are reported in terms of the parameter aand

¯e=1



i=1

E1y∗

i=signF∗

i =2



i=1

pi1−pi

402 REJOINDER

which is just the average risk based on zero–one loss (and is twice the Bayes

risk). This error rate does not depend on the value of asince it is indepen-

dent of the magnitude of ˆ

Fi, depending only on its sign. Thus, overﬁtting as

measured by zero–one loss does not exist for this procedure. Interpolating the

data by setting a=1 does not degrade zero–one loss. Also shown in Table 1 is

the parameter value a∗that minimizes the population risk for each respective

criterion.

As seen from Table 1, both the average training set loss and population risk

for the four criteria, other than zero–one loss, do depend on the magnitude of

Fi, and thus depend strongly on how close the training data are ﬁt, as regu-

lated by the value of a. The average training set loss ˆ

Ris monotone decreasing

with increasing values of a. Starting at a=0, the population risk Rfor all

four criteria initially decreases. For the ﬁrst three a minimum is reached at

the value a∗=1−2¯e. Increasing the value of abeyond that point monotoni-

cally increases population risk, symptomatic of classic overﬁtting. The degree

of overﬁtting depends on the criterion. Criteria listed higher in Table 1 report

more overﬁtting. For absolute error there is no overﬁtting; interpolating the

training data here minimizes population risk for this criterion.

Figure 1 of this Rejoinder shows both training error (lowest green curve),

population risk for ¯e=01 (middle red curve), and ¯e=02 (highest blue

curve), as a function of 1/1−afor the loss criteria shown in Table 1. All

curves are normalized to have unit risk at a=0. Note the difference in ver-

tical scale between the lower and upper plots. The ﬁrst three criteria exhibit

classic overﬁtting. The degree of this overﬁt increases with increasing intrin-

sic noise as reﬂected by the value of ¯e. For exponential and likelihood loss the

expected test error approaches inﬁnity as the value of aapproaches one (data

interpolation), the latter much more slowly. Both squared and absolute loss

approach ﬁnite values of 4 ¯eand 2 ¯e, respectively. Except for the discontinuity

at a=0 (where the risk is that of a coin ﬂip, R=1

2, normalized to be 1), the

population zero–one risk curves drop immediately to ¯e. When the abscissa is

plotted on a linear rather than logarithmic scale, the plot for exponential loss

(upper left panel) closely resembles that displayed in Figure 3 (lower panel)

of the B ¨

uhlmann and Yu discussion.

It is important to note that the extent to which overﬁtting is observed here

depends only on the loss criterion used to measure degree-of-ﬁt. For a given

training data set, equal values of the parameter aproduce identical models

for the probabilities ˆpi. The judgment as to whether, or how much, a particu-

lar model overﬁts the training data is determined solely by the criterion used

to deﬁne “ﬁt,” and not by the training procedure used to obtain the model.

Exponential loss reports dramatic overﬁtting, likelihood substantial overﬁt-

ting, squared error loss moderate overﬁtting and absolute and zero–one loss

no overﬁtting at all, on exactly the same sequence of models.

Of course, this example is highly idealized so as to permit exact calcula-

tions of risk. In other situations absolute and zero–one loss may exhibit some

degree of overﬁtting. The assumption that the x-values are ﬁxed prevents

zero–one loss from reporting overﬁtting for this procedure. In actual practice,

ADDITIVE LOGISTIC REGRESSION 403

Fig. 1.Training and population risk curves for the ﬁve loss functions in Table 1as a function of

1/1−a. The green curves are training risk. The red curves are population risk for ¯e=01and

the blue curves are for ¯e=02. All curves are normalized to have unit risk at a=0.

404 REJOINDER

the x-values themselves are random variables and ﬁtting the data too closely

can result in some incorrect estimates of the sign of F, thereby degrading pop-

ulation zero–one risk. However, the analysis above suggests that overﬁtting

as judged by absolute and especially zero–one loss should be less severe than

when judged by exponential, likelihood or squared error loss. Also, there is

no reason to expect that the degree-of-ﬁt to the training data that results in

minimum population risk will be the same for the former two measures as

with the latter three. This is illustrated in Figure 3 of B¨

uhlmann and Yu’s

discussion.

This analysis has no bearing on the relative merits of any of these criteria as

implementing loss functions for boosting, or any other procedure. As pointed

out by B ¨

uhlmann and Yu, it is often the case that the best criterion to drive

a search strategy is not the criterion used to judge the value of the resulting

solution. However, the phenomenon of “overﬁtting” is largely dictated by the

latter.

In the ﬁeld of statistics most of our experience has been derived from like-

lihood and squared error loss. As Breiman points out “adding more and more

variables to a logistic regression, increasing the training likelihood at each

step, usually results in overﬁtting at some point.” This is surely the case

when overﬁtting is judged by the likelihood criterion. It can sometimes be

the case when judged by zero–one loss. However, the simple example pre-

sented here, as well as considerable empirical evidence elsewhere, suggest

that zero–one loss, being less sensitive to variations in the estimates of the

underlying probabilities (they depend only on the sign of 2 ˆpi−1), exhibit

considerably less overﬁtting than other loss criteria that depend more heav-

ily on the detailed probability estimates. This will be the case whether those

estimates are obtained by AdaBoost, LogitBoost or any other procedure.

2. Regression. Most of the discussions consider the extension of boosting

to regression. Buja and B ¨

uhlmann and Yu, as well as Ridgeway (1999), observe

that our view of boosting leads to natural extensions to regression based on

squared error loss. This is discussed in Section 10 of the paper. It is explored in

more detail in Friedman (1999a), where gradient based boosting is extended to

arbitrary differentiable loss criteria. As noted in Friedman (1999a), shrinkage

is an important ingredient to the success of boosting in the regression context.

At each step mof the iterative procedure, the current approximation Fm−1x

is updated by

Fmx=Fm−1x+ν·fmx(7)

Here fmxis the update computed in the usual manner by the boosting

algorithm at the mth step, and ν∈01is a parameter whose value con-

trols the amount of shrinkage. Empirical evidence was presented that such

shrinkage dramatically improves the accuracy of the target function estimate

when measured by likelihood, squared error or absolute error. Less dramatic

but signiﬁcant gains were also observed for zero–one loss, where the quality

of the function estimate (as judged by the other criteria) is less important. In

ADDITIVE LOGISTIC REGRESSION 405

all cases, accuracy improved monotonically with decreasing value of ν(more

shrinkage), provided that the number of iterations Mwas correspondingly

increased so as to reach minimum test error.

Breiman and B ¨

uhlmann and Yu suggest that incorporating a stochastic

component into boosting algorithms can substantially improve performance.

B¨

uhlmann and Yu propose “bag-boosting” which, as Ridgeway points out, is the

essence of Breiman’s (1999) “adaptive bagging” procedure. The base learner is

replaced by the corresponding bagged learner at each boosting step. This can

be viewed as an alternative to shrinkage for reducing the learning rate. The

individual bagging substeps can be regarded as randomized boosting steps

without updating the approximation. Friedman (1999b) investigated combin-

ing randomization with shrinkage in an integrated manner. The approxima-

tion is slowly but continuously updated at each randomized boost using a slow

learning rate, rather than alternating between a large number of steps with-

out updating (bagging), followed by a single step with a full update (boost).

Combining randomization with shrinkage in this manner provides a modest

but signiﬁcant additional improvement over the already large gain produced

by shrinkage alone. However, there is a substantial computational advantage

by employing what B ¨

uhlmann and Yu call “subbagging.” Friedman and Hall

(1999) discuss the equivalence of bagging based on boostrapping, and half-

sampling without replacement and show that different subsample sizes can

often lead to better performance.

3. Interpretations of boosting. For their invention of boosting, Freund

and Schapire deserve a great deal of praise. It has led to both practical learn-

ing algorithms and much intellectual study and debate. As Buja so poetically

states, “There is no single ‘true’ interpretation of anything; interpretation is a

vehicle in the service of human comprehension. The proof of an interpretation

is in enabling others to fruitfully think about an idea.” We hope that our work

has helped serve that goal. Freund and Schapire and Breiman present two

additional interpretations, different from ours and each other. Ridgeway con-

siders yet another view along Bayesian lines. Having multiple interpretations

can only increase the chance of obtaining new insights.

We don’t see a great deal of contradiction between our view and that of

Freund and Schapire. “Provable bounds” can be derived for LogitBoost (Nigel

Duffy and David Helmbold, private communication). Also, we don’t see how

our “assumptions” are necessarily stronger (or weaker) than those motivating

their interpretation. It is hard to imagine a binary random variable that is not

Bernoulli distributed. We do not assume “a log-linear relationship between

the features and the labels.” The function class describing this relationship

is determined by the weak learner employed. AdaBoost, LogitBoost, or any

other boosting method using the same learner is making similar assumptions

concerning the label–feature relationship. As was pointed out in the paper,

and ampliﬁed by B ¨

uhlmann and Yu and Buja, “The question is not whether

but how much a weak learner can be boosted, or, in statistical terminology,

how well an additive model can approximate an underlying response surface.”

406 REJOINDER

The underlying response surface is the unknown label–feature relationship.

The weak learner provides the additive components. Different weak learners

will perform differently with different response surfaces.

Freund and Mason’s (1999) “alternating decision tree” approach is quite

interesting. It shares many common elements with MARS [Friedman (1991)],

substituting exponential loss for squared error loss, and using zero-order

rather than ﬁrst-order splines. Both of these modiﬁcations may be beneﬁcial

in the classiﬁcation context.

Freund’s (1999) suggestion of removing outliers is a good one, but not “rad-

ical.” This approach is at the center of the ﬁeld of robust statistics. The source

of AdaBoost’s lack of robustness is its use of exponential loss as the implement-

ing loss function, which increases exponentially with large negative margin.

Negative log-likelihood increases only linearly with large negative margin,

thereby being much more robust. Still, sometimes performance can be further

improved a little by completely removing the inﬂuence of observations with

very large negative margins. This is accomplished as a beneﬁcial side effect of

the weight trimming strategy described in Section 9 of the paper, when used

with LogitBoost. The LogitBoost weights are wi=pxi1−pxi. Obser-

vations with the smallest weights at the current iteration are removed from

consideration for that iteration. Observations with large negative margin have

either pxi 0 (for yi=1) or pxi 1 (for yi=−1), thereby being deleted.

Breiman views boosting as an “ensemble” method for combining classiﬁers.

He seeks to incorporate boosting and bagging into a common framework based

on “random forests.” This is an intriguing concept. However, it seems to us

that boosting operates differently than random forests. From the description,

a random forests appears to be an ensembles of learners that are i.i.d. given

the data. Hence the bias of the ensemble is the same as that of each individual

learner. By contrast, in boosting the individual members are not independent

given the data: the construction of each depends explicitly on those members

that have been previously entered. The bias of the ensemble is typically much

less than that of each constituent learner. Boosting turns a weak learner to

strong ensemble, whereas random forests combine strong learners in an i.i.d.

fashion in order to reduce variance.

Viewing boosting as stagewise additive regression does however suggest

Breiman’s conclusion that “if boosting succeeds, it is because it gives low cor-

relation between classiﬁers of reasonable strength.” In stagewise regression

“each step decouples itself from the previous step” leading to “low correla-

tion between pairs of classiﬁers.” In general, a greedy search strategy will

not successively add highly correlated terms, since that would not produce

very much local reduction in the criterion being minimized. Also, there will

be a tendency to add terms of “reasonable strength,” since that will tend to

reduce the criterion. Thus, one could conclude that AdaBoost owes its success

to these particular aspects of stagewise additive regression. This is consis-

tent with Breiman’s explanation, although motivated somewhat differently.

Unfortunately, this conclusion appears to be in conﬂict with one empirical fact,

ADDITIVE LOGISTIC REGRESSION 407

namely, the success of shrinking (1) in increasing accuracy with all boosting

methods, including AdaBoost [Friedman (1999a)].

Shrinking strongly inhibits the natural tendency of greedy search strate-

gies to add low correlation terms of reasonable strength. For smaller val-

ues of the shrinkage parameter ν, the strength of each added term ν·fmx

becomes smaller, and more highly correlated with recently entered terms. Yet,

the evidence so far indicates that the smaller the value of ν, the higher the

overall accuracy, as long as there are enough iterations.

The source of this mystery in under investigation. It is but one of the many

aspect of this general class of procedures that are not yet well understood.

Whether these procedures are referred to as “boosting,” or more prosaically

as “stagewise additive regression,” there are still many problems left to solve

that can challenge us for some time to come.

REFERENCES

Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept.

Statistics, Univ. California, Berkeley.

Freund, Y. (1999). An adaptive version of boost by majority algorithm. In Proceedings of the

Twelfth Annual Conference on Computational Learning Theory.

Freund,Y.andMason, L. (1999). The alternating decision tree learning algorithm. In Machine

Learning: Proceedings of the Sixteenth International Conference 124–133.

Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist.

19 1–141.

Friedman, J. H. (1999a). Greedy function approximation: a gradient boosting machine. Ann.

Statist. To appear.

Friedman, J. H. (1999b). Stochastic gradient boosting. Technical report, Dept. Statistics, Stanford

Univ.

Friedman,J.H.andHall, P. (1999). On bagging and nonlinear estimation. J. Comput. Graph.

Statist. To appear.

Grove,A.andSchuurmans, D. (1998). Boosting in the limit: maximizing the margin of

learned ensembles. In Proceedings of the Fifteenth National Conference on Artiﬁcial

Intelligence.

Quinlan, J. (1996). Boosting ﬁrst order learning. In Proceedings of the Seventh International

Workshop on Algorithmic Learning Theory (S. Arikawa and A. Sharma, eds.) Lecture

Notes in Artiﬁcial Intelligence 1160 143–155. Springer, Berlin.

Ratsch, G. (1998). Ensemble learning methods for classiﬁcation. Masters thesis, Dept. Computer

Science, Univ. Potsdam.

Ratsch,G.,Onoda,T.andMuller, K. R. (2000). Soft margins for AdaBoost. Machine Learning

1–35.

Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-ﬁrst Symposium on the

Interface 172–181.

Department of Statistics

Sequoia Hall

370 Serra Mall

Stanford University

Stanford California 94305-4065

E-mail: hastie@stat.stanford.edu

Combining Space Syntax with Machine Learning to Predict Seating Places: The case of Gulbenkian estate in Portugal

Conference Paper

Full-text available

Jun 2024

Pedro Afonso Fernandes

A recent trend in space syntax is the application of machine learning (ML) techniques to extend the analyses traditionally performed with methods from graph theory. The goal of this paper is to explore the relationship between visual syntactic measures and the location of seating places in a grid with the resort of ML methods. As far as a public housing estate (Bairro Gulbenkian) located in Odivelas, near Lisbon, Portugal, is concerned, we found that the location of benches can be predicted accurately from visual connectivity, clustering, control, controllability, integration and through vision using partial least squares or random forests. In fact, these two methods provide a better balance between sensitivity (the proportion of seating places classified as such) and specificity (the proportion of other places classified as such) than logistic regression, least absolute shrinkage operator (LASSO), decision trees, support vector machines and neural networks. In addition, we found that visual clustering, integration, control and through vision may be the key measures to predict seating places.

Global Crop-Specific Fertilization Dataset from 1961-2019

Preprint

Full-text available

Jun 2024

As global fertilizer application rates increase, high-quality datasets are paramount for comprehensive analyses to support informed decision-making and policy formulation in crucial areas such as food security or climate change. This study aims to fill existing data gaps by employing two machine learning models, eXtreme Gradient Boosting and HistGradientBoosting algorithms to produce precise country-level predictions of nitrogen ($N$), phosphorus pentoxide ($P_2O_5$), and potassium oxide ($K_2O$) application rates. Subsequently, we created a comprehensive dataset of 5-arcmin resolution maps depicting the application rates of each fertilizer for 13 major crop groups from 1961 to 2019. The predictions were validated by both comparing with existing databases and by assessing the drivers of fertilizer application rates using the model's SHapley Additive exPlanations. This extensive dataset is poised to be a valuable resource for assessing fertilization trends, identifying the socioeconomic, agricultural, and environmental drivers of fertilizer application rates, and serving as an input for various applications, including environmental modeling, causal analysis, fertilizer price predictions, and forecasting.

Reconstructing Intrinsic Stellar Noise with Stellar Atmospheric Parameters and Chromospheric Activity

Article

Full-text available

Jun 2024

Accurately characterizing the intrinsic stellar photometric noise induced by stellar astrophysics, such as stellar activity, granulation, and oscillations, is of crucial importance for detecting transiting exoplanets. In this study, we investigate the relation between the intrinsic stellar photometric noise, quantified by the Kepler combined differential photometric precision (CDPP) metric, and the level of stellar chromospheric activity, indicated by the S -index of Ca ii H K lines derived from LAMOST spectra. Our results reveal a clear positive correlation between the S -index and robust rms values of CDPP, with the correlation becoming more significant at higher activity levels and on longer timescales. We have therefore built an empirical relation between the robust rms values of CDPP and the S -index as well as T eff , log g , [Fe/H], and the apparent magnitude, with the XGBoost regression algorithm, using the LAMOST–Kepler common star sample as the training set. This method achieves a precision of ∼20 ppm for inferring the intrinsic noise from the S -index and other stellar labels on a 6 hr integration duration. We have applied this empirical relation to the full LAMOST DR7 spectra database and obtained the intrinsic noise predictions for 1,358,275 stars. The resultant catalog is publicly available and expected to be valuable for optimizing target selection for future exoplanet-hunting space missions, such as the Earth 2.0 mission.

A Novel Approach of Weighted Support Vector Machine with Applied Chance Theory for Forecasting Air Pollution Phenomenon in Egypt

Article

Mar 2018
Int J Comput Intell Appl

The particulate matter air pollutant of diameter less than 10 micrometers (PM 10), a category of pollutants including solid and liquid particles, can be a health hazard for several reasons: it can harm lung tissues and throat, aggravate asthma, and increase respiratory illness. Accurate prediction models of PM 10 concentrations are essential for proper management, control, and public warning strategies. Therefore, machine learning techniques can develop methods or tools that can be used to discover unseen patterns from given data to solve a particular task or problem. The chance theory has advanced concepts pertinent to treating cases where both randomness and fuzziness play simultaneous roles at one time. The main objective is to study the modi¯cation of a single machine learning algorithm À À À support vector machine (SVM) À À À applying the chance weight of the target variable, based on the chance theory, to the corresponding dataset point to be superior to the ensemble machine learning algorithms. The results of this study outperform of the SVM algorithms when modifying and combining with the right theory/technique, especially the chance theory over other modern ensemble learning algorithms.

Impact of Land Use Type over Pedestrian Flow Characteristics for Skywalk Facilities in Mumbai, India

Article

Full-text available

May 2024

The present study attempts to compare seven tree-based models (including ensembles) to suggest the best modeling approach for identifying the dominating factors (microscopic, macroscopic, and geometric) and predicting the walking speeds over elevated walkways. The results show that Gradient Boosted Trees and Light Gradient Boosted Trees outperformed other approaches in predicting walking speed. The variable importance of final models estimated using SHapley Additive exPlanations (SHAP) reveals that walking speed was dependent on the flow, density, and length of the facility. Moreover, other features such as gender, age, height of facility, and width of the facility also play a significant role in determining the pedestrian walking speeds. The identification of important variables not only provides better insight into factors that affect walking speed over elevated facilities but also provides a valuable source of information to researchers, planners, and policymakers for better designing, operation, and management of elevated pedestrian infrastructures.

Assessing the Conventional Wisdom on Dollar Store Locations Through Machine Learning

Article

Jan 2024

Nonpharmacological pain management approaches among U.S. construction workers: A cross-sectional pilot study

Article

Jun 2024
AM J IND MED

Background U.S. construction workers experience high rates of injury that can lead to chronic pain. This pilot study examined nonpharmacological (without medication prescribed by healthcare provider) and pharmacological (e.g., prescription opioids) pain management approaches used by construction workers. Methods A convenience sample of U.S. construction workers was surveyed, in partnership with the U.S. National Institute for Occupational Safety and Health (NIOSH) Construction Sector Program. Differences in familiarity and use of nonpharmacological and pharmacological pain management approaches, by demographics, were assessed using logistic regression models. A boosted regression tree model examined the most influential factors related to pharmacological pain management use, and potential reductions in use were counterfactually modeled. Results Of 166 (85%) of 195 participants reporting pain/discomfort in the last year, 72% reported using pharmacological pain management approaches, including 19% using opioids. There were significant differences in familiarity with nonpharmacological approaches by gender, education, work experience, and job title. Among 37 factors that predicted using pharmacological and non‐pharmacological pain management approaches, training on the risks of opioids, job benefits for unpaid leave and paid disability, and familiarity with music therapy, meditation or mindful breathing, and body scans were among the most important predictors of potentially reducing use of pharmacological approaches. Providing these nonpharmacological approaches to workers could result in an estimated 23% (95% CI: 16%–30%) reduction in pharmacological pain management approaches. Conclusion This pilot study suggests specific factors related to training, job benefits, and worker familiarity with nonpharmacological pain management approaches influence use of these approaches.

A data‐driven predictive model for disinfectant residual in drinking water storage tanks

Article

Jun 2024

A data‐driven approach is developed and proven for ranking the risk of low disinfection residual in water distribution storage tanks, 1 month ahead. The forecasting methodology uses water quality data collected from drinking water treatment plants, storage tank outlets, and rainfall data as inputs. This methodology was developed and tested with data from a water utility serving more than 5 million people. Results show high‐risk category prediction accuracy of 75%–80%. Using a final year of unseen validation data, more than 90% of the storage tanks ranked in the top 20 by the forecasting methodology experienced low disinfectant residual in the following month. Storage tanks are critical water distribution system infrastructure that are currently managed reactively. The adoption of such readily transferable machine learning approaches enables direct proactive management strategies and efficient interventions that can help ensure drinking water quality.

The development of a lithology prediction model using measurement while drilling data in a quartzite quarry

Article

Jun 2024

The importance of private gardens and their spatial composition and configuration to urban heat island mitigation

Article

Jun 2024

An Empirical Comparison of Voting Classification Algorithms : Bagging, Boosting, and Variants

Article

Full-text available

Jan 1996

Machine learning, improved boosting algorithms using confidence-rated predictions

Article

Jan 1999

Generalized Linear Models

Article

Dec 1985

Nearest Neighbor Pattern Classification

Article

Jan 1967

An ensemble learning method for classification of multiple-label data

Article

Jun 2015

Data samples with multiple concept labels are common for mobile and web-based applications. The classification of multiple-label sample attracts much attention of researchers, in which ensemble learning is a cheap and effective method. Ensemble learning combines a set of base learners with a weight vector to construct powerful classification model, which has been proved effective in single-label cases. However, for multiple-label classification, the ensemble learning is ambiguous due to the accuracy and diversity of the base learners to be combined, which leads to a complex multiple-object optimization problem. We propose a multiple-label classification ensemble learning framework in this paper, in which the accuracy and diversity for multiple-label base learners are defined. Then an optimization target function is presented which is consistent with the laws of ensemble learning. We evaluate the proposed on three multiple-label data sets from UCI data repository. The results show that the proposed framework can obtain better ensembles for multiple-label learners compared to previous successful methods.

Bayesian CART Model Search: Rejoinder

Article

Sep 1998

In this article we put forward a Bayesian approach for finding classification and regression tree (CART) models. The two basic components of this approach consist of prior specification and stochastic search. The basic idea is to have the prior induce a posterior distribution that will guide the stochastic search toward more promising CART models. As the search proceeds, such models can then be selected with a variety of criteria, such as posterior probability, marginal likelihood, residual sum of squares or misclassification rates. Examples are used to illustrate the potential superiority of this approach over alternative methods.

Combining Predictors

Article

Jan 1999

L. Breiman

In the last few years, dramatic decreases in generalization error have come about through the growing and combining of an ensemble of predictors. To generate the ensemble, the most common approach is through perturbations of the training set and construction of the same algorithm (trees, neural nets, etc.) using the perturbed training sets. But other methods of generating ensembles have also been explored. Combination is achieved by averaging the predictions in the case of trying to predict a numerical output (regression) or by weighted or weighted plurality vote if class membership is being predicted (classification). We review some of the recent developments that seem notable to us. These include bagging, boosting, and arcing. The basic algorithm used in our empirical studies is tree-structured CART but a variety of other algorithms have also been used to form ensembles.

Arcing classifiers. (With discussion)

Article

Jun 1998
ANN STAT

Leo Breiman

Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Y. Freund and R. Schapire [in L. Saitta (ed.), Machine Learning: Proc. Thirteenth Int. Conf. 148-156 (1996); see also Ann. Stat. 26, No. 5, 1651-1686 (1998; Zbl 0929.62069)] propose an algorithm the basis of which is to adaptively resample and combine (hence the acronym “arcing”) so that the weights in the resampling are increased for those cases most often misclassified and the combining is done by weighted voting. Arcing is more successful than bagging in test set error reduction. We explore two arcing algorithms, compare them to each other and to bagging, and try to understand how arcing works. We introduce the definitions of bias and variance for a classifier as components of the test set error. Unstable classifiers can have low bias on a large range of data sets. Their problem is high variance. Combining multiple versions either through bagging or arcing reduces variance significantly.

Projection Pursuit Regression

Article

Dec 1981

A new method for nonparametric multiple regression is presented. The procedure models the regression surface as a sum of general smooth functions of linear combinations of the predictor variables in an iterative manner. It is more general than standard stepwise and stagewise regression procedures, does not require the definition of a metric in the predictor space, and lends itself to graphical interpretation.

Using Adaptive Bagging to Debias Regressions

Article

Jan 1999

Leo Breiman

Breiman(1996) showed that bagging could effectively reduce the variance of regression predictors, while leaving the bias unchanged. A new form of bagging we call adaptive bagging is effective in reducing both bias and variance. The procedure works in stages-- the first stage is bagging. Based on the outcomes of the first stage, the output values are altered and a second stage of bagging is carried out using the altered output values. This is repeated until a specified noise level is reached. We give the background theory, and test the method using both trees and nearest neighbor regression methods. Application to two class classification data gives some interesting results.

Additive Logistic Regression: A Statistical View of Boosting

Abstract and Figures

Recommended publications

Additive Logistic Regression: A Statistical View of Boosting (With Discussion and a Rejoinder by the...

Response to Mease and Wyner, evidence contrary to the statistical view of boosting, JMLR 9:131-156,...

Discussion of "Evidence contrary to the statistical view of boosting"

Additive Models, Trees, and Related Methods