ArticlePDF Available

Aggregating Regression Procedures for a Better Performance

Authors:

Abstract

Methods have been proposed to linearly combine candidate regression procedures to improve estimation accuraccy. Applications of these methods in many examples are very succeesful, pointing to the great potential of combining procedures. A fundamental question regarding combining procedure is: What is the potential gain and how much one needs to pay for it? A partial answer to this question is obtained by Juditsky and Nemirovski (1996) for the case when a large number of procedures are to be combined. We attempt to give a more general solution. Under a l 1 constrain on the linear coefficients, we show that for pursuing the best linear combination over n procedures, in terms of rate of convergence under the squared L 2 loss, one can pay a price of order O Gamma log n=n 1Gamma Delta when 0 ! ! 1=2 and a price of order O i (log n=n) 1=2 j when 1=2 ! 1. These rates can not be improved or essentially improved in a uniform sense. This result suggests that one should be cautious...
Aggregating Regression Procedures for a Better Performance
Yuhong Yang
Department of Statistics
Iowa State University
yyang@iastate.edu
December, 1999
Abstract
Methods have been prop osed to linearly combine candidate regression procedures to improve esti-
mation accuraccy. Applications of these methods in many examples are very succeesful, pointing to the
great potential of combining procedures. A fundamental question regarding combining procedure is:
What is the potential gain and how much one needs to pay for it?
A partial answer to this question is obtained by Juditsky and Nemirovski (1996) for the case when
a large number of procedures are to be combined. We attempt to give a more general solution. Under
a
l
1
constrain on the linear coecients, we show that for pursuing the best linear combination over
n
procedures, in terms of rate of convergence under the squared
L
2
loss, one can pay a price of order
O
?
log
n=n
1
?
when 0
< <
1
=
2 and a price of order
O
(log
n=n
)
1
=
2
when 1
=
2
<
1
. These rates
can not be improved or essentially improved in a uniform sense. This result suggests that one should
be cautious in pursuing the best linear combination, because one may end up with paying a high price
for nothing when linear combination in fact does not help. We show that with care in aggregation, the
nal procedure can automatically avoid paying the high price for such a case and then behaves as well
as the best candidate procedure in terms of rate of convergence.
Keywords and phrases:
Aggregating procedures, adaptive estimation, linear combining, nonparametric
regression.
1
1 Introduction
Recently, new ideas on combining dierent procedures for estimation, co ding, forecasting, or learning
have been considered in statistics and several related elds, resulting a number of very interesting
results. The common theme behind these work is to automatically share the strength of the individual
procedures in some sense. In the context of machine learning, it has b een shown that with an appropriate
weighting method, a combined pro cedure can behave close to the best procedure in terms of a certain
cumulative loss, see, e.g., Vovk (1990), Littlestone and Warmuth (1994), Cesa-Bianchi
et al
(1997),
and Cesa-Bianchi and Lugosi (1999). The focus has b een on deriving mixed strategies with optimal
performance without any probabilistic assumptions at all on the generation of the data. In the eld of
forecasting, combined forecasts have been shown to work better in various examples, see, e.g., Clemen
(1989) for a review of work in that direction. In information theory, study of universal coding in the
spirit of adaptation results in very interesting and p owerful techniques also useful in other related elds
such as machine learning and statistics. See Merhav and Feder (1998) and Barron, Rissanen and Yu
(1998) for reviews of work in that eld. In statistics, several metho ds have b een recently prop osed
to linearly combine regression estimators. They include a mo del selection criterion based metho d by
Buckland
et al
(1995), cross-validation based \stacking" by Wolpert (1992) and Breiman (1996a) (an
earlier version is in Stone (1974)), a bootstrap based method by LeBlanc and Tibshirani (1996), a
stochastic approximation based metho d by Juditsky and Nemirovski (1996), and information-theoretic
based methods to combine density and regression estimators by Yang (1996, 1998, 1999bc) and Catoni
(1997) for density estimation. Juditsky and Nemirovski proposed algorithms and derived interesting
theoretical upper and lower bounds for linear aggregation in pursuing the best performance among the
linearly combined estimators (with coecients sub ject to an appropriate constraint). Yang (1998, 1999c)
shows that with proper weighting, a combined procedure has a risk bounded above by a multiple of the
smallest risk over the original pro cedures plus a small penalty.
The above mentioned theoretical work in statistics are in two related but dierent directions: one
aiming at automatically achieving the best possible performance among the given collection of candidate
procedures, and the other aiming at improving the performance of the original procedures. For the
latter, the hope is that an aggregated procedure (through a convex or linear combination of the original
procedures with data dependent co ecients) will signicantly outp erform each individual candidate
procedure. Clearly the second direction is more aggressive. If one could identify the best linearly
combined procedure, pursuing the best p erformance among the candidate procedures would b e too
conservative. On the other hand, common sense would suggest if one asks for more, one needs to pay
2
more. The present pap er intends to contribute to the theoretical understanding on the gain and price
for pursuing the b est linear combination.
Suppose that we have
M
candidate regression procedures and consider the squared
L
2
risk as a
performance measure in estimating the regression function. In Yang (1998, 1999c) it is shown that a
suitable data-dependent convex combination of these procedures results in an estimator that (under a
minor condition) has a risk within a multiple of the smallest risk among the candidate procedures plus
a small penalty of order (log
M
)
=n
. Thus in terms of rate of convergence, with
M
candidate procedures
to be combined, one only needs to pay the price basically of order (log
M
)
=n
for performing nearly as
well as the best candidate procedure (which, of course, is unknown to the statistician). As long as
M
does not increase exp onentially fast in
n;
the discrepancy (log
M
)
=n
is of order log
n=n;
which does not
aect the rate of convergence for typical nonparametric regression. As a consequence, when p olynomially
many nonparametric procedures are suitably combined, the estimator automatically converges at the
best rate oered by the individual procedures. For the more aggressive goal of pursuing the b est linear
combination of the candidate procedures, under the constrain that the
l
1
norm of the linear coecients
is bounded above by 1, Juditsky and Nemirovski (1996) proposed algorithms and showed that with
M
estimators to be combined, the aggregated estimator has a risk within a multiple of
p
(log
M
)
=n
of the
smallest risk over all the linear combinations of the estimators. Furthermore, they show that, in general,
this order
p
(log
M
)
=n
can not be overcome uniformly by any combining methods. Thus compared to
combining for attaining the best performance, one has to pay a much higher price,
p
(log
M
)
=n;
for
searching for the b est linear combination of the original pro cedures.
The work of Juditsky and Nemirovski (1996) is targeted at the case when
M
is large (e.g., their
results are applied to restore Barron's class with
M
of a polynomial order in
n
). They derived the
above mentioned lower b ound when
M
and
n
have the relationship:
C
1
log
M
n
C
2
M
log
M
(where
the constants
C
1
and
C
2
depend on the variance of the error and the assumed known upper bound on
the supremum norm of the regression function
f
). The relationship implies that
M
is at least at order
n=
log(
n
)
:
It is unclear then what happens when
M
is of a smaller order. For such a case, the order
p
(log
M
)
=n
may no longer b e a valid lower bound. In the extreme case with
M
xed (
M
does not grow
as
n
! 1
), one would expect a penalty of order close to the parametric rate 1
=n
instead of order
n
?
1
=
2
.
In this paper, we show that when
M
is of order
n
;
one only needs to pay the price of order log
n=n
1
?
for 0
<
1
=
2. This rate can not be improved uniformly beyond a logarithmic factor.
Note that the order of the price increases dramatically as
increases from 0, but after
1
=
2
;
it
stays at the rate
p
(log
n
)
=n
as long as
<
1
:
This phenomenon is closely related to the advantage
3
of sparse approximations as observed in wavelet estimation (see, e.g., Donoho and Johnstone (1998)),
neural networks and subset selection (see, e.g., Barron (1994), Yang and Barron (1998), Yang (1999a),
and Barron, Birge and Massart (1999)). Under the
l
1
constraint on the linear coecients, when
>
1
=
2
;
there can not be to o many (relative to
M
) large coecients and combining sparsely selected procedures
with suitably large coecients achieves the optimal performance.
In applications, one does not know if the b est linear combination can substantially improve the
estimation accuracy so that the high price of order, e.g., (log
n
)
=n
1
=
2
is justied. Accordingly, it is not
clear which direction to go when combining the candidate procedures. We show, fortunately, with some
care in combining, an estimator can be aggressive and conservative automatically in the right way. For
convenience in discussion, we will call the conservative goal
combining for adaptation
, and the aggressive
goal
combining for improvement
.
The paper is organized as follows. In Section 2, we derive general risk bounds for combining
M
procedures. In Section 3, we study a combined procedure suitable for dierent purposes at the same
time. In Section 4, we give an illustration using linear and sparse approximations. We briey mention a
generalization of the main results in Section 5. In Section 6, a basic combining algorithm and its prop erty
are presented, which provides a tool for the main results in this paper. The proofs of the results are in
Section 7.
2 Risk b ounds on linear aggregation
Consider the regression model
Y
i
=
f
(
X
i
) +
"
i
,
i
= 1
; :::n;
where (
X
i
; Y
i
)
n
i
=1
are i.i.d. copies from the joint distribution of (
X; Y
) with
Y
=
f
(
X
) +
"
. The
explanatory variable
X
(could be high-dimensional) has an unknown distribution
P
X
. The variance
parameter
>
0 is unknown and the random variable
"
is assumed to have a known density function
h
(
x
) (with respect to Leb esgue or a general measure
) with mean 0 and variance 1. The goal is to
estimate the regression function
f
based on the data
Z
n
= (
X
i
; Y
i
)
n
i
=1
.
Let
be a regression estimation procedure pro ducing estimator
b
f
i
(
x
) =
b
f
i
(
x
;
Z
i
) for each
i
1
:
Let
k k
denote the
L
2
norm with respect to the distribution of
X;
i.e.,
k
g
k
=
q
R
g
2
(
x
)
P
X
(
dx
). Let
R
(
f
;
n
;
) =
E
k
f
?
b
f
n
k
2
denote the risk of the procedure
at the sample size
n
under the squared
L
2
loss.
Let =
f
1
;
2
; :::;
M
g
denote a collection of candidate procedures to be aggregated. Let
b
f
j;i
(
x
) =
b
f
j;i
(
x
;
Z
i
) denote the estimator of
f
based on procedure
j
given the observations
Z
i
for
i
1
:
Assume
4
M
=
M
n
changes according to the sample size
n
. In particular, we will consider the case when
M
=
Cn
for some 0
<
1
:
When the sample size increases, one is allowed to consider more candidate
procedures (possibly more and more complicated).
As in Juditsky and Nemirovski (1996), the coecients for linear combination are suitably constrained.
Let
F
n
=
f
P
1
j
M
j
b
f
j;n
(
x
) :
P
1
j
M
j
j
j
1
g
be the collection of linear combinations of the original
estimators in with coecients summing up no more than 1 in absolute values. The hope behind the
consideration of the linear aggregation is that a certain combination of the original estimators might
have a much better performance than the individual ones. Advantages of such combining have been
empirically demonstrated in several related elds (e.g., Bates and Granger (1969), Breiman (1996)). Let
k k
M
1
denote the
l
1
norm on
R
M
;
i.e.,
k
k
M
1
=
P
1
j
M
j
j
j
:
Dene
R
(
f
;
n
; ) = inf
k
k
M
1
1
E
k
f
?
X
1
j
M
j
b
f
j;n
k
2
:
It is the smallest risk over all the estimators in the linear aggregation class
F
n
:
Obviously,
R
(
f
;
n
; )
inf
1
j
M
n
R
(
f
;
n
;
j
)
:
In this paper, unless stated otherwise, by linear combination, we mean linear
combination with the coecients satisfying the above
l
1
constraint.
We need the following assumptions for our results.
A1. The regression function
f
(
x
) is uniformly bounded, i.e.,
k
f
k
1
A <
1
. The variance parameter
is bounded above and below by known positive constants
<
1
and
>
0
:
A2. The error distribution
h
has a nite fourth moment and satises that for each pair 0
< s
0
<
1 and
T >
0
;
there exists a constant
B
(depending on
s
0
and
T
) such that
Z
h
(
x
) log
h
(
x
)
1
s
h
(
x
?
t
s
)
dx
B
?
(1
?
s
)
2
+
t
2
for all
s
0
s
s
?
1
0
and
?
T < t < T:
The constants
A
and
B
in the above assumptions are involved in the derivation of the risk bounds,
but they need not to b e known in our aggregation procedure. The Assumption A2 is mild and is satised
by Gaussian, double-exponential, and many other smooth distributions.
An algorithm, named ARM in Yang (1999c), to combine procedures for adaptation is given in Section
6. This algorithm serves as a building block for the results in this paper. Through a suitable discretization
of the linear coecients together with a sparse approximation, the problem of combining for improvement
becomes the problem of combining for adaptation over a (much) larger class of procedures. We have the
following performance upper bound.
5
Theorem 1:
Assume that Conditions A1 and A2 are satised. For any given collection of estimation
procedures
=
f
j
;
1
j
M
n
g
;
we can construct a combined procedure
such that
R
(
f
;
n
;
)
C
(
R
?
f
;
n
2
;
+
M
n
log(1+
n=M
n
)
n
when
M
n
<
p
n
R
?
f
;
n
4
;
+
log
M
n
p
n
log
n
when
M
n
p
n
;
where
C
is a constant depending on
A
,
,
;
and
h:
In particular, if
M
n
C
0
n
for some
>
0
and
C
0
>
0
, then
R
(
f
;
n
;
)
C
0
8
<
:
R
?
f
;
n
4
;
+
log
n
n
1
=
2
when 1
=
2
<
1
R
?
f
;
n
2
;
+
log
n
n
1
?
when 0
<
1
=
2
;
(1)
where the constant
C
0
depends on
A
,
,
; C
0
;
and
h:
Remark:
The condition on
in Assumption A1 is mainly technical (it is not really needed to
perform the procedure). The lower bound condition on
is not essential even from a technical point
of view, since one can always add a little bit noise to the observations to satisfy the condition usually
without aecting the rate of convergence.
The constructed pro cedure
is given in the proof of Theorem 1 in Section 7. Note that for both
parametric and nonparametric regression, for a goo d pro cedure,
R
(
f
;
n
;
) and
R
(
f
;
n=
2;
) are usually
of the same order. Thus it is typically the case that
R
(
f
;
n
; ) and
R
?
f
;
n
2
;
converge at the same
rate. From the result, when
1
=
2
;
the penalty term for pursuing the best linear combination of
n
procedures is of order ((log
n
)
=n
)
1
=
2
(independent of
). This rate is obtained by Juditsky and
Nemirovski (1996) with a weaker assumption on the errors (nite variance), but requiring the knowledge
of
A:
When
<
1
=
2, our result above shows that the penalty is smaller in order, resulting in a possibly
much faster rate of convergence. For an extreme example, when
M
n
is xed, the price we pay is only of
order log
n=n
.
How good are the upper bounds derived here? Juditsky and Nemirovski (1996) show that when
M
and
n
satisfy
C
1
log
M
n
C
2
M
log
M
for some constants
C
1
and
C
2
(i.e.,
M
is no smaller than
order
n=
log
n
but not too large), the order ((log
n
)
=n
)
1
=
2
can not be improved in a minimax sense. We
show in general, the rates given in Theorem 1 can not b e improved up to possibly a logarithmic factor
for some cases. For simplicity, assume that the errors are normally distributed with variance 1.
Theorem 2:
Consider
M
n
=
b
C
0
n
c
for some
>
0
. There exist
M
n
procedures
M
n
=
f
j
;
1
j
M
n
g
such that for any aggregated procedure
(
n
)
based on
M
n
;
one can nd a regression function
f
with
k
f
k
1
p
2
satisfying
R
f
;
n
;
(
n
)
?
R
(
f
;
n
;
M
n
)
C
(
log
n
n
1
=
2
when 1
=
2
< <
1
1
n
1
?
when 0
1
=
2
;
6
where the constant
C
does not depend on
n:
Thus no aggregation method can achieve the smallest risk over all the linear combinations within
an order smaller than the ones given above in accordance with
uniformly over all b ounded regression
functions. Note that the lower rate matches the upper rate when
>
1
=
2 and the upper and lower rates
dier only in logarithmic factors when 0
1
=
2
:
It is interesting to notice how the price (in rate) for combining for improvement changes according to
M
n
:
In the beginning, it basically increases linearly in
M
n
;
but after
M
reaches
p
n;
it increases much
more slowly in a logarithmic fashion. Accordingly, it stays at rate
log
n
n
1
=
2
as long as
M
n
increases
polynomially in
n
.
In a dierent direction, Yang (1998, 1999c) shows that one only needs to pay the price of order
(log
M
)
=n
to pursuit the less ambitious goal of achieving the best performance among the original
M
procedures. Observing the dramatic dierence between the two penalties, one naturally faces the
question: Should we combine for adaptation or for improvement? If one of the original procedures
happen to behave the best (or close to the best) among all the linear combinations, or at least one of
the original procedures converges at a rate faster than (log
n
)
=n
1
?
(for 0
<
1
=
2) or
p
log
n=n
(for
1
=
2), if one aggregates for better performance, one could be unfortunately paying too high a price
for nothing but hurting the convergence rate in estimating
f
. In terms of rate of convergence, combining
for improvement is worth the eort only if
R
(
f
;
n=
2; ) plus the penalty in (1) is of a smaller order
than (log
M
)
=n
+ inf
j
R
(
f
;
n=
2;
j
). Since the risks are of course unknown, in applications, one does not
know in advance whether to combine for adaptation or combine for improvement. A wrong choice can
lead to a much worse rate of convergence. In the next section, we show one can actually handle the two
goals optimally at the same time.
3 Multi-purpose aggregation
Here we show when combining the procedures properly, one can have the potential of obtaining a large
gain in estimation accuracy yet without losing much when there happens to be no advantage considering
sophisticated linear combinations.
Let us consider a slightly dierent setting compared to the previous section. Suppose that we have
a countable collection of candidate procedures =
f
1
;
2
; :::
g
:
Under this setting, one does not need
to decide before hand how many procedures should be included at a given sample size. Consider three
dierent approaches to combine the procedures in .
The rst approach is to combine the procedures for adaptation. Here one intends to capture the
7
best performance in terms of rate of convergence among the candidate procedures. Let
A
denote this
combined procedure based on using the Three-Stage ARM Algorithm as given in the Section 6. Since
is not (necessarily) a nite collection, one can not use the uniform weight. The prior weight
j
is
taken to b e
ce
?
log
j
;
where log
is dened by log
x
= log(
x
+ 1) + 2 log log(
x
+ 1) and the constant
c
is
chosen to normalize the weights to add up to 1. Based on Proposition 1 in Section 6, we have that for
any
f
with
k
f
k
1
<
1
;
R
(
f
;
n
;
A
)
C
1
inf
j
log (
j
+ 1)
n
+
R
(
f
;
n=
2;
j
)
=:
C
1
R
1
(
f
;
n
; )
;
(2)
where the constant
C
1
depends on
k
f
k
1
;
,
;
and
h:
In the rest of the paper, unless stated otherwise,
a constant
C
(with or without subscript) may depend on
k
f
k
1
;
,
;
and
h:
For convenience, we may
use the same symbol
C
for dierent such constants in dierent places. From above, if one procedure,
say
j
behaves the best, then the penalty is of order
1
n
:
If the best estimator changes according to
n;
then inf
j
log(
j
+1)
n
+
R
(
f
;
n=
2;
j
)
is a trade-o between complexity and estimation accuracy.
The second approach targets at the best p erformance among all the linear combinations of the original
procedures up to dierent orders. For each integer
L
1
;
let
L
denote the combined (for improvement)
procedure based on the rst
L
procedures
1
; ; ::::;
L
as used for Theorem 1. Then combine the procedures
f
1
;
2
; :::
g
with weight
ce
?
log
j
for
j
1 as dened earlier. Let
B
denote this combined pro cedure.
Let
L
denote the set of the rst
L
procedures in
:
Let
n
(
L
) =
(
L
log(1+
n=L
)
n
1
L <
p
n
log
L
p
n
log
n
L
p
n:
By Theorem 1 and Proposition 1, we have that for any
f
with
k
f
k
1
<
1
;
R
(
f
;
n
;
B
)
C
2
inf
L
R
f
;
n
2
;
L
+
n
(
L
)
=:
C
2
R
2
(
f
;
n
; )
:
(3)
The third approach recognizes that in many cases, when combining a lot of procedures, the b est
linear combination may concentrate on only a few procedures. For such a case, working with these
important procedures only leads to a much smaller price when combining for improvement. This calls
for additional care in aggregation and it can be done as follows. For each integer
L >
1
;
1
k < L;
and a subset
S
of
f
1
;
2
; :::; L
g
of size
k;
let
(
S
) be the combined (for improvement) procedure based on
f
j
:
j
2
S
g
as for (1). Then let
L;k
be the combined (for adaptation) procedure based on all such
(
S
)
with uniform weight 1
=
(
L
k
) (there are (
L
k
) many such procedures). Then let
(
L
)
be the combined (for
adaptation) procedure based on
L;
1
; :::;
L;L
?
1
using the uniform weight 1
=
(
L
?
1)
:
Let
C
denote the
combined (for adaptation) pro cedure based on
(
L
)
; L
2 with weight
c
0
log
j
, where the constant
c
0
8
is chosen such that
P
1
j
=2
c
0
e
?
log
j
= 1. Let
S
denote the collection of pro cedures
f
j
:
j
2
S
g
:
Based
on Proposition 1 and Theorem 1, we have that for any
f
with
k
f
k
1
<
1
;
R
(
f
;
n
;
C
)
C
3
inf
L
2
inf
1
k
L
?
1
inf
j
S
j
=
k;S
f
1
;
2
;:::;L
g
R
f
;
n
16
;
S
+
n
(
k
) +
log(
L
k
)
n
(4)
=:
C
3
R
3
(
f
;
n
; )
:
Now we combine these three pro cedures
A
;
B
;
and
C
with equal weight 1
=
3
:
And let
F
denote
the nal combined procedure. Note that it is still a linear combination of the original procedures. We
have the following result.
Corollary 1:
Assume Conditions A1 and A2 are satised. Then for each
f
with
k
f
k
1
<
1
;
we
have
R
(
f
;
n
;
F
)
C
min (
R
1
(
f
;
n=
2; )
; R
2
(
f
;
n=
2; )
; R
3
(
f
;
n=
2; ))
;
where
R
1
(
f
;
n
; )
; R
2
(
f
;
n
; )
; R
3
(
f
;
n
; )
are given in (2), (3), and (4).
The above result characterizes goo d performance of the nal estimator simultaneously in three di-
rections in terms of rate of convergence. First of all, the nal estimator converges as fast as any original
procedure. Secondly, when linear combinations of the rst
L
n
procedures (for some
L
n
>
1) can improve
estimation accuracy dramatically, one pays the price at most of order
n
(
L
n
) for the better performance.
When
L
n
is small, the gain is substantial. When certain linear combinations of a small number of proce-
dures perform well, the nal estimator can also take advantage of that. In summary, the nal estimator
can b ehave both aggressively (combining for improvement) and conservatively (combining for adapta-
tion) which ever is better.
4 An illustration via linear approximation
We illustrate the result of multi-purpose aggregation studied in the previous section through an example
with linear and sparse approximations. We assume that
x
2
[0
;
1]
d
(1
d
1
)
:
Let
f
j
:
j
= 1
;
2
; :::
g
be a collection of linear approximation systems. For each
j;
j
=
f
'
j;
1
(
x
)
; '
j;
2
(
x
)
; :::
g
is a chosen collection of linearly independent functions in
L
2
[0
;
1]
d
. Traditionally orthonormal bases (or
at least with some frame properties) have been emphasized. Recently non-orthogonal and/or over com-
plete bases have been advocated and studied. Relaxation of orthogonality enables one to consider e.g.,
trigonometric expansions with fractional frequencies and neural network mo dels. Considering dierent
bases provides much more exibility that gives a great potential to improve estimation accuracy, es-
pecially in high-dimensional settings. See Barron and Cover (1991), Mallat and Zhang (1993), Barron
9
(1994), Donoho and Johnstone (1994), Juditsky and Nemirovski (1996), Yang and Barron (1998), Yang
(1999a), and Barron, Birge and Massart (1999) for some work in those directions.
For a xed
j;
the (squared
L
2
) approximation error of
f
using the rst
N
terms is
j;N
(
f
) = inf
f
a
l
g
k
f
?
N
X
l
=1
a
l
'
j;l
k
2
:
We call this individual approximation. The approximation error of
f
using linear combinations of the
individual approximations of
f
up to
N
terms based on the rst
L
systems is
L
N
(
f
) = inf
f
a
j;l
g
k
f
?
L
X
j
=1
N
X
l
=1
a
j;l
'
j;l
k
2
:
We call this linearly combined approximation. Obviously
L
N
(
f
)
j;N
(
f
) for 1
j
L:
When
L
N
(
f
)
j;N
(
f
) for 1
j
L
with the right size, the advantage of considering linear combinations
over dierent systems can be substantial. The approximation error of
f
based on sparse approximation
using
k
out of the rst
L
systems is
L;k
N
(
f
) = inf
S
f
1
;:::;M
g
;
j
S
j
=
k
inf
f
a
j;l
g
k
f
?
X
j
2
S
N
X
l
=1
a
j;l
'
j;l
k
2
:
We call this sparsely combined approximation. The sparse approximation can improve estimation accu-
racy compared to the linearly combined approximation if only a few approximation systems are actually
needed in the linearly combined approximation, i.e., one can nd
k
L
such that
L;k
N
(
f
) is close to
L
N
(
f
)
:
For a given
j
and
N;
traditional linear mo del estimators (e.g., based on the least squares principle
or projection estimators with orthogonal basis functions) can b e used to estimate the b est parameters
in the linear approximation, resulting in the familiar bias-squared (approximation error) plus variance
(estimation error) trade-o for the mean squared error. As is well-known, the variance is typically of
order
N=n
under minor conditions.
Combining the approximation error and estimation error, one can bound
R
1
(
f
;
n
; )
; R
2
(
f
;
n
; )
;
and
R
3
(
f
;
n
; ) as dened in (2), (3), and (4) as follows
R
1
(
f
;
n
; ) =
O
inf
j
j;N
(
f
) +
N
n
+
log
j
n
;
(5)
R
2
(
f
;
n
; ) =
O
inf
L;N
L
N
(
f
) +
LN
n
+
n
(
L
)

;
(6)
R
3
(
f
;
n
; ) =
O
inf
L;N
inf
1
k
L
?
1
L;k
N
(
f
) +
n
(
k
) +
k
log
L
n
+
kN
n

:
(7)
10
Based on Corollary 1 and the above bounds, one can derive rate of convergence for the nal aggregated
procedure
F
under various assumptions on the approximation errors
j;N
(
f
)
;
L
N
(
f
)
;
and
L;k
N
(
f
)
:
The
conclusion is basically that, in terms of rate of convergence, the nal estimator behaves as well as the
best estimator based on an individual approximation system, or as the linearly combined estimator, or
as the sparsely combined estimator, which ever is the best.
When the basis functions are orthonormal, conditions on the
L
2
approximation errors typically
correspond to conditions on the co ecients, resulting in simple characterization of the functions. Here we
give an example. Suppose
d
=
1
and assume
X
= (
X
1
; X
2
; :::
) has independent, uniformly distributed
components (or after suitable transformations). We assume the true regression function is additive, i.e.,
f
(
x
) =
c
0
+
f
1
(
x
1
) +
f
2
(
x
2
) +
:
(8)
To estimate the additive component
f
j
(
x
j
)
;
a linear approximation system
j
=
f
'
j;
1
(
x
j
)
; '
j;
2
(
x
j
)
; :::
g
is used. Assume the basis functions are orthonormal with mean zero. For a given
j;
let
b
f
j;N
(
x
k
) be
the projection estimator of
f
j
(
x
j
) based on the rst
N
basis functions in
j
. That is,
b
f
j;N
(
x
k
) =
P
N
i
=1
b
j;i
'
j;i
(
x
k
)
;
where
b
j;i
=
1
n
P
n
l
=1
Y
l
'
j;l
(
X
k;l
)
:
For simplicity, assume
k
f
k
1
A
for some known
constant
A >
0 and the estimators are accordingly clipped into the range. Let
j;N
,
j
1
; N
1 denote
these regression procedures. Let
A
;
B
;
and
C
be the dierently combined procedures as constructed
in the previous section and let
F
denote the nal pro cedure combining them together.
Assume
f
j
(
x
j
) =
P
1
l
=1
j;l
'
j;l
(
x
j
) for
j
1 and assume the co ecients satisfy the following condition
B0:
1
X
j
=1
j
2
1
X
i
=1
i
2
s
2
j;i
!
<
1
:
(9)
for some
s >
0 and
>
0
:
When the true regression function is actually univariate in one variable, say
x
j
0
;
then
j;i
= 0 for all
j
and
i
except
j
=
j
0
:
If one knew this is the case, one can ignore the other
variables. Let B1 denote this condition. Another condition, denoted B2, is that
j;i
= 0 for all
j
1
and
i > i
0
for some unknown integer
i
0
, and in addition,
1
X
j
=1
i
0
X
i
=1
j
j;i
j
<
1
:
Corollary 2:
Assume the errors are normally distributed with
2
bounded above and below by
known constants. If
f
satises Condition B0, we have
R
(
f
;
n
;
F
) =
O
n
?
2
s
1+
s
(2+1
=
)
:
(10)
If
f
satises conditions B0 and B1, we have
R
(
f
;
n
;
F
) =
O
n
?
2
s
1+2
s
:
(11)
11
If
f
satises conditions B0 and B2, we have
R
(
f
;
n
;
F
) =
O
(log
n=n
)
1
=
2
:
(12)
Note that the procedure
F
does not require knowledge of the constants
s
and
(or
i
0
). Thus the
rate
n
?
2
s
1+
s
(2+1
=
)
is adaptively achieved. When
s
or
is very small, the rate of convergence is very
slow. Under the additional assumption of B1 or B2, a much better rate of convergence is automatically
achieved by the aggregated pro cedure.
Remarks:
1. In the construction of the aggregated pro cedure
C
;
sparseness is in terms of the number of
procedures b eing combined. One can also consider sparseness in terms of the number of terms in the
linear approximation within each approximation system. Then the same convergence rate (log
n=n
)
1
=
2
can be obtained under Condition B2 without assuming that for each
j;
there are only nitely many
non-zero coecients. See Yang and Barron (1998) for such a treatment in density estimation based on
models selection.
2. Under the assumptions that
X
= (
X
1
; :::
) has indep endent and uniformly distributed components
and that the basis functions have mean zero,
E'
j;l
(
X
j
)
'
j
0
;l
0
(
X
j
0
) = 0 for all
j
6
=
j
0
:
These strong
conditions make the approximation error readily bounded under Condition B0. Without these conditions,
the convergence rates in Corollary 2 can be shown to still hold under the direct conditions inf
f
j;l
g
k
f
?
P
J
j
=1
P
1
i
=1
j;l
'
j;l
k
2
=
O
?
J
?
2
and
P
1
i
=1
i
2
s
2
j;i
<
1
for each
j
. Also the additive condition
(8) can be expressed in terms of pre-specied linear combinations of the original explanatory variables
rather than the original explanatory variables themselves.
3. If
f
happens to be \parametric" in the sense that it can b e expressed as a linear combination of
nitely many basis functions (possibly across dierent systems), then the convergence rate of the nal
procedure is
O
((log
n
)
=n
)
;
possibly losing a logarithmic factor.
4. When
>
1
=
2
;
under Condition B0 and that
j;i
= 0 for all
j
and
i
except
j
=
j
0
;
the condition
P
1
j
=1
P
i
0
i
=1
j
j;i
j
<
1
is automatically satised. For this case, it can be shown that
R
(
f
;
n
;
F
) in fact
converges at a better rate
n
?
2
=
(2
+1)
than (log
n=n
)
1
=
2
:
5 Generalization
The main results in this paper can b e generalized with little diculty in two directions based on an
analysis similar to that in Yang (1999c). Firstly, the error distribution
h
need not to be known completely.
It suces to assume that
h
is in a countable collection of candidate error distributions. This gives more
12
exibility to hand errors with dierent degree of heavy tail. Secondly, one does not need to require
that the random errors have a constant variance function. Assume instead that for each
j
;
in addition
to having an estimator
^
f
j;n
of the regression function, we also have an estimator ^
j;n
of the variance
function. The procedures can share variance estimators if so desired. The pro cedures can b e combined
for estimating
f
using both the regression estimators and the variance estimators (see Yang (1999c)).
A recent work on variance estimation is in Ruppert
et al
(1997), where a local polynomial method is
proposed with a theoretical justication.
6 A Three-Stage algorithm to combine procedures for adapta-
tion
Let =
f
j
; j
1
g
be a collection of regression procedures. The index set
f
j
1
g
is allowed to
degenerate to a nite set. Let
j
be positive numbers summing up to one, i.e.,
P
1
j
=1
j
= 1. They
will be used as prior weights on the procedures. The following is an algorithm to combine candidate
procedures for adaptation as essentially given in Yang (1999c).
A Three-Stage ARM Algorithm
Step 1.
Split the data into three parts
Z
(1)
= (
X
i
; Y
i
)
n
1
i
=1
,
Z
(2)
= (
X
i
; Y
i
)
n
1
+
n
2
i
=
n
1
+1
;
and
Z
(3)
=
(
X
i
; Y
i
)
n
i
=
n
1
+
n
2
+1
. Let
n
3
=
n
?
n
1
?
n
2
:
Step 2.
Obtain estimates
^
f
j;n
1
(
x
;
Z
(1)
) of
f
based on
Z
(1)
for
j
1.
Step 3.
Estimate the variance
2
for each procedure by
^
2
j
=
1
n
2
n
1
+
n
2
X
i
=
n
1
+1
Y
i
?
^
f
j;n
1
(
X
i
)
2
:
Step 4.
For each
j
, evaluate predictions. For
n
1
+
n
2
+ 1
k
n
, predict
Y
k
by
^
f
j;n
1
(
X
k
). For
n
1
+
n
2
+ 1
k
n;
compute
E
j;k
=
k
i
=
n
1
+
n
2
+1
h
(
Y
i
?
^
f
j;n
1
(
X
i
))
^
j
^
k
?
n
1
?
n
2
j
:
Step 5.
Let
W
j;k
=
j
E
j;k
P
l
1
l
E
l;k
and compute the nal weight
W
j
=
1
n
3
n
X
k
=
n
1
+
n
2
+1
W
j;k
13
The nal estimator is
e
f
n
(
x
) =
1
X
j
=1
W
j
^
f
j;n=
2
(
x
) (13)
The combined estimator has the following theoretical property. For simplicity in notation, assume
that
n
is a multiple of 4
;
and then take
n
1
=
n=
2 and
n
2
=
n
3
=
n=
4
:
We assume that the estimator ^
j
are bounded above and below by positive constants
and
(otherwise one needs to clip the estimator
to be in that range).
Proposition 1:
Assume Conditions A1 and A2 hold. Then the above convexly combined estimator
e
f
n
satises
E
k
f
?
e
f
n
k
2
C
inf
j
1
n
1 + log
1
j
+
E
k
f
?
^
f
j;n=
2
k
2
;
where the constant
C
depends only on
A
,
,
;
and
h:
In particular, if there are
M
procedures to be
combined with uniform weight, then
E
k
f
?
e
f
n
k
2
C
log
M
n
+ inf
j
E
k
f
?
^
f
j;n=
2
k
2
:
Remarks:
1. In the ARM algorithm, the second stage is used to estimate
2
:
Here the estimators are derived in
terms of predictions based on the individual regression pro cedures. The use of these variance estimators
does not get in the way of estimating the regression function
f
in terms of rate of convergence. One
can also use common model-independent estimators of
2
(see, e.g., Rice (1984)). Then one does not
need this stage, and accordingly, the risk of the variance estimators will appear in the risk bound on
estimating
f:
2. As discussed in Yang (1999c), the estimator
e
f
n
depends on the order of observations. For im-
provement, one can randomly permute the order of observations a number of times and average the
corresponding estimators.
3. In the denition of the nal estimator
e
f
n
=
P
1
j
=1
W
j
^
f
j;n=
2
(
x
)
;
we use
^
f
j;n=
2
(
x
) instead of
^
f
j;n
(
x
)
to have a cleaner risk bound. But
^
f
j;n
(
x
) should be a slightly b etter choice in terms of accuracy.
Proof of Proposition 1
: The result is proved in Yang (1999c) for the case when there are nitely
many, say
J;
candidate procedures with equal weight
j
= 1
=J
for 1
j
J:
The proof for the general
case can be done similarly.
7 Proof of the results
14
Proof of Theorem 1:
There are mainly two steps in our derivation of an aggregated procedure
yielding the given risk bound. First, we discretize (with suitable accuracy) the co ecients for linear
combinations and then treat the set of all the corresponding linearly discretely combined estimators as
a new collection of candidate estimators. For suitable discretization, some results on metric entropy
are very helpful. In the second step, we combine these estimators for adaptation using the algorithm
ARM proposed in Yang (1999c) and described in Section 6. When
M
n
is large, however, an additional
diculty arises and an idea of sparse combining takes care of the problem.
We consider rst the case when
M
n
<
p
n:
Let
G
=
f
= (
1
; :::;
M
) :
P
M
i
=1
j
i
j
1
g
:
Let
N
be
an
-net in
G
under the
l
M
1
distance, i.e., for each
2
G;
there exists
0
2
N
such that
k
?
0
k
M
1
=
q
P
M
i
=1
j
i
?
0
i
j
:
An
-net in
G
yields a suitable net in the set
F
n
of the linear combinations of the
original estimators. For simplicity in notation, let
b
f
1
; :::;
b
f
M
denote the original estimators at the sample
size
n:
Let
F
be the set of the linear combinations of the estimators
b
f
1
; :::;
b
f
M
with coecients in
N
:
Then for any estimator
b
f
=
P
M
i
=1
i
b
f
i
with
2
G;
there exists
0
2
N
such that
k
b
f
?
M
X
i
=1
0
i
b
f
i
k
=
k
M
X
i
=1
i
?
0
i
b
f
i
k
A
k
?
0
k
M
1
A:
(14)
Now we combine all the estimators in
F
using the ARM algorithm given in Section 6 with uniform
weight 1
=
j
N
j
. Let
b
f
n
denote the combined estimator. By Proposition 1, for any
f
with
k
f
k
1
<
1
;
we
have
E
k
f
?
b
f
n
k
2
C
log(
j
N
j
)
n
+
C
inf
b
f
2
F
R
(
f
;
b
f
;
n=
2)
;
where
C
depends only on
A
,
,
;
and
h:
Since
F
is an (
A
)-net in
F
n
;
by triangle inequality, for any
f;
we have inf
b
f
2
F
R
(
f
;
b
f
;
n=
2)
2 inf
b
f
2
F
n
R
(
f
;
b
f
;
n=
2) + 2
A
2
2
:
It follows that
E
k
f
?
b
f
n
k
2
2
C
log(
j
N
j
)
n
+ 2
C
inf
b
f
2
F
n
R
(
f
;
b
f
;
n=
2) + 2
A
2
C
2
:
(15)
To get the best upper bound (in order), we need to minimize
log(
j
N
j
)
n
+ 2
A
2
2
when discretizing
G:
Note
that the logarithm of the smallest size of
N
is the covering entropy of the set
G
under the
l
M
1
distance
(see, e.g., Kolmogorov and Tihomirov (1959) for properties of metric entropies). For this case, metric
entropy orders are known. The following result is given in terms of the entropy number, i.e., the worst
case approximation error with the best net of size of 2
k
points. Let
k
denote the entropy number of
G
.
From Edmunds and Triebel (1989, Proposition 3.1.3), when
k
M;
k
c
2
?
k=M
for some constant
c
independent of
k
and
M:
Take
k
=
M
(log(
n=M
) + 2 log 2)
2 log 2
15
(note that
k
M
). (Strictly speaking, we need to round up or down to make
k
an integer.) Then
log(
j
N
j
)
n
+ 2
A
2
2
M
(log(
n=M
) + 2 log 2)
(2 log 2)
n
+
(
Ac
)
2
M
2
n
c
0
M
log(1 +
n=M
)
n
;
where
c
0
depends only on
A
and
c:
The upper bound in Theorem 1 for
M <
p
n
then follows.
Now consider the other case:
M
p
n:
The argument above leads to a rate (
M=n
) log(1 +
n=M
) for
M
n;
which as will be seen is only sub-optimal. For this case, due to the
l
1
constraint, the number of
large coecients is small relative to
M
when
M
p
n:
An appropriate search of the large coecients
can result in optimal rate of convergence, as we derive below.
Note that for
k
k
M
1
1
;
k
P
M
i
=1
i
b
f
i
k
A:
Then by a sampling argument (see e.g., Lemma 1 in
Barron (1993)), for each
m;
there exist a subset
I
f
1
; :::; M
g
of size
m
and
0
I
= (
0
i
; i
2
I
) such
that
k
P
M
i
=1
i
b
f
i
?
P
i
2
I
0
i
b
f
i
k
A=
p
m:
Taking
m
=
p
n=
log
n;
we have
k
P
M
i
=1
i
b
f
i
?
P
i
2
I
0
i
b
f
i
k
A
(log
n=n
)
1
=
4
:
Consider an
-net in
B
I
=
f
I
:
P
i
2
I
j
i
j
1
g
under the
l
m
1
distance. Again by
Edmunds and Triebel (1989), taking
k
=
m
(log(
n=m
)+2 log 2)
2 log 2
;
the best
-net has approximation accuracy
c=
2
p
m
=n:
Then as in (14), we know that there exists
00
I
in this
-net such that
k
P
i
2
I
0
i
b
f
i
?
P
i
2
I
00
i
b
f
i
k
Ac=
2
p
m
=n:
Thus for each
b
f
2
F
n
;
there exist
I
f
1
; :::; M
g
of size
m
and
00
I
such
that
k
M
X
i
=1
i
b
f
i
?
X
i
2
I
00
i
b
f
i
k
A
(log
n
)
1
=
4
n
1
=
4
+
Ac
2
n
1
=
4
(log
n
)
1
=
4
c
00
(log
n
)
1
=
4
n
1
=
4
;
where
c
00
depends only on
A
and
c:
Notice that, in general,
I
depends on
f
and therefore it should be
chosen adaptively. The above analysis suggests the following method of sparse combining.
For each xed subset
I
f
1
; :::; M
g
of size
m
;
discretize the linear coecients as described above.
Then (with uniform weight) combine the corresponding linear combinations of the procedures in .
Then combine these (combined) procedures over all possible choices of
I
(there are
?
M
m
many such
I
altogether) with uniform weight. Let
denote this nal procedure and let
I
=
f
i
; i
2
I
g
. Applying
Proposition 1 twice, we have that
R
(
f
;
n
;
)
C
R
f
;
n
4
;
+
(log
n
)
1
=
2
n
1
=
2
+
m
log(
n=m
)
n
+
log(
M
m
)
n
!
C
0
R
f
;
n
4
;
+
log
M
p
n
log
n
;
where the constants
C
and
C
0
depend on
A
,
,
;
and
h:
This completes the pro of of Theorem 1.
Remark
: In the ab ove derivation, when
M >
p
n;
combining a small number (relative to
M
) of
procedures together with subset search yields the price of order
p
log
n=n
for
M
of a polynomial order
in
n;
which is the optimal rate based on Theorem 2 when
M
is of a higher order than
p
n
. Similar ideas
16
on sparse subset selection are in e.g., Barron (1994), Yang and Barron (1998) and Barron, Birge and
Massart (1999).
We need a lemma on minimax lower bound for the pro of of Theorem 2. Let
d
be a distance (metric)
on a space
S
. For
D
S;
we say
G
is an
-packing set in
D
(
>
0) if
G
D
and any two distinct
members in
G
are more than
apart in the distance
d:
Now let
F
be a class of regression functions. The
distance
d
here is the
L
2
distance.
Definition 1:
(
Global metric entropy
) The packing
-entropy of
F
is the logarithm of the largest
-packing set in
F
. The packing
-entropy of
F
is denoted
M
(
)
:
Definition 2:
(
Local metric entropy
) The local
-entropy at
f
2
F
is the logarithm of the largest
(
=
2)-packing set in
B
(
f;
) =
f
f
0
2
F
:
k
f
0
?
f
k
g
. The local
-entropy at
f
is denoted by
M
(
j
f
).
The local
-entropy of
F
is dened as
M
loc
(
) = max
f
2
F
M
(
j
f
)
:
Both global and local entropies will b e involved in our derivations of the lower bounds. Assume that
M
loc
(
) is lower bounded by
M
loc
(
)
:
Let
M
loc
(
n
) =
n
2
n
+ 2 log 2
:
Assume
M
(
) is upper bounded by
M
(
) and lower bounded by
M
(
)
:
Let
n
be determined by
M
(
p
2
n
) =
n
2
n
(16)
and
n
be determined by
M
(
n
) = 4
n
2
n
+ 2 log 2
:
(17)
Assume the random errors in the regression model are normally distributed with variance 1
:
The following
lemma is useful for deriving minimax lower bounds using either global or local metric entropy.
Lemma 1:
The minimax risk for estimating
f
in F is lower bounded as fol lows:
min
b
f
max
f
2
F
E
k
f
?
b
f
k
2
2
n
32
;
min
b
f
max
f
2
F
E
k
f
?
b
f
k
2
2
n
8
;
where the minimization (or inmum) is over al l regression estimators based on
Z
n
= (
X
i
; Y
i
)
n
i
=1
:
The rst bound in the lemma is from Yang and Barron (1999, Section 7) and the second one is from
Yang and Barron (1997, Section 4).
Proof of Theorem 2:
Let
'
1
(
x
)
; '
2
(
x
)
;
... be a uniformly bounded orthonormal basis (with
respect to the distribution of
X
). An example is the trigonometric basis on [0,1]. Take
i
; i
1 to be
the procedure that always estimate
f
by
'
i
(
x
)
:
For each
M
=
C
0
n
;
consider the class of regression
17
functions
F
=
f
f
(
x
) =
1
'
1
(
x
) +
:::
+
M
'
M
(
x
) :
k
k
M
1
1
g
:
It is obvious that
R
(
f
;
n
;
M
n
) = 0 for
f
2
F
:
Thus to prove Theorem 2, it suces to show that min
b
f
max
f
2
F
E
k
f
?
b
f
k
2
C
(
n
) for some
constant
C >
0 not depending on
n;
where
(
n
) = (log
n=n
)
1
=
2
for 1
=
2
< <
1
and
(
n
) =
n
?
(1
?
)
for 0
1
=
2
:
Note that under the orthonormality assumption on the basis functions, the
L
2
distance
on
F
is the same as the
l
2
distance on the coecients =
f
:
k
k
M
1
1
g
:
Thus the entropy of
F
under
the
L
2
distance is the same as the that of under the
l
M
2
distance. To apply Lemma 1, we lower bound
the lo cal entropy of
F
or
:
Note that by Cauchy-Schwartz inequality, the
l
M
1
and
l
M
2
norms have the
relationship:
k
k
M
1
p
M
k
k
M
2
:
Thus for
M
?
1
=
2
;
taking
f
0
;
we have
B
(
f;
) =
f
f
2
F
:
k
f
k
g
=
f
f
:
k
k
M
1
1
;
k
k
M
2
g
=
f
f
:
k
k
M
2
g
:
Consequently, for
M
?
1
=
2
;
the (
=
2)-packing of
B
(
f;
) under the
L
2
distance is equivalent to the
(
=
2)-packing of
B
=
f
:
k
k
M
2
g
under the
l
M
2
distance. Since a maximum (
=
2)-packing set is an
(
=
2)-covering set, the union of the balls with radius
=
2 and centered at p oints in a maximum packing
set in
B
should cover
B
:
It follows that the size of the maximum packing set is at least the ratio of
volumes of the balls
B
and
B
=
2
;
which is 2
M
:
Thus we have shown that the local entropy
M
loc
(
) of
F
under the
L
2
distance is at least
M
loc
(
) =
M
log 2 for
M
?
1
=
2
:
For
M
=
C
0
n
for some 0
1
=
2
;
solving
M
loc
(
n
) =
n
2
n
+ 2 log 2 gives
n
of order
n
?
(1
?
)
=
2
:
Note that for such
;
by possibly reducing
M
loc
(
) by a constant factor,
n
obtained this way can be made smaller than
M
?
1
=
2
(as required in the
earlier derivation). By Lemma 1, we have proved the minimax lower rates for
F
when 0
1
=
2
:
That
is,
min
b
f
max
f
2
F
E
k
f
?
b
f
k
2
C
1
n
?
(1
?
)
for some constant
C
1
independent of
n:
For
>
1
=
2
;
we use the global entropy to derive the minimax
lower b ound. It is known form Schutt (1984) that the entropy number satises
c
1
r
log(1 +
M=k
)
k
k
c
2
r
log(1 +
M=k
)
k
for some constants
c
1
and
c
2
independent of
M
and
k
when log
M
k
M:
We can choose
n
and
n
both of order (log
n=n
)
1
=
4
to satisfy (16) and (17). This gives the minimax lower rate for
F
when
>
1
=
2
;
i.e.,
min
b
f
max
f
2
F
E
k
f
?
b
f
k
2
C
2
(log
n=n
)
1
=
2
for some constant
C
2
independent of
n:
Finally, with the trigonometric basis, the functions in
F
satises
k
f
k
1
p
2
:
The conclusion of Theorem 2 follows. This completes the proof of Theorem 2.
Remarks:
18
1. It is interesting to note that both the global and the lo cal entropies are useful here for dierent
cases. For
>
1
=
2, the application of global entropy gives the right rate of convergence. However, if one
intends to use the minimax lower bound in terms of the local entropy, the above derivation of a local
entropy bound does not work because for the critical
of order (log
n=n
)
1
=
4
;
it is of a higher order than
M
?
1
=
2
and accordingly
B
(
f;
)
6
=
f
f
:
k
k
M
2
g
. On the other hand, for 0
1
=
2
;
the application
of the local entropy method gives a rate that agrees with the upper bound up to a logarithmic factor.
If one uses the global entropy, the lower bound by Lemma 1 diers substantially in rate from the upper
bound. For general relationship between global and local entropies, see Yang and Barron (1999, Section
7).
2. In the derivation of the lower bounds in Theorem 2, we choose very special (nonrandom) original
estimators. This is of course not a typical situation when one would consider combining estimation
procedures. In applications, the candidate estimators (or many of them) are most likely somewhat highly
correlated (they are estimating the same target), but probably not too highly correlated (otherwise one
can gain little even by ideal combining). For such cases, the actual price paid by a goo d aggregation
method is smaller than that given in Theorem 2, but probably not too much smaller.
Proof of Corollary 2:
Assume that Condition B0 is satised. For a given
j;
the approximation
error of
f
j
(
x
j
) using the b est rst
N
terms satises
j;N
(
f
j
) =
k
f
j
?
N
X
l
=1
j;l
'
j;l
k
2
=
1
X
i
=
N
+1
2
j;i
1
X
i
=
N
+1
i
2
s
2
j;i
(
N
+ 1)
2
s
1
(
N
+ 1)
2
s
1
X
i
=1
i
2
s
2
j;i
:
Thus under Condition B0 on
f;
we have
j;N
(
f
j
) =
O
?
(
N
+ 1)
?
2
s
as
N
! 1
:
The approximation
error of
f
(
x
) using the basis functions
'
j;l
(
x
j
) with 1
j
L
and 1
l
N
satises
L
N
(
f
) =
k
f
?
L
X
j
=1
N
X
i
=1
j;i
'
j;i
k
2
=
1
X
j
=
L
+1
1
X
i
=1
2
j;i
+
L
X
j
=1
1
X
i
=
N
+1
2
j;i
1
(
L
+ 1)
2
1
X
j
=
L
+1
j
2
1
X
i
=1
i
2
s
2
j;i
+
1
(
N
+ 1)
2
s
L
X
j
=1
j
2
1
X
i
=
N
+1
i
2
s
2
j;i
:
Thus the approximation error is
L
N
(
f
) =
O
?
(
N
+ 1)
?
2
s
+ (
L
+ 1)
?
2
:
Under Conditions B0 and B2, sparse approximation has the potential to perform much better. From
Condition B2,
P
1
i
=
i
0
+1
j
j;i
j
2
= 0
:
Let
j
=
P
i
0
i
=1
j
j;i
j
:
Then Condition B2 implies that there exists a
constant
a
such that
k
P
L
j
=1
P
i
0
i
=1
j;i
'
j;l
k
P
L
j
=1
j
a
for all
L
1
:
From Lemma 1 in Barron (1993),
there is a subset,
S
f
1
;
2
; :::; L
g
of size
k
such that
k
P
L
j
=1
P
i
0
i
=1
j;i
'
j;l
?
P
j
2
S
P
i
0
i
=1
j;i
'
j;l
k
2
Ck
?
1
for some constant
C >
0
:
Thus taking
N
=
i
0
;
the overall approximation error is upper bounded in order
by
L;k
N
(
f
) =
O
?
(
L
+ 1)
?
2
+
k
?
1
:
19
From (5), (6), and (7) and above, we have that under Conditions B0 and B1, with the choice of
N
of
order
n
1
=
(2
s
+1)
;
R
1
(
f
;
n
; ) =
O
inf
j
(
N
+ 1)
?
2
s
+
N
n

=
O
n
?
2
s
1+2
s
;
and under Condition B0, with the choice of
N
of order
n
1
1+
s
(2+1
=
)
and
L
of order
n
s=
1+
s
(2+1
=
)
;
R
2
(
f
;
n
; ) =
O
inf
L;N
(
N
+ 1)
?
2
s
+ (
L
+ 1)
?
2
+
LN
n
+
n
(
L
)

=
O
n
?
2
s
1+
s
(2+1
=
)
;
and under Conditions B0 and B2, with the choice of
k
of order
p
n=
log
n; L
of order
n
1
=
(4
)
, and
N
=
i
0
,
R
3
(
f
;
n
; )
=
O
inf
L;N
inf
1
k
L
?
1
(
L
+ 1)
?
2
+
k
?
1
+
n
(
k
) +
k
log
L
n
+
kN
n

=
O
(log
n=n
)
1
=
2
:
The conclusions of Corollary 2 follow. This completes the pro of of Corollary 2.
References
[1] Barron, A.R. (1993) Universal approximation bounds for superpositions of a sigmoidal function.
IEEE Trans. Inform. Theory
39
, 930-945.
[2] Barron, A.R. (1994) Approximation and estimation bounds for articial neural networks.
Machine
Learning,
14
, 115-133.
[3] Barron, A.R. and Cover, T.M. (1991) Minimum complexity density estimation.
IEEE, Trans. on
Information Theory
,
37
, 1034-1054.
[4] Barron, A.R., Birge, L. and Massart, P. (1999) Risk bounds for model selection via p enalization.
Probability Theory and Related Fields,
113
, 301-413.
[5] Barron, A.R., Rissanen, J., and Yu, B. (1998) The minimum description length principle in coding
and modeling.
IEEE Trans. on Information Theory
,
44
, 2743-2760.
[6] Bates, J.M., and Granger, C.W.J. (1969) The combination of forecasts.
Operational Research Quar-
terly
,
20
, 451-468.
[7] Birge, L. and Massart, P. (1996) From model selection to adaptive estimation. In
Research Papers
in Probability and Statistics: Festschrift in honor of Lucien Le Cam
(D. Pollard, E. Torgersen and
G. Yang, eds.), 55-87, Springer, New York.
[8] Breiman, L. (1996) Stacked regressions.
Machine Learning
,
24
, 49-64.
[9] Buckland, S.T., Burnham, K.P., and Augustin, N.H. (1995) Model selection: An integral part of
inference.
Biometrics
,
53
, 603-618.
[10] Catoni, O. (1997) The mixture approach to universal mo del selection. Technical Report LIENS-97-
22, Ecole Normale Superieure, Paris, France.
20
[11] Cesa-Bianchi, N., Freund, Y., Haussler, D.P., Schapire, R., and Warmuth, M.K. (1997) How to use
expert advise?
Journal of the ACM
,
44
, 427-485.
[12] Cesa-Bianchi, N. and Lugosi, G. (1999) On prediction of individual sequences. Accepted by
Ann.
Statistics.
[13] Clemen, R.T. (1989) Combining forecasts: a review and annotated bibliography.
Intl. J. Forecast.
,
5
, 559-583.
[14] Donoho, D.L. and Johnstone, I.M. (1994) Ideal denoising in an orthonormal basis chosen from
alibrary of bases.
C. R. Acad. Sci. Paris,
319
, 1317-1322.
[15] Donoho, D.L. and Johnstone, I.M. (1998) Minimax estimation via wavelet shrinkage.
Ann. Statis-
tics.
,
26
, 879-921.
[16] Edmunds, D.E. and Triebel, H. (1989) Entropy numbers and approximation numbers in function
spaces.
Proc. London Math. Soc.
,
58
, 137-152.
[17] Juditsky, A. and Nemirovski, A. (1996) Functional aggregation for nonparametric estimation.
Pub-
lication Interne, IRISA
, N. 993.
[18] Kolmogorov, A.N. and Tihomirov, V.M. (1959)
-entropy and
-capacity of sets in function spaces.
Uspehi Mat. Nauk
14
, 3-86.
[19] Merhav, N. and Feder, M. (1998) Universal prediction.
IEEE Trans. on Information Theory
,
44
2124-2147.
[20] LeBlanc, M. and Tibshirani, R (1996) Combining estimates in regression and classication.
J. Amer.
Statist. Asso.
,
91
, 1641-1650.
[21] Littlestone, N. and Warmuth, M.K. (1994) The weighted majority algorithm.
Information and
Computation
108
, 212-261.
[22] Mallat, S.G. and Zhang, Z. (1993) Matching pursuits with time-frequency dictionaries.
IEEE Trans.
on Signal Processing
,
41
3397-3415.
[23] Rice, J. (1984) Bandwidth choice for nonparametric regression.
Ann. Statist.
12
, 1215-1230.
[24] Ruppert, D., Wand, M.P., Holst, U., and Hossjer, O. (1997) Local polynomial variance-function
estimation.
J. Amer. Statist. Assoc.
,
39
, 262-273.
[25] Schutt, C. (1984) Entropy numbers of diagonal operators between symmetric Banach spaces. J.
Approx. Theory,
40
, 121-128.
[26] Stone, M. (1974) Cross-validatory Choice and Assessment of Statistical Predictions (with Discus-
sion).
J. Roy. Statist. Soc., Ser.
B,
36
, 111-147.
[27] Vovk, V.G. (1990) Aggregating strategies. In
Proceedings of the 3rd Annual Workshop on Compu-
tational Learning Theory
, 372-383.
[28] Wolpert, D. (1992) Stacked generalization.
Neural Networks
,
5
, 241-259.
21
[29] Yang, Y. (1996)
Minimax Optimal Density Estimation
, Ph.D. Dissertation, Department of Statistics,
Yale University, May, 1996.
[30] Yang, Y. (1998) Combining Dierent Procedures for Adaptive Regression. Accepted by
Journal of
Multivariate Analysis
.
[31] Yang, Y. (1999a) Model selection for nonparametric regression.
Statistica Sinica
,
9
, 475-499.
[32] Yang, Y. (1999b) Mixing strategies for density estimation. To appear in the
Ann. Statististics
.
[33] Yang, Y. (1999c) Adaptive regression by mixing. Technical Report No. 12, Department of Statistics,
Iowa State University.
[34] Yang, Y. and Barron, A.R. (1997) Information-theoretic determination of minimax rates of conver-
gence. Tech. Report #28, Department of Statistics, Iowa State University, IA.
[35] Yang, Y. and Barron, A.R. (1998) An asymptotic property of mo del selection criteria.
IEEE Trans.
on Information Theory
,
44
, 95-116.
[36] Yang, Y. and Barron, A.R. (1999) Information-theoretic determination of minimax rates of conver-
gence. To app ear in the
Ann. Statististics
.
22
... They also use their view to obtain a boosting technique for regression. In [15] , Yang has studied minimax properties of aggregating regression procedures . In particular, he has proved that when the number d of aggregated procedures is less than √ N (where N is the size of the training set), the order of the convergence rate of the best mixture (in the minimax sense) is the same as the one of the best linear combination (i.e. ...
... Comparison with minimax bounds. In this section, we derive from Corollary 4.3 an aggregating procedure which is optimal in a minimax sense according to lower bounds established by Juditsky and Nemirovski ([7]) and by Yang ([15]). We still denote˜ρdenote˜ denote˜ρ a posterior distribution such that R(E ˜ ρ(dθ) f θ ) = miñ R R. Lemma 4.4. ...
... Remark 4.2. In [15], Yang also proposed an adaptive estimator. The advantage of the procedure designed here is to be feasible, to avoid splitting the data in many parts and to hold when the regression function wrt the unknown probability distribution is not in the model˜Rmodel˜ model˜R. ...
Article
Numerous empirical results have shown that combining regression procedures can be a very efficient method. This work provides probably approximately correct (PAC) bounds for the L 2 generalization error of such methods. The interest in these bounds is twofold. First, it gives for any aggregating procedure a bound for the expected risk depending on the empirical risk and the empirical complexity measured by the Kullback-Leibler divergence between the aggregating distribution ρ ^ and a prior distribution π and by the empirical mean of the variance of the regression functions under the probability ρ ^. Secondly, by structural risk minimization, we derive an aggregating procedure which takes advantage of the unknown properties of the best mixture f ˜: when the best convex combination f ˜ of d regression functions belongs to the d initial functions (i.e., when combining does not make the bias decrease), the convergence rate is of order (logd)/N. In the worst case, our combining procedure achieves a convergence rate of order (logd)/N which is known to be optimal in a uniform sense when d>N. As in AdaBoost, our aggregating distribution tends to favor functions which disagree with the mixture on mispredicted points. Our algorithm is tested on artificial classification data (which have been also used for testing other boosting methods, such as AdaBoost).
... where ∆ n,K is a remainder term independent from f . This problem was studied in the random design setting by Yang [1999], Catoni [2004], Wegkamp [2003], Györfi et al. [2002] and Birgé [2004]. Under some standard assumption, It was proven by Tsybakov [2003] that the optimal residual term satisfies ∆ n,K = Θ(log K/n). ...
Thesis
The increasing size of available data has led machine learning specialists to consider more complex models in order to achieve better performance. From a theoretical point of view, statistical learning under resource constraints has known a growing interest in the machine learning community. Many settings were developed to formalize budgeted limitations. In this thesis, we are motivated by these challenges, where we consider classical learning problems under the ``frugal lens". First, we tackle support recovery in a sparse linear regression problem, with one pass over data. We develop an online greedy algorithm named "online orthogonal matching pursuit" that actively selects covariates in a sequential way, with guarantees on its computational complexity that is adaptive to the unknown magnitude of the regression coefficients. Second, we consider the problem of model selection aggregation of experts. We present procedures that achieve fast rates under various budgeted settings and discuss the attainability of fast rates in different settings. Third, we tackle the problem of online prediction of individual sequences, where no distributional assumption is made in the process of generating data. We consider some natural budgeted constraints on the number of experts used for prediction and the number of observed feedbacks. We develop new strategies for each setting and discuss the attainability of constant regrets. Finally, we consider the problem of fixed confidence best arm identification. Given a confidence level, the learner wants to identify the arm with the largest mean using the least number of queries possible. We suppose that simultaneous queries are possible and prove that significant improvement can be made with respect to the BAI standard algorithms by taking the unknown covariance of the arms into consideration.
... Since its introduction by Vovk [1990] and Littlestone and Warmuth [1994], the EWA forecaster has been analyzed from many perspectives [Cesa-Bianchi et al., 1997, Cesa-Bianchi and Lugosi, 1999, Cesa-Bianchi, 1999. The use of exponential weights has also found many successful statistical applications, in particular in the context of aggregation [Yang, 2004, Leung and Barron, 2006, Catoni, 2007, Dalalyan and Tsybakov, 2007, 2008, 2009, Juditsky et al., 2008, Alquier, 2008, Audibert, 2009, Dalalyan and Tsybakov, 2012a. ...
Preprint
This paper addresses the problem of prediction with expert advice for outcomes in a geodesic space with non-positive curvature in the sense of Alexandrov. Via geometric considerations, and in particular the notion of barycenters, we extend to this setting the definition and analysis of the classical exponentially weighted average forecaster. We also adapt the principle of online to batch conversion to this setting. We shortly discuss the application of these results in the context of aggregation and for the problem of barycenter estimation.
... , h N (that can be statistical estimates based on an independent data set, for instance, pre-trained base classifiers in classification setting). Aggregation techniques, especially in the case of L 2 -regression, have been introduced in [23] and [28] and studied extensively in [4] [5] [9] [26] [29] among others. Bunea, Tsybakov and Wegkamp [4] [5], Koltchinskii [15] and van de Geer [14] also use 1 -type penalization and study the role of sparsity in this type of problems. ...
Article
Let (X, Y) be a random couple in S×T with unknown distribution P. Let (X1, Y1), …, (Xn, Yn) be i.i.d. copies of (X, Y), Pn being their empirical distribution. Let h1, …, hN:S↦[−1, 1] be a dictionary consisting of N functions. For λ∈ℝN, denote fλ:=∑j=1Nλjhj. Let ℓ:T×ℝ↦ℝ be a given loss function, which is convex with respect to the second variable. Denote (ℓ•f)(x, y):=ℓ(y; f(x)). We study the following penalized empirical risk minimization problem which is an empirical version of the problem (here ɛ≥0 is a regularization parameter; λ0 corresponds to ɛ=0). A number of regression and classification problems fit this general framework. We are interested in the case when p≥1, but it is close enough to 1 (so that p−1 is of the order , or smaller). We show that the “sparsity” of λɛ implies the “sparsity” of λ̂ɛ and study the impact of “sparsity” on bounding the excess risk P(ℓ•fλ̂ɛ)−P(ℓ•fλ0) of solutions of empirical risk minimization problems.
... Some references to aggregation of arbitrary estimators in regression models are [13], [10], [17], [18], [9], [2], [15], [16] and [7]. This paper extends the results of the paper [4], which considers regression with fixed design and Gaussian errors W i . ...
Conference Paper
This paper shows that near optimal rates of aggregation and adaptation to unknown sparsity can be simultaneously achieved via ℓ1 penalized least squares in a nonparametric regression setting. The main tool is a novel oracle inequality on the sum between the empirical squared loss of the penalized least squares estimate and a term reflecting the sparsity of the unknown regression function.
Article
Full-text available
Unconditionally secure message authentication is an important part of Quantum Cryptography (QC). We analyze security effects of using a key obtained from QC for authentication purposes in later rounds of QC. In particular, the eavesdropper gains partial knowledge on the key in QC that may have an effect on the security of the authentication in the later round. Our initial analysis indicates that this partial knowledge has little effect on the authentication part of the system, in agreement with previous results on the issue. However, when taking the full QC protocol into account, the picture is different. By accessing the quantum channel used in QC, the attacker can change the message to be authenticated. This together with partial knowledge of the key does incur a security weakness of the authentication. The underlying reason for this is that the authentication used, which is insensitive to such message changes when the key is unknown, becomes sensitive when used with a partially known key. We suggest a simple solution to this problem, and stress usage of this or an equivalent extra security measure in QC.
Article
Full-text available
We compare Bayes Model Averaging, BMA, to a non-Bayes form of model averaging called stacking. In stacking, the weights are no longer posterior probabilities of models; they are obtained by a technique based on cross-validation. When the correct data generating model (DGM) is on the list of models under consideration BMA is never worse than stacking and often is demonstrably better, provided that the noise level is of order commensurate with the coefficients and explanatory variables. Here, however, we focus on the case that the correct DGM is not on the model list and may not be well approximated by the elements on the model list. We give a sequence of computed examples by choosing model lists and DGM's to contrast the risk performance of stacking and BMA. In the first examples, the model lists are chosen to reflect geometric principles that should give good performance. In these cases, stacking typically outperforms BMA, sometimes by a wide margin. In the second set of examples we examine how stacking and BMA perform when the model list includes all subsets of a set of potential predictors. When we standardize the size of terms and coefficients in this setting, we find that BMA outperforms stacking when the deviant terms in the DGM 'point' in directions accommodated by the model list but that when the deviant term points outside the model list stacking seems to do better. Overall, our results suggest the stacking has better robustness properties than BMA in the most important settings.
Chapter
Full-text available
We present simple procedures for the prediction of a real valued sequence. The algorithms are based on a combination of several simple predictors. We show that if the sequence is a realization of a bounded stationary and ergodic random process then the average of squared errors converges, almost surely, to that of the optimum, given by the Bayes predictor. We offer an analog result for the prediction of stationary gaussian processes.
Article
We study the problem of aggregation of estimators. Given a collection of M different estimators, we construct a new estimator, called aggregate, which is nearly as good as the best linear combination over an l 1-ball of ℝM of the initial estimators. The aggregate is obtained by a particular version of the mirror averaging algorithm. We show that our aggregation procedure statisfies sharp oracle inequalities under general assumptions. Then we apply these results to a new aggregation problem: D-convex aggregation. Finally we implement our procedure in a Gaussian regression model with random design and we prove its optimality in a minimax sense up to a logarithmic factor.
Article
Full-text available
The conditional variance function in a heteroscedastic, nonparametric regression model is estimated by linear smoothing of squared residuals. Attention is focused on local polynomial smoothers. Both the mean and variance functions are assumed to be smooth, but neither is assumed to be in a parametric family. The biasing effect of preliminary estimation of the mean is studied, and a degrees-of-freedom correction of bias is proposed. The corrected method is shown to be adaptive in the sense that the variance function can be estimated with the same asymptotic mean and variance as if the mean function were known. A proposal is made for using standard bandwidth selectors for estimating both the mean and variance functions. The proposal is illustrated with data from the LIDAR method of measuring atmospheric pollutants and from turbulence-model computations.
Article
We attempt to recover an unknown function from noisy, sampled data. Using orthonormal bases of compactly supported wavelets, we develop a nonlinear method which works in the wavelet domain by simple nonlinear shrinkage of the empirical wavelet coefficients. The shrinkage can be tuned to be nearly minimax over any member of a wide range of Triebel- and Besov-type smoothness constraints and asymptotically minimax over Besov bodies with p ≤ q. Linear estimates cannot achieve even the minimax rates over Triebel and Besov classes with p < 2, so the method can significantly outperform every linear method (e.g., kernel, smoothing spline, sieve) in a minimax sense. Variants of our method based on simple threshold nonlinear estimators are nearly minimax. Our method possesses the interpretation of spatial adaptivity; it reconstructs using a kernel which may vary in shape and bandwidth from point to point, depending on the data. Least favorable distributions for certain of the Triebel and Besov scales generate objects with sparse wavelet transforms. Many real objects have similarly sparse transforms, which suggests that these minimax results are relevant for practical problems. Sequels to this paper, which was first drafted in November 1990, discuss practical implementation, spatial adaptation properties, universal near minimaxity and applications to inverse problems.
Article
Two separate sets of forecasts of airline passenger data have been combined to form a composite set of forecasts. The main conclusion is that the composite set of forecasts can yield lower mean-square error than either of the original forecasts. Past errors of each of the original forecasts are used to determine the weights to attach to these two original forecasts in forming the combined forecasts, and different methods of deriving these weights are examined.
Article
We argue that model selection uncertainty should be fully incorporated into statistical inference whenever estimation is sensitive to model choice and that choice is made with reference to the data. We consider different philosophies for achieving this goal and suggest strategies for data analysis. We illustrate our methods through three examples. The first is a Poisson regression of bird counts in which a choice is to be made between inclusion of one or both of two covariates. The second is a line transect data set for which different models yield substantially different estimates of abundance. The third is a simulated example in which truth is known.