ArticlePDF Available

Aggregating Regression Procedures for a Better Performance

January 2000

January 2000

Authors:

Tsinghua University

Methods have been proposed to linearly combine candidate regression procedures to improve estimation accuraccy. Applications of these methods in many examples are very succeesful, pointing to the great potential of combining procedures. A fundamental question regarding combining procedure is: What is the potential gain and how much one needs to pay for it? A partial answer to this question is obtained by Juditsky and Nemirovski (1996) for the case when a large number of procedures are to be combined. We attempt to give a more general solution. Under a l 1 constrain on the linear coefficients, we show that for pursuing the best linear combination over n procedures, in terms of rate of convergence under the squared L 2 loss, one can pay a price of order O Gamma log n=n 1Gamma Delta when 0 ! ! 1=2 and a price of order O i (log n=n) 1=2 j when 1=2 ! 1. These rates can not be improved or essentially improved in a uniform sense. This result suggests that one should be cautious...

Content uploaded by Yuhong Yang

Content may be subject to copyright.

Aggregating Regression Procedures for a Better Performance

Yuhong Yang

Department of Statistics

Iowa State University

yyang@iastate.edu

December, 1999

Abstract

Methods have been prop osed to linearly combine candidate regression procedures to improve esti-

mation accuraccy. Applications of these methods in many examples are very succeesful, pointing to the

great potential of combining procedures. A fundamental question regarding combining procedure is:

What is the potential gain and how much one needs to pay for it?

A partial answer to this question is obtained by Juditsky and Nemirovski (1996) for the case when

a large number of procedures are to be combined. We attempt to give a more general solution. Under

constrain on the linear coecients, we show that for pursuing the best linear combination over



procedures, in terms of rate of convergence under the squared

loss, one can pay a price of order

log

n=n





when 0

<  <

2 and a price of order



(log

n=n

)



when 1



 <

. These rates

can not be improved or essentially improved in a uniform sense. This result suggests that one should

be cautious in pursuing the best linear combination, because one may end up with paying a high price

for nothing when linear combination in fact does not help. We show that with care in aggregation, the

nal procedure can automatically avoid paying the high price for such a case and then behaves as well

as the best candidate procedure in terms of rate of convergence.

Keywords and phrases:

Aggregating procedures, adaptive estimation, linear combining, nonparametric

regression.

1 Introduction

Recently, new ideas on combining dierent procedures for estimation, co ding, forecasting, or learning

have been considered in statistics and several related elds, resulting a number of very interesting

results. The common theme behind these work is to automatically share the strength of the individual

procedures in some sense. In the context of machine learning, it has b een shown that with an appropriate

weighting method, a combined pro cedure can behave close to the best procedure in terms of a certain

cumulative loss, see, e.g., Vovk (1990), Littlestone and Warmuth (1994), Cesa-Bianchi

et al

(1997),

and Cesa-Bianchi and Lugosi (1999). The focus has b een on deriving mixed strategies with optimal

performance without any probabilistic assumptions at all on the generation of the data. In the eld of

forecasting, combined forecasts have been shown to work better in various examples, see, e.g., Clemen

(1989) for a review of work in that direction. In information theory, study of universal coding in the

spirit of adaptation results in very interesting and p owerful techniques also useful in other related elds

such as machine learning and statistics. See Merhav and Feder (1998) and Barron, Rissanen and Yu

(1998) for reviews of work in that eld. In statistics, several metho ds have b een recently prop osed

to linearly combine regression estimators. They include a mo del selection criterion based metho d by

Buckland

et al

(1995), cross-validation based \stacking" by Wolpert (1992) and Breiman (1996a) (an

earlier version is in Stone (1974)), a bootstrap based method by LeBlanc and Tibshirani (1996), a

stochastic approximation based metho d by Juditsky and Nemirovski (1996), and information-theoretic

based methods to combine density and regression estimators by Yang (1996, 1998, 1999bc) and Catoni

(1997) for density estimation. Juditsky and Nemirovski proposed algorithms and derived interesting

theoretical upper and lower bounds for linear aggregation in pursuing the best performance among the

linearly combined estimators (with coecients sub ject to an appropriate constraint). Yang (1998, 1999c)

shows that with proper weighting, a combined procedure has a risk bounded above by a multiple of the

smallest risk over the original pro cedures plus a small penalty.

The above mentioned theoretical work in statistics are in two related but dierent directions: one

aiming at automatically achieving the best possible performance among the given collection of candidate

procedures, and the other aiming at improving the performance of the original procedures. For the

latter, the hope is that an aggregated procedure (through a convex or linear combination of the original

procedures with data dependent co ecients) will signicantly outp erform each individual candidate

procedure. Clearly the second direction is more aggressive. If one could identify the best linearly

combined procedure, pursuing the best p erformance among the candidate procedures would b e too

conservative. On the other hand, common sense would suggest if one asks for more, one needs to pay

more. The present pap er intends to contribute to the theoretical understanding on the gain and price

for pursuing the b est linear combination.

Suppose that we have

candidate regression procedures and consider the squared

risk as a

performance measure in estimating the regression function. In Yang (1998, 1999c) it is shown that a

suitable data-dependent convex combination of these procedures results in an estimator that (under a

minor condition) has a risk within a multiple of the smallest risk among the candidate procedures plus

a small penalty of order (log

)

. Thus in terms of rate of convergence, with

candidate procedures

to be combined, one only needs to pay the price basically of order (log

)

for performing nearly as

well as the best candidate procedure (which, of course, is unknown to the statistician). As long as

does not increase exp onentially fast in

the discrepancy (log

)

is of order log

n=n;

which does not

aect the rate of convergence for typical nonparametric regression. As a consequence, when p olynomially

many nonparametric procedures are suitably combined, the estimator automatically converges at the

best rate oered by the individual procedures. For the more aggressive goal of pursuing the b est linear

combination of the candidate procedures, under the constrain that the

norm of the linear coecients

is bounded above by 1, Juditsky and Nemirovski (1996) proposed algorithms and showed that with

estimators to be combined, the aggregated estimator has a risk within a multiple of

(log

)

of the

smallest risk over all the linear combinations of the estimators. Furthermore, they show that, in general,

this order

(log

)

can not be overcome uniformly by any combining methods. Thus compared to

combining for attaining the best performance, one has to pay a much higher price,

(log

)

=n;

for

searching for the b est linear combination of the original pro cedures.

The work of Juditsky and Nemirovski (1996) is targeted at the case when

is large (e.g., their

results are applied to restore Barron's class with

of a polynomial order in

). They derived the

above mentioned lower b ound when

and

have the relationship:

log



log

(where

the constants

and

depend on the variance of the error and the assumed known upper bound on

the supremum norm of the regression function

). The relationship implies that

is at least at order

log(

)

It is unclear then what happens when

is of a smaller order. For such a case, the order

(log

)

may no longer b e a valid lower bound. In the extreme case with

xed (

does not grow

! 1

), one would expect a penalty of order close to the parametric rate 1

instead of order

In this paper, we show that when

is of order



;

one only needs to pay the price of order log

n=n



for 0



 <

2. This rate can not be improved uniformly beyond a logarithmic factor.

Note that the order of the price increases dramatically as



increases from 0, but after





;

stays at the rate

(log

)

as long as

 <

This phenomenon is closely related to the advantage

of sparse approximations as observed in wavelet estimation (see, e.g., Donoho and Johnstone (1998)),

neural networks and subset selection (see, e.g., Barron (1994), Yang and Barron (1998), Yang (1999a),

and Barron, Birge and Massart (1999)). Under the

constraint on the linear coecients, when

 >

;

there can not be to o many (relative to

) large coecients and combining sparsely selected procedures

with suitably large coecients achieves the optimal performance.

In applications, one does not know if the b est linear combination can substantially improve the

estimation accuracy so that the high price of order, e.g., (log

)

is justied. Accordingly, it is not

clear which direction to go when combining the candidate procedures. We show, fortunately, with some

care in combining, an estimator can be aggressive and conservative automatically in the right way. For

convenience in discussion, we will call the conservative goal

combining for adaptation

, and the aggressive

goal

combining for improvement

The paper is organized as follows. In Section 2, we derive general risk bounds for combining

procedures. In Section 3, we study a combined procedure suitable for dierent purposes at the same

time. In Section 4, we give an illustration using linear and sparse approximations. We briey mention a

generalization of the main results in Section 5. In Section 6, a basic combining algorithm and its prop erty

are presented, which provides a tool for the main results in this paper. The proofs of the results are in

Section 7.

2 Risk b ounds on linear aggregation

Consider the regression model

(

) +





= 1

; :::n;

where (

; Y

)

are i.i.d. copies from the joint distribution of (

X; Y

) with

(

) +





. The

explanatory variable

(could be high-dimensional) has an unknown distribution

. The variance

parameter

 >

0 is unknown and the random variable

is assumed to have a known density function

(

) (with respect to Leb esgue or a general measure



) with mean 0 and variance 1. The goal is to

estimate the regression function

based on the data

= (

; Y

)

Let



be a regression estimation procedure pro ducing estimator

(

) =

(

;

) for each



Let

k  k

denote the

norm with respect to the distribution of

i.e.,

(

)

(

). Let

(

;



) =

denote the risk of the procedure



at the sample size

under the squared

loss.

Let  =



; 

; :::; 

denote a collection of candidate procedures to be aggregated. Let

j;i

(

) =

j;i

(

;

) denote the estimator of

based on procedure



given the observations

for



Assume

changes according to the sample size

. In particular, we will consider the case when



for some 0



 <

When the sample size increases, one is allowed to consider more candidate

procedures (possibly more and more complicated).

As in Juditsky and Nemirovski (1996), the coecients for linear combination are suitably constrained.

Let





j;n

(

) :





j 

be the collection of linear combinations of the original

estimators in  with coecients summing up no more than 1 in absolute values. The hope behind the

consideration of the linear aggregation is that a certain combination of the original estimators might

have a much better performance than the individual ones. Advantages of such combining have been

empirically demonstrated in several related elds (e.g., Bates and Granger (1969), Breiman (1996)). Let

k  k

denote the

norm on

;

i.e.,







Dene



(

;

; ) = inf







j;n

It is the smallest risk over all the estimators in the linear aggregation class

Obviously,



(

;

; )



inf



(

;



)

In this paper, unless stated otherwise, by linear combination, we mean linear

combination with the coecients satisfying the above

constraint.

We need the following assumptions for our results.

A1. The regression function

(

) is uniformly bounded, i.e.,



A <

. The variance parameter



is bounded above and below by known positive constants

 <

and

 >

A2. The error distribution

has a nite fourth moment and satises that for each pair 0

< s

1 and

T >

;

there exists a constant

(depending on

and

) such that

(

) log

(

)

(

)



)



for all



and

T < t < T:

The constants

and

in the above assumptions are involved in the derivation of the risk bounds,

but they need not to b e known in our aggregation procedure. The Assumption A2 is mild and is satised

by Gaussian, double-exponential, and many other smooth distributions.

An algorithm, named ARM in Yang (1999c), to combine procedures for adaptation is given in Section

6. This algorithm serves as a building block for the results in this paper. Through a suitable discretization

of the linear coecients together with a sparse approximation, the problem of combining for improvement

becomes the problem of combining for adaptation over a (much) larger class of procedures. We have the

following performance upper bound.

Theorem 1:

Assume that Conditions A1 and A2 are satised. For any given collection of estimation

procedures

 =



;



;

we can construct a combined procedure





such that

(

;





)



(



;

; 



log(1+

n=M

)

when



;

; 



log

when



;

where

is a constant depending on



;

and

In particular, if





for some

 >

and

, then

(

;





)





;

; 







log



when 1



 <



;

; 



log



when 0



 <

;

(1)

where the constant

depends on



; C

;

and

Remark:

The condition on



in Assumption A1 is mainly technical (it is not really needed to

perform the procedure). The lower bound condition on



is not essential even from a technical point

of view, since one can always add a little bit noise to the observations to satisfy the condition usually

without aecting the rate of convergence.

The constructed pro cedure





is given in the proof of Theorem 1 in Section 7. Note that for both

parametric and nonparametric regression, for a goo d pro cedure,

(

;



) and

(

;



) are usually

of the same order. Thus it is typically the case that



(

;

; ) and



;

; 



converge at the same

rate. From the result, when





;

the penalty term for pursuing the best linear combination of



procedures is of order ((log

)

(independent of



). This rate is obtained by Juditsky and

Nemirovski (1996) with a weaker assumption on the errors (nite variance), but requiring the knowledge

When

 <

2, our result above shows that the penalty is smaller in order, resulting in a possibly

much faster rate of convergence. For an extreme example, when

is xed, the price we pay is only of

order log

n=n

How good are the upper bounds derived here? Juditsky and Nemirovski (1996) show that when

and

satisfy

log



log

for some constants

and

(i.e.,

is no smaller than

order

log

but not too large), the order ((log

)

can not be improved in a minimax sense. We

show in general, the rates given in Theorem 1 can not b e improved up to possibly a logarithmic factor

for some cases. For simplicity, assume that the errors are normally distributed with variance 1.

Theorem 2:

Consider



for some

 >

. There exist

procedures





;



such that for any aggregated procedure



(

)

based on



;

one can nd a regression function

with



satisfying



;



(

)





(

;

; 

)



(



log



when 1

<  <



when 0







;

where the constant

does not depend on

Thus no aggregation method can achieve the smallest risk over all the linear combinations within

an order smaller than the ones given above in accordance with



uniformly over all b ounded regression

functions. Note that the lower rate matches the upper rate when

 >

2 and the upper and lower rates

dier only in logarithmic factors when 0







It is interesting to notice how the price (in rate) for combining for improvement changes according to

In the beginning, it basically increases linearly in

;

but after

reaches

it increases much

more slowly in a logarithmic fashion. Accordingly, it stays at rate



log



as long as

increases

polynomially in

In a dierent direction, Yang (1998, 1999c) shows that one only needs to pay the price of order

(log

)

to pursuit the less ambitious goal of achieving the best performance among the original

procedures. Observing the dramatic dierence between the two penalties, one naturally faces the

question: Should we combine for adaptation or for improvement? If one of the original procedures

happen to behave the best (or close to the best) among all the linear combinations, or at least one of

the original procedures converges at a rate faster than (log

)



(for 0



 <

2) or

log

n=n

(for





2), if one aggregates for better performance, one could be unfortunately paying too high a price

for nothing but hurting the convergence rate in estimating

. In terms of rate of convergence, combining

for improvement is worth the eort only if



(

;

2; ) plus the penalty in (1) is of a smaller order

than (log

)

+ inf

(

;



). Since the risks are of course unknown, in applications, one does not

know in advance whether to combine for adaptation or combine for improvement. A wrong choice can

lead to a much worse rate of convergence. In the next section, we show one can actually handle the two

goals optimally at the same time.

3 Multi-purpose aggregation

Here we show when combining the procedures properly, one can have the potential of obtaining a large

gain in estimation accuracy yet without losing much when there happens to be no advantage considering

sophisticated linear combinations.

Let us consider a slightly dierent setting compared to the previous section. Suppose that we have

a countable collection of candidate procedures  =



; 

; :::

Under this setting, one does not need

to decide before hand how many procedures should be included at a given sample size. Consider three

dierent approaches to combine the procedures in .

The rst approach is to combine the procedures for adaptation. Here one intends to capture the

best performance in terms of rate of convergence among the candidate procedures. Let





denote this

combined procedure based on  using the Three-Stage ARM Algorithm as given in the Section 6. Since

 is not (necessarily) a nite collection, one can not use the uniform weight. The prior weight



taken to b e

log



;

where log



is dened by log



= log(

+ 1) + 2 log log(

+ 1) and the constant

chosen to normalize the weights to add up to 1. Based on Proposition 1 in Section 6, we have that for

any

with

;

(

;





)



inf



log (

+ 1)

(

;



)





(

;

; )

;

(2)

where the constant

depends on

; 

;

and

In the rest of the paper, unless stated otherwise,

a constant

(with or without subscript) may depend on

; 

;

and

For convenience, we may

use the same symbol

for dierent such constants in dierent places. From above, if one procedure,

say





behaves the best, then the penalty is of order

If the best estimator changes according to

then inf



log(

+1)

(

;



)



is a trade-o between complexity and estimation accuracy.

The second approach targets at the best p erformance among all the linear combinations of the original

procedures up to dierent orders. For each integer



;

let



denote the combined (for improvement)

procedure based on the rst

procedures



; ; ::::; 

as used for Theorem 1. Then combine the procedures



; 

; :::

with weight

log



for



1 as dened earlier. Let





denote this combined pro cedure.

Let 

denote the set of the rst

procedures in 

Let

(

) =

(

log(1+

n=L

)



L <

log



By Theorem 1 and Proposition 1, we have that for any

with

;

(

;





)



inf







;

; 



(

)





(

;

; )

(3)

The third approach recognizes that in many cases, when combining a lot of procedures, the b est

linear combination may concentrate on only a few procedures. For such a case, working with these

important procedures only leads to a much smaller price when combining for improvement. This calls

for additional care in aggregation and it can be done as follows. For each integer

L >

;



k < L;

and a subset

;

; :::; L

of size

let



(

) be the combined (for improvement) procedure based on



as for (1). Then let



L;k

be the combined (for adaptation) procedure based on all such



(

)

with uniform weight 1

(

) (there are (

) many such procedures). Then let



(

)

be the combined (for

adaptation) procedure based on



; :::; 

L;L

using the uniform weight 1

(

Let





denote the

combined (for adaptation) pro cedure based on



(

)

; L



2 with weight

log



, where the constant

is chosen such that

log



= 1. Let 

denote the collection of pro cedures



Based

on Proposition 1 and Theorem 1, we have that for any

with

;

(

;





)



inf





inf





inf

k;S

f

;

;:::;L





;

; 



(

) +

log(

)



(4)



(

;

; )

Now we combine these three pro cedures





; 



;

and





with equal weight 1

And let



denote

the nal combined procedure. Note that it is still a linear combination of the original procedures. We

have the following result.

Corollary 1:

Assume Conditions A1 and A2 are satised. Then for each

with

;

have

(

;



)



min (



(

;

2; )

; R



(

;

2; )

; R



(

;

2; ))

;

where



(

;

; )

; R



(

;

; )

; R



(

;

; )

are given in (2), (3), and (4).

The above result characterizes goo d performance of the nal estimator simultaneously in three di-

rections in terms of rate of convergence. First of all, the nal estimator converges as fast as any original

procedure. Secondly, when linear combinations of the rst

procedures (for some

1) can improve

estimation accuracy dramatically, one pays the price at most of order

(

) for the better performance.

When

is small, the gain is substantial. When certain linear combinations of a small number of proce-

dures perform well, the nal estimator can also take advantage of that. In summary, the nal estimator

can b ehave both aggressively (combining for improvement) and conservatively (combining for adapta-

tion) which ever is better.

4 An illustration via linear approximation

We illustrate the result of multi-purpose aggregation studied in the previous section through an example

with linear and sparse approximations. We assume that

;



 1

)

Let



= 1

;

; :::

be a collection of linear approximation systems. For each



(

)

; '

(

)

; :::

is a chosen collection of linearly independent functions in

;

. Traditionally orthonormal bases (or

at least with some frame properties) have been emphasized. Recently non-orthogonal and/or over com-

plete bases have been advocated and studied. Relaxation of orthogonality enables one to consider e.g.,

trigonometric expansions with fractional frequencies and neural network mo dels. Considering dierent

bases provides much more exibility that gives a great potential to improve estimation accuracy, es-

pecially in high-dimensional settings. See Barron and Cover (1991), Mallat and Zhang (1993), Barron

(1994), Donoho and Johnstone (1994), Juditsky and Nemirovski (1996), Yang and Barron (1998), Yang

(1999a), and Barron, Birge and Massart (1999) for some work in those directions.

For a xed

the (squared

) approximation error of

using the rst

terms is



j;N

(

) = inf

j;l

We call this individual approximation. The approximation error of

using linear combinations of the

individual approximations of

up to

terms based on the rst

systems is



(

) = inf

j;l

We call this linearly combined approximation. Obviously



(

)





j;N

(

) for 1



When



(

)





j;N

(

) for 1



with the right size, the advantage of considering linear combinations

over dierent systems can be substantial. The approximation error of

based on sparse approximation

using

out of the rst

systems is



L;k

(

) = inf

f

;:::;M

;

inf

j;l

We call this sparsely combined approximation. The sparse approximation can improve estimation accu-

racy compared to the linearly combined approximation if only a few approximation systems are actually

needed in the linearly combined approximation, i.e., one can nd



such that



L;k

(

) is close to



(

)

For a given

and

traditional linear mo del estimators (e.g., based on the least squares principle

or projection estimators with orthogonal basis functions) can b e used to estimate the b est parameters

in the linear approximation, resulting in the familiar bias-squared (approximation error) plus variance

(estimation error) trade-o for the mean squared error. As is well-known, the variance is typically of

order

N=n

under minor conditions.

Combining the approximation error and estimation error, one can bound



(

;

; )

; R



(

;

; )

;

and



(

;

; ) as dened in (2), (3), and (4) as follows



(

;

; ) =



inf





j;N

(

) +

log



;

(5)



(

;

; ) =



inf

L;N





(

) +

(

)



;

(6)



(

;

; ) =



inf

L;N



inf







L;k

(

) +

(

) +

log



(7)

Based on Corollary 1 and the above bounds, one can derive rate of convergence for the nal aggregated

procedure



under various assumptions on the approximation errors



j;N

(

)

; 

(

)

;

and



L;k

(

)

The

conclusion is basically that, in terms of rate of convergence, the nal estimator behaves as well as the

best estimator based on an individual approximation system, or as the linearly combined estimator, or

as the sparsely combined estimator, which ever is the best.

When the basis functions are orthonormal, conditions on the

approximation errors typically

correspond to conditions on the co ecients, resulting in simple characterization of the functions. Here we

give an example. Suppose

and assume

= (

; X

; :::

) has independent, uniformly distributed

components (or after suitable transformations). We assume the true regression function is additive, i.e.,

(

) =

(

) +

(

) +

  

(8)

To estimate the additive component

(

)

;

a linear approximation system 

(

)

; '

(

)

; :::

is used. Assume the basis functions are orthonormal with mean zero. For a given

let

j;N

(

) be

the projection estimator of

(

) based on the rst

basis functions in 

. That is,

j;N

(

) =



j;i

(

)

;

where



j;i

j;l

(

k;l

)

For simplicity, assume



for some known

constant

A >

0 and the estimators are accordingly clipped into the range. Let



j;N



; N



1 denote

these regression procedures. Let





; 



;

and





be the dierently combined procedures as constructed

in the previous section and let



denote the nal pro cedure combining them together.

Assume

(

) =



j;l

(

) for



1 and assume the co ecients satisfy the following condition

B0:





j;i

(9)

for some

s >

0 and

 >

When the true regression function is actually univariate in one variable, say

;

then



j;i

= 0 for all

and

except

If one knew this is the case, one can ignore the other

variables. Let B1 denote this condition. Another condition, denoted B2, is that



j;i

= 0 for all



and

i > i

for some unknown integer

, and in addition,



j;i

Corollary 2:

Assume the errors are normally distributed with



bounded above and below by

known constants. If

satises Condition B0, we have

(

;



) =



(2+1

=

)



(10)

satises conditions B0 and B1, we have

(

;



) =



1+2



(11)

satises conditions B0 and B2, we have

(

;



) =



(log

n=n

)



(12)

Note that the procedure



does not require knowledge of the constants

and



(or

). Thus the

rate

(2+1

=

)

is adaptively achieved. When



is very small, the rate of convergence is very

slow. Under the additional assumption of B1 or B2, a much better rate of convergence is automatically

achieved by the aggregated pro cedure.

Remarks:

1. In the construction of the aggregated pro cedure





;

sparseness is in terms of the number of

procedures b eing combined. One can also consider sparseness in terms of the number of terms in the

linear approximation within each approximation system. Then the same convergence rate (log

n=n

)

can be obtained under Condition B2 without assuming that for each

there are only nitely many

non-zero coecients. See Yang and Barron (1998) for such a treatment in density estimation based on

models selection.

2. Under the assumptions that

= (

; :::

) has indep endent and uniformly distributed components

and that the basis functions have mean zero,

j;l

(

)

(

) = 0 for all

These strong

conditions make the approximation error readily bounded under Condition B0. Without these conditions,

the convergence rates in Corollary 2 can be shown to still hold under the direct conditions inf



j;l



j;l





and



j;i

for each

. Also the additive condition

(8) can be expressed in terms of pre-specied linear combinations of the original explanatory variables

rather than the original explanatory variables themselves.

3. If

happens to be \parametric" in the sense that it can b e expressed as a linear combination of

nitely many basis functions (possibly across dierent systems), then the convergence rate of the nal

procedure is

((log

)

;

possibly losing a logarithmic factor.

4. When

 >

;

under Condition B0 and that



j;i

= 0 for all

and

except

;

the condition



j;i

is automatically satised. For this case, it can be shown that

(

;



) in fact

converges at a better rate

=



+1)

than (log

n=n

)

5 Generalization

The main results in this paper can b e generalized with little diculty in two directions based on an

analysis similar to that in Yang (1999c). Firstly, the error distribution

need not to be known completely.

It suces to assume that

is in a countable collection of candidate error distributions. This gives more

exibility to hand errors with dierent degree of heavy tail. Secondly, one does not need to require

that the random errors have a constant variance function. Assume instead that for each



;

in addition

to having an estimator

j;n

of the regression function, we also have an estimator ^



j;n

of the variance

function. The procedures can share variance estimators if so desired. The pro cedures can b e combined

for estimating

using both the regression estimators and the variance estimators (see Yang (1999c)).

A recent work on variance estimation is in Ruppert

et al

(1997), where a local polynomial method is

proposed with a theoretical justication.

6 A Three-Stage algorithm to combine procedures for adapta-

tion

Let  =



; j



be a collection of regression procedures. The index set



is allowed to

degenerate to a nite set. Let



be positive numbers summing up to one, i.e.,



= 1. They

will be used as prior weights on the procedures. The following is an algorithm to combine candidate

procedures for adaptation as essentially given in Yang (1999c).

A Three-Stage ARM Algorithm

Step 1.

Split the data into three parts

(1)

= (

; Y

)

(2)

= (

; Y

)

;

and

(3)

(

; Y

)

. Let

Step 2.

Obtain estimates

j;n

(

;

(1)

) of

based on

(1)

for



Step 3.

Estimate the variance



for each procedure by





j;n

(

)



Step 4.

For each

, evaluate predictions. For

+ 1



, predict

j;n

(

). For

+ 1



compute

j;k





(

j;n

(

))







Step 5.

Let

j;k



j;k





l;k

and compute the nal weight

j;k

The nal estimator is

(

) =

j;n=

(

) (13)

The combined estimator has the following theoretical property. For simplicity in notation, assume

that

is a multiple of 4

;

and then take

2 and

We assume that the estimator ^



are bounded above and below by positive constants



and



(otherwise one needs to clip the estimator

to be in that range).

Proposition 1:

Assume Conditions A1 and A2 hold. Then the above convexly combined estimator

satises



inf



1 + log





j;n=



;

where the constant

depends only on



;

and

In particular, if there are

procedures to be

combined with uniform weight, then





log

+ inf

j;n=



Remarks:

1. In the ARM algorithm, the second stage is used to estimate



Here the estimators are derived in

terms of predictions based on the individual regression pro cedures. The use of these variance estimators

does not get in the way of estimating the regression function

in terms of rate of convergence. One

can also use common model-independent estimators of



(see, e.g., Rice (1984)). Then one does not

need this stage, and accordingly, the risk of the variance estimators will appear in the risk bound on

estimating

2. As discussed in Yang (1999c), the estimator

depends on the order of observations. For im-

provement, one can randomly permute the order of observations a number of times and average the

corresponding estimators.

3. In the denition of the nal estimator

j;n=

(

)

;

we use

j;n=

(

) instead of

j;n

(

)

to have a cleaner risk bound. But

j;n

(

) should be a slightly b etter choice in terms of accuracy.

Proof of Proposition 1

: The result is proved in Yang (1999c) for the case when there are nitely

many, say

candidate procedures with equal weight



= 1

for 1



The proof for the general

case can be done similarly.

7 Proof of the results

Proof of Theorem 1:

There are mainly two steps in our derivation of an aggregated procedure

yielding the given risk bound. First, we discretize (with suitable accuracy) the co ecients for linear

combinations and then treat the set of all the corresponding linearly discretely combined estimators as

a new collection of candidate estimators. For suitable discretization, some results on metric entropy

are very helpful. In the second step, we combine these estimators for adaptation using the algorithm

ARM proposed in Yang (1999c) and described in Section 6. When

is large, however, an additional

diculty arises and an idea of sparse combining takes care of the problem.

We consider rst the case when

Let



= (



; :::; 

) :



j 

Let



-net in

under the

distance, i.e., for each



there exists





such that



j 

:



-net in

yields a suitable net in the set

of the linear combinations of the

original estimators. For simplicity in notation, let

; :::;

denote the original estimators at the sample

size

Let



be the set of the linear combinations of the estimators

; :::;

with coecients in



Then for any estimator



with



there exists





such that









k 





A:

(14)

Now we combine all the estimators in



using the ARM algorithm given in Section 6 with uniform

weight 1



. Let

denote the combined estimator. By Proposition 1, for any

with

;

have



log(



)

inf



(

;

where

depends only on



;

and

Since



is an (

A

)-net in

;

by triangle inequality, for any

we have inf



(

;



2 inf

(

;

2) + 2



It follows that



log(



)

+ 2

inf

(

;

2) + 2

C

(15)

To get the best upper bound (in order), we need to minimize

log(



)

+ 2



when discretizing

Note

that the logarithm of the smallest size of



is the covering entropy of the set

under the

distance

(see, e.g., Kolmogorov and Tihomirov (1959) for properties of metric entropies). For this case, metric

entropy orders are known. The following result is given in terms of the entropy number, i.e., the worst

case approximation error with the best net of size of 2

points. Let



denote the entropy number of

From Edmunds and Triebel (1989, Proposition 3.1.3), when



M; 



k=M

for some constant

independent of

and

Take

(log(

n=M

) + 2 log 2)

2 log 2

(note that



). (Strictly speaking, we need to round up or down to make

an integer.) Then

log(



)

+ 2





(log(

n=M

) + 2 log 2)

(2 log 2)

(

)



log(1 +

n=M

)

;

where

depends only on

and

The upper bound in Theorem 1 for

M <

then follows.

Now consider the other case:



The argument above leads to a rate (

M=n

) log(1 +

n=M

) for



which as will be seen is only sub-optimal. For this case, due to the

constraint, the number of

large coecients is small relative to

when



An appropriate search of the large coecients

can result in optimal rate of convergence, as we derive below.

Note that for





;



k 

Then by a sampling argument (see e.g., Lemma 1 in

Barron (1993)), for each

there exist a subset

 f

; :::; M

of size

and



= (



; i

) such

that



k 

Taking



log

we have



k 

(log

n=n

)

Consider an



-net in



j 

under the



distance. Again by

Edmunds and Triebel (1989), taking



(log(

n=m



)+2 log 2)

2 log 2

;

the best



-net has approximation accuracy







=n:

Then as in (14), we know that there exists



in this



-net such that



k 

Ac=



=n:

Thus for each

;

there exist



 f

; :::; M

of size



and



such

that







k 

(log

)

(log

)



(log

)

;

where

depends only on

and

Notice that, in general,



depends on

and therefore it should be

chosen adaptively. The above analysis suggests the following method of sparse combining.

For each xed subset

 f

; :::; M

of size



;

discretize the linear coecients as described above.

Then (with uniform weight) combine the corresponding linear combinations of the procedures in .

Then combine these (combined) procedures over all possible choices of

(there are





many such

altogether) with uniform weight. Let





denote this nal procedure and let 



; i

. Applying

Proposition 1 twice, we have that

(

;





)







;

; 



(log

)



log(

n=m



)

log(



)









;

; 



log



;

where the constants

and

depend on



;

and

This completes the pro of of Theorem 1.

Remark

: In the ab ove derivation, when

M >

combining a small number (relative to

) of

procedures together with subset search yields the price of order

log

n=n

for

of a polynomial order

which is the optimal rate based on Theorem 2 when

is of a higher order than

. Similar ideas

on sparse subset selection are in e.g., Barron (1994), Yang and Barron (1998) and Barron, Birge and

Massart (1999).

We need a lemma on minimax lower bound for the pro of of Theorem 2. Let

be a distance (metric)

on a space

. For



we say

is an



-packing set in

(

 >

0) if



and any two distinct

members in

are more than



apart in the distance

Now let

be a class of regression functions. The

distance

here is the

distance.

Definition 1:

(

Global metric entropy

) The packing



-entropy of

is the logarithm of the largest



-packing set in

. The packing



-entropy of

is denoted

(



)

Definition 2:

(

Local metric entropy

) The local



-entropy at

is the logarithm of the largest

(

=

2)-packing set in

(

f; 

) =

k



. The local



-entropy at

is denoted by

(



The local



-entropy of

is dened as

loc

(



) = max

(



)

Both global and local entropies will b e involved in our derivations of the lower bounds. Assume that

loc

(



) is lower bounded by

loc

(



)

Let

loc

(



) =

n

+ 2 log 2

Assume

(



) is upper bounded by

(



) and lower bounded by

(



)

Let



be determined by

(



) =

n

(16)

and



be determined by

(



) = 4

n

+ 2 log 2

(17)

Assume the random errors in the regression model are normally distributed with variance 1

The following

lemma is useful for deriving minimax lower bounds using either global or local metric entropy.

Lemma 1:

The minimax risk for estimating

in F is lower bounded as fol lows:

min

max





;

min

max





;

where the minimization (or inmum) is over al l regression estimators based on

= (

; Y

)

The rst bound in the lemma is from Yang and Barron (1999, Section 7) and the second one is from

Yang and Barron (1997, Section 4).

Proof of Theorem 2:

Let

(

)

; '

(

)

;

... be a uniformly bounded orthonormal basis (with

respect to the distribution of

). An example is the trigonometric basis on [0,1]. Take



; i



1 to be

the procedure that always estimate

(

)

For each



;

consider the class of regression

functions



(

) =



(

) +

:::



(

) :





It is obvious that



(

;

; 

) = 0 for

Thus to prove Theorem 2, it suces to show that min

max



C

(

) for some

constant

C >

0 not depending on

where



(

) = (log

n=n

)

for 1

<  <

and



(

) =



)

for 0







Note that under the orthonormality assumption on the basis functions, the

distance

is the same as the

distance on the coecients  =





Thus the entropy of

under

the

distance is the same as the that of  under the

distance. To apply Lemma 1, we lower bound

the lo cal entropy of

or 

Note that by Cauchy-Schwartz inequality, the

and

norms have the

relationship:







Thus for





;

taking



;

we have

(

f; 

) =



k







;













Consequently, for





;

the (

=

2)-packing of

(

f; 

) under the

distance is equivalent to the

(

=

2)-packing of









under the

distance. Since a maximum (

=

2)-packing set is an

(

=

2)-covering set, the union of the balls with radius

=

2 and centered at p oints in a maximum packing

set in



should cover



It follows that the size of the maximum packing set is at least the ratio of

volumes of the balls



and

=

;

which is 2

Thus we have shown that the local entropy

loc

(



) of

under the

distance is at least

loc

(



) =

log 2 for





For



for some 0







;

solving

loc

(



) =

n

+ 2 log 2 gives



of order



)

Note that for such

;

by possibly reducing

loc

(



) by a constant factor,



obtained this way can be made smaller than

(as required in the

earlier derivation). By Lemma 1, we have proved the minimax lower rates for

when 0







That

is,

min

max





)

for some constant

independent of

For

 >

;

we use the global entropy to derive the minimax

lower b ound. It is known form Schutt (1984) that the entropy number satises

log(1 +

M=k

)







log(1 +

M=k

)

for some constants

and

independent of

and

when log



We can choose



and



both of order (log

n=n

)

to satisfy (16) and (17). This gives the minimax lower rate for

when

 >

;

i.e.,

min

max



(log

n=n

)

for some constant

independent of

Finally, with the trigonometric basis, the functions in

satises



The conclusion of Theorem 2 follows. This completes the proof of Theorem 2.

Remarks:

1. It is interesting to note that both the global and the lo cal entropies are useful here for dierent

cases. For

 >

2, the application of global entropy gives the right rate of convergence. However, if one

intends to use the minimax lower bound in terms of the local entropy, the above derivation of a local

entropy bound does not work because for the critical



of order (log

n=n

)

;

it is of a higher order than

and accordingly

(

f; 

)







. On the other hand, for 0







;

the application

of the local entropy method gives a rate that agrees with the upper bound up to a logarithmic factor.

If one uses the global entropy, the lower bound by Lemma 1 diers substantially in rate from the upper

bound. For general relationship between global and local entropies, see Yang and Barron (1999, Section

7).

2. In the derivation of the lower bounds in Theorem 2, we choose very special (nonrandom) original

estimators. This is of course not a typical situation when one would consider combining estimation

procedures. In applications, the candidate estimators (or many of them) are most likely somewhat highly

correlated (they are estimating the same target), but probably not too highly correlated (otherwise one

can gain little even by ideal combining). For such cases, the actual price paid by a goo d aggregation

method is smaller than that given in Theorem 2, but probably not too much smaller.

Proof of Corollary 2:

Assume that Condition B0 is satised. For a given

the approximation

error of

(

) using the b est rst

terms satises



j;N

(

) =



j;l



j;i





j;i

(

+ 1)



(

+ 1)



j;i

Thus under Condition B0 on

we have



j;N

(

) =

(

+ 1)



! 1

The approximation

error of

(

) using the basis functions

j;l

(

) with 1



and 1



satises



(

) =



j;i



j;i



j;i



(

+ 1)





j;i

(

+ 1)





j;i

Thus the approximation error is



(

) =

(

+ 1)

+ (

+ 1)





Under Conditions B0 and B2, sparse approximation has the potential to perform much better. From

Condition B2,



j;i

= 0

Let





j;i

Then Condition B2 implies that there exists a

constant

such that



j;i

j;l

k





for all



From Lemma 1 in Barron (1993),

there is a subset,

 f

;

; :::; L

of size

such that



j;i

j;l



j;i

j;l



for some constant

C >

Thus taking

;

the overall approximation error is upper bounded in order



L;k

(

) =

(

+ 1)





From (5), (6), and (7) and above, we have that under Conditions B0 and B1, with the choice of

order

+1)

;



(

;

; ) =



inf



(

+ 1)





1+2



;

and under Condition B0, with the choice of

of order

(2+1

=

)

and

of order

s=

(2+1

=

)

;



(

;

; ) =



inf

L;N



(

+ 1)

+ (

+ 1)



(

)





(2+1

=

)



;

and under Conditions B0 and B2, with the choice of

of order

log

n; L

of order



)

, and



(

;

; )



inf

L;N



inf





(

+ 1)



(

) +

log



(log

n=n

)

The conclusions of Corollary 2 follow. This completes the pro of of Corollary 2.

References

[1] Barron, A.R. (1993) Universal approximation bounds for superpositions of a sigmoidal function.

IEEE Trans. Inform. Theory

, 930-945.

[2] Barron, A.R. (1994) Approximation and estimation bounds for articial neural networks.

Machine

Learning,

, 115-133.

[3] Barron, A.R. and Cover, T.M. (1991) Minimum complexity density estimation.

IEEE, Trans. on

Information Theory

, 1034-1054.

[4] Barron, A.R., Birge, L. and Massart, P. (1999) Risk bounds for model selection via p enalization.

Probability Theory and Related Fields,

113

, 301-413.

[5] Barron, A.R., Rissanen, J., and Yu, B. (1998) The minimum description length principle in coding

and modeling.

IEEE Trans. on Information Theory

, 2743-2760.

[6] Bates, J.M., and Granger, C.W.J. (1969) The combination of forecasts.

Operational Research Quar-

terly

, 451-468.

[7] Birge, L. and Massart, P. (1996) From model selection to adaptive estimation. In

Research Papers

in Probability and Statistics: Festschrift in honor of Lucien Le Cam

(D. Pollard, E. Torgersen and

G. Yang, eds.), 55-87, Springer, New York.

[8] Breiman, L. (1996) Stacked regressions.

Machine Learning

, 49-64.

[9] Buckland, S.T., Burnham, K.P., and Augustin, N.H. (1995) Model selection: An integral part of

inference.

Biometrics

, 603-618.

[10] Catoni, O. (1997) The mixture approach to universal mo del selection. Technical Report LIENS-97-

22, Ecole Normale Superieure, Paris, France.

[11] Cesa-Bianchi, N., Freund, Y., Haussler, D.P., Schapire, R., and Warmuth, M.K. (1997) How to use

expert advise?

Journal of the ACM

, 427-485.

[12] Cesa-Bianchi, N. and Lugosi, G. (1999) On prediction of individual sequences. Accepted by

Ann.

Statistics.

[13] Clemen, R.T. (1989) Combining forecasts: a review and annotated bibliography.

Intl. J. Forecast.

, 559-583.

[14] Donoho, D.L. and Johnstone, I.M. (1994) Ideal denoising in an orthonormal basis chosen from

alibrary of bases.

C. R. Acad. Sci. Paris,

319

, 1317-1322.

[15] Donoho, D.L. and Johnstone, I.M. (1998) Minimax estimation via wavelet shrinkage.

Ann. Statis-

tics.

, 879-921.

[16] Edmunds, D.E. and Triebel, H. (1989) Entropy numbers and approximation numbers in function

spaces.

Proc. London Math. Soc.

, 137-152.

[17] Juditsky, A. and Nemirovski, A. (1996) Functional aggregation for nonparametric estimation.

Pub-

lication Interne, IRISA

, N. 993.

[18] Kolmogorov, A.N. and Tihomirov, V.M. (1959)



-entropy and



-capacity of sets in function spaces.

Uspehi Mat. Nauk

, 3-86.

[19] Merhav, N. and Feder, M. (1998) Universal prediction.

IEEE Trans. on Information Theory

2124-2147.

[20] LeBlanc, M. and Tibshirani, R (1996) Combining estimates in regression and classication.

J. Amer.

Statist. Asso.

, 1641-1650.

[21] Littlestone, N. and Warmuth, M.K. (1994) The weighted majority algorithm.

Information and

Computation

108

, 212-261.

[22] Mallat, S.G. and Zhang, Z. (1993) Matching pursuits with time-frequency dictionaries.

IEEE Trans.

on Signal Processing

3397-3415.

[23] Rice, J. (1984) Bandwidth choice for nonparametric regression.

Ann. Statist.

, 1215-1230.

[24] Ruppert, D., Wand, M.P., Holst, U., and Hossjer, O. (1997) Local polynomial variance-function

estimation.

J. Amer. Statist. Assoc.

, 262-273.

[25] Schutt, C. (1984) Entropy numbers of diagonal operators between symmetric Banach spaces. J.

Approx. Theory,

, 121-128.

[26] Stone, M. (1974) Cross-validatory Choice and Assessment of Statistical Predictions (with Discus-

sion).

J. Roy. Statist. Soc., Ser.

, 111-147.

[27] Vovk, V.G. (1990) Aggregating strategies. In

Proceedings of the 3rd Annual Workshop on Compu-

tational Learning Theory

, 372-383.

[28] Wolpert, D. (1992) Stacked generalization.

Neural Networks

, 241-259.

[29] Yang, Y. (1996)

Minimax Optimal Density Estimation

, Ph.D. Dissertation, Department of Statistics,

Yale University, May, 1996.

[30] Yang, Y. (1998) Combining Dierent Procedures for Adaptive Regression. Accepted by

Journal of

Multivariate Analysis

[31] Yang, Y. (1999a) Model selection for nonparametric regression.

Statistica Sinica

, 475-499.

[32] Yang, Y. (1999b) Mixing strategies for density estimation. To appear in the

Ann. Statististics

[33] Yang, Y. (1999c) Adaptive regression by mixing. Technical Report No. 12, Department of Statistics,

Iowa State University.

[34] Yang, Y. and Barron, A.R. (1997) Information-theoretic determination of minimax rates of conver-

gence. Tech. Report #28, Department of Statistics, Iowa State University, IA.

[35] Yang, Y. and Barron, A.R. (1998) An asymptotic property of mo del selection criteria.

IEEE Trans.

on Information Theory

, 95-116.

[36] Yang, Y. and Barron, A.R. (1999) Information-theoretic determination of minimax rates of conver-

gence. To app ear in the

Ann. Statististics

Aggregated estimators and empirical complexity for least square regression

Article

Jun 2004
ANN I H POINCARE-PR

Jean-Yves Audibert

Numerous empirical results have shown that combining regression procedures can be a very efficient method. This work provides probably approximately correct (PAC) bounds for the L 2 generalization error of such methods. The interest in these bounds is twofold. First, it gives for any aggregating procedure a bound for the expected risk depending on the empirical risk and the empirical complexity measured by the Kullback-Leibler divergence between the aggregating distribution ρ ^ and a prior distribution π and by the empirical mean of the variance of the regression functions under the probability ρ ^. Secondly, by structural risk minimization, we derive an aggregating procedure which takes advantage of the unknown properties of the best mixture f ˜: when the best convex combination f ˜ of d regression functions belongs to the d initial functions (i.e., when combining does not make the bias decrease), the convergence rate is of order (logd)/N. In the worst case, our combining procedure achieves a convergence rate of order (logd)/N which is known to be optimal in a uniform sense when d>N. As in AdaBoost, our aggregating distribution tends to favor functions which disagree with the mixture on mispredicted points. Our algorithm is tested on artificial classification data (which have been also used for testing other boosting methods, such as AdaBoost).

Contributions to Frugal Learning

Thesis

Dec 2022

El Mehdi Saad

The increasing size of available data has led machine learning specialists to consider more complex models in order to achieve better performance. From a theoretical point of view, statistical learning under resource constraints has known a growing interest in the machine learning community. Many settings were developed to formalize budgeted limitations. In this thesis, we are motivated by these challenges, where we consider classical learning problems under the ``frugal lens". First, we tackle support recovery in a sparse linear regression problem, with one pass over data. We develop an online greedy algorithm named "online orthogonal matching pursuit" that actively selects covariates in a sequential way, with guarantees on its computational complexity that is adaptive to the unknown magnitude of the regression coefficients. Second, we consider the problem of model selection aggregation of experts. We present procedures that achieve fast rates under various budgeted settings and discuss the attainability of fast rates in different settings. Third, we tackle the problem of online prediction of individual sequences, where no distributional assumption is made in the process of generating data. We consider some natural budgeted constraints on the number of experts used for prediction and the number of observed feedbacks. We develop new strategies for each setting and discuss the attainability of constant regrets. Finally, we consider the problem of fixed confidence best arm identification. Given a confidence level, the learner wants to identify the arm with the largest mean using the least number of queries possible. We suppose that simultaneous queries are possible and prove that significant improvement can be made with respect to the BAI standard algorithms by taking the unknown covariance of the arms into consideration.

The exponentially weighted average forecaster in geodesic spaces of non-positive curvature

Preprint

Feb 2020

Quentin Paris

This paper addresses the problem of prediction with expert advice for outcomes in a geodesic space with non-positive curvature in the sense of Alexandrov. Via geometric considerations, and in particular the notion of barycenters, we extend to this setting the definition and analysis of the classical exponentially weighted average forecaster. We also adapt the principle of online to batch conversion to this setting. We shortly discuss the application of these results in the context of aggregation and for the problem of barycenter estimation.

Sparsity in penalized empirical risk minimization

Article

Feb 2009
ANN I H POINCARE-PR

Vladimir Koltchinskii

Let (X, Y) be a random couple in S×T with unknown distribution P. Let (X1, Y1), …, (Xn, Yn) be i.i.d. copies of (X, Y), Pn being their empirical distribution. Let h1, …, hN:S↦[−1, 1] be a dictionary consisting of N functions. For λ∈ℝN, denote fλ:=∑j=1Nλjhj. Let ℓ:T×ℝ↦ℝ be a given loss function, which is convex with respect to the second variable. Denote (ℓ•f)(x, y):=ℓ(y; f(x)). We study the following penalized empirical risk minimization problem which is an empirical version of the problem (here ɛ≥0 is a regularization parameter; λ0 corresponds to ɛ=0). A number of regression and classification problems fit this general framework. We are interested in the case when p≥1, but it is close enough to 1 (so that p−1 is of the order , or smaller). We show that the “sparsity” of λɛ implies the “sparsity” of λ̂ɛ and study the impact of “sparsity” on bounding the excess risk P(ℓ•fλ̂ɛ)−P(ℓ•fλ0) of solutions of empirical risk minimization problems.

Aggregation and Sparsity Via ℓ1 Penalized Least Squares

Conference Paper

Sep 2006

This paper shows that near optimal rates of aggregation and adaptation to unknown sparsity can be simultaneously achieved via ℓ1 penalized least squares in a nonparametric regression setting. The main tool is a novel oracle inequality on the sum between the empirical squared loss of the penalized least squares estimate and a term reflecting the sparsity of the unknown regression function.

2008 Saint Flour Lectures Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems

Article

Vladimir Koltchinskii

Security aspects of the authentication used in quantum key growing

Article

Full-text available

Mar 2007
Proceedings of SPIE

Unconditionally secure message authentication is an important part of Quantum Cryptography (QC). We analyze security effects of using a key obtained from QC for authentication purposes in later rounds of QC. In particular, the eavesdropper gains partial knowledge on the key in QC that may have an effect on the security of the authentication in the later round. Our initial analysis indicates that this partial knowledge has little effect on the authentication part of the system, in agreement with previous results on the issue. However, when taking the full QC protocol into account, the picture is different. By accessing the quantum channel used in QC, the attacker can change the message to be authenticated. This together with partial knowledge of the key does incur a security weakness of the authentication. The underlying reason for this is that the authentication used, which is insensitive to such message changes when the key is unknown, becomes sensitive when used with a partially known key. We suggest a simple solution to this problem, and stress usage of this or an equivalent extra security measure in QC.

Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored

Article

Full-text available

Nov 2003
J MACH LEARN RES

Bertrand Clarke

We compare Bayes Model Averaging, BMA, to a non-Bayes form of model averaging called stacking. In stacking, the weights are no longer posterior probabilities of models; they are obtained by a technique based on cross-validation. When the correct data generating model (DGM) is on the list of models under consideration BMA is never worse than stacking and often is demonstrably better, provided that the noise level is of order commensurate with the coefficients and explanatory variables. Here, however, we focus on the case that the correct DGM is not on the model list and may not be well approximated by the elements on the model list. We give a sequence of computed examples by choosing model lists and DGM's to contrast the risk performance of stacking and BMA. In the first examples, the model lists are chosen to reflect geometric principles that should give good performance. In these cases, stacking typically outperforms BMA, sometimes by a wide margin. In the second set of examples we examine how stacking and BMA perform when the model list includes all subsets of a set of potential predictors. When we standardize the size of terms and coefficients in this setting, we find that BMA outperforms stacking when the deviant terms in the DGM 'point' in directions accommodated by the model list but that when the deviant term points outside the model list stacking seems to do better. Overall, our results suggest the stacking has better robustness properties than BMA in the most important settings.

Strategies for Sequential Prediction of Stationary Time Series

Chapter

Full-text available

Jan 2005

We present simple procedures for the prediction of a real valued sequence. The algorithms are based on a combination of several simple predictors. We show that if the sequence is a realization of a bounded stationary and ergodic random process then the average of squared errors converges, almost surely, to that of the optimum, given by the Bayes predictor. We offer an analog result for the prediction of stationary gaussian processes.

Generalized mirror averaging and D -convex aggregation

Article

Jan 2007

Karim Lounici

We study the problem of aggregation of estimators. Given a collection of M different estimators, we construct a new estimator, called aggregate, which is nearly as good as the best linear combination over an l 1-ball of ℝM of the initial estimators. The aggregate is obtained by a particular version of the mirror averaging algorithm. We show that our aggregation procedure statisfies sharp oracle inequalities under general assumptions. Then we apply these results to a new aggregation problem: D-convex aggregation. Finally we implement our procedure in a Gaussian regression model with random design and we prove its optimality in a minimax sense up to a logarithmic factor.

Local Polynomial Variance Function Estimation

Article

Full-text available

Aug 1997
TECHNOMETRICS

The conditional variance function in a heteroscedastic, nonparametric regression model is estimated by linear smoothing of squared residuals. Attention is focused on local polynomial smoothers. Both the mean and variance functions are assumed to be smooth, but neither is assumed to be in a parametric family. The biasing effect of preliminary estimation of the mean is studied, and a degrees-of-freedom correction of bias is proposed. The corrected method is shown to be adaptive in the sense that the variance function can be estimated with the same asymptotic mean and variance as if the mean function were known. A proposal is made for using standard bandwidth selectors for estimating both the mean and variance functions. The proposal is illustrated with data from the LIDAR method of measuring atmospheric pollutants and from turbulence-model computations.

Combining different regression procedures for adaptive regression

Article

Full-text available

Jan 2000
J MULTIVARIATE ANAL

Yuhong Yang

From model selection to adaptive estimation

Article

Minimax estimation via wavelet Shrinkage

Article

Jun 1998
ANN STAT

We attempt to recover an unknown function from noisy, sampled data. Using orthonormal bases of compactly supported wavelets, we develop a nonlinear method which works in the wavelet domain by simple nonlinear shrinkage of the empirical wavelet coefficients. The shrinkage can be tuned to be nearly minimax over any member of a wide range of Triebel- and Besov-type smoothness constraints and asymptotically minimax over Besov bodies with p ≤ q. Linear estimates cannot achieve even the minimax rates over Triebel and Besov classes with p < 2, so the method can significantly outperform every linear method (e.g., kernel, smoothing spline, sieve) in a minimax sense. Variants of our method based on simple threshold nonlinear estimators are nearly minimax. Our method possesses the interpretation of spatial adaptivity; it reconstructs using a kernel which may vary in shape and bandwidth from point to point, depending on the data. Least favorable distributions for certain of the Triebel and Besov scales generate objects with sparse wavelet transforms. Many real objects have similarly sparse transforms, which suggests that these minimax results are relevant for practical problems. Sequels to this paper, which was first drafted in November 1990, discuss practical implementation, spatial adaptation properties, universal near minimaxity and applications to inverse problems.

The Combination of Forecasts

Article

Dec 1969
OR

Two separate sets of forecasts of airline passenger data have been combined to form a composite set of forecasts. The main conclusion is that the composite set of forecasts can yield lower mean-square error than either of the original forecasts. Past errors of each of the original forecasts are used to determine the weights to attach to these two original forecasts in forming the combined forecasts, and different methods of deriving these weights are examined.

Entropy Numbers and Approximation Numbers in Function Spaces, II

Article

Jan 1992
P LOND MATH SOC

Model Selection: An Integral Part of Inference

Article

Jun 1997

We argue that model selection uncertainty should be fully incorporated into statistical inference whenever estimation is sensitive to model choice and that choice is made with reference to the data. We consider different philosophies for achieving this goal and suggest strategies for data analysis. We illustrate our methods through three examples. The first is a Poisson regression of bird counts in which a choice is to be made between inclusion of one or both of two covariates. The second is a line transect data set for which different models yield substantially different estimates of abundance. The third is a simulated example in which truth is known.

Cross‐Validatory Choice and Assessment of Statistical Predictions (With Discussion)

Article