PreprintPDF Available

Representation Transfer Learning for Semiparametric Regression

June 2024

June 2024

License
CC BY 4.0

Authors:

Preprints and early-stage research may not have been peer reviewed yet.

We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data.

The architecture of a partially linear neural network with input dimension d = 1, q = 3, representation dimension p = 2, depth D = 2 and width W = 5.

…

The description of variables in the apartment rental dataset.

…

The estimated coefficients and confidence intervals for the variables of main interest in the housing rental data.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Representation Transfer Learning for

Semiparametric Regression

Baihua He∗†

, Huihang Liu∗‡

, Xinyu Zhang§

, and Jian Huang¶

June 21, 2024

Abstract

We propose a transfer learning method that utilizes data representations in a semi-

parametric regression model. Our aim is to perform statistical inference on the param-

eter of primary interest in the target model while accounting for potential nonlinear

eﬀects of confounding variables. We leverage knowledge from source domains, assum-

ing that the sample size of the source data is substantially larger than that of the target

data. This knowledge transfer is carried out by the sharing of data representations,

predicated on the idea that there exists a set of latent representations transferable from

the source to the target domain. We address model heterogeneity between the source

and target domains by incorporating domain-speciﬁc parameters in their respective

models. We establish suﬃcient conditions for the identiﬁability of the models and

demonstrate that the estimator for the primary parameter in the target model is both

consistent and asymptotically normal. These results lay the theoretical groundwork for

making statistical inferences about the main eﬀects. Our simulation studies highlight

the beneﬁts of our method, and we further illustrate its practical applications using

real-world data.

Keywords: Asymptotic normality, data representation, heterogeneity, identiﬁability, multi-

source data.

1 Introduction

In many practical scenarios, the availability of data can be limited, posing diﬃculties for

estimating eﬀective models for statistical inference. Often, there is an abundance of data in

related but not identical source domains, while the speciﬁc target domain suﬀers from data

∗Equal contribution

†Department of Statistics and Finance, School of Management, University of Science and Technology of

China, Hefei, China. Email: baihua@ustc.edu.cn

‡International Institute of Finance, School of Management, University of Science and Technology of

China, Hefei, China. Email: huihang@mail.ustc.edu.cn

§Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. Email:

xinyu@amss.ac.cn

¶Corresponding author. Department of Applied Mathematics, The Hong Kong Polytechnic University,

Hong Kong SAR, China. Email: j.huang@polyu.edu.hk

arXiv:2406.13197v1 [stat.ME] 19 Jun 2024

scarcity. Transfer learning provides a solution to this issue by leveraging knowledge from

similar source tasks to enhance model performance in the target task (Pan and Yang,2010).

Over the past decade, transfer learning has been widely adopted in various machine learning

tasks, including computer vision (Yosinski et al.,2014), natural language processing (Mou

et al.,2016), and speech recognition (Huang et al.,2013).

The challenges of domain heterogeneity and the risks of negative transfer in utilizing

auxiliary source data have led to the advancement of statistical transfer learning, which

seeks to develop novel transfer learning methodologies that address these challenges and

establish their theoretical properties. Researchers have proposed transfer learning methods

for a variety of models, including high-dimensional linear models (Bastani,2021;Li et al.,

2022), generalized linear models (Tian and Feng,2022;Li et al.,2023), functional regression

(Lin and Reimherr,2022), semi-supervised classiﬁcation (Zhou et al.,2022), and basis-type

models (Cai and Pu,2024), among others. Tian et al. (2023) introduced a linear represen-

tation multi-task method for estimating a similar representation. However, their reliance on

linear associations and linear representation constraints the method’s applicability to com-

plex data structures. In the context of transfer learning, Hu and Zhang (2023) proposed a

model averaging approach for semiparametric regression models. However, their method only

leverages the linear component’s knowledge for transfer learning, overlooking the non-linear

component similarities.

Although these methods have shown promising results, there are certain issues that

remain unexplored. First, the challenges in balancing the trade-oﬀ between model ﬂexibility

and parameter interpretability. Most existing research on transfer learning and multi-source

data integration falls into two categories: studies that focus on parametric models (Tian

and Feng,2022;Li et al.,2023), which oﬀer simplicity and interpretability, and those that

explore complex non-linear models (Tan et al.,2018) lack interpretability due to their “black-

box” nature. Striking a balance between the interpretability of parametric models and the

ﬂexibility of non-parametric models remains a signiﬁcant challenge. Second, the challenges

remain in constructing and identifying the transferable knowledge across data domains. Most

existing statistical transfer learning methods transfer the model parameters directly to the

target domain. However, these methods depends on parametric model assumptions, which

may limit the transferability of the knowledge and fail to utilize the latent shared information

among domains. The knowledge is not transferable if the model parameters are diﬀerent,

even if there are latent structures in the data. The identiﬁability of transferable knowledge

is also crucial for the statistical inference before transferring. These challenges motivate us

to develop a novel transfer learning method that uses the shared knowledge among domains

and strike a balance between model interpretability and ﬂexibility,

We propose a representation transfer learning (RTL) method for knowledge transfer

within the context of semiparametric regression model (Engle et al.,1986). The semipara-

metric regression model enables the interpretability of the treatment parameters, or parame-

ters of main interest, while capturing the ﬂexible data structures through the nonparametric

components. Our main objective is to facilitate statistical inference for the treatment eﬀects

in the target model, taking into possibly nonlinear eﬀects of the confounding variables. We

achieve this by accommodating multi-dimensional confounding variables in a ﬂexible manner

and by incorporating information from source domains. The scenario we address involves a

target model of interest, together with independent heterogeneous source models that share

a higher-level data representation. The representation transfer learning mechanism enables

the latent shared knowledge to be transferred. We use deep neural networks in the estima-

tion of the eﬀects of confounding variables through a set of representation functions. The

transfer of knowledge from source data to the target model occurs via these representation

functions, which we estimate by capitalizing on the ample sample sizes available in the source

domains.

The main contribution of our paper are threefold. First, a critical concern is the identi-

ﬁability of both the transferred representation function and the domain-speciﬁc parametric

components. Given that the representation is a high-level abstraction of the data, it is often

non-unique and non-identiﬁable (Chen et al.,2023). Within the semiparametric regression

framework, the identiﬁability of the parametric components can be compromised due to their

interaction with the representation. To tackle this challenge, we formulate novel and inter-

pretable conditions that ensure the identiﬁability of both the representation function and

the linear coeﬃcients, provided there is suﬃcient diversity among the source domains. Sec-

ond, our theoretical analysis shows that the proposed method can consistently estimate the

representation functions via deep neural networks with ReLU activation. We demonstrate

that representation transfer learning can reduce approximation bias and enhance sample

eﬃciency. Third, we establish the asymptotic normality of the estimated primary parame-

ter in the target model, providing the basis for statistical inference regarding the eﬀects of

the variable of primary interest. Consequently, RTL adeptly balances model interpretability

with model ﬂexibility and improve the estimation accuracy of the primary parameter.

Our proposed RTL approach marks a signiﬁcant departure from existing statistical meth-

ods for transfer learning and semiparametric regression models. Unlike the traditional

distance-based transfer learning frameworks, we account for model heterogeneity between

the source and target domains by employing ﬂexible representation functions and domain-

speciﬁc parameters. These learned representation functions serve as conduits for knowledge

transfer, capturing intrinsic information that is often the most challenging aspect to estimate

in a model.

The rest of the article is organized as follows. We introduce the model framework and

develop the proposed RTL method in Section 2. We provide the theoretical guarantees

in Section 3. We present the numerical studies including simulation, semi-synthetic data

analysis based on MNIST hand writing dataset, and illustrate RTL using the Pennsylvania

reemployment bonus experiment and housing rental information data in Section 4. We give

concluding remarks in Section 5, and relegate all technical proofs to the Supplementary

Materials.

2 Model and methodology

In this section, we present our proposed Representation Transfer Learning (RTL) method.

We denote the data by the triplet (Y, X,Z), where Y∈Ris the response variable, X∈Rd

corresponds to the d-dimensional covariate of primary interest, and Z∈Rqaq-dimensional

confounding variable. Typically, Xis a low-dimensional treatment variable, implying that

dis small. We permit the confounding variable Zto be of a moderately high dimension,

with its dimension qallowed to grow as the sample size increases. We consider Kdistinct

source domains, each with its own dataset (Yk,Xk,Zk) for k= 1, . . . , K. Additionally, we

have the target domain data, which is denoted as (Y0,X0,Z0). Our approach is designed

to leverage the information from these multiple source domains to enhance inference in the

target domain via shared data representation.

2.1 Model

We begin by considering distinct semiparametric regression models for each source domain

and the target domain:

Sources: Yk=β⊤

kXk+gk(Zk) + εk, k = 1, . . . , K, (1)

Target: Y0=β⊤

0X0+g0(Z0) + ε0.(2)

In these models, for k= 0,1, . . . , K,βkrepresents the eﬀects associated with Xk,gk:Rq→

Rdenotes an unspeciﬁed nonparametric function capturing the potential nonlinear impact

of the confounding variable Zk, and εkis the random noise component with E(εk)=0

and E(ε2

k) = σ2

k. This model allows us to systematically address the inﬂuence of both the

covariates of interest and the confounding variables across diﬀerent domains.

Our main goal is to conduct statistical inference on the parameter β0, which measures

the eﬀect of the covariate of interest, X0, on the outcome variable Y0. This task is chal-

lenging for two main reasons: the limited availability of data from the target domain and

the complex inﬂuence that nonparametric estimation of nuisance functions, related to mul-

tivariate confounding variables, has on the estimation of β0. Although the use of ﬂexible

neural networks to approximate these functions may appear to be a feasible approach, it

complicates the inference process for β0. Furthermore, this method in itself does not over-

come the fundamental obstacle known as the “curse of dimensionality”, which arises during

the nonparametric estimation of a multidimensional function.

Transfer learning oﬀers a solution to the “curse of dimensionality” in the target domain

by utilizing data from multiple sources. We are particularly interested in scenarios where

the combined sample size from the source domains signiﬁcantly exceeds that of the target

domain. To enable the eﬀective transfer of knowledge from the source domains to the target

domain, it is crucial to establish speciﬁc assumptions about the relationship between the

source and target data models. Our approach capitalizes on the source data to assist in

estimating the relevant function within the target data model. This method assumes the

existence of a latent representation of the confounding eﬀect that is invariant across both

the source and target data.

Speciﬁcally, we propose the following expression for the confounding eﬀects:

gk(Z) = γ⊤

kR(Z), k = 0,1, . . . , K,

where R:Rq→Rpfunctions as a representation of the confounding variables, and γkrepre-

sents domain-speciﬁc coeﬃcients. This composite model structure aligns with the approaches

suggested by Du et al. (2020) and Tripuraneni et al. (2020). The representation Rcan be

interpreted as a set of basis functions, with γacting as the corresponding weights. This

strategy diﬀers from traditional basis expansion techniques, such as spline methods, which

rely on a predetermined set of basis functions for approximating nonparametric functions.

Instead, our approach estimates the representation function Rfrom the data.

By focusing on the diﬀerences in the coeﬃcients (β,γ) across the source and target

domains, we can eﬀectively capture domain heterogeneity. This approach simpliﬁes the

challenging tasks of function estimation and heterogeneity detection. Consequently, we posit

that the representation function Ris a shared element across diﬀerent domains, representing

the transferable knowledge from source tasks to the target task.

The above discussion leads to the proposed RTL model as follows:

Sources: Yk=β⊤

kXk+γ⊤

kR(Zk) + εk, k = 1, . . . , K, (3)

Target: Y0=β⊤

0X0+γ⊤

0R(Z0) + ε0,(4)

where βkand γkare source-speciﬁc coeﬃcients of dimensions dand p, respectively. The

function R:Rq→Rpserves as the shared representation function across domains.

2.2 Estimation method

Based on the RTL models (3) and (4), at the population level, our proposed RTL method

proceeds in two steps:

Step P1: In the source domain, we consider the minimizers of the population risk function

{(β∗k,γ∗k)K

k=1,R∗} ∈ arg min

{(βk,γk)K

k=1,R}

k=1

E{Yk−β⊤

kXk−γ⊤

kR(Zk)}2,(5)

where R∗is the shared representation function that will be used in the target domain,

the coeﬃcients (β∗k,γ∗k)K

k=1 take into account possible heterogeneity across source

domains and the target domain. The confounding eﬀects γ⊤

∗kR∗(Zk) are not separable

and thus are not identiﬁable, as discussed in Section 3. Fortunately, the eﬀect β∗kis

uniquely identiﬁable.

Step P2: In the target domain, given the representation function R∗from the source do-

main, we solve

{β∗0,γ∗0}= arg min

(β0,γ0)

E{Y0−β⊤

0X0−γ⊤

0R∗(Z0)}2.(6)

Now suppose we have a random sample of independent and identically distributed obser-

vations from the target domain, denoted as {(Y0i,X0i,Z0i), i = 1, . . . , n0}. Additionally, we

have access to the datasets from Ksource domains, {(Yki,Xki,Zki), i = 1, . . . , nk},where

k= 1, . . . , K. Let N=n1+···nKbe the combined sample size of the source domains.

Although there are no explicit constraints on the sample sizes across the target and source

domains, it is usually the case that Nsigniﬁcantly exceeds the sample size n0of the target

domain. Our main goal is to leverage the data from these source domains to enhance the

estimation accuracy within the target domain. This can be achieved by using the empirical

version of the formation at the population given in (5) and (6). We ﬁrst focus on estimating

the representation function Rusing the source data, followed by estimating the regression

parameters within the target model using the target data. These two steps correspond to

the Steps P1 and P2 at the population level and are as follows:

Step E1: Estimation of the shared representation function. This step involves esti-

mating a shared representation function R, which is formulated as an optimization

problem:

{(b

βk,b

γk)K

k=1,b

R}= arg min

{(βk,γk)K

k=1,R∈R} n1

k=1

i=1

(Yki −β⊤

kXki −γ⊤

kR(Zki))2o,(7)

where the estimation of the representation function Ris conducted over a speciﬁed

class of neural networks, denoted as R. This step is crucial for capturing the underlying

representations shared across the source domains.

Step E2: Estimation of parameters in the target model. After estimating the rep-

resentation function b

Rfrom the source data, the next step is to estimate the parame-

ters within the target model. This is achieved by solving

β0,b

γ0}= arg min

β0,γ0

i=1 {Y0i−β⊤

0X0i−γ⊤

R(Z0i)}2.(8)

Given that b

Rremains ﬁxed in this step, the pair ( b

β0,b

γ0) is eﬀectively obtained through

a least squares estimation process.

The proposed RTL method, which involves pre-training on multiple source domains be-

fore transferring the estimated representations to the target domain, enhances data and

computational eﬃciency. The abundance of source data ensures that the representation

function Rcan be estimated at a much faster convergence rate than that when only target

domain data is available.

2.3 Implementation

We approximate the representation functions by feedforward neural networks deﬁned as:

R(z) = ADσ(AD−1σ(···σ(A0z+b0)···) + bD−1) + bD,

where Ai∈Rpi+1×piand bi∈Rpi+1 for i= 0, . . . , D,p0=qis the dimension of the input

variables, pD+1 =pis the dimension of the output layer, and σ(·) is the activation function.

We consider the ReLU activation function σ(x) = max{0, x}, applied component-wise. The

parameters of the representation function R(·) are denoted as θ={A0,...,AD,b0,...,bD}.

The number W= max{p0, . . . , pD}and Dare the width and depth of the neural network,

respectively. The weight matrices together with the bias vectors contain S=PD

i=0 pi+1(pi+1)

entries in total. The parameters including weights and biases are assumed to be bounded

by a constant Bθ>0.We denote the set of the neural network functions deﬁned above by

R=NN(W, D, Bθ).

Figure 1illustrates the architecture of a demo neural network with d= 3, q= 5 and

p= 3.We train the representation network and the linear layer in an iterative style for 400

epochs. In each epoch, we ﬁrst update the representation network and then the linear layer.

The weights in the representation network are optimized using the SGD optimizer with a

learning rate of 10−3and batch size of nk. The weights in the last layer are obtained using

the least square estimation. We use early stopping method during the training process, and

use other i.i.d. observations as a validation dataset for model selection with a sample size of

30% of training dataset. That is, we select the model with a minimum prediction error on

the validation set for evaluation.

Input of non-linear part

Input of linear part

Hidden layers Output

Figure 1: The architecture of a partially linear neural network with input dimension d= 1,

q= 3, representation dimension p= 2, depth D= 2 and width W= 5.

To train the representation network across diﬀerent datasets, we compute the total loss

function (7) on each epoch and then backward the gradients to update the weights θin the

representation network. After the estimated representation b

Ris computed, we obtain the

estimator of β∗0and γ∗0by solving (8).

3 Theoretic results

In this section, we study the theoretical properties of the proposed RTL method. We ﬁrst

provide suﬃcient conditions under which the model parameters are identiﬁable. Next, we

derive the convergence rate of the estimated representation function in terms of the source

data sample size. Then we show that the estimator of the parameter of main interest in

the target domain model is asymptotically normal. We also provide a consistent estimator

of the asymptotic covatiance matrix. These results make it possible to conduct statistical

inference about the main parameter in the target domain model.

3.1 Identiﬁability

Identiﬁability is a fundamental question in statistical modeling problems. Usually, the pa-

rameters in a model are required to be uniquely identiﬁable so that their consistent estimation

is possible. In the proposed model, this requires careful consideration because of the term

γ⊤

∗kR∗(Zk) representing the confounding eﬀect in the model. Since γ∗kand R∗are both

unknown, they are not identiﬁable in the usual sense. In this subsection, we provide a set

of conditions to guarantee the identiﬁability of the parameter of main interest β∗kand the

confounding eﬀects represented by γ⊤

∗kR∗(Zk).

We ﬁrst state the deﬁnition of a notion of identiﬁability, linear identiﬁability, for γkand

Deﬁnition 1. The data representations are said to be linearly identiﬁable if there exists an

invertible matrix Λ such that R′(Z) = Λ−1R(Z) and γ′

k= ΛTγkfor all Z∈ Z and k∈[K].

Based on this deﬁnition, we have γ′⊤

kR′=γ⊤

kΛΛ−1R=γ⊤

kR.Therefore, although γkand

Rare only linearly identiﬁable, the confounding eﬀects, represented by γ⊤

kR,are uniquely

identiﬁable in the usual sense.

We impose the following conditions to ensure the identiﬁability of parameters and rep-

resentations.

Condition 1. The matrix E[{Xk−E(Xk|Zk)}{Xk−E(Xk|Zk)}⊤] is invertible.

Condition 2. (a) There exists {ki}p

i=1 ⊆[K] such that the coeﬃcients {γ∗ki}p

i=1 are linearly

independent. (b) There exists Z1,...,Zp∈ Z such that the matrix [R∗(Z1),...,R∗(Zp)] is

invertible.

Condition 1is a common assumption in regression analysis in the presence of confounding

variables, which assumes that the main variables Xkhave signiﬁcant variation across diﬀerent

tasks after projecting out all variation that can be explained by the nuisance variables Zk, for

each k∈[K].When confounding variables are present, they can introduce bias or distortions

that obscure the true relationship between the variables of interest. By requiring that Xk

maintains signiﬁcant variation independent of these confounders, it ensures that the eﬀects

of Xkon the response variable Ykcan be properly estimated.

Condition 2(a) requires that the support of the distribution for the coeﬃcients {γ∗k}K

k=1,

is suﬃciently rich. Similar assumption was also imposed in the analysis of panel data (Ahn

et al.,2001;Bai,2009;Moon and Weidner,2015). Condition 2(b) stipulates that R∗

exhibits a suﬃcient degree of variability. This variability is essential to ensure that the

image of R∗—the set of all possible outputs it can generate—does not become conﬁned

within a proper subspace of its potential range. In simpler terms, the function must be

versatile enough in its transformations to avoid being restricted to a limited portion of the

space it operates within.

Theorem 1. Suppose Conditions 1-2hold. Let {(βk,γk)K

k=1,R}and {(β′

k,γ′

k)K

k=1,R′}be

sets of parameters satisfy (5). Then β′

k=βkand there exists an invertible matrix Λsuch

that R′= Λ−1Rand γ′

k= ΛTγkfor k∈[K].

Theorem 1shows that the representation function Ris identiﬁable up to a multiplicative

matrix transformation if Conditions 1-2are satisﬁed.

In the following, we use a simple example to illustrate the identiﬁability of our proposed

model. We set the representation dimension as p= 2, 3, and 5, the dimension of non-linear

part as q=p, and the dimension of linear part as d= 1. The data generating process is set

as Y=βX +γ⊤R(Z) + ϵ, where βand γare generated from standard normal distribution,

X∈Rand Z∈Rqare from standard normal distribution and ϵ∼ N(0,0.32). The

representation functions R(·) are generated by the following univariate functions: sin(πx),

cos(πx), 2p|x| − 1, (1 − |x|)2, 1/(1 + exp(−x)) and −sin(x). The linear coeﬃcients are

heterogeneous in the source dataset. We set the sample size in source dataset as nk= 2000

for all k= 1, . . . , K and let K= 8. The estimated representation function is transformed by a

linear transformation which is identiﬁed by minimizing the distance between the transformed

representation and the true one. Figure 2shows the transformed learned representation (solid

line) by the proposed method and the true representation function (dashed line).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

(A)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.0

0.5

0.0

0.5

1.0

(B)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.0

0.5

0.0

0.5

1.0

(C)

Figure 2: Demonstration of learned representation by RTL. The left panel uses sin(πx) and

cos(πx) as the true representation functions. The middle panel uses 2p|x| − 1, sin(πx) and

cos(πx). The right panel uses (1 − |x|)2, 1/(1 + exp(−x)), −sin(x), sin(πx) and cos(πx).

The solid line represents the learned representation function and the dashed line represents

the true representation function.

3.2 Convergence rate

Based on the identiﬁability of the representations, we derive the convergence rate of the

estimated representation function in this section. We impose a set of regularity conditions

to guarantee the consistency of the estimated representation function.

Condition 3. (a) For any β∈ B, there exists a constant Bβsuch that ∥β∥2≤Bβ, where

for any d-vector a,∥a∥2=qPd

j=1 a2

j. For any γ∈Γ, there exists a constant Bγsuch that

∥γ∥2≤Bγ. (b) The covariate X∈ X ⊆ Rdsatisﬁes that ∥X∥2≤BX. The representation

function R(·)∈ R satisﬁes ∥R(Z)∥2≤BRfor any Z∈ Z. (c) The response variable

Yk∈ Y ⊆ Ris subexponentially distributed for k∈[K].

Condition 3includes standard assumptions in regression problems (Tripuraneni et al.,

2020;Jiao et al.,2023). We further assume that the representation function is in the H¨older

class.

Deﬁnition 2. Let κ=s+ν > 0, ν∈(0,1] and s=⌊κ⌋ ∈ N0, where ⌊κ⌋denotes the largest

integer strictly smaller than κand N0denotes the set of nonnegative integers. For a ﬁnite

constant B0>0, deﬁne the H¨older class Hκ(Z, B0) as

Hκ(Z, B0)

=R(·) : Z → R: max

∥ω∥1≤s∥∂ωR∥∞≤B0,max

∥ω∥1=ssup

Z=Z′

|∂ωR(Z)−∂ωR(Z′)|

∥Z−Z′∥ν

2≤B0,

where ∂ω=∂ω1···∂ωqwith ω= (ω1, . . . , ωq)T∈Nq

0, and ∥ω∥1=Pq

j=1 ωj.

Condition 4. Each element of the representation function Rbelongs to the H¨older class

Hα(Z, B0).

Condition 5. The support Zof the representation function R:Rq→Rpbelongs to a

compact p-dimensional Riemannian manifold isometrically embedded in Rpwith p≤q.

Condition 6. The dimensions {p, q}of {Z,R}satisfy the following condition:

n−1/2p1/2log N=o(1) and p(D+ 2 + log q)1/2

i=0

(pi+ 1)(log N)2N−1/2=o(1),

where Dis the depth of the neural network, n= min1≤k≤Knk,and N=PK

k=1 nk.

Condition 5is a low-dimensional manifold condition on Z. Actually, Condition 5is not

necessary for the convergence rate of the representation function. But it would guarantee

a higher convergence rate of the representation function when qis very large. Condition 6

pertains to the dimensionality of the model in relation to the sample sizes. This condition

accommodates the presence of a moderately high-dimensional covariate vector, allowing the

dimensions {p, q}to increase indeﬁnitely, provided that their rate of divergence meets the

speciﬁed constraints. While this condition is met in numerous applications, it does not cover

sparse, high-dimensional scenarios where the number of covariates exceeds the sample size.

For representation functions Rand R′, denote d2(R,R′) = (E∥R(Z)−R′(Z)∥2

2)1/2.Let

∆N=p1/2N−s/(2s+plog q)+s1/2

1(log N)2N−1/2, where s1= max{Kd, Kp, S}.Typically, the

size of the neural network S > max{Kd, Kp},thus s1is simply the network size used in the

estimation.

Theorem 2. Suppose Conditions 1–6hold, there exists an invertible matrix Λ∗such that

d2(b

R,R∗) = Op(∆N),

where R∗= Λ−1

∗R∗.

Theorem 2establishes the convergence rate of the estimated representation. The rate is

determined by two terms. The ﬁrst term represents the approximation error, which minimizes

the distance from R∗to R. The second term represents the stochastic error.

3.3 Asymptotic normality

In this section, we establish the asymptotic normality of the estimated primary parameter

within the target domain. It is a common scenario in transfer learning that the total sample

size Nfrom the source domains signiﬁcantly exceeds the sample size n0in the target domain.

Our derivation of asymptotic distribution is conducted with this disparity in sample sizes

taken into consideration.

We need the following condition.

Condition 7. The matrix J0=E[{X0−m∗(Z0)}{X0−m∗(Z0)}T] is invertible and E[{X0−

m∗(Z0)}T{X0−m∗(Z0)}]<∞, where m∗(Z0) = E(X0|R∗(Z0)).

Condition 7is fairly standard in the semi-parametric regression literature, which is needed

for constructing a semi-parametric eﬃcient estimator of β0.We note that the independence

between Xkand Zkis not required for k∈[K], throughout the paper.

To remove the confounding eﬀect of R∗(Z0), we consider ﬁnding an d×pmatrix µ∗that

satisﬁes the orthogonality equation,

E{X0−µR∗(Z0)}R∗⊤(Z0)=0.

This is equivalent to ﬁnd a µ∗that is a minimizer of E[∥X0−µR∗(Z0)∥2

2]. Moreover,

the eﬃcient score for β∗0is {X0−µ∗R∗(Z0)}ϵ0. The following theorem establishes the

asymptotic normality of β∗0.

Theorem 3. Suppose Conditions 1-7hold, we have

√n0(b

β0−β∗0) = J−1

0h1

√n0

i=1 {X0i−µ∗R∗(Z0i)}ϵ0ii+Op(√n0∆2

N).(9)

where ∆N=p1/2N−s/(2s+plog q)+s1/2

1(log N)2N−1/2. Therefore, if n1/2

0∆2

N→0as N→ ∞,

that is, n1/2

0s1(log N)4N−1→0and n01/2pN−2s/(2s+plog q)→0as N→ ∞, we have,

√n0(b

β0−β∗0)D

→N(0, σ2

0J0−1),as n0→ ∞ and N→ ∞.(10)

Through the data augmentation by the abundant source dataset, we prove that the esti-

mator for β∗0attains √n0-consistent and asymptotic normality. The asymptotic expression

of b

β0−β∗0in Theorem 3indicates the estimator b

β0attains the information bound, so it is

semiparametrically eﬃcient. When the variance term σ2

0J−1

0is unknown, we use a plug-in

estimator. Based on Theorem 3, a natural estimator of µ∗is given by solving the equation

Pn0

i=1 X0ib

RT(Z0i)−µPn0

i=1 b

R(Z0i)b

RT(Z0i)=0,which leads to

µ=

i=1

X0ib

RT(Z0i)(n0

i=1 b

R(Z0i)b

RT(Z0i))−1

.(11)

Combining (10) and (11), we can estimate the variance of b

β0by b

Σ=b

J−1

0,where

A=1

i=1 n(Yi0−b

βT

0X0i−ˆ

γT

R(Z0i))2(X0i−b

µb

R(Z0i))(X0i−b

µb

R(Z0i))To,

J0=1

i=1

(X0i−b

µb

R(Z0i))(X0i−b

µb

R(Z0i))T.

The next corollary shows that b

Σis consistent.

Corollary 1. Under Conditions 1-7, if ϵ0iand the components of X0−E(X0|R∗(Z0)) have

bounded fourth moments, then

Σp

−→ σ2

0J0−1.

Theorem 3and Corollary 1shows that the distribution of b

β0−β∗0) can be approximated

by a normal distribution whose covariance matrix can be consistently estimated, provide a

theoretical basis for making statistical inference about the parameter of main interest in the

target domain.

3.4 Beneﬁts from source data

We now discuss the beneﬁts of source data for estimating the primary parameter β0in the

target domain.

Suppose only target dataset were available. The basic semiparametric partially linear

model is (Engle et al.,1986),

Y0=β⊤

0X0+g0(Z0) + ε0.

Consider the least squares estimator

β0,eg0}= arg min

β,g0

i=1 {Y0i−β⊤

0X0i−g0(Z0i)}2.

There is an extensive literature on the asymptotic properties of the least squares estimators

in the semiparametric regression model using various approximation methods such as splines

for dealing with the nonparametric component, see for example, Hardle et al. (2000) and

the references therein. Under the conditions given in Section 3, it holds that (Hardle et al.,

2000;Farrell et al.,2021)

E∥eg0(Z)−g0(Z)∥=Op(n−s/(2s+q)

0).(12)

The convergence rate in (12) is optimal (Stone,1980). Furthermore,

√n0(e

β0−β∗0) = J−1

0"1

√n0

i=1 {X0i−E(X0i|Z0i)}ϵ0i#+√n0Op(n−2s/(2s+q)

0).

Therefore, to ensure asymptotic normality of e

β0,we must have n1/2−2s/(2s+q)

0→0.This neces-

sitates the condition n−s/(2s+q)

0=o(n−1/4

0). Fulﬁlling this requirement can be diﬃcult, partic-

ularly when dealing with a multi-dimensional confounding variable and lacking source data.

For instance, under a standard regularity condition where g0possesses continuous second-

order derivatives and assuming q= 10, we have O(n−2/(4+10)

0) = O(n−1/7

0). Consequently, the

condition n−s/(2s+q)

0=o(n−1/4

0) may prove to be quite restrictive. In contrast, with the inclu-

sion of source data, Theorem 2indicates that if the conditions s1/2

1(log N)2N−1/2=o(n−1/4

and p1/2N−s/(2s+plog q)=o(n−1/4

0) are met, then the estimator of Rwill achieve a convergence

rate faster than n−1/4

0. Given that s1= max{Kd, K p, S}, we can set s1=Sfor a suﬃciently

large network used in the analysis. These conditions are satisﬁed if the network size Sis

less than (log N)−2N1/2/n1/4

0and the total sample size from the source domains Nexceeds

n(2s+plog q)/(4s)

0. Hence, a suﬃciently large amount of source data can ensure the satisfaction

of these conditions.

4 Numerical studies

In this section, we evaluate the performance of RTL via numerical studies. We ﬁrst describe

the simulation results and then illustrate the applications of RTL on two real-world datasets.

4.1 Simulation studies

In this section, we evaluate the ﬁnite sample performance of RTL using simulated data. We

generate data under various designs and compare our method with the existing approaches.

4.1.1 Data generating models

We consider the data generation models as described in (1) and (2). We consider the following

two scenarios:

(a) Homogeneous models: In this scenario, the source and target domain models are the

same. Thus, β∗k=βand γ∗k=γfor all k= 0,1, . . . , K. The coeﬃcients of βand γ

are speciﬁed by i.i.d. drawn from standard normal distribution;

(b) Heterogeneous models: In this scenario, the source and target domain models are dif-

ferent. The elements of β∗kand γ∗kare speciﬁed by drawing i.i.d. random numerbers

from standard normal distribution for all k= 0,1, . . . , K.

Covariates Xkand Zkare drawn from i.i.d. uniform distribution on [−1,1]. We consider

two types of representation functions R(·):

(a) (Additive Model) R(Z)=[f1(z1), f2(z2), . . . , fr(zr)]Twhere fi’s are univariate functions;

(b) (Additive Factor Model) R(Z) = [f1(˜z1), f2(˜z2), . . . , fr(˜zr)]Twhere fi’s are univariate

functions and e

Z=BZ for some transformation matrix B. We generate Bby drawing

i.i.d. random numbers from N(0,1/q).

4.1.2 Evaluation

The performance of the estimated regression function bµ(X,Z) = XTb

β0+b

γT

R(Z) is evalu-

ated according to the prediction error and estimation error. The prediction performance is

evaluated by the empirical mean squared error computed on a test set of size ntest generated

from the target data distribution, i.e., [

MSE0=n−1

test Pntest

i=1 {bµ(Xi,Zi)−µ(Xi,Zi)}2,which

is an estimator of the mean squared error MSE = E{[bµ(X,Z)−µ(X,Z)]2}. The estimation

error is reported on the linear part of the target data, Errβ0=∥b

β0−β0∥2,where b

β0is the

estimator of β0.

4.1.3 The eﬀect of the source data sample size

Given that the main objective of transfer learning is to use the information from source data

to enhance the analysis of target data, we initially assess the performance of RTL as the

sample size of the source data varies. In the experiments conducted here, the dimension of

the linear component Xis ﬁxed at d= 5, and the dimension of the non-linear component

Zis set at q= 10. We consider K= 6 source datasets in total. Additionally, the dimension

of the representation function Ris set to 5.

Let the univariate functions fi’s be randomly chosen from sin(z1), 2p|z2|−1, (1 −|z3|)2,

1/{1 + exp(−z4)}, cos(πz5/2). The dimension of representation function in the working

model, denoted as r, is set as r= 1,3,5,7, and 9. When r= 5, the representation dimension

is the same as the true representation function used in the data generating model. The

under-estimating and over-estimating models are also considered when we set r= 1,3 and

r= 7,9, respectively. The sample size in the source dataset n0= 10,200,400,600,800,1000,

and 1200, and the sample size in the target dataset is set ﬁxed as 50.

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

Mean Value

Prediction MSE vs. Sample Size in Source Data

RTL,

racle

RTL,

r=1

RTL,

r=3

(A)

0 200 400 600 800 1000 1200

0.2

0.3

0.4

0.5

Mean Value

Estimation Error vs. Sample Size in Source Data

RTL

racle

(B)

Figure 3: Additive Model with homogeneous coeﬃcients

0 200 400 600 800 1000 1200

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Mean Value

Prediction MSE vs. Sample Size in Source Data

RTL

racle

(A)

0 200 400 600 800 1000 1200

0.2

0.3

0.4

0.5

0.6

Mean Value

Estimation Error vs. Sample Size in Source Data

RTL

racle

(B)

Figure 4: Additive Model with heterogeneous coeﬃcients

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

Mean Value

Prediction MSE vs. Sample Size in Source Data

RTL

racle

(A)

0 200 400 600 800 1000 1200

0.2

0.3

0.4

0.5

Mean Value

Estimation Error vs. Sample Size in Source Data

RTL

racle

(B)

Figure 5: Additive Factor Model with homogeneous coeﬃcients

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Mean Value

Prediction MSE vs. Sample Size in Source Data

RTL

racle

(A)

0 200 400 600 800 1000 1200

0.2

0.3

0.4

0.5

0.6

Mean Value

Estimation Error vs. Sample Size in Source Data

RTL

racle

(B)

Figure 6: Additive Factor Model with heterogeneous coeﬃcients

We repeat the experiments 50 times and report the average performance. We also report

the ‘Oracle’ method which uses the true representation function in the target data. Fig-

ures 3to 6present prediction MSEs for RTL with the number of representation functions

r= 1,3,5,7,9.The true value of r= 5 in the generating model. For the Additive Model

Design, the depth of the neural network is set as D= 4 and the width is set as W= 300.

The results are shown in Figures 3and 4. For the Additive Factor Model Design, the depth

of the neural network is set as D= 6 and the width is set as W= 500. The results are

shown in Figures 5and 6.

The experimental results indicate that as the sample size increases, the performance of

the RTL method approaches that of the oracle estimator, provided that the dimension of

the representation function is close to or exceeds the true dimension of the representation

function in the data-generating model. However, if the chosen dimension of the representa-

tion function is less than the true dimension present in the generating model, the proposed

method exhibits suboptimal performance. Consequently, in practical applications, it is ad-

visable to set the dimension of the representation function to a higher rather than lower

value to ensure better performance.

4.1.4 Comparison

We consider both the additive model design and the additive factor model design as previ-

ously described. Additionally, we explore a more intricate deep model design to simulate the

data generation process, as illustrated in Figure 7. In this model, the functions fiand hiare

selected randomly from a pool of functions that includes sin(x), −cos(x), cos(2x), sin(πx),

cos(πx), 2√x+ 0.5−1, (1 −|x−0.5|)2, 1/1 + exp(x), tan(x+ 0.1), log(x+ 1.5), exp(x), x2,

and arctan(x). For instance, the output of the ﬁrst node in the second layer is computed

as f1(z1+z2). We utilize K= 20 source datasets and 1 target dataset, and we assess two

conﬁgurations regarding the model’s dimension and sample size. In the ﬁrst conﬁguration,

each source dataset comprises 200 samples, and the dimension of the non-linear component

is set to q= 20. In the second conﬁguration, the sample size for each source dataset is

increased to 400, while the dimension of the non-linear component is reduced to q= 10. For

both conﬁgurations, the target dataset consists of n0= 50 samples, with the dimension of

the non-linear component ﬁxed at d= 5.

f1()

z1z2

f2()

z3z4

f3()

z5z6

f4()

z7z8

f5()

z9z10

f6()

h1() h2() h3() h4() h5()

Figure 7: The architecture of a deep model with q= 10 and p= 5 used in Exp 2.

We consider the following competitor methods.

(a) The “Pool” method is a parametric pooling method which estimates the coeﬃcients

using a combined loss. Pooled regression (PR) assumes that all parameters across

diﬀerent individuals are the same. All datasets are pooled together.

(b) The “MAP” method represents the model averaging transfer learning method (Zhang

et al.,2024).

learning method (Li et al.,2022).

(d) The “Meta” method represents the meta-analysis method where the coeﬃcients from

diﬀerent datasets are weighted based on the inverse variance of estimations.

(e) The “STL” method represents the neural network method which only uses the target

data.

To adapt these methods for non-linear models, we express the nonparametric component

as a linear combination of cubic spline basis functions. The optimal number of knots is

determined through the use of validation samples. For our proposed RTL method, we deﬁne

the dimension of the representation space to be p= 5, which reﬂects the true underlying

dimension. The outcomes of this comparison are depicted in Figures 8and 9, where the

left side corresponds to the scenario with nk= 400 and q= 10, and the right side pertains

to the scenario with nk= 200 and q= 20. It can be seen that RTL has lower prediction

and estimation errors than the existing methods, including Pool, MAP, Trans-Lasso, Meta,

and STL. These results show the superiority of our proposed RTL method in terms of both

prediction accuracy and estimation quality.

(A)

Add AddFactor Deep Deep-Home

Designs

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Performance Metric

RTL

Trans-lasso

MAP

Meta

Pool

STL

(B)

Add AddFactor Deep Deep-Home

Designs

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Performance Metric

RTL

Trans-lasso

MAP

Meta

Pool

STL

Figure 8: Prediction performance comparison of diﬀerent methods across diﬀerent designs.

(A) The left-hand-side corresponding to the case of nk= 200 and q= 20; (B) the right-

hand-side corresponding to the case of nk= 400 and q= 10. It can be seen that RTL has

lower prediction errors than the existing methods, including Pool, MAP, Trans-Lasso, Meta,

and STL.

(A)

Add AddFactor Deep Deep-Home

Designs

0.0

0.5

1.0

1.5

2.0

Performance Metric

RTL

Trans-lasso

MAP

Meta

Pool

STL

(B)

Add AddFactor Deep Deep-Home

Designs

0.0

0.5

1.0

1.5

2.0

Performance Metric

RTL

Trans-lasso

MAP

Meta

Pool

STL

Figure 9: Estimation performance comparison between RTL and Pool, MAP, Trans-Lasso,

Meta, and STL across diﬀerent designs. (A) The left-hand-side corresponding to the case

of nk= 200 and q= 20; (B) the right-hand-side corresponding to the case of nk= 400 and

q= 10. It can be seen that RTL has lower estimation errors than the existing methods,

including Pool, MAP, Trans-Lasso, Meta, and STL.

Table 1: Illustration of asymptotic variance and normality.

Design Avg. Bias SD SE Normality

Additive

nk= 200 0.0040 0.3016 0.2953 0.0459

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

bias

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

density

Estimated pdf

Estimation bias

Additive

nk= 400 -0.0146 0.2321 0.2241 0.0386

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8

bias

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

density

Estimated pdf

Estimation bias

Add-Factor

nk= 200 0.0001 0.2945 0.2880 0.0462

0.75 0.50 0.25 0.00 0.25 0.50 0.75

bias

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

density

Estimated pdf

Estimation bias

Add-Factor

nk= 400 -0.0153 0.2135 0.2044 0.0345

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8

bias

0.0

0.5

1.0

1.5

2.0

density

Estimated pdf

Estimation bias

Deep

nk= 200

-0.0080 0.2155 0.2140 0.0350

0.6 0.4 0.2 0.0 0.2 0.4 0.6

bias

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

density

Estimated pdf

Estimation bias

Deep

nk= 400

-0.0056 0.1350 0.1350 0.0228

0.4 0.2 0.0 0.2 0.4

bias

0.0

0.5

1.0

1.5

2.0

2.5

3.0

density

Estimated pdf

Estimation bias

Deep(Homo)

nk= 200

-0.0061 0.1567 0.1527 0.0242

0.4 0.2 0.0 0.2 0.4

bias

0.0

0.5

1.0

1.5

2.0

2.5

3.0

density

Estimated pdf

Estimation bias

Deep(Home)

nk= 400

-0.0003 0.1158 0.1117 0.0184

0.3 0.2 0.1 0.0 0.1 0.2 0.3

bias

density

Estimated pdf

Estimation bias

4.1.5 Assessment of variance estimation and asymptotic normality

To demonstrate the estimated asymptotic variance, we employ the same experimental setup

as described in Section 4.1.4 with the exception that the parameters β∗0and γ∗0are both

set to 1,and the number of source dataset is set to 6. Our analysis concentrates on the

transformed coeﬃcients θ=α⊤β∗0, where α=1/√p.

Table 1presents the results, including the average bias and standard deviation (SD) of a

combined coeﬃcient, the mean of the estimated variance (SE), and an illustration of normal-

ity, all based on 1000 repetitions. The last column of Table 1displays the histograms of the

estimation biases alongside the asymptotic normal distributions with the estimated means

and variances. The results suggest that the distribution RTL estimator is well approximated

by normal distribution.

4.2 Semi-synthetic MNIST data

In this section, we apply RTL to a semi-synthetic arithmetic dataset constructed from the

MNIST dataset (Le Cun et al.,1998). The MNIST dataset is a collection of handwritten

digits, comprising 70,000 samples with 10 class labels each represented by a 28×28 grayscale

image. For our semi-synthetic scenario, we deﬁne the data generating process as follows:

Y=βX +γ⊤RZ+ϵ, (13)

where Xis a random sample drawn from the standard normal distribution, Zis a digit

image sampled from the MNIST dataset, β∈Rand γ∈R10 are unknown coeﬃcients to be

estimated, R∈R10 represents the one-hot encoded label corresponding to the input image,

and ϵis noise that follows the standard normal distribution. To illustrate, suppose we set

β= 1, γ=1,X= 1, ϵ= 0.2, and let Zcorrespond to the image . The resulting value

of Ywould be 1 + 1⊤R + 0.2, which equals 8.2 if the image represents the digit ‘7’.

We adopt the following experimental setup: The total number of source datasets is ﬁxed

at 10. Each source dataset is composed of a training subset, which accounts for 40% of the

MNIST training set, and a separate validation subset consisting of 500 samples randomly

selected from the same training set. The coeﬃcients βand γ1are randomly assigned for

each source dataset, and we deﬁne γas γ11across all source data. The target dataset is

limited to 100 samples, also drawn from the MNIST training set.

For the representation learning, we utilize a neural network with 7 hidden layers, which

includes 5 convolutional layers and 2 fully connected layers. More detailed information on

the architecture of the neural network is provided in the Supplementary Materials. The

output dimension of the representation network is set to 10. We adopt an iterative training

approach for the representation network and the subsequent linear layer, spanning 10 epochs.

The learning rate is established at 10−4, and we use a batch size of 128.

The average performance of the estimated representation network, in terms of prediction,

classiﬁcation, and estimation based on 5 replications, is presented in Table 2. It is important

to note that the prediction error is evaluated on a test set derived from the MNIST test set,

which comprises 10,000 images. The classiﬁcation accuracy is assessed using the MNIST

training set. For this evaluation, the estimated representation network is augmented with

two additional linear layers with ReLU activation functions, and the ﬁnal output undergoes

a transformation via a logarithmic softmax function.

Table 2: The prediction and estimation errors of RTL in the synthetic MNIST data analysis.

Method Prediction Error Estimation Error Classiﬁcation Accuracy

RTL 0.2163(0.0675) 0.0701(0.0716) 98.69%(0.24%)

The results show that RTL has a good performance in the synthetic MNIST data analysis

for the prediction and estimation in target domain. The high classiﬁcation accuracy indicates

that the estimated representation network is able to capture the label information of the

input images.

4.3 Rental data

We demonstrate the application of the proposed RTL method using an apartment rental

dataset from three major Chinese cities: Beijing, Shanghai, and Shenzhen. This dataset was

obtained from a publicly accessible website, available at http://www.idatascience.cn/.

The number of available rental apartments across various districts in these cities is detailed

in Table 3. Additionally, the variables included in the dataset are given in Table 4. The

main goal of our analysis is to assess the inﬂuence of key factors, such as neighborhood

characteristics and the proximity of schools, on rental prices.

Table 3: The number of apartments for rent by district in Beijing, Shanghai and Shenzhen

Beijing Shanghai Shenzhen

Haidian 528 Pudong 1333 Nanshan 1524

Chaoyang 1241 Xuhui 566 Futian 1169

Changping 310 Changning 432 Bao’an 1108

Dongcheng 315 Putuo 416 Longgang 857

Xicheng 308 Huangpu 393 Luohu 857

Fengtai 347 Baoshan 365 Longhua 778

Shijingshan 269 Longhuaa 360 Buji 735

Mentougou 264 Jing’an 349 Guangming 714

Fangshan 249 Yangpu 316 Yantian 543

Shunyi 225 Tongzhou 223

Jiading 302 Hongkou 207

Daxing 291 Fengxian 204

Huairou 162

Table 4: The description of variables in the apartment rental dataset.

Variable Description

(y) price monthly rent of the apartment

(z) room number of rooms

(z) hall number of halls

(z) toilet number of toilets

(z) hasbed has bed

(z) haswardrobe has wardrobe

(z) hasac has air conditioner

(z) hasgas has gas

(z) ﬂoor 4 categories based on height

(z) totalﬂoor total ﬂoors of the building

(z) numhospital number of hospitals (within 3km)

(x) neighborhood neighborhood of the apartment

(x) numschool number of schools (within 3km)

For the apartment rental dataset encompassing three major Chinese cities, Beijing,

Shanghai, and Shenzhen, it is important to recognize that these cities, being in distinct

regions (north, east, and south) of China, have unique rental markets. Consequently, pool-

ing the data from Beijing, Shanghai, and Shenzhen for analysis without considering regional

diﬀerences could lead to questionable conclusions. For instance, what characterizes a neigh-

borhood in Beijing may be diﬀerent from those in Shanghai or Shenzhen. Additionally, the

availability of an air conditioner might inﬂuence rental prices diﬀerently across these cities

due to their varying climatic conditions. Additionally, considering the vast size of these

cities, there can be signiﬁcant variations in rental market dynamics even between diﬀerent

districts within the same city. Therefore, pooling data from diﬀerent districts within a sin-

gle city is not advisable. On the other hand, despite the geographical distinctions and the

heterogeneity across districts within these cities, there are inherent similarities within their

rental markets. These commonalities make it plausible to apply knowledge about factors

aﬀecting rental prices from one city to another. Thus, while regional speciﬁcities should not

be overlooked, there is merit in exploring the transferability of insights across these diverse

urban rental markets.

Given this context, transfer learning is a reasonable approach to analyzing data from

a speciﬁc district in one city, using data from other districts as source data. This method

helps overcome the limitations of small sample sizes for district-speciﬁc data and enhances the

analysis by leveraging the broader patterns and insights from across the dataset. Thus, while

respecting regional speciﬁcities, transfer learning oﬀers a way to explore the transferability

of insights across these diverse urban rental markets.

In our analysis, we focus on the eﬀects of two variables: neighborhood and numschool (the

number of schools within a 3km radius), which are widely recognized as having a signiﬁcant

inﬂuence on rental prices in China. We use other factors as confounding variables that may

also aﬀect rental prices, albeit to a lesser degree. We examine four target datasets from

four randomly selected districts. These districts include Changping in Beijing, Yantian in

Shenzhen, and both Putuo and Fengxian in Shanghai. When analyzing a speciﬁc target

dataset, such as the one from Changping in Beijing, we incorporate all the remaining data

as source data. This enables us to quantify the eﬀects of neighborhood characteristics and

the proximity to schools on rental prices, while also taking into account the wider context

provided by the comparative data from other districts.

Using the proposed RTL method for the semiparametric regression model as described in

Section 2, we calculate the 95% conﬁdence intervals for the coeﬃcients of neighborhood and

numschool based on the results in Section 3. The ﬁndings are presented in Table 5. These

estimated coeﬃcients provide a quantitative assessment of the impact these variables have on

rental prices. Moreover, the fact that these conﬁdence intervals do not include zero indicates

that the district where the house is located and the proximity to schools signiﬁcantly increase

the monthly rent, when other factors are held constant.

We also consider the prediction performance via randomly dividing the target dataset

into training, validation and testing sets. The training set consists of 30% of the data and

the testing set consists of other 30% of the data, while the remaining is allocated to the

validation set. The prediction performance is evaluated using the mean squared error on

the testing set. Figure 10 displays the prediction errors the proposed RTL and the existing

methods Pool, MAP, Trans-Lasso, Meta, and STL on housing rental information data. It

shows that RTL outperforms these existing methods in terms of prediction error.

Table 5: The estimated coeﬃcients and conﬁdence intervals for the variables of main interest

in the housing rental data.

District Variable Estimate SE 95% CI

Changping neighborhood 38.08 3.37 [31.48,44.68]

numschool 16.00 3.93 [8.30,23.70]

Putuo neighborhood 59.53 6.18 [47.41,71.64]

numschool 12.23 2.85 [6.64,17.82]

Fengxian neighborhood 9.80 3.07 [3.79,15.82]

numschool 9.36 2.47 [4.52,14.21]

Yantian neighborhood 66.70 3.92 [59.03,74.38]

numschool 41.03 5.42 [30.41,51.65]

(A) Changping

10 2

10 1

100

Prediction Error

STLMAP

RTL Trans-lasso Meta Pool

(B) Putuo

10 2

10 1

Prediction Error

MAP STL

RTL Trans-lasso Meta Pool

10 3

10 2

10 1

100

Prediction Error

MAP STL

RTL Trans-lasso Meta Pool

(D) Yantian

10 2

10 1

100

101

Prediction Error

MAP STL

RTL Trans-lasso Meta Pool

Figure 10: Prediction performance between the proposed RTL and the existing methods

Pool, MAP, Trans-Lasso, Meta, and STL on the housing rental dataset.

We further assess the prediction performance by randomly splitting the target dataset

into training, validation, and testing sets. Speciﬁcally, the training and testing sets each

consists of 30% of the data, with the remaining 40% designated as the validation set. We

evaluate the prediction performance using the mean squared error (MSE) on the testing set.

Figure 10 illustrates the prediction errors of our proposed RTL method compared to existing

methods, Pool, MAP, Trans-Lasso, Meta, and STL. The results demonstrate that RTL has

lower prediction error than these existing methods, indicating its superior performance in

predicting rental prices.

5 Discussion

In this work, we introduce a new approach to transfer learning within the context of semi-

parametric regression inference.The essence of our strategy lies in the transfer of knowledge

from the source domains to the target domain via a representation function. Our goal is to

enhance both prediction accuracy and estimation precision in the target domain by leverag-

ing data from multiple source domains. The key idea of our method is the learning of a shared

representation across various source tasks, which is then applied to a target task. We address

data heterogeneity between the source and target domains by incorporating domain-speciﬁc

parameters in their respective models. This strategy facilitates the integration of varied data

representations while maintaining model interpretability and adaptability to heterogeneous

datasets.

Our proposed RTL method has the potential to be adapted for use with other models,

including semiparametric generalized linear and classiﬁcation models. However, there are

several challenging issues that warrant further investigation within our proposed framework.

Firstly, a pivotal hyperparameter in our approach is the number of representations, for which

the optimal selection remains an open question. The determination of this parameter signif-

icantly aﬀects the model’s performance and its ability to generalize. Our simulation studies

indicate that the method performs adequately as long as the number of representations

falls within a reasonable range. This observation suggests that it is helpful to consider a

cross-validation type method for selecting this hyperparameter, which could provide a sys-

tematic approach to further enhance model performance. Secondly, our ﬁndings are based

on a moderately high-dimensional regime of the model. Although this scenario is relevant

in many applications, extending our method to handle sparse, high-dimensional settings is

challenging. In such scenarios, where the model’s dimensionality may exceed the sample

size, it becomes crucial to integrate regularization techniques into the model ﬁtting objec-

tive function via a penalty term to ensure eﬀective model performance. Moreover, while our

current model uses a linear mapping to integrate representations, there is potential to ex-

plore more ﬂexible approaches. For instance, transitioning from task-speciﬁc linear functions

to nonlinear functions could allow for the capture of more complex non-linear relationships

between the representations and the target responses. Pursuing advancements in this area

could signiﬁcantly improve the model’s capacity and its adaptability to more complex data.

We hope to address these issues in our future work.

References

Ahn, S. C., Lee, Y. H., and Schmidt, P. (2001). GMM estimation of linear panel data models

with time-varying individual eﬀects. Journal of Econometrics, 101(2):219–255.

Bai, J. (2009). Panel data models with interactive ﬁxed eﬀects. Econometrica : Journal of

the Econometric Society, 77(4):1229–1279.

Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Manage-

ment Science, 67(5):2964–2984.

Cai, T. T. and Pu, H. (2024). Transfer learning for nonparametric regression: Non-

asymptotic minimax analysis and adaptive procedure. arXiv: 2401.12272.

Chen, W., Horwood, J., Heo, J., and Hern´andez-Lobato, J. M. (2023). Leveraging task struc-

tures for improved identiﬁability in neural network representations. arXiv: 2306.14861.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. (2020). Few-shot learning via

learning the representation, provably. In International Conference on Learning Represen-

tations.

Engle, R. F., Granger, C. W., Rice, J., and Weiss, A. (1986). Semiparametric estimates

of the relation between weather and electricity sales. Journal of the American Statistical

Association, 81(394):310–320.

Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and

inference. Econometrica, 89(1):181–213.

Golowich, N., Rakhlin, A., and Shamir, O. (2018). Size-independent sample complexity of

neural networks. In Conference On Learning Theory, pages 297–299. PMLR.

Gy¨orﬁ, L., Kohler, M., Krzyzak, A., Walk, H., et al. (2002). A Distribution-Free Theory of

Nonparametric Regression. Springer.

Hardle, W., Liang, H., and Gao, J. (2000). Partially Linear Models. Contributions to

Statistics. Physica Heidelberg, 1 edition.

Hu, X. and Zhang, X. (2023). Optimal parameter-transfer learning by semiparametric model

averaging. Journal of Machine Learning Research, 24(2023):1–53.

Huang, J.-T., Li, J., Yu, D., Deng, L., and Gong, Y. (2013). Cross-language knowledge

transfer using multilingual deep neural network with shared hidden layers. In 2013 IEEE

International Conference on Acoustics, Speech and Signal Processing, pages 7304–7308,

Vancouver, BC, Canada. IEEE.

Jiao, Y., Shen, G., Lin, Y., and Huang, J. (2023). Deep nonparametric regression on approx-

imate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of

Statistics, 51(2):691–716.

Le Cun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998). Gradient-based learning applied

to document recognition. Proceedings of the IEEE, 86(11):2278–2324.

Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and

Processes. Springer Science & Business Media, Berlin, Heidelberg.

Li, S., Cai, T. T., and Li, H. (2022). Transfer learning for high-dimensional linear regression:

Prediction, estimation, and minimax optimality. Journal of The Royal Statistical Society

Series B: Statistical Methodology, 84(1):149–173.

Li, S., Zhang, L., Cai, T. T., and Li, H. (2023). Estimation and inference for high-dimensional

generalized linear models with knowledge transfer. Journal of the American Statistical

Association, pages 1–12.

Lin, H. and Reimherr, M. (2022). Transfer learning for functional linear regression with

structural interpretability. arXiv: 2206.04277.

Moon, H. R. and Weidner, M. (2015). Linear regression for panel with unknown number of

factors as interactive ﬁxed eﬀects. Econometrica : Journal of the Econometric Society,

83(4):1543–1579.

Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., and Jin, Z. (2016). How transferable

are neural networks in NLP applications? In Su, J., Duh, K., and Carreras, X., editors,

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,

pages 479–489, Austin, Texas. Association for Computational Linguistics.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on

Knowledge and Data Engineering, 22(10):1345–1359.

Shen, G. (2024). Exploring the complexity of deep neural networks through functional

equivalence. arXiv: 2305.11417.

Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators. The Annals

of Statistics, 8(6):1348–1360.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). A survey on deep

transfer learning. In K˚urkov´a, V., Manolopoulos, Y., Hammer, B., Iliadis, L., and Ma-

glogiannis, I., editors, Artiﬁcial Neural Networks and Machine Learning – ICANN 2018,

pages 270–279, Cham. Springer International Publishing.

Tian, Y. and Feng, Y. (2022). Transfer learning under high-dimensional generalized linear

models. Journal of the American Statistical Association, 118(544):2684–2697.

Tian, Y., Gu, Y., and Feng, Y. (2023). Learning from similar linear representations: Adap-

tivity, minimaxity, and robustness. arXiv: 2303.17765.

Tripuraneni, N., Jordan, M. I., and Jin, C. (2020). On the theory of transfer learning:

The importance of task diversity. In Proceedings of the 34th International Conference on

Neural Information Processing Systems, NIPS’20, pages 7852–7862.

Van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge University Press, Cambridge,

UK.

Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes:

With Applications to Statistics. Springer, New York, NY, USA.

Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam-

bridge University Press, Cambridge, UK.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features

in deep neural networks? In Proceedings of the 27th International Conference on Neural

Information Processing Systems - Volume 2, NIPS’14, pages 3320–3328, Cambridge, MA,

USA. MIT Press.

Zhang, X., Liu, H., Wei, Y., and Ma, Y. (2024). Prediction using many samples with

models possibly containing partially shared parameters. Journal of Business & Economic

Statistics, 42(1):187–196.

Zhou, D., Liu, M., Li, M., and Cai, T. (2022). Doubly robust augmented model accuracy

transfer inference with high dimensional features. arXiv: 2208.05134.

Appendix

A The network architecture for the semi-synthetic

MNIST data

The architecture of the network used in the semi-synthetic data analysis is detailed in

Figure S11.

Conv Conv MaxP Conv Conv MaxP Conv MaxP Flatten

1@28x28 64@28x28 64@28x28 64@14x14 128@14x14 128@14x14 128@7x7 256@7x7 256@3x3 1x2304 1x512

1x10

SoftMax

Figure S11: Structure of representation network used for synthetic data analysis. The convo-

lution (Conv) transforms, max pooling (MaxP) transforms, tensor ﬂattening (Flatten) and

softmax transform are labeled in the bottom. The dimension of the output in each layer is

labeled in the top.

B Proofs of theoretical results

Before we present the proofs of the results stated in the paper, we ﬁrst introduce the

deﬁnition of Rademacher and Gaussian complexity for R.

Deﬁnition 3. We deﬁne the empirical and population Rademacher complexities for a class

of functions Rcontaining function R:Rq→Rpover ndata points, (Z1,...,Zn) as,

Fn(R) = Eε"sup

R∈R

j=1

i=1

εijRj(Zi)#,

and

Fn(R) = Ez[b

Fn(R)],

respectively, where Eε(·) refers to the expectation operator taken over the randomness εij’s,

εij’s are independent Rademacher random variables and Rj(·) is the jth element of R(·).

Analogously, the empirical and population Gaussian complexities are deﬁned as

Gn(R) = Eι"sup

R∈R

j=1

i=1

ιijRj(Zi)#,

and

Gn(R) = Ez[b

Gn(R)],

respectively, where ιij ’s are independent standard Gaussian random variables.

B.1 Auxiliary Lemmas

Deﬁne ℓ(Dk;βk,γk,R) = {Yk−XT

kβk−RT(Zk)γk}2

2, where Dk= (Xk,Zk, Yk). Let

Ψℓ,δ =PK

k=1 ℓ(Dk,ψ)/K −PK

k=1 ℓ(Dk,ψ∗)/K, ψ∈Ψδ,where

Ψδ={ψ:δ/2≤d2(ψ,ψ∗)≤δ, δ > 0,ψ∈Ψ},ψ={(βk,γk)K

k=1,R}, and

ψ∗={(β∗k,γ∗

k)K

k=1,R∗}. Denote PNand Pas the empirical and probability measure of

{Dn,k}K

k=1 and {Dk}K

k=1, where Dn,k =Dki = (Xki,Zki, Yki), i = 1, . . . , nk. We further

deﬁne GN=√N(PN−P). Denote the population 2-norm for function class

B⊗K+ Γ⊗K(R) as,

d2(ψ,ψ′) = 1

k=1

EhX⊤

kβk+R⊤(Zk)γk−X⊤

kβ′

k−R′⊤(Zk)γ′

k2i!1/2

where ψ′={(β′

k,γ′

k)K

k=1,R′}. We use ≲,≳and ≍to denote less than, greater than, and

equal to up to a universal constant. For simplicity in notation, we remove the parts with

the same parameters from the distance d2in the following contents.

Lemma 1. Suppose Condition 3holds, we have

E∗∥GN∥Ψℓ,δ ≲δps1log(s2/δ) + s1

√Nlog(s2/δ),

where E∗is the outer measure and s2= 12Bγ(D+ 1)(BR+ 1)(2Bθ)D+2(QD

j=0 pj)QD

j=1 pj!−1/S.

Proof of Lemma 1.Using the triangle inequality, we can decompose the distance on function

class B⊗K+ Γ⊗K(R) into a distance over B⊗K, Γ⊗K, and R. We have

d2({(β′

k,γ′

k)K

k=1,R′},{(βk,γk)K

k=1,R})

≤d2({(β′

k,γ′

k)K

k=1,R′},{(βk,γ′

k)K

k=1,R′})

+d2({(βk,γ′

k)K

k=1,R′},{(βk,γk)K

k=1,R′})

+d2({(βk,γk)K

k=1,R′},{(βk,γk)K

k=1,R})

≤d2({β′

k}K

k=1,{βk}K

k=1) + d2({γ′

k}K

k=1,{γk}K

k=1)

+ max

k∈[K]∥γk∥2d2(R′,R).(A.1)

We then use a covering argument on each of the spaces B⊗K, Γ⊗K, and Rto witness a

covering of the composed space B⊗K+ Γ⊗K(R). First, let CB⊗Kbe a τ0-covering for the

function class B⊗Kof the norm d2. Then for each β∈ CB⊗K, construct a τ1-covering, CR

for the function class Rof the norm d2. Last, for each β∈ CB⊗Kand R∈ CR, construct

aτ2-covering CΓ⊗Kfor the function class Γ⊗Kof the norm d2. Using the decomposition of

distance (A.1), we can claim that set

CB⊗K· CΓ⊗K(R)=∪β∈CB⊗K(∪R∈CR(CΓ⊗K),)

is a (τ0+ maxk∈[K]∥γk∥2τ1+τ2)-covering for the function space B⊗K+ Γ⊗K(R) in the norm

d2. To see this, let {βk}K

k=1 ∈ B⊗K,{γk}K

k=1 ∈Γ⊗K, and R∈ R be arbitrary. Now let

{β′

k}K

k=1 ∈ CB⊗Kbe τ0close to {βk}K

k=1; given this {β′

k}K

k=1, there exists R′∈ CRbe τ1close

to R; given this {β′

k}K

k=1 and R′, there exists {γ′

k}K

k=1 ∈ CΓ⊗Ksuch that {γ′

k}K

k=1 be τ2close

to {γk}K

k=1. By the process of constructing {(β′

k,γ′

k)K

k=1,R′}and (A.1), we have that,

d2({(β′

k,γ′

k)K

k=1,R′},{(βk,γk)K

k=1,R})≤τ0+ max

k∈[K]∥γk∥2τ1+τ2.

We now bound the cardinality of the coverings CB⊗K· CΓ⊗K(R).

CB⊗K· CΓ⊗K(R)≤ |CB⊗K||CR|max

R∈R CΓ⊗K

R.

To control the cardinality of maxR∈R CΓ⊗K

R, note an ϵ-covering of CΓ⊗K

Rcan be obtained

from the cover CΓR× ·· · × CΓR. Hence,

|CB⊗K| ≤ |CB|K,max

R∈R CΓ⊗K

R≤max

R∈R CΓR

Note that for any ψ,ψ′∈Ψδ, we have

E"1

k=1

ℓ(Dk,ψ′)−1

k=1

ℓ(Dk,ψ)#

=E"1

k=1

ℓ(Dk,ψ′)−1

k=1

ℓ(Dk,ψ∗)#

+E"1

k=1

ℓ(Dk,ψ)−1

k=1

ℓ(Dk,ψ∗)#

=d2

2(ψ,ψ∗) + d2

2(ψ′,ψ∗)≲δ2.

By Theorem 3 in Shen (2024), we have

N(τ, d2,R)≤N(τ, d∞,R)

≤4(D+ 1)(BR+ 1)(2Bθ)D+2(QD

j=0 pj)τ−1S

j=1 pj!,

where d∞is the inﬁnity norm. Furthermore, by the construction of covering net and Theorem

2.7.11 in Van Der Vaart and Wellner (1996), we have

log N[](τ, d2,Ψℓ,δ )

≲Kd log 3δ

τ+Kp log 3δ

τ

+Slog 



12Bγ(D+ 1)(BR+ 1)(2Bθ)D+2(QD

j=0 pj)

τQD

j=1 pj!1/S 



.

Using sub-additivity of the √function, if δ≤s2/3, then we have

J[](δ, Ψℓ,δ ) := Zδ

0q1 + log N[](τ, d2,Ψℓ,δ )dτ

≲Zδ

0p1 + s1log(s2/τ )dτ

=s2rs1

2Z∞

√2 log(s2/δ)

v2e−v2/2dv

≍δps1log(s2/δ).(A.2)

Under Condition 3, combining (A.2) with lemma 3.4.2 in Van Der Vaart and Wellner (1996)

gives

E∗∥GN∥Ψℓ,δ ≲J[](δ, Ψℓ,δ )1 + J[](δ, Ψℓ,δ )

δ2√N≲δps1log(s2/δ) + s1

√Nlog(s2/δ).

Lemma 2. Suppose Condition 3holds, if nk≳d+ log K, we have

FN(B⊗K+ Γ⊗K(R))

≤O{n−1/2d1/2}+O{n−1/2N−2(plog N)1/2}+On−1/2p1/2(log N)

+O{N−1/2(D+ 2 + log q)1/2p(log N)2

i=0

(pi+ 1)}

+O







n−1/2N−2

Slog 

N(D+ 1)(2Bθ)D+2 D

j=0

pj! D

j=1

pj!!−1/S





1/2







.

Proof of Lemma 2.Under Condition 3, by the deﬁnition of the empirical Gaussian complex-

ity, we have

Gnk(B) = Eι"sup

βk∈B

i=1

ιikXkiβk#

≤max

k∈[K]

Bβ

nkv

tEι"nk

i=1 ∥ιikXki∥2

≤max

k∈[K]

Bβ

nkv

i=1 ∥Xki∥2

= max

k∈[K]

Bβ

√nkqtr(ΣXk)

= max

k∈[K]

Bβ

√nkv

j=1

σj(ΣXk),(A.3)

where ΣXk=Pnk

i=1 XkiXki⊤/nkand σj(ΣXk) is the jth largest eigenvalue of ΣXk. Similarly,

we can prove

Gnk(Γ) ≤max

k∈[K]

Bγ

√nkv

j=1

σj(ΣR(Zk)),(A.4)

where σj(ΣR(Zk)) is the jth largest eigenvalue of Pnk

i=1 R(Zki)R⊤(Zki)/nk. Furthermore, we

can obtain that,

GN(R) = 1

NEι"sup

R∈R

j=1

k=1

i=1

ιikRj(Zki)#≤

j=1 b

GN(Rj)≤(log N)

j=1 b

FN(Rj).(A.5)

By the deﬁnition of empirical Gaussian complexity, we can easily conclude

GN(B⊗K+ Γ⊗K(R)) ≤b

GN(B⊗K) + b

GN(Γ⊗K(R)).(A.6)

For simplicity in notation, denote fik (γk,R;γ′

k,R′) = RT(Zki)γk−R′⊤(Zki)γ′

k2. Theo-

rem 7 of Tripuraneni et al. (2020) implies that

GN(Γ⊗K(R)) ≤4 sup

{(γk)K

k=1,R},{(γ′

k)K

k=1,R′}

k=1

i=1

fik(γk,R;γ′

k,R′)/N2

+ 128(log N)hmax

k∥γk∥b

GN(R) + max

Gnk(Γ)i.(A.7)

Similar to the calculation of the covering number in Lemma 1, we can obtain that

N(τ/3, d∞, fk(Γ(R)−Γ(R)))

≤48B2

RBγ

τ2p



192B2

γBR(D+ 1)(BR+ 1)(2Bθ)D+2(QD

j=0 pj)2S

τ2S(QD

j=1 pj!)2



.

Denote fk(γk,R;γ′

k,R′) = RT(Zk)γk−R′⊤(Zk)γ′

k2. By Lemma 9.1 of Gy¨orﬁ et al.

(2002), we have that, for any τ > 0

P sup

{γk,R},{γ′

k,R′}

i=1

fik(γk,R;γ′

k,R′)−E[fk(γk,R;γ′

k,R′)]> τ!

≤2N(τ/3, d∞, fk(Γ(R)−Γ(R))) exp −nkτ2

18B2

γB2

R.

Denote c= 1/(18B2

γB2

R) and C= 2N(1/(3nk), d∞, fk(Γ(R)+Γ(R))). Note that log C/(cnk)≥

1/nk. Then,

E"sup

{γk,R},{γ′

k,R′}(1

i=1

fik(γk,R;γ′

k,R′)−E[fk(γk,R;γ′

k,R′)])#

≤v

tE

 sup

{γk,R},{γ′

k,R′}(1

i=1

fik(γk,R;γ′

k,R′)−E[fk(γk,R;γ′

k,R′)])!2



≤v

tZ∞

P



 sup

{γk,R},{γ′

k,R′}(1

i=1

fik(γk,R;γ′

k,R′)−E[fk(γk,R;γ′

k,R′)])!2

> τ



dτ

≤slog C

cnk

+Z∞

log C

cnk

2N(τ/3, d∞, fk(Γ(R)−Γ(R))) exp −nkτ

18B2

γB2

Rdτ

≤slog C

cnk

+Z∞

log C

cnk

2N(1/(3nk), d∞, fk(Γ(R)−Γ(R))) exp −nkτ

18B2

γB2

Rdτ

=s18B2

γB2

R(1 + log 2 + log{N(1/(3nk), d∞, fk(Γ(R)−Γ(R)))})

Under Condition 3, we have

E"sup

{(γk)K

k=1,R},{(γ′

k)K

k=1,R′}

k=1

i=1

fik(γk,R;γ′

k,R′)#

≤E"sup

{γk,R},{γ′

k,R′}

i=1

fik(γk,R;γ′

k,R′)#

≤4B2

γB2

R+s18B2

γB2

R(1 + log 2)

+s36pB2

γB2

log (48nkB2

RBγ)

36SB2

γB2

log 



192nkBRB2

γ(D+ 1)(BR+ 1)(2Bθ)D+2(QD

j=0 pj)

QD

j=1 pj!1/S 



.(A.8)

Noting by Ledoux and Talagrand (1991) (p97), the empirically Rademacher complexity is

upper bounded by empirical Gaussian complexity up to a factor, together with (A.6) and

(A.7), we have

FN(B⊗K+ Γ⊗K(R))

≤rπ

2b

GN(B⊗K+ Γ⊗K(R))

≤rπ

2b

GN(B⊗K) + b

GN(Γ⊗K(R))

≤rπ

2max

Gnk(B)+2√2πsup

{(γk)K

k=1,R},{(γ′

k)K

k=1,R′}

k=1

i=1

fik(γk,R;γ′

k,R′)/N2

+ 64√2π(log N)hmax

k∥γk∥b

GN(R) + max

Gnk(Γ)i.(A.9)

Under Condition 3and θ∈Θ, adapting from Theorem 2 of Golowich et al. (2018),

FN(Rj)≤2

i=0

(pi+ 1)BRpD+ 2 + log q/√N.

If nk≳d+ log K, applying Lemma 4 of Tripuraneni et al. (2020) and using the concavity of

√·function, we have EhqPd

j=1 σj(ΣXk)i≤O(√d) and EhqPp

j=1 σj(ΣR(Zk))i≤O(√p).

Thus, under Condition 3, combing (A.9) with (A.3), (A.4), (A.5), and (A.8), we prove

Lemma 2.

Lemma 3. Suppose Conditions 3-4hold, we have

∥b

µ−µ∗∥2≤Op(n−1/2

0) + Op(∆N);

∥b

J−1

0−J−1

0∥2=Op(n−1/2

0) + Op(∆N).(A.10)

Proof of Lemma 3.Theorem 2indicates that ∥b

R(Z)−R∗(Z)∥2=Op(∆N) for all Z∈

Z. Denote b

Q=Pn0

i=1 b

R(Z0i)( b

R(Z0i))T/n0and Q=Pn0

i=1 R∗(Z0i)(R∗(Z0i))T/n0. Under

Condition 3, we ﬁrst derive



b

Q−Q



2

=



i=1 b

R(Z0i)−R∗(Z0i)b

R(Z0i)−R∗(Z0i)T

i=1

R∗(Z0i)b

R(Z0i)−R∗(Z0i)T

i=1 b

R(Z0i)−R∗(Z0i)(R∗(Z0i))T



2

=Op{∥ b

R(Z0)−R∗(Z0)∥2}=Op(∆N).(A.11)

By (A.11) and Condition 3, we further derive that



b

Q−1−Q−1



2

=



b

Q−1(b

Q−Q)Q−1



2

≤



b

Q−1



2



b

Q−Q



2

Q−1

2

=Op(1)Op(∆N)Op(1) = Op(∆N).

Recall the construction of eﬃcient scores for β∗0,µ∗is the minimizer of

E[∥X0−µR∗(Z0)∥2

2].

Hence, under Conditions 3-4and by the weak law of large numbers, we have

∥b

µ−µ∗∥2

≤



i=1

X0i(b

R(Z0i))Tb

Q−1−1

i=1

X0i(R∗(Z0i))TQ−1



2

+



i=1

X0i(R∗(Z0i))TQ−1−µ∗



2

≤



i=1

X0i{b

R(Z0i)−R∗(Z0i)}T{b

Q−1−Q−1}



2

+



i=1

X0i{R∗(Z0i)}T{b

Q−1−Q−1}



2

+



i=1

X0i{b

R(Z0i)−R∗(Z0i)}TQ−1



2

+



i=1

X0i(R∗(Z0i))TQ−1−µ∗



2

=Op(∆N) + Op(n−1/2

0).(A.12)

With the quadratic loss and by the property of E(X0|R∗(Z0)), we have

E(X0|R∗(Z0)) = µ∗R∗(Z0).

Hence, by (A.12) and ∥b

R(Z0)−R∗(Z0)∥2=Op(∆N), and the independence between Dn,0

and Dn,k for k= 1, . . . , K, we have

∥b

J0−J0∥2

=



i=1 {X0i−b

µb

R(Z0i)}{X0i−b

µb

R(Z0i)}T−J0



2

≤



i=1 {X0i−b

µb

R(Z0i)}{X0i−b

µb

R(Z0i)}T

−1

i=1 {X0i−µ∗b

R(Z0i)}{X0i−µ∗b

R(Z0i)}T



2

+



i=1 {X0i−µ∗b

R(Z0i)}{X0i−µ∗b

R(Z0i)}T

−Eh{X0−µ∗b

R(Z0)}{X0−µ∗b

R(Z0)}Ti



2

+



Eh{X0−µ∗b

R(Z0)}{X0−µ∗b

R(Z0)}T

−{X0−µ∗R∗(Z0)}{X0−µ∗R∗(Z0)}T

2

+

E{X0−µ∗R∗(Z0)}{X0−µ∗R∗(Z0)}T−J0

2

=Op(n−1/2

0) + Op(∆N).

B.2 Proof of Theorems and Corollaries

Proof of Theorem 1.For any ψsatisfying (5), we have

k=1

Eℓ(Dk;βk,γk,R)−ℓ(Dk;β∗k,γ∗k,R∗)

k=1

ERT

∗(Zk)γ∗k−RT(Zk)γk+X⊤

kβ∗k−X⊤

kβk2

k=1

E(βk−β∗k)T{Xk−E(Xk|Zk)}

+ (βk−β∗k)T{E(Xk|Zk)}+{RT(Zk)γk−RT

∗(Zk)γ∗k}2i

k=1

Eh(βk−β∗k)T{Xk−E(Xk|Zk)}2i

k=1

Eh(βk−β∗k)T{E(Xk|Zk)}+{RT(Zk)γk−RT

∗(Zk)γ∗k}2i.

Together with Condition 1, we conclude that

k=1

Eℓ(Dk;βk,γk,R)−ℓ(Dk;β∗k,γ∗k,R∗)>0,

for any βk=β∗k. Hence, by the deﬁnition of ψand ψ∗, we attain that βk=β∗k, and

RT(Z)γk=RT

∗(Z)γ∗k,(A.13)

for all Z∈ Z, every βk∈ B and γk∈Γ, and k∈[K]. We ignore the upscript of Zkfor erasing

the confusion, as for all Zk∈ Z, equation (A.13) holds. By Condition 2, we construct an

invertible matrix U0=γ∗k1,··· ,γ∗kp∈Rp×psuch that UT

0R∗(Z) = UTR(Z) for all Z∈ Z,

where U=γk1,··· ,γkp∈Rp×p. Then we have R∗(Z)=(UT

0)−1UTR(Z). Note that by

Condition 2, there exist ptasks with input Z1,...,Zpsuch that V0= [R∗(Z1),...,R∗(Zp)] ∈

Rp×pis an invertible matrix. Consequently, we can write as

V0= ΛV,

where Λ = (U0)−TUTand V= [R(Z1),...,R(Zp)] ∈Rp×p. Since V0is invertible and U

does not depend on the input, so are Λ and V. This completes the proof.

Proof of Theorem 2.We center the functions to

ℓik(Dki;ψk) = ℓ(Dki;ψk)−ℓ(Dki;0),

where ψk= (βk,γk,R). Under Condition 3, applying the contraction principle (Ledoux

and Talagrand,1991, Theorem 4.12) over set {β⊤

kXki +γ⊤

kR(Zki), i ∈[nk], k ∈[K]} ⊆ RN

shows that

Eε"sup

ψ∈Ψ

k=1

i=1

εikℓik (Dki;ψk)#≤2BδFN(B⊗K+ Γ⊗K(R)),(A.14)

with probability at least 1−δ, where Bδ=clog (1/δ)+4BXBβ+ 4BRBγand cis a constant.

Additional, we can easily prove that |ℓik(Dki;0)| ≤ Bℓwith probability 1 −δunder Condi-

tion 3, where Bℓ= (clog (1/δ) + BXBβ+BRBγ)2. Further, the constant-shift property of

Rademacher averages (Wainwright (2019), Exercise 4.7c) gives

Eε"sup

ψ∈Ψ

k=1

i=1

εikℓ(Dki;ψk)#

≤Eε"sup

ψ∈Ψ

k=1

i=1

εikℓik (Dki;ψk)#+Bℓ

√N,(A.15)

with probability at least 1 −δ. Theorem 4.10 of Wainwright (2019) shows that

sup

ψ∈Ψ

k=1 L(Dn,k;βk,γk,R)−1

k=1

E{ℓ(Dk;βk,γk,R)}

≤2FN(ℓ(B⊗K+ Γ⊗K(R))) + 2Bℓrlog(1/δ)

with probability at least 1 −3δ, where L(Dn,k;βk,γk,R) = Pnk

i=1 ℓ(Dki;βk,γk,R)/nk. Con-

sequently, combining (A.14) and (A.15), we have

sup

ψ∈F 

k=1 L(Dn,k;βk,γk,R)−1

k=1

E{ℓ(Dk;βk,γk,R)}

≤4BδFN(B⊗K+ Γ⊗K(R)) + 4Bℓrlog(1/δ)

N,(A.16)

with probability at least 1−4δ. Therefore, under Condition 6, combining (A.16) with Lemma

2, we can conclude that

sup

ψ∈F 

k=1 L(Dn,k;βk,γk,R)−1

k=1

E{ℓ(Dk;βk,γk,R)}

−→ 0.(A.17)

Further applying Theorem 1, there exists an invertible matrix Λ∗such that, b

Rcoverges to

R∗, and b

γkconverges to γ∗

k, where R∗= Λ−1

∗R∗and γ∗

k= Λ⊤

∗γ∗kfor k∈[K]. Deﬁne

R= arg min

R∈R

k=1

EhRT(Zk)γ∗k−RT

∗(Zk)γ∗k2i.(A.18)

Under Conditions 3–5, by the proof of Theorem 6.2 in Jiao et al. (2023), we know that

if the network width and depth be W= 114(⌊κ⌋+ 1)(plog q)⌊κ⌋+1 and D= 21(⌊κ⌋+

1)2⌈N(plog q)/2(plog q+2κ)log2(8N(plog q)/2(plog q+2κ))⌉, then

k=1

Eγ∗⊤

R∗(Zk)−γ∗⊤

kR∗(Zk)2

k=1

Eγ⊤

∗ke

R(Zk)−γ⊤

∗kR∗(Zk)2

≤1

k=1

E



e

R(Zk)−R∗(Zk)



2∥γ∗k∥2

=O(pB2

γN−2s

2s+plog q),

where e

R∗(·)=Λ∗e

R(·). Consequently, under Condition 6, we have

k=1

Ehℓ(Dk;β∗k,γ∗

k,e

R∗)i−1

k=1

E[ℓ(Dk;β∗k,γ∗

k,R∗)]

k=1

Ee

R∗T(Zk)γ∗

k−R∗T(Zk)γ∗

k2→0.(A.19)

Then, (A.17) and (A.19) lead to



k=1 L(Dn,k;β∗k,γ∗

k,e

R∗)−1

k=1 L(Dn,k;β∗k,γ∗

k,R∗)

≤

k=1 L(Dn,k;β∗k,γ∗

k,e

R∗)−1

k=1

Ehℓ(Dk;β∗k,γ∗

k,e

R∗)i

+

k=1

Ehℓ(Dk;β∗k,γ∗

k,e

R∗)i−1

k=1

E[ℓ(Dk;β∗k,γ∗

k,R∗)]

+

k=1

E[ℓ(Dk;β∗k,γ∗

k,R∗)] −1

k=1 L(Dn,k;β∗k,γ∗

k,R∗)

=op(1).(A.20)

Noting that {(b

βk,b

γk)K

k=1,b

R}is the minimizer of (7), by (A.20), we further have

k=1 L(Dn,k;b

βk,b

γk,b

R)≤1

k=1 L(Dn,k;β∗k,γ∗

k,e

R∗)

≤1

k=1 L(Dn,k;β∗k,γ∗

k,R∗) + op(1).(A.21)

Since

k=1

E[ℓ(Dk;βk,γk,R)] −1

k=1

E[ℓ(Dk;β∗k,γ∗

k,R∗)]

=d2

2(ψ,ψ∗),

then, for any small δ > 0, we have

inf

d(ψ,ψ∗)≥δ

k=1

E[ℓ(Dk;βk,γk,R)] >1

k=1

E[ℓ(Dk;β∗k,γ∗k,R∗)]

k=1

E[ℓ(Dk;β∗k,γ∗

k,R∗)] .(A.22)

Therefore, the Conditions of Theorem 5.7 in Van der Vaart (2000) follow from (A.17), (A.21),

and (A.22), and this implies that d(b

ψ,ψ∗)p

−→ 0 as n→ ∞, where b

ψ={(b

βk,b

γk)K

k=1,b

R}.

Next, we show the convergence rates of d(b

ψ,ψ∗). By lemma 1, we have

E∗"sup

ψ∈Fδ

√N

k=1 L(Dn,k;βk,γk,R)−1

k=1

E[ℓ(Dk;βk,γk,R)]

−(1

k=1 L(Dn,k;β∗k,γ∗

k,R∗)−1

k=1

E[ℓ(Dk;β∗k,γ∗

k,R∗)])#

≲ϕN(δ),(A.23)

where ϕN(δ) = δps1log(s2/δ) + s1

√Nlog(s2/δ).

Denote υN=s1/2

1(log N)2N−1/2. With some calculations, we have that ϕN(υN)≤υ2

N√N

and ϕN(p1/2BγN−s/(2s+plog q))≤√NpB2

γN−2s

2s+plog q. On the other hand, by the deﬁnition of

Rin (A.18) and analogy to (A.23),



k=1 L(Dn,k;β∗k,γ∗

k,e

R∗)−1

k=1 L(Dn,k;β∗k,γ∗

k,R∗)

≲

k=1

Enℓ(Dk;β∗k,γ∗

k,e

R∗)o−1

k=1

E{ℓ(Dk;β∗k,γ∗

k,R∗)}

+Op(N−1/2ϕN(υN))

≲d2

2({(β∗k,γ∗

k)K

k=1,e

R∗},{(β∗k,γ∗

k)K

k=1,R∗}) + Op(N−1/2ϕN(υN))

≤O(pB2

γN−2s

2s+plog q) + Op(υ2

N).

Then by the deﬁnition of b

ψ, we have

k=1 L(Dn,k;b

βk,b

γk,b

≤1

k=1 L(Dn,k;β∗k,γ∗

k,e

R∗)

≤1

k=1 L(Dn,k;β∗k,γ∗

k,R∗) + Op(pB2

γN−2s

2s+plog q+υ2

N).

By Theorem 3.4.1 in Van Der Vaart and Wellner (1996), we have d(b

ψ,ψ∗) = O(p1/2N−s/(2s+plog q)+

υN). Furthermore, we have

2(b

ψ,ψ∗)

k=1

Eh(b

βk−β∗k)T{Xk−E(Xk|R∗(Zk))}+

βk−β∗k)T{E(Xk|R∗(Zk))}+{b

RT(Zk)b

γk−R∗T(Zk)γ∗

k}2

k=1

E(b

βk−β∗k)T{Xk−E(Xk|R∗(Zk))}2

k=1

E(b

βk−β∗k)T{E(Xk|R∗(Zk))}+{b

RT(Zk)b

γk−R∗T(Zk)γ∗

k}2.

Thus, by Conditions 1–3, it follows that maxk∈[K]∥b

βk−β∗k∥2=Op(p1/2N−s/(2s+plog q)+υN)

and

k=1

Eh{b

RT(Zk)b

γk−R∗T(Zk)γ∗

k}2i=O(pB2

γN−2s

2s+plog q+υ2

N).

Denote

γ′

k,e

γk}= arg sup

γ′

k∈Γ

inf

γk∈Γ

k=1

Eh{b

RT(Zk)γk−R∗T(Zk)γ′

k}2i.

Furthermore,

k=1

Eh{b

RT(Zk)e

γk−R∗T(Zk)e

γk}2i

≤sup

γ′

k∈Γ

inf

γk∈Γ

k=1

Eh{b

RT(Zk)γk−R∗T(Zk)γ′

k}2i

≤Cinf

γk∈Γ

k=1

Eh{b

RT(Zk)γk−R∗T(Zk)γ∗

k}2i

≤C

k=1

Eh{b

RT(Zk)b

γk−R∗T(Zk)γ∗

k}2i,

where the second inequality is implied by Lemma 6 of Tripuraneni et al. (2020). Conse-

quently, by Condition 3, we have d2(b

R,R∗) = O(∆N).

Proof of Theorem 3.Under Condition 3, by Theorem 2and Lemma 3, it is easily to conclude

that

J0}−1"1

i=1 {X0i−b

µb

R(Z0i)}{R∗(Z0i)−b

R(Z0i)}Tγ∗

0#=Op(n−1/2

0∆N+ ∆2

N).(A.24)

By (A.12), and the independence between Dn,k and Dn,0, we have



i=1 {µ∗b

R(Z0i)−b

µb

R(Z0i)}ϵ0i



2

≤∥b

µ−µ∗∥2



i=1 b

R(Z0i)ϵ0i



2

=Op(n−1/2

0∆N+n−1

0).

Consequently, we can easily obtain

J0}−1"1

i=1 {µ∗R∗(Z0i)−µ∗b

R(Z0i) + µ∗b

R(Z0i)−b

µb

R(Z0i)}ϵ0i#

=Op(n−1/2

0∆N+n−1

0).(A.25)

Recall the ﬁrst-order optimality Conditions

i=1

X0i{Y0i−X0i

β0−(b

R(Z0i))Tb

γ0}= 0;

i=1 b

R(Z0i){Y0i−X0i

β0−(b

R(Z0i))Tb

γ0}= 0.

Then the empirical orthogonal score for β∗0is then given by

i=1 {X0i−b

µb

R(Z0i)}{Y0i−X0i

β0−(b

R(Z0i))Tb

γ0}= 0.

With some simple calculations, we can obtain that

i=1 {X0i−b

µb

R(Z0i)}{X0i

T(β∗0−b

β0)−(b

R(Z0i))T(b

γ0−γ∗

+ (R∗(Z0i)−b

R(Z0i))Tγ∗

0+ϵ0i}= 0.

Then we conclude

β0−β∗0={b

J0}−1"1

i=1 {X0i−b

µb

R(Z0i)}(R∗(Z0i)−b

R(Z0i))Tγ∗

+{b

J0}−1"1

i=1 {X0i−b

µb

R(Z0i)}ϵ0i#

={b

J0}−1"1

i=1 {X0i−b

µb

R(Z0i)}(R∗(Z0i)−b

R(Z0i))Tγ∗

+{b

J0}−1"1

i=1 {X0i−µ∗R∗(Z0i)}ϵ0i#

+{b

J0}−1"1

i=1 {µ∗R∗(Z0i)−µ∗b

R(Z0i) + µ∗b

R(Z0i)−b

µb

R(Z0i)}ϵ0i#.

By (A.10), (A.24), and (A.25), we obtain that

β0−β∗0={J0}−1"1

i=1 {X0i−µ∗R∗(Z0i)}ϵ0i#+Op(∆2

N).

Consequently, we have proved that

√n0(b

β0−β∗0)d

−→ N(0, σ2

0J0−1).

Proof of Corollary 1.Denote the eﬃcient score for β∗0as

Φ(D0;µ0,β∗0,γ∗

0,R∗) = {X0−µ∗R∗(Z0)}{Y0−XT

0β∗0−(R∗(Z0))Tγ∗

0}.

Then the eﬃcient orthogonal score for β∗0is then given by

Φ(D0;b

µ,b

β0,b

γ0,b

R) = {X0−b

µb

R(Z0i)}{Y0−XT

β0−(b

R(Z0i))Tb

γ0}.

Note that Φis a d-dimensional vector. Let Φjdenote the jth element of Φ,j∈[d]. To

simplify the notation, write Φj(D0)=Φj(D0;µ∗,β∗0,γ∗

0,R∗) and b

Φj(D0;b

µ,b

β0,b

γ0,b

R). For

any j∈[d], with some calculations, under Conditions 3and 7, the results of Theorem 2and

Theorem 3imply that

i=1 



b

Φ(D0)−Φ(D0)



2=Op(∆N+n−1/2

0).(A.26)

By Condition 7, we have that

E∥Φ(µ∗,β∗0,γ∗

0,R∗)∥2

2

=E[σ2

0]E[{X0−m∗(Z0)}T{X0−m∗(Z0)}] = O(1).

Therefore, for any j1, j2∈[d],

i=1 |Φj1(D0i)|2_|Φj2(D0i)|2!1/2

=Op(1),(A.27)

where aWb= max{a, b}. Consequently, by (A.26) and (A.27), we have



i=1 b

Φj1(D0i)b

Φj2(D0i)−1

i=1

Φj1(D0i)Φj2(D0i)

≤1

i=1 b

Φj1(D0i)b

Φj2(D0i)−Φj1(D0i)Φj2(D0i)

≤1

i=1 b

Φj1(D0i)−Φj1(D0i)_b

Φj2(D0i)−Φj2(D0i)

×|Φj1(D0i)|_|Φj2(D0i)|+b

Φj1(D0i)−Φj1(D0i)_b

Φj2(D0i)−Φj2(D0i)

≤Op(∆N+n−1/2

0).(A.28)

Furthermore, note that

i=1

Φj1(D0i)Φj2(D0i)−E(ϵ2

0V∗

j1V∗

j2) = Op(n−1/2

0),(A.29)

where V∗

jis the jth element of X0−m∗(Z0). Therefore, by (A.28) and (A.29), we have



i=1 b

Φj1(D0i)b

Φj2(D0i)−E(ϵ2

0V∗

j1V∗

j2)

≤Op(∆N+n−1/2

0).

Note that

−→ J0.

Then applying the continuous mapping theorem completes the proof of Corollary 1.

ResearchGate has not been able to resolve any citations for this publication.

How Transferable are Neural Networks in NLP Appications?

Conference Paper

Full-text available

Mar 2016

Transfer learning is aimed to make use of valuable knowledge in a source domain to help the model performance in a target domain. It is particularly important to neural networks because neural models are very likely to be overfitting. In some fields like image processing, many studies have shown the effectiveness of neural network-based transfer learning. For neural NLP, however, existing studies have only casually applied transfer learning, and conclusions are inconsistent. In this paper, we conduct a series of empirical studies and provide an illuminating picture on the transferability of neural networks in NLP.

Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors

Article

Apr 2023
ANN STAT

Estimation and Inference for High-Dimensional Generalized Linear Models with Knowledge Transfer

Article

Feb 2023

Transfer learning provides a powerful tool for incorporating data from related studies into a target study of interest. In epidemiology and medical studies, the classification of a target disease could borrow information across other related diseases and populations. In this work, we consider transfer learning for high-dimensional generalized linear models (GLMs). A novel algorithm, TransHDGLM, that integrates data from the target study and the source studies is proposed. Minimax rate of convergence for estimation is established and the proposed estimator is shown to be rate-optimal. Statistical inference for the target regression coefficients is also studied. Asymptotic normality for a debiased estimator is established, which can be used for constructing coordinate-wise confidence intervals of the regression coefficients. Numerical studies show significant improvement in estimation and inference accuracy over GLMs that only use the target data. The proposed methods are applied to a real data study concerning the classification of colorectal cancer using gut microbiomes, and are shown to enhance the classification accuracy in comparison to methods that only use the target data.

Transfer Learning Under High-Dimensional Generalized Linear Models

Article

Jun 2022

In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its l1/l2-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and sources are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don’t know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN. Supplementary materials for this article are available online.

Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality

Article

Nov 2021

This paper considers estimation and prediction of a high‐dimensional linear regression in the setting of transfer learning where, in addition to observations from the target model, auxiliary samples from different but possibly related regression models are available. When the set of informative auxiliary studies is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When the set of informative auxiliary samples is unknown, we propose a data‐driven procedure for transfer learning, called Trans‐Lasso, and show its robustness to non‐informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans‐Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating data from multiple different tissues as auxiliary samples.

Deep Neural Networks for Estimation and Inference

Article

Jan 2021

We study deep neural networks and their use in semiparametric inference. We establish novel nonasymptotic high probability bounds for deep feedforward neural nets. These deliver rates of convergence that are sufficiently fast (in some cases minimax optimal) to allow us to establish valid second‐step inference after first‐step estimation with deep learning, a result also new to the literature. Our nonasymptotic high probability bounds, and the subsequent semiparametric inference, treat the current standard architecture: fully connected feedforward neural networks (multilayer perceptrons), with the now‐common rectified linear unit activation function, unbounded weights, and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed‐width, very deep networks. We establish the nonasymptotic bounds for these deep nets for a general class of nonparametric regression‐type loss functions, which includes as special cases least squares, logistic regression, and other generalized linear models. We then apply our theory to develop semiparametric inference, focusing on causal parameters for concreteness, and demonstrate the effectiveness of deep learning with an empirical application to direct mail marketing.

Predicting with Proxies: Transfer Learning in High Dimension

Article

Oct 2020

Hamsa Bastani

Predictive analytics is increasingly used to guide decision making in many applications. However, in practice, we often have limited data on the true predictive task of interest and must instead rely on more abundant data on a closely related proxy predictive task. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. Yet, not accounting for the bias in the proxy can lead to suboptimal decisions. Using real data sets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features d). Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare data sets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data. This paper was accepted by George Shanthikumar, big data and analytics.

Linear Regression for Panel with Unknown Number of Factors as Interactive Fixed Effects

Article

Jan 2014

In this paper, we study the least squares (LS) estimator in a linear panel regression model with unknown number of factors appearing as interactive fixed effects. Assuming that the number of factors used in estimation is larger than the true number of factors in the data, we establish the limiting distribution of the LS estimator for the regression coefficients as the number of time periods and the number of cross-sectional units jointly go to infinity. The main result of the paper is that under certain assumptions, the limiting distribution of the LS estimator is independent of the number of factors used in the estimation as long as this number is not underestimated. The important practical implication of this result is that for inference on the regression coefficients, one does not necessarily need to estimate the number of interactive fixed effects consistently

Gradientbased learning applied to document recognition

Article

Jan 1986

Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers

Conference Paper

Oct 2013
Acoust Speech Signal Process

In the deep neural network (DNN), the hidden layers can be considered as increasingly complex feature transformations and the final softmax layer as a log-linear classifier making use of the most abstract features computed in the hidden layers. While the loglinear classifier should be different for different languages, the feature transformations can be shared across languages. In this paper we propose a shared-hidden-layer multilingual DNN (SHL-MDNN), in which the hidden layers are made common across many languages while the softmax layers are made language dependent. We demonstrate that the SHL-MDNN can reduce errors by 3-5%, relatively, for all the languages decodable with the SHL-MDNN, over the monolingual DNNs trained using only the language specific data. Further, we show that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers. It is particularly interesting that the error reduction can be achieved for the target language that is in different families of the languages used to learn the hidden layers.