PreprintPDF Available

Representation Transfer Learning for Semiparametric Regression

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data.
Content may be subject to copyright.
Representation Transfer Learning for
Semiparametric Regression
Baihua He∗†
, Huihang Liu
, Xinyu Zhang§
, and Jian Huang
June 21, 2024
Abstract
We propose a transfer learning method that utilizes data representations in a semi-
parametric regression model. Our aim is to perform statistical inference on the param-
eter of primary interest in the target model while accounting for potential nonlinear
effects of confounding variables. We leverage knowledge from source domains, assum-
ing that the sample size of the source data is substantially larger than that of the target
data. This knowledge transfer is carried out by the sharing of data representations,
predicated on the idea that there exists a set of latent representations transferable from
the source to the target domain. We address model heterogeneity between the source
and target domains by incorporating domain-specific parameters in their respective
models. We establish sufficient conditions for the identifiability of the models and
demonstrate that the estimator for the primary parameter in the target model is both
consistent and asymptotically normal. These results lay the theoretical groundwork for
making statistical inferences about the main effects. Our simulation studies highlight
the benefits of our method, and we further illustrate its practical applications using
real-world data.
Keywords: Asymptotic normality, data representation, heterogeneity, identifiability, multi-
source data.
1 Introduction
In many practical scenarios, the availability of data can be limited, posing difficulties for
estimating effective models for statistical inference. Often, there is an abundance of data in
related but not identical source domains, while the specific target domain suffers from data
Equal contribution
Department of Statistics and Finance, School of Management, University of Science and Technology of
China, Hefei, China. Email: baihua@ustc.edu.cn
International Institute of Finance, School of Management, University of Science and Technology of
China, Hefei, China. Email: huihang@mail.ustc.edu.cn
§Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. Email:
xinyu@amss.ac.cn
Corresponding author. Department of Applied Mathematics, The Hong Kong Polytechnic University,
Hong Kong SAR, China. Email: j.huang@polyu.edu.hk
1
arXiv:2406.13197v1 [stat.ME] 19 Jun 2024
scarcity. Transfer learning provides a solution to this issue by leveraging knowledge from
similar source tasks to enhance model performance in the target task (Pan and Yang,2010).
Over the past decade, transfer learning has been widely adopted in various machine learning
tasks, including computer vision (Yosinski et al.,2014), natural language processing (Mou
et al.,2016), and speech recognition (Huang et al.,2013).
The challenges of domain heterogeneity and the risks of negative transfer in utilizing
auxiliary source data have led to the advancement of statistical transfer learning, which
seeks to develop novel transfer learning methodologies that address these challenges and
establish their theoretical properties. Researchers have proposed transfer learning methods
for a variety of models, including high-dimensional linear models (Bastani,2021;Li et al.,
2022), generalized linear models (Tian and Feng,2022;Li et al.,2023), functional regression
(Lin and Reimherr,2022), semi-supervised classification (Zhou et al.,2022), and basis-type
models (Cai and Pu,2024), among others. Tian et al. (2023) introduced a linear represen-
tation multi-task method for estimating a similar representation. However, their reliance on
linear associations and linear representation constraints the method’s applicability to com-
plex data structures. In the context of transfer learning, Hu and Zhang (2023) proposed a
model averaging approach for semiparametric regression models. However, their method only
leverages the linear component’s knowledge for transfer learning, overlooking the non-linear
component similarities.
Although these methods have shown promising results, there are certain issues that
remain unexplored. First, the challenges in balancing the trade-off between model flexibility
and parameter interpretability. Most existing research on transfer learning and multi-source
data integration falls into two categories: studies that focus on parametric models (Tian
and Feng,2022;Li et al.,2023), which offer simplicity and interpretability, and those that
explore complex non-linear models (Tan et al.,2018) lack interpretability due to their “black-
box” nature. Striking a balance between the interpretability of parametric models and the
flexibility of non-parametric models remains a significant challenge. Second, the challenges
remain in constructing and identifying the transferable knowledge across data domains. Most
existing statistical transfer learning methods transfer the model parameters directly to the
target domain. However, these methods depends on parametric model assumptions, which
may limit the transferability of the knowledge and fail to utilize the latent shared information
among domains. The knowledge is not transferable if the model parameters are different,
even if there are latent structures in the data. The identifiability of transferable knowledge
is also crucial for the statistical inference before transferring. These challenges motivate us
to develop a novel transfer learning method that uses the shared knowledge among domains
and strike a balance between model interpretability and flexibility,
We propose a representation transfer learning (RTL) method for knowledge transfer
within the context of semiparametric regression model (Engle et al.,1986). The semipara-
metric regression model enables the interpretability of the treatment parameters, or parame-
ters of main interest, while capturing the flexible data structures through the nonparametric
components. Our main objective is to facilitate statistical inference for the treatment effects
in the target model, taking into possibly nonlinear effects of the confounding variables. We
achieve this by accommodating multi-dimensional confounding variables in a flexible manner
and by incorporating information from source domains. The scenario we address involves a
target model of interest, together with independent heterogeneous source models that share
a higher-level data representation. The representation transfer learning mechanism enables
2
the latent shared knowledge to be transferred. We use deep neural networks in the estima-
tion of the effects of confounding variables through a set of representation functions. The
transfer of knowledge from source data to the target model occurs via these representation
functions, which we estimate by capitalizing on the ample sample sizes available in the source
domains.
The main contribution of our paper are threefold. First, a critical concern is the identi-
fiability of both the transferred representation function and the domain-specific parametric
components. Given that the representation is a high-level abstraction of the data, it is often
non-unique and non-identifiable (Chen et al.,2023). Within the semiparametric regression
framework, the identifiability of the parametric components can be compromised due to their
interaction with the representation. To tackle this challenge, we formulate novel and inter-
pretable conditions that ensure the identifiability of both the representation function and
the linear coefficients, provided there is sufficient diversity among the source domains. Sec-
ond, our theoretical analysis shows that the proposed method can consistently estimate the
representation functions via deep neural networks with ReLU activation. We demonstrate
that representation transfer learning can reduce approximation bias and enhance sample
efficiency. Third, we establish the asymptotic normality of the estimated primary parame-
ter in the target model, providing the basis for statistical inference regarding the effects of
the variable of primary interest. Consequently, RTL adeptly balances model interpretability
with model flexibility and improve the estimation accuracy of the primary parameter.
Our proposed RTL approach marks a significant departure from existing statistical meth-
ods for transfer learning and semiparametric regression models. Unlike the traditional
distance-based transfer learning frameworks, we account for model heterogeneity between
the source and target domains by employing flexible representation functions and domain-
specific parameters. These learned representation functions serve as conduits for knowledge
transfer, capturing intrinsic information that is often the most challenging aspect to estimate
in a model.
The rest of the article is organized as follows. We introduce the model framework and
develop the proposed RTL method in Section 2. We provide the theoretical guarantees
in Section 3. We present the numerical studies including simulation, semi-synthetic data
analysis based on MNIST hand writing dataset, and illustrate RTL using the Pennsylvania
reemployment bonus experiment and housing rental information data in Section 4. We give
concluding remarks in Section 5, and relegate all technical proofs to the Supplementary
Materials.
2 Model and methodology
In this section, we present our proposed Representation Transfer Learning (RTL) method.
We denote the data by the triplet (Y, X,Z), where YRis the response variable, XRd
corresponds to the d-dimensional covariate of primary interest, and ZRqaq-dimensional
confounding variable. Typically, Xis a low-dimensional treatment variable, implying that
dis small. We permit the confounding variable Zto be of a moderately high dimension,
with its dimension qallowed to grow as the sample size increases. We consider Kdistinct
source domains, each with its own dataset (Yk,Xk,Zk) for k= 1, . . . , K. Additionally, we
have the target domain data, which is denoted as (Y0,X0,Z0). Our approach is designed
3
to leverage the information from these multiple source domains to enhance inference in the
target domain via shared data representation.
2.1 Model
We begin by considering distinct semiparametric regression models for each source domain
and the target domain:
Sources: Yk=β
kXk+gk(Zk) + εk, k = 1, . . . , K, (1)
Target: Y0=β
0X0+g0(Z0) + ε0.(2)
In these models, for k= 0,1, . . . , K,βkrepresents the effects associated with Xk,gk:Rq
Rdenotes an unspecified nonparametric function capturing the potential nonlinear impact
of the confounding variable Zk, and εkis the random noise component with E(εk)=0
and E(ε2
k) = σ2
k. This model allows us to systematically address the influence of both the
covariates of interest and the confounding variables across different domains.
Our main goal is to conduct statistical inference on the parameter β0, which measures
the effect of the covariate of interest, X0, on the outcome variable Y0. This task is chal-
lenging for two main reasons: the limited availability of data from the target domain and
the complex influence that nonparametric estimation of nuisance functions, related to mul-
tivariate confounding variables, has on the estimation of β0. Although the use of flexible
neural networks to approximate these functions may appear to be a feasible approach, it
complicates the inference process for β0. Furthermore, this method in itself does not over-
come the fundamental obstacle known as the “curse of dimensionality”, which arises during
the nonparametric estimation of a multidimensional function.
Transfer learning offers a solution to the “curse of dimensionality” in the target domain
by utilizing data from multiple sources. We are particularly interested in scenarios where
the combined sample size from the source domains significantly exceeds that of the target
domain. To enable the effective transfer of knowledge from the source domains to the target
domain, it is crucial to establish specific assumptions about the relationship between the
source and target data models. Our approach capitalizes on the source data to assist in
estimating the relevant function within the target data model. This method assumes the
existence of a latent representation of the confounding effect that is invariant across both
the source and target data.
Specifically, we propose the following expression for the confounding effects:
gk(Z) = γ
kR(Z), k = 0,1, . . . , K,
where R:RqRpfunctions as a representation of the confounding variables, and γkrepre-
sents domain-specific coefficients. This composite model structure aligns with the approaches
suggested by Du et al. (2020) and Tripuraneni et al. (2020). The representation Rcan be
interpreted as a set of basis functions, with γacting as the corresponding weights. This
strategy differs from traditional basis expansion techniques, such as spline methods, which
rely on a predetermined set of basis functions for approximating nonparametric functions.
Instead, our approach estimates the representation function Rfrom the data.
By focusing on the differences in the coefficients (β,γ) across the source and target
4
domains, we can effectively capture domain heterogeneity. This approach simplifies the
challenging tasks of function estimation and heterogeneity detection. Consequently, we posit
that the representation function Ris a shared element across different domains, representing
the transferable knowledge from source tasks to the target task.
The above discussion leads to the proposed RTL model as follows:
Sources: Yk=β
kXk+γ
kR(Zk) + εk, k = 1, . . . , K, (3)
Target: Y0=β
0X0+γ
0R(Z0) + ε0,(4)
where βkand γkare source-specific coefficients of dimensions dand p, respectively. The
function R:RqRpserves as the shared representation function across domains.
2.2 Estimation method
Based on the RTL models (3) and (4), at the population level, our proposed RTL method
proceeds in two steps:
Step P1: In the source domain, we consider the minimizers of the population risk function
{(βk,γk)K
k=1,R} arg min
{(βk,γk)K
k=1,R}
1
K
K
X
k=1
E{Ykβ
kXkγ
kR(Zk)}2,(5)
where Ris the shared representation function that will be used in the target domain,
the coefficients (βk,γk)K
k=1 take into account possible heterogeneity across source
domains and the target domain. The confounding effects γ
kR(Zk) are not separable
and thus are not identifiable, as discussed in Section 3. Fortunately, the effect βkis
uniquely identifiable.
Step P2: In the target domain, given the representation function Rfrom the source do-
main, we solve
{β0,γ0}= arg min
(β0,γ0)
E{Y0β
0X0γ
0R(Z0)}2.(6)
Now suppose we have a random sample of independent and identically distributed obser-
vations from the target domain, denoted as {(Y0i,X0i,Z0i), i = 1, . . . , n0}. Additionally, we
have access to the datasets from Ksource domains, {(Yki,Xki,Zki), i = 1, . . . , nk},where
k= 1, . . . , K. Let N=n1+···nKbe the combined sample size of the source domains.
Although there are no explicit constraints on the sample sizes across the target and source
domains, it is usually the case that Nsignificantly exceeds the sample size n0of the target
domain. Our main goal is to leverage the data from these source domains to enhance the
estimation accuracy within the target domain. This can be achieved by using the empirical
version of the formation at the population given in (5) and (6). We first focus on estimating
the representation function Rusing the source data, followed by estimating the regression
parameters within the target model using the target data. These two steps correspond to
the Steps P1 and P2 at the population level and are as follows:
5
Step E1: Estimation of the shared representation function. This step involves esti-
mating a shared representation function R, which is formulated as an optimization
problem:
{(b
βk,b
γk)K
k=1,b
R}= arg min
{(βk,γk)K
k=1,R∈R} n1
K
K
X
k=1
1
nk
nk
X
i=1
(Yki β
kXki γ
kR(Zki))2o,(7)
where the estimation of the representation function Ris conducted over a specified
class of neural networks, denoted as R. This step is crucial for capturing the underlying
representations shared across the source domains.
Step E2: Estimation of parameters in the target model. After estimating the rep-
resentation function b
Rfrom the source data, the next step is to estimate the parame-
ters within the target model. This is achieved by solving
{b
β0,b
γ0}= arg min
β0,γ0
1
n0
n0
X
i=1 {Y0iβ
0X0iγ
0b
R(Z0i)}2.(8)
Given that b
Rremains fixed in this step, the pair ( b
β0,b
γ0) is effectively obtained through
a least squares estimation process.
The proposed RTL method, which involves pre-training on multiple source domains be-
fore transferring the estimated representations to the target domain, enhances data and
computational efficiency. The abundance of source data ensures that the representation
function Rcan be estimated at a much faster convergence rate than that when only target
domain data is available.
2.3 Implementation
We approximate the representation functions by feedforward neural networks defined as:
R(z) = ADσ(AD1σ(···σ(A0z+b0)···) + bD1) + bD,
where AiRpi+1×piand biRpi+1 for i= 0, . . . , D,p0=qis the dimension of the input
variables, pD+1 =pis the dimension of the output layer, and σ(·) is the activation function.
We consider the ReLU activation function σ(x) = max{0, x}, applied component-wise. The
parameters of the representation function R(·) are denoted as θ={A0,...,AD,b0,...,bD}.
The number W= max{p0, . . . , pD}and Dare the width and depth of the neural network,
respectively. The weight matrices together with the bias vectors contain S=PD
i=0 pi+1(pi+1)
entries in total. The parameters including weights and biases are assumed to be bounded
by a constant Bθ>0.We denote the set of the neural network functions defined above by
R=NN(W, D, Bθ).
Figure 1illustrates the architecture of a demo neural network with d= 3, q= 5 and
p= 3.We train the representation network and the linear layer in an iterative style for 400
epochs. In each epoch, we first update the representation network and then the linear layer.
The weights in the representation network are optimized using the SGD optimizer with a
learning rate of 103and batch size of nk. The weights in the last layer are obtained using
6
the least square estimation. We use early stopping method during the training process, and
use other i.i.d. observations as a validation dataset for model selection with a sample size of
30% of training dataset. That is, we select the model with a minimum prediction error on
the validation set for evaluation.
Input of non-linear part
Input of linear part
Hidden layers Output
Figure 1: The architecture of a partially linear neural network with input dimension d= 1,
q= 3, representation dimension p= 2, depth D= 2 and width W= 5.
To train the representation network across different datasets, we compute the total loss
function (7) on each epoch and then backward the gradients to update the weights θin the
representation network. After the estimated representation b
Ris computed, we obtain the
estimator of β0and γ0by solving (8).
3 Theoretic results
In this section, we study the theoretical properties of the proposed RTL method. We first
provide sufficient conditions under which the model parameters are identifiable. Next, we
derive the convergence rate of the estimated representation function in terms of the source
data sample size. Then we show that the estimator of the parameter of main interest in
the target domain model is asymptotically normal. We also provide a consistent estimator
of the asymptotic covatiance matrix. These results make it possible to conduct statistical
inference about the main parameter in the target domain model.
3.1 Identifiability
Identifiability is a fundamental question in statistical modeling problems. Usually, the pa-
rameters in a model are required to be uniquely identifiable so that their consistent estimation
is possible. In the proposed model, this requires careful consideration because of the term
γ
kR(Zk) representing the confounding effect in the model. Since γkand Rare both
unknown, they are not identifiable in the usual sense. In this subsection, we provide a set
of conditions to guarantee the identifiability of the parameter of main interest βkand the
confounding effects represented by γ
kR(Zk).
We first state the definition of a notion of identifiability, linear identifiability, for γkand
R.
7
Definition 1. The data representations are said to be linearly identifiable if there exists an
invertible matrix Λ such that R(Z) = Λ1R(Z) and γ
k= ΛTγkfor all Z Z and k[K].
Based on this definition, we have γ′⊤
kR=γ
kΛΛ1R=γ
kR.Therefore, although γkand
Rare only linearly identifiable, the confounding effects, represented by γ
kR,are uniquely
identifiable in the usual sense.
We impose the following conditions to ensure the identifiability of parameters and rep-
resentations.
Condition 1. The matrix E[{XkE(Xk|Zk)}{XkE(Xk|Zk)}] is invertible.
Condition 2. (a) There exists {ki}p
i=1 [K] such that the coefficients {γki}p
i=1 are linearly
independent. (b) There exists Z1,...,Zp Z such that the matrix [R(Z1),...,R(Zp)] is
invertible.
Condition 1is a common assumption in regression analysis in the presence of confounding
variables, which assumes that the main variables Xkhave significant variation across different
tasks after projecting out all variation that can be explained by the nuisance variables Zk, for
each k[K].When confounding variables are present, they can introduce bias or distortions
that obscure the true relationship between the variables of interest. By requiring that Xk
maintains significant variation independent of these confounders, it ensures that the effects
of Xkon the response variable Ykcan be properly estimated.
Condition 2(a) requires that the support of the distribution for the coefficients {γk}K
k=1,
is sufficiently rich. Similar assumption was also imposed in the analysis of panel data (Ahn
et al.,2001;Bai,2009;Moon and Weidner,2015). Condition 2(b) stipulates that R
exhibits a sufficient degree of variability. This variability is essential to ensure that the
image of R—the set of all possible outputs it can generate—does not become confined
within a proper subspace of its potential range. In simpler terms, the function must be
versatile enough in its transformations to avoid being restricted to a limited portion of the
space it operates within.
Theorem 1. Suppose Conditions 1-2hold. Let {(βk,γk)K
k=1,R}and {(β
k,γ
k)K
k=1,R}be
sets of parameters satisfy (5). Then β
k=βkand there exists an invertible matrix Λsuch
that R= Λ1Rand γ
k= ΛTγkfor k[K].
Theorem 1shows that the representation function Ris identifiable up to a multiplicative
matrix transformation if Conditions 1-2are satisfied.
In the following, we use a simple example to illustrate the identifiability of our proposed
model. We set the representation dimension as p= 2, 3, and 5, the dimension of non-linear
part as q=p, and the dimension of linear part as d= 1. The data generating process is set
as Y=βX +γR(Z) + ϵ, where βand γare generated from standard normal distribution,
XRand ZRqare from standard normal distribution and ϵ N(0,0.32). The
representation functions R(·) are generated by the following univariate functions: sin(πx),
cos(πx), 2p|x| 1, (1 |x|)2, 1/(1 + exp(x)) and sin(x). The linear coefficients are
heterogeneous in the source dataset. We set the sample size in source dataset as nk= 2000
for all k= 1, . . . , K and let K= 8. The estimated representation function is transformed by a
linear transformation which is identified by minimizing the distance between the transformed
representation and the true one. Figure 2shows the transformed learned representation (solid
line) by the proposed method and the true representation function (dashed line).
8
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
(A)
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.0
0.5
0.0
0.5
1.0
(B)
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
1.0
0.5
0.0
0.5
1.0
(C)
Figure 2: Demonstration of learned representation by RTL. The left panel uses sin(πx) and
cos(πx) as the true representation functions. The middle panel uses 2p|x| 1, sin(πx) and
cos(πx). The right panel uses (1 |x|)2, 1/(1 + exp(x)), sin(x), sin(πx) and cos(πx).
The solid line represents the learned representation function and the dashed line represents
the true representation function.
3.2 Convergence rate
Based on the identifiability of the representations, we derive the convergence rate of the
estimated representation function in this section. We impose a set of regularity conditions
to guarantee the consistency of the estimated representation function.
Condition 3. (a) For any β B, there exists a constant Bβsuch that β2Bβ, where
for any d-vector a,a2=qPd
j=1 a2
j. For any γΓ, there exists a constant Bγsuch that
γ2Bγ. (b) The covariate X X Rdsatisfies that X2BX. The representation
function R(·) R satisfies R(Z)2BRfor any Z Z. (c) The response variable
Yk Y Ris subexponentially distributed for k[K].
Condition 3includes standard assumptions in regression problems (Tripuraneni et al.,
2020;Jiao et al.,2023). We further assume that the representation function is in the older
class.
Definition 2. Let κ=s+ν > 0, ν(0,1] and s=κ N0, where κdenotes the largest
integer strictly smaller than κand N0denotes the set of nonnegative integers. For a finite
constant B0>0, define the older class Hκ(Z, B0) as
Hκ(Z, B0)
=R(·) : Z R: max
ω1sωRB0,max
ω1=ssup
Z=Z
|ωR(Z)ωR(Z)|
ZZν
2B0,
where ω=ω1···ωqwith ω= (ω1, . . . , ωq)TNq
0, and ω1=Pq
j=1 ωj.
Condition 4. Each element of the representation function Rbelongs to the older class
Hα(Z, B0).
Condition 5. The support Zof the representation function R:RqRpbelongs to a
compact p-dimensional Riemannian manifold isometrically embedded in Rpwith pq.
9
Condition 6. The dimensions {p, q}of {Z,R}satisfy the following condition:
n1/2p1/2log N=o(1) and p(D+ 2 + log q)1/2
D
Y
i=0
(pi+ 1)(log N)2N1/2=o(1),
where Dis the depth of the neural network, n= min1kKnk,and N=PK
k=1 nk.
Condition 5is a low-dimensional manifold condition on Z. Actually, Condition 5is not
necessary for the convergence rate of the representation function. But it would guarantee
a higher convergence rate of the representation function when qis very large. Condition 6
pertains to the dimensionality of the model in relation to the sample sizes. This condition
accommodates the presence of a moderately high-dimensional covariate vector, allowing the
dimensions {p, q}to increase indefinitely, provided that their rate of divergence meets the
specified constraints. While this condition is met in numerous applications, it does not cover
sparse, high-dimensional scenarios where the number of covariates exceeds the sample size.
For representation functions Rand R, denote d2(R,R) = (ER(Z)R(Z)2
2)1/2.Let
N=p1/2Ns/(2s+plog q)+s1/2
1(log N)2N1/2, where s1= max{Kd, Kp, S}.Typically, the
size of the neural network S > max{Kd, Kp},thus s1is simply the network size used in the
estimation.
Theorem 2. Suppose Conditions 16hold, there exists an invertible matrix Λsuch that
d2(b
R,R) = Op(∆N),
where R= Λ1
R.
Theorem 2establishes the convergence rate of the estimated representation. The rate is
determined by two terms. The first term represents the approximation error, which minimizes
the distance from Rto R. The second term represents the stochastic error.
3.3 Asymptotic normality
In this section, we establish the asymptotic normality of the estimated primary parameter
within the target domain. It is a common scenario in transfer learning that the total sample
size Nfrom the source domains significantly exceeds the sample size n0in the target domain.
Our derivation of asymptotic distribution is conducted with this disparity in sample sizes
taken into consideration.
We need the following condition.
Condition 7. The matrix J0=E[{X0m(Z0)}{X0m(Z0)}T] is invertible and E[{X0
m(Z0)}T{X0m(Z0)}]<, where m(Z0) = E(X0|R(Z0)).
Condition 7is fairly standard in the semi-parametric regression literature, which is needed
for constructing a semi-parametric efficient estimator of β0.We note that the independence
between Xkand Zkis not required for k[K], throughout the paper.
To remove the confounding effect of R(Z0), we consider finding an d×pmatrix µthat
satisfies the orthogonality equation,
E{X0µR(Z0)}R∗⊤(Z0)=0.
10
This is equivalent to find a µthat is a minimizer of E[X0µR(Z0)2
2]. Moreover,
the efficient score for β0is {X0µR(Z0)}ϵ0. The following theorem establishes the
asymptotic normality of β0.
Theorem 3. Suppose Conditions 1-7hold, we have
n0(b
β0β0) = J1
0h1
n0
n0
X
i=1 {X0iµR(Z0i)}ϵ0ii+Op(n02
N).(9)
where N=p1/2Ns/(2s+plog q)+s1/2
1(log N)2N1/2. Therefore, if n1/2
02
N0as N ,
that is, n1/2
0s1(log N)4N10and n01/2pN2s/(2s+plog q)0as N , we have,
n0(b
β0β0)D
N(0, σ2
0J01),as n0 and N .(10)
Through the data augmentation by the abundant source dataset, we prove that the esti-
mator for β0attains n0-consistent and asymptotic normality. The asymptotic expression
of b
β0β0in Theorem 3indicates the estimator b
β0attains the information bound, so it is
semiparametrically efficient. When the variance term σ2
0J1
0is unknown, we use a plug-in
estimator. Based on Theorem 3, a natural estimator of µis given by solving the equation
Pn0
i=1 X0ib
RT(Z0i)µPn0
i=1 b
R(Z0i)b
RT(Z0i)=0,which leads to
b
µ=
n0
X
i=1
X0ib
RT(Z0i)(n0
X
i=1 b
R(Z0i)b
RT(Z0i))1
.(11)
Combining (10) and (11), we can estimate the variance of b
β0by b
Σ=b
J1
0b
Ab
J1
0,where
b
A=1
n0
n0
X
i=1 n(Yi0b
βT
0X0iˆ
γT
0b
R(Z0i))2(X0ib
µb
R(Z0i))(X0ib
µb
R(Z0i))To,
b
J0=1
n0
n0
X
i=1
(X0ib
µb
R(Z0i))(X0ib
µb
R(Z0i))T.
The next corollary shows that b
Σis consistent.
Corollary 1. Under Conditions 1-7, if ϵ0iand the components of X0E(X0|R(Z0)) have
bounded fourth moments, then
b
Σp
σ2
0J01.
Theorem 3and Corollary 1shows that the distribution of b
β0β0) can be approximated
by a normal distribution whose covariance matrix can be consistently estimated, provide a
theoretical basis for making statistical inference about the parameter of main interest in the
target domain.
11
3.4 Benefits from source data
We now discuss the benefits of source data for estimating the primary parameter β0in the
target domain.
Suppose only target dataset were available. The basic semiparametric partially linear
model is (Engle et al.,1986),
Y0=β
0X0+g0(Z0) + ε0.
Consider the least squares estimator
{e
β0,eg0}= arg min
β,g0
1
n0
n0
X
i=1 {Y0iβ
0X0ig0(Z0i)}2.
There is an extensive literature on the asymptotic properties of the least squares estimators
in the semiparametric regression model using various approximation methods such as splines
for dealing with the nonparametric component, see for example, Hardle et al. (2000) and
the references therein. Under the conditions given in Section 3, it holds that (Hardle et al.,
2000;Farrell et al.,2021)
Eeg0(Z)g0(Z)=Op(ns/(2s+q)
0).(12)
The convergence rate in (12) is optimal (Stone,1980). Furthermore,
n0(e
β0β0) = J1
0"1
n0
n0
X
i=1 {X0iE(X0i|Z0i)}ϵ0i#+n0Op(n2s/(2s+q)
0).
Therefore, to ensure asymptotic normality of e
β0,we must have n1/22s/(2s+q)
00.This neces-
sitates the condition ns/(2s+q)
0=o(n1/4
0). Fulfilling this requirement can be difficult, partic-
ularly when dealing with a multi-dimensional confounding variable and lacking source data.
For instance, under a standard regularity condition where g0possesses continuous second-
order derivatives and assuming q= 10, we have O(n2/(4+10)
0) = O(n1/7
0). Consequently, the
condition ns/(2s+q)
0=o(n1/4
0) may prove to be quite restrictive. In contrast, with the inclu-
sion of source data, Theorem 2indicates that if the conditions s1/2
1(log N)2N1/2=o(n1/4
0)
and p1/2Ns/(2s+plog q)=o(n1/4
0) are met, then the estimator of Rwill achieve a convergence
rate faster than n1/4
0. Given that s1= max{Kd, K p, S}, we can set s1=Sfor a sufficiently
large network used in the analysis. These conditions are satisfied if the network size Sis
less than (log N)2N1/2/n1/4
0and the total sample size from the source domains Nexceeds
n(2s+plog q)/(4s)
0. Hence, a sufficiently large amount of source data can ensure the satisfaction
of these conditions.
4 Numerical studies
In this section, we evaluate the performance of RTL via numerical studies. We first describe
the simulation results and then illustrate the applications of RTL on two real-world datasets.
12
4.1 Simulation studies
In this section, we evaluate the finite sample performance of RTL using simulated data. We
generate data under various designs and compare our method with the existing approaches.
4.1.1 Data generating models
We consider the data generation models as described in (1) and (2). We consider the following
two scenarios:
(a) Homogeneous models: In this scenario, the source and target domain models are the
same. Thus, βk=βand γk=γfor all k= 0,1, . . . , K. The coefficients of βand γ
are specified by i.i.d. drawn from standard normal distribution;
(b) Heterogeneous models: In this scenario, the source and target domain models are dif-
ferent. The elements of βkand γkare specified by drawing i.i.d. random numerbers
from standard normal distribution for all k= 0,1, . . . , K.
Covariates Xkand Zkare drawn from i.i.d. uniform distribution on [1,1]. We consider
two types of representation functions R(·):
(a) (Additive Model) R(Z)=[f1(z1), f2(z2), . . . , fr(zr)]Twhere fi’s are univariate functions;
(b) (Additive Factor Model) R(Z) = [f1z1), f2(˜z2), . . . , frzr)]Twhere fi’s are univariate
functions and e
Z=BZ for some transformation matrix B. We generate Bby drawing
i.i.d. random numbers from N(0,1/q).
4.1.2 Evaluation
The performance of the estimated regression function bµ(X,Z) = XTb
β0+b
γT
0b
R(Z) is evalu-
ated according to the prediction error and estimation error. The prediction performance is
evaluated by the empirical mean squared error computed on a test set of size ntest generated
from the target data distribution, i.e., [
MSE0=n1
test Pntest
i=1 {bµ(Xi,Zi)µ(Xi,Zi)}2,which
is an estimator of the mean squared error MSE = E{[bµ(X,Z)µ(X,Z)]2}. The estimation
error is reported on the linear part of the target data, Errβ0=b
β0β02,where b
β0is the
estimator of β0.
4.1.3 The effect of the source data sample size
Given that the main objective of transfer learning is to use the information from source data
to enhance the analysis of target data, we initially assess the performance of RTL as the
sample size of the source data varies. In the experiments conducted here, the dimension of
the linear component Xis fixed at d= 5, and the dimension of the non-linear component
Zis set at q= 10. We consider K= 6 source datasets in total. Additionally, the dimension
of the representation function Ris set to 5.
Let the univariate functions fi’s be randomly chosen from sin(z1), 2p|z2|1, (1 |z3|)2,
1/{1 + exp(z4)}, cos(πz5/2). The dimension of representation function in the working
model, denoted as r, is set as r= 1,3,5,7, and 9. When r= 5, the representation dimension
is the same as the true representation function used in the data generating model. The
13
under-estimating and over-estimating models are also considered when we set r= 1,3 and
r= 7,9, respectively. The sample size in the source dataset n0= 10,200,400,600,800,1000,
and 1200, and the sample size in the target dataset is set fixed as 50.
0 200 400 600 800 1000 1200
n
0.0
0.2
0.4
0.6
0.8
1.0
Mean Value
Prediction MSE vs. Sample Size in Source Data
RTL,
r=
5
RTL,
r=
7
RTL,
r=
9
O
racle
RTL,
r=1
RTL,
r=3
(A)
0 200 400 600 800 1000 1200
n
0.2
0.3
0.4
0.5
Mean Value
Estimation Error vs. Sample Size in Source Data
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
r=
1
r=
3
r=
5
r=
7
r=
9
(B)
Figure 3: Additive Model with homogeneous coefficients
0 200 400 600 800 1000 1200
n
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Mean Value
Prediction MSE vs. Sample Size in Source Data
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
r=
1
r=
3
r=
5
r=
7
r=
9
(A)
0 200 400 600 800 1000 1200
n
0.2
0.3
0.4
0.5
0.6
Mean Value
Estimation Error vs. Sample Size in Source Data
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
r=
1
r=
3
r=
5
r=
7
r=
9
(B)
Figure 4: Additive Model with heterogeneous coefficients
14
0 200 400 600 800 1000 1200
n
0.0
0.2
0.4
0.6
0.8
Mean Value
Prediction MSE vs. Sample Size in Source Data
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
r=
1
r=
3
r=
5
r=
7
r=
9
(A)
0 200 400 600 800 1000 1200
n
0.2
0.3
0.4
0.5
Mean Value
Estimation Error vs. Sample Size in Source Data
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
r=
1
r=
3
r=
5
r=
7
r=
9
(B)
Figure 5: Additive Factor Model with homogeneous coefficients
0 200 400 600 800 1000 1200
n
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Mean Value
Prediction MSE vs. Sample Size in Source Data
r=
1
r=
3
r=
5
r=
7
r=
9
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
(A)
0 200 400 600 800 1000 1200
n
0.2
0.3
0.4
0.5
0.6
Mean Value
Estimation Error vs. Sample Size in Source Data
RTL
,
RTL
,
RTL
,
RTL
,
RTL
,
O
racle
r=
1
r=
3
r=
5
r=
7
r=
9
(B)
Figure 6: Additive Factor Model with heterogeneous coefficients
We repeat the experiments 50 times and report the average performance. We also report
the ‘Oracle’ method which uses the true representation function in the target data. Fig-
ures 3to 6present prediction MSEs for RTL with the number of representation functions
r= 1,3,5,7,9.The true value of r= 5 in the generating model. For the Additive Model
Design, the depth of the neural network is set as D= 4 and the width is set as W= 300.
The results are shown in Figures 3and 4. For the Additive Factor Model Design, the depth
of the neural network is set as D= 6 and the width is set as W= 500. The results are
shown in Figures 5and 6.
The experimental results indicate that as the sample size increases, the performance of
the RTL method approaches that of the oracle estimator, provided that the dimension of
the representation function is close to or exceeds the true dimension of the representation
function in the data-generating model. However, if the chosen dimension of the representa-
tion function is less than the true dimension present in the generating model, the proposed
method exhibits suboptimal performance. Consequently, in practical applications, it is ad-
visable to set the dimension of the representation function to a higher rather than lower
value to ensure better performance.
15
4.1.4 Comparison
We consider both the additive model design and the additive factor model design as previ-
ously described. Additionally, we explore a more intricate deep model design to simulate the
data generation process, as illustrated in Figure 7. In this model, the functions fiand hiare
selected randomly from a pool of functions that includes sin(x), cos(x), cos(2x), sin(πx),
cos(πx), 2x+ 0.51, (1 |x0.5|)2, 1/1 + exp(x), tan(x+ 0.1), log(x+ 1.5), exp(x), x2,
and arctan(x). For instance, the output of the first node in the second layer is computed
as f1(z1+z2). We utilize K= 20 source datasets and 1 target dataset, and we assess two
configurations regarding the model’s dimension and sample size. In the first configuration,
each source dataset comprises 200 samples, and the dimension of the non-linear component
is set to q= 20. In the second configuration, the sample size for each source dataset is
increased to 400, while the dimension of the non-linear component is reduced to q= 10. For
both configurations, the target dataset consists of n0= 50 samples, with the dimension of
the non-linear component fixed at d= 5.
f1()
z1z2
f2()
z3z4
f3()
z5z6
f4()
z7z8
f5()
z9z10
f6()
h1() h2() h3() h4() h5()
Figure 7: The architecture of a deep model with q= 10 and p= 5 used in Exp 2.
We consider the following competitor methods.
(a) The “Pool” method is a parametric pooling method which estimates the coefficients
using a combined loss. Pooled regression (PR) assumes that all parameters across
different individuals are the same. All datasets are pooled together.
(b) The “MAP” method represents the model averaging transfer learning method (Zhang
et al.,2024).
(c) The “Trans-lasso” method represents the high-dimensional linear regression transfer
learning method (Li et al.,2022).
(d) The “Meta” method represents the meta-analysis method where the coefficients from
different datasets are weighted based on the inverse variance of estimations.
(e) The “STL” method represents the neural network method which only uses the target
data.
To adapt these methods for non-linear models, we express the nonparametric component
as a linear combination of cubic spline basis functions. The optimal number of knots is
determined through the use of validation samples. For our proposed RTL method, we define
the dimension of the representation space to be p= 5, which reflects the true underlying
dimension. The outcomes of this comparison are depicted in Figures 8and 9, where the
left side corresponds to the scenario with nk= 400 and q= 10, and the right side pertains
to the scenario with nk= 200 and q= 20. It can be seen that RTL has lower prediction
16
and estimation errors than the existing methods, including Pool, MAP, Trans-Lasso, Meta,
and STL. These results show the superiority of our proposed RTL method in terms of both
prediction accuracy and estimation quality.
(A)
Add AddFactor Deep Deep-Home
Designs
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Performance Metric
RTL
Trans-lasso
MAP
Meta
Pool
STL
(B)
Add AddFactor Deep Deep-Home
Designs
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Performance Metric
RTL
Trans-lasso
MAP
Meta
Pool
STL
Figure 8: Prediction performance comparison of different methods across different designs.
(A) The left-hand-side corresponding to the case of nk= 200 and q= 20; (B) the right-
hand-side corresponding to the case of nk= 400 and q= 10. It can be seen that RTL has
lower prediction errors than the existing methods, including Pool, MAP, Trans-Lasso, Meta,
and STL.
(A)
Add AddFactor Deep Deep-Home
Designs
0.0
0.5
1.0
1.5
2.0
Performance Metric
RTL
Trans-lasso
MAP
Meta
Pool
STL
(B)
Add AddFactor Deep Deep-Home
Designs
0.0
0.5
1.0
1.5
2.0
Performance Metric
RTL
Trans-lasso
MAP
Meta
Pool
STL
Figure 9: Estimation performance comparison between RTL and Pool, MAP, Trans-Lasso,
Meta, and STL across different designs. (A) The left-hand-side corresponding to the case
of nk= 200 and q= 20; (B) the right-hand-side corresponding to the case of nk= 400 and
q= 10. It can be seen that RTL has lower estimation errors than the existing methods,
including Pool, MAP, Trans-Lasso, Meta, and STL.
17
Table 1: Illustration of asymptotic variance and normality.
Design Avg. Bias SD SE Normality
Additive
nk= 200 0.0040 0.3016 0.2953 0.0459
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
bias
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
density
Estimated pdf
Estimation bias
Additive
nk= 400 -0.0146 0.2321 0.2241 0.0386
0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8
bias
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
density
Estimated pdf
Estimation bias
Add-Factor
nk= 200 0.0001 0.2945 0.2880 0.0462
0.75 0.50 0.25 0.00 0.25 0.50 0.75
bias
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
density
Estimated pdf
Estimation bias
Add-Factor
nk= 400 -0.0153 0.2135 0.2044 0.0345
0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8
bias
0.0
0.5
1.0
1.5
2.0
density
Estimated pdf
Estimation bias
Deep
nk= 200
-0.0080 0.2155 0.2140 0.0350
0.6 0.4 0.2 0.0 0.2 0.4 0.6
bias
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
density
Estimated pdf
Estimation bias
Deep
nk= 400
-0.0056 0.1350 0.1350 0.0228
0.4 0.2 0.0 0.2 0.4
bias
0.0
0.5
1.0
1.5
2.0
2.5
3.0
density
Estimated pdf
Estimation bias
Deep(Homo)
nk= 200
-0.0061 0.1567 0.1527 0.0242
0.4 0.2 0.0 0.2 0.4
bias
0.0
0.5
1.0
1.5
2.0
2.5
3.0
density
Estimated pdf
Estimation bias
Deep(Home)
nk= 400
-0.0003 0.1158 0.1117 0.0184
0.3 0.2 0.1 0.0 0.1 0.2 0.3
bias
0
1
2
3
4
5
density
Estimated pdf
Estimation bias
4.1.5 Assessment of variance estimation and asymptotic normality
To demonstrate the estimated asymptotic variance, we employ the same experimental setup
as described in Section 4.1.4 with the exception that the parameters β0and γ0are both
set to 1,and the number of source dataset is set to 6. Our analysis concentrates on the
transformed coefficients θ=αβ0, where α=1/p.
Table 1presents the results, including the average bias and standard deviation (SD) of a
combined coefficient, the mean of the estimated variance (SE), and an illustration of normal-
ity, all based on 1000 repetitions. The last column of Table 1displays the histograms of the
estimation biases alongside the asymptotic normal distributions with the estimated means
and variances. The results suggest that the distribution RTL estimator is well approximated
by normal distribution.
4.2 Semi-synthetic MNIST data
In this section, we apply RTL to a semi-synthetic arithmetic dataset constructed from the
MNIST dataset (Le Cun et al.,1998). The MNIST dataset is a collection of handwritten
digits, comprising 70,000 samples with 10 class labels each represented by a 28×28 grayscale
image. For our semi-synthetic scenario, we define the data generating process as follows:
Y=βX +γRZ+ϵ, (13)
18
where Xis a random sample drawn from the standard normal distribution, Zis a digit
image sampled from the MNIST dataset, βRand γR10 are unknown coefficients to be
estimated, RR10 represents the one-hot encoded label corresponding to the input image,
and ϵis noise that follows the standard normal distribution. To illustrate, suppose we set
β= 1, γ=1,X= 1, ϵ= 0.2, and let Zcorrespond to the image . The resulting value
of Ywould be 1 + 1R + 0.2, which equals 8.2 if the image represents the digit ‘7’.
We adopt the following experimental setup: The total number of source datasets is fixed
at 10. Each source dataset is composed of a training subset, which accounts for 40% of the
MNIST training set, and a separate validation subset consisting of 500 samples randomly
selected from the same training set. The coefficients βand γ1are randomly assigned for
each source dataset, and we define γas γ11across all source data. The target dataset is
limited to 100 samples, also drawn from the MNIST training set.
For the representation learning, we utilize a neural network with 7 hidden layers, which
includes 5 convolutional layers and 2 fully connected layers. More detailed information on
the architecture of the neural network is provided in the Supplementary Materials. The
output dimension of the representation network is set to 10. We adopt an iterative training
approach for the representation network and the subsequent linear layer, spanning 10 epochs.
The learning rate is established at 104, and we use a batch size of 128.
The average performance of the estimated representation network, in terms of prediction,
classification, and estimation based on 5 replications, is presented in Table 2. It is important
to note that the prediction error is evaluated on a test set derived from the MNIST test set,
which comprises 10,000 images. The classification accuracy is assessed using the MNIST
training set. For this evaluation, the estimated representation network is augmented with
two additional linear layers with ReLU activation functions, and the final output undergoes
a transformation via a logarithmic softmax function.
Table 2: The prediction and estimation errors of RTL in the synthetic MNIST data analysis.
Method Prediction Error Estimation Error Classification Accuracy
RTL 0.2163(0.0675) 0.0701(0.0716) 98.69%(0.24%)
The results show that RTL has a good performance in the synthetic MNIST data analysis
for the prediction and estimation in target domain. The high classification accuracy indicates
that the estimated representation network is able to capture the label information of the
input images.
4.3 Rental data
We demonstrate the application of the proposed RTL method using an apartment rental
dataset from three major Chinese cities: Beijing, Shanghai, and Shenzhen. This dataset was
obtained from a publicly accessible website, available at http://www.idatascience.cn/.
The number of available rental apartments across various districts in these cities is detailed
in Table 3. Additionally, the variables included in the dataset are given in Table 4. The
main goal of our analysis is to assess the influence of key factors, such as neighborhood
characteristics and the proximity of schools, on rental prices.
19
Table 3: The number of apartments for rent by district in Beijing, Shanghai and Shenzhen
Beijing Shanghai Shenzhen
Haidian 528 Pudong 1333 Nanshan 1524
Chaoyang 1241 Xuhui 566 Futian 1169
Changping 310 Changning 432 Bao’an 1108
Dongcheng 315 Putuo 416 Longgang 857
Xicheng 308 Huangpu 393 Luohu 857
Fengtai 347 Baoshan 365 Longhua 778
Shijingshan 269 Longhuaa 360 Buji 735
Mentougou 264 Jing’an 349 Guangming 714
Fangshan 249 Yangpu 316 Yantian 543
Shunyi 225 Tongzhou 223
Jiading 302 Hongkou 207
Daxing 291 Fengxian 204
Huairou 162
Table 4: The description of variables in the apartment rental dataset.
Variable Description
(y) price monthly rent of the apartment
(z) room number of rooms
(z) hall number of halls
(z) toilet number of toilets
(z) hasbed has bed
(z) haswardrobe has wardrobe
(z) hasac has air conditioner
(z) hasgas has gas
(z) floor 4 categories based on height
(z) totalfloor total floors of the building
(z) numhospital number of hospitals (within 3km)
(x) neighborhood neighborhood of the apartment
(x) numschool number of schools (within 3km)
For the apartment rental dataset encompassing three major Chinese cities, Beijing,
Shanghai, and Shenzhen, it is important to recognize that these cities, being in distinct
regions (north, east, and south) of China, have unique rental markets. Consequently, pool-
ing the data from Beijing, Shanghai, and Shenzhen for analysis without considering regional
differences could lead to questionable conclusions. For instance, what characterizes a neigh-
borhood in Beijing may be different from those in Shanghai or Shenzhen. Additionally, the
availability of an air conditioner might influence rental prices differently across these cities
due to their varying climatic conditions. Additionally, considering the vast size of these
cities, there can be significant variations in rental market dynamics even between different
districts within the same city. Therefore, pooling data from different districts within a sin-
gle city is not advisable. On the other hand, despite the geographical distinctions and the
heterogeneity across districts within these cities, there are inherent similarities within their
rental markets. These commonalities make it plausible to apply knowledge about factors
affecting rental prices from one city to another. Thus, while regional specificities should not
be overlooked, there is merit in exploring the transferability of insights across these diverse
urban rental markets.
Given this context, transfer learning is a reasonable approach to analyzing data from
a specific district in one city, using data from other districts as source data. This method
helps overcome the limitations of small sample sizes for district-specific data and enhances the
20
analysis by leveraging the broader patterns and insights from across the dataset. Thus, while
respecting regional specificities, transfer learning offers a way to explore the transferability
of insights across these diverse urban rental markets.
In our analysis, we focus on the effects of two variables: neighborhood and numschool (the
number of schools within a 3km radius), which are widely recognized as having a significant
influence on rental prices in China. We use other factors as confounding variables that may
also affect rental prices, albeit to a lesser degree. We examine four target datasets from
four randomly selected districts. These districts include Changping in Beijing, Yantian in
Shenzhen, and both Putuo and Fengxian in Shanghai. When analyzing a specific target
dataset, such as the one from Changping in Beijing, we incorporate all the remaining data
as source data. This enables us to quantify the effects of neighborhood characteristics and
the proximity to schools on rental prices, while also taking into account the wider context
provided by the comparative data from other districts.
Using the proposed RTL method for the semiparametric regression model as described in
Section 2, we calculate the 95% confidence intervals for the coefficients of neighborhood and
numschool based on the results in Section 3. The findings are presented in Table 5. These
estimated coefficients provide a quantitative assessment of the impact these variables have on
rental prices. Moreover, the fact that these confidence intervals do not include zero indicates
that the district where the house is located and the proximity to schools significantly increase
the monthly rent, when other factors are held constant.
We also consider the prediction performance via randomly dividing the target dataset
into training, validation and testing sets. The training set consists of 30% of the data and
the testing set consists of other 30% of the data, while the remaining is allocated to the
validation set. The prediction performance is evaluated using the mean squared error on
the testing set. Figure 10 displays the prediction errors the proposed RTL and the existing
methods Pool, MAP, Trans-Lasso, Meta, and STL on housing rental information data. It
shows that RTL outperforms these existing methods in terms of prediction error.
Table 5: The estimated coefficients and confidence intervals for the variables of main interest
in the housing rental data.
District Variable Estimate SE 95% CI
Changping neighborhood 38.08 3.37 [31.48,44.68]
numschool 16.00 3.93 [8.30,23.70]
Putuo neighborhood 59.53 6.18 [47.41,71.64]
numschool 12.23 2.85 [6.64,17.82]
Fengxian neighborhood 9.80 3.07 [3.79,15.82]
numschool 9.36 2.47 [4.52,14.21]
Yantian neighborhood 66.70 3.92 [59.03,74.38]
numschool 41.03 5.42 [30.41,51.65]
21
(A) Changping
10 2
10 1
100
Prediction Error
STLMAP
RTL Trans-lasso Meta Pool
(B) Putuo
10 2
10 1
Prediction Error
MAP STL
RTL Trans-lasso Meta Pool
(C) Fengxian
10 3
10 2
10 1
100
Prediction Error
MAP STL
RTL Trans-lasso Meta Pool
(D) Yantian
10 2
10 1
100
101
Prediction Error
MAP STL
RTL Trans-lasso Meta Pool
Figure 10: Prediction performance between the proposed RTL and the existing methods
Pool, MAP, Trans-Lasso, Meta, and STL on the housing rental dataset.
We further assess the prediction performance by randomly splitting the target dataset
into training, validation, and testing sets. Specifically, the training and testing sets each
consists of 30% of the data, with the remaining 40% designated as the validation set. We
evaluate the prediction performance using the mean squared error (MSE) on the testing set.
Figure 10 illustrates the prediction errors of our proposed RTL method compared to existing
methods, Pool, MAP, Trans-Lasso, Meta, and STL. The results demonstrate that RTL has
lower prediction error than these existing methods, indicating its superior performance in
predicting rental prices.
5 Discussion
In this work, we introduce a new approach to transfer learning within the context of semi-
parametric regression inference.The essence of our strategy lies in the transfer of knowledge
from the source domains to the target domain via a representation function. Our goal is to
enhance both prediction accuracy and estimation precision in the target domain by leverag-
ing data from multiple source domains. The key idea of our method is the learning of a shared
representation across various source tasks, which is then applied to a target task. We address
data heterogeneity between the source and target domains by incorporating domain-specific
parameters in their respective models. This strategy facilitates the integration of varied data
representations while maintaining model interpretability and adaptability to heterogeneous
datasets.
Our proposed RTL method has the potential to be adapted for use with other models,
including semiparametric generalized linear and classification models. However, there are
several challenging issues that warrant further investigation within our proposed framework.
Firstly, a pivotal hyperparameter in our approach is the number of representations, for which
the optimal selection remains an open question. The determination of this parameter signif-
icantly affects the model’s performance and its ability to generalize. Our simulation studies
indicate that the method performs adequately as long as the number of representations
falls within a reasonable range. This observation suggests that it is helpful to consider a
cross-validation type method for selecting this hyperparameter, which could provide a sys-
tematic approach to further enhance model performance. Secondly, our findings are based
on a moderately high-dimensional regime of the model. Although this scenario is relevant
in many applications, extending our method to handle sparse, high-dimensional settings is
challenging. In such scenarios, where the model’s dimensionality may exceed the sample
size, it becomes crucial to integrate regularization techniques into the model fitting objec-
22
tive function via a penalty term to ensure effective model performance. Moreover, while our
current model uses a linear mapping to integrate representations, there is potential to ex-
plore more flexible approaches. For instance, transitioning from task-specific linear functions
to nonlinear functions could allow for the capture of more complex non-linear relationships
between the representations and the target responses. Pursuing advancements in this area
could significantly improve the model’s capacity and its adaptability to more complex data.
We hope to address these issues in our future work.
References
Ahn, S. C., Lee, Y. H., and Schmidt, P. (2001). GMM estimation of linear panel data models
with time-varying individual effects. Journal of Econometrics, 101(2):219–255.
Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica : Journal of
the Econometric Society, 77(4):1229–1279.
Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Manage-
ment Science, 67(5):2964–2984.
Cai, T. T. and Pu, H. (2024). Transfer learning for nonparametric regression: Non-
asymptotic minimax analysis and adaptive procedure. arXiv: 2401.12272.
Chen, W., Horwood, J., Heo, J., and Hern´andez-Lobato, J. M. (2023). Leveraging task struc-
tures for improved identifiability in neural network representations. arXiv: 2306.14861.
Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. (2020). Few-shot learning via
learning the representation, provably. In International Conference on Learning Represen-
tations.
Engle, R. F., Granger, C. W., Rice, J., and Weiss, A. (1986). Semiparametric estimates
of the relation between weather and electricity sales. Journal of the American Statistical
Association, 81(394):310–320.
Farrell, M. H., Liang, T., and Misra, S. (2021). Deep neural networks for estimation and
inference. Econometrica, 89(1):181–213.
Golowich, N., Rakhlin, A., and Shamir, O. (2018). Size-independent sample complexity of
neural networks. In Conference On Learning Theory, pages 297–299. PMLR.
Gy¨orfi, L., Kohler, M., Krzyzak, A., Walk, H., et al. (2002). A Distribution-Free Theory of
Nonparametric Regression. Springer.
Hardle, W., Liang, H., and Gao, J. (2000). Partially Linear Models. Contributions to
Statistics. Physica Heidelberg, 1 edition.
Hu, X. and Zhang, X. (2023). Optimal parameter-transfer learning by semiparametric model
averaging. Journal of Machine Learning Research, 24(2023):1–53.
23
Huang, J.-T., Li, J., Yu, D., Deng, L., and Gong, Y. (2013). Cross-language knowledge
transfer using multilingual deep neural network with shared hidden layers. In 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, pages 7304–7308,
Vancouver, BC, Canada. IEEE.
Jiao, Y., Shen, G., Lin, Y., and Huang, J. (2023). Deep nonparametric regression on approx-
imate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of
Statistics, 51(2):691–716.
Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and
Processes. Springer Science & Business Media, Berlin, Heidelberg.
Li, S., Cai, T. T., and Li, H. (2022). Transfer learning for high-dimensional linear regression:
Prediction, estimation, and minimax optimality. Journal of The Royal Statistical Society
Series B: Statistical Methodology, 84(1):149–173.
Li, S., Zhang, L., Cai, T. T., and Li, H. (2023). Estimation and inference for high-dimensional
generalized linear models with knowledge transfer. Journal of the American Statistical
Association, pages 1–12.
Lin, H. and Reimherr, M. (2022). Transfer learning for functional linear regression with
structural interpretability. arXiv: 2206.04277.
Moon, H. R. and Weidner, M. (2015). Linear regression for panel with unknown number of
factors as interactive fixed effects. Econometrica : Journal of the Econometric Society,
83(4):1543–1579.
Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., and Jin, Z. (2016). How transferable
are neural networks in NLP applications? In Su, J., Duh, K., and Carreras, X., editors,
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,
pages 479–489, Austin, Texas. Association for Computational Linguistics.
Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on
Knowledge and Data Engineering, 22(10):1345–1359.
Shen, G. (2024). Exploring the complexity of deep neural networks through functional
equivalence. arXiv: 2305.11417.
Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators. The Annals
of Statistics, 8(6):1348–1360.
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). A survey on deep
transfer learning. In K˚urkoa, V., Manolopoulos, Y., Hammer, B., Iliadis, L., and Ma-
glogiannis, I., editors, Artificial Neural Networks and Machine Learning ICANN 2018,
pages 270–279, Cham. Springer International Publishing.
Tian, Y. and Feng, Y. (2022). Transfer learning under high-dimensional generalized linear
models. Journal of the American Statistical Association, 118(544):2684–2697.
24
Tian, Y., Gu, Y., and Feng, Y. (2023). Learning from similar linear representations: Adap-
tivity, minimaxity, and robustness. arXiv: 2303.17765.
Tripuraneni, N., Jordan, M. I., and Jin, C. (2020). On the theory of transfer learning:
The importance of task diversity. In Proceedings of the 34th International Conference on
Neural Information Processing Systems, NIPS’20, pages 7852–7862.
Van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge University Press, Cambridge,
UK.
Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes:
With Applications to Statistics. Springer, New York, NY, USA.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam-
bridge University Press, Cambridge, UK.
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features
in deep neural networks? In Proceedings of the 27th International Conference on Neural
Information Processing Systems - Volume 2, NIPS’14, pages 3320–3328, Cambridge, MA,
USA. MIT Press.
Zhang, X., Liu, H., Wei, Y., and Ma, Y. (2024). Prediction using many samples with
models possibly containing partially shared parameters. Journal of Business & Economic
Statistics, 42(1):187–196.
Zhou, D., Liu, M., Li, M., and Cai, T. (2022). Doubly robust augmented model accuracy
transfer inference with high dimensional features. arXiv: 2208.05134.
25
Appendix
A The network architecture for the semi-synthetic
MNIST data
The architecture of the network used in the semi-synthetic data analysis is detailed in
Figure S11.
Conv Conv MaxP Conv Conv MaxP Conv MaxP Flatten
1@28x28 64@28x28 64@28x28 64@14x14 128@14x14 128@14x14 128@7x7 256@7x7 256@3x3 1x2304 1x512
1x10
SoftMax
Figure S11: Structure of representation network used for synthetic data analysis. The convo-
lution (Conv) transforms, max pooling (MaxP) transforms, tensor flattening (Flatten) and
softmax transform are labeled in the bottom. The dimension of the output in each layer is
labeled in the top.
B Proofs of theoretical results
Before we present the proofs of the results stated in the paper, we first introduce the
definition of Rademacher and Gaussian complexity for R.
Definition 3. We define the empirical and population Rademacher complexities for a class
of functions Rcontaining function R:RqRpover ndata points, (Z1,...,Zn) as,
b
Fn(R) = Eε"sup
R∈R
1
n
p
X
j=1
n
X
i=1
εijRj(Zi)#,
and
Fn(R) = Ez[b
Fn(R)],
respectively, where Eε(·) refers to the expectation operator taken over the randomness εij’s,
εij’s are independent Rademacher random variables and Rj(·) is the jth element of R(·).
Analogously, the empirical and population Gaussian complexities are defined as
b
Gn(R) = Eι"sup
R∈R
1
n
p
X
j=1
n
X
i=1
ιijRj(Zi)#,
and
Gn(R) = Ez[b
Gn(R)],
1
respectively, where ιij ’s are independent standard Gaussian random variables.
B.1 Auxiliary Lemmas
Define (Dk;βk,γk,R) = {YkXT
kβkRT(Zk)γk}2
2, where Dk= (Xk,Zk, Yk). Let
Ψℓ,δ =PK
k=1 (Dk,ψ)/K PK
k=1 (Dk,ψ)/K, ψΨδ,where
Ψδ={ψ:δ/2d2(ψ,ψ)δ, δ > 0,ψΨ},ψ={(βk,γk)K
k=1,R}, and
ψ={(βk,γ
k)K
k=1,R}. Denote PNand Pas the empirical and probability measure of
{Dn,k}K
k=1 and {Dk}K
k=1, where Dn,k =Dki = (Xki,Zki, Yki), i = 1, . . . , nk. We further
define GN=N(PNP). Denote the population 2-norm for function class
BK+ ΓK(R) as,
d2(ψ,ψ) = 1
K
K
X
k=1
EhX
kβk+R(Zk)γkX
kβ
kR′⊤(Zk)γ
k2i!1/2
,
where ψ={(β
k,γ
k)K
k=1,R}. We use ,and to denote less than, greater than, and
equal to up to a universal constant. For simplicity in notation, we remove the parts with
the same parameters from the distance d2in the following contents.
Lemma 1. Suppose Condition 3holds, we have
EGNΨℓ,δ δps1log(s2) + s1
Nlog(s2),
where Eis the outer measure and s2= 12Bγ(D+ 1)(BR+ 1)(2Bθ)D+2(QD
j=0 pj)QD
j=1 pj!1/S.
Proof of Lemma 1.Using the triangle inequality, we can decompose the distance on function
class BK+ ΓK(R) into a distance over BK, ΓK, and R. We have
d2({(β
k,γ
k)K
k=1,R},{(βk,γk)K
k=1,R})
d2({(β
k,γ
k)K
k=1,R},{(βk,γ
k)K
k=1,R})
+d2({(βk,γ
k)K
k=1,R},{(βk,γk)K
k=1,R})
+d2({(βk,γk)K
k=1,R},{(βk,γk)K
k=1,R})
d2({β
k}K
k=1,{βk}K
k=1) + d2({γ
k}K
k=1,{γk}K
k=1)
+ max
k[K]γk2d2(R,R).(A.1)
We then use a covering argument on each of the spaces BK, ΓK, and Rto witness a
covering of the composed space BK+ ΓK(R). First, let CBKbe a τ0-covering for the
function class BKof the norm d2. Then for each β CBK, construct a τ1-covering, CR
for the function class Rof the norm d2. Last, for each β CBKand R CR, construct
aτ2-covering CΓKfor the function class ΓKof the norm d2. Using the decomposition of
distance (A.1), we can claim that set
CBK· CΓK(R)=β∈CBK(R∈CR(CΓK),)
2
is a (τ0+ maxk[K]γk2τ1+τ2)-covering for the function space BK+ ΓK(R) in the norm
d2. To see this, let {βk}K
k=1 BK,{γk}K
k=1 ΓK, and R R be arbitrary. Now let
{β
k}K
k=1 CBKbe τ0close to {βk}K
k=1; given this {β
k}K
k=1, there exists R CRbe τ1close
to R; given this {β
k}K
k=1 and R, there exists {γ
k}K
k=1 CΓKsuch that {γ
k}K
k=1 be τ2close
to {γk}K
k=1. By the process of constructing {(β
k,γ
k)K
k=1,R}and (A.1), we have that,
d2({(β
k,γ
k)K
k=1,R},{(βk,γk)K
k=1,R})τ0+ max
k[K]γk2τ1+τ2.
We now bound the cardinality of the coverings CBK· CΓK(R).
CBK· CΓK(R) |CBK||CR|max
R∈R CΓK
R.
To control the cardinality of maxR∈R CΓK
R, note an ϵ-covering of CΓK
Rcan be obtained
from the cover CΓR× ·· · × CΓR. Hence,
|CBK| |CB|K,max
R∈R CΓK
Rmax
R∈R CΓR
K
.
Note that for any ψ,ψΨδ, we have
E"1
K
K
X
k=1
(Dk,ψ)1
K
K
X
k=1
(Dk,ψ)#
=E"1
K
K
X
k=1
(Dk,ψ)1
K
K
X
k=1
(Dk,ψ)#
+E"1
K
K
X
k=1
(Dk,ψ)1
K
K
X
k=1
(Dk,ψ)#
=d2
2(ψ,ψ) + d2
2(ψ,ψ)δ2.
By Theorem 3 in Shen (2024), we have
N(τ, d2,R)≤N(τ, d,R)
4(D+ 1)(BR+ 1)(2Bθ)D+2(QD
j=0 pj)τ1S
QD
j=1 pj!,
where dis the infinity norm. Furthermore, by the construction of covering net and Theorem
3
2.7.11 in Van Der Vaart and Wellner (1996), we have
log N[](τ, d2,Ψℓ,δ )
Kd log 3δ
τ+Kp log 3δ
τ
+Slog
12Bγ(D+ 1)(BR+ 1)(2Bθ)D+2(QD
j=0 pj)
τQD
j=1 pj!1/S
.
Using sub-additivity of the function, if δs2/3, then we have
J[](δ, Ψℓ,δ ) := Zδ
0q1 + log N[](τ, d2,Ψℓ,δ )dτ
Zδ
0p1 + s1log(s2 )dτ
=s2rs1
2Z
2 log(s2)
v2ev2/2dv
δps1log(s2).(A.2)
Under Condition 3, combining (A.2) with lemma 3.4.2 in Van Der Vaart and Wellner (1996)
gives
EGNΨℓ,δ J[](δ, Ψℓ,δ )1 + J[](δ, Ψℓ,δ )
δ2Nδps1log(s2) + s1
Nlog(s2).
Lemma 2. Suppose Condition 3holds, if nkd+ log K, we have
FN(BK+ ΓK(R))
O{n1/2d1/2}+O{n1/2N2(plog N)1/2}+On1/2p1/2(log N)
+O{N1/2(D+ 2 + log q)1/2p(log N)2
D
Y
i=0
(pi+ 1)}
+O
n1/2N2
Slog
N(D+ 1)(2Bθ)D+2 D
Y
j=0
pj! D
Y
j=1
pj!!1/S
1/2
.
Proof of Lemma 2.Under Condition 3, by the definition of the empirical Gaussian complex-
4
ity, we have
b
Gnk(B) = Eι"sup
βk∈B
1
nk
nk
X
i=1
ιikXkiβk#
max
k[K]
Bβ
nkv
u
u
tEι"nk
X
i=1 ιikXki2
2#
max
k[K]
Bβ
nkv
u
u
t
nk
X
i=1 Xki2
2
= max
k[K]
Bβ
nkqtr(ΣXk)
= max
k[K]
Bβ
nkv
u
u
t
d
X
j=1
σjXk),(A.3)
where ΣXk=Pnk
i=1 XkiXki/nkand σjXk) is the jth largest eigenvalue of ΣXk. Similarly,
we can prove
b
Gnk(Γ) max
k[K]
Bγ
nkv
u
u
t
p
X
j=1
σjR(Zk)),(A.4)
where σjR(Zk)) is the jth largest eigenvalue of Pnk
i=1 R(Zki)R(Zki)/nk. Furthermore, we
can obtain that,
b
GN(R) = 1
NEι"sup
R∈R
p
X
j=1
K
X
k=1
nk
X
i=1
ιikRj(Zki)#
p
X
j=1 b
GN(Rj)(log N)
p
X
j=1 b
FN(Rj).(A.5)
By the definition of empirical Gaussian complexity, we can easily conclude
b
GN(BK+ ΓK(R)) b
GN(BK) + b
GNK(R)).(A.6)
For simplicity in notation, denote fik (γk,R;γ
k,R) = RT(Zki)γkR′⊤(Zki)γ
k2. Theo-
rem 7 of Tripuraneni et al. (2020) implies that
b
GNK(R)) 4 sup
{(γk)K
k=1,R},{(γ
k)K
k=1,R}
1
K
K
X
k=1
1
nk
nk
X
i=1
fik(γk,R;γ
k,R)/N2
+ 128(log N)hmax
kγkb
GN(R) + max
kb
Gnk(Γ)i.(A.7)
5
Similar to the calculation of the covering number in Lemma 1, we can obtain that
N(τ/3, d, fk(Γ(R)Γ(R)))
48B2
RBγ
τ2p
192B2
γBR(D+ 1)(BR+ 1)(2Bθ)D+2(QD
j=0 pj)2S
τ2S(QD
j=1 pj!)2
.
Denote fk(γk,R;γ
k,R) = RT(Zk)γkR′⊤(Zk)γ
k2. By Lemma 9.1 of Gy¨orfi et al.
(2002), we have that, for any τ > 0
P sup
{γk,R},{γ
k,R}
1
nk
nk
X
i=1
fik(γk,R;γ
k,R)E[fk(γk,R;γ
k,R)]> τ!
2N(τ/3, d, fk(Γ(R)Γ(R))) exp nkτ2
18B2
γB2
R.
Denote c= 1/(18B2
γB2
R) and C= 2N(1/(3nk), d, fk(Γ(R)+Γ(R))). Note that log C/(cnk)
1/nk. Then,
E"sup
{γk,R},{γ
k,R}(1
nk
nk
X
i=1
fik(γk,R;γ
k,R)E[fk(γk,R;γ
k,R)])#
v
u
u
u
tE
sup
{γk,R},{γ
k,R}(1
nk
nk
X
i=1
fik(γk,R;γ
k,R)E[fk(γk,R;γ
k,R)])!2
v
u
u
u
tZ
0
P
sup
{γk,R},{γ
k,R}(1
nk
nk
X
i=1
fik(γk,R;γ
k,R)E[fk(γk,R;γ
k,R)])!2
> τ
dτ
slog C
cnk
+Z
log C
cnk
2N(τ/3, d, fk(Γ(R)Γ(R))) exp nkτ
18B2
γB2
Rdτ
slog C
cnk
+Z
log C
cnk
2N(1/(3nk), d, fk(Γ(R)Γ(R))) exp nkτ
18B2
γB2
Rdτ
=s18B2
γB2
R(1 + log 2 + log{N(1/(3nk), d, fk(Γ(R)Γ(R)))})
nk
.
6
Under Condition 3, we have
E"sup
{(γk)K
k=1,R},{(γ
k)K
k=1,R}
1
K
K
X
k=1
1
nk
nk
X
i=1
fik(γk,R;γ
k,R)#
E"sup
{γk,R},{γ
k,R}
1
nk
nk
X
i=1
fik(γk,R;γ
k,R)#
4B2
γB2
R+s18B2
γB2
R(1 + log 2)
nk
+s36pB2
γB2
R
nk
log (48nkB2
RBγ)
+v
u
u
u
u
t
36SB2
γB2
R
nk
log
192nkBRB2
γ(D+ 1)(BR+ 1)(2Bθ)D+2(QD
j=0 pj)
QD
j=1 pj!1/S
.(A.8)
Noting by Ledoux and Talagrand (1991) (p97), the empirically Rademacher complexity is
upper bounded by empirical Gaussian complexity up to a factor, together with (A.6) and
(A.7), we have
b
FN(BK+ ΓK(R))
rπ
2b
GN(BK+ ΓK(R))
rπ
2b
GN(BK) + b
GNK(R))
rπ
2max
kb
Gnk(B)+22πsup
{(γk)K
k=1,R},{(γ
k)K
k=1,R}
1
K
K
X
k=1
1
nk
nk
X
i=1
fik(γk,R;γ
k,R)/N2
+ 642π(log N)hmax
kγkb
GN(R) + max
kb
Gnk(Γ)i.(A.9)
Under Condition 3and θΘ, adapting from Theorem 2 of Golowich et al. (2018),
FN(Rj)2
D
Y
i=0
(pi+ 1)BRpD+ 2 + log q/N.
If nkd+ log K, applying Lemma 4 of Tripuraneni et al. (2020) and using the concavity of
·function, we have EhqPd
j=1 σjXk)iO(d) and EhqPp
j=1 σjR(Zk))iO(p).
Thus, under Condition 3, combing (A.9) with (A.3), (A.4), (A.5), and (A.8), we prove
Lemma 2.
Lemma 3. Suppose Conditions 3-4hold, we have
b
µµ2Op(n1/2
0) + Op(∆N);
b
J1
0J1
02=Op(n1/2
0) + Op(∆N).(A.10)
7
Proof of Lemma 3.Theorem 2indicates that b
R(Z)R(Z)2=Op(∆N) for all Z
Z. Denote b
Q=Pn0
i=1 b
R(Z0i)( b
R(Z0i))T/n0and Q=Pn0
i=1 R(Z0i)(R(Z0i))T/n0. Under
Condition 3, we first derive
b
QQ
2
=
1
n0
n0
X
i=1 b
R(Z0i)R(Z0i)b
R(Z0i)R(Z0i)T
+1
n0
n0
X
i=1
R(Z0i)b
R(Z0i)R(Z0i)T
+1
n0
n0
X
i=1 b
R(Z0i)R(Z0i)(R(Z0i))T
2
=Op{∥ b
R(Z0)R(Z0)2}=Op(∆N).(A.11)
By (A.11) and Condition 3, we further derive that
b
Q1Q1
2
=
b
Q1(b
QQ)Q1
2
b
Q1
2
b
QQ
2
Q1
2
=Op(1)Op(∆N)Op(1) = Op(∆N).
Recall the construction of efficient scores for β0,µis the minimizer of
E[X0µR(Z0)2
2].
Hence, under Conditions 3-4and by the weak law of large numbers, we have
b
µµ2
1
n0
n0
X
i=1
X0i(b
R(Z0i))Tb
Q11
n0
n0
X
i=1
X0i(R(Z0i))TQ1
2
+
1
n0
n0
X
i=1
X0i(R(Z0i))TQ1µ
2
1
n0
n0
X
i=1
X0i{b
R(Z0i)R(Z0i)}T{b
Q1Q1}
2
+
1
n0
n0
X
i=1
X0i{R(Z0i)}T{b
Q1Q1}
2
+
1
n0
n0
X
i=1
X0i{b
R(Z0i)R(Z0i)}TQ1
2
8
+
1
n0
n0
X
i=1
X0i(R(Z0i))TQ1µ
2
=Op(∆N) + Op(n1/2
0).(A.12)
With the quadratic loss and by the property of E(X0|R(Z0)), we have
E(X0|R(Z0)) = µR(Z0).
Hence, by (A.12) and b
R(Z0)R(Z0)2=Op(∆N), and the independence between Dn,0
and Dn,k for k= 1, . . . , K, we have
b
J0J02
=
1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}{X0ib
µb
R(Z0i)}TJ0
2
1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}{X0ib
µb
R(Z0i)}T
1
n0
n0
X
i=1 {X0iµb
R(Z0i)}{X0iµb
R(Z0i)}T
2
+
1
n0
n0
X
i=1 {X0iµb
R(Z0i)}{X0iµb
R(Z0i)}T
Eh{X0µb
R(Z0)}{X0µb
R(Z0)}Ti
2
+
Eh{X0µb
R(Z0)}{X0µb
R(Z0)}T
−{X0µR(Z0)}{X0µR(Z0)}T
2
+
E{X0µR(Z0)}{X0µR(Z0)}TJ0
2
=Op(n1/2
0) + Op(∆N).
B.2 Proof of Theorems and Corollaries
Proof of Theorem 1.For any ψsatisfying (5), we have
1
K
K
X
k=1
E(Dk;βk,γk,R)(Dk;βk,γk,R)
=1
K
K
X
k=1
ERT
(Zk)γkRT(Zk)γk+X
kβkX
kβk2
=1
K
K
X
k=1
E(βkβk)T{XkE(Xk|Zk)}
9
+ (βkβk)T{E(Xk|Zk)}+{RT(Zk)γkRT
(Zk)γk}2i
=1
K
K
X
k=1
Eh(βkβk)T{XkE(Xk|Zk)}2i
+1
K
K
X
k=1
Eh(βkβk)T{E(Xk|Zk)}+{RT(Zk)γkRT
(Zk)γk}2i.
Together with Condition 1, we conclude that
1
K
K
X
k=1
E(Dk;βk,γk,R)(Dk;βk,γk,R)>0,
for any βk=βk. Hence, by the definition of ψand ψ, we attain that βk=βk, and
RT(Z)γk=RT
(Z)γk,(A.13)
for all Z Z, every βk B and γkΓ, and k[K]. We ignore the upscript of Zkfor erasing
the confusion, as for all Zk Z, equation (A.13) holds. By Condition 2, we construct an
invertible matrix U0=γk1,··· ,γkpRp×psuch that UT
0R(Z) = UTR(Z) for all Z Z,
where U=γk1,··· ,γkpRp×p. Then we have R(Z)=(UT
0)1UTR(Z). Note that by
Condition 2, there exist ptasks with input Z1,...,Zpsuch that V0= [R(Z1),...,R(Zp)]
Rp×pis an invertible matrix. Consequently, we can write as
V0= ΛV,
where Λ = (U0)TUTand V= [R(Z1),...,R(Zp)] Rp×p. Since V0is invertible and U
does not depend on the input, so are Λ and V. This completes the proof.
Proof of Theorem 2.We center the functions to
ik(Dki;ψk) = (Dki;ψk)(Dki;0),
where ψk= (βk,γk,R). Under Condition 3, applying the contraction principle (Ledoux
and Talagrand,1991, Theorem 4.12) over set {β
kXki +γ
kR(Zki), i [nk], k [K]} RN
shows that
Eε"sup
ψΨ
1
K
K
X
k=1
1
nk
nk
X
i=1
εikik (Dki;ψk)#2BδFN(BK+ ΓK(R)),(A.14)
with probability at least 1δ, where Bδ=clog (1)+4BXBβ+ 4BRBγand cis a constant.
Additional, we can easily prove that |ik(Dki;0)| Bwith probability 1 δunder Condi-
tion 3, where B= (clog (1) + BXBβ+BRBγ)2. Further, the constant-shift property of
10
Rademacher averages (Wainwright (2019), Exercise 4.7c) gives
Eε"sup
ψΨ
1
K
K
X
k=1
1
nk
nk
X
i=1
εik(Dki;ψk)#
Eε"sup
ψΨ
1
K
K
X
k=1
1
nk
nk
X
i=1
εikik (Dki;ψk)#+B
N,(A.15)
with probability at least 1 δ. Theorem 4.10 of Wainwright (2019) shows that
sup
ψΨ
1
K
K
X
k=1 L(Dn,k;βk,γk,R)1
K
K
X
k=1
E{(Dk;βk,γk,R)}
2FN((BK+ ΓK(R))) + 2Brlog(1)
N
with probability at least 1 3δ, where L(Dn,k;βk,γk,R) = Pnk
i=1 (Dki;βk,γk,R)/nk. Con-
sequently, combining (A.14) and (A.15), we have
sup
ψ∈F
1
K
K
X
k=1 L(Dn,k;βk,γk,R)1
K
K
X
k=1
E{(Dk;βk,γk,R)}
4BδFN(BK+ ΓK(R)) + 4Brlog(1)
N,(A.16)
with probability at least 14δ. Therefore, under Condition 6, combining (A.16) with Lemma
2, we can conclude that
sup
ψ∈F
1
K
K
X
k=1 L(Dn,k;βk,γk,R)1
K
K
X
k=1
E{(Dk;βk,γk,R)}
p
0.(A.17)
Further applying Theorem 1, there exists an invertible matrix Λsuch that, b
Rcoverges to
R, and b
γkconverges to γ
k, where R= Λ1
Rand γ
k= Λ
γkfor k[K]. Define
e
R= arg min
R∈R
1
K
K
X
k=1
EhRT(Zk)γkRT
(Zk)γk2i.(A.18)
Under Conditions 35, by the proof of Theorem 6.2 in Jiao et al. (2023), we know that
if the network width and depth be W= 114(κ+ 1)(plog q)κ+1 and D= 21(κ+
1)2N(plog q)/2(plog q+2κ)log2(8N(plog q)/2(plog q+2κ)), then
1
K
K
X
k=1
Eγ∗⊤
ke
R(Zk)γ∗⊤
kR(Zk)2
=1
K
K
X
k=1
Eγ
ke
R(Zk)γ
kR(Zk)2
11
1
K
K
X
k=1
E
e
R(Zk)R(Zk)
2γk2
=O(pB2
γN2s
2s+plog q),
where e
R(·)=Λe
R(·). Consequently, under Condition 6, we have
1
K
K
X
k=1
Eh(Dk;βk,γ
k,e
R)i1
K
K
X
k=1
E[(Dk;βk,γ
k,R)]
=1
K
K
X
k=1
Ee
RT(Zk)γ
kRT(Zk)γ
k20.(A.19)
Then, (A.17) and (A.19) lead to
1
K
K
X
k=1 L(Dn,k;βk,γ
k,e
R)1
K
K
X
k=1 L(Dn,k;βk,γ
k,R)
1
K
K
X
k=1 L(Dn,k;βk,γ
k,e
R)1
K
K
X
k=1
Eh(Dk;βk,γ
k,e
R)i
+
1
K
K
X
k=1
Eh(Dk;βk,γ
k,e
R)i1
K
K
X
k=1
E[(Dk;βk,γ
k,R)]
+
1
K
K
X
k=1
E[(Dk;βk,γ
k,R)] 1
K
K
X
k=1 L(Dn,k;βk,γ
k,R)
=op(1).(A.20)
Noting that {(b
βk,b
γk)K
k=1,b
R}is the minimizer of (7), by (A.20), we further have
1
K
K
X
k=1 L(Dn,k;b
βk,b
γk,b
R)1
K
K
X
k=1 L(Dn,k;βk,γ
k,e
R)
1
K
K
X
k=1 L(Dn,k;βk,γ
k,R) + op(1).(A.21)
Since
1
K
K
X
k=1
E[(Dk;βk,γk,R)] 1
K
K
X
k=1
E[(Dk;βk,γ
k,R)]
=d2
2(ψ,ψ),
12
then, for any small δ > 0, we have
inf
d(ψ,ψ)δ
1
K
K
X
k=1
E[(Dk;βk,γk,R)] >1
K
K
X
k=1
E[(Dk;βk,γk,R)]
>1
K
K
X
k=1
E[(Dk;βk,γ
k,R)] .(A.22)
Therefore, the Conditions of Theorem 5.7 in Van der Vaart (2000) follow from (A.17), (A.21),
and (A.22), and this implies that d(b
ψ,ψ)p
0 as n , where b
ψ={(b
βk,b
γk)K
k=1,b
R}.
Next, we show the convergence rates of d(b
ψ,ψ). By lemma 1, we have
E"sup
ψ∈Fδ
N
1
K
K
X
k=1 L(Dn,k;βk,γk,R)1
K
K
X
k=1
E[(Dk;βk,γk,R)]
(1
K
K
X
k=1 L(Dn,k;βk,γ
k,R)1
K
K
X
k=1
E[(Dk;βk,γ
k,R)])#
ϕN(δ),(A.23)
where ϕN(δ) = δps1log(s2) + s1
Nlog(s2).
Denote υN=s1/2
1(log N)2N1/2. With some calculations, we have that ϕN(υN)υ2
NN
and ϕN(p1/2BγNs/(2s+plog q))NpB2
γN2s
2s+plog q. On the other hand, by the definition of
e
Rin (A.18) and analogy to (A.23),
1
K
K
X
k=1 L(Dn,k;βk,γ
k,e
R)1
K
K
X
k=1 L(Dn,k;βk,γ
k,R)
1
K
K
X
k=1
En(Dk;βk,γ
k,e
R)o1
K
K
X
k=1
E{(Dk;βk,γ
k,R)}
+Op(N1/2ϕN(υN))
d2
2({(βk,γ
k)K
k=1,e
R},{(βk,γ
k)K
k=1,R}) + Op(N1/2ϕN(υN))
O(pB2
γN2s
2s+plog q) + Op(υ2
N).
Then by the definition of b
ψ, we have
1
K
K
X
k=1 L(Dn,k;b
βk,b
γk,b
R)
1
K
K
X
k=1 L(Dn,k;βk,γ
k,e
R)
1
K
K
X
k=1 L(Dn,k;βk,γ
k,R) + Op(pB2
γN2s
2s+plog q+υ2
N).
13
By Theorem 3.4.1 in Van Der Vaart and Wellner (1996), we have d(b
ψ,ψ) = O(p1/2Ns/(2s+plog q)+
υN). Furthermore, we have
d2
2(b
ψ,ψ)
=1
K
K
X
k=1
Eh(b
βkβk)T{XkE(Xk|R(Zk))}+
(b
βkβk)T{E(Xk|R(Zk))}+{b
RT(Zk)b
γkRT(Zk)γ
k}2
=1
K
K
X
k=1
E(b
βkβk)T{XkE(Xk|R(Zk))}2
+1
K
K
X
k=1
E(b
βkβk)T{E(Xk|R(Zk))}+{b
RT(Zk)b
γkRT(Zk)γ
k}2.
Thus, by Conditions 13, it follows that maxk[K]b
βkβk2=Op(p1/2Ns/(2s+plog q)+υN)
and
1
K
K
X
k=1
Eh{b
RT(Zk)b
γkRT(Zk)γ
k}2i=O(pB2
γN2s
2s+plog q+υ2
N).
Denote
{e
γ
k,e
γk}= arg sup
γ
kΓ
inf
γkΓ
1
K
K
X
k=1
Eh{b
RT(Zk)γkRT(Zk)γ
k}2i.
Furthermore,
1
K
K
X
k=1
Eh{b
RT(Zk)e
γkRT(Zk)e
γk}2i
sup
γ
kΓ
inf
γkΓ
1
K
K
X
k=1
Eh{b
RT(Zk)γkRT(Zk)γ
k}2i
Cinf
γkΓ
1
K
K
X
k=1
Eh{b
RT(Zk)γkRT(Zk)γ
k}2i
C
K
K
X
k=1
Eh{b
RT(Zk)b
γkRT(Zk)γ
k}2i,
where the second inequality is implied by Lemma 6 of Tripuraneni et al. (2020). Conse-
quently, by Condition 3, we have d2(b
R,R) = O(∆N).
Proof of Theorem 3.Under Condition 3, by Theorem 2and Lemma 3, it is easily to conclude
that
{b
J0}1"1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}{R(Z0i)b
R(Z0i)}Tγ
0#=Op(n1/2
0N+ 2
N).(A.24)
14
By (A.12), and the independence between Dn,k and Dn,0, we have
1
n0
n0
X
i=1 {µb
R(Z0i)b
µb
R(Z0i)}ϵ0i
2
b
µµ2
1
n0
n0
X
i=1 b
R(Z0i)ϵ0i
2
=Op(n1/2
0N+n1
0).
Consequently, we can easily obtain
{b
J0}1"1
n0
n0
X
i=1 {µR(Z0i)µb
R(Z0i) + µb
R(Z0i)b
µb
R(Z0i)}ϵ0i#
=Op(n1/2
0N+n1
0).(A.25)
Recall the first-order optimality Conditions
1
n0
n0
X
i=1
X0i{Y0iX0i
Tb
β0(b
R(Z0i))Tb
γ0}= 0;
1
n0
n0
X
i=1 b
R(Z0i){Y0iX0i
Tb
β0(b
R(Z0i))Tb
γ0}= 0.
Then the empirical orthogonal score for β0is then given by
1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}{Y0iX0i
Tb
β0(b
R(Z0i))Tb
γ0}= 0.
With some simple calculations, we can obtain that
1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}{X0i
T(β0b
β0)(b
R(Z0i))T(b
γ0γ
0)
+ (R(Z0i)b
R(Z0i))Tγ
0+ϵ0i}= 0.
Then we conclude
b
β0β0={b
J0}1"1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}(R(Z0i)b
R(Z0i))Tγ
0#
+{b
J0}1"1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}ϵ0i#
={b
J0}1"1
n0
n0
X
i=1 {X0ib
µb
R(Z0i)}(R(Z0i)b
R(Z0i))Tγ
0#
15
+{b
J0}1"1
n0
n0
X
i=1 {X0iµR(Z0i)}ϵ0i#
+{b
J0}1"1
n0
n0
X
i=1 {µR(Z0i)µb
R(Z0i) + µb
R(Z0i)b
µb
R(Z0i)}ϵ0i#.
By (A.10), (A.24), and (A.25), we obtain that
b
β0β0={J0}1"1
n0
n0
X
i=1 {X0iµR(Z0i)}ϵ0i#+Op(∆2
N).
Consequently, we have proved that
n0(b
β0β0)d
N(0, σ2
0J01).
Proof of Corollary 1.Denote the efficient score for β0as
Φ(D0;µ0,β0,γ
0,R) = {X0µR(Z0)}{Y0XT
0β0(R(Z0))Tγ
0}.
Then the efficient orthogonal score for β0is then given by
Φ(D0;b
µ,b
β0,b
γ0,b
R) = {X0b
µb
R(Z0i)}{Y0XT
0b
β0(b
R(Z0i))Tb
γ0}.
Note that Φis a d-dimensional vector. Let Φjdenote the jth element of Φ,j[d]. To
simplify the notation, write Φj(D0)=Φj(D0;µ,β0,γ
0,R) and b
Φj(D0;b
µ,b
β0,b
γ0,b
R). For
any j[d], with some calculations, under Conditions 3and 7, the results of Theorem 2and
Theorem 3imply that
1
n0
n0
X
i=1
b
Φ(D0)Φ(D0)
2
2=Op(∆N+n1/2
0).(A.26)
By Condition 7, we have that
EΦ(µ,β0,γ
0,R)2
2
=E[σ2
0]E[{X0m(Z0)}T{X0m(Z0)}] = O(1).
Therefore, for any j1, j2[d],
1
n0
n0
X
i=1 |Φj1(D0i)|2_|Φj2(D0i)|2!1/2
=Op(1),(A.27)
16
where aWb= max{a, b}. Consequently, by (A.26) and (A.27), we have
1
n0
n0
X
i=1 b
Φj1(D0i)b
Φj2(D0i)1
n0
n0
X
i=1
Φj1(D0ij2(D0i)
1
n0
n0
X
i=1 b
Φj1(D0i)b
Φj2(D0i)Φj1(D0ij2(D0i)
1
n0
n0
X
i=1 b
Φj1(D0i)Φj1(D0i)_b
Φj2(D0i)Φj2(D0i)
×|Φj1(D0i)|_|Φj2(D0i)|+b
Φj1(D0i)Φj1(D0i)_b
Φj2(D0i)Φj2(D0i)
Op(∆N+n1/2
0).(A.28)
Furthermore, note that
1
n0
n0
X
i=1
Φj1(D0ij2(D0i)E(ϵ2
0V
j1V
j2) = Op(n1/2
0),(A.29)
where V
jis the jth element of X0m(Z0). Therefore, by (A.28) and (A.29), we have
1
n0
n0
X
i=1 b
Φj1(D0i)b
Φj2(D0i)E(ϵ2
0V
j1V
j2)
Op(∆N+n1/2
0).
Note that
b
J0
p
J0.
Then applying the continuous mapping theorem completes the proof of Corollary 1.
17
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Transfer learning is aimed to make use of valuable knowledge in a source domain to help the model performance in a target domain. It is particularly important to neural networks because neural models are very likely to be overfitting. In some fields like image processing, many studies have shown the effectiveness of neural network-based transfer learning. For neural NLP, however, existing studies have only casually applied transfer learning, and conclusions are inconsistent. In this paper, we conduct a series of empirical studies and provide an illuminating picture on the transferability of neural networks in NLP.
Article
Transfer learning provides a powerful tool for incorporating data from related studies into a target study of interest. In epidemiology and medical studies, the classification of a target disease could borrow information across other related diseases and populations. In this work, we consider transfer learning for high-dimensional generalized linear models (GLMs). A novel algorithm, TransHDGLM, that integrates data from the target study and the source studies is proposed. Minimax rate of convergence for estimation is established and the proposed estimator is shown to be rate-optimal. Statistical inference for the target regression coefficients is also studied. Asymptotic normality for a debiased estimator is established, which can be used for constructing coordinate-wise confidence intervals of the regression coefficients. Numerical studies show significant improvement in estimation and inference accuracy over GLMs that only use the target data. The proposed methods are applied to a real data study concerning the classification of colorectal cancer using gut microbiomes, and are shown to enhance the classification accuracy in comparison to methods that only use the target data.
Article
In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its l1/l2-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and sources are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don’t know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN. Supplementary materials for this article are available online.
Article
This paper considers estimation and prediction of a high‐dimensional linear regression in the setting of transfer learning where, in addition to observations from the target model, auxiliary samples from different but possibly related regression models are available. When the set of informative auxiliary studies is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When the set of informative auxiliary samples is unknown, we propose a data‐driven procedure for transfer learning, called Trans‐Lasso, and show its robustness to non‐informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans‐Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating data from multiple different tissues as auxiliary samples.
Article
We study deep neural networks and their use in semiparametric inference. We establish novel nonasymptotic high probability bounds for deep feedforward neural nets. These deliver rates of convergence that are sufficiently fast (in some cases minimax optimal) to allow us to establish valid second‐step inference after first‐step estimation with deep learning, a result also new to the literature. Our nonasymptotic high probability bounds, and the subsequent semiparametric inference, treat the current standard architecture: fully connected feedforward neural networks (multilayer perceptrons), with the now‐common rectified linear unit activation function, unbounded weights, and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed‐width, very deep networks. We establish the nonasymptotic bounds for these deep nets for a general class of nonparametric regression‐type loss functions, which includes as special cases least squares, logistic regression, and other generalized linear models. We then apply our theory to develop semiparametric inference, focusing on causal parameters for concreteness, and demonstrate the effectiveness of deep learning with an empirical application to direct mail marketing.
Article
Predictive analytics is increasingly used to guide decision making in many applications. However, in practice, we often have limited data on the true predictive task of interest and must instead rely on more abundant data on a closely related proxy predictive task. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. Yet, not accounting for the bias in the proxy can lead to suboptimal decisions. Using real data sets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features d). Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare data sets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data. This paper was accepted by George Shanthikumar, big data and analytics.
Article
In this paper, we study the least squares (LS) estimator in a linear panel regression model with unknown number of factors appearing as interactive fixed effects. Assuming that the number of factors used in estimation is larger than the true number of factors in the data, we establish the limiting distribution of the LS estimator for the regression coefficients as the number of time periods and the number of cross-sectional units jointly go to infinity. The main result of the paper is that under certain assumptions, the limiting distribution of the LS estimator is independent of the number of factors used in the estimation as long as this number is not underestimated. The important practical implication of this result is that for inference on the regression coefficients, one does not necessarily need to estimate the number of interactive fixed effects consistently
Conference Paper
In the deep neural network (DNN), the hidden layers can be considered as increasingly complex feature transformations and the final softmax layer as a log-linear classifier making use of the most abstract features computed in the hidden layers. While the loglinear classifier should be different for different languages, the feature transformations can be shared across languages. In this paper we propose a shared-hidden-layer multilingual DNN (SHL-MDNN), in which the hidden layers are made common across many languages while the softmax layers are made language dependent. We demonstrate that the SHL-MDNN can reduce errors by 3-5%, relatively, for all the languages decodable with the SHL-MDNN, over the monolingual DNNs trained using only the language specific data. Further, we show that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers. It is particularly interesting that the error reduction can be achieved for the target language that is in different families of the languages used to learn the hidden layers.