PreprintPDF Available

Doubly Robust Inference for Hazard Ratio under Informative Censoring with Machine Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Randomized clinical trials with time-to-event outcomes have traditionally used the log-rank test followed by the Cox proportional hazards (PH) model to estimate the hazard ratio between the treatment groups. These are valid under the assumption that the right-censoring mechanism is non-informative, i.e. independent of the time-to-event of interest within each treatment group. More generally, the censoring time might depend on additional covariates, and inverse probability of censoring weighting (IPCW) can be used to correct for the bias resulting from the informative censoring. IPCW requires a correctly specified censoring time model conditional on the treatment and the covariates. Doubly robust inference in this setting has not been plausible previously due to the non-collapsibility of the Cox model. However, with the recent development of data-adaptive machine learning methods we derive an augmented IPCW (AIPCW) estimator that has the following doubly robust (DR) properties: it is model doubly robust, in that it is consistent and asymptotic normal (CAN), as long as one of the two models, one for the failure time and one for the censoring time, is correctly specified; it is also rate doubly robust, in that it is CAN as long as the product of the estimation error rates under these two models is faster than root-$n$. We investigate the AIPCW estimator using extensive simulation in finite samples.
Content may be subject to copyright.
Doubly Robust Inference for Hazard Ratio under Informative Censoring
with Machine Learning
Jiyu Luoand Ronghui Xu
Abstract
Randomized clinical trials with time-to-event outcomes have traditionally used the log-rank test followed by
the Cox proportional hazards (PH) model to estimate the hazard ratio between the treatment groups. These
are valid under the assumption that the right-censoring mechanism is non-informative, i.e. independent of the
time-to-event of interest within each treatment group. More generally, the censoring time might depend on
additional covariates, and inverse probability of censoring weighting (IPCW) can be used to correct for the bias
resulting from the informative censoring. IPCW requires a correctly specified censoring time model conditional
on the treatment and the covariates. Doubly robust inference in this setting has not been plausible previously
due to the non-collapsibility of the Cox model. However, with the recent development of data-adaptive machine
learning methods we derive an augmented IPCW (AIPCW) estimator that has the following doubly robust (DR)
properties: it is model doubly robust, in that it is consistent and asymptotic normal (CAN), as long as one of the
two models, one for the failure time and one for the censoring time, is correctly specified; it is also rate doubly
robust, in that it is CAN as long as the product of the estimation error rates under these two models is faster
than root-n. We investigate the AIPCW estimator using extensive simulation in finite samples.
Keywords: Cox proportional hazards model; Rate doubly robust; AIPCW.
1 Introduction
In the analysis of time-to-event data, the Cox proportional hazards (PH) model (Cox,1972) has been widely used
to estimate the hazard ratio (HR) between two treatment groups in a randomized clinical trial, for example. The
Herbert Wertheim School of Public Health and Human Longevity Science, University of California, San Diego, La Jolla, CA 92093-
0112, USA. E-mail: jil130@ucsd.edu.
Herbert Wertheim School of Public Health and Human Longevity Science, Department of Mathematics and Halicioglu Data Science
Institute, University of California, San Diego, La Jolla, CA 92093-0112, USA. E-mail: rxu@health.ucsd.edu.
1
arXiv:2206.02296v1 [stat.ME] 6 Jun 2022
validity of the maximum partial likelihood estimator (MPLE) under the PH model relies on the non-informative
censoring assumption (Fleming and Harrington,1991); that is, the censoring time random variable is independent
of the failure time random variable within each treatment group. In practice, this assumption can be violated which
leads to informative censoring, and the censoring time may well depend on additional covariates. This issue was
recently highlighted in Van Lancker et al. (2021), who aimed to develop procedures to select baseline covariates in
order to be adjusted for in the Cox regression model. Such adjustment, however, changes the effect estimand, making
it difficult to compare across different adjustment sets. Alternatively, the crude or marginal hazard ratio, as it is
often referred to in the medical literature, between the two groups can still be consistently estimated using inverse
probability of censoring weighting (IPCW) under the relaxed censoring assumption that the censoring time and the
failure time are independent given the additional covariates.
IPCW was proposed in Robins and Finkelstein (2000) to correct for bias resulting from informative censoring of
the log-rank test and, prior to that, in Robins (1993). Up until then, the main body of literature in both applied
and theoretical survival analysis had assumed non-informative censoring, given the predictors in a regression model
(Fleming and Harrington,1991). A separate line of research where IPCW was called for, was under violation of the
PH assumption, where it was recognized that the MPLE gave rise to a population quantity that involved the nuisance
censoring distribution (Xu,1996;Xu and O’Quigley,2000). A series of work has since been done to correct for this
bias using IPCW approaches, including Boyd et al. (2012); Hattori and Henmi (2012); Nguyen and Gillen (2017);
Nu˜no and Gillen (2021). We note that the terminology ‘IPCW’ was not always mentioned in some of these works,
which used the (conditional) survival distribution increments as weights in each risk set; but these are algebraically
equivalent to the inverse probability of censoring weights.
The censoring distribution used in IPCW is often modeled parametrically or semiparametrically, and the resulting
IPCW estimator is consistent and asymptotically normal (CAN) if the model is correctly specified. Nguyen and Gillen
(2017) proposed a survival tree approach to estimate the conditional censoring distribution given the covariates, but
with no theoretical guarantee for inference. In fact, it is known that the resulting estimator is typically biased
(Belloni et. al.,2013).
Doubly robust (DR) approaches were developed when handling missing data (Robins,1993;Robins et al.,1995;
Scharfstein et al.,1999;Robins et al.,2000b;Robins and Rotnitzky,2001;van der Laan and Robins,2003;Bang
and Robins,2005;Tsiatis,2006). It is called doubly robust because two working models are involved, one for the
outcome of interest, and one for the missing data mechanism, and the estimator is consistent as long as one of the
two working models are correctly specified. When IPW is used to handle the missingness (referred to as coarsening),
this usually comes down to augmentation with the coarsened data and the resulting DR estimator is an augmented
IPW (AIPW) estimator (Tsiatis,2006).
2
Since right censoring in survival data may be framed as a type of coarsening (Tsiatis,2006), Rotnitzky and Robins
(2005) developed an augmented IPCW (AIPCW) approach for censored survival data. For the PH model, however,
this approach is not straightforward to apply to. As will be seen later, this is mainly due to the non-collapsibility of
the Cox model (Martinussen and Vansteelandt,2013;Tchetgen Tchetgen and Robins,2012;Rava,2021).
In this paper, we consider simultaneously the regression parameter and the nuisance baseline hazard function
under the PH model. This naturally gives rise to full data estimating equations that are sums of independent and
identically distributed (i.i.d.) martingales. The augmentation leads to working models for the failure time and the
censoring time given the group indicator and the covariates. To specify a conditional failure time model that is
compatible with the original (marginal) PH model given the group membership only, data adaptive machine learning
(ML) or nonparametric methods are needed. With cross-fitting (Chernozhukov et al.,2018), the resulting AIPCW
estimator has doubly robust properties not only in the classical sense, which is referred to as model doubly robust, but
also rate doubly robust (Smucler et al.,2019;Hou et al.,2021). Here, rate double robustness refers to an estimator
being CAN when the product of the estimation error rates under the two working models is faster than root-n, while
either one of them is allowed to be arbitrarily slow.
The rest of the paper is organized as follows. In Section 1.1, we state the model and assumption about censoring.
In Section 2, we take a missing data approach by constructing the AIPCW score from the full data score, and
provide a detailed algorithm for the cross-fitted AIPCW estimator. Asymptotic properties of the AIPCW estimator
are described in Section 3. In Section 4, we conduct simulations for the AIPCW estimator using different nuisance
estimators, and also compare them with the IPCW estimators. Finally, we conclude with discussion in Section 5.
Additional materials are provided in the Appendix.
1.1 Model and assumption
Let Tand Cbe the failure time and the censoring time, respectively. Denote X= min(T , C), and = I(TC).
Denote also Y(t) = I(Xt) the at-risk process, and N(t) = I(Xt, = 1) the failure event counting process.
We consider the two-group survival setting where Ais a binary group indicator. For a randomized trial, this can be
the treatment groups. Let Zbe a p-dimensional vector of baseline covariates. We assume that the data consist of n
independent and identically distributed (i.i.d.) copies of the random vectors O= (X, , A, Z).
Assumption 1. (informative censoring) CT|(A, Z).
We assume the PH model for the two-group survival:
λ(t|A) = λ0(t) exp(βA),(1)
where λ(t|A) denotes the group-specific hazard function of T,βis the log hazard ratio, and λ0(t) is the baseline
3
hazard function.
2 Doubly robust inference
In this section following Tsiatis (2006) we treat right censoring as a coarsened data problem. We start with a set
of full data score functions under the PH model, and show that when IPCW is applied to this set of full data
score functions we obtain the familiar IPCW estimator under the Cox model (Boyd et al.,2012). We then mimic
the approach of Rotnitzky and Robins (2005) to augment the IPCW score functions and arrive at a doubly robust
AIPCW estimator. Finally, for inference purposes we introduce cross-fitting and describe the implementation of the
cross-fitted AIPCW estimator.
2.1 Full data score functions
The full data vector is (T, A, Z ). Following the commonly used NPMLE approach for the semiparametric PH model,
the unknown parameters are βand Λ0(t) = Rt
0λ0(u)du, the cumulative baseline hazard, which is discretized to jumps
at the observed event times only (Nielson et al.,1992).
Following Fleming and Harrington (1991), define the full data counting process NT(t) = I(Tt) and the full
data at-risk process YT(t) = I(Tt). Let
MT(t;β, Λ0) = NT(t)Zt
0
YT(u)eβAdΛ0(u).(2)
Then MT(t;β, Λ0) is the full data martingale with respect to the full data filtration
Ff
t={NT(u), YT(u+), A, Z; 0 ut}under model (1).
We have the following full data score functions for a single copy of the data:
Df
1(β, Λ0, t) = dMT(t;β , Λ0),
Df
2(β, Λ0) = Zτ
0
AdMT(t;β, Λ0).
where τis the maximum follow-up time. Note that Df
1(β, Λ0, t) is a martingale difference that is often used in
survival analysis; see for example, Lu and Ying (2004). For each t, the true values of the parameters βand Λ0satisfy
E{Df
1(β, Λ0, t)}= 0 and E{Df
2(β, Λ0)}= 0.(3)
4
2.2 IPCW score functions
In survival analysis, it’s common to consider the quantity
M(t) = N(t)Zt
0
Y(u)eβAdΛ0(u).(4)
Note that it is not a martingale under informative censoring. We define Sc(t|A, Z) = P(Ct|a, Z) the conditional
survival function of C,e
∆(t) = I(min(T, t)< C), and denote
dMw(t;β, Λ0, Sc) =Sc(t|A, Z)1e
∆(t)dMT(t;β, Λ0)
=Sc(t|A, Z)1dN (t)Y(t)eβAdΛ0(t).(5)
Note that expression (5) gives the IPCW score functions:
Dw
1(β, Λ0, t;Sc) = dM w(t;β, Λ0, Sc),(6)
Dw
2(β, Λ0;Sc) = Zτ
0
AdMw(t;β, Λ0, Sc).(7)
With ncopies of i.i.d. data, this gives the following IPCW weighted estimating equations:
1
n
n
X
i=1
Dw
1i(β, Λ, t;Sc)=0,
1
n
n
X
i=1
Dw
2i(β, Λ; Sc)=0.
After some algebra, the above estimating equations can be combined to give the IPCW partial likelihood score
equation (Boyd et al.,2012):
n
X
i=1 Zτ
0b
Sc(t|Ai, Zi)1(Aie
S(1)(β , t;b
Sc)
e
S(0)(β , t;b
Sc))dNi(t) = 0,(8)
where e
S(l)(β, t;Sc) = Pn
j=1 Al
jSc(t|Aj, Zj)1Yj(t)eβAjfor l= 0,1, and b
Sc(t|A, Z) is some consistent estimator of
Sc(t|A, Z).
2.3 AIPCW score functions
The consistency of the IPCW estimator relies critically on Sc(t|A, Z) being correctly specified. When it is misspecified,
the IPCW estimator is biased. Rotnitzky and Robins (2005) provides an augmentation approach for an IPCW
estimator in survival analysis, so that it has the doubly robust property to be detailed later. However, their approach
cannot be directly applied because we have not only different weights for different individuals in the data set, but also
different weights for each risk set. To this end, it is helpful to augment the martingale increment in (5) as follows.
Denote Nc(t) = I(Xt, = 0) the counting process for the censoring event, and Λc(t|A, Z) = Rt
0Sc(u|A, Z)1d{1
Sc(u|A, Z)}the cumulative hazard function of Cgiven A, Z. Then Mc(t;Sc) = Nc(t)Rt
0Y(u)dΛc(u|A, Z) is the
5
martingale corresponding to the censoring event counting process with respect to its natural history filtration. Also
denote S(t|A, Z) = P(Tt|A, Z ), and F(t|A, Z) = 1 S(t|A, Z ). Define
dMaug (t;β, Λ0, S, Sc) = dMw(t;β, Λ0, Sc) + Zt
0
E{dMT(t;β, Λ0)|A, Z, T u}dMc(u;Sc)
Sc(u|A, Z)(9)
=dN(t)Y(t)dΛ0(t)eβA
Sc(t|A, Z)J(t;S, Sc)dS (t|A, Z ) + S(t|A, Z)eβA dΛ0(t),(10)
where J(t;S, Sc) = Rt
0S(u|A, Z)1Sc(u|A, Z )1dMc(u;Sc). The last ‘=’ above used the fact that, for ut,
E{NT(t)|A, Z, T u}=P(Tt|A, Z, T u) = F(t|A, Z )
S(u|A, Z),(11)
E{YT(t)|A, Z, T u}=P(Tt|A, Z, T u) = S(t|A, Z )
S(u|A, Z).(12)
The above leads to the AIPCW score functions:
D1(β, Λ0, t;S, Sc) = dM aug(t;β , Λ0, S, Sc),(13)
D2(β, Λ0;S, Sc) = Zτ
0
A·dMaug (t;β, Λ0, S, Sc).(14)
In Theorem 1below, we will show that (13) and (14) are doubly robust score functions. We use superscript o
to denote the truth; for example, So(t|A, Z ), So
c(t|A, Z) and Λo
c(t|A, Z) denote the true S(t|A, Z ), Sc(t|A, Z) and
Λc(t|A, Z), respectively. Also let βoand Λo
0denote the true values of the parameters of interest. We assume the
following:
Assumption 2. So(τ|a, z)> c for a {0,1}, z Z and some c > 0.
Assumption 3. So
c(τ|a, z)> c for a {0,1}, z Z and some c > 0.
Theorem 1. Under Assumptions 1-3, if either S=Soor Sc=So
c,
E{D1(βo,Λo
0, t;S, Sc)}=E{D2(βo,Λo
0;S, Sc)}= 0.(15)
The above theorem states that the scores (D1, D2) identifies the true parameters (βo,Λo
0),as long as one of the
two survival functions, S(t|A, Z) and Sc(t|A, Z ), is true.
Given ni.i.d. data points, we estimate βo,Λoby solving
1
n
n
X
i=1
D1i(β, Λ0, t;S, Sc)=0,(16)
1
n
n
X
i=1
D2i(β, Λ0;S, Sc)=0.(17)
Solving for (16) gives
e
Λ0(β, t;S, Sc) = Zt
0
1
nPn
i=1 Sc(u|Ai, Zi)1dNi(u)Ji(u;S, Sc)dS(u|Ai, Zi)
S(0)(β , u;S, Sc),(18)
6
where
S(l)(β, t;S, Sc) = 1
n
n
X
i=1
Al
ieβAi{Sc(u|Ai, Zi)1Yi(t) + Ji(t;S, Sc)S(t|Ai, Zi)}(19)
for l= 0,1. Further define ¯
A(β, t;S, Sc) = S(1) (β, t;S, Sc)/S(0)(β , t;S, Sc). After plugging (18) into (17), we have:
U(β;S, Sc) = 1
n
n
X
i=1 Zτ
0{Sc(t|Ai, Zi)1dNi(t)Ji(u;S, Sc)dS(t|Ai, Zi)}{Ai¯
A(β, t;S, Sc)}= 0.(20)
It’s worth noting that like the partial likelihood score equation, (20) is not a sum of i.i.d terms due to ¯
A(β, t;S, Sc).
As seen from the derivation leading to (10), the augmentation to the weighted martingale increment, which is linear
in N(t) and Y(t), is the result of augmentation to the weighted N(t) and Y(t), respectively. It is apparent that
Sc(t|Ai, Zi)1dNi(t)Ji(t;S, Sc)dS(t|Ai, Zi) is the augmented weighted dNi(t), and the augmented weighted Yi(t)’s
give rise to the quantities S(l)(·) and ¯
A(·), which are the analogies of similar quantities under the usual Cox model.
For example, ¯
A(β, t;S, Sc) corresponds to the empirical mean of the treatment random variable Aamong subjects
who fail at time t, which we may denote by ρ(β, t).
The quantity ρ(β, t) was implied in Rotnitzky and Robins (2005), as a nuisance parameter, based on the partial
likelihood score function. It would, however, not be straightforward to construct compatible models for ρ(β, t), which
is defined on nested risk sets over time. The set of full data estimating functions we consider here, simultaneously
for βand Λ0, on the other hand, lead naturally to models for Sand Sc.
2.4 Cross-fitted AIPCW estimator
In practice, both survival functions S(t|A, Z) and Sc(t|A, Z ) are unknown and need to be estimated by some estimator
b
S(t|A, Z) and b
Sc(t|A, Z). Parametric and semiparametric models, like the Cox model and the accelerated failure time
(AFT) model, are often applied since their theoretical properties are well-studies and with little requirement on the
computing power. However, these models can be misspecified, especially for S(t|A, Z ) due to the non-collapsibility
of the Cox model. ML or nonparametric methods, like splines (Gary,1992;Kooperberg et al.,1995a) and random
survival forest (Ishwaran et al.,2008), offer a good alternative. ML or nonparametric estimators, however, do not
have root-nconvergence rate, which makes it difficult to conduct inference. We will show that the asymptotic
normality can be established if we also apply cross-fitting, where the entire sample is first split into kfolds, and for
each fold, we estimate the nuisance functions using only the out-of-fold sample. Details of the cross-fitted AIPCW
estimator b
βare described in Algorithm 1. Heuristically, cross-fitting works by inducing independence between the
nuisance parameter estimators and the rest of the quantities in the scores, thereby allowing asymptotic normality to
be established (Smucler et al.,2019;Hou et al.,2021).
Quantities involving cross-fitting would be slightly different from quantities without cross-fitting, and will involve
7
Algorithm 1 k-fold Cross-fitted AIPCW estimation of β
Input: A sample of nobservations that are split into kfolds of equal size with index sets
I1,I2,...,Ik.
for each fold indexed by mdo
obtain estimated nuisance functions (b
S(m),b
S(m)
c) using the out-of-fold sample indexed by Im:={1, . . . , n}\
Im.
end for
Output: b
β, the solution to
1
n
n
X
i=1
D1i(β, Λ0, t;b
S(m(i)),b
S(m(i))
c)=0,(21)
1
n
n
X
i=1
D2i(β, Λ0;b
S(m(i)),b
S(m(i))
c)=0,(22)
where m(i) maps observation ito index of the fold it belongs to.
the estimated nuisance parameters (b
S(1),b
S(1)
c),...,(b
S(k),b
S(k)
c). Specifically, solving for (21), we will have
e
Λcf
0(β, t;b
S, b
Sc) = Zt
0
1
nPn
i=1 b
S(m(i))
c(u|Ai, Zi)1dNi(u)Ji(u;b
S(m(i)),b
S(m(i))
c)db
S(m(i))(u|Ai, Zi)
S(0)(β , u;b
S, b
Sc),(23)
with
S(l)
cf (β, t;b
S, b
Sc) = 1
n
n
X
i=1
Al
ieβAi{b
S(m(i))
c(t|Ai, Zi)1Yi(t) + Ji(t;b
S(m(i)),b
S(m(i))
c)b
Sm(i))(t|Ai, Zi)}(24)
for l= 0,1.
Also, ¯
Acf (β, t;b
S, b
Sc) = S(1)
cf (β, t;b
S, b
Sc)/S(0)
cf (β, t;b
S, b
Sc), and after plugging (23) into (22), we have the final cross-
fitted AIPCW estimating equation:
Ucf (β;b
S, b
Sc)
=1
n
n
X
i=1 Zτ
0{b
S(m(i))
c(t|Ai, Zi)1dNi(t)Ji(t;b
S(m(i)),b
S(m(i))
c)db
S(m(i))(t|Ai, Zi)}{Ai¯
A(β, t;b
S, b
Sc)}.(25)
We solve the cross-fitted AIPCW estimating equation (25) using the Newton-Ralphson algorithm.
3 Asymptotic Properties
We will now describe the asymptotic properties of the proposed cross-fitted AIPCW estimator, when estimated using
a random sample of size n. We first list a few additional assumptions.
8
Assumption 4. There exist S(t|a, z)and S
c(t|a, z)with S(τ|a, z)> c and S
c(τ|a, z)> c for some c > 0, such
that
sup
t[0],a∈{0,1},z∈Z |b
S(t|a, z)S(t|a, z)|=Op(an),
sup
t[0],a∈{0,1},z∈Z |b
Sc(t|a, z)S
c(t|a, z)|=Op(bn),
for some an=o(1) and bn=o(1).
Assumption 5. For the limits Sand S
c, there exists a neighbourhood Bof βoand functions s(l)(β, t;S, S
c)for
l= 0,1defined on B × [0, τ ]such that supt[0]∈B |S(l)(β , t;S, S
c)s(l)(β, t;S, S
c)|=op(1).
Assumption 6. For l= 0,1,s(l)(β, t;S, S
c)are continuous functions of β B, uniformly in t[0, τ ]and are
bounded on B × [0, τ ].s(0)(β, t;S, S
c)is bounded away from zero in B × [0, τ]. For all β B,t[0, τ ]:
s(1)(β , t;S, S
c) =
∂β s(0) (β, t;S, S
c) = 2
∂β2s(0)(β, t;S, S
c).(26)
In addition, let ¯a=s(1)/s(0) and v= ¯a¯a2. We have ν(βo;S, S
c) = Rτ
0v(βo, t;S, S
c)s(0)(βo, u;S, S
c)dΛo
0(t)>0.
Assumption 4assumes that both b
Sand b
Scconverge to some limiting function Sand S
cthat are not necessarily
the truth. Here, we do not make the root-nconvergence assumption for each of b
Sand b
Scthat often limits us to
parametric or semiparametric models. This assumption also implies that b
S(m)and b
S(m)
cconverge to Sand S
c
at the same rate. Assumptions 5and 6are similar to regularity assumptions that are typically made under the PH
models (Anderson and Gill,1982).
The asymptotic properties of the cross-fitted AIPCW estimator b
βdefined in Algorithm 1are summarized in
Theorems 2and 3below.
Theorem 2. Under Assumptions 4-6, if either S=Soor S
c=So
c, then b
βp
βo.
Theorem 3. Under Assumptions 4-6, if any of the following conditions hold:
(a) (Rate Double Robustness) S=So, S
c=So
cand anbn=o(n1/2);
(b) (Model Double Robustness) S=Soand an=O(n1/2). In particular, there exists an influence function
ξ(t, a, z)such that b
S(t|a, z)S(t|a, z) = Pn
j=1 ξj(t, a, z)/n +op(n1/2);
(c) (Model Double Robustness) S
c=So
cand bn=O(n1/2). In particular, there exists an influence function
η(t, a, z)such that b
Sc(t|a, z)S
c(t|a, z) = Pn
j=1 ηj(t, a, z)/n +op(n1/2),
then we have
n(b
ββo) = 1
n
n
X
i=1
ν(βo, S, S
c)1ψi(βo,Λo
0, S, S
c) + op(1),(27)
9
where the expression for ψi(βo,Λo
0, S, S
c)is provided in Appendix A.
Theorem 3establishes both the model double robustness and the rate double robustness properties. Traditionally,
doubly robust inference is established assuming both working models are parametric or semiparametric. Model
double robustness here allows estimation under the possibly wrong model to converge at any rate. The theorem also
establishes rate double robustness, which states that if the estimators under both working models converge to the
truth and that their product rate is faster than root-n, the proposed AIPCW estimator is CAN even if one of the
nuisance estimators converges arbitrarily slowly. This result permits more flexible ML or nonparametric methods
with valid inference.
The asymptotic variance of the proposed estimator is simplified under condition (a). In this case, we provide an
estimator of the asymptotic variance, which is given in the Theorem 4below.
Theorem 4. Under Assumptions 4-6, if condition (a) of Theorem 3holds, i.e. if S=So,S
c=So
cand anbn=
op(n1/2), then bν2K/n is a consistent estimator for the asymptotic variance of b
β, where bνand Kare provided in
Appendix A.
When one of the working models is misspecified, the asymptotic variance is rather complicated. In this case,
resampling methods such as bootstrap (Efron,1979) may be used to estimate the variance since the AIPCW estimator
is asymptotically linear.
4 Simulation
In this section, we compare the performance of the cross-fitted AIPCW estimators b
βusing different working models,
against different IPCW estimators and the MPLE. We consider sample sizes n= 500 and n= 1000, and 1000 data
sets are simulated for each setting, which corresponds to margin of error of about +/1.35% for the coverage
probability of nominal 95% confidence intervals. Five-fold cross-fitting is used.
For data generation, we first follow the diagram in Figure 1(a) and generate U1Unif (-1, 1), ABernoulli
(0.5), Z1N(0.5U1,1), Z2N(U2
1,0.09), and T=log(0.5U1+ 0.5)eA. Here, Tfollows the PH model (1) with
βo=1 and λo
0(t) = 1.
We consider two scenarios of data generating process for the censoring time C, as described in Figure 1(b). Both
scenarios have around 25% samples administratively censored at τ= 1, and 40% of the remaining samples censored
during follow-up. Note that administrative censoring works in the same way for Tand C, i.e. those events are
consider as ‘censored’ for both the estimation of Sand the estimation of Sc. It is obvious that Scenario 1 can be
correctly modeled. Scenario 2 is designed such that most commonly used semiparametric models fail. As it turns
10
out, under Scenario 2 Sc(τ|A, Z) can be very close to zero for some values of Aand Z, leading to possible violation of
Assumptions 2and 3. This echoes the argument made in D’Amour et al. (2021) that the overlap assumption needed
for DR estimates often fails in practice.
T C
A
Z
U1
U2Scenario Data generating process for C
1: Cox PH λc(t) = exp(1+2Z2)
2: Mixture
log(U2)N(0,1)
Z1>0 : log(C) = 0.2A2p|Z2|+ 0.3U2
Z10 : log(C)=2.40.3A+ 0.5p|Z1|
+0.5p|Z2| U2
(a) (b)
Figure 1: (a) Variable diagram. (b) Data generating process for C.
We consider three types of working models: PH model using the R package ‘survival’; splines (Kooperberg et al.,
1995a) using the R package ‘polspline’; and random survival forest (RSF) (Ishwaran et al.,2008) using the R package
‘randomForestSRC’. We set splitrule = ’bs.gradient’ for RSF, while keeping all the others settings as default. We
study 7 different combinations of working models for the proposed AIPCW estimator: Cox-Cox, Cox-spline, Cox-
RSF, spline-Cox, RSF-Cox, spline-spline, and RSF-RSF, where the first part in the names denotes the model for S
and the second part denotes the model for Sc. It is worth noting that due to the non-collapsibility of the Cox model,
a semiparametric conditional model for Sis almost always misspecified. Therefore the consistency of AIPCW-Cox-
Cox, AIPCW-Cox-spline and AIPCW-Cox-RSF relies on the correct specification of the censoring model. We also
note that the convergence rate of the spline and RSF is largely unknown, which depends on the choice of tuning
parameters. See Discussion for more on this.
We also investigate the performance of MPLE and various IPCW estimators: IPCW-Cox, IPCW-spline, IPCW-
RSF, IPCW-A and IPCW-1. More specifically, IPCW-A estimates Scusing the product-limit estimator for each
group indicated by A, while IPCW-1 estimates Scusing the product-limit estimator on the entire sample. Robust
variance estimator from Boyd et al. (2012) is used to estimate the model standard errors of the IPCW estimators.
Standard errors for the cross-fitted AIPCW estimators are estimated using Theorem 4, which assumes both Sand
Scmodels are correctly specified.
To avoid numerical problems, we impose a minimum on b
S(m)(t|A, Z) and b
S(m)
c(t|A, Z) in the above, so that
values below 0.01 are trimmed to be 0.01. Finally, as a benchmark, we also fit model (1) to the full data without
11
censoring.
The simulation results for Scenarios 1 and 2 are reported in Tables 1and 2, respectively. It is immediate that
under informative censoring, MPLE, IPCW-1 and IPCW-A have substantial bias leading to poor coverage of the
confidence intervals (CI). Under Scenario 1 where the censoring model is correctly specified as Cox, the other three
IPCW estimators (-Cox, -spline, -RSF) all appear to perform reasonably well. All seven AIPCW estimators also
perform well under Scenario 1, with AIPCW-Cox-RSF having larger bias compared to the rest.
Under Scenario 2, IPCW-Cox appears more biased than IPCW-spline and IPCW-RSF, as expected. But even for
the latter two estimators, their SE’s severely under-estimate the SD’s, leading to poor coverage of the CI’s. This also
points to the known fact that inference is not guaranteed when ML or nonparametric methods are used in IPCW,
as discussed earlier. AIPCW-Cox-Cox also has large bias under Scenario 2, as expected. The rest six AIPCW’s are
less biased. For the larger sample size n= 1000, AIPCW using two ML or nonparametric methods appears to have
the least bias, with close to nominal coverage probabilities. Finally we note that, under Scenario 2, spline-based
AIPCWs tend to have larger variance. This might be explained by the fact that splines are less stable near the
boundary τ, which under Scenario 2 has small b
Sc(τ|A, Z) for some values of Aand Zas mentioned earlier.
12
Table 1: Simulation results under Scenario 1. Data are generated following Figures 1(a) and (b) with
βo=1. Red indicates that the model or approach is invalid.
Sample Size Estimators Bias SD SE CP
n= 500
AIPCW-Cox-Cox 0.002 0.196 0.191 0.94
AIPCW-Cox-spline -0.001 0.198 0.190 0.94
AIPCW-Cox-RSF 0.023 0.197 0.207 0.96
AIPCW-spline-Cox 0.005 0.185 0.177 0.94
AIPCW-RSF-Cox 0.005 0.189 0.178 0.94
AIPCW-spline-spline 0.002 0.185 0.177 0.94
AIPCW-RSF-RSF 0.002 0.192 0.190 0.95
IPCW-Cox -0.006 0.186 0.179 0.94
IPCW-spline -0.005 0.188 0.179 0.94
IPCW-RSF 0.008 0.190 0.177 0.93
IPCW-A-0.221 0.180 0.162 0.70
IPCW-1-0.221 0.179 0.162 0.70
MPLE -0.205 0.175 0.167 0.76
Full data 0.002 0.103 0.099 0.93
n= 1000
AIPCW-Cox-Cox -0.008 0.137 0.134 0.94
AIPCW-Cox-spline -0.010 0.138 0.133 0.94
AIPCW-Cox-RSF 0.019 0.141 0.153 0.97
AIPCW-spline-Cox 0.001 0.127 0.123 0.94
AIPCW-RSF-Cox 0.002 0.130 0.125 0.94
AIPCW-spline-spline 0.001 0.127 0.123 0.94
AIPCW-RSF-RSF -0.005 0.134 0.134 0.95
IPCW-Cox -0.009 0.130 0.128 0.94
IPCW-spline -0.007 0.135 0.128 0.94
IPCW-RSF 0.011 0.134 0.128 0.95
IPCW-A-0.225 0.126 0.114 0.51
IPCW-1-0.224 0.126 0.114 0.51
MPLE -0.207 0.122 0.118 0.58
Full data -0.003 0.069 0.07 0.94
SD: standard deviation; SE: standard error; CP: coverage probability of nominal 95% CI
13
Table 2: Simulation results under Scenario 2. Data are generated following Figures 1(a) and (b) with
βo=1. Red indicates that the model or approach is invalid.
Sample Size Estimators Bias SD SE Coverage
n= 500
AIPCW-Cox-Cox -0.129 0.285 0.276 0.93
AIPCW-Cox-spline -0.029 0.604 0.623 0.97
AIPCW-Cox-RSF -0.064 0.249 0.243 0.93
AIPCW-spline-Cox -0.068 0.282 0.256 0.93
AIPCW-RSF-Cox -0.034 0.275 0.250 0.93
AIPCW-spline-spline 0.038 0.578 0.585 0.96
AIPCW-RSF-RSF -0.039 0.264 0.238 0.93
IPCW-Cox -0.114 0.266 0.174 0.77
IPCW-spline -0.046 0.452 0.192 0.68
IPCW-RSF -0.088 0.257 0.179 0.80
IPCW-A-0.227 0.184 0.170 0.74
IPCW-1-0.226 0.183 0.166 0.72
MPLE -0.216 0.179 0.174 0.77
Full data 0.002 0.103 0.099 0.93
n= 1000
AIPCW-Cox-Cox -0.127 0.195 0.192 0.90
AIPCW-Cox-spline -0.056 0.396 0.367 0.95
AIPCW-Cox-RSF -0.035 0.187 0.189 0.95
AIPCW-spline-Cox -0.056 0.191 0.180 0.93
AIPCW-RSF-Cox -0.021 0.185 0.178 0.92
AIPCW-spline-spline 0.008 0.344 0.332 0.95
AIPCW-RSF-RSF -0.020 0.198 0.179 0.93
IPCW-Cox -0.103 0.204 0.126 0.71
IPCW-spline -0.045 0.377 0.146 0.63
IPCW-RSF -0.047 0.202 0.134 0.78
IPCW-A-0.220 0.127 0.120 0.56
IPCW-1-0.219 0.127 0.117 0.53
MPLE -0.211 0.123 0.123 0.61
Full data -0.003 0.069 0.07 0.94
SD: standard deviation; SE: standard error; CP: coverage probability of nominal 95% CI
14
5 Discussion
For the analysis of two-group survival, including for randomized clinical trials, non-informative censoring is assumed.
When the simple PH model (1) is used with no covariates adjusted for, this requires the censoring distribution to be
independent of any covariates. When this assumption is violated, the commonly used MPLE is biased and typically
IPCW is used to correct that bias if the interest remains to estimate the marginal hazard ratio between the two
groups. IPCW, on the other hand, requires modeling the censoring distribution, which can be wrong unless ML or
nonparametric estimates are used. In this paper we have developed an AIPCW estimator that is both model DR
and rate DR. Rate double robustness allows us to get around the non-collapsibility of the Cox regression model using
more flexible ML or nonparametric methods for the conditional failure time model demanded by the DR construct,
because almost any parametric or semiparametric would otherwise be invalid.
The theoretical results require certain rate condition of the estimates of the nuisance parameters. These are not
always established for a given ML or nonparametric estimator. Cui et al. (2022) and Kooperberg et al. (1995b)
demonstrated that under certain conditions, rate better than n1/4can be achieved for random survival forest and
splines, which would lead to a faster than root-nproduct rate. Faster than n1/4rate are also shown to be attainable
for other ML methods, for example, regression trees (Wager and Walther,2015) and neural networks (Chen and
White,1999). The rates, of course, depend on the hyper-parameter values. In the simulations we used the default
settings for the spline and the random survival forest. Investigation of other ML or nonparametric methods, as well
as their tuning, in relationship with the performance of DR estimators, remains a topic of future work.
This work focused on two-group survival and a binary A. Generalization to continuous and/or multivariate Ais
conceptually straightforward although different algebra might be involved. In particular for continuous A, we would
no longer have A2=Aand additional quantities like S(2) need to be introduced.
Finally the models for Sand Scmay include additional and different sets of covariates for these two models, so
long as the failure time and the censoring time are independent given the common covariates Z.
The R codes for the cross-fitted AIPCW estimator as well as the simulation procedures investigated in this work
are available online in http://github.com/charlesluo1002/DR-Cox.
15
References
Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: a large sample study.
Ann. Stat. 10: 1100–1120.
Bai, X., Tsiatis, A. A., Lu, W. and Song, R. (2017). Optimal treatment regimes for survival endpoints using a
locally-efficient doubly-robust estimator from a classification perspective. Lifetime Data Anal. 23(4): 585–604.
Belloni, A., Chernozhukov, V. and Hansen, C. (2013). Inference on treatment effects after selection among
high-dimensional controls. The Review of Economic Studies 81(2): 608-650.
Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models.
Biometrics 61: 692-972.
Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner, J.A. (1993). Efficient and adaptive estimation for
semiparametric Models. The Johns Hopkins University Press, Baltimore.
Boyd, A.P., Kittelson, J.M. and Gillen, D.L. (2012). Estimation of treatment effect under non-proportional
hazards and conditionally independent censoring. Stat. Med. 31(28): 3504-15.
Breslow, N.E. (1972). Discussion of the paper by D. R. Cox. J. R. Statist. Soc. B. 34: 216–217.
Campigotto, F. and Weller, E. (2014). Impact of informative censoring on the Kaplan-Meier estimate of
progression-free survival in phase II clinical trials. J. Clin. Oncol. 32(27): 3068-3074.
Chen, X. and White, H. (1999). Improved rates and asymptotic normality for nonparametric neural network
estimators. IEEE Trans. Inf. Theory 45(2): 682-691.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J.
(2018). Double/debiased machine learning for treatment and structural parameters. Econom. J. 12: C1-C68.
Cox, D.R. (1972). Regression models and life-tables (with discussion). J. R. Statist. Soc. B. 34: 187–220.
Cox, D.R. (1975). Partial likelihood. Biometrika 62: 269–276.
Cui, Y., Zhu, R., Zhou, M. and Kosorok, M. (2022). Consistency of survival tree and forest models: splitting
bias and correction. Stat. Sin. (preprint).
D’Amour, A., Ding, P., Feller, A., Lei, L. and Sekhon, J (2021). Overlap in observational studies with
highdimensional covariates. J. Econom. 221 644-654.
Dukes, O., Martinussen, T., Tchetgen Tchetgen, E. J. and Vansteelandt, S. (2019). On doubly robust
estimation of the hazard difference. Biometrics 75: 100-109.
16
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7: 1-26.
Fleming, T.R. and Harrington, D.P. (1991). Counting processes and survival analysis. Wiley, New York.
Gray, R. J. (1992). Flexible methods for analyzing survival data using splines, with applications to breast cancer
prognosis. J. Am. Stat. Assoc. 87: 942–951.
Hattori, S. and Henmi, M. (2012). Estimation of treatment effects based on possibly misspecified Cox regression.
Lifetime Data Anal. 18(4): 408-33.
Hou, J., Bradic, J. and Xu, R. (2021). Treatment effect estimation under additive hazards models with high-
dimensional confounding. J. Am. Stat. Assoc. 116: early view.
Ishwaran, H., Kogular, U. B., Blackstone, E. H. and LAUER, M. S. (1995b). Random survival forests.
Ann. Appl. Stat. 2: 841–860.
Van Lancker, K., Dukes, O. and Vansteelandt, S. (2021). Principled selection of baseline covariates to account
for censoring in randomized trials with a survival endpoint. Stat. Med. 40(18): 4108–4121.
Keiding, N., Holst, C. and Green, A. (1989). Retrospective estimation of diabetes incidence from information
in a current prevalent population and historical mortality. Am. J. Epidemiol. 130: 588-600.
Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995a). Hazard regression. J. Am. Stat. Assoc. 143–157.
Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995b). The L2 rate of convergence for hazard regression.
Scand. J. Stat. 90: 78–94.
Koul, H., Susurl, V. and van Ryzin, J. (1981). Regression analysis with randomly right censored data. Ann.
Stat. 9: 1276-88.
Lin, D. Y. and L. J. Wei. (1989). The robust inference for the Cox proportional hazards model. J. Am. Stat. Assoc.
84(408): 1074–1078.
Lok, J. J., Yang, S., Sharkey, B. and Hughes, M. D. (2018). Estimation of the cumulative incidence function
under multiple dependent and independent censoring mechanisms. Lifetime Data Anal. 24(2): 201–223.
Lu, X. and Tsiatis, A.A. (2008). Improving the efficiency of the log-rank test using auxiliary covariates. Biometrika
95(3): 679–694.
Lu, W. and Ying, Z. (2004). On semiparametric transformation cure models. Biometrika 91(2): 331–343.
Martinussen, T. and Vansteelandt, S. (2013). On collapsibility and confounding bias in Cox and Aalen regres-
sion models. Lifetime Data Anal. 19(3): 279–296.
17
Murphy, S.A. (1994). Consistency in a proportional hazards model incorporating a random effect Ann. Stat. 2:
712-731.
Murphy, S.A. (1995). Asymptotic theory for the frailty model. Ann. Stat. 23: 182-198.
Nguyen, V. Q. and Gillen, D. L. (2017). Censoring-robust estimation in observational survival studies: Assessing
the relative effectiveness of vascular access type on patency among end-stage renal disease patients. Stat. Biosci.
9(2): 406–430.
Nielson, G., Gill, R. D., Andersen, P.K. and Sørensen, T.I.A. (1992). A counting process approach to
maximum likelihood estimation in frailty models Scand. J. Stat. 19: 25–44.
Nu˜
no, M.M. and Gillen, D. L. (2021). Censoring-robust time-dependent receiver operating characteristic curve
estimators.Stat. Med. 40(30): 6885–6899.
Rava, D. (2021). Survival analysis and causal inference: from marginal structural ox to additive hazards model and
beyond. Ph.D. Thesis, University of California, San Diego.
Robins, J.M. (2000). Marginal structural models versus structural nested models as tools for causal inference.
Statistical models in epidemiology, the environment, and clinical trials New York: Springer, pp. 95–133.
Robins, J.M. and Finkelstein, D. M. (2000). Correcting for noncompliance and dependent censoring in an AIDS
clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 56(3): 779–788.
Robins, J.M., Hernan, M.A. and Brumback, B. (2000). Marginal structural models and causal inference in
epidemiology. Epidemiology 11(5): 550–560.
Robins, J.M. and Rotnitzky, A. (2001). Comment on“inference for semiparametric models: Some questions and
an answer” by Bickel and Kwon. Stat. Sin. 11(4): 920-936.
Robins, J.M.(1993). Information recovery and bias adjustment in proportional hazards regression analysis of ran-
domized trials using surrogate markers. in Proceedings of the Biopharmaceutical Section, American Statistical
Associalion. 24-33.
Robins, J.M., Rotnitzky, A. and van der Lann, M. (2000). Comment on “On profile likelihood” by Murphy
and van der Vaart. J. Am. Stat. Assoc. 95: 477–482.
Robins, J.M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated
outcomes in the presence of missing data. J. Am. Stat. Assoc. 90: 106-121.
Rotnitzky, A., Bergesio, A. and Farall, A. (2009). Analysis of quality-of-life adjusted failure time data in the
presence of competing, possibly informative, censoring mechanisms. Lifetime Data Anal. 15(1): 1-23.
18
Rotnitzky, A. and Robins, J.M. (2005). Inverse probability weighting in survival analysis. Encyclopedia of Bio-
statistics. Vol 4. 2619-2625. Second Edition. Edited by Peter Armitage and Theodore Colton. New York, Wiley,
2004.
Scharfstein, D. O., Rotnitzky, A., and Robins, J.M. (1999). Rejoinder to ‘adjusting for nonignorable dropout
using semiparametric nonresponse models”.J. Am. Stat. Assoc..94: 1135–1146.
International Non-Hodgkin’s Lymphoma Prognostic Factors Project (1993). A predictive model for
aggressive non-Hodgkin’s lymphoma. N. Engl. J. Med.329(14): 987-994.
Smucler, E., Rotnitzky, A. and Robins, J.M. (2019). A unifying approach for doubly-robust L1 regularized
estimation of causal contrasts. arXiv preprint arXiv:1904.03737.
Struthers, C. A. and Kalbfleisch, J. D. (1986). Misspecified proportional hazard models. Biometrika 73:
363–369.
Tchetgen Tchetgen, E.J. and Robins, J.M. (2012). On parametrization, robustness and sensitivity analysis in
a marginal structural cox proportional hazards model for point exposure. Statistics and Probability Letters 82:
907-915.
Templeton, A.J., Amir, E. and Tannock, I.F. (2020). Informative censoring a neglected cause of bias in
oncology trials. Nat. Rev. Clin. Oncol. 17: 327-328.
Tsiatis, A. A. (2006). Semiparametric theory and missing data. New York: Springer.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK, Cambridge University Press..
van der Laan, M. J. and Robins, J.M.(2003). Unified methods for censored longitudinal data and causality. New
York: Springer.
Wager, S. and Walther, G. (2015). Adaptive concentration of regression trees, with application to random forests.
arXiv preprint arXiv:1503.06388.
Xu, R. (1996). Inference for the proportional hazards model. Ph.D. Thesis, University of California, San Diego.
Xu, R. and Adak, S. (2002). Survival analysis with time-varying regression effects using a tree-based approach.
Biometrics 58(2): 305–315.
Xu, R. and O’Quigley J (2000). Estimating average regression effect under non-proportional hazards. Biostatistics
1: 423–439.
19
Yang, S., Pieper, K. and Cools, F. (2020). Semiparametric estimation of structural failure time models in
continuous-time processes. Biometrika 107: 123-136.
Zhang, M. and Schaubel, D. E. (2012a). Contrasting treatment-specific survival using double-robust estimators.
Stat. Med. 31: 4255–4268.
Zhang, M. and Schaubel, D. E. (2012b). Double-robust semiparametric estimator for differences in restricted
mean lifetimes in observational studies. Biometrics 68: 999-1009.
20
Appendix
A Notation and Expressions
First, we list or repeat notations that will be used in the proofs. For iin 1, . . . , n, we define
Mci(t;Sc) = Nc(t)Zt
0
Y(u)dΛc(u|Ai, Zi),
Ji(t;S, Sc) = Zt
0
S(u|Ai, Zi)1Sc(u|Ai, Zi)1dMc(u;Sc),
dNi(t;S, Sc) = Sc(t|Ai, Zi)1dNi(t)Ji(t;S, Sc)dS(t|Ai, Zi),
Γ(l)
i(β, t;S, Sc) = Al
ieβAi{Sc(t|Ai, Zi)1Yi(t) + Ji(t;S, Sc)S(t|Ai, Zi)},
S(l)(β, t;S, Sc) = 1
n
n
X
i=1
Γ(l)
i(β, t;S, Sc),
dMaug
i(t;β, Λ0, S, Sc) = dNi(t, S, Sc)Γ(0)
i(β, t;S, Sc)dΛ0(t),
¯
A(β, t;S, Sc) = S(1) (β, t;S, Sc)/S(0)(β , t;S, Sc),
U(β;S, Sc) = 1
n
n
X
i=1 Zτ
0
dNi(t;S, Sc){Ai¯
A(β, t;S, Sc)},
V(β, t;S, Sc) = d¯
A(β, t;S, Sc)/dβ =¯
A(β, t;S, Sc)¯
A(β, t;S, Sc)2,
¯a(β, t;S, Sc) = s(1) (β, t;S, Sc)/s(0) (β , t;S, Sc),
v(β, t;S, Sc) = ¯a(β , t;S, Sc)¯a(β , t;S, Sc)2,
µ(β;S, Sc) = Zτ
0{¯a(βo, t;S, Sc)¯a(β, t;S, Sc)}s(0)(βo, t;S, Sc)dΛo
0(t),
ν(β;S, Sc) = Zτ
0
v(β, t;S, Sc)s(0) (βo, t;S, Sc)dΛo
0(t).
21
Next are expressions used in the asymptotic results. We stated in Theorem 3that
n(b
ββo) = 1
n
n
X
i=1
ν(βo, S, S
c)1ψi(βo,Λo
0, S, S
c) + op(1).(28)
Here ψi(βo,Λo
0, S, S
c) = ψ1i+ψ2i+ψ3iwhere
ψ1i=Zτ
0{Ai¯a(βo, t;S, S
c)}dMaug
i(t;βo,Λo
0, S, S
c),(29)
ψ2i=k
n(k1) X
j∈Im(i)Zτ
0{¯a(βo, t;S, S
c)Ai}Ji(t;S, S
c){j(t, Ai, Zi) + eβoAiξj(t, Ai, Zi)dΛo
0(t)},
ψ3i=k
n(k1) X
j∈Im(i)Zτ
0{Ai¯a(βo, t;S, S
c)} ηj(t, Ai, Zi)
S
c(t|Ai, Zi)2{dNi(t)Yi(t)eβoAidΛo
0(t)}
Zt
0S(u|Ai, Zi)
S
c(u|Ai, Zi)2ηj(u, Ai, Zi){dMci(u;S
c) + Yi(u)S
c(u|Ai, Zi)1dS
c(u|Ai, Zi)}
+S(u|Ai, Zi)1S
c(u|Ai, Zi)2Yi(u)j(u, Ai, Zi){dS(t|Ai, Zi) + S(t|Ai, Zi)eβoAidΛo
0(t)}!.
The long expression above simplifies depending on which models are correctly specified. Under case (a) of Theorem 3,
when both Sand Scare correctly specified, ψ2i=ψ3i= 0. Under case (b), when b
Sis n-consistent , ψ3i= 0. Under
case (c), when b
Scis n-consistent, ψ2i= 0.
In Theorem 4, we stated that bν2K/n is a consistent estimator for the asymptotic variance of b
βwhen both
S=Soand S
c=So
care correctly specified. The expression for Kand bνare as follows:
bν=1
n
n
X
i=1 Zτ
0
V(b
β, t;b
S, b
Sc)dNi(t;b
S(m(i)),b
S(m(i))
c),
K=1
n
n
X
i=1 e
ψ1i(b
β, e
Λ0(b
β, ·;b
S, b
Sc),b
S(m(i)),b
S(m(i))
c)2,
where
e
ψ1i(β, Λ0, S, Sc) = Zτ
0{Ai¯
A(β, t;S, Sc)}dM aug
i(t;β, Λ0, S, Sc).
22
B Proof of Double Robustness
Lemma 1. For any Sc(t|A, Z)with its corresponding censoring specific martingale Mc(t;Sc),
Zt
0
dMc(u;Sc)
Sc(u|A, Z)= 1 Y(t)
Sc(t|A, Z)N(t)
Sc(X|A, Z),(30)
where N(t) = I(X < t, T C).
Note, this can be seen as a continuous version of Lemma 10.4 in Tsiatis (2006).
Proof
First note that
Zt
0
dNc(u)
Sc(u|A, Z)=Nc(t)
Sc(X|A, Z),(31)
where NC(t) = I(X < t, T > C). Next, since Sc(u|A, Z ) = exp{−Λc(u|A, Z)},
Zt
0
Y(u)dΛc(u|A, Z)
Sc(u|A, Z)
=I(Xt)Zt
0
dSc(u|A, Z)
Sc(u|A, Z)2+I(X < t)ZX
0
dSc(u|A, Z)
Sc(u|A, Z)2
=I(Xt){−Sc(u|A, Z)1}|u=t
u=0 +I(X < t){−Sc(u|A, Z )1}|u=X
u=0
= 1 Y(t)
Sc(t|A, Z)I(X < t)
Sc(X|A, Z).(32)
Since I(X < t) = N(t) + Nc(t), (31) + (32) then gives the lemma.
Proof of Theorem 1
Recall that
dMaug (t;β, Λ0, S, Sc) = dMw(t;β, Λ0, Sc)J(t;S, Sc)dS(t|A, Z ) + S(t|A, Z)eβAdΛ0(t),
where J(t;S, Sc) is also included in Appendix A.
a) Assume Sc=So
c.
23
We first consider dMw(t;βo,Λo
0, So
c). For h(A) = 1 or A,
E{h(A)dMw(t;βo,Λo
0, So
c)}
=Eh(A)So
c(t|A, Z)1[dE {I(Tt)I(Ct)|T=t, A, Z }
E{I(Tt)I(Ct)|T=t, A, Z} · eβoAdΛo
0(t)io
=E[h(A)So
c(t|A, Z)1{dI (Tt)P(Ct|A, Z )
I(Tt)P(Ct|A, Z)·eβoAdΛo
0(t)}]
=E{h(A)dMT(t;βo,Λo
0)}
= 0,
where the second ‘=’ above uses the informative censoring Assumption 1.
Next we consider J(t;S, Sc){dS(t|A, Z ) + S(t|A, Z)eβoAdΛo
0(t)}. Its expectation being zero follows immediately
from the fact that Mc(t;So
c) is a martingale.
b) Assume S=So.
Noting that YT(t)N(t) = N(t)dNT(t) = 0 and Y(t)dNT(t) = dN(t), we multiply (30) by
dMT(t) = dNT(t)YT(t)eβoAdΛ0(t) giving:
dMT(t;βo,Λo
0)Zt
0
dMc(u;Sc)
Sc(u|A, Z)
=dNT(t)Zt
0
dMc(u;Sc)
Sc(u|A, Z)YT(t)eβoAdΛo
0(t)Zt
0
dMc(u;Sc)
Sc(u|A, Z)
=dNT(t)dNT(t)Y(t)
Sc(t|A, Z)dNT(t)N(t)
Sc(X|A, Z)YT(t)eβoAdΛo
0(t) + Y(t)eβoAdΛo
0(t)
Sc(t|A, Z)+YT(t)N(t)eβoAdΛo
0(t)
Sc(X|A, Z).
=dMT(t)dMw(t).
Therefore
dMw(t;βo,Λo
0) = dMT(t;βo,Λo
0)dMT(t;βo,Λo
0)Zt
0
dMc(u;Sc)
Sc(u|A, Z).
24
We note that (11) and (12) hold when S=So. From (9) then we have
E{dMaug (t;βo,Λo
0, So, Sc)}
=EdMw(t;βo,Λo
0, Sc) + Zt
0
E{dMT(t;βo,Λo
0)|A, Z, T u}dMc(u;Sc)
Sc(u|A, Z)
=EdMT(t;βo,Λo
0)dMT(t;βo,Λo
0)Zt
0
dMc(u;Sc)
Sc(u|A, Z)+Zt
0
E{dMT(t;βo,Λo
0)|A, Z, T u}dMc(u;Sc)
Sc(u|A, Z)
=EZt
0hE{dMT(t;βo,Λo
0)|A, Z, T u} dMT(t;βo,Λo
0)idMc(u;Sc)
Sc(u|A, Z)
=EEZt
0hE{dMT(t;βo,Λo
0)|A, Z, T u} dMT(t;βo,Λo
0)idNc(u)
Sc(u|A, Z)A, Z, T u, C =u
EEZt
0hE{dMT(t;βo,Λo
0)|A, Z, T u} dMT(t;βo,Λo
0)iY(u)dΛc(u)
Sc(u|A, Z)A, Z, T u, C u
=EZt
0
dNc(u)
Sc(u|A, Z)hE{dMT(t;βo,Λo
0)|A, Z, T u, C =u} E{dMT(t;βo,Λo
0)|A, Z, T u, C =u}i
EZt
0
Y(u)dΛc(u)
Sc(u|A, Z)hE{dMT(t;βo,Λo
0)|A, Z, T u, C u} E{dMT(t;βo,Λo
0)|A, Z, T u, C u}i
=0,
where in the 3rd line above E{dMT(t;βo,Λo
0)}= 0 because MT(t;βo,Λo
0) is a martingale.
The above also gives
EZt
0
AdMaug (t;βo,Λo
0, So, Sc)= 0.
25
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Causal inference in observational settings typically rests on a pair of identifying assumptions: (1) unconfoundedness and (2) covariate overlap, also known as positivity or common support. Investigators often argue that unconfoundedness is more plausible when many covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in high-dimensional observational studies, arguing that this assumption is stronger than investigators likely realize. Our main innovation is to frame (strict) overlap in terms of bounds on a likelihood ratio, which allows us to leverage and expand on existing results from information theory. In particular, we show that strict overlap bounds discriminating information (e.g., Kullback-Leibler divergence) between the covariate distributions in the treated and control populations. We use these results to derive explicit bounds on the average imbalance in covariate means under strict overlap and a range of assumptions on the covariate distributions. Importantly, these bounds grow tighter as the dimension grows large, and converge to zero in some cases. We examine how restrictions on the treatment assignment and outcome processes can weaken the implications of certain overlap assumptions, but at the cost of stronger requirements for unconfoundedness. Taken together, our results suggest that adjusting for high-dimensional covariates does not necessarily make causal identification more plausible.
Article
Full-text available
Competing risks occur in a time-to-event analysis in which a patient can experience one of several types of events. Traditional methods for handling competing risks data presuppose one censoring process, which is assumed to be independent. In a controlled clinical trial, censoring can occur for several reasons: some independent, others dependent. We propose an estimator of the cumulative incidence function in the presence of both independent and dependent censoring mechanisms. We rely on semi-parametric theory to derive an augmented inverse probability of censoring weighted (AIPCW) estimator. We demonstrate the efficiency gained when using the AIPCW estimator compared to a non-augmented estimator via simulations. We then apply our method to evaluate the safety and efficacy of three anti-HIV regimens in a randomized trial conducted by the AIDS Clinical Trial Group, ACTG A5095.
Article
Time‐dependent receiver operating characteristic curves are often used to evaluate the classification performance of continuous measures when considering time‐to‐event data. When one is interested in evaluating the predictive performance of multiple covariates, it is common to use the Cox proportional hazards model to obtain risk scores; however, previous work has shown that when the model is mis‐specified, the estimand corresponding to the partial likelihood estimator depends on the censoring distribution. In this manuscript, we show that when the risk score model is mis‐specified, the AUC will also depend on the censoring distribution, leading to either over‐ or under‐estimation of the risk score's predictive performance. We propose the use of censoring‐robust estimators to remove the dependence on the censoring distribution and provide empirical results supporting the use of censoring‐robust risk scores.
Article
Estimating treatment effects for survival outcomes in the high-dimensional setting is critical for many biomedical applications and any application with censored observations. This paper establishes an ‘orthogonal’ score for learning treatment effects, using observational data with a potentially large number of confounders. The estimator allows for root-n, asymptotically valid confidence intervals, despite the bias induced by the regularization. Moreover, we develop a novel hazard difference (HDi), estimator. We establish rate double robustness through the cross-fitting formulation. Numerical experiments illustrate the finite sample performance, where we observe that the cross-fitted HDi estimator has the best performance. We study the radical prostatectomy’s effect on conservative prostate cancer management through the SEER-Medicare linked data. Lastly, we provide an extension to machine learning approaches as well as heterogeneous treatment effects.
Article
The analysis of randomized trials with time‐to‐event endpoints is nearly always plagued by the problem of censoring. In practice, such analyses typically invoke the assumption of noninformative censoring. While this assumption usually becomes more plausible as more baseline covariates are being adjusted for, such adjustment also raises concerns. Prespecification of which covariates will be adjusted for (and how) is difficult, thus prompting the use of data‐driven variable selection procedures, which may impede valid inferences to be drawn. The adjustment for covariates moreover adds concerns about model misspecification, and the fact that each change in adjustment set also changes the censoring assumption and the treatment effect estimand. In this article, we discuss these concerns and propose a simple variable selection strategy designed to produce a valid test of the null in large samples. The proposal can be implemented using off‐the‐shelf software for (penalized) Cox regression, and is empirically found to work well in simulation studies and real data analyses.
Article
Structural failure time models are causal models for estimating the effect of time-varying treatments on a survival outcome. G-estimation and artificial censoring have been proposed for estimating the model parameters in the presence of time-dependent confounding and administrative censoring. However, most existing methods require manually pre-processing data into regularly spaced data, which may invalidate the subsequent causal analysis. Moreover, the computation and inference are challenging due to the nonsmoothness of artificial censoring. We propose a class of continuous-time structural failure time models that respects the continuous-time nature of the underlying data processes. Under a martingale condition of no unmeasured confounding, we show that the model parameters are identifiable from a potentially infinite number of estimating equations. Using the semiparametric efficiency theory, we derive the first semiparametric doubly robust estimators, which are consistent if the model for the treatment process or the failure time model, but not necessarily both, is correctly specified. Moreover, we propose using inverse probability of censoring weighting to deal with dependent censoring. In contrast to artificial censoring, our weighting strategy does not introduce nonsmoothness in estimation and ensures that resampling methods can be used for inference.
Article
Informative censoring occurs when progression-free survival is the primary end point of a randomized clinical trial and unequal patient dropout is observed between treatment arms owing to poorer tolerance of experimental treatment. Herein we discuss how informative censoring in the experimental arm before criteria for disease progression are met causes bias towards a positive result.
Article
The estimation of conditional treatment effects in an observational study with a survival outcome typically involves fitting a hazards regression model adjusted for a high‐dimensional covariate. Standard estimation of the treatment effect is then not entirely satisfactory, as the misspecification of the effect of this covariate may induce a large bias. Such misspecification is a particular concern when inferring the hazard difference, because it is difficult to postulate additive hazards models that guarantee non‐negative hazards over the entire observed covariate range. We therefore consider a novel class of semiparametric additive hazards models which leave the effects of covariates unspecified. The efficient score under this model is derived. We then propose two different estimation approaches for the hazard difference (and hence also the relative chance of survival), both of which yield estimators that are doubly robust. The approaches are illustrated using simulation studies and data on right heart catheterization and mortality from the SUPPORT study.
Article
We revisit the classic semiparametric problem of inference on a low dimensional parameter θ0 in the presence of high-dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high-dimensional that the traditional assumptions, such as Donsker properties, that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods which are particularly well-suited to estimation in modern, very high-dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N−1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman-orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0, and (2) making use of cross-fitting which provides an efficient form of data-splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in a N−1/2 -neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of DML applied to learn the main regression parameter in a partially linear regression model, DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model, DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, and DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples. This article is protected by copyright. All rights reserved