PreprintPDF Available

Doubly Robust Inference for Hazard Ratio under Informative Censoring with Machine Learning

June 2022

June 2022

DOI:10.48550/arXiv.2206.02296

License
CC BY-SA 4.0

Authors:

Preprints and early-stage research may not have been peer reviewed yet.

Randomized clinical trials with time-to-event outcomes have traditionally used the log-rank test followed by the Cox proportional hazards (PH) model to estimate the hazard ratio between the treatment groups. These are valid under the assumption that the right-censoring mechanism is non-informative, i.e. independent of the time-to-event of interest within each treatment group. More generally, the censoring time might depend on additional covariates, and inverse probability of censoring weighting (IPCW) can be used to correct for the bias resulting from the informative censoring. IPCW requires a correctly specified censoring time model conditional on the treatment and the covariates. Doubly robust inference in this setting has not been plausible previously due to the non-collapsibility of the Cox model. However, with the recent development of data-adaptive machine learning methods we derive an augmented IPCW (AIPCW) estimator that has the following doubly robust (DR) properties: it is model doubly robust, in that it is consistent and asymptotic normal (CAN), as long as one of the two models, one for the failure time and one for the censoring time, is correctly specified; it is also rate doubly robust, in that it is CAN as long as the product of the estimation error rates under these two models is faster than root-$n$. We investigate the AIPCW estimator using extensive simulation in finite samples.

(a) Variable diagram. (b) Data generating process for C.

…

Simulation results under Scenario 1. Data are generated following Figures 1(a) and (b) with β o = −1. Red indicates that the model or approach is invalid.

…

Figures - available via license: Creative Commons Attribution-ShareAlike 4.0 International

Content may be subject to copyright.

Available via license: CC BY-SA 4.0

Content may be subject to copyright.

Doubly Robust Inference for Hazard Ratio under Informative Censoring

with Machine Learning

Jiyu Luo∗and Ronghui Xu †

Abstract

Randomized clinical trials with time-to-event outcomes have traditionally used the log-rank test followed by

the Cox proportional hazards (PH) model to estimate the hazard ratio between the treatment groups. These

are valid under the assumption that the right-censoring mechanism is non-informative, i.e. independent of the

time-to-event of interest within each treatment group. More generally, the censoring time might depend on

additional covariates, and inverse probability of censoring weighting (IPCW) can be used to correct for the bias

resulting from the informative censoring. IPCW requires a correctly speciﬁed censoring time model conditional

on the treatment and the covariates. Doubly robust inference in this setting has not been plausible previously

due to the non-collapsibility of the Cox model. However, with the recent development of data-adaptive machine

learning methods we derive an augmented IPCW (AIPCW) estimator that has the following doubly robust (DR)

properties: it is model doubly robust, in that it is consistent and asymptotic normal (CAN), as long as one of the

two models, one for the failure time and one for the censoring time, is correctly speciﬁed; it is also rate doubly

robust, in that it is CAN as long as the product of the estimation error rates under these two models is faster

than root-n. We investigate the AIPCW estimator using extensive simulation in ﬁnite samples.

Keywords: Cox proportional hazards model; Rate doubly robust; AIPCW.

1 Introduction

In the analysis of time-to-event data, the Cox proportional hazards (PH) model (Cox,1972) has been widely used

to estimate the hazard ratio (HR) between two treatment groups in a randomized clinical trial, for example. The

∗Herbert Wertheim School of Public Health and Human Longevity Science, University of California, San Diego, La Jolla, CA 92093-

0112, USA. E-mail: jil130@ucsd.edu.

†Herbert Wertheim School of Public Health and Human Longevity Science, Department of Mathematics and Halicioglu Data Science

Institute, University of California, San Diego, La Jolla, CA 92093-0112, USA. E-mail: rxu@health.ucsd.edu.

arXiv:2206.02296v1 [stat.ME] 6 Jun 2022

validity of the maximum partial likelihood estimator (MPLE) under the PH model relies on the non-informative

censoring assumption (Fleming and Harrington,1991); that is, the censoring time random variable is independent

of the failure time random variable within each treatment group. In practice, this assumption can be violated which

leads to informative censoring, and the censoring time may well depend on additional covariates. This issue was

recently highlighted in Van Lancker et al. (2021), who aimed to develop procedures to select baseline covariates in

order to be adjusted for in the Cox regression model. Such adjustment, however, changes the eﬀect estimand, making

it diﬃcult to compare across diﬀerent adjustment sets. Alternatively, the crude or marginal hazard ratio, as it is

often referred to in the medical literature, between the two groups can still be consistently estimated using inverse

probability of censoring weighting (IPCW) under the relaxed censoring assumption that the censoring time and the

failure time are independent given the additional covariates.

IPCW was proposed in Robins and Finkelstein (2000) to correct for bias resulting from informative censoring of

the log-rank test and, prior to that, in Robins (1993). Up until then, the main body of literature in both applied

and theoretical survival analysis had assumed non-informative censoring, given the predictors in a regression model

(Fleming and Harrington,1991). A separate line of research where IPCW was called for, was under violation of the

PH assumption, where it was recognized that the MPLE gave rise to a population quantity that involved the nuisance

censoring distribution (Xu,1996;Xu and O’Quigley,2000). A series of work has since been done to correct for this

bias using IPCW approaches, including Boyd et al. (2012); Hattori and Henmi (2012); Nguyen and Gillen (2017);

Nu˜no and Gillen (2021). We note that the terminology ‘IPCW’ was not always mentioned in some of these works,

which used the (conditional) survival distribution increments as weights in each risk set; but these are algebraically

equivalent to the inverse probability of censoring weights.

The censoring distribution used in IPCW is often modeled parametrically or semiparametrically, and the resulting

IPCW estimator is consistent and asymptotically normal (CAN) if the model is correctly speciﬁed. Nguyen and Gillen

(2017) proposed a survival tree approach to estimate the conditional censoring distribution given the covariates, but

with no theoretical guarantee for inference. In fact, it is known that the resulting estimator is typically biased

(Belloni et. al.,2013).

Doubly robust (DR) approaches were developed when handling missing data (Robins,1993;Robins et al.,1995;

Scharfstein et al.,1999;Robins et al.,2000b;Robins and Rotnitzky,2001;van der Laan and Robins,2003;Bang

and Robins,2005;Tsiatis,2006). It is called doubly robust because two working models are involved, one for the

outcome of interest, and one for the missing data mechanism, and the estimator is consistent as long as one of the

two working models are correctly speciﬁed. When IPW is used to handle the missingness (referred to as coarsening),

this usually comes down to augmentation with the coarsened data and the resulting DR estimator is an augmented

IPW (AIPW) estimator (Tsiatis,2006).

Since right censoring in survival data may be framed as a type of coarsening (Tsiatis,2006), Rotnitzky and Robins

(2005) developed an augmented IPCW (AIPCW) approach for censored survival data. For the PH model, however,

this approach is not straightforward to apply to. As will be seen later, this is mainly due to the non-collapsibility of

the Cox model (Martinussen and Vansteelandt,2013;Tchetgen Tchetgen and Robins,2012;Rava,2021).

In this paper, we consider simultaneously the regression parameter and the nuisance baseline hazard function

under the PH model. This naturally gives rise to full data estimating equations that are sums of independent and

identically distributed (i.i.d.) martingales. The augmentation leads to working models for the failure time and the

censoring time given the group indicator and the covariates. To specify a conditional failure time model that is

compatible with the original (marginal) PH model given the group membership only, data adaptive machine learning

(ML) or nonparametric methods are needed. With cross-ﬁtting (Chernozhukov et al.,2018), the resulting AIPCW

estimator has doubly robust properties not only in the classical sense, which is referred to as model doubly robust, but

also rate doubly robust (Smucler et al.,2019;Hou et al.,2021). Here, rate double robustness refers to an estimator

being CAN when the product of the estimation error rates under the two working models is faster than root-n, while

either one of them is allowed to be arbitrarily slow.

The rest of the paper is organized as follows. In Section 1.1, we state the model and assumption about censoring.

In Section 2, we take a missing data approach by constructing the AIPCW score from the full data score, and

provide a detailed algorithm for the cross-ﬁtted AIPCW estimator. Asymptotic properties of the AIPCW estimator

are described in Section 3. In Section 4, we conduct simulations for the AIPCW estimator using diﬀerent nuisance

estimators, and also compare them with the IPCW estimators. Finally, we conclude with discussion in Section 5.

Additional materials are provided in the Appendix.

1.1 Model and assumption

Let Tand Cbe the failure time and the censoring time, respectively. Denote X= min(T , C), and ∆ = I(T≤C).

Denote also Y(t) = I(X≥t) the at-risk process, and N(t) = I(X≤t, ∆ = 1) the failure event counting process.

We consider the two-group survival setting where Ais a binary group indicator. For a randomized trial, this can be

the treatment groups. Let Zbe a p-dimensional vector of baseline covariates. We assume that the data consist of n

independent and identically distributed (i.i.d.) copies of the random vectors O= (X, ∆, A, Z).

Assumption 1. (informative censoring) C⊥T|(A, Z).

We assume the PH model for the two-group survival:

λ(t|A) = λ0(t) exp(βA),(1)

where λ(t|A) denotes the group-speciﬁc hazard function of T,βis the log hazard ratio, and λ0(t) is the baseline

hazard function.

2 Doubly robust inference

In this section following Tsiatis (2006) we treat right censoring as a coarsened data problem. We start with a set

of full data score functions under the PH model, and show that when IPCW is applied to this set of full data

score functions we obtain the familiar IPCW estimator under the Cox model (Boyd et al.,2012). We then mimic

the approach of Rotnitzky and Robins (2005) to augment the IPCW score functions and arrive at a doubly robust

AIPCW estimator. Finally, for inference purposes we introduce cross-ﬁtting and describe the implementation of the

cross-ﬁtted AIPCW estimator.

2.1 Full data score functions

The full data vector is (T, A, Z ). Following the commonly used NPMLE approach for the semiparametric PH model,

the unknown parameters are βand Λ0(t) = Rt

0λ0(u)du, the cumulative baseline hazard, which is discretized to jumps

at the observed event times only (Nielson et al.,1992).

Following Fleming and Harrington (1991), deﬁne the full data counting process NT(t) = I(T≤t) and the full

data at-risk process YT(t) = I(T≥t). Let

MT(t;β, Λ0) = NT(t)−Zt

YT(u)eβAdΛ0(u).(2)

Then MT(t;β, Λ0) is the full data martingale with respect to the full data ﬁltration

t={NT(u), YT(u+), A, Z; 0 ≤u≤t}under model (1).

We have the following full data score functions for a single copy of the data:

1(β, Λ0, t) = dMT(t;β , Λ0),

2(β, Λ0) = Zτ

AdMT(t;β, Λ0).

where τis the maximum follow-up time. Note that Df

1(β, Λ0, t) is a martingale diﬀerence that is often used in

survival analysis; see for example, Lu and Ying (2004). For each t, the true values of the parameters βand Λ0satisfy

E{Df

1(β, Λ0, t)}= 0 and E{Df

2(β, Λ0)}= 0.(3)

2.2 IPCW score functions

In survival analysis, it’s common to consider the quantity

M(t) = N(t)−Zt

Y(u)eβAdΛ0(u).(4)

Note that it is not a martingale under informative censoring. We deﬁne Sc(t|A, Z) = P(C≥t|a, Z) the conditional

survival function of C,e

∆(t) = I(min(T, t)< C), and denote

dMw(t;β, Λ0, Sc) =Sc(t|A, Z)−1e

∆(t)dMT(t;β, Λ0)

=Sc(t|A, Z)−1dN (t)−Y(t)eβAdΛ0(t).(5)

Note that expression (5) gives the IPCW score functions:

1(β, Λ0, t;Sc) = dM w(t;β, Λ0, Sc),(6)

2(β, Λ0;Sc) = Zτ

AdMw(t;β, Λ0, Sc).(7)

With ncopies of i.i.d. data, this gives the following IPCW weighted estimating equations:

i=1

1i(β, Λ, t;Sc)=0,

i=1

2i(β, Λ; Sc)=0.

After some algebra, the above estimating equations can be combined to give the IPCW partial likelihood score

equation (Boyd et al.,2012):

i=1 Zτ

Sc(t|Ai, Zi)−1(Ai−e

S(1)(β , t;b

Sc)

S(0)(β , t;b

Sc))dNi(t) = 0,(8)

where e

S(l)(β, t;Sc) = Pn

j=1 Al

jSc(t|Aj, Zj)−1Yj(t)eβAjfor l= 0,1, and b

Sc(t|A, Z) is some consistent estimator of

Sc(t|A, Z).

2.3 AIPCW score functions

The consistency of the IPCW estimator relies critically on Sc(t|A, Z) being correctly speciﬁed. When it is misspeciﬁed,

the IPCW estimator is biased. Rotnitzky and Robins (2005) provides an augmentation approach for an IPCW

estimator in survival analysis, so that it has the doubly robust property to be detailed later. However, their approach

cannot be directly applied because we have not only diﬀerent weights for diﬀerent individuals in the data set, but also

diﬀerent weights for each risk set. To this end, it is helpful to augment the martingale increment in (5) as follows.

Denote Nc(t) = I(X≤t, ∆ = 0) the counting process for the censoring event, and Λc(t|A, Z) = Rt

0Sc(u|A, Z)−1d{1−

Sc(u|A, Z)}the cumulative hazard function of Cgiven A, Z. Then Mc(t;Sc) = Nc(t)−Rt

0Y(u)dΛc(u|A, Z) is the

martingale corresponding to the censoring event counting process with respect to its natural history ﬁltration. Also

denote S(t|A, Z) = P(T≥t|A, Z ), and F(t|A, Z) = 1 −S(t|A, Z ). Deﬁne

dMaug (t;β, Λ0, S, Sc) = dMw(t;β, Λ0, Sc) + Zt

E{dMT(t;β, Λ0)|A, Z, T ≥u}dMc(u;Sc)

Sc(u|A, Z)(9)

=dN(t)−Y(t)dΛ0(t)eβA

Sc(t|A, Z)−J(t;S, Sc)dS (t|A, Z ) + S(t|A, Z)eβA dΛ0(t),(10)

where J(t;S, Sc) = Rt

0S(u|A, Z)−1Sc(u|A, Z )−1dMc(u;Sc). The last ‘=’ above used the fact that, for u≤t,

E{NT(t)|A, Z, T ≥u}=P(T≤t|A, Z, T ≥u) = F(t|A, Z )

S(u|A, Z),(11)

E{YT(t)|A, Z, T ≥u}=P(T≥t|A, Z, T ≥u) = S(t|A, Z )

S(u|A, Z).(12)

The above leads to the AIPCW score functions:

D1(β, Λ0, t;S, Sc) = dM aug(t;β , Λ0, S, Sc),(13)

D2(β, Λ0;S, Sc) = Zτ

A·dMaug (t;β, Λ0, S, Sc).(14)

In Theorem 1below, we will show that (13) and (14) are doubly robust score functions. We use superscript o

to denote the truth; for example, So(t|A, Z ), So

c(t|A, Z) and Λo

c(t|A, Z) denote the true S(t|A, Z ), Sc(t|A, Z) and

Λc(t|A, Z), respectively. Also let βoand Λo

0denote the true values of the parameters of interest. We assume the

following:

Assumption 2. So(τ|a, z)> c for a∈ {0,1}, z ∈ Z and some c > 0.

Assumption 3. So

c(τ|a, z)> c for a∈ {0,1}, z ∈ Z and some c > 0.

Theorem 1. Under Assumptions 1-3, if either S=Soor Sc=So

E{D1(βo,Λo

0, t;S, Sc)}=E{D2(βo,Λo

0;S, Sc)}= 0.(15)

The above theorem states that the scores (D1, D2) identiﬁes the true parameters (βo,Λo

0),as long as one of the

two survival functions, S(t|A, Z) and Sc(t|A, Z ), is true.

Given ni.i.d. data points, we estimate βo,Λoby solving

i=1

D1i(β, Λ0, t;S, Sc)=0,(16)

i=1

D2i(β, Λ0;S, Sc)=0.(17)

Solving for (16) gives

Λ0(β, t;S, Sc) = Zt

nPn

i=1 Sc(u|Ai, Zi)−1dNi(u)−Ji(u;S, Sc)dS(u|Ai, Zi)

S(0)(β , u;S, Sc),(18)

where

S(l)(β, t;S, Sc) = 1

i=1

ieβAi{Sc(u|Ai, Zi)−1Yi(t) + Ji(t;S, Sc)S(t|Ai, Zi)}(19)

for l= 0,1. Further deﬁne ¯

A(β, t;S, Sc) = S(1) (β, t;S, Sc)/S(0)(β , t;S, Sc). After plugging (18) into (17), we have:

U(β;S, Sc) = 1

i=1 Zτ

0{Sc(t|Ai, Zi)−1dNi(t)−Ji(u;S, Sc)dS(t|Ai, Zi)}{Ai−¯

A(β, t;S, Sc)}= 0.(20)

It’s worth noting that like the partial likelihood score equation, (20) is not a sum of i.i.d terms due to ¯

A(β, t;S, Sc).

As seen from the derivation leading to (10), the augmentation to the weighted martingale increment, which is linear

in N(t) and Y(t), is the result of augmentation to the weighted N(t) and Y(t), respectively. It is apparent that

Sc(t|Ai, Zi)−1dNi(t)−Ji(t;S, Sc)dS(t|Ai, Zi) is the augmented weighted dNi(t), and the augmented weighted Yi(t)’s

give rise to the quantities S(l)(·) and ¯

A(·), which are the analogies of similar quantities under the usual Cox model.

For example, ¯

A(β, t;S, Sc) corresponds to the empirical mean of the treatment random variable Aamong subjects

who fail at time t, which we may denote by ρ(β, t).

The quantity ρ(β, t) was implied in Rotnitzky and Robins (2005), as a nuisance parameter, based on the partial

likelihood score function. It would, however, not be straightforward to construct compatible models for ρ(β, t), which

is deﬁned on nested risk sets over time. The set of full data estimating functions we consider here, simultaneously

for βand Λ0, on the other hand, lead naturally to models for Sand Sc.

2.4 Cross-ﬁtted AIPCW estimator

In practice, both survival functions S(t|A, Z) and Sc(t|A, Z ) are unknown and need to be estimated by some estimator

S(t|A, Z) and b

Sc(t|A, Z). Parametric and semiparametric models, like the Cox model and the accelerated failure time

(AFT) model, are often applied since their theoretical properties are well-studies and with little requirement on the

computing power. However, these models can be misspeciﬁed, especially for S(t|A, Z ) due to the non-collapsibility

of the Cox model. ML or nonparametric methods, like splines (Gary,1992;Kooperberg et al.,1995a) and random

survival forest (Ishwaran et al.,2008), oﬀer a good alternative. ML or nonparametric estimators, however, do not

have root-nconvergence rate, which makes it diﬃcult to conduct inference. We will show that the asymptotic

normality can be established if we also apply cross-ﬁtting, where the entire sample is ﬁrst split into kfolds, and for

each fold, we estimate the nuisance functions using only the out-of-fold sample. Details of the cross-ﬁtted AIPCW

estimator b

βare described in Algorithm 1. Heuristically, cross-ﬁtting works by inducing independence between the

nuisance parameter estimators and the rest of the quantities in the scores, thereby allowing asymptotic normality to

be established (Smucler et al.,2019;Hou et al.,2021).

Quantities involving cross-ﬁtting would be slightly diﬀerent from quantities without cross-ﬁtting, and will involve

Algorithm 1 k-fold Cross-ﬁtted AIPCW estimation of β

Input: A sample of nobservations that are split into kfolds of equal size with index sets

I1,I2,...,Ik.

for each fold indexed by mdo

obtain estimated nuisance functions (b

S(−m),b

S(−m)

c) using the out-of-fold sample indexed by I−m:={1, . . . , n}\

Im.

end for

Output: b

β, the solution to

i=1

D1i(β, Λ0, t;b

S(−m(i)),b

S(−m(i))

c)=0,(21)

i=1

D2i(β, Λ0;b

S(−m(i)),b

S(−m(i))

c)=0,(22)

where m(i) maps observation ito index of the fold it belongs to.

the estimated nuisance parameters (b

S(−1),b

S(−1)

c),...,(b

S(−k),b

S(−k)

c). Speciﬁcally, solving for (21), we will have

Λcf

0(β, t;b

S, b

Sc) = Zt

nPn

i=1 b

S(−m(i))

c(u|Ai, Zi)−1dNi(u)−Ji(u;b

S(−m(i)),b

S(−m(i))

c)db

S(−m(i))(u|Ai, Zi)

S(0)(β , u;b

S, b

Sc),(23)

with

S(l)

cf (β, t;b

S, b

Sc) = 1

i=1

ieβAi{b

S(−m(i))

c(t|Ai, Zi)−1Yi(t) + Ji(t;b

S(−m(i)),b

S(−m(i))

c)b

S−m(i))(t|Ai, Zi)}(24)

for l= 0,1.

Also, ¯

Acf (β, t;b

S, b

Sc) = S(1)

cf (β, t;b

S, b

Sc)/S(0)

cf (β, t;b

S, b

Sc), and after plugging (23) into (22), we have the ﬁnal cross-

ﬁtted AIPCW estimating equation:

Ucf (β;b

S, b

Sc)

i=1 Zτ

0{b

S(−m(i))

c(t|Ai, Zi)−1dNi(t)−Ji(t;b

S(−m(i)),b

S(−m(i))

c)db

S(−m(i))(t|Ai, Zi)}{Ai−¯

A(β, t;b

S, b

Sc)}.(25)

We solve the cross-ﬁtted AIPCW estimating equation (25) using the Newton-Ralphson algorithm.

3 Asymptotic Properties

We will now describe the asymptotic properties of the proposed cross-ﬁtted AIPCW estimator, when estimated using

a random sample of size n. We ﬁrst list a few additional assumptions.

Assumption 4. There exist S∗(t|a, z)and S∗

c(t|a, z)with S∗(τ|a, z)> c and S∗

c(τ|a, z)> c for some c > 0, such

that

sup

t∈[0,τ],a∈{0,1},z∈Z |b

S(t|a, z)−S∗(t|a, z)|=Op(an),

sup

t∈[0,τ],a∈{0,1},z∈Z |b

Sc(t|a, z)−S∗

c(t|a, z)|=Op(bn),

for some an=o(1) and bn=o(1).

Assumption 5. For the limits S∗and S∗

c, there exists a neighbourhood Bof βoand functions s(l)(β, t;S∗, S∗

c)for

l= 0,1deﬁned on B × [0, τ ]such that supt∈[0,τ],β∈B |S(l)(β , t;S∗, S∗

c)−s(l)(β, t;S∗, S ∗

c)|=op(1).

Assumption 6. For l= 0,1,s(l)(β, t;S∗, S∗

c)are continuous functions of β∈ B, uniformly in t∈[0, τ ]and are

bounded on B × [0, τ ].s(0)(β, t;S∗, S ∗

c)is bounded away from zero in B × [0, τ]. For all β∈ B,t∈[0, τ ]:

s(1)(β , t;S∗, S∗

c) = ∂

∂β s(0) (β, t;S∗, S∗

c) = ∂2

∂β2s(0)(β, t;S∗, S ∗

c).(26)

In addition, let ¯a=s(1)/s(0) and v= ¯a−¯a2. We have ν(βo;S∗, S ∗

c) = Rτ

0v(βo, t;S∗, S∗

c)s(0)(βo, u;S∗, S ∗

c)dΛo

0(t)>0.

Assumption 4assumes that both b

Sand b

Scconverge to some limiting function S∗and S∗

cthat are not necessarily

the truth. Here, we do not make the root-nconvergence assumption for each of b

Sand b

Scthat often limits us to

parametric or semiparametric models. This assumption also implies that b

S(−m)and b

S(−m)

cconverge to S∗and S∗

at the same rate. Assumptions 5and 6are similar to regularity assumptions that are typically made under the PH

models (Anderson and Gill,1982).

The asymptotic properties of the cross-ﬁtted AIPCW estimator b

βdeﬁned in Algorithm 1are summarized in

Theorems 2and 3below.

Theorem 2. Under Assumptions 4-6, if either S∗=Soor S∗

c=So

c, then b

βp

→βo.

Theorem 3. Under Assumptions 4-6, if any of the following conditions hold:

(a) (Rate Double Robustness) S∗=So, S∗

c=So

cand anbn=o(n−1/2);

(b) (Model Double Robustness) S∗=Soand an=O(n−1/2). In particular, there exists an inﬂuence function

ξ(t, a, z)such that b

S(t|a, z)−S∗(t|a, z) = Pn

j=1 ξj(t, a, z)/n +op(n−1/2);

c=So

cand bn=O(n−1/2). In particular, there exists an inﬂuence function

η(t, a, z)such that b

Sc(t|a, z)−S∗

c(t|a, z) = Pn

j=1 ηj(t, a, z)/n +op(n−1/2),

then we have

√n(b

β−βo) = 1

√n

i=1

ν(βo, S∗, S ∗

c)−1ψi(βo,Λo

0, S∗, S ∗

c) + op(1),(27)

where the expression for ψi(βo,Λo

0, S∗, S ∗

c)is provided in Appendix A.

Theorem 3establishes both the model double robustness and the rate double robustness properties. Traditionally,

doubly robust inference is established assuming both working models are parametric or semiparametric. Model

double robustness here allows estimation under the possibly wrong model to converge at any rate. The theorem also

establishes rate double robustness, which states that if the estimators under both working models converge to the

truth and that their product rate is faster than root-n, the proposed AIPCW estimator is CAN even if one of the

nuisance estimators converges arbitrarily slowly. This result permits more ﬂexible ML or nonparametric methods

with valid inference.

The asymptotic variance of the proposed estimator is simpliﬁed under condition (a). In this case, we provide an

estimator of the asymptotic variance, which is given in the Theorem 4below.

Theorem 4. Under Assumptions 4-6, if condition (a) of Theorem 3holds, i.e. if S∗=So,S∗

c=So

cand anbn=

op(n−1/2), then bν−2K/n is a consistent estimator for the asymptotic variance of b

β, where bνand Kare provided in

Appendix A.

When one of the working models is misspeciﬁed, the asymptotic variance is rather complicated. In this case,

resampling methods such as bootstrap (Efron,1979) may be used to estimate the variance since the AIPCW estimator

is asymptotically linear.

4 Simulation

In this section, we compare the performance of the cross-ﬁtted AIPCW estimators b

βusing diﬀerent working models,

against diﬀerent IPCW estimators and the MPLE. We consider sample sizes n= 500 and n= 1000, and 1000 data

sets are simulated for each setting, which corresponds to margin of error of about +/−1.35% for the coverage

probability of nominal 95% conﬁdence intervals. Five-fold cross-ﬁtting is used.

For data generation, we ﬁrst follow the diagram in Figure 1(a) and generate U1∼Unif (-1, 1), A∼Bernoulli

(0.5), Z1∼N(0.5U1,1), Z2∼N(U2

1,0.09), and T=−log(0.5U1+ 0.5)e−A. Here, Tfollows the PH model (1) with

βo=−1 and λo

0(t) = 1.

We consider two scenarios of data generating process for the censoring time C, as described in Figure 1(b). Both

scenarios have around 25% samples administratively censored at τ= 1, and 40% of the remaining samples censored

during follow-up. Note that administrative censoring works in the same way for Tand C, i.e. those events are

consider as ‘censored’ for both the estimation of Sand the estimation of Sc. It is obvious that Scenario 1 can be

correctly modeled. Scenario 2 is designed such that most commonly used semiparametric models fail. As it turns

out, under Scenario 2 Sc(τ|A, Z) can be very close to zero for some values of Aand Z, leading to possible violation of

Assumptions 2and 3. This echoes the argument made in D’Amour et al. (2021) that the overlap assumption needed

for DR estimates often fails in practice.

T C

U2Scenario Data generating process for C

1: Cox PH λc(t) = exp(−1+2Z2)

2: Mixture

log(U2)∼N(0,1)

Z1>0 : log(C) = −0.2A−2p|Z2|+ 0.3U2

Z1≤0 : log(C)=2.4−0.3A+ 0.5p|Z1|

+0.5p|Z2| − U2

(a) (b)

Figure 1: (a) Variable diagram. (b) Data generating process for C.

We consider three types of working models: PH model using the R package ‘survival’; splines (Kooperberg et al.,

1995a) using the R package ‘polspline’; and random survival forest (RSF) (Ishwaran et al.,2008) using the R package

‘randomForestSRC’. We set splitrule = ’bs.gradient’ for RSF, while keeping all the others settings as default. We

study 7 diﬀerent combinations of working models for the proposed AIPCW estimator: Cox-Cox, Cox-spline, Cox-

RSF, spline-Cox, RSF-Cox, spline-spline, and RSF-RSF, where the ﬁrst part in the names denotes the model for S

and the second part denotes the model for Sc. It is worth noting that due to the non-collapsibility of the Cox model,

a semiparametric conditional model for Sis almost always misspeciﬁed. Therefore the consistency of AIPCW-Cox-

Cox, AIPCW-Cox-spline and AIPCW-Cox-RSF relies on the correct speciﬁcation of the censoring model. We also

note that the convergence rate of the spline and RSF is largely unknown, which depends on the choice of tuning

parameters. See Discussion for more on this.

We also investigate the performance of MPLE and various IPCW estimators: IPCW-Cox, IPCW-spline, IPCW-

RSF, IPCW-A and IPCW-1. More speciﬁcally, IPCW-A estimates Scusing the product-limit estimator for each

group indicated by A, while IPCW-1 estimates Scusing the product-limit estimator on the entire sample. Robust

variance estimator from Boyd et al. (2012) is used to estimate the model standard errors of the IPCW estimators.

Standard errors for the cross-ﬁtted AIPCW estimators are estimated using Theorem 4, which assumes both Sand

Scmodels are correctly speciﬁed.

To avoid numerical problems, we impose a minimum on b

S(−m)(t|A, Z) and b

S(−m)

c(t|A, Z) in the above, so that

values below 0.01 are trimmed to be 0.01. Finally, as a benchmark, we also ﬁt model (1) to the full data without

censoring.

The simulation results for Scenarios 1 and 2 are reported in Tables 1and 2, respectively. It is immediate that

under informative censoring, MPLE, IPCW-1 and IPCW-A have substantial bias leading to poor coverage of the

conﬁdence intervals (CI). Under Scenario 1 where the censoring model is correctly speciﬁed as Cox, the other three

IPCW estimators (-Cox, -spline, -RSF) all appear to perform reasonably well. All seven AIPCW estimators also

perform well under Scenario 1, with AIPCW-Cox-RSF having larger bias compared to the rest.

Under Scenario 2, IPCW-Cox appears more biased than IPCW-spline and IPCW-RSF, as expected. But even for

the latter two estimators, their SE’s severely under-estimate the SD’s, leading to poor coverage of the CI’s. This also

points to the known fact that inference is not guaranteed when ML or nonparametric methods are used in IPCW,

as discussed earlier. AIPCW-Cox-Cox also has large bias under Scenario 2, as expected. The rest six AIPCW’s are

less biased. For the larger sample size n= 1000, AIPCW using two ML or nonparametric methods appears to have

the least bias, with close to nominal coverage probabilities. Finally we note that, under Scenario 2, spline-based

AIPCWs tend to have larger variance. This might be explained by the fact that splines are less stable near the

boundary τ, which under Scenario 2 has small b

Sc(τ|A, Z) for some values of Aand Zas mentioned earlier.

Table 1: Simulation results under Scenario 1. Data are generated following Figures 1(a) and (b) with

βo=−1. Red indicates that the model or approach is invalid.

Sample Size Estimators Bias SD SE CP

n= 500

AIPCW-Cox-Cox 0.002 0.196 0.191 0.94

AIPCW-Cox-spline -0.001 0.198 0.190 0.94

AIPCW-Cox-RSF 0.023 0.197 0.207 0.96

AIPCW-spline-Cox 0.005 0.185 0.177 0.94

AIPCW-RSF-Cox 0.005 0.189 0.178 0.94

AIPCW-spline-spline 0.002 0.185 0.177 0.94

AIPCW-RSF-RSF 0.002 0.192 0.190 0.95

IPCW-Cox -0.006 0.186 0.179 0.94

IPCW-spline -0.005 0.188 0.179 0.94

IPCW-RSF 0.008 0.190 0.177 0.93

IPCW-A-0.221 0.180 0.162 0.70

IPCW-1-0.221 0.179 0.162 0.70

MPLE -0.205 0.175 0.167 0.76

Full data 0.002 0.103 0.099 0.93

n= 1000

AIPCW-Cox-Cox -0.008 0.137 0.134 0.94

AIPCW-Cox-spline -0.010 0.138 0.133 0.94

AIPCW-Cox-RSF 0.019 0.141 0.153 0.97

AIPCW-spline-Cox 0.001 0.127 0.123 0.94

AIPCW-RSF-Cox 0.002 0.130 0.125 0.94

AIPCW-spline-spline 0.001 0.127 0.123 0.94

AIPCW-RSF-RSF -0.005 0.134 0.134 0.95

IPCW-Cox -0.009 0.130 0.128 0.94

IPCW-spline -0.007 0.135 0.128 0.94

IPCW-RSF 0.011 0.134 0.128 0.95

IPCW-A-0.225 0.126 0.114 0.51

IPCW-1-0.224 0.126 0.114 0.51

MPLE -0.207 0.122 0.118 0.58

Full data -0.003 0.069 0.07 0.94

SD: standard deviation; SE: standard error; CP: coverage probability of nominal 95% CI

Table 2: Simulation results under Scenario 2. Data are generated following Figures 1(a) and (b) with

βo=−1. Red indicates that the model or approach is invalid.

Sample Size Estimators Bias SD SE Coverage

n= 500

AIPCW-Cox-Cox -0.129 0.285 0.276 0.93

AIPCW-Cox-spline -0.029 0.604 0.623 0.97

AIPCW-Cox-RSF -0.064 0.249 0.243 0.93

AIPCW-spline-Cox -0.068 0.282 0.256 0.93

AIPCW-RSF-Cox -0.034 0.275 0.250 0.93

AIPCW-spline-spline 0.038 0.578 0.585 0.96

AIPCW-RSF-RSF -0.039 0.264 0.238 0.93

IPCW-Cox -0.114 0.266 0.174 0.77

IPCW-spline -0.046 0.452 0.192 0.68

IPCW-RSF -0.088 0.257 0.179 0.80

IPCW-A-0.227 0.184 0.170 0.74

IPCW-1-0.226 0.183 0.166 0.72

MPLE -0.216 0.179 0.174 0.77

Full data 0.002 0.103 0.099 0.93

n= 1000

AIPCW-Cox-Cox -0.127 0.195 0.192 0.90

AIPCW-Cox-spline -0.056 0.396 0.367 0.95

AIPCW-Cox-RSF -0.035 0.187 0.189 0.95

AIPCW-spline-Cox -0.056 0.191 0.180 0.93

AIPCW-RSF-Cox -0.021 0.185 0.178 0.92

AIPCW-spline-spline 0.008 0.344 0.332 0.95

AIPCW-RSF-RSF -0.020 0.198 0.179 0.93

IPCW-Cox -0.103 0.204 0.126 0.71

IPCW-spline -0.045 0.377 0.146 0.63

IPCW-RSF -0.047 0.202 0.134 0.78

IPCW-A-0.220 0.127 0.120 0.56

IPCW-1-0.219 0.127 0.117 0.53

MPLE -0.211 0.123 0.123 0.61

Full data -0.003 0.069 0.07 0.94

SD: standard deviation; SE: standard error; CP: coverage probability of nominal 95% CI

5 Discussion

For the analysis of two-group survival, including for randomized clinical trials, non-informative censoring is assumed.

When the simple PH model (1) is used with no covariates adjusted for, this requires the censoring distribution to be

independent of any covariates. When this assumption is violated, the commonly used MPLE is biased and typically

IPCW is used to correct that bias if the interest remains to estimate the marginal hazard ratio between the two

groups. IPCW, on the other hand, requires modeling the censoring distribution, which can be wrong unless ML or

nonparametric estimates are used. In this paper we have developed an AIPCW estimator that is both model DR

and rate DR. Rate double robustness allows us to get around the non-collapsibility of the Cox regression model using

more ﬂexible ML or nonparametric methods for the conditional failure time model demanded by the DR construct,

because almost any parametric or semiparametric would otherwise be invalid.

The theoretical results require certain rate condition of the estimates of the nuisance parameters. These are not

always established for a given ML or nonparametric estimator. Cui et al. (2022) and Kooperberg et al. (1995b)

demonstrated that under certain conditions, rate better than n1/4can be achieved for random survival forest and

splines, which would lead to a faster than root-nproduct rate. Faster than n1/4rate are also shown to be attainable

for other ML methods, for example, regression trees (Wager and Walther,2015) and neural networks (Chen and

White,1999). The rates, of course, depend on the hyper-parameter values. In the simulations we used the default

settings for the spline and the random survival forest. Investigation of other ML or nonparametric methods, as well

as their tuning, in relationship with the performance of DR estimators, remains a topic of future work.

This work focused on two-group survival and a binary A. Generalization to continuous and/or multivariate Ais

conceptually straightforward although diﬀerent algebra might be involved. In particular for continuous A, we would

no longer have A2=Aand additional quantities like S(2) need to be introduced.

Finally the models for Sand Scmay include additional and diﬀerent sets of covariates for these two models, so

long as the failure time and the censoring time are independent given the common covariates Z.

The R codes for the cross-ﬁtted AIPCW estimator as well as the simulation procedures investigated in this work

are available online in http://github.com/charlesluo1002/DR-Cox.

References

Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: a large sample study.

Ann. Stat. 10: 1100–1120.

Bai, X., Tsiatis, A. A., Lu, W. and Song, R. (2017). Optimal treatment regimes for survival endpoints using a

locally-eﬃcient doubly-robust estimator from a classiﬁcation perspective. Lifetime Data Anal. 23(4): 585–604.

Belloni, A., Chernozhukov, V. and Hansen, C. (2013). Inference on treatment eﬀects after selection among

high-dimensional controls. The Review of Economic Studies 81(2): 608-650.

Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models.

Biometrics 61: 692-972.

Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner, J.A. (1993). Eﬃcient and adaptive estimation for

semiparametric Models. The Johns Hopkins University Press, Baltimore.

Boyd, A.P., Kittelson, J.M. and Gillen, D.L. (2012). Estimation of treatment eﬀect under non-proportional

hazards and conditionally independent censoring. Stat. Med. 31(28): 3504-15.

Breslow, N.E. (1972). Discussion of the paper by D. R. Cox. J. R. Statist. Soc. B. 34: 216–217.

Campigotto, F. and Weller, E. (2014). Impact of informative censoring on the Kaplan-Meier estimate of

progression-free survival in phase II clinical trials. J. Clin. Oncol. 32(27): 3068-3074.

Chen, X. and White, H. (1999). Improved rates and asymptotic normality for nonparametric neural network

estimators. IEEE Trans. Inf. Theory 45(2): 682-691.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J.

(2018). Double/debiased machine learning for treatment and structural parameters. Econom. J. 12: C1-C68.

Cox, D.R. (1972). Regression models and life-tables (with discussion). J. R. Statist. Soc. B. 34: 187–220.

Cox, D.R. (1975). Partial likelihood. Biometrika 62: 269–276.

Cui, Y., Zhu, R., Zhou, M. and Kosorok, M. (2022). Consistency of survival tree and forest models: splitting

bias and correction. Stat. Sin. (preprint).

D’Amour, A., Ding, P., Feller, A., Lei, L. and Sekhon, J (2021). Overlap in observational studies with

highdimensional covariates. J. Econom. 221 644-654.

Dukes, O., Martinussen, T., Tchetgen Tchetgen, E. J. and Vansteelandt, S. (2019). On doubly robust

estimation of the hazard diﬀerence. Biometrics 75: 100-109.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7: 1-26.

Fleming, T.R. and Harrington, D.P. (1991). Counting processes and survival analysis. Wiley, New York.

Gray, R. J. (1992). Flexible methods for analyzing survival data using splines, with applications to breast cancer

prognosis. J. Am. Stat. Assoc. 87: 942–951.

Hattori, S. and Henmi, M. (2012). Estimation of treatment eﬀects based on possibly misspeciﬁed Cox regression.

Lifetime Data Anal. 18(4): 408-33.

Hou, J., Bradic, J. and Xu, R. (2021). Treatment eﬀect estimation under additive hazards models with high-

dimensional confounding. J. Am. Stat. Assoc. 116: early view.

Ishwaran, H., Kogular, U. B., Blackstone, E. H. and LAUER, M. S. (1995b). Random survival forests.

Ann. Appl. Stat. 2: 841–860.

Van Lancker, K., Dukes, O. and Vansteelandt, S. (2021). Principled selection of baseline covariates to account

for censoring in randomized trials with a survival endpoint. Stat. Med. 40(18): 4108–4121.

Keiding, N., Holst, C. and Green, A. (1989). Retrospective estimation of diabetes incidence from information

in a current prevalent population and historical mortality. Am. J. Epidemiol. 130: 588-600.

Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995a). Hazard regression. J. Am. Stat. Assoc. 143–157.

Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995b). The L2 rate of convergence for hazard regression.

Scand. J. Stat. 90: 78–94.

Koul, H., Susurl, V. and van Ryzin, J. (1981). Regression analysis with randomly right censored data. Ann.

Stat. 9: 1276-88.

Lin, D. Y. and L. J. Wei. (1989). The robust inference for the Cox proportional hazards model. J. Am. Stat. Assoc.

84(408): 1074–1078.

Lok, J. J., Yang, S., Sharkey, B. and Hughes, M. D. (2018). Estimation of the cumulative incidence function

under multiple dependent and independent censoring mechanisms. Lifetime Data Anal. 24(2): 201–223.

Lu, X. and Tsiatis, A.A. (2008). Improving the eﬃciency of the log-rank test using auxiliary covariates. Biometrika

95(3): 679–694.

Lu, W. and Ying, Z. (2004). On semiparametric transformation cure models. Biometrika 91(2): 331–343.

Martinussen, T. and Vansteelandt, S. (2013). On collapsibility and confounding bias in Cox and Aalen regres-

sion models. Lifetime Data Anal. 19(3): 279–296.

Murphy, S.A. (1994). Consistency in a proportional hazards model incorporating a random eﬀect Ann. Stat. 2:

712-731.

Murphy, S.A. (1995). Asymptotic theory for the frailty model. Ann. Stat. 23: 182-198.

Nguyen, V. Q. and Gillen, D. L. (2017). Censoring-robust estimation in observational survival studies: Assessing

the relative eﬀectiveness of vascular access type on patency among end-stage renal disease patients. Stat. Biosci.

9(2): 406–430.

Nielson, G., Gill, R. D., Andersen, P.K. and Sørensen, T.I.A. (1992). A counting process approach to

maximum likelihood estimation in frailty models Scand. J. Stat. 19: 25–44.

Nu˜

no, M.M. and Gillen, D. L. (2021). Censoring-robust time-dependent receiver operating characteristic curve

estimators.Stat. Med. 40(30): 6885–6899.

Rava, D. (2021). Survival analysis and causal inference: from marginal structural ox to additive hazards model and

beyond. Ph.D. Thesis, University of California, San Diego.

Robins, J.M. (2000). Marginal structural models versus structural nested models as tools for causal inference.

Statistical models in epidemiology, the environment, and clinical trials New York: Springer, pp. 95–133.

Robins, J.M. and Finkelstein, D. M. (2000). Correcting for noncompliance and dependent censoring in an AIDS

clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 56(3): 779–788.

Robins, J.M., Hernan, M.A. and Brumback, B. (2000). Marginal structural models and causal inference in

epidemiology. Epidemiology 11(5): 550–560.

Robins, J.M. and Rotnitzky, A. (2001). Comment on“inference for semiparametric models: Some questions and

an answer” by Bickel and Kwon. Stat. Sin. 11(4): 920-936.

Robins, J.M.(1993). Information recovery and bias adjustment in proportional hazards regression analysis of ran-

domized trials using surrogate markers. in Proceedings of the Biopharmaceutical Section, American Statistical

Associalion. 24-33.

Robins, J.M., Rotnitzky, A. and van der Lann, M. (2000). Comment on “On proﬁle likelihood” by Murphy

and van der Vaart. J. Am. Stat. Assoc. 95: 477–482.

Robins, J.M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated

outcomes in the presence of missing data. J. Am. Stat. Assoc. 90: 106-121.

Rotnitzky, A., Bergesio, A. and Farall, A. (2009). Analysis of quality-of-life adjusted failure time data in the

presence of competing, possibly informative, censoring mechanisms. Lifetime Data Anal. 15(1): 1-23.

Rotnitzky, A. and Robins, J.M. (2005). Inverse probability weighting in survival analysis. Encyclopedia of Bio-

statistics. Vol 4. 2619-2625. Second Edition. Edited by Peter Armitage and Theodore Colton. New York, Wiley,

2004.

Scharfstein, D. O., Rotnitzky, A., and Robins, J.M. (1999). Rejoinder to ‘adjusting for nonignorable dropout

using semiparametric nonresponse models”.J. Am. Stat. Assoc..94: 1135–1146.

International Non-Hodgkin’s Lymphoma Prognostic Factors Project (1993). A predictive model for

aggressive non-Hodgkin’s lymphoma. N. Engl. J. Med.329(14): 987-994.

Smucler, E., Rotnitzky, A. and Robins, J.M. (2019). A unifying approach for doubly-robust L1 regularized

estimation of causal contrasts. arXiv preprint arXiv:1904.03737.

Struthers, C. A. and Kalbfleisch, J. D. (1986). Misspeciﬁed proportional hazard models. Biometrika 73:

363–369.

Tchetgen Tchetgen, E.J. and Robins, J.M. (2012). On parametrization, robustness and sensitivity analysis in

a marginal structural cox proportional hazards model for point exposure. Statistics and Probability Letters 82:

907-915.

Templeton, A.J., Amir, E. and Tannock, I.F. (2020). Informative censoring — a neglected cause of bias in

oncology trials. Nat. Rev. Clin. Oncol. 17: 327-328.

Tsiatis, A. A. (2006). Semiparametric theory and missing data. New York: Springer.

van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK, Cambridge University Press..

van der Laan, M. J. and Robins, J.M.(2003). Uniﬁed methods for censored longitudinal data and causality. New

York: Springer.

Wager, S. and Walther, G. (2015). Adaptive concentration of regression trees, with application to random forests.

arXiv preprint arXiv:1503.06388.

Xu, R. (1996). Inference for the proportional hazards model. Ph.D. Thesis, University of California, San Diego.

Xu, R. and Adak, S. (2002). Survival analysis with time-varying regression eﬀects using a tree-based approach.

Biometrics 58(2): 305–315.

Xu, R. and O’Quigley J (2000). Estimating average regression eﬀect under non-proportional hazards. Biostatistics

1: 423–439.

Yang, S., Pieper, K. and Cools, F. (2020). Semiparametric estimation of structural failure time models in

continuous-time processes. Biometrika 107: 123-136.

Zhang, M. and Schaubel, D. E. (2012a). Contrasting treatment-speciﬁc survival using double-robust estimators.

Stat. Med. 31: 4255–4268.

Zhang, M. and Schaubel, D. E. (2012b). Double-robust semiparametric estimator for diﬀerences in restricted

mean lifetimes in observational studies. Biometrics 68: 999-1009.

Appendix

A Notation and Expressions

First, we list or repeat notations that will be used in the proofs. For iin 1, . . . , n, we deﬁne

Mci(t;Sc) = Nc(t)−Zt

Y(u)dΛc(u|Ai, Zi),

Ji(t;S, Sc) = Zt

S(u|Ai, Zi)−1Sc(u|Ai, Zi)−1dMc(u;Sc),

dNi(t;S, Sc) = Sc(t|Ai, Zi)−1dNi(t)−Ji(t;S, Sc)dS(t|Ai, Zi),

Γ(l)

i(β, t;S, Sc) = Al

ieβAi{Sc(t|Ai, Zi)−1Yi(t) + Ji(t;S, Sc)S(t|Ai, Zi)},

S(l)(β, t;S, Sc) = 1

i=1

Γ(l)

i(β, t;S, Sc),

dMaug

i(t;β, Λ0, S, Sc) = dNi(t, S, Sc)−Γ(0)

i(β, t;S, Sc)dΛ0(t),

A(β, t;S, Sc) = S(1) (β, t;S, Sc)/S(0)(β , t;S, Sc),

U(β;S, Sc) = 1

i=1 Zτ

dNi(t;S, Sc){Ai−¯

A(β, t;S, Sc)},

V(β, t;S, Sc) = d¯

A(β, t;S, Sc)/dβ =¯

A(β, t;S, Sc)−¯

A(β, t;S, Sc)2,

¯a(β, t;S, Sc) = s(1) (β, t;S, Sc)/s(0) (β , t;S, Sc),

v(β, t;S, Sc) = ¯a(β , t;S, Sc)−¯a(β , t;S, Sc)2,

µ(β;S, Sc) = Zτ

0{¯a(βo, t;S, Sc)−¯a(β, t;S, Sc)}s(0)(βo, t;S, Sc)dΛo

0(t),

ν(β;S, Sc) = Zτ

v(β, t;S, Sc)s(0) (βo, t;S, Sc)dΛo

0(t).

Next are expressions used in the asymptotic results. We stated in Theorem 3that

√n(b

β−βo) = 1

√n

i=1

ν(βo, S∗, S ∗

c)−1ψi(βo,Λo

0, S∗, S ∗

c) + op(1).(28)

Here ψi(βo,Λo

0, S∗, S ∗

c) = ψ1i+ψ2i+ψ3iwhere

ψ1i=Zτ

0{Ai−¯a(βo, t;S∗, S∗

c)}dMaug

i(t;βo,Λo

0, S∗, S ∗

c),(29)

ψ2i=k

n(k−1) X

j∈I−m(i)Zτ

0{¯a(βo, t;S∗, S∗

c)−Ai}Ji(t;S∗, S∗

c){dξj(t, Ai, Zi) + eβoAiξj(t, Ai, Zi)dΛo

0(t)},

ψ3i=k

n(k−1) X

j∈I−m(i)Zτ

0{Ai−¯a(βo, t;S∗, S∗

c)} ηj(t, Ai, Zi)

S∗

c(t|Ai, Zi)2{dNi(t)−Yi(t)eβoAidΛo

0(t)}

−Zt

0S∗(u|Ai, Zi)

S∗

c(u|Ai, Zi)2ηj(u, Ai, Zi){dMci(u;S∗

c) + Yi(u)S∗

c(u|Ai, Zi)−1dS∗

c(u|Ai, Zi)}

+S∗(u|Ai, Zi)−1S∗

c(u|Ai, Zi)−2Yi(u)dηj(u, Ai, Zi){dS∗(t|Ai, Zi) + S∗(t|Ai, Zi)eβoAidΛo

0(t)}!.

The long expression above simpliﬁes depending on which models are correctly speciﬁed. Under case (a) of Theorem 3,

when both Sand Scare correctly speciﬁed, ψ2i=ψ3i= 0. Under case (b), when b

Sis √n-consistent , ψ3i= 0. Under

case (c), when b

Scis √n-consistent, ψ2i= 0.

In Theorem 4, we stated that bν−2K/n is a consistent estimator for the asymptotic variance of b

βwhen both

S∗=Soand S∗

c=So

care correctly speciﬁed. The expression for Kand bνare as follows:

bν=1

i=1 Zτ

V(b

β, t;b

S, b

Sc)dNi(t;b

S(−m(i)),b

S(−m(i))

c),

K=1

i=1 e

ψ1i(b

β, e

Λ0(b

β, ·;b

S, b

Sc),b

S(−m(i)),b

S(−m(i))

c)2,

where

ψ1i(β, Λ0, S, Sc) = Zτ

0{Ai−¯

A(β, t;S, Sc)}dM aug

i(t;β, Λ0, S, Sc).

B Proof of Double Robustness

Lemma 1. For any Sc(t|A, Z)with its corresponding censoring speciﬁc martingale Mc(t;Sc),

dMc(u;Sc)

Sc(u|A, Z)= 1 −Y(t)

Sc(t|A, Z)−N(t−)

Sc(X|A, Z),(30)

where N(t−) = I(X < t, T ≤C).

Note, this can be seen as a continuous version of Lemma 10.4 in Tsiatis (2006).

Proof

First note that

dNc(u)

Sc(u|A, Z)=Nc(t−)

Sc(X|A, Z),(31)

where NC(t−) = I(X < t, T > C). Next, since Sc(u|A, Z ) = exp{−Λc(u|A, Z)},

−Y(u)dΛc(u|A, Z)

Sc(u|A, Z)

=I(X≥t)Zt

dSc(u|A, Z)

Sc(u|A, Z)2+I(X < t)ZX

dSc(u|A, Z)

Sc(u|A, Z)2

=I(X≥t){−Sc(u|A, Z)−1}|u=t

u=0 +I(X < t){−Sc(u|A, Z )−1}|u=X

u=0

= 1 −Y(t)

Sc(t|A, Z)−I(X < t)

Sc(X|A, Z).(32)

Since I(X < t) = N(t−) + Nc(t−), (31) + (32) then gives the lemma.

Proof of Theorem 1

Recall that

dMaug (t;β, Λ0, S, Sc) = dMw(t;β, Λ0, Sc)−J(t;S, Sc)dS(t|A, Z ) + S(t|A, Z)eβAdΛ0(t),

where J(t;S, Sc) is also included in Appendix A.

a) Assume Sc=So

We ﬁrst consider dMw(t;βo,Λo

0, So

c). For h(A) = 1 or A,

E{h(A)dMw(t;βo,Λo

0, So

c)}

=Eh(A)So

c(t|A, Z)−1[dE {I(T≤t)I(C≥t)|T=t, A, Z }

−E{I(T≥t)I(C≥t)|T=t, A, Z} · eβoAdΛo

0(t)io

=E[h(A)So

c(t|A, Z)−1{dI (T≤t)P(C≥t|A, Z )

−I(T≥t)P(C≥t|A, Z)·eβoAdΛo

0(t)}]

=E{h(A)dMT(t;βo,Λo

0)}

= 0,

where the second ‘=’ above uses the informative censoring Assumption 1.

Next we consider J(t;S, Sc){dS(t|A, Z ) + S(t|A, Z)eβoAdΛo

0(t)}. Its expectation being zero follows immediately

from the fact that Mc(t;So

c) is a martingale.

b) Assume S=So.

Noting that YT(t)N(t−) = N(t−)dNT(t) = 0 and Y(t)dNT(t) = dN(t), we multiply (30) by

dMT(t) = dNT(t)−YT(t)eβoAdΛ0(t) giving:

dMT(t;βo,Λo

0)Zt

dMc(u;Sc)

Sc(u|A, Z)

=dNT(t)Zt

dMc(u;Sc)

Sc(u|A, Z)−YT(t)eβoAdΛo

0(t)Zt

dMc(u;Sc)

Sc(u|A, Z)

=dNT(t)−dNT(t)Y(t)

Sc(t|A, Z)−dNT(t)N(t−)

Sc(X|A, Z)−YT(t)eβoAdΛo

0(t) + Y(t)eβoAdΛo

0(t)

Sc(t|A, Z)+YT(t)N(t−)eβoAdΛo

0(t)

Sc(X|A, Z).

=dMT(t)−dMw(t).

Therefore

dMw(t;βo,Λo

0) = dMT(t;βo,Λo

0)−dMT(t;βo,Λo

0)Zt

dMc(u;Sc)

Sc(u|A, Z).

We note that (11) and (12) hold when S=So. From (9) then we have

E{dMaug (t;βo,Λo

0, So, Sc)}

=EdMw(t;βo,Λo

0, Sc) + Zt

E{dMT(t;βo,Λo

0)|A, Z, T ≥u}dMc(u;Sc)

Sc(u|A, Z)

=EdMT(t;βo,Λo

0)−dMT(t;βo,Λo

0)Zt

dMc(u;Sc)

Sc(u|A, Z)+Zt

E{dMT(t;βo,Λo

0)|A, Z, T ≥u}dMc(u;Sc)

Sc(u|A, Z)

=EZt

0hE{dMT(t;βo,Λo

0)|A, Z, T ≥u} − dMT(t;βo,Λo

0)idMc(u;Sc)

Sc(u|A, Z)

=EEZt

0hE{dMT(t;βo,Λo

0)|A, Z, T ≥u} − dMT(t;βo,Λo

0)idNc(u)

Sc(u|A, Z)A, Z, T ≥u, C =u

−EEZt

0hE{dMT(t;βo,Λo

0)|A, Z, T ≥u} − dMT(t;βo,Λo

0)iY(u)dΛc(u)

Sc(u|A, Z)A, Z, T ≥u, C ≥u

=EZt

dNc(u)

Sc(u|A, Z)hE{dMT(t;βo,Λo

0)|A, Z, T ≥u, C =u} − E{dMT(t;βo,Λo

0)|A, Z, T ≥u, C =u}i

−EZt

Y(u)dΛc(u)

Sc(u|A, Z)hE{dMT(t;βo,Λo

0)|A, Z, T ≥u, C ≥u} − E{dMT(t;βo,Λo

0)|A, Z, T ≥u, C ≥u}i

=0,

where in the 3rd line above E{dMT(t;βo,Λo

0)}= 0 because MT(t;βo,Λo

0) is a martingale.

The above also gives

EZt

AdMaug (t;βo,Λo

0, So, Sc)= 0.

ResearchGate has not been able to resolve any citations for this publication.

Overlap in Observational Studies with High-Dimensional Covariates

Article

Full-text available

Nov 2017
J ECONOMETRICS

Causal inference in observational settings typically rests on a pair of identifying assumptions: (1) unconfoundedness and (2) covariate overlap, also known as positivity or common support. Investigators often argue that unconfoundedness is more plausible when many covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in high-dimensional observational studies, arguing that this assumption is stronger than investigators likely realize. Our main innovation is to frame (strict) overlap in terms of bounds on a likelihood ratio, which allows us to leverage and expand on existing results from information theory. In particular, we show that strict overlap bounds discriminating information (e.g., Kullback-Leibler divergence) between the covariate distributions in the treated and control populations. We use these results to derive explicit bounds on the average imbalance in covariate means under strict overlap and a range of assumptions on the covariate distributions. Importantly, these bounds grow tighter as the dimension grows large, and converge to zero in some cases. We examine how restrictions on the treatment assignment and outcome processes can weaken the implications of certain overlap assumptions, but at the cost of stronger requirements for unconfoundedness. Taken together, our results suggest that adjusting for high-dimensional covariates does not necessarily make causal identification more plausible.

Estimation of the cumulative incidence function under multiple dependent and independent censoring mechanisms

Article

Full-text available

Apr 2018
LIFETIME DATA ANAL

Competing risks occur in a time-to-event analysis in which a patient can experience one of several types of events. Traditional methods for handling competing risks data presuppose one censoring process, which is assumed to be independent. In a controlled clinical trial, censoring can occur for several reasons: some independent, others dependent. We propose an estimator of the cumulative incidence function in the presence of both independent and dependent censoring mechanisms. We rely on semi-parametric theory to derive an augmented inverse probability of censoring weighted (AIPCW) estimator. We demonstrate the efficiency gained when using the AIPCW estimator compared to a non-augmented estimator via simulations. We then apply our method to evaluate the safety and efficacy of three anti-HIV regimens in a randomized trial conducted by the AIDS Clinical Trial Group, ACTG A5095.

Consistency of survival tree and forest models: splitting bias and correction

Article

Jan 2022
STAT SINICA

Censoring‐robust time‐dependent receiver operating characteristic curve estimators

Article

Oct 2021

Time‐dependent receiver operating characteristic curves are often used to evaluate the classification performance of continuous measures when considering time‐to‐event data. When one is interested in evaluating the predictive performance of multiple covariates, it is common to use the Cox proportional hazards model to obtain risk scores; however, previous work has shown that when the model is mis‐specified, the estimand corresponding to the partial likelihood estimator depends on the censoring distribution. In this manuscript, we show that when the risk score model is mis‐specified, the AUC will also depend on the censoring distribution, leading to either over‐ or under‐estimation of the risk score's predictive performance. We propose the use of censoring‐robust estimators to remove the dependence on the censoring distribution and provide empirical results supporting the use of censoring‐robust risk scores.

Treatment Effect Estimation Under Additive Hazards Models With High-Dimensional Confounding

Article

May 2021

Estimating treatment effects for survival outcomes in the high-dimensional setting is critical for many biomedical applications and any application with censored observations. This paper establishes an ‘orthogonal’ score for learning treatment effects, using observational data with a potentially large number of confounders. The estimator allows for root-n, asymptotically valid confidence intervals, despite the bias induced by the regularization. Moreover, we develop a novel hazard difference (HDi), estimator. We establish rate double robustness through the cross-fitting formulation. Numerical experiments illustrate the finite sample performance, where we observe that the cross-fitted HDi estimator has the best performance. We study the radical prostatectomy’s effect on conservative prostate cancer management through the SEER-Medicare linked data. Lastly, we provide an extension to machine learning approaches as well as heterogeneous treatment effects.

Principled selection of baseline covariates to account for censoring in randomized trials with a survival endpoint

Article

May 2021

The analysis of randomized trials with time‐to‐event endpoints is nearly always plagued by the problem of censoring. In practice, such analyses typically invoke the assumption of noninformative censoring. While this assumption usually becomes more plausible as more baseline covariates are being adjusted for, such adjustment also raises concerns. Prespecification of which covariates will be adjusted for (and how) is difficult, thus prompting the use of data‐driven variable selection procedures, which may impede valid inferences to be drawn. The adjustment for covariates moreover adds concerns about model misspecification, and the fact that each change in adjustment set also changes the censoring assumption and the treatment effect estimand. In this article, we discuss these concerns and propose a simple variable selection strategy designed to produce a valid test of the null in large samples. The proposal can be implemented using off‐the‐shelf software for (penalized) Cox regression, and is empirically found to work well in simulation studies and real data analyses.

Semiparametric estimation of structural failure time models in continuous-time processes

Article

Oct 2019

Structural failure time models are causal models for estimating the effect of time-varying treatments on a survival outcome. G-estimation and artificial censoring have been proposed for estimating the model parameters in the presence of time-dependent confounding and administrative censoring. However, most existing methods require manually pre-processing data into regularly spaced data, which may invalidate the subsequent causal analysis. Moreover, the computation and inference are challenging due to the nonsmoothness of artificial censoring. We propose a class of continuous-time structural failure time models that respects the continuous-time nature of the underlying data processes. Under a martingale condition of no unmeasured confounding, we show that the model parameters are identifiable from a potentially infinite number of estimating equations. Using the semiparametric efficiency theory, we derive the first semiparametric doubly robust estimators, which are consistent if the model for the treatment process or the failure time model, but not necessarily both, is correctly specified. Moreover, we propose using inverse probability of censoring weighting to deal with dependent censoring. In contrast to artificial censoring, our weighting strategy does not introduce nonsmoothness in estimation and ensures that resampling methods can be used for inference.

Informative censoring — a neglected cause of bias in oncology trials

Article

Apr 2020

Informative censoring occurs when progression-free survival is the primary end point of a randomized clinical trial and unequal patient dropout is observed between treatment arms owing to poorer tolerance of experimental treatment. Herein we discuss how informative censoring in the experimental arm before criteria for disease progression are met causes bias towards a positive result.

On Doubly Robust Estimation of the Hazard Difference

Article

Aug 2018

The estimation of conditional treatment effects in an observational study with a survival outcome typically involves fitting a hazards regression model adjusted for a high‐dimensional covariate. Standard estimation of the treatment effect is then not entirely satisfactory, as the misspecification of the effect of this covariate may induce a large bias. Such misspecification is a particular concern when inferring the hazard difference, because it is difficult to postulate additive hazards models that guarantee non‐negative hazards over the entire observed covariate range. We therefore consider a novel class of semiparametric additive hazards models which leave the effects of covariates unspecified. The efficient score under this model is derived. We then propose two different estimation approaches for the hazard difference (and hence also the relative chance of survival), both of which yield estimators that are doubly robust. The approaches are illustrated using simulation studies and data on right heart catheterization and mortality from the SUPPORT study.

Double/debiased machine learning for treatment and structural parameters

Article

Jun 2017
ECONOMET J

We revisit the classic semiparametric problem of inference on a low dimensional parameter θ0 in the presence of high-dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high-dimensional that the traditional assumptions, such as Donsker properties, that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods which are particularly well-suited to estimation in modern, very high-dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N−1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman-orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0, and (2) making use of cross-fitting which provides an efficient form of data-splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in a N−1/2 -neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of DML applied to learn the main regression parameter in a partially linear regression model, DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model, DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, and DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples. This article is protected by copyright. All rights reserved

Doubly Robust Inference for Hazard Ratio under Informative Censoring with Machine Learning

Abstract and Figures

Recommended publications

Estimation of treatment effect under non-proportional hazards and conditionally independent censorin...

Doubly robust estimation of the hazard difference for competing risks data

Doubly Robust Estimation of the Hazard Difference for Competing Risks Data

A Cautionary Note on Doubly Robust Estimators Involving Continuous-time Structure

Doubly Robust Estimation under Covariate-induced Dependent Left Truncation