Content uploaded by Yichuan Zhao
Author content
All content in this area was uploaded by Yichuan Zhao on Oct 11, 2017
Content may be subject to copyright.
Scandinavian Journal of Statistics, Vol. 39: 554–567, 2012
doi: 10.1111/j.1467-9469.2012.00804.x
©2012 Board of the Foundation of the Scandinavian Journal of Statistics. Published by Wiley Publishing Ltd.
Checking the Short-Term and Long-Term
Hazard Ratio Model for Survival Data
SONG YANG
Office of Biostatistics Research, National Heart, Lung and Blood Institute
YICHUAN ZHAO
Department of Mathematics and Statistics, Georgia State University
ABSTRACT. The short-term and long-term hazard ratio model includes the proportional hazards
model and the proportional odds model as submodels, and allows a wider range of hazard ratio
patterns compared with some of the more traditional models. We propose two omnibus tests for
checking this model, based, respectively, on the martingale residuals and the contrast between the
non-parametric and model-based estimators of the survival function. These tests are shown to be
consistent against any departure from the model. The empirical behaviours of the tests are studied
in simulations, and the tests are illustrated with some real data examples.
Key words: censoring, goodness of fit, martingale residuals, model checking, omnibus test,
survival data
1. Introduction
In clinical trials life testing and reliability studies, the problem of comparing two groups
of data is often encountered. Usually a summary measure is used to capture the difference
between the two groups, and the baseline distribution is left unspecified. For survival data,
the Cox proportional hazards model (Cox, 1972) is the most widely used and its parameter
has an appealing interpretation as the hazard ratio, or the relative risk, for the two groups.
The proportional hazards model and the derived hazard ratio estimate provide good approxi-
mations in many situations when the hazards of the two groups are nearly proportional.
When the assumption of a constant hazard ratio is in doubt, common alternatives include
the accelerated failure time model (Kalbfleisch & Prentice, 2002) and the proportional odds
model (Bennett, 1983). The proportional hazards model and the proportional odds model
also belong to the class of transformation models (Bickel et al., 1993). These alternative
models have been studied extensively in the literature, in works such as Ying (1993), Cheng
et al. (1995), Murphy et al. (1997), Yang & Prentice (1999) and Chen et al. (2002), among
others. In addition to these more established models, other semiparametric models have
also been proposed (e.g. Tsodikov, 2002; Chen & Cheng, 2006; Zeng & Lin, 2007).
Yang & Prentice (2005) proposed a model which includes the proportional hazards model
and the proportional odds model as submodels. Assume absolutely continuous failure time
distributions and label the two groups control and treatment, with cumulative hazard func-
tions C(t) and T(t), and hazard rate functions C(t) and T(t), respectively. The model of
Yang & Prentice (2005) postulates that,
T(t)=12
1+(2−1)SC(t)C(t), (1)
almost everywhere for t<0, where SCis the survivor function of the control group, 1,2>0
and 0is the upper boundary of the support of the control distribution: 0=sup{x:
x
0C(t)dt<∞}.Under this model, 1,2can be interpreted as the short-term and long-term
Scand J Statist 39 Checking the hazard ratio model 555
hazard ratio, respectively, and various patterns of the hazard ratio can be realized, such as
proportional hazards, no initial effect, disappearing effect or crossing hazards. The survivor
functions may also cross, a phenomenon not possible under the linear transformation models
and the accelerated failure time model.
Note that this model has an asymmetric flavour in that interchange of the control and
treatment groups will not result in a model of the same form. Also, in clinical applications,
heavy censoring is often present and there may be no data available near 0. Thus in applica-
tions one is interested in using model (1) for tin [0, ], where <0, a range of interest with
adequate data available.
In this article, we propose two omnibus tests for checking model (1), based, respectively,
on the martingale residuals and the contrast between the non-parametric and model-based
estimators of the survival function. In the literature, the martingale residuals have often been
used to detect departures from the assumed model (Therneau et al., 1990; Lin et al., 1993).
Lin et al. (1993) studied various partial-sum processes of the martingale residuals, and used
simulated Gaussian processes to approximate the distributions of those processes under the
proportional hazards regression model. The martingale residual-based test proposed here is
related to the test for the proportional hazards regression model in Lin et al. (1993), but
different in that we will consider integrals of the martingale residuals.
While the martingale residual-based test and related graphical methods have been shown in
the literature to be extremely useful for the Cox proportional regression model, the
partial-sum processes of martingale residuals themselves do not have a simple interpre-
tation. The contrast-based test uses the non-parametric survival function estimate, and has a
very simple and direct interpretation as the difference between the estimated survival proba-
bilities. Under the null hypothesis that model (1) holds, the distribution of both the martin-
gale residual-based test and the contrast-based test can be approximated through simulations.
We will show that both tests are consistent against any departure from model (1). Various
numerical simulations show that the proposed tests have good empirical size and power. The
proposed procedures will be illustrated in real data examples.
We organize the article as follows: In section 2, distributional results for the stochastic pro-
cesses used for the tests are established. Then the p-values of the tests are studied and the
consistency of the tests against departure from the model is established. In section 3, simula-
tion results and data examples are presented. Some concluding remarks are given in section 4.
Proofs of the asymptotic results are placed in appendix S1.
2. Distributional results
Let T1,...,Tnbe the pooled lifetimes of the two groups, and C1,...,Cnbe the corresponding
censoring variables. Let the sample sizes of the two groups be n1and n2, respectively, and
arrange the indices such that T1,...,Tn1,n1<n, constitute the control group. Let Zi=I(i>n1),
i=1, ...,n, where I(·) is the indicator function. We assume that T1,...,Tn,C1,...,Cnare inde-
pendent. The available data are the triplets (Xi,i,Zi), i=1, …, n, where Xi=Ti∧Ciand
i=I(Ti≤Ci), with ∧denoting the minimum.
To simplify the presentation, throughout this article, we will assume the following condi-
tions.
Condition 1. lim n1/n =∈(0, 1).
Condition 2. The data range of interest is [0, ], where <0. The survivor function Giof
the censoring variable Cigiven Ziis continuous and satisfies i≤n1Gi(t)/n1→C(t),
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
556 S. Yang and Y. Zhao Scand J Statist 39
i>n1Gi(t)/n2→T(t), uniformly in t≤, for some functions C(t), T(t) satisfy-
ing C()T()>0.
Condition 3. The survivor functions SCand STof the two comparison groups are absolutely
continuous, and SC()ST()>0.
Define R(t)=1/SC(t)−1, t≤, the odds function of the control group. Let =(1,2)T,
where ‘T’ denotes transpose and 1=log 1,2=log2.With this parameterization, we con-
sider the model
i(t)=1
1i()+2i()R(t)
dR(t)
dt,i=1, ...,n, (2)
almost everywhere on t∈[0,], where i(t) is the hazard function for Tigiven Zi, and ji (b)=
exp(−bjZi), j=1, 2, for b=(b1,b2)T.
To develop the tests for model (2), we need an estimator for .Although it is possible to
adopt the pseudo-likelihood approach in Yang & Prentice (2005), here we will use a simpler
estimator. If Rwere known, the log-likelihood function of model (2) would be proportional to
−
n
i=1
[iln{1i(b)+2i(b)R(Xi)}+ln{1+2i(b)R(Xi)/1i(b)}/2i(b)].
This motivates the following estimator for . Let ˆ
SCbe the Kaplan–Meier estimator of SC
and ˆ
R=1/ˆ
SC−1 (Kaplan & Meier, 1958). Throughout this article, c/d is defined to be zero
when d=0. Let j(b)=exp(−bj), j=1, 2.For satisfying conditions 2 and 3, define
p(b)=
i>n1I(Xi≤)iln 1(b)+2(b)ˆ
R(Xi)+ln 1+2(b)ˆ
R(Xi∧)/1(b)/2(b).
Then we will use the minimizer ˆ
of p(b) to estimate . Equivalently, ˆ
is the zero of the
gradient ∇p(b)ofp(b). Define
A(b)=
0
ln{1(b)+2(b)R(t)}T(t)ST(t)dT(t)+
0
{1+R(t)}T(t)ST(t)
1(b)+2(b)R(t)dC(t).
It can be checked that, under model (2), ∇A(b)iszeroat. We will also assume the
following condition.
Condition 4. The function A(b) has a unique minimum that occurs at the unique zero of
∇A.
Let Mi(t)=iI(Xi≤t)−t
0I(Xi≥s)/{1i()+2i()R(s)}dR(s), the martingale associated
with the ith study subject, 1 ≤i≤n, where throughout the article we use the notation
u
l=(l,u].Now define the martingale residuals
ˆ
Mi(t)=iI(Xi≤t)−t
0
I(Xi≥s)dˆ
R(s)
1i(ˆ
)+2i(ˆ
)ˆ
R(s),1≤i≤n.
Note that ˆ
Mi(t) does not involve ˆ
for i≤n1. It will be shown later that, under model
(2) and conditions 1–4, ˆ
is strongly consistent for . Thus, i>n1
ˆ
Mi(t) will be close to
i>n1Mi(t), and will fluctuate around zero under model (2). This leads us to define a mar-
tingale residual-based test that rejects model (2) if supt≤|i>n1t
0(s)dˆ
Mi(s)|/√nis large,
where is a data-dependent function. The choice of will be discussed more later.
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
Scand J Statist 39 Checking the hazard ratio model 557
The asymptotic distribution of the test involves that of ˆ
. To describe these asymptotic
distributions, let
¯
M1(t)=1
√n
i≤n1
M1i(t), ¯
M2(t)=1
√n
i>n1
M2i(t),
K1(t)=
i≤n1
I(Xi≥t), K2(t)=
i>n1
I(Xi≥t),
N1(t)=
i≤n1
iI(Xi≤t), N2(t)=
i>n1
iI(Xi≤t).
Define D(g,b)=1(b)+2(b)gfor g>0 and let
W(g,b)=1(b)
D(g,b),2(b)g
D(g,b)T
,(b)=−{H(˜p/n,b)}−1,
where H(f,b) is the Hessian matrix of the function fat band
˜p(b)=
0
ln{1(b)+2(b)ˆ
R(t)}dN2(t)+
0
K2(t)dˆ
R(t)
1(b)+2(b)ˆ
R(t).
Define
WT(t)=−W(ˆ
R(t), ),
WC(t)=−K2(t)WT(t)
K1(t)D(R(t), )ˆ
SC(t)
+1
K1(t)
t
K2(s)WT(s)
D(R(s), )2()
D(ˆ
R(s), )ˆ
SC(s)−1dˆ
R(s).
(3)
Let ˆ
WTbe the estimator of WTdefined by replacing and Rwith ˆ
and ˆ
R, respectively.
Similarly, define an estimator ˆ
WCof WC. We first establish the following results.
Theorem 1. Suppose that conditions 1–4 hold. Then, under model (2),
(i) ˆ
is strongly consistent for ;
(ii) √n(ˆ
−)has a limiting zero mean normal distribution. A strongly consistent estimator
of the limiting covariance matrix is given by
(ˆ
)
0
ˆ
WTˆ
WT
TdN2/n +
0
ˆ
WCˆ
WT
CdN1/n(ˆ
).
For the integrand in the martingale residual-based test, while many choices are possible,
based on a trial-and-error process of simulation studies and real data applications, we will
work with the choice (t)=1(t)2(t), where
1(t)=K1(t)
K1(t)+K2(t)and 2(t)=1+4{K1(t)+K2(t)}
n1−K1(t)+K2(t)
n.
The factor 1(t) is used to help stabilize the integral near the upper tail. The function 2(t)
assigns weights between 1 and 2 to all data points, with more weight in the central region
and less weight towards the boundaries of the data range.
Let
Un(t)=t
0
(s)d ¯
M2(s)−t
0
(s)d ¯
M1(s)−t
0
(s)Tdˆ
R(s)()Q, (4)
where
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
558 S. Yang and Y. Zhao Scand J Statist 39
(s)=(s)K2(s)
K1(s)D(R(s), )ˆ
SC(s)−1
K1(s)t
s
(x)K2(x)
D(R(x), )2(ˆ
)
D(ˆ
R(x), ˆ
)ˆ
SC(x)−1dˆ
R(x)
and
(s)=(s)K2(s)
nD(ˆ
R(s), ˆ
)W(R(s), ).
Let ˆbe the estimator of defined by replacing and Rwith ˆ
and ˆ
R, respectively, and
define the estimator ˆ
of analogously. For the asymptotic distribution of the martingale
residual-based statistical test, we have the following result.
Theorem 2. Suppose that conditions 1–4 hold. Then, under model (2), the process i>n1t
0(s)×
dˆ
Mi(s)/√n,0≤t≤,is asymptotically equivalent to the process Un(t), 0 ≤t≤,defined in (4),
which converges weakly to a zero mean Gaussian process. A strongly consistent estimator of the
covariance process of the limiting Gaussian process is given by
ˆ1(s,t)=N2(s)/n +s
0
ˆ2(x)dN1(x)/n
+s
0
ˆ
(x)Tdˆ
R(x)(ˆ
)
0
ˆ
WTˆ
WT
TdN2/n +
0
ˆ
WCˆ
WT
CdN1/n
(ˆ
)t
0
ˆ
(x)d ˆ
R(x)
−s
0
ˆ
(x)Tdˆ
R(x)(ˆ
)t
0
ˆ
WTdN2/n −t
0
ˆˆ
WCdN1/n
−t
0
ˆ
(x)Tdˆ
R(x)(ˆ
)s
0
ˆ
WTdN2/n −s
0
ˆˆ
WCdN1/n,0≤s≤t≤.
In the literature, the martingale residuals have been very useful for checking the Cox pro-
portional regression model. However, the partial-sum processes of martingale residuals them-
selves do not have a simple interpretation. Alternative to the martingale residual-based test,
we can also obtain a test by contrasting the non-parametric and model-based estimators for
the survivor function STof the treatment group. Define ˆ
ST=exp(−ˆ
T), where ˆ
Tis the
Nelson–Aalen estimator for the cumulative hazard function T(Nelson, 1969; Aalen, 1975).
Note that under model (2), we have ST(t)={1()+2()R(t)}−1/2().From this we can define
˜
ST(t)={1(ˆ
)+2(ˆ
)ˆ
R(t)}−1/2(ˆ
),
to be a model-based estimator for ST. Intuitively, model (2) holds if the model-based survival
estimator is close to the non-parametric estimator. This leads us to define a contrast-based
test using ˆ
ST(t)−˜
ST(t), the difference between the estimated survival probabilities. Note that
the Kaplan–Meier estimator could also be used in the contrast-based test. In various simu-
lations, ˆ
STresults in a better performance for small samples, hence it is used in defining the
test instead of the Kaplan–Meier estimator.
Let
Vn(t)=−ˆ
ST(t)t
0
nd¯
M2(s)
K2(s)+(t)t
0
nd¯
M1(s)
K1(s)+˜
ST(t)B(t)T(ˆ
)Q, (5)
where
(t)=
˜
ST(t)
D(ˆ
R(t), ˆ
)ˆ
SC(t),B(t)=ˆ
R(t)
D(ˆ
R(t), ˆ
),−ln( ˜
ST(t)) −ˆ
R(t)
D(ˆ
R(t), ˆ
)T
.
The following result establishes the weak convergence of √n{ˆ
ST(t)−˜
ST(t)}on t∈[0, ].
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
Scand J Statist 39 Checking the hazard ratio model 559
Theorem 3. Suppose that conditions 1–4 are satisfied. Then, under model (2), the process √n
{ˆ
ST(t)−˜
ST(t)},0≤t≤, is asymptotically equivalent to the process Vn,0≤t≤, defined in
(5), which converges weakly to a zero mean Gaussian process. A strongly consistent estimator
of the covariance process of the limiting Gaussian process is given by
ˆ2(s,t)=˜
ST(s)˜
ST(t)B(s)T(ˆ
)
0
ˆ
WT(x)ˆ
WT
T(x)dN2(x)/n
+
0
ˆ
WC(x)ˆ
WT
C(x)dN1(x)/n.(ˆ
)B(t)
+ˆ
ST(s)ˆ
ST(t)s
0
n
K2
2(x)dN2(x)+(s)(t)s
0
n
K2
1(x)dN1(x)
−ˆ
ST(t)˜
ST(s)B(s)T(ˆ
)t
0
ˆ
WT(x)
K2(x)dN2(x)−ˆ
ST(s)˜
ST(t)B(t)T(ˆ
)s
0
ˆ
WT(x)
K2(x)dN2(x)
+(t)˜
ST(s)B(s)T(ˆ
)t
0
ˆ
WC(x)
K1(x)dN1(x)
+(s)˜
ST(t)B(t)T(ˆ
)s
0
ˆ
WC(x)
K1(x)dN1(x), 0 ≤s≤t≤.(6)
With the asymptotic results above, we define a contrast-based test that rejects model (2)
if √nsupt≤2(t)|ˆ
ST(t)−˜
ST(t)|/ˆ2(t,t) is large. Note that we use the standardized process
in defining the test. Furthermore, the weight function 2(t) is used to moderate the influence
of data points near the boundaries of the data range. Use of the standardized process and
the weight function is based on various numerical studies that indicate better performance
of this setup.
The p-values of the tests are difficult to obtain analytically. The bootstrap method provides
a practical alternative. It is, however, very time-consuming. The normal resampling approxi-
mation method of Lin et al. (1993) reduces computing time significantly, and has become a
standard method. We will modify the method for our problem here.
Let i,i=1, ...,n, be independent variables that are also independent from the data. For
t≤, define the process
ˆ
U(t)=1
√n⎡
⎣
i>n1t
0
d(iNi)−
i≤n1t
0
ˆd(iNi)
−t
0
ˆ
dˆ
R(ˆ
)⎛
⎝
i>n1
0
ˆ
WTd(iNi)+
i≤n1
0
ˆ
WCd(iNi)⎞
⎠⎤
⎦
=1
√n⎡
⎣
i>n1
XiiiI(Xi≤t)−
i≤n1
iiˆ(Xi)I(Xi≤t)
−t
0
ˆ
dˆ
R(ˆ
)⎧
⎨
⎩
i≤n1
iiˆ
WC(Xi)I(Xi≤)+
i>n1
iiˆ
WT(Xi)I(Xi≤)⎫
⎬
⎭⎤
⎦,(7)
and
ˆ
V(t)=1
√n⎡
⎣−ˆ
ST(t)
i>n1
ii
n
K2(Xi)I(Xi≤t)+(t)
i≤n1
ii
n
K1(Xi)I(Xi≤t)
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
560 S. Yang and Y. Zhao Scand J Statist 39
+˜
ST(t)B(t)T(ˆ
)⎧
⎨
⎩
i≤n1
iiˆ
WC(Xi)I(Xi≤)+
i>n1
iiˆ
WT(Xi)I(Xi≤)⎫
⎬
⎭⎤
⎦.(8)
In the approach of Lin et al. (1993), the ivalues are the standard normal variables. The
standard normal variables sometimes result in inflated empirical size in various simulation
studies for our problem here. Thus we need to make some adjustment. Specifically we will
choose i,i=1, ...,nto be independent normal variables that are independent of the data,
with mean zero and variance c2
nsuch that supncn<∞and cn→1. Conditional on the observed
data (Xi,i,Zi), i=1, ...,n, the processes ˆ
Uand ˆ
Vare sums of nindependent Gaussian pro-
cesses. It can be shown that, given the data, the processes ˆ
Uand ˆ
Vconverge weakly and
have the same limiting process as that of Unand V
n, respectively.
Let ro,cobe the observed values of
sup
t∈[0,]!!!
i≥n1t
0
(s)d ˆ
Mi(s)!!!"√nand sup
t∈[0, ]
√n{2(t)|ˆ
ST(t)−˜
ST(t)|/ˆ2(t,t)},
respectively. The p-values
P#sup
t∈[0,]!!!
i≥n1t
0
(s)d ˆ
Mi(s)!!!"√n>ro$,P[ sup
t∈[0, ]
√n{2(t)|ˆ
ST(t)−˜
ST(t)|/ˆ2(t,t)}>co]
can be approximated by
P[ sup
t∈[0,]|ˆ
U(t)|>ro], P[ sup
t∈[0, ]{2(t)|ˆ
V(t)|/ˆ2(t,t)}>co],
respectively, which in turn can be approximated by simulating the conditional distributions
given data a large number of times. We have the following result for the consistency of the
tests.
Theorem 4. Suppose conditions 1–4 are satisfied. Then the martingale residual-based test is
consistent against a general departure from model (2) on t ∈[0, ]. The contrast-based test with
the standardized process is consistent against a general departure from model (2) on t ∈[0, ],
except for the degenerate case where 2(t,t)is zero for some t ∈[0, ].
3. Simulation studies and examples
3.1. Simulation studies
To fine-tune the tests and to evaluate their performance, we have conducted various simu-
lation studies. As mentioned before, because of better performance for small samples, ˆ
STis
used in defining the contrast-based test instead of the Kaplan–Meier estimator. Note that the
zero of ∇˜pprovides an alternative estimator for . It was found that the estimator ˆ
gener-
ally results in a better performance than this alternative estimator. For the ivariables in (7),
we can take cn≡1. That is, no modification of the original normal approximation method is
needed. For the ivariables in (8), the choice cn=1+1/√nworks well.
Regarding the choice of , we found that in computing ˆ
, we can take =maxiXito include
all data. After ˆ
is obtained, in simulating the conditional distributions of ˆ
U,ˆ
V, we can
take =(maxiXiZi)∧(maxiXi(1 −Zi)). Note that our presentation and proofs work with the
situation where all processes are restricted to [0, ]forin condition 2.
Next, we report the results from some representative simulation studies. All numerical com-
putations were done in Matlab. To evaluate the empirical size of the tests, we first generate
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
Scand J Statist 39 Checking the hazard ratio model 561
data under the model of Yang & Prentice (2005). Lifetime variables were generated with R(t)
being the identity function. The values of were (log(0.9), log(1.2))Tand (log(1.2), log(0.8))T,
representing a one-third increase or decrease from the initial hazard ratio, respectively. We
will refer these two cases as model I and model II, respectively. For the empirical power
evaluation, lifetime variables were generated with the standard exponential distribution for
the control group, and with
T(t)
C(t)=a,t∈(0, 0.5) ∪(1.5, ∞)
=1/a,t∈[0.5, 1.5],
for a=3 and a=0.5. These two cases will be referred to as model III and model IV, respec-
tively. Notice that model III gives a U-shaped hazard ratio while Model IV gives an upside
down U-shaped hazard ratio, as opposed to a monotone hazard ratio implied by the model
of Yang & Prentice (2005). The censoring variables were independent and identically distr-
ibuted with the log-normal distribution, where the normal distribution had mean cand stan-
dard deviation 0.5, with cchosen to achieve various censoring rates. The empirical size and
power were obtained from 1000 repetitions, and for each repetition, the critical values of the
tests were calculated empirically from 1000 realizations of relevant conditional distributions.
The results of these simulations are summarized in Tables 1 and 2.
From Table 1, both tests in general have the correct size. When the censoring is light
or moderate, the tests are generally conservative. On the other hand, when the censoring
is heavy, both tests may have inflated size for small sample size. This may be expected as
there is very little information available. For example, with n1=n2=80 and the censoring at
75 per cent, only about 40 of the total 160 data points are not censored. As the sample size
increases, the size generally improves. The multiples used in ivariables could possibly be cho-
sen to even control the size for small sample size with heavy censoring, but that may come
at the expense of making the tests more conservative at light or moderate censoring levels.
From Table 2, under light or moderate censoring, generally the martingale residual-based
test has some advantage for model III while the contrast-based test for model IV. However,
we see almost a reversal in performance when the censoring is heavy. These behaviours indi-
cate that the two tests may possibly be powerful under different classes of alternatives. Also,
both tests have reduced power when there is heavy censoring. One possible reason is that, due
Tabl e 1. Empirical size of the lack-of-fit tests for model (2), at various sample sizes and censoring levels.
=(log(0.9), log(1.2))Tfor model I and =(log(1.2), log(0.8))Tfor model II. The results were based on
1000 repetitions. For each repetition, the critical values of the tests were calculated from 1000 realizations
of relevant conditional distributions
Censoring rate
Model I Model II
Test 10% 30% 50% 75% 10% 30% 50% 75%
n1=n2=40
Residual 0.0210 0.0340 0.0310 0.0460 0.0290 0.0380 0.0330 0.0690
Contrast 0.0220 0.0270 0.0410 0.0470 0.0300 0.0310 0.0370 0.0710
n1=n2=80
Residual 0.0170 0.0170 0.0300 0.0450 0.0260 0.0330 0.0290 0.07300
Contrast 0.0190 0.0180 0.0320 0.0430 0.0220 0.0200 0.0220 0.05600
n1=n2=160
Residual 0.0310 0.0380 0.0360 0.0520 0.0510 0.0400 0.0470 0.0540
Contrast 0.0260 0.0290 0.0420 0.0380 0.0290 0.0240 0.0300 0.0400
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
562 S. Yang and Y. Zhao Scand J Statist 39
Tabl e 2. Empirical power of the lack-of-fit tests for model (2), at various sample sizes and censoring levels,
at model III with a U-shaped hazard ratio, and model IV with an upside-down U-shaped hazard ratio. The
results were based on 1000 repetitions. For each repetition, the critical values of the tests were calculated
from 1000 realizations of relevant conditional distributions
Censoring rate
Model III Model IV
Test 10% 30% 50% 75% 10% 30% 50% 75%
n1=n2=40
Residual 0.2810 0.3520 0.2330 0.2290 0.2630 0.2610 0.2780 0.1680
Contrast 0.1160 0.1590 0.2260 0.2880 0.5410 0.4530 0.3240 0.1580
n1=n2=80
Residual 0.4390 0.5090 0.2660 0.3440 0.6220 0.6230 0.6260 0.2090
Contrast 0.2190 0.2720 0.2560 0.3450 0.9100 0.8820 0.7010 0.1600
n1=n2=160
Residual 0.7530 0.7920 0.3010 0.3830 0.9320 0.9530 0.9390 0.3650
Contrast 0.7600 0.6410 0.2690 0.3900 0.9960 0.9940 0.9590 0.3210
to heavy censoring, the hazard ratio in the data range is likely monotone instead of U-shaped
or upside down U-shaped, resulting in the power reduction. For example, for the last case
with n1=n2=160, the average of maxiXifrom the simulated data sets, at the four censoring
levels, is 3.89, 2.31, 1.33 and 0.56 for model III, and 4.65, 2.87, 1.76 and 0.82 for model IV.
Note that the U-shape or upside down U-shape is realized over a range containing [0, 1.5].
With heavy censoring, often there may not be enough data in that range. This also gives a
possible reason for the power being higher at 75 per cent censoring than at 50 per cent censor-
ing under model III. At 75 per cent censoring, given the average 0.56 of maxiXi, the available
data are mostly in the [0, 0.5] range where the hazard ratio is constant. This simple situation
possibly yields a better result compared with the situation at 50 per cent censoring, where the
available data are in a range over which the hazard ratio has a rapid descent from 3 to 0.5.
3.2. The Women’s Health Initiative trial
The Women’s Health Initiative (WHI) randomized controlled trial of combined post-
menopausal hormone therapy reported an elevated coronary heart disease risk and overall
unfavourable health benefits versus risks over a 5.6-year study period (Writing Group for the
Women’s Health Initiative Investigators, 2002; Manson et al., 2003). Here we look at the time
to coronary heart disease in the WHI clinical trial, which included 16,608 postmenopausal
women initially in the age range of 50–79 with uterus. The trial has two arms. The placebo
arm has sample size 8102, and the estrogen plus progestin arm has sample size 8506. About
98 per cent of the data are censored, primarily by the trial monitoring time. For this data set,
there was strong evidence that the hazards were non-proportional. In Prentice et al. (2005),
a time axis partition was used and the relative risk estimate was obtained separately over the
intervals 0 −2, 2 −5 and >5 years. Fitting model (2) to this data set with the placebo group
being the control group, we get ˆ
=(0.636, −3.601)T. The martingale residual-based test has
the p-value of 0.438, and the contrast-based test has the p-value of 0.427.Thus both tests
indicate a good fit of the model. Figure 1 gives the plots of the non-parametric survival curve
ˆ
STand the model-based survival curve ˜
STfor the treatment group, or the estrogen plus pro-
gestin group. We see that the model-based and non-parametric survival curves are very close
to each other.
Now if we switch the group label and use the estrogen plus progestin group as the con-
trol group, then both tests result in p-values less than 0.01, discrediting the model. Figure 2
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
Scand J Statist 39 Checking the hazard ratio model 563
0 500 1000 1500 2000 2500 3000 3500
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
1.005
Days
% Survival
Fig. 1. Estimated survival curves for the WHI data when fitting model (2) using the placebo group as
the control group: solid line – Kaplan–Meier curve for the estrogen plus progestin group; dashed line
– model-based estimator for the estrogen plus progestin group.
0 500 1000 1500 2000 2500 3000 3500
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
1.005
Days
% Survival
Fig. 2. Estimated survival curves for the WHI data when fitting model (2) using estrogen plus proges-
tin group as the control group: Solid line – Kaplan–Meier curve for the placebo group; dashed line –
model-based estimator for the placebo group.
gives the plots of ˆ
STand ˜
STfor the placebo arm, which is the ‘treatment group’ under the
assumed model. We see that in this way, compared with the model using the placebo arm as
the control group, there is some noticeable gap between the model-based survival curve and
non-parametric survival curve in the middle of the data range. One possible reason may be
that the data yield a Kaplan–Meier curve for the placebo group which behaves very differ-
ently near the end of the data range, with jumps considerably larger than at early or middle
time points. In comparison, for the estrogen plus progestin group, as displayed in Fig. 1, the
Kaplan–Meier curve behaves more or less linearly throughout, with jumps less dramatic near
the end of the data range.
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
564 S. Yang and Y. Zhao Scand J Statist 39
3.3. The gastrointestinal tumour study
Next, we look at an example where the hazards are highly non-proportional in that both the
hazard functions and the survivor functions cross. The Gastrointestinal Tumour Study Group
(1982) reported the results of a trial that compared chemotherapy with combined chemo-
therapy and radiation therapy, in the treatment of locally unresectable gastric cancer. There
were 45 patients on each treatment. Two observations in the chemotherapy group and six
in the combination group were censored. Kaplan–Meier plots of the two estimated survival
curves cross at around 1000 days (see p. 10 of Yang & Prentice, 2005). If we let the chemo-
therapy group be the control group and fit model (2) to the data, then we have ˆ
=(1.714,
−0.981)T. The martingale residual-based test has the p-value of 0.104, and the contrast-based
test has the p-value of 0.595. Thus the martingale residual-based test signals some degree of
lack of fit, while the contrast-based test indicates a good fit. Figure 3 gives the plots of ˆ
ST
and ˜
STfor the treatment group, with combined chemotherapy and radiation therapy. We see
that the model-based survival curve and the non-parametric survival curve are mostly close
to each other, except for a very small region around 230 days, where both the model-based
survival curve and the non-parametric survival curve descend rapidly and have their largest
discrepancy. This discrepancy affects the martingale residual-based test more than the con-
trast-based test. Plots of i>n1t
0(s)dˆ
Mi(s)/√nand a few realizations of the process ˆ
U(t)
given the data, shown in appendix S2, suggest that the low p-value of the martingale resid-
ual-based test is mainly caused by behaviours of the processes on that small region.
Now if we let the combined chemotherapy and radiation therapy group be the control
group and fit model (2) to the data, then the martingale residual-based test has the p-value
of 0.634, while the contrast-based test has the p-value of 0.354. Figure 4 gives the plots of
ˆ
STand ˜
STfor the chemotherapy group, which is the ‘treatment group’ under the assumed
model. We see that, compared with the scenario in Fig. 3 before, the difference between the
model-based survival curve and non-parametric survival curve is more pronounced over most
of the data range. Here the model fit seems worse compared with that in Fig. 3, but the
0 500 1000 1500 2000 2500 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Days
% Survival
Fig. 3. Estimated survival curves for the gastric data when fitting model (2) using the chemo group
as the control group: solid line – Kaplan–Meier curve for the combination group; dashed line –
model-based estimator for the combination group.
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
Scand J Statist 39 Checking the hazard ratio model 565
0 500 1000 1500 2000 2500 3000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Days
% Survival
Fig. 4. Estimated survival curves for the gastric data when fitting model (2) using the combination group
as the control group: solid line – Kaplan–Meier curve for the chemo group; dashed line – model-based
estimator for the chemo group.
p-value of the martingale residual-based test is less extreme. Plots of i>n1t
0(s)d ˆ
Mi(s)/√n
and a few realizations of the process ˆ
U(t) given the data suggest that, for tup to about 700
days, the variance process of ˆ
U(t) is generally larger than for the case in Fig. 3, helping to
avoidalowp-value for the martingale residual-based test. These behaviours may plausibly
be due to the high non-proportionality of the two groups combined with a moderate sample
size.
4. Discussion
We have investigated the tests using the supremum norm of the relevant processes. Alterna-
tively, Kolmogorov–Smirnov type tests can be considered using the integrated absolute value
of the processes. Such integrated-type of test statistics are related to the Cramer–von Mises
tests. Also, contrast between the Nelson–Aalen estimator of the cumulative hazard and the
model-based estimator can be used instead of the contrast between the Kaplan–Meier esti-
mator and the model-based estimator. In some preliminary simulation studies, those tests did
not provide improvement over the tests we focused on here, and thus are omitted.
The model of Yang & Prentice (2005) implies that the hazard ratio is monotone. While this
covers many practical situations, other patterns of the hazard ratio are possible. When the
tests indicate a lack of fit for the model of Yang & Prentice (2005), it is possible to remedy
the situation by considering larger classes of semiparametric models to incorporate an even
wider range of hazard ratio patterns. Also, in addition to the two-sample case considered
here, adjustment for covariates may be considered.
Acknowledgements
The authors thank the editors and reviewers for their constructive comments and sugges-
tions that have led to an improved manuscript. The research of Yichuan Zhao was partially
supported by NSA grant #H98230-12-1-0209.
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
566 S. Yang and Y. Zhao Scand J Statist 39
Supporting Information
Additional Supporting Information may be found in the online version of this article:
Appendix S1. Proofs of asymptotic results.
Appendix S2. Additional plots for the gastrointestinal tumor study.
Please note: Wiley-Blackwell are not responsible for the content or functionality of any
supporting materials supplied by the authors. Any queries (other than missing material) should
be directed to the corresponding author for the article.
References
Aalen, O. O. (1975). Statistical inference for a family of counting processes. PhD thesis, University of
California, Berkeley.
Bennett, S. (1983). Analysis of survival data by the proportional odds model. Statist. Med. 2,
273–277.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. & Wellner, J. A. (1993). Efficient and adaptive estimation for
semiparametric models. Johns Hopkins University Press, Baltimore.
Chen, Y. Q. & Cheng, S. (2006). Linear life expectancy regression with censored data. Biometrika 93,
303–313.
Chen, K., Jin, Z. & Ying, Z. (2002). Semiparametric analysis of transformation models with censored
data. Biometrika 89, 659–668.
Cheng, S. C., Wei, L. J. & Ying, Z. (1995). Analysis of transformation models with censored data. Bio-
metrika 82, 835–845.
Cox, D. R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34,
187–220.
Gastrointestinal Tumor Study Group: Schein, P. D., Bruckner, H. W., Douglass, H. O., Mayer, R. et al.
(1982). A comparison of combination chemotherapy and combined modality therapy for locally
advanced gastric carcinoma. Cancer 49, 1771–1777.
Kalbfleisch, J. D. & Prentice, R. L. (2002), The statistical analysis of failure time data, 2nd edn. Wiley,
Hoboken, NJ.
Kaplan, E. & Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist.
Assoc. 53, 457–481.
Lin, D. Y., Wei, L. J. & Ying, Z. (1993). Checking the Cox model with cumulative sums of martingale-
based residuals. Biometrika 80, 557–572.
Manson, J. E., Hsia, J., Johnson, K. C., Rossouw, J. E., Assaf, A. R., Lasser, N. L., Trevisan, M., Black,
H. R., Heckbert, S. R., Detrano, R., Strickland, O. L.,Wong, N. D., Crouse, J. R., Stein, E. & Cush-
man, M., for the Women’s Health Initiative Investigators (2003). Estrogen plus progestin and the risk
of coronary heart disease. New Engl. J. Med.349, 523–534.
Murphy, S. A., Rossini, A. J. & Van der Vaart, A. W. (1997). Maximal likelihood estimate in the pro-
portional odds model. J. Amer. Statist. Assoc. 92, 968–976.
Nelson, W. (1969). Hazard plotting for incomplete failure data. J. Qual. Technol. 1, 27–52.
Prentice, R. L., Langer, R., Stefanick, M. L., Howard, B. V., Pettinger, M., Anderson, G., Barad,
D., Curb, J. D., Kotchen, J., Kuller, L., Limacher, M. & Wactawski-Wende, J., for the Women’s
Health Initiative Investigators (2005). Combined postmenopausal hormone therapy and cardiovascular
disease: toward resolving the discrepancy between observational studies and the women’s health initia-
tive clinical trial. Am. J. Epi. 162, 404–414.
Therneau, T. M., Grambsch, P. M. & Fleming, T. R. (1990). Martingale-based residuals for survival
models. Biometrika 77, 147–160.
Tsodikov, A. (2002). Semi-parametric models of long- and short-term survival: an application to the
analysis of breast cancer survival in Utah by age and state. Statist. Med. 21, 895–920.
Writing Group for the Women’s Health Initiative Investigators (2002). Risks and benefits of estrogen
plus progestin in healthy postmenopausal women: principal results from the women’s health initiative
randomized controlled trial. J. Amer. Med. Assoc.288, 321–333.
Yang, S. & Prentice, R. L. (1999). Semiparametric inference in the proportional odds regression model.
J. Amer. Statist. Assoc. 94, 125–136.
Yang, S. & Prentice, R. L. (2005). Semiparametric analysis of short-term and long-term hazard ratios
with two-sample survival data. Biometrika 92, 1–17.
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.
Scand J Statist 39 Checking the hazard ratio model 567
Ying, Z. (1993). A large sample study of rank estimation for censored regression data. Ann. Statist. 21,
76–99.
Zeng, D. & Lin, D. Y. (2007). Maximum likelihood estimation in semiparametric regression models with
censored data. J. Roy. Statist. Soc. B 69, 507–564.
Received January 2010, in final form March 2012
Song Yang, PhD, Office of Biostatistics Research, Division of Cardiovascular Sciences, National Heart,
Lung and Blood Institute, NIH, DHHS, 6701 Rockledge Dr. MSC 7913, Bethesda, MD 20892, USA.
E-mail: yangso@nhlbi.nih.gov
©2012 Board of the Foundation of the Scandinavian Journal of Statistics.