Content uploaded by M. Shafiqur Rahman
Author content
All content in this area was uploaded by M. Shafiqur Rahman on May 10, 2023
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=tbep20
Biostatistics & Epidemiology
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tbep20
Propensity score-based adjustment for covariate
effects on classification accuracy of bio-marker
using ROC curve
Muntaha Mushfiquee & M. Shafiqur Rahman
To cite this article: Muntaha Mushfiquee & M. Shafiqur Rahman (2022): Propensity score-
based adjustment for covariate effects on classification accuracy of bio-marker using ROC curve,
Biostatistics & Epidemiology, DOI: 10.1080/24709360.2022.2131994
To link to this article: https://doi.org/10.1080/24709360.2022.2131994
Published online: 24 Oct 2022.
Submit your article to this journal
View related articles
View Crossmark data
BIOSTATISTICS & EPIDEMIOLOGY
https://doi.org/10.1080/24709360.2022.2131994
Propensity score-based adjustment for covariate effects on
classification accuracy of bio-marker using ROC curve
Muntaha Mushfiquee and M. Shafiqur Rahman
Institute of Statistical Research and Training, University of Dhaka, Dhaka, Bangladesh
ABSTRACT
The potential performance of bio-marker in classifying diseased from
healthy population may be affected by baseline covariates (X) that
are associated with both the bio-marker (Y) and the disease status
(D). Some existing approaches can be able to adjust for the effect
of a single covariate at a time. However, several potential covariates
can be available in practice for which simultaneous adjustment in
the ROC curve is essential. This study proposed a propensity score
(PS) based adjustment for the effects of several covariates in the ROC
curve. The PS is first derived from a linear transformation of several
covariates and the PS-adjusted (and PS-specific) ROC curve was then
estimated using the existing non-parametric induced ROC regression
framework. The method is illustrated for both continuous and binary
bio-markers. The simulation study suggests that the PS-based adjust-
ment performed well by providing a consistent estimate of the true
ROC curve and showing robustness to the mis-specification of the
propensity score model as well as to a non-linear function of covari-
ates. Further, an application of the method is provided to evaluate
the effectiveness of the body-mass-index in classifying patients with
hypertension or diabetes after adjusting for the potential covariates
such as age, sex, education, socio-economic status.
ARTICLE HISTORY
Received 10 March 2022
Accepted 28 September 2022
KEYWORDS
Induced ROC regression;
non-parametric method;
body-mass-index;
classification
1. Introduction
In medicine, bio-marker plays a vital role in diagnosing the risk of developing a disease,
the absence or presence of a disease and disease progression [1,2]. For example, body
temperature is a well-known bio-marker for fever; cholesterol values are considered as a
bio-marker and risk indicator for coronary and vascular disease and blood pressure is used
to determine the risk of stroke [3,4]. Therefore, one of the recent advances in bio-medical
research is the discovery of novel bio-markers and quantication of risk oered by the
marker [5–7]. Furthermore, sometimes, the controlled clinical trials have been conducted
to determine if altering the concentrations of new bio-marker lead to the improvement
in patient prognosis. For example, the inammatory marker CRP has been extensively
studied in cardiovascular medicine to determine its potentiality to quantify the risk of
developing cardiovascular disease [8–10]. The usefulness of the new marker depends on
CONTACT M. Shafiqur Rahman shafiq@isrt.ac.bd Institute of Statistical Research and Training, University of
Dhaka, Dhaka 1000, Bangladesh
© 2022 International Biometric Society – Chinese Region
2M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
how much they add extra predictive value in correctly classifying the diseased subjects
from the non-diseased subjects [5,6].
Given the importance of bio-marker in disease classication, it is essential to evaluate
their classication accuracy- the ability to provide the correct diagnosis given a subject’s
true disease status (discrimination). The receiver operating characteristic (ROC) curve, a
graphical plot of sensitivity (true positive rate) against the 1-specicity (false positive rate)
evaluated at dierent cuto values of the diagnostic marker, is commonly used to evaluate
the binary classication accuracy of the marker [11–13]. The area under the curve (AUC)
is used as a numerical summary measure of the classication accuracy, which is interpreted
as the probability that for a randomly selected pair of subjects, the subject with disease has
a larger marker value than the subject without disease. It ranges between 0.5 and 1, with the
value of 0.5 indicating random classication by the marker like a coin toss and 1 indicating
perfect classication.
The potential performance of a bio-marker may be strongly inuenced by both patients
and disease characteristics or features of the specimen collection or test settings. As dis-
cussed by Janes and Pepe [14,15], covariate can be associated with both bio-marker and
thediseaseoutcomeandactedasaconfounderoreectmodier.Henceitcanaect
the classication accuracy in two dierent ways: (i) it can confound the performance of
the bio-marker in disease classication but not aect the discriminatory power (it is not
an eect modier) (ii) it can aect classication accuracy (discriminatory power) of the
marker at its dierent levels (it is an eect modier). For the rst case, the classication
accuracy (discriminatory power) of the marker is the same across dierent levels (groups)
ofthecovariatevaluesbutdierentfromthoseforthepooleddata,becausethedistri-
bution of bio-marker for cases and control vary across dierent levels of the covariates.
HencetheadjustmentforthecovariateeectontheROCcurveisnecessary,otherwise,
the traditional pooled ROC curve may provide misleading conclusion on the performance
of the marker. The adjusted ROC (aROC) is analogous to the adjusted odds-ratio in asso-
ciation study in epidemiology and statistics [15]. For the second case, the classication
accuracy diers for dierent levels of the covariates, for which conditional or covariate-
specic ROC curve (for a specic value of the covariate) can be considered. However, for
the continuous covariate and small sample situation covariate-specic ROC curve cannot
be estimated with precision. In this situation, to provide a summary of the classication
accuracy, covariate-adjusted ROC curve is usually estimated from weighted average of
covariate-specic ROC curve, with weights corresponding to the proportion of cases in
each covariate group [16,17].
To assess possible covariate eects on ROC curve, a considerable number of approaches
have been discussed in literature. Of them, an attempt was made by a few studies to include
all covariates in multivariable models and then produce a ROC curve for the linear predic-
tor ηi=
ˆ
βXiderived from the model [13,18]. However, the ROC curve derived from the
linearpredictorofamultivariablemodelisnottheROCcurveofbio-markerYadjusted
for covariate value X1and X2, but rather it captures the combined ability of both the bio-
marker and covariates to discriminate between case and control. This combination can
perform well even if Yis a poor classier and X’s discriminate well [14,16]. Further, the
covariate adjustment by comparing nested models (model with and without bio-marker
Y) may produce misleading conclusion if both bio-marker and covariate are correlated.
This approach provides how much accuracy improves with the addition of Yto X, i.e.
BIOSTATISTICS & EPIDEMIOLOGY 3
the incremental value, which is dierent from the covariate-adjusted performance of a
bio-marker [14].
To adjust for the covariate eects on the ROC curve, two dierent regression approaches
(‘induced’ and ‘direct’ methodology) have been suggested in literature. In the induced
regression [19,20], separate regression models for bio-marker (diagnostic results) for dis-
eased and healthy population are specied, and the covariate eects on the associated ROC
curve are then computed by deriving an induced form of the ROC curve. Whereas in
thedirectmethodology[21,22], a regression model for conditional ROC curve with the
eects of the covariates is directly specied, instead of considering the bio-marker val-
ues. Both these approaches are based on a generalized linear model (GLM) framework to
characterize the covariate eects on ROC curve. However, using the parametric model for
evaluatingthecovariateseects,particularlythecontinuouseectsofcontinuouscovari-
ates, on ROC curve may provide misleading conclusions if the eects are incorrectly
specied and/or the distribution assumption of the bio-marker is mis-specied. To address
these problems, Rodríguez–Alvarez et al. [17]proposedanon-parametricapproachforthe
induced-regression methodology where only an assumption is required that the eect of
continuous covariate follows a smooth function. However, the non-parametric approach
of Rodríguez–Alvarez et al. [17]suersfromnotbeingabletoaddressmorethanone
covariate at a time, because of the computational complexity of distribution function in
the non-parametric regression setting. However, in practice, a mixture of several continu-
ous and categorical covariates is commonly available, for which methodological extension
in ROC analysis is required to adjust for the eects of several covariates simultaneously.
The issues of covariate adjustment in ROC analysis are similar to the adjustment for
theeectsofbaselinecovariates(confounders)inestimatingthemarginaleectsofa
treatment/exposure, which is commonly addressed by performing propensity score (PS)
analysis [23–25]. Some recent studies applied propensity score in ROC analysis [26,27].
Of them, the study by Han et al. [26] applied covariate balancing propensity score when
estimatingtheAUCforabio-markeraimingthatbalancinginbio-markerprevalencemay
provide greater AUC value, i.e. better discriminatory power of the bio-marker. In the study
by Galadima and McClish [27], they used PS in estimating conditional AUC after control-
ling for confounders, where the AUC is used as the measure of eect of a dichotomous
exposure when outcome is continuous and interpreted as the probability that a randomly
selected non-exposed subject has a better response than a randomly selected exposed sub-
ject, which is, however, not the same to the AUC metric generally estimated for evaluating
the classication accuracy of a bio-marker. Moreover, none of them serves the necessity
of covariate adjustment in ROC analysis when a set of covariates confound the perfor-
mance of bio-marker in disease classication without aecting the discriminatory power.
With this background, this paper proposes propensity (PS) score-based adjustment for the
covariate eects in ROC analysis, where a set of both categorical and continuous covari-
ates can be adjusted simultaneously [28]. This can be performed by calculating PS from a
linear transformation between the bio-marker and the set of covariates through a multi-
variable regression model and using it as a single covariate, instead of all covariates, in the
existing non-parametric ROC regression setup. The process is very straightforward and
simple and the resulting ROC estimators (both PS-specic and PS-adjusted) have shown
to have the intuitive interpretation and statistical properties of a good estimator. Details of
the proposed approach with motivation are discussed in Section 3.
4M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
This paper is organized as follows. Section 2describes the existing non-parametric
method of covariate adjustment in ROC curve, and Section 3describes the proposed PS
based adjustment. An extensive simulation study is discussed in Section 4and an applica-
tion of the methods is in Section 5. Section 6ends the paper with some discussions of the
main ndings and concluding remarks.
2. ROC curve and covariate adjustment
2.1. Overview on ROC curve
Let us consider Ybe a random variable representing a continuous bio-marker values. The
diagnosis according to any cuto value cis positive if Y≥cand negative if Y<c.Let
D1and D0denotes the diseased and non-diseased populations, respectively. Now the true
positive rate TPR(c)andthefalsepositiverateFPR(c)at any cuto value care dened as:
TPR(c)=P(Y≥c|D1)=1−FD1(c)
FPR(c)=P(Y≥c|D0)=1−FD0(c),
where FD1and FD0are distribution function YD1and YD2,respectively.Inatheoretical
ROC curve, the true positive rate (sensitivity) is plotted in function of the false positive
rate (1-specicity) for dierent cut-o points [29].EachpointontheROCcurverepresents
a sensitivity/specicity pair corresponding to a particular decision threshold. A test with
perfect discrimination (no overlap in the two distributions) has a ROC curve that passes
through the upper left corner (100% sensitivity, 100% specicity) [30]. Therefore, the closer
theROCcurveistotheupperleftcorner,thehighertheoverallaccuracyofthetest[17]. Let
us consider pistheallfalsepositiveratesaccordingtothevaryingvaluesofc(−∞,∞)[31].
The ROC curve can be dened such that:
ROC(p)=1−FD1[F−1
D0(1−p)], p∈(0, 1).
The most commonly used value to summarize the accuracy is the area under the ROC
curve
AUC =1
0
ROC(p)dp=P(YD1>YD0).
This can be interpreted as the probability that in a randomly selected pair of healthy and
diseased individuals, the diagnostic marker value is higher for the diseased subject. Values
of AUC close to 1.0 indicate that the marker has high diagnostic accuracy [32,33].
2.2. Covariate-adjusted ROC curve for single covariate
Let us consider a continuous bio-marker Yand a continuous covariate X.Thenusingthe
similar notations in the paper by Janes and Pepe [15]theTPRandFPRforY|Xat any
cuto are dened such that,
TPRX(c)=1−FD1(c|X)=Pr YD1>c|X,(1)
FPRX(c)=1−FD0(c|X)=Pr YD0>c|X.(2)
BIOSTATISTICS & EPIDEMIOLOGY 5
Then the covariate-specic ROC curve can be dened as
ROCX(p)=1−FD1(X)F−1
D0(X)(1−p).(3)
According to Janes and Pepe [16] the covariate-adjusted ROC curve can be dened such
that,
aROCX(p)=ROCX(p)dFXD1(X).(4)
Then the AUC can be obtained such that,
aAUCX=1
0
aROCX(p)dp.
2.3. Non-parametric estimation of coviariate-adjusted ROC curve
2.3.1. Continuous bio-marker
Assume that YD1and YD0are bio- marker value for diseased and healthy subjects, respec-
tively. As described by Rodríguez–Alvarez et al. [17], a non-parametric regression model
isassumedtotestbio-markervaluesalongwithacontinuouscovariateXfor diseased and
non-diseased population:
YD1=μ1(X)+σ1(X)1,(5)
YD0=μ0(X)+σ0(X)0,(6)
where μ1(X)=β01 +β11Xand μ0(X)=β00 +β01Xare the regression functions, and σ1
and σ0are the variance functions. The errors 1and 0areassumedtobeindependentof
the covariate X,withzeromean,unitvariance.BasedonEquations(3),5,and6andafter
some algebra discussed in Appendix A of the paper by Rodríguez–Alvarez et al. [17], the
covariate-specic ROC curve from this induced regression has the following form:
ROCX=x(p)=FD1(X)μ0(X)−μ1(X)
σ1(X)+σ0(X)
σ1(X)F−1
D0(X)(p).(7)
TheaboveROCcurvecanbeestimatedusingthefollowingprocedure.Let{(xD1
i,yD1
i)}nD1
i=1
and {(xD0
j,yD0
j)}nD0
j=1be two independent random samples drawn from the diseased and
healthy populations, respectively. Then μ1(X)and μ0(X)can be estimated as,
μ1(X)=
x,{(xD1
i,yD1
i)}nD1
i=1,h1,q1,
μ0(X)=
x,{(xD0
j,yD0
j)}nD0
j=1,h0,q0,
where is local polynomial kernel estimator, h1and h0are bandwidths or smoothing
parameters, q1and q0are order of polynomials for the diseased and healthy population,
respectively. The corresponding variance function can be estimated as
σ2
1(X)=
x,{(xD1
i,zD1
i)}nD1
i=1,g1,r1,
6M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
σ2
0(X)=
x,{(xD0
j,zD0
j)}nD0
j=1,g0,r0,
where zD1
i=[yD1
i−μ1(xD1
i)]2and zD0
j=[yD0
j−μ0(xD0
j)]2and is local polynomial
kernel estimator, g1and g0are bandwidths, r1and r0are order of polynomials for the
corresponding diseased and healthy population, respectively [34].
To implement the estimation procedure, one needs to choose the order of polynomials
and bandwidths. As suggested by Rodríguez–Alvarez et al. [17], we choose polynomial of
order 1 (q1=q0=1) for mean function and polynomial of order 0 (r0=r1=0) for vari-
ance function. For bandwidth selection, cross-validation was used for automatic selection.
To speed up the cross-validation process, binning-type acceleration process proposed by
Fan and Marron [35] was used to obtain binning-approximation of μ1,μ0,σ2
1,andσ2
0.
For more details on bandwidth selection, see Rodríguez–Alvarez et al. [17]. Now using the
aboveestimators,theTPRandFPRcanbeestimatedas
FD1(z)=1
nD1
nD1
i=1
IyD1
i−ˆμ1(XD1
i)
ˆσ1(XD1
i)≥z,
FD0(z)=1
nD0
nD0
j=1
IyD0
j−ˆμ0(XD0
j)
ˆσ0(XD0
j)≥z.
Finally, the estimated covariate-specic ROC curve can be obtained as
ROCX=x(p)=
FD1(X)μ0(X)−μ1(X)
σ1(X)+σ0(X)
σ1(X)
F−1
D0(X)(p),(8)
where
F−1
D0(X)(p)=sup{z;
FD0(z)≥p}. Then the estimated covariate-adjusted ROC curve
dened in Equation (3) can estimated as,
a
ROCX(p)=ROCX(p)dFD1X(X)=1
nD1(X)
nD1(X)
i=1
IyD1
i−μ1(XD1
i)
σ1(XD1
i)>
F−1
D0(X)(p).
(9)
Then the estimated covariate-adjusted AUC can be obtained such that,
a
AUCX=1
0
aROC(p)dp. (10)
2.3.2. Binary bio-marker
Assume that the bio-marker, Y, is binary (positive or negative result based on a threshold
value) and it takes the value 1 if and only if an unobservable continuous measurement Y∗
exceed a threshold θ, otherwise it takes value 0. The random variable Y∗is termed as latent
response. We can write the probability of positive response as πi=Pr[Yi=1] =Pr[Y∗>
θ]. To identify the latent class model, the threshold θoften considered as zero and stan-
dardize Y∗tohavestandarddeviationone.Underthelatentvariablemodelingframework,
the non-parametric regression for the binary bio-marker along with continuous covariate
Xcan be written as
Y∗
D1=ν1(X)+τ1(X)U1, (11)
BIOSTATISTICS & EPIDEMIOLOGY 7
Y∗
D0=ν0(X)+τ0(X)U0, (12)
where ν1(X)=γ01 +γ11X,andν0(X)=γ00 +γ01Xare regression functions, and τ1(X)
and τ0(X)are both variance functions, and both U1and U0are error terms with normally
distributed random variable of mean 0 and variance 1. However, as the scale of the latent
variable is not identied, to interpret the regression coecient in unitof standard deviation
of latent variable, we take τ1(X)=τ0(X)=1. The latent variable models can be converted
into probit model as follows:
π1=Pr[Y∗
D1>0] =Pr[U1>−ν1]=Pr[U1>−ν1]=1−(−ν1)=(ν1),
π0=Pr[Y∗
D0>0] =Pr[U0>−ν1]=Pr[U0>−ν0]=1−(−ν0)=(ν0),
where (.)is the standard normal cumulative density function. The ν1and ν0can be esti-
mated using both parametrically following maximum likelihood estimation technique and
non-parametrically using kernel estimation [36]. Here, we estimated both ν1and ν0follow-
ing the non-parametric estimation described in Section 2.3.1.ThenfollowingEquation(8)
the ROC can be estimated for binary bio-marker as
ROCb
X=x(p)=
FD1(X)ν0(X)−ν1(X)+
F−1
D0(X)(p),
where the TPR and FPR can be estimated with the observed binary response yas
FD1(z)=1
nD1
nD1
i=1
IyD1
i−ˆν1(XD1
i)≥zi,
FD0(z)=1
nD0
nD0
j=1
IyD0
j−ˆν0(XD0
j)≥zj,
with zD1
i=[yD1
i−ν1(xD1
i)]2and zD0
j=[yD0
j−ν0(xD0
j)]2. Finally, the covariate-adjusted
ROC curve, denoted by a
ROC∗
X(p), for the binary bio-marker can be estimated as
a
ROCb
X(p)=ROCb
X(p)dFD1(X)(X)=1
nD1
nD1
i=1
IyD1
i−ν1(XD1
i)>
F−1
D0(X)(p),
where
F−1
D0(X)(p)=sup{z;
FD0(z)≥p}.
3. Propensity score-based adjustment for covariate effects in ROC analysis
The non-parametric induced regression discussed in the earlier section can manage to
adjustfortheeectofasinglecovariateatatime,however,multiplecovariatesoften
needed to be adjusted in ROC analysis. In this paper, an adjustment for several covariates
onestimatingROCcurvearemadeusingpropensityscore,whichisalinearcombina-
tion of several covariates. The propensity score-based adjustment for confounding eect
is very popular in epidemiological research. We motivated from the studies by Janes and
Pepe [14,16], who suggested that the adjustment for covariate eects in ROC analysis is
8M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
analogous to the adjustment for confounder in odds ratio in epidemiological studies. We,
therefore, apply this approach to the non-parametric ROC regression analysis to adjust
for several covariates. The procedure with mathematical justication is described in the
following sub-sections starting with introducing propensity score.
3.1. Propensity score
The propensity score (PS), a probability of being treated (Z=1) given a set of baseline
covariates (X), is being used for estimating casual eect of the treatment/exposure in epi-
demiological observational studies where random assignment is not feasible [23,37]. The
PSareusedinsuchawaythattheresultingtreatment(Z=1) and control groups (Z=0)
will have similar covariate values to those created through random assignment [38]. This
is called covariate balancing between treatment and control group, and statistically, it
termed as conditional independence of treatment and covariate given PS and denoted by
Z⊥X|PS. Several PS methods are available in literature including ‘matching’, ‘stratica-
tion’, ‘weighting’ and ‘covariate adjustment’. Of the methods, the ‘PS-covariate adjustment’
is a popular one which uses PS as covariate in the outcome model to estimate the eect of
the treatment/exposure on outcome. The resulting estimate can be interpreted as the eect
of treatment on outcome comparing to untreated group, with the same PS (i.e. the same
covariate values) for both groups.
Generalized propensity score (GPS) is widely used for estimating casual eect of the
continuous treatment or exposure on the outcome [28,39,40]. Therefore, if the bio-marker
Yis continuous (here the notation for treatment Zis replaced by bio-marker Y), the GPS
canbeusedtoestimatePS.Letr(y,x)be the conditional density of ygiven the covariates
x,whichhasthefollowingform:
r(y,x)=fY|X(y|x)=N(βX,σ2).
Then the generalized propensity score (gPS)canbeobtainedas
gPS =r(y,x)=1
√2πˆσ2exp −1
2ˆσ2(Yi−ˆ
βX)2, (13)
where ˆ
βand ˆσare the estimates (by maximum likelihood or ordinary least squares) of
the regression coecient and standard deviation of the random error, respectively in the
propensity score model.
If the bio-marker Yis binary, the PS can be estimated using logistic regression including
asetofcovariatesXas:
bPS =P(Y=1|X)=[1 +exp(−βX)]−1. (14)
The PS estimated from the above models can be considered an example of a single-index
in that the information of several covariates eects on the response is completely captured
through the the linear predictor, βX.
3.2. Using propensity score in ROC regression
The PS obtained from the above models (Equations (13)–(14)) are result of a linear com-
bination of several covariates and hence it can be considered as dimension reduction
BIOSTATISTICS & EPIDEMIOLOGY 9
technique for a covariate set [41]. In the dimension reduction regression [42,43], the main
goal is to reduce the dimension of Xwithout losing information on Y|X.Thisleadstoan
assumption that without loss of information the conditional distribution F(Y|X)can be
indexed by a (or a set of) linear combination of X. That is, distribution of Y|Xis the same
as that of Y|PS (equivalently Y|βX) for all values of Xand mathematically we can write
F(Y|X)=F(Y|PS). It implies that the covariate matrix of order p×ncan be replaced
by the predictor matrix of order 1 ×n. In other words, according to the conditional inde-
pendence theory given by Dawid [44], the above dimension reduction can be explained as
the conditional independence of Yand Xgiven PS (or βX). Based on this background, we
proposedtousePSinROCanalysisinsteadofusingseveralcovariatesX.NowusingPS
(PS is used here as a general notation for both gPS and bPS) instead of X,theTPRandFPR
for Y|PS can be dened as
TPRPS(c)=1−FD1(PS)(c)=Pr YD1>c|PS, (15)
FPRPS(c)=1−FD0(PS)(c)=Pr YD0>c|PS, (16)
Then the PS-specic ROC curve can be dened as
ROCPS(p)=1−FD1(PS)F−1
D0(PS)(1−p). (17)
ThePS-adjustedROCcurvecanbedenedas,
aROCPS(p)=ROCPS(p)dFPSD1(PS). (18)
Then the AUC can be obtain such that,
aAUCPS =1
0
aROC1(p)dp.
One can use the linear coecient βXof the PS model instead of PS because a one-to-one
correspondingly holds between PS and βX.UsingPSorβXas a covariate conrm the
adjustment of several covariates (X), because the PS or βXserves as a representative of
several covariates by means of their linear combination. Now, given F(Y|X)=F(Y|PS)
for all values of X, we can state that the components in Equations (1)–(3) conditioning on
Xwill be equal to the Equations (15)–(17) conditioning on PS, respectively, i.e. TPRX(c)=
TPRPS(c),FPR
X(c)=FPRPS(c)and ROCX(p)=ROCPS (p).Theseresultsindicateequal-
ity of the components in Equation (4) and Equation 18, i.e. aROCPS(p)=aROCX(p).The
corresponding PS-adjusted ROC curve, aROCPS, and the area under the curve, aAUCPS,
have similar interpretations to the Xadjusted curve and area, aROCXand aAUCX,but
referring the adjustment for the eect of PS rather than X. In other words, as discussed by
Janes and Pepe [14], the adjustment of the covariate XinROCanalysisensuretheappro-
priate distribution of bio-marker between diseased and healthy subjects, which was not
possiblewithoutadjustmentofXbecauseitconfoundsthetruedistributionofthebio-
markerbetweenthegroups.Itcanbearguedthatthesameadjustmentinthedistributionof
diseased and healthy bio-marker will appear when Xis replaced by PS because of a one-to-
one correspondence holds between XandPS,accordingtothePS-modeldescribedabove.
10 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
However, the main advantage of the PS-approach is that it can accommodate the adjust-
ment of several covariates by means of their linear combination connecting to PS, which
is dicult in the existing approach.
The PS-adjusted ROC curve dened in Equation (18) can be obtained by replacing the
single covariate Xby the estimated propensity score (PS) into the non-parametric induced
regression Equations 5–(6). That is, for the continuous bio-marker Y,thenon-parametric
regression Equations (5)–(6) take the following form:
YD1=μ1(gPS)+σ1(gPS)1,
YD0=μ0(gPS)+σ0(gPS)0.
For the binary bio-marker Y, the PS adjusted the above non-parametric regression equa-
tions become:
Y∗
D1=ν1(bPS)+τ1(bPS)U1,
Y∗
D0=ν0(bPS)+τ0(bPS)U0.
The PS-specic (Equation 17) and PS-adjusted ROC curve (Equation 18) for both binary
and continuous bio-marker can be estimated following the non-parametric induced regres-
sion procedure described in Sections 2.3.1–2.3.2 butreplacingthesinglecovariateXby the
PS.Therefore,theestimatedPS-adjustedROCcurveandAUChavethesimilarinterpre-
tation to those obtained by adjusting for single covariate. For example, a value of 0.75 for
the estimated PS-adjusted AUC (aAUCPS) indicates that, for a randomly selected pair of
subjects (disease vs healthy), there is a 75% chance of having greater bio-marker value for
the diseased subject than the healthier counterpart, given that both subjects have the same
PS value, i.e. the same value for each of the covariates.
Again, following Rodríguez–Alvarez et al. [17], we propose to use bootstrap variance
for constructing condence interval and testing the eect of PS on PS-specic ROC curve
with hypothesis H0:ROC
PS(p)=ROC(p). For more details about the bootstrap variance
estimation procedure in the non-parametric induced ROC regression, see Section 3of the
paper by Rodríguez–Alvarez et al. [17].
4. Simulation study
A simulation study was conducted to evaluate the performance of the PS-based covariate
adjustment in the ROC analysis. Separate simulation series were conducted for continu-
ous and binary bio-marker. In each simulation series, we rst generated two independent
covariates of which one is continuous (X1) and the other is dichotomous (X2), separately
for diseased (D1)andhealthy(D0) population, respectively, as
X1D1∼N(μD1,σ2
D1);X1D0∼N(μD0,σ2
D0),
X2D1∼Bin(n,πD1);X2D0∼Bin(n,πD0).
We consi d e r e d μD1=51.54, μD0=38.19, σ2
D1=224.13, σ2
D0=182.33, πD1=0.67,
πD0=0.42. Dierent values of the parameters associated with the covariate distribution
for diseased and healthy population suggest imbalance covariate distribution between the
BIOSTATISTICS & EPIDEMIOLOGY 11
population, for which adjustment for the covariate eects is necessary. Then the continuous
bio-marker values for diseased and healthy population were generated from the following
models:
YD1=β01 +β11X1D1+β21 X2D1+D1, (19)
YD0=β00 +β10X1D0+β20 X2D0+D0, (20)
where D1∼N(0, 1)and D0∼N(0, 1). The values of the parameters were set to β01 =
29.99, β11 =−0.013, β21 =2.06, β00 =21.50, β10 =0.121, β20 =−0.77. Dierent values
of the parameters of the model for diseased and healthy subjects suggest an association of
covariates X1and X2with both the bio-marker and the disease.
Similarly the binary bio-marker values were generated separately for diseased and
healthy population as
Yb
D1=1, if Y∗
D1>0,
=0, otherwise
Yb
D0=1, if Y∗
D0>0,
=0, otherwise
where the latent variables Y∗
D1and Y∗
D0can be generated from the following model:
Y∗
D1(x)=γ01 +γ11X1D1+γ21 X2D1+UD1,
Y∗
D0(x)=γ00 +γ10X1D0+γ20 X2D0+UD0.
where UD1∼N(0, 1)and UD0∼N(0, 1). We considered γ01 =−1.5, γ11 =0.123, γ21 =
0.772, γ00 =−2.46, γ10 =0.013, γ20 =2.06. As described earlier, dierent values of the
parameters of the two models suggest an association of covariates with both bio-marker
and disease outcome.
For both simulation series, we considered dierent scenarios by varying the sample size
as 150, 300, 500, 1000, with 30% for the subjects with disease and the rest of them for
healthy subjects. This implies that P(D)=0.30. Further simulation scenarios were cre-
ated by considering X1and X2correlated. For each simulation scenario, we estimated the
propensity score (using Equation (13) for continuous marker and Equation (14) for binary
marker) from a linear combination of the X1and X2and used it in the non-parametric
ROC regression described in Section 2.3 to estimate the PS-adjusted ROC curve. The aver-
age over the 500 estimated curves were compared with the true ROC curve (obtained by
integrating the covariate specic ROC curves over the covariate distribution among the
cases) to evaluate the performance of the PS-based adjustment in ROC curve. The R pack-
age ‘npROCRegression’ was used to estimate the non-parametric PS-based adjusted ROC
curve.
Figure 1showsthatthewholesamplingdistributionofthePS-adjustedROCcurvesfor
continuous bio-marker over 500 replications and their mean ROC curve with the super-
imposedtrueROCcurve.TheresultsrevealedthatthepropensityscoreadjustedROC
curve is reasonably correctly estimated (very close to true curve). The simulation standard
error (or mean squared error) decreased with increasing sample size indicating consistent
12 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
Figure 1. Estimated ROC curve (dash line) from the average over 500 replicated curves with a superim-
posed true ROC curve (solid line) for a continuous bio-marker linearly related with covariates. FPR refers
to false positive rate and PS to propensity score, n.h to sample size in healthy population and n.d in
disease population. Maximum Monte Carlo Error=0.0295.
estimate of the curve. Similar results were observed for the binary bio-marker (Figure 2).
Simulation with correlated covariates also suggest similar results (results not shown).
We performed another simulation study to investigate if the mis-specication of the
propensity score model aects the PS-adjusted ROC results. For this, two more covari-
ates, X3D1∼N(0, 1),X3D0∼N(0, 1)and X4D1∼Ben(n, 0.7),X4D1∼Ben(n, 0.3)were
independently generated in addition to the covariate X1and X2generated in the ear-
lier simulation, for a total of 300 subjects (keeping 30% diseased). The corresponding
bio-marker values, YD1and YD0, were generated for the same subjects as generated pre-
viously but adding X3in the previous true models (Equations (19)–(20)). The values for
therespectivecoecients(β31,β30)associatedwithX3intheextendedtruemodelare0.5
and 0.2, respectively. We then tted four dierent models, of which one is correctly speci-
ed model: E(Y|X)=β0+β1X1+β2X2+β3X3, and the other three are dierent forms
of mis-specied models. They are Model 1: E(Y|X)=β0+β1X1+β2X2,whereX3is
missing, Model 2: E(Y|X)=β0+β1X1+β2X2+β3X4,whereX3is missing and X4was
treatedasnoise,andModel3:E(Y|X)=β0+β1X1+β2X2+β3X3+β4X4,whereX4
BIOSTATISTICS & EPIDEMIOLOGY 13
Figure 2. PS-adjusted ROC curve (dash line) from the average over 500 replicated curves with a super-
imposed true ROC curve (solid line) for a binary bio-marker. FPR refers to false positive rate and PS to
propensity score, n.h to sample size in healthy population and n.d in disease population. Maximum
Monte Carlo Error=0.0367.
was considered as noise. The results revealed that the eects mis-specication of the PS
model in estimating true ROC curve is negligible (Figure 3), which is true for all forms of
mis-specication.
We further performed a simulation study to investigate the robustness of the non-
parametric induced regression method to a non-linear function of covariates. That is, the
bio-marker in both healthy and diseased population is non-linearly related with covari-
ates. For this we extended the models in Equations (19)–(20) by adding another covariates
X3D1and X3D0with their quadratic terms. The true beta coecient values for both the lin-
ear and quadratic term of the new covariates were considered as 0.5, 0.2, 0.05, and 0.02,
respectively. The results in Figure 4show that the method appeared to be robust to a
non-linear relationship of biomarker and the covariates and provided consistent estimates
with the increasing sample size.
5. Application
Body mass index (BMI) is an inuential risk factor of both hypertension and diabetes
that lead to many cardiovascular diseases such as stroke, cardiac disease, etc. [45,46].
14 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
Figure 3. PS-adjusted ROC curve (dash line) from the average over 500 replicated curves with a super-
imposed true ROC curve (solid line) for different mis-specified models. FPR refers to false positive rate
and PS to propensity score. Maximum Monte Carlo Error=0.0287.
Considering BMI as bio-marker, the methods are applied to evaluate BMI in classifying
the subjects with hypertension (and diabetes separately) from the healthy subjects with
ageover35yearsinthepresenceofseveralcovariatessuchasage,gender,education,
sociology-economic status, etc.
5.1. Data and summary measure
Data used to illustrate the method were extracted from the nationally representative
2011 Bangladesh Demographic and Health Survey (BDHS) that has been conducted
through a collaborative eort of the National Institute of Population Research and Training
(NIPORT), Mitra and Associates and ICF international under the program of worldwide
demographic and health survey. Details of the survey methodology can be found in their
report [47]. The bio-marker measurements for both blood pressure and fasting glucose
levelwerecollectedfromallmembersofahouseholdwithage35yearsandaboveinevery
third of the selected households in each primary sampling unit selected throughout the
country. For measuring blood pressure, the survey used the WHO’s recommended LIFE
BIOSTATISTICS & EPIDEMIOLOGY 15
Figure 4. Estimated ROC curve (dash line) with a superimposed true ROC curve (solid line) for a continu-
ous bio-marker non-linearly related with covariates. FPR refers to false positive rate and PS to propensity
score, n.h to sample size in healthy population and n.d in disease population. Maximum Monte Carlo
Error=0.0427.
SOURCEUA-76PlusBloodPressureMonitormodel.Bythetrainedtechnician,threemea-
surements of both systolic and diastolic blood pressure were taken at 10-minute intervals
between the measurements, and the average of the second and third measurements was
used to report respondents’ blood pressure. In addition to these measurements, respon-
dentswereaskedwhethertheyhavebeentoldbyadoctorornursethattheyhavehigh
blood pressure and hence are now taking prescribed medication for lowering it. A sub-
ject is classied as hypertensive (elevated blood pressure) if his/her systolic blood pressure
(SBP) value equal to or greater than 140 mmHg or a diastolic blood pressure (DBP) value
equal to or greater than 90 mmHg or he/she was taking medication for lowering the blood
pressure.
For measuring fasting plasma glucose level, the respondents were requested in advance
for fasting overnight until a capillary blood sample (whole blood obtained from the middle
or ring nger) is taken in the morning using the HemoCue 201+ blood glucose analyzer.
The respondents were further asked whether they have been told by doctor or nurse that
they had diabetes prior to the survey and are taking medication for recovering from it.
A subject with fasting plasma glucose level equal to higher than 7.0 mmol/L or who was
taking medication is classied as a patient with diabetes. For more details about the survey
andsuchmeasurements,seeelsewhere[45–47]. Finally, bio-marker information related to
both the hypertension and diabetes were collected from a total of 7328 (male 3744, female
3584) and 7593 (male 3753, female 3840) subjects, respectively, who given their consent
for providing the bio-marker information.
16 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
Tab le 1. Summary statistics for the selected variables.
Variables n%/Mean 95%CI
Hypertension
Hypertension 1865 25.52 [24.53, 26.54]
No hypertension 5442 74.47 [73.46, 75.46]
Diabetes
Diabetic 850 11.63 [10.92, 12.39]
Non-diabetic 6457 88.37 [87.61, 89.08]
Age 7307 51.05 [50.76, 51.34]
Gender
Male 3616 49.49 [48.34, 50.63]
Female 3691 50.52 [49.36, 51.66]
BMI
BMI <25 6213 85.03 [84.19, 85.82]
BMI ≥25 1094 14.97 [14.17, 15.80]
Type of place of resident
Urban 2402 32.87 [31.80, 33.95]
Rural 4905 67.13 [66.04,68.19]
Wealth index
Poorest 1292 17.68 []16.82,18.57]
Poorer 1313 17.97 [17.10,18.86]
Middle 1418 19.41 [18.51, 20.32]
Richer 1528 20.91 [19.99,21.85]
Richest 1756 24.03 [23.06, 25.02]
Educational level
No education, preschool 3248 44.94 [43.80, 46.09]
Primary 2025 27.71 [26.69, 28.75]
Secondary 1372 18.78 [17.89, 19.69]
Higher 626 8.57 [7.94, 9.23]
A set of individual-level covariates available including age (years), sex (male, female),
education (no education, primary, secondary, higher), area of residence (urban, rural), and
household’s socio-economic status (poorest, poorer, richer, richest). The socio-economic
status was determined by taking ve equal parts of wealth index calculated using principal
component analysis of assets owned by household, with the bottom 20% as the poorest
and the top 20% as the richest. The status of both hypertension (yes, no) and diabetes (yes,
no) were considered as two outcome variables and BMI (a composite index of height and
weight) as bio-marker. However, status of diabetes was considered as covariate when we
evaluated the performance of BMI in classifying hypertensive patient and vice-versa.
Table 1presents the summary statistics of all variables included in the analysis. It can be
observed that 25% and 12% of the total subjects are found to be hypertensive and diabetic,
respectively. The average age of all subjects is 51.05 with almost equal number of male and
female. Of the total sample, 1094 (14.97%) subjects have BMI over 25, 32.87% belong to
urban area and 67.13% to rural area and majority have no education or primary education
(approximately 44.94% and 27.71%).
5.2. Performance of BMI in classifying patients with diabetes and hypertension
Literatureandsomeexplanatoryanalyses(resultsnotshown)suggestapossibleassocia-
tion of all such covariates (age, gender, area of residence and socio-economic status) with
both the BMI and diseases: diabetes and hypertension [45,46,48]. Here we tried with the
adjustmentfordierentcombinationsofcovariatesintheROCcurvethroughPSadjust-
ment to evaluate the performance of the BMI in classifying the hypertensive or diabetes
BIOSTATISTICS & EPIDEMIOLOGY 17
patients. Separate analysis was performed for diabetes and hypertension. When the sta-
tus of diabetes was considered as outcome, the status of hypertension was considered as
covariate and vice-versa.
Both the Figures 5and 6show the results for diabetes and hypertension, respectively,
where in each gure, the part(a) depicts the covariate-adjusted ROC curves for dierent
combination of covariates, the part(b) depicts the 95% bootstrap-based condence inter-
valforthetrueROCcurveadjustingforallpossiblecovariates,andthepart(c)illustrates
the AUC of the corresponding PS-specic ROC curves. For both the disease status, a small
variation can be observed in the ROC curves when adjustment made for dierent com-
binations of covariates (Figures 5(a) and 6(a)). Of the dierent combination, the greatest
Figure 5. Estimate of the propensity score-based adjusted ROC curve for assessing BMI to classify dia-
betes patients. (a) Adjustment for different combination of covariates; (b) Adjustment for all possible
covariates; (c) PS-specific AUC when adjustment made for all covariates.
18 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
Figure 6. Estimate of the propensity score-based adjusted ROC curve for assessing BMI to classify hyper-
tensive patients. (a) Adjustment for different combination of covariates; (b) Adjustment for all possible
covariates; (c) PS-specific AUC when adjustment made for all covariates.
classication accuracy for both hypertension and diabetes was observed when all possible
covariates were adjusted simultaneously. Moreover, a signicant (p<0.01) variation can
be observed for PS-specic (covariates-specics) ROC curves and their areas (AUC) when
adjustment made for all possible covariates (Figures 5(c) and 6(c)), which suggest that the
performance of BMI in classifying the subjects with disease from the healthier counterpart
is aected by the PS values i.e. covariates. For example, greater discriminating ability can be
observed for the lower PS values, i.e. for greater ˆ
βXreecting for the female subjects with
higher age and education and better economic condition. The predictive performance of
BMIisappearedtosimilarforbothdiseases(hypertensionanddiabetes).Moreover,simi-
lar results can be observed when BMI was considered as binary bio-marker with BMI <25
and ≥25 indicating normal and overweight, respectively (results not shown).
BIOSTATISTICS & EPIDEMIOLOGY 19
6. Discussion
The importance of new bio-marker in bio-medical research depends on how accurately the
marker classies the subjects with disease from the healthy subjects. The ROC curve is a
popular method for evaluating classication accuracy of bio-marker. However, the dier-
ence in patient characteristics (baseline covariates) between disease and healthy population
may make it dicult to evaluate classication accuracy of bio-marker and hence it is nec-
essary to adjust for the covariate eect in the ROC analysis. While the existing methods are
unabletoadjustformorethanonecovariatesatatime[17], this study provided a solution
by using propensity score to adjust for several covariates simultaneously. More specically,
the propensity score, derived from a linear combination of a set of covariates, was adapted
in the existing non-parametric induced regression estimator for the ROC curve instead
of using that set of covariates. The PS used here as data dimension reduction technique
to reduce dimension of several covariates by means of their linear combination, which is
similar to dimension reduction regression discussed in literature [42,43]. Hence, using PS
directly or the linear coecient ( ˆ
βx) from where PS was derived instead of several covari-
atesdoesnotchangetheinterpretationoftheROCcurve.Forexample,thePS-adjusted
ROCcurvecanbeinterpretedastheclassicationaccuracyofthebio-markeradjusting
for the eects of PS, i.e. the eects of several covariates. In addition, this approach can be
used to generate covariate-specic ROC curve from the PS-specic ROC curves because
there is one-to-one correspond holds between the PS and covariate.
The simulation study suggests that the PS-adjusted ROC cur ve appeared to be consistent
with increasing sample size. Even the eects of the mis-specication of the propensity score
model is negligible. Furthermore, the method was applied to evaluate the performance of
BMI in classifying the hypertensive and diabetic patient from their healthier counter parts,
in the presence of several patient specic covariates such as age, gender, education, socio-
economic status, area of residence. The results support the simulation ndings and have a
meaningful interpretation. The ndings of the study suggest that PS-based approach can
beusedtoadjustfortheeectofseveralbaselinecovariatesthatmakeconfoundingthe
discriminatory power of the bio-marker.
ThePS-adjustedROCcurve,aROC
PS, derived using non-parametric induced regres-
sion possesses the same properties as the standard ROC estimator, aROCX,fromthesame
methodology considering single covariates as described by Rodríguez–Alvarez et al. [17].
Themainandimportantadvantageoftheproposedapproachisthatitcanadjusttheeect
ofseveralcovariatesinROCanalysisatatimewithoutlossofgeneralitybymeansofasin-
gle score (PS) obtained from the linear transformation of the covariate set. In this paper,
we discussed regression approach as a dimension reduction technique through modeling
bio-markers with a set of covariates, which is truly based on the normality and linearity
assumptions. However, further research may be required to investigate the consequence of
the degrees/forms of mis-specication (non-normality, non-linearity, etc.) of the propen-
sity score model and extend the methods to overcome such situations. In this paper, we
used bootstrap variance for making inference, however, further research may be possible
to derive analytically variance with robust variance estimation technique to capture the
variation in the propensity score model into the ROC estimator. In addition, the proposed
methodhasafewlimitationsthatittakesahugetimetohavethenalestimateofROC
curve and its area, and therefore extensive simulation considering diverse scenarios such
20 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
as correlated covariates, non-linear function of the propensity score model, etc. was not
possible.
Acknowledgments
TheauthorsacknowledgetheauthorityofmeasuresDHSformakingavailablethedatausedhereina
public domain. In addition, the authors acknowledge the proceeding of the 62nd ISI World Statistics
Congress, Kuala Lampur, Malaysia, 2019 published by the same authors, because some parts of this
manuscript are completely matched with this conference proceedings.
Disclosure statement
No potential conict of interest was reported by the author(s).
Ethics approval and consent to participate
As the dataset is freely available in a public domain and is permitted to use in research
publication, the ethics approval and consent statement has been approved by the authority
who made the data available for public use.
Data availability statement
The dataset used in this study can be downloaded freely from a public domain at
https://dhsprogram.com/data/ under the authority of the DHS program.
Notes on contributors
Muntaha Mushquee completedBShonoursandMSinAppliedStatisticsfromtheInstituteofSta-
tistical Research and Training, University of Dhaka, and did her 2nd MSc degree in Statistics at the
Memorial University of Newfoundland, Canada. She is now working as a Health Data Analyst at BC
PHSA, Canada.
M. Shaqur Rahman hasaPhDinMedicalStatistics,UniversityCollegeLondonandiscurrently
working as a Professor of Applied Statistics at the Institute of Statistical Research and Training, Uni-
versity of Dhaka. His main research areas include casual inference, developing and validating risk
prediction models, mixed eect models, focusing on development of new statistical methods for
analyzing data in public health and medical research. So far he published a number of research
papers in peer reviewed journals and scientic reports and guided a number of undergraduate and
postgraduate students for preparing their projects and thesis. In addition to teaching and research,
he is also involved with statistical consultancy services for various national and international orga-
nizations. He has been serving as an associate editor of a number of peer-reviewed journals. He is
a member of the Bangladesh Statistical Association, International Biometric Society, International
Statistical Institute, International Society of Clinical Biostatistics and an educational ambassador of
the American Statistical Association from Bangladesh.
ORCID
M. Shaqur Rahman http://orcid.org/0000-0001-5256-7453
References
[1] Mayeux R. Biomarkers: potential uses and limitations. NeuroRx. 1(2): 182–188.
BIOSTATISTICS & EPIDEMIOLOGY 21
[2] Strimbu K, Tavel JA. What are biomarkers?. Curr Opin HIV AIDS. 2010;5(6):463–466.
[3] Craig-Schapiro R, Fagan AM, Holtzman DM. Biomarkers of alzheimer’s disease. Neurobiol
Dis. 2009;35(2):128–140.
[4] Zheng Y, Cai T, Jin Y, et al. Evaluting prognostic accuracy of biomarkers under competing risk.
Biometrics. 2012;68(2):388–396.
[5] Chen W, Samuelson FW, Gallas BD, et al. On the assessment of the added value of new
predictive biomarkers. BMC Med Res Methodol. 2013;13:98.
[6] Moons KGM, de Groot JAH, Linnet K, et al. Quantifying the added value of a diagnostic test
or marker. Clin Chem. 2012;58(10):1408–1417.
[7] Linnet K, Bossuyt PMM, Moons KGM, et al. Quantifying the accuracy of a diagnostic test or
marker. Clin Chem. 2012;58(9):1292–1301.
[8] Ridker PM, Buring JE, Rifai N, et al. Development and validation of improved algorithms for
the assessment of global cardiovascular risk in women. J Am Med Assoc. 2007;297:611–619.
[9] Ridker PM, Paynter NP, Rifai N, et al. C-reactive protein and parental history improve
global cardiovascular risk prediction: the reynolds risk score for men. Circulation.
2008;118:2243–2251.
[10] Wilson PWF, Pencina M, Jacques P, et al. C-Reactive protein and reclassication of cardiovas-
cular risk in the framingham heart study. Circ Cardiovasc Qual Outcomes. 2008;1(2):92–97.
[11] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating character-
istic (ROC) curve. Radiology. 1982;143:29–36.
[12] Birgit G. Analysis of biomarker data: logs, odds ratios, and receiver operating characteristic
curves. Currunet Opinion in HIV and AIDS. 2010;5(6):473–9.
[13] Pepe MS. An interpretation for the roc curve and inference using glm procedures. Biometrics.
2000;56:352–359.
[14] Janes H, Pepe MS. Adjusting for covariates in studies of diagnostic, screening, or prognostic
markers:anoldconceptinanewsetting.AmJEpidemiol.2008;168(1):89–97.
[15] Janes H, Pepe MS. Adjusting for covariate eects on classication accuracy using the covariate-
adjusted receiver operating characteristic curve. Biometrika. 2009;96(2):371–382.
[16] Janes H, Pepe MS. Adjusting for covariate eects on classication accuracy using the covariate-
adjusted ROC curve. UW Biostat Work Pap Ser. 2009;96(2):371–382.
[17] Rodríguez-Alvarez MX, Roca-Pardiñas J, Cadarso-Suárez C. Roc curve and covari-
ates: extending induced methodology to the non-parametric framework. Stat Comput.
2011;21(4):483–499.
[18] Harrell Jr FE, Cali RM, Pryor DB, et al. Evaluating the yield of medical tests. J Am Med Assoc.
1982;247:2543–46.
[19] Pepe MS. Three approaches to regression analysis of receiver operating characteristic curves
for continuous test results. Biometrics. 1998;54(1):124–135.
[20] Faraggi D. Adjusting receiver operating characteristic curves and related indices for covariates.
JRStatSocSerDStat.2003;52(2):179–192.
[21] Cai T, Pepe MS. Semiparametric receiver operating characteristic analysis to evaluate biomark-
ers for disease. J Am Stat Assoc. 2002;97(460):1099–1107.
[22] Alonzo TA, Pepe MS. Distribution-free roc analysis using binary regression techniques.
Biostatistics. 2002;3(3):421–432.
[23] Austin PC. An introduction to propensity score methods for reducing the eects of confound-
ing in observational studies. Multivariate Behav Res. 2011;46(3):399–424.
[24] D’Agostino RB. Propensity score methods for bias reduction in the comparison of a treatment
to a non-randomized control group. Stat Med. 1998;17:2265–2281.
[25] Weitzen S, Lapane KL, Toledano AY, et al. Principles for modeling propensity scores in medical
research: a systematic literature review. Pharmacoepidemiol Drug Saf. 2004;13:841–853.
[26] Han S, Andrei AC, Tsui KW, et al. Roc analysis using covariate balancing propensity scores
with an application to biochemical predictors for thyroid cancer. Communications in Statistics
Part B: Simulation and Computation. 2022;51(1):374–390.
22 M. MUSHFIQUEE AND M. SHAFIQUR RAHMAN
[27] Galadima HI, McClish DK. Controlling for confounding via propensity score methods
can result in biased estimation of the conditional auc: A simulation study. Pharm Stat.
2019;18(5):568–582.
[28]McCareyDF,GrinBA,AlmirallD,etal.Atutorialonpropensityscoreestimation
for mutiple treatments using generalized boosted model. Stat Med. 2013;30–32(19):3388–
3414.
[29] Coocicnough DJ, Rossmanrr K, Lusted LB. Radiographic applications of receiver operating
characteritic (roc) curves. Radiology. 2003;229(1):3–8.
[30] Goncalves L, Subtil A, Oliveira MR. Roc curve estimation: an overview. Revstat Stat J.
2014;12(1):1–20.
[31] Metz CE. Basic principles of roc analysis. Semin Nucl Med. 1978;8(4):283–298.
[32] Faraggi D, Reiser B. Estimation of the area under the roc curve. Stat Med. 2002;30(20):
3093–3106.
[33] Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8:283–298.
[34] González-Manteiga W, Pardo-Fernández JC, Keilegom IV. Roc curves in non-parametric
location-scale regression models. Scand J Stat. 2011;38:169–184.
[35] Fan J, Marron JS. Fast implementations of nonparametric curve estimators. J Comput Graph
Stat. 1994;3(1):35–56.
[36] Müller H-G, Schmitt T. Kernel and probit estimates in quantal bioassay. J Am Stat Assoc.
1988;83(403):750–759.10.1080/01621459.1988.10478658
[37] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for
causal eects. Biometrika. 1983 Apr;70(1):41–55.10.1093/biomet/70.1.41
[38] Olmos A, Govindasamy P. A practical guide for using propensity score weighting in R.
2015;20(13).
[39] Zhu Y, Coman DL, Ghosh D. A boosting algorithm for estimating generalized propensity
score with continuous treatment. J Causal Inference. 2015;3(1):25–40.
[40] Austin PC. Assessing covariate balance when using the generalized propensity score with
quantitative or continuous exposures. Stat Methods Med Res. 2019;28(5):1365–1377. PMID:
29415624.
[41] Ghosh D. Propensity score modelling in observational studies using dimension reduction
methods. Stat Probab Lett. 2011;81(7):813–820. ISSN 0167-7152. Statistics in Biological and
Medical Sciences.
[42] Cook RD, Lee H. Dimension reduction in binary response regression. J Am Stat Assoc.
1999;94(448):1187–1200.
[43] Li KC. Sliced inverse regression for dimension reduction. J Am Stat Assoc. 1991;86(414):
316–327.
[44] Dawid A. Conditional independence in statistical theory. J R Stat Soc Series B Stat Methodol.
1979;41(1):1–31.ISSN 00359246.
[45] AkterT,SarkerEB,RahmanMS.Atutorialongeewithapplicationstodiabetesandhyperten-
sion data from a complex survey. J Biomed Anal. 2018;1(1):37–50.
[46] Roy PK, Khan MHR, Akter T, et al. Exploring socio-demographic-and geographical-variations
in prevalence of diabetes and hypertension in bangladesh: Bayesian spatial analysis of national
health survey data. Spat Spatiotemporal Epidemiol. 2019;29:71–83.
[47] NIPORT, Mitra-Associates, and Macro International. Bangladesh Demographic and Health
Survey 2011. Technical report, National Institute of Population Research and Training
(NIPORT); Dhaka, Bangladesh, and Calverton, Maryland, USA, 2011.
[48] DuaS,BhukerM,SharmaP,etal.Bodymassindexrelatestobloodpressureamongadults.
North Am J Med Sci. 2014;6(2):89–95.