ArticlePDF Available

Estimating transformationsfor evaluating diagnostic testswith covariate adjustment

Authors:

Abstract and Figures

Receiver operating characteristic analysis is one of the most popular approaches for evaluating and comparing the accuracy of medical diagnostic tests. Although various methodologies have been developed for estimating receiver operating characteristic curves and their associated summary indices, there is no consensus on a single framework that can provide consistent statistical inference while handling the complexities associated with medical data. Such complexities might include non-normal data, covariates that influence the diagnostic potential of a test, ordinal biomarkers or censored data due to instrument detection limits. We propose a regression model for the transformed test results which exploits the invariance of receiver operating characteristic curves to monotonic transformations and accommodates these features. Simulation studies show that the estimates based on transformation models are unbiased and yield coverage at nominal levels. The methodology is applied to a cross-sectional study of metabolic syndrome where we investigate the covariate-specific performance of weight-to-height ratio as a non-invasive diagnostic test. Software implementations for all the methods described in the article are provided in the tram add-on package to the R system for statistical computing and graphics.
Content may be subject to copyright.
Original Research Article
Estimating transformations
for evaluating diagnostic tests
with covariate adjustment
Ainesh Sewak and Torsten Hothorn
Statistical Methods in Medical Research
2023, Vol. 32(7) 1403–1419
© The Author(s) 2023
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/09622802231176030
journals.sagepub.com/home/smm
Abstract
Receiver operating characteristic analysis is one of the most popular approaches for evaluating and comparing the accu-
racy of medical diagnostic tests. Although various methodologies have been developed for estimating receiver operating
characteristic curves and their associated summary indices, there is no consensus on a single framework that can pro-
vide consistent statistical inference while handling the complexities associated with medical data. Such complexities might
include non-normal data, covariates that influence the diagnostic potential of a test, ordinal biomarkers or censored data
due to instrument detection limits. We propose a regression model for the transformed test results which exploits the
invariance of receiver operating characteristic curves to monotonic transformations and accommodates these features.
Simulation studies show that the estimates based on transformation models are unbiased and yield coverage at nominal
levels. The methodology is applied to a cross-sectional study of metabolic syndrome where we investigate the covariate-
specific performance of weight-to-height ratio as a non-invasive diagnostic test. Software implementations for all the
methods described in the article are provided in the tram add-on package to the R system for statistical computing and
graphics.
Keywords
Transformation model, receiver operating characteristic curve, area under the receiver operating characteristic curve,
diagnostic test, distribution regression, ordinal outcome, censoring, Youden index, overlapping coefficient, limit of
detection
1 Introduction
Estimating receiver operating characteristic (ROC) curves for evaluating the performance of medical diagnostic tests has
been a main focus of statistical literature over the last decades.1,2 Diagnostic tests screen for the presence or absence of a
disease. Characterizing their accuracy is essential to ensure appropriate prevention, treatment and monitoring of diseases.
ROC curves are a valuable tool in determining the diagnostic potential of a test and continue to be extensively applied
in biomedical studies as new tests or biomarkers are developed in radiology, oncology, genetics, and other related fields.
Increasingly more applications can be expected due to advancements in technology and analyzing the resulting data requires
a computationally straightforward approach to provide accurate and consistent statistical inference.
Previous research has focused on extending statistical methodology for ROC curve estimation to address issues such
as adjustment for covariates,3,4 incorporating censoring due to instrument detection limits5,6 and robustness to model
Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Zürich, Switzerland
Corresponding author:
Torsten Hothorn, Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Hirschengraben 84, CH-8001 Zürich, Switzerland.
Email: Torsten.Hothorn@R-project.org
1404 Statistical Methods in Medical Research 32(7)
misspecification.7In addition, a wide variety of parametric and nonparametric methods have been proposed within fre-
quentist and Bayesian paradigms (see Inácio et al.8for a recent review). However, there is no consensus on an analytic
approach that can handle all these issues simultaneously.
An attractive feature of the ROC curve, which has scantly been used for its estimation, is that it remains invariant to
monotonic transformations of the test results. Although transformations have been used to bring continuous test results
into a form that approximately satisfies the assumptions of a suitable parametric model,4estimation of a transformation
function has been limited to the Box-Cox power transformation family.9,10 For rank-based methods, the transformation
function can be left unspecified, but in all cases, a restriction to normality has been previously imposed on the model for
the ROC curve.11
In this article, we present a new unifying methodological framework for estimating ROC curves and its associated
summary indices by modeling the relationship between the transformed test results and potential covariates. We employ
transformation models to jointly estimate the transformation function and regression parameters.12,13 This approach speci-
fies a parametric model for the ROC curve but remains distribution-free because we do not impose any strong assumptions
about the transformation function. Using the estimated parameters, we show how to evaluate covariate effects on the dis-
criminatory performance of diagnostic tests. Unlike nonparametric methods which are flexible but difficult to interpret
and implement, transformation models excel on both fronts. Rimplementations of all methods discussed in this article are
available, along with a set of supporting examples.
1.1 Notation and preliminaries
Let the random variable Ydenote the continuous result of a diagnostic test and let Ddenote the disease status, with D=1
if a subject is diseased and 0 if nondiseased. We denote quantities conditional on the disease status using subscripts. For
example, Y1and Y0are the test results in the diseased and nondiseased populations with cumulative distribution functions
(CDFs) given by F1and F0and densities f1and f0, respectively. Suppose that the subject is diagnosed as diseased when
their test result exceeds a threshold value, c. By convention, we assume that larger values of the test result are more
indicative of the disease. The probability of truly identifying a diseased and nondiseased subject is defined as sensitivity,
(Y1>c)=1F1(c), and specificity, (Y0c)=F0(c), respectively. The set of pairs (1specificity, sensitivity)for all
cproduce the ROC curve. By setting p=1F0(c), an equivalent representation of the ROC curve is
ROC(p)=1F1F1
0(1p)
Summary indices of the ROC curve quantify the degree of separation between the distributions Y1and Y0. The most widely
used index is the area under the ROC curve (AUC) defined by
AUC =(Y1>Y0)=1
0
ROC(p)dp
The AUC represents the probability that the test results of a randomly selected diseased subject exceed the one of a nondis-
eased subject and is directly related to the Mann–Whitney–Wilcoxon U-statistic (MWW).14 Alternative indices include
the Youden index,15 J, which combines sensitivity and specificity over all possible thresholds to provide the maximum
potential effectiveness of a diagnostic test, given by
J=max
c[F0(c)−F1(c)]
The Youden index is equivalent to the Smirnov (or the two-sample Kolmogorov-Smirnov) test statistic16 and can be
represented as half the L1distance between the two densities or as the complement of the overlapping coefficient (OVL)17–20:
J=1
2f0(y)−f1(y)dy =1min[f1(y),f0(y)]dy =1OVL
Additionally, the threshold corresponding to J, where sensitivity and specificity are maximized, denoted as c, is often used
in clinical practice as the optimal classification threshold to screen subjects.
Covariates may impact the level and the accuracy of a diagnostic test. In order to appropriately understand the accuracy
of the test in subpopulations, we can use covariate-specific or conditional ROC curves.21 Let Xdenote a vector of covariates
that are hypothesized to have an impact on the accuracy of the test. The conditional CDF in the diseased population is given
Sewak and Hothorn 1405
by F1(yx)=(Y1yX=x)and analogously given for the nondiseased population. The covariate-specific ROC can
be written as
ROC(px)=1F1F1
0(1px)∣x(1)
with its counterpart conditional summary indices, AUC(x)and J(x), defined accordingly. The covariate-specific ROC curve
can be generated by modeling the conditional distribution of the test results, known as the induced or indirect methodology.3
1.2 Overview
The article proceeds as follows. In Section 2, we propose a transformation modeling framework for parameterizing ROC
curves from which we derive closed-form expressions for associated AUC and Youden summary indices. We discuss
maximum likelihood estimation procedures for our model and corresponding inference. In Section 3, we assess the empir-
ical performance of our methods using simulated data. We apply our approach to a cross-sectional study for detection of
metabolic syndrome in Section 4 and conclude the article with a discussion.
2 Methods
2.1 Transformation model
The ROC curve is a composition of distribution functions and thus is invariant to strictly monotonically increasing trans-
formations of Y. We propose a model for the conditional distribution of the transformed test result given the disease status
and covariates. This transformation is obtained from the data and leads to a distribution-free framework to parameterize
the covariate-specific ROC curve and its summary indices.
Suppose there exists a strictly monotonically increasing function hsuch that the relationship between the transformed
test result and the covariates follows a shift-scale model
h(Y)=𝜇d(x)+𝜎d(x)Z
where D=dspecifies the disease indicator (D=0 for nondiseased and D=1 for diseased), X=xa fixed set of covariates,
𝜇d(x)is the shift term, 𝜎d(x)is the scale term, and Zis a latent random variable with an aprioriknown absolutely
continuous log-concave CDF, FZ. Given that Dand Xare fixed, the conditional CDF for Yis
(YyD=d,X=x)=Fd(yx)=FZh(y)−𝜇d(x)
𝜎d(x)(2)
Equation (2) represents a general class of models called transformation models.22,13 The transformation function huniquely
characterizes the distribution of Y, similar to the density or distribution function. Plugging in this conditional CDF of Y
into equation (1), hcancels out and the covariate-specific ROC curve is given by
ROC(px)=1FZ𝜁(x)F1
Z(1p)−𝛿(x)(3)
where
𝛿(x)= 𝜇1(x)−𝜇0(x)
𝜎1(x)and 𝜁(x)= 𝜎0(x)
𝜎1(x)
Thus, the ROC curve is completely determined by the shift and scale terms of the model.
The binormal23 and bilogistic24 ROC curves can be obtained by setting FZto the standard normal distribution function
probit1, or the standard logistic distribution function logit1(x) = expit(x)=(1+exp(x))1, in equation (3),
respectively. Similarly, the proportional hazard25 and reverse proportional hazard alternatives26 for the ROC curve also
fall within the purview of our transformation model with FZspecified as cloglog1(x)=1−exp(exp(x)) (minimum
extreme value distribution function) and loglog1(x)=exp(exp(x)) (maximum extreme value distribution function),
respectively.
However, to the best of our knowledge, the only literature where the transformation function his included in the model
formulation of the ROC curve is Zou,27 who jointly models the shift term and the parameters of a Box-Cox power trans-
formation function. A key point of this article is that we explicitly estimate hjointly with 𝜇(x)from the observed data and
are not restricted to normality imposed by power transformation families. Thus, the methods we propose allow for proper
1406 Statistical Methods in Medical Research 32(7)
propagation of uncertainty from the estimated transformation function
hinto the estimates of the shift and scale terms of
the model.
The ROC curve in equation (3) follows a parametric model depending on FZ, but is distribution-free as by Alonzo
and Pepe,28 because no assumptions are made about the transformation hand consequently for the distribution of the test
results. The approach to model the test results as a function of the disease status and covariates was originally proposed in
the latent variable ordinal regression setting by Tosteson and Begg21 and extended by Pepe3to modeling covariate effects
directly on the ROC curve.
Tosteson and Begg21 pointed out that to ensure concavity of the induced ROC curve, the scale term must be omitted,
that is, 𝜎d(x)=1ford={0, 1}. The ROC curve is termed proper if it is concave or, equivalently, if the derivative
of the ROC curve is a monotonically decreasing function.29 A concave ROC curve is desirable as it yields the maximal
sensitivity for a given value of specificity.30 In this sense, as the decision criterion for classifying subjects is optimal when
the ROC curve is concave, we focus on the remaining work on the model involving only the shift term. Hence, the effect of
covariates on the ROC curve is contained in the difference between the shift terms for diseased and nondiseased subjects,
𝛿(x)=𝜇1(x)−𝜇0(x). For a relaxation of this assumption, see Siegfried et al.31 who additionally estimate the scale functions
through regression models.
2.1.1 Two-sample case
We first consider the case of two samples without covariates. Let the shift term take the form 𝜇d(x)=𝛿d. The CDF of the test
results in the nondiseased population is given by F0(y)=FZ(h(y)) and in the diseased population by F1(y)=FZ(h(y)−𝛿).
Using Equation (3), the induced ROC curve can be expressed as
ROC(p)=1FZh(h1(F1
Z(1p))) 𝛿
=1FZF1
Z(1p)−𝛿(4)
The model assumption implies that a monotone function hexists to transform both Y1and Y0into the same distribution,
ZFZseparated by a shift parameter, 𝛿. The induced ROC curve from this model does not assume a particular distribution
of the test result, rather, it quantifies the difference between the test result distributions on the scale of a user-defined FZ.
In this sense, the difference between the test result distributions is described by 𝛿. Each choice of FZleads to a different
interpretation of 𝛿. For example, when FZis selected to be the standard normal distribution function, 𝛿is essentially
Cohen’s d, the standardized difference in means of the transformed test results comparing the diseased and nondiseased
groups, 𝔼[h(Y1)−h(Y0)]. Similarly, when FZis the standard logistic distribution function, exp(𝛿)is the ratio of odds of
having a positive test result comparing diseased and nondiseased groups. Closed-form expressions can be derived for
summary indices of the ROC curve by solving the appropriate integrals. The expressions of AUC, J, the optimal threshold
c, sensitivity and specificity at care given for some choices of FZin Table 1.
2.1.2 Conditional ROC curve
The accuracy of a diagnostic test may be influenced by a set of covariates X. To evaluate their effect on the ROC curve and
its summary indices, we assume a linear transformation model with a shift term that takes the form
𝜇d(x)=𝛿d+x𝝃+dx𝜸(5)
Ta b l e 1 . Closed-form expressions for the area under the receiver operating characteristic curve (AUC), Youden Index (J), optimal
classification threshold (c), sensitivity (Sens), and specificity (Spec) at cin terms of the shift parameter 𝛿in the linear
transformation model given by Fd(y)=FZ(h(y)−𝛿d).
FZ
Index probit1logit1cloglog1loglog1
AUC Φ( 𝛿
2)exp(𝛿)(exp(𝛿)−1𝛿)
(exp(𝛿)−1)2𝛿0
12𝛿=0
expit(𝛿)
J12Φ( 𝛿
2)12expit( 𝛿
2)exp(
𝛿
e𝛿1) exp( 𝛿
e𝛿1)
ch1(𝛿
2)h1(log( 𝛿
1e𝛿)) h1(log( e𝛿1
𝛿))
Sens(c(
𝛿
2) expit( 𝛿
2) exp( 𝛿
e𝛿1)1−exp( 𝛿
e𝛿1)
Spec(c)1−exp( 𝛿
e𝛿1)exp(
𝛿
e𝛿1)
Sewak and Hothorn 1407
where 𝝃,𝜸Pare the coefficient vectors for the covariates and interaction term, respectively. Under this model, the
resulting covariate-specific ROC curve is
ROC(px)=1FZF1
Z(1p)−(𝛿+x𝜸)
where the covariate effect on the ROC curve is given by the difference in shift terms between diseased and nondiseased
subjects, 𝛿(x)=𝛿+x𝜸. Similarly, the covariate-specific AUC is given by
AUC(x)=(Y0<Y1X=x)=a(𝛿(x))=a𝛿+x𝜸(6)
where a:[0, 1]is the AUC function from the first row of Table 1 for different choices of FZ. The expressions for J,
c, sensitivity and specificity can analogously be adjusted to account for covariates, with 𝛿replaced by 𝛿+x𝜸in Table
1. In the case of a single continuous covariate X=x, the interpretation of the interaction coefficient is as follows.
For each possible specificity value, a unit increase in xresults in a 𝛾-unit increase in the ROC curve (or an increase in the
sensitivity) on the scale of FZ.If𝛾is positive, an increase in xcorresponds to an increase in the ROC curve, indicating
that a test is better able to discriminate the two populations for larger values of xand, vice versa. Note that the ROC curve
varies with the covariate contingent upon the presence of an interaction between dand x.For𝛾=0, the covariate affects the
distribution of the test results from the diseased and nondiseased population, but not the ROC curve. That is, for all levels
of x, the difference between the transformed distributions h(Y1)and h(Y0)is given by 𝛿and the ROC curve is unchanged.
Analogous interpretations hold when we are interested in modeling a set of covariates X, which could possibly include
categorical covariates.
Standard regression techniques have also been proposed as an alternative to assess the effect of covariates on summary
indices rather than deriving the induced ROC curve. For example, Dodd et al.32 model the partial AUC as a regression
function of covariates. Our model equivalently results in a regression model for the AUC, where 𝛿+x𝜸is in the form of
a usual linear predictor and ais a monotonically increasing inverse link function which defines the scale for the regression
coefficients. As will be shown in Section 2.2, an advantage of our approach is that we do not rely on less efficient binary
regression techniques and directly estimate the regression parameters of the transformation model using maximum likeli-
hood estimation. In Supplemental Material Section A, we show that our method is additionally related to the probabilistic
index model (PIM) of Thas et al.33,34
We can also consider more general and potentially nonlinear formulations of the shift and scale terms in our framework.
For the special case of FZ= probit1, the AUC from a shift-scale transformation model31 is given by
(Y0<Y1X0=x0,X1=x1)=Φ
𝜇1(x1)− 𝜇0(x0)
𝜎0(x0)2+𝜎1(x1)2
where X0and X1are the corresponding (potentially different) sets of covariates in the nondiseased and diseased popula-
tions, respectively. When the scale term depends only on a single set of covariates, 𝜎0(x0)=𝜎1(x1)=𝜎(x), despite varying
sets of covariates in the shift terms, all the expressions hold from Table 1 with 𝛿replaced by 𝜇1(x1)−𝜇0(x0)
𝜎(x). However, such
closed-form expressions cannot be derived for other choices of FZwhen the scale term depends on the disease indicator or
on different sets of covariates. In such cases, AUCs and other summary indices can be derived using numerical techniques
on the induced ROC curve.
2.2 Estimation
In this section, we propose estimation methods for a transformation model with univariate test results. We provide an
explicit parameterization of the transformation function and the shift term. We then maximize the likelihood contributions
for a potentially exact continuous, right-, left-, or interval-censored datum to jointly estimate the model parameters. This
enables us to fully determine the ROC curve and its summary indices as well as handle test results which are ordinal or
impacted by instrument detection limits.
1408 Statistical Methods in Medical Research 32(7)
2.2.1 Parameterization
We parameterize the transformation function as
h(y𝝑)=b(y)𝝑=
M
m=0
𝜗mbm(y)for y(7)
where b(y)=(b0(y),,bM(y))is a vector of M+1 basis functions with coefficients 𝝑M+1. Polynomials in Bern-
stein form offer a computationally attractive choice of basis that provides a flexible way of estimating the underlying
transformation function. The Bernstein basis polynomial of order Mis defined on the interval [l,u]as
bm(y)=M
mym(1y)Mm,m=0, ,M(8)
where y=yl
ul∈[0, 1]. The restriction 𝜗m𝜗m+1for m=0, ,M1, guarantees the monotonicity of h. Observe that
the transformation function is linear in the parameters that define it and any nonlinearity of the test results is modeled by
the basis functions. If the order Mis chosen to be sufficiently large, Bernstein polynomials can uniformly approximate any
real-valued continuous function on an interval.35
2.2.2 Likelihood
Denote the complete parameter vector as 𝜽=(𝜷,𝝑), where 𝜷=(𝛿,𝝃,𝜸)2P+1are the vector of regression
coefficients parameterizing the function 𝜇dfrom Section 2.1 and 𝝑M+1are the basis coefficients. We follow the max-
imum likelihood approach proposed by Hothorn et al.13 to jointly estimate 𝜷and 𝝑. The advantages of embedding the
model in the likelihood framework are as follows. (i) All forms of random censoring (right, left, and interval) as well as
truncation can directly be incorporated into likelihood contributions defined in terms of the distribution function.36 Sup-
plemental Material Section A details how ordinal biomarkers can be accommodated in the proposed modeling framework
using interval-censored likelihood contributions. (ii) If the given model is correctly specified, under regularity conditions,
the maximum likelihood estimator (MLE) will be asymptotically the most efficient estimator. (iii) The MLE is asymp-
totically normally distributed and has a sample variance that can be computed from the inverse of the Fisher information
matrix. This can be used to generate confidence intervals (CIs) for the estimated parameters. (iv) The MLE is equivariant
which implies invariance of the score test (or the Lagrange multiplier test) to reparameterizations.37,38 Specifically, we will
show in Section 2.3.1, by inverting the score test, our method produces confidence bands for the ROC curve and appropriate
score intervals for its summary indices.
The likelihood contribution of a single observation O=(Y,D,X), where Y∈(y,y]={y:y<yy}is given by
L(𝜽O)=
fZ(h(y𝝑)−𝜇d(x𝜷))h(y𝝑)y“exact continuous”
1FZ(h(y𝝑)−𝜇d(x𝜷)) y∈(y,∞) “right censored”
FZ(h(y𝝑)−𝜇d(x𝜷)) y∈(,y)“left censored”
FZ(h(y𝝑)−𝜇d(x𝜷))
FZ(h(y𝝑)−𝜇d(x𝜷)) y∈(y,y]“interval censored”
where fZis the density function of Zand h(y𝝑)is the first derivative of the transformation function with respect to y.
Given a sample of Nindependent and identically distributed observations Oifor i=1, ,N, the log-likelihood is given by
𝓁(𝜽)=N
i=1log(Li(𝜽)), where Liis the likelihood contribution of observation i. The (unconditional) maximum likelihood
estimate of 𝜽is the solution to the optimization problem
𝜽=(
𝜷,
𝝑) = ar g max
𝜷,𝝑
𝓁(𝜷,𝝑)
subject to the monotonicity constraint 𝜗m𝜗m+1for m=0, ,M1. The resulting ROC curve only depends on 𝜷which
is decoupled from the parameters needed to model the transformation function 𝝑. The score function is defined as the first
Sewak and Hothorn 1409
derivative of the log-likelihood function with respect to each of the parameters and is given by
S(𝜽)=
𝜕𝓁(𝜽)
𝜕𝜷
𝜕𝓁(𝜽)
𝜕𝝑
=S𝜷(𝜽)
S𝝑(𝜽)
We perform constrained optimization using the likelihood and score contributions to determine the maximum likelihood
estimates for 𝜷and 𝝑(for computational details, see Hothorn39). The asymptotic variance of the MLE can further be
estimated by the expected Fisher information matrix which is the variance-covariance matrix of the score function and is
defined as
I(𝜽)=−𝔼
𝜕2𝓁(𝜽)
𝜕𝜷𝜕𝜷
𝜕2𝓁(𝜽)
𝜕𝜷𝜕𝝑
𝜕2𝓁(𝜽)
𝜕𝝑𝜕𝜷
𝜕2𝓁(𝜽)
𝜕𝝑𝜕𝝑
=I𝜷,𝜷(𝜽)I𝜷,𝝑(𝜽)
I𝜷,𝝑(𝜽)I𝝑,𝝑(𝜽)
The matrix is partitioned such that the submatrix I𝜷,𝜷(𝜽)corresponds to the parameter related to the disease indicator and
covariates.
2.2.3 Limit of detection
Instrument precision can affect the evaluation of diagnostic biomarkers. For example, when biomarker levels are at or
below the limit of detection (LOD) yLOD, the observed value lies in an interval (−∞,yLOD)and the resulting measurement
is left censored. Often a replacement value is substituted for such measurements. Alternatively, only biomarker values
above the LOD are used for the ROC analysis. It has been shown that these approaches lead to biased estimation.40 Various
adjustments to ROC curves and its summary indices have been proposed to handle such censored measurements.41,6,42
However, these methods typically do not account for covariates. Our framework naturally accounts for such obser-
vations in the likelihood function for left censored test results. Similarly, the right censored likelihood accounts for
measurements which are affected by an upper limit of detection. Thus, our method provides a smooth covariate-
specific ROC curve for all values of specificity with estimates and inference appropriately incorporating the observed
information.
2.3 Confidence intervals
In the following section, we present three methods to calculate confidence bands for the ROC curve and CIs for its summary
indices. Since these quantities are functions G:2P+1of the regression parameters 𝜷in the model, to maintain
nominal coverage for a CI for G(𝜷), appropriate methods are needed. The methods discussed include inverting the score
test, the multivariate delta method and simulation from the asymptotic distribution of the estimate. The methods are ordered
by their degree of theoretical justification. We start with score intervals which are invariant to parameter transformations
but become computationally expensive when dealing with a large set of parameters. We then discuss estimating the variance
using the delta method and conclude with a simple simulation method which is versatile without being computationally
demanding.
2.3.1 Score intervals
In the two-sample univariate case where 𝛿defines the ROC curve, as in equation (5), we can construct score intervals
for 𝛿. Unlike the Wald and other commonly used intervals, score intervals are especially desirable as they are invariant to
transformations of the parameters. A score CI for G(𝛿)(e.g. the AUC a(𝛿)), provides the same level of coverage as would
a score CI for 𝛿. In turn, under a correctly specified model, a score CI for 𝛿allows the construction of accurately covered
uniform confidence bands for the ROC curve as well as intervals for its summary indices such as the AUC and the Youden
index.
We first generate score CIs for 𝛿by inverting the score test. In this case, the null hypothesis is given by H0:𝛿=𝛿0
where 𝛿0is a specific value of the parameter of interest. Under H0, the restricted (conditional) MLE for 𝝑can be obtained
by
𝝑(𝛿0) = ar g max
𝝑
𝓁(𝛿0,𝝑)
1410 Statistical Methods in Medical Research 32(7)
or as a solution of the M+1 score equations S𝝑(𝛿0,𝝑)=0. Note that this estimate is a function of 𝛿0. Letting
𝜽=(𝛿0,
𝝑(𝛿0)),
the quadratic (Rao) score statistic simplifies to
R(𝛿0)=S(
𝜽)I1(
𝜽)S(
𝜽)
=(S𝛿(
𝜽),0)I1(
𝜽)(S𝛿(
𝜽),0)
=S𝛿(
𝜽)A𝛿,𝛿(
𝜽)S𝛿(
𝜽)
where A𝛿,𝛿(
𝜽)denotes the submatrix corresponding to 𝛿of the inverse Fisher information matrix and is given by the Schur
complement I𝛿,𝛿(𝜽)−I𝛿,𝝑(𝜽)I1
𝝑,𝝑(𝜽)I𝛿,𝝑(𝜽). Under H0,R(𝛿0)converges asymptotically to a chi-square distribution with 1
degree of freedom, R(𝛿0)D
𝜒2
1. This result is explained by Rao.43 Inverting the score statistic by enumerating values of
𝛿0allows for the construction of (1𝛼)score CIs for 𝛿defined as
{𝛿0R(𝛿0)<𝜒
2
1(1𝛼)}
where 𝜒2
1(1𝛼)is the (1𝛼)quantile value of the chi-squared distribution with 1 degree of freedom. Equivalently, we
can use the square root of the score statistic to construct a (1𝛼)score interval using quantiles of the standard normal
distribution, {𝛿0∣Φ
1(𝛼2)<R(𝛿0)Φ1(1𝛼2)}. Finally, we apply the function Gto both the lower and
upper limits of the interval to construct score confidence bands for the ROC curve or score CIs for its summary indices.
The score statistic is given by R(𝛿0)=S𝛿(
𝜽)2A𝛿,𝛿(
𝜽). Testing if there is a significant difference between the nondiseased
and diseased populations coincides to the hypothesis test, H0:𝛿=0. This is computationally efficient because only the
distribution of R(0)needs to be computed. However, computing score CIs for more than one parameter requires updating
the restricted MLEs
𝝑(𝛿0)for an enumeration of 𝛿0values. This becomes computationally intractable when enumerating a
higher-dimensional grid of parameters.
2.3.2 Delta method
Since the MLE satisfies n(
𝜷𝜷)D
←←←←←←NP+10,A𝜷,𝜷(𝜽)
then by the multivariate delta method, G(
𝜷)also follows a normal distribution with
𝕍(G(
𝜷)) = 1
nG(𝜷)A𝜷,𝜷(𝜽)∇G(𝜷)
where G(𝜷)is the gradient of Gevaluated at 𝜷and the inverse Fisher information matrix A𝜷,𝜷(𝜽)is given by the Schur
complement I𝜷,𝜷(𝜽)−I𝜷,𝝑(𝜽)I1
𝝑,𝝑(𝜽)I𝜷,𝝑(𝜽). For example, when the shift term takes the linear form as in equation (6) and
Gdefines the AUC function for FZ= probit1, the entries of G(𝜷)are given by
𝜕G(𝜷)
𝜕𝛿 =1
2C,𝜕G(𝜷)
𝜕𝜉i
=0 and 𝜕G(𝜷)
𝜕𝛾i
=xi
2C
where C=𝜙(𝛿+x𝜸
2),𝜙is the density of the standard normal distribution and iindexes the Pcovariates. In general, the
gradient can be estimated by calculating such derivatives and evaluating the resulting function at the MLE. Similarly, the
variance-covariance matrix of the estimated parameters A𝜷,𝜷(𝜽)can be computed by inverting the numerically evaluated
Hessian matrix. Thus, a (1𝛼)level CI for G(𝜷)is given by
G(𝜷)±Φ
1(𝛼2)
𝕍(G(
𝜷))
2.3.3 Simulated intervals
When the function Ghas complex derivatives, as would be the case for nonlinear shift terms 𝜇d(x)or when calculating
optimal thresholds cwhere Gincludes the inverse of the transformation function, constructing CIs using the delta method
becomes infeasible. For these cases, we apply a simple simulation-based algorithm which utilizes the asymptotic normality
of the MLE to calculate CIs for the ROC curve and its summary indices, which are functions of the parameters of interest.
The steps of the algorithm to construct (1𝛼)level CIs for G(
𝜷)can be summarized as follows:
Sewak and Hothorn 1411
Ta b l e 2 . Overview of the different methods used in the simulation study.
ROC AUC Youden index
Reference R package Estimate CB Estimate CI Estimate CI
Hothorn39 tram ✓✓✓✓
Harrell Jr45,46 rms ✓✓
Thas et al.33 pim ✓✓
Therneau47 survival ✓✓
Robin et al.48 pROC ✓✓✓✓
Fay49 asht ✓✓
Konietschke et al.50 nparcomp ✓✓
Khan and Brandenburger51 ROC it ✓✓✓✓
Feng et al.52 auRoc ✓✓
Perez-Jaume et al.53 ThresholdROC ✓✓
Ridout and Linkie54 overlap ✓✓
Franco-Pereira et al.55 -✓✓
Pèrez Fernàndez et al.56 nsROC ✓✓✓✓
ROC: receiver operating characteristic; AUC: area under the ROC curve; CI: confidence interval; CB: confidence band.
References to the original publication along with R software details are given. The () indicates if a method computes the specific metric. The metrics
included estimates for the ROC curve, AUC, and Youden index as well as corresponding CBs or CIs.
1. Generate Bindependent samples from the asymptotic multivariate normal distribution of the parameter estimates
NP+1(
𝜷,1
n
A𝜷,𝜷(
𝚯)) and denote as
𝜷
1,,
𝜷
B.
2. For each sample b=1, ,B, calculate the function of interest G(
𝜷
b).
3. Construct the CI (QG(
𝜷)(𝛼2),QG(
𝜷)(1𝛼2)), where QG(
𝜷)is the empirical quantile function of the sample
G(
𝜷
1),,G(
𝜷
B).
A similar algorithm is presented by Mandel,44 who discuss its asymptotic validity and present several examples that show
its empirical coverage adheres to nominal levels with results similar to the delta method.
3 Empirical evaluation
We conducted a simulation study to evaluate the performance of our estimators in the two-sample setting. We chose this
setting to be able to compare various estimators commonly used in practice. The software details of all the methods used
alongside their respective features and references are summarized in Table 2.
We considered a data generating process (DGP) such that nondiseased test results followed a standard normal distribution
F0(y)=Φ(y)and the diseased test results a distribution with the CDF F1(y)=FZDGP(F1
ZDGP (Φ(y)) 𝛿). To obtain different
shapes of the ROC curve, we chose three choices of FZDGP {probit1,logit1,cloglog1}and varied 𝛿such that the
AUC ∈{0.5, 0.65, 0.8, 0.95}or that J∈{0, 0.25, 0.5, 0.8}, leading to a variety of configurations. Under this simulation
paramaterization, the true ROC curves followed the form of equation (5) and the true summary indices could be calculated
as a function of 𝛿from Table 1.
The conventional binormal model corresponded to FZDGP = probit1and induced proper binormal ROC curves.
This was the only configuration where the test results for both groups were normally distributed. We included this con-
figuration to ascertain the loss of power associated with our estimators when the standard binormal assumption held. For
other choices of FZDGP with AUC >0.5, the resulting distributions of the diseased test results were non-normal, with vari-
ances and higher moments differing between the two groups. Specifically, the configuration of FZDGP = logit1led to light
tailed distributions for the diseased test results, while FZDGP =cloglog
1led to skewed, heavy-tailed distributions. The
corresponding density functions for the data generating model with selected AUC values are given in Figure 1.
For 10,000 replications of each configuration, we generated balanced data sets with sample sizes N0=N1
{25, 50, 100}. The transformation models discussed in Section 2 were fitted to the simulated data sets assuming a
parameterization of the transformation function given by a Bernstein basis polynomial of order M=6 (see Hothorn39
for a discussion on suitable choices for M). The true data-generating model had a nonlinear transformation function
h=F1
ZDGP
Φ. Our model estimation procedure aimed to approximate this function alongside the shift parameter 𝛿.The
1412 Statistical Methods in Medical Research 32(7)
Figure 1. Density functions for the model used to generate the data for the simulations. The nondiseased test results followed a
standard normal distribution corresponding to an AUC =0.5. The diseased test results varied with three choices of FZDGP :probit1,
logit1,andcloglog1each of which had an AUC of 0.5, 0.65, 0.8, and 0.95. DGP: data generating process; AUC: area under the
receiver operating characteristic curve.
functions implementing transformation models for different choices of FZare available from the tram add-on package.39
Note that the function FZDGP is for the data generating process in the simulation study and is distinct from FZ, the inverse
link function used in the model. When FZDGP =FZ, the model is correctly specified for the DGP.
Figure 2 displays the distribution of bias for the AUC estimates using the proposed methods under the various simulation
configurations. We found that all three methods had minimal bias for an AUC =0.5, where the test results were unable
to distinguish between the two groups. The models with FZ {probit1,logit 1}yielded approximately unbiased AUC
estimates in all cases, even when they were misspecified for the true data generating process. However, estimates based on
the proportional hazards model FZ=cloglog
1, were biased for data generating processes other than where it was correctly
specified.
We compared our approaches to a set of alternative methods (see Table 2) for computing CIs for the AUC and Youden
index. We detail the empirical coverage and average width of the CIs for the AUC in Supplemental Figures S1 and S2,
respectively. Estimates based on transformation models (Rpackages tram, orm, pim) yielded coverage close to the nominal
level and significantly outperformed the other methods when the model was correctly specified for the true data generating
process. All other methods generally performed close to nominal levels for low to medium AUC values (0.5–0.8), but broke
down for higher AUC values. In addition, the score CIs from the transformation model with FZ= logit1were accurate
even when it was misspecified for the true data generating process. However, methods which used FZ=cloglog
1gave
CIs which were shorter in length (overconfident).
Analogously, Supplemental Figures S3 and S4 detail the coverage and length of the CIs for the Youden index. The meth-
ods which were based on the overlap coefficient failed to cover the configuration where J=0 because their lower limits
were never below 0. Our methods estimated CIs for 𝛿which naturally accounted for this scenario. The transformation
model with FZ= logit1provided coverage at nominal levels for all simulation configurations with a relatively small CI
width. The approach of Franco-Pereira et al.55 (FP) was also accurate under model misspecification but was more involved.
Namely, it consisted of estimating Box-Cox transformation parameters under a binormal framework with bootstrap vari-
ance, all carried out on the logit scale and then back-transformed. In a setting with covariates, censoring or with J=0this
methodology would be limited.
Supplemental Figures S5 and S6 show the coverage and area of the confidence bands for the ROC curve. All
the approaches based on transformation models covered the configuration with AUC =0.5 accurately. However, the
Sewak and Hothorn 1413
Figure 2. Distribution of bias from the simulation study for estimation of the AUC. The DGP for nondiseased results was
F0(y)=Φ(y)and for diseased results F1(y)=FZDGP (F1
ZDGP
(Φ(y)) 𝛿). We varied FZDGP {probit 1,logit1,cloglog1}, AUC and
sample size. The proposed methods also varied by the same inverse link functions. An alignment of colors in the column (DGP) and
the fill of the box plot is indicative that the method is correctly specified for the DGP. DGP: data generating process; AUC: area
under the receiver operating characteristic curve.
1414 Statistical Methods in Medical Research 32(7)
other approaches did not yield coverage close to nominal levels in this configuration with the exception of Martínez-
Camblor et al.,57 whose confidence bands had a significantly larger area indicating lower power. For all other configurations,
only transformation models which were correctly specified for the true data-generating model provided accurate results.
In addition to the simulations described above, we considered three other scenarios to evaluate the robustness of our
proposed methods to model misspecification. The details for each of these scenarios are given in Supplemental Materi-
als Section B. In terms of the AUC, we noticed that our models are generally robust to misspecification, but can break
down in certain cases. However, the proportional hazard model with FZ=cloglog
1resulted in poor performance under
misspecified configurations, indicating that it should be used with caution.
4 Application
The prevalence of obesity has increased consistently for most countries in the recent decade and this trend is a serious
global health concern.58 Obesity contributes directly to increased risk of cardiovascular disease (CVD) and its risk factors,
including type 2 diabetes, hypertension, and dyslipidemia.59,60 Metabolic syndrome (MetS) refers to the joint presence of
several cardiovascular risk factors and is characterized by insulin resistance.61 The National Cholesterol Education Program
Adult Treatment Panel III (NCEP-ATP III) criteria is the most widely used definition for MetS, but it requires laboratory
analysis of a blood sample. This has led to the search for non-invasive techniques which allow reliable and early detection
of MetS.
Waist-to-height ratio (WHtR) is a well-known anthropometric index used to predict visceral obesity. Visceral obesity
is an independent risk factor for development of MetS by means of the increased production of free fatty acids whose
presence obstructs insulin activity.62 This suggests that higher values of WHtR, reflecting obesity, and CVD risk factors,
are more indicative of incident MetS. Several studies have found that WHtR is highly predictive of MetS.63–65 However,
as waist circumference changes with age and gender,66 it is also important to study whether or not the performance of
WHtR at diagnosing MetS is impacted by these variables. Evaluation of WHtR as a predictor of MetS after adjusting for
covariates is necessary so that more tailored interventions can be initiated to improve outcomes.
We illustrate the use of our methods to data from a cross sectional study designed to validate the use of WHtR and
systolic blood pressure (SBP) as markers for early detection of MetS in a working population from the Balearic Islands
(Spain). Detailed descriptions of the study methodology and population characteristics are reported in Romero-Saldaña et
al.67 Briefly, data on 60 799 workers were collected during their work health periodic assessments between 2012 and 2016.
Presence of MetS was determined by the NCEP-ATP III criteria and the sample consisted of 5487 workers with MetS.
4.1 Two-sample analysis
We first examined the unconditional performance of WHtR as a marker to diagnose MetS, denoted Yand D, respectively.
We fitted a linear transformation model with corresponding ROC curve of the form in equation (5), where 𝛿is the shift
parameter, for various choices of the inverse link function FZ. Associated inference of the AUC and Jwas calculated using
the closed-form expressions from Table 1. The resulting estimates are presented in Table 3. The AUCs were consistently
bounded away from 0.5 indicating a good capacity of WHtR to discriminate between workers with and without MetS. This
can also be seen from the estimated ROC curve plotted in Figure 3 which lies well above the diagonal line as well as the
modeled densities which have a small degree of overlap. The CIs and uniform confidence bands were quite small due to
the large sample size.
Ta b l e 3 . Estimates and 95% score confidence intervals of the shift paramater, AUC and Jin the two-sample linear transformation
model for WHtR as a marker of MetS.
FZ𝛿AUC J
probit11.492 (1.462, 1.521) 0.854 (0.849, 0.859) 0.544 (0.535, 0.553)
logit12.785 (2.730, 2.841) 0.871 (0.866, 0.875) 0.602 (0.593, 0.611)
cloglog11.186 (1.157, 1.215) 0.766 (0.761, 0.771) 0.412 (0.403, 0.421)
loglog11.425 (1.397, 1.453) 0.806 (0.802, 0.810) 0.484 (0.475, 0.492)
AUC: area under the receiver operating characteristic curve; MetS: metabolic syndrome; WHtR: waist-to-height ratio.
Sewak and Hothorn 1415
Figure 3. Estimates from the linear transformation model with a single shift parameter, h(Y)=𝛿d+Z,whereZis chosen to be a
standard logistic distribution. (A) Density functions of WHtR for the workers who were diagnosed with MetS (dotted line) and those
who were not (solid line). (B) ROC curve for WHtR as a marker of MetS with 95% uniform score confidence bands are represented
by gray shaded areas. MetS: metabolic syndrome; WHtR: waist-to-height ratio.
4.2 Conditional ROC analysis
Next, we investigated if the discriminatory ability of WHtR in separating workers with and without MetS varies with
covariates. We considered a transformation model that included the main effects of covariates plus interaction terms with
the disease indicator, which leads to the ROC curve given by
ROC(px)=1FZ(F1
Z(1p)−(𝛿+𝛾x))
where the covariates xincluded age, gender, and tobacco consumption. The choice of FZ= logit1was made using repeated
holdout validation. We describe this model selection procedure in Supplemental Material Section C and show the results
for different model choices and parameterizations.
Figure 4 displays the covariate-specific ROC curves fitted to these data. The performance of WHtR appeared to be better
for females compared to males and decreased with age. The effect of smoking, although significant in the model, does not
seem to substantially alter the ROC curves given the other covariates are kept fixed. To inspect the covariate effect further,
we calculated the age- and gender-specific AUCs and Youden indices from the model. Figure 5 clearly shows that the
discriminatory capabilities of WHtR in distinguishing workers with MetS is consistently better for females and decreases
with age.
5 Discussion
This article presents a new modeling framework for ROC analysis that can be used to characterize the accuracy of medical
diagnostic tests. Our model is based on estimating an unknown transformation function for the test results and yields a
distribution-free yet model-based estimator for the ROC curve. Covariates that influence the diagnostic accuracy of tests
can naturally be accommodated as regression parameters into the model and covariate-specific summary indices such as
the AUC and Youden index are easily computed using closed-form expressions.
Our proposed approach has several features which distinguish it from contemporary methods of ROC analysis. Firstly,
we employ maximum likelihood to jointly estimate all parameters defining the transformation function and regression coef-
ficients. This implies the variation in the estimated transformation parameters is accounted for and appropriately propagated
to inference for the ROC curve. In turn, asymptotic efficiency is guaranteed for our estimators and we avoid reliance on
resampling procedures for the construction of CIs. Secondly, transformation models focus on estimating the conditional dis-
tribution function whose evaluation directly provides the likelihood contributions for interval, right-, and left-censored data
that commonly arises due to instrument detection limits. Thirdly, no strong assumptions are made regarding the transfor-
mation function which results in a highly flexible model that retains interpretability of the regression coefficients. Lastly,
1416 Statistical Methods in Medical Research 32(7)
Figure 4. Estimated covariate-specific ROC curves for WHtR as a marker of MetS for female (solid line) and male workers (dashed
line). Vertical panels represent a specific age (30, 40, 50) and horizontal panels smoking status. ROC: receiver operating
characteristic; MetS: metabolic syndrome; WHtR: waist-to-height ratio.
Figure 5. Age-based AUC and Youden indices where WHtR is used as a marker to detect MetS for non-smoking female (solid line)
and male (dashed line) workers. 95% Wald pointwise confidence bands are represented by gray shaded areas. AUC: area under the
receiver operating characteristic curve; MetS: metabolic syndrome; WHtR: waist-to-height ratio.
Sewak and Hothorn 1417
software implementations for all the methods described in this article are available in the tram Radd-on package (see
Supplemental Material for example code), thus enabling a unified framework for ROC analysis.
In our simulation study, interestingly, we found that a model with FZ= logit1provided accurate results even when
it was misspecified for the true data generating process. This model also behaves very similarly to the semiparametric
cumulative probability model,68 both of which estimate a log-odds ratio 𝛿. The equivalence of the transformation model’s
odds ratio to the MWW test statistic has been well studied.69 The MWW statistic has a bounded influence function and is
robust to contaminations of the specified model.70 Due to their equivalence, we hypothesize that the transformation model
with FZ= logit1is also endowed with the same robustness properties as the MWW and thus can be chosen when no a
priori model is known.
One aspect that warrants further investigation is model selection, specifically with regards to the choice of FZ.One
strategy would be to define FZtailored to a specific interpretation of the parameters 𝛿,𝜷, and 𝜸, for example, as log-odds
ratios with FZ= logit1or FZ=cloglog
1for hazard ratios.34 A second option is to use some form of cross validation
in combination with model assessment via the probability integral transform (PIT) (as discussed in Supplemental Mate-
rials Section C). Third, and in analogy to single index models, one could introduce parameters to FZsuch that the shape
of the inverse link function is estimated along with all other model parameters (McLain and Ghosh71 discuss a family of
link functions including the complementary log-log and logit). Finally, we could completely relax the assumption that the
difference between the diseased and nondiseased distributions is described by a shift-term. In this case, separate transfor-
mation functions would be allowed in each of the two groups. Namely, consider a stratified model where the nondiseased
results follow a distribution with the CDF F0(y)=FZ(h0(y)) and the diseased with the CDF F1(y)=FZ(h1(y)). Defining
a new transformation function r=h1h1
0F1
Z:[0, 1], the smooth ROC curve with no shift assumptions is given
by ROC(p)=1FZ(r(1p)). This model has more flexibility but sacrifices the properness property desirable for the
ROC curves. Furthermore, care has to be taken in defining the correct likelihood contributions for accurate inference of
this model as uncertainty enters from both transformation functions.
In future work, we plan to pursue various extensions of transformation models for ROC analysis to consider (1)
penalty terms for high-dimensional covariates,72 (2) mixed effects for clustered observations,73 and (3) covariate-dependent
transformation functions through forest-based machine learning methods.74
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work
was supported by the Swiss National Science Foundation, grant number 200021_184603.
ORCID iD
Torsten Hothorn https://orcid.org/0000-0001-8301-0471
Supplemental material
Supplemental materials for this article are available online.
References
1. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford, UK: Oxford University Press, 2003.
2. Zou KH, Liu A, Bandos AI, et al. Statistical Evaluation of Diagnostic Performance: Topics in ROC Analysis. Boca Raton, FL, USA:
CRC Press, 2011.
3. Pepe MS. A regression modelling framework for receiver operating characteristic curves in medical diagnostic testing. Biometrika
1997; 84: 595–608.
4. Faraggi D. Adjusting receiver operating characteristic curves and related indices for covariates. J R Stat Soc: Ser D (The Statistician)
2003; 52: 179–192.
5. Perkins NJ, Schisterman EF and Vexler A. Receiver operating characteristic curve inference from a sample with a limit of detection.
Am J Epidemiol 2007; 165: 325–333.
6. Bantis LE, Yan Q, Tsimikas JV, et al. Estimation of smooth ROC curves for biomarkers with limits of detection. Stat Med 2017;
36: 3830–3843.
7. Inácio V, Lourenço VM, de Carvalho M, et al. Robust and flexible inference for the covariate-specific receiver operating
characteristic curve. Stat Med 2021; 40: 5779–5795.
8. Inácio V, Rodríguez-Álvarez MX and Gayoso-Diz P. Statistical evaluation of medical tests. Annu Rev Stat Appl 2021; 8: 41–67.
1418 Statistical Methods in Medical Research 32(7)
9. Zou KH, Tempany CM, Fielding JR, et al. Original smooth receiver operating characteristic curve estimation from continuous data:
statistical methods for analyzing the predictive value of spiral CT of ureteral stones. Acad Radiol 1998; 5: 680–687.
10. Zou KH, Hall W. Two transformation models for estimating an ROC curve derived from continuous data. J Appl Stat 2000; 27:
621–631.
11. Zou KH, Hall WJ and Shapiro DE. Smooth non-parametric receiver operating characteristic (ROC) curves for continuous diagnostic
tests. Stat Med 1997; 16: 2143–2156.
12. Hothorn T, Kneib T and Bühlmann P. Conditional transformation models. J R Stat Soc: Ser B (Statistical Methodology) 2014; 76:
3–27.
13. Hothorn T, Möst L and Bühlmann P. Most likely transformations. Scand J Stat 2018; 45: 110–134.
14. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. JMath
Psychol 1975; 12: 387–415.
15. Youden WJ. Index for rating diagnostic tests. Cancer 1950; 3: 32–35.
16. Komaba A, Johno H and Nakamoto K. A novel statistical approach for two-sample testing based on the overlap coefficient, 2022.
https://arxiv.org/abs/2206.03166. arXiv:2206.03166 [math.ST].
17. Weitzman MS. Measures of Overlap of Income Distributions of White and Negro Families in the United States.3. Washington, DC:
US Bureau of the Census, 1970. Washington, D.C.
18. Feller W. An Introduction to Probability Theory and Its Applications. New York, NY, USA: Wiley, 1991.
19. Schmid F, Schmidt A. Nonparametric estimation of the coefficient of overlapping—theory and empirical application. Comput Stat
Data Anal 2006; 50: 1583–1596.
20. Martínez-Camblor P. About the use of the overlap coefficient in the binary classification context. Commun Stat-Theor Method 2022;
1–11.
21. Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making 1988; 8: 204–215.
22. Bickel PJ, Doksum KA. An analysis of transformations revisited. JAmStatAssoc1981; 76: 296–311.
23. Dorfman DD, Alf Jr E. Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence
intervals-rating-method data. J Math Psychol 1969; 6: 487–496.
24. Ogilvie JC, Creelman CD. Maximum-likelihood estimation of receiver operating characteristic curve parameters. J Math Psychol
1968; 5: 377–391.
25. Gönen M, Heller G. Lehmann family of ROC curves. Med Decis Making 2010; 30: 509–517.
26. Khan RA. Resilience family of receiver operating characteristic curves. IEEE Trans Reliab 2022. DOI: 10.1109/TR.2022.3194710.
27. Zou KH. Analysis of Some Transformation Models for the Two-sample Problem With Special Reference to Receiver Operating
Characteristic Curves. PhD thesis, University of Rochester, 1997.
28. Alonzo TA, Pepe MS. Distribution-free ROC analysis using binary regression techniques. Biostatistics 2002; 3: 421–432.
29. Pan X, Metz CE. The “proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data.
Acad Radiol 1997; 4: 380–389.
30. McIntosh MW, Pepe MS. Combining several screening tests: optimality of the risk score. Biometrics 2002; 58: 657–664.
31. Siegfried S, Kook L and Hothorn T. Distribution-free location-scale regression. Am Stat 2022. DOI: 10.1080/
00031305.2023.2203177.
32. Dodd LE, Pepe MS. Partial AUC estimation and regression. Biometrics 2003; 59: 614–623.
33. Thas O, Neve JD, Clement L, et al. Probabilistic index models. J R Stat Soc: Ser B (Statistical Methodology) 2012; 74: 623–671.
34. De Neve J, Thas O and Gerds TA. Semiparametric linear transformation models: effect measures, estimators, and applications. Stat
Med 2019; 38: 1484–1501.
35. Farouki RT. The bernstein polynomial basis: a centennial retrospective. Comput Aided Geom Des 2012; 29: 379–419.
36. Lindsey JK. Parametric Statistical Inference. Oxford, UK: Oxford University Press, 1996.
37. Rao CR. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation.
In Mathematical Proceedings of the Cambridge Philosophical Society, volume 44. Cambridge, UK: Cambridge University Press,
1948. pp. 50–57.
38. Dagenais MG, Dufour JM. Invariance, nonlinear models, and asymptotic tests. Economet: J Economet Soc 1991; 59: 1601–1615.
39. Hothorn T. Most likely transformations: the mlt package. J Stat Softw 2020; 92: 1–68.
40. Lynn HS. Maximum likelihood inference for left-censored HIV RNA data. Stat Med 2001; 20: 33–45.
41. Mumford SL, Schisterman EF, Vexler A, et al. Pooling biospecimens and limits of detection: effects on ROC curve analysis.
Biostatistics 2006; 7: 585–598.
42. Xiong C, Luo J, Agboola F, et al. A family of estimators to diagnostic accuracy when candidate tests are subject to detection
limits—application to diagnosing early stage Alzheimer’s disease. Stat Methods Med Res 2022; 31: 882–898.
43. Rao CR. Score test: historical review and recent developments. In Advances in Ranking and Selection, Multiple Comparisons, and
Reliability. Boston, MA, USA: Birkhäuser, 2005; pp. 8–20.
44. Mandel M. Simulation-based confidence intervals for functions with complicated derivatives. Am Stat 2013; 67: 76–81.
45. Harrell Jr FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis.
608. New York: Springer, 2001.
46. Harrell Jr FE. rms Regression Modeling Strategies, 2022. https://CRAN.R-project.org/package=rms. Rpackage version 6.3-0.
Sewak and Hothorn 1419
47. Therneau TM. survival: A Package for Survival Analysis in R, 2022. https://CRAN.R-project.org/package=survival. Rpackage
version 3.3-1.
48. Robin X, Turck N, Hainard A, et al. pROC: an open-source package for Rand S+ to analyze and compare ROC curves. BMC
Bioinform 2011; 12: 77.
49. Fay MP. asht: Applied Statistical Hypothesis Tests, 2022. https://CRAN.R-project.org/package=asht. Rpackage version 0.9.7.
50. Konietschke F, Placzek M, Schaarschmidt F, et al. nparcomp: an Rsoftware package for nonparametric multiple comparisons and
simultaneous confidence intervals. J Stat Softw 2015; 64: 1–17. DOI: http://www.jstatsoft.org/v64/i09/.
51. Khan MRA, Brandenburger T. ROCit: Performance Assessment of Binary Classifier with Visualization, 2020. https://
CRAN.R-project.org/package=ROCit. Rpackage version 2.1.1.
52. Feng D, Manevski D and Perme MP. auRoc: Various Methods to Estimate the AUC, 2020. https://CRAN.
R-project.org/package=auRoc. Rpackage version 0.2-1.
53. Perez-Jaume S, Skaltsa K, Pallarès N, et al. ThresholdROC: optimum threshold estimation tools for continuous diagnostic tests in
R.J Stat Softw 2017; 82: 1–21.
54. Ridout M, Linkie M. Estimating overlap of daily activity patterns from camera trap data. J Agric Biol Environ Stat 2009; 14: 322–337.
55. Franco-Pereira AM, Nakas CT, Reiser B, et al. Inference on the overlap coefficient: the binormal approach and alternatives. Stat
Methods Med Res 2021; 30: 2672–2684.
56. Pérez Fernández S, Martínez Camblor P, Filzmoser P, et al. nsROC: an Rpackage for non-standard ROC curve analysis. RJ2018;
10: 55–77.
57. Martínez-Camblor P, Pérez-Fernández S and Corral N. Efficient nonparametric confidence bands for receiver operating-
characteristic curves. Stat Methods Med Res 2018; 27: 1892–1908.
58. Abarca-Gómez L, Abdeen ZA, Hamid ZA, et al. Worldwide trends in body-mass index, underweight, overweight, and obesity from
1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128.9 million children, adolescents, and adults.
Lancet 2017; 390: 2627–2642.
59. Zalesin KC, Franklin BA, Miller WM, et al. Impact of obesity on cardiovascular disease. Endocrinol Metab Clin North Am 2008;
37: 663–684.
60. Grundy SM. Obesity, metabolic syndrome, and cardiovascular disease. J Clin Endocr Metab 2004; 89: 2595–2600.
61. Eckel RH, Alberti KGMM, Grundy SM, et al. The metabolic syndrome. Lancet 2010; 375: 181–183.
62. Bosello O, Zamboni M. Visceral obesity and metabolic syndrome. Obes Rev 2000; 1: 47–56.
63. Shao J, Yu L, Shen X, et al. Waist-to-height ratio, an optimal predictor for obesity and metabolic syndrome in Chinese adults. JNutr
Health Aging 2010; 14: 782–785.
64. Romero-Saldaña M, Fuentes-Jiménez FJ, Vaquero-Abellán M, et al. New non-invasive method for early detection of metabolic
syndrome in the working population. Eur J Cardiovasc Nurs 2016; 15: 549–558.
65. Suliga E, Ciesla E, Głuszek-Osuch M, et al. The usefulness of anthropometric indices to identify the risk of metabolic syndrome.
Nutrients 2019; 11: 2598.
66. Stevens J, Katz EG and Huxley RR. Associations between gender, age and waist circumference. Eur J Clin Nutr 2010; 64: 6–15.
67. Romero-Saldaña M, Tauler P, Vaquero-Abellán M, et al. Validation of a non-invasive method for the early detection of metabolic
syndrome: a diagnostic accuracy test in a working population. BMJ Open 2018; 8: e020476.
68. Tian Y, Hothorn T, Li C, et al. An empirical comparison of two novel transformation models. Stat Med 2020; 39: 562–576.
69. Wang Y, Tian L. The equivalence between Mann-Whitney Wilcoxon test and score test based on the proportional odds model for
ordinal responses. In 4th International Conference on Industrial Economics System and Industrial Security Engineering (IEIS).
Kyoto, Japan: IEEE, pp. 1–5.
70. Hampel FR. The influence curve and its role in robust estimation. JAmStatAssoc1974; 69: 383–393.
71. McLain AC, Ghosh SK. Efficient sieve maximum likelihood estimation of time-transformation models. J Stat Theory Pract 2013;
7: 285–303.
72. Kook L, Hothorn T. Regularized transformation models: the tramnet package. RJ2021; 13: 581–594.
73. Tamási B, Hothorn T. tramME: mixed-effects transformation models using template model builder. RJ2021; 13: 581–594.
74. Hothorn T, Zeileis A. Predictive distribution modeling using transformation forests. J Comput Graph Stat 2021; 30: 1181–1196.
Article
We introduce a generalized additive model for location, scale, and shape (GAMLSS) next of kin aiming at distribution-free and parsimonious regression modelling for arbitrary outcomes. We replace the strict parametric distribution formulating such a model by a transformation function, which in turn is estimated from data. Doing so not only makes the model distribution-free but also allows to limit the number of linear or smooth model terms to a pair of location-scale predictor functions. We derive the likelihood for continuous, discrete, and randomly censored observations, along with corresponding score functions. A plethora of existing algorithms is leveraged for model estimation, including constrained maximum-likelihood, the original GAMLSS algorithm, and transformation trees. Parameter interpretability in the resulting models is closely connected to model selection. We propose the application of a novel best subset selection procedure to achieve especially simple ways of interpretation. All techniques are motivated and illustrated by a collection of applications from different domains, including crossing and partial proportional hazards, complex count regression, non-linear ordinal regression, and growth curves. All analyses are reproducible with the help of the tram add-on package to the R system for statistical computing and graphics.
Article
Full-text available
Diagnostic tests are of critical importance in health care and medical research. Motivated by the impact that atypical and outlying test outcomes might have on the assessment of the discriminatory ability of a diagnostic test, we develop a robust and flexible model for conducting inference about the covariate-specific receiver operating characteristic (ROC) curve that safeguards against outlying test results while also accommodating for possible nonlinear effects of the covariates. Specifically, we postulate a location-scale regression model for the test outcomes in both the diseased and nondiseased populations, combining additive regression B-splines and M-estimation for the regression function, while the distribution of the error term is estimated via a weighted empirical distribution function of the standardized residuals. The results of the simulation study show that our approach successfully recovers the true covariate-specific area under the ROC curve on a variety of conceivable test outcomes contamination scenarios. Our method is applied to a dataset derived from a prostate cancer study where we seek to assess the ability of the Prostate Health Index to discriminate between men with and without Gleason 7 or above prostate cancer, and if and how such discriminatory capacity changes with age.
Book
The use of clinical and laboratory information to detect conditions and predict patient outcomes is a mainstay of medical practice. Modern biotechnology offers increasing potential to develop sophisticated tests for these purposes. This book describes the statistical concepts and techniques for evaluating the accuracy of medical tests. Worked examples include applications to cancer biomarker studies, prospective disease screening studies, diagnostic radiology studies and audiology testing studies. The statistical methodology can be broadly applied for evaluating classifiers and to problems beyond medical settings. Several measures for quantifying test accuracy are described including the Receiver Operating Characteristic Curve. Pepe presents statistical procedures for the estimation and comparison of those measures among tests. Regression frameworks for assessing factors that influence test accuracy and for comparing tests while adjusting for such factors are presented. The sequence of research steps involved in the development of a test is considered in some detail. Sample size calculations and other issues pertinent to study design are described for tests at various phases of development. In addition, the impacts of missing data and imperfect reference data are addressed. These problems often occur in practice, and modern statistical procedures for dealing with them are discussed. Additional topics that are covered include: meta-analysis for summarizing the results of multiple studies of a test; the evaluation of markers for predicting event time data; and procedures for combining the results of multiple tests to improve classification. This book should be of interest to quantitative researchers and practicing statisticians. The book also covers the theoretical foundations for statistical inference and should therefore be of interest to academic statisticians including those involved in statistical methodological research in this field.
Book
Inference involves drawing conclusions about some general phenomenon from limited empirical observations in the face of random variability. In a scientific context, the general must include the completely unforeseen if all possibilities are to be considered. Many of the statistical models most used to describe such phenomena belong to one of a small number of families--the exponential, transformation, and stable families. In the past 25 years, the likelihood function has been recognized as the fundamental element of approach to drawing scientific conclusions. This book brings together for the first time these two components of statistics as the central themes of statistical inference. Chapters focus on model building, approximations, and examples. There are also appendices on the elements of measure theory, probability theory, and numerical methods. The discussions are appropriate for students of mathematical statistics.
Article
We introduce a generalized additive model for location, scale, and shape (GAMLSS) next of kin aiming at distribution-free and parsimonious regression modelling for arbitrary outcomes. We replace the strict parametric distribution formulating such a model by a transformation function, which in turn is estimated from data. Doing so not only makes the model distribution-free but also allows to limit the number of linear or smooth model terms to a pair of location-scale predictor functions. We derive the likelihood for continuous, discrete, and randomly censored observations, along with corresponding score functions. A plethora of existing algorithms is leveraged for model estimation, including constrained maximum-likelihood, the original GAMLSS algorithm, and transformation trees. Parameter interpretability in the resulting models is closely connected to model selection. We propose the application of a novel best subset selection procedure to achieve especially simple ways of interpretation. All techniques are motivated and illustrated by a collection of applications from different domains, including crossing and partial proportional hazards, complex count regression, non-linear ordinal regression, and growth curves. All analyses are reproducible with the help of the tram add-on package to the R system for statistical computing and graphics.
Article
A new semiparameteric model of the receiver operating characteristic (ROC) curve based on the resilience family or proportional reversed hazard family is proposed, which is an alternative to the existing models. The resulting ROC curve and its summary indices (such as area under the curve and Youden index) have simple analytic forms. The partial likelihood method is applied to estimate the ROC curve. Moreover, the estimation methodologies of the resilience family of the ROC curve have been developed based on area under the curve estimators exploiting Mann–Whitney statistics and the Rojo approach. A simulation study has been carried out to assess the performance of all considered estimators. Real data from the American National Health and Nutrition Examination Survey has been analyzed in detail based on the proposed model and the usual binormal model prevalent in the literature. Real data in the context of brain injury-related biomarkers are also analyzed in order to compare our model with the Lehmann family of the ROC curves. Finally, we show that the proposed model may be applicable in the misspecification scenario through the Ducheme muscular dystrophy data.
Article
Linear transformation models constitute a general family of parametric regression models for discrete and continuous responses. To accommodate correlated responses, the model is extended by incorporating mixed effects. This article presents the R package tramME, which builds on existing implementations of transformation models (mlt and tram packages) as well as Laplace approximation and automatic differentiation (using the TMB package), to calculate estimates and perform likelihood inference in mixed-effects transformation models. The resulting framework can be readily applied to a wide range of regression problems with grouped data structures.
Article
The overlap coefficient (OVL) measures the common area between two or more density functions. It has been used for measuring the similarity between distributions in different research fields including astronomy, economy or sociology, among others. Recently, different authors have studied the use of the OVL coefficient in the binary classification problem. They argue that, in particular settings, it could provide better accuracy measure than other stablished indices. We prove here that the OVL coefficient does not provide additional information to the Youden index and that, the potential advantages previously reported are based on the assumption that the classification rules underlying any classification process always assign more probability of being positive to the larger values of the marker. Particularly, we prove that, for a fixed continuous marker, the OVL coefficient is equivalent to the Youden index associated with the optimal classification rules based on this marker. We illustrate the problem studying the capacity of the white blood cells count to identify the type of disease in patients having either acute viral meningitis or acute bacterial meningitis.
Article
In disease diagnosis, individuals are usually assumed to be one of the two basic types, healthy or diseased, as typically based on an established gold standard. Candidate markers for diagnosing a disease often are much cheaper and less invasive than the gold standard but must be evaluated against the gold standard for their sensitivity and specificity to accurately diagnose the disease. When candidate diagnostic markers are fully measured, receiver operating characteristic curves have been the standard approaches for assessing diagnostic accuracy. However, full measurements of diagnostic markers may not be available above or below certain limits due to various practical and technical limitations. For example, in the diagnosis of Alzheimer disease using cerebrospinal fluid biomarkers, the Roche Elecsys® immunoassays have a measuring range for multiple cerebrospinal fluid molecular concentrations. Many cognitive tests used in diagnosing dementia due to Alzheimer disease are also subject to detection limits, often referred to as the floor and ceiling effects in the neuropsychological literature. We propose a new statistical methodology for estimating the diagnostic accuracy when a diagnostic marker is subject to detection limits by dividing the entire study sample into two sub-samples by a threshold of the diagnostic marker. We then propose a family of estimators to the area under the receiver operating characteristic curve by combining a conditional nonparametric estimator and another conditional semi-parametric estimator derived from Cox's proportional hazards model. We derive the variance to the proposed estimators, and further, assess the performance of the proposed estimators as a function of possible thresholds through an extensive simulation study, and recommend the optimum thresholds. Finally, we apply the proposed methodology to assess the ability of several cerebrospinal fluid biomarkers and cognitive tests in diagnosing early stage Alzheimer disease dementia.
Article
The tramnet package implements regularized linear transformation models by combining the flexible class of transformation models from tram with constrained convex optimization implemented in CVXR. Regularized transformation models unify many existing and novel regularized regression models under one theoretical and computational framework. Regularization strategies implemented for transformation models in tramnet include the Lasso, ridge regression, and the elastic net and follow the parameterization in glmnet. Several functionalities for optimizing the hyperparameters, including model-based optimization based on the mlrMBO package, are implemented. A multitude of S3 methods is deployed for visualization, handling, and simulation purposes. This work aims at illustrating all facets of tramnet in realistic settings and comparing regularized transformation models with existing implementations of similar models.
Article
The overlap coefficient ([Formula: see text]) measures the similarity between two distributions through the overlapping area of their distribution functions. Given its intuitive description and ease of visual representation by the straightforward depiction of the amount of overlap between the two corresponding histograms based on samples of measurements from each one of the two distributions, the development of accurate methods for confidence interval construction can be useful for applied researchers. The overlap coefficient has received scant attention in the literature since it lacks readily available software for its implementation, while inferential procedures that can cover the whole range of distributional scenarios for the two underlying distributions are missing. Such methods, both parametric and non-parametric are developed in this article, while R-code is provided for their implementation. Parametric approaches based on the binormal model show better performance and are appropriate for use in a wide range of distributional scenarios. Methods are assessed through a large simulation study and are illustrated using a dataset from a study on human immunodeficiency virus-related cognitive function assessment.