ArticlePDF Available

Estimating transformationsfor evaluating diagnostic testswith covariate adjustment

June 2023
Statistical Methods in Medical Research 32(7):9622802231176030

June 2023
32(7):9622802231176030

DOI:10.1177/09622802231176030

License
CC BY-NC 4.0

Authors:

Torsten Hothorn

Ludwig-Maximilians-University of Munich

Receiver operating characteristic analysis is one of the most popular approaches for evaluating and comparing the accuracy of medical diagnostic tests. Although various methodologies have been developed for estimating receiver operating characteristic curves and their associated summary indices, there is no consensus on a single framework that can provide consistent statistical inference while handling the complexities associated with medical data. Such complexities might include non-normal data, covariates that influence the diagnostic potential of a test, ordinal biomarkers or censored data due to instrument detection limits. We propose a regression model for the transformed test results which exploits the invariance of receiver operating characteristic curves to monotonic transformations and accommodates these features. Simulation studies show that the estimates based on transformation models are unbiased and yield coverage at nominal levels. The methodology is applied to a cross-sectional study of metabolic syndrome where we investigate the covariate-specific performance of weight-to-height ratio as a non-invasive diagnostic test. Software implementations for all the methods described in the article are provided in the tram add-on package to the R system for statistical computing and graphics.

Density functions for the model used to generate the data for the simulations. The nondiseased test results followed a standard normal distribution corresponding to an AUC = 0.5. The diseased test results varied with three choices of F Z DGP : probit −1 , logit −1 , and cloglog −1 each of which had an AUC of 0.5, 0.65, 0.8, and 0.95. DGP: data generating process; AUC: area under the receiver operating characteristic curve.

…

Figures - available via license: Creative Commons Attribution-NonCommercial 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC 4.0

Content may be subject to copyright.

Original Research Article

Estimating transformations

for evaluating diagnostic tests

with covariate adjustment

Ainesh Sewak and Torsten Hothorn

Statistical Methods in Medical Research

2023, Vol. 32(7) 1403–1419

Article reuse guidelines:

sagepub.com/journals-permissions

DOI: 10.1177/09622802231176030

journals.sagepub.com/home/smm

Abstract

Receiver operating characteristic analysis is one of the most popular approaches for evaluating and comparing the accu-

racy of medical diagnostic tests. Although various methodologies have been developed for estimating receiver operating

characteristic curves and their associated summary indices, there is no consensus on a single framework that can pro-

vide consistent statistical inference while handling the complexities associated with medical data. Such complexities might

include non-normal data, covariates that inﬂuence the diagnostic potential of a test, ordinal biomarkers or censored data

due to instrument detection limits. We propose a regression model for the transformed test results which exploits the

invariance of receiver operating characteristic curves to monotonic transformations and accommodates these features.

Simulation studies show that the estimates based on transformation models are unbiased and yield coverage at nominal

levels. The methodology is applied to a cross-sectional study of metabolic syndrome where we investigate the covariate-

speciﬁc performance of weight-to-height ratio as a non-invasive diagnostic test. Software implementations for all the

methods described in the article are provided in the tram add-on package to the R system for statistical computing and

graphics.

Keywords

Transformation model, receiver operating characteristic curve, area under the receiver operating characteristic curve,

diagnostic test, distribution regression, ordinal outcome, censoring, Youden index, overlapping coefﬁcient, limit of

detection

1 Introduction

Estimating receiver operating characteristic (ROC) curves for evaluating the performance of medical diagnostic tests has

been a main focus of statistical literature over the last decades.1,2 Diagnostic tests screen for the presence or absence of a

disease. Characterizing their accuracy is essential to ensure appropriate prevention, treatment and monitoring of diseases.

ROC curves are a valuable tool in determining the diagnostic potential of a test and continue to be extensively applied

in biomedical studies as new tests or biomarkers are developed in radiology, oncology, genetics, and other related ﬁelds.

Increasingly more applications can be expected due to advancements in technology and analyzing the resulting data requires

a computationally straightforward approach to provide accurate and consistent statistical inference.

Previous research has focused on extending statistical methodology for ROC curve estimation to address issues such

as adjustment for covariates,3,4 incorporating censoring due to instrument detection limits5,6 and robustness to model

Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Zürich, Switzerland

Corresponding author:

Torsten Hothorn, Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Hirschengraben 84, CH-8001 Zürich, Switzerland.

Email: Torsten.Hothorn@R-project.org

1404 Statistical Methods in Medical Research 32(7)

misspeciﬁcation.7In addition, a wide variety of parametric and nonparametric methods have been proposed within fre-

quentist and Bayesian paradigms (see Inácio et al.8for a recent review). However, there is no consensus on an analytic

approach that can handle all these issues simultaneously.

An attractive feature of the ROC curve, which has scantly been used for its estimation, is that it remains invariant to

monotonic transformations of the test results. Although transformations have been used to bring continuous test results

into a form that approximately satisﬁes the assumptions of a suitable parametric model,4estimation of a transformation

function has been limited to the Box-Cox power transformation family.9,10 For rank-based methods, the transformation

function can be left unspeciﬁed, but in all cases, a restriction to normality has been previously imposed on the model for

the ROC curve.11

In this article, we present a new unifying methodological framework for estimating ROC curves and its associated

summary indices by modeling the relationship between the transformed test results and potential covariates. We employ

transformation models to jointly estimate the transformation function and regression parameters.12,13 This approach speci-

ﬁes a parametric model for the ROC curve but remains distribution-free because we do not impose any strong assumptions

about the transformation function. Using the estimated parameters, we show how to evaluate covariate effects on the dis-

criminatory performance of diagnostic tests. Unlike nonparametric methods which are ﬂexible but difﬁcult to interpret

and implement, transformation models excel on both fronts. Rimplementations of all methods discussed in this article are

available, along with a set of supporting examples.

1.1 Notation and preliminaries

Let the random variable Ydenote the continuous result of a diagnostic test and let Ddenote the disease status, with D=1

if a subject is diseased and 0 if nondiseased. We denote quantities conditional on the disease status using subscripts. For

example, Y1and Y0are the test results in the diseased and nondiseased populations with cumulative distribution functions

(CDFs) given by F1and F0and densities f1and f0, respectively. Suppose that the subject is diagnosed as diseased when

their test result exceeds a threshold value, c. By convention, we assume that larger values of the test result are more

indicative of the disease. The probability of truly identifying a diseased and nondiseased subject is deﬁned as sensitivity,

ℙ(Y1>c)=1−F1(c), and speciﬁcity, ℙ(Y0≤c)=F0(c), respectively. The set of pairs (1−speciﬁcity, sensitivity)for all

c∈ℝproduce the ROC curve. By setting p=1−F0(c), an equivalent representation of the ROC curve is

ROC(p)=1−F1F−1

0(1−p)

Summary indices of the ROC curve quantify the degree of separation between the distributions Y1and Y0. The most widely

used index is the area under the ROC curve (AUC) deﬁned by

AUC =ℙ(Y1>Y0)=∫1

ROC(p)dp

The AUC represents the probability that the test results of a randomly selected diseased subject exceed the one of a nondis-

eased subject and is directly related to the Mann–Whitney–Wilcoxon U-statistic (MWW).14 Alternative indices include

the Youden index,15 J, which combines sensitivity and speciﬁcity over all possible thresholds to provide the maximum

potential effectiveness of a diagnostic test, given by

J=max

c∈ℝ[F0(c)−F1(c)]

The Youden index is equivalent to the Smirnov (or the two-sample Kolmogorov-Smirnov) test statistic16 and can be

represented as half the L1distance between the two densities or as the complement of the overlapping coefﬁcient (OVL)17–20:

J=1

2∫f0(y)−f1(y)dy =1−∫min[f1(y),f0(y)]dy =1−OVL

Additionally, the threshold corresponding to J, where sensitivity and speciﬁcity are maximized, denoted as c∗, is often used

in clinical practice as the optimal classiﬁcation threshold to screen subjects.

Covariates may impact the level and the accuracy of a diagnostic test. In order to appropriately understand the accuracy

of the test in subpopulations, we can use covariate-speciﬁc or conditional ROC curves.21 Let Xdenote a vector of covariates

that are hypothesized to have an impact on the accuracy of the test. The conditional CDF in the diseased population is given

Sewak and Hothorn 1405

by F1(y∣x)=ℙ(Y1≤y∣X=x)and analogously given for the nondiseased population. The covariate-speciﬁc ROC can

be written as

ROC(p∣x)=1−F1F−1

0(1−p∣x)∣x(1)

with its counterpart conditional summary indices, AUC(x)and J(x), deﬁned accordingly. The covariate-speciﬁc ROC curve

can be generated by modeling the conditional distribution of the test results, known as the induced or indirect methodology.3

1.2 Overview

The article proceeds as follows. In Section 2, we propose a transformation modeling framework for parameterizing ROC

curves from which we derive closed-form expressions for associated AUC and Youden summary indices. We discuss

maximum likelihood estimation procedures for our model and corresponding inference. In Section 3, we assess the empir-

ical performance of our methods using simulated data. We apply our approach to a cross-sectional study for detection of

metabolic syndrome in Section 4 and conclude the article with a discussion.

2 Methods

2.1 Transformation model

The ROC curve is a composition of distribution functions and thus is invariant to strictly monotonically increasing trans-

formations of Y. We propose a model for the conditional distribution of the transformed test result given the disease status

and covariates. This transformation is obtained from the data and leads to a distribution-free framework to parameterize

the covariate-speciﬁc ROC curve and its summary indices.

Suppose there exists a strictly monotonically increasing function hsuch that the relationship between the transformed

test result and the covariates follows a shift-scale model

h(Y)=𝜇d(x)+𝜎d(x)Z

where D=dspeciﬁes the disease indicator (D=0 for nondiseased and D=1 for diseased), X=xa ﬁxed set of covariates,

𝜇d(x)is the shift term, 𝜎d(x)is the scale term, and Z∈ℝis a latent random variable with an aprioriknown absolutely

continuous log-concave CDF, FZ. Given that Dand Xare ﬁxed, the conditional CDF for Yis

ℙ(Y≤y∣D=d,X=x)=Fd(y∣x)=FZh(y)−𝜇d(x)

𝜎d(x)(2)

Equation (2) represents a general class of models called transformation models.22,13 The transformation function huniquely

characterizes the distribution of Y, similar to the density or distribution function. Plugging in this conditional CDF of Y

into equation (1), hcancels out and the covariate-speciﬁc ROC curve is given by

ROC(p∣x)=1−FZ𝜁(x)F−1

Z(1−p)−𝛿(x)(3)

where

𝛿(x)= 𝜇1(x)−𝜇0(x)

𝜎1(x)and 𝜁(x)= 𝜎0(x)

𝜎1(x)

Thus, the ROC curve is completely determined by the shift and scale terms of the model.

The binormal23 and bilogistic24 ROC curves can be obtained by setting FZto the standard normal distribution function

probit−1=Φ, or the standard logistic distribution function logit−1(x) = expit(x)=(1+exp(−x))−1, in equation (3),

respectively. Similarly, the proportional hazard25 and reverse proportional hazard alternatives26 for the ROC curve also

fall within the purview of our transformation model with FZspeciﬁed as cloglog−1(x)=1−exp(−exp(x)) (minimum

extreme value distribution function) and loglog−1(x)=exp(−exp(−x)) (maximum extreme value distribution function),

respectively.

However, to the best of our knowledge, the only literature where the transformation function his included in the model

formulation of the ROC curve is Zou,27 who jointly models the shift term and the parameters of a Box-Cox power trans-

formation function. A key point of this article is that we explicitly estimate hjointly with 𝜇(x)from the observed data and

are not restricted to normality imposed by power transformation families. Thus, the methods we propose allow for proper

1406 Statistical Methods in Medical Research 32(7)

propagation of uncertainty from the estimated transformation function 

hinto the estimates of the shift and scale terms of

the model.

The ROC curve in equation (3) follows a parametric model depending on FZ, but is distribution-free as by Alonzo

and Pepe,28 because no assumptions are made about the transformation hand consequently for the distribution of the test

results. The approach to model the test results as a function of the disease status and covariates was originally proposed in

the latent variable ordinal regression setting by Tosteson and Begg21 and extended by Pepe3to modeling covariate effects

directly on the ROC curve.

Tosteson and Begg21 pointed out that to ensure concavity of the induced ROC curve, the scale term must be omitted,

that is, 𝜎d(x)=1ford={0, 1}. The ROC curve is termed proper if it is concave or, equivalently, if the derivative

of the ROC curve is a monotonically decreasing function.29 A concave ROC curve is desirable as it yields the maximal

sensitivity for a given value of speciﬁcity.30 In this sense, as the decision criterion for classifying subjects is optimal when

the ROC curve is concave, we focus on the remaining work on the model involving only the shift term. Hence, the effect of

covariates on the ROC curve is contained in the difference between the shift terms for diseased and nondiseased subjects,

𝛿(x)=𝜇1(x)−𝜇0(x). For a relaxation of this assumption, see Siegfried et al.31 who additionally estimate the scale functions

through regression models.

2.1.1 Two-sample case

We ﬁrst consider the case of two samples without covariates. Let the shift term take the form 𝜇d(x)=𝛿d. The CDF of the test

results in the nondiseased population is given by F0(y)=FZ(h(y)) and in the diseased population by F1(y)=FZ(h(y)−𝛿).

Using Equation (3), the induced ROC curve can be expressed as

ROC(p)=1−FZh(h−1(F−1

Z(1−p))) − 𝛿

=1−FZF−1

Z(1−p)−𝛿(4)

The model assumption implies that a monotone function hexists to transform both Y1and Y0into the same distribution,

Z∼FZseparated by a shift parameter, 𝛿. The induced ROC curve from this model does not assume a particular distribution

of the test result, rather, it quantiﬁes the difference between the test result distributions on the scale of a user-deﬁned FZ.

In this sense, the difference between the test result distributions is described by 𝛿. Each choice of FZleads to a different

interpretation of 𝛿. For example, when FZis selected to be the standard normal distribution function, 𝛿is essentially

Cohen’s d, the standardized difference in means of the transformed test results comparing the diseased and nondiseased

groups, 𝔼[h(Y1)−h(Y0)]. Similarly, when FZis the standard logistic distribution function, exp(𝛿)is the ratio of odds of

having a positive test result comparing diseased and nondiseased groups. Closed-form expressions can be derived for

summary indices of the ROC curve by solving the appropriate integrals. The expressions of AUC, J, the optimal threshold

c∗, sensitivity and speciﬁcity at c∗are given for some choices of FZin Table 1.

2.1.2 Conditional ROC curve

The accuracy of a diagnostic test may be inﬂuenced by a set of covariates X. To evaluate their effect on the ROC curve and

its summary indices, we assume a linear transformation model with a shift term that takes the form

𝜇d(x)=𝛿d+x⊤𝝃+dx⊤𝜸(5)

Ta b l e 1 . Closed-form expressions for the area under the receiver operating characteristic curve (AUC), Youden Index (J), optimal

classiﬁcation threshold (c∗), sensitivity (Sens), and speciﬁcity (Spec) at c∗in terms of the shift parameter 𝛿in the linear

transformation model given by Fd(y)=FZ(h(y)−𝛿d).

Index probit−1logit−1cloglog−1loglog−1

AUC Φ( 𝛿

2)exp(𝛿)(exp(𝛿)−1−𝛿)

(exp(𝛿)−1)2𝛿≠0

1∕2𝛿=0

expit(𝛿)

J1−2Φ( −𝛿

2)1−2expit( −𝛿

2)exp(

−𝛿

e𝛿−1) − exp( 𝛿

e−𝛿−1)

c∗h−1(𝛿

2)h−1(log( 𝛿

1−e−𝛿)) h−1(log( e𝛿−1

𝛿))

Sens(c∗)Φ(

𝛿

2) expit( 𝛿

2) exp( −𝛿

e𝛿−1)1−exp( 𝛿

e−𝛿−1)

Spec(c∗)1−exp( 𝛿

e−𝛿−1)exp(

−𝛿

e𝛿−1)

Sewak and Hothorn 1407

where 𝝃,𝜸∈ℝPare the coefﬁcient vectors for the covariates and interaction term, respectively. Under this model, the

resulting covariate-speciﬁc ROC curve is

ROC(p∣x)=1−FZF−1

Z(1−p)−(𝛿+x⊤𝜸)

where the covariate effect on the ROC curve is given by the difference in shift terms between diseased and nondiseased

subjects, 𝛿(x)=𝛿+x⊤𝜸. Similarly, the covariate-speciﬁc AUC is given by

AUC(x)=ℙ(Y0<Y1∣X=x)=a(𝛿(x))=a𝛿+x⊤𝜸(6)

where a:ℝ↦[0, 1]is the AUC function from the ﬁrst row of Table 1 for different choices of FZ. The expressions for J,

c∗, sensitivity and speciﬁcity can analogously be adjusted to account for covariates, with 𝛿replaced by 𝛿+x⊤𝜸in Table

1. In the case of a single continuous covariate X=x∈ℝ, the interpretation of the interaction coefﬁcient is as follows.

For each possible speciﬁcity value, a unit increase in xresults in a 𝛾-unit increase in the ROC curve (or an increase in the

sensitivity) on the scale of FZ.If𝛾is positive, an increase in xcorresponds to an increase in the ROC curve, indicating

that a test is better able to discriminate the two populations for larger values of xand, vice versa. Note that the ROC curve

varies with the covariate contingent upon the presence of an interaction between dand x.For𝛾=0, the covariate affects the

distribution of the test results from the diseased and nondiseased population, but not the ROC curve. That is, for all levels

of x, the difference between the transformed distributions h(Y1)and h(Y0)is given by 𝛿and the ROC curve is unchanged.

Analogous interpretations hold when we are interested in modeling a set of covariates X, which could possibly include

categorical covariates.

Standard regression techniques have also been proposed as an alternative to assess the effect of covariates on summary

indices rather than deriving the induced ROC curve. For example, Dodd et al.32 model the partial AUC as a regression

function of covariates. Our model equivalently results in a regression model for the AUC, where 𝛿+x⊤𝜸is in the form of

a usual linear predictor and ais a monotonically increasing inverse link function which deﬁnes the scale for the regression

coefﬁcients. As will be shown in Section 2.2, an advantage of our approach is that we do not rely on less efﬁcient binary

regression techniques and directly estimate the regression parameters of the transformation model using maximum likeli-

hood estimation. In Supplemental Material Section A, we show that our method is additionally related to the probabilistic

index model (PIM) of Thas et al.33,34

We can also consider more general and potentially nonlinear formulations of the shift and scale terms in our framework.

For the special case of FZ= probit−1, the AUC from a shift-scale transformation model31 is given by

ℙ(Y0<Y1∣X0=x0,X1=x1)=Φ

𝜇1(x1)− 𝜇0(x0)

𝜎0(x0)2+𝜎1(x1)2

where X0and X1are the corresponding (potentially different) sets of covariates in the nondiseased and diseased popula-

tions, respectively. When the scale term depends only on a single set of covariates, 𝜎0(x0)=𝜎1(x1)=𝜎(x), despite varying

sets of covariates in the shift terms, all the expressions hold from Table 1 with 𝛿replaced by 𝜇1(x1)−𝜇0(x0)

𝜎(x). However, such

closed-form expressions cannot be derived for other choices of FZwhen the scale term depends on the disease indicator or

on different sets of covariates. In such cases, AUCs and other summary indices can be derived using numerical techniques

on the induced ROC curve.

2.2 Estimation

In this section, we propose estimation methods for a transformation model with univariate test results. We provide an

explicit parameterization of the transformation function and the shift term. We then maximize the likelihood contributions

for a potentially exact continuous, right-, left-, or interval-censored datum to jointly estimate the model parameters. This

enables us to fully determine the ROC curve and its summary indices as well as handle test results which are ordinal or

impacted by instrument detection limits.

1408 Statistical Methods in Medical Research 32(7)

2.2.1 Parameterization

We parameterize the transformation function as

h(y∣𝝑)=b(y)⊤𝝑=



m=0

𝜗mbm(y)for y∈ℝ(7)

where b(y)=(b0(y),…,bM(y))⊤is a vector of M+1 basis functions with coefﬁcients 𝝑∈ℝM+1. Polynomials in Bern-

stein form offer a computationally attractive choice of basis that provides a ﬂexible way of estimating the underlying

transformation function. The Bernstein basis polynomial of order Mis deﬁned on the interval [l,u]as

bm(y)=M

mym(1−y)M−m,m=0, …,M(8)

where y=y−l

u−l∈[0, 1]. The restriction 𝜗m≤𝜗m+1for m=0, …,M−1, guarantees the monotonicity of h. Observe that

the transformation function is linear in the parameters that deﬁne it and any nonlinearity of the test results is modeled by

the basis functions. If the order Mis chosen to be sufﬁciently large, Bernstein polynomials can uniformly approximate any

real-valued continuous function on an interval.35

2.2.2 Likelihood

Denote the complete parameter vector as 𝜽=(𝜷⊤,𝝑⊤)⊤, where 𝜷=(𝛿,𝝃⊤,𝜸⊤)⊤∈ℝ2P+1are the vector of regression

coefﬁcients parameterizing the function 𝜇dfrom Section 2.1 and 𝝑∈ℝM+1are the basis coefﬁcients. We follow the max-

imum likelihood approach proposed by Hothorn et al.13 to jointly estimate 𝜷and 𝝑. The advantages of embedding the

model in the likelihood framework are as follows. (i) All forms of random censoring (right, left, and interval) as well as

truncation can directly be incorporated into likelihood contributions deﬁned in terms of the distribution function.36 Sup-

plemental Material Section A details how ordinal biomarkers can be accommodated in the proposed modeling framework

using interval-censored likelihood contributions. (ii) If the given model is correctly speciﬁed, under regularity conditions,

the maximum likelihood estimator (MLE) will be asymptotically the most efﬁcient estimator. (iii) The MLE is asymp-

totically normally distributed and has a sample variance that can be computed from the inverse of the Fisher information

matrix. This can be used to generate conﬁdence intervals (CIs) for the estimated parameters. (iv) The MLE is equivariant

which implies invariance of the score test (or the Lagrange multiplier test) to reparameterizations.37,38 Speciﬁcally, we will

show in Section 2.3.1, by inverting the score test, our method produces conﬁdence bands for the ROC curve and appropriate

score intervals for its summary indices.

The likelihood contribution of a single observation O=(Y,D,X), where Y∈(y,y]={y∈ℝ:y<y≤y}is given by

L(𝜽∣O)=









fZ(h(y∣𝝑)−𝜇d(x∣𝜷))h′(y∣𝝑)y∈ℝ“exact continuous”

1−FZ(h(y∣𝝑)−𝜇d(x∣𝜷)) y∈(y,∞) “right censored”

FZ(h(y∣𝝑)−𝜇d(x∣𝜷)) y∈(−∞,y)“left censored”

FZ(h(y∣𝝑)−𝜇d(x∣𝜷))

−FZ(h(y∣𝝑)−𝜇d(x∣𝜷)) y∈(y,y]“interval censored”

where fZis the density function of Zand h′(y∣𝝑)is the ﬁrst derivative of the transformation function with respect to y.

Given a sample of Nindependent and identically distributed observations Oifor i=1, …,N, the log-likelihood is given by

𝓁(𝜽)=N

i=1log(Li(𝜽)), where Liis the likelihood contribution of observation i. The (unconditional) maximum likelihood

estimate of 𝜽is the solution to the optimization problem



𝜽=(

𝜷,

𝝑) = ar g max

𝜷,𝝑

𝓁(𝜷,𝝑)

subject to the monotonicity constraint 𝜗m≤𝜗m+1for m=0, …,M−1. The resulting ROC curve only depends on 𝜷which

is decoupled from the parameters needed to model the transformation function 𝝑. The score function is deﬁned as the ﬁrst

Sewak and Hothorn 1409

derivative of the log-likelihood function with respect to each of the parameters and is given by

S(𝜽)=

𝜕𝓁(𝜽)

𝜕𝜷

𝜕𝓁(𝜽)

𝜕𝝑



=S𝜷(𝜽)

S𝝑(𝜽)

We perform constrained optimization using the likelihood and score contributions to determine the maximum likelihood

estimates for 𝜷and 𝝑(for computational details, see Hothorn39). The asymptotic variance of the MLE can further be

estimated by the expected Fisher information matrix which is the variance-covariance matrix of the score function and is

deﬁned as

I(𝜽)=−𝔼

𝜕2𝓁(𝜽)

𝜕𝜷𝜕𝜷⊤

𝜕2𝓁(𝜽)

𝜕𝜷𝜕𝝑⊤

𝜕2𝓁(𝜽)

𝜕𝝑𝜕𝜷⊤

𝜕2𝓁(𝜽)

𝜕𝝑𝜕𝝑⊤



=I𝜷,𝜷(𝜽)I𝜷,𝝑(𝜽)

I𝜷,𝝑(𝜽)⊤I𝝑,𝝑(𝜽)

The matrix is partitioned such that the submatrix I𝜷,𝜷(𝜽)corresponds to the parameter related to the disease indicator and

covariates.

2.2.3 Limit of detection

Instrument precision can affect the evaluation of diagnostic biomarkers. For example, when biomarker levels are at or

below the limit of detection (LOD) yLOD, the observed value lies in an interval (−∞,yLOD)and the resulting measurement

is left censored. Often a replacement value is substituted for such measurements. Alternatively, only biomarker values

above the LOD are used for the ROC analysis. It has been shown that these approaches lead to biased estimation.40 Various

adjustments to ROC curves and its summary indices have been proposed to handle such censored measurements.41,6,42

However, these methods typically do not account for covariates. Our framework naturally accounts for such obser-

vations in the likelihood function for left censored test results. Similarly, the right censored likelihood accounts for

measurements which are affected by an upper limit of detection. Thus, our method provides a smooth covariate-

speciﬁc ROC curve for all values of speciﬁcity with estimates and inference appropriately incorporating the observed

information.

2.3 Conﬁdence intervals

In the following section, we present three methods to calculate conﬁdence bands for the ROC curve and CIs for its summary

indices. Since these quantities are functions G:ℝ2P+1→ℝof the regression parameters 𝜷in the model, to maintain

nominal coverage for a CI for G(𝜷), appropriate methods are needed. The methods discussed include inverting the score

test, the multivariate delta method and simulation from the asymptotic distribution of the estimate. The methods are ordered

by their degree of theoretical justiﬁcation. We start with score intervals which are invariant to parameter transformations

but become computationally expensive when dealing with a large set of parameters. We then discuss estimating the variance

using the delta method and conclude with a simple simulation method which is versatile without being computationally

demanding.

2.3.1 Score intervals

In the two-sample univariate case where 𝛿deﬁnes the ROC curve, as in equation (5), we can construct score intervals

for 𝛿. Unlike the Wald and other commonly used intervals, score intervals are especially desirable as they are invariant to

transformations of the parameters. A score CI for G(𝛿)(e.g. the AUC a(𝛿)), provides the same level of coverage as would

a score CI for 𝛿. In turn, under a correctly speciﬁed model, a score CI for 𝛿allows the construction of accurately covered

uniform conﬁdence bands for the ROC curve as well as intervals for its summary indices such as the AUC and the Youden

index.

We ﬁrst generate score CIs for 𝛿by inverting the score test. In this case, the null hypothesis is given by H0:𝛿=𝛿0

where 𝛿0is a speciﬁc value of the parameter of interest. Under H0, the restricted (conditional) MLE for 𝝑can be obtained



𝝑(𝛿0) = ar g max

𝝑

𝓁(𝛿0,𝝑)

1410 Statistical Methods in Medical Research 32(7)

or as a solution of the M+1 score equations S𝝑(𝛿0,𝝑)=0. Note that this estimate is a function of 𝛿0. Letting 

𝜽=(𝛿0,

𝝑(𝛿0)),

the quadratic (Rao) score statistic simpliﬁes to

R(𝛿0)=S(

𝜽)⊤I−1(

𝜽)S(

𝜽)

=(S𝛿(

𝜽)⊤,0⊤)I−1(

𝜽)(S𝛿(

𝜽)⊤,0⊤)⊤

=S𝛿(

𝜽)⊤A𝛿,𝛿(

𝜽)S𝛿(

𝜽)

where A𝛿,𝛿(

𝜽)denotes the submatrix corresponding to 𝛿of the inverse Fisher information matrix and is given by the Schur

complement I𝛿,𝛿(𝜽)−I𝛿,𝝑(𝜽)I−1

𝝑,𝝑(𝜽)I𝛿,𝝑(𝜽)⊤. Under H0,R(𝛿0)converges asymptotically to a chi-square distribution with 1

degree of freedom, R(𝛿0)D

⟶𝜒2

1. This result is explained by Rao.43 Inverting the score statistic by enumerating values of

𝛿0allows for the construction of (1−𝛼)score CIs for 𝛿deﬁned as

{𝛿0∈ℝ∣R(𝛿0)<𝜒

1(1−𝛼)}

where 𝜒2

1(1−𝛼)is the (1−𝛼)quantile value of the chi-squared distribution with 1 degree of freedom. Equivalently, we

can use the square root of the score statistic to construct a (1−𝛼)score interval using quantiles of the standard normal

distribution, {𝛿0∈ℝ∣Φ

−1(𝛼∕2)<R(𝛿0)≤Φ−1(1−𝛼∕2)}. Finally, we apply the function Gto both the lower and

upper limits of the interval to construct score conﬁdence bands for the ROC curve or score CIs for its summary indices.

The score statistic is given by R(𝛿0)=S𝛿(

𝜽)2A𝛿,𝛿(

𝜽). Testing if there is a signiﬁcant difference between the nondiseased

and diseased populations coincides to the hypothesis test, H0:𝛿=0. This is computationally efﬁcient because only the

distribution of R(0)needs to be computed. However, computing score CIs for more than one parameter requires updating

the restricted MLEs 

𝝑(𝛿0)for an enumeration of 𝛿0values. This becomes computationally intractable when enumerating a

higher-dimensional grid of parameters.

2.3.2 Delta method

Since the MLE satisﬁes n(

𝜷−𝜷)D

←←←←←←←→NP+10,A𝜷,𝜷(𝜽)

then by the multivariate delta method, G(

𝜷)also follows a normal distribution with

𝕍(G(

𝜷)) = 1

n∇G(𝜷)⊤A𝜷,𝜷(𝜽)∇G(𝜷)

where ∇G(𝜷)is the gradient of Gevaluated at 𝜷and the inverse Fisher information matrix A𝜷,𝜷(𝜽)is given by the Schur

complement I𝜷,𝜷(𝜽)−I𝜷,𝝑(𝜽)I−1

𝝑,𝝑(𝜽)I𝜷,𝝑(𝜽)⊤. For example, when the shift term takes the linear form as in equation (6) and

Gdeﬁnes the AUC function for FZ= probit−1, the entries of ∇G(𝜷)are given by

𝜕G(𝜷)

𝜕𝛿 =1

2C,𝜕G(𝜷)

𝜕𝜉i

=0 and 𝜕G(𝜷)

𝜕𝛾i

=xi

2C

where C=𝜙(𝛿+x⊤𝜸

2),𝜙is the density of the standard normal distribution and iindexes the Pcovariates. In general, the

gradient can be estimated by calculating such derivatives and evaluating the resulting function at the MLE. Similarly, the

variance-covariance matrix of the estimated parameters A𝜷,𝜷(𝜽)can be computed by inverting the numerically evaluated

Hessian matrix. Thus, a (1−𝛼)level CI for G(𝜷)is given by

G(𝜷)±Φ

−1(𝛼∕2)

𝕍(G(

𝜷))

2.3.3 Simulated intervals

When the function Ghas complex derivatives, as would be the case for nonlinear shift terms 𝜇d(x)or when calculating

optimal thresholds c∗where Gincludes the inverse of the transformation function, constructing CIs using the delta method

becomes infeasible. For these cases, we apply a simple simulation-based algorithm which utilizes the asymptotic normality

of the MLE to calculate CIs for the ROC curve and its summary indices, which are functions of the parameters of interest.

The steps of the algorithm to construct (1−𝛼)level CIs for G(

𝜷)can be summarized as follows:

Sewak and Hothorn 1411

Ta b l e 2 . Overview of the different methods used in the simulation study.

ROC AUC Youden index

Reference R package Estimate CB Estimate CI Estimate CI

Hothorn39 tram ✓✓✓✓✓ ✓

Harrell Jr45,46 rms ✓✓

Thas et al.33 pim ✓✓

Therneau47 survival ✓✓

Robin et al.48 pROC ✓✓✓✓✓

Fay49 asht ✓✓

Konietschke et al.50 nparcomp ✓✓

Khan and Brandenburger51 ROC it ✓✓✓✓✓

Feng et al.52 auRoc ✓✓

Perez-Jaume et al.53 ThresholdROC ✓✓

Ridout and Linkie54 overlap ✓✓

Franco-Pereira et al.55 -✓✓

Pèrez Fernàndez et al.56 nsROC ✓✓✓✓✓

ROC: receiver operating characteristic; AUC: area under the ROC curve; CI: conﬁdence interval; CB: conﬁdence band.

References to the original publication along with R software details are given. The (✓) indicates if a method computes the speciﬁc metric. The metrics

included estimates for the ROC curve, AUC, and Youden index as well as corresponding CBs or CIs.

1. Generate Bindependent samples from the asymptotic multivariate normal distribution of the parameter estimates

NP+1(

𝜷,1



A𝜷,𝜷(

𝚯)) and denote as 

𝜷∗

1,…,

𝜷∗

2. For each sample b=1, …,B, calculate the function of interest G(

𝜷∗

b).

3. Construct the CI (QG(

𝜷∗)(𝛼∕2),QG(

𝜷∗)(1−𝛼∕2)), where QG(

𝜷∗)is the empirical quantile function of the sample

G(

𝜷∗

1),…,G(

𝜷∗

B).

A similar algorithm is presented by Mandel,44 who discuss its asymptotic validity and present several examples that show

its empirical coverage adheres to nominal levels with results similar to the delta method.

3 Empirical evaluation

We conducted a simulation study to evaluate the performance of our estimators in the two-sample setting. We chose this

setting to be able to compare various estimators commonly used in practice. The software details of all the methods used

alongside their respective features and references are summarized in Table 2.

We considered a data generating process (DGP) such that nondiseased test results followed a standard normal distribution

F0(y)=Φ(y)and the diseased test results a distribution with the CDF F1(y)=FZDGP(F−1

ZDGP (Φ(y)) − 𝛿). To obtain different

shapes of the ROC curve, we chose three choices of FZDGP ∈ {probit−1,logit−1,cloglog−1}and varied 𝛿such that the

AUC ∈{0.5, 0.65, 0.8, 0.95}or that J∈{0, 0.25, 0.5, 0.8}, leading to a variety of conﬁgurations. Under this simulation

paramaterization, the true ROC curves followed the form of equation (5) and the true summary indices could be calculated

as a function of 𝛿from Table 1.

The conventional binormal model corresponded to FZDGP = probit−1=Φand induced proper binormal ROC curves.

This was the only conﬁguration where the test results for both groups were normally distributed. We included this con-

ﬁguration to ascertain the loss of power associated with our estimators when the standard binormal assumption held. For

other choices of FZDGP with AUC >0.5, the resulting distributions of the diseased test results were non-normal, with vari-

ances and higher moments differing between the two groups. Speciﬁcally, the conﬁguration of FZDGP = logit−1led to light

tailed distributions for the diseased test results, while FZDGP =cloglog

−1led to skewed, heavy-tailed distributions. The

corresponding density functions for the data generating model with selected AUC values are given in Figure 1.

For 10,000 replications of each conﬁguration, we generated balanced data sets with sample sizes N0=N1∈

{25, 50, 100}. The transformation models discussed in Section 2 were ﬁtted to the simulated data sets assuming a

parameterization of the transformation function given by a Bernstein basis polynomial of order M=6 (see Hothorn39

for a discussion on suitable choices for M). The true data-generating model had a nonlinear transformation function

h=F−1

ZDGP

◦Φ. Our model estimation procedure aimed to approximate this function alongside the shift parameter 𝛿.The

1412 Statistical Methods in Medical Research 32(7)

Figure 1. Density functions for the model used to generate the data for the simulations. The nondiseased test results followed a

standard normal distribution corresponding to an AUC =0.5. The diseased test results varied with three choices of FZDGP :probit−1,

logit−1,andcloglog−1each of which had an AUC of 0.5, 0.65, 0.8, and 0.95. DGP: data generating process; AUC: area under the

receiver operating characteristic curve.

functions implementing transformation models for different choices of FZare available from the tram add-on package.39

Note that the function FZDGP is for the data generating process in the simulation study and is distinct from FZ, the inverse

link function used in the model. When FZDGP =FZ, the model is correctly speciﬁed for the DGP.

Figure 2 displays the distribution of bias for the AUC estimates using the proposed methods under the various simulation

conﬁgurations. We found that all three methods had minimal bias for an AUC =0.5, where the test results were unable

to distinguish between the two groups. The models with FZ∈ {probit−1,logit −1}yielded approximately unbiased AUC

estimates in all cases, even when they were misspeciﬁed for the true data generating process. However, estimates based on

the proportional hazards model FZ=cloglog

−1, were biased for data generating processes other than where it was correctly

speciﬁed.

We compared our approaches to a set of alternative methods (see Table 2) for computing CIs for the AUC and Youden

index. We detail the empirical coverage and average width of the CIs for the AUC in Supplemental Figures S1 and S2,

respectively. Estimates based on transformation models (Rpackages tram, orm, pim) yielded coverage close to the nominal

level and signiﬁcantly outperformed the other methods when the model was correctly speciﬁed for the true data generating

process. All other methods generally performed close to nominal levels for low to medium AUC values (0.5–0.8), but broke

down for higher AUC values. In addition, the score CIs from the transformation model with FZ= logit−1were accurate

even when it was misspeciﬁed for the true data generating process. However, methods which used FZ=cloglog

−1gave

CIs which were shorter in length (overconﬁdent).

Analogously, Supplemental Figures S3 and S4 detail the coverage and length of the CIs for the Youden index. The meth-

ods which were based on the overlap coefﬁcient failed to cover the conﬁguration where J=0 because their lower limits

were never below 0. Our methods estimated CIs for 𝛿∈ℝwhich naturally accounted for this scenario. The transformation

model with FZ= logit−1provided coverage at nominal levels for all simulation conﬁgurations with a relatively small CI

width. The approach of Franco-Pereira et al.55 (FP) was also accurate under model misspeciﬁcation but was more involved.

Namely, it consisted of estimating Box-Cox transformation parameters under a binormal framework with bootstrap vari-

ance, all carried out on the logit scale and then back-transformed. In a setting with covariates, censoring or with J=0this

methodology would be limited.

Supplemental Figures S5 and S6 show the coverage and area of the conﬁdence bands for the ROC curve. All

the approaches based on transformation models covered the conﬁguration with AUC =0.5 accurately. However, the

Sewak and Hothorn 1413

Figure 2. Distribution of bias from the simulation study for estimation of the AUC. The DGP for nondiseased results was

F0(y)=Φ(y)and for diseased results F1(y)=FZDGP (F−1

ZDGP

(Φ(y)) − 𝛿). We varied FZDGP ∈ {probit −1,logit−1,cloglog−1}, AUC and

sample size. The proposed methods also varied by the same inverse link functions. An alignment of colors in the column (DGP) and

the ﬁll of the box plot is indicative that the method is correctly speciﬁed for the DGP. DGP: data generating process; AUC: area

under the receiver operating characteristic curve.

1414 Statistical Methods in Medical Research 32(7)

other approaches did not yield coverage close to nominal levels in this conﬁguration with the exception of Martínez-

Camblor et al.,57 whose conﬁdence bands had a signiﬁcantly larger area indicating lower power. For all other conﬁgurations,

only transformation models which were correctly speciﬁed for the true data-generating model provided accurate results.

In addition to the simulations described above, we considered three other scenarios to evaluate the robustness of our

proposed methods to model misspeciﬁcation. The details for each of these scenarios are given in Supplemental Materi-

als Section B. In terms of the AUC, we noticed that our models are generally robust to misspeciﬁcation, but can break

down in certain cases. However, the proportional hazard model with FZ=cloglog

−1resulted in poor performance under

misspeciﬁed conﬁgurations, indicating that it should be used with caution.

4 Application

The prevalence of obesity has increased consistently for most countries in the recent decade and this trend is a serious

global health concern.58 Obesity contributes directly to increased risk of cardiovascular disease (CVD) and its risk factors,

including type 2 diabetes, hypertension, and dyslipidemia.59,60 Metabolic syndrome (MetS) refers to the joint presence of

several cardiovascular risk factors and is characterized by insulin resistance.61 The National Cholesterol Education Program

Adult Treatment Panel III (NCEP-ATP III) criteria is the most widely used deﬁnition for MetS, but it requires laboratory

analysis of a blood sample. This has led to the search for non-invasive techniques which allow reliable and early detection

of MetS.

Waist-to-height ratio (WHtR) is a well-known anthropometric index used to predict visceral obesity. Visceral obesity

is an independent risk factor for development of MetS by means of the increased production of free fatty acids whose

presence obstructs insulin activity.62 This suggests that higher values of WHtR, reﬂecting obesity, and CVD risk factors,

are more indicative of incident MetS. Several studies have found that WHtR is highly predictive of MetS.63–65 However,

as waist circumference changes with age and gender,66 it is also important to study whether or not the performance of

WHtR at diagnosing MetS is impacted by these variables. Evaluation of WHtR as a predictor of MetS after adjusting for

covariates is necessary so that more tailored interventions can be initiated to improve outcomes.

We illustrate the use of our methods to data from a cross sectional study designed to validate the use of WHtR and

systolic blood pressure (SBP) as markers for early detection of MetS in a working population from the Balearic Islands

(Spain). Detailed descriptions of the study methodology and population characteristics are reported in Romero-Saldaña et

al.67 Brieﬂy, data on 60 799 workers were collected during their work health periodic assessments between 2012 and 2016.

Presence of MetS was determined by the NCEP-ATP III criteria and the sample consisted of 5487 workers with MetS.

4.1 Two-sample analysis

We ﬁrst examined the unconditional performance of WHtR as a marker to diagnose MetS, denoted Yand D, respectively.

We ﬁtted a linear transformation model with corresponding ROC curve of the form in equation (5), where 𝛿is the shift

parameter, for various choices of the inverse link function FZ. Associated inference of the AUC and Jwas calculated using

the closed-form expressions from Table 1. The resulting estimates are presented in Table 3. The AUCs were consistently

bounded away from 0.5 indicating a good capacity of WHtR to discriminate between workers with and without MetS. This

can also be seen from the estimated ROC curve plotted in Figure 3 which lies well above the diagonal line as well as the

modeled densities which have a small degree of overlap. The CIs and uniform conﬁdence bands were quite small due to

the large sample size.

Ta b l e 3 . Estimates and 95% score conﬁdence intervals of the shift paramater, AUC and Jin the two-sample linear transformation

model for WHtR as a marker of MetS.

FZ𝛿AUC J

probit−11.492 (1.462, 1.521) 0.854 (0.849, 0.859) 0.544 (0.535, 0.553)

logit−12.785 (2.730, 2.841) 0.871 (0.866, 0.875) 0.602 (0.593, 0.611)

cloglog−11.186 (1.157, 1.215) 0.766 (0.761, 0.771) 0.412 (0.403, 0.421)

loglog−11.425 (1.397, 1.453) 0.806 (0.802, 0.810) 0.484 (0.475, 0.492)

AUC: area under the receiver operating characteristic curve; MetS: metabolic syndrome; WHtR: waist-to-height ratio.

Sewak and Hothorn 1415

Figure 3. Estimates from the linear transformation model with a single shift parameter, h(Y)=𝛿d+Z,whereZis chosen to be a

standard logistic distribution. (A) Density functions of WHtR for the workers who were diagnosed with MetS (dotted line) and those

who were not (solid line). (B) ROC curve for WHtR as a marker of MetS with 95% uniform score conﬁdence bands are represented

by gray shaded areas. MetS: metabolic syndrome; WHtR: waist-to-height ratio.

4.2 Conditional ROC analysis

Next, we investigated if the discriminatory ability of WHtR in separating workers with and without MetS varies with

covariates. We considered a transformation model that included the main effects of covariates plus interaction terms with

the disease indicator, which leads to the ROC curve given by

ROC(p∣x)=1−FZ(F−1

Z(1−p)−(𝛿+𝛾⊤x))

where the covariates xincluded age, gender, and tobacco consumption. The choice of FZ= logit−1was made using repeated

holdout validation. We describe this model selection procedure in Supplemental Material Section C and show the results

for different model choices and parameterizations.

Figure 4 displays the covariate-speciﬁc ROC curves ﬁtted to these data. The performance of WHtR appeared to be better

for females compared to males and decreased with age. The effect of smoking, although signiﬁcant in the model, does not

seem to substantially alter the ROC curves given the other covariates are kept ﬁxed. To inspect the covariate effect further,

we calculated the age- and gender-speciﬁc AUCs and Youden indices from the model. Figure 5 clearly shows that the

discriminatory capabilities of WHtR in distinguishing workers with MetS is consistently better for females and decreases

with age.

5 Discussion

This article presents a new modeling framework for ROC analysis that can be used to characterize the accuracy of medical

diagnostic tests. Our model is based on estimating an unknown transformation function for the test results and yields a

distribution-free yet model-based estimator for the ROC curve. Covariates that inﬂuence the diagnostic accuracy of tests

can naturally be accommodated as regression parameters into the model and covariate-speciﬁc summary indices such as

the AUC and Youden index are easily computed using closed-form expressions.

Our proposed approach has several features which distinguish it from contemporary methods of ROC analysis. Firstly,

we employ maximum likelihood to jointly estimate all parameters deﬁning the transformation function and regression coef-

ﬁcients. This implies the variation in the estimated transformation parameters is accounted for and appropriately propagated

to inference for the ROC curve. In turn, asymptotic efﬁciency is guaranteed for our estimators and we avoid reliance on

resampling procedures for the construction of CIs. Secondly, transformation models focus on estimating the conditional dis-

tribution function whose evaluation directly provides the likelihood contributions for interval, right-, and left-censored data

that commonly arises due to instrument detection limits. Thirdly, no strong assumptions are made regarding the transfor-

mation function which results in a highly ﬂexible model that retains interpretability of the regression coefﬁcients. Lastly,

1416 Statistical Methods in Medical Research 32(7)

Figure 4. Estimated covariate-speciﬁc ROC curves for WHtR as a marker of MetS for female (solid line) and male workers (dashed

line). Vertical panels represent a speciﬁc age (30, 40, 50) and horizontal panels smoking status. ROC: receiver operating

characteristic; MetS: metabolic syndrome; WHtR: waist-to-height ratio.

Figure 5. Age-based AUC and Youden indices where WHtR is used as a marker to detect MetS for non-smoking female (solid line)

and male (dashed line) workers. 95% Wald pointwise conﬁdence bands are represented by gray shaded areas. AUC: area under the

receiver operating characteristic curve; MetS: metabolic syndrome; WHtR: waist-to-height ratio.

Sewak and Hothorn 1417

software implementations for all the methods described in this article are available in the tram Radd-on package (see

Supplemental Material for example code), thus enabling a uniﬁed framework for ROC analysis.

In our simulation study, interestingly, we found that a model with FZ= logit−1provided accurate results even when

it was misspeciﬁed for the true data generating process. This model also behaves very similarly to the semiparametric

cumulative probability model,68 both of which estimate a log-odds ratio 𝛿. The equivalence of the transformation model’s

odds ratio to the MWW test statistic has been well studied.69 The MWW statistic has a bounded inﬂuence function and is

robust to contaminations of the speciﬁed model.70 Due to their equivalence, we hypothesize that the transformation model

with FZ= logit−1is also endowed with the same robustness properties as the MWW and thus can be chosen when no a

priori model is known.

One aspect that warrants further investigation is model selection, speciﬁcally with regards to the choice of FZ.One

strategy would be to deﬁne FZtailored to a speciﬁc interpretation of the parameters 𝛿,𝜷, and 𝜸, for example, as log-odds

ratios with FZ= logit−1or FZ=cloglog

−1for hazard ratios.34 A second option is to use some form of cross validation

in combination with model assessment via the probability integral transform (PIT) (as discussed in Supplemental Mate-

rials Section C). Third, and in analogy to single index models, one could introduce parameters to FZsuch that the shape

of the inverse link function is estimated along with all other model parameters (McLain and Ghosh71 discuss a family of

link functions including the complementary log-log and logit). Finally, we could completely relax the assumption that the

difference between the diseased and nondiseased distributions is described by a shift-term. In this case, separate transfor-

mation functions would be allowed in each of the two groups. Namely, consider a stratiﬁed model where the nondiseased

results follow a distribution with the CDF F0(y)=FZ(h0(y)) and the diseased with the CDF F1(y)=FZ(h1(y)). Deﬁning

a new transformation function r=h1◦h−1

0◦F−1

Z:[0, 1]↦ℝ, the smooth ROC curve with no shift assumptions is given

by ROC(p)=1−FZ(r(1−p)). This model has more ﬂexibility but sacriﬁces the properness property desirable for the

ROC curves. Furthermore, care has to be taken in deﬁning the correct likelihood contributions for accurate inference of

this model as uncertainty enters from both transformation functions.

In future work, we plan to pursue various extensions of transformation models for ROC analysis to consider (1)

penalty terms for high-dimensional covariates,72 (2) mixed effects for clustered observations,73 and (3) covariate-dependent

transformation functions through forest-based machine learning methods.74

Declaration of conﬂicting interests

The author(s) declared no potential conﬂicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following ﬁnancial support for the research, authorship, and/or publication of this article: This work

was supported by the Swiss National Science Foundation, grant number 200021_184603.

ORCID iD

Torsten Hothorn https://orcid.org/0000-0001-8301-0471

Supplemental material

Supplemental materials for this article are available online.

References

1. Pepe MS. The Statistical Evaluation of Medical Tests for Classiﬁcation and Prediction. Oxford, UK: Oxford University Press, 2003.

2. Zou KH, Liu A, Bandos AI, et al. Statistical Evaluation of Diagnostic Performance: Topics in ROC Analysis. Boca Raton, FL, USA:

CRC Press, 2011.

3. Pepe MS. A regression modelling framework for receiver operating characteristic curves in medical diagnostic testing. Biometrika

1997; 84: 595–608.

4. Faraggi D. Adjusting receiver operating characteristic curves and related indices for covariates. J R Stat Soc: Ser D (The Statistician)

2003; 52: 179–192.

5. Perkins NJ, Schisterman EF and Vexler A. Receiver operating characteristic curve inference from a sample with a limit of detection.

Am J Epidemiol 2007; 165: 325–333.

6. Bantis LE, Yan Q, Tsimikas JV, et al. Estimation of smooth ROC curves for biomarkers with limits of detection. Stat Med 2017;

36: 3830–3843.

7. Inácio V, Lourenço VM, de Carvalho M, et al. Robust and ﬂexible inference for the covariate-speciﬁc receiver operating

characteristic curve. Stat Med 2021; 40: 5779–5795.

8. Inácio V, Rodríguez-Álvarez MX and Gayoso-Diz P. Statistical evaluation of medical tests. Annu Rev Stat Appl 2021; 8: 41–67.

1418 Statistical Methods in Medical Research 32(7)

9. Zou KH, Tempany CM, Fielding JR, et al. Original smooth receiver operating characteristic curve estimation from continuous data:

statistical methods for analyzing the predictive value of spiral CT of ureteral stones. Acad Radiol 1998; 5: 680–687.

10. Zou KH, Hall W. Two transformation models for estimating an ROC curve derived from continuous data. J Appl Stat 2000; 27:

621–631.

11. Zou KH, Hall WJ and Shapiro DE. Smooth non-parametric receiver operating characteristic (ROC) curves for continuous diagnostic

tests. Stat Med 1997; 16: 2143–2156.

12. Hothorn T, Kneib T and Bühlmann P. Conditional transformation models. J R Stat Soc: Ser B (Statistical Methodology) 2014; 76:

3–27.

13. Hothorn T, Möst L and Bühlmann P. Most likely transformations. Scand J Stat 2018; 45: 110–134.

14. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. JMath

Psychol 1975; 12: 387–415.

15. Youden WJ. Index for rating diagnostic tests. Cancer 1950; 3: 32–35.

16. Komaba A, Johno H and Nakamoto K. A novel statistical approach for two-sample testing based on the overlap coefﬁcient, 2022.

https://arxiv.org/abs/2206.03166. arXiv:2206.03166 [math.ST].

17. Weitzman MS. Measures of Overlap of Income Distributions of White and Negro Families in the United States.3. Washington, DC:

US Bureau of the Census, 1970. Washington, D.C.

18. Feller W. An Introduction to Probability Theory and Its Applications. New York, NY, USA: Wiley, 1991.

19. Schmid F, Schmidt A. Nonparametric estimation of the coefﬁcient of overlapping—theory and empirical application. Comput Stat

Data Anal 2006; 50: 1583–1596.

20. Martínez-Camblor P. About the use of the overlap coefﬁcient in the binary classiﬁcation context. Commun Stat-Theor Method 2022;

1–11.

21. Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making 1988; 8: 204–215.

22. Bickel PJ, Doksum KA. An analysis of transformations revisited. JAmStatAssoc1981; 76: 296–311.

23. Dorfman DD, Alf Jr E. Maximum-likelihood estimation of parameters of signal-detection theory and determination of conﬁdence

intervals-rating-method data. J Math Psychol 1969; 6: 487–496.

24. Ogilvie JC, Creelman CD. Maximum-likelihood estimation of receiver operating characteristic curve parameters. J Math Psychol

1968; 5: 377–391.

25. Gönen M, Heller G. Lehmann family of ROC curves. Med Decis Making 2010; 30: 509–517.

26. Khan RA. Resilience family of receiver operating characteristic curves. IEEE Trans Reliab 2022. DOI: 10.1109/TR.2022.3194710.

27. Zou KH. Analysis of Some Transformation Models for the Two-sample Problem With Special Reference to Receiver Operating

Characteristic Curves. PhD thesis, University of Rochester, 1997.

28. Alonzo TA, Pepe MS. Distribution-free ROC analysis using binary regression techniques. Biostatistics 2002; 3: 421–432.

29. Pan X, Metz CE. The “proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data.

Acad Radiol 1997; 4: 380–389.

30. McIntosh MW, Pepe MS. Combining several screening tests: optimality of the risk score. Biometrics 2002; 58: 657–664.

31. Siegfried S, Kook L and Hothorn T. Distribution-free location-scale regression. Am Stat 2022. DOI: 10.1080/

00031305.2023.2203177.

32. Dodd LE, Pepe MS. Partial AUC estimation and regression. Biometrics 2003; 59: 614–623.

33. Thas O, Neve JD, Clement L, et al. Probabilistic index models. J R Stat Soc: Ser B (Statistical Methodology) 2012; 74: 623–671.

34. De Neve J, Thas O and Gerds TA. Semiparametric linear transformation models: effect measures, estimators, and applications. Stat

Med 2019; 38: 1484–1501.

35. Farouki RT. The bernstein polynomial basis: a centennial retrospective. Comput Aided Geom Des 2012; 29: 379–419.

36. Lindsey JK. Parametric Statistical Inference. Oxford, UK: Oxford University Press, 1996.

37. Rao CR. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation.

In Mathematical Proceedings of the Cambridge Philosophical Society, volume 44. Cambridge, UK: Cambridge University Press,

1948. pp. 50–57.

38. Dagenais MG, Dufour JM. Invariance, nonlinear models, and asymptotic tests. Economet: J Economet Soc 1991; 59: 1601–1615.

39. Hothorn T. Most likely transformations: the mlt package. J Stat Softw 2020; 92: 1–68.

40. Lynn HS. Maximum likelihood inference for left-censored HIV RNA data. Stat Med 2001; 20: 33–45.

41. Mumford SL, Schisterman EF, Vexler A, et al. Pooling biospecimens and limits of detection: effects on ROC curve analysis.

Biostatistics 2006; 7: 585–598.

42. Xiong C, Luo J, Agboola F, et al. A family of estimators to diagnostic accuracy when candidate tests are subject to detection

limits—application to diagnosing early stage Alzheimer’s disease. Stat Methods Med Res 2022; 31: 882–898.

43. Rao CR. Score test: historical review and recent developments. In Advances in Ranking and Selection, Multiple Comparisons, and

Reliability. Boston, MA, USA: Birkhäuser, 2005; pp. 8–20.

44. Mandel M. Simulation-based conﬁdence intervals for functions with complicated derivatives. Am Stat 2013; 67: 76–81.

45. Harrell Jr FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis.

608. New York: Springer, 2001.

46. Harrell Jr FE. rms Regression Modeling Strategies, 2022. https://CRAN.R-project.org/package=rms. Rpackage version 6.3-0.

Sewak and Hothorn 1419

47. Therneau TM. survival: A Package for Survival Analysis in R, 2022. https://CRAN.R-project.org/package=survival. Rpackage

version 3.3-1.

48. Robin X, Turck N, Hainard A, et al. pROC: an open-source package for Rand S+ to analyze and compare ROC curves. BMC

Bioinform 2011; 12: 77.

49. Fay MP. asht: Applied Statistical Hypothesis Tests, 2022. https://CRAN.R-project.org/package=asht. Rpackage version 0.9.7.

50. Konietschke F, Placzek M, Schaarschmidt F, et al. nparcomp: an Rsoftware package for nonparametric multiple comparisons and

simultaneous conﬁdence intervals. J Stat Softw 2015; 64: 1–17. DOI: http://www.jstatsoft.org/v64/i09/.

51. Khan MRA, Brandenburger T. ROCit: Performance Assessment of Binary Classiﬁer with Visualization, 2020. https://

CRAN.R-project.org/package=ROCit. Rpackage version 2.1.1.

52. Feng D, Manevski D and Perme MP. auRoc: Various Methods to Estimate the AUC, 2020. https://CRAN.

R-project.org/package=auRoc. Rpackage version 0.2-1.

53. Perez-Jaume S, Skaltsa K, Pallarès N, et al. ThresholdROC: optimum threshold estimation tools for continuous diagnostic tests in

R.J Stat Softw 2017; 82: 1–21.

54. Ridout M, Linkie M. Estimating overlap of daily activity patterns from camera trap data. J Agric Biol Environ Stat 2009; 14: 322–337.

55. Franco-Pereira AM, Nakas CT, Reiser B, et al. Inference on the overlap coefﬁcient: the binormal approach and alternatives. Stat

Methods Med Res 2021; 30: 2672–2684.

56. Pérez Fernández S, Martínez Camblor P, Filzmoser P, et al. nsROC: an Rpackage for non-standard ROC curve analysis. RJ2018;

10: 55–77.

57. Martínez-Camblor P, Pérez-Fernández S and Corral N. Efﬁcient nonparametric conﬁdence bands for receiver operating-

characteristic curves. Stat Methods Med Res 2018; 27: 1892–1908.

58. Abarca-Gómez L, Abdeen ZA, Hamid ZA, et al. Worldwide trends in body-mass index, underweight, overweight, and obesity from

1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128.9 million children, adolescents, and adults.

Lancet 2017; 390: 2627–2642.

59. Zalesin KC, Franklin BA, Miller WM, et al. Impact of obesity on cardiovascular disease. Endocrinol Metab Clin North Am 2008;

37: 663–684.

60. Grundy SM. Obesity, metabolic syndrome, and cardiovascular disease. J Clin Endocr Metab 2004; 89: 2595–2600.

61. Eckel RH, Alberti KGMM, Grundy SM, et al. The metabolic syndrome. Lancet 2010; 375: 181–183.

62. Bosello O, Zamboni M. Visceral obesity and metabolic syndrome. Obes Rev 2000; 1: 47–56.

63. Shao J, Yu L, Shen X, et al. Waist-to-height ratio, an optimal predictor for obesity and metabolic syndrome in Chinese adults. JNutr

Health Aging 2010; 14: 782–785.

64. Romero-Saldaña M, Fuentes-Jiménez FJ, Vaquero-Abellán M, et al. New non-invasive method for early detection of metabolic

syndrome in the working population. Eur J Cardiovasc Nurs 2016; 15: 549–558.

65. Suliga E, Ciesla E, Głuszek-Osuch M, et al. The usefulness of anthropometric indices to identify the risk of metabolic syndrome.

Nutrients 2019; 11: 2598.

66. Stevens J, Katz EG and Huxley RR. Associations between gender, age and waist circumference. Eur J Clin Nutr 2010; 64: 6–15.

67. Romero-Saldaña M, Tauler P, Vaquero-Abellán M, et al. Validation of a non-invasive method for the early detection of metabolic

syndrome: a diagnostic accuracy test in a working population. BMJ Open 2018; 8: e020476.

68. Tian Y, Hothorn T, Li C, et al. An empirical comparison of two novel transformation models. Stat Med 2020; 39: 562–576.

69. Wang Y, Tian L. The equivalence between Mann-Whitney Wilcoxon test and score test based on the proportional odds model for

ordinal responses. In 4th International Conference on Industrial Economics System and Industrial Security Engineering (IEIS).

Kyoto, Japan: IEEE, pp. 1–5.

70. Hampel FR. The inﬂuence curve and its role in robust estimation. JAmStatAssoc1974; 69: 383–393.

71. McLain AC, Ghosh SK. Efﬁcient sieve maximum likelihood estimation of time-transformation models. J Stat Theory Pract 2013;

7: 285–303.

72. Kook L, Hothorn T. Regularized transformation models: the tramnet package. RJ2021; 13: 581–594.

73. Tamási B, Hothorn T. tramME: mixed-effects transformation models using template model builder. RJ2021; 13: 581–594.

74. Hothorn T, Zeileis A. Predictive distribution modeling using transformation forests. J Comput Graph Stat 2021; 30: 1181–1196.

Distribution-Free Location-Scale Regression

Article

Apr 2023

We introduce a generalized additive model for location, scale, and shape (GAMLSS) next of kin aiming at distribution-free and parsimonious regression modelling for arbitrary outcomes. We replace the strict parametric distribution formulating such a model by a transformation function, which in turn is estimated from data. Doing so not only makes the model distribution-free but also allows to limit the number of linear or smooth model terms to a pair of location-scale predictor functions. We derive the likelihood for continuous, discrete, and randomly censored observations, along with corresponding score functions. A plethora of existing algorithms is leveraged for model estimation, including constrained maximum-likelihood, the original GAMLSS algorithm, and transformation trees. Parameter interpretability in the resulting models is closely connected to model selection. We propose the application of a novel best subset selection procedure to achieve especially simple ways of interpretation. All techniques are motivated and illustrated by a collection of applications from different domains, including crossing and partial proportional hazards, complex count regression, non-linear ordinal regression, and growth curves. All analyses are reproducible with the help of the tram add-on package to the R system for statistical computing and graphics.

Robust and flexible inference for the covariate‐specific receiver operating characteristic curve

Article

Full-text available

Sep 2021

Diagnostic tests are of critical importance in health care and medical research. Motivated by the impact that atypical and outlying test outcomes might have on the assessment of the discriminatory ability of a diagnostic test, we develop a robust and flexible model for conducting inference about the covariate-specific receiver operating characteristic (ROC) curve that safeguards against outlying test results while also accommodating for possible nonlinear effects of the covariates. Specifically, we postulate a location-scale regression model for the test outcomes in both the diseased and nondiseased populations, combining additive regression B-splines and M-estimation for the regression function, while the distribution of the error term is estimated via a weighted empirical distribution function of the standardized residuals. The results of the simulation study show that our approach successfully recovers the true covariate-specific area under the ROC curve on a variety of conceivable test outcomes contamination scenarios. Our method is applied to a dataset derived from a prostate cancer study where we seek to assess the ability of the Prostate Health Index to discriminate between men with and without Gleason 7 or above prostate cancer, and if and how such discriminatory capacity changes with age.

The Statistical Evaluation of Medical Tests for Classification and Prediction

Book

Oct 2023

Margaret Sullivan Pepe

The use of clinical and laboratory information to detect conditions and predict patient outcomes is a mainstay of medical practice. Modern biotechnology offers increasing potential to develop sophisticated tests for these purposes. This book describes the statistical concepts and techniques for evaluating the accuracy of medical tests. Worked examples include applications to cancer biomarker studies, prospective disease screening studies, diagnostic radiology studies and audiology testing studies. The statistical methodology can be broadly applied for evaluating classifiers and to problems beyond medical settings. Several measures for quantifying test accuracy are described including the Receiver Operating Characteristic Curve. Pepe presents statistical procedures for the estimation and comparison of those measures among tests. Regression frameworks for assessing factors that influence test accuracy and for comparing tests while adjusting for such factors are presented. The sequence of research steps involved in the development of a test is considered in some detail. Sample size calculations and other issues pertinent to study design are described for tests at various phases of development. In addition, the impacts of missing data and imperfect reference data are addressed. These problems often occur in practice, and modern statistical procedures for dealing with them are discussed. Additional topics that are covered include: meta-analysis for summarizing the results of multiple studies of a test; the evaluation of markers for predicting event time data; and procedures for combining the results of multiple tests to improve classification. This book should be of interest to quantitative researchers and practicing statisticians. The book also covers the theoretical foundations for statistical inference and should therefore be of interest to academic statisticians including those involved in statistical methodological research in this field.

Parametric Statistical Inference

Book

Oct 2023

J K Lindsey

Inference involves drawing conclusions about some general phenomenon from limited empirical observations in the face of random variability. In a scientific context, the general must include the completely unforeseen if all possibilities are to be considered. Many of the statistical models most used to describe such phenomena belong to one of a small number of families--the exponential, transformation, and stable families. In the past 25 years, the likelihood function has been recognized as the fundamental element of approach to drawing scientific conclusions. This book brings together for the first time these two components of statistics as the central themes of statistical inference. Chapters focus on model building, approximations, and examples. There are also appendices on the elements of measure theory, probability theory, and numerical methods. The discussions are appropriate for students of mathematical statistics.

Distribution-Free Location-Scale Regression

Article

Apr 2023

Resilience Family of Receiver Operating Characteristic Curves

Article

Jan 2022

Ruhul Khan

A new semiparameteric model of the receiver operating characteristic (ROC) curve based on the resilience family or proportional reversed hazard family is proposed, which is an alternative to the existing models. The resulting ROC curve and its summary indices (such as area under the curve and Youden index) have simple analytic forms. The partial likelihood method is applied to estimate the ROC curve. Moreover, the estimation methodologies of the resilience family of the ROC curve have been developed based on area under the curve estimators exploiting Mann–Whitney statistics and the Rojo approach. A simulation study has been carried out to assess the performance of all considered estimators. Real data from the American National Health and Nutrition Examination Survey has been analyzed in detail based on the proposed model and the usual binormal model prevalent in the literature. Real data in the context of brain injury-related biomarkers are also analyzed in order to compare our model with the Lehmann family of the ROC curves. Finally, we show that the proposed model may be applicable in the misspecification scenario through the Ducheme muscular dystrophy data.

tramME: Mixed-Effects Transformation Models Using Template Model Builder

Article

Jan 2022

Linear transformation models constitute a general family of parametric regression models for discrete and continuous responses. To accommodate correlated responses, the model is extended by incorporating mixed effects. This article presents the R package tramME, which builds on existing implementations of transformation models (mlt and tram packages) as well as Laplace approximation and automatic differentiation (using the TMB package), to calculate estimates and perform likelihood inference in mixed-effects transformation models. The resulting framework can be readily applied to a wide range of regression problems with grouped data structures.

About the use of the overlap coefficient in the binary classification context

Article

Feb 2022

Pablo Martínez-Camblor

The overlap coefficient (OVL) measures the common area between two or more density functions. It has been used for measuring the similarity between distributions in different research fields including astronomy, economy or sociology, among others. Recently, different authors have studied the use of the OVL coefficient in the binary classification problem. They argue that, in particular settings, it could provide better accuracy measure than other stablished indices. We prove here that the OVL coefficient does not provide additional information to the Youden index and that, the potential advantages previously reported are based on the assumption that the classification rules underlying any classification process always assign more probability of being positive to the larger values of the marker. Particularly, we prove that, for a fixed continuous marker, the OVL coefficient is equivalent to the Youden index associated with the optimal classification rules based on this marker. We illustrate the problem studying the capacity of the white blood cells count to identify the type of disease in patients having either acute viral meningitis or acute bacterial meningitis.

A family of estimators to diagnostic accuracy when candidate tests are subject to detection limits—Application to diagnosing early stage Alzheimer disease

Article

Jan 2022

In disease diagnosis, individuals are usually assumed to be one of the two basic types, healthy or diseased, as typically based on an established gold standard. Candidate markers for diagnosing a disease often are much cheaper and less invasive than the gold standard but must be evaluated against the gold standard for their sensitivity and specificity to accurately diagnose the disease. When candidate diagnostic markers are fully measured, receiver operating characteristic curves have been the standard approaches for assessing diagnostic accuracy. However, full measurements of diagnostic markers may not be available above or below certain limits due to various practical and technical limitations. For example, in the diagnosis of Alzheimer disease using cerebrospinal fluid biomarkers, the Roche Elecsys® immunoassays have a measuring range for multiple cerebrospinal fluid molecular concentrations. Many cognitive tests used in diagnosing dementia due to Alzheimer disease are also subject to detection limits, often referred to as the floor and ceiling effects in the neuropsychological literature. We propose a new statistical methodology for estimating the diagnostic accuracy when a diagnostic marker is subject to detection limits by dividing the entire study sample into two sub-samples by a threshold of the diagnostic marker. We then propose a family of estimators to the area under the receiver operating characteristic curve by combining a conditional nonparametric estimator and another conditional semi-parametric estimator derived from Cox's proportional hazards model. We derive the variance to the proposed estimators, and further, assess the performance of the proposed estimators as a function of possible thresholds through an extensive simulation study, and recommend the optimum thresholds. Finally, we apply the proposed methodology to assess the ability of several cerebrospinal fluid biomarkers and cognitive tests in diagnosing early stage Alzheimer disease dementia.

Regularized Transformation Models: The tramnet Package

Article

Jan 2021

The tramnet package implements regularized linear transformation models by combining the flexible class of transformation models from tram with constrained convex optimization implemented in CVXR. Regularized transformation models unify many existing and novel regularized regression models under one theoretical and computational framework. Regularization strategies implemented for transformation models in tramnet include the Lasso, ridge regression, and the elastic net and follow the parameterization in glmnet. Several functionalities for optimizing the hyperparameters, including model-based optimization based on the mlrMBO package, are implemented. A multitude of S3 methods is deployed for visualization, handling, and simulation purposes. This work aims at illustrating all facets of tramnet in realistic settings and comparing regularized transformation models with existing implementations of similar models.

Inference on the overlap coefficient: The binormal approach and alternatives

Article

Oct 2021

The overlap coefficient ([Formula: see text]) measures the similarity between two distributions through the overlapping area of their distribution functions. Given its intuitive description and ease of visual representation by the straightforward depiction of the amount of overlap between the two corresponding histograms based on samples of measurements from each one of the two distributions, the development of accurate methods for confidence interval construction can be useful for applied researchers. The overlap coefficient has received scant attention in the literature since it lacks readily available software for its implementation, while inferential procedures that can cover the whole range of distributional scenarios for the two underlying distributions are missing. Such methods, both parametric and non-parametric are developed in this article, while R-code is provided for their implementation. Parametric approaches based on the binormal model show better performance and are appropriate for use in a wide range of distributional scenarios. Methods are assessed through a large simulation study and are illustrated using a dataset from a study on human immunodeficiency virus-related cognitive function assessment.

Estimating transformationsfor evaluating diagnostic testswith covariate adjustment

Abstract and Figures

Recommended publications

Transformation models for ROC analysis

Distribution-Free Location-Scale Regression

Distribution-Free Location-Scale Regression

Multivariate Conditional Transformation Models