Content uploaded by Devan V Mehrotra
Author content
All content in this area was uploaded by Devan V Mehrotra on Aug 16, 2017
Content may be subject to copyright.
Journal of Biopharmaceutical Statistics, 16: 429–441, 2006
Copyright © Taylor & Francis Group, LLC
ISSN: 1054-3406 print/1520-5711 online
DOI: 10.1080/10543400600719251
STATISTICAL CONSIDERATIONS
FOR NONINFERIORITY/EQUIVALENCE TRIALS
IN VACCINE DEVELOPMENT
W. W. B. Wang, D. V. Mehrotra, I. S. F. Chan, and J. F. Heyse
Clinical Biostatistics, Merck Research Laboratories, North Wales,
Pennsylvania, USA
Noninferiority/equivalence designs are often used in vaccine clinical trials. The goal of
these designs is to demonstrate that a new vaccine, or new formulation or regimen of
an existing vaccine, is similar in terms of effectiveness to the existing vaccine, while
offering such advantages as easier manufacturing, easier administration, lower cost, or
improved safety profile. These noninferiority/equivalence designs are particularly useful
in four common types of immunogenicity trials: vaccine bridging trials, combination
vaccine trials, vaccine concomitant use trials, and vaccine consistency lot trials. In this
paper, we give an overview of the key statistical issues and recent developments for
noninferiority/equivalence vaccine trials. Specifically, we cover the following topics: (i)
selection of study endpoints; (ii) formulation of the null and alternative hypotheses;
(iii) determination of the noninferiority/equivalence margin; (iv) selection of efficient
statistical methods for the statistical analysis of noninferiority/equivalence vaccine
trials, with particular emphasis on adjustment for stratification factors and missing pre-
or post-vaccination data; and (v) the calculation of sample size and power.
Key Words: Equivalence; Minimum risk weights; Missing data; Noninferiority; Stratified analysis;
Vaccine clinical trial.
1. INTRODUCTION
The usual goal of vaccination is to simulate an in vivo pathogen-specific
exposure that triggers the host’s immune system to generate a pool of effector
and memory B or T cells that will protect against potential real exposures in the
future. The simulation is accomplished via inoculation of the host by a vaccine that
contains either a live, attenuated version of the pathogen, or a DNA plasmid or
viral vector that encodes relevant genes of the pathogen to help elicit a cell-mediated
immune response, and so on.
The immunogenicity of a new vaccine is studied in the early stages of
development to assess whether the vaccine can induce quantifiable levels of
pathogen-specific immune responses. In later stages, the goal is to assess whether
the immune marker used to quantify vaccine immunogenicity qualifies as a correlate
Received and Accepted March 13, 2006
Address correspondence to William W.B. Wang, Clinical Biostatistics, Merck Research
Laboratories, UG-1CD, P.O. Box 1000, North Wales, Pennsylvania 19454 USA; E-mail:
William_Wang@Merck.Com
429
430 WANG ET AL.
of (or surrogate for) disease protection. Once a correlate for disease protection has
been established, immunogenicity trials with noninferiority/equivalence designs are
widely used as economic and time-efficient alternatives to large efficacy trials in
evaluating the effectiveness of a new or reformulated vaccine as compared to already
licensed vaccines (Chan et al., 2003; Horne, 1995; Mehrotra, in press).
Noninferiority/equivalence designs are particularly useful in four common
types of vaccine immunogenicity trials: vaccine bridging trials, combination vaccine
trials, vaccine concomitant use trials, and vaccine consistency lot trials. Each of
these trials serves a different purpose. For example, vaccine bridging trials are
used because manufacturing processes or storage conditions may be changed after
vaccine licensure to improve production yields and vaccine stability/shelf life.
Vaccine bridging trials are often required to demonstrate that such changes do
not have an adverse impact on vaccine effectiveness, by ruling out a clinically
significant difference in immune responses between the modified vaccine and the
current vaccine.
Combination vaccine trials are typically used to rule out clinically significant
differences in immune responses between a combined vaccine and separate but
simultaneously administered vaccines (Blackwelder, 1995; FDA, 1997; Horne et al.,
2001). A combination vaccine is intended to prevent multiple diseases or to prevent
one disease caused by different strains or serotypes of the same organism while
reducing the number of injections required (Chan et al., 2003). Similarly, since
the concomitant administration of multiple vaccines can reduce the number of
vaccination visits, vaccine concomitant use trials are used to rule out clinically
significant differences in immune responses between the concomitant administration
of two or more vaccines and the separate administration of each vaccine.
Finally, since vaccines are biological products that are not as stable and
well characterized as chemical drug products, vaccine consistency lot trials are
required to study multiple (typically, three) lots of vaccines made from the same
manufacturing process (called consistency lots). The goal is to rule out a clinically
significant difference in immunogenicity in either direction between any two pairs
of consistency lots.
The goal of this article is to provide an overview of key statistical
issues and recent developments for noninferiority/equivalence vaccine trials.
Some of the issues and methodologies that we discuss are also applicable to
drug noninferiority/equivalence trials. However, there are noteworthy differences
between drug and vaccine trials; where appropriate, we will draw attention to
such differences. The rest of the paper is organized as follows. In Section 2, we
describe typical immunogenicity endpoints and noninferiority/equivalence margins
used in practice, and discuss the formulation of the null and alternative hypotheses.
In Section 3, we discuss statistical analysis methods, including novel approaches
for dealing with stratification and missing data either at pre- or post-vaccination.
We discuss sample size and power considerations in Section 4, and offer some
concluding remarks in Section 5.
2. ENDPOINTS, HYPOTHESES, AND MARGIN SELECTION
The selection of immunogenicity endpoints, the formulation of study
hypotheses, and the choice of the noninferiority/equivalence margin should be
CONSIDERATIONS FOR NONINFERIORITY/EQUIVALENCE TRIALS 431
based on sound clinical, regulatory, and statistical judgments. A commonly used
immunologic endpoint is the immune response rate, defined as the percentage of
subjects who achieve a certain level of immune response after vaccination (yes/no).
If a particular level of immune response (either based on an absolute cutoff or a
fold rise from baseline) has been shown to be correlated with disease protection,
the percentage of participants achieving this “protective level” after vaccination is
usually considered the primary endpoint for immunogenicity analyses. However,
this definition is susceptible to potential issues in assay stability and consistency
across subjects, study sites, and assay-performing laboratories. Accordingly, another
commonly used immunologic endpoint is the geometric mean concentration (GMC)
of immune response after vaccination or the geometric mean fold rise (GMFR)
from pre-vaccination to post-vaccination. In the absence of a reasonable protective
level or a good correlate of protection, the GMC or GMFR is used as the primary
immunogenicity endpoint. In addition, as noted by Plikaytis and Carlone (2004),
the selection of endpoints can depend on the type of vaccine, such as T-cell
independent versus T-cell dependent, and the kinetic curve of immune responses
after vaccination.
The primary objective of a vaccine noninferiority/equivalence trial is to
demonstrate that a new or modified vaccine is noninferior or equivalent to the
current vaccine by ruling out a prespecified clinically relevant difference in the
immune response. Accordingly, in a noninferiority immunogenicity trial without
stratification, the null and alternative hypotheses are generally set up as
H0≤−0versus H1>−0(1)
In this hypothesis setup, depending on the endpoint being used, =PT−PC
or =logGMCT−logGMCC, where PTand PCare the population immune
response rates, and GMCTand GMCCare the population geometric means of the
immune response (either in terms of concentration or in terms of fold rise) after
vaccination, for the new vaccine and the control vaccine groups, respectively. 0is a
prespecified small positive quantity defining the noninferiority margin. The nominal
significance level for this one-sided test (usually one-sided =0025) is generally set
at half of the conventional significance level for a two-sided test for a difference
in proportions (Schuirmann, 1987). This approach has been adopted in regulatory
environments, as suggested in the International Conference on Harmonization E9
(ICH E9) guidelines (1999).
In vaccine equivalence trials, such as consistency-lot studies, the objective is to
show that the two vaccines are similar by ruling out a clinically significant difference
in either direction. Such studies are typically designed as a two-sided (at the =005
level) equivalence trial with respect to both immune response rates and GMCs. For
such trials, the hypothesis in Eq. (1) is replaced with H0≤−0or ≥0versus
H1−0<<
0, and the approach of two one-sided tests (Schuirmann, 1987) is
recommended by the ICH E9 guidelines (1999).
With respect to the noninferiority margin, the choice of 0should justify that
the new vaccine preserves a large proportion of the effectiveness of the control
vaccine relative to a placebo. Although a noninferiority margin that preserves 50%
of the treatment effect has been proposed for the evaluation of drug treatments
(Ebbutt and Frith, 1999; Temple, 1996), there is a general perception that a
432 WANG ET AL.
narrower margin should be used in preventive vaccine trials because vaccines
are given to potentially millions of healthy individuals for prophylaxis. The
noninferiority or equivalence margin for vaccine immunogenicity should depend
on the level of the correlation between immune responses and the vaccine efficacy,
the variability of immunogenicity responses, the class of vaccine being tested, and
the relative importance of the immunogenicity endpoints in the given trial. For
example, in studies of the hepatitis A vaccine VAQTA®, the immune response rates
in terms of seroconversion in the VAQTA®vaccine groups are usually greater than
90% (with 0% in the placebo group) and are highly correlated with protection.
A margin to preserve half of that is obviously inadequate; instead, a 0of 10
percentage points has been used as the noninferiority margin (Frey et al., 1999). For
comparing GMCs, a 0of log(0.67) or log(0.5) (corresponding to a 1.5- or 2-fold
difference) has been used. The noninferiority margin should be discussed proactively
between the trial sponsor and the regulatory agencies. Temple (1996), Ebbutt and
Frith (1999), and both the ICH E9 (1999) and ICH E10 guidelines (2000) and the
EMEA guideline (2005) include some more general discussions on the choice of
noninferiority and equivalence margins.
3. STATISTICAL METHODS FOR VACCINE
NONINFERIORITY/EQUIVALENCE TRIALS
The analysis of noninferiority/equivalence trials is generally based on the dual-
use of hypothesis testing and test-based confidence intervals. Both p-value and
confidence intervals are reported in most trials. Operationally, rejection of the null
hypothesis in Eq. (1) at one-sided level is equivalent to the lower bound of the
two-sided (1 −2) confidence interval for be greater than −0.
Asymptotic statistical tests of hypothesis in Eq. (1) and corresponding test-
based confidence intervals for two treatment groups with a dichotomous endpoint
have been extensively discussed in the literature in assessing a vaccine effect based
on the difference of two immune response rates. Many authors have proposed
Z-type test statistics with different standard error estimates (Blackwelder, 1982;
Miettinen and Nurminen, 1985). The more commonly used and better performing
of these is the Z-type statistic, shown below, proposed by Miettinen and Nurminen
(1985), Farrington and Manning (1990), Chan and Zhang (1999).
ZD=
PT−
PC+0
˜02
(2)
where
˜02 =1
NT
PT1−
PT+1
NC
PC1−
PC1/2
(3)
and
PT
PCare the constrained maximum likelihood estimates (MLEs) of (PT,PC
under the null hypothesis given in Eq. (1) based on the observed responses
PT
PC.
(Detailed expressions for
PT
PCare presented by Miettinen and Nurminen
(1985), as well as by Farrington and Manning (1990)). Noninferiority is established
CONSIDERATIONS FOR NONINFERIORITY/EQUIVALENCE TRIALS 433
(H0is rejected) at the one-sided significance level if ZD≥Zwhere Zis the
1001− percentile of the standard normal distribution.
The method in the preceding paragraph works exceptionally with large
sample sizes. With “small” sample sizes (say n<50 per group), exact (rather than
asymptotic theory-based) tests and corresponding confidence intervals (Agresti and
Min, 2001; Chan, 1998, 2002, 2003; Chan and Zhang, 1999; Mehrotra et al., 2003)
are preferable. However, as noted by Mehrotra et al. (2003), the test statistic for
exact inference should be carefully chosen to avoid a loss in power. For example,
the only option in the StatXact-4 software package for calculating exact confidence
intervals used the numerator of ZDas the test statistic; the power based on this
unstandardized statistic can be substantially lower than the power based on ZD.
Fortunately, StatXact-5 includes other options for calculating exact unconditional
test-based confidence intervals, including Chan and Zhang’s (1999) method based
on inverting two one-sided tests using the ZDstatistic, and the Agresti and Min’s
(2001) method of inverting one two-sided test. It should be noted that with relatively
large sample sizes (say, n>200 per group), the exact and corresponding asymptotic
methods usually yield very similar results, so use of the former is not necessary.
Of note, since noninferiority trials focus on one-sided (level) hypothesis
tests, we recommend that the corresponding exact confidence intervals (usually
at two-sided 1 −2level) be obtained by inverting two one-sided tests to ensure
consistency of inference in the sense that rejection of the noninferiority null
hypothesis in Eq. (1) (p-value ≤ is equivalent to the lower bound of the two-
sided 1−2 CI on the difference being greater than −0(Chan, 2003; Chan and
Mehrotra, 2003). The confidence interval obtained by inverting one two-sided test
(Agresti and Min, 2001), although generally narrower than the confidence intervals
obtained by inverting two one-sided tests, do not guarantee control of the error rate
on each side at the level; hence, it may produce results that are inconsistent with
the one-sided noninferiority hypothesis test (Chan, 2003).
In assessing a vaccine effect based on a fold difference in GMCs between
the vaccine and control groups, the statistical testing for hypothesis in Eq. (1)
and the corresponding confidence interval estimation are often performed by using
an analysis of variance (ANOVA), an analysis of covariance (ANCOVA), or a
linear mixed-effects model that includes the natural log of immune responses as the
dependent variable, and the treatment group, baseline, and stratification factors (if
any) as the explanatory variables.
Statistical Methods for Stratified Trials
If it is known a priori that certain prognostic factors, such as the subject’s age,
gender, or pre-vaccination immune status, will influence the response to vaccination,
then the strategy of pre-stratification by such factors is often used in clinical trials
to facilitate unbiased and more efficient comparisons of treatment groups. For a
stratified trial, as noted by Mehrotra (2002), it is important to be clear about what
hypothesis is being tested; see also Gail et al. (1996) and Ganju and Mehrotra
(2003). For a noninferiority vaccine trial, the hypothesis that is usually of interest is
H0=
s
i=1
fii≤−0versus H1=
s
i=1
fii>−0(4)
434 WANG ET AL.
where iis the true difference in response rates (or means) for stratum i, and fiis the
fraction of subjects in the target population that belong to stratum i
s
i=1fi=1.
A test of hypothesis Eq. (4) is typically conducted using the following statistic:
Zw=
ˆ
w+0
V0ˆ
w
(5)
In Eq. (5), ˆ
w=s
i=1wiˆ
i=s
i=1wi
PiT −
PiC , and wiis the weight assigned to the
ith stratum s
i=1wi=1. There are two options for the denominator in Eq. (5).
The first option is to use the null variance
V0ˆ
w=s
i=1w2
i
V0ˆ
i, where
V0ˆ
i=
1
NiT
PiT 1−
PiT +1
NiC
PiC 1−
PiC , with
PiT
PiC as before. The second option is
to replace
V0ˆ
iwith the observed variance (Blackwelder, 1995; Dunnett and Gent,
1977) ˆpiC1−ˆpiC
niC
+ˆpiT 1−ˆpiT
niT . In either case, the null hypothesis is rejected at the
one-sided level if Zw>Z
.
For the weights, wi, two popular choices are the Cochran-Mantel-Haenszel
(CMH) weights, and the precision or inverse-variance (INVAR) weights (Mehrotra
and Railkar, 2000). The CMH weights are proportional to the harmonic means
of the observed stratum-specific sample sizes, while the INVAR weights are
proportional to the reciprocals of the observed variances of the stratum-specific
differences. The SSIZE weights are optimal (in terms of power) if the true odds
ratios, piT 1−piC
piC 1−piT , are constant across strata, while the INVAR weights are optimal
if the iare constant. Unfortunately, in practice, we rarely know whether it is
the true stratum-specific odds ratios or the ithat are “closer” to being constant
across strata, so the choice between SSIZE and INVAR is essentially a gamble.
To help minimize the potential loss in power that might be incurred by gambling
unfavorably, Mehrotra and Railkar (2000) proposed a “minimum risk” (MR)
weighting strategy which yields an estimate of that has the smallest asymptotic
mean squared error, and offers power advantages as well.
The MR weighting strategy can be particularly useful for stratified
noninferiority vaccine trials. To illustrate, Table 1 contains simulation results
comparing the null variance (MN) method with CMH weights, and the observed
variance method with CMH or MR weights for a stratified analysis of a difference
in two proportions. In more general settings, it has been shown that using the null
variance is typically better than using the observed variance in terms of power,
Type I error rate control, and confidence interval coverage (Dunnett and Gent,
1977; Miettinen and Nurminen, 1985). However, for stratified noninferiority trials
involving vaccines with high response rates (>90%), the results in Table 1 suggest
that using the observed variance along with MR rather than CMH weights can yield
notable gains in power. Intuitively, this is caused by the relatively large separation
of
PiT and
PiC under the constraint of
PiT −
PiC =−0, which in turn leads to a
relatively large estimated null variance as compared with the observed variance. Of
note, the same result holds for unstratified trials, or for stratified trials in which
stratification is ignored at the time of analysis and an unstratified analysis is used;
the latter is not recommended because it can result in a loss in power (Mehrotra
and Railkar, 2000; Mehrotra, 2001).
CONSIDERATIONS FOR NONINFERIORITY/EQUIVALENCE TRIALS 435
Table 1 Comparison of using null variances with CMH weights versus using observed variances with
either CMH or MR weights
Stratum 1 Stratum 2 Methods
p1Tp1Cp2Tp2C
0N/trt CMH_OBS CMH_MN MR_OBS
Empirical Type I error rates (nominal = 2.5%, 1-sided)
.35 50 50 65 −015 15 130 252724
.60 70 80 90 −010 10 200 212324
.85 90 90 95 −005 05 350 222223
Empirical power (%)
.50 50 80 80 0 15 130 791801802
.70 70 90 90 0 10 200 790788833
.90 90 95 95 0 05 350 752740779
f1=3, f2=7); 10000 simulations;
CMH_OBS = CMH weights with observed variance;
CMH_MN = CMH weights with null variance (MN Method);
MR_OBS = Minimum Risk weights with observed variance
Another important issue in stratified data analysis for noninferiority trials is
the test of treatment by stratum interaction. Gail and Simon (1985) used the terms
quantitative and qualitative interactions and proposed likelihood-based testing
procedures. Pan and Wolfe (1997) generalized their methodology to interactions
with clinical significance, that is, clinically meaningful interaction. For quantitative
interaction with binary responses, Mehrotra (2002) proposed a new test, which, in a
comprehensive simulation study done by Mehrotra and Chan (2000) controlled the
Type I error rate and was generally at least as powerful as several other published
methods. Wiens and Heyse (2003) presented and compared five different analysis
strategies to test for qualitative treatment-by-stratum interaction in noninferiority
trials.
Statistical Methods in Handling Missing Immunogenicity Data
Most vaccine regimens include a sequence of one or more “priming”
inoculations followed by a “booster” shot later, if necessary. Blood samples
are collected at one or more time points after each inoculation and assayed
for immune activity. Missing data and losses to follow-up do occur in vaccine
noninferiority/equivalence (and superiority) trials. This situation is similar to the
incomplete longitudinal data problem for drug trials. However, there are two key
differences. First, while the missing data resulting from dropouts in vaccine trials
are typically missing completely at random (MCAR), they are more likely to be
either missing at random (MAR) or non-ignorably missing (NM) for drug trials.
The reason is that patients often drop out from drug trials because they are not
responding favorably to their assigned treatment (e.g., high blood pressure not
declining); this concept is generally not applicable for vaccine trials! The second
key difference is that the ability to predict or impute the missing data at, say,
436 WANG ET AL.
the post-boost visit of interest may be better for vaccine trials compared with
drug trials. This happens because subjects in vaccine trials are inherently less
heterogeneous than patients in drug trials. Moreover, for some (but not all) vaccines,
the post-prime responses are highly correlated with (and hence predictive of) the
post-boost responses.
A common source of missing data in vaccine immunogenicity trials is
mishaps or errors in blood sample storage, sample handling, or assay testing for
immunogenicity measurements. In such trials, missing data can occur at baseline
(pre-vaccination) as at post-vaccination. This type of missing data can complicate
the data analysis if the baseline immune status is to be used as a covariate, as, for
example, in an ANCOVA model for continuous responses with the baseline value
as the covariate. A simple and common way to tackle this missing immunogenicity
data problem is to use a “complete case” (CC) analysis which excludes subjects with
missing pre- or post-vaccination data. Although this approach is unbiased under
MCAR, it is inefficient because it fails to utilize the information of the excluded
subjects. A better alternative is to use principled methods for longitudinal data
analysis such as “restricted maximum likelihood” (REML), “generalized estimating
equations” (GEE), or “multiple imputation” (MI), all of which are readily available
in standard software. The gains in efficiency of the latter approaches over the
complete case analysis can be significant when the amount of missing data is
large. For example, there could be a substantial amount of missing post-boost
immunogenicity data in the case of an interim analysis of an ongoing trial when the
later enrolled subjects have received priming inoculations but not yet been boosted.
If the post-prime information from these subjects is not included in the interim
analysis, the analysis could be missing a large amount of available data. As a result,
it would be preferable to use an analytical approach that could actually incorporate
all the data.
Li et al. (in press) proposed a propensity score-based multiple imputation (MI)
method to tackle missing data in longitudinal clinical trials with binary responses,
with particular emphasis on vaccine immunogenicity trials. They noted three key
results. First, if data are missing completely at random, MI can be notably more
efficient than the CC and GEE methods. Second, with small samples, GEE often
fails because of “convergence problems,” but MI is free of those problems. Finally,
if the data are missing at random, MI generally yields results with negligible bias,
while the CC and GEE methods yield results with moderate to large bias.
Wang et al. (2003) applied a longitudinal regression approach with a
“restricted” linear model structure (Liang and Zeger, 2000) to analyze vaccine
immunogenicity data; the restriction is that the population means for all treatment
groups are identical at baseline. This model allows for the comparison of
postvaccination immune responses between two vaccination groups, adjusting for
pre-vaccination immunogenicity levels in the presence of incomplete data (missing
either at pre- or post-vaccination). Under the assumption of equal population
means at baseline, which is justified in randomized trials, this longitudinal model
provides an unbiased estimate of the treatment effect while increasing the power for
the noninferiority/equivalence comparison as compared with the “complete case”
analysis. In the case of no missing data, estimates of treatment differences from this
longitudinal model are identical to those from the traditional analysis of covariance
(ANCOVA) model.
CONSIDERATIONS FOR NONINFERIORITY/EQUIVALENCE TRIALS 437
4. POWER CALCULATION FOR VACCINE
NONINFERIORITY/EQUIVALENCE TRIALS
There are many methods for calculating sample size and power for vaccine
noninferiority/equivalence trials. Here, we briefly describe the methods based on the
commonly used asymptotic Z-test for noninferiority immunogenicity trials.
For bridging studies that are designed to test the noninferiority hypothesis
in Eq. (1) with respect to immune response rates, a commonly used method is the
sample size formula proposed by Farrington and Manning (1990), which is based
on the Z-type test statistic in Eq. (2). To have 1 −power to claim noninferiority at
the one-sided level, the approximate sample size for testing hypothesis in Eq. (1)
should be
NT=Z¯02 +Z122PT−PC+02and NC=uNT(6)
where uis the prespecified sample size ratio between the control and the test vaccine
groups, PCand PTare the expected immune response rates for the control and test
vaccine groups in the planned study, respectively, 12 is the value of the expression in
(2) when the constrained maximum likelihood estimates
PTand
PCare replaced with
PTand PC, respectively, and ¯02 is the value of ˜02 in Eq. (2) obtained when (
PT
PC
are calculated taking
PT
PC=PTP
C. Similarly, the power of a one-sided level
ZDtest based on normal approximation is given by
1−=1
12
−Z¯02 +NTPT−PC+0(7)
where is the standard normal cumulative distribution function.
For bridging studies that are designed to test noninferiority hypothesis in
Eq. (1) with respect to GMCs, the sample size calculation is similar to that
for testing the conventional hypothesis of no difference under the log-normal
assumption. Let be the standard deviation of the log-transformed immune
responses. In order to achieve 1 −power for testing hypothesis in Eq. (1) at the
one-sided level, the sample size for the test vaccine group is
NT=1+1/uZ+Z22/logRGMC −02and NC=uNT(8)
where uis as specified earlier, and RGMC =GMCT/GMCCis the expected ratio
of GMCs between the new and control vaccine groups under the alternative
hypothesis. The power of a one-sided level test based on normal approximation is
given by
1−=Z+1
NT
1+1/ulogRGMC−0(9)
In order to demonstrate the stability of the vaccine manufacturing process,
consistency lot studies are often required by regulatory authorities on vaccine
products to show equivalence among three clinical lots. In such trials, there are
multiple pairwise comparisons and no closed-form formula exists for sample size
438 WANG ET AL.
calculation, but simulations can be used to adequately plan the sample size. For
power estimation, for example, for hypothesis testing with respect to immune
response rates, Wiens et al. (1996) recommended performing simulations under the
setup that the true immune response rates for the (typically) three lots are P1,P1,
and P1+05, instead of all being equal. This setup gives a conservative estimate of
power when the alternative hypothesis is true (Wiens et al., 1996).
In a combination or multivalent vaccine study, the hypotheses regarding each
component or each serotype are often considered to be co-primary. In planning
such a study, the sample size for each primary hypothesis can be calculated using
the formulas in Eq. (6), Eq. (8), or any other related sample size formulas for
the particular hypothesis in question. To address multiplicity concerns, a popular
approach is to plan the power for each primary hypothesis large enough such that
the overall power is acceptable under the assumption of independence among the
multiple hypotheses. For example, for a multivalent vaccine with kcomponents,
one can plan the power for the hypothesis testing on each component to be at
least 1 −/k so that the overall power is guaranteed to be no less than 1 −. This
approach controls the overall Type I error very stringently (much less than the
nominal level if each null is true), but it is rather conservative in that insignificance
for just one component implies a failure to reject the overall null hypothesis. This
approach is also conservative (and provides a lower bound for power) because
the correlation between the components is usually positive. Based on extensive
simulations, Kong et al. (2004) pointed out that trial designs with high power
(>80%) under the assumption of independence have only a modest increase in
power when the correlations between the components are taken into consideration.
At this point, there is no commonly accepted testing strategy to overcome the
conservatism of this popular approach, and further research is needed.
5. CONCLUDING REMARKS
Vaccine noninferiority/equivalence trials involve unique statistical issues when
compared with drug trials or other types of vaccine trials (e.g., vaccine efficacy
trials). Unlike drug trials, these vaccine noninferiority trials generally aim to assess
the biological effect of a vaccine on healthy and inherently less heterogeneous study
population. Adherence in these trials is usually high (completion of the vaccination
series if more than one dose is administered) because there tend to be few drop-outs
due to adverse experiences. A per-protocol approach is often considered to be the
primary approach for vaccine noninferiority trials, rather than the intention-to-treat
approach that is used in drug studies.
In contrast with other types of vaccine trials, the conclusions drawn from
noninferiority/equivalence vaccine trials are highly dependent on the identification
of appropriate correlates of protection and on the measurement of immune
responses by properly validated immunogenicity assays. In this paper, we have
focused on parametric inferential methods and related issues for establishing the
equivalence/noninferiority of immune response rates and geometric means of
immune responses. In addition to these parametric analyses, the comparability of
immune responses to vaccinations is commonly illustrated by graphical displays
of the reverse cumulative distribution curves of PrX ≥x. These curves give
the percentages of participants with immune responses greater than or equal to
CONSIDERATIONS FOR NONINFERIORITY/EQUIVALENCE TRIALS 439
varying fixed levels x. Nonparametric methods have also been proposed by Stine
and Heyse (2001) to estimate the overlap or proportion of a similar response in
distributions, which can be used to measure the similarity between two distributions
of immune responses. In addition, it should be noted that, even though this paper
has focused on the most common types of vaccine noninferiority/equivalence trials
where the primary endpoints are immunogenicity related, the statistical techniques
described here can be used for efficacy and safety related endpoints as well
(Chan et al., 2003; Chan, 2003).
From a statistical research perspective, vaccine noninferiority/equivalence
trials are a fertile area of investigation. Each of the four aforementioned common
types of noninferiority/equivalence immunogenicity trials offer unique statistical
challenges. Further statistical research is needed in establishing immunogenicity
markers, selecting immunogenicity endpoints, choosing noninferiority margins,
and finding optimal statistical strategies (e.g., possible applications of Bayesian
techniques) to establish noninferiority/equivalence. Due to the emerging need of
multivalent combination vaccines to simplify vaccination schedules, the multiplicity
issue should also be a priority area for future research. The use of composite
endpoints or gate-keeping strategies may be useful in this area. The ultimate goal is
to have a readily available statistical toolkit that can be used to minimize the risk
of abandoning potentially beneficial vaccines or licensing inferior ones.
ACKNOWLEDGMENTS
We thank the referees for their thoughtful comments that led to an improved
manuscript.
REFERENCES
Agresti, A., Min, Y. (2001). On small-sample confidence intervals for parameters in discrete
distributions. Biometrics 57:963–971.
Blackwelder, W. C. (1982). Proving the null hypothesis in clinical trials. Controlled Clinical
Trials 3:345–353.
Blackwelder, W. C. (1995). Similarity/equivalence trials for combination vaccine. Williams,
J. C.; Goldenthal, K. L., Burns, D. L., Lewis, B. P. eds. Combined Vaccines and
Simultaneous Administration. Ann. New York Acad. Sciences, 754:321–328.
Chan, I. S. F. (1998). Exact tests of equivalence and efficacy with a non-zero lower bound
for comparative studies. Statistics in Medicine 17:1403–1413.
Chan, I. S. F. (2002). Power and sample size determination for noninferiority trials using an
exact method. Journal of Biopharmaceutical Statistics 12(4):457–469.
Chan, I. S. F. (2003). Proving noninferiority or equivalence of two treatments with
dichotomous endpoints using exact methods, Statistical Methods in Medical Research
12:37–58.
Chan, I. S. F., Mehrotra, D. V. (2003). Confidence intervals and hypothesis testing.
Encyclopedia of Biopharmaceutical Statistics, 2nd ed. New York: Marcel Dekker,
pp. 231–234.
Chan, I. S. F., Wang, W. W. B., Heyse, J. F. (2003). Vaccine clinical trials. Encyclopedia of
Biopharmaceutical Statistics. Chow, S. C. eds. 2nd ed. New York: Marcel Dekker, Inc.
pp. 1005–1022.
Chan, I. S. F., Zhang, Z. X. (1999). Test-based exact confidence intervals for the difference
of two binomial proportions. Biometrics 55:1202–1209.
440 WANG ET AL.
Dunnett, C. W., Gent, M. (1977). Significance testing to establish equivalence between
treatments with special reference to data in the form of 2 ×2 tables. Biometrics
33:593–602.
Ebbutt, A. F., Frith, L. (1999). Practical issues in equivalence trials. Stat. in Med. 1691–1701.
European Medicines Agency (EMEA). (2005). Committee for Medicinal Products for
Human Uses (CHMP), Guideline on the choice of noninferiority margin. July.
Farrington, C. P., Manning, G. (1990). Test statistics and sample size formulae for
comparative binomial trials with null hypothesis of non-zero risk difference or non-
unity relative risk. Statistics in Medicine 9:1447–1454.
FDA. (1997). Guidance for industry for the evaluation of combination vaccines for
preventable diseases: production, testing and clinical studies. Center for Biologics
Evaluation and Research, Food and Drug Administration.
Frey et al. (1999). Interference of antibody production to hepatitis B surface antigen in a
combination hepatitis A/hepatitis B vaccine. J. Infect. Dis. 180(6):2018–2022.
Gail, M., Simon, R. (1985). Testing for qualitative interactions between treatment effects
and patient subsets. Biometrics 41:361–372.
Gail, M. H., Mark, S. D., Carroll, R. J., Green, S. B., Pee, D. (1996). On design
considerations and randomization-based inference for community intervention trials.
Statistics in Medicine 15:1069–1092.
Ganju, J., Mehrotra, D. V. (2003). Stratified experiments re-examined with emphasis on
multicenter clinical trials. Controlled Clinical Trials 24:167–181.
Horne, A. D. (1995). The statistical analysis of immunogenicity data in vaccine trials: a
review of methodologies and issues. Ann. N Y Acad. Sci. 754:329–46.
Horne, A. D., Lachenbruch, P. A., Getson, P. R., Hsu, H. (2001). Analysis of studies
to evaluate immune response to combination vaccines. Clinical Infectious Diseases
33(Suppl 4):S306–11.
ICH E9 Expert Working Group. (1999). Statistical principles for clinical trials: ICH
harmonized tripartite guidelines. Statistics in Medicine 18:1905–1942.
ICH E10 Guideline. (2000). Choice of control group and related issues in clinical trials.
International Conference on Harmonization (ICH), July.
Kong, L., Kohberger, R. C., Koch, G. G. (2004). Type I error and power in
noninferiority/equivalence trials with correlated endpoints: an example from vaccine
development trials. J. of Biopharmaceutical Statistics 14:893–907.
Li, X., Mehrotra, D. V., Barnard, J. (in press). Analysis of incomplete longitudinal binary
data using multiple imputation. Statistics in Medicine.
Liang, K., Zeger, S. (2000). Longitudinal data analysis of continuous and discrete response
for pre-post designs. Sankhya 62:134–138.
Mehrotra, D. V. (2001). Stratification issues with binary endpoints. Drug Information Journal
35(4):1343–1350.
Mehrotra, D. V. (2002) Stratified comparative clinical trials—analysis and interpretation
issues. Proceeding of International Biometrics Conference.
Mehrotra, D. V. Vaccine clinical trials – A statistical primer. Journal of Biopharmaceutical
Statistics (in press).
Mehrotra, D. V., Chan, I. S. F. (2000). Testing for treatment by stratum interaction in
stratified comparative binomial trial. Joint Statistical Meetings. August.
Mehrotra, D. V., Chan, I. S. F., Berger, R. L. (2003). A cautionary note on
exact unconditional inference for a difference between two independent binomial
proportions. Biometrics 59:441–450.
Mehrotra, D. V., Railkar, R. (2000). Minimum risk weights for comparing treatments in
stratified binomial trials. Statistics in Medicine 19:811–825.
Miettinen, O, Nurminen, M. (1985). Comparative analysis of two rates. Statistics in Medicine
4:213–226.
CONSIDERATIONS FOR NONINFERIORITY/EQUIVALENCE TRIALS 441
Pan, G. H., Wolfe, D. A. (1997). Test for qualitative interaction of clinical significance.
Statistics in Medicine 16:1645–1652.
Plikaytis, B. D., Carlone, P. M. (2004). Statistical considerations for vaccine immunogenicity
trials: Part 1 and Part 2, Vaccine 23(13):1596–1614.
Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the
power approach for assessing the equivalence of average bioavailability. Journal of
Pharmacokinetics and Biopharmaceutics 15:657–680.
Stine, R. A., Heyse, J. F. (2001). Nonparametric measures of overlap. Stat in Med 20:215–236.
Temple, R. (1996). Problems in interpreting active control equivalence trials. Accountability
in Research 4:267–275.
Wang, W. W. B, Li, D., Liu, F., Chan, I. S. F. (2003). Analysis of immune responses in
vaccine clinical trials with a pre-post design. Joint Statistical Meeting. San Francisco,
August.
Wiens, B. L. Heyse, J. F. (2003). Testing for interaction in studies of noninferiority. Journal
of Biopharmaceutical Statistics 13:103–115.
Wiens, B. L., Heyse, J. F., Matthews, H. (1996). Similarity of three treatments, with
application to vaccine development. Proceedings of the Biopharmaceutical Section, Joint
Statistical Meeting 203–206.