When to use agreement versus reliability measures

Reproducibility concerns the degree to which repeated measurements provide similar results. Agreement parameters assess how close the results of the repeated measurements are, by estimating the measurement error in repeated measurements. Reliability parameters assess whether study objects, often persons, can be distinguished from each other, despite measurement errors. In that case, the measurement error is related to the variability between persons. Consequently, reliability parameters are highly dependent on the heterogeneity of the study sample, while the agreement parameters, based on measurement error, are more a pure characteristic of the measurement instrument. Using an example of an interrater study, in which different physical therapists measure the range of motion of the arm in patients with shoulder complaints, the differences and relationships between reliability and agreement parameters for continuous variables are illustrated. If the research question concerns the distinction of persons, reliability parameters are the most appropriate. But if the aim is to measure change in health status, which is often the case in clinical practice, parameters of agreement are preferred.
When to use agreement versus reliability measures
1. Introduction
Outcome measures in medical sciences may concern the
assessment of radiographs and other imaging techniques,
biopsy readings, the results of laboratory tests, the findings
of physical examinations, or the scores on questionnaires
collecting information, for example, on functional limita-
tions, pain coping styles, and quality of life. An essential
requirement of all outcome measures is that they are valid
and reproducible or reliable [1,2].
Reproducibility concerns the degree to which repeated
measurements in stable study objects, often persons, pro-
vide similar results. Repeated measurements may differ be-
cause of biologic variation in persons, because even stable
characteristics often show small day-to-day differences, or
follow a circadian rhythm. Other sources of variation may
originate from the measurement instrument itself, or the
circumstances under which the measurements take place.
For instance, some instruments may be temperature depen-
dent, or the mood of a respondent may influence the
answers on a questionnaire. Measurements based on assess-
ments made by clinicians may be influenced by intrarater or
interrater variation.
This article first presents an example of an interrater
study, then describes the concepts underlying various repro-
ducibility parameters, which can be distinguished in reli-
ability and agreement parameters. The primary aim of
this article is to demonstrate the relationship and the impor-
tant difference between parameters of reliability and agree-
ment, and to provide recommendations for their use in
medical sciences.
2. An example
In an interrater study on the range of motion of a painful
shoulder, different reproducibility parameters were used to
present the results [3]. To assess the limitations in passive
glenohumeral abduction movement, the range of motion
of the arm was measured with a digital inclinometer, and
expressed in degrees. Two physical therapists (PT
) measured the range of motion of the affected and
the nonaffected shoulder in 155 patients with shoulder
complaints. Table 1 presents the results in terms of means
and standard deviations, percentages of agreement within
5and 10 , limits of agreement, and intraclass correlation
coefficients (ICC) [3].
The first two lines in Table 1 present the means and
standard deviations of the scores assessed by PT
for the affected and nonaffected shoulder. The
standard deviations (SD) show the variability in the results
between the patients, in which the heterogeneity of the
study sample is reflected with regard to the characteristic
under study. The third line presents the mean differences
) between the scores of PT
and PT
and the
SDs of these differences (SD
). The fourth and fifth lines
present agreement parameters by reporting the percentages
of patients for whom the scores of PT
and PT
less than 5and 10 , respectively. For 43% of the patients
the scores of PT
and PT
were within the 5 range, and
for 72% they were within the 10 range, for both the af-
fected and the nonaffected shoulder. The limits of agree-
ment, calculated according to the Bland and Altman
method [4], are also about similar for both shoulders and
were 218.80to 20.40 for the affected shoulder and
217.88 to 19.68 for the nonaffected shoulder. The last
line shows the ICC, which is a parameter of reliability.
There is quite a difference in the value of the ICC for
the two shoulders: 0.83 for the affected shoulder and
0.28 for the nonaffected shoulder. For interpretation, the
physical therapists would be quite satisfied with an agree-
ment percentage of 72% of the patients within the 10
range, while the ICC (O0.7 is generally considered as
good [5]) shows a satisfactory value for the affected shoul-
der, but not for the nonaffected shoulder. The explanation
for these apparently contradictory results can be found in
the conceptual difference between the two types of
3. Conceptual difference between agreement
and reliability parameters
In the literature, agreement and reliability parameters
are often used interchangeably, although some authors have
pointed out the differences [6,7].
Agreement and reliability parameters focus on two dif-
ferent questions:
1. ‘How good is the agreement between repeated mea-
surements?’’ This concerns the measurement error,
and assesses exactly how close the scores for repeated
measurements are.
2. ‘How reliable is the measurement?’’ In other words,
how well can patients be distinguished from each
other, despite measurement errors. In this case, the
measurement error is related to the variability be-
tween study objects.
As an umbrella term for the concepts of agreement and
reliability we use the term ‘‘reproducibility’[7], because
both concepts concern the question of whether measure-
ment results are reproducible in test–retest situations. The
repetitions may concern different measurement moments,
different conditions, different raters, or the same rater at
different times.
Figure 1 visualizes the distinction between agreement
and reliability. The weight of three persons is measured
on 5 different days. The five measurements per person show
some variation. The SD of the values of the repeated mea-
surements of one person represents the agreement, and an-
swers question 1 above. For reliability this measurement
error is related to the variability between persons, and tells
us how well they can be distinguished from each other. If
the values of persons are distant (as for persons and -),
the measurement error will not affect discrimination of
the persons, but if the values of persons are close (as for
persons -and :) the measurement error will affect the
ability to discriminate and the reliability will be substan-
tially lower.
A reliability parameter (e.g., the ICC) has as typical
basic formula:
reliability 5variability between study objects
variability between study objects 1measurement error
The reliability parameter relates the measurement error
to the variability between study objects, in our case persons.
Reproducibility of the measurement of glenohumeral abduction
of the shoulder
Parameters Affected shoulder Nonaffected shoulder
: Mean (SD) 69.49(17.60) 79.78(7.60 )
: Mean (SD) 68.69(16.25 ) 78.88 (8.38)
) 0.80(10.00) 0.90 (9.58 )
vs. PT
: % within 543% 43%
vs. PT
: % within 1072% 72%
Limits of agreement
0.83 0.28
If the measurement error is small, compared to the variabil-
ity between persons, the reliability parameter approaches 1.
This means that the discrimination of the persons is hardly
affected by measurement error, and thus the reliability is
high (persons and -in Fig. 1). If the measurement error
is large, compared to the variability between persons, the
ICC value becomes smaller. For example, if the measure-
ment error equals the variability between persons, the
ICC becomes 0.5. This means that the discrimination will
be affected by the measurement error (e.g., persons -
and :in Fig. 1). The ICC is a ratio ranging in value be-
tween 0 (representing a totally unreliable measurement)
and 1 (implying perfect reliability).
We now turn back to our example. The percentage of
scores of both PTs within 5 and 10 is a parameter of
agreement: it estimates the measurement error. Note that
it remains unclear whether this measurement error origi-
nates from the PTs, for example, because of a difference
in the way they use the inclinometer, from variation within
the patients who differ in their range of motion at the two
moments of measurements, or from a combination of PTs
and patients, for instance because of variation in the way
the patients are motivated by the PTs. However, this mea-
surement error is totally independent of the variability be-
tween persons. In our example, the agreement of the
scores of the two PTs is approximately the same for the
affected and the nonaffected shoulder. The ICC, which is a
reliability parameter, relates this measurement error to the
variability between persons in the population sample under
study. The higher value of the ICC for the affected shoulder
is explained by the higher variability between persons in
glenohumeral abduction of the affected shoulder, compared
to the nonaffected shoulder. This can be seen from the much
larger standard deviation in the measurements of the affected
shoulder, compared to the standard deviation for the non-
affected shoulder (first two lines in Table 1): patients all have
maximum movement ability in the arm on the nonaffected
side, but they differ considerably in the range of motion of
the arm on the affected side. As the variability in scores for
the affected shoulders is greater, these can more easily be
distinguished, despite the same magnitude of measurement
error. This clearly illustrates the difference between agree-
ment and reliability parameters.
4. Agreement parameters are neglected in medical
In the 1980s Guyatt et al. [8] clearly emphasized the dis-
tinction between reliability and agreement parameters.
They explained that reliability parameters are required for
instruments that are used for discriminative purposes and
agreement parameters are required for those that are used
for evaluative purposes. With a hypothetic example they el-
oquently demonstrated that discriminative instruments re-
quire a high level of reliability: that is, the measurement
error should be small in comparison to the variability be-
tween the persons that the instrument needs to distinguish.
Thus, if the differences between persons are large, a certain
amount of measurement error is acceptable. For an evalua-
tive measurement instrument the variability between per-
sons in the population sample does not matter at all; only
the measurement error is important. This measurement er-
ror should be smaller than the improvements or deteriora-
tions that one wants to detect. If the measurement error is
large, then small changes cannot be distinguished from
measurement error. The smaller the measurement error,
the smaller the changes that can be detected beyond
measurement error.
In medical sciences measurement instruments are often
used to evaluate changes over time, either with or without
interventions. Nevertheless, many researchers still prefer
reliability parameters over agreement parameters. In two
recent clinimetric reviews [9,10] we assessed the quality
of measurement instruments in terms of reproducibility.
We registered whether agreement and reliability parameters
were assessed. The measurement instruments were ques-
tionnaires to assess shoulder disability [9] or quality of life
(QoL) in visually impaired persons [10]. These instruments
are mainly used to evaluate the effects of interventions or
monitor changes over time. Thus, these are typically evalu-
ative measurement instruments. In the review of QoL of vi-
sually impaired persons [10] we found 31 questionnaires.
For 16 questionnaires a reliability parameter was reported,
but a parameter for agreement was presented for only seven
questionnaires. For all 16 shoulder disability questionnaires
a parameter of reliability was presented, but an additional
parameter of agreement was presented for only for six
questionnaires [9]. Apparently, agreement parameters have
not yet struck root in medical sciences.
Streiner and Norman [1] argue that there is no need for
a special parameter for measurement error, because it can
be derived from the ICC formula. However, this is only true
if all the components of the ICC formula are presented.
Usually only the ICC value is provided, without even men-
tioning which ICC formula has been used [11], that is, with
or without inclusion of the systematic difference between
measurements. Even more important, authors who present
only reliability parameters and no parameters of agreement
usually draw the wrong conclusions: they rely solely on
reliability parameters when they should have relied on
parameters of agreement when evaluation is at issue.
5. Relationship between the agreement
and reliability parameters
The relationship between parameters of agreement and
reliability can best be illustrated by elaborating on the
variances that are involved in the ICC formulas. Therefore,
we first need to explain the meaning of the variance
components [12]. Variance (s
) is the statistical term that is
used to indicate variability.
The variance in observed scores can be subdivided into
the variance in the objects under study, in our example
the persons (s
), the variance in observers (the two differ-
ent PTs) (s
), and the interaction between persons and
PTs. We will call this latter term the residual variance
).*The variance in persons (s
) represents the var-
iability between persons, and s
represents the variance
due to systematic differences between PT
and PT
. The
measurement error [error variance (s
)] consists of
either s
or of (s
), depending on whether
or not one wants to take into account systematic differences
between the measurements (in our example PTs A and B).
These systematic differences are usually considered to be
part of the measurement error, because in practice, the
measurements are performed by different PTs, and one is
interested in the real values of the differences between
the repeated measurements. However, if one is only inter-
ested in the ranking of patients, the systematic differences
between the PTs are not important, and the error variance
contains only s
The ICCs, which relate the measurement error to the var-
iability between persons, are represented by the following
formulas, for ICC
and ICC
ICCagreement 5
pt 1s2
ICCconsistency 5
We realize that these specifications ‘‘agreement’’ and
‘‘consistency’’ for the type of ICC is highly confusing,
because both terms have others meanings in the field of
reproducibility. For example, consistency is sometimes used
as synonym of reproducibility. As this terminology of ICC
types is used in general handbooks and landmark papers
[11,12], we do not want to deviate from it. ICC
the extra term s
in the denominator to take the systematic
difference between the PTs into account; the ICC
ignores systematic differences. Both types of ICCs are de-
pendent on the heterogeneity of the population sample with
respect to the characteristic of the study. We want to stress
that both ICC
and ICC
are reliablity param-
eters (and not agreement parameters), although the term
suggests otherwise.
The measurement error is represented by the
standard error of measurement (SEM) and equals
the square root of the error variance: SEM 5Os2
This means that SEMagreement 5Oðs2
pt 1s2
SEMconsistency 5Os2
residual. The SEM is a suitable parameter
of agreement.
6. Illustration of ICC and SEM calculations
in the example
Table 2 presents the values of the variance components
for the affected and nonaffected shoulder. The variance
components are estimated with SPSS (version 10.1), with
the range of motion values as independent variable and per-
sons and PTs as random factors, using the restricted maxi-
mum likelihood method. From these variance components,
the above-mentioned SEMs can be calculated. For the
affected shoulder:
SEMagreement AB 5Oðs2
pt AB 1s2
SEMconsistency AB 5Os2
residual 5O49:98 57:07
for the affected shoulder can be
calculated as follows:
ICCagreement AB 5
pt AB 1s2
236:93 10:00 149:98 50:83
And for the nonaffected shoulder:
ICCagreement AB 5
pt AB 1s2
18:08 10:11 145:91 50:28
In this example, ICC
and ICC
roughly the same value, as the systematic differences be-
tween PT
and PT
were small, s
is almost 0. There-
fore, we introduce a hypothetic physical therapist C (PT
who scores the range of motion of every patient 5 lower
than PT
. We added the (hypothetic) scores of PT
to our da-
taset and recalculated the variance components (Table 2).
Table 2
Values of the variance components for the affected and nonaffected
Variance components Affected shoulder Nonaffected shoulder
236.93 18.08
0.00 0.11
16.50 17.09
49.98 45.91
From now on we proceed with the example of PT
and PT
because this will better illustrate the influence of systematic
differences on the parameters for agreement and reliability.
We will present only the calculations for the affected shoul-
der in the text. The results for both shoulders are presented
in Table 3.
ICCagreement AC 5
pt AC 1s2
6:93 116:50 149:98 50:78
ICCconsistency AC 5
236:93 149:98 50:83
Note that ICC
is not influenced by the system-
atic differences, but ICC
becomes smaller, be-
cause for this parameter systematic differences between
and PT
) are included in the measurement
7. Three ways to obtain SEM values
To facilitate and encourage the use of agreement param-
eters we will demonstrate how agreement parameters can
be derived from the ICC formula, or can be calculated in
other ways.
1. SEM values can easily be derived from the ICC
formula, if all variance components are presented.
In that case, the reader can calculate the ICC of
his/her own choice. SEM is calculated as Os2
which equals Oðs2
pt 1s2
residualÞ, if one wishes to
take the systematic differences between the PTs
into account, otherwise, it equals Os2
SEMagreement AC 5Os2
error 5Oðs2
pt AC 1s2
Oð16:50 149:98Þ58:15
SEMconsistency AC 5Os2
error 5Os2
5O49:98 57:07
2. The ICC formula can be transformed to
SEM5sOð1eICCÞ[1], in which srepresents the to-
tal variance (i.e., the denominator of the reliability
formula). Only SEM
can be reproduced this
way, by imputing the pooled SD of the first and sec-
ond assessment for s, and using ICC
SEMconsistency AC 5sOð1eICCconsistency ACÞ516:94
Note that SEM
cannot be obtained in this
way as calculated, because systematic errors are not repre-
sented in the pooled SD. Using ICC
and the pooled
SD of the first and second assessment for swould yield:
SEMagreement AC 5sOð1eICCagreement ACÞ516:94
The formula SEM 5sOð1eICCÞis often used if infor-
mation on the individual variance components is lacking.
The ICC calculated in one study is then applied to a popu-
lation sample for which the standard deviation (the total
variance) is known. In this way, only a raw indication of
the SEM can be obtained, because the ICC is heavily de-
pendent on the heterogeneity of the characteristic under
study in the population sample, and is thus, in theory, only
applicable for a population with a similar heterogeneity.
3. The value of SEM can also be derived by dividing the
SD of the mean differences between two measure-
ments (SD
)by ffiffi
p. The factor ffiffiffi
pis included be-
cause it concerns the difference between two
measurements and errors occur in both measure-
ments. Note that the SEM obtained by this formula
is again SEM
, because systematic error is
not included in the SD. Thus, SEM
be calculated in this way either.
SEMconsistency AC 5SDdiff AC=O2510:00=O257:07
gives the same result, because PT
only differed by a systematic value.
8. Typical parameters for agreement and reliability
For repeated measurements on a continuous scale, as in
our example, an ICC is the most appropriate reliability
Table 3
Reproducibility of measurement of PT
and PT
Parameters Affected shoulder Nonaffected shoulder
: Mean (SD) 69.49(17.60) 79.78(7.60)
: Mean (SD) 63.69(16.25 ) 73.88 (8.38 )
: Mean
5.80(10.0) 5.90(9.58)
0.78 0.22
0.83 0.28
Limits of Agreement
213.80–25.40212.88 224.68
parameter. An extensive overview of the various ICC for-
mulas is provided by McGraw and Wong [11].
In our example, agreement was expressed as the percent-
age of observations lying between predefined values (Table
1). Presentation in this way makes sense in clinical practice,
because every PT knows what 5 and 10 means. This
measure was chosen because it can easily be interpreted
by PTs [3]. However, the SEM is usually the basic param-
eter of agreement for measurements on a continuous scale.
A method proposed by Bland and Altman [4], which as-
sesses the limits of agreement is frequently used. These
limits of agreement can be directly derived from the
SDdiff 5ðO2SEMconsistencyÞ.
9. Clinical interpretation
Agreement parameters are expressed on the actual scale
of measurement, and not as reliability parameters as a di-
mensionless value between 0 and 1. This is an important
advantage for clinical interpretation. If weights are mea-
sured in kilograms, the dimension of the SEM is kilograms.
For example, if we know that a weighing scale has
a SEM of 300 g, we know that we can use it to monitor
adult body weight because changes of less than 1 kilogram
are not important. The smallest detectable change (SDC)
is based on this measurement error, and is defined as
1:96 O2SEM.*With an SEM of 300 g, SDC is
1:96 O2300g5832g. Obviously, one cannot use this
scale to weigh babies or to weigh flour in the kitchen, be-
cause in these instances changes of less than 800 g are very
important. The measurement error alone provides useful
information when there is a clear conception of the
differences that are important.
The situation is different in the case of unfamiliarity
with scores. For example, if a new multiitem questionnaire
is used to measure functional status on a scale from 0 to
50, an orthopaedic surgeon may want to know what a value
of 14 points and an SEM of 2 points means, because she
has no idea how many points of change represent clini-
cally relevant change. By presenting an ICC she will know
whether the instrument is able to discriminate between pa-
tients in the sample, but she still does not know whether
the instrument is suitable for monitoring the functional sta-
tus of her patients over time. This requires more informa-
tion about the interpretation of scores. By assessing the
scores of groups of mildly, moderately, and severely dis-
abled patients a feeling for the meaning of scores will
arise. Comparisons with other instruments will provide
further insight into the meaning of values on the new
measurement instrument. The assessments of minimally
important changes in various measurements will also
contribute to insight with regard to which (changes in)
scores are clinically relevant [13,14]. Only this informa-
tion makes it possible to assess whether the agreement pa-
rameter of a measurement instrument is sufficient to detect
clinically relevant changes.
10. Conclusion
In this article we have shown the important difference
between the parameters of reliability and agreement and
their relationship. Agreement parameters will be more sta-
ble over different population samples than reliability pa-
rameters, as we observed in our shoulder example, in
which the SEM was quite similar for the affected and the
nonaffected shoulder. Reliability parameters are highly de-
pendent on the variation in the population sample, and are
only generalizable to samples with a similar variation. Re-
liability is clearly a characteristic of the performance of an
instrument in a certain population sample. Agreement is
more a characteristic of the measurement instrument itself.
Agreement parameters are preferable in all situations in
which the instrument will be used for evaluation purposes,
which is often the case in medical research. Researchers
and readers should be eager to apply and interpret the
parameters of agreement and reliability correctly.
