ArticlePDF Available

When to use agreement versus reliability measures

Authors:
  • Amsterdam University Medical Centers - Vrije Universiteit

Abstract and Figures

Reproducibility concerns the degree to which repeated measurements provide similar results. Agreement parameters assess how close the results of the repeated measurements are, by estimating the measurement error in repeated measurements. Reliability parameters assess whether study objects, often persons, can be distinguished from each other, despite measurement errors. In that case, the measurement error is related to the variability between persons. Consequently, reliability parameters are highly dependent on the heterogeneity of the study sample, while the agreement parameters, based on measurement error, are more a pure characteristic of the measurement instrument. Using an example of an interrater study, in which different physical therapists measure the range of motion of the arm in patients with shoulder complaints, the differences and relationships between reliability and agreement parameters for continuous variables are illustrated. If the research question concerns the distinction of persons, reliability parameters are the most appropriate. But if the aim is to measure change in health status, which is often the case in clinical practice, parameters of agreement are preferred.
Content may be subject to copyright.
When to use agreement versus reliability measures
Henrica C.W. de Vet
a,
*, Caroline B. Terwee
a
, Dirk L. Knol
a,b
, Lex M. Bouter
a
a
Institute for Research in Extramural Medicine, VU University Medical Center, Amsterdam, Van der Boechorststraat 7,
Amsterdam 1081 BT, The Netherlands
b
Department of Clinical Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands
Accepted 25 October 2005
Abstract
Background: Reproducibility concerns the degree to which repeated measurements provide similar results. Agreement parameters as-
sess how close the results of the repeated measurements are, by estimating the measurement error in repeated measurements. Reliability
parameters assess whether study objects, often persons, can be distinguished from each other, despite measurement errors. In that case, the
measurement error is related to the variability between persons. Consequently, reliability parameters are highly dependent on the hetero-
geneity of the study sample, while the agreement parameters, based on measurement error, are more a pure characteristic of the measure-
ment instrument.
Methods and Results: Using an example of an interrater study, in which different physical therapists measure the range of motion of
the arm in patients with shoulder complaints, the differences and relationships between reliability and agreement parameters for continuous
variables are illustrated.
Conclusion: If the research question concerns the distinction of persons, reliability parameters are the most appropriate. But if the aim
is to measure change in health status, which is often the case in clinical practice, parameters of agreement are preferred. Ó2006 Elsevier
Inc. All rights reserved.
Keywords: Agreement; Measurement error; Measurement instruments; Reliability; Repeated measurements; Reproducibility
1. Introduction
Outcome measures in medical sciences may concern the
assessment of radiographs and other imaging techniques,
biopsy readings, the results of laboratory tests, the findings
of physical examinations, or the scores on questionnaires
collecting information, for example, on functional limita-
tions, pain coping styles, and quality of life. An essential
requirement of all outcome measures is that they are valid
and reproducible or reliable [1,2].
Reproducibility concerns the degree to which repeated
measurements in stable study objects, often persons, pro-
vide similar results. Repeated measurements may differ be-
cause of biologic variation in persons, because even stable
characteristics often show small day-to-day differences, or
follow a circadian rhythm. Other sources of variation may
originate from the measurement instrument itself, or the
circumstances under which the measurements take place.
For instance, some instruments may be temperature depen-
dent, or the mood of a respondent may influence the
answers on a questionnaire. Measurements based on assess-
ments made by clinicians may be influenced by intrarater or
interrater variation.
This article first presents an example of an interrater
study, then describes the concepts underlying various repro-
ducibility parameters, which can be distinguished in reli-
ability and agreement parameters. The primary aim of
this article is to demonstrate the relationship and the impor-
tant difference between parameters of reliability and agree-
ment, and to provide recommendations for their use in
medical sciences.
2. An example
In an interrater study on the range of motion of a painful
shoulder, different reproducibility parameters were used to
present the results [3]. To assess the limitations in passive
glenohumeral abduction movement, the range of motion
of the arm was measured with a digital inclinometer, and
* Corresponding author. Tel.: 131 20 444 8176; fax: 131 20 444
6775.
E-mail address:hcw.devet@vumc.nl (H.C.W. de Vet).
0895-4356/06/$ – see front matter Ó2006 Elsevier Inc. All rights reserved.
doi: 10.1016/j.jclinepi.2005.10.015
Journal of Clinical Epidemiology 59 (2006) 1033–1039
expressed in degrees. Two physical therapists (PT
A
and
PT
B
) measured the range of motion of the affected and
the nonaffected shoulder in 155 patients with shoulder
complaints. Table 1 presents the results in terms of means
and standard deviations, percentages of agreement within
5and 10 , limits of agreement, and intraclass correlation
coefficients (ICC) [3].
The first two lines in Table 1 present the means and
standard deviations of the scores assessed by PT
A
and
PT
B
for the affected and nonaffected shoulder. The
standard deviations (SD) show the variability in the results
between the patients, in which the heterogeneity of the
study sample is reflected with regard to the characteristic
under study. The third line presents the mean differences
(Mean
diff
) between the scores of PT
A
and PT
B,
and the
SDs of these differences (SD
diff
). The fourth and fifth lines
present agreement parameters by reporting the percentages
of patients for whom the scores of PT
A
and PT
B
differed
less than 5and 10 , respectively. For 43% of the patients
the scores of PT
A
and PT
B
were within the 5 range, and
for 72% they were within the 10 range, for both the af-
fected and the nonaffected shoulder. The limits of agree-
ment, calculated according to the Bland and Altman
method [4], are also about similar for both shoulders and
were 218.80to 20.40 for the affected shoulder and
217.88 to 19.68 for the nonaffected shoulder. The last
line shows the ICC, which is a parameter of reliability.
There is quite a difference in the value of the ICC for
the two shoulders: 0.83 for the affected shoulder and
0.28 for the nonaffected shoulder. For interpretation, the
physical therapists would be quite satisfied with an agree-
ment percentage of 72% of the patients within the 10
range, while the ICC (O0.7 is generally considered as
good [5]) shows a satisfactory value for the affected shoul-
der, but not for the nonaffected shoulder. The explanation
for these apparently contradictory results can be found in
the conceptual difference between the two types of
parameters.
3. Conceptual difference between agreement
and reliability parameters
In the literature, agreement and reliability parameters
are often used interchangeably, although some authors have
pointed out the differences [6,7].
Agreement and reliability parameters focus on two dif-
ferent questions:
1. ‘How good is the agreement between repeated mea-
surements?’’ This concerns the measurement error,
and assesses exactly how close the scores for repeated
measurements are.
2. ‘How reliable is the measurement?’’ In other words,
how well can patients be distinguished from each
other, despite measurement errors. In this case, the
measurement error is related to the variability be-
tween study objects.
As an umbrella term for the concepts of agreement and
reliability we use the term ‘‘reproducibility’[7], because
both concepts concern the question of whether measure-
ment results are reproducible in test–retest situations. The
repetitions may concern different measurement moments,
different conditions, different raters, or the same rater at
different times.
Figure 1 visualizes the distinction between agreement
and reliability. The weight of three persons is measured
on 5 different days. The five measurements per person show
some variation. The SD of the values of the repeated mea-
surements of one person represents the agreement, and an-
swers question 1 above. For reliability this measurement
error is related to the variability between persons, and tells
us how well they can be distinguished from each other. If
the values of persons are distant (as for persons and -),
the measurement error will not affect discrimination of
the persons, but if the values of persons are close (as for
persons -and :) the measurement error will affect the
ability to discriminate and the reliability will be substan-
tially lower.
A reliability parameter (e.g., the ICC) has as typical
basic formula:
reliability 5variability between study objects
variability between study objects 1measurement error
The reliability parameter relates the measurement error
to the variability between study objects, in our case persons.
Table 1
Reproducibility of the measurement of glenohumeral abduction
of the shoulder
Parameters Affected shoulder Nonaffected shoulder
PT
A
: Mean (SD) 69.49(17.60) 79.78(7.60 )
PT
B
: Mean (SD) 68.69(16.25 ) 78.88 (8.38)
Mean
diff_AB
(SD
diff_AB
) 0.80(10.00) 0.90 (9.58 )
PT
A
vs. PT
B
: % within 543% 43%
PT
A
vs. PT
B
: % within 1072% 72%
Limits of agreement
_AB
218.80–20.40217.88–19.68
ICC
agreement_AB
0.83 0.28
Abbreviations: ICC: intraclass correlation coefficient; Mean
diff_AB
:
Mean of the differences between PT
A
and PT
B
;PT
A
: physical therapist
A; PT
B
: physical therapist B; SD: standard deviation; SD
diff_AB
: standard
deviation of the differences between PT
A
and PT
B
.
60 80
body weight in kilograms
Fig. 1. Five repeated measurements of the body weights of three persons
(,-, and :).
1034 H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039
If the measurement error is small, compared to the variabil-
ity between persons, the reliability parameter approaches 1.
This means that the discrimination of the persons is hardly
affected by measurement error, and thus the reliability is
high (persons and -in Fig. 1). If the measurement error
is large, compared to the variability between persons, the
ICC value becomes smaller. For example, if the measure-
ment error equals the variability between persons, the
ICC becomes 0.5. This means that the discrimination will
be affected by the measurement error (e.g., persons -
and :in Fig. 1). The ICC is a ratio ranging in value be-
tween 0 (representing a totally unreliable measurement)
and 1 (implying perfect reliability).
We now turn back to our example. The percentage of
scores of both PTs within 5 and 10 is a parameter of
agreement: it estimates the measurement error. Note that
it remains unclear whether this measurement error origi-
nates from the PTs, for example, because of a difference
in the way they use the inclinometer, from variation within
the patients who differ in their range of motion at the two
moments of measurements, or from a combination of PTs
and patients, for instance because of variation in the way
the patients are motivated by the PTs. However, this mea-
surement error is totally independent of the variability be-
tween persons. In our example, the agreement of the
scores of the two PTs is approximately the same for the
affected and the nonaffected shoulder. The ICC, which is a
reliability parameter, relates this measurement error to the
variability between persons in the population sample under
study. The higher value of the ICC for the affected shoulder
is explained by the higher variability between persons in
glenohumeral abduction of the affected shoulder, compared
to the nonaffected shoulder. This can be seen from the much
larger standard deviation in the measurements of the affected
shoulder, compared to the standard deviation for the non-
affected shoulder (first two lines in Table 1): patients all have
maximum movement ability in the arm on the nonaffected
side, but they differ considerably in the range of motion of
the arm on the affected side. As the variability in scores for
the affected shoulders is greater, these can more easily be
distinguished, despite the same magnitude of measurement
error. This clearly illustrates the difference between agree-
ment and reliability parameters.
4. Agreement parameters are neglected in medical
sciences
In the 1980s Guyatt et al. [8] clearly emphasized the dis-
tinction between reliability and agreement parameters.
They explained that reliability parameters are required for
instruments that are used for discriminative purposes and
agreement parameters are required for those that are used
for evaluative purposes. With a hypothetic example they el-
oquently demonstrated that discriminative instruments re-
quire a high level of reliability: that is, the measurement
error should be small in comparison to the variability be-
tween the persons that the instrument needs to distinguish.
Thus, if the differences between persons are large, a certain
amount of measurement error is acceptable. For an evalua-
tive measurement instrument the variability between per-
sons in the population sample does not matter at all; only
the measurement error is important. This measurement er-
ror should be smaller than the improvements or deteriora-
tions that one wants to detect. If the measurement error is
large, then small changes cannot be distinguished from
measurement error. The smaller the measurement error,
the smaller the changes that can be detected beyond
measurement error.
In medical sciences measurement instruments are often
used to evaluate changes over time, either with or without
interventions. Nevertheless, many researchers still prefer
reliability parameters over agreement parameters. In two
recent clinimetric reviews [9,10] we assessed the quality
of measurement instruments in terms of reproducibility.
We registered whether agreement and reliability parameters
were assessed. The measurement instruments were ques-
tionnaires to assess shoulder disability [9] or quality of life
(QoL) in visually impaired persons [10]. These instruments
are mainly used to evaluate the effects of interventions or
monitor changes over time. Thus, these are typically evalu-
ative measurement instruments. In the review of QoL of vi-
sually impaired persons [10] we found 31 questionnaires.
For 16 questionnaires a reliability parameter was reported,
but a parameter for agreement was presented for only seven
questionnaires. For all 16 shoulder disability questionnaires
a parameter of reliability was presented, but an additional
parameter of agreement was presented for only for six
questionnaires [9]. Apparently, agreement parameters have
not yet struck root in medical sciences.
Streiner and Norman [1] argue that there is no need for
a special parameter for measurement error, because it can
be derived from the ICC formula. However, this is only true
if all the components of the ICC formula are presented.
Usually only the ICC value is provided, without even men-
tioning which ICC formula has been used [11], that is, with
or without inclusion of the systematic difference between
measurements. Even more important, authors who present
only reliability parameters and no parameters of agreement
usually draw the wrong conclusions: they rely solely on
reliability parameters when they should have relied on
parameters of agreement when evaluation is at issue.
5. Relationship between the agreement
and reliability parameters
The relationship between parameters of agreement and
reliability can best be illustrated by elaborating on the
variances that are involved in the ICC formulas. Therefore,
we first need to explain the meaning of the variance
1035H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039
components [12]. Variance (s
2
) is the statistical term that is
used to indicate variability.
The variance in observed scores can be subdivided into
the variance in the objects under study, in our example
the persons (s
2
p
), the variance in observers (the two differ-
ent PTs) (s
2
pt
), and the interaction between persons and
PTs. We will call this latter term the residual variance
(s
2
residual
).*The variance in persons (s
2
p
) represents the var-
iability between persons, and s
2
pt
represents the variance
due to systematic differences between PT
A
and PT
B
. The
measurement error [error variance (s
2
error
)] consists of
either s
2
residual
or of (s
2
pt
1s
2
residual
), depending on whether
or not one wants to take into account systematic differences
between the measurements (in our example PTs A and B).
These systematic differences are usually considered to be
part of the measurement error, because in practice, the
measurements are performed by different PTs, and one is
interested in the real values of the differences between
the repeated measurements. However, if one is only inter-
ested in the ranking of patients, the systematic differences
between the PTs are not important, and the error variance
contains only s
2
residual
[12].
The ICCs, which relate the measurement error to the var-
iability between persons, are represented by the following
formulas, for ICC
agreement
and ICC
consistency
[11],
respectively:
ICCagreement 5
s2
p
s2
p1s2
pt 1s2
residual
ICCconsistency 5
s2
p
s2
p1s2
residual
We realize that these specifications ‘‘agreement’’ and
‘‘consistency’’ for the type of ICC is highly confusing,
because both terms have others meanings in the field of
reproducibility. For example, consistency is sometimes used
as synonym of reproducibility. As this terminology of ICC
types is used in general handbooks and landmark papers
[11,12], we do not want to deviate from it. ICC
agreement
has
the extra term s
2
pt
in the denominator to take the systematic
difference between the PTs into account; the ICC
consistency
ignores systematic differences. Both types of ICCs are de-
pendent on the heterogeneity of the population sample with
respect to the characteristic of the study. We want to stress
that both ICC
agreement
and ICC
consistency
are reliablity param-
eters (and not agreement parameters), although the term
ICC
agreement
suggests otherwise.
The measurement error is represented by the
standard error of measurement (SEM) and equals
the square root of the error variance: SEM 5Os2
error.
This means that SEMagreement 5Oðs2
pt 1s2
residualÞand
SEMconsistency 5Os2
residual. The SEM is a suitable parameter
of agreement.
6. Illustration of ICC and SEM calculations
in the example
Table 2 presents the values of the variance components
for the affected and nonaffected shoulder. The variance
components are estimated with SPSS (version 10.1), with
the range of motion values as independent variable and per-
sons and PTs as random factors, using the restricted maxi-
mum likelihood method. From these variance components,
the above-mentioned SEMs can be calculated. For the
affected shoulder:
SEMagreement AB 5Oðs2
pt AB 1s2
residualÞ
5Oð0149:98Þ57:07
SEMconsistency AB 5Os2
residual 5O49:98 57:07
The ICC
agreement_AB
for the affected shoulder can be
calculated as follows:
ICCagreement AB 5
s2
p
s2
p1s2
pt AB 1s2
residual
5236:93
236:93 10:00 149:98 50:83
And for the nonaffected shoulder:
ICCagreement AB 5
s2
p
s2
p1s2
pt AB 1s2
residual
518:08
18:08 10:11 145:91 50:28
In this example, ICC
agreement
and ICC
consistency
have
roughly the same value, as the systematic differences be-
tween PT
A
and PT
B
were small, s
2
pt_AB
is almost 0. There-
fore, we introduce a hypothetic physical therapist C (PT
C
),
who scores the range of motion of every patient 5 lower
than PT
B
. We added the (hypothetic) scores of PT
C
to our da-
taset and recalculated the variance components (Table 2).
Table 2
Values of the variance components for the affected and nonaffected
shoulder
Variance components Affected shoulder Nonaffected shoulder
s
2
p
236.93 18.08
s
2
pt_AB
0.00 0.11
s
2
pt_AC
16.50 17.09
s
2
residual
49.98 45.91
s
2
pt_AB
represents error due to systematic differences between PT
A
and
PT
B.
s
2
pt_AC
represents error due to systematic differences between PT
A
and
PT
C
.
*
s
2
residual
is sometimes expressed as s
2
p*pt
, and represents the interac-
tion between PTs and persons. As explained in an earlier paragraph, this
variance omponent cannot be disentangled.
1036 H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039
From now on we proceed with the example of PT
A
and PT
C
,
because this will better illustrate the influence of systematic
differences on the parameters for agreement and reliability.
We will present only the calculations for the affected shoul-
der in the text. The results for both shoulders are presented
in Table 3.
ICCagreement AC 5
s2
p
s2
p1s2
pt AC 1s2
residual
5236:93
6:93 116:50 149:98 50:78
ICCconsistency AC 5
s2
p
s2
p1s2
residual
5236:93
236:93 149:98 50:83
Note that ICC
consistency
is not influenced by the system-
atic differences, but ICC
agreement_AC
becomes smaller, be-
cause for this parameter systematic differences between
PT
A
and PT
C
(s
2
pt_AC
) are included in the measurement
error.
7. Three ways to obtain SEM values
To facilitate and encourage the use of agreement param-
eters we will demonstrate how agreement parameters can
be derived from the ICC formula, or can be calculated in
other ways.
1. SEM values can easily be derived from the ICC
formula, if all variance components are presented.
In that case, the reader can calculate the ICC of
his/her own choice. SEM is calculated as Os2
error,
which equals Oðs2
pt 1s2
residualÞ, if one wishes to
take the systematic differences between the PTs
into account, otherwise, it equals Os2
residual.
SEMagreement AC 5Os2
error 5Oðs2
pt AC 1s2
residualÞ5
Oð16:50 149:98Þ58:15
SEMconsistency AC 5Os2
error 5Os2
residual
5O49:98 57:07
2. The ICC formula can be transformed to
SEM5sOð1eICCÞ[1], in which srepresents the to-
tal variance (i.e., the denominator of the reliability
formula). Only SEM
consistency
can be reproduced this
way, by imputing the pooled SD of the first and sec-
ond assessment for s, and using ICC
consistency
:
SEMconsistency AC 5sOð1eICCconsistency ACÞ516:94
Oð1e0:826Þ57:07
Note that SEM
agreement_AC
cannot be obtained in this
way as calculated, because systematic errors are not repre-
sented in the pooled SD. Using ICC
agreement
and the pooled
SD of the first and second assessment for swould yield:
SEMagreement AC 5sOð1eICCagreement ACÞ516:94
Oð120:781Þ57:93s8:15
The formula SEM 5sOð1eICCÞis often used if infor-
mation on the individual variance components is lacking.
The ICC calculated in one study is then applied to a popu-
lation sample for which the standard deviation (the total
variance) is known. In this way, only a raw indication of
the SEM can be obtained, because the ICC is heavily de-
pendent on the heterogeneity of the characteristic under
study in the population sample, and is thus, in theory, only
applicable for a population with a similar heterogeneity.
3. The value of SEM can also be derived by dividing the
SD of the mean differences between two measure-
ments (SD
diff
)by ffiffi
2
p. The factor ffiffiffi
2
pis included be-
cause it concerns the difference between two
measurements and errors occur in both measure-
ments. Note that the SEM obtained by this formula
is again SEM
consistency
, because systematic error is
not included in the SD. Thus, SEM
agreement_AC
cannot
be calculated in this way either.
SEMconsistency AC 5SDdiff AC=O2510:00=O257:07
SEM
consistency_AB
gives the same result, because PT
B
and
PT
C
only differed by a systematic value.
8. Typical parameters for agreement and reliability
For repeated measurements on a continuous scale, as in
our example, an ICC is the most appropriate reliability
Table 3
Reproducibility of measurement of PT
A
and PT
C
Parameters Affected shoulder Nonaffected shoulder
PT
A
: Mean (SD) 69.49(17.60) 79.78(7.60)
PT
C
: Mean (SD) 63.69(16.25 ) 73.88 (8.38 )
PT
A
2PT
C
: Mean
diff_AC
(SD
diff_AC
)
5.80(10.0) 5.90(9.58)
ICC
agreement_AC
0.78 0.22
ICC
consistency_AC
0.83 0.28
SEM
agreement_AC
8.157.94
SEM
consistency_AC
7.076.78
Limits of Agreement
_AC
213.80–25.40212.88 224.68
Abbreviations: ICC: intraclass correlation coefficient; Mean
diff_AC
:
Mean of the differences between PT
A
and PT
B
;PT
A
: physical therapist
A; PT
B
: physical therapist B; SD: standard deviation; SD
diff_AC
: standard
deviation of the differences between PT
A
and PT
B
.
1037H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039
parameter. An extensive overview of the various ICC for-
mulas is provided by McGraw and Wong [11].
In our example, agreement was expressed as the percent-
age of observations lying between predefined values (Table
1). Presentation in this way makes sense in clinical practice,
because every PT knows what 5 and 10 means. This
measure was chosen because it can easily be interpreted
by PTs [3]. However, the SEM is usually the basic param-
eter of agreement for measurements on a continuous scale.
A method proposed by Bland and Altman [4], which as-
sesses the limits of agreement is frequently used. These
limits of agreement can be directly derived from the
SDdiff 5ðO2SEMconsistencyÞ.
9. Clinical interpretation
Agreement parameters are expressed on the actual scale
of measurement, and not as reliability parameters as a di-
mensionless value between 0 and 1. This is an important
advantage for clinical interpretation. If weights are mea-
sured in kilograms, the dimension of the SEM is kilograms.
For example, if we know that a weighing scale has
a SEM of 300 g, we know that we can use it to monitor
adult body weight because changes of less than 1 kilogram
are not important. The smallest detectable change (SDC)
is based on this measurement error, and is defined as
1:96 O2SEM.*With an SEM of 300 g, SDC is
1:96 O2300g5832g. Obviously, one cannot use this
scale to weigh babies or to weigh flour in the kitchen, be-
cause in these instances changes of less than 800 g are very
important. The measurement error alone provides useful
information when there is a clear conception of the
differences that are important.
The situation is different in the case of unfamiliarity
with scores. For example, if a new multiitem questionnaire
is used to measure functional status on a scale from 0 to
50, an orthopaedic surgeon may want to know what a value
of 14 points and an SEM of 2 points means, because she
has no idea how many points of change represent clini-
cally relevant change. By presenting an ICC she will know
whether the instrument is able to discriminate between pa-
tients in the sample, but she still does not know whether
the instrument is suitable for monitoring the functional sta-
tus of her patients over time. This requires more informa-
tion about the interpretation of scores. By assessing the
scores of groups of mildly, moderately, and severely dis-
abled patients a feeling for the meaning of scores will
arise. Comparisons with other instruments will provide
further insight into the meaning of values on the new
measurement instrument. The assessments of minimally
important changes in various measurements will also
contribute to insight with regard to which (changes in)
scores are clinically relevant [13,14]. Only this informa-
tion makes it possible to assess whether the agreement pa-
rameter of a measurement instrument is sufficient to detect
clinically relevant changes.
10. Conclusion
In this article we have shown the important difference
between the parameters of reliability and agreement and
their relationship. Agreement parameters will be more sta-
ble over different population samples than reliability pa-
rameters, as we observed in our shoulder example, in
which the SEM was quite similar for the affected and the
nonaffected shoulder. Reliability parameters are highly de-
pendent on the variation in the population sample, and are
only generalizable to samples with a similar variation. Re-
liability is clearly a characteristic of the performance of an
instrument in a certain population sample. Agreement is
more a characteristic of the measurement instrument itself.
Agreement parameters are preferable in all situations in
which the instrument will be used for evaluation purposes,
which is often the case in medical research. Researchers
and readers should be eager to apply and interpret the
parameters of agreement and reliability correctly.
References
[1] Streiner DL, Norman GR. Health Measurement Scales. A practical
guide to their development and use. 3rd ed. New York: Oxford Uni-
versity Press Inc.; 2003.
[2] McDowell I, Newell C. Measuring health. A guide to rating scales
and questionnaires. 2nd ed. New York: Oxford University Press
Inc.; 1996.
[3] De Winter AF, Heemskerk MAMB, Terwee CB, Jans MP, Van
Schaardenburg D, Scholten RJPM, Bouter LM. Inter-observer repro-
ducibility of range of motion in patients with shoulder pain using
a digital inclinometer. BMC Musculoskel Disord 2004;5:18.
[4] Bland JM, Altman DG. Statistical methods for assessing agreement
between two methods of clinical measurements. Lancet 1986;i:
307–10.
[5] Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York:
McGraw-Hill Inc.; 1994.
[6] Stratford PW, Goldsmith CH. Use of the standard error as a reliability
index of interest: an applied example using elbow flexor strength
data. Phys Ther 1997;77:745–50.
[7] De Vet HCW. Observer reliability and agreement. In: Armitage P,
Colton T, editors, Encyclopedia biostatistica, Vol 4. Chichester: John
Wiley & Sons, Ltd.; 1998. p. 3123–8.
[8] Guyatt G, Walter S, Norman G. Measuring change over time: assess-
ing the usefulness of evaluative instruments. J Chronic Dis 1987;40:
171–8.
[9] Bot SD, Terwee CB, Van der Windt DA, Bouter LM, Dekker J, De
Vet HC. Clinimetric evaluation of shoulder disability questionnaires:
a systematic review of the literature. Ann Rheum Dis 2004;63:
335–41.
[10] De Boer MR, Moll AC, De Vet HC, Terwee CB, Volker-Dieben HJ,
Van Rens GH. Psychometric properties of vision-related quality of
*
The term ‘smallest’ detectable difference (SDD) is also used for this
purpose.
1038 H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039
life questionnaires: a systematic review. Ophthal Physiol Opt
2004;24:257–73.
[11] Shavelson RJ, Webb NM. Generalizability theory. A primer. London:
Sage Publications; 1991.
[12] McGraw KO, Wong SP. Forming inferences about some intraclass
correlation coefficients. Psychol Methods 1996;1:30–46.
[13] Crosby RD, Kolotkin RL, Williams GR. Defining clinically meaning-
ful change in health-related quality of life. J Clin Epidemiol 2003;56:
395–407.
[14] Testa MA. Interpretation of quality-of-life outcomes. Issues that
affect magnitude and meaning. Med Care 2000;38:II-166–74.
1039H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039
... V tejto situácii nás zaujíma absolútna miera chyby merania a akákoľvek variabilita medzi subjektmi alebo distribúcia hodnotenej vlastnosti v populácii nie je dôležitá . Ak chceme teda zistiť, aký je súhlas medzi rôznymi hodnotiteľmi, alebo opakovanými meraniami, odlišnými podmienkami, či medzi hodnotiteľom v rôznom čase, potom by sme sa mali vydať touto cestou (de Vet et al., 2006). ...
... Keď je našou výskumnou otázkou zistiť, ako konzistentné je meranie, prípadne ako dobre je možné objekty/subjekty výskumu od seba odlíšiť aj napriek chybám v meraní, potom by sme si mali vybrať spoľahlivosť posudzovateľov (de Vet et al., 2006). Spoľahlivosť a súhlas posudzovateľov môžu predstavovať dva prístupy k dátam výskumníka. ...
Article
Full-text available
Interrater agreement is one way to establish reliability (and also validity) in social science research. The traditionally preferred method of measuring interrater agreement is the descriptive approach owing to its simplicity. This approach is also associated with a number of different agreement indices, which makes it difficult to select the right index. This article summarises theoretical definition on the prevailing approach used to measure interrater agreement (in both quantitative and qualitative research). From a practical point of view, the article focuses on the possibilities of measuring agreement by using percent agreement, the kappa coefficient, and the AC1 coefficient. A more detailed description of the indices explains how to define, calculate, and interpret them and the problems associated with their use. The indices are then discussed in comparison. Although underestimated and criticised, percent agreement may be a good indicator of interrater agreement. Several paradoxes accompany the use of the kappa coefficient, which is only possible under certain conditions. The appropriate alternative to it is the AC1 coefficient. The article concludes with a summary of recommendations for improving the quantification of interrater agreement.
... Validation of clinical measurements involves demonstrating reliability and reproducibility for the intended use, acknowledging inherent measurement errors. When comparing methods, neither provides an absolute correct measurement, underscoring the need to assess their agreement [29]. Correlation studies evaluate relationships between variables, not differences, thus aren't recommended for method comparability assessment. ...
Article
Full-text available
The purpose of this work was to investigate the degree of agreement between two distinct approaches for measuring a set of blood values and to compare comfort levels reported by participants when utilizing these two disparate measurement methods. Radial arterial blood was collected for the comparator analysis using the Abbott i-STAT® POCT device. In contrast, the non-invasive proprietary DBC methodology is used to calculate sodium, potassium, chloride, ionized calcium, total carbon dioxide, pH, bicarbonate, and oxygen saturation using four input parameters (temperature, hemoglobin, pO2, and pCO2). Agreement between the measurement for a set of blood values obtained using i-STAT and DBC methodology was compared using intraclass correlation coefficients, Passing and Bablok regression analyses, and Bland Altman plots. A p-value of <0.05 was considered statistically significant. A total of 37 participants were included in this study. The mean age of the participants was 42.4 ± 13 years, most were male (65%), predominantly Caucasian/White (75%), and of Hispanic ethnicity (40%). The Intraclass Correlation Coefficients (ICC) analyses indicated agreement levels ranging from poor to moderate between i-STAT and the DBC’s algorithm for Hb, pCO2, HCO3, TCO2, and Na, and weak agreement for pO2, HSO2, pH, K, Ca, and Cl. The Passing and Bablok regression analyses demonstrated that values for Hb, pO2, pCO2, TCO2, Cl, and Na obtained from the i-STAT did not differ significantly from that of the DBC’s algorithm suggesting good agreement. The values for Hb, K, and Na measured by the DBC algorithm were slightly higher than those obtained by the i-STAT, indicating some systematic differences between these two methods on Bland Altman Plots. The non-invasive DBC methodology was found to be reliable and robust for most of the measured blood values compared to invasive POCT i-STAT device in healthy participants. These findings need further validation in larger samples and among individuals afflicted with various medical conditions.
... The extent to which the respective instruments produced the same overall utility scores during repeated administrations was measured by the Intraclass Correlation Coefficient (ICC) [26]. Besides Gwet's AC2 and ICC, we also estimated standard error of measurement (SEM), smallest detectable change (SDC) and level of agreement between test and retest for the QOL-ACC and EQ-5D-5L [27]. ...
Article
Full-text available
Purpose The Quality of Life-Aged Care Consumers (QOL-ACC), a valid preference-based instrument, has been rolled out in Australia as part of the National Quality Indicator (QI) program since April 2023 to monitor and benchmark the quality of life of aged care recipients. As the QOL-ACC is being used to collect quality of life data longitudinally as one of the key aged care QI indicators, it is imperative to establish the reliability of the QOL-ACC in aged care settings. Therefore, we aimed to assess the reliability of the QOL-ACC and compare its performance with the EQ-5D-5L. Methods Home care recipients completed a survey including the QOL-ACC, EQ-5D-5L and two global items for health and quality of life at baseline (T1) and 2 weeks later (T2). Using T1 and T2 data, the Gwet’s AC2 and intra-class correlation coefficient (ICC) were estimated for the dimension levels and overall scores agreements respectively. The standard error of measurement (SEM) and the smallest detectable change (SDC) were also calculated. Sensitivity analyses were conducted for respondents who did not change their response to global item of quality of life and health between T1 and T2. Results Of the 83 respondents who completed T1 and T2 surveys, 78 respondents (mean ± SD age, 73.6 ± 5.3 years; 56.4% females) reported either no or one level change in their health and/or quality of life between T1 and T2. Gwet’s AC2 ranged from 0.46 to 0.63 for the QOL-ACC dimensions which were comparable to the EQ-5D-5L dimensions (Gwet’s AC2 ranged from 0.52 to 0.77). The ICC for the QOL-ACC (0.85; 95% CI, 0.77–0.90) was comparable to the EQ-5D-5L (0.83; 95% CI, 0.74–0.88). The SEM for the QOL-ACC (0.08) was slightly smaller than for the EQ-5D-5L (0.11). The SDC for the QOL-ACC and the EQ-5D-5L for individual subjects were 0.22 and 0.30 respectively. Sensitivity analyses stratified by quality of life and health status confirmed the base case results. Conclusions The QOL-ACC demonstrated a good test-retest reliability similar to the EQ-5D-5L, supporting its repeated use in aged care settings. Further studies will provide evidence of responsiveness of the QOL-ACC to aged care-specific interventions in aged care settings.
... However, because ICC is dependent on sample heterogeneity, the ICC values may not be generalizable to all other samples or populations. 18 The SD is a valuable measure to guide studies that want to measure changes within patients because it may be multiplied by 2.77 to estimate the smallest measurable difference (SMD). 19 This is the value under which 95% of repeated measurements will lie if there is no difference between them, indicating the value above which a measured difference can be considered as being true. ...
... TEWL was assessed using a VapoMeter SWL5001 (Delfin Technologies Ltd., Kuopio, Finland) with a closed chamber system that was not affected by the environment or airflow and was suitable for field studies [21,22]. Following international guidelines to minimise influences of related factors (e.g., acclimatisation period, limiting intake of caffeinated beverages and spices prior to measurement) [22], measurements were performed before and 15 min after bed baths on days 1 and 2. As a reference value, we calculated the smallest detectable change (SDC), which represents the actual change beyond the measurement error, by measuring the baseline twice [23,24] (Supplementary Table S2). ...
Article
Aim Wiping pressure (WP [mmHg]) during bed baths is essential to maintain skin integrity and care quality for older adults. However, effects of different wiping pressures on skin barrier recovery over multiple days remain unclear. This study evaluated and compared the effects of consecutive bed bathing with weak pressure and that with ordinary pressure on skin barrier recovery of hospitalised older adults. Methods This within-person, randomised, controlled trial involved 254 forearms (127 patients) and was conducted at a general hospital. Forearms were blinded and randomly assigned a site and sequence of two bed bathing sessions: wiping three times with weak (10≤WP<20) and ordinary pressure (20≤WP<30) once per day for 2 consecutive days. The skin barrier was assessed daily based on transepidermal water loss (TEWL) and stratum corneum hydration (SCH) before and 15 min after the interventions. Dry skin was assessed using the overall dry skin score. Results A linear mixed model showed that the time courses of TEWL and SCH differed significantly between groups. Impaired skin barrier function caused by ordinary pressure on the first day did not recover to baseline values the next day, whereas weak pressure did not cause significant changes. During subgroup analyses, TEWL of patients with dry skin was more likely to increase with ordinary pressure. Conclusions Despite decreased skin barrier recovery experienced by older adults, our findings suggest the safety of weak pressure and highlight the importance of WP during bed baths. Weak pressure is particularly desirable for patients with dry skin.
Article
Full-text available
This study assesses the impact of three volumetric gas flow measurement methods—turbine (fT); pneumotachograph (fP), and Venturi (fV)—on predictive accuracy and precision of expired gas analysis indirect calorimetry (EGAIC) across varying exercise intensities. Six males (Age: 38 ± 8 year; Height: 178.8 ± 4.2 cm; V̇O2peak$$ \dot{V}{\mathrm{O}}_2\mathrm{peak} $$: 42 ± 2.8 mL O2 kg⁻¹ min⁻¹) and 14 females (Age = 44.6 ± 9.6 year; Height = 164.6 ± 6.9 cm; V̇O2peak$$ \dot{V}{\mathrm{O}}_2\mathrm{peak} $$ = 45 ± 8.6 mL O2 kg⁻¹ min⁻¹) were recruited. Participants completed physical exertion on a stationary cycle ergometer for simultaneous pulmonary minute ventilation (V̇$$ \dot{V} $$) measurements and EGAIC computations. Exercise protocols and subsequent conditions involved a 5‐min cycling warm‐up at 25 W min⁻¹, incremental exercise to exhaustion (V̇O2$$ \dot{V}{\mathrm{O}}_2 $$ ramp test), then a steady‐state exercise bout induced by a constant Watt load equivalent to 80% ventilatory threshold (80% VT). A linear mixed model revealed that exercise intensity significantly affected V̇O2$$ \dot{V}{\mathrm{O}}_2 $$ measurements (p < 0.0001), whereas airflow sensor method (p = 0.97) and its interaction with exercise intensity (p = 0.91) did not. Group analysis of precision yielded a V̇O2$$ \dot{V}{\mathrm{O}}_2 $$ CV % = 21%; SEM = 5 mL O2 kg⁻¹ min⁻¹. Intra‐ and interindividual analysis of precision via Bland–Altman revealed a 95% confidence interval (CI) precision benchmark of 3–5 mL kg⁻¹ min⁻¹. Agreement among methods decreased at power outputs eliciting V̇$$ \dot{V} $$ up to 150 L min⁻¹, indicating a decrease in precision and highlighting potential challenges in interpreting biological variability, training response heterogeneity, and test–retest comparisons. These findings suggest careful consideration of airflow sensor method variance across metabolic cart configurations.
Book
Full-text available
Accessible to any professional or researcher who has a basic understanding of analysis of variance, "Generalizability Theory: A Primer" offers an intuitive development of generalizability theory, a technique for estimating the relative magnitudes of various components of error variation and for indicating the most efficient strategy for achieving desired measurement precision. Covering a variety of topics such as generalizability studies with nested facets and with fixed facets, measurement error and generalizability coefficients, and decision studies with same and with different designs, the text includes exercises so the reader may practice the application of each chapter's material. By using detailed illustrations and examples, Shavelson and Webb clearly describe the logic underlying major concepts in generalizability theory to enable readers to apply these methods when investigating the consistency of their own measurements. (PsycINFO Database Record (c) 2010 APA, all rights reserved)
Article
Full-text available
AIthough intraclass correlation coefficients (lCCs) are commonIy used in behavioral measurement, pychometrics, and behavioral genetics, procodures available for forming inferences about ICC are not widely known. Following a review of the distinction between various forms of the ICC, this article presents procedures available for calculating confidence intervals and conducting tests on ICCs developed using data from one-way and two-way random and mixed-efFect analysis of variance models. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Reliability, the ratio of the variance attributable to true differences among subjects to the total variance, is an important attribute of psychometric measures. However, it is possible for instruments to be reliable, but unresponsive to change; conversely, they may show poor reliability but excellent responsiveness. This is especially true for instruments in which items are tailored to the individual respondent. Therefore, we suggest a new index of responsiveness to assess the usefulness of instruments designed to measure change over time. This statistic, which relates the minimal clinically important difference to the variability in stable subjects, has direct sample size implications. Responsiveness should join reliability and validity as necessary requirements for instruments designed primarily to measure change over time.
Article
One effect of rising health care costs has been to raise the profile of studies that evaluate care and create a systematic evidence base for therapies and, by extension, for health policies. All clinical trials and evaluative studies require instruments to monitor the outcomes of care in terms of quality of life, disability, pain, mental health, or general well-being. Many measurement tools have been developed, and choosing among them is difficult. This book provides comparative reviews of the quality of leading health measurement instruments and a technical and historical introduction to the field of health measurement, and discusses future directions in the field. This edition reviews over 100 scales, presented in chapters covering physical disability, psychological well-being, anxiety, depression, mental status testing, social health, pain measurement, and quality of life. An introductory chapter describes the theoretical and methodological development of health measures, while a final chapter reviews the current status of the field, indicating areas in which further development is required. Each chapter includes a tabular comparison of the quality of the instruments reviewed, followed by a detailed description of each instrument, covering its purpose and conceptual basis, its reliability and validity, alternative versions and, where possible, a copy of the scale itself. To ensure accuracy, each review has been approved by the original author of each instrument or by an acknowledged expert.
Book
Clinicians and those in health sciences are frequently called upon to measure subjective states such as attitudes, feelings, quality of life, educational achievement and aptitude, and learning style in their patients. This fifth edition of Health Measurement Scales enables these groups to both develop scales to measure non-tangible health outcomes, and better evaluate and differentiate between existing tools. Health Measurement Scales is the ultimate guide to developing and validating measurement scales that are to be used in the health sciences. The book covers how the individual items are developed; various biases that can affect responses (e.g. social desirability, yea-saying, framing); various response options; how to select the best items in the set; how to combine them into a scale; and finally how to determine the reliability and validity of the scale. It concludes with a discussion of ethical issues that may be encountered, and guidelines for reporting the results of the scale development process. Appendices include a comprehensive guide to finding existing scales, and a brief introduction to exploratory and confirmatory factor analysis, making this book a must-read for any practitioner dealing with this kind of data.
Article
Reports 3 errors in the original article by K. O. McGraw and S. P. Wong (Psychological Methods, 1996, 1[1], 30–46). On page 39, the intraclass correlation coefficient (ICC) and r values given in Table 6 should be changed to r = .714 for each data set, ICC(C,1) = .714 for each data set, and ICC(A,1) = .720, .620, and .485 for the data in Columns 1, 2, and 3 of the table, respectively. In Table 7 (p. 41), which is used to determine confidence intervals on population values of the ICC, the procedures for obtaining the confidence intervals on ICC(A,k) need to be amended slightly. Corrected formulas are given. On pages 44–46, references to Equations A3, A,4, and so forth in the Appendix should be to Sections A3, A4, and so forth. (The following abstract of this article originally appeared in record 1996-03170-003.). Although intraclass correlation coefficients (ICCs) are commonly used in behavioral measurement, psychometrics, and behavioral genetics, procedures available for forming inferences about ICC are not widely known. Following a review of the distinction between various forms of the ICC, this article presents procedures available for calculating confidence intervals and conducting tests on ICCs developed using data from one-way and two-way random and mixed-effect analysis of variance models. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Chapter
The terms observer reliability and observer agreement represent different concepts. Reliability coefficients express the ability to differentiate between subjects. Agreement parameters determine whether the same value is achieved if a measurement is repeated. Parameters for reliability are ICCs for measurements on an interval scale. Parameters for agreement are kappa for measurements on a nominal scale, weighted kappa for an ordinal scale, and levels of agreement for measurements on an interval scale. Assessing observer reliability and agreement is essential for the interpretation of clinical observations in research and medical practice. Moreover, only if the sources of measurement errors are known, can they be anticipated, thereby improving the quality of measurements.
Article
Reliability, the ratio of the variance attributable to true differences among subjects to the total variance, is an important attribute of psychometric measures. However, it is possible for instruments to be reliable, but unresponsive to change: conversely, they may show poor reliability but excellent responsiveness. This is especially true for instruments in which items are tailored to the individual respondent.Therefore, we suggest a new index of responsiveness to assess the usefulness of instruments designed to measure change over time. This statistic, which relates the minimal clinically important difference to the variability in stable subjects, has direct sample size implications. Responsiveness should join reliability and validity as necessary requirements for instruments designed primarily to measure change over time.
Article
In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.