ArticlePDF Available

When to use agreement versus reliability measures

November 2006
Journal of Clinical Epidemiology 59(10):1033-9

November 2006
59(10):1033-9

DOI:10.1016/j.jclinepi.2005.10.015

Source
PubMed

Authors:

Henrica De Vet

Amsterdam University Medical Center

Caroline B Terwee

Amsterdam University Medical Center

Lex Bouter

Amsterdam University Medical Centers - Vrije Universiteit

Reproducibility concerns the degree to which repeated measurements provide similar results. Agreement parameters assess how close the results of the repeated measurements are, by estimating the measurement error in repeated measurements. Reliability parameters assess whether study objects, often persons, can be distinguished from each other, despite measurement errors. In that case, the measurement error is related to the variability between persons. Consequently, reliability parameters are highly dependent on the heterogeneity of the study sample, while the agreement parameters, based on measurement error, are more a pure characteristic of the measurement instrument. Using an example of an interrater study, in which different physical therapists measure the range of motion of the arm in patients with shoulder complaints, the differences and relationships between reliability and agreement parameters for continuous variables are illustrated. If the research question concerns the distinction of persons, reliability parameters are the most appropriate. But if the aim is to measure change in health status, which is often the case in clinical practice, parameters of agreement are preferred.

Five repeated measurements of the body weights of three persons (,-, and :).

…

alues of the variance components for the affected and nonaffected shoulder

…

Reproducibility of measurement of PT A and PT C

…

the values of the variance components for the affected and nonaffected shoulder. The variance components are estimated with SPSS (version 10.1), with the range of motion values as independent variable and per- sons and PTs as random factors, using the restricted maxi- mum likelihood method. From these variance components, the above-mentioned SEMs can be calculated. For the affected shoulder:

…

Figures - uploaded by Lex Bouter

Content may be subject to copyright.

Content uploaded by Lex Bouter

Content may be subject to copyright.

When to use agreement versus reliability measures

Henrica C.W. de Vet

*, Caroline B. Terwee

, Dirk L. Knol

a,b

, Lex M. Bouter

Institute for Research in Extramural Medicine, VU University Medical Center, Amsterdam, Van der Boechorststraat 7,

Amsterdam 1081 BT, The Netherlands

Department of Clinical Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands

Accepted 25 October 2005

Abstract

Background: Reproducibility concerns the degree to which repeated measurements provide similar results. Agreement parameters as-

sess how close the results of the repeated measurements are, by estimating the measurement error in repeated measurements. Reliability

parameters assess whether study objects, often persons, can be distinguished from each other, despite measurement errors. In that case, the

measurement error is related to the variability between persons. Consequently, reliability parameters are highly dependent on the hetero-

geneity of the study sample, while the agreement parameters, based on measurement error, are more a pure characteristic of the measure-

ment instrument.

Methods and Results: Using an example of an interrater study, in which different physical therapists measure the range of motion of

the arm in patients with shoulder complaints, the differences and relationships between reliability and agreement parameters for continuous

variables are illustrated.

Conclusion: If the research question concerns the distinction of persons, reliability parameters are the most appropriate. But if the aim

is to measure change in health status, which is often the case in clinical practice, parameters of agreement are preferred. Ó2006 Elsevier

Keywords: Agreement; Measurement error; Measurement instruments; Reliability; Repeated measurements; Reproducibility

1. Introduction

Outcome measures in medical sciences may concern the

assessment of radiographs and other imaging techniques,

biopsy readings, the results of laboratory tests, the ﬁndings

of physical examinations, or the scores on questionnaires

collecting information, for example, on functional limita-

tions, pain coping styles, and quality of life. An essential

requirement of all outcome measures is that they are valid

and reproducible or reliable [1,2].

Reproducibility concerns the degree to which repeated

measurements in stable study objects, often persons, pro-

vide similar results. Repeated measurements may differ be-

cause of biologic variation in persons, because even stable

characteristics often show small day-to-day differences, or

follow a circadian rhythm. Other sources of variation may

originate from the measurement instrument itself, or the

circumstances under which the measurements take place.

For instance, some instruments may be temperature depen-

dent, or the mood of a respondent may inﬂuence the

answers on a questionnaire. Measurements based on assess-

ments made by clinicians may be inﬂuenced by intrarater or

interrater variation.

This article ﬁrst presents an example of an interrater

study, then describes the concepts underlying various repro-

ducibility parameters, which can be distinguished in reli-

ability and agreement parameters. The primary aim of

this article is to demonstrate the relationship and the impor-

tant difference between parameters of reliability and agree-

ment, and to provide recommendations for their use in

medical sciences.

2. An example

In an interrater study on the range of motion of a painful

shoulder, different reproducibility parameters were used to

present the results [3]. To assess the limitations in passive

glenohumeral abduction movement, the range of motion

of the arm was measured with a digital inclinometer, and

* Corresponding author. Tel.: 131 20 444 8176; fax: 131 20 444

6775.

E-mail address:hcw.devet@vumc.nl (H.C.W. de Vet).

doi: 10.1016/j.jclinepi.2005.10.015

Journal of Clinical Epidemiology 59 (2006) 1033–1039

expressed in degrees. Two physical therapists (PT

and

) measured the range of motion of the affected and

the nonaffected shoulder in 155 patients with shoulder

complaints. Table 1 presents the results in terms of means

and standard deviations, percentages of agreement within

5and 10 , limits of agreement, and intraclass correlation

coefﬁcients (ICC) [3].

The ﬁrst two lines in Table 1 present the means and

standard deviations of the scores assessed by PT

and

for the affected and nonaffected shoulder. The

standard deviations (SD) show the variability in the results

between the patients, in which the heterogeneity of the

study sample is reﬂected with regard to the characteristic

under study. The third line presents the mean differences

(Mean

diff

) between the scores of PT

and PT

and the

SDs of these differences (SD

diff

). The fourth and ﬁfth lines

present agreement parameters by reporting the percentages

of patients for whom the scores of PT

and PT

differed

less than 5and 10 , respectively. For 43% of the patients

the scores of PT

and PT

were within the 5 range, and

for 72% they were within the 10 range, for both the af-

fected and the nonaffected shoulder. The limits of agree-

ment, calculated according to the Bland and Altman

method [4], are also about similar for both shoulders and

were 218.80to 20.40 for the affected shoulder and

217.88 to 19.68 for the nonaffected shoulder. The last

line shows the ICC, which is a parameter of reliability.

There is quite a difference in the value of the ICC for

the two shoulders: 0.83 for the affected shoulder and

0.28 for the nonaffected shoulder. For interpretation, the

physical therapists would be quite satisﬁed with an agree-

ment percentage of 72% of the patients within the 10 

range, while the ICC (O0.7 is generally considered as

good [5]) shows a satisfactory value for the affected shoul-

der, but not for the nonaffected shoulder. The explanation

for these apparently contradictory results can be found in

the conceptual difference between the two types of

parameters.

3. Conceptual difference between agreement

and reliability parameters

In the literature, agreement and reliability parameters

are often used interchangeably, although some authors have

pointed out the differences [6,7].

Agreement and reliability parameters focus on two dif-

ferent questions:

1. ‘‘How good is the agreement between repeated mea-

surements?’’ This concerns the measurement error,

and assesses exactly how close the scores for repeated

measurements are.

2. ‘‘How reliable is the measurement?’’ In other words,

how well can patients be distinguished from each

other, despite measurement errors. In this case, the

measurement error is related to the variability be-

tween study objects.

As an umbrella term for the concepts of agreement and

reliability we use the term ‘‘reproducibility’’ [7], because

both concepts concern the question of whether measure-

ment results are reproducible in test–retest situations. The

repetitions may concern different measurement moments,

different conditions, different raters, or the same rater at

different times.

Figure 1 visualizes the distinction between agreement

and reliability. The weight of three persons is measured

on 5 different days. The ﬁve measurements per person show

some variation. The SD of the values of the repeated mea-

surements of one person represents the agreement, and an-

swers question 1 above. For reliability this measurement

error is related to the variability between persons, and tells

us how well they can be distinguished from each other. If

the values of persons are distant (as for persons and -),

the measurement error will not affect discrimination of

the persons, but if the values of persons are close (as for

persons -and :) the measurement error will affect the

ability to discriminate and the reliability will be substan-

tially lower.

A reliability parameter (e.g., the ICC) has as typical

basic formula:

reliability 5variability between study objects

variability between study objects 1measurement error

The reliability parameter relates the measurement error

to the variability between study objects, in our case persons.

Table 1

Reproducibility of the measurement of glenohumeral abduction

of the shoulder

Parameters Affected shoulder Nonaffected shoulder

: Mean (SD) 69.49(17.60) 79.78(7.60 )

: Mean (SD) 68.69(16.25 ) 78.88 (8.38)

Mean

diff_AB

(SD

diff_AB

) 0.80(10.00) 0.90 (9.58 )

vs. PT

: % within 543% 43%

vs. PT

: % within 1072% 72%

Limits of agreement

_AB

218.80–20.40217.88–19.68

ICC

agreement_AB

0.83 0.28

Abbreviations: ICC: intraclass correlation coefﬁcient; Mean

diff_AB

Mean of the differences between PT

and PT

;PT

: physical therapist

A; PT

: physical therapist B; SD: standard deviation; SD

diff_AB

: standard

deviation of the differences between PT

and PT

60 80

body weight in kilograms

Fig. 1. Five repeated measurements of the body weights of three persons

(,-, and :).

1034 H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039

If the measurement error is small, compared to the variabil-

ity between persons, the reliability parameter approaches 1.

This means that the discrimination of the persons is hardly

affected by measurement error, and thus the reliability is

high (persons and -in Fig. 1). If the measurement error

is large, compared to the variability between persons, the

ICC value becomes smaller. For example, if the measure-

ment error equals the variability between persons, the

ICC becomes 0.5. This means that the discrimination will

be affected by the measurement error (e.g., persons -

and :in Fig. 1). The ICC is a ratio ranging in value be-

tween 0 (representing a totally unreliable measurement)

and 1 (implying perfect reliability).

We now turn back to our example. The percentage of

scores of both PTs within 5 and 10 is a parameter of

agreement: it estimates the measurement error. Note that

it remains unclear whether this measurement error origi-

nates from the PTs, for example, because of a difference

in the way they use the inclinometer, from variation within

the patients who differ in their range of motion at the two

moments of measurements, or from a combination of PTs

and patients, for instance because of variation in the way

the patients are motivated by the PTs. However, this mea-

surement error is totally independent of the variability be-

tween persons. In our example, the agreement of the

scores of the two PTs is approximately the same for the

affected and the nonaffected shoulder. The ICC, which is a

reliability parameter, relates this measurement error to the

variability between persons in the population sample under

study. The higher value of the ICC for the affected shoulder

is explained by the higher variability between persons in

glenohumeral abduction of the affected shoulder, compared

to the nonaffected shoulder. This can be seen from the much

larger standard deviation in the measurements of the affected

shoulder, compared to the standard deviation for the non-

affected shoulder (ﬁrst two lines in Table 1): patients all have

maximum movement ability in the arm on the nonaffected

side, but they differ considerably in the range of motion of

the arm on the affected side. As the variability in scores for

the affected shoulders is greater, these can more easily be

distinguished, despite the same magnitude of measurement

error. This clearly illustrates the difference between agree-

ment and reliability parameters.

4. Agreement parameters are neglected in medical

sciences

In the 1980s Guyatt et al. [8] clearly emphasized the dis-

tinction between reliability and agreement parameters.

They explained that reliability parameters are required for

instruments that are used for discriminative purposes and

agreement parameters are required for those that are used

for evaluative purposes. With a hypothetic example they el-

oquently demonstrated that discriminative instruments re-

quire a high level of reliability: that is, the measurement

error should be small in comparison to the variability be-

tween the persons that the instrument needs to distinguish.

Thus, if the differences between persons are large, a certain

amount of measurement error is acceptable. For an evalua-

tive measurement instrument the variability between per-

sons in the population sample does not matter at all; only

the measurement error is important. This measurement er-

ror should be smaller than the improvements or deteriora-

tions that one wants to detect. If the measurement error is

large, then small changes cannot be distinguished from

measurement error. The smaller the measurement error,

the smaller the changes that can be detected beyond

measurement error.

In medical sciences measurement instruments are often

used to evaluate changes over time, either with or without

interventions. Nevertheless, many researchers still prefer

reliability parameters over agreement parameters. In two

recent clinimetric reviews [9,10] we assessed the quality

of measurement instruments in terms of reproducibility.

We registered whether agreement and reliability parameters

were assessed. The measurement instruments were ques-

tionnaires to assess shoulder disability [9] or quality of life

(QoL) in visually impaired persons [10]. These instruments

are mainly used to evaluate the effects of interventions or

monitor changes over time. Thus, these are typically evalu-

ative measurement instruments. In the review of QoL of vi-

sually impaired persons [10] we found 31 questionnaires.

For 16 questionnaires a reliability parameter was reported,

but a parameter for agreement was presented for only seven

questionnaires. For all 16 shoulder disability questionnaires

a parameter of reliability was presented, but an additional

parameter of agreement was presented for only for six

questionnaires [9]. Apparently, agreement parameters have

not yet struck root in medical sciences.

Streiner and Norman [1] argue that there is no need for

a special parameter for measurement error, because it can

be derived from the ICC formula. However, this is only true

if all the components of the ICC formula are presented.

Usually only the ICC value is provided, without even men-

tioning which ICC formula has been used [11], that is, with

or without inclusion of the systematic difference between

measurements. Even more important, authors who present

only reliability parameters and no parameters of agreement

usually draw the wrong conclusions: they rely solely on

reliability parameters when they should have relied on

parameters of agreement when evaluation is at issue.

5. Relationship between the agreement

and reliability parameters

The relationship between parameters of agreement and

reliability can best be illustrated by elaborating on the

variances that are involved in the ICC formulas. Therefore,

we ﬁrst need to explain the meaning of the variance

1035H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039

components [12]. Variance (s

) is the statistical term that is

used to indicate variability.

The variance in observed scores can be subdivided into

the variance in the objects under study, in our example

the persons (s

), the variance in observers (the two differ-

ent PTs) (s

), and the interaction between persons and

PTs. We will call this latter term the residual variance

residual

).*The variance in persons (s

) represents the var-

iability between persons, and s

represents the variance

due to systematic differences between PT

and PT

. The

measurement error [error variance (s

error

)] consists of

either s

residual

or of (s

residual

), depending on whether

or not one wants to take into account systematic differences

between the measurements (in our example PTs A and B).

These systematic differences are usually considered to be

part of the measurement error, because in practice, the

measurements are performed by different PTs, and one is

interested in the real values of the differences between

the repeated measurements. However, if one is only inter-

ested in the ranking of patients, the systematic differences

between the PTs are not important, and the error variance

contains only s

residual

[12].

The ICCs, which relate the measurement error to the var-

iability between persons, are represented by the following

formulas, for ICC

agreement

and ICC

consistency

[11],

respectively:

ICCagreement 5

p1s2

pt 1s2

residual

ICCconsistency 5

p1s2

residual

We realize that these speciﬁcations ‘‘agreement’’ and

‘‘consistency’’ for the type of ICC is highly confusing,

because both terms have others meanings in the ﬁeld of

reproducibility. For example, consistency is sometimes used

as synonym of reproducibility. As this terminology of ICC

types is used in general handbooks and landmark papers

[11,12], we do not want to deviate from it. ICC

agreement

has

the extra term s

in the denominator to take the systematic

difference between the PTs into account; the ICC

consistency

ignores systematic differences. Both types of ICCs are de-

pendent on the heterogeneity of the population sample with

respect to the characteristic of the study. We want to stress

that both ICC

agreement

and ICC

consistency

are reliablity param-

eters (and not agreement parameters), although the term

ICC

agreement

suggests otherwise.

The measurement error is represented by the

standard error of measurement (SEM) and equals

the square root of the error variance: SEM 5Os2

error.

This means that SEMagreement 5Oðs2

pt 1s2

residualÞand

SEMconsistency 5Os2

residual. The SEM is a suitable parameter

of agreement.

6. Illustration of ICC and SEM calculations

in the example

Table 2 presents the values of the variance components

for the affected and nonaffected shoulder. The variance

components are estimated with SPSS (version 10.1), with

the range of motion values as independent variable and per-

sons and PTs as random factors, using the restricted maxi-

mum likelihood method. From these variance components,

the above-mentioned SEMs can be calculated. For the

affected shoulder:

SEMagreement AB 5Oðs2

pt AB 1s2

residualÞ

5Oð0149:98Þ57:07

SEMconsistency AB 5Os2

residual 5O49:98 57:07

The ICC

agreement_AB

for the affected shoulder can be

calculated as follows:

ICCagreement AB 5

p1s2

pt AB 1s2

residual

5236:93

236:93 10:00 149:98 50:83

And for the nonaffected shoulder:

ICCagreement AB 5

p1s2

pt AB 1s2

residual

518:08

18:08 10:11 145:91 50:28

In this example, ICC

agreement

and ICC

consistency

have

roughly the same value, as the systematic differences be-

tween PT

and PT

were small, s

pt_AB

is almost 0. There-

fore, we introduce a hypothetic physical therapist C (PT

who scores the range of motion of every patient 5 lower

than PT

. We added the (hypothetic) scores of PT

to our da-

taset and recalculated the variance components (Table 2).

Table 2

Values of the variance components for the affected and nonaffected

shoulder

Variance components Affected shoulder Nonaffected shoulder

236.93 18.08

pt_AB

0.00 0.11

pt_AC

16.50 17.09

residual

49.98 45.91

pt_AB

represents error due to systematic differences between PT

and

pt_AC

represents error due to systematic differences between PT

and

residual

is sometimes expressed as s

p*pt

, and represents the interac-

tion between PTs and persons. As explained in an earlier paragraph, this

variance omponent cannot be disentangled.

1036 H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039

From now on we proceed with the example of PT

and PT

because this will better illustrate the inﬂuence of systematic

differences on the parameters for agreement and reliability.

We will present only the calculations for the affected shoul-

der in the text. The results for both shoulders are presented

in Table 3.

ICCagreement AC 5

p1s2

pt AC 1s2

residual

5236:93

6:93 116:50 149:98 50:78

ICCconsistency AC 5

p1s2

residual

5236:93

236:93 149:98 50:83

Note that ICC

consistency

is not inﬂuenced by the system-

atic differences, but ICC

agreement_AC

becomes smaller, be-

cause for this parameter systematic differences between

and PT

pt_AC

) are included in the measurement

error.

7. Three ways to obtain SEM values

To facilitate and encourage the use of agreement param-

eters we will demonstrate how agreement parameters can

be derived from the ICC formula, or can be calculated in

other ways.

1. SEM values can easily be derived from the ICC

formula, if all variance components are presented.

In that case, the reader can calculate the ICC of

his/her own choice. SEM is calculated as Os2

error,

which equals Oðs2

pt 1s2

residualÞ, if one wishes to

take the systematic differences between the PTs

into account, otherwise, it equals Os2

residual.

SEMagreement AC 5Os2

error 5Oðs2

pt AC 1s2

residualÞ5

Oð16:50 149:98Þ58:15

SEMconsistency AC 5Os2

error 5Os2

residual

5O49:98 57:07

2. The ICC formula can be transformed to

SEM5sOð1eICCÞ[1], in which srepresents the to-

tal variance (i.e., the denominator of the reliability

formula). Only SEM

consistency

can be reproduced this

way, by imputing the pooled SD of the ﬁrst and sec-

ond assessment for s, and using ICC

consistency

SEMconsistency AC 5sOð1eICCconsistency ACÞ516:94

Oð1e0:826Þ57:07

Note that SEM

agreement_AC

cannot be obtained in this

way as calculated, because systematic errors are not repre-

sented in the pooled SD. Using ICC

agreement

and the pooled

SD of the ﬁrst and second assessment for swould yield:

SEMagreement AC 5sOð1eICCagreement ACÞ516:94

Oð120:781Þ57:93s8:15

The formula SEM 5sOð1eICCÞis often used if infor-

mation on the individual variance components is lacking.

The ICC calculated in one study is then applied to a popu-

lation sample for which the standard deviation (the total

variance) is known. In this way, only a raw indication of

the SEM can be obtained, because the ICC is heavily de-

pendent on the heterogeneity of the characteristic under

study in the population sample, and is thus, in theory, only

applicable for a population with a similar heterogeneity.

3. The value of SEM can also be derived by dividing the

SD of the mean differences between two measure-

ments (SD

diff

)by ﬃﬃﬃ

p. The factor ﬃﬃﬃ

pis included be-

cause it concerns the difference between two

measurements and errors occur in both measure-

ments. Note that the SEM obtained by this formula

is again SEM

consistency

, because systematic error is

not included in the SD. Thus, SEM

agreement_AC

cannot

be calculated in this way either.

SEMconsistency AC 5SDdiff AC=O2510:00=O257:07 

SEM

consistency_AB

gives the same result, because PT

and

only differed by a systematic value.

8. Typical parameters for agreement and reliability

For repeated measurements on a continuous scale, as in

our example, an ICC is the most appropriate reliability

Table 3

Reproducibility of measurement of PT

and PT

Parameters Affected shoulder Nonaffected shoulder

: Mean (SD) 69.49(17.60) 79.78(7.60)

: Mean (SD) 63.69(16.25 ) 73.88 (8.38 )

2PT

: Mean

diff_AC

(SD

diff_AC

)

5.80(10.0) 5.90(9.58)

ICC

agreement_AC

0.78 0.22

ICC

consistency_AC

0.83 0.28

SEM

agreement_AC

8.157.94

SEM

consistency_AC

7.076.78

Limits of Agreement

_AC

213.80–25.40212.88 224.68 

Abbreviations: ICC: intraclass correlation coefﬁcient; Mean

diff_AC

Mean of the differences between PT

and PT

;PT

: physical therapist

A; PT

: physical therapist B; SD: standard deviation; SD

diff_AC

: standard

deviation of the differences between PT

and PT

1037H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039

parameter. An extensive overview of the various ICC for-

mulas is provided by McGraw and Wong [11].

In our example, agreement was expressed as the percent-

age of observations lying between predeﬁned values (Table

1). Presentation in this way makes sense in clinical practice,

because every PT knows what 5 and 10 means. This

measure was chosen because it can easily be interpreted

by PTs [3]. However, the SEM is usually the basic param-

eter of agreement for measurements on a continuous scale.

A method proposed by Bland and Altman [4], which as-

sesses the limits of agreement is frequently used. These

limits of agreement can be directly derived from the

SDdiff 5ðO2SEMconsistencyÞ.

9. Clinical interpretation

Agreement parameters are expressed on the actual scale

of measurement, and not as reliability parameters as a di-

mensionless value between 0 and 1. This is an important

advantage for clinical interpretation. If weights are mea-

sured in kilograms, the dimension of the SEM is kilograms.

For example, if we know that a weighing scale has

a SEM of 300 g, we know that we can use it to monitor

adult body weight because changes of less than 1 kilogram

are not important. The smallest detectable change (SDC)

is based on this measurement error, and is deﬁned as

1:96 O2SEM.*With an SEM of 300 g, SDC is

1:96 O2300g5832g. Obviously, one cannot use this

scale to weigh babies or to weigh ﬂour in the kitchen, be-

cause in these instances changes of less than 800 g are very

important. The measurement error alone provides useful

information when there is a clear conception of the

differences that are important.

The situation is different in the case of unfamiliarity

with scores. For example, if a new multiitem questionnaire

is used to measure functional status on a scale from 0 to

50, an orthopaedic surgeon may want to know what a value

of 14 points and an SEM of 2 points means, because she

has no idea how many points of change represent clini-

cally relevant change. By presenting an ICC she will know

whether the instrument is able to discriminate between pa-

tients in the sample, but she still does not know whether

the instrument is suitable for monitoring the functional sta-

tus of her patients over time. This requires more informa-

tion about the interpretation of scores. By assessing the

scores of groups of mildly, moderately, and severely dis-

abled patients a feeling for the meaning of scores will

arise. Comparisons with other instruments will provide

further insight into the meaning of values on the new

measurement instrument. The assessments of minimally

important changes in various measurements will also

contribute to insight with regard to which (changes in)

scores are clinically relevant [13,14]. Only this informa-

tion makes it possible to assess whether the agreement pa-

rameter of a measurement instrument is sufﬁcient to detect

clinically relevant changes.

10. Conclusion

In this article we have shown the important difference

between the parameters of reliability and agreement and

their relationship. Agreement parameters will be more sta-

ble over different population samples than reliability pa-

rameters, as we observed in our shoulder example, in

which the SEM was quite similar for the affected and the

nonaffected shoulder. Reliability parameters are highly de-

pendent on the variation in the population sample, and are

only generalizable to samples with a similar variation. Re-

liability is clearly a characteristic of the performance of an

instrument in a certain population sample. Agreement is

more a characteristic of the measurement instrument itself.

Agreement parameters are preferable in all situations in

which the instrument will be used for evaluation purposes,

which is often the case in medical research. Researchers

and readers should be eager to apply and interpret the

parameters of agreement and reliability correctly.

References

[1] Streiner DL, Norman GR. Health Measurement Scales. A practical

guide to their development and use. 3rd ed. New York: Oxford Uni-

versity Press Inc.; 2003.

[2] McDowell I, Newell C. Measuring health. A guide to rating scales

and questionnaires. 2nd ed. New York: Oxford University Press

Inc.; 1996.

[3] De Winter AF, Heemskerk MAMB, Terwee CB, Jans MP, Van

Schaardenburg D, Scholten RJPM, Bouter LM. Inter-observer repro-

ducibility of range of motion in patients with shoulder pain using

a digital inclinometer. BMC Musculoskel Disord 2004;5:18.

[4] Bland JM, Altman DG. Statistical methods for assessing agreement

between two methods of clinical measurements. Lancet 1986;i:

307–10.

[5] Nunnally JC, Bernstein IH. Psychometric theory. 3rd ed. New York:

McGraw-Hill Inc.; 1994.

[6] Stratford PW, Goldsmith CH. Use of the standard error as a reliability

index of interest: an applied example using elbow ﬂexor strength

data. Phys Ther 1997;77:745–50.

[7] De Vet HCW. Observer reliability and agreement. In: Armitage P,

Colton T, editors, Encyclopedia biostatistica, Vol 4. Chichester: John

Wiley & Sons, Ltd.; 1998. p. 3123–8.

[8] Guyatt G, Walter S, Norman G. Measuring change over time: assess-

ing the usefulness of evaluative instruments. J Chronic Dis 1987;40:

171–8.

[9] Bot SD, Terwee CB, Van der Windt DA, Bouter LM, Dekker J, De

Vet HC. Clinimetric evaluation of shoulder disability questionnaires:

a systematic review of the literature. Ann Rheum Dis 2004;63:

335–41.

[10] De Boer MR, Moll AC, De Vet HC, Terwee CB, Volker-Dieben HJ,

Van Rens GH. Psychometric properties of vision-related quality of

The term ‘smallest’ detectable difference (SDD) is also used for this

purpose.

1038 H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039

life questionnaires: a systematic review. Ophthal Physiol Opt

2004;24:257–73.

[11] Shavelson RJ, Webb NM. Generalizability theory. A primer. London:

Sage Publications; 1991.

[12] McGraw KO, Wong SP. Forming inferences about some intraclass

correlation coefﬁcients. Psychol Methods 1996;1:30–46.

[13] Crosby RD, Kolotkin RL, Williams GR. Deﬁning clinically meaning-

ful change in health-related quality of life. J Clin Epidemiol 2003;56:

395–407.

[14] Testa MA. Interpretation of quality-of-life outcomes. Issues that

affect magnitude and meaning. Med Care 2000;38:II-166–74.

1039H.C.W. de Vet et al. / Journal of Clinical Epidemiology 59 (2006) 1033–1039

Options, Problems and Guidelines for Measuring Interrater Agreement - a Descriptive Approach

Article

Full-text available

Jun 2024

Lucia Kočišová

Interrater agreement is one way to establish reliability (and also validity) in social science research. The traditionally preferred method of measuring interrater agreement is the descriptive approach owing to its simplicity. This approach is also associated with a number of different agreement indices, which makes it difficult to select the right index. This article summarises theoretical definition on the prevailing approach used to measure interrater agreement (in both quantitative and qualitative research). From a practical point of view, the article focuses on the possibilities of measuring agreement by using percent agreement, the kappa coefficient, and the AC1 coefficient. A more detailed description of the indices explains how to define, calculate, and interpret them and the problems associated with their use. The indices are then discussed in comparison. Although underestimated and criticised, percent agreement may be a good indicator of interrater agreement. Several paradoxes accompany the use of the kappa coefficient, which is only possible under certain conditions. The appropriate alternative to it is the AC1 coefficient. The article concludes with a summary of recommendations for improving the quantification of interrater agreement.

Evaluation of agreement between a noninvasive method for real-time measurement of critical blood values with a standard point-of-care device

Article

Full-text available

Jun 2024
PLOS ONE

The purpose of this work was to investigate the degree of agreement between two distinct approaches for measuring a set of blood values and to compare comfort levels reported by participants when utilizing these two disparate measurement methods. Radial arterial blood was collected for the comparator analysis using the Abbott i-STAT® POCT device. In contrast, the non-invasive proprietary DBC methodology is used to calculate sodium, potassium, chloride, ionized calcium, total carbon dioxide, pH, bicarbonate, and oxygen saturation using four input parameters (temperature, hemoglobin, pO2, and pCO2). Agreement between the measurement for a set of blood values obtained using i-STAT and DBC methodology was compared using intraclass correlation coefficients, Passing and Bablok regression analyses, and Bland Altman plots. A p-value of <0.05 was considered statistically significant. A total of 37 participants were included in this study. The mean age of the participants was 42.4 ± 13 years, most were male (65%), predominantly Caucasian/White (75%), and of Hispanic ethnicity (40%). The Intraclass Correlation Coefficients (ICC) analyses indicated agreement levels ranging from poor to moderate between i-STAT and the DBC’s algorithm for Hb, pCO2, HCO3, TCO2, and Na, and weak agreement for pO2, HSO2, pH, K, Ca, and Cl. The Passing and Bablok regression analyses demonstrated that values for Hb, pO2, pCO2, TCO2, Cl, and Na obtained from the i-STAT did not differ significantly from that of the DBC’s algorithm suggesting good agreement. The values for Hb, K, and Na measured by the DBC algorithm were slightly higher than those obtained by the i-STAT, indicating some systematic differences between these two methods on Bland Altman Plots. The non-invasive DBC methodology was found to be reliable and robust for most of the measured blood values compared to invasive POCT i-STAT device in healthy participants. These findings need further validation in larger samples and among individuals afflicted with various medical conditions.

Reliability of the quality of life-aged care consumers (QOL-ACC) and EQ-5D-5L among older people using aged care services at home

Article

Full-text available

May 2024
HEALTH QUAL LIFE OUT

Purpose The Quality of Life-Aged Care Consumers (QOL-ACC), a valid preference-based instrument, has been rolled out in Australia as part of the National Quality Indicator (QI) program since April 2023 to monitor and benchmark the quality of life of aged care recipients. As the QOL-ACC is being used to collect quality of life data longitudinally as one of the key aged care QI indicators, it is imperative to establish the reliability of the QOL-ACC in aged care settings. Therefore, we aimed to assess the reliability of the QOL-ACC and compare its performance with the EQ-5D-5L. Methods Home care recipients completed a survey including the QOL-ACC, EQ-5D-5L and two global items for health and quality of life at baseline (T1) and 2 weeks later (T2). Using T1 and T2 data, the Gwet’s AC2 and intra-class correlation coefficient (ICC) were estimated for the dimension levels and overall scores agreements respectively. The standard error of measurement (SEM) and the smallest detectable change (SDC) were also calculated. Sensitivity analyses were conducted for respondents who did not change their response to global item of quality of life and health between T1 and T2. Results Of the 83 respondents who completed T1 and T2 surveys, 78 respondents (mean ± SD age, 73.6 ± 5.3 years; 56.4% females) reported either no or one level change in their health and/or quality of life between T1 and T2. Gwet’s AC2 ranged from 0.46 to 0.63 for the QOL-ACC dimensions which were comparable to the EQ-5D-5L dimensions (Gwet’s AC2 ranged from 0.52 to 0.77). The ICC for the QOL-ACC (0.85; 95% CI, 0.77–0.90) was comparable to the EQ-5D-5L (0.83; 95% CI, 0.74–0.88). The SEM for the QOL-ACC (0.08) was slightly smaller than for the EQ-5D-5L (0.11). The SDC for the QOL-ACC and the EQ-5D-5L for individual subjects were 0.22 and 0.30 respectively. Sensitivity analyses stratified by quality of life and health status confirmed the base case results. Conclusions The QOL-ACC demonstrated a good test-retest reliability similar to the EQ-5D-5L, supporting its repeated use in aged care settings. Further studies will provide evidence of responsiveness of the QOL-ACC to aged care-specific interventions in aged care settings.

Within-participants reliability and measurement error of magnetization transfer imaging determinations within the healthy cervical spinal cord

Article

Full-text available

May 2024

Effects of consecutive bed bathing with weak versus ordinary pressure on skin barrier recovery of hospitalised older adults: A within-person randomised controlled trial

Article

May 2024

Aim Wiping pressure (WP [mmHg]) during bed baths is essential to maintain skin integrity and care quality for older adults. However, effects of different wiping pressures on skin barrier recovery over multiple days remain unclear. This study evaluated and compared the effects of consecutive bed bathing with weak pressure and that with ordinary pressure on skin barrier recovery of hospitalised older adults. Methods This within-person, randomised, controlled trial involved 254 forearms (127 patients) and was conducted at a general hospital. Forearms were blinded and randomly assigned a site and sequence of two bed bathing sessions: wiping three times with weak (10≤WP<20) and ordinary pressure (20≤WP<30) once per day for 2 consecutive days. The skin barrier was assessed daily based on transepidermal water loss (TEWL) and stratum corneum hydration (SCH) before and 15 min after the interventions. Dry skin was assessed using the overall dry skin score. Results A linear mixed model showed that the time courses of TEWL and SCH differed significantly between groups. Impaired skin barrier function caused by ordinary pressure on the first day did not recover to baseline values the next day, whereas weak pressure did not cause significant changes. During subgroup analyses, TEWL of patients with dry skin was more likely to increase with ordinary pressure. Conclusions Despite decreased skin barrier recovery experienced by older adults, our findings suggest the safety of weak pressure and highlight the importance of WP during bed baths. Weak pressure is particularly desirable for patients with dry skin.

Evaluating Airflow Sensor Methods: Precision in Indirect Calorimetry

Article

Full-text available

Jun 2024
SCAND J MED SCI SPOR

This study assesses the impact of three volumetric gas flow measurement methods—turbine (fT); pneumotachograph (fP), and Venturi (fV)—on predictive accuracy and precision of expired gas analysis indirect calorimetry (EGAIC) across varying exercise intensities. Six males (Age: 38 ± 8 year; Height: 178.8 ± 4.2 cm; V̇O2peak$$ \dot{V}{\mathrm{O}}_2\mathrm{peak} $$: 42 ± 2.8 mL O2 kg⁻¹ min⁻¹) and 14 females (Age = 44.6 ± 9.6 year; Height = 164.6 ± 6.9 cm; V̇O2peak$$ \dot{V}{\mathrm{O}}_2\mathrm{peak} $$ = 45 ± 8.6 mL O2 kg⁻¹ min⁻¹) were recruited. Participants completed physical exertion on a stationary cycle ergometer for simultaneous pulmonary minute ventilation (V̇$$ \dot{V} $$) measurements and EGAIC computations. Exercise protocols and subsequent conditions involved a 5‐min cycling warm‐up at 25 W min⁻¹, incremental exercise to exhaustion (V̇O2$$ \dot{V}{\mathrm{O}}_2 $$ ramp test), then a steady‐state exercise bout induced by a constant Watt load equivalent to 80% ventilatory threshold (80% VT). A linear mixed model revealed that exercise intensity significantly affected V̇O2$$ \dot{V}{\mathrm{O}}_2 $$ measurements (p < 0.0001), whereas airflow sensor method (p = 0.97) and its interaction with exercise intensity (p = 0.91) did not. Group analysis of precision yielded a V̇O2$$ \dot{V}{\mathrm{O}}_2 $$ CV % = 21%; SEM = 5 mL O2 kg⁻¹ min⁻¹. Intra‐ and interindividual analysis of precision via Bland–Altman revealed a 95% confidence interval (CI) precision benchmark of 3–5 mL kg⁻¹ min⁻¹. Agreement among methods decreased at power outputs eliciting V̇$$ \dot{V} $$ up to 150 L min⁻¹, indicating a decrease in precision and highlighting potential challenges in interpreting biological variability, training response heterogeneity, and test–retest comparisons. These findings suggest careful consideration of airflow sensor method variance across metabolic cart configurations.

A single trial of the five-repetition sit-to-stand test provides adequate measures in community-dwelling older adults: A cross-sectional study

Article

Mar 2024

Reliability and construct validity of the Craniocervical Flexion Test in patients with migraine

Article

Jun 2024
REV BRAS FISIOTER

The Swedish version of the Back Pain Attitudes Questionnaire - translation, cross-cultural adaptation and validation

Article

Jun 2024

Validation of MRI assessment of foot perfusion for improving treatment of patients with peripheral artery disease

Article

May 2024

Generalizability Theory: A Primer

Book

Full-text available

Jan 1991

Accessible to any professional or researcher who has a basic understanding of analysis of variance, "Generalizability Theory: A Primer" offers an intuitive development of generalizability theory, a technique for estimating the relative magnitudes of various components of error variation and for indicating the most efficient strategy for achieving desired measurement precision. Covering a variety of topics such as generalizability studies with nested facets and with fixed facets, measurement error and generalizability coefficients, and decision studies with same and with different designs, the text includes exercises so the reader may practice the application of each chapter's material. By using detailed illustrations and examples, Shavelson and Webb clearly describe the logic underlying major concepts in generalizability theory to enable readers to apply these methods when investigating the consistency of their own measurements. (PsycINFO Database Record (c) 2010 APA, all rights reserved)

Forming Inferences About Some Intraclass Correlation Coefficients

Article

Full-text available

Mar 1996

AIthough intraclass correlation coefficients (lCCs) are commonIy used in behavioral measurement, pychometrics, and behavioral genetics, procodures available for forming inferences about ICC are not widely known. Following a review of the distinction between various forms of the ICC, this article presents procedures available for calculating confidence intervals and conducting tests on ICCs developed using data from one-way and two-way random and mixed-efFect analysis of variance models. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Psychometric Theory

Article

Jan 1994

Measuring change over time: assessing the usefulness of evaluative instruments

Article

Mar 1987
J Chron Dis

Reliability, the ratio of the variance attributable to true differences among subjects to the total variance, is an important attribute of psychometric measures. However, it is possible for instruments to be reliable, but unresponsive to change; conversely, they may show poor reliability but excellent responsiveness. This is especially true for instruments in which items are tailored to the individual respondent. Therefore, we suggest a new index of responsiveness to assess the usefulness of instruments designed to measure change over time. This statistic, which relates the minimal clinically important difference to the variability in stable subjects, has direct sample size implications. Responsiveness should join reliability and validity as necessary requirements for instruments designed primarily to measure change over time.

A Guide to Rating Scales and Questionnaires

Article

Jan 2009

Ian Mcdowell

One effect of rising health care costs has been to raise the profile of studies that evaluate care and create a systematic evidence base for therapies and, by extension, for health policies. All clinical trials and evaluative studies require instruments to monitor the outcomes of care in terms of quality of life, disability, pain, mental health, or general well-being. Many measurement tools have been developed, and choosing among them is difficult. This book provides comparative reviews of the quality of leading health measurement instruments and a technical and historical introduction to the field of health measurement, and discusses future directions in the field. This edition reviews over 100 scales, presented in chapters covering physical disability, psychological well-being, anxiety, depression, mental status testing, social health, pain measurement, and quality of life. An introductory chapter describes the theoretical and methodological development of health measures, while a final chapter reviews the current status of the field, indicating areas in which further development is required. Each chapter includes a tabular comparison of the quality of the instruments reviewed, followed by a detailed description of each instrument, covering its purpose and conceptual basis, its reliability and validity, alternative versions and, where possible, a copy of the scale itself. To ensure accuracy, each review has been approved by the original author of each instrument or by an acknowledged expert.

Health Measurement Scales: A Practical Guide To Their Development and Use

Book

Jan 2015

Clinicians and those in health sciences are frequently called upon to measure subjective states such as attitudes, feelings, quality of life, educational achievement and aptitude, and learning style in their patients. This fifth edition of Health Measurement Scales enables these groups to both develop scales to measure non-tangible health outcomes, and better evaluate and differentiate between existing tools. Health Measurement Scales is the ultimate guide to developing and validating measurement scales that are to be used in the health sciences. The book covers how the individual items are developed; various biases that can affect responses (e.g. social desirability, yea-saying, framing); various response options; how to select the best items in the set; how to combine them into a scale; and finally how to determine the reliability and validity of the scale. It concludes with a discussion of ethical issues that may be encountered, and guidelines for reporting the results of the scale development process. Appendices include a comprehensive guide to finding existing scales, and a brief introduction to exploratory and confirmatory factor analysis, making this book a must-read for any practitioner dealing with this kind of data.

Forming Inferences about Some Intraclass Correlation Coefficients

Article

Mar 1996

Reports 3 errors in the original article by K. O. McGraw and S. P. Wong (Psychological Methods, 1996, 1[1], 30–46). On page 39, the intraclass correlation coefficient (ICC) and r values given in Table 6 should be changed to r = .714 for each data set, ICC(C,1) = .714 for each data set, and ICC(A,1) = .720, .620, and .485 for the data in Columns 1, 2, and 3 of the table, respectively. In Table 7 (p. 41), which is used to determine confidence intervals on population values of the ICC, the procedures for obtaining the confidence intervals on ICC(A,k) need to be amended slightly. Corrected formulas are given. On pages 44–46, references to Equations A3, A,4, and so forth in the Appendix should be to Sections A3, A4, and so forth. (The following abstract of this article originally appeared in record 1996-03170-003.). Although intraclass correlation coefficients (ICCs) are commonly used in behavioral measurement, psychometrics, and behavioral genetics, procedures available for forming inferences about ICC are not widely known. Following a review of the distinction between various forms of the ICC, this article presents procedures available for calculating confidence intervals and conducting tests on ICCs developed using data from one-way and two-way random and mixed-effect analysis of variance models. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Observer Reliability and Agreement

Chapter

Jul 2005

Henrica De Vet

The terms observer reliability and observer agreement represent different concepts. Reliability coefficients express the ability to differentiate between subjects. Agreement parameters determine whether the same value is achieved if a measurement is repeated. Parameters for reliability are ICCs for measurements on an interval scale. Parameters for agreement are kappa for measurements on a nominal scale, weighted kappa for an ordinal scale, and levels of agreement for measurements on an interval scale. Assessing observer reliability and agreement is essential for the interpretation of clinical observations in research and medical practice. Moreover, only if the sources of measurement errors are known, can they be anticipated, thereby improving the quality of measurements.

Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chron Dis 40, 171-178

Article

Feb 1987
J Chron Dis

Reliability, the ratio of the variance attributable to true differences among subjects to the total variance, is an important attribute of psychometric measures. However, it is possible for instruments to be reliable, but unresponsive to change: conversely, they may show poor reliability but excellent responsiveness. This is especially true for instruments in which items are tailored to the individual respondent.Therefore, we suggest a new index of responsiveness to assess the usefulness of instruments designed to measure change over time. This statistic, which relates the minimal clinically important difference to the variability in stable subjects, has direct sample size implications. Responsiveness should join reliability and validity as necessary requirements for instruments designed primarily to measure change over time.

Statistical-Methods For Assessing Agreement Between 2 Methods Of Clinical Measurement

Article

Apr 2010

In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.

When to use agreement versus reliability measures

Abstract and Figures

Recommended publications

Reliability of goniometric measurements and visual estimates of ankle joint active range of motion o...

Inter-observer reproducibility of measurements of range of motion in patients with shoulder pain usi...

Current challenges in clinimetrics

The interexaminer reproducibility of physical examination of the cervical spine

Reproducibility and validity of the DynaPort KneeTest