Meta-analysis of Cohen's kappa


Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen’s κ to describe the typical inter-rater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen’s κ is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.
Shuyan Sun
Keywords Cohen's κ Inter-rater reliability Meta-analysis Generalizability
In classical test theory proposed by Spearman (1904), an observed score Xis expressed as the
true score Tplus a random error of measurement e, i.e., X=T?e. Reliability is defined as the
squared correlation between observed scores and true scores (Lord and Novick 1968). It
indicates the extent to which scores produced by a particular measurement procedure are
consistent and reproducible (Thorndike 2005). Reliability is an unobserved property of scores
obtained from a sample on a particular test, not an inherent property of the test (Thompson
2002; Thompson and Vacha-Hasse 2000; Vacha-Hasse 1998; Vacha-Hasse et al. 2002).
Therefore, it is never appropriate to claim a test is reliable or unreliable in a research article.
Instead, researchers should state the scores are reliable or unreliable. Reliability estimates
usually vary from one study to another due to differences in study characteristics including
study settings, test properties, and subject characteristics. A test that yields reliable scores for
one group of subjects in this setting may fail to yield reliable scores for a different group of
subjects in another setting. Hence, understanding the generalizability of score reliability and
the factors affecting score reliability becomes an important methodological issue.
Researchers in education and psychology have been applying meta-analytic techniques
to reliability coefficients to investigate the generalizability of score reliability across
multiple studies (e.g., Capraro et al. 2001; Caruso 2000; Helms 1999; Henson et al. 2001;
Huynh et al. 2009; Miller et al. 2007; Rohner and Khaleque 2003; Vacha-Hasse 1998;
Viswesvaran and Ones 2000; Yin and Fan 2000). This methodology was proposed and
labeled as reliability generalization by Vacha-Hasse (1998). As an extension of validity
generalization (Hunter and Schmidt 1990; Schmidt and Hunter 1977), reliability gener-
alization can be used (a) to describe the typical reliability estimate of a given test across
different studies, (b) to describe the variability of reliability estimates across different
studies, and (c) to identify study characteristics that can explain the variability of reliability
estimates and to accumulate psychometric knowledge regarding study characteristics
(Vacha-Hasse 1998). Score reliability is affected by heterogeneity of subjects being tested,
and reliability estimates always change when the test is administered to a different sample
(Guilford and Fruchter 1978). Effect sizes in applied research studies are inherently
attenuated by unreliable scores (Baugh 2002; Henson 2001; Thompson 1994) and con-
sequently the statistical power of detecting a meaningful difference is lowered. Therefore,
an investigation of the generalizability of score reliability has very important implications
for psychometricians, statisticians and applied researchers who wish to better understand
score reliability and its influences on the appropriateness of test use, effect size and
statistical power (Vacha-Hasse et al. 2002).
Inter-rater reliability refers to the consistency of ratings given by different raters to the
same subject. It quantifies the extent to which raters agree on the relative ratings given to
subjects and serves as a measure of quality and accuracy of the rating process (Linacre
1989). Inter-rater reliability estimates usually vary across studies due to test properties (e.g.,
items, scaling and operational definitions), rater characteristics (e.g., knowledge, experi-
ence, qualification and training), study settings, and subject characteristics (Kraemer 1979;
Shrout 1998; Suen 1988). Substantial measurement errors from raters and rating procedures
will attenuate measurement precision and statistical power, and make meaningful treatment
effects more difficult to detect (Fleiss and Shrout 1977). When the outcome of interest is
measured on a nominal scale, Cohen’s jis considered the most important (von Eye 2006)
and most widely accepted measure of inter-rater reliability (Brennan and Silman 1992; Sim
and Wright 2005; Zwick 1988), especially in the medical literature (Kraemer et al. 2004;
Viera and Garrett 2005). The frequent application of Cohen’s jallows the possibility of
conducting meta-analyses to examine generalizability of inter-rater reliability across mul-
tiple studies. This study first discusses the statistical framework for meta-analysis of
Cohen’s jand then demonstrates its application by a meta-analysis for pressure ulcer
classification systems, a set of diagnostic tools in nursing and medical research.
1 Statistical framework for meta-analysis of Cohen’s j
1.1 Assumptions of Cohen’s j
The basic feature of Cohen’s jis to consider two raters as alternative forms of a test, and
their ratings are analogous to the scores obtained from the test. Well known as a chance-
corrected measure of inter-rater reliability, Cohen’s jdetermines whether the degree of
agreement between two raters is higher than would be expected by chance (Cohen 1960). It
assumes that (a) the subjects being rated are independent of each other, (b) the categories
of ratings are independent, mutually exclusive and collectively exhaustive, and (c) two
146 Health Serv Outcomes Res Method (2011) 11:145–163
raters operate independently. In addition to the three statistical assumptions, Cohen’s j
further assumes that the ‘‘correctness’’ of ratings cannot be determined in a typical situ-
ation, and raters are deemed equally competent to make the judgment on a prior ground
(Cohen, 1960, p.38). Under these two practical assumptions, no restriction is placed on the
distribution of ratings over categories for the raters. In other words, Cohen’s jallows the
marginal distributions of raters to differ (Banerjee et al. 1999). Marginal distribution is
defined as the set of underlying probabilities with which each rater uses the categories.
Block and Kraemer (1989) argued that by allowing marginal distributions to differ,
Cohen’s jmeasures the association between two sets of ratings rather than the agreement
between two raters. A well-defined measure of agreement describes how well one rater’s
rating agrees with what another rater would have reported and indicates the generalizability
of a rating beyond the specific rater. When the true interest is agreement between raters
instead of mere association between ratings, the marginal distributions should not be too
disperse. Therefore, it is very important to impose the assumption of homogeneity of
marginal distributions on Cohen’s j(Blackman and Koval 2000; Block and Kraemer 1989;
Brennan and Prediger 1981; Zwick 1988).
1.2 Sampling distribution of Cohen’s j
The formula for computing Cohen’s jis expressed in Eq. 1.
where p
is percent agreement, defined as the proportion of subjects on which the raters
agree, and p
is chance agreement, defined as the proportion of agreement that would be
expected by chance. The calculation of Cohen’s jis very simple and can be done from a
contingency table using a basic calculator. The upper limit of jis ?1.00, occurring when
and only when the two raters agree perfectly, in other words, the two raters have exactly
the same marginal distribution. The lower limit of jfalls between 0 and -1.00, depending
on the dispersion of raters’ marginal distributions (Cohen 1960). A jvalue of 0 indicates
that the agreement is merely due to chance. Negative values of jcan be meaningfully
interpreted as the level of agreement that would be expected by chance only (Brennan and
Silman 1992). Although the meaning of jwith value 0 or 1 is quite clear, the interpretation
of intermediate values is less evident. The benchmarks for interpreting Cohen’s jproposed
by Landis and Koch (1977a) are of high profile in the literature, with 0.8 to 1.0 indicating
almost perfect agreement, 0.6 to 0.8 as substantial, 0.4 to 0.6 as moderate, 0.2 to 0.4 as
fair, zero to 0.2 as slight and zero or lower as poor. Slightly different interpretations can be
found in Fleiss (1981) and Altman (1991). Stemler and Tsai (2008) proposed to use .50 as
the minimal Cohen’s jfor an acceptable level of inter-rater reliability.
The mean and variance of jwere derived by Everitt (1968) as in Eqs. 2and 3.
EðjÞ¼ 1
VarðjÞ¼ 1
The exact variance of p
is very tedious to calculate. When p
is assumed to be
binomially distributed, the approximate variance of jis simplified as in Eq. 4(Cohen
1960; Everitt 1968).
Health Serv Outcomes Res Method (2011) 11:145–163 147
VarðjÞ¼ 1
The sampling distribution of jappears to be very non-symmetric when nis small
(Blackman and Koval 2000; Block and Kraemer 1989; Koval and Blackman 1996). With a
large enough n, the sampling distribution of jis approximately normal so that confidence
intervals (CI) and significance tests can be easily done using standard normal distribu-
tion quantiles. For instance, 95% CI of Cohen’s jcan be constructed from ^
p. The calculations of Cohen’s j, its variance and 95% CI are demonstrated in
Appendix A.
It is worth noting that the variance estimator in Eq. 4is derived under the null
hypothesis that the agreement between two raters is merely due to chance. It usually
overestimates the true variance and results in conservative CIs and significance tests (Fleiss
et al. 1969). In inter-rater reliability studies, a non-zero jis usually of interest and requires
a non-null variance estimator which has been derived by Fleiss, Cohen and Everitt (1969).
However, the non-null variance estimator involves marginal distributions of raters, which
are usually not reported in primary studies. For this reason, the non-null variance estimator
cannot be used in meta-analysis unless it is explicitly reported in primary studies.
1.3 Weighted mean Cohen’s j
Following the tradition of weighted mean effect sizes in meta-analysis, the weighted mean
Cohen’s j(
j:) can be derived by calculating the variance-weighted average as in Eq. 5
where j
is the estimate obtained from study i,i=1,,m,mis the number of primary
studies, and w
is the reciprocal of the variance of j
that can be obtained from Eq. 4.
1.4 Homogeneity test of Cohen’s j
Whether Cohen’s jestimates obtained from primary studies are homogenous or not can be
tested by Chi-square goodness of fit test, i.e., the Qstatistic in Eq. 6as in traditional meta-
analysis of effect sizes.
Again, the weight w
is equal to the reciprocal of variance of j
. This homogeneity test
implies that jestimates with larger variances are weighted less in the calculation of Q.
Under the null hypothesis, Qfollows v
distribution with df =m-1 (Hedges 1982a,b).
The statistical assumption associated with this test is that Cohen’s jestimates are obtained
from independent samples that are large enough to be asymptotically normal.
1.5 Fitting fixed-effects models
A fixed-effects model in meta-analysis assumes that observed effects across studies are
homogenous, i.e., except for sampling errors, observed effects would be a constant across
studies. A fixed-effects model without moderators can be used to estimate the common
148 Health Serv Outcomes Res Method (2011) 11:145–163
effect among all observed effects in primary studies. The fixed-effects model for observed
can be expressed as in Eq. 7.
where his the population common effect, i.e., the common inter-rater reliability estimate
across mstudies, and e
represents the random sampling error. Under weighted least-square
estimation, the population common effect hcan be estimated by
j:in Eq. 5with variance
in Eq. 8.
j:Þ¼ 1
Under the null hypothesis that population common inter-rater reliability is 0,
follows the standard normal distribution. Accordingly, 95% CI for population average
inter-rater reliability can be easily constructed from
j:1:96 ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p. A fixed-effects
model assumes that the variations of observed outcomes could be fully explained by study
characteristics, and inferences about population effect can be made conditionally on the
characteristics of primary studies included in the meta-analysis (Hedges and Vevea 1998).
Therefore, it is a common practice to include study characteristics (e.g., rater character-
istics and study setting) as moderators in a fixed-effects model to explore how they affect
the variability of Cohen’s jestimates across studies.
1.6 Fitting random-effects models
A random-effects model in meta-analysis assumes that observed effects across studies are a
random variable instead of a fixed constant. The variability among observed effects is not
only a result of random sampling errors, but also caused by random variability at the study
level (Hedges 1983; Hedges and Vevea 1998; Raudenbush 2009). The random-effects
model for observed j
can be expressed in Eq. 9.
where h. is the population average effect, i.e., the overall estimate of inter-rater reliability,
is the between-study variation that is normally distributed with mean 0 and variance s
and e
represents the random sampling error. Under the random-effects model, the popu-
lation average effect
j:can be estimated from Eqs. 10 and 11 and its variance can be
estimated from Eq. 12.
varðkiÞþ 1
j:Þ¼ 1
Under the null hypothesis that population average inter-rater reliability is 0,
follows the standard normal distribution. 95% CI for population average inter-rater
reliability can be readily constructed from
j:1:96 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Health Serv Outcomes Res Method (2011) 11:145–163 149
1.7 Fitting mixed-effects models
Exploring the source of heterogeneity by investigating moderator effects on the outcomes
is considered one of the most important and useful aspects of meta-analysis (Thompson
1994). Including moderators in a random-effects model to explain the heterogeneity results
in a mixed-effects model as expressed in Eq. 13.
ji¼b0þb1Xi1þþbpXip þliþeið13Þ
The variance of l
represents the amount of residual heterogeneity, i.e., the variability of
Cohen’s jestimates across studies that cannot be accounted for by the moderators X
included in the mixed-effects model. The mixed-effects model assumes that each
Cohen’s jestimate is a linear function of moderator effects and residual heterogeneity. An
estimate of overall effect obtained from a random-effects model becomes meaningless and
can even be misleading when significant moderator effects are present in a mixed-effects
model. Hence, an investigation of moderator effects should always be an integral part of
meta-analysis. In addition to usual moderators in meta-analysis including study settings
and subject characteristics, rater characteristics are particularly important moderators that
may explain the heterogeneity among observed Cohen’s jestimates and should be
included in a mixed-effects model.
2 An application to pressure ulcer classification systems
2.1 Background
Pressure ulcers (PU) are very serious health problems (Allman 1997) associated with great
pain and distress for patients and extensive health care costs (Graves et al. 2005). PU
classification systems are commonly used to classify skin sites into different categories that
indicate the severity of the condition. Though PU classification systems aim at providing
consistent assessment to promote accurate communication, precise documentation, and
appropriate treatment decision (Stotts 2001), several widely used PU classification systems
have been criticized for low inter-rater reliability estimates reported in published studies
and their usefulness is being questioned (Russell 2002). A meta-analysis of Cohen’s jis a
useful tool to examine the generalizability of inter-rater reliability estimates across studies,
to determine the factors affecting inter-rater reliability, and to inform PU assessment in
future research and practice.
2.2 Data sources
Kottner et al. (2009) conducted a systematic review of inter-rater reliability for PU clas-
sification systems and included 24 primary studies in the final data synthesis. The 24
studies were retrieved and served as the pool of potential studies for the present meta-
analysis. To be eligible for final data synthesis, the potential studies have to meet four
criteria: (1) the language is English; (2) Cohen’s jwas reported as the measure of inter-
rater reliability, or sufficient information was provided to calculate Cohen’s j; (3) standard
errors of Cohen’s jestimates were reported, or sufficient information was provided to
estimate the standard errors; and (4) Cohen’s jis an appropriate measure of inter-rater
reliability for the rating procedure. The selection process (see Fig. 1) resulted in six studies
150 Health Serv Outcomes Res Method (2011) 11:145–163
for this meta-analysis. The variables extracted from the six studies are authors, study
location, publication year, PU classification system, number of categories, rating proce-
dure, rater characteristics, number of raters, total number of skin sites, skin site charac-
teristics, percent agreement (p
) and Cohen’s jestimate. Note that five studies contained
multiple Cohen’s jestimates obtained from independent samples. A total number of fifteen
Cohen’s jestimates were identified for final data synthesis.
2.3 Results
As shown in Table 1, seven different PU classification systems were used in the six
primary studies. Although the seven classification systems have very much in common,
they differ in operational definitions of grade 1 PU. Normal skin was involved in three
studies. The prevalence of PU was not reported in five studies, indicating possible heter-
ogeneity of subjects across studies. Five studies used real skin sites for PU assessment
while only one study used images of skin sites. The characteristics of raters were heter-
ogeneous in terms of training and experiences in PU and tissue viability assessment.
Sample sizes (i.e., numbers of skin sites) varied from 35 to 2,396.
Because of the heterogeneity of study characteristics and the inclusion of multiple PU
classification systems, a random-effects model was fitted using Rpackage metafor
(Viechtbauer 2010). The metafor package provides flexible and comprehensive functions
for fitting various models in general-purpose meta-analysis. The codes for all analyses
conducted in this meta-analysis are provided in Appendix B. The results obtained from the
random-effects model were summarized in Table 2. The test for heterogeneity is signifi-
cant, Q=653.50, df =14, P\.001, suggesting that considerable heterogeneity exists
among true inter-rater reliability across studies. The amount of heterogeneity (s
) is esti-
mated to be 0.06. The I
statistic suggests that 97.58% of the total variability in Cohen’s j
24 studies meeting selection criteria after
quality assessment in Kottner et al (2009)
11 studies reporting p0 for inter-
rater reliability and Cohen’s
cannot be calculated
3 studies reported multirater and
3 studies reported mean
7 studies reporting Cohen’s as the inter-
rater reliability measure
1 study not providing the information to
estimate standard error of Cohen’s
6 studies meeting the criteria for final data
synthesis; 15 independent estimates of
Cohen’s obtained
Fig. 1 Selection process of included studies for the meta-analysis for pressure ulcer classification systems
Health Serv Outcomes Res Method (2011) 11:145–163 151
Table 1 Study characteristics of fifteen Cohen’s jestimates included in the meta-analysis
Study/Location Measure/# of
Skin Normal skin
Raters Setting Skin site characteristics nj(p
Bours et al. (1999)/
EPUAU/5 Real Yes Trained staff nurses and one
Hospital PU prevalence rate was 10.1%;
4.1% were Stage I ulcers
674 0.97 (1.00)
Bours et al. (1999)/
EPUAU/5 Real Yes Trained staff nurses and one
PU prevalence rate was 83.6%;
60.7% were Stage I ulcers
344 0.81 (0.94)
Bours et al. (1999)/
EPUAU/5 Real Yes Trained primary nurses and
one wound care specialist
PU prevalence rate was 12.7%;
5.4% were Stage I ulcers
1348 0.49 (0.98)
Buntinx et al.
Shea/5 Real No Nurses and physicians with
chronic wound experience,
without special training in
Hospital Pressure sores, leg ulcers caused
by arterial insufficiency, venous
leg ulcers, and amputation
81 0.42 (0.67)
Healey (1995)/UK Stirling 2-digit/15 Image Not reported Nurses Not
Caucasian skin images with
various skin problems
330 0.22 (0.59)
Healey (1995)/UK Torrance/5 Image Not reported Nurses Not
Caucasian skin images with
various skin problems
330 0.29 (0.60)
Healey (1995)/UK Stirling 1-digit/5 Image Not reported Nurses Not
Caucasian skin images with
various skin problems
330 0.15 (0.39)
Healey (1995)/UK Surrey/4 Image Not reported Nurses Not
Caucasian skin images with
various skin problems
309 0.37 (0.67)
Nixon et al.
Adapted EPUAP,
Real Yes Clinical research nurse team
leader; trained and
experienced clinical research
Hospital Skin sites of adult patients 107 0.97 (0.98)
Nixon et al. (2005)/
Adapted EPUAP,
Real Yes Trained and experienced
clinical research nurses and
trained ward nurses
Hospital Skin sites of adult patients 2396 0.63 (0.79)
152 Health Serv Outcomes Res Method (2011) 11:145–163
Table 1 continued
Study/Location Measure/# of
Skin Normal skin
Raters Setting Skin site characteristics nj(p
Pedley (2004)/UK EPUAU/5 Real Yes Trained registered nurses
experienced in tissue
Hospital Pressure ulcers of differing
severity and a number of
pressure points free from
pressure damage
35 0.31 (0.49)
Pedley (2004)/UK Stirling 1-digit/5 Real Yes Trained registered nurses
experienced in tissue
Hospital Pressure ulcers of differing
severity and a number of
pressure points free from
pressure damage
35 0.37 (0.54)
Pedley (2004)/UK Stirling 2-digit/15 Real Yes Trained registered nurses
experienced in tissue
Hospital Pressure ulcers of differing
severity and a number of
pressure points free from
pressure damage
35 0.48 (0.54)
Vanderwee et al.
Blanchable and
Real No Researcher and trained nurses Hospital Geriatric patients, erythema
at the heels, hips, and sacrum
503 0.69 (0.92)
Vanderwee et al.
Blanchable and
Real No Researcher and trained nurses Hospital Geriatric patients, erythema
at the heels, hips, and sacrum
503 0.72 (0.92)
Health Serv Outcomes Res Method (2011) 11:145–163 153
estimates can be attributed to heterogeneity among the true inter-rater reliabilities. Figure 2
clearly shows the heterogeneity of Cohen’s jestimates across studies. The overall estimate
of Cohen’s jis 0.53 (95% CI: 0.39–0.66), which is a moderate inter-rater reliability
according to Landis and Koch (1977a).
When subjectivity is involved in the rating process, raters’ prior experience and training
becomes a particularly important factor that may affect the variability of inter-rater reli-
ability estimates. Based on the descriptions of rater characteristics in the six primary
studies, the raters were categorized into two groups, raters with special training in PU
assessment versus raters without special training in PU assessment. A mixed-effects model
was fitted to explore the effect of rater characteristics on the variability of Cohen’s j
estimates and the results were summarized in Table 3. The estimated amount of residual
Table 2 Total heterogeneity and population average estimate from a random-effects model
Estimated total amount of heterogeneity 0.06 with 95% CI [0.03, 0.16]
(% of total variability due to heterogeneity) 97.58%
Test for heterogeneity 653.50 14 \.001
Estimate SE z p 95% CI
Population average estimate 0.53 0.07 7.75 \.001 0.39 0.66
Fig. 2 A forest plot of Cohen’s jestimates and the overall estimate from the random-effects model
154 Health Serv Outcomes Res Method (2011) 11:145–163
heterogeneity is 0.03, suggesting that about 50% of the total amount of heterogeneity
(which is 0.06 estimated from the random-effects model) can be accounted for by raters’
training. For the group of raters without special training, b=0.28 (95% CI: 0.12–0.45),
SE =0.08, P\0.001. For the group of raters with special training, b=0.66 (95% CI:
0.54–0.78), SE =0.06, P\0.001. The difference between two groups are statistically
significant as suggested by the test of moderators, Q=13.62, df =1, P\0.001. How-
ever, the test of residual heterogeneity is still significant, Q=258.20, df =13, P\0.001,
indicating that moderators not considered in the model also affect the variability of inter-
rater reliability estimates across studies. Additional moderators including whether normal
skin was involved and how PU was assessed (based on real skin or skin image) were added
to the model but no significant effects were detected.
The funnel plot of Cohen’s jestimates against their estimated standard errors from the
random-effects model (Fig. 3) suggests that possible publication bias exists. The trim and
fill method (Duval and Tweedie 2000a,b) was used to adjust for publication bias. The trim
and fill method is a nonparametric data augmentation technique that estimates the number
of studies missing from a meta-analysis due to suppression of the most extreme obser-
vations on one side of the funnel plot. The method then augments observed data under the
fixed- or random-effects model to make the funnel plot more symmetric. Because it is a
way of formalizing the use of a funnel plot and the results can be easily understood
visually, the trim and fill method is now the most popular method for adjusting for pub-
lication bias (Borenstein 2005). Under the random-effects model, the trim and fill method
was applied to the PU data and the estimated number of missing studies on either side is
zero. In other words, the symmetry of the funnel plot cannot be improved by data aug-
mentation. The mechanisms of publication bias and incomplete data reporting are usually
very complicated and may vary with dataset and subject area in meta-analysis (Sutton
2009). The trim and fill method did not work for the present meta-analysis probably
because the mechanism of publication bias is not due to the suppression of extreme
Cohen’s jestimates as assumed by the trim and fill method. The existence of publication
bias may be partially explained by the fact that eleven primary studies did not report
Cohen’s jas the measure of inter-rater reliability and were excluded from the meta-
analysis. The inclusion of those studies in the final data synthesis may lead to a different
conclusion about publication bias (Fig. 4).
The trim and fill method was also applied to the PU data under the fixed-effects model to
further demonstrate its rationale of adjusting for publication bias. The results from the fixed-
effects model and the trim-and-filled model were summarized in Table 4. Under the
Table 3 Results from a mixed-effects model with rater characteristic as a moderator
Estimated residual amount of heterogeneity 0.03 with 95% CI [0.01, 0.10]
% of total variability due to moderator 52.87%
Test for residual heterogeneity 258.20 13 \.001
Test for moderators 13.62 1 \.001
bSEz P 95% CI
Raters without training 0.28 0.08 3.47 \.001 0.12 0.45
Raters with training 0.66 0.06 11.00 \.001 0.54 0.78
Health Serv Outcomes Res Method (2011) 11:145–163 155
fixed-effects model, the common Cohen’s jestimate is 0.65 (95% CI: 0.63–0.67). After
adjusting for publication bias under the fixed-effects model, six missing studies were estimated
on the right side of the funnel plot and the adjusted common Cohen’s jestimate is 0.73 (95%
CI: 0.72–0.76). However, as shown in the trim and filled funnel plot (Fig. 5), Cohen’s j
estimates for the six missing studies are larger than 1 and hence substantively meaningless.
This may be viewed as a limitation of the trim and fill method thatneeds further methodological
investigation. Again, the application of the trim and fill method under the fixed-effects model is
for demonstration purpose only. The results should not be used to inform future practice.
2.4 Conclusion
The results from the meta-analysis of fifteen Cohen’s jestimates shows that (1) the overall
inter-rater reliability estimated from a random-effects model is .53, indicating a moderate
level of agreement between raters, (2) significant heterogeneity of Cohen’s jestimates exist
between studies, and (3) raters with special training in PU assessment tend to produce more
reliable ratings than raters without training, suggesting the importance of rater training. In
order to obtain as many published studies as possible, this meta-analysis used very broad
criteria to select studies and included Cohen’s jestimates generated from seven different PU
classification systems. No comparison was made between classification systems because of
the small number of Cohen’s jestimates for each classification system. Therefore, it is very
difficult to decide which PU classification system should be used in daily practice. When a
large number of inter-rater reliability studies are accumulated in future, it is necessary to
include PU classification systems as a moderator in the mixed-effects model so that the effect
of test properties on inter-rater reliability estimates can be understood.
Fig. 3 A funnel plot of Cohen’s jestimates against standard error estimates from the random-effects model
156 Health Serv Outcomes Res Method (2011) 11:145–163
3 Discussion
The present study proposed a formal statistical framework to specifically combine Cohen’s
jestimates across multiple studies and extended traditional meta-analysis of effect sizes to
inter-rater reliability. The proposed framework relies on the sampling distribution of
Cohen’s jand traditional meta-analytic models to describe the typical inter-rater reliability
Fig. 4 A forest plot of Cohen’s jestimates with separate estimates for the moderator (raters without
training in PU assessment vs. raters with training in PU assessment)
Table 4 Comparison of results from the fixed-effects model and the trim-and-filled fixed-effects model
Fixed-effects model Trim-and-filled fixed-effects model
Test for heterogeneity Q=653.50, df =14, P\.001 Q=1257.15, df =20, P\.001
Common estimate 0.65 0.74
Estimated standard error 0.009 0.009
95% CI for common estimate [0.63, 0.67] [0.72, 0.76]
Health Serv Outcomes Res Method (2011) 11:145–163 157
in multiple studies and to quantify between-study variation, and thus it is more rigorous
and informative than narrative reviews and systematic reviews. It allows researchers to
evaluate how important study characteristics affect the variability of inter-rater reliability
estimates across studies and to accumulate psychometric knowledge about the test being
used. The findings from a meta-analysis of Cohen’s jwill facilitate test developers, test
users and methodologists to better understand inter-rater reliability and develop effective
strategies to improve inter-rater reliability.
Even the most skillful chef cannot cook a meal without basic ingredients, so the quality
of a meta-analysis largely depends on the supplies in primary studies. A successful meta-
analysis of Cohen’s jrequires that the test of interest has been frequently used and
Cohen’s jis frequently reported as a measure of inter-rater reliability. About 80% of the
primary studies in the literature of PU assessment did not report any inter-rater reliability
coefficient (Kottner et al. 2009). The consequence of underreporting is that population
parameter estimates in meta-analysis will be biased. Therefore, raising the awareness of
reporting inter-rater reliability estimates is an urgent mission. It was also noted that p
more frequently reported as the measurement of inter-rater reliability than Cohen’s j.p
does not account for chance agreement between two raters and is a positively biased
measure of the true systematic tendency of the two raters agreeing with each other (Fleiss
1981). On the contrary, Cohen’s jadjusts for chance agreement and hence should be the
preferred measure of inter-rater reliability. Moreover, standard error is a measure of
estimation precision and a key element to synthesize estimates across different studies in
meta-analysis. Unfortunately, standard errors are generally not reported for Cohen’s jin
Fig. 5 A funnel plot of Cohen’s jestimates against standard error estimates after adjusting for publication
bias with the trim and fill method (the dots on the right of the vertical line are the missing observations
estimated by the trim and fill method)
158 Health Serv Outcomes Res Method (2011) 11:145–163
published studies. If p
is reported for Cohen’s jestimate in a primary study, the standard
error can be estimated from Eqs. 1and 4; otherwise, the study has to be excluded from final
data synthesis. Researchers are strongly recommended to explicitly report Cohen’s jwith
estimated standard error to facilitate future meta-analysis.
Last but not least, the appropriate use of Cohen’s jin primary studies is essential for
meta-analysis. By assumption, Cohen’s jis an appropriate measure of inter-rater reliability
when all of the subjects are rated by two equally competent raters and the correctness of
ratings usually cannot be determined. When these assumptions are violated, Cohen’s j
cannot be used to indicate the consistency of the ratings between two raters. For instance,
Hart et al. (2006) reported Cohen’s jbetween hospital nurses and PU experts. Ratings
made by experts were considered correct classifications and the reliability between nurses
and experts was considered rater-to-standards reliability. Whether rater-to-standard
reliability is a valid concept and how to assess it are completely different issues, but
apparently Cohen’s jwas misused and the study cannot be included in meta-analysis. In
fact, Cohen’s jhas been extended to estimate inter-rater reliability for many complicated
rating scenarios (e.g., Berry and Mielke 1998; Cohen 1968,1972; Davies and Fleiss 1982;
Fleiss 1971; Gross 1986; Janson and Olsson 2001,2004; Kraemer 1980; Kraemer et al.
2004; Landis and Koch 1977b; Vanbelle and Albert 2009). Sadly, those extended measures
are infrequently, if not never, applied in empirical research. The misuse of Cohen’s jand
the large number of extended measures of inter-rater reliability calls for clear and easy-
to-follow tutorials on conducting inter-rater reliability studies and choosing the appropriate
measure of inter-rater reliability for different rating scenarios.
Appendix A
An example illustrating how to calculate Cohen’s j, its variance and confidence intervals
using real data from Nixon et al. (2005)
Ward nurse
No pressure ulcer Pressure ulcer Total
Clinical research nurse
No pressure ulcer (a)
Pressure ulcer (c)
Total (g
n¼2175 þ144
2396 0:97
2396 þ186179
2396 0:86
1pc¼0:97 0:86
10:86 0:79
Health Serv Outcomes Res Method (2011) 11:145–163 159
jÞ¼ 1
2396 6:20 104
j1:96 ffiffiffiffiffiffiffiffiffiffiffiffi
p0:79 1:96 0:02. Therefore, 95% CI for Cohen’s jis [0.75,
Note that, the calculation of Var(^
j) depends on p
which is usually not reported in
research articles. When p
and ^
jare reported, p
can be derived by pc¼p0^
Appendix B
Rcodes for all analyses conducted in the meta-analysis of Cohen’s jfor PU classification
#Install the package metafor
#Derive vi, the variance for each kappa estimate
#p0i and ki are the percentage agreement and Cohen’s kappa estimate for study i
# pci is the chance agreement derived from p0i and ki (as shown in Appendix A)
#Fit the random-effects model using rma() and get confidence intervals for parameter
#Get the forest plot
forest(res,at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2), slab = paste(PUdata$Study))
text(-1.25,17, ‘‘Study’’,pos=4)
text(1.5,17, ‘‘Cohen’s Kappa [95% CI]’’,pos=4)
#Get the funnel plot to check publication bias
funnel(res, main = ‘‘Random-Effects Model’’)
#Sort the data matrix by the moderator variable rater for the mixed-effects model
#Fit the mixed-effects model using rma() and get confidence intervals for parameter
#Get the forest plot for mixed-effect model and add group estimates to the bottom of the
forest(data$ki,data$vi, at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2),ylim=c(-3,18),slab = paste
preds \- predict(mix, newmods = c(0, 1))
addpoly(preds$pred, sei = preds$se, mlab = c(‘‘Raters without training’’, ‘‘Raters with
160 Health Serv Outcomes Res Method (2011) 11:145–163
text(1.2,0.4,’’Raters with training’’)
text(1.2,11,’’Raters without training’’)
text(-1.25,17, ‘‘Study’’,pos=4)
text(1.5,17, ‘‘Cohen’s Kappa [95% CI]’’,pos=4)
# Use trim and fill method to adjust for publication bias under the random-effects model
rtf \- trimfill(re)
# Use trim and fill method to adjust for publication bias under the fixed-effects model
ftf \- trimfill(fe)
#Get the funnel plot with augmented data
