ArticlePDF Available

Meta-analysis of Cohen’s kappa

Authors:

Abstract and Figures

Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen’s κ to describe the typical inter-rater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen’s κ is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.
Content may be subject to copyright.
Meta-analysis of Cohen’s kappa
Shuyan Sun
Received: 3 September 2010 / Revised: 25 October 2011 / Accepted: 29 October 2011 /
Published online: 11 November 2011
Springer Science+Business Media, LLC 2011
Abstract Cohen’s jis the most important and most widely accepted measure of inter-
rater reliability when the outcome of interest is measured on a nominal scale. The estimates
of Cohen’s jusually vary from one study to another due to differences in study settings,
test properties, rater characteristics and subject characteristics. This study proposes a
formal statistical framework for meta-analysis of Cohen’s jto describe the typical inter-
rater reliability estimate across multiple studies, to quantify between-study variation and to
evaluate the contribution of moderators to heterogeneity. To demonstrate the application of
the proposed statistical framework, a meta-analysis of Cohen’s jis conducted for pressure
ulcer classification systems. Implications and directions for future research are discussed.
Keywords Cohen’s jInter-rater reliability Meta-analysis Generalizability
In classical test theory proposed by Spearman (1904), an observed score Xis expressed as the
true score Tplus a random error of measurement e, i.e., X=T?e. Reliability is defined as the
squared correlation between observed scores and true scores (Lord and Novick 1968). It
indicates the extent to which scores produced by a particular measurement procedure are
consistent and reproducible (Thorndike 2005). Reliability is an unobserved property of scores
obtained from a sample on a particular test, not an inherent property of the test (Thompson
2002; Thompson and Vacha-Hasse 2000; Vacha-Hasse 1998; Vacha-Hasse et al. 2002).
Therefore, it is never appropriate to claim a test is reliable or unreliable in a research article.
Instead, researchers should state the scores are reliable or unreliable. Reliability estimates
usually vary from one study to another due to differences in study characteristics including
study settings, test properties, and subject characteristics. A test that yields reliable scores for
one group of subjects in this setting may fail to yield reliable scores for a different group of
subjects in another setting. Hence, understanding the generalizability of score reliability and
the factors affecting score reliability becomes an important methodological issue.
S. Sun (&)
School of Education, University of Cincinnati, 2600 Clifton Ave., Dyer Hall 475, P.O. Box 210049,
Cincinnati, OH 45221, USA
e-mail: sunsn@mail.uc.edu
123
Health Serv Outcomes Res Method (2011) 11:145–163
DOI 10.1007/s10742-011-0077-3
Researchers in education and psychology have been applying meta-analytic techniques
to reliability coefficients to investigate the generalizability of score reliability across
multiple studies (e.g., Capraro et al. 2001; Caruso 2000; Helms 1999; Henson et al. 2001;
Huynh et al. 2009; Miller et al. 2007; Rohner and Khaleque 2003; Vacha-Hasse 1998;
Viswesvaran and Ones 2000; Yin and Fan 2000). This methodology was proposed and
labeled as reliability generalization by Vacha-Hasse (1998). As an extension of validity
generalization (Hunter and Schmidt 1990; Schmidt and Hunter 1977), reliability gener-
alization can be used (a) to describe the typical reliability estimate of a given test across
different studies, (b) to describe the variability of reliability estimates across different
studies, and (c) to identify study characteristics that can explain the variability of reliability
estimates and to accumulate psychometric knowledge regarding study characteristics
(Vacha-Hasse 1998). Score reliability is affected by heterogeneity of subjects being tested,
and reliability estimates always change when the test is administered to a different sample
(Guilford and Fruchter 1978). Effect sizes in applied research studies are inherently
attenuated by unreliable scores (Baugh 2002; Henson 2001; Thompson 1994) and con-
sequently the statistical power of detecting a meaningful difference is lowered. Therefore,
an investigation of the generalizability of score reliability has very important implications
for psychometricians, statisticians and applied researchers who wish to better understand
score reliability and its influences on the appropriateness of test use, effect size and
statistical power (Vacha-Hasse et al. 2002).
Inter-rater reliability refers to the consistency of ratings given by different raters to the
same subject. It quantifies the extent to which raters agree on the relative ratings given to
subjects and serves as a measure of quality and accuracy of the rating process (Linacre
1989). Inter-rater reliability estimates usually vary across studies due to test properties (e.g.,
items, scaling and operational definitions), rater characteristics (e.g., knowledge, experi-
ence, qualification and training), study settings, and subject characteristics (Kraemer 1979;
Shrout 1998; Suen 1988). Substantial measurement errors from raters and rating procedures
will attenuate measurement precision and statistical power, and make meaningful treatment
effects more difficult to detect (Fleiss and Shrout 1977). When the outcome of interest is
measured on a nominal scale, Cohen’s jis considered the most important (von Eye 2006)
and most widely accepted measure of inter-rater reliability (Brennan and Silman 1992; Sim
and Wright 2005; Zwick 1988), especially in the medical literature (Kraemer et al. 2004;
Viera and Garrett 2005). The frequent application of Cohen’s jallows the possibility of
conducting meta-analyses to examine generalizability of inter-rater reliability across mul-
tiple studies. This study first discusses the statistical framework for meta-analysis of
Cohen’s jand then demonstrates its application by a meta-analysis for pressure ulcer
classification systems, a set of diagnostic tools in nursing and medical research.
1 Statistical framework for meta-analysis of Cohen’s j
1.1 Assumptions of Cohen’s j
The basic feature of Cohen’s jis to consider two raters as alternative forms of a test, and
their ratings are analogous to the scores obtained from the test. Well known as a chance-
corrected measure of inter-rater reliability, Cohen’s jdetermines whether the degree of
agreement between two raters is higher than would be expected by chance (Cohen 1960). It
assumes that (a) the subjects being rated are independent of each other, (b) the categories
of ratings are independent, mutually exclusive and collectively exhaustive, and (c) two
146 Health Serv Outcomes Res Method (2011) 11:145–163
123
raters operate independently. In addition to the three statistical assumptions, Cohen’s j
further assumes that the ‘‘correctness’’ of ratings cannot be determined in a typical situ-
ation, and raters are deemed equally competent to make the judgment on a prior ground
(Cohen, 1960, p.38). Under these two practical assumptions, no restriction is placed on the
distribution of ratings over categories for the raters. In other words, Cohen’s jallows the
marginal distributions of raters to differ (Banerjee et al. 1999). Marginal distribution is
defined as the set of underlying probabilities with which each rater uses the categories.
Block and Kraemer (1989) argued that by allowing marginal distributions to differ,
Cohen’s jmeasures the association between two sets of ratings rather than the agreement
between two raters. A well-defined measure of agreement describes how well one rater’s
rating agrees with what another rater would have reported and indicates the generalizability
of a rating beyond the specific rater. When the true interest is agreement between raters
instead of mere association between ratings, the marginal distributions should not be too
disperse. Therefore, it is very important to impose the assumption of homogeneity of
marginal distributions on Cohen’s j(Blackman and Koval 2000; Block and Kraemer 1989;
Brennan and Prediger 1981; Zwick 1988).
1.2 Sampling distribution of Cohen’s j
The formula for computing Cohen’s jis expressed in Eq. 1.
j¼p0pc
1pcð1Þ
where p
0
is percent agreement, defined as the proportion of subjects on which the raters
agree, and p
c
is chance agreement, defined as the proportion of agreement that would be
expected by chance. The calculation of Cohen’s jis very simple and can be done from a
contingency table using a basic calculator. The upper limit of jis ?1.00, occurring when
and only when the two raters agree perfectly, in other words, the two raters have exactly
the same marginal distribution. The lower limit of jfalls between 0 and -1.00, depending
on the dispersion of raters’ marginal distributions (Cohen 1960). A jvalue of 0 indicates
that the agreement is merely due to chance. Negative values of jcan be meaningfully
interpreted as the level of agreement that would be expected by chance only (Brennan and
Silman 1992). Although the meaning of jwith value 0 or 1 is quite clear, the interpretation
of intermediate values is less evident. The benchmarks for interpreting Cohen’s jproposed
by Landis and Koch (1977a) are of high profile in the literature, with 0.8 to 1.0 indicating
almost perfect agreement, 0.6 to 0.8 as substantial, 0.4 to 0.6 as moderate, 0.2 to 0.4 as
fair, zero to 0.2 as slight and zero or lower as poor. Slightly different interpretations can be
found in Fleiss (1981) and Altman (1991). Stemler and Tsai (2008) proposed to use .50 as
the minimal Cohen’s jfor an acceptable level of inter-rater reliability.
The mean and variance of jwere derived by Everitt (1968) as in Eqs. 2and 3.
EðjÞ¼ 1
1pcfEðp0Þpc0ð2Þ
VarðjÞ¼ 1
1pc
Varðp0Þð3Þ
The exact variance of p
0
is very tedious to calculate. When p
0
is assumed to be
binomially distributed, the approximate variance of jis simplified as in Eq. 4(Cohen
1960; Everitt 1968).
Health Serv Outcomes Res Method (2011) 11:145–163 147
123
VarðjÞ¼ 1
1pc
Varðp0Þ¼p0ð1p0Þ
nð1pcÞ2ð4Þ
The sampling distribution of jappears to be very non-symmetric when nis small
(Blackman and Koval 2000; Block and Kraemer 1989; Koval and Blackman 1996). With a
large enough n, the sampling distribution of jis approximately normal so that confidence
intervals (CI) and significance tests can be easily done using standard normal distribu-
tion quantiles. For instance, 95% CI of Cohen’s jcan be constructed from ^
j1:96
ffiffiffiffiffiffiffiffiffiffiffiffi
varðjÞ
p. The calculations of Cohen’s j, its variance and 95% CI are demonstrated in
Appendix A.
It is worth noting that the variance estimator in Eq. 4is derived under the null
hypothesis that the agreement between two raters is merely due to chance. It usually
overestimates the true variance and results in conservative CIs and significance tests (Fleiss
et al. 1969). In inter-rater reliability studies, a non-zero jis usually of interest and requires
a non-null variance estimator which has been derived by Fleiss, Cohen and Everitt (1969).
However, the non-null variance estimator involves marginal distributions of raters, which
are usually not reported in primary studies. For this reason, the non-null variance estimator
cannot be used in meta-analysis unless it is explicitly reported in primary studies.
1.3 Weighted mean Cohen’s j
Following the tradition of weighted mean effect sizes in meta-analysis, the weighted mean
Cohen’s j(
j:) can be derived by calculating the variance-weighted average as in Eq. 5
j:¼Pm
i¼1wiji
Pm
i¼1wið5Þ
where j
i
is the estimate obtained from study i,i=1,,m,mis the number of primary
studies, and w
i
is the reciprocal of the variance of j
i
that can be obtained from Eq. 4.
1.4 Homogeneity test of Cohen’s j
Whether Cohen’s jestimates obtained from primary studies are homogenous or not can be
tested by Chi-square goodness of fit test, i.e., the Qstatistic in Eq. 6as in traditional meta-
analysis of effect sizes.
Q¼X
m
i¼1
ðji
j:Þ2
varðjiÞ¼X
m
i¼1
wiðji
j:Þ2ð6Þ
Again, the weight w
i
is equal to the reciprocal of variance of j
i
. This homogeneity test
implies that jestimates with larger variances are weighted less in the calculation of Q.
Under the null hypothesis, Qfollows v
2
distribution with df =m-1 (Hedges 1982a,b).
The statistical assumption associated with this test is that Cohen’s jestimates are obtained
from independent samples that are large enough to be asymptotically normal.
1.5 Fitting fixed-effects models
A fixed-effects model in meta-analysis assumes that observed effects across studies are
homogenous, i.e., except for sampling errors, observed effects would be a constant across
studies. A fixed-effects model without moderators can be used to estimate the common
148 Health Serv Outcomes Res Method (2011) 11:145–163
123
effect among all observed effects in primary studies. The fixed-effects model for observed
j
i
can be expressed as in Eq. 7.
ji¼hþeið7Þ
where his the population common effect, i.e., the common inter-rater reliability estimate
across mstudies, and e
i
represents the random sampling error. Under weighted least-square
estimation, the population common effect hcan be estimated by
j:in Eq. 5with variance
in Eq. 8.
varð
j:Þ¼ 1
Pm
i¼1wið8Þ
Under the null hypothesis that population common inter-rater reliability is 0,
j:
ffiffiffiffiffiffiffiffiffiffi
varðj:Þ
p
follows the standard normal distribution. Accordingly, 95% CI for population average
inter-rater reliability can be easily constructed from
j:1:96 ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
varð
j:Þ
p. A fixed-effects
model assumes that the variations of observed outcomes could be fully explained by study
characteristics, and inferences about population effect can be made conditionally on the
characteristics of primary studies included in the meta-analysis (Hedges and Vevea 1998).
Therefore, it is a common practice to include study characteristics (e.g., rater character-
istics and study setting) as moderators in a fixed-effects model to explore how they affect
the variability of Cohen’s jestimates across studies.
1.6 Fitting random-effects models
A random-effects model in meta-analysis assumes that observed effects across studies are a
random variable instead of a fixed constant. The variability among observed effects is not
only a result of random sampling errors, but also caused by random variability at the study
level (Hedges 1983; Hedges and Vevea 1998; Raudenbush 2009). The random-effects
model for observed j
i
can be expressed in Eq. 9.
ji¼h:þliþeið9Þ
where h. is the population average effect, i.e., the overall estimate of inter-rater reliability,
l
i
is the between-study variation that is normally distributed with mean 0 and variance s
2
,
and e
i
represents the random sampling error. Under the random-effects model, the popu-
lation average effect
j:can be estimated from Eqs. 10 and 11 and its variance can be
estimated from Eq. 12.
j:¼Pm
i¼1w
iji
Pm
i¼1w
ið10Þ
w
i¼1
varðkiÞþ 1
m1Pm
i¼1ji1
mPm
i¼1ji

21
mPm
i¼1varðkiÞð11Þ
varð
j:Þ¼ 1
Pm
i¼1w
ið12Þ
Under the null hypothesis that population average inter-rater reliability is 0,
j:
ffiffiffiffiffiffiffiffiffiffiffi
varðj:Þ
p
follows the standard normal distribution. 95% CI for population average inter-rater
reliability can be readily constructed from
j:1:96 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
varð
j:Þ
p.
Health Serv Outcomes Res Method (2011) 11:145–163 149
123
1.7 Fitting mixed-effects models
Exploring the source of heterogeneity by investigating moderator effects on the outcomes
is considered one of the most important and useful aspects of meta-analysis (Thompson
1994). Including moderators in a random-effects model to explain the heterogeneity results
in a mixed-effects model as expressed in Eq. 13.
ji¼b0þb1Xi1þþbpXip þliþeið13Þ
The variance of l
i
represents the amount of residual heterogeneity, i.e., the variability of
Cohen’s jestimates across studies that cannot be accounted for by the moderators X
i1
to
X
ip
included in the mixed-effects model. The mixed-effects model assumes that each
Cohen’s jestimate is a linear function of moderator effects and residual heterogeneity. An
estimate of overall effect obtained from a random-effects model becomes meaningless and
can even be misleading when significant moderator effects are present in a mixed-effects
model. Hence, an investigation of moderator effects should always be an integral part of
meta-analysis. In addition to usual moderators in meta-analysis including study settings
and subject characteristics, rater characteristics are particularly important moderators that
may explain the heterogeneity among observed Cohen’s jestimates and should be
included in a mixed-effects model.
2 An application to pressure ulcer classification systems
2.1 Background
Pressure ulcers (PU) are very serious health problems (Allman 1997) associated with great
pain and distress for patients and extensive health care costs (Graves et al. 2005). PU
classification systems are commonly used to classify skin sites into different categories that
indicate the severity of the condition. Though PU classification systems aim at providing
consistent assessment to promote accurate communication, precise documentation, and
appropriate treatment decision (Stotts 2001), several widely used PU classification systems
have been criticized for low inter-rater reliability estimates reported in published studies
and their usefulness is being questioned (Russell 2002). A meta-analysis of Cohen’s jis a
useful tool to examine the generalizability of inter-rater reliability estimates across studies,
to determine the factors affecting inter-rater reliability, and to inform PU assessment in
future research and practice.
2.2 Data sources
Kottner et al. (2009) conducted a systematic review of inter-rater reliability for PU clas-
sification systems and included 24 primary studies in the final data synthesis. The 24
studies were retrieved and served as the pool of potential studies for the present meta-
analysis. To be eligible for final data synthesis, the potential studies have to meet four
criteria: (1) the language is English; (2) Cohen’s jwas reported as the measure of inter-
rater reliability, or sufficient information was provided to calculate Cohen’s j; (3) standard
errors of Cohen’s jestimates were reported, or sufficient information was provided to
estimate the standard errors; and (4) Cohen’s jis an appropriate measure of inter-rater
reliability for the rating procedure. The selection process (see Fig. 1) resulted in six studies
150 Health Serv Outcomes Res Method (2011) 11:145–163
123
for this meta-analysis. The variables extracted from the six studies are authors, study
location, publication year, PU classification system, number of categories, rating proce-
dure, rater characteristics, number of raters, total number of skin sites, skin site charac-
teristics, percent agreement (p
0
) and Cohen’s jestimate. Note that five studies contained
multiple Cohen’s jestimates obtained from independent samples. A total number of fifteen
Cohen’s jestimates were identified for final data synthesis.
2.3 Results
As shown in Table 1, seven different PU classification systems were used in the six
primary studies. Although the seven classification systems have very much in common,
they differ in operational definitions of grade 1 PU. Normal skin was involved in three
studies. The prevalence of PU was not reported in five studies, indicating possible heter-
ogeneity of subjects across studies. Five studies used real skin sites for PU assessment
while only one study used images of skin sites. The characteristics of raters were heter-
ogeneous in terms of training and experiences in PU and tissue viability assessment.
Sample sizes (i.e., numbers of skin sites) varied from 35 to 2,396.
Because of the heterogeneity of study characteristics and the inclusion of multiple PU
classification systems, a random-effects model was fitted using Rpackage metafor
(Viechtbauer 2010). The metafor package provides flexible and comprehensive functions
for fitting various models in general-purpose meta-analysis. The codes for all analyses
conducted in this meta-analysis are provided in Appendix B. The results obtained from the
random-effects model were summarized in Table 2. The test for heterogeneity is signifi-
cant, Q=653.50, df =14, P\.001, suggesting that considerable heterogeneity exists
among true inter-rater reliability across studies. The amount of heterogeneity (s
2
) is esti-
mated to be 0.06. The I
2
statistic suggests that 97.58% of the total variability in Cohen’s j
24 studies meeting selection criteria after
quality assessment in Kottner et al (2009)
11 studies reporting p0 for inter-
rater reliability and Cohen’s
cannot be calculated
3 studies reported multirater and
3 studies reported mean
7 studies reporting Cohen’s as the inter-
rater reliability measure
1 study not providing the information to
estimate standard error of Cohen’s
6 studies meeting the criteria for final data
synthesis; 15 independent estimates of
Cohen’s obtained
Fig. 1 Selection process of included studies for the meta-analysis for pressure ulcer classification systems
Health Serv Outcomes Res Method (2011) 11:145–163 151
123
Table 1 Study characteristics of fifteen Cohen’s jestimates included in the meta-analysis
Study/Location Measure/# of
categories
Skin Normal skin
involved?
Raters Setting Skin site characteristics nj(p
0
)
Bours et al. (1999)/
Netherland
EPUAU/5 Real Yes Trained staff nurses and one
researcher
Hospital PU prevalence rate was 10.1%;
4.1% were Stage I ulcers
674 0.97 (1.00)
Bours et al. (1999)/
Netherland
EPUAU/5 Real Yes Trained staff nurses and one
researcher
Nursing
home
PU prevalence rate was 83.6%;
60.7% were Stage I ulcers
344 0.81 (0.94)
Bours et al. (1999)/
Netherland
EPUAU/5 Real Yes Trained primary nurses and
one wound care specialist
Home
health
care
PU prevalence rate was 12.7%;
5.4% were Stage I ulcers
1348 0.49 (0.98)
Buntinx et al.
(1986)/Belgium
Shea/5 Real No Nurses and physicians with
chronic wound experience,
without special training in
assessment
Hospital Pressure sores, leg ulcers caused
by arterial insufficiency, venous
leg ulcers, and amputation
wound
81 0.42 (0.67)
Healey (1995)/UK Stirling 2-digit/15 Image Not reported Nurses Not
reported
Caucasian skin images with
various skin problems
330 0.22 (0.59)
Healey (1995)/UK Torrance/5 Image Not reported Nurses Not
reported
Caucasian skin images with
various skin problems
330 0.29 (0.60)
Healey (1995)/UK Stirling 1-digit/5 Image Not reported Nurses Not
reported
Caucasian skin images with
various skin problems
330 0.15 (0.39)
Healey (1995)/UK Surrey/4 Image Not reported Nurses Not
reported
Caucasian skin images with
various skin problems
309 0.37 (0.67)
Nixon et al.
(2005)/UK
Adapted EPUAP,
NPUAP/7
Real Yes Clinical research nurse team
leader; trained and
experienced clinical research
nurses
Hospital Skin sites of adult patients 107 0.97 (0.98)
Nixon et al. (2005)/
UK
Adapted EPUAP,
NPUAP/7
Real Yes Trained and experienced
clinical research nurses and
trained ward nurses
Hospital Skin sites of adult patients 2396 0.63 (0.79)
152 Health Serv Outcomes Res Method (2011) 11:145–163
123
Table 1 continued
Study/Location Measure/# of
categories
Skin Normal skin
involved?
Raters Setting Skin site characteristics nj(p
0
)
Pedley (2004)/UK EPUAU/5 Real Yes Trained registered nurses
experienced in tissue
viability
Hospital Pressure ulcers of differing
severity and a number of
pressure points free from
pressure damage
35 0.31 (0.49)
Pedley (2004)/UK Stirling 1-digit/5 Real Yes Trained registered nurses
experienced in tissue
viability
Hospital Pressure ulcers of differing
severity and a number of
pressure points free from
pressure damage
35 0.37 (0.54)
Pedley (2004)/UK Stirling 2-digit/15 Real Yes Trained registered nurses
experienced in tissue
viability
Hospital Pressure ulcers of differing
severity and a number of
pressure points free from
pressure damage
35 0.48 (0.54)
Vanderwee et al.
(2006)/Belgium
Blanchable and
non-blanchable
erythema/2
Real No Researcher and trained nurses Hospital Geriatric patients, erythema
at the heels, hips, and sacrum
503 0.69 (0.92)
Vanderwee et al.
(2006)/Belgium
Blanchable and
non-blanchable
erythema/2
Real No Researcher and trained nurses Hospital Geriatric patients, erythema
at the heels, hips, and sacrum
503 0.72 (0.92)
Health Serv Outcomes Res Method (2011) 11:145–163 153
123
estimates can be attributed to heterogeneity among the true inter-rater reliabilities. Figure 2
clearly shows the heterogeneity of Cohen’s jestimates across studies. The overall estimate
of Cohen’s jis 0.53 (95% CI: 0.39–0.66), which is a moderate inter-rater reliability
according to Landis and Koch (1977a).
When subjectivity is involved in the rating process, raters’ prior experience and training
becomes a particularly important factor that may affect the variability of inter-rater reli-
ability estimates. Based on the descriptions of rater characteristics in the six primary
studies, the raters were categorized into two groups, raters with special training in PU
assessment versus raters without special training in PU assessment. A mixed-effects model
was fitted to explore the effect of rater characteristics on the variability of Cohen’s j
estimates and the results were summarized in Table 3. The estimated amount of residual
Table 2 Total heterogeneity and population average estimate from a random-effects model
Estimated total amount of heterogeneity 0.06 with 95% CI [0.03, 0.16]
I
2
(% of total variability due to heterogeneity) 97.58%
QdfP
Test for heterogeneity 653.50 14 \.001
Estimate SE z p 95% CI
Population average estimate 0.53 0.07 7.75 \.001 0.39 0.66
Fig. 2 A forest plot of Cohen’s jestimates and the overall estimate from the random-effects model
154 Health Serv Outcomes Res Method (2011) 11:145–163
123
heterogeneity is 0.03, suggesting that about 50% of the total amount of heterogeneity
(which is 0.06 estimated from the random-effects model) can be accounted for by raters’
training. For the group of raters without special training, b=0.28 (95% CI: 0.12–0.45),
SE =0.08, P\0.001. For the group of raters with special training, b=0.66 (95% CI:
0.54–0.78), SE =0.06, P\0.001. The difference between two groups are statistically
significant as suggested by the test of moderators, Q=13.62, df =1, P\0.001. How-
ever, the test of residual heterogeneity is still significant, Q=258.20, df =13, P\0.001,
indicating that moderators not considered in the model also affect the variability of inter-
rater reliability estimates across studies. Additional moderators including whether normal
skin was involved and how PU was assessed (based on real skin or skin image) were added
to the model but no significant effects were detected.
The funnel plot of Cohen’s jestimates against their estimated standard errors from the
random-effects model (Fig. 3) suggests that possible publication bias exists. The trim and
fill method (Duval and Tweedie 2000a,b) was used to adjust for publication bias. The trim
and fill method is a nonparametric data augmentation technique that estimates the number
of studies missing from a meta-analysis due to suppression of the most extreme obser-
vations on one side of the funnel plot. The method then augments observed data under the
fixed- or random-effects model to make the funnel plot more symmetric. Because it is a
way of formalizing the use of a funnel plot and the results can be easily understood
visually, the trim and fill method is now the most popular method for adjusting for pub-
lication bias (Borenstein 2005). Under the random-effects model, the trim and fill method
was applied to the PU data and the estimated number of missing studies on either side is
zero. In other words, the symmetry of the funnel plot cannot be improved by data aug-
mentation. The mechanisms of publication bias and incomplete data reporting are usually
very complicated and may vary with dataset and subject area in meta-analysis (Sutton
2009). The trim and fill method did not work for the present meta-analysis probably
because the mechanism of publication bias is not due to the suppression of extreme
Cohen’s jestimates as assumed by the trim and fill method. The existence of publication
bias may be partially explained by the fact that eleven primary studies did not report
Cohen’s jas the measure of inter-rater reliability and were excluded from the meta-
analysis. The inclusion of those studies in the final data synthesis may lead to a different
conclusion about publication bias (Fig. 4).
The trim and fill method was also applied to the PU data under the fixed-effects model to
further demonstrate its rationale of adjusting for publication bias. The results from the fixed-
effects model and the trim-and-filled model were summarized in Table 4. Under the
Table 3 Results from a mixed-effects model with rater characteristic as a moderator
Estimated residual amount of heterogeneity 0.03 with 95% CI [0.01, 0.10]
% of total variability due to moderator 52.87%
QdfP
Test for residual heterogeneity 258.20 13 \.001
Test for moderators 13.62 1 \.001
bSEz P 95% CI
Raters without training 0.28 0.08 3.47 \.001 0.12 0.45
Raters with training 0.66 0.06 11.00 \.001 0.54 0.78
Health Serv Outcomes Res Method (2011) 11:145–163 155
123
fixed-effects model, the common Cohen’s jestimate is 0.65 (95% CI: 0.63–0.67). After
adjusting for publication bias under the fixed-effects model, six missing studies were estimated
on the right side of the funnel plot and the adjusted common Cohen’s jestimate is 0.73 (95%
CI: 0.72–0.76). However, as shown in the trim and filled funnel plot (Fig. 5), Cohen’s j
estimates for the six missing studies are larger than 1 and hence substantively meaningless.
This may be viewed as a limitation of the trim and fill method thatneeds further methodological
investigation. Again, the application of the trim and fill method under the fixed-effects model is
for demonstration purpose only. The results should not be used to inform future practice.
2.4 Conclusion
The results from the meta-analysis of fifteen Cohen’s jestimates shows that (1) the overall
inter-rater reliability estimated from a random-effects model is .53, indicating a moderate
level of agreement between raters, (2) significant heterogeneity of Cohen’s jestimates exist
between studies, and (3) raters with special training in PU assessment tend to produce more
reliable ratings than raters without training, suggesting the importance of rater training. In
order to obtain as many published studies as possible, this meta-analysis used very broad
criteria to select studies and included Cohen’s jestimates generated from seven different PU
classification systems. No comparison was made between classification systems because of
the small number of Cohen’s jestimates for each classification system. Therefore, it is very
difficult to decide which PU classification system should be used in daily practice. When a
large number of inter-rater reliability studies are accumulated in future, it is necessary to
include PU classification systems as a moderator in the mixed-effects model so that the effect
of test properties on inter-rater reliability estimates can be understood.
Fig. 3 A funnel plot of Cohen’s jestimates against standard error estimates from the random-effects model
156 Health Serv Outcomes Res Method (2011) 11:145–163
123
3 Discussion
The present study proposed a formal statistical framework to specifically combine Cohen’s
jestimates across multiple studies and extended traditional meta-analysis of effect sizes to
inter-rater reliability. The proposed framework relies on the sampling distribution of
Cohen’s jand traditional meta-analytic models to describe the typical inter-rater reliability
Fig. 4 A forest plot of Cohen’s jestimates with separate estimates for the moderator (raters without
training in PU assessment vs. raters with training in PU assessment)
Table 4 Comparison of results from the fixed-effects model and the trim-and-filled fixed-effects model
Fixed-effects model Trim-and-filled fixed-effects model
Test for heterogeneity Q=653.50, df =14, P\.001 Q=1257.15, df =20, P\.001
Common estimate 0.65 0.74
Estimated standard error 0.009 0.009
95% CI for common estimate [0.63, 0.67] [0.72, 0.76]
Health Serv Outcomes Res Method (2011) 11:145–163 157
123
in multiple studies and to quantify between-study variation, and thus it is more rigorous
and informative than narrative reviews and systematic reviews. It allows researchers to
evaluate how important study characteristics affect the variability of inter-rater reliability
estimates across studies and to accumulate psychometric knowledge about the test being
used. The findings from a meta-analysis of Cohen’s jwill facilitate test developers, test
users and methodologists to better understand inter-rater reliability and develop effective
strategies to improve inter-rater reliability.
Even the most skillful chef cannot cook a meal without basic ingredients, so the quality
of a meta-analysis largely depends on the supplies in primary studies. A successful meta-
analysis of Cohen’s jrequires that the test of interest has been frequently used and
Cohen’s jis frequently reported as a measure of inter-rater reliability. About 80% of the
primary studies in the literature of PU assessment did not report any inter-rater reliability
coefficient (Kottner et al. 2009). The consequence of underreporting is that population
parameter estimates in meta-analysis will be biased. Therefore, raising the awareness of
reporting inter-rater reliability estimates is an urgent mission. It was also noted that p
0
is
more frequently reported as the measurement of inter-rater reliability than Cohen’s j.p
0
does not account for chance agreement between two raters and is a positively biased
measure of the true systematic tendency of the two raters agreeing with each other (Fleiss
1981). On the contrary, Cohen’s jadjusts for chance agreement and hence should be the
preferred measure of inter-rater reliability. Moreover, standard error is a measure of
estimation precision and a key element to synthesize estimates across different studies in
meta-analysis. Unfortunately, standard errors are generally not reported for Cohen’s jin
Fig. 5 A funnel plot of Cohen’s jestimates against standard error estimates after adjusting for publication
bias with the trim and fill method (the dots on the right of the vertical line are the missing observations
estimated by the trim and fill method)
158 Health Serv Outcomes Res Method (2011) 11:145–163
123
published studies. If p
0
is reported for Cohen’s jestimate in a primary study, the standard
error can be estimated from Eqs. 1and 4; otherwise, the study has to be excluded from final
data synthesis. Researchers are strongly recommended to explicitly report Cohen’s jwith
estimated standard error to facilitate future meta-analysis.
Last but not least, the appropriate use of Cohen’s jin primary studies is essential for
meta-analysis. By assumption, Cohen’s jis an appropriate measure of inter-rater reliability
when all of the subjects are rated by two equally competent raters and the correctness of
ratings usually cannot be determined. When these assumptions are violated, Cohen’s j
cannot be used to indicate the consistency of the ratings between two raters. For instance,
Hart et al. (2006) reported Cohen’s jbetween hospital nurses and PU experts. Ratings
made by experts were considered correct classifications and the reliability between nurses
and experts was considered rater-to-standards reliability. Whether rater-to-standard
reliability is a valid concept and how to assess it are completely different issues, but
apparently Cohen’s jwas misused and the study cannot be included in meta-analysis. In
fact, Cohen’s jhas been extended to estimate inter-rater reliability for many complicated
rating scenarios (e.g., Berry and Mielke 1998; Cohen 1968,1972; Davies and Fleiss 1982;
Fleiss 1971; Gross 1986; Janson and Olsson 2001,2004; Kraemer 1980; Kraemer et al.
2004; Landis and Koch 1977b; Vanbelle and Albert 2009). Sadly, those extended measures
are infrequently, if not never, applied in empirical research. The misuse of Cohen’s jand
the large number of extended measures of inter-rater reliability calls for clear and easy-
to-follow tutorials on conducting inter-rater reliability studies and choosing the appropriate
measure of inter-rater reliability for different rating scenarios.
Appendix A
An example illustrating how to calculate Cohen’s j, its variance and confidence intervals
using real data from Nixon et al. (2005)
Ward nurse
No pressure ulcer Pressure ulcer Total
Clinical research nurse
No pressure ulcer (a)
2175
(b)
35
(f
1
)
2210
Pressure ulcer (c)
42
(d)
144
(f
2
)
186
Total (g
1
)
2217
(g
2
)
179
(n)
2396
p0¼aþd
n¼2175 þ144
2396 0:97
pc¼
f1g1
nþf2g2
n
n¼
22102217
2396 þ186179
2396
2396 0:86
^
j¼p0pc
1pc¼0:97 0:86
10:86 0:79
Health Serv Outcomes Res Method (2011) 11:145–163 159
123
Varð^
jÞ¼ 1
ð1pcÞ2
p0ð1p0Þ
n¼1
ð10:86Þ2
0:97ð10:97Þ
2396 6:20 104
^
j1:96 ffiffiffiffiffiffiffiffiffiffiffiffi
varðjÞ
p0:79 1:96 0:02. Therefore, 95% CI for Cohen’s jis [0.75,
0.83].
Note that, the calculation of Var(^
j) depends on p
c
which is usually not reported in
research articles. When p
0
and ^
jare reported, p
c
can be derived by pc¼p0^
j
1^
j.
Appendix B
Rcodes for all analyses conducted in the meta-analysis of Cohen’s jfor PU classification
systems
#Install the package metafor
library(metafor)
#Derive vi, the variance for each kappa estimate
#p0i and ki are the percentage agreement and Cohen’s kappa estimate for study i
# pci is the chance agreement derived from p0i and ki (as shown in Appendix A)
pci=(p0i-ki)/(1-ki)
vi=p0i*(1-p0i)/((1-pci)^2*ni)
#Fit the random-effects model using rma() and get confidence intervals for parameter
estimates
res\-rma(ki,vi,data=PUdata)
confint(res)
#Get the forest plot
forest(res,at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2), slab = paste(PUdata$Study))
op\-par(cex=1,font=2)
text(-1.25,17, ‘‘Study’’,pos=4)
text(1.5,17, ‘‘Cohen’s Kappa [95% CI]’’,pos=4)
#Get the funnel plot to check publication bias
funnel(res, main = ‘‘Random-Effects Model’’)
#Sort the data matrix by the moderator variable rater for the mixed-effects model
data\-PUdata[order(PUdata$rater),]
#Fit the mixed-effects model using rma() and get confidence intervals for parameter
estimates
mix\-rma(ki,vi,mods=*factor(rater),data=data)
confint(mix)
#Get the forest plot for mixed-effect model and add group estimates to the bottom of the
plot
forest(data$ki,data$vi, at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2),ylim=c(-3,18),slab = paste
(data$Study))
preds \- predict(mix, newmods = c(0, 1))
op\-par(cex=1.1,font=1)
addpoly(preds$pred, sei = preds$se, mlab = c(‘‘Raters without training’’, ‘‘Raters with
training’’))
abline(h=0)
160 Health Serv Outcomes Res Method (2011) 11:145–163
123
abline(h=10.5)
text(1.2,0.4,’’Raters with training’’)
text(1.2,11,’’Raters without training’’)
op\-par(cex=1.1,font=2)
text(-1.25,17, ‘‘Study’’,pos=4)
text(1.5,17, ‘‘Cohen’s Kappa [95% CI]’’,pos=4)
# Use trim and fill method to adjust for publication bias under the random-effects model
re\-rma(ki,vi,data=PUdata,method=‘‘REML’’)
rtf \- trimfill(re)
# Use trim and fill method to adjust for publication bias under the fixed-effects model
fe\-rma(ki,vi,data=PUdata,method=‘‘FE’’)
ftf \- trimfill(fe)
#Get the funnel plot with augmented data
funnel(ftf)
abline(v=1)
References
Allman, R.M.: Pressure ulcer prevalence, incidence, risk factors, and impact. Clin. Geriatr. Med. 13,
421–436 (1997)
Altman, D.G.: Practical Statistics for Medical Students. Chapman and Hall, London (1991)
Baugh, F.: Correcting effect sizes for score reliability: a reminder that measurement and substantive issues
are linked inextricably. Educ. Psychol. Meas. 62, 254–263 (2002)
Banerjee, M., Capozzoli, M., McSweeny, L., Sinha, D.: Beyond kappa: a review of interrater agreement
measures. Can. J. Stat. 27, 3–23 (1999)
Berry, K.J., Mielke, P.W.: A generalization of Cohen’s kappa agreement measure to interval measurement
and multiple raters. Educ. Psychol. Meas. 48, 921–933 (1998)
Blackman, N.J.-M., Koval, J.J.: Interval estimation for Cohen’s kappa as a measure of agreement. Stat. Med.
19, 723–741 (2000)
Block, D.A., Kraemer, H.C.: 292 kappa coefficients: measures of agreement or association. Biometrics 45,
269–287 (1989)
Borenstein, M.: Software for publication bias. In: Rothstein, H.R., Sutton, A.J., Borenstein, M. (eds.)
Publication Bias in Meta-Analysis—Prevention, Assessment and Adjustments, pp. 193–220. Wiley,
Chichester (2005)
Bours, G., Halfens, R., Lubbers, M., Haalboom, J.: The development of a National Registration Form to
measure the prevalence of pressure ulcers in the Netherlands. Ostomy Wound Manage. 45, 28–40
(1999)
Brennan, R.L., Prediger, D.J.: Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas.
41, 687–699 (1981)
Brennan, R.L., Silman, A.: Statistical methods for assessing observer variability in clinical measures. Br.
Med. J. 304, 1491–1494 (1992)
Buntinx, F., Beckers, H., De Keyser, G., Flour, M., Nissen, G., Raskin, T., De Vet, H.: Inter-observer
variation in the assessment of skin ulceration. J. Wound Care 5, 166–170 (1986)
Capraro, M.M., Capraro, R.M., Henson, R.K.: Measurement error of scores on the Mathematics Anxiety
Rating Scale across studies. Educ. Psychol. Meas. 61, 373–386 (2001)
Caruso, J.C.: Reliability generalization of the NEO personality scales. Educ. Psychol. Meas. 60, 236–254
(2000)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial
credit. Psychol. Bull. 70, 220–231 (1968)
Cohen, J.: Weighted Chi square: an extension of the kappa method. Educ. Psychol. Meas. 32, 61–74 (1972)
Davies, M., Fleiss, J.L.: Measurement agreement for multinomial data. Biometrics 38, 1047–1051 (1982)
Health Serv Outcomes Res Method (2011) 11:145–163 161
123
Duval, S., Tweedie, R.: A nonparametric ‘‘trim and fill’’ method of assessing publication bias in meta-
analysis. J. Am. Stat. Assoc. 95(449), 89–98 (2000a)
Duval, S., Tweedie, R.: Trim and fill: a simple funnel plot based method of testing and adjusting for
publication bias in meta-analysis. Biometrics 56, 455–463 (2000b)
Everitt, B.S.: Moments of the statistics kappa and weighted kappa. Br. J. Math. Stat. Psychol. 21, 97–103
(1968)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)
Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. Wiley, New York (1981)
Fleiss, J.L., Cohen, J., Everitt, B.S.: Large sample standard errors of kappa and weighted kappa. Psychol.
Bull. 72, 323–327 (1969)
Fleiss, J.L., Shrout, P.E.: The effects of measurement errors on some multivariate procedures. Am. J. Public
Health 67, 1188–1191 (1977)
Graves, N., Birrell, F.A., Whitby, M.: Modeling the economic losses from pressure ulcers among hospi-
talized patients in Australia. Wound. Rep. Reg. 13, 462–467 (2005)
Gross, S.T.: The kappa coefficient of agreement for multiple observers when the number of subjects is small.
Biometrics 42, 883–893 (1986)
Guilford, J.P., Fruchter, B.: Fundamental Statistics in Psychology and Education, 6th edn. McGraw-Hill,
New York (1978)
Hart, S., Bergquist, S., Gajewski, B., Dunton, N.: Reliability testing of the national database of nursing
quality indicators pressure ulcer indicator. J. Nurs. Care Qual. 21, 256–265 (2006)
Healey, F.: The reliability and utility of pressure sore grading scales. J. Tissue Viability 5, 111–114 (1995)
Hedges, L.V.: Fitting categorical models to effect sizes from a series of experiments. J. Educ. Stat. 7,
119–137 (1982a)
Hedges, L.V.: Fitting continuous models to effect sizes from a series of experiments. J. Educ. Stat. 7,
245–270 (1982b)
Hedges, L.V.: A random effects model for effect sizes. Psychol. Bull. 93, 388–395 (1983)
Hedges, L.V., Vevea, J.L.: Fixed and random effects models in meta-analysis. Psychol. Methods 3, 486–504
(1998)
Helms, J.E.: Another meta-analysis of the White Racial Identity Attitude Scale’s Cronbach alphas: impli-
cations for validity. Meas. Eval. Couns. Dev. 32, 122–137 (1999)
Henson, R.K.: Understanding internal consistency reliability estimates: a conceptual primer on coefficient
alpha. Meas. Eval. Couns. Dev. 34, 177–189 (2001)
Henson, R.K., Kogan, L.R., Vacha-Haase, T.: A reliability generalization study of the Teacher Efficacy
Scale and related instruments. Educ. Psychol. Meas. 61, 404–420 (2001)
Hunter, J.E., Schmidt, F.L.: Methods of Meta-Analysis: Correcting Error and Bias in Research Findings.
Sage, Newbury Park (1990)
Huynh, Q., Howell, R.T., Benet-Martinez, V.: Reliability of bidimensional acculturation scores: a meta-
analysis. J. Cross Cult. Psychol. 40, 256–274 (2009)
Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations. Educ.
Psychol. Meas. 61, 277–289 (2001)
Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations by
different sets of judges. Educ. Psychol. Meas. 64, 62–70 (2004)
Kottner, J., Raeder, K., Halfens, R., Dassen, T.: A systematic review of inter-rater reliability of pressure
ulcers classification systems. J. Clin. Nurs. 18, 315–336 (2009)
Koval, J.J., Blackman, N.J.-M.: Estimators of kappa-exact small sample properties. J. Stat. Comput. Simulat.
55, 513–536 (1996)
Kraemer, H.C.: Ramifications of a population model for jas a coefficient o reliability. Psychometrika 44,
461–472 (1979)
Kraemer, H.C.: Extension of the kappa coefficient. Biometrics 36, 207–216 (1980)
Kraemer, H.C., Vyjeyanthi, S.P., Noda, A.: Kappa coefficients in medical research. In: D’Agostino, R.B.
(ed.) Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies, pp. 85–105. Wiley,
New York (2004)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33,
159–174 (1977a)
Landis, J.R., Koch, G.G.: An application of hierarchical kappa-type statistics in the assessment of majority
agreement among multiple observers. Biometrics 33, 363–374 (1977b)
Linacre, J.M.: Many-Facet Rasch Measurement. MESA Press, Chicago (1989)
Lord, F.M., Novick, M.R.: Statistical Theories of Mental Test Scores. Addison-Wesley, Reading (1968)
162 Health Serv Outcomes Res Method (2011) 11:145–163
123
Miller, C.S., Shields, A.L., Campfield, D., Wallace, K.A., Weiss, R.D.: Substance use scales of the Min-
nesota Multiphasic Personality Inventory: an exploration of score reliability via meta-analysis. Educ.
Psychol. Meas. 67, 1052–1065 (2007)
Nixon, J., Thorpe, H., Barrow, H., Phillips, A., Nelson, E.A., Mason, S.A., Cullum, N.: Reliability of
pressure ulcer classification and diagnosis. J. Adv. Nurs. 50, 613–623 (2005)
Pedley, G.E.: Comparison of pressure ulcer grading scales: a study of clinical utility and inter-rater reli-
ability. Int. J. Nurs. Stud. 41, 129–140 (2004)
Raudenbush, S.W.: Analyzing effect sizes: random-effects models. In: Cooper, H.M., Hedges, L.V.,
Valentine, J.C. (eds.) The Handbook of Research Synthesis and Meta-Analysis, 2nd edn, pp. 295–315.
Russel Sage Foundation, New York (2009)
Rohner, R.P., Khaleque, A.: Reliability and validity of the Parental Control Scale: a meta-analysis of cross-
cultural and intracultural studies. J. Cross Cult. Psychol. 34, 643–649 (2003)
Russell, L.: Pressure ulcer classification: the systems and the pitfalls. Br. J. Nurs. 11, S49–S59 (2002)
Schmidt, F.L., Hunter, J.E.: Development of a general solution to the problem of validity generalization.
J. Appl. Psychol. 62, 529–540 (1977)
Shrout, P.E.: Measurement reliability and agreement in psychiatry. Stat. Meth. Med. Res. 7, 301–317 (1998)
Sim, J., Wright, C.C.: The Kappa statistic in reliability studies: use, interpretation and sample size
requirements. Phys. Ther. 85, 257–268 (2005)
Spearman, C.E.: The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101
(1904)
Stemler, S.E., Tsai, J.: Best practices in interrater reliability: three common approaches. In: Osborne, J.W.
(ed.) Best Practices in Quantitative Methods, pp. 29–49. Sage, Thousand Oaks (2008)
Stotts, N.A.: Assessing a patient with a pressure ulcer. In: Morison, M.J. (ed.) The Prevention and Treatment
of Pressure Ulcers, pp. 99–115. Mosby, London (2001)
Suen, H.K.: Agreement, reliability, accuracy and validity: toward a clarification. Behav. Assess. 10,
343–366 (1988)
Sutton, A.J.: Publication bias. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The handbook of
research synthesis and meta-analysis, 2nd edn, pp. 435–452. Russel Sage Foundation, New York
(2009)
Thompson, B.: Guidelines for authors. Educ. Psychol. Meas. 54, 837–847 (1994a)
Thompson, B.: Score Reliability: Contemporary Thinking on Reliability Issues. Sage, Thousand Oaks
(2002)
Thompson, B., Vacha-Hasse, T.: Psychometrics is datametrics: the test is not reliable. Educ. Psychol. Meas.
60, 174–195 (2000)
Thompson, S.G.: Why sources of heterogeneity in meta-analysis should be investigated. Br. Med. J. 309,
1351–1355 (1994b)
Thorndike, R.M.: Measurement and Evaluation in Psychology and Education. Pearson Merrill Prentice Hall,
Upper Saddle River (2005)
Vacha-Hasse, T.: Reliability generalization: exploring variance in measurement error affecting score reli-
ability across studies. Educ. Psychol. Meas. 58, 6–20 (1998)
Vacha-Hasse, T., Henson, R.K., Caruso, J.C.: Reliability generalization: moving toward improved under-
standing and use of score reliability. Educ. Psychol. Meas. 62, 562–569 (2002)
Vanbelle, S., Albert, A.: Agreement between two independent groups of raters. Psychometrika 74, 477–492
(2009)
Vanderwee, K., Grypdonck, M., De Bacquer, D., Defloor, T.: The reliability of two observation methods of
nonblanchable erythema, Grade 1 pressure ulcer. Appl. Nurs. Res. 19, 156–162 (2006)
Viechtbauer, W.: Conducting meta-analyses in Rwith the metafor package. J. Stat. Softw. 36, 1–48 (2010)
Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the Kappa statistic. Fam. Med. 37,
360–363 (2005)
Viswesvaran, C., Ones, D.S.: Measurement error in ‘‘Big Five Factors’’ personality assessment: reliability
generalization across studies and measures. Educ. Psychol. Meas. 60, 224–235 (2000)
von Eye, A.: An alternative to Cohen’s j. Eur. Psychol. 11, 12–24 (2006)
Yin, P., Fan, X.: Assessing the reliability of Beck Depression Inventory scores: reliability generalization
across studies. Educ. Psychol. Meas. 60, 201–223 (2000)
Zwick, R.: Another look at interrater agreement. Psychol. Bull. 3, 374–378 (1988)
Health Serv Outcomes Res Method (2011) 11:145–163 163
123
... The level of interrater agreement will be assessed using Cohen kappa inter-rater reliability. [16] The articles and abstracts from the search will be evaluated for relevance and categorized into 1 of 3 groups: not pertinent, pertinent, or possibly pertinent. The pertinent full text will be thoroughly assessed in order to identify the studies applicable to this review. ...
... BCM and PM will independently assess the risk of bias using the 4 domains of the Downs and Black checklist, namely, reporting bias (10 items), external validity (3 items), internal validity (6 items), and selection bias (7 items). The results will be graded as excellent (25-26), good (20)(21)(22)(23)(24), moderate (14)(15)(16)(17)(18)(19), poor (11)(12)(13), and extremely bad (<10). [17] Additionally, BCM, PM and NCM will evaluate the included studies, and independent reviewers (PSN and AK) will arbitrate any disagreements. ...
Article
Full-text available
Background The incidence and prevalence of prediabetes has become a global concern. The risk factors of prediabetes, such as insulin resistance, adiposity, lipotoxicity and obesity, in conjunction with the alteration of the renin-angiotensin-aldosterone system (RAAS), have been positively correlated with the high morbidity and mortality rate. Thus, this systematic review seeks to establish the relationship between the risk factors of prediabetes, namely insulin resistance adiposity, lipotoxicity, obesity and the RAAS. Therefore, a synthesis of these risk factors, their clinical indicators and the RAAS components will be compiled in order to establish the association between the RAAS alteration and obesity in prediabetic patients. Methods This protocol for a systematic review was developed in compliance with the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P) standards. This will be accomplished by searching clinical Medical Subject Headings categories in MEDLINE with full texts, EMBASE, Web of Science, PubMed, Cochrane Library, Academic Search Complete, ICTRP and ClinicalTrial.gov. Reviewers will examine all of the findings and select the studies that meet the qualifying criteria. To check for bias, the Downs and Black Checklist will be used, followed by a Review Manager v5. A Forrest plot will be used for the meta-analysis and sensitivity analysis. Furthermore, the strength of the evidence will be assessed utilizing the Grading of Recommendations Assessment, Development, and Evaluation procedure (GRADE). The protocol has been registered with PROSPERO CRD42022320252. This systematic review and meta-analysis will include published randomized clinical trials, observational studies and case-control studies from the years 2000 to 2022.
... Level of agreement for correlation coefficient analysis [ 12 ]. ...
... This process helped us refine the concept, make improvements, and facilitate the adoption of the scoring tool. Finally, Cohen's Kappa was used for reliability to show how high the agreement was between two raters [47]; the value of 0.74 was "substantial" for this tool. ...
Article
Full-text available
Nursing students can access massive amounts of online health data to drive cutting-edge evidence-based practice in clinical placement, to bridge the theory–practice gap. This activity requires investigation to identify the strategies nursing students apply to evaluate online health information. Online Think-Aloud sessions enabled 14 participants to express their cognitive processes in navigating various educational resources, including online journals and databases, and determining the reliability of sources, indicating their strategies for information-seeking, which helped to create this scoring system. Easy access and user convenience were clearly the instrumental factors in this behavior, which has troubling implications for the lack of use of higher-quality resources (e.g., from peer-reviewed academic journals). The identified challenges encountered during resource access included limited skills in the critical evaluation of information credibility and reliability, signaling a requirement for improved information literacy skills. Participants acknowledged the importance of evidence-based, high-quality information, but faced numerous barriers, such as restricted access to professional and specialty databases, and a lack of academic skills training. This paper develops and critiques a Performative Tool for assessing the process of seeking health information using an online Think-Aloud method, and explores factors and strategies contributing to evidence-based health information access and utilization in clinical practice, aiming to provide insight into individuals’ information-seeking behaviors in online health contexts.
... Second, to establish the reliability of the coding scheme (a priori and rubric scoring), we went through rounds of inter-rater reliability, where different members of the research team coded at least 10% of the artifacts (Hallgren, 2012) to test the validity of the codes. We calculated Cohen's Kappa (Sun, 2011) values for the nominal codes and percent positive agreement (Campbell et al., 2013) for the ordinal rubric codes. Through rounds of coding and discussion, we narrowed the definitions and use of the codes until we reached acceptable values (at least 80% initial agreement). ...
Article
Full-text available
Preservice elementary teachers enter their science methods courses with a range of prior experience with science practice. Those prior experiences likely inform much of their science pedagogy and goals. In this study, the authors examine how a cohort of preservice elementary teachers engaged in science practice as they learned content in a physics course. Drawing on course documents, videorecords, and artifacts from in-class lab work and interviews with nine participants, the authors used an asset-based, mixed methods approach. The authors developed rubrics to assess the level of sophistication the participants used while engaging in science practice on a scale of 1 (pre-novice) to 4 (experienced). They used descriptive statistics and ANOVA's to interpret the performance of the participants in addition to grounded theory open coding of interviews to determine the participants' level of prior experience with science practice. The findings suggest that these preservice teachers primarily engaged in science practices at a novice level. In general, their sophistication scores on the rubric aligned with their prior experience. The findings suggest that while one content course steeped in science practice was not enough to significantly change preservice teachers' engagement, it can provide a needed starting place and that it likely takes time to develop these skills. The findings have implications for both teacher educators and researchers who hope to increase the use of science practice as a method of learning science content.
... Clearly, GA-XGboost and GA-RF achieved significant accuracies and Cohen's Kappa values upon GA-driven feature reduction. The Cohen's Kappa values of both models ranged from 0.47 to 0.62 against training (Leave-20%-Out) and testing sets (Table 4, default ACs definitions) indicating moderate to substantial reliability 56,57 . Nonetheless, the two models fell short of perfect reliability (i.e., κ from 0.81 to 1.0) 40 www.nature.com/scientificreports/ ...
Article
Full-text available
Activity cliffs (ACs) are pairs of structurally similar molecules with significantly different affinities for a biotarget, posing a challenge in computer-assisted drug discovery. This study focuses on protein kinases, significant therapeutic targets, with some exhibiting ACs while others do not despite numerous inhibitors. The hypothesis that the presence of ACs is dependent on the target protein and its complete structural context is explored. Machine learning models were developed to link protein properties to ACs, revealing specific tripeptide sequences and overall protein properties as critical factors in ACs occurrence. The study highlights the importance of considering the entire protein matrix rather than just the binding site in understanding ACs. This research provides valuable insights for drug discovery and design, paving the way for addressing ACs-related challenges in modern computational approaches.
... The estimation model used in this study is random effect size (Setiawan et al., 2022;Tamur et al., 2020). Furthermore, checking publication bias in this study through funnel plot 90 analysis and Rosenthal Fail Safe N test (Chamdani et al., 2022;Diah et al., 2022;Sun, 2015). If the Rosenthal Fail Safe test value N/ (5k + 10) > 1 then the research in the meta-analysis is resistant to publication bias (Mullen, 2001). ...
Article
Full-text available
This study aims to determine the effect of STEM-based guided inquiry models on students' Creative Thinking Skills in science learning. This type of research is a meta-analysis. The study analyzed 15 primary studies published in 2018-2023 that had met the inclusion criteria. Search data sources through the Google scholar database; ERIC, Taylor of Francis, ScienceDirect and ProQuest. Data analysis with the help of the JSAP application verse 0.16.3. These results conclude that the overall value of effect size is 0.99 (95% CI [ 0.79; 1,19]) high category. These findings show that the application of STEM-based inquiry-based learning models affects students' 21st century thinking skills. In addition, these findings provide important information on STEM-based guided inquiry learning in schools.
Article
Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.
Book
Full-text available
The theory and application of Many-Facet Rasch Measurement to judged (rated or rank-ordered) performances, and description of the estimation of the MFRM Rasch measures focusing on missing data.
Article
Full-text available
Most currently used measures of interrater agreement for the nominal case incorporate a correction for chance agreement. The definition of chance agreement, however, is not the same for all coefficients. Three chance-corrected coefficients are Cohen’s (1960) κ; Scott’s (1955) π; and the S index of Bennett, Alpert, and Goldstein (1954), which has reappeared in many guises. For all three measures, independence between raters is assumed in deriving the proportion of agreement expected by chance. Scott’s π involves a further assumption of homogeneous rater marginals, and the S coefficient requires the assumption of uniform marginal distributions for both raters. Because of these disparate formulations, κ, π, and S can lead to different conclusions about rater agreement. Consideration of the properties of these measures leads to the recommendation that marginal homogeneity be assessed as a first step in the analysis of rater agreement. If marginal homogeneity can be assumed, π can be used as an index of agreement.
Article
Cohen's kappa statistic is a very well known measure of agreement between two raters with respect to a dichotomous outcome. Several expressions for its asymptotic variance have been derived and the normal approximation to its distribution has been used to construct confidence intervals. However, information on the accuracy of these normal-approximation confidence intervals is not comprehensive. Under the common correlation model for dichotomous data, we evaluate 95 per cent lower confidence bounds constructed using four asymptotic variance expressions. Exact computation, rather than simulation is employed. Specific conditions under which the use of asymptotic variance formulae is reasonable are determined. Copyright (C) 2000 John Wiley & Sons, Ltd.
Article
Article
This volume considers the problem of quantitatively summarizing results from a stream of studies, each testing a common hypothesis. In the simplest case, each study yields a single estimate of the impact of some intervention. Such an estimate will deviate from the true effect size as a function of random error because each study uses a finite sample size. What is distinctive about this chapter is that the true effect size itself is regarded as a random variable taking on different values in different studies, based on the belief that differences between the studies generate differences in the true effect sizes. This approach is useful in quantifying the heterogeneity of effects across studies, incorporating such variation into confidence intervals, testing the adequacy of models that explain this variation, and producing accurate estimates of effect size in individual studies. After discussing the conceptual rationale for the random effects model, this chapter provides a general strategy for answering a series of questions that commonly arise in research synthesis: 1. Does a stream of research produce heterogeneous results? That is, do the true effect sizes vary? 2. If so, how large is this variation? 3. How can we make valid inferences about the average effect size when the true effect sizes vary? 4. Why do study effects vary? Specifically do observable differences between studies in their target populations, measurement approaches, definitions of the treatment, or historical contexts systematically predict the effect sizes? 5. How effective are such models in accounting for effect size variation? Specifically, how much variation in the true effect sizes does each model explain? 6. Given that the effect sizes do indeed vary, what is the best estimate of the effect in each study? I illustrate how to address these questions by re-analyzing data from a series of experiments on teacher expectancy effects on pupil's cognitive skill. My aim is to illustrate, in a comparatively simple setting, to a broad audience with a minimal background in applied statistics, the conceptual framework that guides analyses using random effects models and the practical steps typically needed to implement that framework. Although the conceptual framework guiding the analysis is straightforward, a number of technical issues must be addressed satisfactorily to ensure the validity the inferences. To review these issues and recent progress in solving them requires a somewhat more technical presentation. Appendix 16A considers alternative approaches to estimation theory, and appendix 16B considers alternative approaches to uncertainty estimation, that is, the estimation of standard errors, confidence intervals, and hypothesis tests. These appendices together provide re-analyses of the illustrative data under alternative approaches, knowledge of which is essential to those who give technical advice to analysts.