ArticlePDF Available

Meta-analysis of Cohen’s kappa

December 2011
Health Services and Outcomes Research Methodology 11(3-4)

December 2011
11(3-4)

Authors:

University of Maryland, Baltimore County

Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from one study to another due to differences in study settings, test properties, rater characteristics and subject characteristics. This study proposes a formal statistical framework for meta-analysis of Cohen’s κ to describe the typical inter-rater reliability estimate across multiple studies, to quantify between-study variation and to evaluate the contribution of moderators to heterogeneity. To demonstrate the application of the proposed statistical framework, a meta-analysis of Cohen’s κ is conducted for pressure ulcer classification systems. Implications and directions for future research are discussed.

Total heterogeneity and population average estimate from a random-effects model

…

A funnel plot of Cohen's j estimates against standard error estimates from the random-effects model

…

Results from a mixed-effects model with rater characteristic as a moderator

…

A forest plot of Cohen's j estimates with separate estimates for the moderator (raters without training in PU assessment vs. raters with training in PU assessment)

…

Comparison of results from the fixed-effects model and the trim-and-filled fixed-effects model

…

Figures - uploaded by Shuyan Sun

Content may be subject to copyright.

Content uploaded by Shuyan Sun

Content may be subject to copyright.

Meta-analysis of Cohen’s kappa

Shuyan Sun

Received: 3 September 2010 / Revised: 25 October 2011 / Accepted: 29 October 2011 /

Published online: 11 November 2011

Springer Science+Business Media, LLC 2011

Abstract Cohen’s jis the most important and most widely accepted measure of inter-

rater reliability when the outcome of interest is measured on a nominal scale. The estimates

of Cohen’s jusually vary from one study to another due to differences in study settings,

test properties, rater characteristics and subject characteristics. This study proposes a

formal statistical framework for meta-analysis of Cohen’s jto describe the typical inter-

rater reliability estimate across multiple studies, to quantify between-study variation and to

evaluate the contribution of moderators to heterogeneity. To demonstrate the application of

the proposed statistical framework, a meta-analysis of Cohen’s jis conducted for pressure

ulcer classiﬁcation systems. Implications and directions for future research are discussed.

Keywords Cohen’s jInter-rater reliability Meta-analysis Generalizability

In classical test theory proposed by Spearman (1904), an observed score Xis expressed as the

true score Tplus a random error of measurement e, i.e., X=T?e. Reliability is deﬁned as the

squared correlation between observed scores and true scores (Lord and Novick 1968). It

indicates the extent to which scores produced by a particular measurement procedure are

consistent and reproducible (Thorndike 2005). Reliability is an unobserved property of scores

obtained from a sample on a particular test, not an inherent property of the test (Thompson

2002; Thompson and Vacha-Hasse 2000; Vacha-Hasse 1998; Vacha-Hasse et al. 2002).

Therefore, it is never appropriate to claim a test is reliable or unreliable in a research article.

Instead, researchers should state the scores are reliable or unreliable. Reliability estimates

usually vary from one study to another due to differences in study characteristics including

study settings, test properties, and subject characteristics. A test that yields reliable scores for

one group of subjects in this setting may fail to yield reliable scores for a different group of

subjects in another setting. Hence, understanding the generalizability of score reliability and

the factors affecting score reliability becomes an important methodological issue.

S. Sun (&)

School of Education, University of Cincinnati, 2600 Clifton Ave., Dyer Hall 475, P.O. Box 210049,

Cincinnati, OH 45221, USA

e-mail: sunsn@mail.uc.edu

123

Health Serv Outcomes Res Method (2011) 11:145–163

DOI 10.1007/s10742-011-0077-3

Researchers in education and psychology have been applying meta-analytic techniques

to reliability coefﬁcients to investigate the generalizability of score reliability across

multiple studies (e.g., Capraro et al. 2001; Caruso 2000; Helms 1999; Henson et al. 2001;

Huynh et al. 2009; Miller et al. 2007; Rohner and Khaleque 2003; Vacha-Hasse 1998;

Viswesvaran and Ones 2000; Yin and Fan 2000). This methodology was proposed and

labeled as reliability generalization by Vacha-Hasse (1998). As an extension of validity

generalization (Hunter and Schmidt 1990; Schmidt and Hunter 1977), reliability gener-

alization can be used (a) to describe the typical reliability estimate of a given test across

different studies, (b) to describe the variability of reliability estimates across different

studies, and (c) to identify study characteristics that can explain the variability of reliability

estimates and to accumulate psychometric knowledge regarding study characteristics

(Vacha-Hasse 1998). Score reliability is affected by heterogeneity of subjects being tested,

and reliability estimates always change when the test is administered to a different sample

(Guilford and Fruchter 1978). Effect sizes in applied research studies are inherently

attenuated by unreliable scores (Baugh 2002; Henson 2001; Thompson 1994) and con-

sequently the statistical power of detecting a meaningful difference is lowered. Therefore,

an investigation of the generalizability of score reliability has very important implications

for psychometricians, statisticians and applied researchers who wish to better understand

score reliability and its inﬂuences on the appropriateness of test use, effect size and

statistical power (Vacha-Hasse et al. 2002).

Inter-rater reliability refers to the consistency of ratings given by different raters to the

same subject. It quantiﬁes the extent to which raters agree on the relative ratings given to

subjects and serves as a measure of quality and accuracy of the rating process (Linacre

1989). Inter-rater reliability estimates usually vary across studies due to test properties (e.g.,

items, scaling and operational deﬁnitions), rater characteristics (e.g., knowledge, experi-

ence, qualiﬁcation and training), study settings, and subject characteristics (Kraemer 1979;

Shrout 1998; Suen 1988). Substantial measurement errors from raters and rating procedures

will attenuate measurement precision and statistical power, and make meaningful treatment

effects more difﬁcult to detect (Fleiss and Shrout 1977). When the outcome of interest is

measured on a nominal scale, Cohen’s jis considered the most important (von Eye 2006)

and most widely accepted measure of inter-rater reliability (Brennan and Silman 1992; Sim

and Wright 2005; Zwick 1988), especially in the medical literature (Kraemer et al. 2004;

Viera and Garrett 2005). The frequent application of Cohen’s jallows the possibility of

conducting meta-analyses to examine generalizability of inter-rater reliability across mul-

tiple studies. This study ﬁrst discusses the statistical framework for meta-analysis of

Cohen’s jand then demonstrates its application by a meta-analysis for pressure ulcer

classiﬁcation systems, a set of diagnostic tools in nursing and medical research.

1 Statistical framework for meta-analysis of Cohen’s j

1.1 Assumptions of Cohen’s j

The basic feature of Cohen’s jis to consider two raters as alternative forms of a test, and

their ratings are analogous to the scores obtained from the test. Well known as a chance-

corrected measure of inter-rater reliability, Cohen’s jdetermines whether the degree of

agreement between two raters is higher than would be expected by chance (Cohen 1960). It

assumes that (a) the subjects being rated are independent of each other, (b) the categories

of ratings are independent, mutually exclusive and collectively exhaustive, and (c) two

146 Health Serv Outcomes Res Method (2011) 11:145–163

123

raters operate independently. In addition to the three statistical assumptions, Cohen’s j

further assumes that the ‘‘correctness’’ of ratings cannot be determined in a typical situ-

ation, and raters are deemed equally competent to make the judgment on a prior ground

(Cohen, 1960, p.38). Under these two practical assumptions, no restriction is placed on the

distribution of ratings over categories for the raters. In other words, Cohen’s jallows the

marginal distributions of raters to differ (Banerjee et al. 1999). Marginal distribution is

deﬁned as the set of underlying probabilities with which each rater uses the categories.

Block and Kraemer (1989) argued that by allowing marginal distributions to differ,

Cohen’s jmeasures the association between two sets of ratings rather than the agreement

between two raters. A well-deﬁned measure of agreement describes how well one rater’s

rating agrees with what another rater would have reported and indicates the generalizability

of a rating beyond the speciﬁc rater. When the true interest is agreement between raters

instead of mere association between ratings, the marginal distributions should not be too

disperse. Therefore, it is very important to impose the assumption of homogeneity of

marginal distributions on Cohen’s j(Blackman and Koval 2000; Block and Kraemer 1989;

Brennan and Prediger 1981; Zwick 1988).

1.2 Sampling distribution of Cohen’s j

The formula for computing Cohen’s jis expressed in Eq. 1.

j¼p0pc

1pcð1Þ

where p

is percent agreement, deﬁned as the proportion of subjects on which the raters

agree, and p

is chance agreement, deﬁned as the proportion of agreement that would be

expected by chance. The calculation of Cohen’s jis very simple and can be done from a

contingency table using a basic calculator. The upper limit of jis ?1.00, occurring when

and only when the two raters agree perfectly, in other words, the two raters have exactly

the same marginal distribution. The lower limit of jfalls between 0 and -1.00, depending

on the dispersion of raters’ marginal distributions (Cohen 1960). A jvalue of 0 indicates

that the agreement is merely due to chance. Negative values of jcan be meaningfully

interpreted as the level of agreement that would be expected by chance only (Brennan and

Silman 1992). Although the meaning of jwith value 0 or 1 is quite clear, the interpretation

of intermediate values is less evident. The benchmarks for interpreting Cohen’s jproposed

by Landis and Koch (1977a) are of high proﬁle in the literature, with 0.8 to 1.0 indicating

almost perfect agreement, 0.6 to 0.8 as substantial, 0.4 to 0.6 as moderate, 0.2 to 0.4 as

fair, zero to 0.2 as slight and zero or lower as poor. Slightly different interpretations can be

found in Fleiss (1981) and Altman (1991). Stemler and Tsai (2008) proposed to use .50 as

the minimal Cohen’s jfor an acceptable level of inter-rater reliability.

The mean and variance of jwere derived by Everitt (1968) as in Eqs. 2and 3.

EðjÞ¼ 1

1pcfEðp0Þpcg¼0ð2Þ

VarðjÞ¼ 1

1pc

Varðp0Þð3Þ

The exact variance of p

is very tedious to calculate. When p

is assumed to be

binomially distributed, the approximate variance of jis simpliﬁed as in Eq. 4(Cohen

1960; Everitt 1968).

Health Serv Outcomes Res Method (2011) 11:145–163 147

123

VarðjÞ¼ 1

1pc

Varðp0Þ¼p0ð1p0Þ

nð1pcÞ2ð4Þ

The sampling distribution of jappears to be very non-symmetric when nis small

(Blackman and Koval 2000; Block and Kraemer 1989; Koval and Blackman 1996). With a

large enough n, the sampling distribution of jis approximately normal so that conﬁdence

intervals (CI) and signiﬁcance tests can be easily done using standard normal distribu-

tion quantiles. For instance, 95% CI of Cohen’s jcan be constructed from ^

j1:96

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

varðjÞ

p. The calculations of Cohen’s j, its variance and 95% CI are demonstrated in

Appendix A.

It is worth noting that the variance estimator in Eq. 4is derived under the null

hypothesis that the agreement between two raters is merely due to chance. It usually

overestimates the true variance and results in conservative CIs and signiﬁcance tests (Fleiss

et al. 1969). In inter-rater reliability studies, a non-zero jis usually of interest and requires

a non-null variance estimator which has been derived by Fleiss, Cohen and Everitt (1969).

However, the non-null variance estimator involves marginal distributions of raters, which

are usually not reported in primary studies. For this reason, the non-null variance estimator

cannot be used in meta-analysis unless it is explicitly reported in primary studies.

1.3 Weighted mean Cohen’s j

Following the tradition of weighted mean effect sizes in meta-analysis, the weighted mean

Cohen’s j(

j:) can be derived by calculating the variance-weighted average as in Eq. 5



j:¼Pm

i¼1wiji

i¼1wið5Þ

where j

is the estimate obtained from study i,i=1,…,m,mis the number of primary

studies, and w

is the reciprocal of the variance of j

that can be obtained from Eq. 4.

1.4 Homogeneity test of Cohen’s j

Whether Cohen’s jestimates obtained from primary studies are homogenous or not can be

tested by Chi-square goodness of ﬁt test, i.e., the Qstatistic in Eq. 6as in traditional meta-

analysis of effect sizes.

Q¼X

i¼1

ðji

j:Þ2

varðjiÞ¼X

i¼1

wiðji

j:Þ2ð6Þ

Again, the weight w

is equal to the reciprocal of variance of j

. This homogeneity test

implies that jestimates with larger variances are weighted less in the calculation of Q.

Under the null hypothesis, Qfollows v

distribution with df =m-1 (Hedges 1982a,b).

The statistical assumption associated with this test is that Cohen’s jestimates are obtained

from independent samples that are large enough to be asymptotically normal.

1.5 Fitting ﬁxed-effects models

A ﬁxed-effects model in meta-analysis assumes that observed effects across studies are

homogenous, i.e., except for sampling errors, observed effects would be a constant across

studies. A ﬁxed-effects model without moderators can be used to estimate the common

148 Health Serv Outcomes Res Method (2011) 11:145–163

123

effect among all observed effects in primary studies. The ﬁxed-effects model for observed

can be expressed as in Eq. 7.

ji¼hþeið7Þ

where his the population common effect, i.e., the common inter-rater reliability estimate

across mstudies, and e

represents the random sampling error. Under weighted least-square

estimation, the population common effect hcan be estimated by 

j:in Eq. 5with variance

in Eq. 8.

varð

j:Þ¼ 1

i¼1wið8Þ

Under the null hypothesis that population common inter-rater reliability is 0, 

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

varðj:Þ

follows the standard normal distribution. Accordingly, 95% CI for population average

inter-rater reliability can be easily constructed from 

j:1:96 ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

varð

j:Þ

p. A ﬁxed-effects

model assumes that the variations of observed outcomes could be fully explained by study

characteristics, and inferences about population effect can be made conditionally on the

characteristics of primary studies included in the meta-analysis (Hedges and Vevea 1998).

Therefore, it is a common practice to include study characteristics (e.g., rater character-

istics and study setting) as moderators in a ﬁxed-effects model to explore how they affect

the variability of Cohen’s jestimates across studies.

1.6 Fitting random-effects models

A random-effects model in meta-analysis assumes that observed effects across studies are a

random variable instead of a ﬁxed constant. The variability among observed effects is not

only a result of random sampling errors, but also caused by random variability at the study

level (Hedges 1983; Hedges and Vevea 1998; Raudenbush 2009). The random-effects

model for observed j

can be expressed in Eq. 9.

ji¼h:þliþeið9Þ

where h. is the population average effect, i.e., the overall estimate of inter-rater reliability,

is the between-study variation that is normally distributed with mean 0 and variance s

and e

represents the random sampling error. Under the random-effects model, the popu-

lation average effect 

j:can be estimated from Eqs. 10 and 11 and its variance can be

estimated from Eq. 12.



j:¼Pm

i¼1w

iji

i¼1w

ið10Þ

w

i¼1

varðkiÞþ 1

m1Pm

i¼1ji1

mPm

i¼1ji



21

mPm

i¼1varðkiÞð11Þ

varð

j:Þ¼ 1

i¼1w

ið12Þ

Under the null hypothesis that population average inter-rater reliability is 0, 

j:

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

varðj:Þ

follows the standard normal distribution. 95% CI for population average inter-rater

reliability can be readily constructed from 

j:1:96 ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

varð

j:Þ

Health Serv Outcomes Res Method (2011) 11:145–163 149

123

1.7 Fitting mixed-effects models

Exploring the source of heterogeneity by investigating moderator effects on the outcomes

is considered one of the most important and useful aspects of meta-analysis (Thompson

1994). Including moderators in a random-effects model to explain the heterogeneity results

in a mixed-effects model as expressed in Eq. 13.

ji¼b0þb1Xi1þþbpXip þliþeið13Þ

The variance of l

represents the amount of residual heterogeneity, i.e., the variability of

Cohen’s jestimates across studies that cannot be accounted for by the moderators X

included in the mixed-effects model. The mixed-effects model assumes that each

Cohen’s jestimate is a linear function of moderator effects and residual heterogeneity. An

estimate of overall effect obtained from a random-effects model becomes meaningless and

can even be misleading when signiﬁcant moderator effects are present in a mixed-effects

model. Hence, an investigation of moderator effects should always be an integral part of

meta-analysis. In addition to usual moderators in meta-analysis including study settings

and subject characteristics, rater characteristics are particularly important moderators that

may explain the heterogeneity among observed Cohen’s jestimates and should be

included in a mixed-effects model.

2 An application to pressure ulcer classiﬁcation systems

2.1 Background

Pressure ulcers (PU) are very serious health problems (Allman 1997) associated with great

pain and distress for patients and extensive health care costs (Graves et al. 2005). PU

classiﬁcation systems are commonly used to classify skin sites into different categories that

indicate the severity of the condition. Though PU classiﬁcation systems aim at providing

consistent assessment to promote accurate communication, precise documentation, and

appropriate treatment decision (Stotts 2001), several widely used PU classiﬁcation systems

have been criticized for low inter-rater reliability estimates reported in published studies

and their usefulness is being questioned (Russell 2002). A meta-analysis of Cohen’s jis a

useful tool to examine the generalizability of inter-rater reliability estimates across studies,

to determine the factors affecting inter-rater reliability, and to inform PU assessment in

future research and practice.

2.2 Data sources

Kottner et al. (2009) conducted a systematic review of inter-rater reliability for PU clas-

siﬁcation systems and included 24 primary studies in the ﬁnal data synthesis. The 24

studies were retrieved and served as the pool of potential studies for the present meta-

analysis. To be eligible for ﬁnal data synthesis, the potential studies have to meet four

criteria: (1) the language is English; (2) Cohen’s jwas reported as the measure of inter-

rater reliability, or sufﬁcient information was provided to calculate Cohen’s j; (3) standard

errors of Cohen’s jestimates were reported, or sufﬁcient information was provided to

estimate the standard errors; and (4) Cohen’s jis an appropriate measure of inter-rater

reliability for the rating procedure. The selection process (see Fig. 1) resulted in six studies

150 Health Serv Outcomes Res Method (2011) 11:145–163

123

for this meta-analysis. The variables extracted from the six studies are authors, study

location, publication year, PU classiﬁcation system, number of categories, rating proce-

dure, rater characteristics, number of raters, total number of skin sites, skin site charac-

teristics, percent agreement (p

) and Cohen’s jestimate. Note that ﬁve studies contained

multiple Cohen’s jestimates obtained from independent samples. A total number of ﬁfteen

Cohen’s jestimates were identiﬁed for ﬁnal data synthesis.

2.3 Results

As shown in Table 1, seven different PU classiﬁcation systems were used in the six

primary studies. Although the seven classiﬁcation systems have very much in common,

they differ in operational deﬁnitions of grade 1 PU. Normal skin was involved in three

studies. The prevalence of PU was not reported in ﬁve studies, indicating possible heter-

ogeneity of subjects across studies. Five studies used real skin sites for PU assessment

while only one study used images of skin sites. The characteristics of raters were heter-

ogeneous in terms of training and experiences in PU and tissue viability assessment.

Sample sizes (i.e., numbers of skin sites) varied from 35 to 2,396.

Because of the heterogeneity of study characteristics and the inclusion of multiple PU

classiﬁcation systems, a random-effects model was ﬁtted using Rpackage metafor

(Viechtbauer 2010). The metafor package provides ﬂexible and comprehensive functions

for ﬁtting various models in general-purpose meta-analysis. The codes for all analyses

conducted in this meta-analysis are provided in Appendix B. The results obtained from the

random-effects model were summarized in Table 2. The test for heterogeneity is signiﬁ-

cant, Q=653.50, df =14, P\.001, suggesting that considerable heterogeneity exists

among true inter-rater reliability across studies. The amount of heterogeneity (s

) is esti-

mated to be 0.06. The I

statistic suggests that 97.58% of the total variability in Cohen’s j

24 studies meeting selection criteria after

quality assessment in Kottner et al (2009)

11 studies reporting p0 for inter-

rater reliability and Cohen’s

cannot be calculated

3 studies reported multirater and

3 studies reported mean

7 studies reporting Cohen’s as the inter-

rater reliability measure

1 study not providing the information to

estimate standard error of Cohen’s

6 studies meeting the criteria for final data

synthesis; 15 independent estimates of

Cohen’s obtained

Fig. 1 Selection process of included studies for the meta-analysis for pressure ulcer classiﬁcation systems

Health Serv Outcomes Res Method (2011) 11:145–163 151

123

Table 1 Study characteristics of ﬁfteen Cohen’s jestimates included in the meta-analysis

Study/Location Measure/# of

categories

Skin Normal skin

involved?

Raters Setting Skin site characteristics nj(p

)

Pedley (2004)/UK EPUAU/5 Real Yes Trained registered nurses

experienced in tissue

viability

Hospital Pressure ulcers of differing

severity and a number of

pressure points free from

pressure damage

35 0.31 (0.49)

Pedley (2004)/UK Stirling 1-digit/5 Real Yes Trained registered nurses

experienced in tissue

viability

Hospital Pressure ulcers of differing

severity and a number of

pressure points free from

pressure damage

35 0.37 (0.54)

Pedley (2004)/UK Stirling 2-digit/15 Real Yes Trained registered nurses

experienced in tissue

viability

Hospital Pressure ulcers of differing

severity and a number of

pressure points free from

pressure damage

35 0.48 (0.54)

Vanderwee et al.

(2006)/Belgium

Blanchable and

non-blanchable

erythema/2

Real No Researcher and trained nurses Hospital Geriatric patients, erythema

at the heels, hips, and sacrum

503 0.69 (0.92)

Vanderwee et al.

(2006)/Belgium

Blanchable and

non-blanchable

erythema/2

Real No Researcher and trained nurses Hospital Geriatric patients, erythema

at the heels, hips, and sacrum

503 0.72 (0.92)

Health Serv Outcomes Res Method (2011) 11:145–163 153

123

estimates can be attributed to heterogeneity among the true inter-rater reliabilities. Figure 2

clearly shows the heterogeneity of Cohen’s jestimates across studies. The overall estimate

of Cohen’s jis 0.53 (95% CI: 0.39–0.66), which is a moderate inter-rater reliability

according to Landis and Koch (1977a).

When subjectivity is involved in the rating process, raters’ prior experience and training

becomes a particularly important factor that may affect the variability of inter-rater reli-

ability estimates. Based on the descriptions of rater characteristics in the six primary

studies, the raters were categorized into two groups, raters with special training in PU

assessment versus raters without special training in PU assessment. A mixed-effects model

was ﬁtted to explore the effect of rater characteristics on the variability of Cohen’s j

estimates and the results were summarized in Table 3. The estimated amount of residual

Table 2 Total heterogeneity and population average estimate from a random-effects model

Estimated total amount of heterogeneity 0.06 with 95% CI [0.03, 0.16]

(% of total variability due to heterogeneity) 97.58%

QdfP

Test for heterogeneity 653.50 14 \.001

Estimate SE z p 95% CI

Population average estimate 0.53 0.07 7.75 \.001 0.39 0.66

Fig. 2 A forest plot of Cohen’s jestimates and the overall estimate from the random-effects model

154 Health Serv Outcomes Res Method (2011) 11:145–163

123

heterogeneity is 0.03, suggesting that about 50% of the total amount of heterogeneity

(which is 0.06 estimated from the random-effects model) can be accounted for by raters’

training. For the group of raters without special training, b=0.28 (95% CI: 0.12–0.45),

SE =0.08, P\0.001. For the group of raters with special training, b=0.66 (95% CI:

0.54–0.78), SE =0.06, P\0.001. The difference between two groups are statistically

signiﬁcant as suggested by the test of moderators, Q=13.62, df =1, P\0.001. How-

ever, the test of residual heterogeneity is still signiﬁcant, Q=258.20, df =13, P\0.001,

indicating that moderators not considered in the model also affect the variability of inter-

rater reliability estimates across studies. Additional moderators including whether normal

skin was involved and how PU was assessed (based on real skin or skin image) were added

to the model but no signiﬁcant effects were detected.

The funnel plot of Cohen’s jestimates against their estimated standard errors from the

random-effects model (Fig. 3) suggests that possible publication bias exists. The trim and

ﬁll method (Duval and Tweedie 2000a,b) was used to adjust for publication bias. The trim

and ﬁll method is a nonparametric data augmentation technique that estimates the number

of studies missing from a meta-analysis due to suppression of the most extreme obser-

vations on one side of the funnel plot. The method then augments observed data under the

ﬁxed- or random-effects model to make the funnel plot more symmetric. Because it is a

way of formalizing the use of a funnel plot and the results can be easily understood

visually, the trim and ﬁll method is now the most popular method for adjusting for pub-

lication bias (Borenstein 2005). Under the random-effects model, the trim and ﬁll method

was applied to the PU data and the estimated number of missing studies on either side is

zero. In other words, the symmetry of the funnel plot cannot be improved by data aug-

mentation. The mechanisms of publication bias and incomplete data reporting are usually

very complicated and may vary with dataset and subject area in meta-analysis (Sutton

2009). The trim and ﬁll method did not work for the present meta-analysis probably

because the mechanism of publication bias is not due to the suppression of extreme

Cohen’s jestimates as assumed by the trim and ﬁll method. The existence of publication

bias may be partially explained by the fact that eleven primary studies did not report

Cohen’s jas the measure of inter-rater reliability and were excluded from the meta-

analysis. The inclusion of those studies in the ﬁnal data synthesis may lead to a different

conclusion about publication bias (Fig. 4).

The trim and ﬁll method was also applied to the PU data under the ﬁxed-effects model to

further demonstrate its rationale of adjusting for publication bias. The results from the ﬁxed-

effects model and the trim-and-ﬁlled model were summarized in Table 4. Under the

Table 3 Results from a mixed-effects model with rater characteristic as a moderator

Estimated residual amount of heterogeneity 0.03 with 95% CI [0.01, 0.10]

% of total variability due to moderator 52.87%

QdfP

Test for residual heterogeneity 258.20 13 \.001

Test for moderators 13.62 1 \.001

bSEz P 95% CI

Raters without training 0.28 0.08 3.47 \.001 0.12 0.45

Raters with training 0.66 0.06 11.00 \.001 0.54 0.78

Health Serv Outcomes Res Method (2011) 11:145–163 155

123

ﬁxed-effects model, the common Cohen’s jestimate is 0.65 (95% CI: 0.63–0.67). After

adjusting for publication bias under the ﬁxed-effects model, six missing studies were estimated

on the right side of the funnel plot and the adjusted common Cohen’s jestimate is 0.73 (95%

CI: 0.72–0.76). However, as shown in the trim and ﬁlled funnel plot (Fig. 5), Cohen’s j

estimates for the six missing studies are larger than 1 and hence substantively meaningless.

This may be viewed as a limitation of the trim and ﬁll method thatneeds further methodological

investigation. Again, the application of the trim and ﬁll method under the ﬁxed-effects model is

for demonstration purpose only. The results should not be used to inform future practice.

2.4 Conclusion

The results from the meta-analysis of ﬁfteen Cohen’s jestimates shows that (1) the overall

inter-rater reliability estimated from a random-effects model is .53, indicating a moderate

level of agreement between raters, (2) signiﬁcant heterogeneity of Cohen’s jestimates exist

between studies, and (3) raters with special training in PU assessment tend to produce more

reliable ratings than raters without training, suggesting the importance of rater training. In

order to obtain as many published studies as possible, this meta-analysis used very broad

criteria to select studies and included Cohen’s jestimates generated from seven different PU

classiﬁcation systems. No comparison was made between classiﬁcation systems because of

the small number of Cohen’s jestimates for each classiﬁcation system. Therefore, it is very

difﬁcult to decide which PU classiﬁcation system should be used in daily practice. When a

large number of inter-rater reliability studies are accumulated in future, it is necessary to

include PU classiﬁcation systems as a moderator in the mixed-effects model so that the effect

of test properties on inter-rater reliability estimates can be understood.

Fig. 3 A funnel plot of Cohen’s jestimates against standard error estimates from the random-effects model

156 Health Serv Outcomes Res Method (2011) 11:145–163

123

3 Discussion

The present study proposed a formal statistical framework to speciﬁcally combine Cohen’s

jestimates across multiple studies and extended traditional meta-analysis of effect sizes to

inter-rater reliability. The proposed framework relies on the sampling distribution of

Cohen’s jand traditional meta-analytic models to describe the typical inter-rater reliability

Fig. 4 A forest plot of Cohen’s jestimates with separate estimates for the moderator (raters without

training in PU assessment vs. raters with training in PU assessment)

Table 4 Comparison of results from the ﬁxed-effects model and the trim-and-ﬁlled ﬁxed-effects model

Fixed-effects model Trim-and-ﬁlled ﬁxed-effects model

Test for heterogeneity Q=653.50, df =14, P\.001 Q=1257.15, df =20, P\.001

Common estimate 0.65 0.74

Estimated standard error 0.009 0.009

95% CI for common estimate [0.63, 0.67] [0.72, 0.76]

Health Serv Outcomes Res Method (2011) 11:145–163 157

123

in multiple studies and to quantify between-study variation, and thus it is more rigorous

and informative than narrative reviews and systematic reviews. It allows researchers to

evaluate how important study characteristics affect the variability of inter-rater reliability

estimates across studies and to accumulate psychometric knowledge about the test being

used. The ﬁndings from a meta-analysis of Cohen’s jwill facilitate test developers, test

users and methodologists to better understand inter-rater reliability and develop effective

strategies to improve inter-rater reliability.

Even the most skillful chef cannot cook a meal without basic ingredients, so the quality

of a meta-analysis largely depends on the supplies in primary studies. A successful meta-

analysis of Cohen’s jrequires that the test of interest has been frequently used and

Cohen’s jis frequently reported as a measure of inter-rater reliability. About 80% of the

primary studies in the literature of PU assessment did not report any inter-rater reliability

coefﬁcient (Kottner et al. 2009). The consequence of underreporting is that population

parameter estimates in meta-analysis will be biased. Therefore, raising the awareness of

reporting inter-rater reliability estimates is an urgent mission. It was also noted that p

more frequently reported as the measurement of inter-rater reliability than Cohen’s j.p

does not account for chance agreement between two raters and is a positively biased

measure of the true systematic tendency of the two raters agreeing with each other (Fleiss

1981). On the contrary, Cohen’s jadjusts for chance agreement and hence should be the

preferred measure of inter-rater reliability. Moreover, standard error is a measure of

estimation precision and a key element to synthesize estimates across different studies in

meta-analysis. Unfortunately, standard errors are generally not reported for Cohen’s jin

Fig. 5 A funnel plot of Cohen’s jestimates against standard error estimates after adjusting for publication

bias with the trim and ﬁll method (the dots on the right of the vertical line are the missing observations

estimated by the trim and ﬁll method)

158 Health Serv Outcomes Res Method (2011) 11:145–163

123

published studies. If p

is reported for Cohen’s jestimate in a primary study, the standard

error can be estimated from Eqs. 1and 4; otherwise, the study has to be excluded from ﬁnal

data synthesis. Researchers are strongly recommended to explicitly report Cohen’s jwith

estimated standard error to facilitate future meta-analysis.

Last but not least, the appropriate use of Cohen’s jin primary studies is essential for

meta-analysis. By assumption, Cohen’s jis an appropriate measure of inter-rater reliability

when all of the subjects are rated by two equally competent raters and the correctness of

ratings usually cannot be determined. When these assumptions are violated, Cohen’s j

cannot be used to indicate the consistency of the ratings between two raters. For instance,

Hart et al. (2006) reported Cohen’s jbetween hospital nurses and PU experts. Ratings

made by experts were considered correct classiﬁcations and the reliability between nurses

and experts was considered rater-to-standards reliability. Whether rater-to-standard

reliability is a valid concept and how to assess it are completely different issues, but

apparently Cohen’s jwas misused and the study cannot be included in meta-analysis. In

fact, Cohen’s jhas been extended to estimate inter-rater reliability for many complicated

rating scenarios (e.g., Berry and Mielke 1998; Cohen 1968,1972; Davies and Fleiss 1982;

Fleiss 1971; Gross 1986; Janson and Olsson 2001,2004; Kraemer 1980; Kraemer et al.

2004; Landis and Koch 1977b; Vanbelle and Albert 2009). Sadly, those extended measures

are infrequently, if not never, applied in empirical research. The misuse of Cohen’s jand

the large number of extended measures of inter-rater reliability calls for clear and easy-

to-follow tutorials on conducting inter-rater reliability studies and choosing the appropriate

measure of inter-rater reliability for different rating scenarios.

Appendix A

An example illustrating how to calculate Cohen’s j, its variance and conﬁdence intervals

using real data from Nixon et al. (2005)

Ward nurse

No pressure ulcer Pressure ulcer Total

Clinical research nurse

No pressure ulcer (a)

2175

(b)

)

2210

Pressure ulcer (c)

(d)

144

)

186

Total (g

)

2217

)

179

(n)

2396

p0¼aþd

n¼2175 þ144

2396 0:97

pc¼

f1g1

nþf2g2

n¼

22102217

2396 þ186179

2396

2396 0:86

j¼p0pc

1pc¼0:97 0:86

10:86 0:79

Health Serv Outcomes Res Method (2011) 11:145–163 159

123

Varð^

jÞ¼ 1

ð1pcÞ2

p0ð1p0Þ

n¼1

ð10:86Þ2

0:97ð10:97Þ

2396 6:20 104

j1:96 ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

varðjÞ

p0:79 1:96 0:02. Therefore, 95% CI for Cohen’s jis [0.75,

0.83].

Note that, the calculation of Var(^

j) depends on p

which is usually not reported in

research articles. When p

and ^

jare reported, p

can be derived by pc¼p0^

1^

Appendix B

Rcodes for all analyses conducted in the meta-analysis of Cohen’s jfor PU classiﬁcation

systems

#Install the package metafor

library(metafor)

#Derive vi, the variance for each kappa estimate

#p0i and ki are the percentage agreement and Cohen’s kappa estimate for study i

# pci is the chance agreement derived from p0i and ki (as shown in Appendix A)

pci=(p0i-ki)/(1-ki)

vi=p0i*(1-p0i)/((1-pci)^2*ni)

#Fit the random-effects model using rma() and get conﬁdence intervals for parameter

estimates

res\-rma(ki,vi,data=PUdata)

conﬁnt(res)

#Get the forest plot

forest(res,at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2), slab = paste(PUdata$Study))

op\-par(cex=1,font=2)

text(-1.25,17, ‘‘Study’’,pos=4)

text(1.5,17, ‘‘Cohen’s Kappa [95% CI]’’,pos=4)

#Get the funnel plot to check publication bias

funnel(res, main = ‘‘Random-Effects Model’’)

#Sort the data matrix by the moderator variable rater for the mixed-effects model

data\-PUdata[order(PUdata$rater),]

#Fit the mixed-effects model using rma() and get conﬁdence intervals for parameter

estimates

mix\-rma(ki,vi,mods=*factor(rater),data=data)

conﬁnt(mix)

#Get the forest plot for mixed-effect model and add group estimates to the bottom of the

plot

forest(data$ki,data$vi, at=c(-0.2,0,0.2,0.4,0.6,0.8,1,1.2),ylim=c(-3,18),slab = paste

(data$Study))

preds \- predict(mix, newmods = c(0, 1))

op\-par(cex=1.1,font=1)

addpoly(preds$pred, sei = preds$se, mlab = c(‘‘Raters without training’’, ‘‘Raters with

training’’))

abline(h=0)

160 Health Serv Outcomes Res Method (2011) 11:145–163

123

abline(h=10.5)

text(1.2,0.4,’’Raters with training’’)

text(1.2,11,’’Raters without training’’)

op\-par(cex=1.1,font=2)

text(-1.25,17, ‘‘Study’’,pos=4)

text(1.5,17, ‘‘Cohen’s Kappa [95% CI]’’,pos=4)

# Use trim and ﬁll method to adjust for publication bias under the random-effects model

re\-rma(ki,vi,data=PUdata,method=‘‘REML’’)

rtf \- trimﬁll(re)

# Use trim and ﬁll method to adjust for publication bias under the ﬁxed-effects model

fe\-rma(ki,vi,data=PUdata,method=‘‘FE’’)

ftf \- trimﬁll(fe)

#Get the funnel plot with augmented data

funnel(ftf)

abline(v=1)

References

Allman, R.M.: Pressure ulcer prevalence, incidence, risk factors, and impact. Clin. Geriatr. Med. 13,

421–436 (1997)

Altman, D.G.: Practical Statistics for Medical Students. Chapman and Hall, London (1991)

Baugh, F.: Correcting effect sizes for score reliability: a reminder that measurement and substantive issues

are linked inextricably. Educ. Psychol. Meas. 62, 254–263 (2002)

Banerjee, M., Capozzoli, M., McSweeny, L., Sinha, D.: Beyond kappa: a review of interrater agreement

measures. Can. J. Stat. 27, 3–23 (1999)

Berry, K.J., Mielke, P.W.: A generalization of Cohen’s kappa agreement measure to interval measurement

and multiple raters. Educ. Psychol. Meas. 48, 921–933 (1998)

Blackman, N.J.-M., Koval, J.J.: Interval estimation for Cohen’s kappa as a measure of agreement. Stat. Med.

19, 723–741 (2000)

Block, D.A., Kraemer, H.C.: 292 kappa coefﬁcients: measures of agreement or association. Biometrics 45,

269–287 (1989)

Borenstein, M.: Software for publication bias. In: Rothstein, H.R., Sutton, A.J., Borenstein, M. (eds.)

Publication Bias in Meta-Analysis—Prevention, Assessment and Adjustments, pp. 193–220. Wiley,

Chichester (2005)

Bours, G., Halfens, R., Lubbers, M., Haalboom, J.: The development of a National Registration Form to

measure the prevalence of pressure ulcers in the Netherlands. Ostomy Wound Manage. 45, 28–40

(1999)

Brennan, R.L., Prediger, D.J.: Coefﬁcient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas.

41, 687–699 (1981)

Brennan, R.L., Silman, A.: Statistical methods for assessing observer variability in clinical measures. Br.

Med. J. 304, 1491–1494 (1992)

Buntinx, F., Beckers, H., De Keyser, G., Flour, M., Nissen, G., Raskin, T., De Vet, H.: Inter-observer

variation in the assessment of skin ulceration. J. Wound Care 5, 166–170 (1986)

Capraro, M.M., Capraro, R.M., Henson, R.K.: Measurement error of scores on the Mathematics Anxiety

Rating Scale across studies. Educ. Psychol. Meas. 61, 373–386 (2001)

Caruso, J.C.: Reliability generalization of the NEO personality scales. Educ. Psychol. Meas. 60, 236–254

(2000)

Cohen, J.: A coefﬁcient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)

Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial

credit. Psychol. Bull. 70, 220–231 (1968)

Cohen, J.: Weighted Chi square: an extension of the kappa method. Educ. Psychol. Meas. 32, 61–74 (1972)

Davies, M., Fleiss, J.L.: Measurement agreement for multinomial data. Biometrics 38, 1047–1051 (1982)

Health Serv Outcomes Res Method (2011) 11:145–163 161

123

Duval, S., Tweedie, R.: A nonparametric ‘‘trim and ﬁll’’ method of assessing publication bias in meta-

analysis. J. Am. Stat. Assoc. 95(449), 89–98 (2000a)

Duval, S., Tweedie, R.: Trim and ﬁll: a simple funnel plot based method of testing and adjusting for

publication bias in meta-analysis. Biometrics 56, 455–463 (2000b)

Everitt, B.S.: Moments of the statistics kappa and weighted kappa. Br. J. Math. Stat. Psychol. 21, 97–103

(1968)

Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)

Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. Wiley, New York (1981)

Fleiss, J.L., Cohen, J., Everitt, B.S.: Large sample standard errors of kappa and weighted kappa. Psychol.

Bull. 72, 323–327 (1969)

Fleiss, J.L., Shrout, P.E.: The effects of measurement errors on some multivariate procedures. Am. J. Public

Health 67, 1188–1191 (1977)

Graves, N., Birrell, F.A., Whitby, M.: Modeling the economic losses from pressure ulcers among hospi-

talized patients in Australia. Wound. Rep. Reg. 13, 462–467 (2005)

Gross, S.T.: The kappa coefﬁcient of agreement for multiple observers when the number of subjects is small.

Biometrics 42, 883–893 (1986)

Guilford, J.P., Fruchter, B.: Fundamental Statistics in Psychology and Education, 6th edn. McGraw-Hill,

New York (1978)

Hart, S., Bergquist, S., Gajewski, B., Dunton, N.: Reliability testing of the national database of nursing

quality indicators pressure ulcer indicator. J. Nurs. Care Qual. 21, 256–265 (2006)

Healey, F.: The reliability and utility of pressure sore grading scales. J. Tissue Viability 5, 111–114 (1995)

Hedges, L.V.: Fitting categorical models to effect sizes from a series of experiments. J. Educ. Stat. 7,

119–137 (1982a)

Hedges, L.V.: Fitting continuous models to effect sizes from a series of experiments. J. Educ. Stat. 7,

245–270 (1982b)

Hedges, L.V.: A random effects model for effect sizes. Psychol. Bull. 93, 388–395 (1983)

Hedges, L.V., Vevea, J.L.: Fixed and random effects models in meta-analysis. Psychol. Methods 3, 486–504

(1998)

Helms, J.E.: Another meta-analysis of the White Racial Identity Attitude Scale’s Cronbach alphas: impli-

cations for validity. Meas. Eval. Couns. Dev. 32, 122–137 (1999)

Henson, R.K.: Understanding internal consistency reliability estimates: a conceptual primer on coefﬁcient

alpha. Meas. Eval. Couns. Dev. 34, 177–189 (2001)

Henson, R.K., Kogan, L.R., Vacha-Haase, T.: A reliability generalization study of the Teacher Efﬁcacy

Scale and related instruments. Educ. Psychol. Meas. 61, 404–420 (2001)

Hunter, J.E., Schmidt, F.L.: Methods of Meta-Analysis: Correcting Error and Bias in Research Findings.

Sage, Newbury Park (1990)

Huynh, Q., Howell, R.T., Benet-Martinez, V.: Reliability of bidimensional acculturation scores: a meta-

analysis. J. Cross Cult. Psychol. 40, 256–274 (2009)

Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations. Educ.

Psychol. Meas. 61, 277–289 (2001)

Janson, H., Olsson, U.: A measure of agreement for interval or nominal multivariate observations by

different sets of judges. Educ. Psychol. Meas. 64, 62–70 (2004)

Kottner, J., Raeder, K., Halfens, R., Dassen, T.: A systematic review of inter-rater reliability of pressure

ulcers classiﬁcation systems. J. Clin. Nurs. 18, 315–336 (2009)

Koval, J.J., Blackman, N.J.-M.: Estimators of kappa-exact small sample properties. J. Stat. Comput. Simulat.

55, 513–536 (1996)

Kraemer, H.C.: Ramiﬁcations of a population model for jas a coefﬁcient o reliability. Psychometrika 44,

461–472 (1979)

Kraemer, H.C.: Extension of the kappa coefﬁcient. Biometrics 36, 207–216 (1980)

Kraemer, H.C., Vyjeyanthi, S.P., Noda, A.: Kappa coefﬁcients in medical research. In: D’Agostino, R.B.

(ed.) Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies, pp. 85–105. Wiley,

New York (2004)

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33,

159–174 (1977a)

Landis, J.R., Koch, G.G.: An application of hierarchical kappa-type statistics in the assessment of majority

agreement among multiple observers. Biometrics 33, 363–374 (1977b)

Linacre, J.M.: Many-Facet Rasch Measurement. MESA Press, Chicago (1989)

Lord, F.M., Novick, M.R.: Statistical Theories of Mental Test Scores. Addison-Wesley, Reading (1968)

162 Health Serv Outcomes Res Method (2011) 11:145–163

123

Miller, C.S., Shields, A.L., Campﬁeld, D., Wallace, K.A., Weiss, R.D.: Substance use scales of the Min-

nesota Multiphasic Personality Inventory: an exploration of score reliability via meta-analysis. Educ.

Psychol. Meas. 67, 1052–1065 (2007)

Nixon, J., Thorpe, H., Barrow, H., Phillips, A., Nelson, E.A., Mason, S.A., Cullum, N.: Reliability of

pressure ulcer classiﬁcation and diagnosis. J. Adv. Nurs. 50, 613–623 (2005)

Pedley, G.E.: Comparison of pressure ulcer grading scales: a study of clinical utility and inter-rater reli-

ability. Int. J. Nurs. Stud. 41, 129–140 (2004)

Raudenbush, S.W.: Analyzing effect sizes: random-effects models. In: Cooper, H.M., Hedges, L.V.,

Valentine, J.C. (eds.) The Handbook of Research Synthesis and Meta-Analysis, 2nd edn, pp. 295–315.

Russel Sage Foundation, New York (2009)

Rohner, R.P., Khaleque, A.: Reliability and validity of the Parental Control Scale: a meta-analysis of cross-

cultural and intracultural studies. J. Cross Cult. Psychol. 34, 643–649 (2003)

Russell, L.: Pressure ulcer classiﬁcation: the systems and the pitfalls. Br. J. Nurs. 11, S49–S59 (2002)

Schmidt, F.L., Hunter, J.E.: Development of a general solution to the problem of validity generalization.

J. Appl. Psychol. 62, 529–540 (1977)

Shrout, P.E.: Measurement reliability and agreement in psychiatry. Stat. Meth. Med. Res. 7, 301–317 (1998)

Sim, J., Wright, C.C.: The Kappa statistic in reliability studies: use, interpretation and sample size

requirements. Phys. Ther. 85, 257–268 (2005)

Spearman, C.E.: The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101

(1904)

Stemler, S.E., Tsai, J.: Best practices in interrater reliability: three common approaches. In: Osborne, J.W.

(ed.) Best Practices in Quantitative Methods, pp. 29–49. Sage, Thousand Oaks (2008)

Stotts, N.A.: Assessing a patient with a pressure ulcer. In: Morison, M.J. (ed.) The Prevention and Treatment

of Pressure Ulcers, pp. 99–115. Mosby, London (2001)

Suen, H.K.: Agreement, reliability, accuracy and validity: toward a clariﬁcation. Behav. Assess. 10,

343–366 (1988)

Sutton, A.J.: Publication bias. In: Cooper, H.M., Hedges, L.V., Valentine, J.C. (eds.) The handbook of

research synthesis and meta-analysis, 2nd edn, pp. 435–452. Russel Sage Foundation, New York

(2009)

Thompson, B.: Guidelines for authors. Educ. Psychol. Meas. 54, 837–847 (1994a)

Thompson, B.: Score Reliability: Contemporary Thinking on Reliability Issues. Sage, Thousand Oaks

(2002)

Thompson, B., Vacha-Hasse, T.: Psychometrics is datametrics: the test is not reliable. Educ. Psychol. Meas.

60, 174–195 (2000)

Thompson, S.G.: Why sources of heterogeneity in meta-analysis should be investigated. Br. Med. J. 309,

1351–1355 (1994b)

Thorndike, R.M.: Measurement and Evaluation in Psychology and Education. Pearson Merrill Prentice Hall,

Upper Saddle River (2005)

Vacha-Hasse, T.: Reliability generalization: exploring variance in measurement error affecting score reli-

ability across studies. Educ. Psychol. Meas. 58, 6–20 (1998)

Vacha-Hasse, T., Henson, R.K., Caruso, J.C.: Reliability generalization: moving toward improved under-

standing and use of score reliability. Educ. Psychol. Meas. 62, 562–569 (2002)

Vanbelle, S., Albert, A.: Agreement between two independent groups of raters. Psychometrika 74, 477–492

(2009)

Vanderwee, K., Grypdonck, M., De Bacquer, D., Deﬂoor, T.: The reliability of two observation methods of

nonblanchable erythema, Grade 1 pressure ulcer. Appl. Nurs. Res. 19, 156–162 (2006)

Viechtbauer, W.: Conducting meta-analyses in Rwith the metafor package. J. Stat. Softw. 36, 1–48 (2010)

Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the Kappa statistic. Fam. Med. 37,

360–363 (2005)

Viswesvaran, C., Ones, D.S.: Measurement error in ‘‘Big Five Factors’’ personality assessment: reliability

generalization across studies and measures. Educ. Psychol. Meas. 60, 224–235 (2000)

von Eye, A.: An alternative to Cohen’s j. Eur. Psychol. 11, 12–24 (2006)

Yin, P., Fan, X.: Assessing the reliability of Beck Depression Inventory scores: reliability generalization

across studies. Educ. Psychol. Meas. 60, 201–223 (2000)

Zwick, R.: Another look at interrater agreement. Psychol. Bull. 3, 374–378 (1988)

Health Serv Outcomes Res Method (2011) 11:145–163 163

123

The relationship between adipose tissue RAAS activity and the risk factors of prediabetes in different ethnicities: A protocol for a systematic review and meta-analysis

Article

Full-text available

May 2024
MEDICINE

Background The incidence and prevalence of prediabetes has become a global concern. The risk factors of prediabetes, such as insulin resistance, adiposity, lipotoxicity and obesity, in conjunction with the alteration of the renin-angiotensin-aldosterone system (RAAS), have been positively correlated with the high morbidity and mortality rate. Thus, this systematic review seeks to establish the relationship between the risk factors of prediabetes, namely insulin resistance adiposity, lipotoxicity, obesity and the RAAS. Therefore, a synthesis of these risk factors, their clinical indicators and the RAAS components will be compiled in order to establish the association between the RAAS alteration and obesity in prediabetic patients. Methods This protocol for a systematic review was developed in compliance with the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P) standards. This will be accomplished by searching clinical Medical Subject Headings categories in MEDLINE with full texts, EMBASE, Web of Science, PubMed, Cochrane Library, Academic Search Complete, ICTRP and ClinicalTrial.gov. Reviewers will examine all of the findings and select the studies that meet the qualifying criteria. To check for bias, the Downs and Black Checklist will be used, followed by a Review Manager v5. A Forrest plot will be used for the meta-analysis and sensitivity analysis. Furthermore, the strength of the evidence will be assessed utilizing the Grading of Recommendations Assessment, Development, and Evaluation procedure (GRADE). The protocol has been registered with PROSPERO CRD42022320252. This systematic review and meta-analysis will include published randomized clinical trials, observational studies and case-control studies from the years 2000 to 2022.

Development of potential dysgraphia handwriting dataset

Article

Full-text available

May 2024

Developing a Tool for Assessing the Process of Seeking Health Information: Online Think-Aloud Method

Article

Full-text available

May 2024

Nursing students can access massive amounts of online health data to drive cutting-edge evidence-based practice in clinical placement, to bridge the theory–practice gap. This activity requires investigation to identify the strategies nursing students apply to evaluate online health information. Online Think-Aloud sessions enabled 14 participants to express their cognitive processes in navigating various educational resources, including online journals and databases, and determining the reliability of sources, indicating their strategies for information-seeking, which helped to create this scoring system. Easy access and user convenience were clearly the instrumental factors in this behavior, which has troubling implications for the lack of use of higher-quality resources (e.g., from peer-reviewed academic journals). The identified challenges encountered during resource access included limited skills in the critical evaluation of information credibility and reliability, signaling a requirement for improved information literacy skills. Participants acknowledged the importance of evidence-based, high-quality information, but faced numerous barriers, such as restricted access to professional and specialty databases, and a lack of academic skills training. This paper develops and critiques a Performative Tool for assessing the process of seeking health information using an online Think-Aloud method, and explores factors and strategies contributing to evidence-based health information access and utilization in clinical practice, aiming to provide insight into individuals’ information-seeking behaviors in online health contexts.

Characterizing engagement in the science practices: A study of preservice elementary teachers

Article

Full-text available

May 2024
SCI EDUC

Preservice elementary teachers enter their science methods courses with a range of prior experience with science practice. Those prior experiences likely inform much of their science pedagogy and goals. In this study, the authors examine how a cohort of preservice elementary teachers engaged in science practice as they learned content in a physics course. Drawing on course documents, videorecords, and artifacts from in-class lab work and interviews with nine participants, the authors used an asset-based, mixed methods approach. The authors developed rubrics to assess the level of sophistication the participants used while engaging in science practice on a scale of 1 (pre-novice) to 4 (experienced). They used descriptive statistics and ANOVA's to interpret the performance of the participants in addition to grounded theory open coding of interviews to determine the participants' level of prior experience with science practice. The findings suggest that these preservice teachers primarily engaged in science practices at a novice level. In general, their sophistication scores on the rubric aligned with their prior experience. The findings suggest that while one content course steeped in science practice was not enough to significantly change preservice teachers' engagement, it can provide a needed starting place and that it likely takes time to develop these skills. The findings have implications for both teacher educators and researchers who hope to increase the use of science practice as a method of learning science content.

Protein characteristics substantially influence the propensity of activity cliffs among kinase inhibitors

Article

Full-text available

Apr 2024

Activity cliffs (ACs) are pairs of structurally similar molecules with significantly different affinities for a biotarget, posing a challenge in computer-assisted drug discovery. This study focuses on protein kinases, significant therapeutic targets, with some exhibiting ACs while others do not despite numerous inhibitors. The hypothesis that the presence of ACs is dependent on the target protein and its complete structural context is explored. Machine learning models were developed to link protein properties to ACs, revealing specific tripeptide sequences and overall protein properties as critical factors in ACs occurrence. The study highlights the importance of considering the entire protein matrix rather than just the binding site in understanding ACs. This research provides valuable insights for drug discovery and design, paving the way for addressing ACs-related challenges in modern computational approaches.

The Influence of the Stem-Based Guided Inquiry Model on Students’ Creative Thinking Skills in Science Learning: A Meta-Analysis Study

Article

Full-text available

Mar 2024

This study aims to determine the effect of STEM-based guided inquiry models on students' Creative Thinking Skills in science learning. This type of research is a meta-analysis. The study analyzed 15 primary studies published in 2018-2023 that had met the inclusion criteria. Search data sources through the Google scholar database; ERIC, Taylor of Francis, ScienceDirect and ProQuest. Data analysis with the help of the JSAP application verse 0.16.3. These results conclude that the overall value of effect size is 0.99 (95% CI [ 0.79; 1,19]) high category. These findings show that the application of STEM-based inquiry-based learning models affects students' 21st century thinking skills. In addition, these findings provide important information on STEM-based guided inquiry learning in schools.

Added utility of temperature zone information in remote sensing-based large scale crop mapping

Article

Jun 2024

VR-Grasp: A Human Grasp Taxonomy for Virtual Reality

Article

May 2024

Negative Screening AI in Ankle Stress Radiography to Reduce Workload

Article

May 2024
ACAD RADIOL

UrduAspectNet: Fusing Transformers and Dual GCN for Urdu Aspect-Based Sentiment Detection

Article

May 2024

Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.

MANY-FACET RASCH MEASUREMENT

Book

Full-text available

Jun 1994

John Michael Linacre

The theory and application of Many-Facet Rasch Measurement to judged (rated or rank-ordered) performances, and description of the estimation of the MFRM Rasch measures focusing on missing data.

Another Look at Interrater Agreement

Article

Full-text available

May 1988

Rebecca Zwick

Most currently used measures of interrater agreement for the nominal case incorporate a correction for chance agreement. The definition of chance agreement, however, is not the same for all coefficients. Three chance-corrected coefficients are Cohen’s (1960) κ; Scott’s (1955) π; and the S index of Bennett, Alpert, and Goldstein (1954), which has reappeared in many guises. For all three measures, independence between raters is assumed in deriving the proportion of agreement expected by chance. Scott’s π involves a further assumption of homogeneous rater marginals, and the S coefficient requires the assumption of uniform marginal distributions for both raters. Because of these disparate formulations, κ, π, and S can lead to different conclusions about rater agreement. Consideration of the properties of these measures leads to the recommendation that marginal homogeneity be assessed as a first step in the analysis of rater agreement. If marginal homogeneity can be assumed, π can be used as an index of agreement.

Best Practices in Interrater Reliability Three Common Approaches

Chapter

Jan 2008

A Coefficient of Agreement for Nominal Scales

Article

Apr 1960
EDUC PSYCHOL MEAS

Jacob Cohen

Practical Statistics for Medical Research

Book

Nov 1990

Douglas G. Altman

Measurement and Evaluation in Psychology and Education.

Article

Dec 1961

Interval estimation for Cohen's kappa as a measure of agreement

Article

Mar 2000

Cohen's kappa statistic is a very well known measure of agreement between two raters with respect to a dichotomous outcome. Several expressions for its asymptotic variance have been derived and the normal approximation to its distribution has been used to construct confidence intervals. However, information on the accuracy of these normal-approximation confidence intervals is not comprehensive. Under the common correlation model for dichotomous data, we evaluate 95 per cent lower confidence bounds constructed using four asymptotic variance expressions. Exact computation, rather than simulation is employed. Specific conditions under which the use of asymptotic variance formulae is reasonable are determined. Copyright (C) 2000 John Wiley & Sons, Ltd.

Guidelines for authors

Article

Jan 1994
EDUC PSYCHOL MEAS

A G Thompson

Analyzing effect sizes: Random-effects models

Article

Jan 2009

Stephen W. Raudenbush

This volume considers the problem of quantitatively summarizing results from a stream of studies, each testing a common hypothesis. In the simplest case, each study yields a single estimate of the impact of some intervention. Such an estimate will deviate from the true effect size as a function of random error because each study uses a finite sample size. What is distinctive about this chapter is that the true effect size itself is regarded as a random variable taking on different values in different studies, based on the belief that differences between the studies generate differences in the true effect sizes. This approach is useful in quantifying the heterogeneity of effects across studies, incorporating such variation into confidence intervals, testing the adequacy of models that explain this variation, and producing accurate estimates of effect size in individual studies. After discussing the conceptual rationale for the random effects model, this chapter provides a general strategy for answering a series of questions that commonly arise in research synthesis: 1. Does a stream of research produce heterogeneous results? That is, do the true effect sizes vary? 2. If so, how large is this variation? 3. How can we make valid inferences about the average effect size when the true effect sizes vary? 4. Why do study effects vary? Specifically do observable differences between studies in their target populations, measurement approaches, definitions of the treatment, or historical contexts systematically predict the effect sizes? 5. How effective are such models in accounting for effect size variation? Specifically, how much variation in the true effect sizes does each model explain? 6. Given that the effect sizes do indeed vary, what is the best estimate of the effect in each study? I illustrate how to address these questions by re-analyzing data from a series of experiments on teacher expectancy effects on pupil's cognitive skill. My aim is to illustrate, in a comparatively simple setting, to a broad audience with a minimal background in applied statistics, the conceptual framework that guides analyses using random effects models and the practical steps typically needed to implement that framework. Although the conceptual framework guiding the analysis is straightforward, a number of technical issues must be addressed satisfactorily to ensure the validity the inferences. To review these issues and recent progress in solving them requires a somewhat more technical presentation. Appendix 16A considers alternative approaches to estimation theory, and appendix 16B considers alternative approaches to uncertainty estimation, that is, the estimation of standard errors, confidence intervals, and hypothesis tests. These appendices together provide re-analyses of the illustrative data under alternative approaches, knowledge of which is essential to those who give technical advice to analysts.

Fundamental Statistics In Psychology and Education

Article

Jan 1974
J Roy Stat Soc

Meta-analysis of Cohen’s kappa

Abstract and Figures

Recommended publications

Inter-rater reliability of the scale for assessment of negative symptoms in schizophrenia

Inter-rater reliability and aspects of validity of the parent-infant relationship global assessment...

Intra-rater Reliability of Oral Proficiency Ratings

Kappa Coefficient