Article

The Measurement Of Observer Agreement For Categorical Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Rodríguez-Lomba et al attempt to illustrate the subjectivity present in the detection of different dermoscopic features [9]. To this aim, they calculate the concordance between five dermatologists in detecting twenty-two features and obtain a Kappa-Fleiss agreement coefficient ranging from 0.04 and 0.46, indicating slight or fair agreement [9,10]. This lack of agreement has prompted researchers to avoid determining the GT or reference standard [11] from a single observer. ...
... Cohen-Kappa value was 0.9075, with a standard deviation equal to 0.0555. Cohen-Kappa concordance level according to Landis et al. [10] is: kappa >0.8 means almost perfect agreement; >0.6 means substantial agreement; >0.4 means moderate agreement; >0.2 means fair agreement; >0 means slight agreement; <0 means no agreement. Thus, the average value obtained indicates almost perfect agreement, and, as can be observed in the standard deviation, almost perfect agreement was also obtained for every pair of raters. ...
... Secondly, Fleiss Kappa was calculated to measure the concordance among all dermatologists, which was 0.9079. According to Landis et al [10] these values of Kappa statistics can be described as almost perfect agreement. ...
Preprint
Full-text available
Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.
... The quality of the selected CG was assessed using the AGREE Instrument (Appraisal of Guidelines for Research & Evaluation), 2nd edition. AGREE is an assessment tool developed from reviews of more than 100 selected guidelines independently evaluated by more than 200 reviewers from different countries [24][25][26][27] . It is used as part of a protocol for quality assessment of CG to improve healthcare by WHO and several technology assessment agencies around the world 3,25-27 . ...
... Kappa coefficients of moderate agreement (Kappa>0.4) were considered preferable for this type of study 3,[27][28] . For the agreement analysis, the raters jointly decided that assessment scores of 1 and 2 would be considered "low," scores of 3 to 5 would be "intermediate" and scores of 6 and 7 "high". ...
... Cohens kappa ranges from -1 to 1. According to Landis and Koch (Landis and Koch, 1977), values less than 0 indicate that the classifier is useless, while values ranging between 0 and 0.20 qualify its usefulness as slight, those between 0.21 and 0.40 as fair, those between 0.41 and 0.60 as moderate, those between 0.61 and 0.80 as substantial, and those between 0.81 and 1 as almost perfect. We also use the fidelity metric to assess how well the explanations provided by the methods that interpret RF approximates the predictions of the RF model. ...
... If we consider the sum of wins and ties, the differences are more important in multiclass classification than in binary classification. Based on the results of the Kappa measure, and according to Landis and Koch classification (Landis and Koch, 1977), the usefulness of the classification processed by RF, PreForest-ORE, Forest-ORE, STEL, Forest-ORE+STEL, and RPART methods, on covered instances is deemed substantial, whereas the usefulness of the classification processed by the other methods is considered moderate. Table 14 reports the fidelity metric measured for the methods that are intended to approximate the RF model. ...
Preprint
Random Forest (RF) is well-known as an efficient ensemble learning method in terms of predictive performance. It is also considered a "black box" because of its hundreds of deep decision trees. This lack of interpretability can be a real drawback for acceptance of RF models in several real-world applications, especially those affecting ones lives, such as in healthcare, security, and law. In this work, we present Forest-ORE, a method that makes RF interpretable via an optimized rule ensemble (ORE) for local and global interpretation. Unlike other rule-based approaches aiming at interpreting the RF model, this method simultaneously considers several parameters that influence the choice of an inter-pretable rule ensemble. Existing methods often prioritize predictive performance over interpretability coverage and do not provide information about existing overlaps or interactions between rules. Forest-ORE uses a mixed-integer optimization program to build an ORE that considers the trade-off between predictive performance , interpretability coverage, and model size (size of the rule ensemble, rule lengths, and rule overlaps). In addition to providing an ORE competitive in predictive performance with RF, this method enriches the ORE through other rules that afford complementary information. It also enables monitoring of the rule selection process and delivers various metrics that can be used to generate a graphical representation of the final model. This framework is illustrated through an example, and its robustness is assessed through 36 benchmark datasets. A comparative analysis of well-known methods shows that Forest-ORE provides an excellent trade-off between predictive performance, interpretability coverage, and model size.
... There is an almost perfect agreement in all staging items of the score for both limbs with the lower limit of the 95% confidence interval >80%. 19 Intraobserver agreement for the final score is also excellent, with ICC ≥ 0.90. ...
... The reliability is overall strong (lower limit of the 95% confidence interval >60%) or almost perfect (lower limit of the 95% confidence interval >80%), ranging from 0.77 to 0.95. 19 Interobserver agreement for the final score is also excellent, with ICC ≥ 0.90. ...
... All examiners went through a training process in which they screened abstracts and titles for a set of 150 articles that were pre-selected by the first author to contain studies that were both easy and more difficult to rate as relevant or irrelevant. We estimated the inter-examiner accuracy, yielding a Cohen's κ of 0.40, corresponding to a fair to moderate agreement [33]. This first batch allowed us to identify points of misunderstanding and better explain the criteria for selecting an article. ...
... This second batch allowed us to clarify the final difficult points before splitting the abstracts between examiners so that each examiner read and rated ~1100 abstracts. During this second phase, inter-examiner accuracy was examined again on 20 articles per reviewer, yielding a κ of 0.73, which is considered a substantial agreement [33]. The filtering process led to a selection of 1539 studies from the original search that fitted the scope of this overview. ...
Article
Full-text available
Context-dependent dispersal allows organisms to seek and settle in habitats improving their fitness. Despite the importance of species interactions in determining fitness, a quantitative synthesis of how they affect dispersal is lacking. We present a meta-analysis asking (i) whether the interaction experienced and/or perceived by a focal species (detrimental interaction with predators, competitors, parasites or beneficial interaction with resources, hosts, mutualists) affects its dispersal; and (ii) how the species' ecological and biological background affects the direction and strength of this interaction-dependent dispersal. After a systematic search focusing on actively dispersing species, we extracted 397 effect sizes from 118 empirical studies encompassing 221 species pairs; arthropods were best represented, followed by vertebrates, protists and others. Detrimental species interactions increased the focal species’ dispersal (adjusted effect: 0.33 [0.06, 0.60]), while beneficial interactions decreased it (−0.55 [−0.92, −0.17]). The effect depended on the dispersal phase, with detrimental interactors having opposite impacts on emigration and transience. Interaction-dependent dispersal was negatively related to species’ interaction strength, and depended on the global community composition, with cues of presence having stronger effects than the presence of the interactor and the ecological complexity of the community. Our work demonstrates the importance of interspecific interactions on dispersal plasticity, with consequences for metacommunity dynamics. This article is part of the theme issue ‘Diversity-dependence of dispersal: interspecific interactions determine spatial dynamics’.
... Three judges independently rated whether each word should be categorized as generally having a positive and/or negative emotion, after which a reconciliation process was conducted to resolve conflicting decisions. Initial Fleiss' Kappa (Fleiss, 1971) for interrater agreement was 0.54 (moderate agreement) and the final was 0.95, indicating almost perfect agreement (Landis and Koch, 1977). The main changes following the reconciliation process was (1) the addition of words with low polarity/confidence e.g., the word ‫אבל‬ 'aval' (but) was added in the second phase to the negative list; ...
... Then, the lists were updated by the following set of rules: (a) a word remained in the category list if two out of three judges agreed it should be included, (b) a word was deleted from the category list if at least two of the three judges agreed it should be excluded. Fleiss' Kappa (Fleiss, 1971) for interrater agreement was 0.95, indicating almost perfect agreement (Landis and Koch, 1977). Based on these lexicons, we calculated the number of positive and negative emotion words within each session text. ...
... Therefore, a Pearson correlation coefficient was used to assess the relationship between the CHO-KLAT and the PedsQL. 15 The PedsQL is a generic measure of HRQoL and contains 23 questions about health-activity, feelings, and school. 16 Turkish version of PedsQL was validated. ...
... In the study of Landis and Koch, ICC scores between 0.60 and 0.80 have been considered substantial reliability coefficents. 15 In the other reliability assessment, an ICC score between 0.75 and 0.90 is defined as "good" compliance, above 0.90 "excellent." 22 In the first CHO-KLAT study, ICC were found 0.74 for childreported and 0.83 for proxy-reported. ...
Article
Full-text available
Objective: The evaluation of health-related quality of life (HRQoL) is encouraged to assess the multidimensional impact of treatments and disease and to improve care in boys with hemophilia (BwH). However, validated HRQoL tools for BwH are not yet available for Turkish. The purpose of this study was to assess the validity and reliability of the Canadian Hemophilia Outcomes-Kids’ Life Assessment Tool (CHO-KLAT), version 2.0, which is multilingual valid tool, in Turkish. Methods: The procedure included 4 steps: linguistic translation, content validity, validity evaluation with the Pediatric Quality of Life (PedsQL), finally test-retest analysis for reliability assessment. The participants were questioned for the type and severity of hemophilia, medical treatment, and inhibitor status. Results: The primary Turkish version of the CHO-KLAT evolved with the cooperation of the Canadian and Turkish teams. Content validity was performed with 9 experts and latest version of Turkish CHO-KLAT was produced. This multicenter study was conducted with 53 BwH aged 4-18 for validity assessment, 52 BwH for test-retest reliability. The mean age of BwH was 11.6 (standard deviation (SD): 4.2). The means of CHO-KLAT and PedsQL were 64.1 (SD: 4.2) and 66.7 (SD: 15.3). As a result of the validity evaluation, a strong correlation was found between CHO-KLAT and PedsQL (r = 0.603; P < .001). The interclass correlation coefficient was 0.887 in the test-retest reliability. Conclusion: The Turkish version of CHO-KLAT 2.0 was validated. It is now available to be used in clinical studies for HRQoL assessment of Turkish BwH. Keywords: Hemophilia, children, quality of life, validity and reliability
... An Analysis of Audiences, Purposes and Challenges' (Mewburn & Thomson, 2013) at this stage of screening because it seemed as if they could potentially pertain to educators' invisible labour. We then calculated interrater reliability scores and found our categorisation had 86.2% agreement and a Cohen's kappa of 0.706, or 'substantial' agreement (Landis & Koch, 1977). We discussed any coding discrepancies until we came to consensus. ...
... We also filtered out Smeltzer et al. (2015) at this stage because the abstract emphasised work-life balance but did not suggest any connection to invisible labour. We again calculated interrater reliability scores and found our categorisation had 96.1% agreement and a Cohen's kappa of 0.754, again considered 'substantial' agreement (Landis & Koch, 1977). We again discussed any coding discrepancies until we came to consensus. ...
Article
Full-text available
The hidden or overlooked nature of many of educators' professional activities complicates the already difficult task of supporting educators' labour—in both K‐12 and higher education settings. These efforts can be understood as types of invisible labour. Following PRISMA standards, we conducted a systematic literature review to answer a single research question: How have scholars framed educators' professional activities in terms of invisible labour? This systematic review searched 10 educational databases and identified 16 peer‐reviewed journal articles spanning 2011–2021. From thematic analysis of these studies, we developed a model of five types of invisibility that intersect and mask educators' professional efforts: background, care, precarious, identity and remote labour. The review also showed several overall themes related to educators' invisible labour, which we discuss in connection to the literature: effort is often semivisible, invisibility is subjective, effort by marginalised educators is often overlooked, labour in unexpected places often means effort is overlooked, and there are layers of factors masking effort. We then discuss implications for practice, starting with five invisible labour questions to prompt reflection, then how to apply invisible labour as an improvement lens for identifying needs, allocating resources, analysing jobs and tasks, and evaluating performance.
... Inter-rater agreement, assessed by Cohen's Kappa coefficient during the article selection phase, demonstrated an "almost perfect agreement" between reviewers (kappa = 0.99). The methodological quality and risk of bias was assessed using the JBI Critical Appraisal Checklist for Analytical Cross-Sectional Studies (11). The JBI analysis is described in Table 2. ...
... Remarkably, malignant and benign SGTs presented a lower percentage of marked cells expressed for hMSH3, with their lowest expression represented by 4.27 ± 5.35% and 7.30 ± 3.41%, respectively (13). Amaral-Silva et al. (11) reported that malignant tumors showed an underexpression with a total mean of 6.25 ± 6.95% in cells marked for the activity of hMSH3 biomarker. Malignant SGTs exhibited a lower total mean than benign SGTs, suggesting a higher lack of hMSH3 expression. ...
... All records were systematically screened using EPPI-Reviewer software (Version: 6.15.0.0) (Thomas et al., 2023). The provided standard coding scheme was adapted to meet all eligibility (Cohen, 1960;Landis & Koch, 1977;McHugh, 2012). (Sachdev et al., 2014) in line with the fifth version of the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association, 2013)) of (a) learning and memory, (b) complex attention, (c) executive function, and (d) visuospatial skills, and (6) findings of each study relating to cognitive performance. ...
... In case of disagreements, EdB served as referee. Inter-rater agreement was again assessed and interpreted based on Cohen's kappa (Cohen, 1960;Landis & Koch, 1977;McHugh, 2012). ...
Article
Full-text available
BACKGROUND: Exergame-based training is currently considered a more promising training approach than conventional physical and/or cognitive training. OBJECTIVES: This study aimed to provide quantitative evidence on dose-response relationships of specific exercise and training variables (training components) of exergame-based training on cognitive functioning in middle-aged to older adults (MOA). METHODS: We conducted a systematic review with meta-analysis including randomized controlled trials comparing the effects of exergame-based training to inactive control interventions on cognitive performance in MOA. RESULTS: The systematic literature search identified 22,928 records of which 31 studies were included. The effectiveness of exergame-based training was significantly moderated by the following training components: body position for global cognitive functioning, the type of motor-cognitive training, training location, and training administration for complex attention, and exercise intensity for executive functions. CONCLUSION: The effectiveness of exergame-based training was moderated by several training components that have in common that they enhance the ecological validity of the training (e.g., stepping movements in a standing position). Therefore, it seems paramount that future research focuses on developing innovative novel exergame-based training concepts that incorporate these (and other) training components to enhance their ecological validity and transferability to clinical practice. We provide specific evidence-based recommendations for the application of our research findings in research and practical settings and identified and discussed several areas of interest for future research. PROSPERO registration number CRD42023418593; prospectively registered, date of registration: 1 May 2023
... to k = .91, which is a substantial to almost perfect level of agreement (Landis & Koch, 1977). ...
... The coders practiced labeling by completing four videos of the collected data, of which feedback was discussed after completion of each video. This practice phase was followed by the coding of a fifth video for which coder agreement was substantial (Landis & Koch, 1977), ranging from k = .70 to k = .75. ...
Article
Full-text available
The secure base phenomenon was ascribed to changes in exploration observed during Ainsworth’s Strange Situation Procedure (SSP), related to the quality of the attachment relationship. However, infant temperament was not taken into consideration. The current study aims to replicate Ainsworth’s findings regarding infant exploration and attachment quality during the SSP and extend the findings by examining the role of infant temperament. One hundred thirty-two mother-infant dyads participated in the SSP when infants were 12 months old. Video recordings were coded for attachment quality and for duration of locomotion, duration of engagement with toys, and quality of engagement with toys. Temperamental activity level and fear were assessed with the Infant Behavior Questionnaire. Results showed that—irrespective of infant temperament—infants with insecure-resistant attachment relationships engaged less with toys compared to infants with secure or insecure avoidant relationships, and these differences were amplified during separation from the mother. Duration of engagement with toys was thus a robust indicator of attachment-related infant exploratory behavior. Duration of locomotion increased in response to separation from the mother and decreased after reunion. This likely reflects a mix of exploratory and proximity seeking behavior, and was more affected by controlling for temperamental fear. For quality of engagement with toys, no associations with attachment and temperament were found. During the SSP, the manifestation of the secure base phenomenon depended on the combination of the type of exploratory behaviors and the quality of the attachment relationship, but also on infant temperament.
... Frequencies and percentages of IG achievement levels after the second observation are reported by the observees and the observers. A concordance analysis, Cohen's Kappa (Landis & Koch, 1977), is performed to estimate the agreement between the evaluations of the observees and the observers, thus considering the assessments of the observees as accurate reflections of reality. ...
... The validity and reliability of the analysis were also ensured through expert judgment, obtaining an 86% agreement level (Cohen's Kappa), which can be considered excellent (Landis & Koch, 1977). Cases of disagreement were discussed until a consensus was reached. ...
Article
This paper investigates whether Reciprocal Peer Observation is an effective practice for promoting Teacher Professional Development. It focuses on analysing the Improvement Goals transfer processes stemming from teachers' own educational approach, which teachers identify during Reciprocal Peer Observation. A total of 230 teachers, paired together, conducted a second classroom observation, focused on a specific Improvement Goals to assess the extent of their transfer. The findings indicate that Improvement Goals transfer to classroom practice occurs predominantly. The study analyses predictive and facilitating factors that contribute to this process. The results reveal that collaborative culture and collective agency are predictive factors for transfer. Similarly, personal factors arising from reflection and awareness of one's own practices, alongside the support of the partner, could promote the identified processes of improvement. In conclusion, Reciprocal Peer Observation can be regarded as a highly effective practice for identifying Improvement Goals and transferring them to the classroom, benefiting Teacher Professional Development.
... A Cohen's kappa coefficient was calculated, showing moderate agreement (K=0.42). 18 Data collection process The two authors (JS and RP) responsible for the data collection process developed a data extraction sheet and pilot tested it on two included reports. The extraction sheet was then sent to four authors (AH, ES, KS, EHS) for any adjustment. ...
Article
Full-text available
Objective The purpose of this study was to review the current literature regarding the non-operative treatment of isolated medial collateral ligament (MCL) injuries. Design Systematic review, registered in the Open Science Framework ( https://doi.org/10.17605/OSF.IO/E9CP4 ). Data sources The Embase, MEDLINE and PEDro databases were searched; last search was performed on December 2023. Eligibility criteria Peer-reviewed original reports from studies that included information about individuals who sustained an isolated MCL injury with non-surgical treatment as an intervention, or reports comparing surgical with non-surgical treatment were eligible for inclusion. Included reports were synthesised qualitatively. Risk of bias was assessed with the Risk of Bias Assessment tool for Non-randomized Studies. Certainty of evidence was determined using the Grading of Recommendations Assessment Development and Evaluation. Results A total of 26 reports (1912 patients) were included, of which 18 were published before the year 2000 and 8 after. No differences in non-operative treatment were reported between grade I and II injuries, where immediate weight bearing and ambulation were tolerated, and rehabilitation comprised different types of strengthening exercises with poorly reported details. Some reports used immobilisation with a brace as a treatment method, while others did not use any equipment. The use of a brace and duration of use was inconsistently reported. Conclusion There is substantial heterogeneity and lack of detail regarding the non-operative treatment of isolated MCL injuries. This should prompt researchers and clinicians to produce high-quality evidence studies on the promising non-operative treatment of isolated MCL injuries to aid in decision-making and guide rehabilitation after MCL injury. Level of evidence Level I, systematic review.
... The thematic analysis of the diary entries resulted in 1350 first-order codes, allocated to 25 second-order themes across the following four aggregated dimensions: technical and methodological functionality (n = 55), creative processes (n = 102), social and environmental affordances (n = 130), and affective states (n = 24). The peer-review process yielded a significant level of interrater agreement (κ = 0.68), indicating a strong consensus and consistency among the raters, and confirming the reliability of the coding scheme (Landis and Koch, 1977). ...
Conference Paper
Full-text available
For creative collaboration, hybrid models promise to combine the freedom and flexibility of remote work and the inspiring interactions of in-person collaboration. Yet understanding of how design principles for virtual and physical settings are transferable to hybrid collaboration remains limited. Functionality, social-environmental affordances, and affective states are detrimental to creativity. Yet the question remains: how do these drivers need to be designed in hybrid collaboration to support the creative process? This qualitative, exploratory diary study examines creative hybrid collaboration among 20 participants of a hybrid design thinking course. The results suggest that, from a user-centered perspective, the support of creative processes in hybrid collaboration requires adequate functionalities and technical infrastructure. Further, navigating team dynamics between online and offline participants is especially challenging due to contrasting perceptions and requires social presence, psychological safety, inclusion, and rapport-building. Recommendations are given for the practical establishment of hybrid teams to foster creativity and collaboration.
... For the agreement analysis, the data were evaluated using the two-way random method and absolute agreement was observed in the same evaluator at two different times (intra-examiner agreement), and between evaluators for each examination (inter-examiner agreement) [25]. The intraclass correlation coefficient values for four of the five observers showed high agreement between observers, indicating that the measurements were strong and reliable. ...
Article
Full-text available
This study aimed to evaluate the reliability of an age estimation method based on the pulp⁄tooth area ratio by assessing intra-and inter-examiner agreement across five observers at different intervals. Using the same X-ray device and technical parameters, 96 digital periapical X-ray images of upper and lower canines were obtained from 28 deceased people in Central America, whose age at death ranged from 19 to 49 years. Excellent and good agreement of results were achieved, and there were no statistically significant differences. The R2 value for upper teeth (54.0%) was higher than the R2 value for lower teeth (45.7%). The highest intraclass correlation coefficient value was 0.995 (0.993-0.997) and the lowest 0.798 (0.545-0.895). Inter-examiner agreement was high with values of 0.975 (0.965-0.983) and 0.927 (0.879-0.955). This method is adequate for assessing age in missing and unidentified people, including victims of mass disasters.
... As illustrated in Through diagnostic modeling, we identified two potential categories, the first containing 104 patients and the second containing 95 patients. The overall agreement between the clinical diagnostic groups defined by Model (i) and the two latent categories was 82%, with a kappa coefficient of 0.71 (95% CI 0.63, 0.79), depicting significant consistency [47]. This demonstrates the appropriateness of the clinical diagnostic reference standard. ...
Article
Full-text available
Early diagnosis of Bell's palsy is crucial for effective patient management in primary care settings. This study aimed to develop a simplified diagnostic tool to enhance the accuracy of identifying Bell's palsy among patients with facial muscle weakness. Data from 240 patients were analyzed using seven potential clinical evaluation indicators. Two diagnostic benchmarks were established: one based on clinical assessment and the other incorporating magnetic resonance imaging (MRI) findings. A multivariate logistic regression model was developed based on these benchmarks, resulting in the construction of a predictive tool evaluated through latent class models. Both models retained four key clinical indicators: absence of forehead wrinkles, accumulation of food and saliva inside the mouth on the affected side, presence of vesicular rash in the ear or pharynx, and lack of pain or symptoms associated with tick exposure, rash, or joint pain. The first model demonstrated excellent discriminative ability (area under the curve [AUC] = 0.96, 95% confidence interval [CI] 0.94 - 0.99) and calibration (P < 0.001), while the second model also showed good performance (AUC = 0.88, 95% CI 0.83 - 0.92) and calibration (P = 0.005). Bootstrap validation indicated no significant overfitting. The latent class defined by the first model significantly aligned with the clinical diagnosis group, while the second model showed lower consistency.
... In this study, fuzzy set approach was set by accuracy level (modified from Landis and Koch (1977) for forest type) (Table 2), as follows: -Forest type value higher than 50% represents perfect (P) accuracy between the classified map and the ground reference data. ...
Article
Full-text available
The forest map remains essential for investigating plant ecology and biodiversity patterns. This study proposed methods for mapping forest types based on ecological niche modeling and then used fuzzy error matrix for accuracy assessment. The upper Ping basin of northern Thailand was selected as study area. The modeled data included forest inventory, topographic, climatic, soil, and geological data. Ecological niche factor analysis was used to model and produce the best habitat suitability index of each forest type, which were then combined using hierarchically generated coding. As a result, eight classes of forest types were generated: dry dipterocarp forest (7,373.94 km2 , 32.81%), evergreen ecotone or transition area (3,666.97 km2 , 16.32%), mixed deciduous forest (3,440.79 km2 , 15.31%), deciduous ecotone or transition area (3,225.58 km2 , 14.35%), deciduous and evergreen forest (2,027.12 km2 , 9.02), coniferous forest (CF; 365.28 km2 , 1.63%), moist and dry evergreen forest (290.08 km2 , 1.29%), and hill evergreen forest (270.56 km2 , 1.21%). Four variables were found to be critical in forest type distribution: elevation, mean annual temperature, annual maximum temperatures and annual minimum temperatures. To assess map accuracy, fuzzy error matrix, which allows the recognition of ambiguous classes and does not ignore variation in the interpretation of the reference data at class boundaries, was used (75.89% of overall accuracy)
... "substantial", 0.81-1.00 "almost perfect" [54]. ...
Article
Full-text available
Background Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption in educational settings. This study aimed to assess the accuracy of predicting the correct answers from three AI chatbots (ChatGPT-4, Microsoft Copilot and Google Gemini) in the Italian entrance standardized examination test of healthcare science degrees (CINECA test). Secondarily, we assessed the narrative coherence of the AI chatbots’ responses (i.e., text output) based on three qualitative metrics: the logical rationale behind the chosen answer, the presence of information internal to the question, and presence of information external to the question. Methods An observational cross-sectional design was performed in September of 2023. Accuracy of the three chatbots was evaluated for the CINECA test, where questions were formatted using a multiple-choice structure with a single best answer. The outcome is binary (correct or incorrect). Chi-squared test and a post hoc analysis with Bonferroni correction assessed differences among chatbots performance in accuracy. A p-value of < 0.05 was considered statistically significant. A sensitivity analysis was performed, excluding answers that were not applicable (e.g., images). Narrative coherence was analyzed by absolute and relative frequencies of correct answers and errors. Results Overall, of the 820 CINECA multiple-choice questions inputted into all chatbots, 20 questions were not imported in ChatGPT-4 (n = 808) and Google Gemini (n = 808) due to technical limitations. We found statistically significant differences in the ChatGPT-4 vs Google Gemini and Microsoft Copilot vs Google Gemini comparisons (p-value < 0.001). The narrative coherence of AI chatbots revealed “Logical reasoning” as the prevalent correct answer (n = 622, 81.5%) and “Logical error” as the prevalent incorrect answer (n = 40, 88.9%). Conclusions Our main findings reveal that: (A) AI chatbots performed well; (B) ChatGPT-4 and Microsoft Copilot performed better than Google Gemini; and (C) their narrative coherence is primarily logical. Although AI chatbots showed promising accuracy in predicting the correct answer in the Italian entrance university standardized examination test, we encourage candidates to cautiously incorporate this new technology to supplement their learning rather than a primary resource. Trial registration Not required.
... Fisher's exact test was used for univariate analysis including categorical variables. The result of the kappa coefficient measures of reliability were classified as very good (0.61-0.8) or reached near-perfect agreement (0.81-1.0) according to the criteria previously reported [15]. Statistical significance was set at P < 0.05. ...
Article
Full-text available
Purpose Spinopelvic sagittal alignment is crucial for assessing balance and determining treatment efficacy in patients with adult spinal deformity (ASD). Only a limited number of reports have addressed spinopelvic parameters and lumbosacral transitional vertebrae (LSTV). Our primary objective was to study spinopelvic sagittal parameter changes in patients with LSTV. A secondary objective was to investigate clinical symptoms and quality of life (QOL) in patients with LSTV. Methods In this study, we investigated 371 participants who had undergone medical check-ups for the spine. LSTV was evaluated using Castellvi’s classification, and patients were divided into LSTV+ (type II-IV, L5 vertebra articulated or fused with the sacrum) and LSTV- groups. After propensity score matching for demographic data, we analyzed spinopelvic parameters, sacroiliac joint degeneration, clinical symptoms, and QOL for these two participant groups. Oswestry Disability Index (ODI) scores and EQ-5D (EuroQol 5 dimensions) indices were compared between the two groups. Results Forty-four patients each were analyzed in the LSTV + and LSTV- groups. The LSTV + group had significantly greater pelvic incidence (52.1 ± 11.2 vs. 47.8 ± 10.0 degrees, P = 0.031) and shorter pelvic thickness (10.2 ± 0.9 vs. 10.7 ± 0.8 cm, P = 0.018) compared to the LSTV- group. The “Sitting” domain of ODI (1.1 ± 0.9 vs. 0.6 ± 0.7, P = 0.011) and “Pain/Discomfort” domain of EQ-5D (2.0 ± 0.8 vs. 1.6 ± 0.7, P = 0.005) were larger in the LSTV + group. Conclusion There was a robust association between LSTV and pelvic sagittal parameters. Clinical symptoms also differed between the two groups in some domains. Surgeons should be aware of the relationship between LSTV assessment, radiographic parameters and clinical symptoms. Level of evidence 3.
... The area ratio is 0.840, and the prediction accuracy is 84.0 %. Based on the classification by Landis and Koch [35], the prediction accuracy of this model is fairly satisfactory. ...
... AUC values higher than 0.7 indicate that the models have good performance (Dudík et al., 2004). TSS ranges between −1 and 1, with values higher than 0.4 indicating that the models have fair quality (Landis & Koch, 1977). Here, we consider AUC > 0.8 and TSS > 0.5 as thresholds to measure the accuracy of the models; results below these values will be excluded from the final ensemble. ...
Article
Full-text available
Myotis originated during the Oligocene in Eurasia and has become one of the most diverse bat genera, with over 140 species. In the case of neotropical Myotis, there is a high degree of phenotypic conservatism. This means that the taxonomic and geographic limits of several species are not well understood, which constrains detailed studies on their ecology and evolution and how to effectively protect these species. Similar to other organisms, bats may respond to climate change by moving to different areas, adapting to new conditions, or going extinct. Ecological niche models have become established as an efficient and widely used method for interpolating (and sometimes extrapolating) species' distributions and offer an effective tool for identifying species conservation requirements and forecasting how global environmental changes may affect species distribution. How species respond to climate change is a key point for understanding their vulnerability and designing effective conservation strategies in the future. Thus, here, we assessed the impacts of climate change on the past and future distributions of two phylogenetically related species, Myotis ruber and Myotis keaysi. The results showed that the species are influenced by changes in temperature, and for M. ruber, precipitation also becomes important. Furthermore, M. ruber appears to have been more flexible to decreases in temperature that occurred in the past, which allowed it to expand its areas of environmental suitability, unlike M. keaysi, which decreased and concentrated these areas. However, despite a drastic decrease in the spatial area of environmental suitability of these species in the future, there are areas of potential climate stability that have been maintained since the Pleistocene, indicating where conservation efforts need to be concentrated in the future.
... After the mean values were calculated, they were compared using the K Index (KI) in order to determine interobserver reproducibility. The Landis and Koch [56] method was then used to interpret the K values (0 to 0.2 = slight agreement; 0.21 to 0.40 = fair agreement; 0.41 to 0.60 = moderate agreement; 0.61 to 0.80 = strong or substantial agreement; 0.81 to 0.99 = very strong or almost perfect agreement; 1.0 = perfect agreement). A very satisfactory KI value (0.89) was obtained in this study. ...
Article
Full-text available
This study aimed to investigate, for the first time, the potential role of the gigantocellular nucleus, a component of the reticular formation, in the pathogenetic mechanism of Sudden Infant Death Syndrome (SIDS), an event frequently ascribed to failure to arouse from sleep. This research was motivated by previous experimental studies demonstrating the gigantocellular nucleus involvement in regulating the sleep-wake cycle. We analyzed the brains of 48 infants who died suddenly within the first 7 months of life, including 28 SIDS cases and 20 controls. All brains underwent a thorough histological and immunohistochemical examination, focusing specifically on the gigantocellular nucleus. This examination aimed to characterize its developmental cytoarchitecture and tyrosine hydroxylase expression, with particular attention to potential associations with SIDS risk factors. In 68% of SIDS cases, but never in controls, we observed hypoplasia of the pontine portion of the gigantocellular nucleus. Alterations in the catecholaminergic system were present in 61% of SIDS cases but only in 10% of controls. A strong correlation was observed between these findings and maternal smoking in SIDS cases when compared with controls. In conclusion we believe that this study sheds new light on the pathogenetic processes underlying SIDS, particularly in cases associated with maternal smoking during pregnancy.
... to 0.80 for excellent accuracy, and 0.81 to 1 for virtually perfect accuracy (Landis and Koch 1977). ...
Article
Full-text available
Vegetation fires have major impacts on the ecosystem and present a significant threat to human life. Vegetation fires consists of forest fires, cropland fires, and other vegetation fires in this study. Currently, there is a limited amount of research on the long-term prediction of vegetation fires in Pakistan. The exact effect of every factor on the frequency of vegetation fires remains unclear when using standard analysis. This research utilized the high proficiency of machine learning algorithms to combine data from several sources, including the MODIS Global Fire Atlas dataset, topographic, climatic conditions, and different vegetation types acquired between 2001 and 2022. We tested many algorithms and ultimately chose four models for formal data processing. Their selection was based on their performance metrics, such as accuracy, computational efficiency, and preliminary test results. The model’s logistic regression, a random forest, a support vector machine, and an eXtreme Gradient Boosting were used to identify and select the nine key factors of forest and cropland fires and, in the case of other vegetation, seven key factors that cause a fire in Pakistan. The findings indicated that the vegetation fire prediction models achieved prediction accuracies ranging from 78.7 to 87.5% for forest fires, 70.4 to 84.0% for cropland fires, and 66.6 to 83.1% for other vegetation. Additionally, the area under the curve (AUC) values ranged from 83.6 to 93.4% in forest fires, 72.6 to 90.6% in cropland fires, and 74.2 to 90.7% in other vegetation. The random forest model had the highest accuracy rate of 87.5% in forest fires, 84.0% in cropland fires, and 83.1% in other vegetation and also the highest AUC value of 93.4% in forest fires, 90.6% in cropland fires, and 90.7% in other vegetation, proving to be the most optimal performance model. The models provided predictive insights into specific conditions and regional susceptibilities to fire occurrences, adding significant value beyond the initial MODIS detection data. The maps generated to analyze Pakistan’s vegetation fire risk showed the geographical distribution of areas with high, moderate, and low vegetation fire risks, highlighting predictive risk assessments rather than historical fire detections.
... Cronbach's alpha coefficient was used to assess the internal consistency of the scales [33], with higher values indicating better consistency. Specifically, a Cronbach's alpha coefficient above 0.800 indicates good internal reliability [74]. ...
Article
Full-text available
Rural landscapes are acknowledged for their potential to restore human health due to natural characteristics. However, modern rural development has degraded these environments, thereby diminishing the restorative potential of rural landscapes. Few studies have systematically analyzed the impact of naturalness, landscape types, and landscape elements on restorativeness using both subjective and objective measurements. This study investigated the restorative effects of various rural landscapes in Guangzhou, employing electroencephalography and eye-tracking technologies to record physiological responses and using the Restorative Components Scale and the Perceived Restorativeness and Naturalness Scale to evaluate psychological responses. The results indicated the following: (1) There was a significant positive correlation between perceived naturalness and restorativeness, surpassing the impact of actual naturalness. (2) Different landscape types had varying impacts on restorativeness at the same level of perceived naturalness. Natural forest landscapes, artificial forest landscapes, and settlement landscapes exhibited the most substantial restorative effects among the natural, semi-natural, and artificial landscapes examined, respectively. (3) Restorative properties varied across landscape elements: trees and water significantly enhanced restorativeness, whereas constructed elements reduced it. Findings from this study can provide support for policymakers to make informed decisions regarding the selection and arrangement of rural landscape types and elements to enhance mental health and well-being.
... The datasets for RDTs, microscopy, and qPCR results were merged based on the 2 7 8 participant's ID using the Pandas Python package [58]. 2 7 9 Generalized linear models (GLM) were used to evaluate the association between Python using the stats model package [59]. Additionally, to evaluate the performance of 2 8 2 RDTs and microscopy in fine-scale malaria stratifications compared to qPCR, their 2 8 3 agreement was tested using Kappa statistic [60], and the resulting Kappa values [61]. In addition, the positive predictive value (PPV) for RDTs and microscopy was 2 8 7 computed, using qPCR results as the reference, per village, as (proportion of positive 2 8 8 . ...
Preprint
Full-text available
Introduction: Malaria-endemic countries are increasingly adopting data-driven risk stratification, often at district or higher regional levels, to guide their intervention strategies. The data typically comes from population-level surveys collected by rapid diagnostic tests (RDTs), which unfortunately perform poorly in low transmission settings. Here, we conducted a high-resolution survey of Plasmodium falciparum prevalence rate (PfPR) in two Tanzanian districts and compared the fine-scale strata obtained using data from RDTs, microscopy and quantitative polymerase chain reaction (qPCR) assays. Methods: A cross-sectional survey was conducted in 35 villages in Ulanga and Kilombero districts, south-eastern Tanzania between 2022 and 2023. We screened 7,628 individuals using RDTs (SD-BIOLINE) and microscopy, with two thirds of the samples further analyzed by qPCR. The data was used to categorize each district and village as having very low (PfPR<1%), low (1%≤PfPR<5%), moderate (5%≤PfPR<30%), or high (PfPR≥30%) parasite prevalence. A generalized linear model was used to analyse infection risk factors. Other metrics, including positive predictive value (PPV), sensitivity, specificity, parasite densities, and Kappa statistics were computed for RDTs or microscopy using qPCR as reference. Results: Significant fine-scale variations in malaria risk were observed within and between districts, with village prevalence ranging from 0% to >50%. Prevalence varied by testing method: Kilombero was low risk by RDTs (PfPR=3%) and microscopy (PfPR=2%) but moderate by qPCR (PfPR=9%); Ulanga was high risk by RDTs (PfPR=39%) and qPCR (PfPR=54%) but moderate by microscopy (PfPR=26%). RDTs and microscopy classified majority of the 35 villages as very low to low risk (18 - 21 villages). In contrast, qPCR classified most villages as moderate to high risk (29 villages). Using qPCR as the reference, PPV for RDTs and microscopy ranged from <20% in very low transmission villages to >80% in moderate to high transmission villages. Sensitivity was 62% for RDTs and 41% for microscopy; specificity was 93% and 96%, respectively. Kappa values were 0.58 for RDTs and 0.42 for microscopy. School-age children (5-15years) had higher malaria prevalence and parasite densities than adults (P<0.001). High-prevalence villages also had higher parasite densities (Spearman r=0.77, P<0.001 for qPCR; r=0.55, P=0.003 for microscopy). Conclusion: This study highlights significant fine-scale variability in malaria risk within and between districts and emphasizes the variable performance of the testing methods when stratifying risk. While RDTs and microscopy were effective in high-transmission areas, they performed poorly in low-transmission settings; and classified most villages as very low or low risk. In contrast, qPCR classified most villages as moderate or high risk. While we cannot conclude on which public health decisions would be subject to change because of these differences, the findings suggest the need for improved testing approaches that are operationally feasible and sufficiently sensitive, to enable precise mapping and effective targeting of malaria in such local contexts. Moreover, public health authorities should recognize the strengths and limitations of their available data when planning local stratification or making decisions.
... For the statistical evaluation of exercise risk, the number of times an exercise was rated as a risk to a patient, was counted. In addition, the agreement between physiotherapists in the assessment of exercise risk per patient example was assessed using Fleiss-Kappa (κ π ) for more than three raters or Cohens-Kappa (κ d ) for two raters, and percentage agreement 36 . If there was a large deviation between the percentage agreement and the calculated kappa coefficient, the Brennan & Prediger's (κ q ) agreement coefficient was used 37 . ...
Article
Full-text available
Musculoskeletal disorders (MSDs) impact people globally, cause occupational illness and reduce productivity. Exercise therapy is the gold standard treatment for MSDs and can be provided by physiotherapists and/or also via mobile apps. Apart from the obvious differences between physiotherapists and mobile apps regarding communication, empathy and physical touch, mobile apps potentially offer less personalized exercises. The use of artificial intelligence (AI) may overcome this issue by processing different pain parameters, comorbidities and patient-specific lifestyle factors and thereby enabling individually adapted exercise therapy. The aim of this study is to investigate the risks of AI-recommended strength, mobility and release exercises for people with MSDs, using physiotherapist risk assessment and retrospective consideration of patient feedback on risk and non-risk exercises. 80 patients with various MSDs received exercise recommendations from the AI-system. Physiotherapists rated exercises as risk or non-risk, based on patient information, e.g. pain intensity (NRS), pain quality, pain location, work type. The analysis of physiotherapists’ agreement was based on the frequencies of mentioned risk, the percentage distribution and the Fleiss- or Cohens-Kappa. After completion of the exercises, the patients provided feedback for each exercise on an 11-point Likert scale., e.g. the feedback question for release exercises was “How did the stretch feel to you?” with the answer options ranging from “painful (0 points)” to “not noticeable (10 points)”. The statistical analysis was carried out separately for the three types of exercises. For this, an independent t-test was performed. 20 physiotherapists assessed 80 patient examples, receiving a total of 944 exercises. In a three-way agreement of the physiotherapists, 0.08% of the exercises were judged as having a potential risk of increasing patients' pain. The evaluation showed 90.5% agreement, that exercises had no risk. Exercises that were considered by physiotherapists to be potentially risky for patients also received lower feedback ratings from patients. For the ‘release’ exercise type, risk exercises received lower feedback, indicating that the patient felt more pain (risk: 4.65 (1.88), non-risk: 5.56 (1.88)). The study shows that AI can recommend almost risk-free exercises for patients with MSDs, which is an effective way to create individualized exercise plans without putting patients at risk for higher pain intensity or discomfort. In addition, the study shows significant agreement between physiotherapists in the risk assessment of AI-recommended exercises and highlights the importance of considering individual patient perspectives for treatment planning. The extent to which other aspects of face-to-face physiotherapy, such as communication and education, provide additional benefits beyond the individualization of exercises compared to AI and app-based exercises should be further investigated. Trial registration: 30.12.2021 via OSF Registries, https://doi.org/10.17605/OSF.IO/YCNJQ.
... Test-retest reliability was determined using the intraclass correlation coefficient (ICC) for continuous variables and kappa coefficients for categorical variables. Interpreting the results, values from 0.01 to 0.2 signify slight agreement, 0.21 to 0.40 denote fair agreement, 0.41 to 0.60 express moderate agreement, 0.61 to 0.80 communicate substantial agreement, and 0.81 to 1.0 indicate almost perfect agreement [28,29]. The coefficient of internal consistency for multi-item scales was assessed using Cronbach's alpha [30]. ...
Article
Full-text available
Background Physical activity is essential for physical, mental, and cognitive health. Providing evidence to develop better public health policies to encourage increased physical activity is crucial. Therefore, we developed an in-depth survey as part of the Korea Youth Risk Behavior Survey to assess the current status and determinants of physical activity among Korean adolescents. Methods We developed an initial version of the questionnaire based on a review of validated questionnaires, recent trends and emerging issues related to adolescent physical activity, and the national public health agenda pertaining to health promotion. Content validity was confirmed by a panel of 10 experts. Face validity was confirmed through focus group interviews with 12 first-year middle school students. The test-retest reliability of the questionnaire was evaluated by administering it twice, approximately two weeks apart, to a sample of 360 middle and high school students. Additionally, the frequency or average number of responses was analyzed in a sample of 600 students who participated in the initial test-retest reliability evaluation of the questionnaire developed in this study. Results Through item pool generation and content and face validity test, the final 15 questionnaire items were developed across five themes: levels of physical activity, school sports club activities, transportation-related physical activity, physical activity-promoting environments, and factors mediating physical activity. The test-retest reliability ranged from fair to substantial. Results from the newly developed survey reveal that only a minority of adolescents engage in sufficient physical activity, with only 17.2% and 21.5% participating in vigorous and moderate-intensity activities, respectively, for at least five days per week. Among school-based activities, 44.3% of students do not participate in school sports clubs due to reasons including absence of clubs and disinterest in exercise. The major motivators for physical activity are personal enjoyment and health benefits, whereas preferences for other leisure activities and academic pressures are the predominant barriers. Conclusions This study developed valid and reliable in-depth survey items to assess physical activity among Korean youths. It will hopefully enhance our understanding of adolescent physical activity, offering essential preliminary evidence to inform the development of public health strategies aimed at promoting adolescent health.
... Any disagreements between reviewers were adjudicated by a third reviewer (PDR or LLD). The level of agreement between reviewers at full-text review was described using the kappa statistic [21]. ...
Article
Full-text available
Purpose This study describes chemotherapy-induced nausea and vomiting (CINV) control rates in pediatric and adult patients who did or did not receive guideline-consistent CINV prophylaxis. Methods We conducted a systematic literature review of studies published in 2000 or later that evaluated CINV control in patients receiving guideline-consistent vs. guideline-inconsistent CINV prophylaxis and reported at least one CINV-related patient outcome. Studies were excluded if the guideline evaluated was not publicly available or not developed by a professional organization. Over-prophylaxis was defined as antiemetic use recommended for a higher level of chemotherapy emetogenicity than a patient was receiving. Results We identified 7060 citations and retrieved 141 publications for full-text evaluation. Of these, 21 publications (14 prospective and seven retrospective studies) evaluating guidelines developed by six organizations were included. The terms used to describe CINV endpoints and definition of guideline-consistent CINV prophylaxis varied among studies. Included studies either did not address over-prophylaxis in their definition of guideline-consistent CINV prophylaxis (48%; 10/21) or defined it as guideline-inconsistent (38%; 8/21) or guideline-consistent (3/21; 14%). Eleven included studies (52%; 11/21) reported a clinically meaningful improvement in at least one CINV endpoint in patients receiving guideline-consistent CINV prophylaxis. Ten reported a statistically significant improvement. Conclusions This evidence supports the use of guideline-consistent prophylaxis to optimize CINV control. Institutions caring for patients with cancer should systematically adapt CINV CPGs for local implementation and routinely evaluate CINV outcomes.
... Agreement between dichotomised reported and observed weakness was tested using weighted Cohen's Kappa. The strength of agreement was interpreted using conventional divisions [19]. ...
Article
Full-text available
Purpose To establish the prevalence and agreement between reported and observed leg weakness in people with sciatica. To establish which factors mediate any identified difference between reported and observed leg weakness in people with sciatica. Methods 68 people with a clinical diagnosis of sciatica, records from spinal service, secondary care NHS Hospital, England, UK reviewed. Primary outcome measures were the sciatica bothersome index for reported leg weakness and the Medical Research Council scale for observed weakness. Agreement was established with Cohen’s Kappa and intraclass correlation coefficient. Potential factors that may mediate a difference between reported and observed weakness included leg pain, sciatica bothersome index sensory subscale, age, hospital anxiety and depression subscale for anxiety. Results 85% of patients reported weakness but only 34% had observed weakness. Cohen’s Kappa (0.066, 95% CI − 0.53, 0.186; p = 0.317)] and ICC 0.213 (95% CI − 0.26, 0.428, p = 0.040) both showed poor agreement between reported and observed weakness. The difference between reported and observed measures of weakness was mediated by the severity of leg pain (b = 0.281, p = 0.024) and age (b = 0.253, p = 0.042). Conclusion There is a high prevalence of reported leg weakness in people with sciatica, which is not reflected in observed clinical measures of weakness. Differences between reported and observed weakness may be driven by the severity of leg pain and age. Further work needs to establish whether other objective measures can detect patient reported weakness.
... Approximately 60% of the think-aloud-tasks were double coded by two coders. After discussing discrepancies and coding again the Cohens kappa was raised to a good level at Cohens Κ = 0,77 (Landis and Koch 1977;Altmann, 1990). ...
Article
Full-text available
Tree-thinking is a fundamental skill set for understanding evolutionary theory and, thus, part of biological and scientific literacy. Research on this topic is mostly directed towards tree-reading—the umbrella-term for all skills enabling a person to gather and infer information from a given tree. Tree-building or phylogenetic inference as the second complementary sub-skill-set, encompassing all skills which enable a person to build a phylogenetic tree from given data, is not understood as well. To understand this topic we conducted think-aloud-tasks with tree-building experts and conducted supplementary guided interviews with them. We used school-like character tables, as they are common in high schools for the experts to build trees and audio-recorded their speech while building the trees. Analyzing the transcripts of the tasks we could find a basic methodology for building trees and define a set of backbone-skills of tree-building. Those are based on an iterative cycle going through phases of organizing information, searching and setting taxa/characters, organizing and checking oneself. All experts used simple guidelines, either deploying maximum parsimony to arrive at a solution or relying heavily on their previous knowledge. From that, we were able to utilize our result to formulate a guideline and helpful suggestions especially for beginners and novices in the field of tree-building to develop a better understanding of this topic.
... Both the first and second authors were involved in developing the coding scheme for the design process practices, and they implemented the Making-Process-Rug methodology in practice in our previous studies and, therefore, had a good understanding of it. Inter-rater reliability (IRR) was calculated utilizing the standard error Cohen's kappa (Cohen, 1960), which was 0.865 with a low standard error of 0.063, indicating almost perfect agreement (Landis & Koch, 1977). ...
Article
Full-text available
This study analyzed collaborative invention projects by teams of lower-secondary (13–14-year-old) Finnish students. In invention projects, student teams design and make materially embodied collaborative inventions using traditional and digital fabrication technologies. This investigation focused on the student teams’ knowledge creation processes by examining how they applied maker practices (i.e., design process, computer engineering, product design, and science practices) in their co-invention projects and the effects of teacher and peer support. In our investigations, we relied on video data and on-site observations, utilizing and further developing visual data analysis methods. Our findings assist in expanding the scope of computer-supported collaborative learning (CSCL) research toward sociomaterially mediated knowledge creation, revealing the open-ended, nonlinear, and self-organized flow of the co-invention projects that take place around digital devices. Our findings demonstrate the practice-based, knowledge-creating nature of these processes, where computer engineering, product design, and science are deeply entangled with design practices. Furthermore, embodied design practices of sketching, practical experimenting, and working with concrete materials were found to be of the essence to inspire and deepen knowledge creation and advancement of epistemic objects. Our findings also reveal how teachers and peer tutor students can support knowledge creation through co-invention.
... Cohen's Kappa resulted in κ = 0.68, which can be considered as substantial or good agreement (κ = 0.60-0.79) according to Landis and Koch [46] and Altman [47]. ...
Article
Full-text available
This qualitative study aims to analyse the personal qualification, attitudes and the pedagogical concepts of German teachers as experts in their profession regarding basic life support (BLS) education in secondary schools. Thirteen (n = 13) secondary school teachers participated in semi-structured expert interviews and were interviewed for at least 20 to 60 min regarding BLS student education. Interviews were semi-structured with guiding questions addressing (1) personal experience, (2) teacher qualification for BLS and (3) implementation factors (e.g., personal, material and organisational). Audio-recorded interviews were analysed by content analysis, generating a coding system. School teachers provided a heterogeneous view on implementation-related processes in BLS education. Many teachers were educated in first aid, acknowledge its importance, but had no experience in teaching BLS. They want to assure being competent for teaching BLS and need tailored trainings, materials, pedagogical information and the incorporation into the curriculum. Also, the management of time constraints, unwilling colleagues, or young students being overwhelmed were commonly mentioned considerations. Concluding, teachers reported to be willing to teach BLS but a stepwise implementation framework incorporating practice-oriented qualification and educational goals is missing.
... In the formulas, u is the standard normal quantile, and Se(K) is the standard deviation of K. Landis and Koch divided the size of the Kappa coefficient into six bands, each representing the degree of consistency strength. When K < 0, the consistency is extremely poor; 0.0 to 0.2, the consistency is very weak; 0.21 to 0.40, the consistency is weak; 0.41 to 0.60, medium consistency; 0.61 to 0.80, high degree of consistency; 0.81 to 1.00, almost perfect consistency [24]. ...
Preprint
Full-text available
Base on the characteristics of computer servers, this study focuses on the reproducibility of waste computer servers from the aspect of performance. First, Grey Relation Analysis is introduced to reflect the performance stability of computer servers after remanufacturing. Subsequently, the performance indicators of the remanufactured computer servers are evaluated through the steps of homogenization, dimensionless quantization, and weight calculation. After testing the performance of the computer servers, the remanufacturing performance index of each computer server is calculated. Finally, the product's remanufacturability score is compared with that of the relevant professionals. The results show that the evaluation method proposed in this study can effectively assess the remanufacturing performance of used computer servers and provide new methods for the reuse of used computer servers.
... The inter-rater agreement between two evaluators is measured by Cohen's Kappa (Cohen 1960). Following the established categorization scheme proposed by Landis and Koch (1977), the values of the kappa coefficient are categorized into specific ranges, denoting the degree of agreement. Specifically, kappa scores falling within the intervals of 0.21 to 0.40, 0.41 to 0.60, and 0.61 to 0.80 are interpreted as indicative of fair, moderate, and substantial agreement, respectively. ...
Article
Full-text available
In agile requirements engineering, Generating Acceptance Criteria (GAC) to elaborate user stories plays a pivotal role in the sprint planning phase, which provides a reference for delivering functional solutions. GAC requires extensive collaboration and human involvement. However, the lack of labeled datasets tailored for User Story attached with Acceptance Criteria (US-AC) poses significant challenges for supervised learning techniques attempting to automate this process. Recent advancements in Large Language Models (LLMs) have showcased their remarkable text-generation capabilities, bypassing the need for supervised fine-tuning. Consequently, LLMs offer the potential to overcome the above challenge. Motivated by this, we propose SimAC, a framework leveraging LLMs to simulate agile collaboration, with three distinct role groups: requirement analyst, quality analyst, and others. Initiated by role-based prompts, LLMs act in these roles sequentially, following a create-update-update paradigm in GAC. Owing to the unavailability of ground truths, we invited practitioners to build a gold standard serving as a benchmark to evaluate the completeness and validity of auto-generated US-AC against human-crafted ones. Additionally, we invited eight experienced agile practitioners to evaluate the quality of US-AC using the INVEST framework. The results demonstrate consistent improvements across all tested LLMs, including the LLaMA and GPT-3.5 series. Notably, SimAC significantly enhances the ability of gpt-3.5-turbo in GAC, achieving improvements of 29.48% in completeness and 15.56% in validity, along with the highest INVEST satisfaction score of 3.21/4. Furthermore, this study also provides case studies to illustrate SimAC’s effectiveness and limitations, shedding light on the potential of LLMs in automated agile requirements engineering.
... All the authors evaluated bone union and evaluated it again 1 month after the initial evaluation. Interobserver and intraobserver reliability was evaluated using intraclass correlation coefficient (ICC) [15,16]. In cases of disagreement between authors, consensus was reached through discussion. ...
Article
Full-text available
Teriparatide is an anabolic drug sometimes administered to patients who have atypical femoral fracture (AFF). However, whether teriparatide has beneficial effects on bone healing remains uncertain. The present study aimed to analyze the association between teriparatide and bone healing in complete AFF. A total of 59 consecutive cases (58 patients) who underwent intramedullary nailing for complete AFF were categorized based on postoperative use of teriparatide into the non-teriparatide (non-TPTD, n = 34) and teriparatide groups (TPTD, n = 25). Time-to-bone union was evaluated and compared between the two groups. Additionally, multiple regression analysis was performed to evaluate factors affecting time-to-bone union. All participants were women, with a mean age of 77.6 years (range: 62–92). No significant difference in time-to-bone union was found between the non-TPTD and TPTD groups (5.5 months vs. 5.8 months, p = 0.359). Two patients in the non-TPTD group underwent reoperation (p = 0.503) due to failure caused by inadequate fixation, and both achieved bone healing after additional fixation with blocking screws. Multiple regression analysis revealed that the anterior gap of the fracture site postoperatively was a factor affecting time-to-bone union (p = 0.014). The beneficial effect of teriparatide on bone healing in complete AFF could not be confirmed. Additional randomized controlled trials are required. Nonetheless, appropriate techniques, including efforts to reduce the gap on the tensile side during the surgery, are important for reliable bone healing.
... indicates a high level of consistency. 24 All analyses were performed in Stata (College Station, TX) version 15 software. ...
Article
Full-text available
Objective As a large and populous country, China releases a high number of diagnostic criteria. However, the published diagnostic criteria have not yet been systematically analyzed. Therefore, the aim of this study is to investigate the characteristics, development methods, reporting quality, and evidence basis of diagnostic criteria published in China. Methods We searched five databases for diagnostic criteria from their inception until July 31, 2023. All diagnostic criteria were screened through abstract and full‐text reading, and included if satisfying the prespecified criteria. Two researchers independently extracted data on the characteristics, development methods, reporting quality, and evidence basis of diagnostic criteria. Results A total of 143 diagnostic criteria were included. In terms of development methods, the proportions of diagnostic criteria that involved a systematic literature search (n = 2; 1.4%; 95% confidence interval (CI), 0.4% to 5.0%), adoption of formal consensus methods (n = 4; 2.8%; 95% CI, 1.1% to 7.0%), and criteria validation (n = 9; 6.3%; 95% CI, 3.3% to 11.5%) were relatively low. Regarding reporting quality, the average compliance with the ACCORD checklist was 5.1%; none of the diagnostic criteria reported on registration, expert inclusion criteria, expert recruitment process, or consensus results. A majority (58.7%; 95% CI, 50.6% to 66.5%) of criteria did not cite any research, and only one (0.7%; 95% CI, 0.1% to 3.9%) criterion was derived from a systematic review. Moreover, only 16.1% (95% CI, 11.0% to 23.0%) of diagnostic criteria used evidence from the Chinese population. Conclusion The diagnostic criteria developed in China exhibit serious flaws, particularly in evidence retrieval, formation of expert panels, consensus methods, and validation. Additionally, only few diagnostic criteria used a systematic synthesis of the evidence or evidence from the China. There is an urgent need to enhance the methodology for developing diagnostic criteria.
... These categorical crosstable data were also compared using Cohen's kappa analysis. Kappa values were interpreted as follows: almost perfect (0.8-1.0), substantial (0.6-0.8), moderate (0.4-0.6), fair (0.2-0.4), and poor (<0.2). 7 Second, class assessment was compared using gamma analysis. The closer the gamma index is to 1 indicates stronger association. ...
Article
Full-text available
Purpose The AdvanSure™ AlloScreen assay is an advanced multiplex test that allows for simultaneous detection of specific IgE (sIgE) against multiple allergens. For precise identification of causative allergens in allergic patients, we compared this new multiplex sIgE assay with the ImmunoCAP assay, which is currently the gold-standard method for sIgE detection. Materials and Methods Serum samples from 218 Korean allergic disease patients were used to compare the ImmunoCAP and AlloScreen assays with respect to the following 13 allergens: Dermatophagoides pteronyssinus, Dermatophagoides farinae, cat and dog dander, Alternaria, birch, oak, ragweed, mugwort, rye grass, and food allergens (egg white, cow's milk, peanuts). Results A total of 957 paired tests using the 13 allergens were compared. The total agreement ratio ranged from 0.74 (oak) to 0.97 (Alternaria). With respect to class association analyses, the gamma index ranged from 0.819 (rye grass) to 0.990 (Alternaria). The intra-class correlation coefficients for house dust mites, cat and dog dander, Alternaria, birch, ragweed, egg white, cow's milk, and peanut sIgE titers were >0.8. Conclusion The AlloScreen and ImmunoCAP assays exhibited similar diagnostic performance. However, due to methodological differences between the two systems, careful interpretation of their results is needed in clinical applications.
Article
Full-text available
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
Article
Full-text available
Accurately monitoring one’s learning processes during self-regulated learning depends on using the right cues, one of which could be perceived mental effort. A meta-analysis by Baars et al. (2020) found a negative association between mental effort and monitoring judgments (r = -.35), suggesting that the amount of mental effort experienced during a learning task is usually negatively correlated with learners’ perception of learning. However, it is unclear how monitoring judgments and perceptions of mental effort relate to learning outcomes. To examine if perceived mental effort is a diagnostic cue for learning outcomes, and whether monitoring judgments mediate this relationship, we employed a meta-analytic structural equation model. Results indicated a negative, moderate association between perceived mental effort and monitoring judgments (β = -.19), a positive, large association between monitoring judgments and learning outcomes (β = .29), and a negative, moderate indirect association between perceived mental effort and learning outcomes (β = -.05), which was mediated by monitoring judgments. Our subgroup analysis did not reveal any significant differences across moderators potentially due to the limited number of studies included per moderator category. Findings suggest that when learners perceive higher levels of mental effort, they exhibit lower learning (confidence) judgments, which relates to lower actual learning outcomes. Thus, learners seem to use perceived mental effort as a cue to judge their learning while perceived mental effort only indirectly relates to actual learning outcomes.
Preprint
Full-text available
Introduction:Malaria-endemic countries are increasingly adopting data-driven risk stratification, often at district or higher regional levels, to guide their intervention strategies. The data typically comes from population-level surveys collected by rapid diagnostic tests (RDTs), which unfortunately perform poorly in low transmission settings. Here, we conducted a high-resolution survey of Plasmodium falciparum prevalence rate (PfPR) in two Tanzanian districts and compared the fine-scale strata obtained using data from RDTs, microscopy and quantitative polymerase chain reaction (qPCR) assays. Methods: A cross-sectional survey was conducted in 35 villages in Ulanga and Kilombero districts, south-eastern Tanzania between 2022 and 2023. We screened 7,628 individuals using RDTs (SD-BIOLINE) and microscopy, with two thirds of the samples further analyzed by qPCR. The data was used to categorize each district and village as having very low (PfPR<1%), low (1%≤PfPR<5%), moderate (5%≤PfPR<30%), or high (PfPR≥30%) parasite prevalence. A generalized linear model was used to analyse infection risk factors. Other metrics, including positive predictive value (PPV), sensitivity, specificity, parasite densities, and Kappa statistics were computed for RDTs or microscopy using qPCR as reference. Results: Significant fine-scale variations in malaria risk were observed within and between districts, with village prevalence ranging from 0% to >50%. Prevalence varied by testing method: Kilombero was low risk by RDTs (PfPR=3%) and microscopy (PfPR=2%) but moderate by qPCR (PfPR=9%); Ulanga was high risk by RDTs (PfPR=39%) and qPCR (PfPR=54%) but moderate by microscopy (PfPR=26%). RDTs and microscopy classified majority of the 35 villages as very low to low risk (18 - 21 villages). In contrast, qPCR classified most villages as moderate to high risk (29 villages). Using qPCR as the reference, PPV for RDTs and microscopy ranged from <20% in very low transmission villages to >80% in moderate to high transmission villages. Sensitivity was 62% for RDTs and 41% for microscopy; specificity was 93% and 96%, respectively. Kappa values were 0.58 for RDTs and 0.42 for microscopy. School-age children (5-15years) had higher malaria prevalence and parasite densities than adults (P<0.001). High-prevalence villages also had higher parasite densities (Spearman r=0.77, P<0.001 for qPCR; r=0.55, P=0.003 for microscopy). Conclusion: This study highlights significant fine-scale variability in malaria risk within and between districts and emphasizes the variable performance of the testing methods when stratifying risk. While RDTs and microscopy were effective in high-transmission areas, they performed poorly in low-transmission settings; and classified most villages as very low or low risk. In contrast, qPCR classified most villages as moderate or high risk. While we cannot conclude on which public health decisions would be subject to change because of these differences, the findings suggest the need for improved testing approaches that are operationally feasible and sufficiently sensitive, to enable precise mapping and effective targeting of malaria in such local contexts. Moreover, public health authorities should recognize the strengths and limitations of their available data when planning local stratification or making decisions.
Article
Full-text available
El objetivo de este estudio fue examinar el efecto de la demarcación de los jugadores en el terreno de juego (defensa, centrocampista, delantero), la localización (local vs visitante) y el resultado obtenido (ganar, perder, empatar) sobre las acciones técnicas, a través de la técnica Principal Component Analysis (PCA). Se registraron datos de 27 partidos oficiales disputados por un equipo profesional. Las variables utilizadas fueron, i) ofensivas: pases (PAS), regates (REG) y tiros (TIR) y ii) defensivas: interceptaciones (INT), duelos aéreos (DAE) y despejes (DES). Los resultados mostraron diferencias significativas en todas las demandas técnicas atendiendo a la demarcación. Defensa fue la que menor número de REG y TIR realizó, y la que más acciones de DAE y DES. Los centrocampistas, fueron los que obtuvieron un mayor número de PAS y menos DAE. Los delanteros fueron los que menos acciones de PAS, INT e DES ejecutaron, obteniendo el mayor recuento en REG y TIR. Solo se encontraron diferencias significativas en INT cuando se analizó la influencia de la localización del partido, obteniéndose valores superiores como visitante. El equipo presentó valores superiores en PAS, TIR, INT y DAE cuando consiguió la victoria. Con derrota se realizaron un mayor número de DES que con victoria o empate. Recomendamos a los entrenadores programar sus entrenamientos potenciando con los DEF acciones de DES y DAE, con los CEN, de PAS y TIR y con los DEL, de TIR y REG. Dentro de un deporte de alta complejidad como el fútbol sugerimos a los entrenadores considerar a las variables DAE, INT, PAS y TI, como acciones técnicas a potenciar en los entrenamientos debido a su relación e influencia sobre la consecución de una victoria durante los partidos de competición. Palabras clave: Fútbol, match analysis, técnica, análisis de componentes principales, indicadores de rendimiento. Abstract. The objective of this study was to examine the effect of the demarcation of the players on the field of play (defence, midfielder, forward), the location (home vs. away) and the result obtained (win, lose, draw) on the technical actions, through the Principal Component Analysis (PCA) technique. Data were recorded from 27 official matches played by a professional team. The variables used were, i) offensive: passes (PAS), dribbles (REG) and shots (TIR) and ii) defensive: interceptions (INT), aerial duels (DAE) and clearances (DES). The results showed significant differences in all technical demands depending on the demarcation. Defense was the one that carried out the least number of REG and TIR, and the one that carried out the most DAE and DES actions. The midfielders were the ones who obtained the highest number of PAS and the least DAE. The forwards were the ones who executed the fewest PAS, INT and DES actions, obtaining the highest count in REG and TIR. Significant differences were only found in INT when the influence of the location of the match was analyzed, obtaining higher values as a visitor. The team presented higher values in PAS, TIR, INT and DAE when it achieved victory. With defeat, a greater number of DES were carried out than with victory or draw. We recommend that coaches program their training by enhancing the actions of DES and DAE with the DEF, with the CEN, with PAS and TIR, and with the DEL, with TIR and REG. Within a highly complex sport such as football, we suggest that coaches consider the variables DAE, INT, PAS and TI, as technical actions to enhance in training due to their relationship and influence on achieving a victory during soccer matches. competition. Key words: soccer, match analysis, technique, principal component analysis, performance indicators
Article
It ia an urgent issue to introduce new technology for demanding efficient and low budget with saving lavor since the bridge inspection is increasing every year. In recent years, we have applied Convolutional Neural Network (CNN), which is one of the machine learning that has focused an attention on its use in the field of civil engineering. CNN is considered to be one of the highly effective method of the support for bridge inspection. In this study, we developed a learning model that is a corrosion detector for steel girder bridges using CNN as machine learning. Our learning models trained using the photographs of the results of road bridge inspections conducted by Fukushima Prefecture. The corrosion detector derived from our learning models as was validated by using test data from the photographs of the ground survey at the road bridge in service of Inawashiro, Fukushima.
Article
Aims Diagnostic separation of diandric triploid gestation, i.e. partial mole from digynic triploid gestation, is clinically relevant, as the former may progress to postmolar gestational trophoblastic neoplasia. The aim of the study was to investigate if the combination of abnormal histology combined with ploidy analysis‐based triploidy is sufficient to accurately diagnose partial mole. Methods and Results A genotype–phenotype correlation study was undertaken to reappraise histological parameters among 20 diandric triploid gestations and 22 digynic triploid gestations of comparable patient age, gestational weeks, and clinical presentations. Two villous populations, irregular villous contours, pseudoinclusions, and syncytiotrophoblast knuckles, were common in both groups. Villous size ≥2.5 mm, cistern formation, trophoblastic hyperplasia, and syncytiotrophoblast lacunae were significantly more common in the partial hydatidiform mole. Cistern formation had the highest positive predictive value (PPV) (93%) and highest specificity (96%) for diandric triploid gestation, although the sensitivity was 70%. Cistern formation combined with villous size ≥2.5 mm or trophoblast hyperplasia or syncytiotrophoblast lacunae had 100% specificity and PPV, but a marginal sensitivity of 60%–65%. A moderate interobserver agreement (Kappa = 0.57, Gwet's AC1 = 0.59) was achieved among four observers who assigned diagnosis of diandric triploid gestation or digynic triploidy solely based on histology. Conclusions None of histological parameters are unique to either diandric triploid gestation or digynic triploid gestation. Cistern formation is the most powerful discriminator, with 93% PPV and 70% sensitivity for diandric triploid gestation. While cistern formation combined with either trophoblastic hyperplasia or villous size ≥2.5 mm or syncytiotrophoblast lacunae has 100% PPV and specificity for diandric triploid gestation, the sensitivity is only 60% to 65%. Therefore, in the presence of triploidy, histological assessment is unable to precisely classify 35% to 40% of diandric triploid gestations or partial moles.
Preprint
Full-text available
Q fever (QF) and Rift Valley fever (RVF) are endemic zoonotic diseases in African countries, causing significant health and economic burdens. Accurate prevalence estimates, crucial for disease control, rely on robust diagnostic tests. While enzyme-linked immunosorbent assays (ELISA) are not the gold standard, they offer rapid, cost-effective, and practical alternatives. However, varying results from different tests and laboratories can complicate comparing epidemiological studies. This study aimed to assess the agreement of test results for QF and RVF in humans and livestock across different laboratory conditions and, for humans, different types of diagnostic tests. We measured inter-laboratory agreement using concordance, Cohen's kappa, and prevalence and bias-adjusted kappa (PABAK) on 91 human and 102 livestock samples collected from rural regions in Chad. The samples were tested using ELISA in Chad, and indirect immunofluorescence assay (IFA) (for human QF and RVF) and ELISA (for livestock QF and RVF) in Switzerland and Germany. Additionally, we examined demographic factors influencing test agreement, including district, setting (village vs. camp), sex, age, and livestock species of the sampled individuals. The inter-laboratory agreement ranged from fair to moderate. For humans, QF concordance was 62.5%, Cohen's kappa was 0.31, RVF concordance was 81.1%, and Cohen's kappa was 0.52. For livestock, QF concordance was 92.3%, Cohen's kappa was 0.59, RVF concordance was 94.0%, and Cohen's kappa was 0.59. Multivariable analysis revealed that QF test agreement is significantly higher in younger humans and people living in villages compared to camps and tends to be higher in livestock from Danamadji compared to Yao, and in small ruminants compared to cattle. Additionally, RVF agreement was found to be higher in younger humans. Our findings emphasize the need to consider sample conditions, test performance, and influencing factors when conducting and interpreting epidemiological seroprevalence studies.
Article
Full-text available
This study aims to determine the tone and subject matter of the campaign speeches of the candidates competing in the 2nd round of the 2023 Turkish Presidential election. Content analyses in the field of political communication examine which types of messages—positive, negative, or defensive—politicians frequently use in their campaigns and which policies/issues they focus on. In this study, the speeches made by the candidates Recep Tayyip Erdoğan and Kemal Kılıçdaroğlu between May 14 and May 28 in the 2nd round of the 2023 election were analyzed using qualitative content analysis. The results indicate that Erdoğan's campaign predominantly employed a positive tone, while Kılıçdaroğlu's campaign was characterized by a negative tone. Both candidates preferred to conduct their campaigns through praise, attack, and defense categories based on topics rather than on individuals and values. The primary focus in both candidates' campaign speeches was the election process and alliances. After the election process and alliances, it was determined that the candidates focused on different topics. Erdoğan emphasized terrorism, while Kılıçdaroğlu focused on corruption and degeneration. Other implications and findings have been discussed.
Article
Full-text available
Notes that various procedures are available for measuring agreement among 2 or more os who classify responses among nominal categories, but that different problem situations require different measures. The general model of a contingency table with fixed margins is used to suggest (a) a measure of level of agreement among several os when compared internally, (b) a conditional measurement of agreement for several os compared internally, (c) a test for the joint agreement of several os when compared with a standard, and (d) a statistic for evaluating the pattern of agreement between 2 os. Illustrations are presented for each situation, and results of a monte carlo study of the behavior of the pattern agreement statistic are discussed. (19 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
2 statistics, kappa and weighted kappa, are available for measuring agreement between 2 raters on a nominal scale. Formulas for the standard errors of these 2 statistics are in error in the direction of overestimation, so that their use results in conservative significance tests and confidence intervals. Valid formulas for the approximate large-sample variances are given, and their calculation is illustrated using a numerical example. (PsycINFO Database Record (c) 2006 APA, all rights reserved). © 1969 American Psychological Association.
Article
Full-text available
Introduced the statistic kappa to measure nominal scale agreement between a fixed pair of raters. Kappa was generalized to the case where each of a sample of 30 patients was rated on a nominal scale by the same number of psychiatrist raters (n = 6), but where the raters rating 1 s were not necessarily the same as those rating another. Large sample standard errors were derived.
Article
Full-text available
A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joint nominal scale assignments such that disagreements of varying gravity (or agreements of varying degree) are weighted accordingly. Although providing for partial credit, Kw is fully chance corrected. Its sampling characteristics and procedures for hypothesis testing and setting confidence limits are given. Under certain conditions, Kw equals product-moment r. The use of unequal weights for symmetrical cells makes Kw suitable as a measure of validity.
Article
The statistical analysis of data in multi-dimensional contingency tables is discussed in terms of appropriate underlying probability models. Emphasis is placed on the distinction between `factors' (such as treatments or blocks) which have fixed marginal totals and `responses' (such as category of performance) which have random marginal totals. Hence, four principal cases arise: (i) the `multi-response, no factor' tables, (ii) the `multi-response, uni-factor' tables, (iii) the `multi-response, multi-factor' tables, (iv) the `uni-response, multi-factor' tables. For situations (i) and (ii), the concept of `no interaction' is related to questions regarding the pattern of association among responses. However, for situation (iv), it is related to how factors combine (e.g., additively) to determine the response distribution. Finally, for situation (iii), both types of questions arise. For each of the different types of tables, the problem of formulating appropriate hypotheses of `no interaction' is considered. The corresponding test statistics are based upon a general and computationally simple criterion of Wald [1943]. The suggested methods are illustrated with several numerical examples.
Article
This paper is concerned with contingency tables which are analogous to the well-known mixed model in analysis of variance. The corresponding experimental situation involves exposing each of n subjects to each of the d levels of a given factor and classifying the d responses into one of r categories. The resulting data are represented in an r r $\cdots$ r contingency table of d dimensions. The hypothesis of principal interest is equality of the one-dimensional marginal distributions. Alternatively, if the r categories may be quantitatively scaled, then attention is directed at the hypothesis of equality of the mean scores over the d first order marginals. Test statistics are developed in terms of minimum Neyman $\chi^2$ or equivalently weighted least squares analysis of underlying linear models. As such, they bear a strong resemblance to the Hotelling T$^2$ procedures used with continuous data in mixed models. Several numerical examples are given to illustrate the use of the various methods discussed.
Article
When populations are cross-classified with respect to two or more classifications or polytomies, questions often arise about the degree of association existing between the several polytomies. Most of the traditional measures or indices of association are based upon the standard chi-square statistic or on an assumption of underlying joint normality. In this paper a number of alternative measures are considered, almost all based upon a probabilistic model for activity to which the cross-classification may typically lead. Only the case in which the population is completely known is considered, so no question of sampling or measurement error appears. We hope, however, to publish before long some approximate distributions for sample estimators of the measures we propose, and approximate tests of hypotheses. Our major theme is that the measures of association used by an empirical investigator should not be blindly chosen because of tradition and convention only, although these factors may properly be given some weight, but should be constructed in a manner having operational meaning within the context of the particular problem.
Article
The estimates of Koch [1967a] have the undesirable property that they may change in value if the same constant is added to each of the observations. In this paper, an alternative procedure based on the same generd principles is developed and applied to a variety of models. As before, the estimators obtained are unbiased and consistent. They are also reasonably easy to compute. Finally, in the case of balanced experiments, they coincide with those obtained from the analysis of variance. On the other hand, their structure is more complex than that of the estimators considered in the previous paper. In particular, the derivation of their covariance matrix is much more complicated, and hence no attempt has been made here to study its properties.
Article
A general method of estimation of variance components in random-effects models of the nested and/or classification type is considered. If a given parameter is estimable with respect to some particular experimental design (i.e., an unbiased estimate of the parameter may be obtained from the experiment), then the suggested estimator may be readily computed with only the aid of a desk calculator. The estimates are always unbiased and consistent (with respect to the structure of the experimental design); in the case of balanced experiments, they coincide with those obtained from the analysis of variance.
Article
A very important and yet widely misunderstood concept or problem in science and technology is that of precision and accuracy of measurement. It is therefore necessary to define the terms precision and accuracy (or imprecision and inaccuracy) clearly and analytically if possible. Also, we need to establish and develop appropriate statistical tests of significance for these measures, since generally a relatively small number of measurements will be made or taken in most investigations.In this paper a discussion is given of some of the pertinent literature for estimating variances in errors of measurement, or the “imprecisions” of measurement, when two or three instruments are used to take the same observations on a series of items or characteristics. Also, present techniques for comparing the imprecision of measurement of one instrument with that of a second instrument through the use of statistical tests of significance are reviewed, as well as procedures for detecting the significance of the difference in biases or levels of measurement of two instruments. Finally, we indicate methods of extending present theory to the case of three measuring instruments, for which rather sensitive statistical test of significance are developed for dealing with the precision and accuracy problem.An example for the three instrument case is given to illustrate the suggested methodology of analysis.
Article
The statistical analysis of multi-dimensional contingency tables is discussed from the point of view of the associated underlying model. Different formulations of hypotheses of ‘no interaction’ are considered. The corresponding test statistics are based on a general and computationally simple criterion originally due to Wald [1943]. The suggested methods are illustrated with several numerical examples.
Article
This paper deals with the theory of a proposed method for the statistical study of measuring processes. The practical aspects of the method, including computational details, are discussed in a companion paper published in the ASTM Bulletin. In the present article a theoretical framework is proposed for the mathematical expression of the sources of variation in measuring methods and a suitable method of statistical analysis is described. Particular attention is given, both here and in the companion paper, to interlaboratory studies of test methods. An illustration based on data taken from the chemical literature is appended.
Article
A generalization is given to the multivariate case of the linear model usually employed in the determination of accuracy of observations. Likelihood ratio tests are derived for testing hypotheses concerning systematic differences among observers, and a criterion is suggested for evaluating the magnitude of errors of measurement.
Article
2***Department of Biostatistics, School of Public Health, University of North Carolina, Chapel Hill, North Carolina 27514, U.S.A.
Article
Determining the extent of association between 2 ordinal variables is a recurrent problem in psychological research. Several statistics are available and include rho, gamma, and tau. The basic problem with these techniques is that they measure order rather than extent of agreement. As a consequence, 2 quite different sets of ordinal data will produce the same statistical results, providing only that the ordering of each set of rankings is a constant. A new statistic, which can be expressed as a simple percentage of agreement, is proposed as an alternative method, and applied to a hypothetical research problem. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Establishes the property that if Vij = (c-j)2 (Vij denotes the disagreement weight in the weighted Kappa formula) and if the variables can be scaled 1 and 2, then irrespective of the marginal distributions, weighted Kappa is identical with the intraclass correlation coefficient in which the mean differences between the raters is included as a component of variability. A discussion of this property is presented along with an example. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Summary This paper reviews research situations in medicine, epidemiology and psychiatry, in psychological measurement and testing, and in sample surveys in which the observer(rater or interviewer) can be an important source of measurement error. Moreover, most of the statistical literature in observer variability is surveyed with attention given to a notational unification of the various models proposed. In the continuous data case, the usual analysis of variance (ANOVA) components of variance models are presented with an emphasis on the intraclass correlation coefficient as a measure of reliability. Other modified ANOVA models, response error models in sample surveys, and related multivariate extensions are also discussed. For the categorical data case, special attention is given to measures of agreement and tests of hypotheses when the data consist of dichotomous responses. In addition, similarities between the dichotomous and continous cases are illustrated in terms of intraclass correlation coefficients. Finally, measures of agreement, such as kappa and weighted-kappa, are discussed in the context of nominal and ordinal data. A proposed unifying framework for the categorical data case is given in the form of concluding remarks.
Article
This paper is concerned with the analysis of multivariate categorical data which are obtained from repeated measurement experiments. An expository discussion of pertinent hypotheses for such situations is given, and appropriate test statistics are developed through the application of weighted least squares methods. Special consideration is given to computational problems associated with the manipulation of large tables including the treatment of empty cells. Three applications of the methodology are provided.
Article
This paper presents a general statistical methodology for the analysis of multivariate categorical data involving agreement among more than two observers. Since these situations give rise to very large contingency tables in which most of the observed cell frequencies are zero, procedures based on indicator variables of the raw data for individual subjects are used to generate first-order margins and main diagonal sums from the conceptual multidimensional contingency table. From these quantities, estimates are generated to reflect the strength of an internal majority decision on each subject. Moreover, a subset of observers who demonstrate a high level of interobserver agreement can be identified by using pairwise agreement statistics between each observer and the internal majority standard opinion on each subject. These procedures are all illustrated within the context of a clinical diagnosis example involving seven pathologists.
Article
GENCAT is a computer program which implements an extremely general methodology for the analysis of multivariate categorical data. This approach essentially involves the construction of test statistics for hypotheses involving functions of the observed proportions which are directed at the relationships under investigation and the estimation of corresponding model parameters via weighted least squares computations. Any compounded function of the observed proportions which can be formulated as a sequence of the following transformations of the data vector--linear, logarithmic, exponential, or the addition of a vector of constants--can be analyzed within this general framework. This algorithm produces minimum modified chi-square statistics which are obtained by partitioning the sums of squares as in ANOVA. The input data can be either: (a) frequencies from a multidimentional contingency table; (b) a victor of functions with its estimated covariance matrix; and (c) raw data in the form of integer-valued variables associated with each subject. The input format is completely flexible for the data as well as for the matrices.
Article
At least a dozen indexes have been proposed for measuring agreement between two judges on a categorical scale. Using the binary (positive-negative) case as a model, this paper presents and critically evaluates some of these proposed measures. The importance of correcting for chance-expected agreement is emphasized, and identities with intraclass correlation coefficients are pointed out.
Article
The minimum modified chi-square method of analyzing contingency tables is extended to compounded logarithmic, exponential and linear functions. These compounded functions allow one to consider the following practical situations in terms of a general technique: 1) Patterns of association in square contingency tables, as related to functions of diagonal totals, 2) Rank correlation coefficients, 3) "Ridits" 4) Partial association. The derivation of this procedure and examples showing its application are presented.
Article
For epidemiologic and comparative pathologic studies of cerebral atherosclerosis, assessment of reliability of measurements is necessary. Such a study is described along with the measurement method used. The development of the methodology for assessing reliability of data is presented. Within and between coder variability is estimated. For the biometrician, the salient feature is that several methods of determining reliability might have to be explored and tried before arriving at a method which is useful and acceptable to the clinician or clinical pathologist.
Article
This paper illustrates tests for some suitable hypotheses in analysis of contingency tables when some characters are quantitative. For a two-dimensional table tests are given for the hypothesis of homogeneity of mean scores, the hypothesis of linearity of regression of mean scores, and also for testing significance of regression of mean scores on the level of the other character. For a three-dimensional table some similar procedures are offered. It is briefly pointed out how such test criteria can be derived in a systematic manner by an application of a certain generalized least squares technique.
Article
Assume there are n i (i=1,2,⋯,s) samples from s multinomial distributions, each having r categories of response. Then define any u functions of the unknown true cell probabilities {π ij :i=1,2,⋯,s;j=1,2,⋯,r, where ∑ i=1 r π ij =1} that have derivatives of order up to the second with respect to π ij and for which the matrix of first derivatives is of rank u. A general noniterative procedure is described for fitting these functions to a linear model, for testing the goodness-of-fit of the model, and for testing hypotheses about the parameters in the linear model. The special cases of linear functions and logarithmic functions of the π ij are developed in detail, and some examples of how the general approach can be used to analyze various types of categorical data are presented.
The Analysis of Variance
  • H Scheff
Scheff6, H. [1959]. The Analysis of Variance. Wiley, New York.
Studies on multiple sclerosis in Winnipeg. Manitoba and New Orleans
  • K B Westlund
  • L T Kurland
Westlund, K. B. and Kurland, L. T. [1953]. Studies on multiple sclerosis in Winnipeg. Manitoba and New Orleans, Louisiana. American Journal of Hygiene 57, 380-396.
A general methodology for the measurement of observer agreement when the data are categorical
  • J R Landis
Landis, J. R. [1975]. A general methodology for the measurement of observer agreement when the data are categorical. Ph.D. Dissertation, University of North Carolina, Institute of Statistics Mimeo Series No. 1022.
Statistical Theory in Research A note on the equivalence of two test criteria for hypotheses in categorical data
  • R L Anderson
  • T A Bancroft
Anderson, R. L. and Bancroft, T. A. [1952]. Statistical Theory in Research. McGraw Hill, New York. Bhapkar, V. P. [1966]. A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association 61, 228-235.