Article

A Coefficient of Agreement For Nominal Scales

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Additionally, aspect (Asp) and slope (Slop) were two other important factors that showed significant decrease in gain when excluded. [28,29]. ...
... We also assessed MaxEnt using various statistical parameters, such as AUC, K, TSS, and NMI, to confirm its reliability and statistical support. Model output with a high AUC value close to 1 (0.975 ± 0.019) and good scores on other classification accuracy tests show that the bioclimatic variables used to run the model were chosen correctly to make a good prediction (Table 1 and Supplementary Figure S1), which was found in accordance with earlier studies by [26,28,29]. Importantly, while working on such models' spatial scale, selection for niche prediction are also depend upon the size and areal extent of the distribution range, which should be precisely chosen and surveyed thoroughly [34]. ...
Article
Full-text available
Thamnocalamus spathiflorus is a shrubby woody bamboo invigorating at the alpine and sub-alpine region of the northwestern Himalayas. The present investigation was conducted to map the potential distribution of Th. spathiflorus in the western Himalayas for current and future climate scenario using Ecological Niche Modelling (ENM). In total, 125 geo-coordinates were collected for the species presence from Himachal Pradesh (HP) and Uttarakhand (UK) states of India and modelled to predict the current distribution using the Maximum Entropy (MaxEnt) model, along with 13 bioclimatic variables selected after multi-collinearity test. Model output was supported with a significant value of the Area Under the "Receiver Operating Characteristics" Curve (AUC = 0.975 ± 0.019), and other confusion matrix-derived accuracy measures. The variables, namely precipitation seasonality (Bio 15), precipitation (Prec), annual temperature range (Bio 7), and altitude (Alt) showed highest level of percentage contribution (72.2%) and permutation importance (60.9%) in predicting the habitat suitability of Th. spathiflorus. The actual (1 km 2 buffer zone) and predicted estimates of species cover were ~136 km 2 and ~982 km 2 , respectively. The predicted range was extended from Chamba (HP) in the north to Pithoragarh (UK) in southeast, which further protracted to Nepal. Furthermore, the distribution modelling under future climate change scenarios (RCP 8.5) for year 2050 and 2070 showed an eastern centroidal shift with slight decline of the species area by ~16 km 2 and ~46 km 2 , respectively. This investigation employed the Model for Interdisciplinary Research on Climate (MIROC6)-shared socio-economics pathways (SSP245) for cross-validation purposes. The model was used to determine the habitat suitability and potential distribution of Th. spathiflorus in relation to the current distribution and RCP 8.5 future scenarios for the years 2021-2040 and 2061-2080, respectively. It showed a significant decline in the distribution area of the species between year 2030 and 2070. Overall, this is the pioneer study revealing the eco-distribution prediction modelling of this important high-altitude bamboo species.
... The data was gathered purely from widely accessible accounts on the Facebook platform. 2. We manually and rigorously annotate the dataset into vulgar and non-vulgar categories and validate the data annotation process by Cohen's Kappa statistics (Cohen, 1960). 3. Finally, we compare Machine Learning (ML)-based, and Deep Learning(DL)-based approaches for identifying vulgar remarks in the Chittagonian dialect on social media content. ...
... After annotating the data, we examined the inter-rater agreement. As a result, using Cohen's Kappa (Cohen, 1960), we obtained an average agreement value of 0.91, indicating very strong agreement among annotators. ...
Conference Paper
Full-text available
The negative effects of online bullying and harassment are increasing with Internet popularity, especially in social media. One solution is using natural language processing (NLP) and machine learning (ML) methods for the automatic detection of harmful remarks, but these methods are limited in low-resource languages like the Chittagonian dialect of Bangla. This study focuses on detecting vulgar remarks in social media using supervised ML and deep learning algorithms. Logistic Regression achieved promising accuracy (0.91) while simple RNN with Word2vec and fastTex had lower accuracy (0.84-0.90), highlighting the issue that NN algorithms require more data.
... All records were systematically screened using EPPI-Reviewer software (Version: 6.15.0.0) (Thomas et al., 2023). The provided standard coding scheme was adapted to meet all eligibility (Cohen, 1960;Landis & Koch, 1977;McHugh, 2012). (Sachdev et al., 2014) in line with the fifth version of the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association, 2013)) of (a) learning and memory, (b) complex attention, (c) executive function, and (d) visuospatial skills, and (6) findings of each study relating to cognitive performance. ...
... In case of disagreements, EdB served as referee. Inter-rater agreement was again assessed and interpreted based on Cohen's kappa (Cohen, 1960;Landis & Koch, 1977;McHugh, 2012). ...
Article
Full-text available
BACKGROUND: Exergame-based training is currently considered a more promising training approach than conventional physical and/or cognitive training. OBJECTIVES: This study aimed to provide quantitative evidence on dose-response relationships of specific exercise and training variables (training components) of exergame-based training on cognitive functioning in middle-aged to older adults (MOA). METHODS: We conducted a systematic review with meta-analysis including randomized controlled trials comparing the effects of exergame-based training to inactive control interventions on cognitive performance in MOA. RESULTS: The systematic literature search identified 22,928 records of which 31 studies were included. The effectiveness of exergame-based training was significantly moderated by the following training components: body position for global cognitive functioning, the type of motor-cognitive training, training location, and training administration for complex attention, and exercise intensity for executive functions. CONCLUSION: The effectiveness of exergame-based training was moderated by several training components that have in common that they enhance the ecological validity of the training (e.g., stepping movements in a standing position). Therefore, it seems paramount that future research focuses on developing innovative novel exergame-based training concepts that incorporate these (and other) training components to enhance their ecological validity and transferability to clinical practice. We provide specific evidence-based recommendations for the application of our research findings in research and practical settings and identified and discussed several areas of interest for future research. PROSPERO registration number CRD42023418593; prospectively registered, date of registration: 1 May 2023
... Quando pretendemos determinar a fidelidade de instrumentos de medida que geram resultados qualitativos (e.g., testes projetivos), é comum utilizar-se um coeficiente de fidelidade denominado acordo inter-juízes (Cohen, 1960). Este coeficiente permite determinar a consistência da avaliação de dois juízes, utilizando uma grelha de cotação, de produções escritas ou verbais de examinandos (como é o caso de determinadas respostas em instrumentos de medida psicológica, por exemplo as baterias de inteligência de Wecshler -WAIS ou WISC; Wechsler, 1939). ...
... Este coeficiente permite determinar a consistência da avaliação de dois juízes, utilizando uma grelha de cotação, de produções escritas ou verbais de examinandos (como é o caso de determinadas respostas em instrumentos de medida psicológica, por exemplo as baterias de inteligência de Wecshler -WAIS ou WISC; Wechsler, 1939). Caso esta tarefa seja realizada por mais de dois juízes, terá de se utilizar um coeficiente de acordo diferente, nomeadamente o coeficiente K de Cohen (Cohen, 1960). Notese, contudo, que se trata de um cálculo de fidelidade que incide particularmente na avaliação da qualidade da grelha de cotação. ...
Chapter
CADERNO DE LABORATÓRIO é uma publicação periódica do LAPSO-Laboratório de Psicologia, Iscte-Instituto Universitário de Lisboa, em colaboração com o CIS_Iscte. Uma parte muito significativa do conhecimento que acumulamos ao longo do nosso percurso enquanto investigadores/as é desenvolvido (ou consolidado) quando colaboramos com os/as nossos/as colegas. Frequentemente, a informação que procuramos para decidir qual a melhor metodologia a aplicar, ou resolver um problema de codificação, surge numa conversa à porta do laboratório ou enquanto tomamos um café (ou mesmo algo mais forte). O nosso principal objetivo com o CADERNO DE LABORATÓRIO é sistematizar esse conhecimento de modo a desenvolver um guia de práticas e recursos laboratoriais que suportem estudantes e investigadores/as em Psicologia de diferentes níveis. Especificamente, os capítulos têm o potencial de integrar a bibliografia de unidades curriculares relacionadas com metodologias de investigação e/ou competências académicas dos três ciclos de estudos em Psicologia.
... where Pr(a) represents the actual observed agreement, and Pr(e) represents chance agreement. Cohen (1960) suggested that the Kappa result should be interpreted as follows: values less than or equal to 0 indicate no agreement, 0.01-0.20 indicate no agreement to slight agreement, 0.21-0.40 indicate fair agreement, 0.41-0.60 ...
... indicate substantial agreement and 0.81-1.00 indicate almost perfect agreement (Cohen, 1960;McHugh, 2012). The weighted mean accuracy F1 of the results used in all types of classification algorithms was evaluated. ...
Article
Full-text available
Remote sensing technology and the Earth system data it can obtain can provide great support for the monitoring and management of protected areas. These data can provide the ecological indicators of a place. It is very important to understand the situation concerning the natural land elements of a pro-tected area and to stop unacceptable actions in time. This paper presents an analysis of the natural elements of the land use/land cover (LULC) in the landscapes of protected areas. Freely available Sentinel-2A (S2A) multispectral data were used to classify the LULC and monitor the situation of protected areas. The research object was Trakai Historical National Park, which is an authentic land-scape in Lithuania. First, the Sentinel-2A image was processed and classified using the random forest algorithm by the special Lithuanian remote monitoring data collection, processing, use and storage system of the Environmental Protection Agency Lithuania. Next, the LULC model was statistically analysed using Quantum Geographic Information System (QGIS) software. The authors recommend automating these processes. The results show that in the period from 2021–2022, the farmland areas (cultivated meadows, decay areas, winter cereals, intensive cultivated crops and natural meadows) in Trakai Historical National Park decreased by 9.2%. Meanwhile, the forest, water and wetland areas increased by 9.6%, which makes it possible to conclude that these changes are beneficial for the ecosystems in this area.
... Discrepancies were discussed in research team meetings to reach consensus. For the QRS, a second author coded 10% of the transcripts to assess intercoder agreement [54]. Having a strong intercoder (ICR) agreement is important for the trustworthiness of the study findings [55]. ...
Article
Full-text available
Background This systematic review examined the evidence on effectiveness and acceptability of cognitive behavioral therapy (CBT) interventions in improving quality of life (QoL) and psychological well-being of unaccompanied minors (UM). Methods PubMed, Scopus, Embase, ProQuest, PsycInfo, PsycArticles, and Open Dissertations databases were used to identify quantitative and qualitative studies. The Effective Public Health Practice Project (EPHPP) and Critical Appraisal Skills Programme (CASP) tools were used for quality assessment. Narrative synthesis and qualitative research synthesis were carried out to collate the findings. Results 18 studies were included. Two studies examined QoL, and five studies examined acceptability of interventions. Most quantitative studies (n = 10) were appraised as methodologically weak. Trauma-Focused CBT appears to have the most evidence demonstrating effectiveness in ameliorating symptoms of post-traumatic stress disorder, depression, and anxiety. Promising findings (i.e., increased mindfulness and psychological flexibility) were observed for third wave interventions but further replication is required. Conclusions The literature is tainted by under-powered studies, lacking blinding, and follow-up assessments. Female UM remain largely underrepresented. This review calls for a drastic augmentation of high quality quantitative and qualitative research focusing on augmenting QoL and examining acceptability rather than merely aiming for psychological symptom reduction in UM to enhance overall well-being and functionality. The research protocol was registered in PROSPERO (registration number: CRD42021293881).
... A value of 0.5 for a poor model (random) and 1.0 for a perfect model; hence, the better the classifier, the closer the model is to the top left corner of the AUC plot [65]. Cohen's kappa expresses the level of agreement between two raters on a binary classification [66]. Kappa values above 0.80 are considered good agreement, while zero or lower ratings indicate no agreement. ...
Article
Full-text available
Ground hazards are a significant problem in the global economy, costing millions of dollars in damage each year. Railroad tracks are vulnerable to ground hazards like flooding since they traverse multiple terrains with complex environmental factors and diverse human developments. Traditionally, flood-hazard assessments are generated using models like the Hydrological Engineering Center–River Analysis System (HEC-RAS). However, these maps are typically created for design flood events (10, 50, 100, 500 years) and are not available for any specific storm event, as they are not designed for individual flood predictions. Remotely sensed methods, on the other hand, offer precise flood extents only during the flooding, which means the actual flood extents cannot be determined beforehand. Railroad agencies need daily flood extent maps before rainfall events to manage and plan for the parts of the railroad network that will be impacted during each rainfall event. A new approach would involve using traditional flood-modeling layers and remotely sensed flood model outputs such as flood maps created using the Google Earth Engine. These new approaches will use machine-learning tools in flood prediction and extent mapping. This new approach will allow for determining the extent of flood for each rainfall event on a daily basis using rainfall forecast; therefore, flooding extents will be modeled before the actual flood, allowing railroad managers to plan for flood events pre-emptively. Two approaches were used: support vector machines and deep neural networks. Both methods were fine-tuned using grid-search cross-validation; the deep neural network model was chosen as the best model since it was computationally less expensive in training the model and had fewer type II errors or false negatives, which were the priorities for the flood modeling and would be suitable for developing the automated system for the entire railway corridor. The best deep neural network was then deployed and used to assess the extent of flooding for two floods in 2020 and 2022. The results indicate that the model accurately approximates the actual flooding extent and can predict flooding on a daily temporal basis using rainfall forecasts.
... Macro precision (known as a measure of exactness) and macro recall (known as a measure of completeness) inform about how well the classifiers perform regarding each class. Cohens Kappa (Cohen, 1960) is a statistical measure used to compare multi-class and imbalanced class data. It is known as a measure of reliability. ...
Preprint
Random Forest (RF) is well-known as an efficient ensemble learning method in terms of predictive performance. It is also considered a "black box" because of its hundreds of deep decision trees. This lack of interpretability can be a real drawback for acceptance of RF models in several real-world applications, especially those affecting ones lives, such as in healthcare, security, and law. In this work, we present Forest-ORE, a method that makes RF interpretable via an optimized rule ensemble (ORE) for local and global interpretation. Unlike other rule-based approaches aiming at interpreting the RF model, this method simultaneously considers several parameters that influence the choice of an inter-pretable rule ensemble. Existing methods often prioritize predictive performance over interpretability coverage and do not provide information about existing overlaps or interactions between rules. Forest-ORE uses a mixed-integer optimization program to build an ORE that considers the trade-off between predictive performance , interpretability coverage, and model size (size of the rule ensemble, rule lengths, and rule overlaps). In addition to providing an ORE competitive in predictive performance with RF, this method enriches the ORE through other rules that afford complementary information. It also enables monitoring of the rule selection process and delivers various metrics that can be used to generate a graphical representation of the final model. This framework is illustrated through an example, and its robustness is assessed through 36 benchmark datasets. A comparative analysis of well-known methods shows that Forest-ORE provides an excellent trade-off between predictive performance, interpretability coverage, and model size.
... While reviewing these codes together, researchers made any revisions necessary to better clarify operational definitions of each code (see Appendix C). (3) Researchers independently re-coded the transcripts [66,67] in totality. Each participant's interview response was assigned a single code; the inter-rater reliability calculated using Cohen's kappa [68] was 100%. Due to the original study not being based on understanding user acceptance, some interview questions did not apply to the coding framework, and thus were excluded from calculations. ...
Article
Full-text available
Loneliness is increasingly common, especially among older adults. Technology like mobile telepresence robots can help people feel less lonely. However, such technology has challenges, and even if people use it in the short term, they may not accept it in the long term. Prior work shows that it can take up to six months for people to fully accept technology. This study focuses on exploring the nuances and fluidity of acceptance phases. This paper reports a case study of four older adult participants living with a mobile telepresence robot for seven months. In monthly interviews, we explore their progress through the acceptance phases. Results reveal the complexity and fluidity of the acceptance phases. We discuss what this means for technology acceptance. In this paper, we also make coding guidelines for interviews on acceptance phases more concrete. We take early steps in moving toward a more standard interview and coding method to improve our understanding of acceptance phases and how to help potential users progress through them.
... Research participants were classified in group 2 if there appeared to be reasonable balance between choices and decisions of their own making, and those that were primarily determined by external factors. This interpretive exercise was conducted by two of the co-authors (WR and GJvdW), independently of each other, and the extent of agreement between the two raters was determined by calculating the kappa coefficient (Cohen, 1960). Discrepancies in classification between the two raters were resolved by discussion. ...
Thesis
Full-text available
Het doel van dit proefschrift was om erachter te komen wat we te weten kunnen komen als je met de capability benadering gaat kijken hoe het met mensen gaat. Een situatie waarin het moeilijk vast te stellen blijkt hoe goed het met iemand gaat, is wanneer iemand ernstig gehoorverlies heeft en hoortoestellen of een CI draagt. Dat komt doordat de apparaten niet altijd goed zichtbaar zijn en de meeste apparaten (gelukkig) zo goed werken dat het gehoor dicht in de buurt van normaalhorende mensen komt. Dit betekent echter niet dat er geen grote impact op het leven kan zijn. De taalontwikkeling kan gestoord zijn, iemand kan zich geïsoleerd voelen, of iemand kan minder goed aan werk komen. Het doel en de verantwoordelijkheid van zorg is om de capability van deze mensen te waarborgen. In dit proefschrift zijn kinderen en (jong)volwassenen met hoortoestellen en CI’s de mensen waar we de capability van proberen te meten.
... We report our meta-evaluation results of these different design choices in Table 1, showing the agreement (Cohen Kappa score (Cohen, 1960)) of these evaluators with human annotations (on our test set detailed in §3.2), and the approximate time cost per evaluation pass on the SORRY-Bench dataset. Safety evaluators with a higher agreement and a lower time cost are considered better. ...
Preprint
Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner.
... Both the first and second authors were involved in developing the coding scheme for the design process practices, and they implemented the Making-Process-Rug methodology in practice in our previous studies and, therefore, had a good understanding of it. Inter-rater reliability (IRR) was calculated utilizing the standard error Cohen's kappa (Cohen, 1960), which was 0.865 with a low standard error of 0.063, indicating almost perfect agreement (Landis & Koch, 1977). ...
Article
Full-text available
This study analyzed collaborative invention projects by teams of lower-secondary (13–14-year-old) Finnish students. In invention projects, student teams design and make materially embodied collaborative inventions using traditional and digital fabrication technologies. This investigation focused on the student teams’ knowledge creation processes by examining how they applied maker practices (i.e., design process, computer engineering, product design, and science practices) in their co-invention projects and the effects of teacher and peer support. In our investigations, we relied on video data and on-site observations, utilizing and further developing visual data analysis methods. Our findings assist in expanding the scope of computer-supported collaborative learning (CSCL) research toward sociomaterially mediated knowledge creation, revealing the open-ended, nonlinear, and self-organized flow of the co-invention projects that take place around digital devices. Our findings demonstrate the practice-based, knowledge-creating nature of these processes, where computer engineering, product design, and science are deeply entangled with design practices. Furthermore, embodied design practices of sketching, practical experimenting, and working with concrete materials were found to be of the essence to inspire and deepen knowledge creation and advancement of epistemic objects. Our findings also reveal how teachers and peer tutor students can support knowledge creation through co-invention.
... A second rater coded 30 syllabi independently to assess intercoder reliability according to a shared codebook previously developed. Cohen's kappa values were calculated to ensure inter-rater agreement, and the kappa value was good (k-Cohen = 0.75), showing an acceptable level of agreement [66]. ...
Article
Full-text available
Family involvement and participation in education (FIPE) profoundly impacts the quality of students’ academic and social development. Initial teacher education contribution in fostering attitudes, skills, and strategies for effective FIPE is therefore unquestionable. We aimed to find out to what extent Portuguese pre-service teachers are prepared to engage families. A document analysis was conducted to establish explicit information regarding FIPE within initial teacher education syllabi. Out of 621 syllabi across 36 master’s courses from 25 institutions, only 98 included some information on FIPE. A mere 12 syllabi, from seven institutions, exclusively addressed family–school relationships. Our study covered over 87% of the master’s courses and syllabi, exposing inconsistencies in their educational aims, content, and recommended literature. These findings highlight discrepancies within the initial teacher education syllabi and underscore the need for the enhanced training of pre-service teachers in FIPE. It is crucial to promote more in-depth and explicit syllabi to promote effective family engagement and enrich initial teacher education programs.
... According to quality criteria of qualitative research [43], Cohen's Kappa [44,45] was calculated for 23.1% of the interviews, containing 120 codes (22.2%). The analysed interviews were independently rated by two coding persons with experience in conducting semi-structured interviews. ...
Article
Full-text available
This qualitative study aims to analyse the personal qualification, attitudes and the pedagogical concepts of German teachers as experts in their profession regarding basic life support (BLS) education in secondary schools. Thirteen (n = 13) secondary school teachers participated in semi-structured expert interviews and were interviewed for at least 20 to 60 min regarding BLS student education. Interviews were semi-structured with guiding questions addressing (1) personal experience, (2) teacher qualification for BLS and (3) implementation factors (e.g., personal, material and organisational). Audio-recorded interviews were analysed by content analysis, generating a coding system. School teachers provided a heterogeneous view on implementation-related processes in BLS education. Many teachers were educated in first aid, acknowledge its importance, but had no experience in teaching BLS. They want to assure being competent for teaching BLS and need tailored trainings, materials, pedagogical information and the incorporation into the curriculum. Also, the management of time constraints, unwilling colleagues, or young students being overwhelmed were commonly mentioned considerations. Concluding, teachers reported to be willing to teach BLS but a stepwise implementation framework incorporating practice-oriented qualification and educational goals is missing.
... To evaluate model accuracy, we randomly selected 70% of occurrence data for use as a training data set and the remainding 30% as a validation data set; each model was run 100 times for crossvalidation. The area under the curve (AUC) of the receiver operating characteristic (ROC) (Hanley and McNeil, 1982), the true skill statistic (TSS) (Allouche et al., 2006), and Cohen's kappa (Kappa) (Cohen, 1960), were used as indices to evaluate model performance. The closer the measured values of TSS, Kappa, and AUC are to 1, the more reliable the prediction results are (Pearce and Ferrier, 2000). ...
Article
Full-text available
In May 2020, a bottom-trawl survey in the southern Bohai Sea collected the portunid crab Charybdis bimaculata, a species formerly found in the northern Yellow Sea. In subsequent surveys, C. bimaculata was found to be abundant and likely to occupy habitats and niches of native species. To study the suitability of habitat in the southern Bohai Sea for this crab, nine trawl surveys were conducted between 2020 and 2022 to monitor its dispersal. Using Biomod2 software and combining species occurrence and environmental data, a distribution model for C. bimaculata in the southern Bohai Sea is developed. We analyze relationships between this and other crustacean species by comparing niche widths and their overlap. A random forest model outperforms eight others, and has the highest evaluation indices among single algorithm species-distribution models. The evaluation index of an ensemble model is significantly higher than those of single algorithm models, indicating its greater accuracy and robustness. We report suitable habitat for C. bimaculata to occur mainly in central and northeastern Laizhou Bay, and for this habitat suitability to shift over years from the middle to northeastern waters. Niche width showed a negative trend from 2020 to 2022, and is greater in May than August for each year. Niche overlaps between C. bimaculata and other major crustaceans in the southern Bohai Sea exist. We consider that increased sea surface temperature caused by climate change enabled invasion of C. bimaculata from northern Yellow Sea waters into the southern Bohai Sea, where it can overwinter and complete its life cycle. These results provide a scientific basis upon which monitoring of C. bimaculata in the Bohai Sea can be strengthened to better cope with its invasion and any negative impact on local biodiversity.
... Interrater reliability was assessed using the Kappa index for categorical and binary variables whereas one-way random effects variance intraclass correlation coefficients (ICC 12 ) were used for continuous variables. To determine the raters' level of agreement, the authors followed seminal statistical literature that indicates that values between 0.41 and 0.60 fall within the moderate level of acceptability, values between 0.61 and 0.80 are treated as "substantial" agreement, and values between 0.81 and 1.00 indicate almost perfect agreement (Cohen, 1960). More recent work has favored increasing the thresholds of "weak" interrater agreement to values between 0.40 and 0.60 with values at and above .61 ...
Article
Full-text available
Implementation of threat assessment (TA) and management efforts on college campuses are often focused on threatening and concerning behavior originating from insider persons of concern (POCs), such as students, faculty, and staff, with little emphasis on outsider (i.e., those unaffiliated with the university) activity. However, universities often respond to a range of concerning behaviors, which may include acts such as harassment, stalking, and assault from both sources. The present study addressed this gap by examining factors differentiating insiders and outsider POCs’ concerning behavior. The results revealed that most threats to the university originated from aggrieved individuals affiliated with the campus while a minority stemmed from outsiders. Utilizing their privileged knowledge of the target and the setting, insider POCs tended to physically approach their targets to redress their grievances through disruptive and harmful behaviors that could pose a physical risk. Outsider POCs, on the other hand, were either former romantic partners or strangers who engaged in more frequent electronic harassment serving to intimidate their targets and make them fear for their safety. The series of behaviors differentiating insider and outsider POCs highlight the importance of considering the relationships between POCs and institutions to better determine threat assessment and management strategies. The findings support the inclusion of insider threat principles in building upon TA efforts to address the range of concerning behaviors stemming from those within the institution. Practices may entail programs that encourage early reporting for all concerning behavior and closer coordination with law enforcement, particularly for outsider POCs.
... To further analyze the consistency and agreement, we compute the Cohen Kappa value (Cohen, 1960) between GPT-4V scores with the majority-human scores. As evaluated, we observe a Cohen Kappa value of 0.648, representing a substantial agreement between human annotators and GPT-4V. ...
Preprint
Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
... As for the evaluation of the proposed segmentation network, overall accuracy (OA) [35] and kappa coefficient (KC) [36] were used to evaluate the overall performance. We used user's accuracy (UA) [37], producer's accuracy (PA) [37], and F1 score [38] to evaluate the ability to classify different land cover types. ...
Article
Full-text available
Generating high-resolution land cover maps using relatively lower-resolution remote sensing images is of great importance for subtle analysis. However, the domain gap between real lower-resolution and synthetic images has not been permanently resolved. Furthermore, super-resolution information is not fully exploited in semantic segmentation models. By solving the aforementioned issues, a deeply fused super resolution guided semantic segmentation network using 30 m Landsat images is proposed. A large-scale dataset comprising 10 m Sentinel-2, 30 m Landsat-8 images, and 10 m European Space Agency (ESA) Land Cover Product is introduced, facilitating model training and evaluation across diverse real-world scenarios. The proposed Deeply Fused Super Resolution Guided Semantic Segmentation Network (DFSRSSN) combines a Super Resolution Module (SRResNet) and a Semantic Segmentation Module (CRFFNet). SRResNet enhances spatial resolution, while CRFFNet leverages super-resolution information for finer-grained land cover classification. Experimental results demonstrate the superior performance of the proposed method in five different testing datasets, achieving 68.17–83.29% and 39.55–75.92% for overall accuracy and kappa, respectively. When compared to ResUnet with up-sampling block, increases of 2.16–34.27% and 8.32–43.97% were observed for overall accuracy and kappa, respectively. Moreover, we proposed a relative drop rate of accuracy metrics to evaluate the transferability. The model exhibits improved spatial transferability, demonstrating its effectiveness in generating accurate land cover maps for different cities. Multi-temporal analysis reveals the potential of the proposed method for studying land cover and land use changes over time. In addition, a comparison of the state-of-the-art full semantic segmentation models indicates that spatial details are fully exploited and presented in semantic segmentation results by the proposed method.
... the Cohen's kappa coefficient (k) was then calculated to obtain the probability of agreement between two coders. this probability is defined as the agreement statistic between two researchers corrected for chance (Cohen, 1960). as the result obtained was close to 1 (k = 0.86), the goodness of fit of this statistic was considered. ...
Article
Full-text available
In recent years, new technologies have made it possible to reproduce cultural content through new social media tools, thus ensuring the development of cultural heritage on a global scale, but museums have not always seen the introduction of these media in their strategies in a positive way. This article focuses on the analysis of public engagement with the collections of the five most reputable museums in Europe through the visual social media platform, Instagram. The study explores public engagement through a mixed-methods approach, with data mining using the Fan Page Karma monitoring tool. The findings show the value of active listening and interaction with user-generated content as a key component of reputation and image, reflecting the importance of two-way communication. The research may also be useful in the future to help improve strategies in the digital ecosystems of museum institutions.
... Interrater reliability for the reading of each imaging was measured using Cohen κ coefficient. 15 Except for the primary outcome (1-tailed test), all statistical tests were 2-tailed, and P values less than .05 were considered statistically significant. ...
Article
Full-text available
Importance Whether F18-choline (FCH) positron emission tomographic (PET)/computed tomographic (CT) scan can replace Tc99m-sestaMIBI (MIBI) single-photon emission (SPE)CT/CT as a first-line imaging technique for preoperative localization of parathyroid adenomas (PTA) in patients with primary hyperparathyroidism (PHPT) is unclear. Objective To compare first-line FCH PET/CT vs MIBI SPECT/CT for optimal care in patients with PHPT needing parathyroidectomy and to compare the proportions of patients in whom the first-line imaging method resulted in successful minimally invasive parathyroidectomy (MIP) and normalization of calcemia 1 month after surgery. Design, Setting, and Participants A French multicenter randomized open diagnostic intervention phase 3 trial was conducted. Patients were enrolled from November 2019 to May 2022 and participated up to 6 months after surgery. The study included adults with PHPT and an indication for surgical treatment. Patients with previous parathyroid surgery or multiple endocrine neoplasia type 1 (MEN1) were ineligible. Interventions Patients were assigned in a 1:1 ratio to receive first-line FCH PET/CT (FCH1) or MIBI SPECT/CT (MIBI1). In the event of negative or inconclusive first-line imaging, they received second-line FCH PET/CT (FCH2) after MIBI1 or MIBI SPECT/CT (MIBI2) after FCH1. All patients underwent surgery under general anesthesia within 12 weeks following the last imaging. Clinical and biologic (serum calcemia and parathyroid hormone levels) assessments were performed 1 and 6 months after surgery. Main Outcomes and Measures The primary outcome was a true-positive first-line imaging-guided MIP combined with uncorrected serum calcium levels of 2.55 mmol/l or less 1 month after surgery, corresponding to the local upper limit of normality. Results Overall, 57 patients received FCH1 (n = 29) or MIBI1 (n = 28). The mean (SD) age of patients was 62.8 (12.5) years with 15 male (26%) and 42 female (74%) patients. Baseline patient characteristics were similar between groups. Normocalcemia at 1 month after positive first-line imaging-guided MIP was observed in 23 of 27 patients (85%) in the FCH1 group and 14 of 25 patients (56%) in the MIBI1 group. Sensitivity was 82% (95% CI, 62%-93%) and 63% (95% CI, 42%-80%) for FCH1 and MIBI1, respectively. Follow-up at 6 months with biochemical measures was available in 43 patients, confirming that all patients with normocalcemia at 1 month after surgery still had it at 6 months. No adverse events related to imaging and 4 adverse events related to surgery were reported. Conclusions This randomized clinical trial found that first-line FCH PET/CT is a suitable and safe replacement for MIBI SPECT/CT. FCH PET/CT leads more patients with PHPT to correct imaging-guided MIP and normocalcemia than MIBI SPECT/CT thanks to its superior sensitivity. Trial Registration ClinicalTrials.gov Identifier: NCT04040946
... During training, we performed hyperparameter tuning with 4-fold cross-validation. Grid search was used to find the optimal settings, with tuning focusing on maximizing Cohen's [21]. Table 1 reports the time window where the model performed best. ...
Preprint
Full-text available
Learning analytics has begun to use physiological signals because these have been linked with learners' cognitive and affective states. These signals, when interpreted through machine learning techniques, offer a nuanced understanding of the temporal dynamics of student learning experiences and processes. However, there is a lack of clear guidance on the optimal time window to use for analyzing physiological signals within predictive models. We conducted an empirical investigation of different time windows (ranging from 60 to 210 seconds) when analysing multichannel physiological sensor data for predicting cognitive load. Our results demonstrate a preference for longer time windows, with optimal window length typically exceeding 90 seconds. These findings challenge the conventional focus on immediate physiological responses, suggesting that a broader temporal scope could provide a more comprehensive understanding of cognitive processes. In addition, the variation in which time windows best supported prediction across classifiers underscores the complexity of integrating physiological measures. Our findings provide new insights for developing educational technologies that more accurately reflect and respond to the dynamic nature of learner cognitive load in complex learning environments.
... Agreement between tests was evaluated relying on Cohen's kappa statistics (Watson & Petrie, 2010). This method requires nominal variables with only two mutually exclusive categories, and returns as output: (i) a coefficient on the level of agreement (the k coefficient) (Cohen, 1960); (ii) the p value for statistical significance; (iii) a square contingency table reporting frequency distribution of different categories. Values on the main diagonal of the contingency table report the number of times in which two tests provide the same categorical outcome for the same child, while off diagonal elements report the number of times in which tests disagree. ...
Article
Full-text available
A growing number of primary school students experience difficulties with grapho-motor skills involved in handwriting, which impact both form and content of their texts. Therefore, it is important to assess and monitor handwriting skills in primary school via standardized tests and detect specific grapho-motor parameters (GMPs) which impact handwriting legibility. Multiple standardized tools are available to assess grapho-motor skills in primary school, yet little is known on between-test agreement, on impact of specific GMPs on children’s overall performance and on which GMPs may be specifically hard to tackle for children that are starting to consolidate their handwriting skills. These data would be extremely relevant for clinicians, therapists and educators, who have to choose among different assessment tools as well as design tailored intervention strategies to reach adequate performance on different GMPs in cases of poor handwriting. To gain better understanding of currently available standardized tools, we compared overall performance of 39 Italian primary school children (19 second graders and 20 third graders) experiencing difficulties with handwriting on three standardized tests for grapho-motor skills assessment and explored the impact of individual GMPs on child performance. Results showed some agreement between tests considering all children in our sample, but no agreement in second grade and only limited agreement in third grade. Data also allowed highlighting significant correlations between some GMP scores and children’s overall performance in our sample. Finally, children in our sample appeared to experience specific difficulties with some GMPs, such as letter joins and alignment.
... Fleiss's kappa (64) assessed the degree of agreement over and above what would be expected by chance. This variant on the more familiar Cohen's kappa (65) is used in cases of more than two raters. While there are no generally accepted guidelines for a desirable level of either form of kappa, some healthcare researchers have proposed values from 0.41-0.60 as "moderate, " 0.61-0.80 as "good, " and 0.81-1.00 ...
Article
Full-text available
Purpose The present study examines how the coronavirus disease 2019 (COVID-19) experience affected values and priorities. Methods This cross-sectional study collected data between January and April 2023, from 1,197 individuals who are chronically ill or part of a general population sample. Using open-ended prompts and closed-ended questions, we investigated individuals’ perceptions about COVID-19-induced changes in what quality of life means to them, what and who are important, life focus, and changes in norms and stressors. Data analyses included content and psychometric analysis, leading to latent profile analysis (LPA) to characterize distinct groups, and analysis of variance and chi-squared to compare profile groups’ demographic characteristics. Results About 75% of the study sample noted changes in values and/or priorities, particularly in the greater prominence of family and friends. LPA yielded a four-profile model that fit the data well. Profile 1 (Index group; 64% of the sample) had relatively average scores on all indicators. Profile 2 (COVID-Specific Health & Resignation to Isolation Attributable to COVID-19; 5%) represented COVID-19-specific preventive health behaviors along with noting the requisite isolation and disengagement entailed in the social distancing necessary for COVID-19 prevention. Profile 3 (High Stress, Low Trust; 25%) represented high multi-domain stress, with the most elevated scores both on focusing on being true to themselves and perceiving people to be increasingly uncivil. Profile 4 (Active in the World, Low Trust; 6%) was focused on returning to work and finding greater meaning in their activities. These groups differed on race, marital status, difficulty paying bills, employment status, number of times they reported having had COVID-19, number of COVID-19 boosters received, whether they had Long COVID, age, BMI, and number of comorbidities. Conclusion Three years after the beginning of the worldwide COVID-19 pandemic, its subjective impact is notable on most study participants’ conceptualization of quality of life, priorities, perspectives on social norms, and perceived stressors. The four profile groups reflected distinct ways of dealing with the long-term effects of COVID-19.
... However, summarizing the agreement as a single or a few numbers makes the result easier to interpret, especially when there are many categories. Map-comparison measures based on such contingency tables, such as the broadly used Cohen's kappa (κ) (Cohen, 1960; or the more recent quantity-and-allocation agreement (Pontius Jr & Millones, 2011), both consider the percentage of pixels of the map attributed to the same category in two maps and take into account the likelihood of agreement occurring by chance. ...
Article
Full-text available
Biomes are large‐scale ecosystems occupying large spaces. The biome concept should theoretically facilitate scientific synthesis of global‐scale studies of the past, present, and future biosphere. However, there is neither a consensus biome map nor universally accepted definition of terrestrial biomes, making joint interpretation and comparison of biome‐related studies difficult. “Desert,” “rainforest,” “tundra,” “grassland,” or “savanna,” while widely used terms in common language, have multiple definitions and no universally accepted spatial distribution. Fit‐for‐purpose classification schemes are necessary, so multiple biome‐mapping methods should for now co‐exist. In this review, we compare biome‐mapping methods, first conceptually, then quantitatively. To facilitate the description of the diversity of approaches, we group the extant diversity of past, present, and future global‐scale biome‐mapping methods into three main families that differ by the feature captured, the mapping technique, and the nature of observation used: (1) compilation biome maps from expert elicitation, (2) functional biome maps from vegetation physiognomy, and (3) simulated biome maps from vegetation modeling. We design a protocol to measure and quantify spatially the pairwise agreement between biome maps. We then illustrate the use of such a protocol with a real‐world application by investigating the potential ecological drivers of disagreement between four broadly used, modern global biome maps. In this example, we quantify that the strongest disagreement among biome maps generally occurs in landscapes altered by human activities and moderately covered by vegetation. Such disagreements are sources of bias when combining several biome classifications. When aiming to produce realistic biome maps, biases could be minimized by promoting schemes using observations rather than predictions, while simultaneously considering the effect of humans and other ecosystem engineers in the definition. Throughout this review, we provide comparison and decision tools to navigate the diversity of approaches to encourage a more effective use of the biome concept.
... 24 Secondary analyses were performed in subgroups based on gender (women, men) and underlying cirrhosis etiology (viral, non-viral, post-sustained virological response [SVR], active HCV). Lastly, we calculated the concordance and Cohen's concordance coefficient 25 between the GALAD and HES V2.0 scores among patients who developed HCC. All analyses were conducted using R version 4.0.3. ...
Article
Full-text available
Background The original hepatocellular carcinoma early detection screening (HES) score, which combines alpha-fetoprotein (AFP) with age, alanine aminotransferase, and platelets, has better performance than AFP alone for early HCC detection. We have developed HES V2.0 by adding AFP-L3 and des-gamma-carboxy prothrombin to the score and compared its performance to GALAD and ASAP scores among patients with cirrhosis. Methods We conducted a prospective-specimen collection, retrospective-blinded-evaluation phase 3 biomarker cohort study in patients with cirrhosis enrolled in imaging and AFP surveillance. True-positive rate (TPR)/sensitivity and false-positive rate for any or early HCC were calculated for GALAD, ASAP, and HES V2.0 scores within 6, 12, and 24 months of HCC diagnosis. We calculated the AUROC curve and estimated TPR based on an optimal threshold at a fixed false-positive rate of 10%. Results We analyzed 2331 patients, of whom 125 developed HCC (71% in the early stages). For any HCC, HES V2.0 had higher TPR than GALAD overall (+7.2%), at 6 months (+3.6%), at 12 months (+7.2%), and 24 months (+13.0%) before HCC diagnosis. HES V2.0 had higher TPR than ASAP for all time points (+5.9% to +12.0%). For early HCC, HES V2.0 had higher sensitivity/TPR than GALAD overall (+6.7%), at 12 months (+6.3%), and 24 months (+14.6%) but not at 6 months (+0.0%) and higher than ASAP for all time points (+13.4% to +18.0%). Conclusions In a prospective cohort study, HES V2.0 had a significantly higher performance for identifying new HCC, including early stage, than GALAD or ASAP.
Article
Full-text available
Introduction Motor vehicular trauma, bite wounds, high-rise syndrome, and trauma of unknown origin are common reasons cats present to the emergency service. In small animals, thoracic injuries are often associated with trauma. The objective of this retrospective study was to evaluate limits of agreement (LOA) between thoracic point-of-care ultrasound (thoracic POCUS) and thoracic radiography (TXR), and to correlate thoracic POCUS findings to animal trauma triage (ATT) scores and subscores in a population of cats suffering from recent trauma. Methods Cats that had thoracic POCUS and TXR performed within 24 h of admission for suspected/witnessed trauma were retrospectively included. Thoracic POCUS and TXR findings were assessed as “positive” or “negative” based on the presence or absence of injuries. Cats positive on thoracic POCUS and TXR were assigned 1 to 5 tentative diagnoses: pulmonary contusions/hemorrhage, pneumothorax, pleural effusion, pericardial effusion, and diaphragmatic hernia. When available ATT scores were calculated. To express LOA between the two imaging modalities a kappa coefficient and 95% CI were calculated. Interpretation of kappa was based on Cohen values. Results One hundred and eleven cats were included. 83/111 (74.4%) cats were assessed as positive based on thoracic POCUS and/or TXR. Pulmonary contusion was the most frequent diagnosis. The LOA between thoracic POCUS and TXR were moderate for all combined injuries, moderate for pulmonary contusions/hemorrhage, pneumothorax, diaphragmatic hernia, and fair for pleural effusion. Cats with positive thoracic POCUS had significantly higher median ATT scores and respiratory subscores compared to negative thoracic POCUS cats. Discussion The frequency of detecting intrathoracic lesions in cats was similar between thoracic POCUS and TXR with fair to moderate LOA, suggesting thoracic POCUS is useful in cats suffering from trauma. Thoracic POCUS may be more beneficial in cats with higher ATT scores, particularly the respiratory score.
Article
Background The methods previously proposed in the literature to assess patients with rotator cuff related shoulder pain, based on special orthopedic tests to precisely identify the structure causing the shoulder symptoms have been recently challenged. This opens the possibility of a different way of physical examination. Objective To analyze the differences in shoulder range of motion, strength and thoracic kyphosis between rotator cuff related shoulder pain patients and an asymptomatic group. Method The protocol of the present research was registered in the International Prospective Register of Systematic Review (PROSPERO) (registration number CRD42021258924). Database search of observational studies was conducted in MEDLINE, EMBASE, WOS and CINHAL until July 2023, which assessed shoulder or neck neuro-musculoskeletal non-invasive physical examination compared to an asymptomatic group. Two investigators assessed eligibility and study quality. The Newcastle Ottawa Scale was used to evaluate the methodology quality. Results Eight studies ( N = 604) were selected for the quantitative analysis. Meta-analysis showed statistical differences with large effect for shoulder flexion (I2 = 91.7%, p < 0.01, HG = −1.30), external rotation (I2 = 83.2%, p < 0.01, HG = −1.16) and internal rotation range of motion (I2 = 0%, p < 0.01, HG = −1.32). Regarding to shoulder strength; only internal rotation strength showed statistical differences with small effect (I2 = 42.8%, p < 0.05, HG = −0.3). Conclusions There is moderate to strong evidence that patients with rotator cuff related shoulder pain present less shoulder flexion, internal and external rotation range of motion and less internal rotation strength than asymptomatic individuals.
Article
Full-text available
Objective To conduct a meta-analytic review of psychosocial predictors of doping intention, doping use and inadvertent doping in sport and exercise settings. Design Systematic review and meta-analysis. Data sources Scopus, Medline, Embase, PsychINFO, CINAHL Plus, ProQuest Dissertations/Theses and Open Grey. Eligibility criteria Studies (of any design) that measured the outcome variables of doping intention, doping use and/or inadvertent doping and at least one psychosocial determinant of those three variables. Results We included studies from 25 experiments (N=13 586) and 186 observational samples (N=3 09 130). Experimental groups reported lower doping intentions ( g =−0.21, 95% CI (−0.31 to –0.12)) and doping use ( g =−0.08, 95% CI (−0.14 to –0.03), but not inadvertent doping ( g =−0.70, 95% CI (−1.95 to 0.55)), relative to comparators. For observational studies, protective factors were inversely associated with doping intentions ( z =−0.28, 95% CI −0.31 to –0.24), doping use ( z =−0.09, 95% CI −0.13 to to –0.05) and inadvertent doping ( z =−0.19, 95% CI −0.32 to –0.06). Risk factors were positively associated with doping intentions ( z =0.29, 95% CI 0.26 to 0.32) and use ( z =0.17, 95% CI 0.15 to 0.19), but not inadvertent doping ( z =0.08, 95% CI −0.06 to 0.22). Risk factors for both doping intentions and use included prodoping norms and attitudes, supplement use, body dissatisfaction and ill-being. Protective factors for both doping intentions and use included self-efficacy and positive morality. Conclusion This study identified several protective and risk factors for doping intention and use that may be viable intervention targets for antidoping programmes. Protective factors were negatively associated with inadvertent doping; however, the empirical volume is limited to draw firm conclusions.
Article
Full-text available
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
Article
Introduction: During the COVID-19 pandemic, access to dental treatment by persons deprived of their liberty (PPL) was affected due to dentist-patient proximity and the risk of generation of aerosols in dental procedures and treatments. The risks of infection for oral health personnel are considered high, mainly from cross-infection between patients. Objetives: Differentiate between a true and false dental consultation emergency during the SARS-CoV-2 outbreak for a better and effective screening of inmates of the Social Rehabilitation Center (CERESO) of San Francisco Kobén (Campeche, Mexico). Material and method: An observational, cross-sectional, descriptive, and prospective study was designed for a sample of 100 inmates of the CERESO San Francisco Kobén, the data was collected in the prison’s dental office, the participants signed a letter of informed consent to be voluntarily included in the study during the SARS-CoV-2 outbreak. The questionnaire “Assessment of a true Dental Emergency” previously validated for the Mexican population was applied, the personnel was standardized and an intra- examiner and inter-examiner reliability of k = 0.98 was obtained. To prepare the database and the analysis of the information collected, the Statistical Package for Social Science v. 21 (SPSS v.21) was used. Results: When evaluating emergencies at the dental clinic, 84% were determined according to the instrument as a false emergency and 16% were a true emergency. Discussion: In the population of CERESO of San Francisco Kobén, the figures for medical-dental care show that inmates face a proportionally low dental morbidity-mortality.
Preprint
Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks focus on formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area. All the codes and datasets are available at: https://github.com/oneal2000/STARD/tree/main
Article
Bu çalışmanın amacı, ortaokul matematik uygulamaları dersine ait öğretim materyalinde yer alan problemlerin, matematiksel modelleme problemlerine uygun olup olmadığının araştırılmasıdır. Araştırmada nitel araştırma yöntemlerinden doküman analizi yöntemi kullanılmıştır. Matematik uygulamaları öğretim materyali olarak değerlendirilen beşinci, altıncı, yedinci ve sekizinci sınıf ders kitaplarında yer alan toplam 149 adet problem incelenmiştir. Verilerin analizinde tümdengelim ve tümevarım yöntemleri birlikte kullanılarak bu öğretim materyallerinde yer alan problemlerin yarısından fazlasının gerçekçi durum içeren problemler olduğu görülmüştür. Fakat bu problemlerin oldukça az bir kısmı modelleme problemlerinde bulunması gereken özgünlük kriterini sağlamaktadır. Yine benzer şekilde sadece birkaç adet problem açıklık kriterlerini taşımaktadır. Modelleme alt yeterlikleri incelendiğinde problemlerin birçoğunun matematiksel çözüm gerektiren problemler olduğu dikkat çekicidir. Problemlerin büyük çoğunluğu basitleştirme, yorumlama ve doğrulama aşamalarını içermemektedir. Yenilenen matematik uygulamaları öğretim programı ile matematiksel modellemeye ilginin arttığı fakat bunun öğretim materyallerine yansımadığı görülmüştür. Dikkate alınan kriterler doğrultusunda öğretim materyallerinin yenilenmesi önerilmektedir.
Article
Full-text available
In today’s world, the increasing population growth and rapid urbanization lead to significant changes in land use, often resulting in the conversion or destruction of natural landscapes. This highlights the pressing need for monitoring these changes to ensure the sustainability of ecosystems. However, acquiring accurate environmental data presents several challenges, and issues like data imbalance only add complexity to classification efforts. This study proposes an innovative approach that combines classification and regression using a Long Short-Term Memory (LSTM) neural network, integrated with Cellular Automata (CA). To tackle the challenge of imbalanced samples, a class weight function is introduced to the LSTM model. During training, this function assigns higher weights to samples from the minority class and lower weights to those from the majority class. The results demonstrate the effectiveness of this model, achieving an accuracy of 91.5%, precision of 94%, recall of 91.5%, F1-score of 92%, and Kappa of 73.5%. Compared to a Markov model used for comprehensive evaluation, the proposed model shows significant improvement, with a 25.5% increase in accuracy, 27% increase in precision, and a 28.5% increase in Kappa. This underscores the exceptional capability of the proposed model in accurately predicting land use changes in the studied area. Graphical abstract
Article
Background/Aims Self-reported questionnaires on health status after randomized trials can be time-consuming, costly, and potentially unreliable. Administrative data sets may provide cost-effective, less biased information, but it is uncertain how administrative and self-reported data compare to identify chronic conditions in a New Zealand cohort. This study aimed to determine whether record linkage could replace self-reported questionnaires to identify chronic conditions that were the outcomes of interest for trial follow-up. Methods Participants in 50-year follow-up of a randomized trial were asked to complete a questionnaire and to consent to accessing administrative data. The proportion of participants with diabetes, pre-diabetes, hyperlipidaemia, hypertension, mental health disorders, and asthma was calculated using each data source and agreement between data sources assessed. Results Participants were aged 49 years (SD = 1, n = 424, 50% male). Agreement between questionnaire and administrative data was slight for pre-diabetes (kappa = 0.10), fair for hyperlipidaemia (kappa = 0.27), substantial for diabetes (kappa = 0.65), and moderate for other conditions (all kappa >0.42). Administrative data alone identified two to three times more cases than the questionnaire for all outcomes except hypertension and mental health disorders, where the questionnaire alone identified one to two times more cases than administrative data. Combining all sources increased case detection for all outcomes. Conclusions A combination of questionnaire, pharmaceutical, and laboratory data with expert panel review were required to identify participants with chronic conditions of interest in this follow-up of a clinical trial.
Article
We previously developed a computer-assisted image analysis algorithm to detect and quantify the microscopic features of rodent progressive cardiomyopathy (PCM) in rat heart histologic sections and validated the results with a panel of five veterinary toxicologic pathologists using a multinomial logistic model. In this study, we assessed both the inter-rater and intra-rater agreement of the pathologists and compared pathologists’ ratings to the artificial intelligence (AI)-predicted scores. Pathologists and the AI algorithm were presented with 500 slides of rodent heart. They quantified the amount of cardiomyopathy in each slide. A total of 200 of these slides were novel to this study, whereas 100 slides were intentionally selected for repetition from the previous study. After a washout period of more than six months, the repeated slides were examined to assess intra-rater agreement among pathologists. We found the intra-rater agreement to be substantial, with weighted Cohen’s kappa values ranging from k = 0.64 to 0.80. Intra-rater variability is not a concern for the deterministic AI. The inter-rater agreement across pathologists was moderate (Cohen’s kappa k = 0.56). These results demonstrate the utility of AI algorithms as a tool for pathologists to increase sensitivity and specificity for the histopathologic assessment of the heart in toxicology studies.
Article
Full-text available
In agile requirements engineering, Generating Acceptance Criteria (GAC) to elaborate user stories plays a pivotal role in the sprint planning phase, which provides a reference for delivering functional solutions. GAC requires extensive collaboration and human involvement. However, the lack of labeled datasets tailored for User Story attached with Acceptance Criteria (US-AC) poses significant challenges for supervised learning techniques attempting to automate this process. Recent advancements in Large Language Models (LLMs) have showcased their remarkable text-generation capabilities, bypassing the need for supervised fine-tuning. Consequently, LLMs offer the potential to overcome the above challenge. Motivated by this, we propose SimAC, a framework leveraging LLMs to simulate agile collaboration, with three distinct role groups: requirement analyst, quality analyst, and others. Initiated by role-based prompts, LLMs act in these roles sequentially, following a create-update-update paradigm in GAC. Owing to the unavailability of ground truths, we invited practitioners to build a gold standard serving as a benchmark to evaluate the completeness and validity of auto-generated US-AC against human-crafted ones. Additionally, we invited eight experienced agile practitioners to evaluate the quality of US-AC using the INVEST framework. The results demonstrate consistent improvements across all tested LLMs, including the LLaMA and GPT-3.5 series. Notably, SimAC significantly enhances the ability of gpt-3.5-turbo in GAC, achieving improvements of 29.48% in completeness and 15.56% in validity, along with the highest INVEST satisfaction score of 3.21/4. Furthermore, this study also provides case studies to illustrate SimAC’s effectiveness and limitations, shedding light on the potential of LLMs in automated agile requirements engineering.
Article
Full-text available
Application of standardised and automated assessments of head computed tomography (CT) for neuroprognostication after out-of-hospital cardiac arrest. Prospective, international, multicentre, observational study within the Targeted Hypothermia versus Targeted Normothermia after out-of-hospital cardiac arrest (TTM2) trial. Routine CTs from adult unconscious patients obtained > 48 h ≤ 7 days post-arrest were assessed qualitatively and quantitatively by seven international raters blinded to clinical information using a pre-published protocol. Grey–white-matter ratio (GWR) was calculated from four (GWR-4) and eight (GWR-8) regions of interest manually placed at the basal ganglia level. Additionally, GWR was obtained using an automated atlas-based approach. Prognostic accuracies for prediction of poor functional outcome (modified Rankin Scale 4–6) for the qualitative assessment and for the pre-defined GWR cutoff < 1.10 were calculated. 140 unconscious patients were included; median age was 68 years (interquartile range [IQR] 59–76), 76% were male, and 75% had poor outcome. Standardised qualitative assessment and all GWR models predicted poor outcome with 100% specificity (95% confidence interval [CI] 90–100). Sensitivity in median was 37% for the standardised qualitative assessment, 39% for GWR-8, 30% for GWR-4 and 41% for automated GWR. GWR-8 was superior to GWR-4 regarding prognostic accuracies, intra- and interrater agreement. Overall prognostic accuracy for automated GWR (area under the curve [AUC] 0.84, 95% CI 0.77–0.91) did not significantly differ from manually obtained GWR. Standardised qualitative and quantitative assessments of CT are reliable and feasible methods to predict poor functional outcome after cardiac arrest. Automated GWR has the potential to make CT quantification for neuroprognostication accessible to all centres treating cardiac arrest patients.
Article
Full-text available
Background Children with affective dysregulation (AD) show an excessive reactivity to emotionally positive or negative stimuli, typically manifesting in chronic irritability, severe temper tantrums, and sudden mood swings. AD shows a large overlap with externalizing and internalizing disorders. Given its transdiagnostic nature, AD cannot be reliably and validly captured only by diagnostic categories such as disruptive mood dysregulation disorder (DMDD). Therefore, this study aimed to evaluate two semi-structured clinical interviews—one for parents and one for children. Methods Both interviews were developed based on existing measures that capture particular aspects of AD. We analyzed internal consistencies and interrater agreement to evaluate their reliability. Furthermore, we analyzed factor loadings in an exploratory factor analysis, differences in interview scores between children with and without co-occurring internalizing and externalizing disorders, and associations with other measures of AD and of AD-related constructs. The evaluation was performed in a screened community sample of children aged 8–12 years ( n = 445). Interrater reliability was additionally analyzed in an outpatient sample of children aged 8–12 years ( n = 27). Results Overall, internal consistency was acceptable to good. In both samples, we found moderate to excellent interrater reliability on a dimensional level. Interrater agreement for the dichotomous diagnosis DMDD was substantial to perfect. In the exploratory factor analysis, almost all factor loadings were acceptable. Children with a diagnosis of disruptive disorder, attention-deficit/hyperactivity disorder, or any disorder (disruptive disorder, attention-deficit/hyperactivity disorder, and depressive disorder) showed higher scores on the DADYS interviews than children without these disorders. The correlation analyses revealed the strongest associations with other measures of AD and measures of AD-specific functional impairment. Moreover, we found moderate to very large associations with internalizing and externalizing symptoms and moderate to large associations with emotion regulation strategies and health-related quality of life. Conclusions The analyses of internal consistency and interrater agreement support the reliability of both clinical interviews. Furthermore, exploratory factor analysis, discriminant analyses, and correlation analyses support the interviews’ factorial, discriminant, concurrent, convergent, and divergent validity. The interviews might thus contribute to the reliable and valid identification of children with AD and the assessment of treatment responses. Trial registration ADOPT Online: German Clinical Trials Register (DRKS) DRKS00014963. Registered 27 June 2018.
Conference Paper
Full-text available
Through recognizing the significance of a qualified workforce, skills, career, research has become one of the focal points in education, economics, and placements. In this work we concentrate on skills needs, nature of job, enticing career are dynamic variables dependent on many aspects such as geography time, a vocation or aspiration. The purpose of this paper will identify current trends and issues in research focusing on career and technical education. The term career and technical education (CTE) was viewed from a broad perception that included workforce education, technical education, technical college and community college etc. Results should allow researchers, practitioners and policy makers to identify instant and emerging research needs in career and technical education. Queries were constructed based on 546 students' data summaries available as training data (i.e. resume). Performance was measured on a test dataset of various filled documents (questioners).
Article
Context : Tendon injuries are common disorders in both workers and athletes, potentially impacting performance in both conditions. This is why the search for effective treatments is continuing. Objective(s) : The objective of this study was to analyze whether the ultrasound-guided percutaneous needle electrolysis technique may be considered a procedure to reduce pain caused by tendinosis. Evidence Acquisition : The search strategy included the PubMed, SCOPUS, CINAHL, Physiotherapy Evidence Database, SciELO, and ScienceDirect up to the date of February 25, 2024. Randomized clinical trials that assessed pain caused by tendinosis using the Visual Analog Scale and Numeric Rating Scale were included. The studies were evaluated for quality using the Cochrane Risk of Bias 2, and the evidence strength was assessed by the GRADEpro GDT. Evidence Synthesis : Out of the 534 studies found, 8 were included in the review. A random-effects meta-analysis and standardized mean differences (SMD) were conducted. The ultrasound-guided percutaneous needle electrolysis proved to be effective in reducing pain caused by tendinosis in the overall outcome (SMD = −0.97; 95% CI, −1.26 to −0.68; I 2 = 58%; low certainty of evidence) and in the short-term (SMD = −0.83, 95% CI, −1.29 to −0.38; I 2 = 65%; low certainty of evidence), midterm (SMD = −1.28; 95% CI, −1.65 to −0.91; I 2 = 0%; moderate certainty of evidence), and long-term (SMD = −0.94; 95% CI, −1.62 to −0.26; I 2 = 71%; low certainty of evidence) subgroups. Conclusion(s) : The application of the ultrasound-guided percutaneous needle electrolysis technique for reducing pain caused by tendinosis appears to be effective. However, due to the heterogeneity found (partially explained), more studies are needed to define the appropriate dosimetry, specific populations that may benefit more from the technique, and possible adverse events.
Article
Aim This paper aims to assess the complexity, quality and outcome of endodontic treatment provided in Managed Clinical Networks (MCNs) in England to understand if we are “getting it right first time” (GIRFT). Methods In a convenient sample of endodontic treatments provided between May 2011 and April 2017, the complexity of teeth treated, the quality of treatment procedure, the radiographic appearance of root fillings, as well as clinical and radiographic healing were retrospectively assessed using records taken as part of treatment. Trained, calibrated examiners independently scored radiographs using previously published scoring systems. Results 646 teeth were followed up for 24.7 months (standard deviation [SD] 17.08). The average age of those patients treated was 46.7 years (SD 15.38) with 48.3% being male. Of teeth treated, 70.4% were of complexity level 3. 88.2% of teeth were asymptomatic, and 80% demonstrated complete radiographic healing. Procedural errors inhibited achieving correct working length and taper, with more voids within root canal fillings. When patency filing was reported as being carried out, complete radiographic healing was more likely. Conclusions It is possible to collate outcome data in the NHS system, especially if there is provision for ongoing follow up and time allocated for collection of data. Endodontic treatment provided within primary and secondary care settings are of high quality, with outcomes being better with single operators carrying out high volumes of endodontic treatment.
Article
Full-text available
The problems of psychophysics primarily involve scale construction. Each of the 5 types of scaling problems are outlined, i.e., the determination of nominal scales, ordinal scales, interval scales, logarithmic interval scales, and ratio scales, along with a description of the methodology appropriate to each problem. The applications of psychophysical methodology to problems of practical utility are briefly described.
Article
When populations are cross-classified with respect to two or more classifications or polytomies, questions often arise about the degree of association existing between the several polytomies. Most of the traditional measures or indices of association are based upon the standard chi-square statistic or on an assumption of underlying joint normality. In this paper a number of alternative measures are considered, almost all based upon a probabilistic model for activity to which the cross-classification may typically lead. Only the case in which the population is completely known is considered, so no question of sampling or measurement error appears. We hope, however, to publish before long some approximate distributions for sample estimators of the measures we propose, and approximate tests of hypotheses. Our major theme is that the measures of association used by an empirical investigator should not be blindly chosen because of tradition and convention only, although these factors may properly be given some weight, but should be constructed in a manner having operational meaning within the context of the particular problem.
Article
The author presents a discussion of the significance and role of mathematics and mathematical models in scientific investigation and especially in relation to psychological measurement. The major divisions of his discussion are entitled the mathematical model, numerals and measurements, psychophysics and psychophysical methods, probability, and measures and indicants. 67-item bibliography. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
An Outline of the Statistical Theory of Prediction The Prediction of Personal AdjustmentReliability of Content Analysis: The Case of Nominal Scale Coding
  • L Guttman
  • W A Scott
Guttman, L. "An Outline of the Statistical Theory of Prediction." In P. Horst (Ed.), The Prediction of Personal Adjustment. New York: Social Science Research Council, 1941. Scott, W. A. "Reliability of Content Analysis: The Case of Nominal Scale Coding." Public Opinion Quarterly, XIX (1955), 321-325.
Mathematics, Measurement and Psychophysics Handbook of Experimental PsychologyProblems and Methods of Psychophysics
  • S Stevens
  • S S Stevens
Stevens, S. S. "Mathematics, Measurement and Psychophysics." In S. S. Stevens (Ed.), Handbook of Experimental Psychology. New York: John Wiley & Sons, 1951. Stevens, S. S. "Problems and Methods of Psychophysics." Psy-chological Bulletin, LV (1958), 177-196. at UNIV OF VIRGINIA on August 8, 2012 epm.sagepub.com Downloaded from