PreprintPDF Available

METRICS: Establishing a preliminary checklist to standardize design and reporting of artificial intelligence-based studies in healthcare

Authors:

Abstract

ABSTRACT Background: Adherence to evidence-based practice is indispensable in healthcare. Recently, the utility of artificial intelligence (AI)-based models in healthcare has been evaluated extensively. However, the lack of consensus guidelines for design and reporting of findings in these studies pose challenges to interpretation and synthesis of evidence. Objective: To propose a preliminary framework forming the basis of comprehensive guidelines to standardize reporting of AI-based studies in healthcare education and practice. Methods: A systematic literature review was conducted on Scopus, PubMed, and Google Scholar. The published records with “ChatGPT”, “Bing”, or “Bard” in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and gaps in reporting. Panel discussion followed to establish a unified and thorough checklist for reporting. Testing of the finalized checklist on the included records was done by two independent raters with Cohen’s κ as the method to evaluate the inter-rater reliability. Results: The final dataset that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included nine pertinent themes collectively referred to as “METRICS”: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and inter-rater reliability; (8) Count of queries executed to test the model; (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0±0.58. The tested METRICS score was acceptable by the range of Cohen’s κ of 0.558–0.962 (P<.001 for the nine tested items). Classified per item, the highest average METRICS score was recorded for the “Model” item, followed by “Specificity of the prompts and language used” item, while the lowest scores were recorded for the “Randomization of selecting the queries” item classified as sub-optimal and “Individual factors in selecting the queries and inter-rater reliability” item classified as satisfactory. Conclusions: The findings highlighted the need for standardized reporting algorithms for AI-based studies in healthcare based on variability observed in methodologies and reporting. The proposed METRICS checklist could be the preliminary helpful step to establish a universally accepted approach to standardize reporting in AI-based studies in healthcare, a swiftly evolving research topic.
JMIR Preprints Sallam et al
METRICS: Establishing a Preliminary Checklist to
Standardize Design and Reporting of Artificial
Intelligence-Based Studies in Healthcare
Malik Sallam, Muna Barakat, Mohammed Sallam
Submitted to: JMIR Medical Informatics
on: November 19, 2023
Disclaimer: © The authors. All rights reserved. This is a privileged document currently under peer-review/community
review. Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for
review purposes only. While the final peer-reviewed paper may be licensed under a CC BY license on publication, at this
stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Table of Contents
Original Manuscript ....................................................................................................................................................................... 5
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
METRICS: Establishing a Preliminary Checklist to Standardize Design and
Reporting of Artificial Intelligence-Based Studies in Healthcare
Malik Sallam1, 2, 3 MD, PhD; Muna Barakat4, 5 PhD; Mohammed Sallam6 MSc, PharmD
1The University of Jordan Department of Pathology, Microbiology and Forensic Medicine, School of Medicine Amman JO
2Jordan University Hospital Department of Clinical Laboratories and Forensic Medicine Amman JO
3Lund University Department of Translational Medicine, Faculty of Medicine Malmo SE
4Applied Science Private University Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy Amman JO
5Middle East University MEU Research Unit Amman JO
6Mediclinic Middle East Department of Pharmacy, Mediclinic Parkview Hospital Dubai AE
Corresponding Author:
Malik Sallam MD, PhD
The University of Jordan
Department of Pathology, Microbiology and Forensic Medicine, School of Medicine
Queen Rania Al-Abdullah Street-Aljubeiha
Amman
JO
Abstract
Background: Adherence to evidence-based practice is indispensable in healthcare. Recently, the utility of artificial intelligence
(AI)-based models in healthcare has been evaluated extensively. However, the lack of consensus guidelines for design and
reporting of findings in these studies pose challenges to interpretation and synthesis of evidence.
Objective: To propose a preliminary framework forming the basis of comprehensive guidelines to standardize reporting of AI-
based studies in healthcare education and practice.
Methods: A systematic literature review was conducted on Scopus, PubMed, and Google Scholar. The published records with
“ChatGPT”, “Bing”, or “Bard” in the title were retrieved. Careful examination of the methodologies employed in the included
records was conducted to identify the common pertinent themes and gaps in reporting. Panel discussion followed to establish a
unified and thorough checklist for reporting. Testing of the finalized checklist on the included records was done by two
independent raters with Cohen’s ? as the method to evaluate the inter-rater reliability.
Results: The final dataset that formed the basis for pertinent theme identification and analysis comprised a total of 34 records.
The finalized checklist included nine pertinent themes collectively referred to as “METRICS”: (1) Model used and its exact
settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source;
(5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and inter-
rater reliability; (8) Count of queries executed to test the model; (9) Specificity of the prompts and language used. The overall
mean METRICS score was 3.0±0.58. The tested METRICS score was acceptable by the range of Cohen’s ? of 0.558–0.962
(P<.001 for the nine tested items). Classified per item, the highest average METRICS score was recorded for the “Model” item,
followed by “Specificity of the prompts and language used” item, while the lowest scores were recorded for the “Randomization
of selecting the queries” item classified as sub-optimal and “Individual factors in selecting the queries and inter-rater reliability”
item classified as satisfactory.
Conclusions: The findings highlighted the need for standardized reporting algorithms for AI-based studies in healthcare based
on variability observed in methodologies and reporting. The proposed METRICS checklist could be the preliminary helpful step
to establish a universally accepted approach to standardize reporting in AI-based studies in healthcare, a swiftly evolving
research topic.
(JMIR Preprints 19/11/2023:54704)
DOI: https://doi.org/10.2196/preprints.54704
Preprint Settings
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
1) Would you like to publish your submitted manuscript as preprint?
Please make my preprint PDF available to anyone at any time (recommended).
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.
Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.
2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain visible to all users (see Important note, above). I also understand that if I later pay to participate in <a href="https://jmir.zendesk.com/hc/en-us/articles/360008899632-What-is-the-PubMed-Now-ahead-of-print-option-when-I-pay-the-APF-" target="_blank">JMIR’s PubMed Now! service</a> service, my accepted manuscript PDF will automatically be made openly available.
Yes, but only make the title and abstract visible (see Important note, above). I understand that if I later pay to participate in <a href="https://jmir.zendesk.com/hc/en-us/articles/360008899632-What-is-the-PubMed-Now-ahead-of-print-option-when-I-pay-the-APF-" target="_blank">JMIR’s PubMed Now! service</a> service, my accepted manuscript PDF will automatically be made openly available.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Original Manuscript
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
METRICS: Establishing a Preliminary Checklist to Standardize
Design and Reporting of Artificial Intelligence-Based Studies in
Healthcare
Original Paper
Malik Sallam1,2,3*, Muna Barakat4,5, Mohammed Sallam6
1. Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The
University of Jordan, Amman, Jordan
2. Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman,
Jordan
3. Department of Translational Medicine, Faculty of Medicine, Lund University, Malmö, Sweden
4. Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private
University, Amman, Jordan
5. MEU Research Unit, Middle East University, Amman, Jordan
6. Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United
Arab Emirates
* Corresponding author: Malik Sallam, ORCID: 0000-0002-0165-9670
Address: Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital,
Queen Rania Al-Abdullah Street-Aljubeiha/P.O. Box: 13046, Amman, Jordan
Tel +962791845186
Fax +96265353388
Email: malik.sallam@ju.edu.jo
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
METRICS: Establishing a Preliminary Checklist to Standardize
Design and Reporting of Artificial Intelligence-Based Studies in
Healthcare
Abstract
Background: Adherence to evidence-based practice is indispensable in healthcare. Recently, the
utility of artificial intelligence (AI)-based models in healthcare has been evaluated extensively.
However, the lack of consensus guidelines for design and reporting of findings in these studies pose
challenges to interpretation and synthesis of evidence.
Objectives: To propose a preliminary framework forming the basis of comprehensive guidelines to
standardize reporting of AI-based studies in healthcare education and practice.
Methods: A systematic literature review was conducted on Scopus, PubMed, and Google Scholar.
The published records with ChatGPT, Bing, or Bard” in the title were retrieved. Careful
examination of the methodologies employed in the included records was conducted to identify the
common pertinent themes and gaps in reporting. Panel discussion followed to establish a unified and
thorough checklist for reporting. Testing of the finalized checklist on the included records was done
by two independent raters with Cohen’s κ as the method to evaluate the inter-rater reliability.
Results: The final dataset that formed the basis for pertinent theme identification and analysis
comprised a total of 34 records. The finalized checklist included nine pertinent themes collectively
referred to as METRICS: (1) Model used and its exact settings; (2) Evaluation approach for the
generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of
tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the
queries and inter-rater reliability; (8) Count of queries executed to test the model; (9) Specificity of
the prompts and language used. The overall mean METRICS score was 3.0±0.58. The tested
METRICS score was acceptable by the range of Cohen’s κ of 0.558–0.962 (P<.001 for the nine
tested items). Classified per item, the highest average METRICS score was recorded for the Model
item, followed by Specificity of the prompts and language used” item, while the lowest scores were
recorded for the Randomization of selecting the queries item classified as sub-optimal and
Individual factors in selecting the queries and inter-rater reliability item classified as satisfactory.
Conclusions: The findings highlighted the need for standardized reporting algorithms for AI-based
studies in healthcare based on variability observed in methodologies and reporting. The proposed
METRICS checklist could be the preliminary helpful step to establish a universally accepted
approach to standardize reporting in AI-based studies in healthcare, a swiftly evolving research topic.
Keywords: Guidelines; evaluation; meaningful analytics; large language models; decision support
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Introduction
The integration of artificial intelligence (AI) models into healthcare education and practice holds
promising perspectives with numerous possibilities for continuous improvement [1-5]. Examples of
AI-based conversational models that are characterized by ease-of-use and perceived usefulness
include: ChatGPT by OpenAI, Bing by Microsoft, and Bard by Google [6-8].
The huge potential of these AI-based models in healthcare can be illustrated as follows. First, the AI-
based models can facilitate the streamlining of clinical workflow, with subsequent improvement in
efficiency in terms of reduced time for delivering care and reduced costs [1, 9-11]. Second, AI-based
models can enhance the area of personalized medicine with a huge potential to achieve refined
prediction of disease risks and disease outcomes [1, 12, 13]. Third, the usefulness of AI-based
models can be manifested in improved health literacy among lay individuals through providing
easily accessible and understandable health information [1, 14].
Despite the aforementioned advantages of employing AI-based models in healthcare, several valid
concerns were raised which should be considered carefully due to its serious consequences [1, 4, 15].
For example, the lack of clarity on how these AI-based models are trained raises ethical concerns
besides the inherent bias in the generated content based on the modality of training to develop and
update such models [16, 17]. Importantly, the generation of inaccurate or misleading content, which
might appear scientifically plausible to non-experts —referred to as hallucinations could have
profound negative impact in healthcare settings [1, 18, 19]. Furthermore, the integration of AI-based
models in healthcare could raise complex medico-legal and accountability questions, compounded by
the issues of data privacy and cybersecurity risks [1, 4, 20, 21].
Similarly, the AI-based models can be paradigm-shifting in acquiring information and in healthcare
education [1, 22-24]. However, careful consideration of the best policies and practices to incorporate
the AI-based models in healthcare education is needed [25]. This involves the urgent need to address
the issues of inaccuracies, possible academic dishonesty, and decline in the critical thinking
development and deteriorating practical training skills [1].
Recently, a remarkable number of studies investigated the applicability and disadvantages of the
prominent AI-based conversational models such as ChatGPT, Microsoft Bing, and Google Bard in
various healthcare and educational settings [1, 2, 4, 26-32]. However, synthesizing evidence from
such studies can be challenging due to several reasons. Variations in methodologies implemented in
various studies as well as in the reporting standards could hinder the efforts aiming to compare and
contrast the results contributing to the complexity in this domain. This variability arises from several
factors including testing different AI-based models with different settings, variability in the prompts
used to generate the content, different approaches for evaluating the generated content, varying range
of topics tested and possible bias in selecting the subjects for testing, the number and expertise of
individual raters of the quality of content, and the number of queries executed, among others [33-35].
Therefore, it is important to initiate and develop an approach that can aid to reach standardized
reporting practices for the studies aiming to evaluate the AI-based models content, particularly in
healthcare. This standardization can be crucial to enable achieving precise comparisons and credible
synthesis of findings across different studies. Thus, we aimed to propose a preliminary framework
(checklist) to establish proper guidelines for design and reporting of the findings of AI-based studies
that address healthcare-related topics.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Methods
Study Design
The study was based on literature review to highlight the key methodological aspects in the studies
that investigated three AI-based models (ChatGPT, Bing, and Bard) in healthcare education and
practice. The literature review was based on a systematic search based on the Preferred Reporting
Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, to identify relevant
literature indexed in each database up to 11 November 2023 [36]. The databases used for this
literature search were Scopus, PubMed/MEDLINE, and Google Scholar.
This study did not involve human subjects and thus it is waived from ethical permission.
Systematic Search of Studies
The Scopus string query was (TITLE-ABS-KEY("artificial intelligence" OR "AI") AND TITLE-
ABS-KEY ("healthcare" OR "health care") AND TITLE-ABS-KEY ("education" OR "practice"))
AND PUBYEAR > 2022 AND DOCTYPE (ar OR re) AND (LIMIT-TO (PUBSTAGE , "final"))
AND (LIMIT-TO (SRCTYPE , "j")) AND (LIMIT-TO (LANGUAGE , "English")). The Scopus
search yielded a total of 843 documents.
The PubMed advanced search tool was used as follows: ("artificial intelligence"[Title/Abstract] OR
"AI"[Title/Abstract]) AND ("healthcare"[Title/Abstract] OR "health care"[Title/Abstract]) AND
("education"[Title/Abstract] OR "practice"[Title/Abstract]) AND ("2022/12/01"[Date - Publication] :
"2023/11/11"[Date - Publication]). The PubMed search yielded a total of 564 records.
In Google Scholar and using the Publish or Perish software (version 8), in the title words and Years
2022 - 2023, "artificial intelligence" OR "AI" AND "healthcare" OR "health care" AND "education"
OR "practice" with the maximum of 999 records retrieved [37].
Criteria for Record Inclusion
The records from the three databases were merged using EndNote 20.2.1. software. This was
followed by removal of duplicate records, and removal of preprints by using the function ANY
FIELD preprint OR ANY FIELD rxiv OR ANY FIELD SSRN OR ANY FIELD Researchgate OR
ANY FIELD researchsquare OR ANY FIELD. The retrieved records were eligible for the final
screening step given the following inclusion criteria: (1) original article; (2) English record; (3)
Published (peer reviewed); and (4) Assessment in healthcare practice or healthcare education.
Finally, the imported references were subjected to the search function in EndNote as follows: Title
contains ChatGPT OR Title contains Bing OR Title contains Bard.
Development of the Initial Checklist Items
The methodologies and results of the included records were analyzed, followed by a literature review
of the commonly used reporting and quality guidelines (The Strengthening the Reporting of
Observational Studies in Epidemiology (STROBE) Statement: guidelines for reporting observational
studies (Checklist: cross-sectional studies) and The Critical Appraisal Skills Programme (CASP)
(CASP Qualitative Studies Checklist) [38, 39]. This was followed by an independent content review
process to identify essential themes and best practices in AI-based healthcare studies. A collaborative
authors discussion followed to refine these themes into specific parts relevant to the study
objectives, emphasizing aspects that could impact the study quality and reproducibility.
Establishing the Final Checklist Criteria
Careful examination of the included records resulted in compiling three independent lists of
pertinent themes herein defined as being critical or recurring in reporting of the results of AI-based
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
studies. Thorough discussion among the authors followed to reach a consensus on the pertinent
themes. The recurring themes were defined as those found in the methods of at least three separate
records. Critical aspects were defined as those agreed by the three authors that would impact the
conclusions of the included record.
Thus, the final pertinent themes were selected based on their author-perceived significance in the
quality and reproducibility of findings. A final list of nine themes were agreed upon by the authors as
follows: (1) Model of the AI-based tool(s) used in the included record and the explicit mentioning
of the exact settings employed for each tool; (2) the Evaluation” approach to assess the quality of
the AI-based model generated content in terms of objectivity to reach unbiased findings and
subjectivity; (3) the exact Timing” of AI-based model testing and its duration; (4) the
Transparency of data sources used to generate queries for the AI-model testing including the
permission to use the copyrighted content; (5) the Range of topics tested (single topic, multiple
related topics, or various unrelated topics as well as the breadth of inter-topic and intra-topic queries
tested); (6) the degree of Randomization” of topics selected to be tested to consider the potential
bias; (7) Individual subjective role in evaluating the content and the possibility inter-rater”
reliability concordance/discordance; (8) the number (“Count) of queries executed on each AI model
entailing the sample size of queries tested; and (9) the Specificity of prompts used on each AI-
based model including the exact phrasing of each prompt and the presence of feedback and learning
loops; and the Specificity of language(s) used in testing besides any other cultural issues. Thus, the
final checklist was termed METRICS.
Scoring the METRICS Items and Classification of the METRICS score
Testing of the included records followed by two independent raters (the first and senior authors)
independently, with each METRICS item scored using a 5-point Likert scale as follows: excellent=5,
very good=4, good=3, satisfactory=2, and suboptimal=1. For the items that was deemed not
applicable (e.g., individual factors for studies that employed objective method for evaluation), no
score was given. The scores for the two raters were then averaged. The average METRICS score was
calculated as the sum of average scores for each applicable item divided by 10, minus the number of
items deemed not applicable.
The subjective assessment of the two raters was done based on predefined criteria as a general guide.
For example, if the exact date(s) of model queries was/were mentioned, the Timing item was
scored as excellent. The count of queries was agreed to be categorized as excellent if it was more
than 500, while a single case or no mention of the count was considered suboptimal. For the prompt
attributes, scores were assigned based on the availability of exact prompts, explicit mention of the
language used, and details of prompting. The evaluation method was agreed to be rated higher for
objective assessments with full details and lower for subjective assessments. The explicit mentioning
of the method of inter-rater reliability testing was agreed to be scored higher for the Individual
item . Transparency was assessed based on the comprehensiveness of the data source, with full
database disclosure and permission to use the data agreed to be given an excellent score.
Randomization was agreed to be scored the lowest for the absence of such detail and the highest for
explicit detailed descriptions.
Finally, we decided to highlight the records that average scored the highest for each METRICS item.
The decision to take this approach was based on attempt to refrain from providing examples for the
other quality categories to avoid premature conclusions regarding the quality of the included studies
based on the preliminary pilot nature of the METRICS tool.
Statistical and Data Analysis
For our statistical analyses, we utilized IBM SPSS Statistics for Windows, Version 26.0 (Armonk,
NY, USA: IBM Corp).
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
The average METRICS scores were classified into distinct categories of equal weights: Excellent
(4.21–5.00), Very Good (3.41–4.20), Good (2.61–3.40), Satisfactory (1.81–2.60), and Sub-optimal
(1.00–1.80).
The Cohen’s κ measure was used to assess the inter-rater reliability by two independent raters. The
Cohen’s κ measures were categorized as follows: Less than 0.20 (poor agreement), 0.21–0.40 (fair
agreement), 0.41–0.60 (good agreement), 0.61–0.80 (very good agreement), and 0.81–1.00 (excellent
agreement).
The level of statistical significance was P=.05.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Results
Description of the Included Studies
A total of 34 studies were included in the final analysis aiming to establish the METRICS criteria
(Figure 1).
Figure 1. The approach for selecting the articles according to the PRISMA guidelines as a guide.
The most common source of the records was Cureus journal with 9 records out of the 34 (26.5%),
followed by BMJ Neurology Open with 2 records out of the 34 (5.9%). The remaining 23 records
were published in 23 different journals.
Evaluation of the Included records based on the MED-METRICS Items
The METRICS checklist was divided into three parts: Model attributes, Evaluation approach, and
features of Data to form the (MED-METRICS).
The full details of the model attributes description of the included studies are presented in (Table 1).
Table 1. Model attributes description of the included studies.
Authors Model Timing Count Specificity of prompt and
language
Al-Ashwal et
al. [40]
ChatGPT-3.5,
ChatGPT-4, Bing,
and Bard with
unclear settings
One month, May
2023 255 drug-drug pairs Exact prompts were provided
in English
Alfertshofer
et al. [41]
ChatGPT with
unclear settings Not provided 1800 question
The exact prompt was used for
each question. The questions
were taken from the US, the
UK, Italian, French, Spanish,
and Indian exams. A new
session for each question
Ali et al. [42]
ChatGPT Feb 9
free version with
unclear settings
Not provided 50 items
Provided fully in the
supplementary file of the article
in English
Aljindan et
al. [43]
ChatGPT-4 with
unclear settings Not provided 220 questions
Initial prompting that involved
role playing as a medical
professional. The language was
English
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Altamimi et
al. [44] ChatGPT-3.5 Single day not
otherwise specified 9 questions Exact prompts were provided
in English
Baglivo et al.
[45]
Bing, ChatGPT,
Chatsonic, Bard,
and YouChat with
full details of the
mode, LLM model
including plugins
Exact dates were
provided for each
model 12, 13, and
14 April 2023 and
13 July 2023
15 questions Italian
Biswas et al.
[46]
ChatGPT-3.5 with
unclear settings
Exact date
provided: 16 March
2023
11 questions
Exact prompts were provided
in English. A new session for
each question
Chen et al.
[47]
ChatGPT-4 with
unclear settings Not provided 560 questions Exact prompts were provided
in English
Deiana et al.
[48]
ChatGPT-3.5 and
ChatGPT-4
versions with
unclear settings
Not provided 11 questions
The exact prompts were not
explicitly provided. English.
with a new session for each
question. Up to three iterations
were allowed for incorrect
responses
Fuchs et al.
[49]
ChatGPT-3 and
ChatGPT-4 with
unclear settings
Exact dates
provided: 19
February 2023 and
25 March 2023
60 questions
Dental medicine question were
translated from German to
English while the other
questions were already present
in English. Exact prompts were
provided in English with
prompting in two groups for
the same questions: one group
was primed with instructions
while the second was not
primed. 20 trials were
conducted per group, and chat
history was cleared after each
trial
Ghosh & Bir
[50]
ChatGPT (version
March 14, 2023)
with unclear
settings
14 March and 16
March 2023 200 questions
The exact prompts and
language were not explicitly
provided, the first response was
taken as final, and the option of
"regenerate response" was not
used
Giannos [51]
ChatGPT-3 and
ChatGPT-4 with
unclear settings
Not provided 69 questions Not provided explicitly
Gobira et al.
[52]
ChatGPT-4 with
unclear settings Not provided 125 questions Portuguese
Grewal et al.
[53]
ChatGPT-4 with
unclear settings
The first week of
May 2023 Not clear
Exact prompts were provided
in English, one follow-up
prompt was used for
enhancement of some prompts
Guerra et al.
[54]
ChatGPT-4 with
unclear settings Not provided 591 questions
The exact prompt was provided
while the language was not
provided explicitly
Hamed et al.
[55]
ChatGPT-4 with
unclear settings Not provided Not clear
The exact prompts and
language were not explicitly
provided. Different prompts
were tried to identify the most
suitable
Hoch et al.
[56]
ChatGPT (May 3rd
version) with
5 May 2023 and 7
May 2023 2576 questions The exact prompt was provided
while the language was not
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
unclear settings provided explicitly
Juhi et al.
[57]
ChatGPT with
unclear settings
20 February 2023
to 5 March 2023 40 drug-drug pairs Exact prompts were provided
in English
Kuang et al.
[58]
ChatGPT unclear
with unclear
settings
Not provided Not clear The exact prompts were not
explicitly provided. English
Kumari et al.
[59]
ChatGPT-3.5, Bard,
and Bing with
unclear settings
30 July 2023 50 questions The exact prompts were not
explicitly provided. English
Kung et al.
[60]
ChatGPT-3.5 and
ChatGPT-4 with
unclear settings
July 2023 215 questions Not clear
Lai et al. [61]
ChatGPT-4 May 24
Version 3.5, with
unclear settings
Not provided 200 questions
The exact prompts and
language were not explicitly
provided. Three attempts to
answer the complete set of
questions over 3 weeks (once
per week), with a new session
for each question
Lyu et al.
[62]
ChatGPT with
unclear settings Mid-February 2023 Not clear Exact prompts were provided
in English
Moise et al.
[63]
ChatGPT-3.5 with
unclear settings Not provided 23 questions
Exact prompts were provided
in English with a new session
for each question
Oca et al.
[64]
ChatGPT, Bing,
and Bard with
unclear settings
11 April 2023 20 queries for each
model
The exact prompt was provided
in English
Oztermeli &
Oztermeli
[65]
ChatGPT-3.5 with
unclear settings Not provided 1177 questions
The exact prompts were not
explicitly provided. Turkish,
with a new session for each
question
Pugliese et al.
[66]
ChatGPT, with
unclear settings 25 March 2023 15 questions
Exact prompts were provided
in English with a new session
for each question
Sallam et al.
[67]
ChatGPT (default
model), with
unclear settings
25 February 2023 Not provided Exact prompts were provided
in English
Seth et al.
[68]
ChatGPT-3.5, Bard,
and Bing AI Not provided 6 questions Exact prompts were provided
in English
Suthar et al.
[69]
ChatGPT-4 with
unclear settings Not provided 140 cases The exact prompts were not
explicitly provided. English
Walker et al.
[70]
ChatGPT-4 with
unclear settings Not provided 5 cases
The exact prompts were not
explicitly provided. English,
with a new session for each
question
H. Wang et
al. [71]
ChatGPT-3.5 and
ChatGPT-4, with
unclear settings
14 February 2023
for ChatGPT-3.5
and 14-16 May
2023 for ChatGPT-
4
300 questions
Chinese and English. The exact
prompts were provided. The
prompts were enhanced though
role play
Y. M. Wang
et al. [72]
ChatGPT-3.5, with
unclear settings 5-10 March 2023 Not clear
Chinese (Mandarin) and
English. Examples of prompts
were provided
Zhou et al.
[73]
ChatGPT-3.5, with
unclear settings 24-25 April 2023
Single case and
multiple poll
questions
Exact prompts were provided
in English
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
ChatGPT was tested in all the records included (34/34, 100%), followed by Google Bard (5/34,
14.7%), and Bing Chat (5/34, 14.7%). The exact dates of AI-based model queries were explicitly
mentioned in 13 out of 34 records (38.2%). The count of cases/questions that were tested in the
studies ranged from a single case to 2576 questions. The majority of studies tested the AI-models
based on queries in English language with 23 out of the 34 included records (67.6%).
The full details of the on the evaluation approach of the AI-generated content in the included studies
are presented in (Table 2).
Table 2. Classification based on the evaluation approach of the AI generated content.
Authors Evaluation of performance Individual role and inter-rater reliability
Al-Ashwal et
al. [40]
Objective via two different clinical reference
tools Not applicable
Alfertshofer
et al. [41]
Objective based on the key answers with the
questions screened independently by four
investigators
Not applicable
Ali et al. [42]
Objective for multiple-choice questions and
true/false questions, and subjective for short-
answer and assay questions
Assessment by two assessors independently
with intraclass correlation coefficient for
agreement
Aljindan et al.
[43]
Objective based on key answers and
historical performance metrics Not applicable
Altamimi et
al. [44] Subjective
Not clear. Assessment for accuracy,
informativeness, and accessibility by clinical
toxicologists and emergency medicine
physicians
Baglivo et al.
[45]
Objective based on key answers and
comparison to 5th year medical students’
performance
Not applicable
Biswas et al.
[46]
Subjective by a five-member team of
optometry teaching and expert staff with over
100 years of clinical and academic
experience between them. Independent
evaluation on a 5-point Likert scale ranging
from very poor to very good
The median scores across raters for each
response were studied. The score represented
raters consensus, while the score variance
represented disagreements between the raters
Chen et al.
[47] Objective based on key answers Not applicable
Deiana et al.
[48]
Subjective based on qualitative assessment of
correctness, clarity, and exhaustiveness. Each
response was rated using a 4-point Likert
scale scoring from strongly disagree to
strongly agree
Independently assessment by two raters with
experience in vaccination and health
communication topics
Fuchs et al.
[49] Objective based on key answers Not applicable
Ghosh & Bir
[50]
Objective based on key answers. Subjectivity
by raters’ assessment
Scoring was done by two assessors on a scale
of zero to five, with zero being incorrect and
five being fully correct, based on a pre-
selected answer key.
Giannos [51] Objective based on key answers Not applicable
Gobira et al.
[52]
Objective based on key answers with an
element of subjectivity through classifying
the responses as adequate, inadequate, or
indeterminate
Two raters independently scored the
accuracy. After individual evaluations, the
raters performed a third assessment to reach a
consensus on the questions with differing
results
Grewal et al.
[53] Not clear Not clear
Guerra et al. Subjective through comparison to the results
of a previous study on the average Not applicable
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
[54] performance of users and a cohort of medical
students and neurosurgery residents
Hamed et al.
[55] Subjective Not clear
Hoch et al.
[56] Objective based on key answers Not applicable
Juhi et al.
[57]
Subjective and the use of Stockley's Drug
Interactions Pocket Companion 2015 as a
reference key
Two raters reached a consensus was reached
for categorizing the output
Kuang et al.
[58] Subjective Not clear
Kumari et al.
[59]
Subjective. Content validity was checked by
two experts of curriculum design
Three independent raters scored content
based on their correctness with an accuracy
score ranging from 1 to 5
Kung et al.
[60] Objective based on key answers Not applicable
Lai et al. [61] Objective based on key answers Not applicable
Lyu et al. [62] Subjective
two experienced radiologists (with 21 and 8
years of experience) evaluated the quality of
the ChatGPT responses
Moise et al.
[63]
Subjective through comparison with the latest
American Academy of Otolaryngology–Head
and Neck Surgery Foundation Clinical
Practice Guideline: Tympanostomy Tubes in
Children (Update)
Two independent raters evaluated the output.
The inter-rater reliability was assessed using
Cohen’s Kappa test. To confirm the
consensus, responses were reviewed by the
senior author
Oca et al. [64] Not clear Not clear
Oztermeli &
Oztermeli
[65]
Objective based on key answers Not applicable
Pugliese et al.
[66]
Subjective using the Likert scale for
accuracy, completeness, and
comprehensiveness
Multi-rater 10 key opinion leaders in NAFLD
and 1 nonphysician with expertise in patient
advocacy in liver disease, each independently
rating the AI-content
Sallam et al.
[67]
Subjective based on correctness, clarity, and
conciseness Fleiss multi-rater Kappa
Seth et al.
[68]
Subjective through comparison with the
current healthcare guidelines for rhinoplasty,
evaluation by the panel of plastic surgeons
through a Likert scale to assess readability
complexity of the text and the education level
required for understanding and modified
DISCERN a score
Not clear
Suthar et al.
[69]
Subjective by three fellowship-trained
neuroradiologists, utilizing a five-point Likert
scale. from 1 indicated a "highly improbable"
while a score of 5 denoted it as "highly
probable."
Not applicable
Walker et al.
[70]
Modified EQIP b Tool with comparison with
UK National Institute for Health and Care
Excellence guidelines for gallstone disease,
pancreatitis, liver cirrhosis/portal
hypertension, and the European Association
for Study of the Liver guidelines
All answers were assessed by 2 authors
independently and in case of a contradictory
result, resolution was achieved by consensus.
The process was repeated 3 times per EQIP
item. Wrong or out of context answers,
known as AI hallucinations,” were recorded
H. Wang et al.
[71] Subjective Unclear
Y. M. Wang
et al. [72] Objective based on key answers Not applicable
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Zhou et al.
[73] Subjective Unclear
a DISCERN: xx; b EQIP: xx.
Objective evaluation of the AI-model generated content was observed in 15 out of the 34 included
records (44.1%).
The full details of the features of data used to generate queries to be tested on Ai models including
the range of topics and randomization are presented in (Table 3).
Table 3. Classification of the included studies based on the feature of data used to generate AI
queries.
Authors Transparency Range Random
Al-Ashwal et
al. [40]
Full description using two tools
for assessment Micromedex, a
subscription-based drug-drug
interaction screening tool, and
Drugs.com, a free database
Narrow. Drug-drug interaction
prediction
Non-random,
purposeful selection
of the drugs by a
clinical pharmacist,
five drugs paired
with the top 51
prescribed drugs
Alfertshofer et
al. [41]
Full description using the question
bank AMBOSS with official
permission for the use of the
AMBOSS USMLE step 2CK
practice question bank for research
purposes was granted by
AMBOSS
Broad Randomly extracted
Ali et al. [42]
Developed by the researchers and
reviewed by a panel of
experienced academics for
accuracy, clarity of language,
relevance and agreement on
correct answers. Evaluation of
face validity, accuracy and
suitability for undergraduate
dental students
Narrow inter-subject (dentistry).
Broad intra-subject in restorative
dentistry, periodontics, fixed
prosthodontics, removable
prosthodontics, endodontics,
pedodontics, orthodontics,
preventive dentistry, oral surgery
and oral medicine
Not clear
Aljindan et al.
[43]
The Saudi Medical Licensing
Exam questions extracted from the
subscription CanadaQBank
website
Broad in Medicine with 30% of
the questions were from Medicine,
25% were from Obstetrics and
Gynecology, 25% were from
Pediatrics, and the remaining 20%
were from Surgery
Randomized
through four
researchers to ensure
comprehensive
coverage of
questions and
eliminate potential
bias in question
selection
Altamimi et al.
[44]
Snakebite management
information guidelines derived
from the World Health
Organization, Centers for Disease
Control and Prevention, and the
clinical literature
Narrow Not clear
Baglivo et al.
[45]
The Italian National Medical
Residency test
Narrow. Vaccination-related
questions from the Italian National
Medical Residency Test
Not clear
Biswas et al.
[46]
Constructed based on the
frequently asked questions on
myopia webpage of the
Narrow involving nine categories:
one each for disease summary,
cause, symptom, onset,
Not clear
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Association of British Dispensing
Opticians and the College of
Optometrists
prevention, complication, natural
history of untreated myopia,
prognosis and three on treatments
Chen et al.
[47]
BoardVitals which is an online
question bank accredited by the
Accreditation Council for
Continuing Medical Education
Neurology-based. Broad intra-
subject: basic neuroscience;
behavioral, cognitive, psychiatry;
cerebrovascular; child neurology;
congenital; cranial nerves; critical
care; demyelinating disorders;
epilepsy and seizures; ethics;
genetic; headache;
imaging/diagnostic studies;
movement disorders; neuro-
ophthalmology; neuro-otology;
neuroinfectious disease;
neurologic complications of
systemic disease; neuromuscular;
neurotoxicology, nutrition,
metabolic; oncology; pain;
pharmacology; pregnancy; sleep;
and trauma
Not clear
Deiana et al.
[48]
Questions concerning vaccine
myths and misconceptions by the
World Health Organization
Narrow on vaccine myths and
misconceptions Not clear
Fuchs et al.
[49]
The digital platform self-
assessment questions tailored for
dental and medical students at the
University of Bern’s Institute for
Medical Education
Broad with multiple choice
questions designed for dental
students preparing for the Swiss
Federal Licensing Examination in
Dental Medicine and allergists and
immunologists preparing for the
European Examination in Allergy
and Clinical Immunology
Not clear
Ghosh & Bir
[50]
Department question bank, which
is a compilation of first and
second semester questions from
various medical universities across
India
Biochemistry
Random without
details of
randomization
Giannos [51] Specialty Certificate Examination
Neurology Web Questions bank Neurology and neuroscience Not clear
Gobira et al.
[52]
The National Brazilian
Examination for Revalidation of
Medical Diplomas Issued by
Foreign Higher Education
Institutions (Revalida)
Preventive Medicine, Gynecology
and Obstetrics, Surgery, Internal
Medicine, and Pediatrics
Not clear
Grewal et al.
[53] Not clear Radiology Not clear
Guerra et al.
[54]
Questions released by the
Congress of Neurological
Surgeons in the self-assessment
neurosurgery exam
Neurosurgery across seven
subspecialties: tumor,
cerebrovascular, trauma, spine,
functional, pediatrics, and
pain/nerve
Not clear
Hamed et al.
[55]
Guidelines adapted from Diabetes
Canada Clinical Practice
Guidelines Expert Committee, the
Royal Australian College of
General Practitioners, Australian
Diabetes Society position
statement, and the Joint British
Diabetes Societies
Management of diabetic
ketoacidosis Not clear
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Hoch et al.
[56]
Question database of an online
learning platform funded by the
German Society of Oto-Rhino-
Laryngology, Head and Neck
Surgery
Otolaryngology with a range of 15
distinct otolaryngology
subspecialties including
allergology; audiology; ear, nose,
and throat; tumors, face and neck,
inner ear and skull base, larynx,
middle ear, oral cavity and
pharynx, nose and sinuses,
phoniatrics, salivary glands, sleep
medicine, vestibular system, and
legal aspects
Not clear
Juhi et al. [57]
A list of drug-drug interactions
from previously published
research
Narrow on drug-drug interaction Not clear
Kuang et al.
[58] Not clear Neurosurgery Not clear
Kumari et al.
[59]
Designed by experts in
hematology-related cases
Hematology with the following
intra-subject aspects: Case
Solving. laboratory calculations,
disease interpretations, and other
relevant aspects of hematology
Not clear
Kung et al.
[60]
Orthopaedic In-Training
Examination Orthopedics Not clear
Lai et al. [61]
The United Kingdom Medical
Licensing Assessment which is a
newly derived national
undergraduate medical exit
examination
Borad in medicine with the
following aspects: acute and
emergency, cancer, cardiovascular,
child health, clinical hematology,
ear nose and throat, endocrine and
metabolic, gastrointestinal
including liver, general practice
and primary healthcare, genetics
and genomics, infection, medical
ethics and law, medicine of older
adult, mental health,
musculoskeletal, neuroscience
obstetrics and gynecology,
ophthalmology, palliative and end
of life care, perioperative
medicine and anesthesia, renal and
urology, respiratory, sexual health,
social and population health,
surgery
Not clear
Lyu et al. [62]
Chest CT and brain MRI screening
reports collected from the Atrium
Health Wake Forest Baptist
clinical database
Radiology Not clear
Moise et al.
[63]
Statements published in the latest
American Academy of
Otolaryngology–Head and Neck
Surgery Foundation Clinical
Practice Guideline:
Tympanostomy Tubes in Children
(Update)
Narrow Otolaryngology Not clear
Oca et al. [64] Not clear
Narrow involving solely queries
on accurate recommendation of
close ophthalmologists
Not clear
Oztermeli &
Oztermeli [65]
Turkish medical specialty exam,
prepared by the Student Selection
and Placement Center
Broad: Basic sciences including
Anatomy, Physiology–Histology
Embryology, Biochemistry,
Not clear
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Microbiology, Pathology, and
Pharmacology; Clinical including
Internal Medicine, Pediatrics,
General Surgery, Obstetrics and
Gynecology, Neurology,
Neurosurgery, Psychiatry, Public
Health, Dermatology, Radiology,
Nuclear Medicine,
Otolaryngology, Ophthalmology,
Orthopedics, Physical Medicine
and Rehabilitation, Urology,
Pediatric Surgery, Cardiovascular
Surgery, Thoracic Surgery, Plastic
Surgery, Anesthesiology and
Reanimation, and Emergency
Medicine
Pugliese et al.
[66]
Expert selection of 15 questions
commonly asked by non-alcoholic
fatty liver disease patients
Narrow involving non-alcoholic
fatty liver disease aspects Not clear
Sallam et al.
[67]
Panel discussion of experts in
healthcare education
Broad on healthcare education,
medical, dental, pharmacy, and
public health
Not clear
Seth et al. [68]
Devised by three fellows of the
Royal Australasian College of
Surgeons with experience in
performing rhinoplasty and
expertise in facial reconstructive
surgery.
Narrow involving technical
aspects of rhinoplasty Not clear
Suthar et al.
[69]
Quizzes from the Case of the
Month feature of the American
Journal of Neuroradiology
Narrow involving radiology Not clear
Walker et al.
[70]
Devised based the Global Burden
of Disease tool
Narrow involving benign and
malignant hepato-pancreatico-
biliary–related conditions
Not clear
H. Wang et al.
[71]
Medical Exam Help. Ten inpatient
and ten outpatient medical records
to form a collection of Chinese
medical records after
desensitization
Clinical medicine, basic medicine,
medical humanities, and relevant
laws
Not clear
Y. M. Wang et
al. [72]
The Taiwanese Senior
Professional and Technical
Examinations for Pharmacists
downloaded from the Ministry of
Examination website
Broad involving pharmacology
and pharmaceutical chemistry,
pharmaceutical analysis and
pharmacognosy (including
Chinese medicine), and
pharmaceutics and
biopharmaceutics. dispensing
pharmacy and clinical pharmacy,
therapeutics, and pharmacy
administration and pharmacy law
Not clear
Zhou et al.
[73]
A single clinical case from
OrthoBullets, a global clinical
collaboration platform for
orthopedic surgeons. Written
permission to use their clinical
case report
Very narrow involving a single
orthopedic case Not clear
Explicit mention of the randomization process was only mentioned by 4 out of the 34 included
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
studies (8.8%). Six studies out of the 34 records involved broad multidisciplinary medical exam
questions (17.6%). Two studies out of the 34 explicitly mentioned the permission to use the data for
the studies (5.9%).
Examples of Optimal Reporting of each Criterion within the METRICS
Checklist
The records with the highest scores for each METRICS item, as determined by the average
subjective inter-rater assessments are shown in (Table 4).
Table 4. Included records that scored the highest METRICS per item.
Item Issues considered in each item Excellent/very good reporting example(s)
#1 Model What is the model of the AI tool used
for generating content, and what are
the exact settings for each tool?
Baglivo et al. [45]: Bing, ChatGPT, Chatsonic,
Bard, and YouChat with full details of the mode,
LLM model including plugins
#2 Evaluation What is the exact approach used to
evaluate the AI-generated content and
is it objective or subjective evaluation?
Al-Ashwal et al. [40]: Objective via two
different clinical reference tools; Alfertshofer et
al. [41]: Objective based on the key answers with
the questions screened independently by four
investigators; Ali et al. [42]: Objective for
multiple-choice questions and true/false
questions, and subjective for short-answer and
assay questions; Aljindan et al. [43]: Objective
based on key answers and historical performance
metrics; and Baglivo et al. [45]: Objective based
on key answers and comparison to 5th year
medical students’ performance
#3a Timing When is the AI model tested exactly
and what was the duration, and timing
of testing?
Baglivo et al. [45]; Biswas et al. [46]; Fuchs et
al. [49]; Ghosh & Bir [50]; Hoch et al. [56]; Juhi
et al. [57]; Kumari et al. [59]; Kung et al. [60];
Oca et al. [64]; Pugliese et al. [66]; Sallam et al.
[67]; H. Wang et al. [71]; and Zhou et al. [73]
#3b Transparency How transparent are the data sources
used to generate queries for the AI
model?
Alfertshofer et al. [41]
#4a Range What is the range of topics tested, and
are they inter-subject or intra-subject
with variability in different subjects?
Ali et al. [42]; Chen et al. [47]; Hoch et al. [56];
and Y. M. Wang et al. [71]
#4b
Randomization
Was the process of selecting the topics
to be tested on the AI-model
randomized?
Alfertshofer et al. [41]; and Aljindan et al. [43]
#5 Individual Is there any individual subjective
involvement in AI content evaluation?
If so, did the authors described the
details in full?
Ali et al. [42]; and Moise et al. [63]
#6 Count What is the count of queries executed
(sample size)?
Alfertshofer et al. [41]; Chen et al. [47]; Guerra
et al. [54]; Hoch et al. [56]; and Oztermeli &
Oztermeli [65]
#7 Specificity of
prompt/language
How specific are the exact prompts
used? Were those exact prompts
provided fully? Did the authors
consider the feedback and learning
loops? How specific are the language
and cultural issues considered in the AI
model?
Alfertshofer et al. [41]; Biswas et al. [46]; Fuchs
et al. [49]; Grewal et al. [53]; H. Wang et al.
[71]; Moise et al. [63]; and Pugliese et al. [66]
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Inter-rater Assessment of the Included Records based on METRICS
Scores
The overall mean METRICS score was 3.0±0.58. Per item, the κ inter-rater reliability ranged from
0.558–0.962 (P<.001 for the nine tested items) indicating good to excellent agreement (Table 5).
Table 5. The inter-rater reliability per METRICS item.
METRICS Item Mean±SD Quality Cohens κ Asymptotic
Standard
Error
Approximate
T
P
value
1 Model 3.72±0.58 Very good .820 .090 6.044 <.001
2 Timing 2.1.93 Good .853 .076 6.565 <.001
3 Count 3.04±1.32 Good .962 .037 10.675 <.001
4 Specificity of
prompt/language
3.44±1.25 Very good .765 .086 8.083 <.001
5 Evaluation 3.31±1.16 Good .885 .063 9.668 <.001
6 Individual 2.1.42 Satisfactory .865 .087 6.860 <.001
7 Transparency 3.24±1.01 Good .558 .112 5.375 <.001
8 Range 3.24±1.07 Good .836 .076 8.102 <.001
9 Randomization 1.31±0.87 Sub-optimal .728 .135 5.987 <.001
METRICS score 3.01±0.58 Good .381 .086 10.093 <.001
Classified per item, the highest average METRICS score was recorded for model item, followed by
prompt attributes and language, while the lowest scores were recorded for the randomization item
being sub-optimal and individual item being satisfactory (Table 5).
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Discussion
Principal Results
The interpretation and synthesis of credible scientific evidence based on the studies that evaluated
the commonly used AI-based conversational models (e.g., ChatGPT, Bing, and Bard) can be
challenging. This is related to the discernible variability in the methods used for evaluation of such
models as well as the varying styles of reporting. Such variability is fathomable considering the
emerging nature of this research field with less than a year of reporting as of the time of writing.
Therefore, a standardized framework to guide the design of such studies and to delineate the best
reporting practices can be beneficial, since rigorous methodology and clear reporting of the findings
are key attributes of science to reach reliable conclusions with real-world implications.
In this study, a preliminary checklist referred to as METRICS was formulated which can help
researchers aspiring to test the performance of the AI-based models in various aspects of healthcare
education and practice. It is crucial to explicitly state that the proposed METRICS checklist in this
study cannot be claimed to be comprehensive or flawless; nevertheless, this checklist could form a
solid base for future and much needed efforts aiming to standardize reporting of the AI-based studies
in healthcare.
The principal finding of this study was the establishment of nine key themes that are recommended
to be considered in the design, testing, and reporting of AI-based models in research, particularly in
the healthcare domain. These features were the model of AI, evaluation approach, timing of testing
and transparency, range of topics tested and randomization of queries, individual factors in the design
and assessment, count of queries and the specific prompts and languages used. The relevance of
these themes in the design and reporting of AI-model content testing can be illustrated as follows.
First, the variability in AI model types used to conduct the queries and variability in settings pose
significant challenges for cross-study comparisons. The significant impact of the AI model on its
resultant output is related to the distinct architectures and capabilities of various AI models with
expected variability in performance and quality of AI-generated content [74]. Additionally, various
options to configure the models further affect the AI-generated content. Consequently, it is important
to consider these variations when evaluating research using different AI models [75-79]. These issues
can be illustrated clearly in the included records in this study that conducted contemporary analysis
of at least two models. For example, Al-Ashwal et al. showed that Bing had the highest accuracy and
specificity in predicting drug-drug interaction, outperforming Bard, ChatGPT-3.5, and ChatGPT-4
[40]. Another example by Baglivo et al. showed not only the inter-model variability but also the
intra-model variability in performance in the domain of public health [45]. Additionally, in the
context of providing information on rhinoplasty, Seth et al. showed that this inter-model variability in
performance with Bard content being the most comprehensible, followed by ChatGPT and Bing [68].
Second, the continuous updating of AI models introduces significant temporal variability, which
would influence the comparability of studies conducted at different times. The AI models updates
result in enhancement in capabilities and performance [80]. Consequently, this temporal variability
can lead to inconsistencies in synthesizing evidence, as the same model may demonstrate different
outputs over time. Therefore, when analyzing or comparing studies involving AI models, it is crucial
to consider the specific version and state of the model at the time of each study to accurately interpret
and compare results. In this context, it is important to conduct future longitudinal studies to discern
the exact effect of changes in performance of the commonly used AI models over time.
Third, the count of queries in evaluating an AI model was identified among the pertinent themes of
assessment. This appears understandable since studies employing a larger number of queries can
provide a more comprehensive evaluation of the tested model. The extensive number of queries can
reveal minor weaknesses, despite the difficulty to establish what constitutes an extensive number
of queries of the minimum number of queries needed to reveal the real performance of the AI model
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
in a particular topic. In this study, the number of queries varies from a single case to more than 2500
questions showing the need for standardization and establishing a clear guide on the number of
queries deemed suitable [56, 73].
Fourth, a key theme identified in this study was the nature and language of the prompts used to
conduct the studies. The exact prompting approach as well as the presence of cultural and linguistic
biases appear as critical factors that can influence quality of content generated by the AI-based
models [81]. Slight differences in wording or context in the prompt used to generate the AI content
can lead to recognizable differences in the content generated [34, 82, 83]. Additionally, the feedback
mechanisms and learning loops allowing AI-based models to learn from interactions can change the
model performance for the same query, which might not be consistently accounted for in all studies.
These minor variations in prompts across different studies can also complicate the synthesis of
evidence highlighting the need for standardizing such an aspect. Additionally, as highlighted above,
the AI-based models may exhibit biases based on their training data, affecting performance across
various cultural or linguistic contexts [84-86]. Thus, the studies conducted in different regions or
involving various languages might yield varying results. In this study, we found that a majority of the
included records tested AI-based models using English language highlighting the need for more
studies in other languages to reveal the possible variability in performance based on language.
Comparative studies involving multiple languages can reveal such inconsistencies with an example
of the study by Ying-Mei Wang et al. [72]. In the aforementioned study assessing ChatGPT
performance in the Taiwanese pharmacist licensing exam, the performance in English test was better
than the Chinese test across all tested subjects [72]. Another comprehensive study by Alfertshofer et
al. that assessed the performance of ChatGPT on six different national medical licensing exams
highlighted the variability in performance per country/language [41].
The fifth important theme highlighted in the current study was the approach of evaluating the AI-
generated content. Variable methods of assessment can introduce a discernible methodological
variability. Specifically, the utilization of objective assessment ensures consistency in assessment. On
the other hand, the subjective assessment even by experts can vary despite the professional judgment
and deep understanding provided by such expert opinion [87].
Similarly, the number and expertise of evaluators/raters involved in constructing and evaluating the
AI-based studies was identified as a pertinent theme in this study [88, 89]. Variations in rater
numbers across studies can lead to inconsistencies in synthesized evidence [66, 67, 90]. Additionally,
the method used to establish agreement (e.g., kappa statistics, consensus meetings) might differ in
various studies, affecting the comparability of results.
Finally, the data-pertinent issues were identified as key themes in this study. This involves the need
for full transparency regarding the sources of data used to create the queries (e.g., question banks,
credible national and international guidelines, clinical reports, etc.) [91, 92]. Additionally, ethical
considerations, such as consent to use copyrighted material and consent/anonymization of the
clinical data, should be carefully stated in the AI-based model evaluation studies. An important
aspect that appeared sub-optimal in the majority of included records is randomization to reduce or
eliminate potential bias in query selection. Thus, this important issue should be addressed in future
studies to allow unbiased evaluation of the AI-based models content. Another important aspect is the
need to carefully select the topics to be tested, which can belong to a broad domain (e.g., medical
examination) or a narrow domain (e.g., a particular specialty) [93-95]. A comprehensive description
of topics is essential to reveal subtle differences in AI performance across various domains. Biased
query coverage per topic may result in unreliable conclusions regarding the AI model performance.
Limitations
It is crucial to explicitly mention the need for careful interpretation of the findings based on the
following limitations. First, the search process involved the broad term artificial intelligence,
which may have inadvertently resulted in missing relevant references. Additionally, the reliance on
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
including published English records, indexed in Scopus, PubMed, or Google Scholar, could raise
concerns about potential selection bias and the exclusion of relevant studies. However, it is important
to consider these limitation in light of the context of our study which represented a preliminary report
that needs to be validated by future comprehensive and exhaustive studies. Second, it is important to
acknowledge that a few pertinent themes could have been overlooked despite our attempt to achieve
a thorough analysis given the limited number of authors. Additionally, the subjective nature of
pertinent theme selection should be considered as another important caveat in this study. This
shortcoming extended to involve the raters subjective assessment in assigning different METRICS
scores. Moreover, the equal weight given to each item of the checklist in the METRICS score might
not be a suitable approach given the possibility of varying importance of each component. Third, the
focus on a few specific AI-based conversational models (i.e., ChatGPT, Bing, and Bard) can
potentially overlook the nuanced aspects of other AI models. Nevertheless, our approach was
justified by the popularity and widespread use of these particular AI-based models. Lastly, we fully
and unequivocally acknowledge that the METRICS checklist is preliminary and needs further
verification to ensure its valid applicability.
Future Perspectives
The METRICS checklist proposed in this study could be a helpful step towards
establishing useful guidelines to design and report the findings of AI-based
studies. The integration of AI models in healthcare education and practice
necessitates a collaborative approach involving healthcare professionals,
researchers, and AI developers. Synthesis of evidence with critical appraisal of
the quality of each element in the METRICS checklist is recommended for
continuous enhancement of the AI output which would result in successful
implementation of the AI models in healthcare while avoiding the possible
concerns. Regular multidisciplinary efforts and iterative revisions are
recommended to ensure that the METRICS checklist properly reflects its original
intended purpose of improving the quality of study design and results reporting
in this swiftly evolving research field. Future studies should benefit from
expanding the scope of literature review and data inclusion, incorporating a
wider range of databases, languages, and AI models. This is crucial for reaching
the ultimate aim of standardization the design and reporting of AI-based studies.
Conclusions
The newly devised METRICS checklist may represent a key initial step to motivate the
standardization of reporting of AI-based studies in healthcare education and practice. Additionally,
the establishment of this algorithm can motivate collaborative efforts to develop universally accepted
reporting guidelines for AI-based studies. In turn, this can enhance the comparability and reliability
of evidence synthesis from these studies. The METRICS checklist as presented by the findings of the
study can help to elucidate the strengths and limitations of AI models generated content, guiding
their future development and application. The standardization offered by the METRICS checklist can
be important to ensure the reporting of reliable and replicable results. Subsequently, this can result in
exploiting the promising potential of AI-based models in healthcare while avoiding its possible
concerns. The METRICS checklist could mark a significant progress in the evolving research field;
nevertheless, there is a huge room for its refinement through revisions and updates to verify its
validity.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Acknowledgements
None.
Conflicts of Interest
None declared.
Abbreviations
AI: Artificial intelligence
CASP: Critical Appraisal Skills Programme
METRICS: Model, Evaluation, Timing, Range/Randomization, Individual, Count, and Specificity of
prompt and language
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses
STROBE: Strengthening the Reporting of Observational Studies in Epidemiology
References
1. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic
Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023 Mar
19;11(6):887. PMID: 36981544. doi: 10.3390/healthcare11060887.
2. Garg RK, Urs VL, Agarwal AA, Chaudhary SK, Paliwal V, Kar SK. Exploring the role of
ChatGPT in patient care (diagnosis and treatment) and medical research: A systematic review.
Health Promot Perspect. 2023;13(3):183-91. PMID: 37808939. doi: 10.34172/hpp.2023.22.
3. Alam F, Lim MA, Zulkipli IN. Integrating AI in medical education: embracing ethical
usage and critical understanding. Front Med (Lausanne). 2023;10:1279707. PMID: 37901398.
doi: 10.3389/fmed.2023.1279707.
4. Jianning L, Amin D, Jens K, Jan E. ChatGPT in Healthcare: A Taxonomy and Systematic
Review. medRxiv. 2023:2023.03.30.23287899. doi: 10.1101/2023.03.30.23287899.
5. Mesko B. The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial
Intelligence Approachable for Medical Professionals. J Med Internet Res. 2023 Jun
22;25:e48392. PMID: 37347508. doi: 10.2196/48392.
6. Rudolph J, Tan S, Tan S. War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond.
The new AI gold rush and its impact on higher education. Journal of Applied Learning and
Teaching. 2023;6(1):364-89. doi: 10.37074/jalt.2023.6.1.23.
7. Choudhury A, Shamszare H. Investigating the Impact of User Trust on the Adoption and
Use of ChatGPT: Survey Analysis. J Med Internet Res. 2023 Jun 14;25:e47184. PMID: 37314848.
doi: 10.2196/47184.
8. Shahsavar Y, Choudhury A. User Intentions to Use ChatGPT for Self-Diagnosis and
Health-Related Purposes: Cross-sectional Survey Study. JMIR Hum Factors. 2023 May
17;10:e47564. PMID: 37195756. doi: 10.2196/47564.
9. Bajwa J, Munir U, Nori A, Williams B. Artificial intelligence in healthcare: transforming
the practice of medicine. Future Healthc J. 2021 Jul;8(2):e188-e94. PMID: 34286183. doi:
10.7861/fhj.2021-0095.
10. Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the Utility of
ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med
Internet Res. 2023 Aug 22;25:e48659. PMID: 37606976. doi: 10.2196/48659.
11. Giannos P, Delardas O. Performance of ChatGPT on UK Standardized Admission Tests:
Insights From the BMAT, TMUA, LNAT, and TSA Examinations. JMIR Med Educ. 2023 Apr
26;9:e47737. PMID: 37099373. doi: 10.2196/47737.
12. Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Medical
Education. 2023 2023/09/22;23(1):689. doi: 10.1186/s12909-023-04698-z.
13. Miao H, Li C, Wang J. A Future of Smarter Digital Health Empowered by Generative
Pretrained Transformer. J Med Internet Res. 2023 Sep 26;25:e49963. PMID: 37751243. doi:
10.2196/49963.
14. Liu T, Xiao X. A Framework of AI-Based Approaches to Improving eHealth Literacy and
Combating Infodemic. Front Public Health. 2021;9:755808. PMID: 34917575. doi:
10.3389/fpubh.2021.755808.
15. Hsu HY, Hsu KC, Hou SY, Wu CL, Hsieh YW, Cheng YD. Examining Real-World Medication
Consultations and Drug-Herb Interactions: ChatGPT Performance Evaluation. JMIR Med Educ.
2023 Aug 21;9:e48433. PMID: 37561097. doi: 10.2196/48433.
16. Chen Z. Ethics and discrimination in artificial intelligence-enabled recruitment practices.
Humanities and Social Sciences Communications. 2023 2023/09/13;10(1):567. doi:
10.1057/s41599-023-02079-x.
17. Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical Considerations of Using ChatGPT in
Health Care. J Med Internet Res. 2023 Aug 11;25:e48009. PMID: 37566454. doi:
10.2196/48009.
18. Emsley R. ChatGPT: these are not hallucinations theyre fabrications and falsifications.
Schizophrenia. 2023 2023/08/19;9(1):52. doi: 10.1038/s41537-023-00379-4.
19. Gödde D, Nöhl S, Wolf C, Rupert Y, Rimkus L, Ehlers J, et al. A SWOT (Strengths,
Weaknesses, Opportunities, and Threats) Analysis of ChatGPT in the Medical Literature:
Concise Review. J Med Internet Res. 2023 Nov 16;25:e49368. PMID: 37865883. doi:
10.2196/49368.
20. Murdoch B. Privacy and artificial intelligence: challenges for protecting health
information in a new era. BMC Medical Ethics. 2021 2021/09/15;22(1):122. doi:
10.1186/s12910-021-00687-3.
21. Mijwil M, Aljanabi M, Ali AH. Chatgpt: Exploring the role of cybersecurity in the
protection of medical information. Mesopotamian journal of cybersecurity. 2023;2023:18-21.
doi: 10.58496/MJCS/2023/004.
22. Sun L, Yin C, Xu Q, Zhao W. Artificial intelligence for healthcare and medical education: a
systematic review. Am J Transl Res. 2023;15(7):4820-8. PMID: 37560249.
23. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT
Perform on the United States Medical Licensing Examination? The Implications of Large
Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023 Feb
8;9:e45312. PMID: 36753318. doi: 10.2196/45312.
24. Sallam M, Salim NA, Al-Tammemi AB, Barakat M, Fayyad D, Hallit S, et al. ChatGPT
Output Regarding Compulsory Vaccination and COVID-19 Vaccine Conspiracy: A Descriptive
Study at the Outset of a Paradigm Shift in Online Search for Information. Cureus. 2023
Feb;15(2):e35029. PMID: 36819954. doi: 10.7759/cureus.35029.
25. Scherr R, Halaseh FF, Spina A, Andalib S, Rivera R. ChatGPT Interactive Medical
Simulations for Early Clinical Education: Case Study. JMIR Med Educ. 2023 Nov 10;9:e49877.
PMID: 37948112. doi: 10.2196/49877.
26. Rushabh D, Kanhai A, Pavan K, Simar B, Sophie C, Howard PF. Utilizing Large Language
Models to Simplify Radiology Reports: a comparative analysis of ChatGPT3.5, ChatGPT4.0,
Google Bard, and Microsoft Bing. medRxiv. 2023:2023.06.04.23290786. doi:
10.1101/2023.06.04.23290786.
27. Ray PP. ChatGPT: A comprehensive review on background, applications, key challenges,
bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems. 2023
2023/01/01/;3:121-54. doi: 10.1016/j.iotcps.2023.04.003.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
28. Qarajeh A, Tangpanithandee S, Thongprayoon C, Suppadungsuk S, Krisanapan P,
Aiumtrakul N, et al. AI-Powered Renal Diet Support: Performance of ChatGPT, Bard AI, and Bing
Chat. Clin Pract. 2023 Sep 26;13(5):1160-72. PMID: 37887080. doi:
10.3390/clinpract13050104.
29. Zúñiga Salazar G, Zúñiga D, Vindel CL, Yoong AM, Hincapie S, Zúñiga AB, et al. Efficacy of
AI Chats to Determine an Emergency: A Comparison Between OpenAI's ChatGPT, Google Bard,
and Microsoft Bing AI Chat. Cureus. 2023 Sep;15(9):e45473. PMID: 37727841. doi:
10.7759/cureus.45473.
30. Fijačko N, Prosen G, Abella BS, Metličar Š, Štiglic G. Can novel multimodal chatbots such
as Bing Chat Enterprise, ChatGPT-4 Pro, and Google Bard correctly interpret electrocardiogram
images? Resuscitation. 2023 Oct 24;193:110009. PMID: 37884222. doi:
10.1016/j.resuscitation.2023.110009.
31. Aiumtrakul N, Thongprayoon C, Suppadungsuk S, Krisanapan P, Miao J, Qureshi F, et al.
Navigating the Landscape of Personalized Medicine: The Relevance of ChatGPT, BingChat, and
Bard AI in Nephrology Literature Searches. J Pers Med. 2023 Sep 30;13(10):1457. PMID:
37888068. doi: 10.3390/jpm13101457.
32. Stephens LD, Jacobs JW, Adkins BD, Booth GS. Battle of the (Chat)Bots: Comparing Large
Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease
Prevention. Transfus Med Rev. 2023 Jul;37(3):150753. PMID: 37704461. doi:
10.1016/j.tmrv.2023.150753.
33. Eysenbach G. The Role of ChatGPT, Generative Language Models, and Artificial
Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR
Med Educ. 2023 Mar 6;9:e46885. PMID: 36863937. doi: 10.2196/46885.
34. Meskó B. Prompt Engineering as an Important Emerging Skill for Medical Professionals:
Tutorial. J Med Internet Res. 2023 Oct 4;25:e50638. PMID: 37792434. doi: 10.2196/50638.
35. Hristidis V, Ruggiano N, Brown EL, Ganta SRR, Stewart S. ChatGPT vs Google for Queries
Related to Dementia and Other Cognitive Decline: Comparison of Results. J Med Internet Res.
2023 Jul 25;25:e48966. PMID: 37490317. doi: 10.2196/48966.
36. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The
PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj. 2021
Mar 29;372:n71. PMID: 33782057. doi: 10.1136/bmj.n71.
37. Harzing A-W. The publish or perish book [electronic resource]: Your guide to effective
and responsible citation analysis. 1st ed ed: Tarma Software Research Pty Limited Melbourne,
Australia; 2010. ISBN: 0980848512.
38. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP.
Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement:
guidelines for reporting observational studies. Bmj. 2007 Oct 20;335(7624):806-8. PMID:
17947786. doi: 10.1136/bmj.39335.541782.AD.
39. Critical Appraisal Skills Programme. CASP Qualitative Studies Checklist. 2022 [cited
2023 10 November 2023]; Available from: https://casp-uk.net/casp-tools-checklists/.
40. Al-Ashwal FY, Zawiah M, Gharaibeh L, Abu-Farha R, Bitar AN. Evaluating the Sensitivity,
Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional
Drug-Drug Interactions Clinical Tools. Drug Healthc Patient Saf. 2023;15:137-47. PMID:
37750052. doi: 10.2147/dhps.S425858.
41. Alfertshofer M, Hoch CC, Funk PF, Hollmann K, Wollenberg B, Knoedler S, et al. Sailing
the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing
Examinations. Ann Biomed Eng. 2023 Aug 8;Online ahead of print. PMID: 37553555. doi:
10.1007/s10439-023-03338-3.
42. Ali K, Barhom N, Tamimi F, Duggal M. ChatGPT-A double-edged sword for healthcare
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
education? Implications for assessments of dental students. Eur J Dent Educ. 2023 Aug 7;Online
ahead of print. PMID: 37550893. doi: 10.1111/eje.12937.
43. Aljindan FK, Al Qurashi AA, Albalawi IAS, Alanazi AMM, Aljuhani HAM, Falah Almutairi F,
et al. ChatGPT Conquers the Saudi Medical Licensing Exam: Exploring the Accuracy of Artificial
Intelligence in Medical Knowledge Assessment and Implications for Modern Medical Education.
Cureus. 2023 Sep;15(9):e45043. PMID: 37829968. doi: 10.7759/cureus.45043.
44. Altamimi I, Altamimi A, Alhumimidi AS, Altamimi A, Temsah MH. Snakebite Advice and
Counseling From Artificial Intelligence: An Acute Venomous Snakebite Consultation With
ChatGPT. Cureus. 2023 Jun;15(6):e40351. PMID: 37456381. doi: 10.7759/cureus.40351.
45. Baglivo F, De Angelis L, Casigliani V, Arzilli G, Privitera GP, Rizzo C. Exploring the Possible
Use of AI Chatbots in Public Health Education: Feasibility Study. JMIR Med Educ. 2023 Nov
1;9:e51421. PMID: 37910155. doi: 10.2196/51421.
46. Biswas S, Logan NS, Davies LN, Sheppard AL, Wolffsohn JS. Assessing the utility of
ChatGPT as an artificial intelligence-based large language model for information to answer
questions on myopia. Ophthalmic Physiol Opt. 2023 Nov;43(6):1562-70. PMID: 37476960. doi:
10.1111/opo.13207.
47. Chen TC, Multala E, Kearns P, Delashaw J, Dumont A, Maraganore D, et al. Assessment of
ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open.
2023;5(2):e000530. PMID: 37936648. doi: 10.1136/bmjno-2023-000530.
48. Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial Intelligence and
Public Health: Evaluating ChatGPT Responses to Vaccination Myths and Misconceptions.
Vaccines (Basel). 2023 Jul 7;11(7). PMID: 37515033. doi: 10.3390/vaccines11071217.
49. Fuchs A, Trachsel T, Weiger R, Eggmann F. ChatGPT's performance in dentistry and
allergy-immunology assessments: a comparative study. Swiss Dent J. 2023 Oct 6;134(5). PMID:
37799027.
50. Ghosh A, Bir A. Evaluating ChatGPT's Ability to Solve Higher-Order Questions on the
Competency-Based Medical Education Curriculum in Medical Biochemistry. Cureus. 2023
Apr;15(4):e37023. PMID: 37143631. doi: 10.7759/cureus.37023.
51. Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT's performance
on the UK Neurology Specialty Certificate Examination. BMJ Neurol Open. 2023;5(1):e000451.
PMID: 37337531. doi: 10.1136/bmjno-2023-000451.
52. Gobira M, Nakayama LF, Moreira R, Andrade E, Regatieri CVS, Belfort R, Jr. Performance
of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical
Degree Revalidation. Rev Assoc Med Bras (1992). 2023;69(10):e20230848. PMID: 37792871.
doi: 10.1590/1806-9282.20230848.
53. Grewal H, Dhillon G, Monga V, Sharma P, Buddhavarapu VS, Sidhu G, et al. Radiology Gets
Chatty: The ChatGPT Saga Unfolds. Cureus. 2023 Jun;15(6):e40135. PMID: 37425598. doi:
10.7759/cureus.40135.
54. Guerra GA, Hofmann H, Sobhani S, Hofmann G, Gomez D, Soroudi D, et al. GPT-4 Artificial
Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on
Neurosurgery Written Board-Like Questions. World Neurosurg. 2023 Aug 18. PMID: 37597659.
doi: 10.1016/j.wneu.2023.08.042.
55. Hamed E, Eid A, Alberry M. Exploring ChatGPT's Potential in Facilitating Adaptation of
Clinical Guidelines: A Case Study of Diabetic Ketoacidosis Guidelines. Cureus. 2023
May;15(5):e38784. PMID: 37303347. doi: 10.7759/cureus.38784.
56. Hoch CC, Wollenberg B, Lüers JC, Knoedler S, Knoedler L, Frank K, et al. ChatGPT's quiz
skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-
choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023
Sep;280(9):4271-8. PMID: 37285018. doi: 10.1007/s00405-023-08051-4.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
57. Juhi A, Pipil N, Santra S, Mondal S, Behera JK, Mondal H. The Capability of ChatGPT in
Predicting and Explaining Common Drug-Drug Interactions. Cureus. 2023 Mar;15(3):e36272.
PMID: 37073184. doi: 10.7759/cureus.36272.
58. Kuang YR, Zou MX, Niu HQ, Zheng BY, Zhang TL, Zheng BW. ChatGPT encounters multiple
opportunities and challenges in neurosurgery. Int J Surg. 2023 Oct 1;109(10):2886-91. PMID:
37352529. doi: 10.1097/js9.0000000000000571.
59. Kumari A, Kumari A, Singh A, Singh SK, Juhi A, Dhanvijay AKD, et al. Large Language
Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and
Microsoft Bing. Cureus. 2023 Aug;15(8):e43861. PMID: 37736448. doi: 10.7759/cureus.43861.
60. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB, 3rd. Evaluating ChatGPT
Performance on the Orthopaedic In-Training Examination. JB JS Open Access. 2023 Jul-
Sep;8(3). PMID: 37693092. doi: 10.2106/jbjs.Oa.23.00056.
61. Lai UH, Wu KS, Hsu TY, Kan JKC. Evaluating the performance of ChatGPT-4 on the United
Kingdom Medical Licensing Assessment. Front Med (Lausanne). 2023;10:1240915. PMID:
37795422. doi: 10.3389/fmed.2023.1240915.
62. Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, et al. Translating radiology
reports into plain language using ChatGPT and GPT-4 with prompt learning: results,
limitations, and potential. Vis Comput Ind Biomed Art. 2023 May 18;6(1):9. PMID: 37198498.
doi: 10.1186/s42492-023-00136-5.
63. Moise A, Centomo-Bozzo A, Orishchak O, Alnoury MK, Daniel SJ. Can ChatGPT Guide
Parents on Tympanostomy Tube Insertion? Children (Basel). 2023 Sep 30;10(10):1634. PMID:
37892297. doi: 10.3390/children10101634.
64. Oca MC, Meller L, Wilson K, Parikh AO, McCoy A, Chang J, et al. Bias and Inaccuracy in AI
Chatbot Ophthalmologist Recommendations. Cureus. 2023 Sep;15(9):e45911. PMID:
37885556. doi: 10.7759/cureus.45911.
65. Oztermeli AD, Oztermeli A. ChatGPT performance in the medical specialty exam: An
observational study. Medicine (Baltimore). 2023 Aug 11;102(32):e34673. PMID: 37565917.
doi: 10.1097/md.0000000000034673.
66. Pugliese N, Wai-Sun Wong V, Schattenberg JM, Romero-Gomez M, Sebastiani G, Aghemo
A. Accuracy, Reliability, and Comprehensiveness of ChatGPT-Generated Medical Responses for
Patients With Nonalcoholic Fatty Liver Disease. Clin Gastroenterol Hepatol. 2023 Sep 15. PMID:
37716618. doi: 10.1016/j.cgh.2023.08.033.
67. Sallam M, Salim NA, Barakat M, Al-Tammemi AB. ChatGPT applications in medical,
dental, pharmacy, and public health education: A descriptive study highlighting the advantages
and limitations. Narra J. 2023;3(1):e103. doi: 10.52225/narra.v3i1.103.
68. Seth I, Lim B, Xie Y, Cevik J, Rozen WM, Ross RJ, et al. Comparing the Efficacy of Large
Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An
Observational Study. Aesthet Surg J Open Forum. 2023;5:ojad084. PMID: 37795257. doi:
10.1093/asjof/ojad084.
69. Suthar PP, Kounsal A, Chhetri L, Saini D, Dua SG. Artificial Intelligence (AI) in Radiology:
A Deep Dive Into ChatGPT 4.0's Accuracy with the American Journal of Neuroradiology's
(AJNR) "Case of the Month". Cureus. 2023 Aug;15(8):e43958. PMID: 37746411. doi:
10.7759/cureus.43958.
70. Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, et al. Reliability of
Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient
Information Quality Instrument. J Med Internet Res. 2023 Jun 30;25:e47479. PMID: 37389908.
doi: 10.2196/47479.
71. Wang H, Wu W, Dou Z, He L, Yang L. Performance and exploration of ChatGPT in medical
examination, records and education in Chinese: Pave the way for medical AI. Int J Med Inform.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
2023 Sep;177:105173. PMID: 37549499. doi: 10.1016/j.ijmedinf.2023.105173.
72. Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the pharmacist licensing
examination in Taiwan. J Chin Med Assoc. 2023 Jul 1;86(7):653-8. PMID: 37227901. doi:
10.1097/jcma.0000000000000942.
73. Zhou Y, Moon C, Szatkowski J, Moore D, Stevens J. Evaluating ChatGPT responses in the
context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. Eur J Orthop
Surg Traumatol. 2023 Sep 30. PMID: 37776392. doi: 10.1007/s00590-023-03742-4.
74. Malhotra R, Singh P. Recent advances in deep learning models: a systematic literature
review. Multimedia Tools and Applications. 2023 2023/04/25. doi: 10.1007/s11042-023-
15295-z.
75. Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, et al. ChatGPT-
Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic
Accuracy Evaluation. JMIR Med Inform. 2023 Oct 9;11:e48808. PMID: 37812468. doi:
10.2196/48808.
76. Levkovich I, Elyoseph Z. Suicide Risk Assessments Through the Eyes of ChatGPT-3.5
Versus ChatGPT-4: Vignette Study. JMIR Ment Health. 2023 Sep 20;10:e51232. PMID:
37728984. doi: 10.2196/51232.
77. Flores-Cohaila JA, Gara-Vicente A, Vizcarra-Jiménez SF, De la Cruz-Galán JP, Gutiérrez-
Arratia JD, Quiroga Torres BG, et al. Performance of ChatGPT on the Peruvian National
Licensing Medical Examination: Cross-Sectional Study. JMIR Med Educ. 2023 Sep 28;9:e48039.
PMID: 37768724. doi: 10.2196/48039.
78. Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of
Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and
Physicians for Patients in an Emergency Department: Clinical Data Analysis Study. JMIR
Mhealth Uhealth. 2023 Oct 3;11:e49995. PMID: 37788063. doi: 10.2196/49995.
79. Huang RS, Lu KJQ, Meaney C, Kemppainen J, Punnett A, Leung FH. Assessment of
Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency
Progress Test: Comparative Study. JMIR Med Educ. 2023 Sep 19;9:e50514. PMID: 37725411.
doi: 10.2196/50514.
80. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep
learning: concepts, CNN architectures, challenges, applications, future directions. Journal of Big
Data. 2021 2021/03/31;8(1):53. doi: 10.1186/s40537-021-00444-8.
81. Ali S, Abuhmed T, El-Sappagh S, Muhammad K, Alonso-Moral JM, Confalonieri R, et al.
Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy
Artificial Intelligence. Information Fusion. 2023 2023/11/01/;99:101805. doi:
10.1016/j.inffus.2023.101805.
82. Giray L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann Biomed
Eng. 2023 Dec;51(12):2629-33. PMID: 37284994. doi: 10.1007/s10439-023-03272-4.
83. Khlaif ZN, Mousa A, Hattab MK, Itmazi J, Hassan AA, Sanmugam M, et al. The Potential
and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation. JMIR Med
Educ. 2023 Sep 14;9:e47049. PMID: 37707884. doi: 10.2196/47049.
84. Varsha PS. How can we manage biases in artificial intelligence systems A systematic
literature review. International Journal of Information Management Data Insights. 2023
2023/04/01/;3(1):100165. doi: 10.1016/j.jjimei.2023.100165.
85. Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on Medical
Questions in the National Medical Licensing Examination in Japan: Evaluation Study. JMIR Form
Res. 2023 Oct 13;7:e48023. PMID: 37831496. doi: 10.2196/48023.
86. Taira K, Itaya T, Hanada A. Performance of the Large Language Model ChatGPT on the
National Nurse Examinations in Japan: Evaluation Study. JMIR Nurs. 2023 Jun 27;6:e47305.
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
JMIR Preprints Sallam et al
PMID: 37368470. doi: 10.2196/47305.
87. Sallam M, Barakat M, Sallam M. CLEAR: Pilot Testing of a Tool to Standardize Assessment
of the Quality of Health Information Generated by Artificial Intelligence-Based Models.
Preprintsorg. 2023. doi: 10.20944/preprints202311.1171.v1.
88. Sezgin E, Chekeni F, Lee J, Keim S. Clinical Accuracy of Large Language Models and
Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study. J Med
Internet Res. 2023 Sep 11;25:e49240. PMID: 37695668. doi: 10.2196/49240.
89. Wilhelm TI, Roos J, Kaczmarczyk R. Large Language Models for Therapy
Recommendations Across 3 Clinical Specialties: Comparative Study. J Med Internet Res. 2023
Oct 30;25:e49324. PMID: 37902826. doi: 10.2196/49324.
90. Ferreira AL, Chu B, Grant-Kels JM, Ogunleye T, Lipoff JB. Evaluation of ChatGPT
Dermatology Responses to Common Patient Queries. JMIR Dermatol. 2023 Nov 17;6:e49280.
PMID: 37976093. doi: 10.2196/49280.
91. Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, et al.
Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge
Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care. JMIR
Med Educ. 2023 Apr 21;9:e46599. PMID: 37083633. doi: 10.2196/46599.
92. Lakdawala N, Channa L, Gronbeck C, Lakdawala N, Weston G, Sloan B, et al. Assessing the
Accuracy and Comprehensiveness of ChatGPT in Offering Clinical Guidance for Atopic
Dermatitis and Acne Vulgaris. JMIR Dermatol. 2023 Nov 14;6:e50409. PMID: 37962920. doi:
10.2196/50409.
93. Borchert RJ, Hickman CR, Pepys J, Sadler TJ. Performance of ChatGPT on the Situational
Judgement Test-A Professional Dilemmas-Based Examination for Doctors in the United
Kingdom. JMIR Med Educ. 2023 Aug 7;9:e48978. PMID: 37548997. doi: 10.2196/48978.
94. Sun H, Zhang K, Lan W, Gu Q, Jiang G, Yang X, et al. An AI Dietitian for Type 2 Diabetes
Mellitus Management Based on Large Language and Image Recognition Models: Preclinical
Concept Validation Study. J Med Internet Res. 2023 Nov 9;25:e51300. PMID: 37943581. doi:
10.2196/51300.
95. Ettman CK, Galea S. The Potential Influence of AI on Population Mental Health. JMIR
Ment Health. 2023 Nov 16;10:e49936. PMID: 37971803. doi: 10.2196/49936.
Powered by TCPDF (www.tcpdf.org)
https://preprints.jmir.org/preprint/54704 [unpublished, non-peer-reviewed preprint]
Article
Full-text available
Background The transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance. Methods The study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters. Results ChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses. Conclusion The study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.
Article
Full-text available
Background: Artificial intelligence (AI)-based tools can reshape healthcare practice. This includes ChatGPT which is considered among the most popular AI-based conversational models. Nevertheless, the performance of different versions of ChatGPT needs further evaluation in different settings to assess its reliability and credibility in various healthcare-related tasks. Therefore, the current study aimed to assess the performance of the freely available ChatGPT-3.5 and the paid version ChatGPT-4 in 10 different diagnostic clinical microbiology case scenarios. Methods: The current study followed the METRICS (Model, Evaluation, Timing/Transparency, Range/Randomization, Individual factors, Count, Specificity of the prompts/language) checklist for standardization of the design and reporting of AI-based studies in healthcare. The models tested on December 3, 2023 included ChatGPT-3.5 and ChatGPT-4 and the evaluation of the ChatGPT-generated content was based on the CLEAR tool (Completeness, Lack of false information, Evidence support, Appropriateness, and Relevance) assessed on a 5-point Likert scale with a range of the CLEAR scores of 1-5. ChatGPT output was evaluated by two raters independently and the inter-rater agreement was based on the Cohen’s κ statistic. Ten diagnostic clinical microbiology laboratory case scenarios were created in the English language by three microbiologists at diverse levels of expertise following an internal discussion of common cases observed in Jordan. The range of topics included bacteriology, mycology, parasitology, and virology cases. Specific prompts were tailored based on the CLEAR tool and a new session was selected following prompting each case scenario. Results: The Cohen’s κ values for the five CLEAR items were 0.351-0.737 for ChatGPT-3.5 and 0.294-0.701 for ChatGPT-4 indicating fair to good agreement and suitability for analysis. Based on the average CLEAR scores, ChatGPT-4 outperformed ChatGPT-3.5 (mean: 2.64±1.06 vs. 3.21±1.05, P=.012, t-test). The performance of each model varied based on the CLEAR items, with the lowest performance for the “Relevance” item (2.15±0.71 for ChatGPT-3.5 and 2.65±1.16 for ChatGPT-4). A statistically significant difference upon assessing the performance per each CLEAR item was only seen in ChatGPT-4 with the best performance in “Completeness”, “Lack of false information”, and “Evidence support” (P=0.043). The lowest level of performance for both models was observed with antimicrobial susceptibility testing (AST) queries while the highest level of performance was seen in bacterial and mycologic identification. Conclusions: Assessment of ChatGPT performance across different diagnostic clinical microbiology case scenarios showed that ChatGPT-4 outperformed ChatGPT-3.5. The performance of ChatGPT demonstrated noticeable variability depending on the specific topic evaluated. A primary shortcoming of both ChatGPT models was the tendency to generate irrelevant content lacking the needed focus. Although the overall ChatGPT performance in these diagnostic microbiology case scenarios might be described as “above average” at best, there remains a significant potential for improvement, considering the identified limitations and unsatisfactory results in a few cases.
Article
Full-text available
Background and objectives ChatGPT has shown promise in healthcare. To assess the utility of this novel tool in healthcare education, we evaluated ChatGPT’s performance in answering neurology board exam questions. Methods Neurology board-style examination questions were accessed from BoardVitals, a commercial neurology question bank. ChatGPT was provided a full question prompt and multiple answer choices. First attempts and additional attempts up to three tries were given to ChatGPT to select the correct answer. A total of 560 questions (14 blocks of 40 questions) were used, although any image-based questions were disregarded due to ChatGPT’s inability to process visual input. The artificial intelligence (AI) answers were then compared with human user data provided by the question bank to gauge its performance. Results Out of 509 eligible questions over 14 question blocks, ChatGPT correctly answered 335 questions (65.8%) on the first attempt/iteration and 383 (75.3%) over three attempts/iterations, scoring at approximately the 26th and 50th percentiles, respectively. The highest performing subjects were pain (100%), epilepsy & seizures (85%) and genetic (82%) while the lowest performing subjects were imaging/diagnostic studies (27%), critical care (41%) and cranial nerves (48%). Discussion This study found that ChatGPT performed similarly to its human counterparts. The accuracy of the AI increased with multiple attempts and performance fell within the expected range of neurology resident learners. This study demonstrates ChatGPT’s potential in processing specialised medical information. Future studies would better define the scope to which AI would be able to integrate into medical decision making.
Article
Full-text available
Background: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties. Objective: This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology. Methods: Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI’s most advanced model, GPT-4, an automated evaluation of each model’s responses to the diseases was performed using the same criteria and compared to the physicians’ assessments through Pearson correlation analysis. Results: Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P<.001). Evaluating their reliability, the models displayed strong contrasts in their falseness ratings, with variations both across models (P<.001) and specialties (P<.001). Distinct error patterns emerged, such as confusing diagnoses; providing vague, ambiguous advice; or omitting critical treatments, such as antibiotics for infectious diseases. Regarding potential harm, GPT-3.5-Turbo was found to be the safest, with the lowest harmfulness rating. All models lagged in detailing the risks associated with treatment procedures, explaining the effects of therapies on quality of life, and offering additional sources of information. Pearson correlation analysis underscored a substantial alignment between physician assessments and GPT-4’s evaluations across all established criteria (P<.01). Conclusions: This study, while comprehensive, was limited by the involvement of a select number of specialties and physician evaluators. The straightforward prompting strategy (“How to treat…”) and the assessment benchmarks, initially conceptualized for human-authored content, might have potential gaps in capturing the nuances of AI-driven information. The LLMs evaluated showed a notable capability in generating valuable medical content; however, evident lapses in content quality and potential harm signal the need for further refinements. Given the dynamic landscape of LLMs, this study’s findings emphasize the need for regular and methodical assessments, oversight, and fine-tuning of these AI tools to ensure they produce consistently trustworthy and clinically safe medical advice. Notably, the introduction of an auto-evaluation mechanism using GPT-4, as detailed in this study, provides a scalable, transferable method for domain-agnostic evaluations, extending beyond therapy recommendation assessments.
Article
Full-text available
Background The transition to clinical clerkships can be difficult for medical students, as it requires the synthesis and application of preclinical information into diagnostic and therapeutic decisions. ChatGPT—a generative language model with many medical applications due to its creativity, memory, and accuracy—can help students in this transition. Objective This paper models ChatGPT 3.5’s ability to perform interactive clinical simulations and shows this tool’s benefit to medical education. Methods Simulation starting prompts were refined using ChatGPT 3.5 in Google Chrome. Starting prompts were selected based on assessment format, stepwise progression of simulation events and questions, free-response question type, responsiveness to user inputs, postscenario feedback, and medical accuracy of the feedback. The chosen scenarios were advanced cardiac life support and medical intensive care (for sepsis and pneumonia). Results Two starting prompts were chosen. Prompt 1 was developed through 3 test simulations and used successfully in 2 simulations. Prompt 2 was developed through 10 additional test simulations and used successfully in 1 simulation. Conclusions ChatGPT is capable of creating simulations for early clinical education. These simulations let students practice novel parts of the clinical curriculum, such as forming independent diagnostic and therapeutic impressions over an entire patient encounter. Furthermore, the simulations can adapt to user inputs in a way that replicates real life more accurately than premade question bank clinical vignettes. Finally, ChatGPT can create potentially unlimited free simulations with specific feedback, which increases access for medical students with lower socioeconomic status and underresourced medical schools. However, no tool is perfect, and ChatGPT is no exception; there are concerns about simulation accuracy and replicability that need to be addressed to further optimize ChatGPT’s performance as an educational resource.
Article
Full-text available
Large language models (LLMs) such as ChatGPT have potential applications in healthcare, including dentistry. Priming, the practice of providing LLMs with initial, relevant information, is an approach to improve their output quality. This study aimed to evaluate the performance of ChatGPT 3 and ChatGPT 4 on self-assessment questions for dentistry, through the Swiss Federal Licensing Examination in Dental Medicine (SFLEDM), and allergy and clinical immunology, through the European Examination in Allergy and Clinical Immunology (EEAACI). The second objective was to assess the impact of priming on ChatGPT's performance. The SFLEDM and EEAACI multiple-choice questions from the University of Bern's Institute for Medical Education platform were administered to both ChatGPT versions, with and without priming. Performance was analyzed based on correct responses. The statistical analysis included Wilcoxon rank sum tests (α=0.05). The average accuracy rates in the SFLEDM and EEAACI assessments were 63.3% and 79.3%, respectively. Both ChatGPT versions performed better on EEAACI than SFLEDM, with ChatGPT 4 outperforming ChatGPT 3 across all tests. ChatGPT 3's performance exhibited a significant improvement with priming for both EEAACI (p=0.017) and SFLEDM (p=0.024) assessments. For ChatGPT 4, the priming effect was significant only in the SFLEDM assessment (p=0.038). The performance disparity between SFLEDM and EEAACI assessments underscores ChatGPT's varying proficiency across different medical domains, likely tied to the nature and amount of training data available in each field. Priming can be a tool for enhancing output, especially in earlier LLMs. Advancements from ChatGPT 3 to 4 highlight the rapid developments in LLM technology. Yet, their use in critical fields such as healthcare must remain cautious owing to LLMs' inherent limitations and risks.
Article
Full-text available
Background Nutritional management for patients with diabetes in China is a significant challenge due to the low supply of registered clinical dietitians. To address this, an artificial intelligence (AI)–based nutritionist program that uses advanced language and image recognition models was created. This program can identify ingredients from images of a patient’s meal and offer nutritional guidance and dietary recommendations. Objective The primary objective of this study is to evaluate the competence of the models that support this program. Methods The potential of an AI nutritionist program for patients with type 2 diabetes mellitus (T2DM) was evaluated through a multistep process. First, a survey was conducted among patients with T2DM and endocrinologists to identify knowledge gaps in dietary practices. ChatGPT and GPT 4.0 were then tested through the Chinese Registered Dietitian Examination to assess their proficiency in providing evidence-based dietary advice. ChatGPT’s responses to common questions about medical nutrition therapy were compared with expert responses by professional dietitians to evaluate its proficiency. The model’s food recommendations were scrutinized for consistency with expert advice. A deep learning–based image recognition model was developed for food identification at the ingredient level, and its performance was compared with existing models. Finally, a user-friendly app was developed, integrating the capabilities of language and image recognition models to potentially improve care for patients with T2DM. Results Most patients (182/206, 88.4%) demanded more immediate and comprehensive nutritional management and education. Both ChatGPT and GPT 4.0 passed the Chinese Registered Dietitian examination. ChatGPT’s food recommendations were mainly in line with best practices, except for certain foods like root vegetables and dry beans. Professional dietitians’ reviews of ChatGPT’s responses to common questions were largely positive, with 162 out of 168 providing favorable reviews. The multilabel image recognition model evaluation showed that the Dino V2 model achieved an average F1 score of 0.825, indicating high accuracy in recognizing ingredients. Conclusions The model evaluations were promising. The AI-based nutritionist program is now ready for a supervised pilot study.
Article
Full-text available
Background Large language models (LLMs) are emerging artificial intelligence (AI) technologies refining research and healthcare. However, the impact of these models on presurgical planning and education remains under-explored. Objectives This study aims to assess 3 prominent LLMs—Google's AI BARD (Mountain View, CA), Bing AI (Microsoft, Redmond, WA), and ChatGPT-3.5 (Open AI, San Francisco, CA) in providing safe medical information for rhinoplasty. Methods Six questions regarding rhinoplasty were prompted to ChatGPT, BARD, and Bing AI. A Likert scale was used to evaluate these responses by a panel of Specialist Plastic and Reconstructive Surgeons with extensive experience in rhinoplasty. To measure reliability, the Flesch Reading Ease Score, the Flesch–Kincaid Grade Level, and the Coleman–Liau Index were used. The modified DISCERN score was chosen as the criterion for assessing suitability and reliability. A t test was performed to calculate the difference between the LLMs, and a double-sided P-value <.05 was considered statistically significant. Results In terms of reliability, BARD and ChatGPT demonstrated a significantly (P < .05) greater Flesch Reading Ease Score of 47.47 (±15.32) and 37.68 (±12.96), Flesch–Kincaid Grade Level of 9.7 (±3.12) and 10.15 (±1.84), and a Coleman–Liau Index of 10.83 (±2.14) and 12.17 (±1.17) than Bing AI. In terms of suitability, BARD (46.3 ± 2.8) demonstrated a significantly greater DISCERN score than ChatGPT and Bing AI. In terms of Likert score, ChatGPT and BARD demonstrated similar scores and were greater than Bing AI. Conclusions BARD delivered the most succinct and comprehensible information, followed by ChatGPT and Bing AI. Although these models demonstrate potential, challenges regarding their depth and specificity remain. Therefore, future research should aim to augment LLM performance through the integration of specialized databases and expert knowledge, while also refining their algorithms. Level of Evidence: 5
Article
The integration of artificial intelligence (AI) into everyday life has galvanized a global conversation on the possibilities and perils of AI on human health. In particular, there is a growing need to anticipate and address the potential impact of widely accessible, enhanced, and conversational AI on mental health. We propose 3 considerations to frame how AI may influence population mental health: through the advancement of mental health care; by altering social and economic contexts; and through the policies that shape the adoption, use, and potential abuse of AI-enhanced tools.