PreprintPDF Available

Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

Authors:
  • Eastern International University
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper presents a performance comparison of three large language models (LLMs), namely OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard, on the VNHSGE English dataset. The performance of BingChat, Bard, and ChatGPT (GPT-3.5) is 92.4\%, 86\%, and 79.2\%, respectively. The results show that BingChat is better than ChatGPT and Bard. Therefore, BingChat and Bard can replace ChatGPT while ChatGPT is not yet officially available in Vietnam. The results also indicate that BingChat, Bard and ChatGPT outperform Vietnamese students in English language proficiency. The findings of this study contribute to the understanding of the potential of LLMs in English language education. The remarkable performance of ChatGPT, Bing Chat, and Bard demonstrates their potential as effective tools for teaching and learning English at the high school level.
Content may be subject to copyright.
Performance Comparison of Large Language Models on VNHSGE English
Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard
Xuan-Quy DAO
School of Engineering
Eastern International University
Binh Duong, Vietnam
quy.dao@eiu.edu.vn
Abstract
This paper presents a performance compari-
son of three large language models (LLMs),
namely OpenAI ChatGPT, Microsoft Bing Chat
(BingChat), and Google Bard, on the VNHSGE
English dataset. The performance of BingChat,
Bard, and ChatGPT (GPT-3.5) is 92.4%, 86%,
and 79.2%, respectively. The results show that
BingChat is better than ChatGPT and Bard.
Therefore, BingChat and Bard can replace
ChatGPT while ChatGPT is not yet officially
available in Vietnam. The results also indicate
that BingChat, Bard and ChatGPT outperform
Vietnamese students in English language profi-
ciency. The findings of this study contribute to
the understanding of the potential of LLMs in
English language education. The remarkable
performance of ChatGPT, BingChat, and Bard
demonstrates their potential as effective tools
for teaching and learning English at the high
school level.
1 Introduction
Artificial Intelligence (AI) has revolutionized the
field of education, by transforming the ways of
learning and teaching. One of the most prominent
advancements in this domain is the development of
large language models (LLMs), such as ChatGPT
1
,
BingChat
2
, and Bard
3
(developed by OpenAI, Mi-
crosoft, and Google, respectively). The develop-
ment and continuous improvement of these LLMs
have paved the way for AI to be integrated into
various educational applications and domains (Ope-
nAI,2023). These models have advanced conver-
sational abilities, closely resembling human-like
interactions. This capability holds great promise
for educational settings, including the utilization
of virtual assistants, chatbots, and online learning
support systems (Kasneci et al.,2023).
1https://chat.openai.com
2https://www.bing.com/chat
3https://bard.google.com
Although these models offer tremendous poten-
tial for personalized and interactive learning expe-
riences, creating new opportunities for educational
innovation and transforming the way we teach and
learn, the development and deployment of AI in
education require careful consideration of the ef-
fectiveness of these AI-powered educational tools
in different educational and social contexts (Kung
et al.,2023), (Thorp,2023).
Similar to other parts of the world, LLMs hold
great potential in the field of education in Viet-
nam. However, it is essential to conduct thorough
research and evaluation to comprehensively un-
derstand their capabilities and limitations in the
specific context of Vietnamese education. Chat-
GPT scored well on the Vietnamese High School
Graduation Examination (VNHSGE) (Dao et al.,
2023a) and obtained a good score on an English
test case (Dao et al.,2023b). However, it is worth
noting that ChatGPT has not been officially de-
ployed in Vietnam. Meanwhile, Vietnamese stu-
dents have access to BingChat and Bard, two other
LLMs. Thus, it is crucial to evaluate the capabil-
ities of BingChat and Bard within the context of
Vietnamese education.
We focus on evaluating the performance of
BingChat and Bard on the VNHSGE English
dataset, and comparing them to ChatGPT. The ob-
jective of this evaluation is to determine whether
BingChat and Bard can serve as potential alterna-
tives to ChatGPT in English education at the high
school level. Thus, we ask the following research
questions:
Research Question 1 (RS1): What is the per-
formance of ChatGPT, BingChat, and Bard
on the VNHSGE English dataset at the high
school level in Vietnam?
Research Question 2 (RS2): How do the
LLMs perform in comparison to Vietnamese
students in English language proficiency at
arXiv:2307.02288v3 [cs.CL] 20 Jul 2023
the high school level in Vietnam?
Research Question 3 (RS3): What potential
do LLMs hold for English language teaching
and learning at the high school level in the
context of Vietnam?
The main contribution of this paper provides a
comprehensive evaluation of the performance of
LLMs, comparing them to human performance on
the same tasks, and identifying the potential ap-
plications of LLMs in the context of English lan-
guage education at the high school level in Vietnam.
These findings can inform educators, researchers,
and policymakers in making informed decisions
regarding the integration of LLMs into the English
language curriculum and instructional practices.
2 Related works
2.1 Large Language Models
AI is now capable of understanding and engaging
in human-like communication, thanks to recent ad-
vancements in LLMs. These breakthroughs have
opened up new opportunities for their application
in the field of education. LLMs have demonstrated
immense potential in areas such as education, con-
tent development, and language translation. The
two primary architectures of LLMs are BERT
(Bidirectional Encoder Representations from Trans-
formers) and GPT (Generative Pre-trained Trans-
former). In 2018, Google introduced BERT (De-
vlin et al.,2018), which has excelled in various
natural language processing (NLP) tasks. Devel-
oped by OpenAI (Alec et al.,2018), the GPT al-
gorithm was trained on extensive unlabeled text
datasets. Building on Google’s research, Face-
book’s RoBERTa (Liu et al.,2019) was introduced,
and in 2019, Google released T5 (Raffel et al.,
2020). The year 2020 witnessed the creation of
GPT-3 by OpenAI (Brown et al.,2020), show-
casing exceptional performance in a wide range
of NLP tasks. Recently, OpenAI developed GPT-
4 (OpenAI,2023), a text-to-text machine learning
system capable of processing both text and image
inputs. GPT-4 has demonstrated human-level per-
formance in many professional and academic crite-
ria, although its performance may not match that of
humans in all contexts. These advancements high-
light the progress made in LLMs and their potential
impact in various domains.
2.2 Evaluation of LLMs on English
ChatGPT scored an average of 7.18 on the national
high school exam in the Netherlands, which is
equivalent to the average score of all students partic-
ipating in the exams in the Netherlands (de Winter,
2023). In other research, ChatGPT outperformed
Vietnamese students by scoring 7.92 in the national
high school graduation exam in Vietnam (Dao
et al.,2023b). Additionally, ChatGPT/GPT-4 has
been shown to be as good as commercial transla-
tion products at translation (Jiao et al.,2023), as
well as having the ability to perform multilingual
tasks (Bang et al.,2023).
3 Methods
3.1 Dataset
The VNHSGE English dataset (Dao et al.,2023c)
consists of 2500 multiple-choice questions and
comprises a diverse range of exam questions ex-
tracted from high school examinations, covering
various topics and assessing different linguistic
abilities. The dataset includes questions related to
pronunciation and stress, vocabulary and grammar,
communication skills, reading fill-in-the-blank, and
reading comprehension. The evaluation dataset
comprises 250 multiple-choice questions corre-
sponding to 5 exams from the academic years 2019,
2020, 2021, 2022, and 2023.
3.2 Prompt
In this study, zero-shot learning was employed,
where LLMs were tasked with answering questions
directly without any prior data or example ques-
tions. For the VNHSGE English dataset D, let {Q,
S} denote pairs where Q is the question and S is
the ground truth solution. Additionally, let P de-
note the context of words. The LLM’s answer A is
determined by
A=f(P, Q)(1)
where f is the LLM, which takes into account the
context P and the question Q. The context P in this
case is a specific structure that guides the LLM’s
response. It instructs the LLM to provide the an-
swer in the following format: { Choice: “A” or “B”
or “C” or “D”; Explanation: Explain the answer;
The question is: [the actual question] }. By follow-
ing this structure, the LLM generates its answer
A, which can be evaluated and compared to the
ground truth solution S.
Figure 1illustrates the process of prompting
LLMs and retrieving the results. In the case of
multiple-choice questions from the VNHSGE En-
glish dataset, the questions are formatted to align
with the expected answer format. The questions
are then sent to the LLMs’ API.
3.3 Grading
To evaluate the performance of LLMs in answering
questions, we assessed the LLM’s response by com-
paring it to the ground truth. Given the question Q,
the corresponding ground truth solution S, and the
LLM’s answer A, we assessed the LLM’s response
by comparing it to the ground truth solution S:
G=g(Q, S, A)(2)
The evaluation process resulted in a binary grading
system, where the LLM’s answer was classified as
correct or incorrect.
The answers of LLMs may be different because
they are trained on different datasets. To capture the
best and worst cases among the answers of LLMs,
we introduced two variables:
LLMB
represents the
case with the highest graded answer among the
LLMs, while
LLMW
represents the case with the
lowest graded answer among the LLMs
LLMB= max(Gi)
LLMW= min(Gi)(3)
where
Gi
represents the grading corresponding to
ChatGPT, BingChat, and Bard. These variables
allow us to analyze the upper and lower bounds of
the performance of LLMs in terms of grading their
answers.
4 Results
4.1 Answers of LLMs
In this section, we present the answers provided
by ChatGPT, BingChat, and Bard on the VNHSGE
English dataset. We provide illustrative examples
for different types of questions, including pronun-
ciation and stress, grammar and vocabulary, com-
munication, reading fill-in-the-blank, and reading
comprehension. For detailed information about
the answers provided by the LLMs, please refer to
Section A.
4.2 Performance of LLMs
To answer RS 1, we evaluate the performance of
ChatGPT, BingChat, and Bard on the VNHSGE
English dataset. The findings provide insights into
how accurate and effective these LLMs are in an-
swering questions from the dataset. The contribu-
tion of this analysis is a comparative assessment
of the performance of the three models, which can
guide educators and researchers in selecting the
most suitable LLM for English language tasks in
the context of high school education.
4.2.1 Question order
Figure 2illustrates the accuracy of the answers of
LLMs according to the order of the questions. The
results show that the LLMs models do not provide
accurate answers for the first four questions, which
are related to phonetics. This may be because mod-
els like ChatGPT, BingChat, and Bard were not
sufficiently trained in the field of phonetics. LLMs
perform well on the remaining portions of the exam,
proving that LLMs have received adequate train-
ing in English vocabulary, grammar, and reading
comprehension.
4.2.2 Performance evaluation
Table 1presents the performance of LLMs.
BingChat demonstrates better results compared to
ChatGPT and Bard. The accuracy of
LLMB
is
97.2%, demonstrating the significant potential of
LLMs in English education at the high school level.
Figure 3illustrates the stability of LLMs across
the years 2019-2023. The performance indices
LLMB
, ChatGPT, and Bard demonstrate relatively
stable results, while BingChat and
LLMW
exhibit
variations between academic years. The consistent
performance of LLMs over the years indicates their
ability to maintain stable performance. This demon-
strates that LLMs are trained on a large amount of
data and can be valuable tools in education.
Table 1: Performance (%)
2019 2020 2021 2022 2023 AVG
ChatGPT 76 86 76 80 78 79.2
BingChat 92 96 86 94 94 92.4
Bard 82 94 82 86 86 86
LLMW66 82 68 74 70 72
LLMB96 100 94 96 100 97.2
Multiple
Choice
Question
New
Question
Large
Language
Models
Response
I want you to answer the question in the
following structure:
Choice: "A" or "B" or "C" or "D"
Explanation: Explain the answer
The question is:
Context
prompt
Figure 1: Formatted question and LLMs response.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
0
50
100
Accuracy
ChatGPT BingChat Bard
Figure 2: Correctness average of ChatGPT, BingChat, and Bard in question order.
ChatGPT
BingChat
Bard
LLMW
LLMB
80
100
Performance (%)
Figure 3: Stability of LLMs performance.
4.3 LLMs and Vietnamese students
4.3.1 LLMs perform better in English than
Vietnamese students
To answer RS 2, we examine the performance
of the LLMs in comparison to Vietnamese stu-
dents’ English language skills. Our aim is to deter-
mine whether LLMs possess abilities comparable
to those of humans, although this comparison is
challenging due to the dissimilar settings. By con-
ducting this comparison, we can evaluate whether
LLMs can serve as effective study aids for Viet-
namese students in various subject areas.
The score distribution of the students’ exams
serves as an indicator for evaluating their perfor-
mance. The Vietnam Ministry of Education and
Training releases score distribution charts for each
subject every year, which help assess the candi-
dates’ competence and determine the difficulty
level of the exams. By gathering the score dis-
tributions (2019,2020,2021,2022), we can
compare the performance of LLMs with that of
Vietnamese students. To facilitate this compari-
son, we compare the score of LLMs to the average
score (AVS) and the most reached score (MVS) by
Vietnamese students. For instance, in 2023, the
AVS and MVS were 5.84 and 3.8, respectively (see
Appendix section Bfor comparison in Vietnamese
score spectrum). Table 2shows the score of LLMs
on a 10-point scale. Figure 4shows the effective-
ness of LLMs with Vietnamese students.
Table 2: Scores
2019 2020 2021 2022 2023 AVG
ChatGPT 7.6 8.6 7.6 8.0 7.8 7.92
BingChat 9.2 9.6 8.6 9.4 9.4 9.24
Bard 8.2 9.4 8.2 8.6 8.6 8.6
The results indicate that LLMs exhibit a higher
level of English proficiency compared to Viet-
namese students. One key advantage of LLMs
is their ability to access vast amounts of infor-
mation and data, including extensive language re-
sources and pre-training on large corpora. This en-
ables them to have a broader knowledge base and
a deeper understanding of the English language.
2019 2020 2021 2022 2023
4
6
8
10
7.6
8.6
7.6
8
7.8
9.2
9.6
8.6
9.4
9.4
8.2
9.4
8.2
8.6
8.6
4.36
4.58
5.84
5.15
3.2
3.4
4
3.8
English Score
ChatGPT BingChat Bard AVS MVS
Figure 4: Comparison of the performance of LLMs and Vietnamese students.
Consequently, LLMs demonstrate superior perfor-
mance in answering questions and providing accu-
rate responses compared to Vietnamese students.
4.3.2 Potential of LLMs in helping
Vietnamese students learn English
To answer RS 3, we explore the potential of LLMs
for English language teaching and learning at the
high school level in Vietnam. By evaluating the
performance of the LLMs on the VNHSGE English
dataset, the research highlights the capabilities of
these models in assisting with language instruction
and learning. The findings contribute to identifying
the specific areas where LLMs can be leveraged,
such as vocabulary acquisition, grammar compre-
hension, and reading comprehension. The findings
of the research can inform the development of ed-
ucational strategies and materials that incorporate
LLMs to enhance English language education in
Vietnam.
LLMs can indeed provide valuable assistance to
Vietnamese students in learning English. Here are
some ways LLMs can be helpful:
Language practice: LLMs can act as conver-
sational partners, allowing Vietnamese stu-
dents to practice their English communication
skills. Students can engage in dialogue with
the model, asking questions, discussing top-
ics, and receiving responses that simulate real
conversations.
Personalized learning: LLMs have the poten-
tial to adapt to individual student’s needs and
provide personalized learning experiences. By
analyzing students’ performance and identi-
fying areas of improvement, LLMs can gen-
erate customized exercises, quizzes, or study
materials tailored to each student’s specific
requirements.
Language comprehension: LLMs can process
and understand English text, including arti-
cles, books, and educational resources. Viet-
namese students can use LLMs to improve
their reading comprehension by getting instant
definitions, explanations, and examples of un-
familiar words or phrases.
Writing support: LLMs can assist students in
improving their English writing skills. Stu-
dents can use LLMs to generate suggestions
for sentence structure, grammar, and vocabu-
lary, and even receive feedback on their writ-
ing. This can help students enhance their writ-
ten expression and produce more accurate and
coherent English texts.
Language resources: LLMs can serve as a vast
repository of information, providing access to
various English learning materials, such as
grammar rules, idiomatic expressions, vocab-
ulary lists, and sample essays. Students can
leverage these resources to enhance their un-
derstanding of English language concepts and
improve their overall language proficiency.
5 Conclusion
This study compared the performance of three
LLMs (OpenAI ChatGPT, Microsoft Bing Chat,
and Google Bard) on the VNHSGE English dataset.
The results showed that these LLMs have the po-
tential to be used in education, as they demon-
strated high accuracy in answering multiple-choice
questions and were consistent in their performance
across different years of the dataset.
Additionally, LLMs are significantly superior
to Vietnamese students in English language pro-
ficiency suggesting that they have the potential
to be valuable tools for facilitating English lan-
guage learning among Vietnamese students. By
leveraging the strengths of language models, edu-
cators, and students can benefit from personalized
and adaptive learning experiences that meet indi-
vidual needs and enhance overall language profi-
ciency. This highlights the potential of language
models to revolutionize English language educa-
tion and empower Vietnamese students to excel in
their language-learning endeavors.
References
Radford Alec, Narasimhan Karthik, Salimans Tim, and
Sutskever Ilya. 2018. Improving language under-
standing with unsupervised learning. Citado, 17:1–
12.
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei
Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multi-
task, multilingual, multimodal evaluation of chatgpt
on reasoning, hallucination, and interactivity.arXiv
preprint arXiv:2302.04023.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
learners.Advances in neural information processing
systems, 33:1877–1901.
Xuan-Quy Dao, Ngoc-Bich Le, Xuan-Dung Phan, and
Bac-Bien Ngo. 2023a. Can chatgpt pass the viet-
namese national high school graduation examination?
arXiv preprint arXiv:2306.09170.
Xuan-Quy Dao, Ngoc-Bich Le, Xuan-Dung Phan, and
Bac-Bien Ngo. 2023b. An evaluation of chatgpt’s
proficiency in english language testing of the viet-
namese national high school graduation examination.
Available at SSRN 4473369.
Xuan-Quy Dao, Ngoc-Bich Le, The-Duy Vo, Xuan-
Dung Phan, Bac-Bien Ngo, Van-Tien Nguyen, Thi-
My-Thanh Nguyen, and Hong-Phuoc Nguyen. 2023c.
Vnhsge: Vietnamese high school graduation exam-
ination dataset for large language models.arXiv
preprint arXiv:2305.12199.
Joost CF de Winter. 2023. Can chatgpt pass high
school exams on english language comprehension.
Researchgate. Preprint.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing.arXiv preprint arXiv:1810.04805.
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing
Wang, and Zhaopeng Tu. 2023. Is chatgpt a good
translator? a preliminary study.arXiv preprint
arXiv:2301.08745.
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann,
Maria Bannert, Daryna Dementieva, Frank Fischer,
Urs Gasser, Georg Groh, Stephan Günnemann, Eyke
Hüllermeier, et al. 2023. Chatgpt for good? on op-
portunities and challenges of large language models
for education.Learning and Individual Differences,
103:102274.
Tiffany H Kung, Morgan Cheatham, Arielle Medenilla,
Czarina Sillos, Lorie De Leon, Camille Elepaño,
Maria Madriaga, Rimel Aggabao, Giezel Diaz-
Candido, James Maningo, et al. 2023. Performance
of chatgpt on usmle: Potential for ai-assisted medical
education using large language models.PLoS digital
health, 2(2):e0000198.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach.arXiv preprint arXiv:1907.11692.
OpenAI. 2023. GPT-4 Technical Report.arXiv preprint
arXiv:2303.08774.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former.The Journal of Machine Learning Research,
21(1):5485–5551.
H Holden Thorp. 2023. Chatgpt is fun, but not an
author.
A Answes of LLMs
This section illustrates the answers of LLMs on the
VNHSGE English dataset.
A.1 Pronunciation and stress question
Question: Mark the letter A, B, C, or D on
your answer sheet to indicate the word whose
underlined part differs from the other three in
pronunciation in each of the following ques-
tions.
A. maintained
B. promoted
C. required
D. argued
Solution: Pronounce the ending "ed" Ex-
plain: A. maintained /mein’teind/ B. pro-
moted /pr
@
’m
@
utid/ C. required /rI’kwai
@
(r)d/
D. say /’a:gju:d/ How to pronounce the end-
ing "ed": + The ending "ed" is pronounced
/t/
when the verb ends in
/tR/, /θ/, /
R/, /s/, /k/, /p/, /f/+
"ed" is pronounced
/Id/
when the verb ends in
/t/, /d/
+ The
ending "ed" is pronounced/d/ for the rest of
the cases Option
B
, the underlined part is
pronounced /Id/, the rest is pronounced /d/.
Choose B.
ChatGPT: Question 1: D. argued
BingChat: For question 1, the word whose
underlined part differs from the other three in
pronunciation is D. argued.
Bard: The answer is **A. maintain**.
The other three words have the "ed" pro-
nounced as /d/, but "maintain" has the "ed"
pronounced as /t/.
A.2 Grammar and vocabulary questions
Question: Mark the letter A, B, C, or D on
your answer sheet to indicate the correct an-
swer to each of the following questions.
Question: He’s American,. . . .?
A. won’t he
B. didn’t he
C. doesn’t he
D. isn’t he
Solution: Tag question. Explain: The verb
in the main clause is "is", the subject is "He".
=> Negative question tag “isn’t he”. => He’s
American, isn’t he? Choose D.
ChatGPT:D. isn’t he
BingChat: For question 5, the correct an-
swer is D. isn’t he.
Bard: The correct answer is D. The ques-
tion is a tag question, which is a type of ques-
tion that is used to confirm information. The
tag question "isn’t he?" is used to confirm that
the person being referred to is American. The
other options are not correct because they do
not make sense in the context of the question.
Question: Mark the letter A, B, C, or D on
your answer sheet to indicate the underlined
part that needs correction in each of the fol-
lowing questions.
Question: It is
essential
to provide
high school students with adequate ca-
reer
orientation
, so they can make more
informative
decisions about their future
major.
A. essential
B. orientation
C. informative
D. major
Solution: Advanced Vocabulary. Explain:
informative (adj): provides a lot of informa-
tion, has the effect of improving knowledge
informed (adj): to have a lot of knowledge
or information about something Edit: “infor-
mative” => (informed) It is essential to pro-
vide high school students with adequate career
guidance, so they can make more informed
decisions about their future major. Choose C.
ChatGPT: It is essential to provide high
school students with adequate career orienta-
tion, so they can make more informed deci-
sions about their future major. No correction
needed.
BingChat:C. informative
Bard: The correct answer is (D). The word
major should be capitalized, as it is referring
to a specific academic field of study. The other
parts of the sentence are correct.
A.3 Communication question
Question: Mark the letter A, B, C, or D
on your answer sheet to indicate the option
that best completes each of the following ex-
changes.
Question 21: Jack and David are talking
about taking a gap year. Jack: “I think tak-
ing a gap year is a waste of time. David:
“. . . . . . . It gives gap-year takers a lot of valu-
able experiences.
A. I agree with you
B. I don’t quite agree
C. It’s right
D. My pleasure
Solution: Explanation: A. I agree with you
=> not suitable meaning B. I don’t quite agree
C. That’s right => doesn’t match the mean-
ing D. My pleasure => inappropriate meaning
Choose B.
ChatGPT: Jack and David are talking
about taking a gap year. Jack: “I think
taking a gap year is a waste of time. David:
“I don’t quite agree. It gives gap-year takers a
lot of valuable experiences. B. “I don’t quite
agree.
BingChat: the option that best completes
the exchange is B. I don’t quite agree.
Bard: The correct answer is **B. I don’t
quite agree**. David disagrees with Jack’s
opinion that taking a gap year is a waste of
time. He believes that gap-year takers can
gain a lot of valuable experiences, such as
travel, work experience, and personal growth.
A.4 Reading Fill-in-the-Blank question
Question: Read the following passage and
mark the letter A, B, C, or D on your answer
sheet to indicate the correct word or phrase
that bestfits each of the numbered blanks from
26 to 30.
THE BEST STUDENTS
According to most professors, mature stu-
dents are ideal students because they are hard-
working and become actively involved in all
aspects of the learning process. The majority
of mature students have a poor educational
(26). . . ..but they manage to do exceptionally
well at tertiary level.
Question 26:
A. development
B. background
C. basement
D. institution
Solution: The majority of mature students
have a poor educational (26) background but
they manage to do exceptionally well at ter-
tiary level. Choose B.
ChatGPT:B. background
BingChat: the correct word that best fits
the blank is B. background.
Bard: The answer must be something that
describes the educational history of mature
students. Of the above choices, only back-
ground describes the educational history of
mature students. So the answer is (B). Here
are the other choices and why they are not
the answer: * development: This does not
describe the educational history of mature stu-
dents. * basement: This is not a word that is
typically used to describe educational history.
* institution: This is a place where education
takes place, but it does not describe the educa-
tional history of mature students.
A.5 Reading comprehension question
Question: Read the following passage and
mark the letter A, B, C, or D on your answer
sheet to indicate the correct answer to each of
the questions.
Marcel Bich, a French manufacturer of tra-
ditional ink pens, was the man turning the
ballpoint pen into an item that today almost
anyone can afford. Bich was shocked at the
poor quality of the ballpoint pens that were
available, and also at their high cost. How-
ever, he recognised that the ballpoint was a
firmly established invention, and he decided
to design a cheap pen that worked well and
would be commercially successful.
Bich went to the Biro brothers and asked
them if he could use the design of their origi-
nal invention in one of his own pens. In return,
he offered to pay them every time he sold a
pen. Then, for two years, Bich studied the
detailed construction of every ballpoint pen
that was being sold, often working with a mi-
croscope.
By 1950, he was ready to introduce his new
wonder: a plastic pen with a clear barrel that
wrote smoothly, did not leak and only cost a
few cents. He called it the ‘Bic Cristal’. The
ballpoint pen had finally become a
practical
writing instrument. The public liked it im-
mediately, and today it is as common as the
pencil. In Britain, they are still called Biros,
and many Bic models also say ‘Biro’ on the
side of the pen, to remind people of their
original inventors. Bich became extremely
wealthy thanks to his invention,
which
had
worldwide appeal. Over the next 60 years
his company, Societe Bic, opened factories
all over the world and expanded its range of
inexpensive products. Today, Bic is as famous
for its lighters and razors as it is for its pens,
and you can even buy a Bic mobile phone.
Question: The word “which” in paragraph
4 refers to. . . . . . .
A. factories
B. company
C. invention
D. range
Solution: Explain: The word “which” in
paragraph 4 refers to ..... => which replaces
the noun that comes before it, "invention".
Information: Bich became extremely wealthy
thanks to his invention, which had worldwide
appeal.. Choose C.
ChatGPT:C. invention.
BingChat: the word "which" in paragraph
4 refers to C. invention .
Bard: The word “which” in paragraph 4
refers to **C.invention**.
In paragraph 4, it says:
Bich became extremely wealthy thanks to
his invention, which had worldwide appeal.
Over the next 60 years his company, Societe
Bic, opened factories all over the world and
expanded its range of inexpensive products.
The word “which” refers to the invention
that Bich created. This invention was the ball-
point pen, which became a practical writing in-
strument that was affordable and widely used.
.
B Vietnamese score spectrum
Figures 5-9show the English score spectrum of
Vietnamese students in 2019-2023.
ChatGPT BingChat Bard Vietnamese students
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8.8
9
9.2
9.4
9.6
9.8
10
0
1
2
3
4
5
·104
0
0
5
33
123
469
1,324
2,864
5,952
10,310
16,72223,685 31,481
37,599
42,348
45,297
45,755
44,476
41,861
39,542
36,385
33,410
30,588
27,458
24,979
22,630
20,989
18,710
17,283
15,464
14,288
13,145
12,173
11,343
10,405
9,834
9,274
8,552
7,990
7,612
7,108
6,970
6,416
6,045
5,378
4,845
3,968
3,133
1,976
939
299
Number of Student
Figure 5: English score spectrum of Vietnamese students in 2019.
ChatGPT BingChat Vietnamese students
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8.8
9
9.2
9.4
9.6
9.8
10
0
0.5
1
1.5
2
2.5
3
3.5
4
·104
13
1
5
30
102
392
1,099
2,263
4,431
8,283
12,42718,461 24,565
29,001
33,167
35,670
37,285
37,335
36,730
35,597
34,627
32,682
31,295
29,713
27,816
26,106
23,932
22,050
19,904
18,173
16,453
14,850
13,674
12,482
11,427
10,650
10,106
9,461
9,403
8,779
8,543
8,062
7,478
6,653
5,735
4,532
3,325
2,253
1,367
672
225
Number of Student
Figure 6: English score spectrum of Vietnamese students in 2020.
ChatGPT BingChat Bard Vietnamese students
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8.8
9
9.2
9.4
9.6
9.8
10
0
0.5
1
1.5
2
2.5
3
·104
1
0
0
3
34
106
361
713
1,619
3,302
5,868
8,732
12,280
16,517
20,491
23,980
26,527
28,537
29,183
29,498
29,504
28,943
28,317
27,791
26,867
25,860
24,631
23,337
22,660
21,964
21,090
20,102
19,403
18,911
18,665
18,354
18,464
18,219
18,498
18,915
19,319
20,258
21,176
22,490
23,724
24,471
24,251
21,582
16,586
10,543
4,345
Number of Student
Figure 7: English score spectrum of Vietnamese students in 2021.
ChatGPT BingChat Bard Vietnamese students
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8.8
9
9.2
9.4
9.6
9.8
10
0
0.5
1
1.5
2
2.5
3
3.5
4
·104
13
0
4
23
112
271
678
1,374
2,927
5,113
8,263
12,251
16,803
21,590
26,094
30,207
33,323
35,634
36,884
38,064
37,883
36,594
35,940
34,027
32,576
30,700
29,077
27,268
25,439
23,862
22,414
21,222
19,744
18,991
18,071
17,093
16,295
15,800
15,355
15,135
14,689
14,546
14,191
13,702
12,652
11,115
9,070
6,757
4,111
1,824
425
Number of Student
Figure 8: English score spectrum of Vietnamese students in 2022.
ChatGPT BingChat Bard Vietnamese students
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6
8.8
9
9.2
9.4
9.6
9.8
10
0
0.5
1
1.5
2
2.5
3
3.5
·104
6
1
0
12
38
135
362
1,022
1,966
3,410
5,852
8,734
12,421
16,601
20,744
24,243
27,391
30,085
32,408
33,932
34,393
35,450
34,992
34,848
33,738
32,844
31,110
29,487
27,915
26,531
24,751
23,612
22,103
21,201
20,276
19,147
18,972
18,126
17,946
17,575
17,236
17,512
17,022
17,120
16,526
15,248
13,262
9,620
5,460
2,222
494
Number of Student
Figure 9: English score spectrum of Vietnamese students in 2023.
... Reports from 2022 indicate that effective utilization of connectivity and openly accessible digital educational resources has the potential to play a significant role in the transformation and democratization of education (Baklazhenko & Kornieva, 2023). Research related to the comparison of text reading comprehension results between humans and AI has been carried out by several researchers (Dao, 2023;Desaire et al., 2023;Duenas et al., 2023;Giannos & Delardas, 2023;Vázquez-Cano et al., 2023;Xiao et al., 2023). The research presents a study that evaluates the scores obtained by ChatGPT when summarizing reading comprehension texts from the PISA international test with prompts that made him perform a simulation as if he were a 15-year-old student. ...
... The results show that BingChat is better than ChatGPT and Bard. The results show that BingChat, Bard, and ChatGPT outperform Vietnamese students in English language proficiency (Dao, 2023). The research examines the comparison of creating multiple choice items for reading comprehension by ChatGPT with those created by humans. ...
... This finding is in line with research, which found that ChatGPT summaries obtained the best scores in terms of content and writing style, with scores respectively 3 and 2.5 points higher than student scores (Vázquez-Cano et al., 2023). BingChat, Bard, and ChatGPT outperform Vietnamese students in English language proficiency (Dao, 2023). The development of AI literacy helps reduce the digital gap between individuals and groups with access to and understanding of AI and those without (S.-C. ...
Article
Full-text available
AI has developed rapidly. However, AI research in applied linguistics in the field of language education in Indonesia still needs to be expanded to reading and writing skills. This research aims to explore students' skills in writing summaries and understanding the historical theme of the development of the Indonesian language with AI based on gender and university aspects. The quantitative method uses descriptive statistical analysis techniques, independent sample t-test, and Welch One-Way ANOVA. The research sample was 288 students from Makassar State University, Timor University, and Makassar Health Polytechnic. The results show that ChatGPT is significantly better at reprocessing text than students based on the aspects measured. ChatGPT outperforms almost every aspect of the assessment. However, in the MCT_Score aspect, the average for Universitas Negeri Makassar students is slightly higher than ChatGPT and the other two universities. Meanwhile, the Makassar Health Polytechnic almost matches the average ChatGPT score. Apart from that, the Universitas Timor average seems significantly different, with a score range of only 6.00-7.00. This research contributes to developing the Indonesian curriculum using Artificial Intelligence (AI) technology. The government can use these findings as a basis for making better policies to improve the quality of education. This research implies that Indonesian students have a gap in understanding texts compared to ChatGPT. The first implication is the need to revise and develop the educational curriculum. Therefore, future research can examine text comprehension abilities in more specific contexts, such as scientific texts, journalism, literature, or specific scientific disciplines. It can provide more detailed insight into students' strategies for overcoming difficulties in understanding texts. In addition, future research will be conducted on the broader impact of using Artificial Intelligence technology in language education on the development of student text comprehension and the potential social and ethical impacts.
... The results indicate that these tools have great potential in education (language practice, personalized learning, language understanding, writing support, etc.), showing high accuracy in multiple-choice questions and consistency in dataset analysis over the years. The average score of the analyzed dataset from 2019-2023 for ChatGPT was 7.2, Bard 8.6, and Bing Chat 9.4, making the authors to suggest using the last two systems, as ChatGPT was not available in Vietnam during the research [16]. In the paper "Artificial Intelligence Neutrality: Framing Analysis of GPT Powered-Bing Chat and Google Bard", the authors analyze Bing Chat and Google Bard on current geopolitical topics, concluding that AI tools attempt to maintain neutrality but still showcase subtle biases that reflect a Western perspective [17]. ...
... This development is having wide ranging applications for researchers who wish to augment their research with computational-based methods. There are other GenAIs in use, each with their own strengths and weaknesses (please see [6] for a comparative analysis on OpenAI Supplementary Information The online version contains supplementary material available at https:// doi. org/ 10. 1007/ s44163-023-00091-3. ...
Article
Full-text available
Generative artificial intelligence (GenAI) systems are disrupting how research is conducted across a wide range of disciplines. Many journals have decided not to allow these tools to be co-authors for the purposes of publication, but rather they must be acknowledged by authors as having been utilised in the writing process. Furthermore, due to the hallucinations that these models sometimes produce, authors are to review what is generated and recognise that they hold it to be true and accurate. To date, there has been varying research conducted on the accuracy of GenAI systems and their production of written text. However, new functions that allow GenAI systems to produce coding for constructing tools in computer programming languages highlights a new area that warrants investigation. Therefore, this article puts forth an account of using ChatGPT 3.5 to construct coding to be utilised for a Latent Dirichlet Allocation Topic Model (LDA-TM) for use in a Systematic Literature Review. This is hoped to address three elements of using ChatGPT 3.5 for coding: code review, error resolution, and scripting new code. The code will be aimed at designating an appropriate Hyper-parameter for the Random State for use in the LDA-TM. Within this context, this article will discuss the advantages and drawbacks of utilising this new tool and what it means for researchers who wish to augment their work with computer programming-based applications. To the authors knowledge, this is the first time this has been discussed within the context of the research being conducted.
Preprint
Full-text available
Systematically gathering grey literature for use in a systematic literature review (SLR) is a challenging task, given the decentralised nature of online resources. Researchers, particularly those in the social sciences, often find themselves navigating government or non-government organisation websites, manually scouring repositories for documents relevant to their SLRs. This important stage is expensive in terms of time and resources required and, in many instances, difficult to repeat. This article introduces an innovative methodology to address these challenges, providing social science researchers with a systematic approach to gather grey literature for inclusion in SLRs. Utilising the computer programming language Python, this articles leverages Google’s API to create a programmable search engine, facilitating a systematic search for grey literature intended for inclusion in a SLR. A case study is presented to demonstrate the efficiency of this method in locating PDF documents, within which two examples are provided. In the first example, documents from Australian government websites ending in “gov.au” related to the topic of (“energy infrastructure” AND resilience) are sought. Secondly, “un.org” is added to illustrate how multiple websites can be searched. Highlighting the effectiveness of this approach, the study successfully locates 100 documents in just 7.5 seconds, automatically saving them into an Excel CSV file for further analysis. To the authors knowledge, this method represents an original approach in the systematic gathering of grey literature for SLRs and highlights the contribution of generative artificial intelligence systems such as ChatGPT 3.5 in assisting to script the necessary code for new SLR tools.
Article
Full-text available
Purpose of review ChatGPT has emerged as a potential tool for facilitating doctors' workflows. However, when it comes to applying these findings within a urological context, there have not been many studies. Thus, our objective was rooted in analyzing the pros and cons of ChatGPT use and how it can be exploited and used by urologists. Recent findings ChatGPT can facilitate clinical documentation and note-taking, patient communication and support, medical education, and research. In urology, it was proven that ChatGPT has the potential as a virtual healthcare aide for benign prostatic hyperplasia, an educational and prevention tool on prostate cancer, educational support for urological residents, and as an assistant in writing urological papers and academic work. However, several concerns about its exploitation are presented, such as lack of web crawling, risk of accidental plagiarism, and concerns about patients-data privacy. Summary The existing limitations mediate the need for further improvement of ChatGPT, such as ensuring the privacy of patient data and expanding the learning dataset to include medical databases, and developing guidance on its appropriate use. Urologists can also help by conducting studies to determine the effectiveness of ChatGPT in urology in clinical scenarios and nosologies other than those previously listed.
ResearchGate has not been able to resolve any references for this publication.