PreprintPDF Available

Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

July 2023

July 2023

DOI:10.48550/arXiv.2307.02288

License
CC BY-NC-SA 4.0

Authors:

Dao Xuan-Quy

Eastern International University

Preprints and early-stage research may not have been peer reviewed yet.

This paper presents a performance comparison of three large language models (LLMs), namely OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard, on the VNHSGE English dataset. The performance of BingChat, Bard, and ChatGPT (GPT-3.5) is 92.4\%, 86\%, and 79.2\%, respectively. The results show that BingChat is better than ChatGPT and Bard. Therefore, BingChat and Bard can replace ChatGPT while ChatGPT is not yet officially available in Vietnam. The results also indicate that BingChat, Bard and ChatGPT outperform Vietnamese students in English language proficiency. The findings of this study contribute to the understanding of the potential of LLMs in English language education. The remarkable performance of ChatGPT, Bing Chat, and Bard demonstrates their potential as effective tools for teaching and learning English at the high school level.

Correctness average of ChatGPT, BingChat, and Bard in question order.

…

English score spectrum of Vietnamese students in 2023.

…

llustrates the stability of LLMs across the years 2019-2023. The performance indices $\mathrm{LLM}_{\mathrm{B}}$, ChatGPT, and Bard demonstrate relatively stable results, while BingChat and $\mathrm{LLM}_{\mathrm{W}}$ exhibit variations between academic years. The consistent performance of LLMs over the years indicates their ability to maintain stable performance. This demonstrates that LLMs are trained on a large amount of data and can be valuable tools in education.

…

shows the effectiveness of LLMs with Vietnamese students. The results indicate that LLMs exhibit a higher level of English proficiency compared to Vietnamese students. One key advantage of LLMs is their ability to access vast amounts of information and data, including extensive language resources and pre-training on large corpora. This enables them to have a broader knowledge base and a deeper understanding of the English language. Consequently, LLMs demonstrate superior performance in answering questions and providing accurate responses compared to Vietnamese students.

…

Performance (%)

…

Figures - uploaded by Dao Xuan-Quy

Content may be subject to copyright.

Content uploaded by Dao Xuan-Quy

Content may be subject to copyright.

Available via license: CC BY-NC-SA 4.0

Content may be subject to copyright.

Performance Comparison of Large Language Models on VNHSGE English

Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

Xuan-Quy DAO

School of Engineering

Eastern International University

Binh Duong, Vietnam

quy.dao@eiu.edu.vn

Abstract

This paper presents a performance compari-

son of three large language models (LLMs),

namely OpenAI ChatGPT, Microsoft Bing Chat

(BingChat), and Google Bard, on the VNHSGE

English dataset. The performance of BingChat,

Bard, and ChatGPT (GPT-3.5) is 92.4%, 86%,

and 79.2%, respectively. The results show that

BingChat is better than ChatGPT and Bard.

Therefore, BingChat and Bard can replace

ChatGPT while ChatGPT is not yet ofﬁcially

available in Vietnam. The results also indicate

that BingChat, Bard and ChatGPT outperform

Vietnamese students in English language proﬁ-

ciency. The ﬁndings of this study contribute to

the understanding of the potential of LLMs in

English language education. The remarkable

performance of ChatGPT, BingChat, and Bard

demonstrates their potential as effective tools

for teaching and learning English at the high

school level.

1 Introduction

Artiﬁcial Intelligence (AI) has revolutionized the

ﬁeld of education, by transforming the ways of

learning and teaching. One of the most prominent

advancements in this domain is the development of

large language models (LLMs), such as ChatGPT

BingChat

, and Bard

(developed by OpenAI, Mi-

crosoft, and Google, respectively). The develop-

ment and continuous improvement of these LLMs

have paved the way for AI to be integrated into

various educational applications and domains (Ope-

nAI,2023). These models have advanced conver-

sational abilities, closely resembling human-like

interactions. This capability holds great promise

for educational settings, including the utilization

of virtual assistants, chatbots, and online learning

support systems (Kasneci et al.,2023).

1https://chat.openai.com

2https://www.bing.com/chat

3https://bard.google.com

Although these models offer tremendous poten-

tial for personalized and interactive learning expe-

riences, creating new opportunities for educational

innovation and transforming the way we teach and

learn, the development and deployment of AI in

education require careful consideration of the ef-

fectiveness of these AI-powered educational tools

in different educational and social contexts (Kung

et al.,2023), (Thorp,2023).

Similar to other parts of the world, LLMs hold

great potential in the ﬁeld of education in Viet-

nam. However, it is essential to conduct thorough

research and evaluation to comprehensively un-

derstand their capabilities and limitations in the

speciﬁc context of Vietnamese education. Chat-

GPT scored well on the Vietnamese High School

Graduation Examination (VNHSGE) (Dao et al.,

2023a) and obtained a good score on an English

test case (Dao et al.,2023b). However, it is worth

noting that ChatGPT has not been ofﬁcially de-

ployed in Vietnam. Meanwhile, Vietnamese stu-

dents have access to BingChat and Bard, two other

LLMs. Thus, it is crucial to evaluate the capabil-

ities of BingChat and Bard within the context of

Vietnamese education.

We focus on evaluating the performance of

BingChat and Bard on the VNHSGE English

dataset, and comparing them to ChatGPT. The ob-

jective of this evaluation is to determine whether

BingChat and Bard can serve as potential alterna-

tives to ChatGPT in English education at the high

school level. Thus, we ask the following research

questions:

•

Research Question 1 (RS1): What is the per-

formance of ChatGPT, BingChat, and Bard

on the VNHSGE English dataset at the high

school level in Vietnam?

•

Research Question 2 (RS2): How do the

LLMs perform in comparison to Vietnamese

students in English language proﬁciency at

arXiv:2307.02288v3 [cs.CL] 20 Jul 2023

the high school level in Vietnam?

•

Research Question 3 (RS3): What potential

do LLMs hold for English language teaching

and learning at the high school level in the

context of Vietnam?

The main contribution of this paper provides a

comprehensive evaluation of the performance of

LLMs, comparing them to human performance on

the same tasks, and identifying the potential ap-

plications of LLMs in the context of English lan-

guage education at the high school level in Vietnam.

These ﬁndings can inform educators, researchers,

and policymakers in making informed decisions

regarding the integration of LLMs into the English

language curriculum and instructional practices.

2 Related works

2.1 Large Language Models

AI is now capable of understanding and engaging

in human-like communication, thanks to recent ad-

vancements in LLMs. These breakthroughs have

opened up new opportunities for their application

in the ﬁeld of education. LLMs have demonstrated

immense potential in areas such as education, con-

tent development, and language translation. The

two primary architectures of LLMs are BERT

(Bidirectional Encoder Representations from Trans-

formers) and GPT (Generative Pre-trained Trans-

former). In 2018, Google introduced BERT (De-

vlin et al.,2018), which has excelled in various

natural language processing (NLP) tasks. Devel-

oped by OpenAI (Alec et al.,2018), the GPT al-

gorithm was trained on extensive unlabeled text

datasets. Building on Google’s research, Face-

book’s RoBERTa (Liu et al.,2019) was introduced,

and in 2019, Google released T5 (Raffel et al.,

2020). The year 2020 witnessed the creation of

GPT-3 by OpenAI (Brown et al.,2020), show-

casing exceptional performance in a wide range

of NLP tasks. Recently, OpenAI developed GPT-

4 (OpenAI,2023), a text-to-text machine learning

system capable of processing both text and image

inputs. GPT-4 has demonstrated human-level per-

formance in many professional and academic crite-

ria, although its performance may not match that of

humans in all contexts. These advancements high-

light the progress made in LLMs and their potential

impact in various domains.

2.2 Evaluation of LLMs on English

ChatGPT scored an average of 7.18 on the national

high school exam in the Netherlands, which is

equivalent to the average score of all students partic-

ipating in the exams in the Netherlands (de Winter,

2023). In other research, ChatGPT outperformed

Vietnamese students by scoring 7.92 in the national

high school graduation exam in Vietnam (Dao

et al.,2023b). Additionally, ChatGPT/GPT-4 has

been shown to be as good as commercial transla-

tion products at translation (Jiao et al.,2023), as

well as having the ability to perform multilingual

tasks (Bang et al.,2023).

3 Methods

3.1 Dataset

The VNHSGE English dataset (Dao et al.,2023c)

consists of 2500 multiple-choice questions and

comprises a diverse range of exam questions ex-

tracted from high school examinations, covering

various topics and assessing different linguistic

abilities. The dataset includes questions related to

pronunciation and stress, vocabulary and grammar,

communication skills, reading ﬁll-in-the-blank, and

reading comprehension. The evaluation dataset

comprises 250 multiple-choice questions corre-

sponding to 5 exams from the academic years 2019,

2020, 2021, 2022, and 2023.

3.2 Prompt

In this study, zero-shot learning was employed,

where LLMs were tasked with answering questions

directly without any prior data or example ques-

tions. For the VNHSGE English dataset D, let {Q,

S} denote pairs where Q is the question and S is

the ground truth solution. Additionally, let P de-

note the context of words. The LLM’s answer A is

determined by

A=f(P, Q)(1)

where f is the LLM, which takes into account the

context P and the question Q. The context P in this

case is a speciﬁc structure that guides the LLM’s

response. It instructs the LLM to provide the an-

swer in the following format: { Choice: “A” or “B”

or “C” or “D”; Explanation: Explain the answer;

The question is: [the actual question] }. By follow-

ing this structure, the LLM generates its answer

A, which can be evaluated and compared to the

ground truth solution S.

Figure 1illustrates the process of prompting

LLMs and retrieving the results. In the case of

multiple-choice questions from the VNHSGE En-

glish dataset, the questions are formatted to align

with the expected answer format. The questions

are then sent to the LLMs’ API.

3.3 Grading

To evaluate the performance of LLMs in answering

questions, we assessed the LLM’s response by com-

paring it to the ground truth. Given the question Q,

the corresponding ground truth solution S, and the

LLM’s answer A, we assessed the LLM’s response

by comparing it to the ground truth solution S:

G=g(Q, S, A)(2)

The evaluation process resulted in a binary grading

system, where the LLM’s answer was classiﬁed as

correct or incorrect.

The answers of LLMs may be different because

they are trained on different datasets. To capture the

best and worst cases among the answers of LLMs,

we introduced two variables:

LLMB

represents the

case with the highest graded answer among the

LLMs, while

LLMW

represents the case with the

lowest graded answer among the LLMs

LLMB= max(Gi)

LLMW= min(Gi)(3)

where

represents the grading corresponding to

ChatGPT, BingChat, and Bard. These variables

allow us to analyze the upper and lower bounds of

the performance of LLMs in terms of grading their

answers.

4 Results

4.1 Answers of LLMs

In this section, we present the answers provided

by ChatGPT, BingChat, and Bard on the VNHSGE

English dataset. We provide illustrative examples

for different types of questions, including pronun-

ciation and stress, grammar and vocabulary, com-

munication, reading ﬁll-in-the-blank, and reading

comprehension. For detailed information about

the answers provided by the LLMs, please refer to

Section A.

4.2 Performance of LLMs

To answer RS 1, we evaluate the performance of

ChatGPT, BingChat, and Bard on the VNHSGE

English dataset. The ﬁndings provide insights into

how accurate and effective these LLMs are in an-

swering questions from the dataset. The contribu-

tion of this analysis is a comparative assessment

of the performance of the three models, which can

guide educators and researchers in selecting the

most suitable LLM for English language tasks in

the context of high school education.

4.2.1 Question order

Figure 2illustrates the accuracy of the answers of

LLMs according to the order of the questions. The

results show that the LLMs models do not provide

accurate answers for the ﬁrst four questions, which

are related to phonetics. This may be because mod-

els like ChatGPT, BingChat, and Bard were not

sufﬁciently trained in the ﬁeld of phonetics. LLMs

perform well on the remaining portions of the exam,

proving that LLMs have received adequate train-

ing in English vocabulary, grammar, and reading

comprehension.

4.2.2 Performance evaluation

Table 1presents the performance of LLMs.

BingChat demonstrates better results compared to

ChatGPT and Bard. The accuracy of

LLMB

97.2%, demonstrating the signiﬁcant potential of

LLMs in English education at the high school level.

Figure 3illustrates the stability of LLMs across

the years 2019-2023. The performance indices

LLMB

, ChatGPT, and Bard demonstrate relatively

stable results, while BingChat and

LLMW

exhibit

variations between academic years. The consistent

performance of LLMs over the years indicates their

ability to maintain stable performance. This demon-

strates that LLMs are trained on a large amount of

data and can be valuable tools in education.

Table 1: Performance (%)

2019 2020 2021 2022 2023 AVG

ChatGPT 76 86 76 80 78 79.2

BingChat 92 96 86 94 94 92.4

Bard 82 94 82 86 86 86

LLMW66 82 68 74 70 72

LLMB96 100 94 96 100 97.2

Multiple

Choice

Question

New

Question

Large

Language

Models

Response

I want you to answer the question in the

following structure:

Choice: "A" or "B" or "C" or "D"

Explanation: Explain the answer

The question is:

Context

prompt

Figure 1: Formatted question and LLMs response.

100

Accuracy

ChatGPT BingChat Bard

Figure 2: Correctness average of ChatGPT, BingChat, and Bard in question order.

ChatGPT

BingChat

Bard

LLMW

LLMB

100

Performance (%)

Figure 3: Stability of LLMs performance.

4.3 LLMs and Vietnamese students

4.3.1 LLMs perform better in English than

Vietnamese students

To answer RS 2, we examine the performance

of the LLMs in comparison to Vietnamese stu-

dents’ English language skills. Our aim is to deter-

mine whether LLMs possess abilities comparable

to those of humans, although this comparison is

challenging due to the dissimilar settings. By con-

ducting this comparison, we can evaluate whether

LLMs can serve as effective study aids for Viet-

namese students in various subject areas.

The score distribution of the students’ exams

serves as an indicator for evaluating their perfor-

mance. The Vietnam Ministry of Education and

Training releases score distribution charts for each

subject every year, which help assess the candi-

dates’ competence and determine the difﬁculty

level of the exams. By gathering the score dis-

tributions (2019,2020,2021,2022), we can

compare the performance of LLMs with that of

Vietnamese students. To facilitate this compari-

son, we compare the score of LLMs to the average

score (AVS) and the most reached score (MVS) by

Vietnamese students. For instance, in 2023, the

AVS and MVS were 5.84 and 3.8, respectively (see

Appendix section Bfor comparison in Vietnamese

score spectrum). Table 2shows the score of LLMs

on a 10-point scale. Figure 4shows the effective-

ness of LLMs with Vietnamese students.

Table 2: Scores

2019 2020 2021 2022 2023 AVG

ChatGPT 7.6 8.6 7.6 8.0 7.8 7.92

BingChat 9.2 9.6 8.6 9.4 9.4 9.24

Bard 8.2 9.4 8.2 8.6 8.6 8.6

The results indicate that LLMs exhibit a higher

level of English proﬁciency compared to Viet-

namese students. One key advantage of LLMs

is their ability to access vast amounts of infor-

mation and data, including extensive language re-

sources and pre-training on large corpora. This en-

ables them to have a broader knowledge base and

a deeper understanding of the English language.

2019 2020 2021 2022 2023

7.6

8.6

7.6

7.8

9.2

9.6

8.6

9.4

8.2

9.4

8.2

8.6

4.36

4.58

5.84

5.15

3.2

3.4

3.8

English Score

ChatGPT BingChat Bard AVS MVS

Figure 4: Comparison of the performance of LLMs and Vietnamese students.

Consequently, LLMs demonstrate superior perfor-

mance in answering questions and providing accu-

rate responses compared to Vietnamese students.

4.3.2 Potential of LLMs in helping

Vietnamese students learn English

To answer RS 3, we explore the potential of LLMs

for English language teaching and learning at the

high school level in Vietnam. By evaluating the

performance of the LLMs on the VNHSGE English

dataset, the research highlights the capabilities of

these models in assisting with language instruction

and learning. The ﬁndings contribute to identifying

the speciﬁc areas where LLMs can be leveraged,

such as vocabulary acquisition, grammar compre-

hension, and reading comprehension. The ﬁndings

of the research can inform the development of ed-

ucational strategies and materials that incorporate

LLMs to enhance English language education in

Vietnam.

LLMs can indeed provide valuable assistance to

Vietnamese students in learning English. Here are

some ways LLMs can be helpful:

•

Language practice: LLMs can act as conver-

sational partners, allowing Vietnamese stu-

dents to practice their English communication

skills. Students can engage in dialogue with

the model, asking questions, discussing top-

ics, and receiving responses that simulate real

conversations.

•

Personalized learning: LLMs have the poten-

tial to adapt to individual student’s needs and

provide personalized learning experiences. By

analyzing students’ performance and identi-

fying areas of improvement, LLMs can gen-

erate customized exercises, quizzes, or study

materials tailored to each student’s speciﬁc

requirements.

•

Language comprehension: LLMs can process

and understand English text, including arti-

cles, books, and educational resources. Viet-

namese students can use LLMs to improve

their reading comprehension by getting instant

deﬁnitions, explanations, and examples of un-

familiar words or phrases.

• Writing support: LLMs can assist students in

improving their English writing skills. Stu-

dents can use LLMs to generate suggestions

for sentence structure, grammar, and vocabu-

lary, and even receive feedback on their writ-

ing. This can help students enhance their writ-

ten expression and produce more accurate and

coherent English texts.

•

Language resources: LLMs can serve as a vast

repository of information, providing access to

various English learning materials, such as

grammar rules, idiomatic expressions, vocab-

ulary lists, and sample essays. Students can

leverage these resources to enhance their un-

derstanding of English language concepts and

improve their overall language proﬁciency.

5 Conclusion

This study compared the performance of three

LLMs (OpenAI ChatGPT, Microsoft Bing Chat,

and Google Bard) on the VNHSGE English dataset.

The results showed that these LLMs have the po-

tential to be used in education, as they demon-

strated high accuracy in answering multiple-choice

questions and were consistent in their performance

across different years of the dataset.

Additionally, LLMs are signiﬁcantly superior

to Vietnamese students in English language pro-

ﬁciency suggesting that they have the potential

to be valuable tools for facilitating English lan-

guage learning among Vietnamese students. By

leveraging the strengths of language models, edu-

cators, and students can beneﬁt from personalized

and adaptive learning experiences that meet indi-

vidual needs and enhance overall language proﬁ-

ciency. This highlights the potential of language

models to revolutionize English language educa-

tion and empower Vietnamese students to excel in

their language-learning endeavors.

References

Radford Alec, Narasimhan Karthik, Salimans Tim, and

Sutskever Ilya. 2018. Improving language under-

standing with unsupervised learning. Citado, 17:1–

12.

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-

liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei

Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multi-

task, multilingual, multimodal evaluation of chatgpt

on reasoning, hallucination, and interactivity.arXiv

preprint arXiv:2302.04023.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners.Advances in neural information processing

systems, 33:1877–1901.

Xuan-Quy Dao, Ngoc-Bich Le, Xuan-Dung Phan, and

Bac-Bien Ngo. 2023a. Can chatgpt pass the viet-

namese national high school graduation examination?

arXiv preprint arXiv:2306.09170.

Xuan-Quy Dao, Ngoc-Bich Le, Xuan-Dung Phan, and

Bac-Bien Ngo. 2023b. An evaluation of chatgpt’s

proﬁciency in english language testing of the viet-

namese national high school graduation examination.

Available at SSRN 4473369.

Xuan-Quy Dao, Ngoc-Bich Le, The-Duy Vo, Xuan-

Dung Phan, Bac-Bien Ngo, Van-Tien Nguyen, Thi-

My-Thanh Nguyen, and Hong-Phuoc Nguyen. 2023c.

Vnhsge: Vietnamese high school graduation exam-

ination dataset for large language models.arXiv

preprint arXiv:2305.12199.

Joost CF de Winter. 2023. Can chatgpt pass high

school exams on english language comprehension.

Researchgate. Preprint.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing.arXiv preprint arXiv:1810.04805.

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing

Wang, and Zhaopeng Tu. 2023. Is chatgpt a good

translator? a preliminary study.arXiv preprint

arXiv:2301.08745.

Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann,

Maria Bannert, Daryna Dementieva, Frank Fischer,

Urs Gasser, Georg Groh, Stephan Günnemann, Eyke

Hüllermeier, et al. 2023. Chatgpt for good? on op-

portunities and challenges of large language models

for education.Learning and Individual Differences,

103:102274.

Tiffany H Kung, Morgan Cheatham, Arielle Medenilla,

Czarina Sillos, Lorie De Leon, Camille Elepaño,

Maria Madriaga, Rimel Aggabao, Giezel Diaz-

Candido, James Maningo, et al. 2023. Performance

of chatgpt on usmle: Potential for ai-assisted medical

education using large language models.PLoS digital

health, 2(2):e0000198.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized bert pretraining ap-

proach.arXiv preprint arXiv:1907.11692.

OpenAI. 2023. GPT-4 Technical Report.arXiv preprint

arXiv:2303.08774.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine

Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

Wei Li, and Peter J Liu. 2020. Exploring the limits

of transfer learning with a uniﬁed text-to-text trans-

former.The Journal of Machine Learning Research,

21(1):5485–5551.

H Holden Thorp. 2023. Chatgpt is fun, but not an

author.

A Answes of LLMs

This section illustrates the answers of LLMs on the

VNHSGE English dataset.

A.1 Pronunciation and stress question

Question: Mark the letter A, B, C, or D on

your answer sheet to indicate the word whose

underlined part differs from the other three in

pronunciation in each of the following ques-

tions.

A. maintained

B. promoted

C. required

D. argued

Solution: Pronounce the ending "ed" Ex-

plain: A. maintained /mein’teind/ B. pro-

moted /pr

’m

utid/ C. required /rI’kwai

(r)d/

D. say /’a:gju:d/ How to pronounce the end-

ing "ed": + The ending "ed" is pronounced

/t/

when the verb ends in

/tR/, /θ/, /

R/, /s/, /k/, /p/, /f/+

"ed" is pronounced

/Id/

when the verb ends in

/t/, /d/

+ The

ending "ed" is pronounced/d/ for the rest of

the cases Option

, the underlined part is

pronounced /Id/, the rest is pronounced /d/.

Choose B.

ChatGPT: Question 1: D. argued

BingChat: For question 1, the word whose

underlined part differs from the other three in

pronunciation is D. argued.

Bard: The answer is **A. maintain**.

The other three words have the "ed" pro-

nounced as /d/, but "maintain" has the "ed"

pronounced as /t/.

A.2 Grammar and vocabulary questions

Question: Mark the letter A, B, C, or D on

your answer sheet to indicate the correct an-

swer to each of the following questions.

Question: He’s American,. . . .?

A. won’t he

B. didn’t he

C. doesn’t he

D. isn’t he

Solution: Tag question. Explain: The verb

in the main clause is "is", the subject is "He".

=> Negative question tag “isn’t he”. => He’s

American, isn’t he? Choose D.

ChatGPT:D. isn’t he

BingChat: For question 5, the correct an-

swer is D. isn’t he.

Bard: The correct answer is D. The ques-

tion is a tag question, which is a type of ques-

tion that is used to conﬁrm information. The

tag question "isn’t he?" is used to conﬁrm that

the person being referred to is American. The

other options are not correct because they do

not make sense in the context of the question.

Question: Mark the letter A, B, C, or D on

your answer sheet to indicate the underlined

part that needs correction in each of the fol-

lowing questions.

Question: It is

essential

to provide

high school students with adequate ca-

reer

orientation

, so they can make more

informative

decisions about their future

major.

A. essential

B. orientation

C. informative

D. major

Solution: Advanced Vocabulary. Explain:

informative (adj): provides a lot of informa-

tion, has the effect of improving knowledge

informed (adj): to have a lot of knowledge

or information about something Edit: “infor-

mative” => (informed) It is essential to pro-

vide high school students with adequate career

guidance, so they can make more informed

decisions about their future major. Choose C.

ChatGPT: It is essential to provide high

school students with adequate career orienta-

tion, so they can make more informed deci-

sions about their future major. No correction

needed.

BingChat:C. informative

Bard: The correct answer is (D). The word

major should be capitalized, as it is referring

to a speciﬁc academic ﬁeld of study. The other

parts of the sentence are correct.

A.3 Communication question

Question: Mark the letter A, B, C, or D

on your answer sheet to indicate the option

that best completes each of the following ex-

changes.

Question 21: Jack and David are talking

about taking a gap year. Jack: “I think tak-

ing a gap year is a waste of time.” David:

“. . . . . . . It gives gap-year takers a lot of valu-

able experiences.”

A. I agree with you

B. I don’t quite agree

C. It’s right

D. My pleasure

Solution: Explanation: A. I agree with you

=> not suitable meaning B. I don’t quite agree

C. That’s right => doesn’t match the mean-

ing D. My pleasure => inappropriate meaning

Choose B.

ChatGPT: Jack and David are talking

about taking a gap year. Jack: “I think

taking a gap year is a waste of time.” David:

“I don’t quite agree. It gives gap-year takers a

lot of valuable experiences.” B. “I don’t quite

agree.

BingChat: the option that best completes

the exchange is B. I don’t quite agree.

Bard: The correct answer is **B. I don’t

quite agree**. David disagrees with Jack’s

opinion that taking a gap year is a waste of

time. He believes that gap-year takers can

gain a lot of valuable experiences, such as

travel, work experience, and personal growth.

A.4 Reading Fill-in-the-Blank question

Question: Read the following passage and

mark the letter A, B, C, or D on your answer

sheet to indicate the correct word or phrase

that bestﬁts each of the numbered blanks from

26 to 30.

THE BEST STUDENTS

According to most professors, mature stu-

dents are ideal students because they are hard-

working and become actively involved in all

aspects of the learning process. The majority

of mature students have a poor educational

(26). . . ..but they manage to do exceptionally

well at tertiary level.

Question 26:

A. development

B. background

C. basement

D. institution

Solution: The majority of mature students

have a poor educational (26) background but

they manage to do exceptionally well at ter-

tiary level. Choose B.

ChatGPT:B. background

BingChat: the correct word that best ﬁts

the blank is B. background.

Bard: The answer must be something that

describes the educational history of mature

students. Of the above choices, only back-

ground describes the educational history of

mature students. So the answer is (B). Here

are the other choices and why they are not

the answer: * development: This does not

describe the educational history of mature stu-

dents. * basement: This is not a word that is

typically used to describe educational history.

* institution: This is a place where education

takes place, but it does not describe the educa-

tional history of mature students.

A.5 Reading comprehension question

Question: Read the following passage and

mark the letter A, B, C, or D on your answer

sheet to indicate the correct answer to each of

the questions.

Marcel Bich, a French manufacturer of tra-

ditional ink pens, was the man turning the

ballpoint pen into an item that today almost

anyone can afford. Bich was shocked at the

poor quality of the ballpoint pens that were

available, and also at their high cost. How-

ever, he recognised that the ballpoint was a

ﬁrmly established invention, and he decided

to design a cheap pen that worked well and

would be commercially successful.

Bich went to the Biro brothers and asked

them if he could use the design of their origi-

nal invention in one of his own pens. In return,

he offered to pay them every time he sold a

pen. Then, for two years, Bich studied the

detailed construction of every ballpoint pen

that was being sold, often working with a mi-

croscope.

By 1950, he was ready to introduce his new

wonder: a plastic pen with a clear barrel that

wrote smoothly, did not leak and only cost a

few cents. He called it the ‘Bic Cristal’. The

ballpoint pen had ﬁnally become a

practical

writing instrument. The public liked it im-

mediately, and today it is as common as the

pencil. In Britain, they are still called Biros,

and many Bic models also say ‘Biro’ on the

side of the pen, to remind people of their

original inventors. Bich became extremely

wealthy thanks to his invention,

which

had

worldwide appeal. Over the next 60 years

his company, Societe Bic, opened factories

all over the world and expanded its range of

inexpensive products. Today, Bic is as famous

for its lighters and razors as it is for its pens,

and you can even buy a Bic mobile phone.

Question: The word “which” in paragraph

4 refers to. . . . . . .

A. factories

B. company

C. invention

D. range

Solution: Explain: The word “which” in

paragraph 4 refers to ..... => which replaces

the noun that comes before it, "invention".

Information: Bich became extremely wealthy

thanks to his invention, which had worldwide

appeal.. Choose C.

ChatGPT:C. invention.

BingChat: the word "which" in paragraph

4 refers to C. invention .

Bard: The word “which” in paragraph 4

refers to **C.invention**.

In paragraph 4, it says:

Bich became extremely wealthy thanks to

his invention, which had worldwide appeal.

Over the next 60 years his company, Societe

Bic, opened factories all over the world and

expanded its range of inexpensive products.

The word “which” refers to the invention

that Bich created. This invention was the ball-

point pen, which became a practical writing in-

strument that was affordable and widely used.

B Vietnamese score spectrum

Figures 5-9show the English score spectrum of

Vietnamese students in 2019-2023.

ChatGPT BingChat Bard Vietnamese students

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

2.6

2.8

3.2

3.4

3.6

3.8

4.2

4.4

4.6

4.8

5.2

5.4

5.6

5.8

6.2

6.4

6.6

6.8

7.2

7.4

7.6

7.8

8.2

8.4

8.6

8.8

9.2

9.4

9.6

9.8

·104

123

469

1,324

2,864

5,952

10,310

16,72223,685 31,481

37,599

42,348

45,297

45,755

44,476

41,861

39,542

36,385

33,410

30,588

27,458

24,979

22,630

20,989

18,710

17,283

15,464

14,288

13,145

12,173

11,343

10,405

9,834

9,274

8,552

7,990

7,612

7,108

6,970

6,416

6,045

5,378

4,845

3,968

3,133

1,976

939

299

Number of Student

Figure 5: English score spectrum of Vietnamese students in 2019.

ChatGPT BingChat Vietnamese students

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

2.6

2.8

3.2

3.4

3.6

3.8

4.2

4.4

4.6

4.8

5.2

5.4

5.6

5.8

6.2

6.4

6.6

6.8

7.2

7.4

7.6

7.8

8.2

8.4

8.6

8.8

9.2

9.4

9.6

9.8

0.5

1.5

2.5

3.5

·104

102

392

1,099

2,263

4,431

8,283

12,42718,461 24,565

29,001

33,167

35,670

37,285

37,335

36,730

35,597

34,627

32,682

31,295

29,713

27,816

26,106

23,932

22,050

19,904

18,173

16,453

14,850

13,674

12,482

11,427

10,650

10,106

9,461

9,403

8,779

8,543

8,062

7,478

6,653

5,735

4,532

3,325

2,253

1,367

672

225

Number of Student

Figure 6: English score spectrum of Vietnamese students in 2020.

ChatGPT BingChat Bard Vietnamese students

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

2.6

2.8

3.2

3.4

3.6

3.8

4.2

4.4

4.6

4.8

5.2

5.4

5.6

5.8

6.2

6.4

6.6

6.8

7.2

7.4

7.6

7.8

8.2

8.4

8.6

8.8

9.2

9.4

9.6

9.8

0.5

1.5

2.5

·104

106

361

713

1,619

3,302

5,868

8,732

12,280

16,517

20,491

23,980

26,527

28,537

29,183

29,498

29,504

28,943

28,317

27,791

26,867

25,860

24,631

23,337

22,660

21,964

21,090

20,102

19,403

18,911

18,665

18,354

18,464

18,219

18,498

18,915

19,319

20,258

21,176

22,490

23,724

24,471

24,251

21,582

16,586

10,543

4,345

Number of Student

Figure 7: English score spectrum of Vietnamese students in 2021.

ChatGPT BingChat Bard Vietnamese students

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

2.6

2.8

3.2

3.4

3.6

3.8

4.2

4.4

4.6

4.8

5.2

5.4

5.6

5.8

6.2

6.4

6.6

6.8

7.2

7.4

7.6

7.8

8.2

8.4

8.6

8.8

9.2

9.4

9.6

9.8

0.5

1.5

2.5

3.5

·104

112

271

678

1,374

2,927

5,113

8,263

12,251

16,803

21,590

26,094

30,207

33,323

35,634

36,884

38,064

37,883

36,594

35,940

34,027

32,576

30,700

29,077

27,268

25,439

23,862

22,414

21,222

19,744

18,991

18,071

17,093

16,295

15,800

15,355

15,135

14,689

14,546

14,191

13,702

12,652

11,115

9,070

6,757

4,111

1,824

425

Number of Student

Figure 8: English score spectrum of Vietnamese students in 2022.

ChatGPT BingChat Bard Vietnamese students

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2.2

2.4

2.6

2.8

3.2

3.4

3.6

3.8

4.2

4.4

4.6

4.8

5.2

5.4

5.6

5.8

6.2

6.4

6.6

6.8

7.2

7.4

7.6

7.8

8.2

8.4

8.6

8.8

9.2

9.4

9.6

9.8

0.5

1.5

2.5

3.5

·104

135

362

1,022

1,966

3,410

5,852

8,734

12,421

16,601

20,744

24,243

27,391

30,085

32,408

33,932

34,393

35,450

34,992

34,848

33,738

32,844

31,110

29,487

27,915

26,531

24,751

23,612

22,103

21,201

20,276

19,147

18,972

18,126

17,946

17,575

17,236

17,512

17,022

17,120

16,526

15,248

13,262

9,620

5,460

2,222

494

Number of Student

Figure 9: English score spectrum of Vietnamese students in 2023.

Assessing text comprehension proficiency: Indonesian higher education students vs ChatGPT

Article

Full-text available

Jan 2024

AI has developed rapidly. However, AI research in applied linguistics in the field of language education in Indonesia still needs to be expanded to reading and writing skills. This research aims to explore students' skills in writing summaries and understanding the historical theme of the development of the Indonesian language with AI based on gender and university aspects. The quantitative method uses descriptive statistical analysis techniques, independent sample t-test, and Welch One-Way ANOVA. The research sample was 288 students from Makassar State University, Timor University, and Makassar Health Polytechnic. The results show that ChatGPT is significantly better at reprocessing text than students based on the aspects measured. ChatGPT outperforms almost every aspect of the assessment. However, in the MCT_Score aspect, the average for Universitas Negeri Makassar students is slightly higher than ChatGPT and the other two universities. Meanwhile, the Makassar Health Polytechnic almost matches the average ChatGPT score. Apart from that, the Universitas Timor average seems significantly different, with a score range of only 6.00-7.00. This research contributes to developing the Indonesian curriculum using Artificial Intelligence (AI) technology. The government can use these findings as a basis for making better policies to improve the quality of education. This research implies that Indonesian students have a gap in understanding texts compared to ChatGPT. The first implication is the need to revise and develop the educational curriculum. Therefore, future research can examine text comprehension abilities in more specific contexts, such as scientific texts, journalism, literature, or specific scientific disciplines. It can provide more detailed insight into students' strategies for overcoming difficulties in understanding texts. In addition, future research will be conducted on the broader impact of using Artificial Intelligence technology in language education on the development of student text comprehension and the potential social and ethical impacts.

Comparative Analysis of Generative AI Tools in Enhancing Educational Engagement

Preprint

Apr 2024

ChatGPT and computational-based research: benefits, drawbacks, and machine learning applications

Article

Full-text available

Dec 2023

Cameron Atkinson

Generative artificial intelligence (GenAI) systems are disrupting how research is conducted across a wide range of disciplines. Many journals have decided not to allow these tools to be co-authors for the purposes of publication, but rather they must be acknowledged by authors as having been utilised in the writing process. Furthermore, due to the hallucinations that these models sometimes produce, authors are to review what is generated and recognise that they hold it to be true and accurate. To date, there has been varying research conducted on the accuracy of GenAI systems and their production of written text. However, new functions that allow GenAI systems to produce coding for constructing tools in computer programming languages highlights a new area that warrants investigation. Therefore, this article puts forth an account of using ChatGPT 3.5 to construct coding to be utilised for a Latent Dirichlet Allocation Topic Model (LDA-TM) for use in a Systematic Literature Review. This is hoped to address three elements of using ChatGPT 3.5 for coding: code review, error resolution, and scripting new code. The code will be aimed at designating an appropriate Hyper-parameter for the Random State for use in the LDA-TM. Within this context, this article will discuss the advantages and drawbacks of utilising this new tool and what it means for researchers who wish to augment their work with computer programming-based applications. To the authors knowledge, this is the first time this has been discussed within the context of the research being conducted.

Generative Artificial Intelligence, Python, and Gathering Grey Literature for a Systematic Literature Review with Google’s Programmable Search Engine.

Preprint

Full-text available

Mar 2024

Cameron Frederick Atkinson

Systematically gathering grey literature for use in a systematic literature review (SLR) is a challenging task, given the decentralised nature of online resources. Researchers, particularly those in the social sciences, often find themselves navigating government or non-government organisation websites, manually scouring repositories for documents relevant to their SLRs. This important stage is expensive in terms of time and resources required and, in many instances, difficult to repeat. This article introduces an innovative methodology to address these challenges, providing social science researchers with a systematic approach to gather grey literature for inclusion in SLRs. Utilising the computer programming language Python, this articles leverages Google’s API to create a programmable search engine, facilitating a systematic search for grey literature intended for inclusion in a SLR. A case study is presented to demonstrate the efficiency of this method in locating PDF documents, within which two examples are provided. In the first example, documents from Australian government websites ending in “gov.au” related to the topic of (“energy infrastructure” AND resilience) are sought. Secondly, “un.org” is added to illustrate how multiple websites can be searched. Highlighting the effectiveness of this approach, the study successfully locates 100 documents in just 7.5 seconds, automatically saving them into an Excel CSV file for further analysis. To the authors knowledge, this method represents an original approach in the systematic gathering of grey literature for SLRs and highlights the contribution of generative artificial intelligence systems such as ChatGPT 3.5 in assisting to script the necessary code for new SLR tools.

Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology

Article

Full-text available

Sep 2023

Purpose of review ChatGPT has emerged as a potential tool for facilitating doctors' workflows. However, when it comes to applying these findings within a urological context, there have not been many studies. Thus, our objective was rooted in analyzing the pros and cons of ChatGPT use and how it can be exploited and used by urologists. Recent findings ChatGPT can facilitate clinical documentation and note-taking, patient communication and support, medical education, and research. In urology, it was proven that ChatGPT has the potential as a virtual healthcare aide for benign prostatic hyperplasia, an educational and prevention tool on prostate cancer, educational support for urological residents, and as an assistant in writing urological papers and academic work. However, several concerns about its exploitation are presented, such as lack of web crawling, risk of accidental plagiarism, and concerns about patients-data privacy. Summary The existing limitations mediate the need for further improvement of ChatGPT, such as ensuring the privacy of patient data and expanding the learning dataset to include medical databases, and developing guidance on its appropriate use. Urologists can also help by conducting studies to determine the effectiveness of ChatGPT in urology in clinical scenarios and nosologies other than those previously listed.

ResearchGate has not been able to resolve any references for this publication.

Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

Abstract and Figures

Recommended publications

Google Bard's Physical Capabilities in Vietnamese High Schools

Which Large Language Model should You Use in Vietnamese Education: ChatGPT, Bing Chat, or Bard?

An Evaluation of ChatGPT's Proficiency in English Language Testing of The Vietnamese National High S...

Mathematical Capabilities of Large Language Models on Vietnamese High School Math Test

ChatGPT is Good but Bing Chat is Better for Vietnamese Students