ArticlePDF Available

Classification of reflective writing: A comparative analysis with shallow machine learning and pre-trained language models

May 2024
Education and Information Technologies

May 2024

DOI:10.1007/s10639-024-12720-0

License
CC BY 4.0

Authors:

Chengming Zhang

Friedrich-Alexander-University of Erlangen-Nürnberg

Florian Hofmann

Friedrich-Alexander-University of Erlangen-Nürnberg

Michaela Gläser-Zikuda

Friedrich-Alexander-University of Erlangen-Nürnberg

Reflective practice holds critical importance, for example, in higher education and teacher education, yet promoting students’ reflective skills has been a persistent challenge. The emergence of revolutionary artificial intelligence technologies, notably in machine learning and large language models, heralds potential breakthroughs in this domain. The current research on analyzing reflective writing hinges on sentence-level classification. Such an approach, however, may fall short of providing a holistic grasp of written reflection. Therefore, this study employs shallow machine learning algorithms and pre-trained language models, namely BERT, RoBERTa, BigBird, and Longformer, with the intention of enhancing the document-level classification accuracy of reflective writings. A dataset of 1,043 reflective writings was collected in a teacher education program at a German university (M = 251.38 words, SD = 143.08 words). Our findings indicated that BigBird and Longformer models significantly outperformed BERT and RoBERTa, achieving classification accuracies of 76.26% and 77.22%, respectively, with less than 60% accuracy observed in shallow machine learning models. The outcomes of this study contribute to refining document-level classification of reflective writings and have implications for augmenting automated feedback mechanisms in teacher education.

Distribution of reflective writing words among pre-service teachers

…

Flowchart for classifying reflective writings (modified to Tan et al., 2021, p. 548)

…

Accuracy of shallow machine learning with various feature engineering

…

An overview of the parameters for the various algorithms utilized for training

…

Overview of shallow machine learning parameter settings

…

Figures - uploaded by Chengming Zhang

Content may be subject to copyright.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from Education and Information Technologies

This content is subject to copyright. Terms and conditions apply.

Vol.:(0123456789)

Education and Information Technologies

https://doi.org/10.1007/s10639-024-12720-0

1 3

Classiﬁcation ofreﬂective writing: Acomparative analysis

withshallow machine learning andpre-trained language

models

ChengmingZhang1 · FlorianHofmann1· LeaPlößl1· MichaelaGläser‑Zikuda1

Received: 15 November 2023 / Accepted: 14 April 2024

Abstract

Reﬂective practice holds critical importance, for example, in higher education and

teacher education, yet promoting students’ reﬂective skills has been a persistent

challenge. The emergence of revolutionary artiﬁcial intelligence technologies, nota-

bly in machine learning and large language models, heralds potential breakthroughs

in this domain. The current research on analyzing reﬂective writing hinges on sen-

tence-level classiﬁcation. Such an approach, however, may fall short of providing a

holistic grasp of written reﬂection. Therefore, this study employs shallow machine

learning algorithms and pre-trained language models, namely BERT, RoBERTa,

BigBird, and Longformer, with the intention of enhancing the document-level clas-

siﬁcation accuracy of reﬂective writings. A dataset of 1,043 reﬂective writings was

collected in a teacher education program at a German university (M = 251.38 words,

SD = 143.08 words). Our ﬁndings indicated that BigBird and Longformer models

signiﬁcantly outperformed BERT and RoBERTa, achieving classiﬁcation accura-

cies of 76.26% and 77.22%, respectively, with less than 60% accuracy observed in

shallow machine learning models. The outcomes of this study contribute to reﬁning

document-level classiﬁcation of reﬂective writings and have implications for aug-

menting automated feedback mechanisms in teacher education.

Keywords Reﬂective writing· Pre-trained language model· Shallow machine

learning· AI feedback· Teacher education

1 Introduction

The dawn of the Artiﬁcial Intelligence (AI) era has brought about transformative

educational changes. From proﬁling and prediction to automated assessment and

personalized learning, the increasing use of AI applications is evidence of its bur-

geoning inﬂuence (Zawacki-Richter etal., 2019; Zhai etal., 2021). One particularly

Extended author information available on the last page of the article

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

impactful instance of AI’s utility is the deployment of Large Language Models

(LLMs) like ChatGPT. This powerful tool can support students in problem-solving,

personalized guidance, feedback provision, and other areas. Interestingly, the ben-

eﬁts of AI feedback have been empirically quantiﬁed. For instance, a meta-analysis

by Cai etal. (2023) revealed a moderate positive impact of feedback on academic

achievement in technology-rich learning environments compared to traditional

feedback-absent environments. These ﬁndings hint at AI’s potential to address long-

standing educational challenges, especially those related to providing timely, per-

sonalized feedback (Russell & Korthagen, 2013). In light of these developments, the

fusion of machine learning (ML) and natural language processing (NLP) technolo-

gies may present an eﬃcient way to meet students’ diverse learning needs.

In the rapidly evolving educational technology landscape, a signiﬁcant break-

through has occurred in using AI for assessment, particularly concerning reﬂective

writing. Reﬂection forms an essential bridge between theoretical knowledge and

practical application (Korthagen & Vasalos, 2005). Despite the deployment of vari-

ous approaches aiming to bolster reﬂective practice, the endeavor has encountered

constraints, primarily due to the multifaceted nature of reﬂective writing, which

poses a challenge for evaluation (Körkkö etal., 2016; Poldner etal., 2014; Ullmann,

2019). Recent researches indicate that shallow ML and pre-trained language mod-

els are particularly eﬀective in assessing reﬂective writing, marking a signiﬁcant

advancement in this ﬁeld (Nehyba & Štefánik, 2023; Solopova etal., 2023; Wulﬀ

etal., 2023). Moreover, AI’s role in assessing student reﬂections is expanding, serv-

ing both as an instrument for summative assessment (Barthakur etal., 2022) and as

a means for generating formative assessment metrics (Jung etal., 2022).

However, two signiﬁcant challenges remain in leveraging AI to provide feedback

on reﬂective writings. The ﬁrst issue concerns the quality of the reﬂective writing

data utilized for training in existing studies. The dataset was often collected in a

non-standardized way in many studies, resulting in the data not being high quality.

This compromise in data quality can adversely aﬀect the performance metrics of

ML-based classiﬁers, as noted by Gupta etal. (2021), resulting in reduced reliability

and a higher risk of fallacious conclusions. The second challenge is that most cur-

rent research segments the reﬂective writings into sentences, classifying them on the

sentence level. This is particularly problematic in reﬂective writing, where there is

an intrinsic and cohesive link across the narrative (Moon, 2013). An overemphasis

on individual sentences risks missing vital information embedded in the larger con-

text of the text. Therefore, a pivot towards a document-level evaluation paradigm is

beneﬁcial and necessary for capturing the full scope of reﬂective thought. Address-

ing these challenges requires an emphasis on collecting high-quality datasets and

classifying them on a document level to improve the eﬀectiveness of AI in giving

feedback on reﬂective writing.

This research aims to classify reﬂective writing by leveraging shallow ML and

pre-trained language models. In contrast to preceding studies, this research employed

a document-level classiﬁcation approach to annotate reﬂective writings. The meth-

odology employed a range of ML models as well as pre-trained language models

like BERT (Devlin etal., 2018) and RoBERTa (Liu etal., 2019a, 2019b), supple-

mented by advanced models such as BigBird (Zaheer etal., 2020) and Longformer

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

(Beltagy etal., 2020). These advanced language models were employed in the clas-

siﬁcation technique to address the challenges of processing long text sequences.

2 Literature review

2.1 Theoretical framework forassessing reﬂective writing

In reﬂection, John Dewey’s “How We Think” (1933) stands out as a landmark work,

which has signiﬁcantly been complemented by Donald Schön’s inﬂuential “The

Reﬂective Practitioner” (1983) and “Educating the Reﬂective Practitioner” (1987).

Schön (1983, 1987) has categorized reﬂection into two types: reﬂection-in-action,

which occurs during the act, and reﬂection-on-action, which takes place after the

event. The notion of reﬂection-on-action was used in this study. In subsequent devel-

opments, the deﬁnition of reﬂection remains ambiguous. Some argued that reﬂec-

tion should include aﬀective aspects (Boud etal., 2013), while others contended that

it is a problem-solving process (Kember, 1999). Additionally, proposals suggested

that reﬂection is a form of metacognition (Flavell, 1979) and an element of self-

regulated learning (Zimmerman, 2002).

Given the reﬂection’s inherent complexity and multidimensional nature, schol-

ars have recognized the necessity to develop various models (Boyd & Fales,

1983;Gibbs, 1988; Hatton & Smith, 1995; Kember, 1999; Kolb, 1984; Mezirow,

1991). These theoretical models were developed around two core dimensions: depth

and breadth. Depth pertains to the level of reﬂection achieved, while breadth refers

to various elements involved in reﬂective writing (e.g., Ullmann, 2019). More spe-

ciﬁcally, the depth model assesses reﬂection holistically, considering it an integral

whole. For instance, Jung etal. (2022) evaluated reﬂectivity among 369 dental stu-

dents through 1500 reﬂective writings, categorizing reﬂections as non-reﬂective,

shallow, or deep levels. Similarly, Liu etal., (2019a, 2019b) implemented a binary

classiﬁcation system to examine 301 reﬂective statements from pharmacy students,

eﬀectively diﬀerentiating between reﬂective and non-reﬂective responses. Provid-

ing feedback on diﬀerent reﬂection levels facilitates a clear understanding of stu-

dent’s current situation and the objectives they need to meet. For educators, identify-

ing students at varying levels of reﬂection enables tailoring speciﬁc improvement

strategies. In contrast, breadth models embrace a multi-faceted, process-oriented

approach, oﬀering a granular analysis of reﬂection by dissecting its components and

examining their interactions. For example, Cui etal. (2019) conducted an extensive

review of existing literature and analyzed the reﬂective writings of 27 dental medi-

cine students over four years. Their study was anchored in a framework compris-

ing six reﬂective categories: description, analysis, feelings, perspective, evaluation,

and outcome. Moreover, Ullmann (2019) centered his research on eight categories

commonly found in models for assessing reﬂective writing: reﬂection, experience,

feeling, belief, diﬃculty, perspective, learning, and intention. His analysis encom-

passed 76 student essays, totaling 5080 sentences, primarily from health, business,

and engineering students in their second and third years. These studies revealed the

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

richness of reﬂective practice and provided insights into how educators can bet-

ter organize reﬂective practice and assessment to capture the full range of student

reﬂections.

Current research on reﬂective writing typically employs one of these two basic

models or a hybrid. Depth models classify reﬂections into levels, which assist stu-

dents in determining and benchmarking the quality of their reﬂective practice. They

also provide educators and institutions with a streamlined metric for evaluating the

caliber of teaching and learning. Nevertheless, their singular focus can shorten a

comprehensive understanding and development of reﬂective capabilities. In con-

trast, breadth models oﬀer multifaceted feedback, encompassing various elements

and processes of reﬂection, thereby oﬀering a more expansive understanding of its

dynamics. This model reveals the nuanced interplay within reﬂective thought and its

broader implications. However, the potential for information overload is a drawback

of the breadth approach, as the profusion of data points might lead to conﬂicts or

overlaps, possibly obscuring areas needing improvement.

2.2 ML andNLP forassessing reﬂective writing

In the realm of analyzing reﬂective writings, the current research employs a diverse

array of methodologies, including dictionary-based (Cui et al., 2019; Springer &

Yinger, 2019), rule-based (Chong etal., 2020; Gibson etal., 2016) and increasingly,

ML-based techniques (Cutumisu & Guo, 2019; Fan et al., 2017; Ullmann, 2019;

Wulﬀ etal., 2023). There has been a notable shift towards adopting ML and NLP

strategies, with the ﬁeld progressively embracing advanced technologies such as deep

learning (DL) and pre-trained language models. These sophisticated models oﬀer

potent capabilities for accurately identifying complex patterns within extensive data-

sets of reﬂective texts, signiﬁcantly improving the detail and accuracy of analyses.

A review of the literature reveals a predominant focus on deploying shallow

ML algorithms for classifying reﬂective writing. Within this category, algorithms

such as Random Forest (Jung & Wise, 2020; Kovanović etal., 2018; Ullmann,

2019), Naïve Bayes (Cheng, 2017; Hu, 2017; Liu etal., 2017), and Support Vector

Machine (Carpenter etal., 2020) have been recognized for their exceptional per-

formance. As for language representation techniques, the majority have relied on

foundational methods such as Bag-of-Words (BoW) (Hu, 2017; Ullmann, 2019),

Linguistic Inquiry and Word Count 2015 (LIWC2015) (Jung & Wise, 2020), and

Term Frequency-Inverse Document Frequency (TF-IDF) (Liu etal., 2017). How-

ever, the ﬁeld is gradually advancing towards more sophisticated models, with

recent forays into using Global Vectors for Word Representation (GloVe) and

Embeddings from Language Models (ELMo) (Carpenter et al., 2020), signal-

ing a shift towards capturing deeper semantic meanings and contextual nuances

within the reﬂective texts. Shallow ML has comparative advantages in terms of

simplicity, interpretability, and computational eﬃciency (Janiesch et al., 2021)

when targeting the classiﬁcation of reﬂective writing. For example, interpretabil-

ity in model decision-making refers to the transparency of how a model processes

inputs to produce outputs (e.g., Carvalho etal., 2019). Incorporating linguistic

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

metrics, such as those from LIWC2015, into the classiﬁcation of reﬂective writ-

ing enhances this interpretability (e.g., Cui etal., 2019; Savicki & Price, 2015;

Springer & Yinger, 2019). LIWC2015 assigns words to various psycholinguistic

categories, encompassing psychological, linguistic, and aﬀective dimensions—

such as aﬀect (encompassing both positive and negative emotions), cognitive

processes (including insight, causation, negation, and others), and social dynam-

ics (Pennebaker etal., 2015). These categories are grounded in established psy-

chological and linguistic theories, oﬀering a theoretically informed framework

that elucidates the model’s methodology in identifying and classifying reﬂec-

tive writing. For instance, the study by Zhang etal. (2023a) demonstrated that

words about cognitive and emotional aspects predict reﬂection levels, showcasing

the practical application and relevance of these categories in research. However,

the limitations of shallow ML are apparent, notably its heavy reliance on fea-

ture engineering. This approach encounters diﬃculties in capturing deep seman-

tic relationships and subtle contextual nuances within textual data, particularly

in instances of reﬂective writing that encompass complex emotional expressions,

metaphors, or advanced cognitive processes (e.g., Dewey, 1933). Moreover, data

from reﬂective writing often exhibits imbalance, with speciﬁc categories signiﬁ-

cantly outnumbering others (Körkkö etal., 2016; Poldner etal., 2014). Shallow

ML models may struggle to address this imbalance eﬀectively, leading to biased

models toward the majority class.

Although DL and pre-trained language models have not yet achieved widespread

adoption in the ﬁeld, the examples of their application demonstrate a signiﬁcant per-

formance advantage over shallow ML models. For instance, Nehyba and Štefánik

(2023) demonstrated that a neural classiﬁer (XLM-RoBERTa) surpassed the capa-

bilities of a shallow classiﬁer (Random Forest) in categorizing reﬂective writing

within higher education. Additionally, Wulﬀ etal. (2023) reported that the BERT

model outperformed both DL Architectures and previously employed shallow ML

algorithms (Wulﬀ etal., 2021) in classifying segments of reﬂective writing by pre-

service teachers. Pre-trained language models primarily utilize deep learning neural

network architectures, such as the Transformer model, which learns a generalized

language representation through pre-training on extensive textual datasets. These

models adeptly capture a broad spectrum of linguistic features, encompassing lexi-

cal, syntactic, and semantic information. This capability allows them to quickly and

eﬃciently adapt to speciﬁc tasks through ﬁne-tuning. Researchers have illustrated

that natural language exhibits long-range dependencies (Ebeling & Neiman, 1995;

Zanette, 2014), a characteristic that renders Transformer architectures particularly

eﬀective for addressing related tasks. Therefore, pre-trained language models can

capture language’s deeper semantics and contextual dependencies, providing more

aﬄuent and more ﬁne-grained textual representations, which are essential for under-

standing the nuanced emotions and complex thought processes in reﬂective writ-

ing. However, pre-trained language models provide excellent performance; their

“black-box” nature makes the model decision-making process challenging to inter-

pret, which may be a problem in application scenarios requiring high transparency

and interpretability (Kraus etal., 2020). Research has demonstrated that implement-

ing non-transparent AI systems in teacher education may provoke AI anxiety among

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

users (Hopcan etal., 2023), and decreasing their acceptance of AI technology (e.g.,

Zhang etal., 2023b).

In sum, most current research on the classiﬁcation of reﬂective writing has pre-

dominantly employed shallow ML techniques. However, there has been an emer-

gence of research utilizing pre-trained language models as well. In addition, the

classiﬁcation of reﬂective writings primarily occurs at the sentence level. In reﬂec-

tive writing, deep reﬂection involves complex thought processes across many sen-

tences and even whole documents, and it may be diﬃcult to identify true deep

reﬂection based on sentence-level analysis alone. Although sentence-level classiﬁca-

tion is more accessible, exploring and developing eﬀective document-level analysis

methods for assessing reﬂective writing is essential. Consequently, we advocate for

increased research on the document-level classiﬁcation of reﬂective writing. Simul-

taneously, enhancing the model’s explanatoriness and transparency is crucial with-

out compromising its accuracy.

2.3 The current study

In current research on reﬂective writing classiﬁcation, the general approach is to

subdivide longer writings into individual sentences and to label these sentences

one by one. Subsequently, these labeled sentences are used as a training dataset for

classiﬁcation algorithms. While this sentence-based approach simpliﬁes the train-

ing process of the algorithm, it may miss contextual connections crucial in reﬂec-

tive writing. To address this shortfall, our study introduces a document-level clas-

siﬁcation approach. This approach seeks to preserve and analyze the text’s integrity,

enabling a more holistic evaluation that considers the narrative arc, thematic coher-

ence, and the interconnectivity of reﬂective thoughts. By doing so, our research will

improve the accuracy of AI feedback.

Regarding the modeling aspect, besides the commonly discussed decision trees

(Barthakur etal., 2022), random forests (Kovanović etal., 2018; Liu etal., 2019a,

2019b), and support vector machines (Ullmann, 2019), we incorporated additional

algorithms: Ridge Classiﬁer, SGD Classiﬁer, XGB Classiﬁer, and Gradient Boost-

ing Classiﬁer. To the best of our knowledge, these algorithms have yet to be widely

applied to the classiﬁcation task of reﬂective writing. Therefore, our study aims to

ﬁll this research gap and provide a foundation and reference for future related work.

Firstly, the Ridge Classiﬁer can eﬀectively deal with the problem of multicollinearity

among features and enhance the generalization ability of the model by introducing

the L2 regularization term (Hoerl & Kennard, 1970). Secondly, the SGD Classiﬁer

is suitable for large-scale and high-dimensional data processing and can eﬀectively

improve computational eﬃciency through iterative optimization (Robbins & Monro,

1951). Next, the XGB Classiﬁer, as an advanced gradient boosting algorithm, espe-

cially performs very well when dealing with complex nonlinear data structures

(Chen & Guestrin, 2016). Finally, the Gradient Boosting Classiﬁer enhances the

prediction accuracy by gradually correcting the errors of the previous model, which

is especially eﬀective for unbalanced datasets (Friedman, 2001). These algorithms

may have an essential role for the classiﬁcation of complex reﬂective writing. As

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

for feature engineering, we employed several methods, including BoW, TF-IDF, and

LIWC2015-based approaches.

Furthermore, we leveraged pre-trained language models such as BERT, RoB-

ERTa, Longformer, and BigBird. These state-of-the-art models, pre-trained on

extensive corpora, oﬀer valuable contextual embeddings and enhance the under-

standing of text semantics. It is worth noting that BERT, a widely adopted pre-

trained language model, has gained recognition for its excellent performance (Dev-

lin etal., 2018). However, to address the processing of long-distance dependencies

in text, there are improved versions known as Longformer and BigBird. Longformer

utilizes a sparse-attention model (Beltagy et al., 2020), while BigBird combines

global and local attention mechanisms to handle long sequences eﬃciently (Zaheer

etal., 2020). Utilizing these two models enables more eﬃcient analysis of lengthy

text sequences. Based on that, the study has the following research questions:

RQ 1: To what extent do shallow machine learning models employing diverse

language representations demonstrate eﬀectiveness in classiﬁcation reﬂective

writing of pre-service teachers?

RQ 2: What is the performance of pre-trained language models when employed

to classify reﬂective writing of pre-service teachers?

RQ 3: How do shallow machine learning models compare to pre-trained language

models in terms of their eﬀectiveness in classifying reﬂective writing of pre-ser-

vice teachers?

3 Methods

3.1 Research design anddata collection

This study was conducted within a German university teacher education program,

focusing on two modules in a lecture: pedagogical diagnostics and classroom man-

agement. Both modules were taught through self-study, with the instructor provid-

ing study materials such as recorded lectures, slides, recommended readings, case

study assignments, and tasks for reﬂective writing. The student’s reﬂective writ-

ings were collected in a digital portfolio format (Gläser-Zikuda, 2015) throughout

these two modules. The module dedicated to pedagogical diagnostics extended over

a period of three weeks, whereas the module focusing on classroom management

was concluded within a single week. Following the completion of their respective

modules, students were obliged to submit reﬂective writingss within a fortnight. To

support pre-service teachers in their reﬂection, structured prompts based on Nar-

ciss’s (2006) framework were developed for this study. Four structured prompts

were incorporated, namely knowledge on task constraints (KTC), knowledge about

concepts (KC), knowledge on how to proceed I (KH I), and knowledge on how to

proceed II (KH II). Pre-service teachers were allowed to choose the prompt that best

suited their individual needs. The data for this study were collected from pre-service

teachers’ reﬂective writing during the winter semester 2021/22, summer semester

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

2022, and winter semester 2022/23. Data was gathered in an anonymous manner

consistent with the university’s policy on data protection. A total of N = 1043 reﬂec-

tive writings were obtained, with an average word count of 251.38 (SD = 143.08) for

the pre-service teachers reﬂective writing (see Fig.1).

3.2 Data annotation withqualitative content analysis

First and foremost, before training with ML algorithms, the reﬂective writing in

this study underwent annotation. To accomplish this, a qualitative content analy-

sis (Gläser-Zikuda etal., 2020) was employed, based on Hatton and Smiths’ the-

ory (1995) and Fütterer’s (2019) adapted coding scheme for classiﬁcation (see

Table 1). The coding framework consisted of four levels: descriptive writing,

descriptive reﬂection, dialogic reﬂection, and critical reﬂection. Firstly, the entire

reﬂective writing was read to form an initial impression, aligning with a theo-

retically predeﬁned category system ranging from level 0 to 3; this preliminary

rating was recorded. After this, meaning segments within the text were analyzed

using the category system following a structured analytical procedure. These seg-

ments could comprise single or multiple sentences linked by a common theme.

Rather than evaluating individual sentences within a meaning segment, they were

counted and employed as a multiplier for the segment’s level. For instance, if a

meaning segment included three sentences and was assessed at level two, then

level two was counted three times (3 × 2). The level assigned to each meaning

Fig. 1 Distribution of reﬂective writing words among pre-service teachers

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

segment and the count of sentences it contained was documented post-coding.

The text was ultimately rated based on the most frequently occurring level, which

had to constitute at least 60% of the text. In cases where no single level met the

60% threshold, the next lower level was selected (e.g., Text 1: 10 sentences: 3 × 0,

2 × 1, 5 × 2 results in level 1; Text 2: 10 sentences: 5 × 0, 5 × 2 results in level 0,

as only 50% of the sentences are above level 0 and no elements of level 1 were

detected). This rating was then cross-referenced with the initial impression and

subsequently reviewed by a second coder to test the intercoder reliability. Three

experts independently coded these reﬂective writings to test for intercoder reli-

ability. Due to the time-consuming nature of coding reﬂective writing, to ensure

coding reliability, 300 pieces of reﬂective writing (28.76% of the total) were

randomly selected for a second round of coding by three diﬀerent experts, with

Cohen’s kappa coeﬃcients of 0.67, 0.66, and 0.73, respectively. These outcomes

were received which were considered acceptable (McHugh, 2012).

According to the ﬁndings, the number of reﬂective writings per reﬂection

level was as follows: 261 for the level of descriptive writing, 545 for the level of

descriptive reﬂection, 209 for the level of dialogic reﬂection, and 28 for the high-

est level for critical reﬂection. The critical reﬂection category (n = 28) sample size

in the dataset was tiny compared to the other categories. This resulted in a weaker

generalization of the model to this category, aﬀecting the accuracy and reliability

of the overall model. While SMOTE oversampling has demonstrated its eﬀective-

ness in addressing sample imbalance in numerous studies (Chawla etal., 2002;

Jung & Wise, 2020; Kovanović etal., 2018), we found it unsuitable for our long

text data upon manual evaluation. Due to the complexity and depth of reﬂective

writing, SMOTE-generated samples did not accurately reﬂect the characteristics

Table 1 Guidelines for coding reﬂective writing (cf. Fütterer, 2019; Zhang etal., 2023a; translated by the

authors)

Category Description

0:Descriptive Writing A situation (action, behavior…) is described. No eﬀorts to classify or explain

exist. Because reﬂective processes were deﬁned as metacognitive, a mere

description does not represent a reﬂective process

1:Descriptive Reﬂection Situations are either justiﬁed (personal judgment, perspective), or feelings,

optional perspectives, or inﬂuential variables are reported, but without con-

necting them or considering their contextual embedding. Personal assump-

tions are presented

2:Dialogic Reﬂection Diﬀerent perspectives, inﬂuencing factors, and justiﬁcations for situations are

identiﬁed. Perspectives are weighed in an intra-personal dialogue. For this

to happen, subjective theories and beliefs must become conscious. Compet-

ing perspectives are weighed up, leading to judgment

3:Critical Reﬂection It is recognized that both situations and the identiﬁed perspectives, inﬂuential

factors, and rationales are embedded in and inﬂuenced by a broader context

(including historical, social, and political). Values and norms of the profes-

sions goals are also challenged, and institutional expectations are included

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

of an authentic critical reﬂection sample. Consequently, we decided to omit the

ﬁnal category (namely, critical reﬂection) from the classiﬁcation process.

3.3 Text pre‑processing andfeature engineering

In ML, text pre-processing and feature engineering roles are indispensable. Text

pre-processing encompasses a variety of techniques and stages designed to cleanse,

transform, and standardize raw text data. The primary aim of this process is to aug-

ment the accuracy of subsequent classiﬁcation tasks. Meanwhile, feature engineer-

ing is a critical process involving transforming, extracting, and generating meaning-

ful features from the pre-processed data. It aids in mitigating the risk of overﬁtting

and bolsters the model’s explanatory power. The ﬂow chart for the classiﬁcation of

reﬂective writing is shown in Fig.2.

3.3.1 Text pre-processing

In this study, we employed a standard pipeline for reﬂective writing. The pipe-

line consisted of the following procedures: tokenization, stop word removal, and

Fig. 2 Flowchart for classifying reﬂective writings (modiﬁed to Tan etal., 2021, p. 548)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

lemmatization. Tokenization involved segmenting the text into individual tokens.

Stop word removal was performed to eliminate joint and non-discriminatory words

that do not contribute signiﬁcantly to the meaning of the text. This included typi-

cal German stop words (e.g., “die” (the), “sind” (are), and “wo” (where)), as well

as speciﬁc non-discriminatory words identiﬁed in Atzeni etal.’s (2022) work such

as “Aufgabe” (task), “pädagogische” (educational), “Bearbeitung” (processing), and

others. Additionally, this section removed unnecessary numerical digits, punctuation

marks, special characters, or any other symbols from the text. Lemmatization was

applied to transform words to their root form, reducing inﬂectional variations and

allowing for better analysis and understanding of the text.

3.3.2 Feature engineering

Three feature extraction methods were employed in this study, namely, BoW, TF-

IDF, and LIWC2015 (Pennebaker et al., 2015). To extract features for BoW and

TF-IDF, the CountVectorizer and TﬁdfVectorizer functions from the Scikit-learn in

Python were utilized. Furthermore, the LIWC2015 has undergone validation pro-

cesses and performs satisfactorily classifying reﬂections in higher education settings

(June & Wise, 2020). In this study, 87 out of 93 linguistic features available in the

LIWC2015 were utilized. Using the SelectKBest method (top 10) and a correla-

tion coeﬃcient threshold greater than 0.1 or less than -0.1 (considered signiﬁcant),

nineteen noteworthy features were extracted (refer to Table 2). Subsequently, the

extracted feature values were preferably normalized using the Z-score method.

3.4 Classiﬁcation algorithm selection andoptimization

The objective of the study was to compare the performance of pre-trained language

models with that of other shallow ML approaches. In this study, to ensure the repro-

ducibility of the experimental process and the consistency of the results, we uni-

formly used 2023 as the seed parameter of the initialized random number genera-

tor in all ML and pre-training language model experiments. This section begins by

providing an introduction to selected shallow ML, followed by an overview of pre-

trained language models.

3.4.1 Shallow machine learning

Prior research has yielded substantial results using Decision Tree, Support Vector

Machines, and Random Forest, which are prevalent methods in the classiﬁcation of

reﬂective writing (Nehyba & Štefánik, 2023; Ullmann, 2019; Wulﬀ etal., 2021).

Leveraging this existing knowledge, our study strives to augment these established

algorithms and introduce new models to address the complexities inherent in diverse

datasets. In our analysis, we implemented seven shallow ML algorithms: Decision

Tree, Support Vector Machine, Random Forest, Ridge Classiﬁer, SGD Classiﬁer,

XGB Classiﬁer, and Gradient Boosting Classiﬁer. To ensure dependable and con-

sistent outcomes, we trained all algorithms on a training set comprising 80% of the

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

data. We tested them on a separate set that accounted for the remaining 20% of the

data. A ﬁve-fold cross-validation was further implemented to validate the results.

In-depth details regarding the parameters speciﬁc to each algorithm are available

in Table3. We evaluated the performance of each algorithm based on its accuracy.

True Positives (TP) are the cases where the model correctly predicts the positive

class. True Negatives (TN) are the cases where the model correctly predicts the

negative class. False Positives (FP) are the cases that are incorrectly predicted as

positive when they are actually negative. False Negatives (FN) are the cases that are

incorrectly predicted as negative when they are actually positive.

3.4.2 Pre-trained language models

AI has seen a substantial surge in using pre-trained language models in recent years.

These models, trained on vast amounts of unlabeled textual data, are adept at under-

standing semantic word representations and their contextual relationships. A notable

(1)

Accuracy =

TP +FP +TN +FN

Table 2 Results of linguistic features extracted from LIWC2015 (mean, standard deviation and correla-

tion coeﬃcient)

*** p < .001; **p < .01; *p < .05

Reﬂection Level

Feature Overall Level 1 Level 2 Level 3 Correlation

WC 246.16 (136.47) 147.93 (55.91) 247.24 (111.59) 366.02 (166.65) 0.54***

Authentic 90.07 (17.29) 87.94 (20.75) 90.72 (16.46) 91.02 (14.26) 0.06*

WPS 21.57 (15.74) 23.75 (25.34) 20.73 (11.57) 21.02 (6.89) -0.06*

other 0.55 (0.71) 0.48 (0.92) 0.54 (0.63) 0.63 (0.56) 0.07*

shehe 0.54 (0.70) 0.47 (0.92) 0.53 (0.63) 0.62 (0.55) 0.07*

adverb 5.35 (1.90) 5.02 (2.32) 5.40 (1.73) 5.61(1.67) 0.11***

negate 0.98 (0.96) 0.91 (1.34) 0.95 (0.80) 1.12 (0.78) 0.07*

verb 18.31 (3.00) 18.44 (3.39) 18.45 (2.95) 17.77 (2.56) -0.07*

posemo 4.43 (1.81) 4.59 (2.08) 4.43 (1.81) 4.22 (1.40) -0.07*

social 6.02 (2.24) 5.82 (2.57) 5.89 (2.10) 6.62 (2.02) 0.12***

female 0.11 (0.28) 0.09 (0.29) 0.09 (0.26) 0.17 (0.34) 0.08**

cogproc 23.15 (4.14) 22.83 (4.77) 23.07 (4.02) 23.74 (3.54) 0.07*

discrep 2.41 (1.38) 2.17 (1.58) 2.42 (1.32) 2.69 (1.23) 0.13***

diﬀer 4.30 (1.97) 4.02 (2.35) 4.25 (1.84) 4.78 (1.70) 0.13***

achiev 6.56 (2.02) 6.68 (2.26) 6.65 (2.05) 6.17 (1.54) -0.08*

power 1.57 (1.16) 1.53 (1.33) 1.52 (1.10) 1.76 (1.03) 0.06*

focusfuture 0.68 (0.73) 0.58 (0.79) 0.69 (0.76) 0.76 (0.56) 0.09**

Analytic 57.91 (29.10) 58.52 (31.50) 56.68 (27.94) 60.38 (28.91) 0.02

Clout 26.07 (14.36) 27.09 (17.31) 25.52 (13.69) 26.25 (11.75) -0.02

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

category within these models comprises those developed on the Transformer archi-

tecture. These Transformer-based pre-trained language models utilize self-attention

mechanisms and multi-layered neural networks to encode input text. This process

results in the generation of word vector representations that eﬀectively capture the

relevance of context. Pre-trained language models present broad application possi-

bilities across diverse NLP tasks. By ﬁne-tuning these models to conform to speciﬁc

task demands, one can noticeably enhance their performance on the targeted tasks.

In our study, we apply a classiﬁcation approach to reﬂective writing using four pre-

trained language models. Two of these models, BERT and RoBERTa, are common

pre-trained language models. The other two, Longformer and Bigbird, are speciﬁcally

designed for handling long texts. BERT and its derivatives have proven highly eﬀec-

tive in classifying reﬂective texts. However, the constraint of a 512-token limit on the

Table 3 An overview of the parameters for the various algorithms utilized for training

Algorithms Parameters

Decision Tree criterion: [entropy]

splitter: [random]

max_depth: np.arange (30,50,1)

min_samples_split: [2,3,4,5,6,7,8,9]

max_features: [None]

Support Vector Machine C: [0.1,1,100,1000]

gamma: [0.01, 0.1, 1]

kernel: [rbf,poly,sigmoid,linear]

degree: [1,2,3,4,5,6]

Random Forest n_estimators: [100, 200, 500]

max_depth: [None, 5, 10, 20]

min_samples_split: [2, 5, 10]

min_samples_leaf: [1, 2, 4]

max_features: [auto, sqrt, log2]

Ridge Classiﬁer alpha: [0.1, 1.0, 10.0]

solver: [auto, svd, cholesky, lsqr, sparse_cg, sag, saga]

normalize: [True, False]

max_iter: [1000, 5000, 10000]

SGD Classiﬁer loss: [hinge, log, modiﬁed_huber, squared_hinge, perceptron]

penalty: [l1, l2, elasticnet]

alpha: [0.0001, 0.001, 0.01]

learning_rate: [constant, optimal, invscaling]

eta0: [0.01, 0.1, 1]

XGB Classiﬁer n_estimators: [100, 500, 1000]

max_depth: [3, 5, 7]

learning_rate: [0.01, 0.1, 1]

subsample: [0.5, 0.7, 1]

colsample_bytree: [0.5, 0.7, 1]

Gradient Bossting Classiﬁer n_estimators: [50, 100, 200]

learning_rate: [0.01, 0.1, 1]

max_depth: [3, 5, 10]

min_samples_split: [2, 5, 10]

min_samples_leaf: [1, 2, 4]

max_features: [auto, sqrt, log2]

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

embedding length in BERT poses challenges when dealing with lengthy text datasets.

This limitation often results in subpar performance. Common workarounds for this issue

include text truncation, adjustments to the attention mechanism, and sentence-wise pro-

cessing. Despite their usage, these strategies present certain drawbacks. Truncation can

lead to losing vital information, while attention mechanism and sentence-wise process-

ing modiﬁcations can signiﬁcantly increase computational complexity. A “head and tail”

strategy was employed for BERT and RoBERTa in experiments. This involved selecting

a head length of 255 tokens and a tail length of tokens characters, ensuring the total

length of the truncated text remained at 510. While Longformer and BigBird support

a maximum length of 4096 tokens, which can cover the entire input of our dataset. The

dataset designated for use with the pre-trained language models was partitioned: 70%

was allocated for training, 10% for validation, and 20% for testing purposes. We utilized

several models available through the Hugging Face platform (https:// huggi ngface. co/).

These included: bert-base-german-cased from Deepset (Chan etal., 2020), princetyagi/

roberta-base-wechsel-german-ﬁnetuned-germanquad as detailed in Minixhofer et al.

(2021), allenai/longformer-base-4096 as described in Beltagy etal. (2020), and google/

bigbird-roberta-base as presented in Zaheer etal. (2020). The hyperparameters employed

for these two categories of pre-trained language models are detailed in Table4.

3.5 Technical implementation

The study on shallow ML was conducted using Python 3.10 and Scikit-learn version

1.2.1. Our development environment was Jupyter Notebook version 6.4.12. Compu-

tations were carried out on a machine featuring an AMD Ryzen 5 4500U processor,

Radeon Graphics operating at 2.38GHz, and equipped with 16.0GB of RAM. The

pre-trained language models were executed on a system with an NVIDIA GeForce

GTX 1650 Ti graphics card (6 GByte). The classiﬁcation tasks were facilitated

by Pytorch framework version 2.0.1. Coding activities were conducted within the

PyCharm programming environment, version 2022.2.

Table 4 Hyperparameter

settings Hyperparameters BERT and RoBERTa Longformer

and BigBird

Max_seq_length 512 4096

Learning_rate 2e-5 2e-5

Batch_size 16 32

Epochs 4 4

Optimizer AdamW AdamW

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

4 Results

4.1 Results ofshallow machine learning

The average accuracy achieved by the shallow ML algorithms is typically under

60%. Among the combinations that performed best, integrating the Gradient Boost-

ing Classiﬁer with LIWC2015 is particularly notable, achieving an accuracy rate of

61.97%. Furthermore, the BoW technique yields the highest performance, record-

ing an accuracy of 60.28% when employed with Support Vector Machines. The TF-

IDF method attains optimal results when applied to the XGB classiﬁer, garnering

an accuracy of 61.69%. In summary, when it comes to diﬀerent feature engineering

techniques, LIWC2015 generally outperforms both BoW and TF-IDF across vari-

ous algorithms. A comparative analysis of the accuracy achieved by various feature

extraction techniques and algorithms is visually represented in Fig.3.

4.2 Results ofpre‑trained language models

The pre-trained language models designed for handling long texts, namely Long-

former and BigBird, delivered the most impressive performance with accuracy

rates of 77.22% and 76.26% respectively. Furthermore, BERT and RoBERTa, the

two other pre-trained language models used in our study, achieved accuracy rates

of 73.28% and 74.25% respectively. These accuracy levels mark a considerable

improvement, a rise of 12% to 16%, in comparison to the shallow ML models.

Table5 shows all results of pre-trained language models.

Fig. 3 Accuracy of shallow machine learning with various feature engineering

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

5 Disscussion

The primary objective of our research was to classify pre-service teachers’ reﬂec-

tive writings at the document level by employing AI algorithms. For this purpose,

we applied a qualitative content analysis for the annotation of the reﬂective writings

to classify reﬂection levels. Then, we utilized a range of shallow ML algorithms

enhanced by pre-trained language models such as BERT, RoBERTa, BigBird, and

Longformer. Our empirical analyses led to several signiﬁcant ﬁndings. The results

underscore the superiority of pre-trained language models over shallow ML algo-

rithms in classifying reﬂective writings, with BigBird and Longformer demonstrat-

ing notably higher accuracy at the document level. Nonetheless, our research also

identiﬁed certain biases between AI evaluations and human teachers’ evaluations.

These discrepancies are further explored in the subsequent sections.

Firstly, the superior eﬃcacy of pre-trained language models over shallow ML

models in analyzing reﬂective writing is consistently corroborated by other research.

For example, Nehyba and Štefánik (2023) illustrated that in the classiﬁcation of

reﬂective writing within higher education contexts, the accuracy of a deep neural

model (XLM-RoBERTa) varied from 92.68% to 97.56%, surpassing the performance

of a shallow ML algorithm (Random Forest), which ranged from 80.49% to 82.93%.

Furthermore, Wulﬀ etal. (2023), in their research of reﬂective writing classiﬁcation,

discovered that the feedforward neural network (FFNN) exhibited the lowest perfor-

mance, achieving weighted F1 mean scores of 0.62 and 0.64, respectively. This was

closely followed by long short-term memory neural networks (LSTM) with weighted

F1 mean values of 0.72. In contrast, the ﬁne-tuned BERT model results signiﬁcantly

surpassed other models, achieving a weighted F1 mean score of 0.82. Despite the var-

iations in research contexts, datasets, and theoretical frameworks across these studies,

the consistently high performance of pre-trained language models in analyzing reﬂec-

tive writing is apparent. These results highlight the robustness and versatility of pre-

trained language models in capturing the complex nature of reﬂective content.

Moreover, our study further highlights the outstanding performance of the BigBird

and Longformer models in classifying reﬂective writing at the document level. To

our knowledge, there is currently no research speciﬁcally focused on the classiﬁcation

of reﬂective writing using these models (BigBird and Longformer). However, these

approaches have been extensively applied in other domains. For instance, in the medi-

cal ﬁeld, Li etal. (2023) demonstrated that Clinical-Longformer and Clinical-BigBird

signiﬁcantly outperform ClinicalBERT and other models designed for short sequences

across all evaluated downstream tasks. Compared to shallow ML algorithms and

Table 5 Results of pre-trained

language models Accuracy

Models Dev Dataset Test Dataset

BERT 68.25% 73.28%

RoBERTa 70.26% 74.25%

Longformer 73.32% 77.22%

BigBird 74.69% 76.26%

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

pre-trained language models optimized for short sequences, these pre-trained language

models designed to handle long sequences exhibit superior performance across various

tasks. For example, shallow ML models often face challenges when dealing with long-

sequence inputs due to factors such as high-dimensional feature spaces, retaining con-

textual relevance over lengthier texts, and the absence of sequential information. BERT

and RoBERTa models address some of these challenges; however, their localized self-

attention mechanisms can struggle to capture longer dependencies within the text (Dev-

lin etal., 2018; Liu etal., 2019a, 2019b). The self-attention mechanism’s demand for

memory grows quadratically as sequence lengths extend, making training times unfeasi-

bly long and rapidly surpassing the memory limits of current Graphics Processing Units

(GPUs). Therefore, models depending on full self-attention, like BERT and RoBERTa,

usually set a cap on the input sequence length at 512 tokens to manage these constraints.

However, BigBird and Longformer models, which incorporate global attention mecha-

nisms and extended local attention spans, more eﬀectively capture textual context (Belt-

agy etal., 2020; Zaheer et al., 2020). The utilization of pre-trained language models

adept at processing long sequences is crucial for the classiﬁcation of reﬂective writing.

As Rosé etal. (2008) have pointed out, the choice of segmentation method signiﬁcantly

aﬀects classiﬁcation performance. Numerous scholars argue that reﬂections often extend

across multiple sentences and cannot be adequately captured through sentence-level

classiﬁcation alone (Moon, 2013). This perspective is supported by Wulﬀ etal. (2023),

as well as Nehyba and Štefánik (2023), who underscore the limitations of sentence-level

segmentation in capturing long-range dependencies and addressing the complexities of

classiﬁcation challenges, respectively. Reﬂection may span several sentences or be inter-

woven with various reﬂective categories within complex sentences. In addition to the

technical aspects, it has been proposed that incorporating strong theoretical frameworks

such as discourse theory can improve segmentation practices (Stede, 2016).

Lastly, it is essential to acknowledge the bias of the pre-trained language model for

the analysis of reﬂective writing when compared to the teacher’s assessment. The cases

of RW1-KM and RW2-KM (translated into English by the authors) exemplify instances

where the AI (Longformer) has been underestimated and overestimated, respectively.

These two examples showed that teachers can better understand students’ intentions,

background knowledge, and implicit meanings in reﬂective writing. For example, RW1-

KM delves into the particulars of a classroom discipline issue, presenting an investigation

of the event’s speciﬁcs. This includes an analysis of the underlying motivations behind

the student’s behavior and the coping strategies employed. Such a reﬂection exhibits a

profound comprehension of the circumstances and a personalized approach, aligning

with the characteristics of dialogic reﬂection. However, the AI’s assessment classiﬁed this

reﬂection as descriptive, likely due to the extensive use of descriptive language within

the reﬂection’s narrative. In RW2-KM, the student showcased an ability to objectively

recount events and delve into their underlying causes, employing Becker’s theory to eluci-

date the classroom scenario. This analysis penetrates beyond surface-level observations to

provide a theoretical interpretation of the teaching and learning context, an approach that

typiﬁes descriptive reﬂection. Nonetheless, the AI’s evaluation misidentiﬁed this reﬂec-

tion as dialogic, likely inﬂuenced by the extensive use of language related to third-party

actions and the text’s complex linguistic expressions. From the analyses of the two reﬂec-

tions described above, it is possible to observe diﬀerences in the classiﬁcation of types

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

of reﬂections between manual and AI assessments. The potential limitations of AI in

understanding the depth and multidimensionality of reﬂections remind us of the caution

that should be taken in using AI in educational assessment, especially when evaluating

complex thinking and reﬂective processes. Also, this emphasizes the irreplaceability of

manual assessment in dealing with the evaluation of complex cognitive processes.

“After reading through the situation, I realized that the class disruption was

on the students’ part. However, the situation did not explicitly mention whether

the teacher saw who threw the sponge in her face. Therefore, I was unsure

whether the teacher should address someone personally or the whole class

when dealing with the conﬂict. Furthermore, it was not easy for me to tell from

the situation whether the pupils threw the objects around the classroom out of

boredom or whether it was a targeted attack on the teacher. Accordingly, the

characteristic of aggressive behavior did not appear in my analysis of the situ-

ation. It was also unclear to me how practically orientated I should argue, so

my answer contained more of a theoretical aspect. In order to put myself more

in the situation, it would be useful to know which year group and which type

of school was involved. Measures such as a note for parents in the homework

booklet would not have much eﬀect in the ninth or tenth year of secondary

school, as many parents are not interested in their children’s school aﬀairs.

Unfortunately, many parents do not speak German. I was able to gain such

insights during my orientation internship.” (Human Assessment: Dialogic

Reﬂection; AI Assessment: Descriptive Reﬂection) [RW1-KM]

“The comparison with the feedback variants shows a high degree of agreement

with the proposed solutions. The task processing is based centrally on Becker’s

analysis, as is the feedback. The processing describes what has happened, clas-

siﬁes the degree of conﬂict, and then looks for possible causes. Just as Becker

describes and proposes the model solution. In the possible response, the process-

ing focuses more on the student’s well-being than on the feedback. In both, the

teacher’s misconduct is looked at. In processing, a conversation is sought with

the students, whereas feedback focuses more on disciplinary measures. However,

it should be noted that the option of silent and individual work is good for show-

ing students their unacceptable behavior. The treatment also classiﬁes the situa-

tion as a central conﬂict. From a personal point of view, the end of the situation

described seems extreme, and such behavior on the part of the students is only

conceivable in individual cases. Nevertheless, it is challenging to develop case

studies that do justice to real-life situations. However, always starting from an

extreme situation can be a good way of preparing the trainee teachers so that the

real situation does not take them by surprise. However, more than individual case

studies is required as an all-encompassing preparation.” (Human Assessment:

Descriptive Writing, AI Assessment: Dialogic Reﬂection) [RW2-KM]

5.1 Limitations

This study faces several limitations that warrant acknowledgment and consideration.

First, regarding data limitations, we encountered challenges in addressing category

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

imbalance. Despite employing rigorous sampling techniques to mitigate this issue,

the “critical reﬂection” category was excluded due to its inconsistent ﬁt within the

dataset. Although this decision facilitated a concentrated examination of categories

possessing adequate data, it concurrently limited the possibility to comprehensively

investigate the role and impact of critical reﬂection within educational practices.

Secondly, regarding the evaluation aspect, our study utilized the reﬂective level

framework proposed by Hatton and Smith in 1995. This framework oﬀers signiﬁcant

beneﬁts, however, it might not fully encompass the breadth of reﬂective outcomes.

Students’ reﬂections encompass a wide range of subjects and levels of cognitive

depth, suggesting that a singular framework may not adequately capture this diver-

sity. Consequently, the assessment criteria used in this study could restrict the thor-

oughness of our evaluation. Lastly, regarding the transparency and interpretability of

the algorithms, our study employed sophisticated ML models, including pre-trained

language models. Although pre-trained language models, such as BERT, demon-

strate superior performance compared to shallow ML models. They often act as

“black boxes,” oﬀering limited insights into their decision-making processes. This

opacity can be a signiﬁcant shortcoming when the goal is to understand the nuanced

aspects of reﬂective writing, a process inherently rich in context and subjectivity.

5.2 Implications andfuture work

The implications of this study surpass the boundaries of teacher education, resonat-

ing across a wide array of disciplines and professional ﬁelds where reﬂective writing

and automated feedback play crucial roles. Enhancing students’ reﬂective skills pre-

sents a signiﬁcant challenge across all areas of professional education (Jung etal.,

2022; Körkkö etal., 2016). By leveraging a broader array of ML models, includ-

ing the pre-trained language models assessed in this study, students can gain access

to more precise and prompt feedback. This enhancement has the potential to sig-

niﬁcantly improve the quality of their reﬂective writing, which, in turn, can posi-

tively impact pedagogical practices. For educators, the adoption of automated sys-

tems oﬀering highly accurate feedback presents beneﬁts. It signiﬁcantly reduces the

workload involved in manually assessing reﬂective writings, allowing educators to

allocate more time and energy to other vital teaching activities, including curricu-

lum development, in-class engagement, and personal support for students.

For future research directions, it is essential to enhance the feedback mech-

anisms for reﬂective writing in education adressing several aspects. Firstly,

the enhancement of feedback content must include the assessment of students’

self-regulation. As can be seen from the deﬁnition of reﬂection, in addition to

including cognitive dimensions, there are also aﬀective, psychological, and other

aspects (Boud etal., 2013; Kember, 1999). While existing research has predomi-

nantly focused on task-level feedback (e.g., Hattie & Timperley, 2007), there

has been a signiﬁcant oversight in addressing self-regulated learning assess-

ments. These assessments are crucial for evaluating students’ metacognitive and

self-directed learning processes during reﬂective writing activities. By integrat-

ing indicators of self-regulation, educators can develop a more comprehensive

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

understanding of students’ capabilities to monitor their learning progress, set

objectives, and modify their cognitive approaches. Second, from a technological

standpoint, AI-generated technologies, such as Retrieval Augmented Generation

(RAG) (Lewis etal., 2020), oﬀer substantial potential for enhancing the qual-

ity of feedback. RAG technology, which synergizes information retrieval and text

generation capabilities, oﬀers signiﬁcant advantages in delivering personalized

feedback for reﬂective writing. This method generates targeted and comprehen-

sive feedback by retrieving information pertinent to a student’s speciﬁc assign-

ment or query and then integrating it with the generative capabilities of neural

networks. Furthermore, training LLMs on literature speciﬁc to various didactics

(e.g., constructivism, behaviorism, cognitivism) can equip these models with a

deeper understanding of these educational theories. This training allows the feed-

back generated to align more closely with the principles of the relevant didac-

tics, thereby enhancing its educational eﬀectiveness. Lastly, empirical validation

of the eﬀectiveness of these proposed assessment and feedback mechanisms is

essential. Future research should employ experimental designs that implement

these enhanced strategies with actual student populations. Conducting such

empirical studies is crucial to determine whether these innovations genuinely

support the development of reﬂective skills among students.

6 Conclusion

Reﬂective writing is a critical element of teacher education and is crucial for

promoting professional development. However, the subjectivity and complexity

of reﬂective writing make providing feedback a substantial challenge, previous

research has tackled using ML and NLP techniques. These existing methods fre-

quently encounter two signiﬁcant obstacles: a reliance on sentence-level classi-

ﬁcation and the employment of low-quality training data. These approaches fail

to capture the nuanced depth and multifaceted complexities of reﬂective writing.

To overcome these challenges, the current study proposes two main innovations.

First, we suggest the collection of reﬂective writings through e-portfolios, ena-

bling a more structured and ongoing assessment of reﬂective practices. Second,

we propose a document-level classiﬁcation approach for the classiﬁcation of

reﬂective writing. By considering the entire text as a single unit of analysis, we

aim to provide a more holistic and comprehensive understanding of the reﬂec-

tive process. Our study contrasts shallow ML algorithms with pre-trained lan-

guage models in classifying reﬂective writing at the document level. Our empiri-

cal results demonstrate that pre-trained language models consistently surpass the

performance of shallow ML algorithms, particularly highlighting the eﬀective-

ness of BigBird and Longformer in processing extended text sequences. This

observation not only underscores the technical superiority of advanced language

models but also contributes valuable insights into developing eﬃcacious assess-

ment strategies for reﬂective writing.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

Appendix

See Table6.

Table 6 Overview of shallow machine learning parameter settings

Models Parameters Feature Engineering

BoW TF-IDF LIWC2015

Best parameters Accuracy Best parameters Accuracy Best parameters Accuracy

Decision Tree criterion entropy 51.83% entropy 52.54% entropy 55.77%

max_depth 49 42 38

max_features None None None

min_samples_split 4 4 4

splitter random random random

Support Vector Machine C 100 60.28% 100 56.06% 100 59.15%

degree 1 2 1

gamma 0.01 1 0.1

kernel rbf poly poly

SGD Classiﬁer alpha 0.01 54.93% 0.0001 55.35% 0.001 60.28%

eta0 0.1 0.01 0.1

learning_rate constant constant optimal

loss log log log

penalty l1 l1 l1

Ridge Classiﬁer alpha 10.0 56.90% 10.0 54.79% 0.1 59.01%

max_iter 1000 1000 1000

normalize True False True

solver auto auto saga

Random Forest max_depth None 57.74% None 56.90% 20 60.70%

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

Table 6 (continued)

Models Parameters Feature Engineering

BoW TF-IDF LIWC2015

Best parameters Accuracy Best parameters Accuracy Best parameters Accuracy

max_features auto sqrt log2

min_samples_leaf 1 1 2

min_samples_split 2 2 5

n_estimators 200 200 200

Gradient Boosting Clas-

siﬁer

learning_rate 0.1 59.86% 0.1 60.00% 1 61.97%

max_depth 5 3 10

max_features log2 sqrt auto

min_samples_leaf 2 1 1

min_samples_split 10 10 5

n_estimators 200 200 50

XGB Classiﬁer colsample_bytree 0.7 57.46% 0.7 61.69% 0.5 60.85%

learning_rate 0.1 0.1 0.01

max_depth 5 5 5

n_estimators 500 100 100

subsample 0.7 0.5 1

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

Acknowledgements We wish to express our appreciation to Jessica Schießl, and Meltem Doganay for

their contribution to the secondary coding of sections of the reﬂective writing.

Authors’ contributions MGZ and CZ conceived the study. FH and LP carried out the data collection. CZ

conducted data analysis and prepared the manuscript draft. Both MGZ and FH supervised the study and

revised the manuscript draft. All authors made signiﬁcant contributions to the ﬁnal manuscript.

Funding Open Access funding enabled and organized by Projekt DEAL. This work was supported by the

Federal Ministry of Education and Research (Germany) [obtained by Prof. Dr. Michaela Gläser-Zikuda,

grant number 16DHB4019]. The authors express their gratitude.

Data availability Due to the presence of personally identiﬁable information within the dataset, it is not

publicly shareable in accordance with privacy protection laws and ethical guidelines of the universities

Erlangen-Nürnberg and Berlin involved in this research.

Declarations

Conﬂict of interest None.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,

which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long

as you give appropriate credit to the original author(s) and the source, provide a link to the Creative

Commons licence, and indicate if changes were made. The images or other third party material in this

article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line

to the material. If material is not included in the article’s Creative Commons licence and your intended

use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-

sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/

licenses/by/4.0/.

References

Atzeni, D., Bacciu, D., Mazzei, D., & Prencipe, G. (2022). A Systematic Review of Wi-Fi and Machine

Learning Integration with Topic Modeling Techniques. Sensors,22(13), 4925. https:// doi. org/ 10.

3390/ s2213 4925

Barthakur, A., Joksimovic, S., Kovanovic, V., Mello, R. F., Taylor, M., Richey, M., & Pardo, A. (2022).

Understanding Depth of Reﬂective Writing in Workplace Learning Assessments Using Machine

Learning Classiﬁcation. IEEE Transactions on Learning Technologies,15(5), 567–578. https:// doi.

org/ 10. 1109/ TLT. 2022. 31625 46

Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer.

arXiv:2004.05150. Retrieved June 22, 2023, from https:// doi. org/ 10. 48550/ arXiv. 2004. 05150

Boud, D., Keogh, R., & Walker, D. (Eds.). (2013). Reﬂection: Turning experience into learning.

Routledge.

Boyd, E. M., & Fales, A. W. (1983). Reﬂective learning: Key to learning from experience. Journal of

Humanistic Psychology,23(2), 99–117. https:// doi. org/ 10. 1177/ 00221 67883 232011

Cai, Z., Gui, Y., Mao, P., Wang, Z., Hao, X., Fan, X., & Tai, R. H. (2023). The eﬀect of feedback on

academic achievement in technology-rich learning environments (TREs): A meta-analytic review.

Educational Research Review, 100521. https:// doi. org/ 10. 1016/j. edurev. 2023. 100521

Carpenter, D., Geden, M., Rowe, J., Azevedo, R., & Lester, J. (2020). Automated analysis of middle

school students’ written reﬂections during game-based learning. In Artiﬁcial Intelligence in Edu-

cation: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Pro-

ceedings, Part I 21 (pp. 67–78). Springer International Publishing. https:// doi. org/ 10. 1007/

978-3- 030- 52237-7_6

Carvalho, D. V., Pereira, E. M., & Cardoso, J. S. (2019). Machine learning interpretability: A survey on

methods and metrics. Electronics,8(8), 832. https:// doi. org/ 10. 3390/ elect ronic s8080 832

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

Chan, B., Schweter, S., & Möller, T. (2020). German’s next language model. arXiv:2010.10906.

Retrieved May 22, 2023, from https:// doi. org/ 10. 48550/ arXiv. 2010. 10906

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority

over-sampling technique. Journal of Artiﬁcial Intelligence Research,16, 321–357. https:// doi. org/ 10.

1613/ jair. 953

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd

Acm Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785–794).

https:// doi. org/ 10. 1145/ 29396 72. 29397 85

Cheng, G. (2017). Towards an automatic classiﬁcation system for supporting the development of critical

reﬂective skills in L2 learning. Australasian Journal of Educational Technology, 33(4). https:// doi.

org/ 10. 14742/ ajet. 3029

Chong, C., Sheikh, U. U., Samah, N. A., & Shaameri, A. Z. (2020). Analysis on reﬂective writing using

natural language processing and sentiment analysis. In IOP Conference Series: Materials Science

and Engineering (Vol. 884, No. 1, p. 012069). IOP Publishing. https:// doi. org/ 10. 1088/ 1757- 899X/

884/1/ 012069

Cui, Y., Wise, A. F., & Allen, K. L. (2019). Developing reﬂection analytics for health professions educa-

tion: A multi-dimensional framework to align critical concepts with data features. Computers in

Human Behavior,100, 305–324. https:// doi. org/ 10. 1016/j. chb. 2019. 02. 019

Cutumisu, M., & Guo, Q. (2019). Using topic modeling to extract pre-service teachers understandings of

computational thinking from their coding reﬂections. IEEE Transactions on Education,62(4), 325–

332. https:// doi. org/ 10. 1109/ TE. 2019. 29252 53

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv:1810.04805. Retrieved May 22, 2023, from https:// doi.

org/ 10. 48550/ arXiv. 1810. 04805

Dewey, J. (1933). How we think: A restatement of the relation of reﬂective thinking to the educative pro-

cess. D.C. Heath and Company.

Ebeling, W., & Neiman, A. (1995). Long-range correlations between letters and sentences in texts. Phys-

ica A: Statistical Mechanics and its Applications,215(3), 233–241. https:// doi. org/ 10. 1016/ 0378-

4371(95) 00025-3

Fan, X., Luo, W., Menekse, M., Litman, D., & Wang, J. (2017). Scaling reﬂection prompts in large class-

rooms via mobile interfaces and natural language processing. In Proceedings of the 22nd Interna-

tional Conference on Intelligent User Interfaces (pp. 363–374). https:// doi. org/ 10. 1145/ 30251 71.

30252 04

Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive–developmental

inquiry. American Psychologist,34(10), 906. https:// doi. org/ 10. 1037/ 0003- 066X. 34. 10. 906

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statis-

tics,29, 1189–1232.

Fütterer, T. (2019). Professional Development Portfolios im Vorbereitungsdienst. Die Wirksamkeit Von

Lernumgebungen Auf Die Qualität Der Portfolioarbeit. Wiesbaden: Springer VS. https:// doi. org/ 10.

1007/ 978-3- 658- 24064-6

Gibbs, G. (1988). Learning by doing: A guide to teaching and learning methods. Oxford University Press.

Gibson, A., Kitto, K., & Bruza, P. (2016). Towards the discovery of learner metacognition from reﬂective

writing. Journal of Learning Analytics,3(2), 22–36. https:// doi. org/ 10. 18608/ jla. 2016. 32.3

Gläser-Zikuda, M. (2015). ePortfolios in Higher Education. In M. Spector (Ed.), Encyclopedia of Educa-

tional Technology (pp. 275–277). SAGE.

Gläser-Zikuda, M., Hagenauer, G., & Stephan, M. (2020). The potential of qualitative content analysis

for empirical educational research. In Forum Qualitative Sozialforschung/Forum: Qualitative Social

Research (Vol. 21, No. 1, p. 20). DEU. https:// doi. org/ 10. 17169/ fqs- 21.1. 3443.

Gupta, N., Mujumdar, S., Patel, H., Masuda, S., Panwar, N., Bandyopadhyay, S., ... & Munigala, V.

(2021). Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD Confer-

ence on Knowledge Discovery & Data Mining (pp. 4040–4041). https:// doi. org/ 10. 1145/ 34475 48.

34708 17

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research,77(1),

81–112. https:// doi. org/ 10. 3102/ 00346 54302 98487

Hatton, N., & Smith, D. (1995). Reﬂection in teacher education: Towards deﬁnition and implementation.

Teaching and Teacher Education,11(1), 33–49. https:// doi. org/ 10. 1016/ 0742- 051X(94) 00012-U

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics, 12(1), 55–67.https:// doi. org/ 10. 1080/ 00401 706. 1970. 10488 634

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

Hopcan, S., Türkmen, G., & Polat, E. (2023). Exploring the artiﬁcial intelligence anxiety and machine

learning attitudes of teacher candidates. Education and Information Technologies, 1–21. https://doi.

org/10.1007/s10639-023-12086-9

Hu, X. (2017). Automated recognition of thinking orders in secondary school student writings. Learning:

Research and Practice,3(1), 30–41. https:// doi. org/ 10. 1080/ 23735 082. 2017. 12842 53

Janiesch, C., Zschech, P., & Heinrich, K. (2021). Machine learning and deep learning. Electronic Mar-

kets,31(3), 685–695. https:// doi. org/ 10. 1007/ s12525- 021- 00475-2

Jung, Y., & Wise, A. F. (2020). How and how well do students reﬂect? Multi-dimensional automated

reﬂection assessment in health professions education. In Proceedings of the Tenth International

Conference on Learning Analytics & Knowledge (pp. 595–604). https:// doi. org/ 10. 1145/ 33754 62.

33755 28

Jung, Y., Wise, A. F., & Allen, K. L. (2022). Using theory-informed data science methods to trace the

quality of dental student reﬂections over time. Advances in Health Sciences Education : Theory and

Practice,27(1), 23–48. https:// doi. org/ 10. 1007/ s10459- 021- 10067-6

Kember, D. (1999). Determining the level of reﬂective thinking from students’ written journals using a

coding scheme based on the work of Mezirow. International Journal of Lifelong Education, 18(1),

18–30. https:// doi. org/ 10. 1080/ 02601 37992 93928

Kolb, D. A. (1984). Experiential learning: Experience as the source of learning and development. Pren-

tice Hall.

Körkkö, M., Kyrö-Ämmälä, O., & Turunen, T. (2016). Professional development through reﬂection in

teacher education. Teaching and Teacher Education,55, 198–206. https:// doi. org/ 10. 1016/j. tate.

2016. 01. 014

Korthagen, F., & Vasalos, A. (2005). Levels in reﬂection: Core reﬂection as a means to enhance profes-

sional growth. Teachers and Teaching,11(1), 47–71. https:// doi. org/ 10. 1080/ 13540 60042 00033 7093

Kovanović, V., Joksimović, S., Mirriahi, N., Blaine, E., Gašević, D., Siemens, G., & Dawson, S. (2018).

Understand students self-reﬂections through learning analytics. In Proceedings of the 8th Interna-

tional Conference on Learning Analytics and Knowledge (pp. 389–398). https:// doi. org/ 10. 1145/

31703 58. 31703 74

Kraus, M., Feuerriegel, S., & Oztekin, A. (2020). Deep learning in business analytics and operations

research: Models, applications and managerial implications. European Journal of Operational

Research,281(3), 628–641. https:// doi. org/ 10. 1016/j. ejor. 2019. 09. 018

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-

augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Process-

ing Systems, 33, 9459–9474.

Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H., & Luo, Y. (2023). A comparative study of pretrained lan-

guage models for long clinical text. Journal of the American Medical Informatics Association,30(2),

340–347. https:// doi. org/ 10. 1093/ jamia/ ocac2 25

Liu, M., Shum, S. B., Mantzourani, E., & Lucas, C. (2019a). Evaluating Machine Learning Approaches

to Classify Pharmacy Students Reﬂective Statements. In S. Isotani, E. Millán, A. Ogan, P. Hast-

ings, B. McLaren, & R. Luckin (Eds.), Lecture Notes in Computer Science. Artiﬁcial Intelligence

in Education (Vol. 11625, pp. 220–230). Springer International Publishing. https:// doi. org/ 10. 1007/

978-3- 030- 23204-7_ 19

Liu, Q., Zhang, S., Wang, Q., & Chen, W. (2017). Mining online discussion data for understanding teach-

ers reﬂective thinking. IEEE Transactions on Learning Technologies,11(2), 243–254. https:// doi.

org/ 10. 1109/ TLT. 2017. 27081 15

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019b). Roberta: A robustly

optimized bert pretraining approach. arXiv:1907.11692. Retrieved May 22, 2023 from https:// doi.

org/ 10. 48550/ arXiv. 1907. 11692

McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276–282.

Retrieved May 22, 2023, from https:// hrcak. srce. hr/ 89395

Mezirow, J. (1991). Transformative dimensions of adult learning. Jossey-Bass.

Minixhofer, B., Paischer, F., & Rekabsaz, N. (2021). WECHSEL: Eﬀective initialization of sub-

word embeddings for cross-lingual transfer of monolingual language models. arXiv:2112.06598.

Retrieved May 22, 2023, from https:// doi. org/ 10. 48550/ arXiv. 2112. 06598

Moon, J. A. (2013). Reﬂection in learning and professional development: Theory and practice.

Routledge.

Narciss, S. (2006). Informatives tutorielles Feedback: Entwicklungs-und Evaluationsprinzipien auf der

Basis instruktionspsychologischer Erkenntnisse. Waxmann.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Education and Information Technologies

1 3

Nehyba, J., & Štefánik, M. (2023). Applications of deep language models for reﬂective writ-

ings. Education and Information Technologies,28(3), 2961–2999. https:// doi. org/ 10. 1007/

s10639- 022- 11254-7

Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psycho-

metric properties of LIWC2015. Retrieved June 10,2023, from http:// hdl. handle. net/ 2152/ 31333

Poldner, E., van der Schaaf, M., Simons, P.R.-J., van Tartwijk, J., & Wijngaards, G. (2014). Assessing

student teachers reﬂective writing through quantitative content analysis. European Journal of

Teacher Education,37(3), 348–373. https:// doi. org/ 10. 1080/ 02619 768. 2014. 892479

Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical

Statistics,22, 400–407.

Rosé, C., Wang, Y. C., Cui, Y., Arguello, J., Stegmann, K., Weinberger, A., & Fischer, F. (2008). Ana-

lyzing collaborative learning processes automatically: Exploiting the advances of computational

linguistics in computer-supported collaborative learning. International Journal of Computer-

Supported Collaborative Learning,3, 237–271. https:// doi. org/ 10. 1007/ s11412- 007- 9034-0

Russell, T., & Korthagen, F. (Eds.). (2013). Teachers who teach teachers: Reﬂections on teacher edu-

cation. Routledge.

Savicki, V., & Price, M. V. (2015). Student Reﬂective Writing: Cognition and Aﬀect Before, During,

and After Study Abroad. Journal of College Student Development,56(6), 587–601. https:// doi.

org/ 10. 1353/ csd. 2015. 0063

Schön, D. A. (1983). The reﬂective practitioner. Jossey-Bass.

Schön, D. A. (1987). Educating the reﬂective practitioner: Toward a new design for teaching and

learning in the professions. Jossey-Bass.

Solopova, V., Rostom, E., Cremer, F., Gruszczynski, A., Witte, S., Zhang, C., ... & Landgraf, T.

(2023). PapagAI: Automated Feedback for Reﬂective Essays. In German Conference on Arti-

ﬁcial Intelligence (Künstliche Intelligenz) (pp. 198–206). Cham: Springer Nature Switzerland.

https:// doi. org/ 10. 1007/ 978-3- 031- 42608-7_ 16

Springer, D. G., & Yinger, O. S. (2019). Linguistic Indicators of Reﬂective Practice Among Music

Education Majors. Journal of Music Teacher Education,28(2), 56–69. https:// doi. org/ 10. 1177/

10570 83718 786739

Stede, M. (Ed.). (2016). Handbuch Textannotation: Potsdamer Kommentarkorpus 2.0 (Vol. 8). Uni-

versitätsverlag Potsdam.

Tan, L., Lu, J., & Jiang, H. (2021). Tomato Leaf Diseases Classiﬁcation Based on Leaf Images: A

Comparison between Classical Machine Learning and Deep Learning Methods. AgriEngineer-

ing,3(3), 542–558. https:// doi. org/ 10. 3390/ agrie ngine ering 30300 35

Ullmann, T. D. (2019). Automated Analysis of Reﬂection in Writing: Validating Machine Learning

Approaches. International Journal of Artiﬁcial Intelligence in Education,29(2), 217–257. https://

doi. org/ 10. 1007/ s40593- 019- 00174-2

Wulﬀ, P., Buschhüter, D., Westphal, A., Nowak, A., Becker, L., Robalino, H., Stede, M., & Borowski,

A. (2021). Computer-Based Classiﬁcation of Preservice Physics Teachers Written Reﬂec-

tions. Journal of Science Education and Technology,30(1), 1–15. https:// doi. org/ 10. 1007/

s10956- 020- 09865

Wulﬀ, P., Mientus, L., Nowak, A., & Borowski, A. (2023). Utilizing a pretrained language model (BERT)

to classify preservice physics teachers’ written reﬂections. International Journal of Artiﬁcial Intel-

ligence in Education,33(3), 439–466. https:// doi. org/ 10. 1007/ s40593- 023- 00330-9

Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., ... & Ahmed, A. (2020).

Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems,

33, 17283–17297.

Zanette, D. H. (2014). Statistical patterns in written language. arXiv:1412.3336. Retrieved May 2, 2023,

from https:// doi. org/ 10. 48550/ arXiv. 1412. 3336

Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research

on artiﬁcial intelligence applications in higher education–where are the educators? International

Journal of Educational Technology in Higher Education,16(1), 1–27. https:// doi. org/ 10. 1186/

s41239- 019- 0171-0

Zhai, X., Chu, X., Chai, C. S., Jong, M. S. Y., Istenic, A., Spector, M., ... & Li, Y. (2021). A Review of

Artiﬁcial Intelligence (AI) in Education from 2010 to 2020. Complexity, 2021, 1–18. https:// doi. org/

10. 1155/ 2021/ 88125 42

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

1 3

Education and Information Technologies

Zhang, C., Schießl, J., Plößl, L., Hofmann, F., & Gläser-Zikuda, M. (2023a). Evaluating Reﬂective

Writing in Pre-Service Teachers: The Potential of a Mixed-Methods Approach. Education Sci-

ences,13(12), 1213. https:// doi. org/ 10. 3390/ educs ci131 21213

Zhang, C., Schießl, J., Plößl, L., Hofmann, F., & Gläser-Zikuda, M. (2023b). Acceptance of artiﬁcial

intelligence among pre-service teachers: A multigroup analysis. International Journal of Educa-

tional Technology in Higher Education,20(1), 49. https:// doi. org/ 10. 1186/ s41239- 023- 00420-7

Zimmerman, B. J. (2002). Becoming a self-regulated learner: An overview. Theory into Practice,41(2),

64–70. https:// doi. org/ 10. 1207/ s1543 0421t ip4102_2

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps

and institutional aﬃliations.

Authors and Aliations

ChengmingZhang1 · FlorianHofmann1· LeaPlößl1· MichaelaGläser‑Zikuda1

* Chengming Zhang

chengming.zhang@fau.de

Florian Hofmann

ﬂorian.hofmann@fau.de

Lea Plößl

lea.ploessl@fau.de

Michaela Gläser-Zikuda

michaela.glaeser-zikuda@fau.de

1 Department ofEducation, University ofErlangen–Nürnberg, Regensburger Street 160,

90478Nuremberg, Germany

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center

GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers

and authorised users (“Users”), for small-scale personal, non-commercial use provided that all

sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of

use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and

students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and

conditions, a relevant site licence or a personal subscription. These Terms will prevail over any

conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to

the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of

the Creative Commons license used will apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may

also use these personal data internally within ResearchGate and Springer Nature and as agreed share

it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise

disclose your personal data outside the ResearchGate or the Springer Nature group of companies

unless we have your permission as detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial

use, it is important to note that Users may not:

use such content for the purpose of providing other users with access on a regular or large scale

basis or as a means to circumvent access control;

use such content where to do so would be considered a criminal or statutory offence in any

jurisdiction, or gives rise to civil liability, or is otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association

unless explicitly agreed to by Springer Nature in writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a

systematic database of Springer Nature journal content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a

product or service that creates revenue, royalties, rent or income from our content or its inclusion as

part of a paid for service or for other commercial gain. Springer Nature journal content cannot be

used for inter-library loans and librarians may not upload Springer Nature journal content on a large

scale into their, or any other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not

obligated to publish any information or content on this website and may remove it or features or

functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke

this licence to you at any time and remove access to any copies of the Springer Nature journal content

which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or

guarantees to Users, either express or implied with respect to the Springer nature journal content and

all parties disclaim and waive any implied warranties or warranties imposed by law, including

merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published

by Springer Nature that may be licensed from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a

regular basis or in any other manner not expressly permitted by these Terms, please contact Springer

Nature at

onlineservice@springernature.com

Content uploaded by Chengming Zhang

Content may be subject to copyright.

ResearchGate has not been able to resolve any citations for this publication.

Evaluating Reflective Writing in Pre-Service Teachers: The Potential of a Mixed-Methods Approach

Article

Full-text available

Dec 2023

Reflective writing is a relevant aspect of pre-service teachers’ professionalization. Evaluating reflective writing in teacher education is demanding due to a shortage of resources. Hence, this study explores the practical possibilities of evaluating reflective writing using a mixed-methods approach to analyze reflective writing from 198 pre-service teachers at a German university. We used qualitative content analysis, computational linguistic approaches, and BERTopic. Results of qualitative content analysis results indicated primarily descriptive and low-level participants’ reflective writing. Next, computational linguistic analyses revealed that affective and cognitive terminology utilization differed across varying levels of reflection, with a higher frequency of such terms correlating with deeper levels of reflection. BERTopic results showed that reflective content mainly centered on learning materials and shifted toward affective and motivational themes related to higher levels of reflection. This study demonstrates that reflective writing can be evaluated across reflection levels and cognitive, affective, and thematic dimensions, combining qualitative content analysis, computational linguistic approaches, and BERTopic.

Acceptance of artificial intelligence among pre-service teachers: a multigroup analysis

Article

Full-text available

Sep 2023

Over the past few years, there has been a significant increase in the utilization of artificial intelligence (AI)-based educational applications in education. As pre-service teachers’ attitudes towards educational technology that utilizes AI have a potential impact on the learning outcomes of their future students, it is essential to know more about pre-service teachers’ acceptance of AI. The aims of this study are (1) to discover what factors determine pre-service teachers’ intentions to utilize AI-based educational applications and (2) to determine whether gender differences exist within determinants that affect those behavioral intentions. A sample of 452 pre-service teachers (325 female) participated in a survey at one German university. Based on a prominent technology acceptance model, structural equation modeling, measurement invariance, and multigroup analysis were carried out. The results demonstrated that eight out of nine hypotheses were supported; perceived ease of use (β = 0.297***) and perceived usefulness (β = 0.501***) were identified as primary factors predicting pre-service teachers’ intention to use AI. Furthermore, the latent mean differences results indicated that two constructs, AI anxiety (z = − 3.217**) and perceived enjoyment (z = 2.556*), were significantly different by gender. In addition, it is noteworthy that the paths from AI anxiety to perceived ease of use (p = 0.018*) and from perceived ease of use to perceived usefulness (p = 0.002**) are moderated by gender. This study confirms the determinants influencing the behavioral intention based on the Technology Acceptance Model 3 of German pre-service teachers to use AI-based applications in education. Furthermore, the results demonstrate how essential it is to address gender-specific aspects in teacher education because there is a high percentage of female pre-service teachers, in general. This study contributes to state of the art in AI-powered education and teacher education.

Exploring the artificial intelligence anxiety and machine learning attitudes of teacher candidates

Article

Full-text available

Aug 2023
Educ Inform Tech

With the advancement of artificial intelligence (AI) and machine learning (ML) techniques, attitudes towards these two fields have begun to gain importance in different professions. One of the affected professions is undoubtedly the teaching profession. Increasing the levels of concern for artificial intelligence and attitudes towards machine learning has become important in order to adapt to potential technologies that will be used. The purpose of this study is to examine the anxiety related to AI and the attitudes towards ML among teacher candidates of different ages, genders, and fields. This study investigates the relationships between sub-dimensions of anxiety towards artificial intelligence and attitudes towards machine learning, as well as to identify differences in these sub-dimensions among gender, age, and department. The findings suggest that although teacher candidates from different disciplines, ages, and genders do not have any concerns regarding learning about artificial intelligence, they do express anxiety about the impact of artificial intelligence on employment rates and social life. The results of this study can be beneficial for developing instructional programs that focus on AI in the long run, considering factors such as age, personal experience, gender, and field-specific elements.

Correction to: Utilizing a Pretrained Language Model (BERT) to Classify Preservice Physics Teachers’ Written Refections

Article

Full-text available

Feb 2023

Applications of deep language models for reflective writings

Article

Full-text available

Sep 2022
Educ Inform Tech

Social sciences expose many cognitively complex, highly qualified, or fuzzy problems, whose resolution relies primarily on expert judgement rather than automated systems. One of such instances that we study in this work is a reflection analysis in the writings of student teachers. We share a hands-on experience on how these challenges can be successfully tackled in data collection for machine learning. Based on the novel deep learning architectures pre-trained for a general language understanding, we can reach an accuracy ranging from 76.56–79.37% on low-confidence samples to 97.56–100% on high confidence cases. We open-source all our resources and models, and use the models to analyse previously ungrounded hypotheses on reflection of university students. Our work provides a toolset for objective measurements of reflection in higher education writings, applicable in more than 100 other languages worldwide with a loss in accuracy measured between 0–4.2% Thanks to the outstanding accuracy of the deep models, the presented toolset allows for previously unavailable applications, such as providing semi-automated student feedback or measuring an effect of systematic changes in the educational process via the students’ response.

A Systematic Review of Wi-Fi and Machine Learning Integration with Topic Modeling Techniques

Article

Full-text available

Jun 2022
SENSORS-BASEL

Wireless networks have drastically influenced our lifestyle, changing our workplaces and society. Among the variety of wireless technology, Wi-Fi surely plays a leading role, especially in local area networks. The spread of mobiles and tablets, and more recently, the advent of Internet of Things, have resulted in a multitude of Wi-Fi-enabled devices continuously sending data to the Internet and between each other. At the same time, Machine Learning has proven to be one of the most effective and versatile tools for the analysis of fast streaming data. This systematic review aims at studying the interaction between these technologies and how it has developed throughout their lifetimes. We used Scopus, Web of Science, and IEEE Xplore databases to retrieve paper abstracts and leveraged a topic modeling technique, namely, BERTopic, to analyze the resulting document corpus. After these steps, we inspected the obtained clusters and computed statistics to characterize and interpret the topics they refer to. Our results include both the applications of Wi-Fi sensing and the variety of Machine Learning algorithms used to tackle them. We also report how the Wi-Fi advances have affected sensing applications and the choice of the most suitable Machine Learning models.

Understanding Depth of Reflective Writing in Workplace Learning Assessments Using Machine Learning Classification

Article

Full-text available

Oct 2022

Self-reflection and reflective writing have been pivotal for developing a deep understanding of concepts and fostering professional competency in learners. The confluence of the importance of reflective practices within the educational curriculum and the increased proliferation of technology have resulted in numerous studies of how to use automated approaches to analyze broad themes of reflection exhibited by learners in higher educational settings. However, there is a dearth of research in the context of automated analysis of reflective writing demonstrated by professionals within workplace learning. Building on a four-level reflective content analysis model, this article evaluates the depth of reflection exhibited by learners within a professional development MOOC using an automated assessment classifier. Our results identify the varying association of linguistic features across these different levels of reflection. The proposed approach can effectively support the deployment at scale of the labor-intensive evaluation of reflective writing by highly trained professionals.

PapagAI: Automated Feedback for Reflective Essays

Chapter

Sep 2023

Written reflective practice is a regular exercise pre-service teachers perform during their higher education. Usually, their lecturers are expected to provide individual feedback, which can be a challenging task to perform on a regular basis. In this paper, we present the first open-source automated feedback tool based on didactic theory and implemented as a hybrid AI system. We describe the components and discuss the advantages and disadvantages of our system compared to the state-of-art generative large language models. The main objective of our work is to enable better learning outcomes for students and to complement the teaching activities of lecturers.

The effect of feedback on academic achievement in technology-rich learning environments (TREs): A meta-analytic review

Article

Feb 2023
Educ Res Rev

A comparative study of pretrained language models for long clinical text

Article

Nov 2022
J AM MED INFORM ASSN

Objective: Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts. Materials and methods: Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks. Results: The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results. Discussion: Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer. Conclusion: This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.

Classification of reflective writing: A comparative analysis with shallow machine learning and pre-trained language models

Abstract and Figures

Recommended publications

Evaluating Reflective Writing in Pre-Service Teachers: The Potential of a Mixed-Methods Approach

PapagAI: Automated Feedback for Reflective Essays

PapagAI:Automated Feedback for Reflective Essays

Acceptance of artificial intelligence among pre-service teachers: a multigroup analysis