ArticlePDF Available

Advancing OCR Accuracy in Image-to-LaTeX Conversion—A Critical and Creative Exploration

November 2023
Applied Sciences 13(22)(12503):20

November 2023
13(22)(12503):20

DOI:10.3390/app132212503

License
CC BY 4.0

Authors:

Everistus Zeluwa Orji

Enugu State University of Science and Technology

Othmar Othmar Mwambe

Dar es Salaam Institute of Technology

This paper comprehensively assesses the application of active learning strategies to enhance natural language processing-based optical character recognition (OCR) models for image-to-LaTeX conversion. It addresses the existing limitations of OCR models and proposes innovative practices to strengthen their accuracy. Key components of this study include the augmentation of training data with LaTeX syntax constraints, the integration of active learning strategies, and the employment of active learning feedback loops. This paper first examines the current weaknesses of OCR models with a particular focus on symbol recognition, complex equation handling, and noise moderation. These limitations serve as a framework against which the subsequent research methodologies are assessed. Augmenting the training data with LaTeX syntax constraints is a crucial strategy for improving model precision. Incorporating symbol relationships, wherein contextual information is considered during recognition, further enriches the error correction. This paper critically examines the application of active learning strategies. The active learning feedback loop leads to progressive improvements in accuracy. This article underlines the importance of uncertainty and diversity sampling in sample selection, ensuring that the dynamic learning process remains efficient and effective. Appropriate evaluation metrics and ensemble techniques are used to improve the operational learning effectiveness of the OCR model. These techniques allow the model to adapt and perform more effectively in diverse application domains, further extending its utility

Visual comparison of image preprocessing techniques (noise reduction (a), contras hancement (b), binarization (c), and skew correction (d)) for OCR.

…

Figures - uploaded by Everistus Zeluwa Orji

Content may be subject to copyright.

Content uploaded by Everistus Zeluwa Orji

Content may be subject to copyright.

Content uploaded by Othmar Othmar Mwambe

Content may be subject to copyright.

Citation: Orji, E.Z.; Haydar, A.;

Er¸san, ˙

I.; Mwambe, O.O. Advancing

OCR Accuracy in Image-to-LaTeX

Conversion—A Critical and Creative

Exploration. Appl. Sci. 2023,13,

12503. https://doi.org/10.3390/

app132212503

Academic Editor: Douglas

O’Shaughnessy

Received: 13 September 2023

Revised: 1 November 2023

Accepted: 7 November 2023

Published: 20 November 2023

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

applied

sciences

Review

Advancing OCR Accuracy in Image-to-LaTeX Conversion—A

Critical and Creative Exploration

Everistus Zeluwa Orji 1, *, Ali Haydar 1,˙

Ibrahim Er¸san 1and Othmar Othmar Mwambe 2

1Department of Computer Engineering, Girne American University, Mersin-10, Karaman 99320, Turkey;

ahaydar@gau.edu.tr (A.H.); ibrahimersan@gau.edu.tr (˙

I.E.)

2Computer Studies Department, Dar es Salaam Institute of Technology (DIT),

Dar es Salaam P.O. Box 2958, Tanzania; othmar.mwambe@dit.ac.tz

*Correspondence: orjizeluwa@gmail.com

Abstract:

This paper comprehensively assesses the application of active learning strategies to enhance

natural language processing-based optical character recognition (OCR) models for image-to-LaTeX

conversion. It addresses the existing limitations of OCR models and proposes innovative practices to

strengthen their accuracy. Key components of this study include the augmentation of training data

with LaTeX syntax constraints, the integration of active learning strategies, and the employment of

active learning feedback loops. This paper ﬁrst examines the current weaknesses of OCR models with

a particular focus on symbol recognition, complex equation handling, and noise moderation. These

limitations serve as a framework against which the subsequent research methodologies are assessed.

Augmenting the training data with LaTeX syntax constraints is a crucial strategy for improving model

precision. Incorporating symbol relationships, wherein contextual information is considered during

recognition, further enriches the error correction. This paper critically examines the application of

active learning strategies. The active learning feedback loop leads to progressive improvements in

accuracy. This article underlines the importance of uncertainty and diversity sampling in sample

selection, ensuring that the dynamic learning process remains efﬁcient and effective. Appropriate

evaluation metrics and ensemble techniques are used to improve the operational learning effectiveness

of the OCR model. These techniques allow the model to adapt and perform more effectively in diverse

application domains, further extending its utility.

Keywords:

optical character recognition (OCR); LaTeX; active learning strategies; image-to-LaTeX

conversion; natural language processing (NLP)

1. Introduction

The digital age has transformed how we interact with written content, with optical

character recognition (OCR) technology serving as a linchpin in this transformation [

OCR enables the conversion of printed or handwritten text into machine-readable formats,

thus ushering in an era of enhanced accessibility and utility for textual data. Even though

the accurate recognition and conversion of mathematical expressions into LaTeX format is

still a challenge that looms large, it is within this complex and critical arena that we ﬁnd the

motivation and contributions of this study. The rationale for undertaking this research is

underpinned by a profound recognition of the crucial role played by mathematical expres-

sion recognition within the broader OCR landscape. Mathematical notation, characterized

by its intricate symbols and complex structures, has long been a vexing challenge for OCR

systems [

]. The precise recognition and correct conversion of mathematical expressions

require understanding the symbols themselves and a deep grasp of the semantics, syntax,

and intricate relationships interweaving these symbols [

]. However, prevailing OCR meth-

ods, while formidable, often fall short of capturing these subtleties, resulting in conversions

that do not meet the stringent accuracy requirements [

]. The crux of the challenge lies

Appl. Sci. 2023,13, 12503. https://doi.org/10.3390/app132212503 https://www.mdpi.com/journal/applsci

Appl. Sci. 2023,13, 12503 2 of 20

in the visual complexity of mathematical symbols, where symbols that bear a striking

resemblance can possess distinct semantic meanings [

]. The perennial noise or distortion

in input images adds a layer of complexity, directly impeding the OCR system’s ability to

recognize and interpret mathematical expressions accurately.

Recognizing these multifaceted challenges propels our quest for innovative solutions

at the intersection of OCR and natural language processing (NLP) to enhance the accuracy

of mathematical expression recognition and conversion. Hence, this extensive review study

is aimed at exploring various existing natural language processing (NLP) techniques that

attempt to enhance OCR accuracy in image-to-LaTeX conversions. This study also analyzes

the limitations of existing approaches and recommends future directions. In turn, this

study exposes existing research gaps and paves the way for innovative NLP integration

techniques in OCR. In order to meet these research goals, this review study has gone

through several stages (see Figure 1). These include a thorough literature review that

sets out the problem statement and tries to answer the research question about how NLP

techniques can be added to OCR to make it more accurate when converting images to

LaTeX, an analytical screening of the techniques introduced by various research articles,

and the recommendation of future directions.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 2 of 21

The crux of the challenge lies in the visual complexity of mathematical symbols, where

symbols that bear a striking resemblance can possess distinct semantic meanings [4]. The

perennial noise or distortion in input images adds a layer of complexity, directly impeding

the OCR system’s ability to recognize and interpret mathematical expressions accurately.

Recognizing these multifaceted challenges propels our quest for innovative solutions

at the intersection of OCR and natural language processing (NLP) to enhance the accuracy

of mathematical expression recognition and conversion. Hence, this extensive review

study is aimed at exploring various existing natural language processing (NLP) tech-

niques that aempt to enhance OCR accuracy in image-to-LaTeX conversions. This study

also analyzes the limitations of existing approaches and recommends future directions. In

turn, this study exposes existing research gaps and paves the way for innovative NLP

integration techniques in OCR. In order to meet these research goals, this review study

has gone through several stages (see Figure 1). These include a thorough literature review

that sets out the problem statement and tries to answer the research question about how

NLP techniques can be added to OCR to make it more accurate when converting images

to LaTeX, an analytical screening of the techniques introduced by various research articles,

and the recommendation of future directions.

Figure 1. Research method.

The remainder of this paper is structured as follows: the background information of

this study and the related works will be stated in Section 2; the deep learning strategies

for OCR in the image-to-LaTeX conversion will be stated in Section 3; the preprocessing

techniques for image enhancement will be presented in Section 4; the limitations of cur-

rent OCR models will be presented in Section 5; the augmenting OCR training data with

LaTeX Syntax constraints will be presented in Section 6; binarization and thresholding

techniques will be presented in Section 7; leveraging symbol relationships for OCR error

correction will be described in Section 8; post-processing techniques for error correction

in OCR for image-to-LaTeX conversion will be presented in Section 9; post-processing

strategies to leverage the redundancy inherent in mathematical notation will be presented

in Section 10; active learning strategies for incorporating the OCR model will be described

in Section 11; evaluation metrics for OCR accuracy in image-to-LaTeX conversion will be

presented in Section 12; and ﬁnally, Section 13 will conclude this study and provide rec-

ommendations for future directions.

Figure 1. Research method.

The remainder of this paper is structured as follows: the background information of

this study and the related works will be stated in Section 2; the deep learning strategies

for OCR in the image-to-LaTeX conversion will be stated in Section 3; the preprocessing

techniques for image enhancement will be presented in Section 4; the limitations of current

OCR models will be presented in Section 5; the augmenting OCR training data with LaTeX

Syntax constraints will be presented in Section 6; binarization and thresholding techniques

will be presented in Section 7; leveraging symbol relationships for OCR error correction

will be described in Section 8; post-processing techniques for error correction in OCR for

image-to-LaTeX conversion will be presented in Section 9; post-processing strategies to

leverage the redundancy inherent in mathematical notation will be presented in Section 10;

active learning strategies for incorporating the OCR model will be described in Section 11;

evaluation metrics for OCR accuracy in image-to-LaTeX conversion will be presented in

Section 12; and ﬁnally, Section 13 will conclude this study and provide recommendations

for future directions.

Appl. Sci. 2023,13, 12503 3 of 20

2. Background Information

The previous OCR approaches have struggled with handling the complexities and in-

tricacies of mathematical notation, causing suboptimal conversions. Mathematical scholars

have turned to natural language processing (NLP) techniques to respond to these chal-

lenges to enhance OCR accuracy in image-to-LaTeX conversion. Integrating NLP strategies

into OCR models can potentially improve mathematical expression recognition and conver-

sion by leveraging semantic information and including linguistic context. Applying deep

learning architectures like recurrent neural networks, convolutional neural networks, and

transformer models, can enable OCR systems to capture subtle patterns and relationships

within mathematical expressions [5].

Advancements in retraining techniques, like bidirectional encoder representations

from transformers, can also capture the contextual embedding that helps accurately rec-

ognize mathematical symbols. This paper’s main objective is to critically examine the

challenges associated with OCR accuracy in the image-to-LaTeX conversion and propose

creative solutions to enhance the performance of OCR models. This report explores inno-

vative techniques that leverage NLP methods to address the limitations of current OCR

systems. The main limitation of current OCR models in accurately recognizing and convert-

ing mathematical expressions is handling the complex equations characterized by several

symbols and intricate structural arrangements [

]. Identifying and interpreting such equa-

tions is critical for accurate conversion to LaTeX format. Various mathematical symbols

also tend to be challenging due to their visual similarity, making it difﬁcult for OCR models

to differentiate between similar-looking symbols accurately [

]. Another challenge arises

from noise or distortion in the input images that negatively affects OCR performance (see

Table 1). OCR models can better interpret signs based on their surrounding context and

infer their intended meaning by encoding contextual information [

]. For instance, under-

standing whether a symbol represents an operator, a variable, or a function is essential for

accurate conversion to LaTeX.

Table 1. Summary of challenges and techniques in OCR approaches to image-to-LaTeX conversion.

OCR Approach Performance Challenges and Techniques

NLP Integration [7]Potential

Complex equations and intricate structural arrangements.

Visual similarity of mathematical symbols.

Noise or distortion in input images.

Leveraging semantic information and linguistic context through NLP techniques.

Data Augmentation [8] Improved

Ensuring adherence to LaTeX syntax rules.

Reducing generation of syntactically incorrect LaTeX code.

Incorporating LaTeX syntax constraints during data augmentation.

Dependency Capture [9]Enhanced

Capturing dependencies and relationships between symbols (subscripts,

superscripts, fraction components).

Improving accuracy by correcting recognition errors and ensuring the integrity

of the converted LaTeX representation.

Active Learning [10]Efﬁcient

Mitigating dependency on large labeled datasets.

Intelligently selecting informative and challenging examples.

Involving human annotators in training through active learning strategies.

(query-by-committee, uncertainty sampling, adaptive sampling).

Incorporating LaTeX syntax constraints during data augmentation guides OCR models

to learn more accurate and compliant conversions, ensuring that the output adheres to

LaTeX syntax rules. Studies can reduce the likelihood of generating syntactically incorrect

LaTeX code during conversion by enforcing these constraints [

]. The constraint-based

augmentation strategy improves accuracy and reliability in OCR outputs [

]. Additionally,

mathematical expressions often exhibit dependencies and relationships between symbols,

such as subscripts, superscripts, or fraction components. We can enhance the accuracy of

OCR outputs by correcting recognition errors and ensuring the integrity of the converted

Appl. Sci. 2023,13, 12503 4 of 20

LaTeX representation by capturing these dependencies and incorporating them into the

OCR model. Traditional OCR approaches require large labeled datasets for training,

which are expensive, time-consuming, and costly to create [

]. Active learning addresses

this challenge by intelligently selecting samples for manual annotation to mitigate the

dependency on large labeled datasets. The OCR model can focus on learning from the

most informative and challenging examples, leading to more efﬁcient and effective model

improvement by actively involving human annotators in training [

]. Active learning

strategies like query-by-committee, uncertainty sampling, and adaptive sampling, can be

used to select samples that maximize the model’s learning potential.

3. Deep Learning Strategies for OCR in Image-to-LaTeX Conversion

Deep learning strategies have reformed the ﬁeld of optical character recognition

(OCR) and enhanced the accuracy in image-to-LaTeX conversion. Leveraging neural net-

work architectures like recurrent neural networks and convolutional neural networks

enables researchers to tackle the complex challenges of character recognition and equation

parsing [

]. The learning models rely on large-scale annotated datasets to learn and

generalize. However, developing high-quality datasets that include various styles, fonts,

and mathematical symbols poses signiﬁcant challenges. It is critical to address data col-

lection challenges to ensure representative and unbiased training sets. Studies have tried

creating benchmark datasets speciﬁc to image-to-LaTeX conversion, like the CROHME

dataset containing handwritten mathematical expressions [

–

]. The datasets enhance

the evaluation and training of OCR models and serve as a foundation for advancing the

ﬁeld [

]. CNNs and RNNs may struggle with out-of-domain and rare symbols encoun-

tered in image-to-LaTeX conversion, leading to low accuracy and errors in the converted

LaTeX result.

Addressing this challenge requires domain-speciﬁc knowledge and designing models

that can handle the intricacies of mathematical notation. For instance, studies have exam-

ined integrating mathematical grammar rules into OCR models to enable the recognition

process and enhance accuracy [

]. These models can achieve more reliable conversions

by incorporating mathematical semantics and structure. Understanding how the models

predict is vital for identifying and rectifying OCR issues. However, the inherent challenge

of deep learning architectures deters their interpretability [

]. Studies have developed

methods for visualizing the attention and feature activations in CNNs and RNNs. How-

ever, further advancement is essential to ensure transparent and reliable OCR systems.

Explainable systems like saliency analysis and attention maps can provide insight into

the decision-making process of OCR models and help identify potential sources of er-

rors [

]. The computational necessities of deep learning models also pose a challenge

since training and deploying complex neural network skill sets require serious computa-

tional resources. This challenge hinders the scalability and accessibility of OCR systems in

resource-constrained environments.

Assessing techniques for model compression, hardware acceleration, and granting

them enough skill, may mitigate these difﬁculties and make OCR solutions more practical.

Studies have recommended lightweight OCR models that attain comparable accuracy to

larger ones since they require fewer computational resources to enhance their deployment

on low-power devices [

]. OCR errors at the character recognition stage may also develop

during the conversion process, leading to substantial inaccuracies in the ﬁnal LaTeX output.

Creating error correction techniques and post-processing mechanisms is vital for mitigating

the impact of OCR errors and ensuring high-quality conversions. Recent studies have

examined applying language models and contextual information to the enhance error

correction in OCR outputs [

]. Leveraging contextual clues and syntactic analysis enables

these strategies to identify and rectify OCR errors, thus fostering the accuracy of the

converted LaTeX representations.

Appl. Sci. 2023,13, 12503 5 of 20

4. Preprocessing Techniques for Image Enhancement

Preprocessing techniques for image enhancement in OCR for image-to-LaTeX con-

version are crucial for improving the accuracy of the recognition process [

]. These

techniques address challenges related to image quality, noise, contrast, skew, and mul-

timodal features. Recent advancements in this ﬁeld have shown promising results, but

critical aspects still need to be considered. One aspect is noise reduction and image denois-

ing techniques. The noise in input images can signiﬁcantly impact OCR accuracy [

Researchers have proposed various denoising algorithms, such as median ﬁltering, Gaus-

sian ﬁltering, and wavelet-based methods, to reduce noise and artifacts. Additionally,

recent studies have introduced advanced denoising algorithms based on deep learning

approaches, leveraging convolutional autosencoders and generative adversarial networks

(GANs) [

]. These methods have demonstrated improved OCR accuracy by effectively

suppressing noise patterns and preserving the legibility of characters and symbols.

Another crucial preprocessing step is contrast enhancement. Enhancing image con-

trast can signiﬁcantly improve the readability of characters and symbols, especially in

low-quality or poorly illuminated images (see Figure 2). Histogram equalization tech-

niques, such as adaptive histogram equalization (AHE) and contrast-limited adaptive

histogram equalization (CLAHE), have been widely used. Recent research has explored

integrating deep learning models, such as U-Net and Pix2Pix networks, for adaptive

contrast enhancement [

]. These approaches have demonstrated their effectiveness in

handling varying illumination conditions and improving OCR performance. Binarization

and thresholding techniques are also critical in OCR preprocessing. Binarization converts

grayscale or color images into binary representations, separating foreground characters

from the background [

]. Various thresholding techniques, including global threshold-

ing, local adaptive thresholding, and hybrid methods, have been proposed to address

different image characteristics and challenges. Recent advancements have introduced

deep learning-based binarization methods that utilize convolutional neural networks to

learn optimal thresholding strategies. These approaches have shown promising results in

handling complex backgrounds and improving the segmentation of characters, leading to

improved OCR accuracy.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 5 of 21

4. Preprocessing Techniques for Image Enhancement

Preprocessing techniques for image enhancement in OCR for image-to-LaTeX con-

version are crucial for improving the accuracy of the recognition process [22,23]. These

techniques address challenges related to image quality, noise, contrast, skew, and multi-

modal features. Recent advancements in this ﬁeld have shown promising results, but crit-

ical aspects still need to be considered. One aspect is noise reduction and image denoising

techniques. The noise in input images can signiﬁcantly impact OCR accuracy [4,24]. Re-

searchers have proposed various denoising algorithms, such as median ﬁltering, Gaussian

ﬁltering, and wavelet-based methods, to reduce noise and artifacts. Additionally, recent

studies have introduced advanced denoising algorithms based on deep learning ap-

proaches, leveraging convolutional autosencoders and generative adversarial networks

(GANs) [25]. These methods have demonstrated improved OCR accuracy by eﬀectively

suppressing noise paerns and preserving the legibility of characters and symbols.

Another crucial preprocessing step is contrast enhancement. Enhancing image con-

trast can signiﬁcantly improve the readability of characters and symbols, especially in

low-quality or poorly illuminated images (see Figure 2). Histogram equalization tech-

niques, such as adaptive histogram equalization (AHE) and contrast-limited adaptive his-

togram equalization (CLAHE), have been widely used. Recent research has explored in-

tegrating deep learning models, such as U-Net and Pix2Pix networks, for adaptive con-

trast enhancement [25]. These approaches have demonstrated their eﬀectiveness in han-

dling varying illumination conditions and improving OCR performance. Binarization and

thresholding techniques are also critical in OCR preprocessing. Binarization converts

grayscale or color images into binary representations, separating foreground characters

from the background [25]. Various thresholding techniques, including global threshold-

ing, local adaptive thresholding, and hybrid methods, have been proposed to address dif-

ferent image characteristics and challenges. Recent advancements have introduced deep

learning-based binarization methods that utilize convolutional neural networks to learn

optimal thresholding strategies. These approaches have shown promising results in han-

dling complex backgrounds and improving the segmentation of characters, leading to im-

proved OCR accuracy.

Figure 2. Visual comparison of image preprocessing techniques (noise reduction (a), contrast en-

hancement (b), binarization (c), and skew correction (d)) for OCR.

Skew detection and correction techniques are essential for aligning images and en-

suring accurate character recognition. Skewed or rotated images can negatively impact

OCR accuracy. Recent research has explored the use of deep learning models, such as

Figure 2.

Visual comparison of image preprocessing techniques (noise reduction (

), contrast en-

hancement (b), binarization (c), and skew correction (d)) for OCR.

Appl. Sci. 2023,13, 12503 6 of 20

Skew detection and correction techniques are essential for aligning images and ensur-

ing accurate character recognition. Skewed or rotated images can negatively impact OCR

accuracy. Recent research has explored the use of deep learning models, such as convolu-

tional neural networks and recurrent neural networks, for automatic skew detection and

correction [

]. These models leverage learned features and geometric transformations to

estimate and rectify image skew. These techniques improve OCR accuracy by effectively

aligning the images, mainly when dealing with skewed documents. Multimodal fusion and

feature enhancement techniques have also gained attention in OCR preprocessing [

]. OCR

models can beneﬁt from complementary features during preprocessing by fusing multiple

modalities, such as color, texture, and shape information. Recent studies have investi-

gated the fusion of these modalities to enhance the discriminative power of OCR models.

Furthermore, attention mechanisms and contextual information have been explored to

guide feature enhancement [

–

]. These approaches enable OCR systems to focus on

informative regions while suppressing noise or irrelevant details, ultimately improving

recognition accuracy.

While recent advancements in preprocessing techniques for image enhancement have

shown promising results, there are still challenges to be addressed. Finding the right

balance between noise suppression and the preservation of ﬁne details is crucial. Adapting

to varying image quality and the efﬁcient handling of input image types, such as hand-

written or scanned documents, also require further research. Moreover, the computational

complexity of some deep learning-based techniques may limit their practicality, especially

in resource-constrained environments. Future research should focus on developing ro-

bust preprocessing methods that are adaptive, efﬁcient, and capable of handling diverse

real-world scenarios to enhance OCR accuracy in image-to-LaTeX conversion [21].

5. Analyzing the Limitations of Current OCR Models

Optical character recognition (OCR) has improved the digitization of documents by

enhancing the conversion of printed and longhand text into machine-readable formats.

However, the accurate recognition and conversion of mathematical expressions from im-

ages to the LaTeX format remain challenging. The main limitation with current OCR models

is handling complex equations accurately. Mathematical expressions involve variables,

many symbols, operators, and nested structures, making them inherently difﬁcult to inter-

pret and convert accurately. OCR models should comprehend the hierarchical relationships

between symbols and the intended mathematical operations to produce faithful LaTeX

representations [

]. Complicated equations challenge OCR models to capture the exact

arrangement of symbols and matrices and recognize fractions, subscripts, superscripts, and

parentheses. Mathematical expressions include advanced concepts like Greek symbols, in-

tegrals, and summations, that further complicate recognition. Handling complex equations

enables OCR models to provide reliable and accurate conversions to the LaTeX format [

OCR models also face challenges in accurately recognizing various mathematical sym-

bols. Mathematical notation includes multiple characters, including numerals, operators,

alphabets, Greek letters, and mathematical functions [

]. Most of these symbols show

visual similarities, making it difﬁcult for OCR models to differentiate between similar-

looking characters accurately [

]. Differentiating between the symbol “0” and the letter “o”

or distinguishing between the variable “x” and the multiplication operator “

” is a chal-

lenge for OCR models. Moreover, recognizing and differentiating between similar-looking

symbols accurately, like “cos” and “sin” or “

” and “a”, requires OCR models to possess a

robust symbol recognition capability that can handle variations in font sizes, styles, and

orientations [

]. To overcome this limitation, OCR models may beneﬁt from incorporating

contextual information and leveraging the semantic relationships between symbols. OCR

models can also make more informed decisions regarding symbol recognition and improve

the accuracy of the conversion process by considering the surrounding context and the

syntax of mathematical expressions [2].

Appl. Sci. 2023,13, 12503 7 of 20

Distortion in the input images also presents an excellent challenge for OCR models.

Images captured in real-world scenarios are characterized by types of noise, such as pixela-

tion, blurring, lighting, and artefacts introduced during the scanning process. Documents

may have longhand annotations, erasures, or smudges that further complicate the recogni-

tion process. These distortions affect the clarity and legibility of mathematical expressions

and cause errors in recognition and subsequent LaTeX conversion [

]. OCR models must

be robust and resilient to warping to ensure accurate recognition and conversion. OCR

models can beneﬁt from preprocessing techniques for image enhancement and noise reduc-

tion. These techniques involve denoising, ﬁltering, and contrast adjustment to improve

the legibility of the input images before OCR processing [

]. Integrating noise-robust

recognition algorithms and data augmentation techniques that simulate various types of

noise can also improve the OCR model’s ability to effectively handle noisy or distorted

images. Mathematical expressions provide intricate correlations between symbols like sub-

scripts, superscripts, and fraction components. OCR models can correct recognition errors

and maintain the structural integrity of the converted LaTeX representation by modelling

these relationships [30].

OCR models can be improved using advanced techniques that capture the hierarchical

structure and semantic relationships within mathematical expressions to overcome their

limitations when handling complex equations. OCR systems can understand the intricate

arrangements of symbols and capture the nuances of mathematical notation by incorpo-

rating deep learning models such as convolutional neural networks and recurrent neural

networks. These models can learn to recognize and interpret matrices, fractions, subscripts,

and nested parentheses more accurately, improving conversion quality. Identifying various

mathematical symbols requires OCR models to have comprehensive symbol recognition ca-

pabilities [

]. The old OCR models struggle with differentiating similar-looking characters,

leading to conversion errors. Leveraging semantic information and contextual embedding

will enable the models to understand symbols better based on their surrounding context

and syntactic patterns. Contextual information can help differentiate between visually

similar characters and improve symbol recognition accuracy. Addressing noise and dis-

tortion challenges requires noise-robust recognition algorithms and robust preprocessing

techniques. OCR models can be trained using augmented datasets that simulate various

types of noise to make them more resilient to real-world image imperfections. Combining

strong preprocessing and noise-robust recognition algorithms can allow the models to

handle challenging image conditions better and produce accurate conversions.

Incorporating domain-speciﬁc knowledge into OCR models can improve the models’

performance. Mathematical expressions adhere to mathematical conversion and syntax

rules. OCR models can learn to generate LaTeX outputs, conforming to syntactic rules

by integrating LaTeX syntax constraints during training. The constraint-based data aug-

mentation strategy enables the model to provide valid and compliant LaTeX code and

mitigates the likelihood of producing syntactically incorrect conversions [

]. Active learn-

ing strategies can be employed to iteratively improve the model’s performance and enhance

OCR accuracy further. Traditional OCR approaches often rely on large labeled datasets

for training, which is time-consuming and expensive. Active learning allows the model

to actively select informative samples for manual annotation, reducing its dependency on

extensive labeled datasets [

]. Active learning helps improve the model’s performance

with fewer labeled examples, resulting in more efﬁcient and effective training by selecting

challenging or uncertain samples for the model.

Evaluation metrics such as recall, precision, and F1 score can be applied to assess

the performance of OCR models in symbol recognition. These metrics quantify the OCR

model’s completeness, accuracy, and overall performance in recognizing mathematical

symbols [

]. Iterative improvements can be made to the OCR model based on the results.

The process may involve ﬁne-tuning the deep learning architectures, reﬁning contextual

information utilization, and adjusting training data augmentation strategies. By contin-

uously evaluating and reﬁning the OCR model, recognition accuracy can be enhanced,

Appl. Sci. 2023,13, 12503 8 of 20

leading to more reliable and precise image-to-LaTeX conversions. The visual similarities,

contextual complexities, and font variations are obstacles to symbol recognition. How-

ever, utilizing semantic relationships, incorporating contextual information, augmenting

training data, and incorporating domain-speciﬁc knowledge can enable OCR models to

overcome these challenges. Improving symbol recognition accuracy in OCR models en-

ables a more accurate and reliable conversion of mathematical expressions from images to

LaTeX format. This advancement has signiﬁcant implications for various ﬁelds, including

academic research, scientiﬁc publishing, and document digitization. Enabling the efﬁcient

representation of mathematical content can help OCR models contribute to disseminating

scientiﬁc knowledge and enhance the accessibility of mathematical information [

]. Con-

tinuously advancing and reﬁning recognition capabilities can allow the models to better

serve the needs of researchers, professionals, and educators who rely on accurate and

efﬁcient image-to-LaTeX conversion.

The semantic relationships between symbols provide additional indications for ac-

curate recognition. Mathematical expressions show relationships such as fractions or

superscript components [

]. OCR models can correct recognition errors and maintain the

structural integrity of the converted LaTeX representation by modeling these relationships.

For instance, recognizing a fraction requires identifying the denominator and numerator

symbols and their relative positions. OCR models can enhance symbol recognition accu-

racy by leveraging these semantic relationships [

]. Transformer models have shown

promising results in symbol recognition tasks. Convolutional neural networks are effective

in capturing local visuals. OCR models can leverage their strengths to improve symbol

recognition accuracy by combining these architectures. OCR models can beneﬁt from

augmented training datasets encompassing variations in symbol appearance, font styles,

sizes, and orientations [

]. Data augmentation techniques like noise injections and random

rotations can help the model learn to handle variations commonly encountered in real-

world scenarios (see Table 2). Augmenting the training data with various symbol instances

enables the OCR model to generalize and recognize symbols accurately. Domain-speciﬁc

knowledge can enhance symbol recognition in OCR models [

]. OCR models can be

trained to identify the appropriate usage of symbols in mathematical expressions, such as

differentiating between using “π” as a constant or as a variable.

Table 2.

Summary of challenges and techniques in optical character recognition (OCR) for the

conversion of mathematical expressions.

Challenges Techniques

Handling complex equations [25].

Incorporating deep learning models (e.g., convolutional neural

networks, recurrent neural networks) to capture hierarchical structures

and semantic relationships within mathematical expressions.

Recognizing various mathematical symbols [32]. Leveraging semantic information and contextual embedding,

integrating strong symbol recognition capabilities.

Dealing with noise and distortion in input images [33].

Employing noise-robust recognition algorithms and robust preprocessing

techniques, training OCR models using augmented datasets.

Incorporating domain-speciﬁc knowledge [34]. Integrating LaTeX syntax constraints during training, utilizing active

learning strategies, reﬁning contextual information utilization.

Evaluating OCR model performance [35,36].

Applying evaluation metrics (recall, precision, F1 score), reﬁning OCR

models through ﬁne-tuning, adjusting training data augmentation

strategies.

Enhancing symbol recognition accuracy [37].

Utilizing semantic relationships, combining transformer models and

convolutional neural networks, augmenting training datasets with

variations in symbol appearance.

Handling variations in symbol usage and appearance [

Augmenting training data with variations in symbol instances,

incorporating domain-speciﬁc knowledge to identify appropriate

symbol usage.

Appl. Sci. 2023,13, 12503 9 of 20

6. Augmenting OCR Training Data with LaTeX Syntax Constraints

Augmenting OCR training data with LaTeX syntax constraints is a critical approach

that improves OCR models’ performances in image-to-LaTeX conversion. OCR models can

generate LaTeX representations that adhere to the correct semantics, structure, and syntax

of mathematical expressions by including the concept of LaTeX syntax rules and applying

them during the training process. The main advantage of augmenting OCR training data

with LaTeX syntax constraints is the improvement in conversion accuracy. LaTeX syntax

provides a well-deﬁned and standardized framework for representing mathematical no-

tation. The models can learn to recognize and correct common errors made during the

OCR and conversion processes by training OCR models with augmented data that includes

the correct LaTeX syntax [

]. For example, OCR models can be trained to locate missing

LaTeX delimiters like brackets, parentheses, or curly braces and rectify them accordingly.

This ability to correct errors leads to the enhanced accuracy and validity of the converted

LaTeX code. Augmenting OCR training data with LaTeX syntax constraints also enables

the preservation of structural integrity. LaTeX syntax rules deﬁne the hierarchical corre-

lation between mathematical symbols, subexpressions, and operators. Incorporating this

knowledge during training can help OCR models understand mathematical expressions’

structure and organization [

]. The information allows the models to generate LaTeX code

that accurately represents the structural integrity of the original mathematical content. For

example, OCR models can learn to correctly handle superscripts, fractions, subscripts, and

nested parentheses.

Augmenting OCR training data with LaTeX syntax constraints also enables a semantic

understanding of mathematical expressions. LaTeX syntax provides semantic concepts

that give information about the meaning and interpretation of the mathematical content.

For instance, applying speciﬁc LaTeX commands or environments indicates the type of

mathematical concept represented, like equations, matrices, or mathematical functions [

Training OCR models with augmented data containing these semantic concepts enables the

model to understand mathematical content better [

]. This understanding allows them to

generate more meaningful LaTeX code that reﬂects the intended semantics of the original

mathematical expressions. Integrating LaTeX syntax constraints into OCR training data

improves consistency in the LaTeX representation of mathematical expressions. Following

the LaTeX syntax rules ensures that the generated LaTeX code is consistent and coherent

when dealing with mathematical notations with multiple hierarchy levels [

]. Augmenting

OCR training data with LaTeX syntax constraints enables OCR models to produce LaTeX

code that adheres to mathematical expressions’ expected structure and semantics, leading

to a more coherent and readable output.

OCR models trained with augmented data incorporating LaTeX syntax constraints also

show improved compatibility with existing LaTeX tools, workﬂows, and libraries. OCR

models produce outputs seamlessly integrated with the LaTeX environment by generating

LaTeX code conforming to the syntax constraints. This compatibility enhances further

manipulation, processing, and rendering of the converted LaTeX code to enable users

to leverage existing LaTeX tools for tasks such as typesetting, rendering to PDF or other

formats, and educational materials [

]. Augmenting OCR training data with LaTeX syntax

constraints also enables a reduction in post-processing work. OCR models can minimize

the need for extensive manual correction and post-processing of the converted content

by generating LaTeX code that adheres to syntax rules [

]. This saves time and effort

for users who rely on accurate and reliable image-to-LaTeX conversion. The reduction in

post-processing requirements allows users to focus on other essential activities such as

validation, content analysis, or further correction of the converted LaTeX code. LaTeX is

broadly applied in academic and scientiﬁc domains for typesetting mathematical content.

OCR models produce outputs that seamlessly integrate into existing LaTeX workﬂows,

tools, and publishing pipelines by generating LaTeX code that conforms to the constraints

of syntax [

]. The compatibility improves the usability and practicality of the converted

LaTeX code and enables users to leverage the full potential of LaTeX for further processing.

Appl. Sci. 2023,13, 12503 10 of 20

In summary, augmenting OCR training data with LaTeX syntax constraints provides

advantages for enhancing the performance and reliability of OCR models in image-to-

LaTeX conversion. It enables consistency, accuracy, and semantic understanding while

mitigating post-processing work and ensuring compatibility with existing LaTeX workﬂows

and tools [

]. OCR models can generate LaTeX representations that capture mathematical

expressions’ semantics, structure, and syntax by incorporating the knowledge of LaTeX

syntax rules.

7. Binarization and Thresholding Techniques

The performance of learning models relies on large-scale annotated datasets. However,

developing high-quality datasets with diverse styles and mathematical symbols poses

a severe challenge due to the scarcity of datasets designed for image-to-LaTeX conver-

sion. Scholars have tried to gather datasets representing the difﬁculties encountered in the

conversion process [

]. The Competition on Recognition of Handwritten Mathematical

Expressions (CHROME) dataset incorporates handwritten mathematical expressions that

enable the evaluation and comparison of OCR models in the context of image-to-LaTeX

conversion [

]. The MathML and LaTeX dataset (MALL) is also concerned with mathe-

matical expressions expressed in MathML and LaTeX formats, enhancing the training and

assessment of OCR systems for this speciﬁc task [

]. These benchmark datasets create a

foundation for strengthening the ﬁeld by enabling standardized assessment and fostering

the development of more accurate OCR models [

]. The quality and representativeness of

the training data are vital since any form of bias presented in the training data may affect

the accuracy of OCR systems.

Training data mainly composed of a speciﬁc font or style makes it difﬁcult for the OCR

model to recognize symbols and characters from other styles and fonts accurately. Address-

ing such challenges requires keen data collection and annotation plans to ensure a balanced

and diverse representation of styles, fonts, and mathematical symbols. Collaborative strate-

gies among academic institutions and scholars can play a critical role in addressing this

challenge by ensuring diversity, sharing datasets, and fostering a more comprehensive un-

derstanding of the intricacies of mathematical notation [

]. In the context of training data,

the scarcity of data for rare symbols is also a critical area to be addressed. Deep learning

models perform better on frequently occurring characters and styles in the training data.

Scholars have examined synthetic data generation and augmentation techniques [

]. Data

augmentation includes applying diverse transformations like scaling, rotation, and adding

noise to augment the training data and expose the model to broad variations. Synthetic

data generation is the development of artiﬁcial images with rare notations to supplement

the training data. These techniques enhance OCR models’ generalization capability and

improve their accuracy when working with less common symbols.

The training data’s diversity and size are other essential elements that affect OCR

accuracy. The learning models need large-scale training data to learn the images’ underlying

variations and patterns. Insufﬁcient training data leads to overﬁtting, in which the model

fails to generalize well to unseen examples. The diversity of the training data is also critical

to ensuring generalization to different styles, fonts, and writing styles [

]. Gathering a

comprehensive and diverse dataset is a non-trivial task, since it requires the consideration

of variations in handwriting, mathematical domains, and notation styles. Studies should

expand and diversify the available training data to improve the accuracy and reliability

of OCR systems for image-to-LaTeX conversion [47]. Training data biased toward speciﬁc

cultural or regional preferences lead to inaccurate OCR results. For instance, OCR models

trained on datasets that mainly encompass Western mathematical notation may struggle

when faced with symbols or notes used in non-Western languages and mathematical

systems [

]. It is essential to incorporate diverse cultural perspectives and collaborations

with experts from various regions to ensure the comprehensive coverage of mathematical

notations and symbols.

Appl. Sci. 2023,13, 12503 11 of 20

8. Leveraging Symbol Relationships for OCR Error Correction

Leveraging symbol relationships for OCR error correction enhances the accuracy and

reliability of image-to-LaTeX conversion. Mathematical notation contains interconnected

symbols and operators that convey speciﬁc relationships and meanings. OCR models

can detect and rectify errors during recognition and conversion by understanding and

analyzing these symbol relationships [

]. OCR systems face errors when dealing with

complex mathematical equations and symbols. OCR models can identify and correct

these errors by considering the context and relationships between characters. For instance,

examining the relationships between adjacent symbols can enable OCR models to detect

and rectify errors like missing symbols [31]. This strategy helps ensure that the converted

LaTeX code accurately represents the original mathematical expression [49].

Symbol relationships also play an important role in error correction related to the

positioning of superscripts and subscripts. OCR errors may lead to incorrectly positioned

subscripts and superscripts, affecting the meaning of mathematical expressions [

]. OCR

models can analyze symbols’ relative heights and alignments to determine the correct

positioning of subscripts and superscripts by leveraging symbol relationships. This assess-

ment enables the models to mitigate errors and accurately represent mathematical notation.

Complex equations with brackets or nested parentheses also challenge OCR systems [

Errors may occur when opening and closing symbols, leading to missing and imbalanced

delimiters. OCR models can analyze the relationships between opening and closing sym-

bols to detect and rectify such errors by leveraging symbol relationships. For example,

when a closing parenthesis is missing, the OCR model can identify the corresponding

opening parenthesis and insert the missing closing symbol. This approach ensures the

correct representation of the structural integrity of mathematical expressions.

Leveraging symbol relationships also enhances the correction of errors related to

the misinterpretation of mathematical operators. OCR errors may occur when operators

interpret incorrectly, leading to incorrect mathematical representations. OCR models

can analyze the context and identify the correct operator based on its relationship with

adjacent symbols by considering the relationships between symbols. This enables the

models to correct errors and ensure the accurate representation of mathematical operations.

It is crucial to effectively train OCR models with data about symbol relationships to

leverage symbol relationships for OCR error correction. OCR models learn to recognize

individual symbols and understand their relationships during training [

]. This contextual

understanding enables the models to assess symbol sequences, apply error correction

strategies, and identify potential errors based on symbol relationships. OCR models

develop a deeper understanding of mathematical notation and can make informed decisions

about error correction by incorporating symbol relationships into the training process [

Contextually understanding symbol relationships allows OCR models to make in-

formed decisions about error correction, leading to more reliable and accurate conversions.

OCR models ensure the structural integrity and semantic accuracy of the converted LaTeX

code by rectifying errors related to subscripts, superscripts, delimiters, and operators. The

primary technique used in leveraging symbol relationships is the application of context

windows [

]. OCR models analyze a window of symbols surrounding the symbol in

question to determine its correct identity and position. The models can make informed

decisions about error correction by considering the neighboring symbols and their rela-

tionships. For instance, when a symbol is interpreted as a division operator instead of a

fraction bar, the OCR model can examine the symbols before and after the fraction bar to

identify the correct interpretation based on the context [

]. Utilizing information about

the mathematical domain and the relationships between the symbols also enables OCR

models to enhance error correction accuracy [52].

Machine learning algorithms can be trained to explain symbol relationships for error

correction. These models learn to recognize symbols and the correlation between them

by analyzing large annotated datasets. These models can capture the dependencies and

patterns in mathematical notation by incorporating symbol relationships into the training

Appl. Sci. 2023,13, 12503 12 of 20

process to make accurate predictions and corrections during the OCR and conversion

processes. Graph-based approaches can be used to leverage symbol relationships in mathe-

matical expressions, in the form of graphs, to identify and correct errors [

]. For instance,

a disconnected node in the graph can infer the missing symbol and restore the right re-

lationship between symbols. The spatial arrangement of symbols within a mathematical

expression may also provide important information about their relationships [

]. OCR

models can examine symbols’ relative positions, alignments, and distances to infer their

roles and relationships. For instance, when two symbols are vertically aligned, they are

likely to be related as a numerator and denominator in a fraction [

]. Leveraging symbol

relationships can also be complemented with statistical concepts like probabilistic models.

The models estimate the likelihood of certain symbol relationships based on the statisti-

cal properties of mathematical notation. OCR models can make informed decisions and

prioritize the most likely corrections by incorporating statistical information into the error

correction process.

In summary, leveraging symbol relationships for OCR error correction requires using

semantic information, context-based machine-learning algorithms, statistical techniques,

graph-based approaches, and spatial relationships. The courses enable OCR models to

examine the relationship between symbols and make accurate predictions and corrections

during recognition and conversion [

]. The tools enable OCR systems to achieve higher

accuracy, improve the structural integrity of mathematical expressions, and ensure the

semantic ﬁdelity of the converted LaTeX code.

9. Post-Processing Techniques for Error Correction in OCR for Image-to-LaTeX Conversion

Post-processing techniques improve the accuracy of Optical Character Recognition

(OCR) systems for image-to-LaTeX conversion by addressing errors that occur during the

recognition process. OCR models are not infallible, and mistakes propagate and increase

throughout the conversion process [

]. The creation of robust error correction strategies

is essential to ensuring high-quality conversions. One critical aspect of post-processing is

error detection. Various error detection methodologies include statistical analysis, linguistic

analysis, and pattern matching. A study compared the OCR output with statistical models

to identify discrepancies [

]. Statistical techniques can identify potential errors based on

their deviation from the expected patterns by analyzing the frequency and distribution of

symbols. Pattern-matching strategies apply regular expressions to identify the mistakes

in the OCR output. The designs can capture inconsistencies in the OCR results, like

misrecognized symbols. Linguistic analysis leverages language models and grammar

rules to identify semantic errors. It can identify mistakes that violate grammatical or

mathematical rules by analyzing the OCR output in the context of the surrounding text

or equations [

]. By combining these approaches, error detection algorithms can identify

potential errors and ﬂag them for further correction.

Error correction techniques come into play upon detecting errors to rectify the OCR

and enhance the accuracy of the ﬁnal LaTeX output. Language models provide contextual

information and semantic understanding to correct mistakes in OCR outputs [

]. The

models consider equations and the surrounding words to identify and rectify errors. If the

OCR output contains a misspelled and unrecognized word, the language model provides

alternative words based on the context to improve the accuracy of the converted LaTeX

representation [

]. Language models can give valuable suggestions for error correction by

leveraging the statistical properties of language and the context in which OCR errors occur.

The integration of contextual information is also used to correct mistakes [

]. Contextual

analysis includes analyzing the relationships between equations, symbols, and mathemati-

cal expressions to identify and correct errors. OCR errors that disrupt the overall coherence

of the mathematical expressions are detected and rectiﬁed by considering the syntactic and

structural context. If an OCR error results in an equation that violates mathematical rules,

the contextual analysis identiﬁes the discrepancy and proposes corrections that maintain

the integrity of the equation.

Appl. Sci. 2023,13, 12503 13 of 20

10. Post-Processing Strategies Leverage the Redundancy Inherent in Mathematical Notation

Mathematical expressions have various representations that convey similar meanings.

The redundancy can be in the form of alternative notes, equivalent forms of equations,

or mathematical transformations [

]. Assessing redundancy enables error correction

algorithms to ﬁnd the most probable corrected version of the OCR output and improve

the accuracy of the converted LaTeX representation [

]. Error correction techniques

ensure the correctness and consistency of the converted LaTeX representation by utiliz-

ing mathematical equivalences and transformations. Leveraging external knowledge

bases and resources improves error correction in OCR for image-to-LaTeX conversion [

These resources include domain-speciﬁc databases, mathematical ontologies, and math-

ematical libraries. Errors can be identiﬁed and corrected based on the known correct

representations by comparing the OCR output against these resources. Domain-speciﬁc

rules can be included in the error correction process to foster the accuracy and consis-

tency of the converted LaTeX representation. For instance, if the OCR output contains a

mathematical symbol that is incompatible with the mathematical domain or context, the

error correction mechanism can propose appropriate replacements based on the domain-

speciﬁc rules [

]. Error correction techniques can improve the accuracy and reliability of

the OCR system for image-to-LaTeX conversion by integrating external knowledge and

domain-speciﬁc information.

The accuracy of error correction techniques depends on the quality and accuracy of

the OCR output. The correction process becomes more challenging if the OCR system has

many errors. Therefore, it is vital to continuously enhance and correct the underlying OCR

algorithms to mitigate recognition errors [

]. Improvements in preprocessing techniques

like noise reduction, image enhancement, and segmentation lead to better OCR results and

consequently enhance the effectiveness of error correction in image-to-LaTeX conversion.

The complexity of mathematical notation poses challenges for error correction for OCR

in image-to-LaTeX conversion [

]. Mathematical expressions involve intricate symbols

and mathematical notations speciﬁc to different domains (see Table 2). Enhancing the

accurate recognition and modiﬁcation of these elements requires specialized algorithms and

techniques tailored to the complexities of mathematical notation [

]. Developing domain-

speciﬁc models and algorithms and collaborating with mathematicians and domain skill

sets leads to more accurate error correction in mathematical OCR.

11. Incorporating Active Learning Strategies for OCR Model Improvement

The main advantage of incorporating active learning is its effectiveness in the anno-

tation process. The old OCR training required annotating a large dataset with ground

truth labels, which is expensive and time-consuming [

]. Active learning reduces these

challenges by prioritizing the most informative samples for annotation [

]. Active learning

reduces the annotation effort while ensuring effective model training by selectively choos-

ing samples for which the model is uncertain or likely to make errors in. This efﬁcient use of

annotation resources saves time and reduces the costs associated with the training phase. It

can also improve OCR model performance by allowing the model to learn from its mistakes

and reﬁne its understanding of symbol recognition and conversion. This iterative process

allows the model to concentrate on critical areas for improvement, leading to more accurate

and reliable image-to-LaTeX conversions [

]. The model becomes more robust and capable

of handling various symbols, font styles, and mathematical structures encountered in

real-world scenarios by actively targeting challenging samples for annotation [43].

The feedback loop created between the model and the annotation process leads to

a progressive cycle of learning and reﬁnement. As the OCR model encounters new data

and challenging samples during the image-to-LaTeX conversion, it ﬂags those samples for

annotation. Experts then annotate the samples, and the newly labeled data updates the

model [

]. The updated model has been equipped with additional knowledge that is then

applied to the conversion process to enhance performance. This iterative cycle enables

the OCR model to adapt and improve continuously. Active learning also enhances the

Appl. Sci. 2023,13, 12503 14 of 20

generalization and adaptability of OCR models [

]. The models learn to handle various

symbol variations, mathematical structures, font styles, and noise patterns by incorporating

diverse samples through strategies like diversity sampling. This generalization capability

enables the model to handle different handwritings, unique mathematical notations, and

variations in expression formats [

]. It also fosters collaboration between the OCR system

and human experts. The ability of the model to ﬂag challenging samples for annotation

enables human experts to provide their expertise and domain knowledge [

]. Experts

contribute to improving the OCR model’s performance by annotating these samples. This

collaboration enhances the quality of the training dataset since human experts can validate

and correct errors made by the model.

A critical aspect to consider is selecting the active learning query strategy. Various

query strategies determine how the model chooses samples for annotation. Common query

strategies include query-by-committee, uncertainty sampling, and diversity sampling [

Uncertainty sampling selects samples with low conﬁdence scores, indicating the model’s

uncertainty about their correct labels. Diversity sampling targets a diverse range of samples

to ensure comprehensive training [

]. Query-by-committee includes training multiple

models with slightly different initializations or architectures and selecting samples on

which these models disagree, thus targeting areas of uncertainty. Deciding on an appro-

priate query strategy depends on the OCR system’s speciﬁc requirements and the nature

of the data.

It is also critical to consider the balance between exploration and exploitation. Ex-

amination involves selecting samples that the model has not seen before, enabling it to

learn from diverse examples, while exploitation focuses on picking pieces expected to

provide the most signiﬁcant improvement in the model’s performance. Striking the right

balance between the two ensures that the OCR model continues to learn and improve while

maximizing its accuracy on challenging samples [

]. Active learning may also beneﬁt

from including ensemble techniques. The techniques combine multiple OCR models, each

trained on different subsets of the training data or with different architectures, to make

predictions. Ensemble models often provide more robust and accurate predictions by

aggregating the knowledge and insights from multiple models. In the context of active

learning, ensemble models can be utilized to improve the reliability of sample selection [

The operational learning strategy can make more informed decisions about which samples

to annotate, further enhancing the model’s performance by considering the agreement or

disagreement among ensemble members on the uncertainty of samples.

The choice of evaluation metrics is important when incorporating active learning into

OCR models. The original metrics, like error or accuracy rates, might not be sufﬁcient

to capture the nuances of OCR performance [

]. Metrics that consider the complexity

of mathematical expressions can provide a more comprehensive assessment of the OCR

system’s performance. Applying appropriate evaluation metrics can optimize the active

learning process by focusing on challenging samples that directly impact the overall quality

of the image-to-LaTeX conversion. Domain adaptation techniques can also be utilized

to improve the efﬁciency of active learning in OCR [

]. Due to domain differences,

OCR models trained on synthetic data might struggle to perform well on real-world

documents. The model can be ﬁne-tuned on a small amount of real-world data, making

it more capable of handling the speciﬁc challenges and variations present in real-world

OCR scenarios by leveraging domain adaptation methods [

]. Incorporating domain

adaptation into the active learning pipeline ensures that the samples selected for annotation

align with the target domain, resulting in improved performance and accuracy. Active

learning for OCR is dynamic, and various techniques and strategies are continuously

explored [

]. Research efforts focus on developing more sophisticated and efﬁcient sample

selection approaches, investigating the integration of active learning with other methods,

and leveraging advanced machine learning algorithms. These progressive developments

aim to enhance the capabilities of OCR models further and optimize the dynamic learning

process for image-to-LaTeX conversion.

Appl. Sci. 2023,13, 12503 15 of 20

In summary, incorporating active learning strategies into OCR models for image-to-

LaTeX conversion has numerous advantages [

]. The efﬁciency of the annotation process

is enhanced by selectively choosing the best samples for annotation, while also mitigating

the annotation effort and related costs. The performance of the model is enhanced through

iterative learning and reﬁnement. The continuous feedback loop ensures that the model

adapts and improves over time, keeping pace with dynamic challenges [

]. Active learning

also enhances the model’s adaptability and generalization by incorporating diverse samples.

The correlation between the OCR system and human experts enriches the training dataset

and improves the model’s performance.

12. Evaluation Metrics for OCR Accuracy in Image-to-LaTeX Conversion

Evaluation metrics are important for examining the accuracy and performance of

optical character recognition (OCR) systems in the context of image-to-LaTeX conversion.

The metrics provide quantitative measures that enable studies to compare different OCR

algorithms, track ﬁeld progress, and identify areas for improvement [

]. Analyzing the

accuracy of OCR in the image-to-LaTeX conversion presents unique challenges due to

the complex nature of mathematical notation and the need for accurate representation

in the LaTeX format [

]. The main aspect to evaluating OCR accuracy is comparing the

OCR output with ground truth references [

]. The latter represents the correct LaTeX

representation of the mathematical expressions contained in the images. Comparing the

OCR output against these references enhances the calculation of metrics that measure the

similarity between the OCR result and the ground truth [

]. Various strategies can be

utilized for the comparison, such as semantic analysis, string-matching algorithms, and

structural similarity measures. These techniques enable the quantiﬁcation of the accuracy of

the OCR system in terms of symbol recognition, overall ﬁdelity, and equation structure [

Symbol-level evaluation metrics examine OCR systems’ accuracy in recognizing in-

dividual symbols in mathematical expressions. The metrics include recall, precision, and

F1 score, commonly used in pattern recognition activities. These metrics enable stud-

ies to assess the performance of OCR systems in accurately identifying and recognizing

mathematical symbols [

]. Equation-level evaluation metrics examine the accuracy of

OCR systems in capturing the structure and syntax of mathematical expressions [

]. The

matrices include an arrangement of symbols and adherence to mathematical rules. The

most commonly applied metric is equation-level accuracy, which measures the proportion

of correctly recognized structured equations. The structural similarity metric quantiﬁes

the similarity between the OCR output and the ground truth regarding the hierarchical

structure and relationships between symbols [

]. These metrics provide information about

the OCR system’s ability to preserve the structural integrity of mathematical expressions

during the conversion process.

Semantic evaluation metrics also examine OCR systems’ accuracy in capturing the

semantic meaning of mathematical expressions. Mathematical notation enables various

equivalent representations, and preserving the semantic equivalence is important for correct

conversion to LaTeX [

]. Semantic evaluation metrics include the similarity between the

semantic models of the OCR output and the ground truth. Strategies like semantic matching,

parsing, and embedding may be utilized to measure semantic similarity and assess the

OCR system’s accuracy in capturing the intended meaning of mathematical expressions.

Domain-speciﬁc evaluation metrics are critical for determining OCR accuracy in specialized

mathematical domains [

]. Various mathematical disciplines may have speciﬁc symbols,

notations, or conventions that must be correctly recognized and represented in LaTeX [

Evaluating OCR systems in these domains requires domain-speciﬁc evaluation metrics that

capture the complexities and intricacies of the notation. Collaborating with mathematicians,

domain experts, and educators is vital for deﬁning and developing these metrics and

ensuring that OCR systems meet the requirements of speciﬁc mathematical domains [

The evaluation of OCR accuracy in image-to-LaTeX conversion should take into account

the efﬁciency and computational complexity of the OCR systems. Large-scale document

Appl. Sci. 2023,13, 12503 16 of 20

processing requires OCR systems to provide accurate results within acceptable time frames.

Evaluation metrics that include processing speed, scalability, and resource utilization can

comprehensively assess the OCR systems’ performance in practical scenarios.

13. Conclusions, Limitations, and Recommendations

This study examines optical character recognition (OCR) for image-to-LaTeX con-

version, mainly focusing on the transformative potential of active learning strategies. It

presents a holistic approach to advancing OCR accuracy, addressing the limitations of

current OCR models, introducing innovative techniques, and emphasizing the critical role

of context-aware processing, active learning, and domain adaptation in achieving this goal.

The limitations of existing OCR models are multifaceted, encompassing challenges related

to recognizing mathematical symbols, handling complex equations, and effectively manag-

ing noise in the input data [

]. OCR models have struggled to cope with the intricacies of

mathematical notation and the diverse typographical conventions associated with LaTeX

documents. Symbol recognition has been a persistent challenge due to variations in writing

styles and the complex interplay of symbols within equations.

This research ﬁrst introduces the concept of augmenting training data with LaTeX syn-

tax constraints to address these limitations. This innovative approach entails constraining

the OCR model’s predictions to adhere to LaTeX syntax rules during training. Integrating

LaTeX-speciﬁc controls enhances the model’s understanding of mathematical expressions,

enabling it to discern structure and semantics accurately. Consequently, the OCR system

produces LaTeX representations that align with LaTeX syntax, resulting in higher accuracy

and reliability in image-to-LaTeX conversions. Symbol relationships within mathematical

expressions also assume critical signiﬁcance in our exploration [

]. OCR models gain the

ability to rectify errors and enhance the ﬁdelity of LaTeX conversions by considering the

contextual information and interdependencies of the symbols. This context-aware pro-

cessing approach represents a crucial step towards overcoming the challenges associated

with symbol recognition and understanding. It explains the importance of context in OCR,

providing a roadmap for further advancements in symbol recognition, interpretation, and

conversion accuracy. Active learning introduces a dynamic element to the OCR process by

helping the model to selectively choose informative samples for annotation. This strategic

selection enhances the model’s performance by focusing on challenging areas and reﬁning

its understanding of symbol recognition and conversion. The active learning feedback

loop, combined with ensemble techniques and appropriate evaluation metrics, creates a

progressive learning and reﬁnement cycle, allowing OCR models to adapt and improve

over time. This iterative process signiﬁcantly enhances accuracy and reliability, making it

an invaluable tool in the OCR toolkit.

This study also emphasizes the importance of uncertainty and diversity sampling

in active learning. These strategies ensure that the dynamic learning process remains

efﬁcient and effective, carefully balancing exploration and exploitation. Ensemble models

strengthen OCR predictions’ robustness and accuracy through their ability to aggregate

knowledge from multiple models. Ensemble-based sample selection plays a pivotal role

in the effectiveness of active learning strategies. Domain adaptation techniques emerge

as a crucial aspect of our research since they allow OCR models trained on synthetic

data to be ﬁne-tuned on real-world data, aligning them to the speciﬁc challenges and

variations encountered in practical OCR scenarios. Methods such as unsupervised or semi-

supervised learning are instrumental in enhancing the transferability and effectiveness of

active learning strategies when faced with real-world complexities [13].

This study offers a comprehensive framework for advancing OCR accuracy in image-

to-LaTeX conversion. Through active learning and domain adaptation, this research paves

the way for more accurate and versatile OCR systems by tackling the limitations of current

OCR models, proposing innovative methodologies, and highlighting the importance of

context-aware processing. The implications of this work extend far and wide, from improv-

ing scientiﬁc documentation and mathematical education to enhancing accessibility for

Appl. Sci. 2023,13, 12503 17 of 20

visually impaired individuals. This paper underscores the signiﬁcance of incorporating

techniques that allow OCR models trained on synthetic data to perform well on real-world

documents to address the domain adaptation challenge. The OCR models can be ﬁne-tuned

on a small amount of real-world data, setting them in alignment with the speciﬁc challenges

and variations encountered in practical OCR scenarios, by leveraging domain adaptation

methods such as unsupervised or semi-supervised learning. Including domain adaptation

considerations enhances the transferability and effectiveness of active learning strategies in

real-world applications.

It is crucial to recognize various limitations that have inﬂuenced the course and

application of this study, despite its signiﬁcant impacts. Firstly, the dynamic technological

landscape poses a constant challenge. This study has been conducted within a rapidly

evolving ﬁeld where new algorithms, hardware, and software emerge daily. Continuous

updates and adaptations are needed to keep pace with the ever-evolving landscape of

technology and maintain the cutting-edge applicability of our methods. Secondly, this

study might not cover every possible detail and difﬁculty OCR professionals face. The

domain of mathematical expressions and LaTeX texts is broad and complex, and although

our approaches show promise, they might not cover every particular situation. Changes in

LaTeX conventions and variations in mathematical notations may present new difﬁculties

that require speciﬁc considerations.

Furthermore, even with the best attempts to use domain adaptation techniques to close

the gap between synthetic and real-world data, real-world OCR applications might still pose

unique challenges that require additional improvements and customized methods. The

variety in real-world documents, including differences in writing styles and the subtleties of

symbol usage, can provide difﬁculties beyond our investigation’s purview. The subjective

character of human interpretation and assessment may result in variations in the perceived

accuracy of OCR outcomes. As human decisions and subjectivity can affect the impact of

our innovations, the human element in accuracy assessment suggests that judgments of the

system’s performance may differ. These limitations give future researchers an important

direction and a ﬁrm base to build on as they improve and advance our work.

Author Contributions:

E.Z.O., A.H., ˙

I.E. and O.O.M. have equally contributed to the conceptualiza-

tion, methodology, writing—original draft preparation, writing—review and editing, and writing

and revision of this paper, while supervision was performed by A.H. and ˙

I.E. All authors have read

and agreed to the published version of the manuscript.

Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement: Data are contained within the article.

Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

Drobac, S.; Lindén, K. Optical character recognition with neural networks and post-correction with ﬁnite state methods. Int.

J. Doc. Anal. Recognit. 2020,23, 279–295. [CrossRef]

Garkal, A.; Pal, A.; Singh, K.P. HMER-Image to LaTeX: A Variational Dropout Approach. In Proceedings of the 2021 5th

Conference on Information and Communication Technology (CICT), Kurnool, India, 10–12 December 2021. [CrossRef]

Deng, Y.; Yu, Y.; Yao, J.; Sun, C. An Attention Based Image to Latex Markup Decoder. In Proceedings of the 2017 Chinese

Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 7199–7203. [CrossRef]

Kayal, P.; Anand, M.; Desai, H.; Singh, M. Tables to LaTeX: Structure and content extraction from scientiﬁc tables. Int. J. Doc. Anal.

Recognit. 2022,26, 121–130. [CrossRef]

Bitterman, D.S.; Goldner, E.; Finan, S.; Harris, D.; Durbin, E.B.; Hochheiser, H.; Warner, J.L.; Mak, R.H.; Miller, T.; Savova, G.K. An

End-to-End Natural Language Processing System for Automatically Extracting Radiation Therapy Events From Clinical Texts.

Int. J. Radiat. Oncol. 2023,117, 262–273. [CrossRef]

Heo, T.S.; Kim, Y.S.; Choi, J.M.; Jeong, Y.S.; Seo, S.Y.; Lee, J.H.; Jeon, J.P.; Kim, C. Prediction of Stroke Outcome Using Natural

Language Processing-Based Machine Learning of Radiology Report of Brain MRI. J. Pers. Med.

2020

,10, 286. [CrossRef] [PubMed]

Appl. Sci. 2023,13, 12503 18 of 20

Rokde, C.N.; Kshirsagar, D.M. NLP challenges for machine translation from english to indian languages. Int. J. Comput. Sci.

Inform. 2020,4, 5. [CrossRef]

Wei, J.; Wang, Q.; Song, X.; Zhao, Z. The Status and Challenges of Image Data Augmentation Algorithms. J. Phys. Conf. Ser.

2023

2456, 012041. [CrossRef]

Ritz, F.; Phan, T.; Sedlmeier, A.; Altmann, P.; Wieghardt, J.; Schmid, R.; Sauer, H.; Klein, C.; Linnhoff-Popien, C.; Gabor, T.

Capturing Dependencies Within Machine Learning via a Formal Process Model. Lect. Notes Comput. Sci.

2022

,13703, 249–265.

[CrossRef]

10.

Vodovozov, V.; Raud, Z.; Petlenkov, E. Challenges of Active Learning in a View of Integrated Engineering Education. Educ. Sci.

2021,11, 43. [CrossRef]

11.

Jin, J.A. The Evolution of Visual Spectacle: A Virtual-Reality Exhibition at the Charles B. Wang Center. Ars Orient.

2020

,50,

20220203. [CrossRef]

12.

Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Structured Output Learning for Unconstrained Text Recognition.

arXiv 2014, arXiv:1412.5903.

13.

Yang, J.; Drake, T.; Damianou, A.; Maarek, Y. Leveraging Crowdsourcing Data for Deep Active Learning an Application: Learning

Intents in alexa. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 23–32. [CrossRef]

14.

Najam, R.; Faizullah, S. Analysis of Recent Deep Learning Techniques for Arabic Handwritten-Text OCR and Post-OCR Correction.

Appl. Sci. 2023,13, 7568. [CrossRef]

15.

Beyerer, J.; Puente León, F.; Frese, C.; Beyerer, J.; Puente León, F.; Frese, C. Preprocessing and Image Enhancement. In Machine

Vision: Automated Visual Inspection: Theory, Practice and Applications; Springer: Berlin/Heidelberg, Germany, 2016; pp. 465–519.

[CrossRef]

16.

Mouchère, H.; Viard-Gaudin, C.; Kim, D.H.; Kim, J.H.; Garain, U. CROHME2011: Competition on Recognition of Online

Handwritten Mathematical Expressions. Available online: https://hal.science/hal-00615216/ﬁle/CROHME_CRC511.pdf

(accessed on 22 July 2023).

17.

Mouchère, H.; Viard-Gaudin, C.; Kim, D.H.; Kim, J.H.; Garain, U. ICFHR 2012-Competition on Recognition of On-line Mathemati-

cal Expressions (CROHME 2012). Available online: http://www.isical.ac.in/~crohme (accessed on 22 July 2023).

18.

Mouchère, H.; Viard-Gaudin, C.; Zanibbi, R.; Garain, U.; Kim, D.H.; Kim, J.H. ICDAR 2013 CROHME: Third International

Competition on Recognition of Online Handwritten Mathematical Expressions. 2013. Available online: www.isical.ac.in/

(accessed on 22 July 2023).

19.

Deng, Y.; Kanervisto, A.; Ling, J.; Rush, A.M. Image-to-Markup Generation with Coarse-to-Fine Attention. In Proceedings of the

International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016.

20. Sivaramakrishnan, A.; Kumar, M.V. Pre-Processing and Image Enhancement Techniques. IJARCCE 2020,9, 107–113. [CrossRef]

21.

Wang, Z.; Liu, J.C. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX. In Proceedings

of the ACM Symposium on Document Engineering 2020, New York, NY, USA, 29 September–1 October 2020. [CrossRef]

22.

Saddami, K.; Munadi, K.; Away, Y.; Arnia, F. Effective and fast binarization method for combined degradation on ancient

documents. Heliyon 2019,5, e02613. [CrossRef] [PubMed]

23.

Lim, C.C.; Ling, A.H.W.; Chong, Y.F.; Mashor, M.Y.; Alshantti, K.; Aziz, M.E. Comparative Analysis of Image Processing

Techniques for Enhanced MRI Image Quality: 3D Reconstruction and Segmentation Using 3D U-Net Architecture. Diagnostics

2023,13, 2377. [CrossRef]

24.

Shopon, M.; Diptu, N.A.; Mohammed, N. End-to-End Optical Character Recognition Using Sythetic Dataset Generator for Noisy

Conditions. In Proceedings of the International Joint Conference on Computational Intelligence: IJCCI 2019, Dhaka, Bangladesh,

25–26 October 2019; pp. 515–527. [CrossRef]

25.

Zhou, M.; Cai, M.; Li, G.; Li, M. An End-to-End Formula Recognition Method Integrated Attention Mechanism. Mathematics

2022

11, 177. [CrossRef]

26.

Huang, Z.; Ma, Y.; Wang, R.; Li, W.; Dai, Y. A Model for EEG-Based Emotion Recognition: CNN-Bi-LSTM with Attention

Mechanism. Electronics 2023,12, 3188. [CrossRef]

27.

Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv

2017

arXiv:1706.03762.

28.

Hino, H. Active Learning: Problem Settings and Recent Developments. 2020. Available online: https://arxiv.org/abs/2012.04225

v2 (accessed on 13 June 2023).

29.

Liu, Y.; Li, Z.; Li, H.; Yu, W.; Huang, M.; Peng, D.; Liu, M.; Chen, M.; Li, C.; Liu, C.-L.; et al. On the Hidden Mystery of OCR in

Large Multimodal Models. 2023. Available online: https://arxiv.org/abs/2305.07895v3 (accessed on 13 June 2023).

30.

Wang, X.; Liu, Q.; Gui, T.; Zhang, Q.; Zou, Y.; Zhou, X.; Ye, J.; Zhang, Y.; Zheng, R.; Pang, Z.; et al. TextFlint: Uniﬁed Multilingual

Robustness Evaluation Toolkit for Natural Language Processing. In Proceedings of the 59th Annual Meeting of the Association for

Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations,

Online, 1–6 August 2021; pp. 347–355. [CrossRef]

31.

Zhang, J.; Sang, J.; Xu, K.; Wu, S.; Zhao, X.; Sun, Y.; Hu, Y.; Yu, J. Robust CAPTCHAs Towards Malicious OCR. IEEE Trans.

Multimedia 2021,23, 2575–2587. [CrossRef]

32.

Kukreja, V. Sakshi Machine learning models for mathematical symbol recognition: A stem to stern literature analysis. Multimedia

Tools Appl. 2022,81, 28651–28687. [CrossRef]

Appl. Sci. 2023,13, 12503 19 of 20

33.

Ogwok, D.; Ehlers, E.M. Detecting, Contextualizing and Computing Basic Mathematical Equations from Noisy Images using

Machine Learning. In Proceedings of the 2020 3rd International Conference on Computational Intelligence and Intelligent

Systems, Tokyo, Japan, 13–15 November 2020; pp. 8–14. [CrossRef]

34.

Lu, M.; Fang, Y.; Yan, F.; Li, M. Incorporating Domain Knowledge into Natural Language Inference on Clinical Texts. IEEE Access

2019,7, 57623–57632. [CrossRef]

35.

Karpinski, R.; Lohani, D.; Belaid, A. Metrics for Complete Evaluation of OCR Performance. 2018. Available online: https://inria.hal.

science/hal-01981731 (accessed on 27 October 2023).

36.

Neudecker, C.; Baierer, K.; Gerber, M.; Clausner, C.; Antonacopoulos, A.; Pletschacher, S. A Survey of OCR Evaluation Tools

and Metrics. In Proceedings of the 6th International Workshop on Historical Document Imaging and Processing, Lausanne,

Switzerland, 5–6 September 2021; pp. 13–18. [CrossRef]

37.

Bin, O.K.; Hooi, Y.K.; Kadir, S.J.A.; Fujita, H.; Rosli, L.H. Enhanced Symbol Recognition based on Advanced Data Augmentation

for Engineering Diagrams. Int. J. Adv. Comput. Sci. Appl. 2022,13, 537–546. [CrossRef]

38.

Patil, S.; Varadarajan, V.; Mahadevkar, S.; Athawade, R.; Maheshwari, L.; Kumbhare, S.; Garg, Y.; Dharrao, D.; Kamat, P.;

Kotecha, K. Enhancing Optical Character Recognition on Images with Mixed Text Using Semantic Segmentation. J. Sens. Actuator

Networks 2022,11, 63. [CrossRef]

39.

Tang, L.A.; Korona-Bailey, J.; Zaras, D.; Roberts, A.; Mukhopadhyay, S.; Espy, S.; Walsh, C.G. Using Natural Language Processing

to Predict Fatal Drug Overdose from Autopsy Narrative Text: Algorithm Development and Validation Study. JMIR Public Health

Surveill 2023,9, e45246. [CrossRef] [PubMed]

40.

Bilbeisi, G.; Ahmed, S.; Majumdar, R. DeepEquaL: Deep Learning Based Mathematical Equation to Latex Generation. In

Proceedings of the Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, 18–22

November 2020; Volume 1333, pp. 324–332. [CrossRef]

41.

Kaluarachchi, T.; Wickramasinghe, M. A systematic literature review on automatic website generation. J. Comput. Lang.

2023

,75,

101202. [CrossRef]

42.

Maharana, K.; Mondal, S.; Nemade, B. A review: Data pre-processing and data augmentation techniques. Glob. Transit. Proc.

2022,3, 91–99. [CrossRef]

43.

Springmann, U.; Fink, F.; Schulz, K.U. Automatic Quality Evaluation and (Semi-) Automatic Improvement of OCR Models for

Historical Printings. 2016. Available online: https://arxiv.org/abs/1606.05157v2 (accessed on 13 June 2023).

44.

Shidaganti, G.; Salil, S.; Anand, P.; Jadhav, V. Robotic Process Automation with AI and OCR to Improve Business Process: Review.

In Proceedings of the 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC),

Coimbatore, India, 4–6 August 2021; pp. 1612–1618. [CrossRef]

45.

Scharpf, P.; Schubotz, M.; Cohl, H.S.; Breitinger, C.; Gipp, B. Discovery and Recognition of Formula Concepts using Machine

Learning. 2023. Available online: https://arxiv.org/abs/2303.01994v2 (accessed on 23 July 2023).

46.

Gipp, B.; Greiner-Petter, A.; Schubotz, M.; Meuschke, N. Methods and Tools to Advance the Retrieval of Mathematical Knowledge

from Digital Libraries for Search-, Recommendation-, and Assistance-Systems. arXiv 2023, arXiv:2305.07335.

47.

Pandey, S.; Pandey, S.K.; Miller, L. Measuring Innovativeness of Public Organizations: Using Natural Language Processing

Techniques in Computer-Aided Textual Analysis. Int. Public Manag. J. 2016,20, 78–107. [CrossRef]

48.

Wang, J.; Sun, Y.; Wang, S. Image to Latex with DenseNet Encoder and Joint Attention. Procedia Comput. Sci.

2019

,147, 374–380.

[CrossRef]

49.

Chu, J.S.V.; Pyo, B.; Parth, V.; Hussein, A.; Wang, P. Key–Value Pair Identiﬁcation from Tables Using Multimodal Learning. Int.

J. Pattern Recognit. Artif. Intell. 2023,37, 2352009. [CrossRef]

50.

Hirlekar, V.V.; Kumar, A. Natural Language Processing based Online Fake News Detection Challenges—A Detailed Review. In

Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India,

10–12 June 2020; pp. 748–754. [CrossRef]

51.

Borovikov, E. A Survey of Modern Optical Character Recognition Techniques. 2014. Available online: https://arxiv.org/abs/14

12.4183v1 (accessed on 20 July 2023).

52.

Sandnes, F.E. Lost in OCR-Translation: Pixel-based Text Reﬂow to the Rescue: Magniﬁcation of Archival Raster Image Documents

in the Browser without Horizontal Scrolling. In Proceedings of the 15th International Conference on PErvasive Technologies

Related to Assistive Environments, New York, NY, USA, 29 June–1 July 2022; pp. 500–506. [CrossRef]

53.

Shruthi, J.; Swamy, S. A prior case study of natural language processing on different domain. Int. J. Electr. Comput. Eng.

2020

,10,

4928–4936. [CrossRef]

54.

Crema, C.; Attardi, G.; Sartiano, D.; Redolﬁ, A. Natural language processing in clinical neuroscience and psychiatry: A review.

Front. Psychiatry 2022,13, 946387. [CrossRef] [PubMed]

55.

Mehta, N.; Braun, P.X.; Gendelman, I.; Alibhai, A.Y.; Arya, M.; Duker, J.S.; Waheed, N.K. Repeatability of binarization thresholding

methods for optical coherence tomography angiography image quantiﬁcation. Sci. Rep. 2020,10, 15368. [CrossRef] [PubMed]

56.

Zhang, Z.; Zhang, Z.; Di Caprio, F.; Gu, G.X. Machine learning for accelerating the design process of double-double composite

structures. Compos. Struct. 2022,285, 115233. [CrossRef]

57.

Li, M.; Zhao, P.; Zhang, Y.; Niu, S.; Wu, Q.; Tan, M. Structure-Aware Mathematical Expression Recognition with Sequence-Level

Modeling. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021;

pp. 5038–5046. [CrossRef]

Appl. Sci. 2023,13, 12503 20 of 20

58.

Dalal, J.; Daiya, S. Image Processing Based Optical Character Recognition Using Matlab. Int. J. Eng. Sci. Res. Technol.

2018

,30,

406–411. [CrossRef]

59.

Edwards, K.M. Accelerating the Design Process Through Natural Language Processing-based Idea Filtering. 2022. Available

online: https://dspace.mit.edu/handle/1721.1/147338 (accessed on 20 July 2023).

60.

Jiang, K.; Lu, X. Natural Language Processing and Its Applications in Machine Translation: A Diachronic Review. In Pro-

ceedings of the 2020 IEEE 3rd International Conference of Safe Production and Informatization (IICSPI), Chongqing, China,

28–30 November 2020; pp. 210–214. [CrossRef]

61.

Ling, X.; Gao, M.; Wang, D. Intelligent Document Processing Based on RPA and Machine Learning. In Proceedings of the 2020

Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 1349–1353. [CrossRef]

62.

Wu, J.-W.; Yin, F.; Zhang, Y.-M.; Zhang, X.-Y.; Liu, C.-L. Image-to-markup generation via paired adversarial learning. Lect. Notes

Comput. Sci. 2019,11051, 18–34. [CrossRef]

63.

Moon, N.N.; Salehin, I.; Parvin, M.; Hasan, M.; Talha, I.M.; Debnath, S.C.; Nur, F.N.; Saifuzzaman, M. Natural language processing

based advanced method of unnecessary video detection. Int. J. Electr. Comput. Eng. 2021,11, 5411–5419. [CrossRef]

64.

Leaman, R.; Khare, R.; Lu, Z. Challenges in clinical natural language processing for automated disorder normalization. J. Biomed.

Inform. 2015,57, 28–37. [CrossRef]

65.

Dong, L.-F.; Liu, H.-C.; Zhang, X.-M. Synthetic Data Generation and Shufﬂed Multi-Round Training Based Ofﬂine Handwritten

Mathematical Expression Recognition. J. Comput. Sci. Technol. 2022,37, 1427–1443. [CrossRef]

66.

Della Porta, M.G.; Travaglino, E.; Boveri, E.; Ponzoni, M.; Malcovati, L.; Papaemmanuil, E.; Rigolin, G.M.; Pascutto, C.; Croci, G.;

Gianelli, U.; et al. Minimal morphological criteria for deﬁning bone marrow dysplasia: A basis for clinical implementation of

WHO classiﬁcation of myelodysplastic syndromes. Leukemia 2014,29, 66–75. [CrossRef]

67.

Jing, Y. Research on the Application of Artiﬁcial Intelligence Natural Language Processing Technology in Japanese Teaching.

J. Phys. Conf. Ser. 2020,1682, 012081. [CrossRef]

68.

Joshi, D.S.; Risodkar, Y.R. Deep Learning Based Gujarati Handwritten Character Recognition. In Proceedings of the

2018 International Conference on Advances in Communication and Computing Technology (ICACCT), Sangamner, India,

8–9 February 2018; pp. 563–566. [CrossRef]

69.

Ma, S.; Chen, C.; Khalajzadeh, H.; Grundy, J. Latexify Math: Mathematical Formula Markup Revision to Assist Collaborative

Editing in Math Q&A Sites. Proc. ACM Human–Comput. Interact. 2021,5, 403. [CrossRef]

70.

Ling, J.; Rush, A. Coarse-to-Fine Attention Models for Document Summarization. In Proceedings of the Workshop on New

Frontiers in Summarization, Copenhagen, Denmark, 7 September 2017; pp. 33–42. [CrossRef]

71.

Névéol, A.; Zweigenbaum, P. Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural

Language Processing of the International Medical Informatics Association Yearbook. Yearb. Med. Inform.

2018

,27, 193–198.

[CrossRef] [PubMed]

Disclaimer/Publisher’s Note:

The statements, opinions and data contained in all publications are solely those of the individual

author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to

people or property resulting from any ideas, methods, instructions or products referred to in the content.

ResearchGate has not been able to resolve any citations for this publication.

Analysis of Recent Deep Learning Techniques for Arabic Handwritten-Text OCR and Post-OCR Correction

Article

Full-text available

Jun 2023

Arabic handwritten-text recognition applies an OCR technique and then a text-correction technique to extract the text within an image correctly. Deep learning is a current paradigm utilized in OCR techniques. However, no study investigated or critically analyzed recent deep-learning techniques used for Arabic handwritten OCR and text correction during the period of 2020–2023. This analysis fills this noticeable gap in the literature, uncovering recent developments and their limitations for researchers, practitioners, and interested readers. The results reveal that CNN-LSTM-CTC is the most suitable architecture among Transformer and GANs for OCR because it is less complex and can hold long textual dependencies. For OCR text correction, applying DL models to generated errors in datasets improved accuracy in many works. In conclusion, Arabic OCR has the potential to further apply several text-embedding models to correct the resultant text from the OCR, and there is a significant gap in studies investigating this problem. In addition, there is a need for more high-quality and domain-specific OCR Arabic handwritten datasets. Moreover, we recommend the practical development of a space for future trends in Arabic OCR applications, derived from current limitations in Arabic OCR works and from applications in other languages; this will involve a plethora of possibilities that have not been effectively researched at the time of writing.

A Model for EEG-Based Emotion Recognition: CNN-Bi-LSTM with Attention Mechanism

Article

Full-text available

Jul 2023

Emotion analysis is the key technology in human–computer emotional interaction and has gradually become a research hotspot in the field of artificial intelligence. The key problems of emotion analysis based on EEG are feature extraction and classifier design. The existing methods of emotion analysis mainly use machine learning and rely on manually extracted features. As an end-to-end method, deep learning can automatically extract EEG features and classify them. However, most of the deep learning models of emotion recognition based on EEG still need manual screening and data pre-processing, and the accuracy and convenience are not high enough. Therefore, this paper proposes a CNN-Bi-LSTM-Attention model to automatically extract the features and classify emotions based on EEG signals. The original EEG data are used as input, a CNN and a Bi-LSTM network are used for feature extraction and fusion, and then the electrode channel weights are balanced through the attention mechanism layer. Finally, the EEG signals are classified to different kinds of emotions. An emotion classification experiment based on EEG is conducted on the SEED dataset to evaluate the performance of the proposed model. The experimental results show that the method proposed in this paper can effectively classify EEG emotions. The method was assessed on two distinctive classification tasks, one with three and one with four target classes. The average ten-fold cross-validation classification accuracy of this method is 99.55% and 99.79%, respectively, corresponding to three and four classification tasks, which is significantly better than the other methods. It can be concluded that our method is superior to the existing methods in emotion recognition, which can be widely used in many fields, including modern neuroscience, psychology, neural engineering, and computer science as well.

Comparative Analysis of Image Processing Techniques for Enhanced MRI Image Quality: 3D Reconstruction and Segmentation Using 3D U-Net Architecture

Article

Full-text available

Jul 2023

Osteosarcoma is a common type of bone tumor, particularly prevalent in children and adolescents between the ages of 5 and 25 who are experiencing growth spurts during puberty. Manual delineation of tumor regions in MRI images can be laborious and time-consuming, and results may be subjective and difficult to replicate. Therefore, a convolutional neural network (CNN) was developed to automatically segment osteosarcoma cancerous cells in three types of MRI images. The study consisted of five main stages. First, 3692 DICOM format MRI images were acquired from 46 patients, including T1-weighted, T2-weighted, and T1-weighted with injection of Gadolinium (T1W + Gd) images. Contrast stretching and median filter were applied to enhance image intensity and remove noise, and the pre-processed images were reconstructed into NIfTI format files for deep learning. The MRI images were then transformed to fit the CNN’s requirements. A 3D U-Net architecture was proposed with optimized parameters to build an automatic segmentation model capable of segmenting osteosarcoma from the MRI images. The 3D U-Net segmentation model achieved excellent results, with mean dice similarity coefficients (DSC) of 83.75%, 85.45%, and 87.62% for T1W, T2W, and T1W + Gd images, respectively. However, the study found that the proposed method had some limitations, including poorly defined borders, missing lesion portions, and other confounding factors. In summary, an automatic segmentation method based on a CNN has been developed to address the challenge of manually segmenting osteosarcoma cancerous cells in MRI images. While the proposed method showed promise, the study revealed limitations that need to be addressed to improve its efficacy.

Discovery and recognition of formula concepts using machine learning

Article

Full-text available

Jul 2023

Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a ‘Formula Concept’ that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering, as well as document similarity assessments for plagiarism detection or recommender systems.

The Status and Challenges of Image Data Augmentation Algorithms

Article

Full-text available

Mar 2023

Image data-augmentation algorithms effectively trump the problem of insufficient training samples for deep learning in some application fields, and it is typically for scholars to choose some of them for various computer vision tasks. But as the algorithms develop rapidly, the early proposed classification that the data-augmentation algorithms are sorted into classical ways and generating methods is no more suitable, because such classification misses some other meaningful strategies. Besides, it is frustrating for someone to decide which is the exact method to undertake, though there are too many optional algorithms to choose. Towards the goal of making some suggestions, the paper categorizes image data-augmentation algorithms into three kinds from the perspective of algorithm strategy, and they are matrix transformation algorithm, feature expansion algorithm and model generation based on neural network algorithm. The paper analyzes the typical algorithm principle, performance, application scenarios, research status and future challenges, and forecasts the development trend of data augmentation algorithms. The paper can provide academic reference for data augmentation algorithm in the fields of medicine and military.

Using Natural Language Processing to Predict Fatal Drug Overdose from Autopsy Narrative Text: Algorithm Development and Validation (Preprint)

Article

Full-text available

Dec 2022

Background Fatal drug overdose surveillance informs prevention but is often delayed because of autopsy report processing and death certificate coding. Autopsy reports contain narrative text describing scene evidence and medical history (similar to preliminary death scene investigation reports) and may serve as early data sources for identifying fatal drug overdoses. To facilitate timely fatal overdose reporting, natural language processing was applied to narrative texts from autopsies. Objective This study aimed to develop a natural language processing–based model that predicts the likelihood that an autopsy report narrative describes an accidental or undetermined fatal drug overdose. Methods Autopsy reports of all manners of death (2019-2021) were obtained from the Tennessee Office of the State Chief Medical Examiner. The text was extracted from autopsy reports (PDFs) using optical character recognition. Three common narrative text sections were identified, concatenated, and preprocessed (bag-of-words) using term frequency–inverse document frequency scoring. Logistic regression, support vector machine (SVM), random forest, and gradient boosted tree classifiers were developed and validated. Models were trained and calibrated using autopsies from 2019 to 2020 and tested using those from 2021. Model discrimination was evaluated using the area under the receiver operating characteristic, precision, recall, F1-score, and F2-score (prioritizes recall over precision). Calibration was performed using logistic regression (Platt scaling) and evaluated using the Spiegelhalter z test. Shapley additive explanations values were generated for models compatible with this method. In a post hoc subgroup analysis of the random forest classifier, model discrimination was evaluated by forensic center, race, age, sex, and education level. Results A total of 17,342 autopsies (n=5934, 34.22% cases) were used for model development and validation. The training set included 10,215 autopsies (n=3342, 32.72% cases), the calibration set included 538 autopsies (n=183, 34.01% cases), and the test set included 6589 autopsies (n=2409, 36.56% cases). The vocabulary set contained 4002 terms. All models showed excellent performance (area under the receiver operating characteristic ≥0.95, precision ≥0.94, recall ≥0.92, F1-score ≥0.94, and F2-score ≥0.92). The SVM and random forest classifiers achieved the highest F2-scores (0.948 and 0.947, respectively). The logistic regression and random forest were calibrated (P=.95 and P=.85, respectively), whereas the SVM and gradient boosted tree classifiers were miscalibrated (P=.03 and P<.001, respectively). “Fentanyl” and “accident” had the highest Shapley additive explanations values. Post hoc subgroup analyses revealed lower F2-scores for autopsies from forensic centers D and E. Lower F2-score were observed for the American Indian, Asian, ≤14 years, and ≥65 years subgroups, but larger sample sizes are needed to validate these findings. Conclusions The random forest classifier may be suitable for identifying potential accidental and undetermined fatal overdose autopsies. Further validation studies should be conducted to ensure early detection of accidental and undetermined fatal drug overdoses across all subgroups.

An end-to-end natural language processing system for automatically extracting radiotherapy events from clinical texts: NLP to extract radiotherapy events from text

Article

Mar 2023

Background: Real-world evidence for radiotherapy (RT) is limited because it is often documented only in the clinical narrative. We developed a natural language processing system for automated extraction of detailed RT events from text to support clinical phenotyping. Methods: A multi-institutional dataset of 96 clinician notes, 129 North American Association of Central Cancer Registries (NAACCR) cancer abstracts, and 270 RT prescriptions from HemOnc.org was used, and divided into train, development, and test sets. Documents were annotated for RT events and associated properties: Dose, Fraction Frequency, Fraction Number, Date, Treatment Site, and Boost. Named entity recognition (NER) models for properties were developed by fine-tuning BioClinicalBERT and RoBERTa transformer models. A multi-class RoBERTa-based relation extraction model was developed to link each dose mention with each property in the same event. Models were combined with symbolic rules to create a hybrid end-to-end pipeline for comprehensive RT event extraction. Results: NER models were evaluated on the held-out test set with F1 results of 0.96, 0.88, 0.94, 0.88, 0.67, and 0.94 for Dose, Fraction Frequency, Fraction Number, Date, Treatment Site, and Boost, respectively. The relation model achieved an average F1 of 0.86 when the input was gold-labeled entities. The end-to-end system F1 result was 0.81. The end-to-end system performed best on NAACCR abstracts (average F1 0.90), which are mostly copy-paste content from clinician notes. Conclusions: We developed methods and a hybrid end-to-end system for RT event extraction, which is the first natural language processing system for this task. This system provides proof-of-concept for real-world RT data collection for research and is promising for the future potential of NLP methods to support clinical care.

Key-Value Pair Identification from Tables Using Multimodal Learning

Article

Mar 2023
INT J PATTERN RECOGN

Computer vision and optical character recognition techniques have rapidly advanced in order to accurately capture text and other features from paper documents. While state-of-the-art tools in these fields now yield high accuracy, analyzing their outputs requires more research. Since tables are common in such documents, a new pipeline, based on multimodal learning, is proposed to better extract key–value pairs from tables. Its performance is evaluated with a synthetically generated dataset with randomly generated tables and a dataset of mechanical part documents provided by SiliconExpert Technologies. Its performance is also compared with another state-of-the-art model built for similar tasks, LayoutLM. The proposed pipeline provides a fully automated, end-to-end scalable solution, beginning with image processing and computer vision components to a machine learning model that uses data from optical character recognition and natural language processing to make the final decisions. In the best configuration, the pipeline achieved a 96.26% accuracy on a large, synthetically generated training and test set. When comparing the proposed pipeline with LayoutLM, the proposed pipeline performed similarly on the synthetic dataset and better on the real dataset. These results show the potential of the multimodal approach in extracting key–value pairs from tables from real paper documents.

A systematic literature review on automatic website generation

Article

Mar 2023

Since machine learning became a prominent feature in the modern-day computing landscape, the urge to automate processes has increased. One such process of particular interest has been the automatic generation of websites based on user intention. Though the requirement of such automatic generation is a modern-day need, the quality of the automatic generation still provides a unique set of challenges. As such, to analyze these unique challenges and viable opportunities in automatic website generation, this survey systematically reviews research on the topics of automatic website generation. The analysis initially segments state-of-the-art into three categories based on the dominant strategy used for automatic generation. These strategies are examples-based, mock-up-driven, and artificial intelligence-driven automatic website generation. When considering the example-based strategy, the emphasis is on analyzing how manual design aspects of a professionally developed website are incorporated into generation models and the challenges that arise. Similarly, transformation methods from website visual design into functional GUI code are investigated for the mock-up-driven strategy with a particular reference to the six underlying conversion mechanisms. Finally, artificial intelligence website builders are analyzed based on their ability to build customizable websites to user preferences. Based on this systematic review of 47 research works on the three dominant strategies, this survey outlines unique challenges and future research endeavors that researchers would encounter when developing models that generate websites automatically and provides insights to researchers on selecting a website generation strategy based on user intention appropriately.

Accelerating the Design Process Through Natural Language Processing-based Idea Filtering

Thesis

Sep 2022

Kristen M. Edwards

The following treatise explores the use of natural language processing to accelerate the design process in various domains by automating idea filtering. During the design of products or programs, a bottleneck often arises when experts need to filter through an exorbitant number of ideas, searching for which ones are most innovative, creative, relevant, or any other number of subjective characteristics. We observe this bottleneck when filtering entrepreneurial ideas for innovation, when filtering earlystage design concepts for creativity and usefulness, and when filtering literature for relevance toward policy-informing evidence syntheses. Motivated by the common challenge of idea filtering in various design domains, my research explores the use of natural language processing (NLP) for accelerating design through automated idea filtering. My team and I investigate the possibility of using machine learning to predict expert-derived creativity assessments of design ideas from more accessible non-expert survey results. We demonstrate the ability of machine learning models to predict design metrics from the design itself and textual survey information. Our results show that incorporating NLP improves prediction results across design metrics, and that clear distinctions in the predictability of certain metrics exist. We go on to explore the effectiveness of using NLP to accelerate literature screening for designing evidence-based policies and programs. In this research, we introduce the use of transformer models for idea filtering and evaluation. Transformer-based models have produced state of the art results in NLP tasks such as language translation, question answering, reading comprehension, and sentiment analysis. Our results show that the fine-tunable transformer-based models achieve the highest text classification accuracy, 79%, therefore accurately evaluating and filtering our textual dataset. Furthermore, we observe that the model accuracy improves with the training data size with diminishing marginal effect. The findings can facilitate informed decision making regarding the trade-off between model accuracy and manual labeling efforts, increasing efficiency. After demonstrating the effectiveness of using NLP to accelerate literature screening, we aimed to next decrease the level of effort required of expert reviewers to generate training data. To train an idea-filtering model, we need a labeled dataset of ideas, however obtaining labeled data is a challenge for engineering and design applications, as experts are expensive and, therefore, expert labeled datasets are as well. We were motivated to explore avenues to decrease the size of training data needed by using active learning (AL). AL is the concept that a machine learning algorithm can perform better with less training data if it is allowed to choose the data from which it learns. We find that data selection techniques that incorporate active learning result in higher F1 scores, a more balanced training set, and fewer necessary labeled training instances. These results suggest that active learning is effective in decreasing expert level of effort for NLP-based idea filtering with highly imbalanced data. We ultimately find that NLP can accelerate design processes in various domains by automating idea filtering and further decreasing the level of effort required of human experts.

Advancing OCR Accuracy in Image-to-LaTeX Conversion—A Critical and Creative Exploration

Abstract and Figures

Recommended publications

Synthetic Data Generation and Shuffled Multi-Round Training Based Offline Handwritten Mathematical E...

Improving Handwritten Mathematical Expression Recognition Via Similar Symbol Distinguishing

Neural Machine Translation for Mathematical Formulae

A novel hybrid algorithm for morphological analysis: artificial A novel hybrid algorithm for morphol...