Conference PaperPDF Available

Can we build language-independent OCR using LSTM networks?

August 2013

August 2013

DOI:10.1145/2505377.2505394

Conference: 4th International Workshop on Multilingual OCR
At: Washington D.C., USA

Authors:

Adnan Ul-Hasan

RPTU - Rheinland-Pfälzische Technische Universität Kaiserslautern Landau

Thomas Breuel

Google Inc.

Language models or recognition dictionaries are usually con-sidered an essential step in OCR. However, using a lan-guage model complicates training of OCR systems, and it also narrows the range of texts that an OCR system can be used with. Recent results have shown that Long Short-Term Memory (LSTM) based OCR yields low error rates even without language modeling. In this paper, we explore the question to what extent LSTM models can be used for multilingual OCR without the use of language models. To do this, we measure cross-language performance of LSTM models trained on different languages. LSTM models show good promise to be used for language-independent OCR. The recognition errors are very low (around 1%) without using any language model or dictionary correction.

Some sample images from our database. There are 96 variations in standard fonts used in common practice, e.g. for times true-type fonts; its normal, italic, bold and italic-bold variations were included. Also, note that these images were degraded to reflect scanning artefacts.

…

Figures - uploaded by Adnan Ul-Hasan

Content may be subject to copyright.

Content uploaded by Adnan Ul-Hasan

Content may be subject to copyright.

Can we build language-independent OCR

using LSTM networks?

Adnan Ul-Hasan

Technical University of Kaiserslautern

67663 Kaiserslautern, Germany

adnan@cs.uni-kl.de

Thomas M. Breuel

Technical University of Kaiserslautern

67663 Kaiserslautern, Germany

tmb@cs.uni-kl.de

ABSTRACT

Language models or recognition dictionaries are usually con-

sidered an essential step in OCR. However, using a lan-

guage model complicates training of OCR systems, and it

also narrows the range of texts that an OCR system can

be used with. Recent results have shown that Long Short-

Term Memory (LSTM) based OCR yields low error rates

even without language modeling. In this paper, we explore

the question to what extent LSTM models can be used for

multilingual OCR without the use of language models. To

do this, we measure cross-language performance of LSTM

models trained on diﬀerent languages. LSTM models show

good promise to be used for language-independent OCR.

The recognition errors are very low (around 1%) without

using any language model or dictionary correction.

Keywords

MOCR, LSTM Networks, RNN

1. INTRODUCTION

Multilingual OCR (MOCR) is of interest for many rea-

sons; digitizing historic books containing two or more scripts,

bilingual books, dictionaries, and books with line by line

translation are few reasons to have reliable multilingual OCR

systems. However, it (MOCR) also present several unique

challenges as Popat pointed out in context of Google books

project1. Some of the unique challenges are:

•Multiple scripts/languages on a page. (multi-sript iden-

tiﬁcation)

•Multiple languages in same or similar fonts, like Arabic-

Persian, English-German.

•The same language in multiple scripts, like Urdu in

Nastaleeq and Naskh scripts.

•Archaic and reformed orthographies, e.g. 18th Century

English, Fraktur (historical German), etc.

1http://en.wikipedia.org/wiki/Google Books

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Request permissions from Permissions@acm.org.

MOCR ’13, August 24 2013, Washington, DC, USA

http://dx.doi.org/10.1145/2505377.2505394

There have been eﬀorts reported to adapt the existing

OCR systems for other languages. Open source OCR sys-

tem Tesseract [2] is one such example. The basic classiﬁca-

tion is based on hierarchical shape-classiﬁcation, where at

ﬁrst the character set is reduced to few characters and then

at last stage, the test sample is matched against the repre-

sentative of the short set. Although, Tesseract can be used

for a variety of languages (due to support available for many

languages), it can not be used as an all-in-one solution for

situation where we have multiple scripts together.

The usual approach to address multilingual OCR problem

is to somehow combine two or more separate classiﬁers [3],

as it is believed that a reasonable OCR output for a sin-

gle script can not be obtained without sophisticated post-

processing steps such as language modelling, use of dictio-

nary to correct OCR errors, font adaptation, etc. Natarajan

et al. [4] proposed an HMM-based script-independent multi-

lingual OCR system. Feature extraction, training and recog-

nition components are all language independent; however,

they use language speciﬁc word lexicon and language models

for recognition purpose. To our best knowledge, there was

not a single method proposed for OCR, that can achieve

very low error rates without using aforementioned sophis-

ticated post-processing techniques. But recent experiments

on English and German script using LSTM networks [5] have

shown that reliable OCR results can be obtained without

such techniques.

Our hypothesis for multilingual OCR is that if a single

model, at least for a family of scripts, e.g. Latin, Arabic,

Indic can be obtained, we can then use this single model to

recognize scripts of that particular family; thereby reduc-

ing the eﬀorts to combine multiple classiﬁers. Since LSTM

networks can achieve very low error-rates without using lan-

guage modelling post-processing step; they can be used for

multilingual OCR.

In this paper, we report the results of applying LSTM

networks to address multilingual OCR problem. The ba-

sic aim is to benchmark how LSTM networks use language

modelling to predict the correct labels or can we do better

without using language modelling and other post-processing

steps. Additionally, we also want to see how well LSTM

networks use context to recognize a particular character.

Speciﬁcally, we trained LSTM networks for English, Ger-

man, French and a mix set of these three languages and test

them on each other. LSTM network based models achieve

very high recognition accuracy without the aid of language

modelling and they have shown good promise to be used for

multilingual OCR tasks.

Figure 1: Some sample images from our database. There are 96 variations in standard fonts used in common

practice, e.g. for times true-type fonts; its normal, italic, bold and italic-bold variations were included. Also,

note that these images were degraded to reﬂect scanning artefacts.

In what follows, preprocessing step is reported in next

section, Section 3 describes the conﬁguration of the LSTM

network used in the experiments, Section 4 gives the details

of experimental evaluation. Section 5 concludes the paper

with discussions on the current work and future directions.

2. PREPROCESSING

Scale and relative position of a character are important

features to distinguish characters in Latin script (and some

other scripts). So, text line normalization is an essential step

in applying 1D LSTM networks to OCR. In this work, we

used the normalization approach introduced in [5], namely

text-line normalization based on a trainable, shape-based

model. A token dictionary created from a collection of a

bunch of text lines contains information about x-height,

baseline (geometric features) and shape of individual charac-

ters. These models are then used to normalize any text-line

image.

3. LSTM NETWORKS

Recurrent Neural Networks (RNNs) have shown a greate

promise in recent times due to the Long Short Term Mem-

ory (LSTM) architecture [6], [7]. The LSTM architecture

diﬀers signiﬁcantly from earlier architectures like Elman net-

works [8] and echo-state networks [9]; and appears to over-

come many of the limitations and problems of those earlier

architectures.

Traditinoal RNNs, though are good at context-aware pro-

cessing [10], have not shown vying performance for OCR and

speech recognition tasks. Their incompetence is reported

mainly due to the vanishing gradient problem [11, 12]. The

Long Short Term Memory [6] architecture was designed to

overcome this problem. It is a highly non-linear recurrent

network with multiplicative “gates” and additive feedback.

Graves et al. [7] introduced bidirectional LSTM architecture

for accessing context in both forward and backward direc-

tions. Both layers are then connected to a single output

layer. To avoid the requirement of segmented training data,

Graves et al. [13] used a forward backward algorithm to align

transcripts with the output of the neural network. Interested

reader is suggested to see the above-mentioned references for

further details regarded LSTM and RNN architectures.

For recognition, we used a 1D bidirectional LSTM archi-

tecture, as described in [7]. We found that 1D architec-

ture outperforms their 2D or higher dimensional siblings for

printed OCR tasks. For all the experiments reported in this

paper, we used a modiﬁed version of the LSTM library de-

scribed in [14]. That library provides 1D and multidimen-

sional LSTM networks, together with ground-truth align-

ment using a forward-backward algorithm (“CTC”, connec-

tionist temporal classiﬁcation; [13]). The library also pro-

vides a heuristic decoding mechanism to map the frame-wise

network output onto a sequence of symbols. We have reim-

plemented LSTM networks and forward-backward alignment

from scratch and reproduced these results (our implementa-

tion uses a slightly diﬀerent decoding mechanism). This im-

plementation has been released as an open-source form [15]

(ocropus version 0.7 ).

During the training stage, randomly chosen input text-

line images are presented as 1D sequences to forward prop-

agation step through LSTM cells and then the forward-

backward alignment of the output is performed. Errors are

then back-propagated to update weights and the process is

then repeated for the next randomly selected text-line im-

age. It is to be noted that raw pixel values are being used

as the only features and other sophisticated features were

extracted from the text-line images. The implicit features

in 1D sequence are baseline and x-heights of individual char-

acters.

4. EXPERIMENTAL EVALUATION

The aim of our experiments was to evaluate LSTM per-

formance on multilingual OCR without the aid of language

modelling and other language-speciﬁc assistance. To explore

the cross-language performance of LSTM networks, a num-

ber of experiments were performed. We trained four sep-

arate LSTM networks for English, German, French and a

mixed set of all these languages. For testing, we have a to-

tal of 16 permutation. Each LSTM network was tested on

Table 1: Statistics on number of text-line images

in each of English, French, German and mix-script

datasets.

Language Total Training Test

English 85,350 81,600 4750

French 85,350 81,600 4750

German 1,14,749 1,10,400 4349

Mixed-script 85,350 81,600 4750

Table 2: Experimental results of applying LSTM networks for multilingual OCR. These results validate our

hypothesis that a single LSTM model trained with a mixture of scripts (from a single family of script) can

be used to recognize text of individual family members. Note that the error rates of testing LSTM network

trained for German on French and networks trained for English on French and German were obtained by

ignoring the words containing special characters (umlauts and accented letters) to correctly gauge the aﬀect

of language models of a particular language. LSTM networks trained for individual languages can also be

used to recognize other scripts, but they show some language dependence. All these results were achieved

without the aid of any language model.

XXXXXXXX

Script

Model English (%) German (%) French (%) Mixed (%)

English 0.5 1.22 4.11.06

German 2.04 0.85 4.71.2

French 1.8 1.4 1.11.05

Mixed-script 1.7 1.1 2.9 1.1

the respective script and on other three scripts, e.g. test-

ing LSTM network trained on German on German, French,

English and mixed-script. These results are summarized in

Table 2, and some sample outputs are presented in Table 3.

As error measure, we used the ratio of insertions, deletions

and substitution relative to the ground-truth and accuracy

was measured at character level.

4.1 Database

A separate synthetic database for each language was de-

veloped using OCRopus [16] (ocropus-linegen). This utility

requires a bunch of utf-8 encoded text ﬁles and a set of

true-type fonts. With these two things available, one can

artiﬁcially generate any number of text-line images. This

utility also provide control to induce scanning artefacts such

as distortion, jitter, and other degradations. Separate cor-

pora of text-line images in German, English and French

languages were generated in commonly used fonts (includ-

ing bold, italic, italic-bold) from freely available literature.

These images were degraded using degradation models [17]

to reﬂect scanning artefacts. There are four degradation pa-

rameters, namely elastic elongation, jitter, sensitivity and

threshold. Sample text-lines images in our database are

shown in Figure 1. Each database is further divided into

training and test datasets. Statistics on number of text line

images in each four scripts is given in Table 1.

4.2 Parameters

The text lines were normalized to a height of 32 in pre-

processing step. Both left-to-right and right-to-left LSTM

layers contain 100 LSTM memory blocks. The learning rate

was set to 1e−4, and the momentum was set to 0.9. The

training was carried out for one million steps (roughly cor-

responding to 100 epochs, given the size of the training set).

Training errors were reported every 10,000 training steps

and plotted. The network corresponding to the minimum

training error was used for test set evaluation.

4.3 Results

Since, there are no umlauts (German) and accented (French)

letters in English, so while testing LSTM model trained for

German on French and model trained for English on French

and German, the words containing those special characters

were omitted from the recognition results. The reason to do

this was to able to correctly gauge the aﬀect of not-using

language models. If those words were not removed, then the

resulting error would also contain a proportion of errors due

to character mis-recognition. So by removing those words

with special characters, the true performance of the LSTM

network trained for language containing lesser alphabets on

the language containing more alphabets can be evaluated.

It should be noted that these results were obtained without

the aid of any post-processing step, like language modelling,

use of dictionaries to correct OCR errors, etc.

LSTM model trained for mixed-data was able to obtain

similar recognition results (around 1% recognition error)

when applied to English, German and French script indi-

vidually. Other results indicate small language dependence

in that LSTM models trained for a single language yielded

lower error rates when tested on the same script than when

they are evaluated on other scripts.

To gauge the magnitude of aﬀect of language modelling,

we compared our results with Tesseract open-source OCR

system [18]. We applied latest available models (as of sub-

mission date) of English, French and German on the same

test-data. Tesseract system achieved high rates as com-

pared to LSTM based models. Tesseract’s model for En-

glish yielded 1.33%, 5.02%, 5.09% and 4.82% recognition

error when applied to English, French, German and Mixed-

data respectively. Model for French yielded 2.06%, 2.7%,

3.5% and 2.96% recognition error when applied to English,

German and Mixed-data respectively, while model for Ger-

man yielded 1.85%, 2.9%, 6.63% and 4.36% recognition er-

ror when applied to English, French and Mixed-data re-

spectively. So, these results show that absence of language

modelling or applying diﬀerent language models aﬀects the

recognition. Since no model for mixed data is available for

Tesseract, the eﬀect of evaluating such a model on individual

script could not be computed.

5. DISCUSSION AND CONCLUSIONS

The results presented in this paper show that LSTM net-

works can be used for multilingual OCR. LSTM networks

do not learn a particular language model internally (nor we

need such models as post-processing step). They show great

promise to learn various shapes of a certain character in dif-

ferent fonts and under degradations (as evident from our

highly versatile data). The language dependence is observ-

Table 3: Sample outputs from four LSTM networks trained for English, German, French and Mixed data.

LSTM net trained on a speciﬁc language is unable to recognize special characters of other languages as they

were not part of training. Therefore, it is necessary to ignore these errors from ﬁnal error score. Thus we

can train an LSTM model for mix-data of a family of script and can use it to recognize individual language

of this family with very low recognition error.

Text-line Image

English

German

French

Mixed-data

Text-line Image

English

German

French

Mixed-data H

Text-line Image

English

German

French

HMixed-data

able, but the aﬀects are small as compared to other state-

of-the-art OCR, where absence of language models results

in relatively bad results. To gauge the language dependence

more precisely, one can evaluate the performance of LSTM

by training LSTM networks on randomly generated data

using n-gram statistics and testing those models on natural

languages. Currently, we are working in this direction and

the results will be reported elsewhere.

In the following, we will analyse the errors made by our

LSTM networks when applied to other scripts. Top 5 con-

fusions for each case are tabulated in Table 4. The case of

applying an LSTM network to the same language for which

it was trained is not discussed here as it is not relevant for

the discussion of cross-language performance of LSTM net-

works.

Most of the errors caused by LSTM network trained on

mixed-data are non-recognition (deletion) of certain char-

acters like l,t,r,i. These errors may be removed by better

training.

Looking at the ﬁrst column of Table 4 (Applying LSTM

network trained for English on other 3 scripts), most of the

errors are due to the confusion between characters of similar

shapes, like Ito l(and vice verca), Zto 2 and cto e. Two

confusions namely Zwith Aand Zwith Lare interesting as,

apparently, there are no shape similarity between them. One

possibility of such a behaviour may be due to the fact that

Zis the least frequent letter in English2and thus there may

be not many Zs in the training samples, thereby resulting

in its poor recognition. Two other noticeable errors (also in

other models) are unrecognised space and ’(denotes that

this letter was deleted).

2http://en.wikipedia.org/wiki/Letter frequency

For LSTM networks trained on German language (second

column), most of the top errors are due to inability to rec-

ognize a particular letter. Top errors when applying LSTM

network trained for French language on other scripts are con-

fusion between w/W with v/V. An interesting observation,

which could be a possible reason for such behaviour, is that

relative frequency of w(see footnote) is very low in French.

In other words, ‘w’ may be considered as a special character

w.r.t. French language when applying French model to Ger-

man and English. So, this is a language dependent issue,

which is not observable in case of mix-data.

This work can be extended in future in many directions.

First, more European languages like Italian, Spanish, Dutch

may be included in current set-up to train an all-in-one

LSTM network for these languages. Secondly, other fam-

ilies of script especially Nabataean and Indic scripts can be

tested to further validate our hypothesis empirically.

6. REFERENCES

[1] A. C. Popat, “Multilingual OCR Challenges in Google

Books,” 2012. [Online]. Available:

http://dri.ie/sites/default/ﬁles/ﬁles/popat multilingual

ocr challenges-handout.pdf

[2] R. Smith, D. Antonova, and D. S. Lee, “Adapting the

Tesseract Open Source OCR Engine for Multilingual

OCR,” in Int. Workshop on Multilingual OCR, Jul.

2009.

[3] M. A. Obaida, M. J. Hossain, M. Begum, and M. S.

Alam, “Multilingual OCR (MOCR): An Approach to

Classify Words to Languages,” Int’l Journal of

Computer Applications, vol. 32, no. 1, pp. 46–53, Oct.

2011.

Table 4: Top confusions for applying LSTM models for various tasks. The confusions for an LSTM models for

which it was trained are not mentioned as it is unnecessary for our present paper. shows the garbage class,

i.e. the character is not recognized at all. When the LSTM net trained on English was applied to recognize

other scripts, the resulting top errors are similar: shape confusions between characters. Non-recognition of

“space” and “ ’” are other noticeable errors. For network trained on German language, most errors are due

to deletion of characters. Confusion of w/W with v/V are the top confusions when LSTM network trained

on French was applied to other scripts.

XXXXXXXX

Script

Model English German French Mixed

English -←space

←c

←t

←0

v←y

v←w

vv ←w

←space

←w

l←I

←space

←t

←0

l←I

←l

German l←I

L←Z

A←Z

c←e

2←Z

-v←w

ˆu←¨u

V←W

←space

vv ←w

←space

←t

←l

←i

←r

French ←0

←space

I←l

t←l

I←!

←space

←0

e←

←c

←l

-←space

←i

e←´e

←l

←0

Mixed-script ←0

l←I

I←l

←space

t←l

←space

←0

g←q

e←

T←l0

v←w

ˆo←¨o

ˆa←¨a

V←W

ˆu←¨u

[4] P. Natarajan, Z. Lu, R. M. Schwartz, I. Bazzi, and

J. Makhoul, “Multilingual Machine Printed OCR,”

IJPRAI, vol. 15, no. 1, pp. 43–63, 2001.

[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and

F. Shafait, “High Performance OCR for English and

Fraktur using LSTM Networks,” in Int. Conf. on

Document Analysis and Recognition, Aug. 2013.

[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term

Memory,” Nueral Computation, vol. 9, no. 8, pp.

1735–1780, 1997.

[7] A. Graves, M. Liwicki, S. Fernandez, Bertolami,

H. Bunke, and J. Schmidhuber, “A Novel

Connectionist System for Unconstrained Handwriting

Recognition,” IEEE Trans. on Pattern Analysis and

Machine Intelligence, vol. 31, no. 5, pp. 855–868, May

2008.

[8] J. L. Elman, “Finding Structure in Time.” Cognitive

Science, vol. 14, no. 2, pp. 179–211, 1990.

[9] H. Jaeger, “Tutorial on Training Recurrent Neural

Networks, Covering BPTT, RTRL, EKF and the

‘Echo State Network’ approach,” Sankt Augustin,

Tech. Rep., 2002.

[10] A. W. Senior, “Oﬀ-line Cursive Handwriting

Recognition using Recurrent Neural Networks,” Ph.D.

dissertation, England, 1994.

[11] S. Hochreiter, Y. Bengio, P. Frasconi, and

J. Schmidhuber, “Gradient ﬂow in recurrent nets: the

diﬃculty of learning long-term dependencies,” in A

Field Guide to Dynammical Recurrent Neural

Netwoks, S. C. Kremer and J. F. Kolen, Eds. IEEE

Press, 2001.

[12] Y. Bengio, P. Smirard, and P. Frasconi, “Learning

long-term dependencies with gradient descent is

diﬃcult,” IEEE Trans. on Neural Networks, vol. 5,

no. 2, pp. 157–166, Mar. 1994.

[13] A. Graves, S. Fernandez, F. Gomes, and

J. Schmidhuber, “Connectionist Temporal

Classiﬁcation: Labeling Unsegemented Sequence Data

with Recurrent Nerual Networks,” in ICML,

Pennsylvania, USA, 2006, pp. 369–376.

[14] A. Graves, “RNNLIB: A recurrent neural network

library for sequence learning problems.” [Online].

Available: http://sourceforge.net/projects/rnnl

[15] “OCRopus - Open Source Document Analysis and

OCR system.” [Online]. Available:

https://code.google.com/p/ocropus

[16] T. M. Breuel, “The OCRopus open source OCR

system,” in DRR XV, vol. 6815, Jan. 2008, p. 68150F.

[17] H. S. Baird, “Document Image Defect Models ,” in

Structured Document Image Analysis, H. S. Baird,

H. Bunke, and K. Yamamoto, Eds. New York:

Springer-Verlag, 1992.

[18] R. Smith, “An Overview of the Tesseract OCR

Engine,” in ICDAR, 2007, pp. 629–633.

Printed Ottoman text recognition using synthetic data and data augmentation

Article

Full-text available

May 2023
INT J DOC ANAL RECOG

Esma Fatima Bilgin-Tasdemir

The Ottoman script, which was in use for over five centuries, is an Arabic alphabet-based writing system. It became obsolete after the change of alphabet in Turkey. There are plenty of Ottoman documents, overwhelmingly printed in Naskh style. This work presents a DL-based character recognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus and then augment it using some image processing methods. We develop a hybrid convolutional neural network-bidirectional long short-term memory recognizer and train it with the original and the augmented datasets. Finally, we apply a transfer learning procedure for adapting the system to real image data. The proposed system obtains 0.11 CER on synthetic data and 0.16 CER on real data comprising of line images from a printed historical Ottoman book.

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Article

Full-text available

Jul 2018

Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory

Article

Full-text available

Nov 2022

Optical Character Recognition (OCR) is a system of converting images, including text,into editable text and is applied to various languages such as English, Arabic, and Persian. While these languages have similarities, their fundamental differences can create unique challenges. In Persian, continuity between Characters, the existence of semicircles, dots, oblique, and left-to-right characters such as English words in the context are some of the most important challenges in designing Persian OCR systems. Our proposed framework, Bina, is designed in a special way to address the issue of continuity by utilizing Convolution Neural Network (CNN) and deep bidirectional Long-Short Term Memory (BLSTM), a type of LSTM networks that has access to both past and future context. A huge and diverse dataset, including about 2M samples of both Persian and English contexts,consisting of various fonts and sizes, is also generated to train and test the performance of the proposed model. Various configurations are tested to find the optimal structure of CNN and BLSTM. The results show that Bina successfully outperformed state of the art baseline algorithm by achieving about 96% accuracy in the Persian and 88% accuracy in the Persian and English contexts.

Printed Ottoman Text Recognition Using Synthetic Data and Data Augmentation

Preprint

Full-text available

Nov 2022

Esma Fatima Bilgin-Tasdemir

Ottoman script is an Arabic alphabet-basedscript as well. It was a writing system of theTurkish language for several centuries until it was replaced with the modern Turkish script,which is based on the Latin alphabet, in 1928. With the ever increasing digitization campaigns, millions of Ottoman documents are coming to light. But, their contents are not directly accessible, nor they are digitally editable and searchable. OCR and text recognition technologies can bea solution to this problem in the form of auto-mated and semi-automated conversion systems. This study presents a DL-based characterrecognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus, and then augment it using some image processing methods. We develop a hybrid Convolutional Neural Network-Bidirectional Long Short Term Memory recognizer and train it with the original and the augmented datasets. Finally we apply a Transfer Learning procedure for adapting the system to real image data. The proposed system obtains 0.16 CER on a test set containing line images from a historical printed Ottoman book.

Metaheuristics based long short term memory optimization for sentiment analysis

Article

Nov 2022
APPL SOFT COMPUT

A tremendous amount of opinionated data is being added to online platforms every day. Social media has become the primary source of collecting user’s opinions about products they have purchased. These online reviews contain important information for the sellers but the information is available in unstructured natural language text. There is a dire need for an automated approach that can extract sentiment from a large number of unstructured reviews. LSTM has shown state-of-the-art results for sentiment analysis in recent years. The performance of LSTM heavily depends on the architectural design. Rather than applying a trial and error approach which can easily mislead the results, an automated optimization technique should be used for architecture related hyperparameter selection. To address this problem, we propose a new framework to optimize the architecture of LSTM using different meta-heuristics. This is the first systematic study of architectural optimization in the context of sentiment analysis using meta heuristics. We have applied Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Firefly, and Cat Swarm Optimization (CSO) for LSTM architecture optimization. The problem of training LSTM has been formulated as an optimization problem and the objective is to maximize F-score. Four benchmark datasets for sentiment analysis have been used in our experiments. PSO and DE show remarkable success in improving the F-score and accuracy. Experimental results demonstrate that the optimal configuration obtained for designing the LSTM architecture using our proposed meta-heuristics significantly improves the accuracy of sentiment analysis.

COMPLETE KAZAKH HANDWRITTEN PAGE RECOGNITION USING START, FOLLOW AND READ METHOD

Article

Jul 2021

In this article we consider end-to-end full page Handwritten Text Recognition for offline Kazakh text images written in Cyrillic alphabet using Fully connected CNN and bidirectional LSTM. The model performs training of text segmentation and recognition jointly using a new Kazakh text images dataset, named Kazakh Handwritten Dataset (KHD). The novel method, which we introduce, uses three steps: Start, Follow and Read (SFR). The proposed model makes use of Region Proposal Network in order to find the starting coordinates of lines in the page. For the case when lines are not straight, we introduce a method that pursues text lines until the end of it and prepare it for the last recognition step. The SFR model works for Russian language as well since Russian alphabet is a subset of Kazakh alphabet. The experimental analysis shows that on average the model provides 0.11 Character Error Rate.

Linguistic Knowledge Within Handwritten Text Recognition Models: A Real-World Case Study

Chapter

Aug 2023

State-of-the-art handwritten text recognition models make frequent use of deep neural networks, with recurrent and connectionist temporal classification layers, which perform recognition over sequences of characters. This architecture may lead to the model learning statistical linguistic features of the training corpus, over and above graphic features. This in turn could lead to degraded performance if the evaluation dataset language differs from the training corpus language.We present a fundamental study aiming to understand the inner workings of OCR models and further our understanding of the use of RNNs as decoders. We examine a real-world example of two graphically similar medieval documents but in different languages: rabbinical Hebrew and Judeo-Arabic. We analyze, computationally and linguistically, the cross-language performance of the models over these documents, so as to gain some insight into the implicit language knowledge the models may have acquired. We find that the implicit language model impacts the final word error by around 10%. A combined qualitative and quantitative analysis allow us to isolate manifest linguistic hallucinations. However, we show that leveraging a pretrained (Hebrew, in our case) model allows one to boost the OCR accuracy for a resource-scarce language (such as Judeo-Arabic).All our data, code, and models are openly available at https://github.com/anutkk/ilmja.KeywordsOptical character recognitionHandwritten text recognitionTransfer learningLanguage modelHebrew manuscripts

Avancées en Reconnaissance Optique des Caractères pour les Documents Arabes Historiques

Thesis

Apr 2021

Benjamin Kiessling

La transcription automatique de textes dans les documents historiques manuscrits et imprimés est devenue un processus établi dans les humanités numériques, son utilisation allant des archives ou des bibliothèques à grande échelle aux groupes de recherche et aux chercheurs individuels. Bien que des progrès considérables aient été réalisés ces dernières années pour comprendre les limites et faire progresser l'état de l'art, ces recherches restent largement limitées aux documents écrits dans les systèmes d'écriture européens, et plus particulièrement à l'écriture latine. L'une des cultures littéraires les plus vastes et les plus diverses, largement ignorée par les recherches actuelles sur l'analyse d'images de documents, est l'écriture arabe. Cette thèse contient une étude compréhensive sur les caractéristiques des documents en écriture arabe et les défis qu'ils posent aux systèmes de reconnaissance optique de caractères de pointe, à travers une analyse théorique de l'écriture arabe et deux études de cas de rétro-numérisation sur des documents imprimés classiques et modernes. Les principales limites des méthodes courantes identifiées dans ces études ont ensuite été traitées. Deux méthodes entraînables de segmentation des pages suivant le paradigme de la ligne de base, permettant d'obtenir des résultats comparables à l'état de l'art et comprenant des caractéristiques supplémentaires nécessaires à la segmentation de pages de documents complexes, une méthode simple de traitement des lignes de texte multigraphique et le logiciel ROC flexible Kraken intégrant ces méthodes sont présentés. On montre l'utilité de ce logiciel de ROC non seulement pour la reconnaissance de texte traditionnelle mais aussi pour une nouvelle tâche d’alignement des caractères. En outre, on présente l'environnement de recherche virtuel (ERV) eScriptorium pour l'annotation et la transcription. Cet ERV est spécifiquement conçu pour pouvoir traiter des textes non-latins, dont l'arabe, plus efficacement que les systèmes alternatifs existants. Au cours de ce travail, on a également préparé plusieurs ensembles de données d'entraînement et d'évaluation sous licence ouverte pour la transcription de textes arabes et la segmentation de pages.

OCR Error Correction Using BiLSTM

Conference Paper

Dec 2021

Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

Conference Paper

Sep 2021

High-Performance OCR for Printed English and Fraktur using LSTM Networks

Conference Paper

Full-text available

Aug 2013

Long Short-Term Memory (LSTM) networks have yielded excellent results on handwriting recognition. This paper describes an application of bidirectional LSTM networks to the problem of machine-printed Latin and Fraktur recognition. Latin and Fraktur recognition differs signiﬁcantly from handwriting recognition in both the statistical properties of the data, as well as in the required, much higher levels of accuracy. Applications of LSTM networks to handwriting recognition use two-dimensional recurrent networks, since the exact position and baseline of handwritten characters is variable. In contrast, for printed OCR, we used a one-dimensional recurrent network combined with a novel algorithm for baseline and x-height normalization. A number of databases were used for training and testing, including the UW3 database, artiﬁcially generated and degraded Fraktur text and scanned pages from a book digitization project. The LSTM architecture achieved 0:6% character-level test-set error on English text. When the artiﬁcially degraded Fraktur data set is divided into training and test sets, the system achieves an error rate of 1:64%. On speciﬁc books printed in Fraktur (not part of the training set), the system achieves error rates of 0:15% (Fontane) and 1:47% (Ersch-Gruber). These recognition accuracies were found without using any language modelling or any other post-processing techniques.

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks

Conference Paper

Full-text available

Jan 2006

Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.

The OCRopus open source OCR system

Conference Paper

Full-text available

Jan 2008
Proceedings of SPIE

Thomas Breuel

OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.

A Novel Connectionist System for Unconstrained Handwriting Recognition

Article

A. Graves

Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies

Article

Sep 2009

Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach

Article

Jan 2002

Herbert Jaeger

Adapting the Tesseract open source OCR engine for multilingual OCR

Article

Jul 2009

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

Document Image Defect Models

Article

Jan 1995

Henry S. Baird

A lack of explicit quantitative models of imaging defects due to printing, optics, and digitization has retarded progress in some areas of document image analysis, including syntactic and structural approaches. Establishing the essential properties of such models, such as completeness (expressive power) and calibration (closeness of fit to actual image populations) remain open research problems. Work-in-progress towards a parameterized model of local imaging defects is described, together with a variety of motivating theoretical arguments and empirical evidence. A pseudo-random image generator implementing the model has been built. Applications of the generator are described, including a polyfont classifier for ASCII and a single-font classifier for a large alphabet (Tibetan U-Chen), both of which which were constructed with a minimum of manual effort. Image defect models and their associated generators permit a new kind of image database which is explicitly parameterized and indefinitely extensible, alleviating some drawbacks of existing databases.

Finding Structure in Time

Article

Mar 1990
COGNITIVE SCI

Jeffrey Elman

Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation). The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states. A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items. These representations suggest a method for representing lexical categories and the type/token distinction.

Structured Document Image Analysis

Book

Jan 1999

Document image analysis is the automatic computer interpretation of images of printed and handwritten documents, including text, drawings, maps, music scores, etc. Research in this field supports a rapidly growing international industry. This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architectureof complete high-performance printed-document reading systems. A unique feature is the extended section on music notation, an ideal vehicle for international sharing of basic research. Also, the collection includes important new work on line drawings, handwriting, character and symbol recognition, and basic methodological issues. The IAPR 1990 Workshop on Syntactic and Structural Pattern Recognition is summarized,including the reports of its expert working groups, whose debates provide a fascinating perspective on the field. The book is an excellent text for a first-year graduate seminar in document image analysis,and is likely to remain a standard reference in the field for years.

Can we build language-independent OCR using LSTM networks?

Abstract and Figures

Recommended publications

On the Strength of Character Language Models for Multilingual Named Entity Recognition

Handwritten Mixed-Script Recognition System: A Comprehensive Approach

Reducing OCR errors by combining two OCR systems

Language segmentation for Optical Character Recognition using Self Organizing Maps